CI/CD Operations
GitHub Actions workflows across the SecurityV0 workspace. All repos live under the securityv0 GitHub organization.
Workflow Inventory
sv0-platform
Inventory current as of 2026-05-22. 18 workflows. Cost behaviour of
ci.ymlis governed by ADR-030 — see § Cost and Actions Minutes.
| Workflow | Trigger | Runner | Purpose |
|---|---|---|---|
ci.yml | Push main/redesign/v06-pilot/v* tags + PRs | ubuntu-latest | Lint, typecheck, test, build; push Docker images to GHCR. amd64-only on PRs, multi-arch on main/tags; image build is path-gated to app changes; superseded PR runs cancelled (ADR-030) |
deploy-dev.yml | workflow_run (ci success) + dispatch | ubuntu-latest | Auto-deploy to Hetzner dev.securityv0.com + PR previews pr-N-dev.securityv0.com |
deploy-prod.yml | Manual dispatch | ubuntu-latest | Deploy to Hetzner app.securityv0.com (approval gate) |
deploy-dev-cleanup.yml | Schedule + dispatch | ubuntu-latest | GC stale PR-preview instances (scheduled sweep; reaps closed PRs) |
deploy-azure-dev.yml | workflow_run (ci success) + dispatch | ubuntu-latest | Deploy to Azure dev demo VM via OIDC Run Command (ADR-024) |
deploy-azure-staging.yml | workflow_run (ci success) + dispatch | ubuntu-latest | Deploy to Azure staging VM via OIDC Run Command |
smoke-staging.yml | workflow_run + schedule + dispatch | ubuntu-latest | Post-deploy smoke tests against staging |
visual-review.yml | PR (ui/api changes) + dispatch | ubuntu-latest | Before/after screenshots + visual diff on sv0-reviews.pages.dev |
visual-review-cleanup.yml | PR closed | ubuntu-latest | Delete Cloudflare Pages visual-review deployments |
visual-review-stale-cleanup.yml | Schedule + dispatch | ubuntu-latest | GC stale Cloudflare Pages visual-review deployments |
visual-regression.yml | PR (ui changes) | ubuntu-latest | Visual regression checks |
token-health.yml | Weekly cron + dispatch | ubuntu-latest | Check Cloudflare Access service-token expiry, open issues near expiration |
demo-data-health.yml | Manual dispatch | ubuntu-latest | On-demand demo data-health probe via /admin/data-health (staging) |
seed-jira-aws-smoke.yml | Schedule + dispatch | ubuntu-latest | Jira→AWS demo seed smoke test |
azure-ops.yml | Manual dispatch | ubuntu-latest | Demo-data ops (docker run from the deployed image) on Azure |
pr-preview-admin.yml | Manual dispatch | ubuntu-latest | Seed/restore admin helper for Hetzner PR previews |
chain-builder-version-bump.yml | PR | ubuntu-latest | Guard: fails the PR if chain-builder source changed without bumping CHAIN_BUILDER_VERSION |
bootstrap-cf-access.yml | Manual dispatch | ubuntu-latest | Bootstrap/reconcile the "SecurityV0 PR Previews" Cloudflare Access app |
sv0-website
| Workflow | Trigger | Runner | Purpose |
|---|---|---|---|
deploy.yml | PR to main | ubuntu-latest | Preview deploy to pr-N.securityv0.pages.dev |
deploy-prod.yml | Manual dispatch | ubuntu-latest | Deploy to securityv0.com (approval gate) |
staging.yml | Push to main | ubuntu-latest | Deploy staging, generate report, await approval, deploy prod |
report.yml | Reusable workflow | ubuntu-latest | Lighthouse audit, screenshots, visual diff, business logic checks, post to issue #18 |
visual-review.yml | PR (src/public changes) | ubuntu-latest | Visual diff on sv0-website-reviews.pages.dev |
visual-review-cleanup.yml | PR closed | ubuntu-latest | Delete Cloudflare Pages deployments |
sv0-connectors
| Workflow | Trigger | Runner | Purpose |
|---|---|---|---|
azure-foundry-ci.yml | Push/PR (azure-foundry paths) | ubuntu-latest | Lint + test (Python 3.11-3.13 matrix) |
entra-servicenow-ci.yml | Push/PR (entra-servicenow paths) | ubuntu-latest | Test + connector reports |
entra-servicenow-quality.yml | Push/PR | ubuntu-latest | Lint, format, typecheck, test, build (Python matrix) |
entra-servicenow-scan.yml | Push/PR + dispatch | ubuntu-latest | Run security scans against live Azure/ServiceNow |
servicenow-keepalive.yml | Every 30 min | ubuntu-latest | Ping ServiceNow dev instance to prevent hibernation |
sv0-documentation
| Workflow | Trigger | Runner | Purpose |
|---|---|---|---|
docs-ci.yml | Push/PR (docs paths) | ubuntu-latest | Build Docusaurus, deploy to sv0-docs.pages.dev |
sv0-intelligence
| Workflow | Trigger | Runner | Purpose |
|---|---|---|---|
weekly-incident.yml | Mondays 8am UTC + dispatch | ubuntu-latest | Gather AI security signals, score with Claude, open PR to sv0-website |
Dependency Graph
sv0-platform:
ci.yml ──workflow_run──> deploy-dev.yml ──creates──> PR preview instances
│
visual-review.yml ──screenshots──> PR preview instances ┘
PR closed ──> deploy-dev-cleanup.yml
PR closed ──> visual-review-cleanup.yml
token-health.yml ──monitors──> CF_ACCESS_* service tokens
sv0-website:
staging.yml ──calls──> report.yml (reusable) ──approval──> deploy-prod
PR opened ──> deploy.yml (preview)
PR opened ──> visual-review.yml
PR closed ──> visual-review-cleanup.yml
sv0-connectors:
servicenow-keepalive.yml ──every 30min──> ServiceNow dev instance (prevent hibernation)
sv0-intelligence:
weekly-incident.yml ──opens PR──> sv0-website ──triggers──> website CI
(deploy.yml preview + visual-review.yml)
Secrets Inventory
| Secret Name | Repo(s) | Purpose | Rotation |
|---|---|---|---|
GITHUB_TOKEN (implicit) | All repos | GHCR, GH API | Auto-managed |
DEPLOY_SSH_KEY | sv0-platform | SSH to Hetzner servers | Manual rotation |
DEPLOY_HOST / DEPLOY_HOST_KEY | sv0-platform | Server address + host key | Change on server migration |
CLOUDFLARE_API_TOKEN | sv0-platform, sv0-website, sv0-documentation | Pages deployments | Manual rotation |
CLOUDFLARE_ACCOUNT_ID | sv0-platform, sv0-website, sv0-documentation | Cloudflare account | Static |
CF_ACCESS_CLIENT_ID_DEPLOY / CF_ACCESS_CLIENT_SECRET_DEPLOY | sv0-platform | CI deploy bot Cloudflare Access | Expires 2027-03-31, monitored by token-health.yml |
CF_ACCESS_CLIENT_ID_VISUAL / CF_ACCESS_CLIENT_SECRET_VISUAL | sv0-platform | Visual review bot Cloudflare Access | Expires 2027-03-31, monitored by token-health.yml |
CLOUDFLARE_API_TOKEN_ZERO_TRUST | sv0-platform | Zero Trust management API | Manual rotation |
ENTRA_SERVICENOW_AZURE_* (3 secrets) | sv0-connectors | Azure Entra connector | Manual rotation |
ENTRA_SERVICENOW_SNOW_* (3 secrets) | sv0-connectors | ServiceNow connector | Manual rotation |
ANTHROPIC_API_KEY | sv0-intelligence | Claude API for signal scoring | Manual rotation |
GH_TOKEN | sv0-intelligence | Cross-repo PR creation | Manual rotation |
Credential Rotation Strategy
The infrastructure strategy doc (2026-03-31-infrastructure-strategy.md) defines a tiered secrets management approach. Operational details:
Automated monitoring -- The token-health.yml workflow runs weekly and on-demand. It queries the Cloudflare Zero Trust API for service token expiry dates and opens GitHub issues when tokens are within 30 days of expiration.
Cloudflare service tokens -- 1-year expiry (current tokens expire 2027-03-31). Rotation is automated via Cloudflare API: create new token, update GitHub secrets, delete old token. Monitored by token-health.yml.
GitHub secrets -- Manual rotation, no built-in expiry tracking. Rely on documentation and calendar reminders.
Future expansion -- Extend token-health.yml to check:
- Azure client secrets (Entra connector) via Microsoft Graph API
- ServiceNow passwords via ServiceNow Table API
- Anthropic API key validity via a lightweight API call
Cloudflare Pages Projects
| Project | Repo | Branch Pattern | Purpose |
|---|---|---|---|
securityv0 | sv0-website | main / staging / pr-N | Marketing website |
sv0-reviews | sv0-platform | pr-N / custom | Platform visual review reports |
sv0-website-reviews | sv0-website | pr-N / staging | Website visual review reports |
sv0-docs / sv0-docs-docusaurus | sv0-documentation | main / pr-N | Documentation site |
Runner Infrastructure
All workflows currently use ubuntu-latest (GitHub-hosted runners). Switched from self-hosted mac-mini runners in March 2026 for reliability and reduced maintenance.
The self-hosted mac-mini runner is still registered but not actively used by any workflow. It remains available as a fallback if GitHub-hosted runners become insufficient (e.g., for tasks requiring macOS or persistent local state). Do not move heavy CI (multi-arch Docker builds) onto it — the host is memory-constrained and has caused kernel panics under load. If native arm64 builds are ever needed, use GitHub's native ubuntu-24.04-arm runners (no QEMU), not self-hosting. See ADR-030.
Cost and Actions Minutes
The org has 50,000 included GitHub Actions Linux-minutes per month (resets on the 1st). Overage is $0.006/min for steady-state burn (~$60 for 10k minutes over the 50k pool; ~$300 at 100k total). That is financially trivial — the operational risks are a $0 budget cap halting all CI, multi-hour build hangs, and unbounded runaway burn (a wedged job or loop has no steady-state ceiling), not the dollar. See ADR-030 for the decision; this section is the operational how-to.
The pool follows active development
The org-wide pool is effectively a single-repo pool — it is consumed almost entirely by whichever repo is under heaviest development that month:
| Month | Linux minutes | Repo consuming ~all of it |
|---|---|---|
| March 2026 | 3,829 | excalidraw-diagram-skill |
| April 2026 | 33,780 | sv0-connectors |
| May 2026 (22 days) | 45,019 | sv0-platform |
So when you investigate a spike, start from the billing-by-repo breakdown, then drill into that repo's heaviest workflow (almost always its ci). In May 2026, sv0-platform's ci was ~80% of the pool — and the spend was bimodal: a typical run was ~20 billed min, but a tail of ~67 runs hung for 300–1,078 min each on multi-arch arm64-via-QEMU image builds (no job timeout to cap them). Look for that long-tail shape, not a high average.
Diagnosing a spike
# 1. Spend by repo for a month (Actions Linux minutes). The legacy
# /orgs/.../settings/billing/actions endpoint is gone (HTTP 410).
gh api /organizations/SecurityV0/settings/billing/usage --jq '
[.usageItems[] | select(.product=="actions" and .sku=="Actions Linux" and (.date|startswith("2026-05")))]
| sort_by(.quantity) | reverse | .[] | "\(.quantity|floor) min \(.repositoryName)"'
# 2. Find the heavy workflow's id (or just use its filename below):
gh workflow list --repo SecurityV0/<repo>
# 3. True run count for a workflow. (gh run list defaults to 20 and needs an
# explicit --limit; for an exact count use the API total_count field. The
# filename form is easiest — no id lookup needed.)
gh api "/repos/SecurityV0/<repo>/actions/workflows/ci.yml/runs?created=>=2026-05-01&per_page=1" --jq '.total_count'
# 4. Per-run billed minutes = sum over jobs of ceil(job_seconds / 60). The
# /workflows/{id}/timing endpoint returns an empty {"billable":{}} and is
# useless — sum job durations from /actions/runs/{run_id}/jobs instead.
# Look at the DISTRIBUTION (a hang tail), not just the mean, and include
# failed/cancelled runs — they bill too.
Levers (highest ROI first)
- Cap heavy jobs with
timeout-minutes— the May spike was 67 runs hanging up to 18h on wedged QEMU builds with no timeout. Atimeout-minutes: 30onbuild-imagesfails fast instead of billing to the 6-hour default. Highest-value, lowest-risk. (Follow-up — not yet shipped.) - amd64-only image builds on PRs — removes the arm64-via-QEMU emulation that causes the hangs; keep multi-arch on
main/tags only. (ADR-030, shipped.) concurrency+cancel-in-progressfor PR refs — stop stacking full runs from rapid pushes. Scope to PRs only;main/tags/redesign/v06-pilotpushes are not auto-cancelled (they must publish their images). (ADR-030, shipped.)- Path-gate the non-required image build — docs/test-only PRs skip it; keep required checks always-on. Never make
build-imagesa required check (a skipped required check blocks merge forever). (ADR-030, shipped.) - Label-gate PR-preview builds — only 4 dev preview slots exist; don't build images for PRs that can't deploy. (Follow-up.)
- Move expensive optional checks to
workflow_dispatch/ label-gated — visual-regression, release multi-arch builds. (Follow-up.) - Native ARM runners, not self-hosting, if arm64 is ever required on PRs (self-hosted + fork PRs = code execution on persistent hardware).
For budget: do not set a $0 Actions budget — it converts a cost event into a CI outage. Use two layers: an alert budget with headroom (detection) and a non-zero hard ceiling set well above expected burn plus per-job timeout-minutes (containment). Alerts alone don't stop a 3am runaway.
Security Note
This document is internal to SecurityV0. Secret names are listed for operational reference — actual secret values are stored in GitHub Actions secrets and are not accessible without repository admin access. Do not share this document externally without redacting the secrets inventory.