Skip to main content

CI/CD Operations

GitHub Actions workflows across the SecurityV0 workspace. All repos live under the securityv0 GitHub organization.

Workflow Inventory

sv0-platform

Inventory current as of 2026-05-22. 18 workflows. Cost behaviour of ci.yml is governed by ADR-030 — see § Cost and Actions Minutes.

WorkflowTriggerRunnerPurpose
ci.ymlPush main/redesign/v06-pilot/v* tags + PRsubuntu-latestLint, typecheck, test, build; push Docker images to GHCR. amd64-only on PRs, multi-arch on main/tags; image build is path-gated to app changes; superseded PR runs cancelled (ADR-030)
deploy-dev.ymlworkflow_run (ci success) + dispatchubuntu-latestAuto-deploy to Hetzner dev.securityv0.com + PR previews pr-N-dev.securityv0.com
deploy-prod.ymlManual dispatchubuntu-latestDeploy to Hetzner app.securityv0.com (approval gate)
deploy-dev-cleanup.ymlSchedule + dispatchubuntu-latestGC stale PR-preview instances (scheduled sweep; reaps closed PRs)
deploy-azure-dev.ymlworkflow_run (ci success) + dispatchubuntu-latestDeploy to Azure dev demo VM via OIDC Run Command (ADR-024)
deploy-azure-staging.ymlworkflow_run (ci success) + dispatchubuntu-latestDeploy to Azure staging VM via OIDC Run Command
smoke-staging.ymlworkflow_run + schedule + dispatchubuntu-latestPost-deploy smoke tests against staging
visual-review.ymlPR (ui/api changes) + dispatchubuntu-latestBefore/after screenshots + visual diff on sv0-reviews.pages.dev
visual-review-cleanup.ymlPR closedubuntu-latestDelete Cloudflare Pages visual-review deployments
visual-review-stale-cleanup.ymlSchedule + dispatchubuntu-latestGC stale Cloudflare Pages visual-review deployments
visual-regression.ymlPR (ui changes)ubuntu-latestVisual regression checks
token-health.ymlWeekly cron + dispatchubuntu-latestCheck Cloudflare Access service-token expiry, open issues near expiration
demo-data-health.ymlManual dispatchubuntu-latestOn-demand demo data-health probe via /admin/data-health (staging)
seed-jira-aws-smoke.ymlSchedule + dispatchubuntu-latestJira→AWS demo seed smoke test
azure-ops.ymlManual dispatchubuntu-latestDemo-data ops (docker run from the deployed image) on Azure
pr-preview-admin.ymlManual dispatchubuntu-latestSeed/restore admin helper for Hetzner PR previews
chain-builder-version-bump.ymlPRubuntu-latestGuard: fails the PR if chain-builder source changed without bumping CHAIN_BUILDER_VERSION
bootstrap-cf-access.ymlManual dispatchubuntu-latestBootstrap/reconcile the "SecurityV0 PR Previews" Cloudflare Access app

sv0-website

WorkflowTriggerRunnerPurpose
deploy.ymlPR to mainubuntu-latestPreview deploy to pr-N.securityv0.pages.dev
deploy-prod.ymlManual dispatchubuntu-latestDeploy to securityv0.com (approval gate)
staging.ymlPush to mainubuntu-latestDeploy staging, generate report, await approval, deploy prod
report.ymlReusable workflowubuntu-latestLighthouse audit, screenshots, visual diff, business logic checks, post to issue #18
visual-review.ymlPR (src/public changes)ubuntu-latestVisual diff on sv0-website-reviews.pages.dev
visual-review-cleanup.ymlPR closedubuntu-latestDelete Cloudflare Pages deployments

sv0-connectors

WorkflowTriggerRunnerPurpose
azure-foundry-ci.ymlPush/PR (azure-foundry paths)ubuntu-latestLint + test (Python 3.11-3.13 matrix)
entra-servicenow-ci.ymlPush/PR (entra-servicenow paths)ubuntu-latestTest + connector reports
entra-servicenow-quality.ymlPush/PRubuntu-latestLint, format, typecheck, test, build (Python matrix)
entra-servicenow-scan.ymlPush/PR + dispatchubuntu-latestRun security scans against live Azure/ServiceNow
servicenow-keepalive.ymlEvery 30 minubuntu-latestPing ServiceNow dev instance to prevent hibernation

sv0-documentation

WorkflowTriggerRunnerPurpose
docs-ci.ymlPush/PR (docs paths)ubuntu-latestBuild Docusaurus, deploy to sv0-docs.pages.dev

sv0-intelligence

WorkflowTriggerRunnerPurpose
weekly-incident.ymlMondays 8am UTC + dispatchubuntu-latestGather AI security signals, score with Claude, open PR to sv0-website

Dependency Graph

sv0-platform:
ci.yml ──workflow_run──> deploy-dev.yml ──creates──> PR preview instances

visual-review.yml ──screenshots──> PR preview instances ┘

PR closed ──> deploy-dev-cleanup.yml
PR closed ──> visual-review-cleanup.yml

token-health.yml ──monitors──> CF_ACCESS_* service tokens

sv0-website:
staging.yml ──calls──> report.yml (reusable) ──approval──> deploy-prod

PR opened ──> deploy.yml (preview)
PR opened ──> visual-review.yml
PR closed ──> visual-review-cleanup.yml

sv0-connectors:
servicenow-keepalive.yml ──every 30min──> ServiceNow dev instance (prevent hibernation)

sv0-intelligence:
weekly-incident.yml ──opens PR──> sv0-website ──triggers──> website CI
(deploy.yml preview + visual-review.yml)

Secrets Inventory

Secret NameRepo(s)PurposeRotation
GITHUB_TOKEN (implicit)All reposGHCR, GH APIAuto-managed
DEPLOY_SSH_KEYsv0-platformSSH to Hetzner serversManual rotation
DEPLOY_HOST / DEPLOY_HOST_KEYsv0-platformServer address + host keyChange on server migration
CLOUDFLARE_API_TOKENsv0-platform, sv0-website, sv0-documentationPages deploymentsManual rotation
CLOUDFLARE_ACCOUNT_IDsv0-platform, sv0-website, sv0-documentationCloudflare accountStatic
CF_ACCESS_CLIENT_ID_DEPLOY / CF_ACCESS_CLIENT_SECRET_DEPLOYsv0-platformCI deploy bot Cloudflare AccessExpires 2027-03-31, monitored by token-health.yml
CF_ACCESS_CLIENT_ID_VISUAL / CF_ACCESS_CLIENT_SECRET_VISUALsv0-platformVisual review bot Cloudflare AccessExpires 2027-03-31, monitored by token-health.yml
CLOUDFLARE_API_TOKEN_ZERO_TRUSTsv0-platformZero Trust management APIManual rotation
ENTRA_SERVICENOW_AZURE_* (3 secrets)sv0-connectorsAzure Entra connectorManual rotation
ENTRA_SERVICENOW_SNOW_* (3 secrets)sv0-connectorsServiceNow connectorManual rotation
ANTHROPIC_API_KEYsv0-intelligenceClaude API for signal scoringManual rotation
GH_TOKENsv0-intelligenceCross-repo PR creationManual rotation

Credential Rotation Strategy

The infrastructure strategy doc (2026-03-31-infrastructure-strategy.md) defines a tiered secrets management approach. Operational details:

Automated monitoring -- The token-health.yml workflow runs weekly and on-demand. It queries the Cloudflare Zero Trust API for service token expiry dates and opens GitHub issues when tokens are within 30 days of expiration.

Cloudflare service tokens -- 1-year expiry (current tokens expire 2027-03-31). Rotation is automated via Cloudflare API: create new token, update GitHub secrets, delete old token. Monitored by token-health.yml.

GitHub secrets -- Manual rotation, no built-in expiry tracking. Rely on documentation and calendar reminders.

Future expansion -- Extend token-health.yml to check:

  • Azure client secrets (Entra connector) via Microsoft Graph API
  • ServiceNow passwords via ServiceNow Table API
  • Anthropic API key validity via a lightweight API call

Cloudflare Pages Projects

ProjectRepoBranch PatternPurpose
securityv0sv0-websitemain / staging / pr-NMarketing website
sv0-reviewssv0-platformpr-N / customPlatform visual review reports
sv0-website-reviewssv0-websitepr-N / stagingWebsite visual review reports
sv0-docs / sv0-docs-docusaurussv0-documentationmain / pr-NDocumentation site

Runner Infrastructure

All workflows currently use ubuntu-latest (GitHub-hosted runners). Switched from self-hosted mac-mini runners in March 2026 for reliability and reduced maintenance.

The self-hosted mac-mini runner is still registered but not actively used by any workflow. It remains available as a fallback if GitHub-hosted runners become insufficient (e.g., for tasks requiring macOS or persistent local state). Do not move heavy CI (multi-arch Docker builds) onto it — the host is memory-constrained and has caused kernel panics under load. If native arm64 builds are ever needed, use GitHub's native ubuntu-24.04-arm runners (no QEMU), not self-hosting. See ADR-030.

Cost and Actions Minutes

The org has 50,000 included GitHub Actions Linux-minutes per month (resets on the 1st). Overage is $0.006/min for steady-state burn (~$60 for 10k minutes over the 50k pool; ~$300 at 100k total). That is financially trivial — the operational risks are a $0 budget cap halting all CI, multi-hour build hangs, and unbounded runaway burn (a wedged job or loop has no steady-state ceiling), not the dollar. See ADR-030 for the decision; this section is the operational how-to.

The pool follows active development

The org-wide pool is effectively a single-repo pool — it is consumed almost entirely by whichever repo is under heaviest development that month:

MonthLinux minutesRepo consuming ~all of it
March 20263,829excalidraw-diagram-skill
April 202633,780sv0-connectors
May 2026 (22 days)45,019sv0-platform

So when you investigate a spike, start from the billing-by-repo breakdown, then drill into that repo's heaviest workflow (almost always its ci). In May 2026, sv0-platform's ci was ~80% of the pool — and the spend was bimodal: a typical run was ~20 billed min, but a tail of ~67 runs hung for 300–1,078 min each on multi-arch arm64-via-QEMU image builds (no job timeout to cap them). Look for that long-tail shape, not a high average.

Diagnosing a spike

# 1. Spend by repo for a month (Actions Linux minutes). The legacy
# /orgs/.../settings/billing/actions endpoint is gone (HTTP 410).
gh api /organizations/SecurityV0/settings/billing/usage --jq '
[.usageItems[] | select(.product=="actions" and .sku=="Actions Linux" and (.date|startswith("2026-05")))]
| sort_by(.quantity) | reverse | .[] | "\(.quantity|floor) min \(.repositoryName)"'

# 2. Find the heavy workflow's id (or just use its filename below):
gh workflow list --repo SecurityV0/<repo>

# 3. True run count for a workflow. (gh run list defaults to 20 and needs an
# explicit --limit; for an exact count use the API total_count field. The
# filename form is easiest — no id lookup needed.)
gh api "/repos/SecurityV0/<repo>/actions/workflows/ci.yml/runs?created=>=2026-05-01&per_page=1" --jq '.total_count'

# 4. Per-run billed minutes = sum over jobs of ceil(job_seconds / 60). The
# /workflows/{id}/timing endpoint returns an empty {"billable":{}} and is
# useless — sum job durations from /actions/runs/{run_id}/jobs instead.
# Look at the DISTRIBUTION (a hang tail), not just the mean, and include
# failed/cancelled runs — they bill too.

Levers (highest ROI first)

  1. Cap heavy jobs with timeout-minutes — the May spike was 67 runs hanging up to 18h on wedged QEMU builds with no timeout. A timeout-minutes: 30 on build-images fails fast instead of billing to the 6-hour default. Highest-value, lowest-risk. (Follow-up — not yet shipped.)
  2. amd64-only image builds on PRs — removes the arm64-via-QEMU emulation that causes the hangs; keep multi-arch on main/tags only. (ADR-030, shipped.)
  3. concurrency + cancel-in-progress for PR refs — stop stacking full runs from rapid pushes. Scope to PRs only; main/tags/redesign/v06-pilot pushes are not auto-cancelled (they must publish their images). (ADR-030, shipped.)
  4. Path-gate the non-required image build — docs/test-only PRs skip it; keep required checks always-on. Never make build-images a required check (a skipped required check blocks merge forever). (ADR-030, shipped.)
  5. Label-gate PR-preview builds — only 4 dev preview slots exist; don't build images for PRs that can't deploy. (Follow-up.)
  6. Move expensive optional checks to workflow_dispatch / label-gated — visual-regression, release multi-arch builds. (Follow-up.)
  7. Native ARM runners, not self-hosting, if arm64 is ever required on PRs (self-hosted + fork PRs = code execution on persistent hardware).

For budget: do not set a $0 Actions budget — it converts a cost event into a CI outage. Use two layers: an alert budget with headroom (detection) and a non-zero hard ceiling set well above expected burn plus per-job timeout-minutes (containment). Alerts alone don't stop a 3am runaway.

Security Note

This document is internal to SecurityV0. Secret names are listed for operational reference — actual secret values are stored in GitHub Actions secrets and are not accessible without repository admin access. Do not share this document externally without redacting the secrets inventory.