Skip to main content

ADR-030: CI Cost & Build-Architecture Strategy

Status

Accepted — 2026-05-22. Shipped in sv0-platform#1301 (issue #1300).

Operational detail (how to diagnose a spend spike, the billing-API command, the full lever list) lives in ci-cd-operations.md § Cost and Actions Minutes. This ADR records the why.

Context

On 2026-05-22 GitHub alerted that we had used 90% of the 50,000 included Actions minutes (~45,000 used; the alert said 45,002, the billing API showed 45,019 by end of day) with 10 days left in the cycle. The first instinct was that we open too many PRs — even one-line changes get a PR — and that we should move to long-lived branches with manually-triggered heavy CI. The data said otherwise.

Where the minutes actually go

Pulled from the GitHub enhanced billing usage API (gh api /organizations/SecurityV0/settings/billing/usage; the legacy /orgs/.../settings/billing/actions endpoint is gone — HTTP 410):

MonthActions Linux minutesRepo consuming ~all of it
March 20263,829excalidraw-diagram-skill
April 202633,780sv0-connectors
May 2026 (22 days)45,019sv0-platform

Two facts fall out:

  1. The org-wide pool is effectively a single-repo pool — it follows whoever is doing the most active development. It is not one repo's misconfiguration; it is the process applied to whichever repo is hot that month. So the fix must be a process pattern, replicable to any repo, not a one-off.

  2. Within sv0-platform, ci.yml is the whole story. Of ~3,470 workflow runs that month, ci ran ~700 times and consumed ≈36,000 measured billed minutes ≈ ~80% of the entire org pool (summed per-job from the runs API). The remainder is the deploy/visual fan-out, mostly cheap.

Why ci was so expensive — a hang tail, not a uniform per-run cost

The cost is bimodal, and the first analysis got this wrong by quoting a "~41-minute average." A typical ci run is cheap (~20 billed min); the spend lives in a long tail of multi-arch image builds that hang for hours:

Typical run (~631 of ~700):        ~20 billed min total
build-test ~4 min
integration-tests ~2 min (MongoDB service container)
build-images (api) ~5 min } multi-arch amd64 + arm64 (QEMU emulation)
build-images (ui) ~8 min }
Hang tail (67 runs): 300–1,078 min wall-clock each
a build-images (api|ui) job stuck under arm64 QEMU emulation, billing until
it hit the 6h job ceiling / was cancelled. These ~67 runs dominate ci spend.

The "~41-minute average" cited in the first pass was an artifact: a random run sample catches a few multi-hundred-minute hangs, which drag the mean up. The real driver is the hang tail, not the body — which is why the fix targets the thing that causes hangs (arm64-via-QEMU) rather than shaving minutes off warm runs.

Three structural waste sources, all in ci.yml:

  • Multi-arch image builds on every PR event. The platforms input was linux/amd64,linux/arm64, built on pull_request. The arm64 half runs under QEMU emulation — the slow half of every build and the source of the multi-hour hang tail above.
  • No job timeout. build-images had no timeout-minutes, so a wedged QEMU build billed up to the 6-hour GitHub default (some runs reached ~18h wall-clock across retries) instead of failing fast. This is the single biggest waste and the easiest to cap.
  • No concurrency / cancel-in-progress. Every push to a PR branch launched a fresh full CI; superseded in-flight runs were never cancelled. Humans and Claude Code agents both push frequently.
  • No path filtering. Docs-only, markdown, .claude/, and test-only PRs ran the full pipeline including both multi-arch image builds.

arm64 has no consumer

Verified before removing it: zero arm64/aarch64 references in any workflow or deploy script; deploys pull images by tag with no --platform, so an x86 host pulls amd64. Deploy targets (Hetzner VPS, Azure VMs per ADR-024) and GitHub-hosted runners are all x86. Local development builds its own images (docker compose up --build) rather than pulling arm64 from GHCR. The arm64 images were built on every PR and consumed by nothing.

The financial reframe

Actions overage is $0.006/min for steady-state burn: ~$60 for 10k minutes over the 50k pool, ~$300 for 100k total (double our burn). For a funded company this is noise. The real risks are:

  1. A $0 Actions budget cap would halt all CI until the cycle resets — development stops.
  2. Multi-hour build hangs that block every developer's and agent's feedback on the affected PR.
  3. Unbounded runaway burn. The $60/$300 figures assume bounded spend. A wedged job with no timeout (our actual May situation) or a workflow loop has no ceiling — it is bounded only by reaction time to an alert. Steady-state cost is not the failure mode to design against; unbounded burn is.

So the objective is not to minimise dollars. It is to (a) stay clear of the hard cap, (b) contain unbounded burn with hard limits (timeouts, a non-zero budget ceiling — see Decision), and (c) tighten the feedback loop by deleting work nobody uses — without sacrificing the per-change CI safety the PR-per-change workflow gives us.

Decision

Two principles and three mechanisms.

Principles

  1. Cut per-run cost before run count. Reducing how many PRs we open (the long-lived-branch idea) is the smaller lever and costs us per-change CI safety, harder reviews, merge conflicts, and fights the issue-per-change discipline. At half the PRs we would still spend ~22k/month. We keep PR-per-change and attack the cost per run instead.

  2. Optimise for cap-block risk and feedback latency, not for dollars. Take the cheap, high-ROI fixes. Do not set a $0 budget cap (it converts a cost event into an outage). Accept modest overage if it ever occurs.

Mechanisms (all in ci.yml + deploy-dev.yml)

  1. amd64-only image builds on PRs; multi-arch only on main / release tags / the pilot trunk. PR builds set platforms: linux/amd64 and skip the QEMU setup step entirely. main, v* tags, and redesign/v06-pilot keep linux/amd64,linux/arm64 so any Apple-Silicon GHCR pulls of the published images keep working. The conditional is a workflow expression on github.event_name.

  2. concurrency with cancel-in-progress scoped to PRs. A top-level group keyed on github.head_ref || github.ref cancels superseded runs; cancel-in-progress is true only for pull_request events. Pushes to main / tags / the pilot trunk are not auto-cancelled — they queue and run serially so each can publish its sha-<...> / :latest images to GHCR (which deploy-dev depends on). Caveat: "not auto-cancelled" is not "never cancelled" — GitHub keeps one running + one pending per group, so a backlog of 3+ rapid main pushes can still evict a pending middle run (it never starts, so its sha- image never publishes). deploy-dev's main path tolerates this by re-resolving to current main HEAD and gating on image existence, but a deploy pinned to a skipped middle SHA (manual rollback) would 404.

  3. Path-gate the (non-required) image build. A fast changes job (using dorny/paths-filter) determines whether a PR touches what the Dockerfiles actually bake in; build-images runs on a PR only when it does. The required status checks — build-test and integration-tests — always run on every PR (no if: condition), so branch protection still gates every change. The filter mirrors both Dockerfiles' COPY sets (api: src/, scripts/, package*.json, tsconfig.json, Dockerfile; ui: the whole ui/ tree). test/** is excluded — neither image copies it.

    Invariant — do not break this: build-images must never be added to required status checks. It is conditionally skipped on PRs, and a skipped required check leaves a PR's merge state pending forever (a skipped job reports no conclusion to a required context). Only the two unconditional jobs (build-test, integration-tests) may be required. "Branch protection gates every change" holds because those two always run — do not "harden" CI by adding the conditional job.

    Because a docs/test-only PR now skips the image build, deploy-dev's PR-preview path gained an image-existence guard: it checks the triggering ci run's build-images jobs succeeded before deploying, and posts a ::notice instead of failing red on a missing pr-N tag. This mirrors the gate the main path already had. (The guard collapses "skipped" and "failed" into one skip path with an "…likely a docs/test-only PR" notice; a genuine build failure still shows red on the build-images job itself, so it is not hidden — only the deploy notice is imprecise about the cause.)

  4. Cap heavy jobs and the budget — contain unbounded burn. build-images should carry a timeout-minutes (e.g. 30) so a wedged QEMU build fails fast instead of billing to the 6-hour default — this directly kills the hang tail that caused this whole exercise. (Shipping the timeout is a tracked follow-up; the amd64-only change already removes the QEMU hang source on PRs.) Pair it with a non-zero Actions budget ceiling set well above expected burn: a $0 cap is an outage, but no ceiling leaves runaway burn unbounded — a generous hard cap trips only on a genuine runaway.

Combined, the shipped mechanisms cut sv0-platform CI minutes by an estimated ~55–70% — a projection, not yet a measured month. The saving comes almost entirely from eliminating the multi-arch QEMU hang tail (the dominant cost) and cancelling superseded PR runs, not from cheaper warm builds: a warm build-images is ~unchanged (~12 min) whether amd64-only or multi-arch, because GitHub bills wall-clock and the warm arm64 layer was not the slow part. PR-per-change is untouched.

Consequences

Positive

  • Eliminates the multi-hour QEMU hang tail on PRs (the dominant cost) and removes arm64 emulation from the PR feedback loop. Projected ~55–70% aggregate CI-minute reduction.
  • PR-per-change discipline is preserved. No long-lived branches forced; required checks unchanged.
  • Published images are unchanged. main/release builds still produce full multi-arch.
  • The pattern is portable. Because spend follows active dev, the same mechanisms apply to any sibling repo when it becomes the hot one.

Negative / accepted

  • No change to the deploy/preview workflow, but there is a test-parity shift: arm64 image build failures (native dep, base-image arch quirk) now surface only on main/tag builds, not per-PR. Before, every PR at least built (never ran) the arm64 image. Accepted because no arm64 deploy target exists; the "When to Reconsider" arm64 trigger covers re-enabling it.
  • PR previews of docs/test-only PRs no longer exist — there is nothing to preview; the deploy degrades to a notice, not a failure.
  • Superseded run results are lost when you push again to a PR — only the latest commit's run matters for merge.
  • A path-filter miss could skip an image build for a real change, producing a stale preview (never a wrong merge — required checks still run). The filter is built from the Dockerfiles' COPY sets to minimise this; if it drifts, widen the filter.
  • arm64 GHCR images for PR tags are gone. Accepted: nothing consumed them.

Trade-offs deliberately rejected

  • Fewer PRs / long-lived branches (the original instinct). Rejected as the primary lever — see Principle 1. Two narrow slices of the idea are kept as follow-ups: batch trivial/agent-generated churn, and move expensive optional checks to manual triggers.
  • Self-hosted runners (infra/github-runner exists, unused). Rejected on two grounds: (1) the only host is the memory-constrained Mac Mini that has caused kernel panics under load; (2) a self-hosted runner must never run pull_request jobs from forks — that is arbitrary code execution on persistent hardware sitting next to deploy keys and .env credential stores. Currently moot (repo is private, 0 forks) but a hard constraint if that ever changes. If native arm64 is ever needed, use GitHub's ephemeral ubuntu-24.04-arm runners (no QEMU), which avoid both problems.
  • A $0 budget cap. Rejected: it turns a cost event into a CI outage. The correct control is a non-zero ceiling plus per-job timeout-minutes (Decision §4) — containment without self-DoS.

Migration plan

Shipped as one PR (sv0-platform#1301); ~3 hours; ci.yml + deploy-dev.yml only.

One bug surfaced during verification and is recorded as a reusable gotcha: dorny/paths-filter lists a PR's changed files via the PR Files API, which needs pull-requests: read. The repo default GITHUB_TOKEN is read-only (contents: read only), so the changes job first failed Resource not accessible by integration. Fix: a per-job permissions: { contents: read, pull-requests: read } block.

Note on action pinning: third-party actions here (dorny/paths-filter@v3, docker/*, etc.) are tag-pinned for readability. deploy-dev.yml already SHA-pins one action (webfactory/ssh-agent); for a security-conscious repo, SHA-pinning the rest is a reasonable hardening follow-up (low urgency while the repo is private with no forks).

Follow-ups (not in #1301)

  • Add timeout-minutes to build-images (and audit other long-runnable jobs). This is the direct fix for the hang tail that caused the spend spike — the highest-value remaining item; the amd64-only change already removes the QEMU hang source on PRs, but a timeout is the defense-in-depth that caps any future wedge.
  • Set a non-zero Actions budget ceiling (well above expected burn) as the containment layer for runaway burn — never $0.
  • Label-gate PR-preview image builds. Only 4 dev preview slots exist (OOM guard), so building images for every PR when at most 4 can deploy is wasteful.
  • Move visual-regression and release multi-arch builds to workflow_dispatch / label-gated. This is the "manually triggered heavy CI when required" model, applied precisely where it pays.
  • Replicate the mechanisms to sibling repos (sv0-connectors first — it was April's hot repo).

When to Reconsider

  • A deploy target starts running on arm64 (e.g., Graviton/Ampere VMs to cut hosting cost). Then PR builds need arm64 again — switch the PR path to native ubuntu-24.04-arm runners, not QEMU.
  • The path filter causes a stale preview that misleads a reviewer. Widen the app filter or revert to always-building on PRs.
  • CI minutes climb back toward the cap despite these fixes. Pick up the follow-ups (label-gated previews, manual heavy checks) and apply the pattern to whichever repo is now hot.
  • The org moves to a managed CI platform or a paid Actions tier with different economics. Re-evaluate the cost-vs-velocity trade in Principle 2.