ADR-030: CI Cost & Build-Architecture Strategy
Status
Accepted — 2026-05-22. Shipped in sv0-platform#1301 (issue #1300).
Operational detail (how to diagnose a spend spike, the billing-API command, the full lever list) lives in ci-cd-operations.md § Cost and Actions Minutes. This ADR records the why.
Context
On 2026-05-22 GitHub alerted that we had used 90% of the 50,000 included Actions minutes (~45,000 used; the alert said 45,002, the billing API showed 45,019 by end of day) with 10 days left in the cycle. The first instinct was that we open too many PRs — even one-line changes get a PR — and that we should move to long-lived branches with manually-triggered heavy CI. The data said otherwise.
Where the minutes actually go
Pulled from the GitHub enhanced billing usage API (gh api /organizations/SecurityV0/settings/billing/usage; the legacy /orgs/.../settings/billing/actions endpoint is gone — HTTP 410):
| Month | Actions Linux minutes | Repo consuming ~all of it |
|---|---|---|
| March 2026 | 3,829 | excalidraw-diagram-skill |
| April 2026 | 33,780 | sv0-connectors |
| May 2026 (22 days) | 45,019 | sv0-platform |
Two facts fall out:
-
The org-wide pool is effectively a single-repo pool — it follows whoever is doing the most active development. It is not one repo's misconfiguration; it is the process applied to whichever repo is hot that month. So the fix must be a process pattern, replicable to any repo, not a one-off.
-
Within sv0-platform,
ci.ymlis the whole story. Of ~3,470 workflow runs that month,ciran ~700 times and consumed ≈36,000 measured billed minutes ≈ ~80% of the entire org pool (summed per-job from the runs API). The remainder is the deploy/visual fan-out, mostly cheap.
Why ci was so expensive — a hang tail, not a uniform per-run cost
The cost is bimodal, and the first analysis got this wrong by quoting a "~41-minute average." A typical ci run is cheap (~20 billed min); the spend lives in a long tail of multi-arch image builds that hang for hours:
Typical run (~631 of ~700): ~20 billed min total
build-test ~4 min
integration-tests ~2 min (MongoDB service container)
build-images (api) ~5 min } multi-arch amd64 + arm64 (QEMU emulation)
build-images (ui) ~8 min }
Hang tail (67 runs): 300–1,078 min wall-clock each
a build-images (api|ui) job stuck under arm64 QEMU emulation, billing until
it hit the 6h job ceiling / was cancelled. These ~67 runs dominate ci spend.
The "~41-minute average" cited in the first pass was an artifact: a random run sample catches a few multi-hundred-minute hangs, which drag the mean up. The real driver is the hang tail, not the body — which is why the fix targets the thing that causes hangs (arm64-via-QEMU) rather than shaving minutes off warm runs.
Three structural waste sources, all in ci.yml:
- Multi-arch image builds on every PR event. The
platformsinput waslinux/amd64,linux/arm64, built onpull_request. The arm64 half runs under QEMU emulation — the slow half of every build and the source of the multi-hour hang tail above. - No job timeout.
build-imageshad notimeout-minutes, so a wedged QEMU build billed up to the 6-hour GitHub default (some runs reached ~18h wall-clock across retries) instead of failing fast. This is the single biggest waste and the easiest to cap. - No
concurrency/cancel-in-progress. Every push to a PR branch launched a fresh full CI; superseded in-flight runs were never cancelled. Humans and Claude Code agents both push frequently. - No path filtering. Docs-only, markdown,
.claude/, and test-only PRs ran the full pipeline including both multi-arch image builds.
arm64 has no consumer
Verified before removing it: zero arm64/aarch64 references in any workflow or deploy script; deploys pull images by tag with no --platform, so an x86 host pulls amd64. Deploy targets (Hetzner VPS, Azure VMs per ADR-024) and GitHub-hosted runners are all x86. Local development builds its own images (docker compose up --build) rather than pulling arm64 from GHCR. The arm64 images were built on every PR and consumed by nothing.
The financial reframe
Actions overage is $0.006/min for steady-state burn: ~$60 for 10k minutes over the 50k pool, ~$300 for 100k total (double our burn). For a funded company this is noise. The real risks are:
- A
$0Actions budget cap would halt all CI until the cycle resets — development stops. - Multi-hour build hangs that block every developer's and agent's feedback on the affected PR.
- Unbounded runaway burn. The $60/$300 figures assume bounded spend. A wedged job with no timeout (our actual May situation) or a workflow loop has no ceiling — it is bounded only by reaction time to an alert. Steady-state cost is not the failure mode to design against; unbounded burn is.
So the objective is not to minimise dollars. It is to (a) stay clear of the hard cap, (b) contain unbounded burn with hard limits (timeouts, a non-zero budget ceiling — see Decision), and (c) tighten the feedback loop by deleting work nobody uses — without sacrificing the per-change CI safety the PR-per-change workflow gives us.
Decision
Two principles and three mechanisms.
Principles
-
Cut per-run cost before run count. Reducing how many PRs we open (the long-lived-branch idea) is the smaller lever and costs us per-change CI safety, harder reviews, merge conflicts, and fights the issue-per-change discipline. At half the PRs we would still spend ~22k/month. We keep PR-per-change and attack the cost per run instead.
-
Optimise for cap-block risk and feedback latency, not for dollars. Take the cheap, high-ROI fixes. Do not set a
$0budget cap (it converts a cost event into an outage). Accept modest overage if it ever occurs.
Mechanisms (all in ci.yml + deploy-dev.yml)
-
amd64-only image builds on PRs; multi-arch only on
main/ release tags / the pilot trunk. PR builds setplatforms: linux/amd64and skip the QEMU setup step entirely.main,v*tags, andredesign/v06-pilotkeeplinux/amd64,linux/arm64so any Apple-Silicon GHCR pulls of the published images keep working. The conditional is a workflow expression ongithub.event_name. -
concurrencywithcancel-in-progressscoped to PRs. A top-level group keyed ongithub.head_ref || github.refcancels superseded runs;cancel-in-progressis true only forpull_requestevents. Pushes tomain/ tags / the pilot trunk are not auto-cancelled — they queue and run serially so each can publish itssha-<...>/:latestimages to GHCR (whichdeploy-devdepends on). Caveat: "not auto-cancelled" is not "never cancelled" — GitHub keeps one running + one pending per group, so a backlog of 3+ rapid main pushes can still evict a pending middle run (it never starts, so itssha-image never publishes).deploy-dev's main path tolerates this by re-resolving to currentmainHEAD and gating on image existence, but a deploy pinned to a skipped middle SHA (manual rollback) would 404. -
Path-gate the (non-required) image build. A fast
changesjob (usingdorny/paths-filter) determines whether a PR touches what the Dockerfiles actually bake in;build-imagesruns on a PR only when it does. The required status checks —build-testandintegration-tests— always run on every PR (noif:condition), so branch protection still gates every change. The filter mirrors both Dockerfiles'COPYsets (api:src/,scripts/,package*.json,tsconfig.json,Dockerfile; ui: the wholeui/tree).test/**is excluded — neither image copies it.Invariant — do not break this:
build-imagesmust never be added to required status checks. It is conditionally skipped on PRs, and a skipped required check leaves a PR's merge state pending forever (a skipped job reports no conclusion to a required context). Only the two unconditional jobs (build-test,integration-tests) may be required. "Branch protection gates every change" holds because those two always run — do not "harden" CI by adding the conditional job.Because a docs/test-only PR now skips the image build,
deploy-dev's PR-preview path gained an image-existence guard: it checks the triggeringcirun'sbuild-imagesjobs succeeded before deploying, and posts a::noticeinstead of failing red on a missingpr-Ntag. This mirrors the gate themainpath already had. (The guard collapses "skipped" and "failed" into one skip path with an "…likely a docs/test-only PR" notice; a genuine build failure still shows red on thebuild-imagesjob itself, so it is not hidden — only the deploy notice is imprecise about the cause.) -
Cap heavy jobs and the budget — contain unbounded burn.
build-imagesshould carry atimeout-minutes(e.g. 30) so a wedged QEMU build fails fast instead of billing to the 6-hour default — this directly kills the hang tail that caused this whole exercise. (Shipping the timeout is a tracked follow-up; the amd64-only change already removes the QEMU hang source on PRs.) Pair it with a non-zero Actions budget ceiling set well above expected burn: a$0cap is an outage, but no ceiling leaves runaway burn unbounded — a generous hard cap trips only on a genuine runaway.
Combined, the shipped mechanisms cut sv0-platform CI minutes by an estimated ~55–70% — a projection, not yet a measured month. The saving comes almost entirely from eliminating the multi-arch QEMU hang tail (the dominant cost) and cancelling superseded PR runs, not from cheaper warm builds: a warm build-images is ~unchanged (~12 min) whether amd64-only or multi-arch, because GitHub bills wall-clock and the warm arm64 layer was not the slow part. PR-per-change is untouched.
Consequences
Positive
- Eliminates the multi-hour QEMU hang tail on PRs (the dominant cost) and removes arm64 emulation from the PR feedback loop. Projected ~55–70% aggregate CI-minute reduction.
- PR-per-change discipline is preserved. No long-lived branches forced; required checks unchanged.
- Published images are unchanged.
main/release builds still produce full multi-arch. - The pattern is portable. Because spend follows active dev, the same mechanisms apply to any sibling repo when it becomes the hot one.
Negative / accepted
- No change to the deploy/preview workflow, but there is a test-parity shift: arm64 image build failures (native dep, base-image arch quirk) now surface only on
main/tag builds, not per-PR. Before, every PR at least built (never ran) the arm64 image. Accepted because no arm64 deploy target exists; the "When to Reconsider" arm64 trigger covers re-enabling it. - PR previews of docs/test-only PRs no longer exist — there is nothing to preview; the deploy degrades to a notice, not a failure.
- Superseded run results are lost when you push again to a PR — only the latest commit's run matters for merge.
- A path-filter miss could skip an image build for a real change, producing a stale preview (never a wrong merge — required checks still run). The filter is built from the Dockerfiles'
COPYsets to minimise this; if it drifts, widen the filter. - arm64 GHCR images for PR tags are gone. Accepted: nothing consumed them.
Trade-offs deliberately rejected
- Fewer PRs / long-lived branches (the original instinct). Rejected as the primary lever — see Principle 1. Two narrow slices of the idea are kept as follow-ups: batch trivial/agent-generated churn, and move expensive optional checks to manual triggers.
- Self-hosted runners (
infra/github-runnerexists, unused). Rejected on two grounds: (1) the only host is the memory-constrained Mac Mini that has caused kernel panics under load; (2) a self-hosted runner must never runpull_requestjobs from forks — that is arbitrary code execution on persistent hardware sitting next to deploy keys and.envcredential stores. Currently moot (repo is private, 0 forks) but a hard constraint if that ever changes. If native arm64 is ever needed, use GitHub's ephemeralubuntu-24.04-armrunners (no QEMU), which avoid both problems. - A
$0budget cap. Rejected: it turns a cost event into a CI outage. The correct control is a non-zero ceiling plus per-jobtimeout-minutes(Decision §4) — containment without self-DoS.
Migration plan
Shipped as one PR (sv0-platform#1301); ~3 hours; ci.yml + deploy-dev.yml only.
One bug surfaced during verification and is recorded as a reusable gotcha: dorny/paths-filter lists a PR's changed files via the PR Files API, which needs pull-requests: read. The repo default GITHUB_TOKEN is read-only (contents: read only), so the changes job first failed Resource not accessible by integration. Fix: a per-job permissions: { contents: read, pull-requests: read } block.
Note on action pinning: third-party actions here (dorny/paths-filter@v3, docker/*, etc.) are tag-pinned for readability. deploy-dev.yml already SHA-pins one action (webfactory/ssh-agent); for a security-conscious repo, SHA-pinning the rest is a reasonable hardening follow-up (low urgency while the repo is private with no forks).
Follow-ups (not in #1301)
- Add
timeout-minutestobuild-images(and audit other long-runnable jobs). This is the direct fix for the hang tail that caused the spend spike — the highest-value remaining item; the amd64-only change already removes the QEMU hang source on PRs, but a timeout is the defense-in-depth that caps any future wedge. - Set a non-zero Actions budget ceiling (well above expected burn) as the containment layer for runaway burn — never
$0. - Label-gate PR-preview image builds. Only 4 dev preview slots exist (OOM guard), so building images for every PR when at most 4 can deploy is wasteful.
- Move
visual-regressionand release multi-arch builds toworkflow_dispatch/ label-gated. This is the "manually triggered heavy CI when required" model, applied precisely where it pays. - Replicate the mechanisms to sibling repos (
sv0-connectorsfirst — it was April's hot repo).
When to Reconsider
- A deploy target starts running on arm64 (e.g., Graviton/Ampere VMs to cut hosting cost). Then PR builds need arm64 again — switch the PR path to native
ubuntu-24.04-armrunners, not QEMU. - The path filter causes a stale preview that misleads a reviewer. Widen the
appfilter or revert to always-building on PRs. - CI minutes climb back toward the cap despite these fixes. Pick up the follow-ups (label-gated previews, manual heavy checks) and apply the pattern to whichever repo is now hot.
- The org moves to a managed CI platform or a paid Actions tier with different economics. Re-evaluate the cost-vs-velocity trade in Principle 2.