ADR-019: Infrastructure-as-Code Strategy

Status

Accepted (2026-04-23)

Context

SecurityV0 currently manages infrastructure through three uncoordinated surfaces:

Dashboard-managed — Cloudflare (DNS, Zero Trust Access applications, service tokens), GitHub Actions secrets and Environments, Hetzner VMs, BetterStack and Grafana Cloud (pending signup), MongoDB Atlas (pending provisioning). Every change is a click. Drift between what reviewers believe is deployed and what is actually deployed is guaranteed over any timescale longer than a week.
Partially terraformed — sv0-connectors/infra/ has real HCL modules for Azure Foundry and Azure identity, with a 1Password Service Account credential pattern and Makefile wrappers. sv0-demo-labs/labs/*/infra/ has Terraform for per-lab Azure environments. Both use local state, per-developer.
Not yet in existence — post-pilot cloud-agnostic compute (AWS VM or Azure VM per the readiness-review §2.3 decision), MediaPro Atlas cluster, observability stack config, any cross-cutting resources that cross the product-repo boundary.

Three pressures forcing this decision now:

MediaPro pilot in early May. Pilot infrastructure is being provisioned right now; everything new we provision during pilot week is a dashboard click we will have to reconcile later.
Dedicated-deployment customers are a known future requirement. ADR-016's amendment committed to supporting single-tenant deployments via a SINGLE_TENANT_SLUG environment flag when the first such customer signs. That promise is only as cheap as our ability to stamp out a new customer stack — Atlas cluster, compute VM, Cloudflare DNS, GitHub environment secrets, observability wiring — in hours not days.
The 2026-04-12 architecture review flagged hand-managed deploy tooling as a ship-blocker for enterprise credibility. Terraforming production infrastructure is step one of the managed-platform migration path ADR-018 committed to.

The alternative paths — keeping dashboard management and writing runbooks, or deferring IaC until after pilot #3 — were considered and rejected. Runbooks encode click-paths that rot every time a vendor changes their UI. Deferring means that every tenant we onboard between pilot-1 and pilot-3 is a second stack we'll have to terraform-import later, with all the import-in-production risk that entails.

Decision

Adopt Terraform as the single IaC tool, a hybrid repo structure (new sv0-infrastructure for cross-cutting, existing product-scoped modules stay where they are), and Terraform Cloud free tier for state and plan/apply orchestration. Design every module so it is tenant-parameterizable from day one — each customer stack is an env/tenant-<slug>/ instantiation of the same module set.

The decision has six interlocking parts.

1. Tool: Terraform (HashiCorp)

Stay on the upstream Terraform 1.9+ binary with HCL. Two working Terraform codebases already exist in sv0-connectors/infra/ and sv0-demo-labs/labs/*/infra/; porting them to Pulumi is negative-value work. OpenTofu is defensible — same HCL, same providers, BSL-proof — but its provider registry lags HashiCorp's by 1-2 releases, and several providers we need (Cloudflare, MongoDB Atlas, BetterStack, Grafana Cloud) publish to the HashiCorp registry first.

Revisit trigger. HashiCorp relicenses the Terraform binary itself under terms that block our usage. At that point switch to OpenTofu; HCL and modules are unchanged.

2. Repo structure: Hybrid

New sv0-infrastructure repo holds cross-cutting infrastructure: Cloudflare zone and DNS, Zero Trust Access applications, GitHub org-level settings and environment secrets, MongoDB Atlas project and cluster, BetterStack monitors and status page, Grafana Cloud stack, and (Phase 3) the post-pilot compute landing zone on AWS or Azure.

Existing product-scoped Terraform stays where it is:

sv0-connectors/infra/ — connector-dev Azure modules (Foundry, identity) with 1Password SP credentials and Makefile wrappers. Per-developer, local state, not managed centrally. No migration.
sv0-demo-labs/labs/*/infra/ — per-lab Azure environments. Same pattern. No migration.
sv0-platform/infra/ — currently empty. Phase 3 populates it with the post-pilot compute VM module when MediaPro lands.

Why hybrid and not everything in one repo: the blast radius, cadence, and access-control profile of cross-cutting infrastructure (Cloudflare zone, production compute) differ substantially from product-scoped dev environments. Bundling them pressures every production change to compete for review attention with connector-dev module additions. Separating repos lets CODEOWNERS, branch protection, and Terraform Cloud workspace approval policies encode the different trust boundaries structurally.

Why hybrid and not per-product Terraform in every repo: Cloudflare DNS does not belong to any single product. GitHub org-level secrets do not belong to any single product. Atlas projects at this scale serve the platform, not a specific repo. Putting cross-cutting config inside sv0-platform/infra/ or sv0-connectors/infra/ encodes a false ownership claim and forces engineers in unrelated repos to edit someone else's repo for routine DNS changes.

3. State backend: Terraform Cloud free tier

Terraform Cloud (HCP Terraform) free tier provides remote state with locking, VCS-driven plan-on-PR, and apply-on-merge with workspace-level manual-approval gates — hosted, no ops burden. The current "Enhanced Free" tier (HashiCorp consolidated the legacy Free plan into Enhanced Free on 2026-03-31) covers 500 managed resources, unlimited users, 1 concurrent run, 1 policy set with up to 5 policies, and 1 self-hosted agent. At the end of Phase 4 the total resource count is projected at 40-80, order-of-magnitude inside the 500-resource limit. The 1-concurrent-run cap is a non-issue at one infra engineer; it becomes a queue problem only when ≥2 engineers run plans simultaneously.

What the free tier does NOT include, and how we work around it:

Scheduled drift detection (Health Assessments) is a Standard/Premium feature, not free. Phase 1-3 rely on manual periodic terraform plan (Ivan runs it weekly, eyeballs the diff). Phase 4 adds a GitHub Actions scheduled workflow that runs terraform plan on a daily cron using OIDC-federated dynamic credentials to TFC — no long-lived TFC token stored in GitHub — and opens/updates a GitHub Issue in sv0-infrastructure on detected diff. This replaces the Health Assessment feature at zero cost. The paid-tier upgrade trigger is "weekly manual plan is no longer enough" (likely when we have >1 person making infra changes, or the surface area grows past ~100 resources).
Notifications are webhook / Slack / Teams / email, not native GitHub Issue creation. The Phase-4 drift workflow writes the issue directly via gh issue create from the scheduled workflow; we don't rely on TFC's notification channels for the issue-creation path.

Rejected: S3 + DynamoDB on AWS. Industry default, fully open-source, but requires (a) bootstrapping an AWS account and IAM OIDC federation before the first terraform apply, (b) writing custom GitHub Actions workflow for plan-comment-on-PR across every repo, (c) paying the cognitive cost of IaC-bootstrapping-IaC during pilot week.

Revisit trigger. Resource count approaches 400, or user count approaches 5. Migration path is terraform state pull → reconfigure backend block → terraform state push. Approximately 30 min per workspace.

4. Modules are tenant-parameterizable by design

Every production module accepts a tenant_slug variable (or equivalent) and is structured so that stamping out a dedicated-deployment customer stack is a matter of instantiating the same modules in a new envs/tenant-<slug>/ directory with different variable values. This ADR commits to this design principle as a first-class constraint, not an afterthought.

Concretely:

modules/atlas-cluster/ takes tenant_slug, emits a per-tenant Atlas project or cluster.
modules/compute-vm/ (Phase 3) takes tenant_slug + cloud_provider, emits a VM with tenant-scoped DNS records and Cloudflare Access policies.
modules/cloudflare-tenant-domain/ (optional for dedicated clients) takes tenant_slug + custom domain, configures DNS + Access for <tenant>.securityv0.com or sv0.<tenant-domain>/.
modules/github-environment/ takes tenant_slug, creates the corresponding GitHub Environment with the right secrets and protection rules.

Shared-SaaS production is a special case: envs/shared/ with a single tenant_slug = "prod" equivalent (or no tenant scoping where it doesn't apply — e.g., the SecurityV0-internal GitHub org config).

Dedicated-deployment customer onboarding becomes: create envs/tenant-<slug>/, instantiate the standard module set with customer-specific values, terraform apply, done. No per-customer fork, no copy-and-paste divergence.

Why this matters now. The ADR-016 amendment committed to supporting dedicated deployments via the SINGLE_TENANT_SLUG env flag on the application side. This ADR commits the infrastructure side of that same promise. Without it, the app flag works but provisioning the surrounding stack is hand-rolled per customer — which is exactly the manual-click pattern we're eliminating.

5. Secrets boundary

Three tiers, each thing lives in exactly one:

Tier	Contents	Rotation cadence
1Password `sv0-infra` vault	Human-held root credentials: scoped Cloudflare API Token (Zone:Edit + Zero Trust:Edit + Account:Read on `securityv0.com`, not the legacy Global API Key — which is account-wide and cannot be narrowed), Atlas org-owner API key, GitHub PAT, BetterStack team token, Grafana Cloud org token, Hetzner API token. Human reads these once to paste into Terraform Cloud workspace variables.	Quarterly, via scheduled ops sprint with documented security scoping
Terraform Cloud workspace variables (marked sensitive)	Exactly the tokens Terraform needs to call each provider's API at apply time. Write-only via the TFC UI; readable by TFC runs, not by humans.	On rotation, triggered by 1Password update
GitHub Actions environment secrets (written by Terraform via the `github` provider)	Workflow-time secrets: `DEPLOY_SSH_KEY`, `DEPLOY_HOST`, `CF_ACCESS_CLIENT_*`, `METRICS_BEARER_TOKEN`, `MONGODB_URI`, `WORKOS_CLIENT_SECRET`.	On rotation, via Terraform PR

For cloud providers that support OIDC federation (AWS, Azure post-pilot), Terraform Cloud mints short-lived JWTs and the cloud provider validates them — no long-lived cloud credentials stored anywhere.

Quarterly ops sprint (per Ivan's direction, 2026-04-23): every 90 days, a scheduled sprint rotates root tokens and documents the security scoping of each. The rotation playbook lives in docs/runbooks/ and is version-controlled.

6. CI/CD and PR-gating pattern

Terraform Cloud's VCS integration handles plan-on-PR and apply-on-merge natively for every workspace. GitHub Actions is used only as belt-and-braces for cross-workspace validation where TFC's one-workspace-per-directory model would miss cross-PR coordination.

Approval gating is structural in two layers that compose:

CODEOWNERS in sv0-infrastructure: /envs/shared/cloudflare-*.tf, /envs/prod/**, and /modules/** require Ivan or Sergey review. Lighter-review paths (BetterStack monitor additions, DNS for demo domains) route through the default reviewer count. CODEOWNERS works on any GitHub plan.
Terraform Cloud workspace approvals: envs/prod and envs/shared have Auto-apply: off in TFC. Every terraform apply requires a manual "Confirm" click by an approver in the TFC UI. This gate is plan-independent and does the real work.
Branch protection on main (plan-dependent): require PR + 1 approval + passing TFC plan status check. GitHub's protected-branch feature requires Team/Enterprise for private repos; on GitHub Free for private repos it is not enforceable. If the org is on Free, the CODEOWNERS + TFC workspace approval combination remains as the enforceable gate; the plan doc's Prerequisites §1 addresses the plan-detection step.

Approvers for production TFC workspaces: Ivan and Sergey. Victor is deferred pending trust evaluation; added when the team scales.

Drift detection (Phase 1-3): Manual. Ivan runs terraform plan against each workspace weekly and reviews the diff. This is sufficient while the total resource count is below ~100 and there is effectively one person making infra changes.

Drift detection (Phase 4+): A scheduled GitHub Actions workflow in sv0-infrastructure runs terraform plan daily using OIDC-federated dynamic TFC credentials (so no long-lived TFC token lives in GitHub), diffs the plan output against the previous day's, and opens or updates a GitHub Issue labeled drift + <workspace-name> when non-empty. Reconciliation has three paths: (a) accept the dashboard change by opening a PR that updates HCL to match reality; (b) revert the drift by running terraform apply; (c) explicit known-drift entry in envs/<workspace>/drift-allowlist.md with a date-stamped reason and a monthly review reminder. Hand-clicking during incidents is accepted reality; the goal is cheap reconciliation, not Terraform purity.

We do not use HCP Terraform Health Assessments — that is a Standard/Premium-tier feature and we are on free tier. The self-built GitHub-Actions drift workflow is the free-tier equivalent.

Consequences

Positive

Every production infrastructure change is a reviewed PR with a visible plan diff. Manual dashboard drift becomes a detectable signal, not an invisible assumption.
Stamping out a dedicated-deployment customer stack is a ~1-hour operation, not a ~1-week bespoke provisioning job. The first such customer justifies the module-parameterization cost immediately; every subsequent customer amortizes it.
Secrets live in one of three well-defined locations with a documented rotation cadence. Quarterly ops sprints make rotation a predictable workflow rather than an event.
The managed-platform migration committed to in ADR-018 is substantively started by Phase 3 of this plan's rollout.

Negative

Three-to-five engineer-days of up-front IaC setup before the first benefit is visible. Phase 1 delivers the Cloudflare baseline but requires reviewing ~400 lines of HCL and replaying ~15 terraform import commands against production.
Terraform Cloud is a new vendor in the stack. Free tier is generous today but product decisions from HashiCorp can change; migration to S3+DynamoDB is documented as a contingency.
Cloudflare API Token is scoped from day 1 (Zone:Edit + Zero Trust:Edit + Account:Read on securityv0.com). Narrowing further later is a credential cutover — generate narrower token → update TFC workspace variable → confirm terraform plan is still clean → revoke the broader token. We do not use the legacy Cloudflare Global API Key. That credential is account-wide and cannot be narrowed in place; it's the wrong class of credential.
WorkOS has no first-class Terraform provider at the time of writing. WorkOS configuration stays dashboard-managed with a monthly reconciliation checklist.
Hetzner VMs are intentionally not terraformed. They stay hand-managed through decommission. This is correct given they are transitional infrastructure (pre-client, cheapest), but it means one production-adjacent surface stays outside IaC until it goes away.

Non-goals

This ADR does not migrate sv0-connectors/infra/ into sv0-infrastructure. Those modules stay where they are.
This ADR does not terraform WorkOS, Hetzner, or any surface that lacks a mature provider.
This ADR does not commit to a specific AWS-vs-Azure post-pilot compute decision. That decision is made in ADR-018 and updated in the readiness review; this ADR commits only to terraforming whichever cloud is chosen.

Paired rollout plan: docs/plans/2026-04-22-iac-rollout-plan.md — phased implementation with first-PR scope, expected diff sizes, and per-phase done-definitions.
Supersedes scope of: ADR-018 — Deploy-Server Security Posture Before Managed-Platform Migration on the "managed platform migration trigger" point; this ADR commits to the structure of that migration.
Depends on: ADR-016 — Multi-Tenant Authentication Architecture amendment (2026-04-22) — the tenant-parameterizable module design principle implements the infrastructure side of the SINGLE_TENANT_SLUG commitment.
Depends on: ADR-017 — WorkOS as Authentication Provider — WorkOS is the one exception to the terraform-everything principle at this time.
Related observability context: docs/architecture/research/2026-04-22-observability-stack.md — Grafana Cloud + BetterStack + grafana/mcp-grafana picks land in Phase 4 of this rollout.
Related infra strategy context: docs/architecture/research/2026-03-31-infrastructure-strategy.md — AWS credit + migration roadmap; Phase 3 of this rollout operationalizes it.

Status​

Context​

Decision​

1. Tool: Terraform (HashiCorp)​

2. Repo structure: Hybrid​

3. State backend: Terraform Cloud free tier​

4. Modules are tenant-parameterizable by design​

5. Secrets boundary​

6. CI/CD and PR-gating pattern​

Consequences​

Positive​

Negative​

Non-goals​

Related​