IaC Rollout Plan
Companion to ADR-019. Four phases, each landable in 1-3 days. Phase 1 is the urgent one — it imports what exists today plus the new Cloudflare Access Bypass for health probes (currently being added by hand). Phases 2-4 execute against MediaPro pilot + post-pilot cadence.
Prerequisites (Ivan creates before Phase 1)
- Create GitHub repo
SecurityV0/sv0-infrastructure— private, default branchmain. Review gating is structural:- If the SecurityV0 org is on GitHub Team/Enterprise: enable branch protection on
main(require PR + 1 approval + passingtf-planstatus check from TFC). - If the SecurityV0 org is on GitHub Free (branch protection on private repos requires a paid plan — GitHub docs on protected branches): fall back to two advisory-but-effective gates — a CODEOWNERS file (covered below) that requires Ivan/Sergey review on sensitive paths, and a GitHub Environment-based approval (same pattern
sv0-platformuses for itsprodenvironment) applied to the Terraform-apply GitHub Action (Phase 4 detail). Environments + required reviewers ARE available on GitHub Free for private repos. - Verify the plan before starting Phase 1. If the answer is "Free," this ADR's enforcement story degrades to advisory in that one dimension; proceed.
- If the SecurityV0 org is on GitHub Team/Enterprise: enable branch protection on
- Create Terraform Cloud organization
securityv0(free tier). Sign up at https://app.terraform.io. - Provision Terraform Cloud workspace
sv0-sharedin the new org. VCS-connect tosv0-infrastructure→ working directoryenvs/shared. Execution mode: remote. Auto-apply: off. Manual "Confirm" click required to apply. - No GitHub Actions → TFC token needed in Phase 1. TFC's native VCS integration handles plan-on-PR (posting plan output as a PR comment via TFC's "VCS-driven plan" mode) and apply-on-merge without any GitHub Actions workflow calling TFC. The belt-and-braces
tf-plan.ymlworkflow that was in a prior revision of this plan is removed; it would have required storing aTF_API_TOKENin GitHub repo secrets, which creates a fourth secrets tier outside the three-tier model in ADR-019 §5. Revisit if we ever need cross-workspace validation that TFC-native cannot do. - Rotate & move root credentials to 1Password vault
sv0-infra(existing Business plan):- Cloudflare API Token (Account → My Profile → API Tokens → Create Token). Scoped from day 1:
Zone → DNS → Editonsecurityv0.com,Account → Access: Apps and Policies → Edit,Account → Account Settings → Read. Do not use the legacy Global API Key — it cannot be scoped down in place (it's an account-wide credential, not a token), and migrating from it later requires a full credential cutover with a plan/apply to confirm the new token works before revoking the old one. - GitHub PAT with
admin:org+reposcopes (https://github.com/settings/tokens). - MongoDB Atlas org-owner API key (from Atlas dashboard → Organization → Access Manager).
- BetterStack team token (Phase 4; from BetterStack dashboard).
- Grafana Cloud org token (Phase 4; from Grafana Cloud dashboard).
- Cloudflare API Token (Account → My Profile → API Tokens → Create Token). Scoped from day 1:
- Populate TFC workspace variables for
sv0-sharedfrom 1Password values. All marked sensitive:CLOUDFLARE_API_TOKENGITHUB_TOKEN(the PAT)MONGODBATLAS_PUBLIC_KEY,MONGODBATLAS_PRIVATE_KEY
Phase 1 — Cloudflare baseline (1-2 days)
Trigger: Immediate. Blocks the health-probe monitor legitimacy and the status.securityv0.com CNAME required for the BetterStack status page.
Scope
Import every existing Cloudflare DNS record and Zero Trust Access application into Terraform. Add two new resources:
- Health-probe Bypass application covering
/health,/api/v1/health,/ready,/api/v1/readyon bothapp.securityv0.comanddev.securityv0.com. - CNAME
status.securityv0.com→ the BetterStack-provided target (set after BetterStack status page is provisioned).
File layout (first PR in sv0-infrastructure)
sv0-infrastructure/
├── README.md # what this repo is, 1Password + TFC conventions
├── CODEOWNERS # envs/shared/cloudflare-*.tf + envs/prod/** → @ivanfofanov @sergey-medved
├── .gitignore # .terraform/, *.tfstate*, *.plan
├── modules/
│ ├── cloudflare-zone/
│ │ ├── main.tf # zone data source, dns records from for_each
│ │ ├── variables.tf # zone_id, records: list(object{...})
│ │ └── outputs.tf
│ └── cloudflare-access-app/
│ ├── main.tf # cloudflare_access_application + cloudflare_access_policy
│ ├── variables.tf # name, domains, session_duration, policies
│ └── outputs.tf
├── envs/shared/
│ ├── versions.tf # terraform 1.9+, cloudflare/cloudflare ~> 4.x
│ ├── providers.tf # cloudflare provider with API token from TFC var
│ ├── cloudflare-dns.tf # module "dns" { source = "../../modules/cloudflare-zone" ... }
│ ├── cloudflare-access.tf # 4 Access apps: app-prod, dev-*, dev-wildcard, health-bypass
│ ├── terraform.tfvars # zone_id, account_id (non-sensitive, committed)
│ └── backend.tf # cloud { organization = "securityv0", workspaces { name = "sv0-shared" } }
└── .github/ # optional; no Terraform workflow in Phase 1 because TFC VCS-driven plan handles it
Import script (committed in PR body as scripts/phase-1-imports.sh)
#!/bin/bash
set -euo pipefail
cd envs/shared
ZONE_ID="${CLOUDFLARE_ZONE_ID:?set CLOUDFLARE_ZONE_ID}"
ACCOUNT_ID="${CLOUDFLARE_ACCOUNT_ID:?set CLOUDFLARE_ACCOUNT_ID}"
# DNS records — one import per record (look up IDs via Cloudflare dashboard or API)
terraform import 'module.dns.cloudflare_record.records["app"]' "${ZONE_ID}/<record_id_for_app_securityv0_com>"
terraform import 'module.dns.cloudflare_record.records["dev"]' "${ZONE_ID}/<record_id_for_dev_securityv0_com>"
terraform import 'module.dns.cloudflare_record.records["wildcard_dev"]' "${ZONE_ID}/<record_id_for_wildcard_dev>"
# ... etc for each existing A/CNAME/TXT record
# Access applications
terraform import 'module.access_prod.cloudflare_access_application.app' "${ACCOUNT_ID}/<app_id_prod>"
terraform import 'module.access_dev.cloudflare_access_application.app' "${ACCOUNT_ID}/<app_id_dev>"
# Bypass app is NEW — not imported; created by `terraform apply`.
Done when
terraform planinenvs/sharedshows zero changes against imported resources.- New health-probe Bypass Access application is live. Verified by:
curl -i https://app.securityv0.com/api/v1/health
# Expected: HTTP/2 200 with JSON body, no 302 redirect to cloudflareaccess.com. status.securityv0.comCNAME resolves (pending BetterStack status page being published in Phase 4 or provisioned by Ivan ahead of Phase 1).- CODEOWNERS is in place; Ivan and Sergey are added as approvers on the TFC workspace.
Expected diff size
~400 lines HCL, ~40 lines GitHub Actions workflow, ~80 lines README. ~15 terraform import invocations documented in the PR body.
Phase 2 — Atlas + GitHub environment secrets (1-2 days)
Trigger: Before the MediaPro Atlas cutover (~Day 2 of readiness-plan Track A). This phase provisions the Atlas cluster via Terraform and (in a follow-up) terraform-imports the existing GitHub Actions environment secrets so the MONGODB_URI secret (and every other existing secret) is code-managed going forward.
Workspace decision — workspace-per-environment
The original draft of this plan put Atlas in envs/shared/atlas.tf. Phase 2 implementation deviated to a workspace-per-environment model:
| Workspace | Directory | What lives there |
|---|---|---|
sv0-shared | envs/shared/ | Cross-environment resources only (Cloudflare zone, future GitHub org-level config). Atlas does not live here. |
sv0-prod | envs/prod/ | Persistent prod infrastructure: the M10 Atlas cluster, future prod compute (Phase 3), prod GitHub env secrets. |
sv0-staging | envs/staging-ephemeral/ | On-demand staging cluster for drills. Idle by default (var.staging_enabled = false); flip on for PITR drill / version upgrade rehearsal / cutover validation. |
Three reasons for the deviation:
- API token scoping.
sv0-sharedalready holds the Cloudflare token. Adding the Atlas key there expands blast radius for anyone withsv0-sharedwrite access. Per-env workspaces isolate the Atlas key tosv0-prod+sv0-staging. - State isolation. A typo or destroy in
envs/staging-ephemeralcannot affect prod resources because the state file is different. - Future split is free. When ADR-020 Phase 1 splits prod from dev (separate clusters),
sv0-prodstays put andsv0-staginggets a real persistent cluster. No cross-workspace state moves.
Scope
- New module
modules/atlas-project/— project + advanced_cluster (M10) + database_user (SCRAM-SHA,for_eachroles) + project_ip_access_list (for_eachover CIDRs). Default region per ADR-020 §3 (aws:eu-west-1Ireland) and Phase 0 carve-out per ADR-020 §0. PITR togglable, termination protection togglable, auto-scaling explicitly disabled. envs/prod/atlas.tf— single M10 clustersv0-prodwith two databases (sv0_prod+sv0_dev) per the carve-out, PITR ON, termination protection ON, IP allowlist starts empty (cluster refuses connections until populated by follow-up PR).envs/staging-ephemeral/atlas.tf— same module, gated behindvar.staging_enabled = falsedefault.terraform planshowsNo changeswhile idle. Stand up on demand for drills; tear down withterraform destroy. Termination protection OFF (must be destroyable).- (Follow-up PR, not in initial Phase 2 scope) New module
modules/github-environment/— GitHub environment resource + secrets + required-reviewer protection rule.envs/prod/github-environments.tfinstantiatesdevandprodGitHub environments insv0-platform, with all existing secrets imported. TheMONGODB_URIsecret is composed from the Atlas connection string + app password (TFC sensitive output) and pushed via thegithub_actions_environment_secretresource.
Provider pin
mongodb/mongodbatlas ~> 2.11. Uses v2.x attribute-form replication_specs = [{...}] and use_effective_fields = true. Forward-compatible with v3 when that lands (v3 makes the new behavior default).
Done when
- Atlas cluster
sv0-prodis live inaws:eu-west-1; DB auth enabled; PITR confirmed in Atlas console; termination protection ON. - App password (
sv0_app) read once from TFC sensitive outputs into 1Passwordsv0-infra(entry:sv0_app password (atlas)). - IP allowlist populated with Hetzner dev VM static egress + Ivan's ops IP via follow-up PR; cluster accepts connections from those addresses.
terraform planinenvs/staging-ephemeralshowsNo changes(becausestaging_enabled = false).- (Follow-up PR) Every existing GitHub environment secret in
sv0-platformis Terraform-managed;terraform planshows zero diff;deploy-prod.ymlstill passes end-to-end with Terraform-managed secrets.
Expected diff size
~700 lines HCL for the cluster work (initial Phase 2 PR). ~300 additional lines + ~15 terraform import invocations when GH environment secrets land in the follow-up PR.
Phase 3 — Post-pilot compute (2-3 days)
Trigger: After MediaPro pilot goes live and is stable for ≥1 week. Readiness review §2.3 is explicit: compute migration is post-pilot.
Scope
New module set under sv0-platform/infra/ (not sv0-infrastructure — this is product-scoped compute):
modules/compute-vm/— cloud-agnostic Linux VM + Docker Compose + Caddy. Variablecloud_providerselects betweenawsandazuresub-modules. Explicit anti-patterns locked in ADR-019 and readiness review §2.3: no Container Apps / ECS / Fargate / Cosmos DB.modules/compute-vm/aws/— EC2 t3.small, SG, ACM cert, Route53 or Cloudflare DNS.modules/compute-vm/azure/— mirror using existingsv0-connectors/infra/patterns (resource group, VNet, NSG, VM, Key Vault).sv0-platform/infra/envs/prod/— production VM instantiation.
Critical design requirement (per ADR-019 §4): compute-vm module accepts tenant_slug variable. Shared-SaaS production uses tenant_slug = "prod" (or equivalent sentinel); dedicated-deployment customers later instantiate the same module at sv0-infrastructure/envs/tenant-<slug>/ or similar.
Cloud selection logic (decided at Phase 3 start, not now)
- Azure VM if Azure Founders Hub credits are approved and ≥$2K available. Uses existing
sv0-connectors/infra/Azure patterns for consistency. - Otherwise AWS EC2 t3.small (Mercury Activate $5K is already activated).
Done when
- Production API container runs on the new VM.
deploy-prod.ymldeploy targets the new VM (DEPLOY_HOST secret rotated via Terraform).- BetterStack health check is green on the new host for 24 consecutive hours.
- Hetzner prod VM is stopped but not destroyed for 14 days as a warm standby. After 14 days, destroyed; the Hetzner infrastructure stays only as the dev environment.
Expected diff size
~400 lines HCL + ~200 lines cloud-init YAML template.
Phase 4 — BetterStack + Grafana Cloud (1 day)
Trigger: After observability stack rollout per docs/architecture/research/2026-04-22-observability-stack.md. Phase 4 terraform-imports the BetterStack monitors and Grafana Cloud stack config so adding a new monitor becomes a PR instead of a dashboard click.
Scope
modules/betterstack-monitors/— uptime monitor + alert routing (SMS + email).modules/betterstack-status-page/— status page + component wiring. Used to anchorstatus.securityv0.com.modules/grafana-cloud-stack/— Grafana Cloud stack, data source tokens, MCP API key, Alloy remote-write credentials.envs/shared/observability.tf— instantiate all of the above.
Provider requirements:
betterstackhq/betterstack ~> 0.9grafana/grafana ~> 3.x
Done when
- All uptime monitors, status page, and alert rules are Terraform-managed.
grafana/mcp-grafanaAPI token is emitted as a TFC sensitive output and pasted into thesv0-platform.claude/settings.jsonMCP config.- Adding a new monitor (e.g., for a customer tenant's dedicated endpoint) is a PR that modifies
envs/shared/observability.tf.
Expected diff size
~150 lines HCL.
Phase 5 and beyond — out of scope for this plan
Explicitly deferred; tracked in follow-up issues:
- WorkOS Terraform provider. No mature provider exists at the time of writing. WorkOS stays dashboard-managed with a monthly reconciliation checklist. Revisit when a provider lands.
- Connector-dev Azure infra migration to shared TFC state.
sv0-connectors/infra/stays on local state + 1Password SP credentials until/unless we need shared state (e.g., for per-customer connector Azure SPs at scale). - Hetzner VM Terraform. Transitional infrastructure; stays hand-managed until decommission.
- AWS Organizations landing zone (accounts, SCPs, budgets from the 2026-03-31 infra strategy). Deferred until we commit to AWS as the post-pilot compute cloud AND have >1 workload there.
- Cloudflare token narrowing (follow-up within 30 days of Phase 1 merge). Phase 1 uses a scoped Cloudflare API Token with
Zone:Edit+Zero Trust:Edit+Account:Readonsecurityv0.com. If we later want to narrow further (e.g., per-environment tokens with only the zones each environment touches), this is a credential cutover — generate the narrower token, update TFC workspace variables, runterraform planto confirm no drift, then revoke the broader token. Not "edit in place." Prerequisite §5 now specifies the scoped API Token from day 1, not the legacy Global API Key (which is account-wide and cannot be narrowed).
Quarterly ops sprint — security and rotation
Per ADR-019 §5, every 90 days there is a scheduled ops sprint that:
- Rotates every root token stored in 1Password
sv0-infra(Cloudflare, Atlas, GitHub, BetterStack, Grafana Cloud, TFC, Hetzner). - Updates corresponding TFC workspace variables.
- Triggers Terraform-driven GitHub Actions secret rotations via PR.
- Documents the security scoping decisions made during rotation (e.g., "this quarter we scoped the Cloudflare token from Global to zone-specific; here's the token's permissions table").
- Reviews the
drift-allowlist.mdentries in each workspace and removes stale known-drift exceptions.
The first scheduled sprint is 2026-07-23 (90 days after this plan's acceptance). Reminder goes in Ivan's calendar and as a recurring GitHub issue in sv0-infrastructure.
Tracking
| Phase | Issue | Owner | Status |
|---|---|---|---|
| Prerequisites | Create sv0-infrastructure repo + TFC org + 1Password vault entries | Ivan | blocks Phase 1 |
| Phase 1 | sv0-documentation#203 until sv0-infrastructure exists, then replace with repo-local issue | Claude drafts, Ivan reviews | not started |
| Phase 2 | sv0-infrastructure#7 (PR #8) | Claude drafts, Ivan reviews | cluster scaffold open for review; GH-env-secrets follow-up after cluster is live |
| Phase 3 | sv0-infrastructure#? + sv0-platform#? | tbd | queued behind pilot-stable |
| Phase 4 | sv0-infrastructure#? | tbd | queued behind observability stack rollout (sv0-platform#494) |
| Quarterly sprint | scheduled for 2026-07-23 | Ivan | recurring |
Umbrella epic: sv0-documentation#195 (MediaPro readiness) references Phases 1-2. Phases 3-4 get their own follow-up issues when their triggers fire.