Skip to main content

IaC Rollout Plan

Companion to ADR-019. Four phases, each landable in 1-3 days. Phase 1 is the urgent one — it imports what exists today plus the new Cloudflare Access Bypass for health probes (currently being added by hand). Phases 2-4 execute against MediaPro pilot + post-pilot cadence.


Prerequisites (Ivan creates before Phase 1)

  1. Create GitHub repo SecurityV0/sv0-infrastructure — private, default branch main. Review gating is structural:
    • If the SecurityV0 org is on GitHub Team/Enterprise: enable branch protection on main (require PR + 1 approval + passing tf-plan status check from TFC).
    • If the SecurityV0 org is on GitHub Free (branch protection on private repos requires a paid plan — GitHub docs on protected branches): fall back to two advisory-but-effective gates — a CODEOWNERS file (covered below) that requires Ivan/Sergey review on sensitive paths, and a GitHub Environment-based approval (same pattern sv0-platform uses for its prod environment) applied to the Terraform-apply GitHub Action (Phase 4 detail). Environments + required reviewers ARE available on GitHub Free for private repos.
    • Verify the plan before starting Phase 1. If the answer is "Free," this ADR's enforcement story degrades to advisory in that one dimension; proceed.
  2. Create Terraform Cloud organization securityv0 (free tier). Sign up at https://app.terraform.io.
  3. Provision Terraform Cloud workspace sv0-shared in the new org. VCS-connect to sv0-infrastructure → working directory envs/shared. Execution mode: remote. Auto-apply: off. Manual "Confirm" click required to apply.
  4. No GitHub Actions → TFC token needed in Phase 1. TFC's native VCS integration handles plan-on-PR (posting plan output as a PR comment via TFC's "VCS-driven plan" mode) and apply-on-merge without any GitHub Actions workflow calling TFC. The belt-and-braces tf-plan.yml workflow that was in a prior revision of this plan is removed; it would have required storing a TF_API_TOKEN in GitHub repo secrets, which creates a fourth secrets tier outside the three-tier model in ADR-019 §5. Revisit if we ever need cross-workspace validation that TFC-native cannot do.
  5. Rotate & move root credentials to 1Password vault sv0-infra (existing Business plan):
    • Cloudflare API Token (Account → My Profile → API Tokens → Create Token). Scoped from day 1: Zone → DNS → Edit on securityv0.com, Account → Access: Apps and Policies → Edit, Account → Account Settings → Read. Do not use the legacy Global API Key — it cannot be scoped down in place (it's an account-wide credential, not a token), and migrating from it later requires a full credential cutover with a plan/apply to confirm the new token works before revoking the old one.
    • GitHub PAT with admin:org + repo scopes (https://github.com/settings/tokens).
    • MongoDB Atlas org-owner API key (from Atlas dashboard → Organization → Access Manager).
    • BetterStack team token (Phase 4; from BetterStack dashboard).
    • Grafana Cloud org token (Phase 4; from Grafana Cloud dashboard).
  6. Populate TFC workspace variables for sv0-shared from 1Password values. All marked sensitive:
    • CLOUDFLARE_API_TOKEN
    • GITHUB_TOKEN (the PAT)
    • MONGODBATLAS_PUBLIC_KEY, MONGODBATLAS_PRIVATE_KEY

Phase 1 — Cloudflare baseline (1-2 days)

Trigger: Immediate. Blocks the health-probe monitor legitimacy and the status.securityv0.com CNAME required for the BetterStack status page.

Scope

Import every existing Cloudflare DNS record and Zero Trust Access application into Terraform. Add two new resources:

  • Health-probe Bypass application covering /health, /api/v1/health, /ready, /api/v1/ready on both app.securityv0.com and dev.securityv0.com.
  • CNAME status.securityv0.com → the BetterStack-provided target (set after BetterStack status page is provisioned).

File layout (first PR in sv0-infrastructure)

sv0-infrastructure/
├── README.md # what this repo is, 1Password + TFC conventions
├── CODEOWNERS # envs/shared/cloudflare-*.tf + envs/prod/** → @ivanfofanov @sergey-medved
├── .gitignore # .terraform/, *.tfstate*, *.plan
├── modules/
│ ├── cloudflare-zone/
│ │ ├── main.tf # zone data source, dns records from for_each
│ │ ├── variables.tf # zone_id, records: list(object{...})
│ │ └── outputs.tf
│ └── cloudflare-access-app/
│ ├── main.tf # cloudflare_access_application + cloudflare_access_policy
│ ├── variables.tf # name, domains, session_duration, policies
│ └── outputs.tf
├── envs/shared/
│ ├── versions.tf # terraform 1.9+, cloudflare/cloudflare ~> 4.x
│ ├── providers.tf # cloudflare provider with API token from TFC var
│ ├── cloudflare-dns.tf # module "dns" { source = "../../modules/cloudflare-zone" ... }
│ ├── cloudflare-access.tf # 4 Access apps: app-prod, dev-*, dev-wildcard, health-bypass
│ ├── terraform.tfvars # zone_id, account_id (non-sensitive, committed)
│ └── backend.tf # cloud { organization = "securityv0", workspaces { name = "sv0-shared" } }
└── .github/ # optional; no Terraform workflow in Phase 1 because TFC VCS-driven plan handles it

Import script (committed in PR body as scripts/phase-1-imports.sh)

#!/bin/bash
set -euo pipefail
cd envs/shared

ZONE_ID="${CLOUDFLARE_ZONE_ID:?set CLOUDFLARE_ZONE_ID}"
ACCOUNT_ID="${CLOUDFLARE_ACCOUNT_ID:?set CLOUDFLARE_ACCOUNT_ID}"

# DNS records — one import per record (look up IDs via Cloudflare dashboard or API)
terraform import 'module.dns.cloudflare_record.records["app"]' "${ZONE_ID}/<record_id_for_app_securityv0_com>"
terraform import 'module.dns.cloudflare_record.records["dev"]' "${ZONE_ID}/<record_id_for_dev_securityv0_com>"
terraform import 'module.dns.cloudflare_record.records["wildcard_dev"]' "${ZONE_ID}/<record_id_for_wildcard_dev>"
# ... etc for each existing A/CNAME/TXT record

# Access applications
terraform import 'module.access_prod.cloudflare_access_application.app' "${ACCOUNT_ID}/<app_id_prod>"
terraform import 'module.access_dev.cloudflare_access_application.app' "${ACCOUNT_ID}/<app_id_dev>"
# Bypass app is NEW — not imported; created by `terraform apply`.

Done when

  • terraform plan in envs/shared shows zero changes against imported resources.
  • New health-probe Bypass Access application is live. Verified by:
    curl -i https://app.securityv0.com/api/v1/health
    # Expected: HTTP/2 200 with JSON body, no 302 redirect to cloudflareaccess.com.
  • status.securityv0.com CNAME resolves (pending BetterStack status page being published in Phase 4 or provisioned by Ivan ahead of Phase 1).
  • CODEOWNERS is in place; Ivan and Sergey are added as approvers on the TFC workspace.

Expected diff size

~400 lines HCL, ~40 lines GitHub Actions workflow, ~80 lines README. ~15 terraform import invocations documented in the PR body.


Phase 2 — Atlas + GitHub environment secrets (1-2 days)

Trigger: Before the MediaPro Atlas cutover (~Day 2 of readiness-plan Track A). This phase provisions the Atlas cluster via Terraform and (in a follow-up) terraform-imports the existing GitHub Actions environment secrets so the MONGODB_URI secret (and every other existing secret) is code-managed going forward.

Workspace decision — workspace-per-environment

The original draft of this plan put Atlas in envs/shared/atlas.tf. Phase 2 implementation deviated to a workspace-per-environment model:

WorkspaceDirectoryWhat lives there
sv0-sharedenvs/shared/Cross-environment resources only (Cloudflare zone, future GitHub org-level config). Atlas does not live here.
sv0-prodenvs/prod/Persistent prod infrastructure: the M10 Atlas cluster, future prod compute (Phase 3), prod GitHub env secrets.
sv0-stagingenvs/staging-ephemeral/On-demand staging cluster for drills. Idle by default (var.staging_enabled = false); flip on for PITR drill / version upgrade rehearsal / cutover validation.

Three reasons for the deviation:

  1. API token scoping. sv0-shared already holds the Cloudflare token. Adding the Atlas key there expands blast radius for anyone with sv0-shared write access. Per-env workspaces isolate the Atlas key to sv0-prod + sv0-staging.
  2. State isolation. A typo or destroy in envs/staging-ephemeral cannot affect prod resources because the state file is different.
  3. Future split is free. When ADR-020 Phase 1 splits prod from dev (separate clusters), sv0-prod stays put and sv0-staging gets a real persistent cluster. No cross-workspace state moves.

Scope

  • New module modules/atlas-project/ — project + advanced_cluster (M10) + database_user (SCRAM-SHA, for_each roles) + project_ip_access_list (for_each over CIDRs). Default region per ADR-020 §3 (aws:eu-west-1 Ireland) and Phase 0 carve-out per ADR-020 §0. PITR togglable, termination protection togglable, auto-scaling explicitly disabled.
  • envs/prod/atlas.tf — single M10 cluster sv0-prod with two databases (sv0_prod + sv0_dev) per the carve-out, PITR ON, termination protection ON, IP allowlist starts empty (cluster refuses connections until populated by follow-up PR).
  • envs/staging-ephemeral/atlas.tf — same module, gated behind var.staging_enabled = false default. terraform plan shows No changes while idle. Stand up on demand for drills; tear down with terraform destroy. Termination protection OFF (must be destroyable).
  • (Follow-up PR, not in initial Phase 2 scope) New module modules/github-environment/ — GitHub environment resource + secrets + required-reviewer protection rule. envs/prod/github-environments.tf instantiates dev and prod GitHub environments in sv0-platform, with all existing secrets imported. The MONGODB_URI secret is composed from the Atlas connection string + app password (TFC sensitive output) and pushed via the github_actions_environment_secret resource.

Provider pin

mongodb/mongodbatlas ~> 2.11. Uses v2.x attribute-form replication_specs = [{...}] and use_effective_fields = true. Forward-compatible with v3 when that lands (v3 makes the new behavior default).

Done when

  • Atlas cluster sv0-prod is live in aws:eu-west-1; DB auth enabled; PITR confirmed in Atlas console; termination protection ON.
  • App password (sv0_app) read once from TFC sensitive outputs into 1Password sv0-infra (entry: sv0_app password (atlas)).
  • IP allowlist populated with Hetzner dev VM static egress + Ivan's ops IP via follow-up PR; cluster accepts connections from those addresses.
  • terraform plan in envs/staging-ephemeral shows No changes (because staging_enabled = false).
  • (Follow-up PR) Every existing GitHub environment secret in sv0-platform is Terraform-managed; terraform plan shows zero diff; deploy-prod.yml still passes end-to-end with Terraform-managed secrets.

Expected diff size

~700 lines HCL for the cluster work (initial Phase 2 PR). ~300 additional lines + ~15 terraform import invocations when GH environment secrets land in the follow-up PR.


Phase 3 — Post-pilot compute (2-3 days)

Trigger: After MediaPro pilot goes live and is stable for ≥1 week. Readiness review §2.3 is explicit: compute migration is post-pilot.

Scope

New module set under sv0-platform/infra/ (not sv0-infrastructure — this is product-scoped compute):

  • modules/compute-vm/ — cloud-agnostic Linux VM + Docker Compose + Caddy. Variable cloud_provider selects between aws and azure sub-modules. Explicit anti-patterns locked in ADR-019 and readiness review §2.3: no Container Apps / ECS / Fargate / Cosmos DB.
  • modules/compute-vm/aws/ — EC2 t3.small, SG, ACM cert, Route53 or Cloudflare DNS.
  • modules/compute-vm/azure/ — mirror using existing sv0-connectors/infra/ patterns (resource group, VNet, NSG, VM, Key Vault).
  • sv0-platform/infra/envs/prod/ — production VM instantiation.

Critical design requirement (per ADR-019 §4): compute-vm module accepts tenant_slug variable. Shared-SaaS production uses tenant_slug = "prod" (or equivalent sentinel); dedicated-deployment customers later instantiate the same module at sv0-infrastructure/envs/tenant-<slug>/ or similar.

Cloud selection logic (decided at Phase 3 start, not now)

  • Azure VM if Azure Founders Hub credits are approved and ≥$2K available. Uses existing sv0-connectors/infra/ Azure patterns for consistency.
  • Otherwise AWS EC2 t3.small (Mercury Activate $5K is already activated).

Done when

  • Production API container runs on the new VM.
  • deploy-prod.yml deploy targets the new VM (DEPLOY_HOST secret rotated via Terraform).
  • BetterStack health check is green on the new host for 24 consecutive hours.
  • Hetzner prod VM is stopped but not destroyed for 14 days as a warm standby. After 14 days, destroyed; the Hetzner infrastructure stays only as the dev environment.

Expected diff size

~400 lines HCL + ~200 lines cloud-init YAML template.


Phase 4 — BetterStack + Grafana Cloud (1 day)

Trigger: After observability stack rollout per docs/architecture/research/2026-04-22-observability-stack.md. Phase 4 terraform-imports the BetterStack monitors and Grafana Cloud stack config so adding a new monitor becomes a PR instead of a dashboard click.

Scope

  • modules/betterstack-monitors/ — uptime monitor + alert routing (SMS + email).
  • modules/betterstack-status-page/ — status page + component wiring. Used to anchor status.securityv0.com.
  • modules/grafana-cloud-stack/ — Grafana Cloud stack, data source tokens, MCP API key, Alloy remote-write credentials.
  • envs/shared/observability.tf — instantiate all of the above.

Provider requirements:

  • betterstackhq/betterstack ~> 0.9
  • grafana/grafana ~> 3.x

Done when

  • All uptime monitors, status page, and alert rules are Terraform-managed.
  • grafana/mcp-grafana API token is emitted as a TFC sensitive output and pasted into the sv0-platform .claude/settings.json MCP config.
  • Adding a new monitor (e.g., for a customer tenant's dedicated endpoint) is a PR that modifies envs/shared/observability.tf.

Expected diff size

~150 lines HCL.


Phase 5 and beyond — out of scope for this plan

Explicitly deferred; tracked in follow-up issues:

  • WorkOS Terraform provider. No mature provider exists at the time of writing. WorkOS stays dashboard-managed with a monthly reconciliation checklist. Revisit when a provider lands.
  • Connector-dev Azure infra migration to shared TFC state. sv0-connectors/infra/ stays on local state + 1Password SP credentials until/unless we need shared state (e.g., for per-customer connector Azure SPs at scale).
  • Hetzner VM Terraform. Transitional infrastructure; stays hand-managed until decommission.
  • AWS Organizations landing zone (accounts, SCPs, budgets from the 2026-03-31 infra strategy). Deferred until we commit to AWS as the post-pilot compute cloud AND have >1 workload there.
  • Cloudflare token narrowing (follow-up within 30 days of Phase 1 merge). Phase 1 uses a scoped Cloudflare API Token with Zone:Edit + Zero Trust:Edit + Account:Read on securityv0.com. If we later want to narrow further (e.g., per-environment tokens with only the zones each environment touches), this is a credential cutover — generate the narrower token, update TFC workspace variables, run terraform plan to confirm no drift, then revoke the broader token. Not "edit in place." Prerequisite §5 now specifies the scoped API Token from day 1, not the legacy Global API Key (which is account-wide and cannot be narrowed).

Quarterly ops sprint — security and rotation

Per ADR-019 §5, every 90 days there is a scheduled ops sprint that:

  1. Rotates every root token stored in 1Password sv0-infra (Cloudflare, Atlas, GitHub, BetterStack, Grafana Cloud, TFC, Hetzner).
  2. Updates corresponding TFC workspace variables.
  3. Triggers Terraform-driven GitHub Actions secret rotations via PR.
  4. Documents the security scoping decisions made during rotation (e.g., "this quarter we scoped the Cloudflare token from Global to zone-specific; here's the token's permissions table").
  5. Reviews the drift-allowlist.md entries in each workspace and removes stale known-drift exceptions.

The first scheduled sprint is 2026-07-23 (90 days after this plan's acceptance). Reminder goes in Ivan's calendar and as a recurring GitHub issue in sv0-infrastructure.


Tracking

PhaseIssueOwnerStatus
PrerequisitesCreate sv0-infrastructure repo + TFC org + 1Password vault entriesIvanblocks Phase 1
Phase 1sv0-documentation#203 until sv0-infrastructure exists, then replace with repo-local issueClaude drafts, Ivan reviewsnot started
Phase 2sv0-infrastructure#7 (PR #8)Claude drafts, Ivan reviewscluster scaffold open for review; GH-env-secrets follow-up after cluster is live
Phase 3sv0-infrastructure#? + sv0-platform#?tbdqueued behind pilot-stable
Phase 4sv0-infrastructure#?tbdqueued behind observability stack rollout (sv0-platform#494)
Quarterly sprintscheduled for 2026-07-23Ivanrecurring

Umbrella epic: sv0-documentation#195 (MediaPro readiness) references Phases 1-2. Phases 3-4 get their own follow-up issues when their triggers fire.