Azure VM Landing Zone — Implementation Plan
Companion to: ADR-022
This runbook holds the concrete plan for the Hetzner → Azure migration: current-state inventory, target topology, sequencing, sub-issues to file, and the cutover playbook.
The ADR holds the durable decisions; this runbook holds the work. When a step here is contradicted by a later ADR amendment, the ADR wins.
Current state — Hetzner footprint as of 2026-05-08
VMs
| Environment | Hostname | IP | DNS | Spec | Cost | Notes |
|---|---|---|---|---|---|---|
| Dev | docker-ce-ubuntu-4gb-ash-1 | 178.156.217.150 | dev.securityv0.com, *.dev.securityv0.com | CPX21, 2 vCPU / 4 GB / 80 GB SSD | ~EUR 11/mo | Hosts dev branch + every open PR (pr-N-dev.securityv0.com) on the same box |
| Prod | docker-ce-ubuntu-4gb-ash-2 | 178.156.245.75 | app.securityv0.com | CPX21, 2 vCPU / 4 GB / 80 GB SSD | ~EUR 11/mo | Hosts the main branch app stack |
Both in Hetzner Cloud, Ashburn VA (us-east).
Pain points the diagram makes visible: every PR lands on the same dev box (5+ concurrent → OOM, single disk fills); CI holds long-lived SSH keys to both VMs; every Atlas read crosses the Atlantic.
Workloads on each VM
Caddy (host, :443)
└── reverse_proxy localhost:8080
└── nginx (Docker, :8080)
├── / → React SPA bundle
└── /api/* → api:3000 (Docker)
└── MONGODB_URI → Atlas EU_WEST_1 (since 2026-05-04)
The mongo container that used to run on each VM has been decommissioned as part of epic #550 phase 2. Containers in scope on each VM today are: api, ui, plus per-PR overlays on the dev VM.
Identity and ingress
- Public ingress — Cloudflare DNS A record → host public IPv4 → Caddy:443 (TLS terminate, Let's Encrypt) → nginx
- Cloudflare Access —
app.securityv0.com,dev.securityv0.com,*-dev.securityv0.comare all behind Cloudflare Access (Zero Trust). CI/CD uses service tokens (CF_ACCESS_CLIENT_ID/CF_ACCESS_CLIENT_SECRET). - Human SSH —
ssh deploy@<host>with~/.ssh/sv0-deploy-prod. Root SSH on prod is gated through Ivan's 1Password SSH agent. - CI deploy SSH — GitHub Actions environment secrets
DEPLOY_SSH_KEY,DEPLOY_HOST,DEPLOY_HOST_KEY. Thedeployuser is in thedockergroup (ADR-018 accepted risk).
CI/CD pipeline
ci.ymlbuilds API + UI images, pushes to GHCR withsha-<commit>tag (andpr-Nfor PRs)deploy-dev.ymltriggers onworkflow_runafter CI; SSHes to dev VM, pulls images, restarts containersdeploy-dev-cleanup.ymltriggers on PR close; SSHes to dev VM, tears down the per-PR compose projectdeploy-prod.ymltriggers onworkflow_dispatchwith approval gate
Pain points the migration must address
- Dev OOM under PR concurrency — More than ~5 simultaneous PRs cause memory pressure that has materially crashed the dev branch and other PR previews.
- Disk-full outage — 2026-04-17, dev VM 75 GB filled with orphan PR-instance directories. Reaper now exists (
/home/deploy/scripts/cleanup-instances.shdaily 04:00) but the underlying single-disk-per-tenant model is the bottleneck. DEPLOY_SSH_KEYrotation toil — ADR-018 acceptance requires this to be rotatable on suspicion of leak; this is a manual workflow in two places (1Password, GHA secret).docker-group root-equivalence — ADR-018 accepted risk, time-bounded by this migration.- us-east → eu-west request hop — Atlas now lives in
EU_WEST_1. Every read crosses the Atlantic, ~80-120 ms of unnecessary latency on every API call that touches Mongo.
Target state — Azure VM landing zone
Region
westeurope (Amsterdam). Co-locates with Atlas EU_WEST_1 and Grafana Cloud Frankfurt.
Subscription
Azure subscription 1, ID 2a25bc41-c1ce-4d04-9cb6-a62deccc3bcc, tenant bcf375ed-e122-4d76-a43d-82c94a3f7e3b. Same subscription used by sv0-connectors/infra/ and sv0-demo-labs. Credits via Microsoft for Startups (allocation tracked in 1Password sv0-infra vault).
Resource groups
Per the 2026-05-09 modest-hybrid amendment (ADR-022 §3), one RG per environment. Bootstrap owns its own RG since it's a permanent local-apply directory.
| RG | Owner | Holds |
|---|---|---|
rg-sv0-bootstrap | bootstrap/ (local apply) | Federated Azure AD app, break-glass SP, Entra emergency group, custom RBAC role, state-backup storage account |
rg-sv0-shared | sv0-shared TFC workspace | VNet, subnets, NAT Gateway, NSGs, Compute Gallery, Recovery Services Vault, Key Vaults (kv-sv0-staging, kv-sv0-prod, kv-sv0-dev) |
rg-sv0-staging | sv0-staging TFC workspace | Phase 3b. 1 Azure VM (compute + colocated Mongo container), Managed Identity, Cloudflare Tunnel sv0-staging. Gated by staging_compute_enabled; default false = zero compute resources. |
rg-sv0-prod | sv0-prod TFC workspace | Phase 3c. Prod fleet VMs, Managed Identity, prod Tunnel config, lifecycle-protected. |
rg-sv0-dev | sv0-dev TFC workspace | Phase 3f-DEFERRED. Dev VMs (one or more), Managed Identity, dev Tunnel config. Empty until 3f ships. |
rg-sv0-pr-previews-pr-N | not Terraformed | Per-PR ephemeral VM, NIC, OS disk, tunnel resource. Created by GHA workflow, deleted as one unit. |
Network
VNet: sv0-vnet 10.0.0.0/16 westeurope
├─ Subnet: snet-prod 10.0.1.0/24
├─ Subnet: snet-dev 10.0.2.0/24
├─ Subnet: snet-pr-previews 10.0.3.0/24
└─ Subnet: snet-shared 10.0.4.0/24 (reserved; future jumphost or scrape VM)
NAT Gateway (single, zonal):
natgw-zone1
sku Standard
zone 1
public_ip_addresses [pip-natgw-zone1] # static, pinned in Atlas allowlist
subnets_attached snet-prod, snet-dev, snet-pr-previews, snet-shared
NSG (one per subnet):
- No inbound rules except the implicit Azure-managed VNet/AzureLoadBalancer
- All outbound permitted (NAT Gateway is the egress chokepoint)
Constraint that drives this design: Azure permits at most one NAT Gateway per subnet, and Standard NAT Gateway is a zonal (not zone-redundant) resource. The earlier draft sketched two NAT Gateways "each attached to snet-*" — that is invalid as written. The chosen design uses a single zonal NAT in zone 1, with documented zonal-failure exposure (ADR §5b). Phase-4+ upgrade option: split into per-zone subnet stacks (snet-prod-z1, snet-prod-z2, etc.) each with its own NAT. Not done yet because (a) it doubles NAT cost, (b) it complicates IaC, (c) we are pre-revenue and a multi-hour zonal outage is acceptable.
No Application Gateway. No Azure Load Balancer. No Front Door. No public IPs on VMs.
VMs
| Tier | Count | Size | Zone(s) | OS | Disk | Mongo |
|---|---|---|---|---|---|---|
| Staging | 1 (toggleable) | Standard_B2s (2 vCPU, 4 GB) | 1 | Ubuntu 24.04 LTS Server (Compute Gallery custom image) | 64 GB Premium SSD | Colocated container by default; env-var switch to prod Atlas sv0_staging DB |
| Prod | 2 | Standard_B2s (2 vCPU, 4 GB) | 1 + 2 | Same | 64 GB Premium SSD | Always Atlas sv0_prod |
| Dev (Phase 3f) | 1+ | Standard_B2s (2 vCPU, 4 GB) | 1 | Same | 64 GB Premium SSD | Colocated container only |
| PR preview | 0–10 | Standard_B2s (default; tunable) | 1 | Same | 32 GB Standard SSD | Reuses dev pattern (colocated container) |
PR-preview sizing default is B2s until Phase 3d measures actual idle/load RSS. ADR §4 holds the rationale; downsize to B1ms is a follow-up PR if data supports it.
Staging compute is destroyable. When staging_compute_enabled = false, all staging compute is destroyed; the OS disk persists separately (~$3/mo) so re-applying restores the same disk state including any local Mongo data. This is the cost-aware default — staging is online only while it's actively in use.
Topology — all environments in one view
Four things this diagram makes load-bearing:
- Staging stands up first, then prod. Phase 3b ships staging end-to-end on
staging.securityv0.com; Phase 3c only starts after every operational concern is exercised on a cheap throwaway environment. Dev (Phase 3f) lands last and is shown dashed/gray here. - Mongo lives on the staging VM by default. Staging runs a colocated Mongo container (same Docker pattern as today's Hetzner setup). The dashed line to Atlas represents the
MONGODB_URIenv-var switch — flip to Atlas-mode only when you need real-cluster validation. No extra Mongo VM. Thesv0_stagingdatabase on the prod M10 has zero incremental cost. - One workspace per environment. Prod, staging, dev each own their own RG; shared resources (VNet, NAT, gallery, KVs, RSV) live in a separate RG owned by
sv0-shared. PR previews are GHA-managed, never in TF state. - Single zonal NAT is the deliberate trade. Azure forbids multiple NAT Gateways per subnet; Standard NAT is zonal not zone-redundant. A zone-1 outage takes egress with it for all environments. Accepted at our scale (ADR-022 §5b); per-zone subnet split deferred to Phase 4+.
Cost estimate (rough, USD/yr) — point-in-time, not auditable
⚠️ Numbers below are author-built estimates from publicly listed Azure prices for westeurope as of 2026-05-08, not from an Azure Pricing Calculator export. They are budgeting-grade, not billing-grade. Phase 3a apply will attach a calculator export to the implementation PR for actuals.
| Line | Qty | Unit/mo | Annualised |
|---|---|---|---|
| Prod VMs (B2s) | 2 | ~$30 | ~$720 |
| Staging VM (B2s, realistic 10–15% on-duty) | 1 effective × 0.1–0.15 | ~$3–5 avg | ~$45–60 |
| Staging Mongo data disk (Premium SSD, persists across compute toggles) | 1 | ~$10 | ~$120 |
| Staging OS disk (destroyed when compute off, paid only while on) | 1 × 0.1–0.15 | ~$1 avg | ~$12 |
| Dev VMs (B2s) — Phase 3f, not yet incurred | 1 | ~$30 | ~$360 (deferred) |
| PR-preview VMs (B2s default, average 5 active × ~30% lifetime) | ~5 effective | ~$30 | ~$540 |
| NAT Gateway (single, zonal) — fixed + ~50 GB egress/mo | 1 | ~$40 | ~$480 |
| Recovery Services Vault (prod-only, daily, 30d retention) | 2 protected | ~$10 | ~$240 |
| Premium / Standard SSD storage (OS disks; Mongo data disk counted separately above) | ~7 active | ~$10 | ~$840 |
| Compute Gallery image storage (1 image, 1 region replica) | 1 | ~$5 | ~$60 |
| Key Vault (staging + prod + dev, ~15K ops/mo) | 3 | ~$1 | ~$36 |
| Public IP on NAT Gateway (static) | 1 | ~$3 | ~$36 |
| Logs / metrics egress to Grafana Cloud (Loki + Prom) | — | — | ~$200 |
| Subtotal (pre-Phase 3f) | ~$3,300/yr | ||
| Subtotal (post-Phase 3f, dev VM added) | ~$3,700/yr |
Staging cost note — realistic usage pattern: staging is on for 2–5 days per active testing period (e.g. validating a release before the prod deploy or a quarterly drill), then off. At ~4 days/month that's ~13% on-duty, $4/mo for VM compute. The 50% framing in earlier drafts was conservative for budgeting but didn't match how staging actually gets used; reframed here to match expected reality. The Mongo data disk is the floor cost ($10/mo, always paid) because it persists across compute toggles. Atlas-mode switch (point staging at prod Atlas sv0_staging DB) has zero cost delta — the sv0_staging database is created on the already-paid prod M10 cluster.
What is not modelled here, and may bump real spend by 10–30%: Cloudflare Tunnel response egress through Azure NAT (every cloudflared connection's response bytes hit egress), GHCR pull egress on heavy PR-preview churn, snapshot storage growth over 30-day retention, Compute Gallery per-region replica storage if we add a second region, and Cloudflare/Grafana plan-tier increases.
Comfortably inside the credit allocation either way. Phase 3a sets Azure budget alerts at 50% / 75% / 90% of $5,000/yr, with a 100%-budget action group that auto-disables the tfc-sv0-infrastructure Azure AD app to halt new provisioning.
Ingress paths after cutover
prod traffic Internet → Cloudflare edge → Cloudflare Tunnel "sv0-prod" (2 replicas, failover)
├─ replica on prod VM 1 → nginx → containers → Atlas sv0_prod
└─ replica on prod VM 2 → nginx → containers → Atlas sv0_prod
(multi-replica = connection redundancy + nearest-replica routing,
NOT health-checked LB. Long-lived sessions may reconnect on replica
loss. Escalation: add Cloudflare Load Balancer if observed
failover is insufficient — ADR §5a.)
staging traffic Internet → Cloudflare edge → Cloudflare Tunnel "sv0-staging" (1 replica)
└─ staging VM → nginx → app container → mongo container (local)
OR (Atlas-mode switch) → Atlas sv0_staging DB on prod cluster
dev traffic [Phase 3f] Internet → Cloudflare edge → Cloudflare Tunnel "sv0-dev"
└─ dev VM → nginx → app container → mongo container (local)
PR-N traffic Internet → Cloudflare edge → Cloudflare Tunnel "sv0-pr-N" → nginx → containers
human SSH `cloudflared access ssh --hostname <vm>` → Cloudflare Access SSO →
short-lived user cert (signed by Cloudflare Access SSH CA) →
sshd validates against TrustedUserCAKeys → session
(sshd still runs with HOST keys; no public network reach to port 22)
CI deploy GitHub Actions → push image to GHCR → image-watcher on VM polls
the pointer doc → docker compose pull && up -d
(no CI-to-VM session; pull-based — ADR §5d. SSH-from-CI was
in the earlier draft and is removed.)
VM egress VM → snet-* → NAT Gateway (zone-1) → Internet/Atlas/GHCR/WorkOS/Grafana
Phasing
The migration runs in six phases. Each phase is one or more PRs, each phase has a verifiable end-state, and each phase is reversible (Hetzner stays running through Phase 3e).
Staging-first sequencing (2026-05-10 amendment). Staging stands up before prod so every operational concern — cloudflared HA, secrets-via-KV, pull-deploy, image-watcher cadence, Alloy log shipping, break-glass — gets exercised on a cheap throwaway environment before prod compute is touched.
Phase 3-bootstrap — One-time OIDC federation setup (precedes 3a)
Status: ✅ Applied 2026-05-10 against modest-hybrid topology (ADR-022 amendment 2026-05-09). 38 resources live in subscription 2a25bc41-c1ce-4d04-9cb6-a62deccc3bcc.
Follow-up apply needed (2026-05-10 amendment): the staging-first amendment adds the following to the bootstrap layer. All changes are additive — adding "sv0-staging" to the tfc_workspaces set produces new map keys ("sv0-staging-plan", "sv0-staging-apply") because for_each is keyed by string, not by index; existing federated credentials are unchanged. Verify with terraform plan and look for any destroy-then-create on existing resources before applying.
rg-sv0-stagingresource group (1)tfc-sv0-staging-plan+tfc-sv0-staging-applyfederated credentials (2)- Plan-phase RBAC for
sv0-stagingSP: Reader on subscription + Reader onrg-sv0-shared(2) - Apply-phase RBAC for
sv0-stagingSP: Contributor + Virtual Machine Contributor onrg-sv0-staging(2) sv0-vm-emergency-opsgroup →sv0-serial-console-operatorrole assignment onrg-sv0-staging(1)
Total ~8 additive resources. Done as part of the Phase 3b prep PR before staging compute can be provisioned.
Local Terraform run, not via TFC. Output is the tfc-sv0-infrastructure Azure AD app + per-workspace federated credentials (8 total, 4 workspaces × {plan, apply}), the break-glass SP, the sv0-vm-emergency-ops Entra group, the custom sv0-serial-console-operator RBAC role, the state-backup storage account, and the five resource groups (rg-sv0-bootstrap, rg-sv0-shared, rg-sv0-staging, rg-sv0-prod, rg-sv0-dev). Documented in ADR §7a; concrete commands:
- Operator authenticates with their own Azure account (Owner-scoped):
az login. (The earlier "pull sv0-azure-bootstrap SP from 1Password" instruction was wrong — no such credential existed, and creating one would be its own bootstrap problem.) - Out-of-band: grant the operator
Storage Blob Data Owneron the state-backup storage account. Required becauseshared_access_key_enabled = falseforces terraform's data-plane polling through AAD; without this role, container creation fails withKeyBasedAuthenticationNotPermitted. cd sv0-infrastructure/bootstrap && terraform init && terraform apply. State stays local — do NOT migrate to a TFC workspace. The state file (bootstrap/terraform.tfstate) is backed up to 1Passwordsv0-infraafter meaningful changes.- Capture outputs (
tfc_app_client_id,tenant_id,subscription_id) and set them as env-category workspace variables on each of all four workspaces —sv0-shared,sv0-staging,sv0-prod,sv0-dev:TFC_AZURE_PROVIDER_AUTH=true,TFC_AZURE_RUN_CLIENT_ID=<id>,ARM_TENANT_ID=<tenant>,ARM_SUBSCRIPTION_ID=<sub>. Use the TFC API (token from~/.terraform.d/credentials.tfrc.json) rather than the UI — repeatable across all four workspaces.sv0-stagingis easy to miss because the federation for it lands in the bootstrap re-apply rather than the original apply; without these vars set, any Phase 3b TFC plan fails at provider init withAADSTS700213. - Store the break-glass SP secret + state-backup storage account name in 1Password
sv0-infravault as itemsv0-azure-break-glass.
Federated subject gotcha. The credential subject field uses TFC organization/project display names (case-sensitive, may contain spaces) — for SecurityV0 that is organization:SecurityV0:project:Default Project:workspace:<ws>:run_phase:<plan|apply>. The default values in bootstrap/variables.tf (tfc_organization, tfc_project) reflect this. Mismatch surfaces only at runtime as AADSTS700213 and never at terraform validate.
End state: TFC can authenticate to Azure for sv0-shared, sv0-staging, sv0-prod, sv0-dev plan and apply phases without any static SP secret in TFC variables. Federation smoke-test gates (each returns planned_and_finished, 0 changes):
sv0-devplan against emptyenvs/dev/✅ (proven 2026-05-10)sv0-stagingplan against emptyenvs/staging/(added by Phase 3b prep PR)sv0-prodplan against emptyenvs/prod/(Phase 3a sanity check)sv0-sharedplan against emptyenvs/shared/Azure portion (Phase 3a sanity check)
Phase 3a — Shared network substrate (no compute)
Goal: stand up everything compute will need, in one workspace, before any VM exists.
Deliverables (all in sv0-shared workspace, envs/shared/):
- VNet
sv0-vnet10.0.0.0/16 in westeurope. - Subnets:
snet-staging10.0.0.0/24,snet-prod10.0.1.0/24,snet-dev10.0.2.0/24 (reserved for Phase 3f),snet-pr-previews10.0.3.0/24,snet-shared10.0.4.0/24 (reserved). - NSG per subnet — no inbound (Cloudflare Tunnel is ingress), all outbound.
- Single zonal Standard NAT Gateway in zone 1 with static public IP, attached to all five subnets. Captured for the Atlas allowlist update in Phase 3c. Cost is ~$35/mo total, shared across all environments — see ADR §5b "Cost amortization" for the per-VM breakdown.
- Compute Gallery image definition for Ubuntu 24.04 LTS + Docker + cloudflared + Alloy + secrets-fetcher + image-watcher. First image version baked + replicated to westeurope.
- Recovery Services Vault (used by Phase 3c for prod snapshots).
- Key Vaults
kv-sv0-staging,kv-sv0-prod,kv-sv0-dev(last one empty until Phase 3f), with Managed-Identity-scoped access policies and per-environment secrets pre-populated per ADR §9. - Entra IdP federated into Cloudflare Access (~30 min one-time, Cloudflare-side prerequisite — see ADR §5c and §5c.1). Requires an Azure App Registration ("Cloudflare Access" or similar), the App's client ID + secret + tenant ID configured in Cloudflare Zero Trust → Settings → Authentication. Without this, the following surfaces fall back to GitHub IdP, and the "Entra MFA required" claim in §5c is unenforceable: (a) all tier-1 SSH to Azure VMs (forced — no app behind SSH; ADR §5c.1 row "Tier-1 SSH"); (b) the two-door URLs —
dev-azure.securityv0.com,dev.securityv0.comafter Phase 3f, allpr-N-dev.securityv0.comPR previews. NOT needed forstaging.securityv0.com(one-door, WorkOS only — ADR §5c.1). The spike PR (sv0-infrastructure#26) surfaced this prerequisite; the auth-simplification owner confirmed the Entra-direct path 2026-05-11. - Two-person approval policy enforced on
sv0-sharedworkspace.
End state: sv0-shared plan applies cleanly; the NAT egress IP, Compute Gallery image ID, and Key Vault URIs are available as workspace outputs for downstream workspaces to consume via terraform_remote_state.
Phase 3b — Staging environment on Azure
Goal: validate the entire prod design end-to-end on a single cheap VM before prod compute is touched. Cut over staging.securityv0.com to it. Default Mongo runs in a container on the same VM; an env-var switch points at the prod Atlas sv0_staging database when end-to-end Atlas validation is needed.
Deliverables (all in sv0-staging workspace, envs/staging/):
- Rename
envs/staging-ephemeral/→envs/staging/. Existing Atlas-drill code moves with the directory — itsatlas_drill_enabledswitch (renamed fromstaging_enabledto disambiguate) stays default-off. - Ordering matters: update the TFC workspace's Working Directory setting (
sv0-staging→envs/staging) via the TFC API BEFORE the rename PR merges. Thesv0-stagingworkspace is VCS-driven, so it auto-triggers a plan on every commit tomain. If the rename merges first, every push will fail-plan with "directory not found" until the Working Directory setting catches up. Procedure: (1)curl -X PATCH ...workspaces/<ws_id>to setworking-directory: envs/staging(yes, ahead of the merge — the directory doesn't exist yet onmain, but the setting is just a string), (2) merge the rename PR, (3) the next push triggers a plan from the new directory. - New
staging_compute_enabledmaster switch (defaultfalse— zero compute resources when off). Whentrue, the following are provisioned inrg-sv0-staging:- 1 Azure VM (
Standard_B2s, zone 1) built from the Compute Gallery image. OS disk usesdelete_os_disk_on_deletion = true(the OS is re-creatable from IaC + Compute Gallery). - Separate
azurerm_managed_diskfor Mongo data — 64 GB Premium SSD, lifecycle-detached from the VM (prevent_destroy = true). Mounted at/var/lib/mongovia cloud-init. This is the actual "OS disk persists" mechanism: the OS disk is throwaway, but the Mongo data disk survivesstaging_compute_enabled = false → truecycles. Cost when staging is off: ~$10/mo (64 GB Premium SSD), independent of compute state. - Managed Identity with
getonkv-sv0-stagingsecrets. - Cloudflare Tunnel
sv0-staging(single replica), DNS CNAMEstaging.securityv0.com. NO Cloudflare Access app on the URL — staging mirrors prod's one-door posture (WorkOS hosted login is the only gate). Per ADR §5c.1: staging is supposed to validate prod's auth shape end-to-end, so it gets the same one-door shape as prod. WorkOS auth happens inside the app, against the same Connect apps as prod. - Cloudflare Access SSH app for the staging VM only (separate from the URL).
cloudflared access ssh --hostname staging-vm.<team>.cloudflareaccess.com→ Entra IdP → short-lived cert → sshd. This is the tier-1 path per ADR §5c; required regardless of the URL's door count because SSH has no app behind it. - cloud-init runs Docker Compose with two services:
api(+ui) andmongo(community container, data on the mounted data disk above).
- 1 Azure VM (
- Mongo data disk re-attach procedure — when
staging_compute_enabledflipsfalse → true, the new VM mounts the existing data disk at/var/lib/mongo. If the previous VM shut down uncleanly, WiredTiger's lock file (/var/lib/mongo/mongod.lock) will block startup. Cloud-init runsrm -f /var/lib/mongo/mongod.lockonly after verifying the disk was attached cleanly (no in-flight writes — checked via Azure's disk-state status). Documented as a deliberate step, not silent rm: WiredTiger's recovery journal handles the rest. MONGODB_URIselector wired as a runtime env var:MONGODB_URI=mongodb://localhost:27017/sv0_stagingby default; flip to the prod Atlas connection string via a workspace variable to run Atlas-mode E2E tests. No code rebuild on flip — but the app code may still need changes. Atlas connection strings carry+srvDNS seeding,tls=true,authSource=admin, and longerserverSelectionTimeoutMSto tolerate replica-set discovery. If the platform code hardcodesdirectConnection: true, a local CA bundle path, or a timeout shorter than Atlas's failover window, the flip will fail at first use. The Phase 3b PR auditssv0-platformfor these assumptions and adjusts code if needed — see validation gate "Atlas-mode flip transparent" below.- Pre-create the
sv0_stagingdatabase on the prod Atlas cluster (zero cost on M10, no separate Atlas project needed). Add Atlas usersv0_stagingscoped to that database only. Store credentials inkv-sv0-staging. - Cloudflare Access policy on
staging.securityv0.com: Entra IdP federated, MFA required, same named-humans-only restriction as prod. - Validation gates (each is a runbook step recorded in the PR description):
- Pull-deploy: push
:stagingtag → image-watcher rolls within 30 s. - Secrets-via-KV: app reads
WORKOS_API_KEYetc. fromkv-sv0-stagingvia Managed Identity, no plaintext on disk. - Cloudflare Tunnel: registers + serves traffic from
staging.securityv0.com. Confirm CF Universal SSL issues a cert (single-level subdomain — covered per repo memory;*.staging.would NOT be covered, so no nested subdomains here). - One-door posture confirmed: anonymous
curl https://staging.securityv0.com/returns the WorkOS hosted login page (or a redirect to it), NOT a Cloudflare Access challenge. If the response is HTML containingcloudflareaccess.com, the staging URL has a CF Access app that shouldn't be there — remove it. - Atlas-mode flip transparent: flip
MONGODB_URIworkspace variable to the prod Atlas SRV string (mongodb+srv://...sv0_staging), apply, verify app reconnects WITHOUT code change or container rebuild. If the flip fails because the app hardcodes Mongo locality (directConnection: true, shortserverSelectionTimeoutMS, missing TLS support), record the exact code change required in the PR and either land it in this PR or file a follow-up. - Cloudflare Access SSH: human can
cloudflared access ssh --hostname <vm>...and land a session. - Azure Serial Console: tier-2 emergency drill on the staging VM produces a shell prompt.
- Alloy log + metrics shipping: confirm
cloudflared,sshd, app containers,cloud-initall visible in Grafana Cloud Loki + Prom. - Pull-deploy is the only mutation path: no SSH-push from CI anywhere in the staging deploy pipeline.
- Disk-full alerting: fill the OS disk to >80% with
fallocate -l 50G /tmp/fill && du -sh /tmp/fill→ confirm Grafana alert fires within the alert window. The data disk is alerted separately at the same threshold. Recovery:rm /tmp/fill; alert clears. - Mongo unclean-shutdown durability:
docker kill -s SIGKILL sv0-mongo && docker start sv0-mongo→ confirm WiredTiger recovers, no data corruption (run a count query on a known collection before and after). This is the simulator for OOM-kill, power-loss, and unclean-VM-restart events that the on-disk Mongo will face in real operation.
- Pull-deploy: push
- Idle-mode cost discipline: when staging is not actively in use, set
staging_compute_enabled = falseand apply. The VM is destroyed; the OS disk persists (~$3/mo) so a re-apply 7 days later restores the same disk state. Mongo data on the disk survives, so a quick smoke after re-apply is faster than a cold start.
End state: staging.securityv0.com live and validated end-to-end. Every concern that prod will need is exercised. The "what does it mean to validate staging works" question is answered by the validation gates above — each is checked off in the Phase 3b PR before Phase 3c is filed.
Phase 3c — Prod fleet (1 VM → 2 VMs → DNS cut)
Goal: Replicate the staging pattern with two zone-spread VMs, run a failover drill on app-staging.securityv0.com, then cut app.securityv0.com from Hetzner DNS to the Cloudflare Tunnel.
Step 3c.1 — Provision prod VM 1 alongside Hetzner:
sv0-prodworkspace (Azure portion): 1 prod VM (Standard_B2s, zone 1), Managed Identity withgetonkv-sv0-prod, cloud-init same as staging but without the local Mongo container (prod always uses Atlas).- Cloudflare Tunnel
sv0-prod(1 replica initially), DNS CNAMEapp-staging.securityv0.com→ Tunnel. Do not cutapp.securityv0.comyet — traffic still flows through Hetzner. - Two-person approval enforced on
sv0-prodworkspace. - Same validation gates as staging, but pointed at the prod stack.
Step 3c.2 — Add prod VM 2, run failover drill, cut DNS:
- Provision second prod VM (zone 2). Same image, same cloud-init, second cloudflared replica registers to the same
sv0-prodTunnel. - Failover drill (before DNS cut): with both replicas registered and traffic flowing through
app-staging, stopcloudflaredon VM 1. Measure the time between replica-1 disappearing and traffic resuming on replica 2 (Grafana Cloud k6 or external curl loop). Record P50/P95/max in the PR. If P95 failover exceeds 30 s, halt the cutover and revisit ADR §5a (escalate to Cloudflare Load Balancer). - Cut
app.securityv0.comDNS from Hetzner public IP → Cloudflare Tunnel CNAME. DNS TTL is 60 s; production cutover completes within ~2 minutes. - Atlas IP allowlist updated to NAT Gateway egress IP. Closes
sv0-infrastructure#11. lifecycle.prevent_destroy = trueset on all prod VM/NIC/disk/tunnel resources.
End state: prod fully on Azure with two zone-spread VMs and measured failover. Hetzner prod VM idle but still reachable for break-glass.
Phase 3d — PR-preview lifecycle automation
Three workflows replace the single provision-preview.yml from the earlier draft:
preview-create.yml(onpull_request: opened, reopened): idempotentaz group create+az vm create+ Cloudflare API tunnel/DNS create, all taggedsv0:pr-preview=N. Skips Azure provisioning if the RG already exists.preview-deploy.yml(onpull_request: synchronize): builds and pushes:pr-Nto GHCR. Does NOT touch Azure or Cloudflare. The PR's VM image-watcher rolls itself.preview-destroy.yml(onpull_request: closed):az group delete --yes --no-wait, then Cloudflare API delete tunnel + DNS, all idempotent and individually logged.preview-reaper.yml(scheduled daily): same destroy logic for any PR-preview RG idle >7 days.preview-reconcile.yml(scheduled daily): cross-check Cloudflare resources taggedsv0:pr-preview=*against open PRs + live Azure RGs; reap orphans. Alerts on any orphan that survives one cycle.- 10-VM concurrency cap enforced in
preview-create.ymlwith PR-comment fallback (ADR §6).
End state: PR previews each get their own VM. Zero dev-VM contention. Cross-cloud cleanup is closed-loop.
Phase 3e — Hetzner decommission
The dev VM is still on Hetzner at this point — Phase 3f is deferred. Decommission focuses on the prod Hetzner VM; the dev Hetzner VM stays until 3f ships.
- Confirm 7+ days of zero traffic on the prod Hetzner public IP (Cloudflare logs).
- Snapshot the prod Hetzner VM to a local archive (forensic + audit-trail).
- Power off, retain snapshot 30 days, then destroy.
- Update
docs/deploy/deployment.mdprod section to point at the Azure runbook. - Hetzner dev VM stays until Phase 3f ships; Hetzner billing reduces but doesn't go to zero yet.
End state: prod on Azure, dev on Hetzner. Single prod substrate. Epic sv0-platform#550 not fully closed until 3f lands.
Phase 3f-DEFERRED — Dev VM pool
Status: Deferred. Primary triggers (any one fires this phase):
- (a) Hetzner dev box becomes a real bottleneck again — OOM kills, disk-full, or PR-preview concurrency limits. Most likely trigger. Estimated arrival: weeks after Phase 3d ships and PR previews land on Azure, because PR previews leaving the dev box frees it up considerably.
- (b) A second engineer joins and needs an isolated dev environment — current single-developer setup is fine on Hetzner; second engineer breaks that.
- (c) Cost or security review concludes the Hetzner dev box should retire — e.g., the
docker-group root-equivalence acceptance from ADR-018 hits its time-bound expiry, or Hetzner pricing changes.
Long-tail re-review trigger: if 6+ months pass after Phase 3c without any of (a)–(c) firing, this phase is re-evaluated against current reality — the deferral may have become a "never" and the carve-out (sv0_dev on prod Atlas) may need to be formalized in ADR-020 rather than treated as transitional. Not an automatic trigger; explicit decision.
Goal: retire the Hetzner dev VM. Move dev to one or more long-running Azure VMs with simple Cloudflare-Access SSH for engineers, same colocated-Mongo-container pattern as staging and as today's Hetzner setup.
Design choices that lock in at the time:
- Mongo runs in a container on each dev VM, not on Atlas. Confirmed by Ivan 2026-05-10: dev VMs use community Mongo only, same Docker pattern they already run. The transitional
sv0_devcarve-out on the prod Atlas cluster (ADR-020 Phase 0) retires when this phase ships. - One dev VM vs. one VM per engineer: TBD at 3f time. Lean toward shared dev VM until the team is >2 engineers; per-engineer VMs are the upgrade path.
- Auto-apply: on for the
sv0-devworkspace (matches the Hetzner cadence — everymainmerge auto-deploys to dev).
Deliverables (all in sv0-dev workspace, envs/dev/):
- 1 long-running Azure VM (
Standard_B2s, zone 1) built from the Compute Gallery image. (Or N VMs if per-engineer is chosen at 3f time.) - Managed Identity with
getonkv-sv0-devsecrets. - Cloudflare Tunnel
sv0-dev(single replica), DNS CNAMEdev.securityv0.com. - Same Docker Compose pattern as staging —
api+ui+mongo(community container with persistent volume). - Update
deploy-dev.ymlto publish to GHCR only — no SSH push. The dev VM's image-watcher rolls to the new image. - Decommission Hetzner dev VM after 7+ days of zero traffic.
End state: dev on Azure, no Hetzner. Epic sv0-platform#550 closes.
Sub-issues to file in sv0-infrastructure
Filed once this PR merges, each linked back to sv0-infrastructure#18 (this work) and sv0-platform#550 (parent epic):
feat(shared): VNet + NAT Gateway + Compute Gallery + Key Vaults in westeurope (Phase 3a)feat(staging): rename envs/staging-ephemeral → envs/staging + Azure compute scaffolding (Phase 3b prep)feat(staging): provision staging VM + colocated Mongo + cloudflared + validation gates (Phase 3b)feat(prod): provision prod VM 1 on app-staging.securityv0.com (Phase 3c.1)feat(prod): add prod VM 2, failover drill, cut app.securityv0.com DNS to Tunnel (Phase 3c.2)feat(pr-previews): GHA workflows for ephemeral per-PR VM lifecycle (Phase 3d)chore(decommission): power off + snapshot + destroy Hetzner prod VM (Phase 3e)feat(dev): provision dev VM(s) + decommission Hetzner dev (Phase 3f, deferred)tighten Atlas IP allowlist to NAT Gateway egress IPs (closes #11)— fired during Phase 3c.2
The Alloy + Atlas-scrape sub-issues (sv0-platform#764, sv0-infrastructure#16) are unblocked by Phase 3a and proceed independently.
Open implementation questions (to resolve in Phase 3a PR review)
These are deliberately deferred from the ADR because they're implementation details that don't change the topology:
- Caddy or nginx in container? Host-side Caddy is dropped (cloudflared terminates ingress). Each VM keeps an in-container reverse proxy for the SPA + API split. Lean: keep the existing nginx container from the Hetzner stack unchanged — same image, same config. Decide-and-document in the Phase 3a image PR.
- Pre-baked Compute Gallery image vs. cloud-init from base Ubuntu, per tier. Pre-baked = faster PR-preview spin-up (~30s vs ~3min); cloud-init = more transparent. Lean: pre-baked for PR-preview tier (latency matters per ADR §6 measurement), cloud-init from the prebaked base for prod/dev (transparency on rare provisioning events). Image versioning + rotation cadence to be defined.
- Recovery Services Vault retention. ADR says 30 days. Does compliance want longer? Confirm with Sergey before Phase 3a apply. Cost impact is bounded — every additional 30 days adds ~$5/mo per protected VM.
- Annual Azure budget threshold. ADR Negative section commits to budget alerts at 50% / 75% / 90% of $5,000/yr (the Microsoft for Startups credit allocation), with a 100%-budget action group that disables the
tfc-sv0-infrastructureAzure AD app. Confirm $5,000 figure with the latest credit-allocation snapshot before Phase 3a apply. - Monitor agent vs. Alloy for VM-level metrics. Alloy is going on every VM anyway (#764). Azure Monitor agent duplicates the role. Lean: Alloy only, no Azure Monitor agent. Trade-off: Azure Backup + Defender for Cloud features that depend on Monitor agent are unavailable; if compliance wants those, install Monitor agent alongside Alloy in a follow-up.
- PR-preview VM sizing. ADR §4 currently defaults to
B2sfor PR previews (same as dev) until Phase 3d measures actual RSS. The follow-up PR that flips the default toB1msis gated on data, not on schedule. - Image-watcher implementation. Lean: a small Go binary + systemd unit, ~150 LOC, polling a pointer doc in GHCR's manifest list. Alternative is
watchtower(off-the-shelf but adds a moving part). Decide-and-document in the Phase 3a image PR.
Break-glass
Three break-glass scenarios, in increasing order of severity. Each has an executable procedure with specific commands.
Note on tier-2 human emergency access (Azure Serial Console). Scenarios A and B below assume Cloudflare Access SSH is the way to reach a VM. If Access SSH itself is broken (cert path wedged, sshd hung) but the VM is otherwise healthy, try Azure Serial Console first — az serial-console connect --name <vm> -g <rg> — before escalating to public-IP failback or break-glass SP. Serial Console hits the hypervisor serial port and is independent of in-VM networking and sshd state. Authentication is via Entra ID + the sv0-vm-emergency-ops group (Ivan + Sergey).
Scenario A — Cloudflare Tunnel down, Azure healthy, TFC healthy (Phase 3a–3d, Hetzner still alive)
Symptoms: app.securityv0.com 5xxs from Cloudflare edge; Azure VMs themselves look healthy in Grafana; Hetzner VMs idle but reachable.
- In Cloudflare DNS dashboard, change
app.securityv0.comfrom CNAME-to-Tunnel back to A →178.156.245.75(Hetzner prod). - DNS TTL is 60s; traffic returns to Hetzner within ~1 minute.
- If Hetzner stack is stale (last deploy >24h old), SSH in and run
cd ~/sv0-platform && docker compose -f docker-compose.deploy.yml pull && up -dto refresh from latest GHCRsha-<commit>tag. - Open incident ticket; root-cause Cloudflare side.
Scenario B — Cloudflare Tunnel down, Azure healthy, TFC healthy (post-Phase-3e, Hetzner gone)
Symptoms: same as A but no Hetzner to fall back to.
- In TFC, open
sv0-prodworkspace. Set workspace variablevar.expose_public_ip_emergency = "true". - Run plan + apply. The compute module's preconditions allow this only when
var.expose_public_ip_emergency == "true". Apply adds: one Standard SKU public IP per prod VM, NSG inbound rule on 443 restricted to Cloudflare's published egress IP ranges (ASN 13335), and a minimal Caddy systemd unit on each VM that's preinstalled-but-disabled in the Compute Gallery image. - In Cloudflare DNS, switch
app.securityv0.comfrom Tunnel CNAME to A records pointing at the new Public IPs. - Verify request path:
curl -I https://app.securityv0.com/health. - Once Cloudflare Tunnel is healthy again, set
var.expose_public_ip_emergency = "false"and apply to undo.
Total time: ~20 minutes once the operator is at the keyboard. Depends only on TFC, not on Cloudflare Tunnel.
Scenario C — TFC unreachable AND Azure compute needs intervention (most severe)
Symptoms: TFC dashboard down, plan/apply blocked, prod VMs need attention (e.g., one is in stop-deallocated state or NSG needs an emergency rule).
The earlier draft simply said "terraform apply" — which fails in the exact scenario being described. Concrete procedure:
- Pull
sv0-azure-break-glassfrom 1Passwordsv0-infravault. RBAC scope (set in bootstrap): Contributor onrg-sv0-prod, Reader onrg-sv0-shared, Backup Contributor onrg-sv0-prod, Storage Blob Data Reader on the state-backup storage account. MFA-required. az login --service-principal -u <client_id> --tenant <tenant> --password <secret>in a clean shell.- Pull TFC state to local. TFC's API is what's down, but the state file is replicated to Azure Blob Storage as part of the standard TFC workspace settings (configured in Phase 3-bootstrap). The break-glass account has Storage Blob Data Reader on the state container:
(
az storage blob download \
--account-name <state_backup_storage_account> \
--container-name state \
--name sv0-prod.tfstate \
--file /tmp/sv0-prod.tfstate<state_backup_storage_account>is recorded in the 1Password break-glass item; it issv0tfcstate<hex>with a random suffix.) - Local apply with the break-glass credential and the pulled state:
Use
cd sv0-infrastructure/envs/prod
export ARM_USE_OIDC=false # use SP creds
export ARM_CLIENT_ID=<from 1Password>
export ARM_CLIENT_SECRET=<from 1Password>
export ARM_TENANT_ID=<...>
export ARM_SUBSCRIPTION_ID=<...>
terraform init -backend=false
terraform apply -state=/tmp/sv0-prod.tfstate -target=<specific resource>-targetaggressively: only touch the resource that needs intervention. Do NOTapplythe full plan — the local state may be slightly stale. - Manual reconcile after TFC recovers. Once TFC is back, run
terraform refreshand a full plan; the diff is the manual change made above. Resolve drift by either accepting the manual change (commit the equivalent HCL) or reverting (terraform applyfrom TFC). - Quarterly drill commitment. This procedure is exercised against a non-prod workspace every quarter. The drill produces a runbook update if any step has gone stale.
Why state replication to Blob Storage is part of break-glass design
TFC's state is canonical when TFC is up; when TFC is down the operator needs some version of state. The TFC "Remote State Backup" feature (configured in Phase 3-bootstrap) writes a copy of every applied state to a Blob container in the state-backup storage account (in rg-sv0-bootstrap, ZRS replication, AAD-only auth, public network access locked to TFC's published egress ranges plus operator IPs). This is documented as part of the break-glass-readiness check in the Verification checklist below.
Verification checklist for each phase
Each phase ends with the following all-green:
-
terraform planempty against the relevant workspace -
cloudflared --versionmatches pinned baseline on each VM; systemd unitRestart=always,Active: active (running) - Tunnel replica count matches expected for that phase: 0 in 3a (no compute), 1 staging replica in 3b, 1 prod replica in 3c.1, 2 prod replicas in 3c.2 onward, N PR-preview replicas in 3d
- Cloudflare Access SSH login works for a named human (browser flow → short-lived cert → sshd accepts) — tier-1 human path
- Azure Serial Console reachable (
az serial-console connect --name <vm> -g <rg>returns a prompt) — tier-2 human emergency path, tested per VM at provisioning -
sv0-vm-emergency-opsEntra group exists and has the customsv0-serial-console-operatorrole onrg-sv0-prod+rg-sv0-dev; membership matches the policy (Ivan + Sergey only) - Cloudflare Access logs flowing to Loki; Activity Log entries for
Microsoft.SerialConsole/serialPorts/connect/actionappear in Loki when the drill runs - Image-watcher rolls a new GHCR tag within 30s of publish (smoke test from each phase)
- Atlas IP allowlist accepts NAT Gateway egress IP only (Phase 3b+)
- Grafana Cloud Prom shows VM up, Alloy
up == 1,cloudflared_tunnel_active_streams > 0 - Grafana Cloud Loki receives logs from
cloudflared,sshd, app containers,cloud-init - Azure Backup has a successful daily snapshot recorded for prod VMs (Phase 3b+)
- No public IP on any VM in
rg-sv0-prod/rg-sv0-dev(assertion:az vm list-ip-addresses --query "[].virtualMachine.network.publicIpAddresses" -o tsvreturns empty for non-NAT IPs) -
lifecycle.prevent_destroy = trueon all prod VM/NIC/disk resources (Phase 3b+) - Two-person approval policy enforced on
sv0-prodandsv0-sharedworkspaces - Break-glass readiness (Phase 3a+): TFC state replicated to the state-backup blob container;
sv0-azure-break-glassSP RBAC verified viaaz role assignment list --assignee <sp_id> - Failover drill recorded in the Phase 3b PR with P50/P95/max measurements
- Cost dashboard shows actuals within 20% of estimate for the prior 7 days
Related runbooks
- 04 — Git Workflow, Branching, and Worktrees — branch + PR discipline for the Phase 3 sub-issues
- 11 — IaC Drift and Emergency Changes — drift workflow extends to the new workspaces
- Agent and M2M Authentication — for any service running on the VMs that needs an identity