Skip to main content

Azure VM Landing Zone — Implementation Plan

Companion to: ADR-022

This runbook holds the concrete plan for the Hetzner → Azure migration: current-state inventory, target topology, sequencing, sub-issues to file, and the cutover playbook.

The ADR holds the durable decisions; this runbook holds the work. When a step here is contradicted by a later ADR amendment, the ADR wins.


Current state — Hetzner footprint as of 2026-05-08

VMs

EnvironmentHostnameIPDNSSpecCostNotes
Devdocker-ce-ubuntu-4gb-ash-1178.156.217.150dev.securityv0.com, *.dev.securityv0.comCPX21, 2 vCPU / 4 GB / 80 GB SSD~EUR 11/moHosts dev branch + every open PR (pr-N-dev.securityv0.com) on the same box
Proddocker-ce-ubuntu-4gb-ash-2178.156.245.75app.securityv0.comCPX21, 2 vCPU / 4 GB / 80 GB SSD~EUR 11/moHosts the main branch app stack

Both in Hetzner Cloud, Ashburn VA (us-east).

Pain points the diagram makes visible: every PR lands on the same dev box (5+ concurrent → OOM, single disk fills); CI holds long-lived SSH keys to both VMs; every Atlas read crosses the Atlantic.

Workloads on each VM

Caddy (host, :443)
└── reverse_proxy localhost:8080
└── nginx (Docker, :8080)
├── / → React SPA bundle
└── /api/* → api:3000 (Docker)
└── MONGODB_URI → Atlas EU_WEST_1 (since 2026-05-04)

The mongo container that used to run on each VM has been decommissioned as part of epic #550 phase 2. Containers in scope on each VM today are: api, ui, plus per-PR overlays on the dev VM.

Identity and ingress

  • Public ingress — Cloudflare DNS A record → host public IPv4 → Caddy:443 (TLS terminate, Let's Encrypt) → nginx
  • Cloudflare Accessapp.securityv0.com, dev.securityv0.com, *-dev.securityv0.com are all behind Cloudflare Access (Zero Trust). CI/CD uses service tokens (CF_ACCESS_CLIENT_ID / CF_ACCESS_CLIENT_SECRET).
  • Human SSHssh deploy@<host> with ~/.ssh/sv0-deploy-prod. Root SSH on prod is gated through Ivan's 1Password SSH agent.
  • CI deploy SSH — GitHub Actions environment secrets DEPLOY_SSH_KEY, DEPLOY_HOST, DEPLOY_HOST_KEY. The deploy user is in the docker group (ADR-018 accepted risk).

CI/CD pipeline

  • ci.yml builds API + UI images, pushes to GHCR with sha-<commit> tag (and pr-N for PRs)
  • deploy-dev.yml triggers on workflow_run after CI; SSHes to dev VM, pulls images, restarts containers
  • deploy-dev-cleanup.yml triggers on PR close; SSHes to dev VM, tears down the per-PR compose project
  • deploy-prod.yml triggers on workflow_dispatch with approval gate

Pain points the migration must address

  1. Dev OOM under PR concurrency — More than ~5 simultaneous PRs cause memory pressure that has materially crashed the dev branch and other PR previews.
  2. Disk-full outage — 2026-04-17, dev VM 75 GB filled with orphan PR-instance directories. Reaper now exists (/home/deploy/scripts/cleanup-instances.sh daily 04:00) but the underlying single-disk-per-tenant model is the bottleneck.
  3. DEPLOY_SSH_KEY rotation toil — ADR-018 acceptance requires this to be rotatable on suspicion of leak; this is a manual workflow in two places (1Password, GHA secret).
  4. docker-group root-equivalence — ADR-018 accepted risk, time-bounded by this migration.
  5. us-east → eu-west request hop — Atlas now lives in EU_WEST_1. Every read crosses the Atlantic, ~80-120 ms of unnecessary latency on every API call that touches Mongo.

Target state — Azure VM landing zone

Region

westeurope (Amsterdam). Co-locates with Atlas EU_WEST_1 and Grafana Cloud Frankfurt.

Subscription

Azure subscription 1, ID 2a25bc41-c1ce-4d04-9cb6-a62deccc3bcc, tenant bcf375ed-e122-4d76-a43d-82c94a3f7e3b. Same subscription used by sv0-connectors/infra/ and sv0-demo-labs. Credits via Microsoft for Startups (allocation tracked in 1Password sv0-infra vault).

Resource groups

Per the 2026-05-09 modest-hybrid amendment (ADR-022 §3), one RG per environment. Bootstrap owns its own RG since it's a permanent local-apply directory.

RGOwnerHolds
rg-sv0-bootstrapbootstrap/ (local apply)Federated Azure AD app, break-glass SP, Entra emergency group, custom RBAC role, state-backup storage account
rg-sv0-sharedsv0-shared TFC workspaceVNet, subnets, NAT Gateway, NSGs, Compute Gallery, Recovery Services Vault, Key Vaults (kv-sv0-staging, kv-sv0-prod, kv-sv0-dev)
rg-sv0-stagingsv0-staging TFC workspacePhase 3b. 1 Azure VM (compute + colocated Mongo container), Managed Identity, Cloudflare Tunnel sv0-staging. Gated by staging_compute_enabled; default false = zero compute resources.
rg-sv0-prodsv0-prod TFC workspacePhase 3c. Prod fleet VMs, Managed Identity, prod Tunnel config, lifecycle-protected.
rg-sv0-devsv0-dev TFC workspacePhase 3f-DEFERRED. Dev VMs (one or more), Managed Identity, dev Tunnel config. Empty until 3f ships.
rg-sv0-pr-previews-pr-Nnot TerraformedPer-PR ephemeral VM, NIC, OS disk, tunnel resource. Created by GHA workflow, deleted as one unit.

Network

VNet:        sv0-vnet            10.0.0.0/16   westeurope
├─ Subnet: snet-prod 10.0.1.0/24
├─ Subnet: snet-dev 10.0.2.0/24
├─ Subnet: snet-pr-previews 10.0.3.0/24
└─ Subnet: snet-shared 10.0.4.0/24 (reserved; future jumphost or scrape VM)

NAT Gateway (single, zonal):
natgw-zone1
sku Standard
zone 1
public_ip_addresses [pip-natgw-zone1] # static, pinned in Atlas allowlist
subnets_attached snet-prod, snet-dev, snet-pr-previews, snet-shared

NSG (one per subnet):
- No inbound rules except the implicit Azure-managed VNet/AzureLoadBalancer
- All outbound permitted (NAT Gateway is the egress chokepoint)

Constraint that drives this design: Azure permits at most one NAT Gateway per subnet, and Standard NAT Gateway is a zonal (not zone-redundant) resource. The earlier draft sketched two NAT Gateways "each attached to snet-*" — that is invalid as written. The chosen design uses a single zonal NAT in zone 1, with documented zonal-failure exposure (ADR §5b). Phase-4+ upgrade option: split into per-zone subnet stacks (snet-prod-z1, snet-prod-z2, etc.) each with its own NAT. Not done yet because (a) it doubles NAT cost, (b) it complicates IaC, (c) we are pre-revenue and a multi-hour zonal outage is acceptable.

No Application Gateway. No Azure Load Balancer. No Front Door. No public IPs on VMs.

VMs

TierCountSizeZone(s)OSDiskMongo
Staging1 (toggleable)Standard_B2s (2 vCPU, 4 GB)1Ubuntu 24.04 LTS Server (Compute Gallery custom image)64 GB Premium SSDColocated container by default; env-var switch to prod Atlas sv0_staging DB
Prod2Standard_B2s (2 vCPU, 4 GB)1 + 2Same64 GB Premium SSDAlways Atlas sv0_prod
Dev (Phase 3f)1+Standard_B2s (2 vCPU, 4 GB)1Same64 GB Premium SSDColocated container only
PR preview0–10Standard_B2s (default; tunable)1Same32 GB Standard SSDReuses dev pattern (colocated container)

PR-preview sizing default is B2s until Phase 3d measures actual idle/load RSS. ADR §4 holds the rationale; downsize to B1ms is a follow-up PR if data supports it.

Staging compute is destroyable. When staging_compute_enabled = false, all staging compute is destroyed; the OS disk persists separately (~$3/mo) so re-applying restores the same disk state including any local Mongo data. This is the cost-aware default — staging is online only while it's actively in use.

Topology — all environments in one view

Four things this diagram makes load-bearing:

  • Staging stands up first, then prod. Phase 3b ships staging end-to-end on staging.securityv0.com; Phase 3c only starts after every operational concern is exercised on a cheap throwaway environment. Dev (Phase 3f) lands last and is shown dashed/gray here.
  • Mongo lives on the staging VM by default. Staging runs a colocated Mongo container (same Docker pattern as today's Hetzner setup). The dashed line to Atlas represents the MONGODB_URI env-var switch — flip to Atlas-mode only when you need real-cluster validation. No extra Mongo VM. The sv0_staging database on the prod M10 has zero incremental cost.
  • One workspace per environment. Prod, staging, dev each own their own RG; shared resources (VNet, NAT, gallery, KVs, RSV) live in a separate RG owned by sv0-shared. PR previews are GHA-managed, never in TF state.
  • Single zonal NAT is the deliberate trade. Azure forbids multiple NAT Gateways per subnet; Standard NAT is zonal not zone-redundant. A zone-1 outage takes egress with it for all environments. Accepted at our scale (ADR-022 §5b); per-zone subnet split deferred to Phase 4+.

Cost estimate (rough, USD/yr) — point-in-time, not auditable

⚠️ Numbers below are author-built estimates from publicly listed Azure prices for westeurope as of 2026-05-08, not from an Azure Pricing Calculator export. They are budgeting-grade, not billing-grade. Phase 3a apply will attach a calculator export to the implementation PR for actuals.

LineQtyUnit/moAnnualised
Prod VMs (B2s)2~$30~$720
Staging VM (B2s, realistic 10–15% on-duty)1 effective × 0.1–0.15~$3–5 avg~$45–60
Staging Mongo data disk (Premium SSD, persists across compute toggles)1~$10~$120
Staging OS disk (destroyed when compute off, paid only while on)1 × 0.1–0.15~$1 avg~$12
Dev VMs (B2s) — Phase 3f, not yet incurred1~$30~$360 (deferred)
PR-preview VMs (B2s default, average 5 active × ~30% lifetime)~5 effective~$30~$540
NAT Gateway (single, zonal) — fixed + ~50 GB egress/mo1~$40~$480
Recovery Services Vault (prod-only, daily, 30d retention)2 protected~$10~$240
Premium / Standard SSD storage (OS disks; Mongo data disk counted separately above)~7 active~$10~$840
Compute Gallery image storage (1 image, 1 region replica)1~$5~$60
Key Vault (staging + prod + dev, ~15K ops/mo)3~$1~$36
Public IP on NAT Gateway (static)1~$3~$36
Logs / metrics egress to Grafana Cloud (Loki + Prom)~$200
Subtotal (pre-Phase 3f)~$3,300/yr
Subtotal (post-Phase 3f, dev VM added)~$3,700/yr

Staging cost note — realistic usage pattern: staging is on for 2–5 days per active testing period (e.g. validating a release before the prod deploy or a quarterly drill), then off. At ~4 days/month that's ~13% on-duty, $4/mo for VM compute. The 50% framing in earlier drafts was conservative for budgeting but didn't match how staging actually gets used; reframed here to match expected reality. The Mongo data disk is the floor cost ($10/mo, always paid) because it persists across compute toggles. Atlas-mode switch (point staging at prod Atlas sv0_staging DB) has zero cost delta — the sv0_staging database is created on the already-paid prod M10 cluster.

What is not modelled here, and may bump real spend by 10–30%: Cloudflare Tunnel response egress through Azure NAT (every cloudflared connection's response bytes hit egress), GHCR pull egress on heavy PR-preview churn, snapshot storage growth over 30-day retention, Compute Gallery per-region replica storage if we add a second region, and Cloudflare/Grafana plan-tier increases.

Comfortably inside the credit allocation either way. Phase 3a sets Azure budget alerts at 50% / 75% / 90% of $5,000/yr, with a 100%-budget action group that auto-disables the tfc-sv0-infrastructure Azure AD app to halt new provisioning.

Ingress paths after cutover

prod traffic    Internet → Cloudflare edge → Cloudflare Tunnel "sv0-prod" (2 replicas, failover)
├─ replica on prod VM 1 → nginx → containers → Atlas sv0_prod
└─ replica on prod VM 2 → nginx → containers → Atlas sv0_prod
(multi-replica = connection redundancy + nearest-replica routing,
NOT health-checked LB. Long-lived sessions may reconnect on replica
loss. Escalation: add Cloudflare Load Balancer if observed
failover is insufficient — ADR §5a.)

staging traffic Internet → Cloudflare edge → Cloudflare Tunnel "sv0-staging" (1 replica)
└─ staging VM → nginx → app container → mongo container (local)
OR (Atlas-mode switch) → Atlas sv0_staging DB on prod cluster

dev traffic [Phase 3f] Internet → Cloudflare edge → Cloudflare Tunnel "sv0-dev"
└─ dev VM → nginx → app container → mongo container (local)

PR-N traffic Internet → Cloudflare edge → Cloudflare Tunnel "sv0-pr-N" → nginx → containers

human SSH `cloudflared access ssh --hostname <vm>` → Cloudflare Access SSO →
short-lived user cert (signed by Cloudflare Access SSH CA) →
sshd validates against TrustedUserCAKeys → session
(sshd still runs with HOST keys; no public network reach to port 22)

CI deploy GitHub Actions → push image to GHCR → image-watcher on VM polls
the pointer doc → docker compose pull && up -d
(no CI-to-VM session; pull-based — ADR §5d. SSH-from-CI was
in the earlier draft and is removed.)

VM egress VM → snet-* → NAT Gateway (zone-1) → Internet/Atlas/GHCR/WorkOS/Grafana

Phasing

The migration runs in six phases. Each phase is one or more PRs, each phase has a verifiable end-state, and each phase is reversible (Hetzner stays running through Phase 3e).

Staging-first sequencing (2026-05-10 amendment). Staging stands up before prod so every operational concern — cloudflared HA, secrets-via-KV, pull-deploy, image-watcher cadence, Alloy log shipping, break-glass — gets exercised on a cheap throwaway environment before prod compute is touched.

Phase 3-bootstrap — One-time OIDC federation setup (precedes 3a)

Status: ✅ Applied 2026-05-10 against modest-hybrid topology (ADR-022 amendment 2026-05-09). 38 resources live in subscription 2a25bc41-c1ce-4d04-9cb6-a62deccc3bcc.

Follow-up apply needed (2026-05-10 amendment): the staging-first amendment adds the following to the bootstrap layer. All changes are additive — adding "sv0-staging" to the tfc_workspaces set produces new map keys ("sv0-staging-plan", "sv0-staging-apply") because for_each is keyed by string, not by index; existing federated credentials are unchanged. Verify with terraform plan and look for any destroy-then-create on existing resources before applying.

  • rg-sv0-staging resource group (1)
  • tfc-sv0-staging-plan + tfc-sv0-staging-apply federated credentials (2)
  • Plan-phase RBAC for sv0-staging SP: Reader on subscription + Reader on rg-sv0-shared (2)
  • Apply-phase RBAC for sv0-staging SP: Contributor + Virtual Machine Contributor on rg-sv0-staging (2)
  • sv0-vm-emergency-ops group → sv0-serial-console-operator role assignment on rg-sv0-staging (1)

Total ~8 additive resources. Done as part of the Phase 3b prep PR before staging compute can be provisioned.

Local Terraform run, not via TFC. Output is the tfc-sv0-infrastructure Azure AD app + per-workspace federated credentials (8 total, 4 workspaces × {plan, apply}), the break-glass SP, the sv0-vm-emergency-ops Entra group, the custom sv0-serial-console-operator RBAC role, the state-backup storage account, and the five resource groups (rg-sv0-bootstrap, rg-sv0-shared, rg-sv0-staging, rg-sv0-prod, rg-sv0-dev). Documented in ADR §7a; concrete commands:

  1. Operator authenticates with their own Azure account (Owner-scoped): az login. (The earlier "pull sv0-azure-bootstrap SP from 1Password" instruction was wrong — no such credential existed, and creating one would be its own bootstrap problem.)
  2. Out-of-band: grant the operator Storage Blob Data Owner on the state-backup storage account. Required because shared_access_key_enabled = false forces terraform's data-plane polling through AAD; without this role, container creation fails with KeyBasedAuthenticationNotPermitted.
  3. cd sv0-infrastructure/bootstrap && terraform init && terraform apply. State stays local — do NOT migrate to a TFC workspace. The state file (bootstrap/terraform.tfstate) is backed up to 1Password sv0-infra after meaningful changes.
  4. Capture outputs (tfc_app_client_id, tenant_id, subscription_id) and set them as env-category workspace variables on each of all four workspaces — sv0-shared, sv0-staging, sv0-prod, sv0-dev: TFC_AZURE_PROVIDER_AUTH=true, TFC_AZURE_RUN_CLIENT_ID=<id>, ARM_TENANT_ID=<tenant>, ARM_SUBSCRIPTION_ID=<sub>. Use the TFC API (token from ~/.terraform.d/credentials.tfrc.json) rather than the UI — repeatable across all four workspaces. sv0-staging is easy to miss because the federation for it lands in the bootstrap re-apply rather than the original apply; without these vars set, any Phase 3b TFC plan fails at provider init with AADSTS700213.
  5. Store the break-glass SP secret + state-backup storage account name in 1Password sv0-infra vault as item sv0-azure-break-glass.

Federated subject gotcha. The credential subject field uses TFC organization/project display names (case-sensitive, may contain spaces) — for SecurityV0 that is organization:SecurityV0:project:Default Project:workspace:<ws>:run_phase:<plan|apply>. The default values in bootstrap/variables.tf (tfc_organization, tfc_project) reflect this. Mismatch surfaces only at runtime as AADSTS700213 and never at terraform validate.

End state: TFC can authenticate to Azure for sv0-shared, sv0-staging, sv0-prod, sv0-dev plan and apply phases without any static SP secret in TFC variables. Federation smoke-test gates (each returns planned_and_finished, 0 changes):

  • sv0-dev plan against empty envs/dev/ ✅ (proven 2026-05-10)
  • sv0-staging plan against empty envs/staging/ (added by Phase 3b prep PR)
  • sv0-prod plan against empty envs/prod/ (Phase 3a sanity check)
  • sv0-shared plan against empty envs/shared/ Azure portion (Phase 3a sanity check)

Phase 3a — Shared network substrate (no compute)

Goal: stand up everything compute will need, in one workspace, before any VM exists.

Deliverables (all in sv0-shared workspace, envs/shared/):

  • VNet sv0-vnet 10.0.0.0/16 in westeurope.
  • Subnets: snet-staging 10.0.0.0/24, snet-prod 10.0.1.0/24, snet-dev 10.0.2.0/24 (reserved for Phase 3f), snet-pr-previews 10.0.3.0/24, snet-shared 10.0.4.0/24 (reserved).
  • NSG per subnet — no inbound (Cloudflare Tunnel is ingress), all outbound.
  • Single zonal Standard NAT Gateway in zone 1 with static public IP, attached to all five subnets. Captured for the Atlas allowlist update in Phase 3c. Cost is ~$35/mo total, shared across all environments — see ADR §5b "Cost amortization" for the per-VM breakdown.
  • Compute Gallery image definition for Ubuntu 24.04 LTS + Docker + cloudflared + Alloy + secrets-fetcher + image-watcher. First image version baked + replicated to westeurope.
  • Recovery Services Vault (used by Phase 3c for prod snapshots).
  • Key Vaults kv-sv0-staging, kv-sv0-prod, kv-sv0-dev (last one empty until Phase 3f), with Managed-Identity-scoped access policies and per-environment secrets pre-populated per ADR §9.
  • Entra IdP federated into Cloudflare Access (~30 min one-time, Cloudflare-side prerequisite — see ADR §5c and §5c.1). Requires an Azure App Registration ("Cloudflare Access" or similar), the App's client ID + secret + tenant ID configured in Cloudflare Zero Trust → Settings → Authentication. Without this, the following surfaces fall back to GitHub IdP, and the "Entra MFA required" claim in §5c is unenforceable: (a) all tier-1 SSH to Azure VMs (forced — no app behind SSH; ADR §5c.1 row "Tier-1 SSH"); (b) the two-door URLs — dev-azure.securityv0.com, dev.securityv0.com after Phase 3f, all pr-N-dev.securityv0.com PR previews. NOT needed for staging.securityv0.com (one-door, WorkOS only — ADR §5c.1). The spike PR (sv0-infrastructure#26) surfaced this prerequisite; the auth-simplification owner confirmed the Entra-direct path 2026-05-11.
  • Two-person approval policy enforced on sv0-shared workspace.

End state: sv0-shared plan applies cleanly; the NAT egress IP, Compute Gallery image ID, and Key Vault URIs are available as workspace outputs for downstream workspaces to consume via terraform_remote_state.

Phase 3b — Staging environment on Azure

Goal: validate the entire prod design end-to-end on a single cheap VM before prod compute is touched. Cut over staging.securityv0.com to it. Default Mongo runs in a container on the same VM; an env-var switch points at the prod Atlas sv0_staging database when end-to-end Atlas validation is needed.

Deliverables (all in sv0-staging workspace, envs/staging/):

  • Rename envs/staging-ephemeral/envs/staging/. Existing Atlas-drill code moves with the directory — its atlas_drill_enabled switch (renamed from staging_enabled to disambiguate) stays default-off.
  • Ordering matters: update the TFC workspace's Working Directory setting (sv0-stagingenvs/staging) via the TFC API BEFORE the rename PR merges. The sv0-staging workspace is VCS-driven, so it auto-triggers a plan on every commit to main. If the rename merges first, every push will fail-plan with "directory not found" until the Working Directory setting catches up. Procedure: (1) curl -X PATCH ...workspaces/<ws_id> to set working-directory: envs/staging (yes, ahead of the merge — the directory doesn't exist yet on main, but the setting is just a string), (2) merge the rename PR, (3) the next push triggers a plan from the new directory.
  • New staging_compute_enabled master switch (default false — zero compute resources when off). When true, the following are provisioned in rg-sv0-staging:
    • 1 Azure VM (Standard_B2s, zone 1) built from the Compute Gallery image. OS disk uses delete_os_disk_on_deletion = true (the OS is re-creatable from IaC + Compute Gallery).
    • Separate azurerm_managed_disk for Mongo data — 64 GB Premium SSD, lifecycle-detached from the VM (prevent_destroy = true). Mounted at /var/lib/mongo via cloud-init. This is the actual "OS disk persists" mechanism: the OS disk is throwaway, but the Mongo data disk survives staging_compute_enabled = false → true cycles. Cost when staging is off: ~$10/mo (64 GB Premium SSD), independent of compute state.
    • Managed Identity with get on kv-sv0-staging secrets.
    • Cloudflare Tunnel sv0-staging (single replica), DNS CNAME staging.securityv0.com. NO Cloudflare Access app on the URL — staging mirrors prod's one-door posture (WorkOS hosted login is the only gate). Per ADR §5c.1: staging is supposed to validate prod's auth shape end-to-end, so it gets the same one-door shape as prod. WorkOS auth happens inside the app, against the same Connect apps as prod.
    • Cloudflare Access SSH app for the staging VM only (separate from the URL). cloudflared access ssh --hostname staging-vm.<team>.cloudflareaccess.com → Entra IdP → short-lived cert → sshd. This is the tier-1 path per ADR §5c; required regardless of the URL's door count because SSH has no app behind it.
    • cloud-init runs Docker Compose with two services: api (+ ui) and mongo (community container, data on the mounted data disk above).
  • Mongo data disk re-attach procedure — when staging_compute_enabled flips false → true, the new VM mounts the existing data disk at /var/lib/mongo. If the previous VM shut down uncleanly, WiredTiger's lock file (/var/lib/mongo/mongod.lock) will block startup. Cloud-init runs rm -f /var/lib/mongo/mongod.lock only after verifying the disk was attached cleanly (no in-flight writes — checked via Azure's disk-state status). Documented as a deliberate step, not silent rm: WiredTiger's recovery journal handles the rest.
  • MONGODB_URI selector wired as a runtime env var: MONGODB_URI=mongodb://localhost:27017/sv0_staging by default; flip to the prod Atlas connection string via a workspace variable to run Atlas-mode E2E tests. No code rebuild on flip — but the app code may still need changes. Atlas connection strings carry +srv DNS seeding, tls=true, authSource=admin, and longer serverSelectionTimeoutMS to tolerate replica-set discovery. If the platform code hardcodes directConnection: true, a local CA bundle path, or a timeout shorter than Atlas's failover window, the flip will fail at first use. The Phase 3b PR audits sv0-platform for these assumptions and adjusts code if needed — see validation gate "Atlas-mode flip transparent" below.
  • Pre-create the sv0_staging database on the prod Atlas cluster (zero cost on M10, no separate Atlas project needed). Add Atlas user sv0_staging scoped to that database only. Store credentials in kv-sv0-staging.
  • Cloudflare Access policy on staging.securityv0.com: Entra IdP federated, MFA required, same named-humans-only restriction as prod.
  • Validation gates (each is a runbook step recorded in the PR description):
    • Pull-deploy: push :staging tag → image-watcher rolls within 30 s.
    • Secrets-via-KV: app reads WORKOS_API_KEY etc. from kv-sv0-staging via Managed Identity, no plaintext on disk.
    • Cloudflare Tunnel: registers + serves traffic from staging.securityv0.com. Confirm CF Universal SSL issues a cert (single-level subdomain — covered per repo memory; *.staging. would NOT be covered, so no nested subdomains here).
    • One-door posture confirmed: anonymous curl https://staging.securityv0.com/ returns the WorkOS hosted login page (or a redirect to it), NOT a Cloudflare Access challenge. If the response is HTML containing cloudflareaccess.com, the staging URL has a CF Access app that shouldn't be there — remove it.
    • Atlas-mode flip transparent: flip MONGODB_URI workspace variable to the prod Atlas SRV string (mongodb+srv://...sv0_staging), apply, verify app reconnects WITHOUT code change or container rebuild. If the flip fails because the app hardcodes Mongo locality (directConnection: true, short serverSelectionTimeoutMS, missing TLS support), record the exact code change required in the PR and either land it in this PR or file a follow-up.
    • Cloudflare Access SSH: human can cloudflared access ssh --hostname <vm>... and land a session.
    • Azure Serial Console: tier-2 emergency drill on the staging VM produces a shell prompt.
    • Alloy log + metrics shipping: confirm cloudflared, sshd, app containers, cloud-init all visible in Grafana Cloud Loki + Prom.
    • Pull-deploy is the only mutation path: no SSH-push from CI anywhere in the staging deploy pipeline.
    • Disk-full alerting: fill the OS disk to >80% with fallocate -l 50G /tmp/fill && du -sh /tmp/fill → confirm Grafana alert fires within the alert window. The data disk is alerted separately at the same threshold. Recovery: rm /tmp/fill; alert clears.
    • Mongo unclean-shutdown durability: docker kill -s SIGKILL sv0-mongo && docker start sv0-mongo → confirm WiredTiger recovers, no data corruption (run a count query on a known collection before and after). This is the simulator for OOM-kill, power-loss, and unclean-VM-restart events that the on-disk Mongo will face in real operation.
  • Idle-mode cost discipline: when staging is not actively in use, set staging_compute_enabled = false and apply. The VM is destroyed; the OS disk persists (~$3/mo) so a re-apply 7 days later restores the same disk state. Mongo data on the disk survives, so a quick smoke after re-apply is faster than a cold start.

End state: staging.securityv0.com live and validated end-to-end. Every concern that prod will need is exercised. The "what does it mean to validate staging works" question is answered by the validation gates above — each is checked off in the Phase 3b PR before Phase 3c is filed.

Phase 3c — Prod fleet (1 VM → 2 VMs → DNS cut)

Goal: Replicate the staging pattern with two zone-spread VMs, run a failover drill on app-staging.securityv0.com, then cut app.securityv0.com from Hetzner DNS to the Cloudflare Tunnel.

Step 3c.1 — Provision prod VM 1 alongside Hetzner:

  • sv0-prod workspace (Azure portion): 1 prod VM (Standard_B2s, zone 1), Managed Identity with get on kv-sv0-prod, cloud-init same as staging but without the local Mongo container (prod always uses Atlas).
  • Cloudflare Tunnel sv0-prod (1 replica initially), DNS CNAME app-staging.securityv0.com → Tunnel. Do not cut app.securityv0.com yet — traffic still flows through Hetzner.
  • Two-person approval enforced on sv0-prod workspace.
  • Same validation gates as staging, but pointed at the prod stack.

Step 3c.2 — Add prod VM 2, run failover drill, cut DNS:

  • Provision second prod VM (zone 2). Same image, same cloud-init, second cloudflared replica registers to the same sv0-prod Tunnel.
  • Failover drill (before DNS cut): with both replicas registered and traffic flowing through app-staging, stop cloudflared on VM 1. Measure the time between replica-1 disappearing and traffic resuming on replica 2 (Grafana Cloud k6 or external curl loop). Record P50/P95/max in the PR. If P95 failover exceeds 30 s, halt the cutover and revisit ADR §5a (escalate to Cloudflare Load Balancer).
  • Cut app.securityv0.com DNS from Hetzner public IP → Cloudflare Tunnel CNAME. DNS TTL is 60 s; production cutover completes within ~2 minutes.
  • Atlas IP allowlist updated to NAT Gateway egress IP. Closes sv0-infrastructure#11.
  • lifecycle.prevent_destroy = true set on all prod VM/NIC/disk/tunnel resources.

End state: prod fully on Azure with two zone-spread VMs and measured failover. Hetzner prod VM idle but still reachable for break-glass.

Phase 3d — PR-preview lifecycle automation

Three workflows replace the single provision-preview.yml from the earlier draft:

  • preview-create.yml (on pull_request: opened, reopened): idempotent az group create + az vm create + Cloudflare API tunnel/DNS create, all tagged sv0:pr-preview=N. Skips Azure provisioning if the RG already exists.
  • preview-deploy.yml (on pull_request: synchronize): builds and pushes :pr-N to GHCR. Does NOT touch Azure or Cloudflare. The PR's VM image-watcher rolls itself.
  • preview-destroy.yml (on pull_request: closed): az group delete --yes --no-wait, then Cloudflare API delete tunnel + DNS, all idempotent and individually logged.
  • preview-reaper.yml (scheduled daily): same destroy logic for any PR-preview RG idle >7 days.
  • preview-reconcile.yml (scheduled daily): cross-check Cloudflare resources tagged sv0:pr-preview=* against open PRs + live Azure RGs; reap orphans. Alerts on any orphan that survives one cycle.
  • 10-VM concurrency cap enforced in preview-create.yml with PR-comment fallback (ADR §6).

End state: PR previews each get their own VM. Zero dev-VM contention. Cross-cloud cleanup is closed-loop.

Phase 3e — Hetzner decommission

The dev VM is still on Hetzner at this point — Phase 3f is deferred. Decommission focuses on the prod Hetzner VM; the dev Hetzner VM stays until 3f ships.

  • Confirm 7+ days of zero traffic on the prod Hetzner public IP (Cloudflare logs).
  • Snapshot the prod Hetzner VM to a local archive (forensic + audit-trail).
  • Power off, retain snapshot 30 days, then destroy.
  • Update docs/deploy/deployment.md prod section to point at the Azure runbook.
  • Hetzner dev VM stays until Phase 3f ships; Hetzner billing reduces but doesn't go to zero yet.

End state: prod on Azure, dev on Hetzner. Single prod substrate. Epic sv0-platform#550 not fully closed until 3f lands.

Phase 3f-DEFERRED — Dev VM pool

Status: Deferred. Primary triggers (any one fires this phase):

  • (a) Hetzner dev box becomes a real bottleneck again — OOM kills, disk-full, or PR-preview concurrency limits. Most likely trigger. Estimated arrival: weeks after Phase 3d ships and PR previews land on Azure, because PR previews leaving the dev box frees it up considerably.
  • (b) A second engineer joins and needs an isolated dev environment — current single-developer setup is fine on Hetzner; second engineer breaks that.
  • (c) Cost or security review concludes the Hetzner dev box should retire — e.g., the docker-group root-equivalence acceptance from ADR-018 hits its time-bound expiry, or Hetzner pricing changes.

Long-tail re-review trigger: if 6+ months pass after Phase 3c without any of (a)–(c) firing, this phase is re-evaluated against current reality — the deferral may have become a "never" and the carve-out (sv0_dev on prod Atlas) may need to be formalized in ADR-020 rather than treated as transitional. Not an automatic trigger; explicit decision.

Goal: retire the Hetzner dev VM. Move dev to one or more long-running Azure VMs with simple Cloudflare-Access SSH for engineers, same colocated-Mongo-container pattern as staging and as today's Hetzner setup.

Design choices that lock in at the time:

  • Mongo runs in a container on each dev VM, not on Atlas. Confirmed by Ivan 2026-05-10: dev VMs use community Mongo only, same Docker pattern they already run. The transitional sv0_dev carve-out on the prod Atlas cluster (ADR-020 Phase 0) retires when this phase ships.
  • One dev VM vs. one VM per engineer: TBD at 3f time. Lean toward shared dev VM until the team is >2 engineers; per-engineer VMs are the upgrade path.
  • Auto-apply: on for the sv0-dev workspace (matches the Hetzner cadence — every main merge auto-deploys to dev).

Deliverables (all in sv0-dev workspace, envs/dev/):

  • 1 long-running Azure VM (Standard_B2s, zone 1) built from the Compute Gallery image. (Or N VMs if per-engineer is chosen at 3f time.)
  • Managed Identity with get on kv-sv0-dev secrets.
  • Cloudflare Tunnel sv0-dev (single replica), DNS CNAME dev.securityv0.com.
  • Same Docker Compose pattern as staging — api + ui + mongo (community container with persistent volume).
  • Update deploy-dev.yml to publish to GHCR only — no SSH push. The dev VM's image-watcher rolls to the new image.
  • Decommission Hetzner dev VM after 7+ days of zero traffic.

End state: dev on Azure, no Hetzner. Epic sv0-platform#550 closes.


Sub-issues to file in sv0-infrastructure

Filed once this PR merges, each linked back to sv0-infrastructure#18 (this work) and sv0-platform#550 (parent epic):

  • feat(shared): VNet + NAT Gateway + Compute Gallery + Key Vaults in westeurope (Phase 3a)
  • feat(staging): rename envs/staging-ephemeral → envs/staging + Azure compute scaffolding (Phase 3b prep)
  • feat(staging): provision staging VM + colocated Mongo + cloudflared + validation gates (Phase 3b)
  • feat(prod): provision prod VM 1 on app-staging.securityv0.com (Phase 3c.1)
  • feat(prod): add prod VM 2, failover drill, cut app.securityv0.com DNS to Tunnel (Phase 3c.2)
  • feat(pr-previews): GHA workflows for ephemeral per-PR VM lifecycle (Phase 3d)
  • chore(decommission): power off + snapshot + destroy Hetzner prod VM (Phase 3e)
  • feat(dev): provision dev VM(s) + decommission Hetzner dev (Phase 3f, deferred)
  • tighten Atlas IP allowlist to NAT Gateway egress IPs (closes #11) — fired during Phase 3c.2

The Alloy + Atlas-scrape sub-issues (sv0-platform#764, sv0-infrastructure#16) are unblocked by Phase 3a and proceed independently.


Open implementation questions (to resolve in Phase 3a PR review)

These are deliberately deferred from the ADR because they're implementation details that don't change the topology:

  1. Caddy or nginx in container? Host-side Caddy is dropped (cloudflared terminates ingress). Each VM keeps an in-container reverse proxy for the SPA + API split. Lean: keep the existing nginx container from the Hetzner stack unchanged — same image, same config. Decide-and-document in the Phase 3a image PR.
  2. Pre-baked Compute Gallery image vs. cloud-init from base Ubuntu, per tier. Pre-baked = faster PR-preview spin-up (~30s vs ~3min); cloud-init = more transparent. Lean: pre-baked for PR-preview tier (latency matters per ADR §6 measurement), cloud-init from the prebaked base for prod/dev (transparency on rare provisioning events). Image versioning + rotation cadence to be defined.
  3. Recovery Services Vault retention. ADR says 30 days. Does compliance want longer? Confirm with Sergey before Phase 3a apply. Cost impact is bounded — every additional 30 days adds ~$5/mo per protected VM.
  4. Annual Azure budget threshold. ADR Negative section commits to budget alerts at 50% / 75% / 90% of $5,000/yr (the Microsoft for Startups credit allocation), with a 100%-budget action group that disables the tfc-sv0-infrastructure Azure AD app. Confirm $5,000 figure with the latest credit-allocation snapshot before Phase 3a apply.
  5. Monitor agent vs. Alloy for VM-level metrics. Alloy is going on every VM anyway (#764). Azure Monitor agent duplicates the role. Lean: Alloy only, no Azure Monitor agent. Trade-off: Azure Backup + Defender for Cloud features that depend on Monitor agent are unavailable; if compliance wants those, install Monitor agent alongside Alloy in a follow-up.
  6. PR-preview VM sizing. ADR §4 currently defaults to B2s for PR previews (same as dev) until Phase 3d measures actual RSS. The follow-up PR that flips the default to B1ms is gated on data, not on schedule.
  7. Image-watcher implementation. Lean: a small Go binary + systemd unit, ~150 LOC, polling a pointer doc in GHCR's manifest list. Alternative is watchtower (off-the-shelf but adds a moving part). Decide-and-document in the Phase 3a image PR.

Break-glass

Three break-glass scenarios, in increasing order of severity. Each has an executable procedure with specific commands.

Note on tier-2 human emergency access (Azure Serial Console). Scenarios A and B below assume Cloudflare Access SSH is the way to reach a VM. If Access SSH itself is broken (cert path wedged, sshd hung) but the VM is otherwise healthy, try Azure Serial Console firstaz serial-console connect --name <vm> -g <rg> — before escalating to public-IP failback or break-glass SP. Serial Console hits the hypervisor serial port and is independent of in-VM networking and sshd state. Authentication is via Entra ID + the sv0-vm-emergency-ops group (Ivan + Sergey).

Scenario A — Cloudflare Tunnel down, Azure healthy, TFC healthy (Phase 3a–3d, Hetzner still alive)

Symptoms: app.securityv0.com 5xxs from Cloudflare edge; Azure VMs themselves look healthy in Grafana; Hetzner VMs idle but reachable.

  1. In Cloudflare DNS dashboard, change app.securityv0.com from CNAME-to-Tunnel back to A → 178.156.245.75 (Hetzner prod).
  2. DNS TTL is 60s; traffic returns to Hetzner within ~1 minute.
  3. If Hetzner stack is stale (last deploy >24h old), SSH in and run cd ~/sv0-platform && docker compose -f docker-compose.deploy.yml pull && up -d to refresh from latest GHCR sha-<commit> tag.
  4. Open incident ticket; root-cause Cloudflare side.

Scenario B — Cloudflare Tunnel down, Azure healthy, TFC healthy (post-Phase-3e, Hetzner gone)

Symptoms: same as A but no Hetzner to fall back to.

  1. In TFC, open sv0-prod workspace. Set workspace variable var.expose_public_ip_emergency = "true".
  2. Run plan + apply. The compute module's preconditions allow this only when var.expose_public_ip_emergency == "true". Apply adds: one Standard SKU public IP per prod VM, NSG inbound rule on 443 restricted to Cloudflare's published egress IP ranges (ASN 13335), and a minimal Caddy systemd unit on each VM that's preinstalled-but-disabled in the Compute Gallery image.
  3. In Cloudflare DNS, switch app.securityv0.com from Tunnel CNAME to A records pointing at the new Public IPs.
  4. Verify request path: curl -I https://app.securityv0.com/health.
  5. Once Cloudflare Tunnel is healthy again, set var.expose_public_ip_emergency = "false" and apply to undo.

Total time: ~20 minutes once the operator is at the keyboard. Depends only on TFC, not on Cloudflare Tunnel.

Scenario C — TFC unreachable AND Azure compute needs intervention (most severe)

Symptoms: TFC dashboard down, plan/apply blocked, prod VMs need attention (e.g., one is in stop-deallocated state or NSG needs an emergency rule).

The earlier draft simply said "terraform apply" — which fails in the exact scenario being described. Concrete procedure:

  1. Pull sv0-azure-break-glass from 1Password sv0-infra vault. RBAC scope (set in bootstrap): Contributor on rg-sv0-prod, Reader on rg-sv0-shared, Backup Contributor on rg-sv0-prod, Storage Blob Data Reader on the state-backup storage account. MFA-required.
  2. az login --service-principal -u <client_id> --tenant <tenant> --password <secret> in a clean shell.
  3. Pull TFC state to local. TFC's API is what's down, but the state file is replicated to Azure Blob Storage as part of the standard TFC workspace settings (configured in Phase 3-bootstrap). The break-glass account has Storage Blob Data Reader on the state container:
    az storage blob download \
    --account-name <state_backup_storage_account> \
    --container-name state \
    --name sv0-prod.tfstate \
    --file /tmp/sv0-prod.tfstate
    (<state_backup_storage_account> is recorded in the 1Password break-glass item; it is sv0tfcstate<hex> with a random suffix.)
  4. Local apply with the break-glass credential and the pulled state:
    cd sv0-infrastructure/envs/prod
    export ARM_USE_OIDC=false # use SP creds
    export ARM_CLIENT_ID=<from 1Password>
    export ARM_CLIENT_SECRET=<from 1Password>
    export ARM_TENANT_ID=<...>
    export ARM_SUBSCRIPTION_ID=<...>
    terraform init -backend=false
    terraform apply -state=/tmp/sv0-prod.tfstate -target=<specific resource>
    Use -target aggressively: only touch the resource that needs intervention. Do NOT apply the full plan — the local state may be slightly stale.
  5. Manual reconcile after TFC recovers. Once TFC is back, run terraform refresh and a full plan; the diff is the manual change made above. Resolve drift by either accepting the manual change (commit the equivalent HCL) or reverting (terraform apply from TFC).
  6. Quarterly drill commitment. This procedure is exercised against a non-prod workspace every quarter. The drill produces a runbook update if any step has gone stale.

Why state replication to Blob Storage is part of break-glass design

TFC's state is canonical when TFC is up; when TFC is down the operator needs some version of state. The TFC "Remote State Backup" feature (configured in Phase 3-bootstrap) writes a copy of every applied state to a Blob container in the state-backup storage account (in rg-sv0-bootstrap, ZRS replication, AAD-only auth, public network access locked to TFC's published egress ranges plus operator IPs). This is documented as part of the break-glass-readiness check in the Verification checklist below.


Verification checklist for each phase

Each phase ends with the following all-green:

  • terraform plan empty against the relevant workspace
  • cloudflared --version matches pinned baseline on each VM; systemd unit Restart=always, Active: active (running)
  • Tunnel replica count matches expected for that phase: 0 in 3a (no compute), 1 staging replica in 3b, 1 prod replica in 3c.1, 2 prod replicas in 3c.2 onward, N PR-preview replicas in 3d
  • Cloudflare Access SSH login works for a named human (browser flow → short-lived cert → sshd accepts) — tier-1 human path
  • Azure Serial Console reachable (az serial-console connect --name <vm> -g <rg> returns a prompt) — tier-2 human emergency path, tested per VM at provisioning
  • sv0-vm-emergency-ops Entra group exists and has the custom sv0-serial-console-operator role on rg-sv0-prod + rg-sv0-dev; membership matches the policy (Ivan + Sergey only)
  • Cloudflare Access logs flowing to Loki; Activity Log entries for Microsoft.SerialConsole/serialPorts/connect/action appear in Loki when the drill runs
  • Image-watcher rolls a new GHCR tag within 30s of publish (smoke test from each phase)
  • Atlas IP allowlist accepts NAT Gateway egress IP only (Phase 3b+)
  • Grafana Cloud Prom shows VM up, Alloy up == 1, cloudflared_tunnel_active_streams > 0
  • Grafana Cloud Loki receives logs from cloudflared, sshd, app containers, cloud-init
  • Azure Backup has a successful daily snapshot recorded for prod VMs (Phase 3b+)
  • No public IP on any VM in rg-sv0-prod / rg-sv0-dev (assertion: az vm list-ip-addresses --query "[].virtualMachine.network.publicIpAddresses" -o tsv returns empty for non-NAT IPs)
  • lifecycle.prevent_destroy = true on all prod VM/NIC/disk resources (Phase 3b+)
  • Two-person approval policy enforced on sv0-prod and sv0-shared workspaces
  • Break-glass readiness (Phase 3a+): TFC state replicated to the state-backup blob container; sv0-azure-break-glass SP RBAC verified via az role assignment list --assignee <sp_id>
  • Failover drill recorded in the Phase 3b PR with P50/P95/max measurements
  • Cost dashboard shows actuals within 20% of estimate for the prior 7 days