Skip to main content

ADR-022: Azure Compute Landing Zone

Status

Accepted — 2026-05-08

Amended — 2026-05-09: workspace topology collapsed to modest hybrid (one TFC workspace per environment, mixed-provider where needed) rather than split-by-resource-family. The amended text is the current §3 below. The bootstrap layer (bootstrap/) became a permanent local-apply directory instead of a TFC sv0-bootstrap workspace. The canonical reference for the trade analysis is sv0-infrastructure/README.md §"Topology principle". This ADR is the durable design record; cross-environment naming and RBAC scopes were updated in place to match the modest-hybrid form so the document does not contradict reality. Phase 3-bootstrap was executed under this amended topology on 2026-05-10.

Amended — 2026-05-10: staging environment promoted to a real deployment target, phasing reordered to stand staging up before prod. Three load-bearing changes:

  1. sv0-staging workspace promoted from "Atlas-drill only" to "staging environment + Atlas-drill." The existing workspace at envs/staging-ephemeral/ retains its drill capability (atlas_drill_enabled switch, default off) and gains Azure staging compute (staging_compute_enabled switch, default off). Both are independently gated; both produce zero resources when off. The working directory is renamed envs/staging-ephemeral/envs/staging/ to reflect the broader scope; the TFC workspace name (sv0-staging) is unchanged.

  2. Phasing reordered: staging before prod. The original 3a–3e plan started with prod compute and treated staging as a side concern. The amended plan ships a full Azure staging environment first (Phase 3b), validates the entire prod design end-to-end on a cheap throwaway environment, and only then provisions prod fleet (Phase 3c). Every operational concern — cloudflared HA, secrets-via-KV, pull-deploy, image-watcher cadence, Alloy log shipping, break-glass — gets exercised on staging before prod touches it. Runbook 12 holds the rewritten phase plan.

  3. Mongo lives on the staging VM, not on prod Atlas, by default. Staging runs a single Azure VM with a colocated Mongo container — same image pattern as today's Hetzner setup, no second VM to maintain. An env-var switch (MONGODB_URI) repoints staging at the prod Atlas sv0_staging database for the small set of cases that demand real-Atlas validation (driver/connector tuning, replica-set behavior, Atlas-specific monitoring). The sv0_staging database is pre-created on the prod M10 cluster (zero cost on a tier already paid for) so the switch is a config flip rather than a provisioning step. This same colocated-Mongo pattern applies to the deferred Phase 3f dev VMs.

The rationale is cost discipline: we have credit limits, an idle staging Atlas M10 was the largest avoidable line, and the prod M10 has spare capacity for an isolated sv0_staging database when end-to-end Atlas validation is genuinely needed.

Operationalises the Phase-3 commitment of ADR-018 and the cloud-pick from ADR-019. Paired with docs/runbooks/12-azure-vm-landing-zone.md, which holds the implementation plan and the migration sequencing.

Context

Three things are now true at once:

  1. The Hetzner footprint is the bottleneck. Two CPX21 instances (2 vCPU / 4 GB, Ashburn VA) host both dev.securityv0.com and app.securityv0.com. The dev box is shared by every open PR — each PR brings up its own pr-N-dev instance via Caddy drop-ins and a per-PR docker-compose project, all on a 4 GB host. We've already had one disk-full outage (2026-04-17) and several sustained periods where >5 concurrent PRs OOM-killed each other. The "single dev VM as a multi-tenant PR oven" model is the constraint, not the cloud bill.

  2. The MongoDB layer is already gone. sv0-prod Atlas M10 in EU_WEST_1 went live (epic #550, phase 2). The remaining workload on the Hetzner VMs is API + UI + Caddy. That's a notably smaller blast radius to migrate.

  3. The decisions ADR-018 and ADR-019 deferred are now load-bearing. ADR-018 said "managed platform within 3-6 months, no further Hetzner hardening." ADR-019 picked Terraform + TFC + tenant-parameterizable modules and named Phase 3 as the post-pilot compute landing zone. Both ADRs explicitly punted on which cloud and which topology. Epic #550's 2026-05-08 update locked Azure as the cloud. This ADR locks the topology.

What this ADR is not doing:

  • Not picking Container Apps / AKS / ACI. ADR-018 contemplated managed container platform as the eventual destination; this ADR does VMs first, on the explicit grounds that VMs are the cloud-portable substrate, and managed-PaaS lock-in is the wrong move while we still owe future migrations to AWS/GCP for partner/customer reasons.
  • Not migrating per-tenant workloads. Phase 5+ (post-cutover) when the first dedicated-deployment customer signs.
  • Not changing the GHCR / image-pipeline path. That layer is already cloud-agnostic.

What Ivan asked for, 2026-05-08

Not a 1:1 migration from Hetzner to Azure, but improving our situation overall and the reliability… utilize more options for VMs (maybe more VMs) for development and PR-based deployments because we have ~5,000 credits for us, and on Hetzner we were limited sometimes because we deployed too much stuff on development… separately, we need to remove one machine dependency, so we might need some kind of load balancing and several production machines. If one goes down, we still operate on another machine… keep it simple but effective and migratable to AWS later, so we don't want to depend too much on Azure specifics.

The four requirements, made concrete:

  1. Capacity headroom for dev/preview — solve the Hetzner OOM problem at the architecture level, not by buying a bigger box.
  2. Prod HA — no single-VM dependency.
  3. Cloud-portable — design must remain migratable to AWS/GCP later. Specifically: no managed-PaaS lock-in.
  4. Simple but effective — minimum surface area to deliver (1) + (2) + (3).

Decision

Adopt an Azure VM landing zone in westeurope that uses only IaaS primitives (VMs, VNet, NAT Gateway, Recovery Services Vault), routes all ingress through Cloudflare Tunnel (no public IPs, no Azure load balancer), runs prod across two availability zones with two cloudflared replicas providing the HA, and provisions a fresh ephemeral VM per open PR.

The decision has eleven load-bearing parts. Parts 1–8 lock the topology and identity. Part 9 specifies how app secrets reach VMs. Part 10 records the operational guardrails (cloudflared supervision, monitoring, lifecycle protection, PR-preview dependencies). Part 11 is the cloud-portability rule set every later Phase-3+ change must conform to.

1. Region: westeurope (Amsterdam)

Closest Azure region to Atlas EU_WEST_1 and Grafana Cloud Frankfurt. Note: Atlas EU_WEST_1 is AWS Ireland (Dublin) — Atlas region codes refer to the underlying cloud, and the sv0-prod cluster runs on AWS, not Azure. Amsterdam → Dublin is ~700 km / ~10–15 ms RTT, vs. ~80–120 ms from Hetzner Ashburn today. The migration shrinks but does not eliminate the cross-cloud hop. Largest Azure EU service catalog by quota and feature parity. Two-AZ minimum is satisfied (zones 1, 2, 3 available).

Rejected alternatives. germanywestcentral (Frankfurt) — narrower service catalog, slightly more expensive, sovereignty story not load-bearing for SecurityV0's data model (we don't store PII server-side; identity is WorkOS). eastus2 — would re-introduce the trans-atlantic hop to Atlas. westus3 — same. Moving Atlas onto Azure (so the cluster truly co-locates with compute) is a separate decision, gated by Atlas-on-Azure pricing in EU and revisited when this ADR is.

2. Subscription: reuse Azure subscription 1 (2a25bc41-c1ce-4d04-9cb6-a62deccc3bcc)

sv0-connectors/infra/ and sv0-demo-labs/labs/*/infra/ already deploy here. Splitting now means provisioning a second subscription, transferring credits, and re-bootstrapping IAM — all for blast-radius isolation we don't need yet. Revisit when (a) credit allocation needs to be split per-product, (b) regulatory scope demands subscription-level data isolation, or (c) we onboard a dedicated-deployment customer that wants their stack in their own subscription.

3. TFC workspace topology

Modest hybrid — one HCP Terraform workspace per environment, mixed-provider where needed. Four workspaces extend the existing sv0-infrastructure repo. The original four-workspace split-by-resource-family proposal was collapsed on 2026-05-09 (see Status amendment); a sv0-staging workspace was promoted from "Atlas-drill only" to "staging environment" on 2026-05-10 (second amendment).

WorkspacePathAuto-applyHolds
sv0-sharedenvs/shared/off (two-person approval)Mixed-provider. Cloudflare account-level config (existing). Azure: VNet, subnets, NSGs, single zonal NAT Gateway, Compute Gallery image definitions, Recovery Services Vault, Key Vaults (kv-sv0-prod, kv-sv0-dev, kv-sv0-staging).
sv0-stagingenvs/staging/offMixed-purpose, double-gated. (a) Pre-existing Atlas drill capabilityatlas_drill_enabled master switch spins up an isolated ephemeral M10 cluster for PITR restore drills, version upgrade rehearsals, perf load-tests; zero resources when off. (b) New Azure staging environmentstaging_compute_enabled master switch spins up 1 Azure VM (colocated app + Mongo container) on staging.securityv0.com; zero compute when off, OS disk persists for cheap warm restart. Phase 3b (this amendment).
sv0-prodenvs/prod/off (two-person approval)Mixed-provider. Atlas sv0-prod cluster (existing, holds sv0_prod + sv0_dev + sv0_staging databases). Cloudflare prod tunnel/DNS. Azure prod fleet VMs, prod Managed Identity, lifecycle-protected.
sv0-devenvs/dev/onDeferred — Phase 3f. Long-running dev VMs with simple Cloudflare-Access SSH. Same colocated-Mongo-container pattern as Hetzner today and staging — no separate Atlas database is needed once dev is on its own VM. The transitional sv0_dev carve-out on the prod cluster (ADR-020 Phase 0) retires at that time.

Why "modest hybrid" rather than split-by-resource-family: the alternative (sv0-shared-network, sv0-prod-compute, sv0-dev-compute, plus a sv0-bootstrap) would have given finer blast-radius isolation at the cost of more cross-workspace terraform_remote_state reads. At our team size the cross-workspace coordination overhead exceeds the blast-radius benefit. sv0-infrastructure/README.md §"Topology principle" holds the full pros/cons.

Why dev gets auto-apply: on: it matches the Hetzner cadence (every main merge auto-deploys to dev) and the cost of an unintended dev-VM change is ~5 minutes of dev downtime, which is a recoverable error. Prod and shared remain off-by-default with two-person approval.

Bootstrap is local-apply, not a TFC workspace. Federation setup (the tfc-sv0-infrastructure Azure AD app, its per-workspace federated credentials, the state-backup storage account, and the break-glass SP) lives in sv0-infrastructure/bootstrap/ and is applied locally by an Owner-scoped operator. The state file stays on the operator's machine and is backed up to 1Password after meaningful changes. Putting bootstrap into a TFC workspace would create a chicken-and-egg dependency — the federation that TFC uses to reach Azure cannot itself be created by TFC. See §7a.

PR-preview VMs are deliberately not in TF state. They are created and destroyed by GitHub Actions workflows calling az directly, in their own resource group (see §6 below). Putting ephemeral compute in long-lived TF state generates churn that a human reviewer would have to plan-approve dozens of times a day.

4. VM topology

TierCountSizeZone(s)Role
Prod2Standard_B2s (2 vCPU, 4 GB)1 + 2 (zone-spread)API + UI + cloudflared + image-watcher. Each is a full-stack replica.
Dev1Standard_B2s (2 vCPU, 4 GB)1API + UI + cloudflared + image-watcher on the dev branch.
PR preview0–N (ephemeral)Standard_B2s (default; tunable)1One VM per open PR, with a cap of 10 concurrent. See sizing note below.

Defaults are encoded as Terraform variables in the compute module — switching a VM size does not require module edits.

PR-preview sizing — B2s default, B1ms candidate after measurement. The Hetzner footprint shows the API + UI + Docker + OS together fitting comfortably in 4 GB. The earlier draft proposed B1ms (1 vCPU, 2 GB) for PR previews to save cost; Codex review flagged this as unsupported. Since we have not measured the actual idle and steady-state RSS of the API + UI + cloudflared + image-watcher together on a 2 GB VM, the safer default is the same B2s as dev. Phase 3d collects memory data from the first 10 PR-preview VMs; if idle RSS comfortably fits 1.5 GB with margin, the Terraform default flips to B1ms in a follow-up PR. The cost delta is bounded — see runbook §"Cost estimate".

Why B-series. Burstable B-series is the right shape for our load profile (idle most of the time, occasional bursts during deploys / PR builds). The credit-banking model is provisional — Phase 3a includes a CPU-credit-balance Grafana panel and an alert at <30% credits. If prod fleet credit balance trends downward during normal hours, the Terraform variable flips to D2as_v5 or equivalent non-burstable.

Why two prod replicas, not three. Three is the textbook quorum size for stateful systems. Our prod fleet is stateless — Atlas is the durable layer. Two is the smallest count that satisfies "no single-VM dependency"; adding a third doesn't materially change the failure model and doubles incremental cost. Revisit if traffic ever exceeds what one B2s can serve when its peer is down.

Why a separate dev VM rather than shared with PR previews. The Hetzner failure mode was exactly that: dev branch and N PRs co-tenanted on one box, with no isolation. Giving dev its own VM and giving each PR its own VM gives architectural isolation — one PR's runaway memory cannot crash another PR or the dev branch.

Why per-PR VMs, not container-per-PR on a shared box. Tested at Hetzner; it doesn't work past ~5 PRs on a 4 GB host. Bigger shared host raises the ceiling but doesn't change the failure mode (one bad PR still takes the whole pool down). Per-PR VM is the architectural fix. Cost is bounded by the concurrency cap (§6).

5. Networking: Cloudflare Tunnel ingress, Azure NAT egress, no public VMs

All ingress to all VMs (prod, dev, PR previews) goes through Cloudflare Tunnel. There are no Azure public IPs on any VM and no Azure inbound load balancer.

Two things the diagram makes load-bearing:

  • Ingress is outbound from the VM perspective. cloudflared dials out to Cloudflare's edge; no inbound NSG rule, no public IP. Failure mode if Cloudflare is down: both VMs are unreachable from the internet even though they are healthy. The break-glass procedure (runbook §"Scenario B") adds a public IP per VM as the failover path.
  • Zone-2 egress goes through the zone-1 NAT. A zone-1 outage takes egress with it for zone-2 VMs too — they lose Atlas/GHCR/WorkOS reachability until NAT recovers or the per-zone subnet split (Phase-4+) ships. The request path on zone-2 VMs survives a zone-1 outage; the Mongo-dependent request path does not.

5a. Ingress and HA characteristics

Each prod VM runs a cloudflared replica pointed at the same sv0-prod tunnel. Per Cloudflare's documentation, multiple replicas of the same tunnel give:

  • Connection redundancy. Each replica establishes 4 outbound connections to different Cloudflare edge data centers; new requests pick a healthy connection.
  • Nearest-replica routing. Cloudflare prefers the geographically nearest healthy replica.
  • Failover on connection loss. If a replica's edge connection drops, in-flight requests on it can fail; new requests route to surviving replicas. Long-lived connections (WebSocket, SSH-over-Access, server-sent events) may need to reconnect on replica loss.

What multi-replica Tunnel does NOT give out of the box:

  • Health-checked, latency-aware L7 load balancing across replicas.
  • Programmatic alerting on replica-down.
  • Active failover for in-flight long-lived connections.

If the observed failover behavior in Phase 3b is unacceptable, the documented escalation is to add Cloudflare Load Balancer (separate paid product) with one origin per VM and active health checks. We do not commit to that in this ADR; we commit to measuring failover behavior with two replicas first and revisiting if it is too slow.

The platform's request profile today is short-lived REST — failover-on-reconnect is acceptable. If we add SSE / WebSocket endpoints, this section gets revisited.

PR previews each get their own dedicated tunnel scoped to pr-N-dev.securityv0.com (single replica, no HA need).

5b. Egress and stable Atlas-allowlist IP

Outbound from VMs (apt, GHCR, Atlas, WorkOS API, Grafana Cloud remote_write) goes through Azure NAT Gateway. Azure's constraint: a subnet can be associated with at most one NAT Gateway, and Standard NAT Gateway is a zonal resource. This rules out my earlier "two NATs on the same subnet" sketch.

The chosen design:

  • Single Standard NAT Gateway in zone 1, attached to all subnets (snet-staging, snet-prod, snet-dev, snet-pr-previews, snet-shared). One static public IP for the Atlas allowlist. The Azure constraint is asymmetric: a subnet may be associated with at most one NAT Gateway, but a NAT Gateway can serve many subnets — so the one NAT is shared across every environment.
  • Documented zonal failure mode: a zone-1 outage takes egress with it. VMs in zone 2 lose Atlas/GHCR reachability until the NAT recovers or DNS is re-pointed. This is acceptable because (a) we are pre-revenue, (b) Azure zonal outages are minutes-to-hours and rare, (c) Atlas itself is in a different cloud and unaffected.
  • Phase 3c validates the NAT Gateway IP shows up in Atlas IP allowlist and that VMs in both zones egress through it as expected.

Cost amortization. Standard NAT Gateway is ~$35/month (1 IP × 720h × $0.045/h hourly = ~$32/mo fixed + per-GB egress). That cost is paid once for the entire org, not per environment. The effective per-VM NAT cost falls as the substrate fills:

StageCompute footprintEffective NAT $/VM-month
Phase 3b (dev spike or staging alone)1 VM$35
Phase 3b + dev spike2 VMs$17.50
Phase 3c.2 (staging + 2 prod)3 VMs$11.66
Phase 3d steady state (+ avg 5 PR previews + dev)~9 VMs$4

This means an isolated spike VM appears to "pay" the full NAT cost, but that perception is a phase-transition artifact — the NAT serves the whole org and amortizes as Phase 3b/3c/3d/3f land. The per-VM Public IP alternative (Public IP per VM, no NAT) was considered and rejected: it (a) breaks the "stable Atlas allowlist IP" property which prod requires, (b) creates a different egress topology in dev than in prod (the spike's whole purpose is to validate the same path prod will use), and (c) is only ~$32/mo cheaper while a single VM is the sole user — savings disappear at 2 VMs.

Phase-4+ upgrade path if zonal NAT becomes a real availability problem: split into per-zone subnet stacks (snet-prod-z1, snet-prod-z2, etc.) each with its own NAT Gateway, or migrate to a zone-redundant NAT (NAT Gateway v2 / Azure Firewall) when GA features support it. This is an explicit deferred decision, not a missing one.

5c. SSH access — what changes, and what does NOT change

The host's sshd continues to run with host SSH keys (server identity). What changes:

  • No public network reachability for sshd. No public IP on VMs; no inbound NSG rule on port 22; sshd is reachable only from inside the VNet and through Cloudflare Tunnel.
  • User authentication moves from ~/.ssh/authorized_keys to short-lived certificates issued by Cloudflare Access for Infrastructure. Each VM's sshd_config is set to:
    • PubkeyAuthentication yes
    • TrustedUserCAKeys /etc/ssh/cloudflare-access.pub — the Cloudflare Access SSH CA public key
    • AuthorizedPrincipalsFile /etc/ssh/principals/%u — maps Cloudflare identity to local user
    • PasswordAuthentication no
  • No long-lived authorized_keys for any deploy or human user. The deploy@ SSH key + GitHub-Actions-managed DEPLOY_SSH_KEY pattern from ADR-018 is retired.

Human emergency access — three tiers, deliberately layered

Even with pull-based deploy retiring all CI SSH, humans still need machine access for critical emergencies (live debugging, manual intervention when an automated path is broken). The design provides three independent access tiers so no single failure removes all human reach to the box.

TierWhenPathAuthenticationAuthorizedAudit
1 — Normal + most emergenciesDefault. Use this unless tier 1 itself is down.cloudflared access ssh --hostname <vm>.<account>.cloudflareaccess.com → browser-based Cloudflare Access SSO → Access mints a short-lived (1 h) user cert → sshd validates against TrustedUserCAKeysCloudflare Access policy: Entra IdP federated (see prerequisite below), MFA required, restricted to named humans (Ivan, Sergey). No service-token path on this hostname.Cloudflare Access logs every session (origin IP, identity, cert serial, duration). Shipped to Loki via the Access log push.

Entra IdP is a Cloudflare-side prerequisite, not an Azure resource. Cloudflare Access does not federate to Azure Entra out of the box — it requires an Azure App Registration on the Entra side (~30 minutes one-time), plus configuration in the Cloudflare Zero Trust admin UI to register Entra as an identity provider. Until that's done, Access falls back to whatever IdP is configured (today: GitHub IdP). The Phase 3a PR is the place to land this — without it, the Phase 3b/3c Access apps cannot enforce the "Entra MFA" policy this ADR commits to. The spike PR (sv0-infrastructure#26) confirmed this gap; runbook 12 §Phase 3a lists Entra IdP setup as a deliverable. | 2 — Cloudflare Access itself is broken | Tier 1 fails because Cloudflare Tunnel/Access is degraded, sshd is wedged, or the cert path is broken. Networking does not need to be healthy. | Azure Serial Console via Azure Portal or az CLI: az serial-console connect --name <vm> --resource-group <rg>. Hits the VM's hypervisor-level serial port — completely separate from sshd, NSG, or any in-VM network state. | Azure RBAC: custom role sv0-serial-console-operator (provides Microsoft.SerialConsole/serialPorts/connect/action + Microsoft.Compute/virtualMachines/read) on the target VM, plus Entra ID with MFA. Granted to Ivan + Sergey via Entra group sv0-vm-emergency-ops. Note: the built-in Virtual Machine User Login role does NOT include the SerialConsole action and is not sufficient. | Every session emits an entry to the Azure Activity Log: Microsoft.SerialConsole/serialPorts/connect/action with the caller's object ID. Activity log streamed to Grafana Cloud Loki. | | 3 — Both Cloudflare AND TFC are unreachable | Worst case. The compute itself needs intervention (NSG rule change, restart, snapshot restore) and the IaC pipeline can't run it. | The break-glass procedure in runbooks/12-azure-vm-landing-zone.md "Scenario C": pull sv0-azure-break-glass SP from 1Password, az login --service-principal, pull state from Blob Storage backup, local terraform apply -target. | 1Password sv0-infra vault, MFA-required, shared with named individuals. SP RBAC: Contributor on rg-sv0-prod, Reader on rg-sv0-shared, Backup Contributor on rg-sv0-prod, Storage Blob Data Reader on the state-backup storage account. | 1Password access log + Activity Log on every Azure mutation. Mandatory post-incident report (within 24 h) covering what was changed and how state will be reconciled. |

Why Azure Serial Console for tier 2 (instead of Azure Bastion): Serial Console is free, ships in every Azure subscription, and gives terminal access at the VM's hypervisor serial port — so it works even when in-VM networking is broken or sshd is wedged. Azure Bastion is a managed PaaS that costs ~$140/month and proxies SSH/RDP through a public-IP'd jump host; for our scale and emergency-only usage, Bastion is over-engineered and adds an Azure-specific surface. AWS and GCP both have 1:1 equivalents (EC2 Serial Console, GCP Compute Engine Serial Port), so this choice is cloud-portable per §11.

Per-VM Serial Console enablement is set in the Compute Gallery image (enable_serial_console = true on the VM resource) and verified in the Phase 3a verification checklist. Boot diagnostics are stored in a rg-sv0-shared storage account with 30-day retention.

Required hygiene around all three tiers:

  • Every emergency session (tier 2 or 3) requires a follow-up incident note in the next-day standup, even if the issue was minor. The point is not punishment; it's so we know which automated paths to fix so we don't need emergency access next time.
  • Quarterly emergency-access drill: Ivan opens a tier-2 session against a non-prod VM and runs through a small task list (read a log, restart a service). Confirms the path is alive; confirms the audit trail captures the event. If a step has gone stale, the runbook gets updated.
  • The break-glass SP credential is rotated annually OR immediately after any tier-3 use, whichever is sooner.

5c.1. Perimeter (CF Access) vs app-layer (WorkOS) auth — two different doors

Cloudflare Access and WorkOS protect different layers and use different IdPs. The auth-simplification owner's verdict (2026-05-11) confirmed that running two IdPs at two layers is not a contradiction.

  • WorkOS = the door on the app. Inside the Node process. Decides "who is this principal, what are their permissions." Used by every customer, every staff member, every bot. Drives the five-kind principal model and the super-admin signal (WorkOS organization membership).
  • CF Access = the door at the network. At Cloudflare's edge, in front of the app. Decides "can this person reach this URL at all," before the Node process sees the request. Federated to Entra ID for staff identity, gated by the Entra group sv0-vm-emergency-ops for SSH, and by named-humans-only Access policies for HTTP endpoints.

Why two doors on some URLs and one on others:

SurfaceDoorsWhy
app.securityv0.com (prod)1 — WorkOS onlyCustomer-facing. Anyone can reach the URL; they bounce off the WorkOS login page.
staging.securityv0.com1 — WorkOS onlyStaging is supposed to mirror prod's auth posture so end-to-end auth validation is meaningful. Same one-door shape as prod.
dev.securityv0.com (current Hetzner; future Azure dev)2 — CF Access + WorkOSDev runs broken / WIP / half-deployed code; we don't want the URL indexed or scraped.
dev-azure.securityv0.com (Phase 3b spike)2 — CF Access + WorkOSSame reasoning as dev. Matches the dev pattern so promotion to dev.securityv0.com later is no-change.
pr-N-dev.securityv0.com (PR previews)2 — CF Access + WorkOSSame as dev.
Tier-1 SSH to any VM1 — CF Access onlyNo app behind SSH; CF Access is the only door. Forced to have an IdP — Entra direct.

Hard rule for the app code: nothing in the Node process may read the Cf-Access-Jwt-Assertion header to derive identity. App identity stays WorkOS-only. CF Access is a perimeter gate, not an auth principal source. This matches the auth-simplification "one signal per authorization decision" rule.

Hard rule for the perimeter: CF Access policies use Entra group membership (sv0-vm-emergency-ops for SSH, named-humans for HTTP). No parallel allowlist of email addresses in the Access app config.

5d. CI deploys: pull-based, not SSH-push

The original draft attempted "Cloudflare Access service-token-based SSH from CI." That conflates two unrelated Cloudflare features — service tokens authenticate HTTP requests, they do not authenticate SSH sessions. The original sketch is not implementable as written.

The cleaner replacement is pull-based deploy. CI's only job is to push container images to GHCR. Each VM polls its image tag and rolls itself when a new tag is published.

PathToday (Hetzner)After Phase 3a
Public HTTPS to prodInternet → Cloudflare DNS A record → Hetzner public IPv4 → Caddy:443 → nginx → containersInternet → Cloudflare edge → Cloudflare Tunnel → cloudflared on VM → nginx → containers
Public HTTPS to devSame pattern, Hetzner public IPv4Same pattern (one replica)
Public HTTPS to PR-Npr-N-dev.securityv0.com → dev box Caddy drop-in → per-PR composepr-N-dev.securityv0.com → CF tunnel → cloudflared on PR's own VM
Human SSH (normal)ssh deploy@<host> with sv0-deploy-prod keycloudflared access ssh --hostname <vm>.<account>.cloudflareaccess.com (short-lived cert via Access SSH CA)
Human emergency (Access SSH down)ssh root@<host> with 1Password agentAzure Serial Console via Portal or az serial-console connect. Independent code path — works when sshd or networking is broken.
Human break-glass (TFC + CF down)n/a (improvised)Documented in runbook §"Scenario C" — pull break-glass SP from 1Password, local terraform apply -target against state replica in Blob Storage.
CI deployDEPLOY_SSH_KEY GHA secret → SSH push → docker compose pull && up -dCI pushes ghcr.io/...:sha-<commit> and updates a tiny pointer doc in GHCR / object storage; on each VM, a sv0-image-watcher systemd unit (15 s poll) sees the new pointer, pulls, and runs docker compose up -d. No CI-to-VM session at all.
Outbound from VMsHetzner default egressNAT Gateway, single static egress IP

The pointer-doc indirection (instead of :latest) lets prod and dev pin different tags while sharing the same image-watcher mechanism. Rollback is "set the prod pointer back to a previous SHA"; the watcher does the rest.

GHCR auth for the watcher uses the VM's Azure Managed Identity → Federated GitHub token (via the existing GitHub→Azure AD federation pattern). No long-lived GHCR PAT on the VM.

5e. Why no Azure LB

Cloudflare Tunnel + DNS handles ingress, including the modest L4 LB role we'd otherwise need. Azure Standard LB has features (zone redundancy, native backend health) but every one of those features couples us to Azure semantics; on AWS the equivalent is Network Load Balancer or Application Load Balancer with different config and different IAM. Cloudflare Tunnel behaves identically on Hetzner, Azure, AWS, GCP, or bare metal. This is the single biggest cloud-portability win in the design — at the cost of concentrating ingress dependency on Cloudflare (see "Risks" below).

5f. Risks of the Cloudflare-centric path

  • Cloudflare in the request path. Cloudflare is already in the path today (Cloudflare Access at the edge). Tunnel + Access SSH concentrates risk we already carry, rather than introducing new risk. Break-glass to direct-public-IP + Caddy is documented in the runbook (with a tested local-apply procedure for the case where TFC is also unreachable).
  • Tunnel multi-replica is failover, not health-checked LB. As §5a states. If the observed failover characteristics in Phase 3b are bad, escalate to Cloudflare Load Balancer. Don't pretend the free tunnel feature is more than it is.
  • Cloudflare Access SSH latency spikes on reconnect. Acceptable for human SSH; deploy doesn't use SSH.
  • Pull-based deploy ≠ instant deploy. Worst case 15 s + container restart on a watcher cycle. Acceptable for our cadence.
  • GHCR is the new single point of failure for deploy. It already was (CI pushes images there) — pull-based deploy doesn't add coupling, it removes the "CI also has to reach the VM" coupling.

6. PR-preview VMs: ephemeral, GitHub-Actions-managed

PR previews are not Terraformed. They are created and destroyed by GitHub Actions workflows in sv0-platform's .github/workflows/. All workflows use the same OIDC federation TFC uses, so no service-principal secret is mounted to GitHub Actions.

Lifecycle has three distinct phases, not "one workflow runs on every PR event":

TriggerWorkflowActionTypical latency
pull_request: opened, reopenedpreview-create.ymlIdempotent: if rg-sv0-pr-previews-pr-N already exists, skip Azure provisioning. Otherwise: az group create, az vm create from a pre-baked Compute Gallery image, cloud-init installs cloudflared and the image-watcher, Cloudflare API creates a PR-scoped tunnel + DNS CNAME pr-N-dev.securityv0.com (tagged with sv0:pr-preview=N for orphan reconciliation)~3 min total: VM boot + cloud-init + cloudflared register + first GHCR pull
pull_request: synchronizepreview-deploy.ymlContainer update only — does NOT re-create VM. CI builds + pushes :pr-N tag to GHCR. The VM's image-watcher (§5d) sees the new tag and rolls containers. Workflow itself only updates the GHCR image; the VM does the rest.~30 s after CI publishes
pull_request: closedpreview-destroy.ymlThree-step cleanup, all idempotent: (1) az group delete --name rg-sv0-pr-previews-pr-N --yes --no-wait, (2) Cloudflare API delete tunnel sv0-pr-N, (3) Cloudflare DNS record delete for pr-N-dev.securityv0.com. All three logged; failure of any step fires an alert and leaves the others completed.~10 s (RG delete is async on Azure side)
Daily scheduled, idle >7 dayspreview-reaper.ymlSame as preview-destroy.yml for any PR-preview RG matching idle criteria. Posts a PR comment noting the reap so the developer knows what happened.n/a
Daily scheduled, orphan reconcilerpreview-reconcile.ymlCross-check Cloudflare tunnels and DNS records tagged sv0:pr-preview=* against open PRs and live Azure RGs. Anything tagged for a PR that is closed AND has no Azure RG → reap the Cloudflare side. Alerts on any orphan that survives one cycle.n/a

Why three workflows, not one. Codex review caught the original "one workflow on every PR event" sketch as the source of two bugs: it would re-create VMs on every push (expensive and slow), and it framed Azure RG deletion as if that also cleaned Cloudflare resources (it doesn't — Azure RG only deletes Azure things). Splitting create/update/destroy makes each workflow's idempotency contract explicit, and the orphan reconciler closes the cross-cloud cleanup gap.

Hard cap: 10 concurrent PR-preview VMs. preview-create.yml checks open RG count first. Behavior at PR 11:

  1. Workflow fails fast with a status check on the PR titled "Preview capacity full (10/10)".
  2. PR comment lists the other PRs holding capacity, sorted by last-activity-time, and tells the developer that the oldest idle PR will be reaped on the next daily reaper cycle.
  3. If the developer needs a slot sooner, they can either close one of their other PRs, or comment /preview reap-oldest-idle to trigger an immediate reaper run scoped to their PRs only. (Implementation detail for Phase 3d, not committed in this ADR.)
  4. The PR keeps building containers and merging works fine without a preview — preview is a developer convenience, not a CI gate.

This is "fail-with-clear-explanation", not "PR is bricked." Acceptable.

Why the dedicated rg-sv0-pr-previews-pr-N resource group per PR. The RG is the Azure-side cleanup boundary. If anything in the per-PR provisioning is broken — a half-installed cloud-init, a hung VM, a stray disk — az group delete --yes --no-wait removes the entire blast radius in one operation. The matching Cloudflare-side cleanup (tunnel + DNS) runs alongside; the orphan reconciler is the safety net.

7. Identity for the Terraform pipeline: OIDC federation, no static SP secret

The bootstrap layer owns one Azure AD app registration (tfc-sv0-infrastructure) with federated credentials per TFC workspace per phase. Per HashiCorp's documented pattern, each workspace gets two federated credentials — one for terraform_run_phase:plan, one for terraform_run_phase:apply — so plan-phase tokens are read-only and apply-phase tokens are write-scoped. This honours ADR-019 §5.

Three things the diagram makes load-bearing:

  • No static SP secret in TFC. Each TFC run mints a short-lived OIDC token, exchanges it for an Azure access token via the federation, and discards it. The only static credential in the chain is the break-glass SP in 1Password, used only when TFC itself is unreachable.
  • The federation app is created out-of-band. TFC cannot create the federation it uses to reach Azure — that's the chicken-and-egg. Bootstrap (sv0-infrastructure/bootstrap/) is a permanent local-apply directory the Owner runs from their own Azure account; it owns the app registration, all federated credentials, the break-glass SP, the custom RBAC role, the Entra emergency-access group, and the state-backup storage account.
  • Plan and apply have separate federated credentials. A leaked plan-phase token is read-only — it cannot mutate Azure state. Eight federated credentials total (4 workspaces × {plan, apply}), each with its own subject claim. The sv0-staging and sv0-dev federations are created in bootstrap alongside the others, even though their workspaces don't hold compute yet — keeps the federation layer stable across phases.
TFC workspaceFederated credential subjectsAzure RBAC scope
sv0-shared…:run_phase:plan, …:run_phase:applyReader on subscription; Network Contributor + Storage Contributor + Key Vault Contributor on rg-sv0-shared
sv0-staging…:run_phase:plan, …:run_phase:applyReader on subscription + rg-sv0-shared; Contributor on rg-sv0-staging; Virtual Machine Contributor
sv0-prod…:run_phase:plan, …:run_phase:applyReader on subscription + rg-sv0-shared; Contributor on rg-sv0-prod; Virtual Machine Contributor + Backup Contributor
sv0-dev…:run_phase:plan, …:run_phase:applyReader on subscription + rg-sv0-shared; Contributor on rg-sv0-dev

Plan-phase credentials get Reader-equivalent scope; apply-phase get the write scopes above. This means a leaked plan-phase token cannot mutate state.

Federated subject format. TFC sends the subject claim using the organization and project display names (case-sensitive, may contain spaces), not URL slugs: organization:SecurityV0:project:Default Project:workspace:<workspace>:run_phase:<plan|apply>. Each federated credential's subject must match exactly or Azure returns AADSTS700213. Verify with curl -H 'Authorization: Bearer $TFC_TOKEN' https://app.terraform.io/api/v2/organizations/<slug> | jq '.data.attributes.name'.

7a. Bootstrap procedure

Bootstrap is the sv0-bootstrap TFC workspace (sv0-infrastructure/bootstrap/). Changes ship via PR; the workspace runs plan/apply on TFC with OIDC, like every other workspace. Auto-apply is off so bootstrap mutations get an explicit operator confirm.

The chicken-and-egg of the very first bootstrap (TFC can't authenticate to Azure until federation exists) is solved by a one-time local apply that pushes its state into TFC on completion. After that, no operator ever runs terraform apply locally against this module. The first bootstrap on this tenant happened 2026-05-13 (issue #29 closure).

To re-bootstrap a fresh subscription: see bootstrap/README.md "Reset / re-bootstrap a fresh subscription." Routine bootstrap changes (new federated workspace, new RBAC scope, etc.) are pushed as PRs against sv0-infrastructure and approved in the sv0-bootstrap TFC workspace.

A few resources are deliberately not TF-managed by bootstrap:

  • sv0-vm-emergency-ops Entra group — the bootstrap SP doesn't have tenant-wide Graph permission to manage Entra groups; the group object ID is hardcoded as a local because the group is permanent. Membership goes via az ad group member add (rare — Sergey + Ivan only).

7b. Recovery credentials

Earlier drafts of this ADR provisioned a static-credential sv0-azure-break-glass SP (Contributor on prod RG, Reader on shared, Backup Contributor on prod, Storage Blob Data Reader on state SA) for use "when TFC is unreachable." That SP was deleted on 2026-05-13 as part of issue #29 because the Tier-3 lockout it hedged against is already closed: Sergey + Ivan are both subscription Owners since 2026-01-04 / 2026-03-10. Verify state with az role assignment list --scope /subscriptions/<sub> --role Owner -o table.

Design knowledge — when a recovery SP is genuinely warranted (≥10-staff scale, compliance ask, or a scenario where adding a second human with the privileged role is blocked) — is banked at patterns/recovery-credentials.md. Seven patterns: UAA-not-Owner, credential out-of-band (never in TF state), file-pipe activation, sunset condition, prevent_destroy, per-operator vs shared, explicit RTO + verify-subscription-state-first.

The existing sv0-terraform-admin Service Principal in 1Password sv0-bots (used by sv0-connectors/infra/) stays where it is — different blast radius, predates this ADR.

8. Backup posture

  • Atlas (data layer): PITR is on, retained 7 days. Atlas-native alerts (sv0-infrastructure#13) page on NO_PRIMARY / disk / connections. Already in place; this ADR does not change it.
  • Prod VMs: Azure Backup via Recovery Services Vault, daily snapshots, 30-day retention. Restoration drill quarterly.
  • Dev / PR-preview VMs: no backup. They are re-creatable from IaC + GHCR images, and contain no durable state.

9. App secrets delivery to VMs

App-side env vars (MONGODB_URI, WORKOS_CLIENT_SECRET, internal API keys, the GHCR pull token for the image-watcher) reach VMs via Azure Key Vault + Managed Identity, not via cloud-init custom_data. The earlier draft was silent on this; Codex review flagged it as a critical gap.

SecretWhere it livesRead pathRotation
MONGODB_URI (app password)kv-sv0-prod Key Vault, secret mongodb-uriEach prod VM has Managed Identity with get on this secret. Cloud-init helper script writes /etc/sv0/env.d/mongodb, owned root:docker, mode 640. Docker Compose reads via env_file:.Atlas password rotated (existing playbook); helper rewrites file on systemd timer (5 min) without container restart
WORKOS_CLIENT_SECRET, WORKOS_API_KEYSame KV, separate secretsSame patternWorkOS rotation playbook
GHCR_PULL_TOKENSame KV, secret ghcr-pull-tokenImage-watcher reads via Managed Identity at startup; refreshes on token expiryTokens are short-lived (federated GitHub→Azure pull), no manual rotation
Cloudflare Tunnel credentials JSONSame KV, secret cloudflared-<vm-id>Cloud-init writes /etc/cloudflared/<vm-id>.json once at first boot, deletes from KV after registering (the tunnel JSON is single-use after creation)Re-issued on cloudflared re-create, not on schedule

Key Vault per environment:

  • kv-sv0-prod — accessed by prod VMs only (Managed Identity scoped per VM).
  • kv-sv0-dev — accessed by dev VM and PR-preview VMs. PR previews share a pull token + WorkOS test creds; they never get prod secrets.

Cloud-init must NOT contain secrets in custom_data. custom_data is readable from the Azure VM Instance Metadata Service and persists in VM properties. The cloud-init script's job is to install the helper that fetches secrets at runtime via Managed Identity, never to embed the secret values.

Cloud-portability hook: on AWS the equivalent is AWS Secrets Manager + IAM Instance Profile (1:1 swap); on GCP it's Secret Manager + Workload Identity. The helper script is parameterized so the cloud-specific fetch is one of three implementations behind the same interface.

10. Operational guardrails

These items were initially missing or asserted-without-detail; Codex review flagged them. Committed here as part of the design contract.

10a. cloudflared supervision and monitoring

  • cloudflared runs as a systemd unit (cloudflared.service) with Restart=always, RestartSec=5s, After=network-online.target, StartLimitBurst=5/60s.
  • Boot verification: cloud-init runs cloudflared tunnel info <id> after install and fails the deployment if the tunnel isn't registered.
  • Metrics scraped from cloudflared's /metrics endpoint (port 2000, localhost-only) by Alloy, sent to Grafana Cloud Prom. Tracked metrics: cloudflared_tunnel_active_streams, cloudflared_tunnel_total_requests, cloudflared_tunnel_response_by_code, connector connectivity.
  • Alerts (Grafana Cloud, fire to ops email):
    • prod_tunnel_replicas < 2 for >2 min — one replica down
    • prod_tunnel_replicas == 0 for >30 s — both replicas down (page)
    • cloudflared_tunnel_response_by_code{code=~"5.."} rate >1/s for >5 min
  • Log shipping: journald → Alloy → Grafana Cloud Loki, with labels vm=<name>, unit=cloudflared, tier=prod|dev|pr-preview. Same path for app containers, nginx, sshd, cloud-init.

10b. Lifecycle guards on prod resources

  • Every VM, OS disk, NIC, and tunnel resource owned by sv0-prod carries lifecycle { prevent_destroy = true }. Replacement requires explicit removal of the lifecycle block in a separate PR — a deliberate two-PR motion, not an accidental apply.
  • Two-person approval policy on sv0-prod and sv0-shared workspaces (TFC apply requires Ivan + Sergey, not just one approver).
  • Destroy of prod resources requires setting var.allow_prod_destroy = true in the workspace AND running plan; absent the flag, the module's preconditions fail.

10c. PR-preview dependency map and failure modes

PR previews depend on six external systems. The reaper / reconciler workflow handles each as listed; nothing is silently retried forever.

DependencyFailure modeHandling
GitHub Actions OIDCToken mint failureWorkflow fails; PR comment with retry instructions
Azure ARM APIThrottling, RG-create failure3 retries with exponential backoff, then fail-loud
Compute Gallery imageImage not yet replicated to regionFail-loud; pre-check in workflow asserts image is present in westeurope
GHCR pullImage not yet pushedImage-watcher polls; PR-preview boots but app stays at "image pending" until CI publishes
Cloudflare APITunnel/DNS create failureRetry once, then fail-loud, then trigger Cloudflare-side cleanup of partially-created resources
Atlas allowlistNAT IP not presentPhase 3a smoke-test catches this; once allowlist is in place, dependency is durable

11. Cloud-portability rule set

Every Phase-3+ change to the compute landing zone must satisfy all five rules:

  1. IaaS primitives only. VM, virtual network, subnet, NAT/Internet Gateway, snapshot/backup vault. No Container Apps, AKS, ACI, App Service, Functions. No Application Gateway, Front Door, Traffic Manager (Cloudflare does L7 / WAF / DNS).
  2. Managed-PaaS for stateful only when re-implementable. Atlas (cross-cloud already), Cloudflare (cross-cloud already), Grafana Cloud (cross-cloud already), GHCR (cross-cloud already). Those are the only PaaS dependencies the compute layer leans on. Anything else needs an ADR.
  3. Cloud-init, not Azure VM Extensions, for app provisioning. Extensions are acceptable for thin wrappers (Azure Backup agent, Monitor agent) where the AWS/GCP equivalent is a 1:1 substitution. Anything that does real work — package install, config render, service enable — uses cloud-init in custom_data.
  4. Modules wrap provider-specific resources behind a stable interface. module "compute_node" accepts cloud_provider ("azure" today, "aws" when migration triggers) and the same input set: name, size, image, subnet, tags, cloud_init. The provider-specific logic lives inside the module; consumers don't see Azure types.
  5. No Azure-specific VM-side identity in app code. App code uses MONGODB_URI, WORKOS_CLIENT_SECRET, GHCR_TOKEN — same env vars on every cloud. Azure Managed Identity is used only for backup/observability, where the AWS equivalent (IAM Instance Profile) is a 1:1 swap.

These rules are the contract. A change that violates one needs an explicit ADR amendment recording why.


Consequences

Positive

  • The Hetzner dev OOM problem is structurally fixed. Each PR has its own VM. One bad PR cannot crash another. The disk-full + memory-pressure failure modes from 2026-04-17 are not reachable in this topology.
  • Prod is no longer single-machine-dependent for the request path. Two zone-spread VMs each running cloudflared. Either VM can fail and Cloudflare's tunnel-replica failover routes new requests to the surviving one. (HA is failover, not health-checked LB — see Negative below.)
  • The public attack surface shrinks. No public IP on any VM, no public reachability of sshd, no DEPLOY_SSH_KEY rotation choreography. The CI deploy mechanism flips from SSH-push to image-pull, removing the need for any CI-to-VM session. ADR-018 risk is retired by architecture rather than by accepted-risk paperwork.
  • Cloud-portable design. Re-implementation on AWS is in scope: swap the compute_node module backend, swap NAT Gateway for NAT Gateway, swap Recovery Vault for Backup Vault, swap Compute Gallery for AMI/EC2 Image Builder, swap Key Vault for Secrets Manager. Cloudflare Tunnel, GHCR, Atlas, Grafana Cloud, WorkOS unchanged. The 2–3 engineer-day estimate from the earlier draft was unsupported — Codex review caught this. Realistic estimate is "1 sprint to port modules + 1 cutover window," with the actual breakdown documented when we draft an AWS port plan. The point is the work is in TF + a runbook, not a re-architecture.
  • TFC OIDC retires the last long-lived cloud credential in the production pipeline. The break-glass SP in 1Password is the only path that can apply changes without TFC, scoped to enable diagnosis of shared resources (Reader on rg-sv0-shared) and remediation of compute (Contributor on rg-sv0-prod).
  • Stable egress IP for Atlas allowlist (NAT Gateway). Closes sv0-infrastructure#11 mechanically.

Negative

  • Cloudflare becomes more deeply load-bearing in the request path AND the SSH path. We were already terminating TLS at Cloudflare Edge and gating with Cloudflare Access; adding Tunnel and Access SSH increases the number of Cloudflare features we depend on. A sustained Cloudflare outage that takes Tunnel offline takes our public surface offline AND blocks human SSH. Mitigation: documented break-glass to flip a TF variable that adds a public IP + Caddy on each prod VM; runbook holds the executable procedure for the case where TFC is also unreachable.
  • Tunnel replica HA is failover, not health-checked LB. Cloudflare documents multi-replica as connection redundancy + nearest-replica, not active health-checked load balancing. Long-lived connections (WebSocket / SSH-over-Access) may need to reconnect on replica loss. Phase 3b includes a measured failover drill; if the observed behavior is unacceptable, the documented escalation is to add Cloudflare Load Balancer (a separate paid product) with one origin per VM and active health checks.
  • Per-PR VM warmup latency is ~3 min on PR open (VM boot + cloud-init + cloudflared register + first GHCR pull). On subsequent pushes (pull_request: synchronize) the latency drops to ~30 s because the synchronize workflow only updates the GHCR image and the VM's image-watcher rolls itself. Hetzner's per-PR-on-shared-host model is closer to 30 s on every push but at the cost of the OOM problem.
  • One new TFC workspace and two mixed-provider extensions. Per the modest-hybrid amendment: sv0-dev is added; sv0-shared extends to host Azure VNet/NAT/Compute Gallery/Key Vaults alongside its existing Cloudflare resources; sv0-prod extends to host Azure prod compute alongside its existing Atlas cluster. Workspace count goes from 4 to 5 (the pre-existing sv0-shared / sv0-prod / sv0-observability / sv0-staging plus the new sv0-dev). Bootstrap is local-apply, not a TFC workspace. Still well under the resource-count and concurrency limits of the free tier (ADR-019 §3).
  • Single Standard NAT Gateway is zone-pinned. Earlier draft sketched two NATs but Azure forbids multiple NAT Gateways on one subnet, and Standard NAT is zonal not zone-redundant. A zone-1 outage takes egress with it; VMs in zone 2 lose Atlas/GHCR reachability until NAT recovers. Accepted because we are pre-revenue and Azure zonal outages are minutes-to-hours and rare. Phase-4+ upgrade options listed in §5b.
  • Cost ceiling is policy, not enforcement. The 10-concurrent-PR cap is checked at provisioning time but is not a credit ceiling. A runaway workflow loop could burn through the credit allocation before annual renewal. Mitigations: cap, reaper, Azure budget alerts at 50% / 75% / 90%, hard-stop budget action at 100% (auto-disable the tfc-sv0-infrastructure app via Azure Cost Management action group).

Accepted risks

  • No backup on dev. A dev outage is a re-deploy from main. We do not pay backup cost for this.
  • No backup on PR previews. Same logic; they are GHCR-tag-replayable.
  • Zonal NAT Gateway. A zonal outage that takes out the NAT removes egress for VMs in any zone. Acceptable at our scale; documented upgrade path in §5b.
  • Tunnel multi-replica is failover, not LB. Acceptable for short-lived REST traffic; revisit if we add SSE/WebSocket endpoints. Documented escalation to Cloudflare Load Balancer if observed failover is too slow.
  • Pull-based deploy has 15s + container-restart latency. Acceptable for our deploy cadence; there is no production scenario today where a sub-15-second deploy matters.
  • Cost estimate is unaudited. Numbers in the runbook §"Cost estimate" are author-built using publicly listed Azure prices. They are point-in-time and not tied to an Azure Pricing Calculator export. Phase 3a apply attaches a calculator export to the PR for the actuals; this section's numbers are budgeting, not billing-grade.

Non-goals

  • This ADR does not Terraform the GitHub Actions workflows that drive PR-preview lifecycle. Those land in sv0-platform/.github/workflows/ as code, reviewed alongside the rest of the deploy pipeline.
  • This ADR does not replace Caddy with a more sophisticated reverse proxy. Each VM keeps Caddy or nginx (TBD; Phase 3a will pick one) for the cloudflared → containers hop.
  • This ADR does not commit to migrating workspaces stamped from these modules (per-tenant deployments). That's the tenant-parameterizable promise from ADR-019; Phase 5+.

When to reconsider

This ADR is deliberately drawn at the IaaS layer and explicitly defers managed-platform questions. Revisit when any of the following happens:

  • First dedicated-deployment customer signs. May or may not change the topology; the modules are tenant-parameterizable already, so the first answer is usually "stamp out a new env, point it at the customer's subscription/account, done."
  • Concurrency on PR previews regularly hits the 10-VM cap during normal weeks. At that point we either raise the cap, or we re-evaluate per-PR-VM vs. shared-pool-with-isolation (Kubernetes-namespace-style). The per-PR VM is correct at our team size; once we are >5 active engineers it may flip.
  • Cloudflare's tunnel + access-SSH path develops sustained reliability problems that materially affect deploy or ingress. At that point we re-evaluate Azure LB + public IP for prod, or add a second WAN provider.
  • Cost variance gets surprising. If the credit allocation is on track to deplete before the renewal cycle, we re-size the prod fleet to A1_v2 burstable (smaller / cheaper), reduce dev to ephemeral-only, or move dev to a smaller cloud (Hetzner stays decommissioned, but a self-hosted Mac mini becomes plausible).
  • Regulatory scope changes. SOC 2 / ISO 27001 attestation may push us to managed services with audited control attestations (Container Apps, AKS-managed). The cloud-portability rule set is paid in flexibility precisely so this revisit is not catastrophic.
  • Azure deprecates B-series. Has happened before to other VM lines. Revisit sizing and switch to D-series equivalents.

References