Azure Ephemeral PR-Preview VMs — Deferred Design
DEFERRED DESIGN — NOT RUNNING. This document describes a complete design for running per-PR ephemeral Azure VMs with strict lifecycle binding. It is not implemented. No Azure resources exist for this. Today's PR previews run on Hetzner (per
sv0-platform/.github/workflows/deploy-dev.yml); see ADR-024 for why this design was banked rather than built.If you landed here looking for the active Azure dev deploy mechanism: that's ADR-024 (demo VM only, no PR-preview pool).
When to lift into an active ADR:
- Hetzner OOM pattern reaches >1× per week (the original 2026-04-17 disk-full outage class).
- A partner or customer requires per-PR Azure isolation (data residency, regulatory).
- Team scales beyond 4 engineers with concurrent partner reviews (Hetzner's Compose-per-PR model becomes the bottleneck again).
- Hetzner host is decommissioned.
When this design is lifted, all parameters here are subject to re-review — they reflect a 2026-05 team size and threat model.
What this design delivers
A per-PR Azure VM lifecycle bound to GitHub's pull_request events:
- PR opened / synchronized → ephemeral VM provisioned at
pr-N-dev-azure.securityv0.comwithin ~5 minutes. - PR closed (merged or not) → VM destroyed within ~5 minutes.
- Cap: 3 concurrent ephemeral previews. 4th PR's workflow fails loud with a clear PR comment.
- Nightly drift sweeper as load-bearing backstop for missed cleanup events.
End-state guarantee: zero VMs running when no PRs are open.
Why deployment stacks, not plain deployments
The naive design is az deployment group create/delete. It doesn't work — az deployment group delete only removes the deployment-history record. Resources survive. Codex caught this on the first ADR-024 draft.
The correct primitive is Azure deployment stacks (GA May 2024, Microsoft Learn):
- Create:
az stack group create -n pr-N -f bicep/pr-preview-vm.bicep --action-on-unmanage deleteResources --deny-settings-mode denyWriteAndDelete --parameters … - Delete:
az stack group delete -n pr-N --action-on-unmanage deleteResources --yes --no-wait
The --action-on-unmanage deleteResources flag (NOT a separate --delete-resources flag — that's the wrong syntax) is what causes stack-managed resources to be destroyed when the stack is removed or when a resource is dropped from the template. denyWriteAndDelete installs an RBAC deny-assignment so outside-the-stack az vm delete against a stack-managed resource is rejected — prevents accidental cross-PR damage when multiple stacks share an RG.
Resource group layout
| RG | Lifecycle | Owner | Holds |
|---|---|---|---|
rg-sv0-dev | Durable | sv0-dev TFC workspace | Demo VM (ADR-024 Phase 1 surface). |
rg-sv0-dev-pr-previews | Durable | sv0-dev TFC workspace | One Azure deployment stack per active PR. Stacks are ephemeral; the RG itself is not. |
Stacks within rg-sv0-dev-pr-previews are named pr-N. Resources inside each stack carry deterministic names (vm-pr-N, nic-pr-N, disk-pr-N, bootdiag-pr-N) so a corrupted stack can be force-cleaned by direct resource deletion as defense in depth.
OIDC + RBAC
Reuse the gha-sv0-platform-deploy Entra app from ADR-024 §2 (do NOT create a second app). Add two new federated credential subjects to it:
| Subject | Use |
|---|---|
repo:SecurityV0/sv0-platform:environment:dev-preview | Provision + cleanup workflows |
repo:SecurityV0/sv0-platform:environment:dev-sweeper | Nightly drift-sweeper workflow |
Both reach the same Service Principal in Azure. The two distinct GitHub environments give an audit-log discriminator: the originating sub claim is recorded in Entra ID sign-in logs (Activity Log alone records only the SP object ID; correlate via correlationId).
Both dev-preview and dev-sweeper GitHub Environments MUST be configured with deployment-branch policy Selected branches and tags → main only. Without it, any branch can mint a token with these subjects.
The SP needs the following RBAC. DO NOT use Contributor on rg-sv0-dev-pr-previews — it includes Microsoft.Compute/disks/beginGetAccess/action (disk SAS exfiltration of any Mongo container disk in the RG) and grants Microsoft.Storage/storageAccounts/listKeys/action via wildcard (full data-plane on any storage account the SP creates, including boot diagnostics). Composite role instead:
| Built-in role | Scope | Why |
|---|---|---|
Virtual Machine Contributor | rg-sv0-dev-pr-previews | VM CRUD + runCommand/action for image flips. |
Network Contributor | rg-sv0-dev-pr-previews | NIC + NSG management. |
Microsoft.Resources/deployments/* (custom or Contributor) | rg-sv0-dev-pr-previews | Stack create requires this. |
Microsoft.Resources/deploymentStacks/* (custom) | rg-sv0-dev-pr-previews | Stack lifecycle. |
Microsoft.Resources/deploymentStacks/manageDenySetting/action | rg-sv0-dev-pr-previews | REQUIRED for --deny-settings-mode denyWriteAndDelete. Excluded from built-in Contributor via NotActions; missing it = 403 on stack create. (Azure/deployment-stacks issue #163.) |
Storage Account Contributor | NOT applicable | This role DOES include listKeys via Microsoft.Storage/storageAccounts/* wildcard. Do not grant. Set allowSharedKeyAccess = false + storage_use_azuread = true on the boot-diagnostics storage account (per stored ops memory feedback_storage_use_azuread_required); then the SP doesn't need data-plane access. |
This is meaningfully more involved than ADR-024's Phase-1 RBAC. Take the time to construct the custom role at activation time; do not paper over with Contributor.
Workflow design
Three workflows in sv0-platform/.github/workflows/:
| Trigger | Workflow | Action |
|---|---|---|
pull_request: opened, reopened, synchronize | pr-preview-azure.yml | (1) Singleton cap-check job (workflow-level concurrency: pr-preview-azure-capcheck to serialize across PRs) counts active stacks; fails loud if ≥3 and this PR doesn't already own one. (2) Per-PR-grouped provision job (concurrency: pr-preview-azure-${{ pull_request.number }}) does az stack group create + CF API for tunnel + DNS + Access app. (3) On synchronize, additionally az vm run-command to bump IMAGE_TAG. (4) if: failure() && steps.stack.outcome == 'success' cleanup-on-failure step destroys the stack before exiting non-zero — prevents orphans. |
pull_request: closed + workflow_dispatch (pr_number) | pr-preview-azure-cleanup.yml | (1) CF API: delete Access policy → Access app → DNS CNAME → tunnel (in order; CF refuses tunnel-delete while DNS references it). (2) az stack group delete -n pr-N --action-on-unmanage deleteResources --yes --no-wait. (3) PR comment confirming teardown. Every step tolerates already-deleted. |
Nightly cron + workflow_dispatch | pr-preview-azure-sweeper.yml | For each stack in rg-sv0-dev-pr-previews, look up the PR state via gh pr view. If closed/merged/null, run cleanup. Also: for any CF tunnel matching sv0-pr-N-dev-azure with no Azure stack, delete the CF resources. Alert via webhook on any reap (reaping by sweeper means the primary cleanup path is broken). |
Cap = 3, hard-fail, no LRU eviction
The 4th concurrent PR opening sees:
Active PR previews at cap (3/3). To open a preview for this PR, close another
open PR's preview first (or merge it). The three currently-active previews are:
- PR #N1 — title — opened by @user — last activity 2h ago
- PR #N2 — title — opened by @user — last activity 6h ago
- PR #N3 — title — opened by @user — last activity 4d ago
The PR's CI and merge process are not blocked — this only affects the preview
environment. To force the oldest one closed, comment on this PR:
/preview reap PR-#N3
(Available to Ivan/Sergey only.)
Posted as both a workflow failed-check and a PR comment from the workflow bot identity.
LRU eviction was rejected. Auto-destroying the oldest active preview breaks reviewers mid-review with no warning.
Cap=3 was chosen over ADR-022 §6's cap=10 because the dev tier sizes for active development on a small team. Forcing-function for prioritization rather than infinite resource consumption. Revisit if cap-exceeded happens >2× per sprint.
/preview reap is intentionally out of scope for the first activation. The cap-exceeded message names it because if/when scope creep demands it, the affordance is signposted. Until then, the cap exceeded message and "ask Ivan/Sergey" is the workflow.
Cloud-init secret sourcing (PR-preview specific)
The PR-preview cloud-init template needs the same six runtime secrets as the demo VM:
| Variable | Source | Tier |
|---|---|---|
tunnel_token | CF API response after creating the per-PR tunnel | per-PR, ephemeral |
ghcr_token | GHA secrets.GITHUB_TOKEN (read:packages) | per-run, ephemeral |
workos_api_key | GHA env secret STAGING_WORKOS_API_KEY on dev-preview environment | staging-tier WorkOS only |
workos_client_id | GHA env secret STAGING_WORKOS_CLIENT_ID | staging-tier |
session_cookie_password | GHA env secret STAGING_SESSION_COOKIE_PASSWORD | staging-tier, never reused for prod |
metrics_bearer_token | GHA env secret STAGING_METRICS_BEARER_TOKEN | dev-tier metrics |
Forks cannot read environment secrets (GitHub docs). PR-from-fork workflows cannot mint the OIDC token (subject environment:dev-preview requires the environment to be wired, which forks cannot satisfy). Both controls reinforce that PR-preview VMs never see production credentials.
Production WorkOS API key MUST NEVER be added to the dev-preview environment. Same prohibition that today's deploy-dev.yml honours.
Cloudflare API token — accepted residual risk
The workflows reuse the existing CF_API_TOKEN GHA secret. That token is zone-wide on securityv0.com — sufficient to rewrite app.securityv0.com, MX records, etc. Cloudflare's hostname-scoped token policies (2024 feature) are not currently configured for this account.
Mitigations on top: GitHub Environment branch-protection (workflows can only run on main), CF Audit Logs forwarded to a SIEM-equivalent (post-hoc detection of anomalous writes). Open follow-up: scope a dedicated CF token tighter than zone-wide when hostname-scoping is configurable for the account.
SKU choice: Standard_B2s
Same SKU as the demo VM. 2 vCPU / 4 GB RAM is comfortable for mongo:7 + api + ui containers (~1.2 GB ideal, ≥1.6 GB under load). The first ADR-024 draft proposed Standard_B1ms (1 vCPU / 2 GB) on cost grounds; codex review showed 2 GB swap-risk under realistic load.
Cost model at cap=3, all VMs continuously running: 3 × Standard_B2s × 24h × ~$0.038/h (westeurope, verify with Azure Retail Prices API at activation time) ≈ $82/month worst case. Expected steady-state (PRs close within 1–3 days): $15–$30/month.
When-this-design-was-correct-only
This design reflects 2026-05's environment. Re-validate these before lifting:
- Hetzner is still the PR-preview substrate. If it's already gone, the rationale for skipping shifts.
- The
tfc-sv0-infrastructureEntra app's role assignments. Phase 2 RBAC assumes a separategha-sv0-platform-deployapp exists per ADR-024. - Azure deployment-stack flags. The Azure CLI surface evolved between proposal (2026-05-14) and any future activation; verify
--action-on-unmanage/--deny-settings-modeare still the right spelling. - GitHub
pull_request:workflow file resolution semantics. Changed Nov 2025; could change again. - Cloudflare API token scoping. May have hostname-scoped policies by the activation date.
Linked design history
- First ADR-024 draft (2026-05-14) included this design as Phase 2. Cut by Ivan after CEO/SOC/fact-check cross-review on the same day. Branch
feat/adr-024-azure-deploy-lifecycleinsv0-documentationcarries the full review trail. - Tracking issue: SecurityV0/sv0-infrastructure#63 (scope reduced to Phase 1 only).
- ADR-022 §6 is the parent design at the prod tier (cap=10, image-watcher, 7-day reaper). This document would amend §6 for the dev tier when activated.