Azure Ephemeral PR-Preview VMs — Deferred Design

DEFERRED DESIGN — NOT RUNNING. This document describes a complete design for running per-PR ephemeral Azure VMs with strict lifecycle binding. It is not implemented. No Azure resources exist for this. Today's PR previews run on Hetzner (per sv0-platform/.github/workflows/deploy-dev.yml); see ADR-024 for why this design was banked rather than built.

If you landed here looking for the active Azure dev deploy mechanism: that's ADR-024 (demo VM only, no PR-preview pool).

When to lift into an active ADR:

Hetzner OOM pattern reaches >1× per week (the original 2026-04-17 disk-full outage class).

A partner or customer requires per-PR Azure isolation (data residency, regulatory).

Team scales beyond 4 engineers with concurrent partner reviews (Hetzner's Compose-per-PR model becomes the bottleneck again).

Hetzner host is decommissioned.

When this design is lifted, all parameters here are subject to re-review — they reflect a 2026-05 team size and threat model.

What this design delivers

A per-PR Azure VM lifecycle bound to GitHub's pull_request events:

PR opened / synchronized → ephemeral VM provisioned at pr-N-dev-azure.securityv0.com within ~5 minutes.
PR closed (merged or not) → VM destroyed within ~5 minutes.
Cap: 3 concurrent ephemeral previews. 4th PR's workflow fails loud with a clear PR comment.
Nightly drift sweeper as load-bearing backstop for missed cleanup events.

End-state guarantee: zero VMs running when no PRs are open.

Why deployment stacks, not plain deployments

The naive design is az deployment group create/delete. It doesn't work — az deployment group delete only removes the deployment-history record. Resources survive. Codex caught this on the first ADR-024 draft.

The correct primitive is Azure deployment stacks (GA May 2024, Microsoft Learn):

Create: az stack group create -n pr-N -f bicep/pr-preview-vm.bicep --action-on-unmanage deleteResources --deny-settings-mode denyWriteAndDelete --parameters …
Delete: az stack group delete -n pr-N --action-on-unmanage deleteResources --yes --no-wait

The --action-on-unmanage deleteResources flag (NOT a separate --delete-resources flag — that's the wrong syntax) is what causes stack-managed resources to be destroyed when the stack is removed or when a resource is dropped from the template. denyWriteAndDelete installs an RBAC deny-assignment so outside-the-stack az vm delete against a stack-managed resource is rejected — prevents accidental cross-PR damage when multiple stacks share an RG.

Resource group layout

RG	Lifecycle	Owner	Holds
`rg-sv0-dev`	Durable	`sv0-dev` TFC workspace	Demo VM (ADR-024 Phase 1 surface).
`rg-sv0-dev-pr-previews`	Durable	`sv0-dev` TFC workspace	One Azure deployment stack per active PR. Stacks are ephemeral; the RG itself is not.

Stacks within rg-sv0-dev-pr-previews are named pr-N. Resources inside each stack carry deterministic names (vm-pr-N, nic-pr-N, disk-pr-N, bootdiag-pr-N) so a corrupted stack can be force-cleaned by direct resource deletion as defense in depth.

OIDC + RBAC

Reuse the gha-sv0-platform-deploy Entra app from ADR-024 §2 (do NOT create a second app). Add two new federated credential subjects to it:

Subject	Use
`repo:SecurityV0/sv0-platform:environment:dev-preview`	Provision + cleanup workflows
`repo:SecurityV0/sv0-platform:environment:dev-sweeper`	Nightly drift-sweeper workflow

Both reach the same Service Principal in Azure. The two distinct GitHub environments give an audit-log discriminator: the originating sub claim is recorded in Entra ID sign-in logs (Activity Log alone records only the SP object ID; correlate via correlationId).

Both dev-preview and dev-sweeper GitHub Environments MUST be configured with deployment-branch policy Selected branches and tags → main only. Without it, any branch can mint a token with these subjects.

The SP needs the following RBAC. DO NOT use Contributor on rg-sv0-dev-pr-previews — it includes Microsoft.Compute/disks/beginGetAccess/action (disk SAS exfiltration of any Mongo container disk in the RG) and grants Microsoft.Storage/storageAccounts/listKeys/action via wildcard (full data-plane on any storage account the SP creates, including boot diagnostics). Composite role instead:

Built-in role	Scope	Why
`Virtual Machine Contributor`	`rg-sv0-dev-pr-previews`	VM CRUD + `runCommand/action` for image flips.
`Network Contributor`	`rg-sv0-dev-pr-previews`	NIC + NSG management.
`Microsoft.Resources/deployments/*` (custom or `Contributor`)	`rg-sv0-dev-pr-previews`	Stack create requires this.
`Microsoft.Resources/deploymentStacks/*` (custom)	`rg-sv0-dev-pr-previews`	Stack lifecycle.
`Microsoft.Resources/deploymentStacks/manageDenySetting/action`	`rg-sv0-dev-pr-previews`	REQUIRED for `--deny-settings-mode denyWriteAndDelete`. Excluded from built-in `Contributor` via `NotActions`; missing it = 403 on stack create. (Azure/deployment-stacks issue #163.)
`Storage Account Contributor`	NOT applicable	This role DOES include `listKeys` via `Microsoft.Storage/storageAccounts/*` wildcard. Do not grant. Set `allowSharedKeyAccess = false` + `storage_use_azuread = true` on the boot-diagnostics storage account (per stored ops memory `feedback_storage_use_azuread_required`); then the SP doesn't need data-plane access.

This is meaningfully more involved than ADR-024's Phase-1 RBAC. Take the time to construct the custom role at activation time; do not paper over with Contributor.

Workflow design

Three workflows in sv0-platform/.github/workflows/:

Trigger	Workflow	Action
`pull_request: opened, reopened, synchronize`	`pr-preview-azure.yml`	(1) Singleton cap-check job (workflow-level `concurrency: pr-preview-azure-capcheck` to serialize across PRs) counts active stacks; fails loud if ≥3 and this PR doesn't already own one. (2) Per-PR-grouped provision job (`concurrency: pr-preview-azure-${{ pull_request.number }}`) does `az stack group create` + CF API for tunnel + DNS + Access app. (3) On `synchronize`, additionally `az vm run-command` to bump `IMAGE_TAG`. (4) `if: failure() && steps.stack.outcome == 'success'` cleanup-on-failure step destroys the stack before exiting non-zero — prevents orphans.
`pull_request: closed` + `workflow_dispatch (pr_number)`	`pr-preview-azure-cleanup.yml`	(1) CF API: delete Access policy → Access app → DNS CNAME → tunnel (in order; CF refuses tunnel-delete while DNS references it). (2) `az stack group delete -n pr-N --action-on-unmanage deleteResources --yes --no-wait`. (3) PR comment confirming teardown. Every step tolerates already-deleted.
Nightly cron + `workflow_dispatch`	`pr-preview-azure-sweeper.yml`	For each stack in `rg-sv0-dev-pr-previews`, look up the PR state via `gh pr view`. If closed/merged/null, run cleanup. Also: for any CF tunnel matching `sv0-pr-N-dev-azure` with no Azure stack, delete the CF resources. Alert via webhook on any reap (reaping by sweeper means the primary cleanup path is broken).

Cap = 3, hard-fail, no LRU eviction

The 4th concurrent PR opening sees:

Active PR previews at cap (3/3). To open a preview for this PR, close another
open PR's preview first (or merge it). The three currently-active previews are:

  - PR #N1 — title — opened by @user — last activity 2h ago
  - PR #N2 — title — opened by @user — last activity 6h ago
  - PR #N3 — title — opened by @user — last activity 4d ago

The PR's CI and merge process are not blocked — this only affects the preview
environment. To force the oldest one closed, comment on this PR:

    /preview reap PR-#N3

(Available to Ivan/Sergey only.)

Posted as both a workflow failed-check and a PR comment from the workflow bot identity.

LRU eviction was rejected. Auto-destroying the oldest active preview breaks reviewers mid-review with no warning.

Cap=3 was chosen over ADR-022 §6's cap=10 because the dev tier sizes for active development on a small team. Forcing-function for prioritization rather than infinite resource consumption. Revisit if cap-exceeded happens >2× per sprint.

/preview reap is intentionally out of scope for the first activation. The cap-exceeded message names it because if/when scope creep demands it, the affordance is signposted. Until then, the cap exceeded message and "ask Ivan/Sergey" is the workflow.

Cloud-init secret sourcing (PR-preview specific)

The PR-preview cloud-init template needs the same six runtime secrets as the demo VM:

Variable	Source	Tier
`tunnel_token`	CF API response after creating the per-PR tunnel	per-PR, ephemeral
`ghcr_token`	GHA `secrets.GITHUB_TOKEN` (read:packages)	per-run, ephemeral
`workos_api_key`	GHA env secret `STAGING_WORKOS_API_KEY` on `dev-preview` environment	staging-tier WorkOS only
`workos_client_id`	GHA env secret `STAGING_WORKOS_CLIENT_ID`	staging-tier
`session_cookie_password`	GHA env secret `STAGING_SESSION_COOKIE_PASSWORD`	staging-tier, never reused for prod
`metrics_bearer_token`	GHA env secret `STAGING_METRICS_BEARER_TOKEN`	dev-tier metrics

Forks cannot read environment secrets (GitHub docs). PR-from-fork workflows cannot mint the OIDC token (subject environment:dev-preview requires the environment to be wired, which forks cannot satisfy). Both controls reinforce that PR-preview VMs never see production credentials.

Production WorkOS API key MUST NEVER be added to the dev-preview environment. Same prohibition that today's deploy-dev.yml honours.

Cloudflare API token — accepted residual risk

The workflows reuse the existing CF_API_TOKEN GHA secret. That token is zone-wide on securityv0.com — sufficient to rewrite app.securityv0.com, MX records, etc. Cloudflare's hostname-scoped token policies (2024 feature) are not currently configured for this account.

Mitigations on top: GitHub Environment branch-protection (workflows can only run on main), CF Audit Logs forwarded to a SIEM-equivalent (post-hoc detection of anomalous writes). Open follow-up: scope a dedicated CF token tighter than zone-wide when hostname-scoping is configurable for the account.

SKU choice: `Standard_B2s`

Same SKU as the demo VM. 2 vCPU / 4 GB RAM is comfortable for mongo:7 + api + ui containers (~1.2 GB ideal, ≥1.6 GB under load). The first ADR-024 draft proposed Standard_B1ms (1 vCPU / 2 GB) on cost grounds; codex review showed 2 GB swap-risk under realistic load.

Cost model at cap=3, all VMs continuously running: 3 × Standard_B2s × 24h × ~$0.038/h (westeurope, verify with Azure Retail Prices API at activation time) ≈ $82/month worst case. Expected steady-state (PRs close within 1–3 days): $15–$30/month.

When-this-design-was-correct-only

This design reflects 2026-05's environment. Re-validate these before lifting:

Hetzner is still the PR-preview substrate. If it's already gone, the rationale for skipping shifts.
The tfc-sv0-infrastructure Entra app's role assignments. Phase 2 RBAC assumes a separate gha-sv0-platform-deploy app exists per ADR-024.
Azure deployment-stack flags. The Azure CLI surface evolved between proposal (2026-05-14) and any future activation; verify --action-on-unmanage/--deny-settings-mode are still the right spelling.
GitHub pull_request: workflow file resolution semantics. Changed Nov 2025; could change again.
Cloudflare API token scoping. May have hostname-scoped policies by the activation date.

Linked design history

First ADR-024 draft (2026-05-14) included this design as Phase 2. Cut by Ivan after CEO/SOC/fact-check cross-review on the same day. Branch feat/adr-024-azure-deploy-lifecycle in sv0-documentation carries the full review trail.
Tracking issue: SecurityV0/sv0-infrastructure#63 (scope reduced to Phase 1 only).
ADR-022 §6 is the parent design at the prod tier (cap=10, image-watcher, 7-day reaper). This document would amend §6 for the dev tier when activated.

What this design delivers​

Why deployment stacks, not plain deployments​

Resource group layout​

OIDC + RBAC​

Workflow design​

Cap = 3, hard-fail, no LRU eviction​

Cloud-init secret sourcing (PR-preview specific)​

Cloudflare API token — accepted residual risk​

SKU choice: Standard_B2s​

When-this-design-was-correct-only​

Linked design history​