Skip to main content

Azure Ephemeral PR-Preview VMs — Deferred Design

DEFERRED DESIGN — NOT RUNNING. This document describes a complete design for running per-PR ephemeral Azure VMs with strict lifecycle binding. It is not implemented. No Azure resources exist for this. Today's PR previews run on Hetzner (per sv0-platform/.github/workflows/deploy-dev.yml); see ADR-024 for why this design was banked rather than built.

If you landed here looking for the active Azure dev deploy mechanism: that's ADR-024 (demo VM only, no PR-preview pool).

When to lift into an active ADR:

  • Hetzner OOM pattern reaches >1× per week (the original 2026-04-17 disk-full outage class).
  • A partner or customer requires per-PR Azure isolation (data residency, regulatory).
  • Team scales beyond 4 engineers with concurrent partner reviews (Hetzner's Compose-per-PR model becomes the bottleneck again).
  • Hetzner host is decommissioned.

When this design is lifted, all parameters here are subject to re-review — they reflect a 2026-05 team size and threat model.

What this design delivers

A per-PR Azure VM lifecycle bound to GitHub's pull_request events:

  • PR opened / synchronized → ephemeral VM provisioned at pr-N-dev-azure.securityv0.com within ~5 minutes.
  • PR closed (merged or not) → VM destroyed within ~5 minutes.
  • Cap: 3 concurrent ephemeral previews. 4th PR's workflow fails loud with a clear PR comment.
  • Nightly drift sweeper as load-bearing backstop for missed cleanup events.

End-state guarantee: zero VMs running when no PRs are open.

Why deployment stacks, not plain deployments

The naive design is az deployment group create/delete. It doesn't workaz deployment group delete only removes the deployment-history record. Resources survive. Codex caught this on the first ADR-024 draft.

The correct primitive is Azure deployment stacks (GA May 2024, Microsoft Learn):

  • Create: az stack group create -n pr-N -f bicep/pr-preview-vm.bicep --action-on-unmanage deleteResources --deny-settings-mode denyWriteAndDelete --parameters …
  • Delete: az stack group delete -n pr-N --action-on-unmanage deleteResources --yes --no-wait

The --action-on-unmanage deleteResources flag (NOT a separate --delete-resources flag — that's the wrong syntax) is what causes stack-managed resources to be destroyed when the stack is removed or when a resource is dropped from the template. denyWriteAndDelete installs an RBAC deny-assignment so outside-the-stack az vm delete against a stack-managed resource is rejected — prevents accidental cross-PR damage when multiple stacks share an RG.

Resource group layout

RGLifecycleOwnerHolds
rg-sv0-devDurablesv0-dev TFC workspaceDemo VM (ADR-024 Phase 1 surface).
rg-sv0-dev-pr-previewsDurablesv0-dev TFC workspaceOne Azure deployment stack per active PR. Stacks are ephemeral; the RG itself is not.

Stacks within rg-sv0-dev-pr-previews are named pr-N. Resources inside each stack carry deterministic names (vm-pr-N, nic-pr-N, disk-pr-N, bootdiag-pr-N) so a corrupted stack can be force-cleaned by direct resource deletion as defense in depth.

OIDC + RBAC

Reuse the gha-sv0-platform-deploy Entra app from ADR-024 §2 (do NOT create a second app). Add two new federated credential subjects to it:

SubjectUse
repo:SecurityV0/sv0-platform:environment:dev-previewProvision + cleanup workflows
repo:SecurityV0/sv0-platform:environment:dev-sweeperNightly drift-sweeper workflow

Both reach the same Service Principal in Azure. The two distinct GitHub environments give an audit-log discriminator: the originating sub claim is recorded in Entra ID sign-in logs (Activity Log alone records only the SP object ID; correlate via correlationId).

Both dev-preview and dev-sweeper GitHub Environments MUST be configured with deployment-branch policy Selected branches and tags → main only. Without it, any branch can mint a token with these subjects.

The SP needs the following RBAC. DO NOT use Contributor on rg-sv0-dev-pr-previews — it includes Microsoft.Compute/disks/beginGetAccess/action (disk SAS exfiltration of any Mongo container disk in the RG) and grants Microsoft.Storage/storageAccounts/listKeys/action via wildcard (full data-plane on any storage account the SP creates, including boot diagnostics). Composite role instead:

Built-in roleScopeWhy
Virtual Machine Contributorrg-sv0-dev-pr-previewsVM CRUD + runCommand/action for image flips.
Network Contributorrg-sv0-dev-pr-previewsNIC + NSG management.
Microsoft.Resources/deployments/* (custom or Contributor)rg-sv0-dev-pr-previewsStack create requires this.
Microsoft.Resources/deploymentStacks/* (custom)rg-sv0-dev-pr-previewsStack lifecycle.
Microsoft.Resources/deploymentStacks/manageDenySetting/actionrg-sv0-dev-pr-previewsREQUIRED for --deny-settings-mode denyWriteAndDelete. Excluded from built-in Contributor via NotActions; missing it = 403 on stack create. (Azure/deployment-stacks issue #163.)
Storage Account ContributorNOT applicableThis role DOES include listKeys via Microsoft.Storage/storageAccounts/* wildcard. Do not grant. Set allowSharedKeyAccess = false + storage_use_azuread = true on the boot-diagnostics storage account (per stored ops memory feedback_storage_use_azuread_required); then the SP doesn't need data-plane access.

This is meaningfully more involved than ADR-024's Phase-1 RBAC. Take the time to construct the custom role at activation time; do not paper over with Contributor.

Workflow design

Three workflows in sv0-platform/.github/workflows/:

TriggerWorkflowAction
pull_request: opened, reopened, synchronizepr-preview-azure.yml(1) Singleton cap-check job (workflow-level concurrency: pr-preview-azure-capcheck to serialize across PRs) counts active stacks; fails loud if ≥3 and this PR doesn't already own one. (2) Per-PR-grouped provision job (concurrency: pr-preview-azure-${{ pull_request.number }}) does az stack group create + CF API for tunnel + DNS + Access app. (3) On synchronize, additionally az vm run-command to bump IMAGE_TAG. (4) if: failure() && steps.stack.outcome == 'success' cleanup-on-failure step destroys the stack before exiting non-zero — prevents orphans.
pull_request: closed + workflow_dispatch (pr_number)pr-preview-azure-cleanup.yml(1) CF API: delete Access policy → Access app → DNS CNAME → tunnel (in order; CF refuses tunnel-delete while DNS references it). (2) az stack group delete -n pr-N --action-on-unmanage deleteResources --yes --no-wait. (3) PR comment confirming teardown. Every step tolerates already-deleted.
Nightly cron + workflow_dispatchpr-preview-azure-sweeper.ymlFor each stack in rg-sv0-dev-pr-previews, look up the PR state via gh pr view. If closed/merged/null, run cleanup. Also: for any CF tunnel matching sv0-pr-N-dev-azure with no Azure stack, delete the CF resources. Alert via webhook on any reap (reaping by sweeper means the primary cleanup path is broken).

Cap = 3, hard-fail, no LRU eviction

The 4th concurrent PR opening sees:

Active PR previews at cap (3/3). To open a preview for this PR, close another
open PR's preview first (or merge it). The three currently-active previews are:

- PR #N1 — title — opened by @user — last activity 2h ago
- PR #N2 — title — opened by @user — last activity 6h ago
- PR #N3 — title — opened by @user — last activity 4d ago

The PR's CI and merge process are not blocked — this only affects the preview
environment. To force the oldest one closed, comment on this PR:

/preview reap PR-#N3

(Available to Ivan/Sergey only.)

Posted as both a workflow failed-check and a PR comment from the workflow bot identity.

LRU eviction was rejected. Auto-destroying the oldest active preview breaks reviewers mid-review with no warning.

Cap=3 was chosen over ADR-022 §6's cap=10 because the dev tier sizes for active development on a small team. Forcing-function for prioritization rather than infinite resource consumption. Revisit if cap-exceeded happens >2× per sprint.

/preview reap is intentionally out of scope for the first activation. The cap-exceeded message names it because if/when scope creep demands it, the affordance is signposted. Until then, the cap exceeded message and "ask Ivan/Sergey" is the workflow.

Cloud-init secret sourcing (PR-preview specific)

The PR-preview cloud-init template needs the same six runtime secrets as the demo VM:

VariableSourceTier
tunnel_tokenCF API response after creating the per-PR tunnelper-PR, ephemeral
ghcr_tokenGHA secrets.GITHUB_TOKEN (read:packages)per-run, ephemeral
workos_api_keyGHA env secret STAGING_WORKOS_API_KEY on dev-preview environmentstaging-tier WorkOS only
workos_client_idGHA env secret STAGING_WORKOS_CLIENT_IDstaging-tier
session_cookie_passwordGHA env secret STAGING_SESSION_COOKIE_PASSWORDstaging-tier, never reused for prod
metrics_bearer_tokenGHA env secret STAGING_METRICS_BEARER_TOKENdev-tier metrics

Forks cannot read environment secrets (GitHub docs). PR-from-fork workflows cannot mint the OIDC token (subject environment:dev-preview requires the environment to be wired, which forks cannot satisfy). Both controls reinforce that PR-preview VMs never see production credentials.

Production WorkOS API key MUST NEVER be added to the dev-preview environment. Same prohibition that today's deploy-dev.yml honours.

Cloudflare API token — accepted residual risk

The workflows reuse the existing CF_API_TOKEN GHA secret. That token is zone-wide on securityv0.com — sufficient to rewrite app.securityv0.com, MX records, etc. Cloudflare's hostname-scoped token policies (2024 feature) are not currently configured for this account.

Mitigations on top: GitHub Environment branch-protection (workflows can only run on main), CF Audit Logs forwarded to a SIEM-equivalent (post-hoc detection of anomalous writes). Open follow-up: scope a dedicated CF token tighter than zone-wide when hostname-scoping is configurable for the account.

SKU choice: Standard_B2s

Same SKU as the demo VM. 2 vCPU / 4 GB RAM is comfortable for mongo:7 + api + ui containers (~1.2 GB ideal, ≥1.6 GB under load). The first ADR-024 draft proposed Standard_B1ms (1 vCPU / 2 GB) on cost grounds; codex review showed 2 GB swap-risk under realistic load.

Cost model at cap=3, all VMs continuously running: 3 × Standard_B2s × 24h × ~$0.038/h (westeurope, verify with Azure Retail Prices API at activation time) ≈ $82/month worst case. Expected steady-state (PRs close within 1–3 days): $15–$30/month.

When-this-design-was-correct-only

This design reflects 2026-05's environment. Re-validate these before lifting:

  • Hetzner is still the PR-preview substrate. If it's already gone, the rationale for skipping shifts.
  • The tfc-sv0-infrastructure Entra app's role assignments. Phase 2 RBAC assumes a separate gha-sv0-platform-deploy app exists per ADR-024.
  • Azure deployment-stack flags. The Azure CLI surface evolved between proposal (2026-05-14) and any future activation; verify --action-on-unmanage/--deny-settings-mode are still the right spelling.
  • GitHub pull_request: workflow file resolution semantics. Changed Nov 2025; could change again.
  • Cloudflare API token scoping. May have hostname-scoped policies by the activation date.

Linked design history

  • First ADR-024 draft (2026-05-14) included this design as Phase 2. Cut by Ivan after CEO/SOC/fact-check cross-review on the same day. Branch feat/adr-024-azure-deploy-lifecycle in sv0-documentation carries the full review trail.
  • Tracking issue: SecurityV0/sv0-infrastructure#63 (scope reduced to Phase 1 only).
  • ADR-022 §6 is the parent design at the prod tier (cap=10, image-watcher, 7-day reaper). This document would amend §6 for the dev tier when activated.