ADR-024: Azure Demo VM Deploy from GitHub Actions
Status
Proposed — 2026-05-14.
Operationalises ADR-022 §3 Phase 3f (long-running dev VMs) for the deploy mechanism only. PR-preview ephemeral VMs (ADR-022 §6) are explicitly not scoped here; the full Phase-2 design (deployment stacks + cap-3 + drift sweeper + composite RBAC) is banked in docs/infrastructure/azure-ephemeral-pr-previews-design.md for activation when a concrete trigger appears (see §"When to Reconsider").
Context
Three things are true:
-
The Azure dev VM exists (spike landed 2026-05-12 via
sv0-infrastructure/envs/dev/).vm-sv0-dev-1servesdev-azure.securityv0.comvia Cloudflare Tunnel. SSH is CF Access (ADR-023 §3.4.2). Today, "deploy" = re-apply Terraform with a newimage_tag— heavyweight, routes through TFC for a frequent low-risk operation. -
Hetzner is the demo-DoS source today. Per ADR-022 §"Context": shared 4 GB Hetzner host, every PR brings up its own Compose stack, >5 concurrent PRs OOM-kill
dev.securityv0.com. The fix Ivan asked for (2026-05-14) is demo isolation — move the demo VM away from PR-preview churn — not VM-level lifecycle binding. That's separable. -
Scope cut, 2026-05-14. First draft of this ADR also designed an ephemeral PR-preview pool on Azure (Phase 2). Cross-review by codex-rescue, security-auditor, ceo-reviewer, secops-analyst, and a fact-checker pointed at the same answer: Phase 2 delivers zero value while Hetzner keeps running PR previews, introduces multiple correctness errors (deployment-stack CLI flags, RBAC composition, GitHub semantics) that wouldn't exist if Phase 2 weren't being built, and bundles a forcing-function cap that pinches developer UX on a 2–3 engineer team. The Phase-2 design is banked, not built.
What Ivan asked for, 2026-05-14
I want to have a deployment procedure updated to this Azure machine… Azure default and back up to dev Hetzner on the machine.
Concrete requirements for this ADR:
- Demo VM deploys to Azure on every main CI success without a TFC apply.
- Hetzner keeps running as the fallback target (deploy-dev.yml unchanged).
- No SSH key managed in GitHub Actions for the Azure path.
Decision
A new GitHub Actions workflow deploy-azure-dev.yml in sv0-platform deploys to the demo VM via Azure Run Command authenticated by OIDC federation from GitHub Actions through a new tightly-scoped Entra app. Hetzner's deploy-dev.yml continues to run unchanged; both targets get every main-CI deploy during cohabitation.
The decision has four load-bearing parts.
1. Hostname stays dev-azure.securityv0.com during Hetzner cohabitation
dev.securityv0.com continues to point at Hetzner. The Azure demo VM keeps its existing dev-azure.securityv0.com URL. Depth-1 from securityv0.com — covered by the existing free Cloudflare Universal SSL on *.securityv0.com (per memory project_cf_universal_ssl_one_level).
When Hetzner retires, a follow-up decision renames the Azure URL to drop the -azure suffix. Until then, dev-azure is the canonical Azure-dev hostname.
2. New Entra app gha-sv0-platform-deploy, RG-scoped to rg-sv0-dev
ADR-022 §6 said "All workflows use the same OIDC federation TFC uses" — meaning GitHub Actions runs would auth via the existing tfc-sv0-infrastructure app. This ADR rejects that assertion on security grounds:
The TFC app has Contributor on rg-sv0-prod via the sv0-prod workspace federated credential (ADR-022 §7). In Azure, federated credentials are trust assertions on the same Service Principal — the resulting Azure access token has the union of all the SP's role assignments. Adding a GitHub-Actions federated credential to tfc-sv0-infrastructure would give every workflow run the same rg-sv0-prod Contributor blast radius. That's a hard no.
A new Entra app gha-sv0-platform-deploy is added to bootstrap/azuread.tf (the sv0-bootstrap TFC workspace, post-sv0-infrastructure#29).
| Federated credential subject | Use | RBAC |
|---|---|---|
repo:SecurityV0/sv0-platform:environment:dev | deploy-azure-dev.yml (and any future Azure-touching GHA workflows in this repo's dev environment) | Virtual Machine Contributor on rg-sv0-dev; Reader on rg-sv0-shared (subnet lookup if ever needed; defensive). |
GitHub Environment protection is part of the trust boundary. GitHub federation subjects of the form repo:OWNER/REPO:environment:NAME are only minted when the running job declares environment: NAME. PRs from forks cannot satisfy environment protection rules and cannot mint these tokens.
The bootstrap step (Migration plan §2) MUST configure the sv0-platform repo's dev environment with a tight deployment-branch policy. As implemented 2026-05-15, the policy allows main + redesign/v06-pilot — the long-running redesign pilot branch needs the same OIDC trust as main while it stays open. When the pilot lands or is closed, the policy collapses to main only. Without this gate, any branch's workflow could mint a token with this subject.
Note on pull_request: workflow file resolution. Per the Nov 7 2025 GitHub change, pull_request:-triggered workflows resolve their workflow file from the repository's default branch (= main), regardless of the PR's base. This is favorable for the trust boundary — but deploy-azure-dev.yml triggers on workflow_run + workflow_dispatch only, not pull_request:, so this is informational, not load-bearing.
3. Demo deploy via az vm run-command invoke
deploy-azure-dev.yml mirrors deploy-dev.yml's triggers (workflow_run: ci on success + workflow_dispatch). Both run on every successful main CI in parallel — Hetzner via SSH (today's flow), Azure via Run Command (new).
The Run Command path:
- OIDC-federate to Azure via
azure/login@v2usinggha-sv0-platform-deploycredentials. No client secret. az vm run-command invoke -g rg-sv0-dev -n vm-sv0-dev-1 --command-id RunShellScript --scripts "<inline>". Inline script:sed -i "s|^IMAGE_TAG=.*|IMAGE_TAG=$NEW_TAG|" /etc/sv0/app.envsystemctl restart sv0-stackdocker ps --format '{{.Names}}\t{{.Status}}' | grep '^sv0-'for verification in the workflow log.
- Health-check
https://dev-azure.securityv0.com/healthvia CF Access service token (reusesCF_ACCESS_CLIENT_ID_DEPLOY/CF_ACCESS_CLIENT_SECRET_DEPLOY)./healthis canonical;/deploy-health(used by Hetzner workflow legacy) is NOT mirrored.
No SSH key in this path. Microsoft.Compute/virtualMachines/runCommand/action is included in the built-in Virtual Machine Contributor role (Microsoft Learn confirms: "Run scripts in a VM using Run Command"). The script body and truncated output appear in Azure Activity Log against the SP identity. ADR-023's audit-log-truncation hazard (Tier-3 §) applies; the script is small (<500 bytes) so truncation is not a problem here, but the GHA workflow log remains the primary forensic record.
Why Run Command over image-watcher pull (the §5d/§6 mechanism in ADR-022). Image-watcher works fine for prod, but for the dev tier:
- Push deploys are deterministic in timing. Workflow finishes ⟹ deploy is done. Image-watcher introduces a 30s–N-min poll delay.
- Push deploys are observable in one place. Workflow log shows everything; image-watcher splits the deploy across CI + VM.
- Push deploys avoid an additional VM-side systemd service. No third unit to maintain.
- Push deploys reuse Tier-3 emergency-operations capability (ADR-023 §3.4.4). No new auth surface.
The trade-off is that Run Command requires the VM reachable via the Azure control plane at deploy time. If the VM is wedged at the kernel level, the deploy fails loud — which is the right behavior; an image-watcher poll would silently lag.
4. Live-demo outage risk: accepted, not mitigated
Every main-CI success triggers a Run Command + systemctl restart sv0-stack against the demo VM. The compose-restart causes a 10–30 second outage on dev-azure.securityv0.com during which the api/ui containers come back up. If a customer demo is running on that URL at the moment of the restart, the demo sees the outage.
Decision (Ivan, 2026-05-14): accept the outage risk. Demos are infrequent enough that retrofitting a freeze mechanism (repo-variable flag, cron window, branch protection) is premature complexity. If a demo coincides with a CI restart, the operator runs the demo across the window or pages the on-PR developer to delay merge. Revisit per §"When to Reconsider" if demo-restart collision happens >1×/sprint.
Cloud-portability check (ADR-022 §11)
Phase 1 surface is small enough that cloud-portability is trivial: az vm run-command ⇄ AWS SSM Run Command ⇄ GCP gcloud compute ssh --command; azure/login@v2 ⇄ AWS configure-aws-credentials@v4 ⇄ GCP auth@v2. Same OIDC subject mapping. Whole migration of this ADR's deliverable is ~30 LOC of workflow YAML.
Consequences
Positive
- Demo VM is isolated from PR-preview churn. Sergey/customers cannot be DoS'd by a busy sprint.
- No SSH key in the Azure deploy path. Run Command auth is RBAC; no
DEPLOY_SSH_KEYequivalent. - OIDC blast radius is tight.
gha-sv0-platform-deploySP has zero production access. A leaked GHA federation token can affect the dev demo VM only. - Hetzner unchanged. Roll back by disabling the new workflow; Hetzner keeps serving as today.
- Phase 2 design is preserved, not lost. When the trigger materialises, docs/infrastructure/azure-ephemeral-pr-previews-design.md is ready to lift into an implementation ADR.
Negative
- Live-demo outage risk accepted, not mitigated (§4).
- Dual-deploy for the cohabitation window. Every main push runs both Hetzner and Azure deploy. ~2 minutes of extra CI per push; trivial cost, but two workflow runs to monitor.
- Activity Log truncation at ~4 KB for Run Command output. Mitigated by tiny inline scripts (<500 B) and GHA-log primacy.
Trade-offs deliberately rejected
- Reuse
tfc-sv0-infrastructurefor the GHA workflow. Rejected (§2): blast radius gives the workflowrg-sv0-prodContributor. - Image-watcher pull instead of Run Command push. Rejected (§3): determinism + observability + reuses Tier-3 capability.
- Demo-freeze mechanism for live demos. Rejected (§4) as premature.
- Phase-2 ephemeral PR-preview pool. Cut (
docs/infrastructure/azure-ephemeral-pr-previews-design.mdinstead).
Migration plan
Three workstreams; total est. 2–3 days elapsed.
- ADR-024 lands (this document). Cross-reviewed for security boundary correctness. No code changes.
sv0-infrastructure/bootstrap/azuread.tfadds thegha-sv0-platform-deployapp + one federated credential + role assignments. Bootstrap is thesv0-bootstrapTFC workspace (post-sv0-infrastructure#29). Outputgha_deploy_app_client_idis set as a GHA repo variable insv0-platform. Note: ADR-022 §3 "Bootstrap is local-apply" text is stale post-#29 and needs a follow-up amendment, not part of this PR. Estimated ~60 LOC of Terraform.- GitHub Environment configuration in
sv0-platform: create thedevenvironment (if not already), set deployment-branch policy tomainonly. Document the manual step in the bootstrap PR description; GitHub's environment-protection sub-policies are not fully Terraformable today. sv0-platform/.github/workflows/deploy-azure-dev.ymlnew. Triggers:workflow_run: cion success +workflow_dispatch. Steps: validate-instance-name, OIDC login (with retry-with-backoff to absorb Azure RBAC propagation latency on the first run — up to 30 min), Run Command invoke, health check, deployment summary. Estimated ~120 LOC.
Cutover sequence:
- Land #2 + #3 — verify the new SP can read
rg-sv0-dev(az role assignment list --assignee <appId>). First check may take up to 30 min for RBAC propagation; the workflow's retry-with-backoff covers this thereafter. - Land #4 in parallel-test mode — workflow runs on
workflow_run: cibut onlyechos what it would do. Validates the wiring against real CI traffic. - Flip to active mode — actual
az vm run-commandcalls execute. Hetzner workflow still runs; both targets get every main deploy. - Watch for 1 sprint week. If clean, declare Phase 1 done; the Hetzner-fallback story stays.
Rollback at any step is "disable deploy-azure-dev.yml." Hetzner is untouched throughout.
When to Reconsider
- Hetzner OOMs reach >1×/week. Re-evaluate moving PR previews off Hetzner — at that point, lift
docs/infrastructure/azure-ephemeral-pr-previews-design.mdinto an implementation ADR. - Demo-restart-mid-customer-demo happens >1×/sprint. Build a demo-freeze mechanism (10 LOC).
- A partner or customer requires per-PR Azure isolation (e.g., regulatory data residency that GitHub-managed runners don't satisfy). Activate the Phase-2 pattern.
az vm run-commandActivity-Log truncation bites a real incident. Switch to a forensic-friendly mechanism.- OIDC federation token lifetime (~1h default) causes mid-workflow re-auth failures. Revisit token caching.
- ADR-022 §3 "bootstrap is local-apply" still says that when the next reader looks. Submit the amendment.
Addendum (2026-05-15): cross-tier deploy auth — the side-by-side view
ADR-024's main body covers the Azure dev demo VM in isolation. Operators need to reason about the four deploy targets together (current and future). This addendum maps them onto a single contract.
| Target | Tier | Active today | Workflow | Auth from GHA | Secrets backing it | Trust boundary |
|---|---|---|---|---|---|---|
Hetzner VPS 178.156.217.150 | dev | ✓ (serves dev.securityv0.com + *-dev.securityv0.com) | deploy-dev.yml, deploy-dev-cleanup.yml, pr-preview-admin.yml | SSH | dev.DEPLOY_HOST, dev.DEPLOY_HOST_KEY, dev.DEPLOY_SSH_KEY (env scope) | Repo branch protection on main + reviewer policy; PR-close cleanup uses a scheduled GC sweep (per sv0-platform#948) to avoid the head-branch-context trap |
| Hetzner prod VM | prod | ✓ (serves app.securityv0.com) | deploy-prod.yml | SSH | prod.DEPLOY_HOST, prod.DEPLOY_HOST_KEY, prod.DEPLOY_SSH_KEY (env scope) | environment: prod branch policy + reviewer requirement |
Azure vm-sv0-dev-1 (dev-azure.securityv0.com) | dev demo | Bootstrap only — no app stack yet (per migration memory). Will receive deploys via Run Command once the app stack lands. | deploy-azure-dev.yml | OIDC, Entra app gha-sv0-platform-deploy, federated subject repo:SecurityV0/sv0-platform:environment:dev | None — auth is RBAC (Virtual Machine Contributor on rg-sv0-dev) | environment: dev branch policy = main + redesign/v06-pilot |
| Azure staging (future) | staging | n/a — per ADR-022 §3 Phase 3b | TBD | OIDC, separate Entra app gha-sv0-platform-deploy-staging | Tier-specific staging.WORKOS_* etc. (new env) | environment: staging branch policy + (likely) reviewers |
| Azure prod (future) | prod | n/a — per ADR-022 Phase 3c | TBD | OIDC, separate Entra app gha-sv0-platform-deploy-prod | Existing prod.* env stays, swap prod.DEPLOY_* for OIDC RBAC | environment: prod branch policy + required reviewers (existing) |
One Entra app per blast-radius tier, multiple federated credentials within. This is the rule ADR-024 §2 set when it rejected reusing tfc-sv0-infrastructure for GHA. It generalises forward — staging and prod must NOT share the dev app, because federated credential RBAC unions at the SP and one leaked GHA token would inherit every tier's role assignments.
Repo-level (cross-tier) secrets stay constant across all rows. See github-secrets-inventory.md § VM ↔ secret mapping for the complete cross-tier-vs-tier-bound split.
No staging GitHub Environment exists today. The STAGING_* prefix on secrets in the dev env refers to the WorkOS staging tenant used by dev-tier deploys, not a staging deploy environment. This is unintuitive and worth flagging; see the inventory's "STAGING_/PROD_ prefix convention" section.
Branch-policy drift control. Today the dev env branch policy (main + redesign/v06-pilot) is configured in the GitHub UI, not in IaC. Before the staging or prod Entra app federation lands, the policy should move to Terraform-managed (github_repository_environment_deployment_branch_policy) so the load-bearing security control is reviewable.