ADR-024: Azure Demo VM Deploy from GitHub Actions

Status

Proposed — 2026-05-14.

Operationalises ADR-022 §3 Phase 3f (long-running dev VMs) for the deploy mechanism only. PR-preview ephemeral VMs (ADR-022 §6) are explicitly not scoped here; the full Phase-2 design (deployment stacks + cap-3 + drift sweeper + composite RBAC) is banked in docs/infrastructure/azure-ephemeral-pr-previews-design.md for activation when a concrete trigger appears (see §"When to Reconsider").

Context

Three things are true:

The Azure dev VM exists (spike landed 2026-05-12 via sv0-infrastructure/envs/dev/). vm-sv0-dev-1 serves dev-azure.securityv0.com via Cloudflare Tunnel. SSH is CF Access (ADR-023 §3.4.2). Today, "deploy" = re-apply Terraform with a new image_tag — heavyweight, routes through TFC for a frequent low-risk operation.
Hetzner is the demo-DoS source today. Per ADR-022 §"Context": shared 4 GB Hetzner host, every PR brings up its own Compose stack, >5 concurrent PRs OOM-kill dev.securityv0.com. The fix Ivan asked for (2026-05-14) is demo isolation — move the demo VM away from PR-preview churn — not VM-level lifecycle binding. That's separable.
Scope cut, 2026-05-14. First draft of this ADR also designed an ephemeral PR-preview pool on Azure (Phase 2). Cross-review by codex-rescue, security-auditor, ceo-reviewer, secops-analyst, and a fact-checker pointed at the same answer: Phase 2 delivers zero value while Hetzner keeps running PR previews, introduces multiple correctness errors (deployment-stack CLI flags, RBAC composition, GitHub semantics) that wouldn't exist if Phase 2 weren't being built, and bundles a forcing-function cap that pinches developer UX on a 2–3 engineer team. The Phase-2 design is banked, not built.

What Ivan asked for, 2026-05-14

I want to have a deployment procedure updated to this Azure machine… Azure default and back up to dev Hetzner on the machine.

Concrete requirements for this ADR:

Demo VM deploys to Azure on every main CI success without a TFC apply.
Hetzner keeps running as the fallback target (deploy-dev.yml unchanged).
No SSH key managed in GitHub Actions for the Azure path.

Decision

A new GitHub Actions workflow deploy-azure-dev.yml in sv0-platform deploys to the demo VM via Azure Run Command authenticated by OIDC federation from GitHub Actions through a new tightly-scoped Entra app. Hetzner's deploy-dev.yml continues to run unchanged; both targets get every main-CI deploy during cohabitation.

The decision has four load-bearing parts.

1. Hostname stays `dev-azure.securityv0.com` during Hetzner cohabitation

dev.securityv0.com continues to point at Hetzner. The Azure demo VM keeps its existing dev-azure.securityv0.com URL. Depth-1 from securityv0.com — covered by the existing free Cloudflare Universal SSL on *.securityv0.com (per memory project_cf_universal_ssl_one_level).

When Hetzner retires, a follow-up decision renames the Azure URL to drop the -azure suffix. Until then, dev-azure is the canonical Azure-dev hostname.

2. New Entra app `gha-sv0-platform-deploy`, RG-scoped to `rg-sv0-dev`

ADR-022 §6 said "All workflows use the same OIDC federation TFC uses" — meaning GitHub Actions runs would auth via the existing tfc-sv0-infrastructure app. This ADR rejects that assertion on security grounds:

The TFC app has Contributor on rg-sv0-prod via the sv0-prod workspace federated credential (ADR-022 §7). In Azure, federated credentials are trust assertions on the same Service Principal — the resulting Azure access token has the union of all the SP's role assignments. Adding a GitHub-Actions federated credential to tfc-sv0-infrastructure would give every workflow run the same rg-sv0-prod Contributor blast radius. That's a hard no.

A new Entra app gha-sv0-platform-deploy is added to bootstrap/azuread.tf (the sv0-bootstrap TFC workspace, post-sv0-infrastructure#29).

Federated credential subject	Use	RBAC
`repo:SecurityV0/sv0-platform:environment:dev`	`deploy-azure-dev.yml` (and any future Azure-touching GHA workflows in this repo's `dev` environment)	`Virtual Machine Contributor` on `rg-sv0-dev`; `Reader` on `rg-sv0-shared` (subnet lookup if ever needed; defensive).

GitHub Environment protection is part of the trust boundary. GitHub federation subjects of the form repo:OWNER/REPO:environment:NAME are only minted when the running job declares environment: NAME. PRs from forks cannot satisfy environment protection rules and cannot mint these tokens.

The bootstrap step (Migration plan §2) MUST configure the sv0-platform repo's dev environment with a tight deployment-branch policy. As implemented 2026-05-15, the policy allows main + redesign/v06-pilot — the long-running redesign pilot branch needs the same OIDC trust as main while it stays open. When the pilot lands or is closed, the policy collapses to main only. Without this gate, any branch's workflow could mint a token with this subject.

Note on pull_request: workflow file resolution. Per the Nov 7 2025 GitHub change, pull_request:-triggered workflows resolve their workflow file from the repository's default branch (= main), regardless of the PR's base. This is favorable for the trust boundary — but deploy-azure-dev.yml triggers on workflow_run + workflow_dispatch only, not pull_request:, so this is informational, not load-bearing.

3. Demo deploy via `az vm run-command invoke`

deploy-azure-dev.yml mirrors deploy-dev.yml's triggers (workflow_run: ci on success + workflow_dispatch). Both run on every successful main CI in parallel — Hetzner via SSH (today's flow), Azure via Run Command (new).

The Run Command path:

OIDC-federate to Azure via azure/login@v2 using gha-sv0-platform-deploy credentials. No client secret.
az vm run-command invoke -g rg-sv0-dev -n vm-sv0-dev-1 --command-id RunShellScript --scripts "<inline>". Inline script:
- sed -i "s|^IMAGE_TAG=.*|IMAGE_TAG=$NEW_TAG|" /etc/sv0/app.env
- systemctl restart sv0-stack
- docker ps --format '{{.Names}}\t{{.Status}}' | grep '^sv0-' for verification in the workflow log.
Health-check https://dev-azure.securityv0.com/health via CF Access service token (reuses CF_ACCESS_CLIENT_ID_DEPLOY / CF_ACCESS_CLIENT_SECRET_DEPLOY). /health is canonical; /deploy-health (used by Hetzner workflow legacy) is NOT mirrored.

No SSH key in this path. Microsoft.Compute/virtualMachines/runCommand/action is included in the built-in Virtual Machine Contributor role (Microsoft Learn confirms: "Run scripts in a VM using Run Command"). The script body and truncated output appear in Azure Activity Log against the SP identity. ADR-023's audit-log-truncation hazard (Tier-3 §) applies; the script is small (<500 bytes) so truncation is not a problem here, but the GHA workflow log remains the primary forensic record.

Why Run Command over image-watcher pull (the §5d/§6 mechanism in ADR-022). Image-watcher works fine for prod, but for the dev tier:

Push deploys are deterministic in timing. Workflow finishes ⟹ deploy is done. Image-watcher introduces a 30s–N-min poll delay.
Push deploys are observable in one place. Workflow log shows everything; image-watcher splits the deploy across CI + VM.
Push deploys avoid an additional VM-side systemd service. No third unit to maintain.
Push deploys reuse Tier-3 emergency-operations capability (ADR-023 §3.4.4). No new auth surface.

The trade-off is that Run Command requires the VM reachable via the Azure control plane at deploy time. If the VM is wedged at the kernel level, the deploy fails loud — which is the right behavior; an image-watcher poll would silently lag.

4. Live-demo outage risk: accepted, not mitigated

Every main-CI success triggers a Run Command + systemctl restart sv0-stack against the demo VM. The compose-restart causes a 10–30 second outage on dev-azure.securityv0.com during which the api/ui containers come back up. If a customer demo is running on that URL at the moment of the restart, the demo sees the outage.

Decision (Ivan, 2026-05-14): accept the outage risk. Demos are infrequent enough that retrofitting a freeze mechanism (repo-variable flag, cron window, branch protection) is premature complexity. If a demo coincides with a CI restart, the operator runs the demo across the window or pages the on-PR developer to delay merge. Revisit per §"When to Reconsider" if demo-restart collision happens >1×/sprint.

Cloud-portability check (ADR-022 §11)

Phase 1 surface is small enough that cloud-portability is trivial: az vm run-command ⇄ AWS SSM Run Command ⇄ GCP gcloud compute ssh --command; azure/login@v2 ⇄ AWS configure-aws-credentials@v4 ⇄ GCP auth@v2. Same OIDC subject mapping. Whole migration of this ADR's deliverable is ~30 LOC of workflow YAML.

Consequences

Positive

Demo VM is isolated from PR-preview churn. Sergey/customers cannot be DoS'd by a busy sprint.
No SSH key in the Azure deploy path. Run Command auth is RBAC; no DEPLOY_SSH_KEY equivalent.
OIDC blast radius is tight. gha-sv0-platform-deploy SP has zero production access. A leaked GHA federation token can affect the dev demo VM only.
Hetzner unchanged. Roll back by disabling the new workflow; Hetzner keeps serving as today.
Phase 2 design is preserved, not lost. When the trigger materialises, docs/infrastructure/azure-ephemeral-pr-previews-design.md is ready to lift into an implementation ADR.

Negative

Live-demo outage risk accepted, not mitigated (§4).
Dual-deploy for the cohabitation window. Every main push runs both Hetzner and Azure deploy. ~2 minutes of extra CI per push; trivial cost, but two workflow runs to monitor.
Activity Log truncation at ~4 KB for Run Command output. Mitigated by tiny inline scripts (<500 B) and GHA-log primacy.

Trade-offs deliberately rejected

Reuse tfc-sv0-infrastructure for the GHA workflow. Rejected (§2): blast radius gives the workflow rg-sv0-prod Contributor.
Image-watcher pull instead of Run Command push. Rejected (§3): determinism + observability + reuses Tier-3 capability.
Demo-freeze mechanism for live demos. Rejected (§4) as premature.
Phase-2 ephemeral PR-preview pool. Cut (docs/infrastructure/azure-ephemeral-pr-previews-design.md instead).

Migration plan

Three workstreams; total est. 2–3 days elapsed.

ADR-024 lands (this document). Cross-reviewed for security boundary correctness. No code changes.
sv0-infrastructure/bootstrap/azuread.tf adds the gha-sv0-platform-deploy app + one federated credential + role assignments. Bootstrap is the sv0-bootstrap TFC workspace (post-sv0-infrastructure#29). Output gha_deploy_app_client_id is set as a GHA repo variable in sv0-platform. Note: ADR-022 §3 "Bootstrap is local-apply" text is stale post-#29 and needs a follow-up amendment, not part of this PR. Estimated ~60 LOC of Terraform.
GitHub Environment configuration in sv0-platform: create the dev environment (if not already), set deployment-branch policy to main only. Document the manual step in the bootstrap PR description; GitHub's environment-protection sub-policies are not fully Terraformable today.
sv0-platform/.github/workflows/deploy-azure-dev.yml new. Triggers: workflow_run: ci on success + workflow_dispatch. Steps: validate-instance-name, OIDC login (with retry-with-backoff to absorb Azure RBAC propagation latency on the first run — up to 30 min), Run Command invoke, health check, deployment summary. Estimated ~120 LOC.

Cutover sequence:

Land #2 + #3 — verify the new SP can read rg-sv0-dev (az role assignment list --assignee <appId>). First check may take up to 30 min for RBAC propagation; the workflow's retry-with-backoff covers this thereafter.
Land #4 in parallel-test mode — workflow runs on workflow_run: ci but only echos what it would do. Validates the wiring against real CI traffic.
Flip to active mode — actual az vm run-command calls execute. Hetzner workflow still runs; both targets get every main deploy.
Watch for 1 sprint week. If clean, declare Phase 1 done; the Hetzner-fallback story stays.

Rollback at any step is "disable deploy-azure-dev.yml." Hetzner is untouched throughout.

When to Reconsider

Hetzner OOMs reach >1×/week. Re-evaluate moving PR previews off Hetzner — at that point, lift docs/infrastructure/azure-ephemeral-pr-previews-design.md into an implementation ADR.
Demo-restart-mid-customer-demo happens >1×/sprint. Build a demo-freeze mechanism (10 LOC).
A partner or customer requires per-PR Azure isolation (e.g., regulatory data residency that GitHub-managed runners don't satisfy). Activate the Phase-2 pattern.
az vm run-command Activity-Log truncation bites a real incident. Switch to a forensic-friendly mechanism.
OIDC federation token lifetime (~1h default) causes mid-workflow re-auth failures. Revisit token caching.
ADR-022 §3 "bootstrap is local-apply" still says that when the next reader looks. Submit the amendment.

Addendum (2026-05-15): cross-tier deploy auth — the side-by-side view

ADR-024's main body covers the Azure dev demo VM in isolation. Operators need to reason about the four deploy targets together (current and future). This addendum maps them onto a single contract.

Target	Tier	Active today	Workflow	Auth from GHA	Secrets backing it	Trust boundary
Hetzner VPS `178.156.217.150`	dev	✓ (serves `dev.securityv0.com` + `*-dev.securityv0.com`)	`deploy-dev.yml`, `deploy-dev-cleanup.yml`, `pr-preview-admin.yml`	SSH	`dev.DEPLOY_HOST`, `dev.DEPLOY_HOST_KEY`, `dev.DEPLOY_SSH_KEY` (env scope)	Repo branch protection on main + reviewer policy; PR-close cleanup uses a scheduled GC sweep (per sv0-platform#948) to avoid the head-branch-context trap
Hetzner prod VM	prod	✓ (serves `app.securityv0.com`)	`deploy-prod.yml`	SSH	`prod.DEPLOY_HOST`, `prod.DEPLOY_HOST_KEY`, `prod.DEPLOY_SSH_KEY` (env scope)	`environment: prod` branch policy + reviewer requirement
Azure `vm-sv0-dev-1` (`dev-azure.securityv0.com`)	dev demo	Bootstrap only — no app stack yet (per migration memory). Will receive deploys via Run Command once the app stack lands.	`deploy-azure-dev.yml`	OIDC, Entra app `gha-sv0-platform-deploy`, federated subject `repo:SecurityV0/sv0-platform:environment:dev`	None — auth is RBAC (`Virtual Machine Contributor` on `rg-sv0-dev`)	`environment: dev` branch policy = `main` + `redesign/v06-pilot`
Azure staging (future)	staging	n/a — per ADR-022 §3 Phase 3b	TBD	OIDC, separate Entra app `gha-sv0-platform-deploy-staging`	Tier-specific `staging.WORKOS_*` etc. (new env)	`environment: staging` branch policy + (likely) reviewers
Azure prod (future)	prod	n/a — per ADR-022 Phase 3c	TBD	OIDC, separate Entra app `gha-sv0-platform-deploy-prod`	Existing `prod.` env stays, swap `prod.DEPLOY_` for OIDC RBAC	`environment: prod` branch policy + required reviewers (existing)

One Entra app per blast-radius tier, multiple federated credentials within. This is the rule ADR-024 §2 set when it rejected reusing tfc-sv0-infrastructure for GHA. It generalises forward — staging and prod must NOT share the dev app, because federated credential RBAC unions at the SP and one leaked GHA token would inherit every tier's role assignments.

Repo-level (cross-tier) secrets stay constant across all rows. See github-secrets-inventory.md § VM ↔ secret mapping for the complete cross-tier-vs-tier-bound split.

No staging GitHub Environment exists today. The STAGING_* prefix on secrets in the dev env refers to the WorkOS staging tenant used by dev-tier deploys, not a staging deploy environment. This is unintuitive and worth flagging; see the inventory's "STAGING_/PROD_ prefix convention" section.

Branch-policy drift control. Today the dev env branch policy (main + redesign/v06-pilot) is configured in the GitHub UI, not in IaC. Before the staging or prod Entra app federation lands, the policy should move to Terraform-managed (github_repository_environment_deployment_branch_policy) so the load-bearing security control is reviewable.

Status​

Context​

What Ivan asked for, 2026-05-14​

Decision​

1. Hostname stays dev-azure.securityv0.com during Hetzner cohabitation​

2. New Entra app gha-sv0-platform-deploy, RG-scoped to rg-sv0-dev​

3. Demo deploy via az vm run-command invoke​

4. Live-demo outage risk: accepted, not mitigated​

Cloud-portability check (ADR-022 §11)​

Consequences​

Positive​

Negative​

Trade-offs deliberately rejected​

Migration plan​

When to Reconsider​

Addendum (2026-05-15): cross-tier deploy auth — the side-by-side view​