Skip to main content

Headless agent ops on dev Azure VMs

STATUS: BANKED (2026-06-03). Not built — superseded for now by Tailscale SSH.

The routine-headless need ("reach the dev/staging VM shell without a browser, from a phone or an agent") shipped via Tailscale SSH instead (ADR-023 §3.4.6; sv0-infrastructure#120) — far simpler for a solo, pre-client operator. A simplicity-lens review found this GHA-OIDC design optimises for fully unattended, no-human-in-the-loop ops, which is a different (future) requirement.

This design is retained, not deleted. It is the right answer when the need becomes no standing VM reachability + every action a discrete audited ARM call — i.e. at the first-client / compliance / ≥3-operator trigger (cf. ADR-024's banked Phase-2 pool). The three-round adversarial review below stands; activate this plan when the trigger fires. Everything from "Relationship to ADR-024" onward is the banked design as reviewed.

Summary

Today every operational action against the dev Azure VM (dev-azure-ssh.securityv0.com) is gated by an interactive Cloudflare Access browser SSO. That is correct for a human at a keyboard but blocks unattended automation: an agent or cron job cannot run dev-VM ops without a person clicking through GitHub OAuth.

This plan adds a headless, non-interactive operational path for dev VMs only: the agent dispatches a GitHub Actions workflow that authenticates to Azure with OIDC federation (no secret) and runs az vm run-command over the Azure management plane. No SSH, no Cloudflare, no port 22, no browser, and no Azure credential on the local machine. Prod is intentionally left untouched on the interactive Cloudflare Access SSH path.

This is additive and parallel to the existing four SSH tiers in ADR-023 §3.4 — it does not weaken, replace, or reroute any human access path.

Relationship to ADR-024 (read this first)

ADR-024 already establishes the run-command-over-OIDC pattern for this exact VM: a GitHub-Actions workflow OIDC-federates to the gha-sv0-platform-deploy Entra app (no client secret) holding Virtual Machine Contributor on rg-sv0-dev, and runs az vm run-command invoke -g rg-sv0-dev -n vm-sv0-dev-1 to deploy. This plan is the operational-ops sibling of that deploy path, with two deliberate differences:

  1. Tighter role. ADR-024 uses the built-in Virtual Machine Contributor (which can also start/stop/resize/delete the VM). Ops gets a custom run-command-only role instead — least privilege for a broader-blast-radius capability (arbitrary root scripts).
  2. Separate identity. Per ADR-024 §2's own rule ("one Entra app per blast-radius tier, multiple federated credentials within"), ops does not reuse the deploy app — arbitrary-script ops is a bigger blast radius than a fixed compose-restart deploy, so it gets its own federated credential + role.

The earlier draft of this plan proposed a local-Mac service principal with a client secret. That was abandoned after review: on macOS az login --service-principal caches the secret in cleartext (service_principal_entries.json), so the laptop would hold an exfiltratable Azure credential — strictly worse than the secret-free OIDC path ADR-024 already proves out.

Problem

The dev SSH config runs cloudflared access ssh-gen on every connect to mint a 4-minute SSH cert. That minting needs a valid Cloudflare Access org token, whose lifetime equals the SSH Access app session (~1h per ADR-023 §3.4.2). When the org token lapses, ssh-gen opens a GitHub OAuth tab. So routine dev ops keep triggering a browser, and a fully unattended agent (no browser at all) cannot proceed.

Why the obvious shortcuts don't work

ShortcutWhy it fails
Drop a CF Access service token into the ssh_configService tokens authenticate HTTP and the cloudflared access ssh TCP proxy, but cannot mint the short-lived SSH cert via ssh-gen — there is no public mechanism for it (tracked in cloudflared #1056 / #212). Cloudflare's newer Access for Infrastructure also has no documented non-interactive SSH path, so the conclusion is unchanged.
Add a static authorized_keys entry for an automation userViolates ADR-023 Hard Rule #3 (no long-lived authorized_keys for routine ops). The sshd on the VM trusts the Cloudflare CA only (TrustedUserCAKeys), by design.
Use az ssh vmAlready considered and rejected in ADR-023 §3.4.3 (Entra dependency per operator, extension requirement, no Cloudflare-Tunnel composition, not cloud-portable).
Local SP client secret on the MacOn macOS the secret is cached in cleartext by az (see ADR-024 relationship). A laptop-resident exfiltratable Azure credential is a worse posture than secret-free OIDC.
Just extend the CF Access SSH session durationReduces browser frequency for a human, but still requires a browser eventually — it does not make a headless agent work. (Still worth doing as an independent ergonomic win; out of scope here.)

Decision

Option D — Azure-native control-plane command execution, dispatched via a GitHub-Actions OIDC workflow.

  • A new Entra app gha-sv0-dev-agent-ops, OIDC-federated to a dedicated ops workflow (no client secret).
  • A custom RBAC role sv0-runcommand-operator, scoped to rg-sv0-dev only, granting just the managed run-command verbs (tighter than ADR-024's Virtual Machine Contributor).
  • A workflow dev-ops-runcommand.yml takes a script input, OIDC-logs-in, runs managed az vm run-command with the full create→poll→fetch→delete lifecycle, and surfaces output as a run log + artifact.
  • The agent runs gh workflow run dev-ops-runcommand.yml -f script=… and reads the result with gh run view / the artifact. The only credential on the Mac is the existing gh token; no Azure credential is stored locally.

This mirrors the sv0-serial-console-operator custom-role pattern (ADR-023 §3.4 Tier-2) and extends ADR-024's OIDC run-command pattern (the run-command precedent — not runbook 12 break-glass, which is a terraform apply path).

Target design

Identity

  • Entra app: gha-sv0-dev-agent-ops, one federated credential with subject repo:SecurityV0/<workflow-repo>:environment:dev-ops — where <workflow-repo> is the unresolved Open question 1 (sv0-infrastructure vs sv0-platform). The subject is exact-string-matched, so it must be set to the final repo; this plan does not yet assert which. No client secret.
  • Custom role: sv0-runcommand-operator, assignable scope and assignment both at rg-sv0-dev. Canonical permission list (managed-only — legacy runCommand/action intentionally excluded since we use the managed flavor):
    • Microsoft.Compute/virtualMachines/read
    • Microsoft.Compute/virtualMachines/runCommands/read
    • Microsoft.Compute/virtualMachines/runCommands/write
    • Microsoft.Compute/virtualMachines/runCommands/delete
    • Microsoft.Compute/locations/operations/read (async poll)
  • Use this exact list verbatim in the ADR and runbook — no runCommand[s]/* wildcard (the wildcard silently re-adds the legacy action and is broader than this enumeration). Subscription-scoped Microsoft.Compute/locations/runCommands/read (listing built-in command docs) is intentionally not granted — not needed for inline RunShellScript.
  • Explicitly NOT granted: anything on rg-sv0-prod or rg-sv0-shared, nothing subscription-wide, no Key Vault access, no role-assignment write.
  • Blob output (only for >4 KB output): the output/error blobs are AppendBlob blobs in a dedicated runcmd-out container, in a storage account in rg-sv0-dev (so the grant stays RG-scoped). The VM's run-command extension writes them via a user-delegation SAS (read add create write) that the workflow mints. Minting that SAS calls generateUserDelegationKey, which acts at the storage-account level — so it cannot be a container-only grant: the ops identity needs Storage Blob Delegator at the storage-account (or rg-sv0-dev) scope to mint the key, plus a data role (Storage Blob Data Contributor) to write/read the blobs. RG-scoped, but not "container only" and not "Storage Blob Data Reader only." Small ops (≤4 KB) skip blobs and read inline output.

Execution path

Use Managed Run Command (az vm run-command create/show/delete), not legacy az vm run-command invoke:

Legacy invokeManaged run-command
OutputTruncated at ~4 KBInline still capped (~last 4 KB in instanceView); un-truncated only with blob URIs
ConcurrencyOne script per VMMultiple concurrent
ShapeSynchronous-ishAsync, parameterized, a VM child resource

The "no truncation" property is not free. Managed run-command's inline output (instanceView) is itself capped at roughly the last 4 KB. To exceed that you must pass --output-blob-uri / --error-blob-uri pointing at AppendBlob blobs. With the az CLI those URIs must be SAS (read add create write) — the VM-managed-identity-writes-blob option exists only in the ARM/REST/PowerShell layer (OutputBlobManagedIdentity), not as an az vm run-command flag, so the workflow mints a user-delegation SAS instead. The workflow then downloads the blob and republishes it as a GitHub artifact.

Token lifetime vs run-command duration. Managed run-command's default timeout is 90 min, but the Azure access token azure/login obtains via OIDC lives ~1 h (ADR-024 records this exact risk). A synchronous 90-min create/poll will start failing auth around the 1 h mark — and so can the if: always() delete, leaking the very child resource the lifecycle reaps. Mitigation the workflow must implement: run with --no-wait (async), cap --timeout-in-seconds well under an hour for synchronous ops, and re-run azure/login before the poll/fetch/delete phase of any long command. Without this, long ops (the whole reason to prefer managed over legacy invoke) fail mid-flight.

Ops workflow + dispatch

The "wrapper" is a GitHub Actions workflow, dev-ops-runcommand.yml (in the Open question 1 repo, gated by a dev-ops GitHub Environment). It must declare concurrency: keyed by the target VM with cancel-in-progress: falsegithub.run_id makes the child-resource name unique, but Azure does not serialize the guest state, so two simultaneous root scripts can corrupt each other's Docker/env/package operations without either failing at the API; the concurrency block serializes dispatches per VM. (If parallel root-on-dev is genuinely wanted, drop it and document the risk.) Managed run-commands are VM child resources that persist until deleted, so the workflow owns the full lifecycle:

  1. permissions: { id-token: write, contents: read } + azure/login@v2 with gha-sv0-dev-agent-ops (OIDC, no secret); the job declares environment: dev-ops (required for both the branch-policy gate and the OIDC subject).
  2. (If output may exceed 4 KB) mint a user-delegation SAS for runcmd-out.
  3. create the run-command named sv0-dev-runcmd-${{ github.run_id }} (run-id makes the name unique — no $(date +%s) collisions) with --output-blob-uri/--error-blob-uri, in a step with if: always() cleanup.
  4. show/poll until terminal, then fetch output (inline ≤4 KB, else the blob), publish as the run log + an artifact.
  5. delete the run-command child resource in an if: always() step (the delete verb is in the role for exactly this).
  6. A scheduled stale-resource sweeper lists runCommands on each dev VM and deletes only sv0-dev-runcmd-* resources that are in a terminal state AND older than one max-timeout window (≥90 min) — never an in-flight peer. (delete is RG-wide, so the age+state guard is what prevents the sweeper or a concurrent run from killing a live command.)

The agent side must capture a deterministic run ID (a bare gh run watch/gh run view is interactive and, under concurrent dispatches, can latch onto the wrong run). Dispatch, then recover the id and pass it explicitly:

gh workflow run dev-ops-runcommand.yml -f script='<self-contained script>'
RID=$(gh run list --workflow dev-ops-runcommand.yml --event workflow_dispatch \
--json databaseId,createdAt -q 'sort_by(.createdAt)|last|.databaseId')
gh run watch "$RID" && gh run view "$RID" --log # then download the artifact for >4 KB output

Trust boundary — stated honestly

There is no Azure credential on the Mac — only the gh token (already present). The standing credential is the OIDC trust, which lives in GitHub + Entra, not on the laptop. But that moves the boundary, it doesn't remove it:

  • Anyone who can dispatch dev-ops-runcommand.yml gets root-on-dev (the script input runs as root on the VM — see Risks). The gate is therefore GitHub repo-write + the dev-ops Environment's deployment-branch policy. The Environment must not require human reviewers (that would break "unattended"), so at current scale the control is "who has repo write." Accepted at 1–2 operators; revisit at ≥3.
  • The branch policy is load-bearing and must be pinned to main onlystricter than ADR-024's dev env, which currently also allows redesign/v06-pilot as a temporary exception (don't replicate that exception for dev-ops). ADR-024 §2 establishes why the gate matters: "without this gate, any branch's workflow could mint a token." workflow_dispatch runs from whatever ref the caller names (gh workflow run --ref <branch>), and both the workflow file and the script step are resolved from that ref — so an unpinned policy lets a repo-writer point at any branch and run arbitrary root on dev. This is a verification-checklist item, not an open question.
  • The federated subject is environment-pinned (environment:dev-ops), so fork PRs cannot mint the token (same protection ADR-024 §2 relies on).
  • Treat run-command output as sensitive. A root-shell script can emit secrets (env, /etc/sv0/*.env, docker inspect) straight into the run log + artifact (default 90-day retention, readable by anyone with repo read). Set short artifact retention, restrict who can read the workflow's artifacts, and warn operators against dumping env/secret files. The double-audit-trail property and this exfiltration surface are the same channel.

Security properties

Framed against ADR-023's deliberately-paranoid posture:

  • No local Azure credential. The macOS-plaintext-secret problem (the reason the SP-secret draft was dropped) does not exist here — the agent holds only its gh token.
  • No standing access to the VM. No open shell, no reachable port 22, no SSH key on the box. Each action is a discrete, individually-authorized ARM call.
  • Least privilege. Custom run-command-only role on exactly rg-sv0-dev (tighter than ADR-024's VM Contributor). Can't start/stop/delete the VM, can't touch prod, can't read Key Vault.
  • Double audit trail. GitHub Actions run log (actor + script + output) plus Azure Activity Log: managed run-commands emit Microsoft.Compute/virtualMachines/runCommands/write (create) and …/runCommands/delete (cleanup), attributed to the ops SP. Loki/alerts must match runCommands/* (plural) — note that Microsoft's managed-run-command how-to page erroneously says runCommand/write (singular), which is not a real provider operation; the plural is correct for both the role and the Activity-Log op name. Match the wrong string and routine ops go invisible while the check stays green.
  • Zero CF/prod blast radius. Dev SSH stays as-is for humans; prod is entirely untouched.

Risks and limitations (name these explicitly)

  1. Run-command executes as root (managed run-command default on Linux). RBAC scopes which VMs, never what the script does — so dispatch rights = effectively root-on-dev. And root-on-dev inherits whatever the dev VM's own managed identity + any Key Vault it reads + its NAT egress can reach: confirm the dev VM's MI has no cross-environment (prod/shared) data-plane access before accepting "can't touch prod." Verb-level allowlisting (a command-broker that accepts a fixed verb set instead of arbitrary script) is the tightening if dispatch-rights-as-root becomes too broad; deferred.
  2. Not a shell. No state between calls, no cd, no TTY. Scripts must be self-contained.
  3. Requires GitHub connectivity. The agent cannot run dev-VM ops fully offline; it depends on GitHub Actions being reachable. (The SP-secret draft was offline-capable; this trade was accepted to drop the local secret.)
  4. Latency. Dispatch + runner spin-up + RBAC propagation (up to ~30 min on first run per ADR-024) means this is ops-paced, not interactive.
  5. runCommands/delete is RG-wide + concurrency. The role can delete any run-command on any VM in rg-sv0-dev; the sweeper's age+terminal-state guard (step 6) is what stops it killing in-flight work. Unique github.run_id names prevent collisions.
  6. Shared fate with waagent. If the guest agent is wedged, run-command fails — same dependency as the deploy path in ADR-024.

Cloud portability

ARM-specific, but the pattern — control-plane command execution via an OIDC-federated workload identity — maps directly to AWS SSM Run Command + a scoped IAM role with GitHub OIDC (configure-aws-credentials), exactly as ADR-024 §Cloud-portability notes. Not a lock-in.

Implementation phases (in sv0-infrastructure + the workflow repo)

PhaseWorkWhere
1Custom role sv0-runcommand-operator + Entra app gha-sv0-dev-agent-ops + federated credential (environment:dev-ops) + role assignment on rg-sv0-dev + runcmd-out storage/container with Storage Blob Delegator (account/RG scope) + Storage Blob Data Contributor for the ops identitybootstrap/azuread.tf (alongside gha-sv0-platform-deploy)
2dev-ops-runcommand.yml workflow: OIDC login, SAS mint, managed run-command create→poll→fetch→delete (if: always()), publish artifact; the dev-ops Environment + branch policyworkflow repo (sv0-infrastructure)
3Scheduled stale-resource sweeper (terminal + age-guarded)sv0-infrastructure workflow
4Verify GHA log + Activity Log → Loki for runCommands/write + runCommands/delete; smoke a real ops command + a >4 KB output end-to-endobservability
5ADR-023 §3.4.6 + runbook 12 section + cross-refs (this PR covers the docs)sv0-documentation

Documentation changes (for this banked design, if activated)

As shipped (2026-06-03), ADR-023 §3.4.6 and runbook 12 lead with Tailscale SSH — not the GHA-OIDC run-command path below. The list here describes the doc edits that this design would carry if/when it's activated; it is not a record of what currently ships.

  • adr-023 §3.4.6 — would become the OIDC-via-GHA run-command sub-tier (today it ships Tailscale, with this design noted as the banked alternative).
  • runbook 12 — routine run-command section (dispatch + workflow lifecycle), referencing ADR-024 as the precedent.
  • agent-and-m2m-authentication.md — scope note pointing infra/VM-ops readers here (vs platform-API M2M).
  • agent-auth-deployed-envs.md — cross-ref: VM shell ≠ VM URL; service tokens are HTTP-only.
  • cf-access-service-token-setup.md — the service-token≠SSH-cert limitation, recorded.

Open questions

  1. Workflow repo — sv0-infrastructure or sv0-platform? Real trade, not cosmetic: sv0-platform already has a dev GitHub Environment and a working GitHub→Entra OIDC-to-Azure trust (ADR-024's deploy workflow), so reusing it is the lower lift — at the cost of coupling infra-ops into the app repo. sv0-infrastructure has no GHA-OIDC-to-Azure federation today (it's TFC-driven), so the lean choice means standing up the first GitHub Environment + GHA OIDC federation in that repo — net-new plumbing ADR-024 didn't need. Decide with that lift in view; the federated subject must match wherever it lands.
  2. One ops identity for all dev VMs, or per-VM? Lean one gha-sv0-dev-agent-ops scoped to rg-sv0-dev (covers the current VM + future pool).
  3. Command-broker for verb allowlisting? Deferred — dispatch-rights-as-root accepted at current scale; revisit at ≥3 operators or first compliance ask.

Next steps (only when activating this banked design)

This design is banked (see the banner at the top); the routine-headless need shipped via Tailscale SSH. Do the following only if/when the activation trigger fires:

  1. File sv0-infrastructure issue: feat(dev): OIDC ops identity (gha-sv0-dev-agent-ops) + custom run-command role + dev-ops-runcommand.yml.
  2. Confirm the workflow repo + dev-ops Environment branch policy (no required reviewers, to stay unattended).
  3. Update ADR-023 §3.4.6 + runbook 12 to make this the active path (today they lead with Tailscale).

Verification checklist

  • az role definition list --name sv0-runcommand-operator shows exactly the five managed verbs above, scoped to rg-sv0-dev (no wildcard, no legacy runCommand/action).
  • az role assignment list --assignee <ops-app> shows the role on rg-sv0-dev (+ Storage Blob Delegator at account/RG scope and Storage Blob Data Contributor for runcmd-out) and nothing on prod/shared.
  • The ops Entra app has a federated credential and no client secret (az ad app credential list shows no passwords).
  • Agent runs gh workflow run dev-ops-runcommand.yml end-to-end with no browser, no SSH, no port 22, and no Azure credential on the Mac.
  • Loki/alerts match the managed events …/runCommands/write and …/runCommands/delete (plural), attributed to the ops app — confirmed by a real run (not the legacy runCommand/action, not the singular form from the MS how-to page).
  • A >4 KB output is retrieved in full via a user-delegation SAS + --output-blob-uri → artifact (proves the no-truncation property; the SAS mint succeeds, i.e. the Delegator grant is at account/RG scope, not container).
  • A run that exceeds ~1 h still completes and still gets cleaned up — i.e. the workflow re-auths (or runs async) so the OIDC-token-vs-90-min-timeout gap doesn't strand the command or skip the delete.
  • After a run, no sv0-dev-runcmd-* child resource remains on the VM; the sweeper reaps a terminal orphan but never an in-flight command.
  • The dev-ops Environment branch policy is pinned to main and is the only dispatch gate; confirm a --ref <other-branch> dispatch is rejected and fork PRs cannot mint the token.
  • Run-command output artifacts have short retention + restricted readers (output is treated as potentially-secret-bearing).
  • The dev VM's own managed identity has no prod/shared data-plane access (bounds "root-on-dev").
  • Prod and dev human SSH paths unchanged (regression check).