ADR-023: Authentication Target Architecture

Status

Accepted — 2026-05-12.

Phase 0 (§6.0) — the hard precondition for this ADR's operational guarantees — landed in sv0-platform at commit d7885d8 via PR #856. The PR included three rounds of adversarial review and post-merge regression coverage for ADR-023 Rule #1 (CI grep gate) and Rule #8 (cross-env startup assertion + tests). An automated GHA gate that enforces future ADR-023 amendments against the Phase 0 SHA is tracked as a follow-up in §6.4 step 18.

Operationalises and supersedes the perimeter-IdP choice in ADR-022 §5c.1. Paired with docs/runbooks/12-azure-vm-landing-zone.md, which holds the Azure-side implementation sequencing for the items in §6 of this ADR.

2026-05-13 amendment (no-PIM revision + CEO scope review + state-verification correction) — §3.4 Tier-3, §6.2 step 12, §7, glossary amended to drop Azure PIM. Verification returned subscribedSkus: [] + 400 AadPremiumLicenseRequired; P2/PIM reserved for product/demo use cases, not infra access control. The interim design layered on a sv0-azure-backup-owner UAA service principal as an account-lockout rollback (Codex adversarial review tightened it to UAA + out-of-band credential + sunset condition). CEO scope review same day caught that the SP's own sunset trigger — "delete when a 2nd human Owner exists" — was cheaper than the SP itself. State verification (az role assignment list) then showed the trigger was already satisfied months ago: Sergey has been subscription Owner since 2026-01-04 (created the subscription); Ivan was added 2026-03-10. The 2-human Owner rollback has existed since March. The entire PIM-design + backup-SP-design + "add Sergey" issue (#60) were solving a non-problem masked by stale documentation that said "Tier-3 = Ivan only." Bridge cancelled (PR #57 closed). #60 closed as already-resolved. Design patterns from Codex review (UAA > Owner, out-of-band > TF-state, sunset conditions, safe activation pattern) banked in docs/patterns/recovery-credentials.md for any future scenario where a recovery SP is genuinely warranted.

2026-05-20 amendment, revised 2026-06-03 (routine headless ops on dev + staging VMs) — adds §3.4.6, a sanctioned non-interactive path for a remote operator or agent to reach the dev/staging VM shell with no browser. The shipped path is Tailscale SSH (the VMs join the tailnet as tagged nodes via an IaC-managed extension; ssh sv0admin@vm-sv0-dev-1 from any tailnet device, no authorized_keys so Hard Rule #3 is honoured — sv0-infrastructure#120). This is additive — it does not change, weaken, or reroute any of the four human SSH tiers, and prod stays on interactive Tier-1 CF Access SSH. The originally-drafted GHA-OIDC az vm run-command design (secret-free, no-standing-access, but built for unattended ops) is banked for the first-client/compliance trigger, not deleted. Motivated by the constraint (verified 2026-05-20) that a CF Access service token cannot mint a short-lived SSH cert (cloudflared #1056/#212) and the §5 Hard Rule #3 bar on static authorized_keys. §4 and §4a clarified accordingly. Banked design + review history: docs/plans/2026-05-20-headless-dev-vm-ops-plan.md.

Context

SV0 today operates three identity-bearing systems: GitHub (SecurityV0 org), WorkOS (AuthKit), and Entra ID (Azure default directory). Until 2026-05-11 there was no canonical document of which system was the source of truth for which decision; this drift led directly to an incorrect "Entra at Cloudflare Access" verdict on 2026-05-11 that survived 24 hours before being reversed. The reversal exposed several latent issues: app code in dev-provider.ts reading Cf-Access-Jwt-Assertion (a violation of the intended layering), no defined emergency-access path for the Azure compute landing zone, no documented offboarding TTL, no rotation discipline on CF Access service tokens, and no rollback for the tier-3 subscription-owner SPOF.

This ADR locks the target architecture covering three scopes — portal UI access, API access, and infrastructure access — using only IdPs SV0 already operates (zero new identity stores) and applying ADR-022's cloud-portability discipline so the design moves cleanly to AWS if/when credits arrive there.

The document is intentionally long because the failure mode it guards against is future-Claude-session-reverses-this-on-flimsy-reasoning (per §11.4). Three rounds of adversarial review (saved under ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review*.md) pressure-tested every claim. Round-3 verdict: ship as ADR with the Phase 0 precondition.

Decision

The full decision is laid out below in §1 (executive summary) through §11 (amendment process). Key invariants are §5's "Hard rules"; key implementation items are §6's phased rollout (Phase 0 is a hard precondition, Phase 3a is the next sprint).

Decision summary

Three IdPs, each the source of truth for exactly one thing: GitHub (L1 staff perimeter), WorkOS (L2 platform users), Entra (L3 Azure RBAC, break-glass only).
Four SSH tiers: Tier-1 = CF Access SSH + GitHub IdP, Tier-1.5 = narrow per-VM emergency key (CTO/CEO-only), Tier-2 = Azure Serial Console + custom Entra-group role, Tier-3 = dual Active subscription Owners (Sergey + Ivan, both Owners since 2026-01-04 / 2026-03-10) + Security Defaults MFA-on-sign-in. No PIM, no backup SP (Entra ID P2 not adopted for infra — see §7; the proposed backup SP was cancelled when state-verification showed the 2-Owner rollback already existed — see §6.2 step 12).
Cloud-portable by construction: WorkOS + GitHub + Cloudflare are cloud-agnostic; Azure RBAC is the only cloud-specific config and it has AWS-equivalent migration sketches in §1.1.
Phase 0 (precondition): rewrite dev-provider.ts to drop CF-Access-JWT reads, add CI grep gate, add WORKOS_AUTHKIT_DOMAIN cross-env startup assertion + test.

Consequences

Positive

One coherent identity model covering UI, API, and infra — replaces the ad-hoc, surface-by-surface state that produced the 2026-05-11 reversal.
Zero net new IdPs to operate; everything below uses what SV0 already runs.
Tier-1.5 + Tier-3 dual-Owner rollback close two real SPOFs that the spike-era setup left implicit (the 2-Owner state already exists; the original ADR text claiming "Tier-3 = Ivan only" was stale).
Cloud portability preserved end-to-end (only L3 config changes on AWS migration).
Explicit #audit-prod-staff-writes Slack channel + quarterly review give a SOC2-prep audit trail without adding anomaly-detection complexity that's premature at current scale.

Negative / accepted residual risk

Tier-3 direct (non-PIM) Azure Bastion role for CF-Access-down recovery is a narrow persistent SPOF (Bastion-reader on one Bastion host, tier-3-only). Accepted as a net-positive trade vs. the CF-fronted-PIM deadlock alternative.
§11.3 two-Claude-session convergence is a thin epistemic guard (same model + same training cutoff + similar context can converge on a wrong answer). Re-litigated when team scales past 5 staff.
1Password is the credential vector for the existing sv0-azure-break-glass SP (rg-sv0-prod Contributor), but the Tier-3 subscription-Owner SPOF concern doesn't apply: Sergey + Ivan are both Active Owners with independent Entra accounts and independent MFA devices. Mutual recovery between operators is the in-place rollback. Microsoft support tenant-root reset (RTO: days) remains the residual fallback only for the joint-loss-of-both-Owner-accounts scenario, which is acceptable.
Backup FIDO2 keys are operator-managed without a physical-safe requirement (per Ivan's pre-client simplicity preference). Acceptable at 1-2 operator scale; revisit at team-of-5 or first compliance ask.
Anomaly detection for super-admin actions is deferred to first compliance ask or 5+ staff.

What changes downstream

ADR-022 §5c.1 needs amendment to swap "Entra IdP at CF Access" for "GitHub IdP at CF Access" (tracked as §6.4 step 19).
Runbook 12 phases adopt the four-tier SSH model and the dual-Owner rollback procedure (no PIM, no backup SP).
Phase 0 implementation lands in sv0-platform (dev-provider.ts rewrite + CI grep gate + env.ts iss assertion + test).
Phase 3a implementation lands in sv0-infrastructure (CF Access SSH on dev-azure-ssh.securityv0.com, Tier-1.5 emergency key + cloud-init wiring, check-cf-service-tokens.yml GHA, direct Bastion role). Dual-Owner state at Tier-3 was already in place — no Phase 3a-4 Azure provisioning required.

1. Executive summary

Three identity providers, each the source of truth for one thing:

GitHub (SecurityV0 org) — source of truth for SV0 staff identity at the network perimeter (Cloudflare Access). Every SV0 staff member has a GitHub account in the org. MFA enforced at GitHub.
WorkOS (AuthKit) — source of truth for platform user accounts (Layer 2 inside the platform). Federates Google Workspace (@securityv0.com), Magic Link, OTP. Customer users and staff-as-product-users authenticate here.
Entra ID (Azure default directory) — source of truth for Azure resource access only. Holds a small fixed sv0-vm-emergency-ops tier-2 emergency operator group + the tier-3 subscription owner accounts (Sergey + Ivan). Not a staff identity store.

Three layers, each answering a different question:

Layer	Question	Source of truth	Surface
L1 — Network perimeter	Is this an SV0 staff member who's allowed to reach this URL/port at all?	GitHub (org membership)	Cloudflare Access
L2 — Application	Is this a platform user (customer or staff-as-product-user) with a valid session?	WorkOS	Inside the platform
L3 — Cloud resource RBAC	Is this Azure principal allowed to call this Azure API?	Entra (Azure roles, Active assignments + Security Defaults MFA-on-sign-in)	Azure portal / CLI / Serial Console

Net new IdPs vs today: zero.

Cloud portability — WorkOS + GitHub + Cloudflare layers are cloud-agnostic. Only L3 RBAC (Azure roles, Entra groups, Security Defaults) is Azure-specific; on AWS migration it maps to IAM + IAM Identity Center + an IAM Identity Center MFA policy. Detailed sketch lives in .scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v3.md — promote to a sub-section here if/when AWS is actually funded. Speculative content removed from this ADR per CEO scope review 2026-05-13.

2. Audiences

Audience	Examples	Auth path
SV0 staff	Engineers, ops, founders	GitHub at L1 + WorkOS at L2
SV0 staff doing emergency ops	Tier-2 emergency operators (1–3 manually-provisioned Entra members)	+ Entra account (Active role, Security Defaults MFA-on-sign-in) for Serial Console
SV0 subscription owners	Tier-3 — today Sergey (original Owner since 2026-01-04) and Ivan (Owner since 2026-03-10)	+ Entra account with Active Owner role (Security Defaults MFA-on-sign-in, browser session 8h). The other Owner is the account-lockout rollback (§6.2).
Customer users	Customer admin/analyst logging into SV0 platform	WorkOS at L2 only (no L1 — prod has no perimeter)
Service / agent / CI	TFC runs, connector pipelines, internal bots	OIDC federation (Azure), M2M tokens (WorkOS, CF service tokens)

3. Surface decision matrix

3.1 Customer-facing portal (UI)

Surface	L1 perimeter	L2 application	L3 resource
`app.securityv0.com` (prod portal)	— (public DNS, no CF Access)	WorkOS prod env (Google + Magic Link + OTP)	n/a
`staging.securityv0.com`	— (matches prod's posture so E2E auth tests are meaningful)	WorkOS staging env (separate org, distinct JWKS + audience)	n/a

Customers need to reach prod/staging URLs — CF Access in front would require provisioning every customer in CF Access, bypassing WorkOS. Prod's only door is L2.

3.2 Internal portals (UI)

Surface	L1 perimeter	L2 application	L3 resource
`dev.securityv0.com` (Hetzner today, Azure later)	CF Access + GitHub IdP	WorkOS	n/a
`dev-azure.securityv0.com` (spike)	CF Access + GitHub IdP	WorkOS	n/a
`pr-N-dev.securityv0.com` (PR previews)	CF Access + GitHub IdP	WorkOS	n/a

L1 keeps unfinished builds invisible to the world; L2 means even L1-authenticated visitors need a real WorkOS session. CF Access policy at L1 is GitHub org membership in SecurityV0. Never parallel email allowlists.

3.3 API access (programmatic + CLI)

Surface	L1 perimeter	L2 application
`app.securityv0.com/api/*` (prod)	—	WorkOS bearer (session cookie OR M2M token)
`staging.securityv0.com/api/*`	—	WorkOS staging bearer
`dev..securityv0.com/api/`	CF Access (inherits from URL)	WorkOS bearer
Internal CLI scripts hitting prod/staging	(perimeter inherits)	WorkOS device_code flow → bearer
Internal CLI scripts hitting dev/dev-azure/preview	CF Access service token (§4a) OR human GitHub flow	WorkOS bearer
Customer-tenant API consumer	—	WorkOS M2M token (per-tenant)

Hard rule: application code MUST NOT read Cf-Access-Jwt-Assertion to derive identity. App identity is always WorkOS-derived. L1 is a network-reachability gate, not an identity signal the app trusts. (See §5 Rule #1 — Phase 0 resolves the legacy dev-provider violation.)

3.4 Infrastructure access

Four tiers, each with a distinct mechanism.

Tier	Use case	Mechanism	Identity	Lifetime
Tier-1 SSH	Routine operator SSH to a VM	Cloudflare Access SSH in front of port 22, GitHub IdP, CF SSH CA short-lived certs	GitHub user (must be in `SecurityV0` org)	~1h per session
Tier-1.5 emergency SSH	CTO/CEO-level break-glass: cloud-init broke, Serial Console unreachable, or CF Access SSH degraded	Per-VM ed25519 key, `sv0emergency` user (no sudo, read-only `/var/log`, single `sv0-rescue` script)	1Password-stored private key	Until per-VM key is rotated (on next redeploy)
Tier-2 emergency console	Network/SSH itself is broken — last-resort	Azure portal Serial Console, gated by custom `sv0-serial-console-operator` role on `sv0-vm-emergency-ops` Entra group	Entra account, manually provisioned, 1–3 members	Per-session, Azure-audited
Tier-3 subscription owner	Subscription-Owner-level ops: bootstrap, RBAC, policy edits	Azure portal / CLI, Entra account, Active assignment + Security Defaults MFA-on-sign-in. Dual-Owner (Sergey + Ivan, the other Owner is the account-lockout rollback).	Tier-3 operators' Entra accounts (Sergey since 2026-01-04, Ivan since 2026-03-10)	Per-session (browser 8h, MFA-on-sign-in)

3.4.1 Tier-1.5 per-VM emergency key — scope and constraints

The Tier-1.5 key is not "SSH keys for daily ops" — it's a narrowly-scoped fourth route in when the first three fail. Per Ivan's 2026-05-12 decision, accepted only under these constraints, all enforced at provisioning time:

Key is NOT in authorized_keys for sv0admin (routine-ops account). Only sv0emergency accepts it.
sv0emergency has:
- No sudo, not in wheel or sudo groups, no sudoers entry.
- Read-only access to /var/log.
- Permission to run exactly one script: /usr/local/bin/sv0-rescue, which:
  1. Writes local diagnostics to disk first (timestamped tarball: last 1000 syslog lines, journalctl tail, docker ps, df -h) at /var/lib/sv0-rescue/$(date +%s).tar.gz. This is the primary action — never deferred.
  2. Then attempts to post an audit-trail webhook to a CF Audit endpoint with a 5-second timeout. On failure, logs to stderr (loud warning) and continues. Audit completeness is achieved by weekly reconciliation of local tarballs against received webhooks (see §3.4.4).
- No shell init, no PATH access to user binaries beyond sv0-rescue.
Key generated per-VM at TF apply-time, stored in 1Password sv0-infra vault as item vm-emergency-<vm-name>. Storage path is out-of-band (see §3.4.5).
Rotated automatically on every VM redeploy.
Accessible only to CTO/CEO-level operators (today: Ivan + Sergey).

This satisfies the spirit of Rule #3: no long-lived authorized_keys for routine ops; emergency-only key with narrow blast radius and tight rotation.

3.4.2 Tier-1 SSH via Cloudflare Access — concrete shape

Separate hostname per VM environment, not the URL hostname (Cloudflare enforces domain-uniqueness across app types). Hostname must be depth-1 from the apex (single label under securityv0.com) — Cloudflare Free's Universal SSL covers only one wildcard level, so a depth-2 name like ssh.dev-azure.securityv0.com will TLS-fail at the edge with handshake_failure. Use a depth-1 pattern like dev-azure-ssh.securityv0.com for the spike VM, staging-ssh.securityv0.com for staging, etc. (PR #35 + sv0-infrastructure issue #38 confirmed the depth-2 failure live; the pattern is also called out in the operator memory project_cf_universal_ssl_one_level.)
DNS CNAME for that hostname → the Cloudflare Tunnel.
Tunnel ingress rule routing SSH traffic to ssh://localhost:22 on the VM.
Cloudflare Access app of type=ssh on that hostname, GitHub IdP, auto_redirect_to_identity=true, allow-list filtered by GitHub org membership.
MFA enforcement is upstream at GitHub's org-policy (require-2FA), not via CF Access require { auth_method = "mfa" }. Empirically confirmed 2026-05-12: CF Access's IdP-based MFA require reads the OIDC amr claim and is only supported for Okta, Microsoft Entra ID, Generic OIDC, Generic SAML 2.0. GitHub OAuth (which the GitHub IdP uses) does not emit amr, so the require is structurally unsatisfiable — it denies all authentications rather than enforcing MFA. The replacement path (CF Access independent MFA: TOTP/WebAuthn at the application layer, IdP-agnostic) is tracked in sv0-infrastructure#36. Until that lands, MFA is enforced only at GitHub's session layer.
Per-app SSH CA managed via Terraform (cloudflare_zero_trust_access_short_lived_certificate). Public key rendered into cloud-init at apply time; no runtime fetch.
VM sshd configured: TrustedUserCAKeys /etc/ssh/cloudflare_ca.pub, AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u.
Cert principal format: CF Access SSH emits the user's email local-part (not full email, not GitHub login) as the cert Principals field, with the full email in Key ID. The principals file (/etc/ssh/auth_principals/sv0admin) must contain email local-parts to match. Empirically confirmed 2026-05-12 via ssh-keygen -L on a live cert (Principals=ifofanov, Key ID=ifofanov@securityv0.com).
CF-issued cert validity is 4 minutes. The operator ssh_config's Match host ... exec "cloudflared access ssh-gen --hostname %h" re-mints the cert on every connection. If ssh-gen fails (e.g., CF Access degraded), ssh fails closed with publickey denied — no silent fallback to a long-lived key. See §3.4.4 for the CF-down recovery path.
Operator connects: cloudflared access ssh --hostname dev-azure-ssh.securityv0.com. Full ssh_config block per cloudflared access ssh-config --hostname ... --short-lived-cert; must include IdentitiesOnly yes + IdentityAgent none so a local SSH agent (1Password etc.) doesn't shadow the CF cert.

3.4.3 Why not Azure AD SSH login (`az ssh vm`)?

Considered, rejected: adds Entra dependency on every operator (SV0 staff aren't in Entra); requires aadsshlogin-extension; doesn't compose with Cloudflare Tunnel (no public IP, needs Azure Bastion → more cost); GitHub already covers everyone; not cloud-portable to AWS.

3.4.4 When Cloudflare Access is down

CF Access has had outages (most recently 2026-02-13, ~2h partial control-plane). During such windows, every L1-gated surface is unreachable — including Tier-1 SSH.

The CF-independent fallback path:

Confirm CF Access is down (not local network): curl -fsSL https://api.cloudflare.com/client/v4/zones | head — repeated 5xx or timeouts → it's CF.
portal.azure.com works — it's Microsoft-hosted, not CF-fronted. Tier-3 Owner can reach Entra portal as normal (Owner is Active, no activation step).
Tier-1.5 emergency key is unreachable via cloudflared (tunnel needs CF control plane to reconnect). Use Azure Bastion via a direct (non-PIM) Bastion-reader role assignment — this is the CF-independent transport. The direct role is provisioned for tier-3 only and stays Active (not PIM-eligible) so it remains usable during CF outages.
If neither Bastion nor Tier-1.5 work, fall to Tier-2 Serial Console (Azure portal, CF-independent).
cloudflared reconnect is best-effort during CF control-plane outages — do not rely on tunnel availability.

Note: sv0-rescue writes local diagnostics first regardless of CF reachability (per §3.4.1). The CF Audit webhook is best-effort; reconciliation runs weekly to catch missing entries.

Drills: quarterly per-operator exercise — run through CF-down procedure on a test VM. First time you need this for real is the wrong time to learn it.

3.4.5 Tier-1.5 key storage — out-of-band 1Password write

Decision: out-of-band write, not the 1Password CLI provider at TF apply-time. Rationale: coupling TFC's apply identity to 1Password write access is sensitive (compromise blast radius = vault write). Out-of-band keeps blast narrow.

Enforcement: post-apply CI check verifies that 1Password contains an item vm-emergency-<vm-name> whose created_at >= apply_completed_at. CI gates merge of the apply-PR on this check passing. This makes "rotated automatically on every VM redeploy" enforced, not aspirational.

Verifier credentials: a 1Password service account scoped read-only to items matching vm-emergency-* in the sv0-infra vault, stored as GHA secret OP_VM_EMERGENCY_VERIFIER. Rotated yearly (calendar reminder, same cadence as §4a service tokens). Read-only and item-prefix-scoped → minimal blast radius even if leaked. The service account has no write or delete capability anywhere. Tracked in 1Password under item op-service-account-vm-emergency-verifier.

Implementation note: the apply itself writes the public key to the VM's cloud-init. A subsequent step (manual or GHA workflow with operator OP_TOKEN at PR-author request time) uses 1Password CLI with operator credentials to write the private key. The CI verification step uses the read-only service account above to verify the write happened. Write and read are separate identities by design.

3.4.6 Routine headless ops (non-interactive, dev + staging VMs)

The four tiers above are all human interactive access. They cannot serve a remote operator who can't reach a browser, or an agent acting unattended: Tier-1 requires an interactive GitHub OAuth browser flow (the CF Access org token expires ~1h and cloudflared access ssh-gen re-prompts), and there is no headless variant — a CF Access service token passes HTTP and the cloudflared access ssh TCP proxy but cannot mint a short-lived SSH cert (no public mechanism; tracked in cloudflared #1056/#212, and Access for Infrastructure has no headless path either). Forcing SSH headless would require a static authorized_keys entry, which §5 Hard Rule #3 forbids.

Sanctioned path: Tailscale SSH (shipped 2026-06-03). The dev and staging Azure VMs join the operator's tailnet and enable Tailscale SSH, so any tailnet device — the operator's iPhone/MacBook, or an agent on the tailnet — can ssh sv0admin@vm-sv0-dev-1 with no browser, no SSH key, and no exposed port 22. Authentication is the device's WireGuard identity; authorization is the tailnet ACL.

Enrolment is IaC. An azurerm_virtual_machine_extension (CustomScript) in sv0-infrastructure installs Tailscale and runs tailscale up --reset --ssh on each VM — applied to the running VM with no recreate (no custom_data change, so it sidesteps the cloud-init-forces-replace trap), and re-run automatically on rebuild. The VMs enrol as tagged nodes (tag:sv0-dev / tag:sv0-staging): owned by a tag, not a user, so no key expiry. (sv0-infrastructure#120.)
Does NOT trip Hard Rule #3. Tailscale SSH never writes authorized_keys — node identity is WireGuard, authorization is the tailnet ACL, and access is centrally revocable and SSO-backed. It honours Rule #3's rationale (no exfiltratable static key) while adding a standing, ACL-scoped, audited reachability path. Treat it as a new tier alongside the four human tiers, not a change to them.
Additive. The CF Access SSH tier (§3.4.2) and the Tier-1.5 emergency key (§3.4.1) are unchanged and remain the fallback if the tailnet is unreachable; port 22 is not opened.
Why this shape: for a solo, pre-client operator the real need is "reach the VM shell from my phone, simply" — Tailscale delivers exactly that with one lightweight daemon and a one-paragraph ACL. The heavier, no-standing-access run-command design (below) is banked for when the need becomes truly unattended, no-human-in-the-loop, every-action-an-audited-ARM-call.
Prod is excluded — prod ops stay on interactive Tier-1 CF Access SSH.

Operational steps: runbook 12 § Routine headless ops. Tailnet ACL + extension wiring: sv0-infrastructure/docs/tailscale-ssh.md.

Banked alternative — GHA-OIDC az vm run-command (designed, reviewed, not built). A fully secret-free, no-standing-access design: an agent dispatches a GitHub-Actions workflow that OIDC-federates to Azure (no client secret on the machine) under a custom run-command-only RBAC role, and runs managed az vm run-command. Its distinguishing strengths — no standing VM reachability, every action a discrete audited ARM call, least-privilege role — earn their keep at the first-client / compliance / ≥3-operator trigger, not before (cf. ADR-024's banked Phase-2 pool). It was chosen against here because it optimises for unattended ops, while the current need is remote human-or-agent shell access; it is also materially more to build and maintain. Full design, three-round review history, gotchas, and rollout: docs/plans/2026-05-20-headless-dev-vm-ops-plan.md. Activate when the trigger fires.

3.5 Other admin surfaces

Surface	Auth
`github.com/SecurityV0/*`	GitHub login (MFA enforced via org policy)
`dash.cloudflare.com`	Cloudflare account login (federated to Google Workspace `@securityv0.com`)
`portal.azure.com` (subscription)	Tier-3 subscription owners (Sergey's + Ivan's Entra) — Active Owner, Security Defaults MFA-on-sign-in
MongoDB Atlas console	Atlas login (federated to Google Workspace)
HCP Terraform UI	HashiCorp Cloud login (federated to GitHub)
1Password (secret vault)	1Password account, MFA enforced

3.6 Staff access to prod — audit trail

Prod has no L1 perimeter (customers need to reach it), so staff and customers use the same L2 (WorkOS). The staff/customer boundary lives entirely in detection.

Event	Action
Super-admin (staff with `WORKOS_SUPER_ADMIN_ORG_ID` membership) authenticates against prod	Standard WorkOS audit log, retained 90d
Super-admin writes to a customer tenant	Slack notification to `#audit-prod-staff-writes`: actor + tenant + endpoint + timestamp
Super-admin reads customer tenant data	Tail-aggregated daily, posted to `#audit-prod-staff-reads` as rollup
`WORKOS_SUPER_ADMIN_ORG_ID` membership change	Notified to `#audit-workos-membership`

#audit-* Slack channels are the audit record, not the alerting mechanism. PagerDuty/Opsgenie pages when a super-admin acts outside business hours (configurable).

Anomaly detection deferred. Metrics-based detection (super-admin writes per hour > N, or writes to > M distinct tenants in a window) is explicitly deferred to first compliance ask or 5+ staff. At current scale (1-2 operators), the Slack rollup + quarterly review by Sergey/Ivan is the control. Re-evaluate when the team grows or a customer audit requires it.

Policy (start weak, tighten as the team grows):

Routine staff prod access should reference a paired ticket or customer support escalation. The audit channels make this enforceable retroactively.
Production write actions by staff must be reproducible from non-staff API endpoints. If not (one-off Mongo edit), file a follow-up to make them reproducible.

3.7 Account recovery

When a staff member loses GitHub access (locked, MFA device lost):

Capability	Status during recovery
`dev.`/`pr--dev.*` URLs	Blocked (L1 GitHub gate)
Tier-1 SSH	Blocked (CF Access SSH uses GitHub IdP)
`app.securityv0.com`, `staging.securityv0.com`	Works (L2 WorkOS uses Google Workspace, independent of GitHub)
GitHub `SecurityV0/*` repos	Blocked
Customer support work (read-only on prod)	Works
Serial Console	Works if user is a tier-2 emergency operator (Entra independent of GitHub)

Expected GitHub recovery TAT: 2–5 business days for fully-locked account; minutes-to-hours for self-serve MFA backup-code path.

Backup FIDO2 setup (pragmatic, simple):

Each tier-2 emergency operator registers a backup FIDO2 key with both GitHub and Entra.
The backup key is kept physically separate from daily-carry items (not on the same keyring as the daily MFA device) — the operator chooses where.
The FIDO2 PIN is memorized, not stored anywhere.
Yearly check that the backup FIDO2 still authenticates against GitHub + Entra (5-min self-test, calendar reminder).
No physical-safe requirement, no per-location restrictions — operators work from any location.

Accepted residual risk: if the backup FIDO2 is lost or compromised at the same time as the daily MFA device, recovery falls to the GitHub support flow (2-5 business days). This is acceptable at current scale.

4. Service / machine identities

Use case	Mechanism	Notes
TFC plans/applies → Azure	OIDC federation (per-workspace SP)	No long-lived secrets
TFC writes state backup to Azure Storage	OIDC federation	Same SP as apply
Bootstrap apply → Azure	Operator's `az login` (Tier-3 Active Owner — Sergey or Ivan)	Local-apply per ADR-022 §7; migration to TFC: `sv0-infrastructure#29`
Connectors reading source systems	Per-connector API key, stored in source system	Read-only
CI runs (GitHub Actions → external)	GitHub OIDC where supported; PAT otherwise (read-scoped)
Internal agents (Claude Code) hitting platform API	WorkOS M2M client per agent (`delegated_agent` kind)	Memory `project_auth_principal_model_locked`
Internal scripts hitting prod/staging	WorkOS device_code flow → short-lived bearer
Internal scripts hitting dev URLs (must pass CF Access)	CF Access service token (§4a) OR cloudflared with operator's GitHub identity	HTTP only — see the note below; a service token cannot mint an SSH cert
Headless ops on a dev/staging VM (its shell, not its URL)	Tailscale SSH (§3.4.6) — the GHA-OIDC `az vm run-command` design is banked	Tailnet node; no browser, no `authorized_keys` (Hard Rule #3 honoured); dev+staging only; tailnet-log audited

4a. CF Access service tokens — policy

Service tokens are HTTP-scoped (and the SSH TCP proxy), not SSH-cert. A CF Access service token satisfies an HTTP Access policy and the cloudflared access ssh proxy, but cannot mint a short-lived SSH cert via cloudflared access ssh-gen (cloudflared #1056/#212). For browserless VM shell ops use the §3.4.6 Tailscale SSH path, not a service token in ssh_config.

One token per script/workflow, named cf-access-st-<purpose> (e.g., cf-access-st-seed-demo).
Stored in 1Password sv0-infra vault with prefix cf-access-st-. Each 1Password item has an expires_at field set to issue-date + 90d. Item references where the consumer stores it (TFC variable name, GH Actions secret name).
Scope-per-app: each token bound to exactly one CF Access app. Wildcards forbidden.
Rotation enforced by automation: scheduled GHA workflow check-cf-service-tokens.yml (weekly) reads 1Password via OP CLI, lists tokens past expires_at, opens an issue assigned to the token's owner, fails CI if any token is >120d old (hard cap). Manual rotation only — automated rotation is non-trivial and not worth the complexity at this scale.
Fail loud on missing config: every consumer asserts both CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET non-empty at startup, hard-exit if missing. Per Rule #6.
Leak detection: GitHub secret scanning patterns for CF_ACCESS_CLIENT_SECRET.

4b. Session lifetimes (target)

Token / session	TTL	Reason
WorkOS staff session	8h	One workday; daily re-auth via Google Workspace
WorkOS customer session	30d	Customer UX expectation; refreshed on activity
CF Access dev session (browser)	8h	Matches staff workday
CF Access SSH cert	1h	Per-session re-auth is the value of CF Access SSH
Azure Owner Entra session	8h (browser default)	MFA-on-sign-in via Security Defaults. PIM (eligible/JIT) NOT adopted — see §7.
CF Access service token	None (rotate every 90d, hard cap 120d)	Long-lived by design; rotation is the control
GitHub PATs	90d max	Per GitHub org policy

5. Hard rules

Non-negotiable invariants. Changes require the §11 amendment process.

App code never reads Cf-Access-Jwt-Assertion to derive user identity or grant access. App identity is WorkOS-only. CF Access is a network gate, not an identity signal the app trusts.
- Exception (legacy): src/api/auth/providers/dev-provider.ts currently reads CF Access JWT. Phase 0 (§6.0) rewrites it to use a hardcoded identity. Phase 0 PR merge is a hard precondition for ADR-023 promotion. CI grep gate enforces zero cf-access-* reads in src/ and ui/src/ after Phase 0.
No parallel allow-lists at the same layer. Each surface's CF Access policy uses exactly one signal: GitHub org membership. No "GitHub OR email-in-list" fallbacks.
No long-lived authorized_keys for routine operator SSH. Tier-1 SSH = CF Access SSH CA. Tier-1.5 per-VM emergency keys are narrow-scoped (§3.4.1) and explicitly NOT routine ops.
Tier-2 emergency access is Entra + Azure RBAC only, via the custom sv0-serial-console-operator role on the fixed sv0-vm-emergency-ops group. Never federated to GitHub or WorkOS.
Service identities are always explicit. No human-tied tokens used by services. CI doesn't reuse a developer's GitHub token; agents don't reuse a developer's WorkOS session.
Fail loud on missing config. A missing secret → hard exit at boot or in the deploy fail-closed check. No silent fallback to a less-secure path.
- Corollary 6a: WorkOS membership and super-admin signals MUST NOT use cached fallback on lookup failure. Fail the request (503), do not degrade to last-known. Required test in test/api/auth/.
One source of truth per identity domain. GitHub = staff at perimeter. WorkOS = product users at app. Entra = Azure RBAC only. Never federate one into another for the sake of "single sign-on" if it adds an indirection.
WorkOS environment isolation. Prod and staging use separate WorkOS environments with distinct JWKS endpoints and audience claims. Never shared org IDs across envs.
- Startup assertion (Phase 0): src/shared/config/env.ts asserts on NODE_ENV=production that WORKOS_AUTHKIT_DOMAIN matches a prod-allowlist regex (or doesn't match a staging-denylist). Hard-exit with a named error otherwise. Symmetric assertion on staging.

6. What changes from today

6.0 Phase 0 — must land before ADR-023 merges

Phase 0 is a hard precondition: ADR-023's "Status" header will reference the Phase 0 PR's merge SHA. ADR-023 does not merge until Phase 0 is green.

Rewrite src/api/auth/providers/dev-provider.ts to drop the CF-Access-JWT path entirely. Provider returns DEV_USER unconditionally. No verifyCfAccessJwt call.
Add CI grep gate: grep -rE "cf-access-jwt-assertion|Cf-Access-Jwt-Assertion" src/ ui/src/ returns zero matches. Fail CI step otherwise.
Audit the codebase for similar silent-fallback patterns and remove any found (focus on || '', ?? '', if (!secret) against sensitive vars).
Implement Rule #8 startup assertion in src/shared/config/env.ts: on NODE_ENV=production, assert WORKOS_AUTHKIT_DOMAIN doesn't match staging-denylist regex (e.g., must not contain staging, test, dev substrings); symmetric assertion on staging. Hard-exit on misconfig. Test in test/shared/config/ proves misconfig → hard exit, not silent degradation.

6.1 Immediate corrections (the dev-azure spike, this week)

Delete cf_entra_idp_id workspace variable on sv0-dev OR set to "". The Terraform local.cf_idp_id falls through to the GitHub IdP fallback. Re-trigger sv0-dev apply.
PR: remove dual-IdP conditional from envs/dev/main.tf. Replace var.cf_entra_idp_id != "" ? var.cf_entra_idp_id : "45cdd3b1-..." with the GitHub IdP ID as a named local (no conditional). Per Rule #6 / Rule #2.
Disable Entra IdP at CF Access (already missing per 2026-05-12 diagnostic — confirm via dashboard). Azure App Registration "Cloudflare Access" stays quiescent; cleanup is deferred.
Close sv0-infrastructure#27 with reversal note.
Offboarding runbook (place in sv0-documentation/docs/runbooks/): when a staff member leaves SV0, in order:
1. Revoke their CF Access user sessions: POST /accounts/{id}/access/organizations/revoke_user with their email.
2. Remove from GitHub SecurityV0 org.
3. Verify by attempting cloudflared access login from a test machine with the offboarded identity (should be denied).
4. Remove from WorkOS organizations they were in.
5. Remove from any sv0-vm-emergency-ops Entra group membership.
6. Audit the offboarded user's last 30d activity in WorkOS, CF Access, GitHub.
Max time-to-revocation: minutes if procedure followed; hours otherwise.

6.2 Phase 3a (next sprint)

CF Access SSH for the dev-azure VM on dev-azure-ssh.securityv0.com per §3.4.2 (depth-1 hostname — see Universal SSL constraint there). DNS CNAME + tunnel ingress for ssh://localhost:22 + CF Access app type=ssh with GitHub IdP (MFA enforced upstream at GitHub org policy, not via CF auth_method=mfa — see §3.4.2) + cloud-init configures sshd to trust the per-app CF SSH CA + principals file (email local-parts).
Tier-1.5 per-VM emergency key for the dev-azure VM per §3.4.1. Cloud-init creates sv0emergency user with no sudo, writes per-VM ed25519 public key, installs /usr/local/bin/sv0-rescue (which writes local diagnostics first, then best-effort CF webhook). Private key stored out-of-band in 1Password as vm-emergency-vm-sv0-dev-1. Post-apply CI check verifies item exists and is fresh (§3.4.5).
Tier-3 Owner — Active dual-Owner + Security Defaults MFA (no PIM, no backup SP):

Amended 2026-05-13 (no-PIM revision + CEO scope review + state-verification correction). The premise that "Tier-3 = Ivan only" was a documentation-staleness bug: az role assignment list verifies that Sergey has been subscription Owner since 2026-01-04 (original Owner; created the subscription) and Ivan was added 2026-03-10. The 2-human Owner rollback has been in place for 2+ months. The original PIM-eligibility design + the proposed sv0-azure-backup-owner bridge SP were both solving a non-existent SPOF. The bridge is cancelled (PR #57 closed). The recovery-credentials design patterns from the cancelled work are banked in docs/patterns/recovery-credentials.md for any future scenario where a real recovery SP is genuinely warranted.
1. Verify Security Defaults is enabled on the tenant (free-tier policy that enforces MFA on az login / portal.azure.com / ARM API via the ARM MFA-required policy). Read via portal: Entra → Properties → Manage Security Defaults. This is the free-tier replacement for PIM's MFA-on-activate enforcement — without it, Tier-3 Owners have no MFA gate.
2. The 2-human Owner setup IS the rollback. Subscription Owner assignments (verified 2026-05-13):
  - Sergey (098551cd-0071-4408-846a-961c35da98a4) — Owner since 2026-01-04
  - Ivan (a38b998e-b2f4-4e73-ac3d-370da0b0a1da) — Owner since 2026-03-10
  Either operator can re-provision the other's role assignment in a lockout scenario. No backup SP, no Microsoft-support-RTO concern.
Hard rule (no exception): the 2-human Owner state above is preserved until ≥3 humans exist OR a documented superseding design is in place. Neither Owner is removable without explicit migration plan.

What we explicitly do not get without PIM (accepted tradeoff for not paying for P2):
- No JIT activation window — Owner is always Active for both operators. Compensating control: Security Defaults MFA-on-sign-in + dual-Owner attribution (each human's actions logged distinctly in Activity Log).
- No per-activation business-justification field. Compensating control: routine Owner-scoped operations go through TFC (audited via TFC run history).
- No activation-event audit log. Compensating control: Azure Activity Log captures every role-scoped operation, with monthly review automated per sv0-infrastructure#59 (scoped to break_glass + bootstrap SPs; loud-on-zero-actions).
Re-evaluate PIM adoption when (a) Entra P2 is procured for product/demo reasons and we can opportunistically extend, or (b) staff with Owner-scoped access grows to ≥3.
Azure Bastion direct (non-PIM) role assignment for tier-3 — provisioned for the Tier-3 Owners (Sergey + Ivan), kept Active. This is the CF-independent transport during CF Access outages (§3.4.4). One-time setup; deferred until Bastion is actually provisioned (out of Phase 3a-4 scope).
GHA workflow check-cf-service-tokens.yml — weekly, reads 1Password via OP CLI, opens issues for tokens past expires_at, fails CI for tokens >120d old. Per §4a.
Quarterly tier-2 emergency drill (per §3.4.4): each emergency operator runs Serial Console + Tier-1.5 procedures on a test VM, results posted to #audit-tier-2-drills.

Explicitly NOT in this phase: separate FIDO2 break-glass Entra account in a physical safe (per Ivan 2026-05-12). Backup FIDO2 setup per §3.7 (no safe requirement) is sufficient at current scale.

6.3 Phase 3b (formal staging)

Staging applies the same pattern. L2 (WorkOS staging env, separate org from prod), no L1 perimeter. SSH via CF Access SSH on staging-ssh.securityv0.com (depth-1 per §3.4.2). Per-VM emergency key for staging VM.

6.4 Deferred cleanup (any time)

Delete the "Cloudflare Access" Azure App Registration and its 1Password client secret entry. Non-blocking.
ADR-023 promotion CI gate (closes the human-review-only state described in the Status header). GHA workflow in sv0-documentation that triggers when docs/architecture/decisions/adr-023-*.md is added or its Status line changes:
- Parse the Status line for a Phase 0 commit SHA.
- Verify the SHA exists in sv0-platform main.
- Verify the §6.0 four steps are observable at the SHA: dev-provider has no CF-Access-JWT path, CI grep gate is present in .github/workflows/, env.ts iss-claim assertion is present, corresponding test exists.
- Fail merge of the ADR PR if any check fails.
Until this lands, ADR-023 promotion is gated by human review (Ivan verifies Phase 0 is merged before merging the ADR). Tracked as a sv0-documentation issue to file alongside the ADR PR.
ADR-022 amendments:
- §5c.1 "Two doors": replace every "Entra IdP at CF Access" with "GitHub IdP at CF Access."
- §5c (Emergency access tiers): tier-1 SSH = CF Access SSH (GitHub), tier-1.5 = per-VM emergency key (CTO/CEO-only), tier-2 = Azure Serial Console (Entra group + custom role), tier-3 = dual Active Azure Owners (Sergey + Ivan) + Security Defaults MFA-on-sign-in (no PIM, no backup SP — see §7).
- §4: change default vm_size from Standard_B2s (NotAvailableForSubscription in westeurope) to Standard_D2as_v6, and amend the Azure Policy to match.

6.5 What does NOT change

WorkOS at L2 for prod/staging/dev portals — unchanged.
Azure RBAC + Entra group for Serial Console — unchanged.
TFC OIDC federation for sv0-shared, sv0-prod, sv0-dev workspaces — unchanged.
GitHub at CF Access for dev.securityv0.com — unchanged.

7. Things explicitly NOT in scope

WorkOS at L1. Pricing-gated (OIDC Connect is a separate SKU), wrong product fit, blurs customer/staff boundary.
Entra at L1. Wrong source of truth — SV0 staff are GitHub users.
Mixed IdP fallbacks. Rule #2.
Federating GitHub into Entra (or vice versa). Adds indirection for no operational gain at our team size. Reopen if team grows past ~20 staff or compliance demands single-IdP.
Cloudflare Access as an application-identity source. Rule #1.
Long-lived authorized_keys for routine operators. Rule #3 (Tier-1.5 is narrow-scope exception).
Bastion / jump host pattern for SSH. CF Access SSH replaces this.
Azure AD SSH login. §3.4.3.
Separate FIDO2 break-glass Entra account in a physical safe. Deferred per Ivan 2026-05-12. Current §3.7 setup is sufficient.
Entra ID P2 / Azure PIM for infra access control. Resolved 2026-05-13. Verification returned subscribedSkus: [] + 400 AadPremiumLicenseRequired (P2 absent on tenant). P2 procurement is reserved for product/demo use cases (Entra audit logs in execution findings); we do not adopt P2/PIM to gate our own infra. Tier-3 runs dual Active Owners (Sergey + Ivan; verified via az role assignment list 2026-05-13) + Security Defaults MFA-on-sign-in. Mutual recovery between the two Owners is the account-lockout rollback. Microsoft support tenant-root reset remains as a residual fallback only for the joint-loss scenario, RTO of days — acceptable. Re-evaluate PIM when (a) P2 lands for product reasons and we can opportunistically extend, or (b) staff with Owner-scoped access grows to ≥3.
Two-human signer on amendments. Deferred until team scales past 5 staff (Ivan 2026-05-12). Current process: two independent Claude/Codex sessions for AI-proposed amendments (§11.3).
Anomaly detector for super-admin actions. Deferred to first compliance ask or 5+ staff (§3.6).
Replacing Cloudflare with a cloud-specific edge. Cloudflare stays — perimeter, DNS, Tunnel, Access all live there. Cloud-portable as-is.

8. Open questions

Most v1/v2 questions are resolved in v3. Remaining live items:

The bootstrap operator role — currently either Tier-3 Owner (Sergey or Ivan), running local-apply on bootstrap/. Long-term it should be a distinct identity (a service account with limited scope used only for bootstrap apply). Track as a Phase 3c+ item.
Cross-Claude-session convergence for AI-proposed amendments — implementation: a checklist in the PR template? A make verify-amendment script that wraps both Claude sessions? Decide before the first such amendment lands.

Closed: "Entra ID P2 license state on the current tenant" — resolved 2026-05-13 (subscribedSkus: []). See §7 + §6.2 step 12 amendment.

9. Glossary (used precisely throughout this doc)

L1 / "perimeter" — Cloudflare Access in front of a URL or port. Decides whether the network connection reaches the backend. Used in §3.2, §3.4. Never "staff/external trust boundary" generally — that's "trust boundary."
L2 / "application" — Authentication inside the application itself. For SV0: WorkOS sessions.
L3 / "resource RBAC" — Cloud-side authorization decisions (Azure RBAC, MongoDB Atlas roles, GitHub permissions).
IdP — Identity provider.
CF Access app — A Cloudflare Access "application" resource: hostname + type + allowed IdPs + policy.
CF Access SSH — Cloudflare Access in front of port 22, VM trusts CF Access SSH CA. Operator: cloudflared access ssh --hostname X.
Service token — Long-lived token issued by Cloudflare Access for service-to-service use.
M2M token — Machine-to-machine token issued by WorkOS for a service principal.
Tier-1 SSH — Routine operator SSH (target: CF Access SSH).
Tier-1.5 emergency SSH — Narrow-scope per-VM key for CTO/CEO-level break-glass when Tier-1 and Tier-2 are unavailable.
Tier-2 emergency operator — Person assigned to sv0-vm-emergency-ops Entra group. Has Serial Console + custom role. 1–3 members.
Tier-3 subscription owner — Person with Azure subscription Owner role, Active assignment, MFA-on-sign-in via Security Defaults. Today: Sergey (original Owner since 2026-01-04) and Ivan (Owner since 2026-03-10). Mutual recovery between the two is the account-lockout rollback; no backup SP (the proposed sv0-azure-backup-owner was cancelled when state-verification 2026-05-13 showed the 2-Owner state already existed — design patterns banked in docs/patterns/recovery-credentials.md for any future scenario where a backup SP is actually warranted).
Bootstrap operator — The role used for bootstrap/ local-apply. Currently the same identity as tier-3, conceptually distinct (§8 Q1).
"Break-glass" (adjective only) — describes tier-1.5 or tier-2 access. Not used as a noun.
Security Defaults — Free-tier Entra policy that enforces MFA registration + MFA-on-sign-in for admin roles (Owner, Contributor, etc.) tenant-wide. The free-tier replacement for Conditional Access (which requires P1/P2). Not granular — applies to all admins uniformly. SV0 staff scope is small enough that this is sufficient.
PIM — Azure Privileged Identity Management. Entra ID P2 feature. Eligible-not-active role assignments, MFA-on-activate. NOT adopted for SV0 infra (§7); referenced only for the AWS-migration sketch in §1.1.

10. References

Round 3 review of v3: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review-r3.md
Round 2 review of v2: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review-r2.md
Round 1 review of v1: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review.md
v3 draft: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v3.md
v2 draft: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v2.md
v1 draft: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v1.md
Reversed earlier verdict: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-11-cf-access-idp-question-for-auth-agent.md
Azure landing zone session: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-11-azure-landing-zone-staging-first-session.md
ADR-022 — Azure compute landing zone: ~/dev/securityv0/repos/sv0-documentation/docs/architecture/decisions/adr-022-azure-compute-landing-zone.md
Runbook 12 — Azure VM landing zone: ~/dev/securityv0/repos/sv0-documentation/docs/runbooks/12-azure-vm-landing-zone.md
WorkOS principal model: ~/dev/securityv0/repos/sv0-skills/auth-context/SKILL.md
Memory: project_auth_principal_model_locked (2026-04-30), feedback_subagent_backward_compat_neutralizes_fix, feedback_fail_loud_over_silent_fallback

11. How to change this document

At current 1-2 operator scale, amendments are operator-PRs reviewed by the other operator (or solo for documentation-only changes). Two named substantive criteria:

Hard Rules in §5 are invariants. Changes to a hard rule require the operator to articulate the threat-model delta in the PR description.
AI-proposed amendments need a second-model pass. Run any AI-generated amendment through a second independent session (Codex or a fresh Claude session) before merging. The 2026-05-13 no-PIM revision is the worked example: Codex caught five threat-model premises the first model missed; the CEO scope review caught that the resulting bridge was unnecessary. Both passes mattered.

Trigger to revisit and expand this section: team growth past 5 staff, OR first SOC2 / customer audit ask. Until then, lighter-weight is correct — heavier process for a 2-person team is the failure mode this section used to be.

Reversal lesson worth keeping (2026-05-11): when proposing an auth change, ask "what does existing infra already provide?" before re-deriving from first principles. The 2026-05-11 verdict was reversed within 24 hours because it conflated backend presence (Azure tenant has 1-3 emergency accounts) with staff identity store (where SV0 staff actually live = GitHub). The same failure mode reappeared 2026-05-13 with the cancelled backup-Owner SP: building a sophisticated workaround for a problem whose existing infra-equivalent (add a 2nd human Owner) was cheaper. Pattern-match to it.

— Authentication target architecture (DRAFT v4), 2026-05-12

Status​

Context​

Decision​

Decision summary​

Consequences​

Positive​

Negative / accepted residual risk​

What changes downstream​

1. Executive summary​

2. Audiences​

3. Surface decision matrix​

3.1 Customer-facing portal (UI)​

3.2 Internal portals (UI)​

3.3 API access (programmatic + CLI)​

3.4 Infrastructure access​

3.4.1 Tier-1.5 per-VM emergency key — scope and constraints​

3.4.2 Tier-1 SSH via Cloudflare Access — concrete shape​

3.4.3 Why not Azure AD SSH login (az ssh vm)?​

3.4.4 When Cloudflare Access is down​

3.4.5 Tier-1.5 key storage — out-of-band 1Password write​

3.4.6 Routine headless ops (non-interactive, dev + staging VMs)​

3.5 Other admin surfaces​

3.6 Staff access to prod — audit trail​

3.7 Account recovery​

4. Service / machine identities​

4a. CF Access service tokens — policy​

4b. Session lifetimes (target)​

5. Hard rules​

6. What changes from today​

6.0 Phase 0 — must land before ADR-023 merges​

6.1 Immediate corrections (the dev-azure spike, this week)​

6.2 Phase 3a (next sprint)​

6.3 Phase 3b (formal staging)​

6.4 Deferred cleanup (any time)​

6.5 What does NOT change​

7. Things explicitly NOT in scope​

8. Open questions​

9. Glossary (used precisely throughout this doc)​

10. References​

11. How to change this document​