Skip to main content

ADR-023: Authentication Target Architecture

Status

Accepted — 2026-05-12.

Phase 0 (§6.0) — the hard precondition for this ADR's operational guarantees — landed in sv0-platform at commit d7885d8 via PR #856. The PR included three rounds of adversarial review and post-merge regression coverage for ADR-023 Rule #1 (CI grep gate) and Rule #8 (cross-env startup assertion + tests). An automated GHA gate that enforces future ADR-023 amendments against the Phase 0 SHA is tracked as a follow-up in §6.4 step 18.

Operationalises and supersedes the perimeter-IdP choice in ADR-022 §5c.1. Paired with docs/runbooks/12-azure-vm-landing-zone.md, which holds the Azure-side implementation sequencing for the items in §6 of this ADR.

2026-05-13 amendment (no-PIM revision + CEO scope review + state-verification correction) — §3.4 Tier-3, §6.2 step 12, §7, glossary amended to drop Azure PIM. Verification returned subscribedSkus: [] + 400 AadPremiumLicenseRequired; P2/PIM reserved for product/demo use cases, not infra access control. The interim design layered on a sv0-azure-backup-owner UAA service principal as an account-lockout rollback (Codex adversarial review tightened it to UAA + out-of-band credential + sunset condition). CEO scope review same day caught that the SP's own sunset trigger — "delete when a 2nd human Owner exists" — was cheaper than the SP itself. State verification (az role assignment list) then showed the trigger was already satisfied months ago: Sergey has been subscription Owner since 2026-01-04 (created the subscription); Ivan was added 2026-03-10. The 2-human Owner rollback has existed since March. The entire PIM-design + backup-SP-design + "add Sergey" issue (#60) were solving a non-problem masked by stale documentation that said "Tier-3 = Ivan only." Bridge cancelled (PR #57 closed). #60 closed as already-resolved. Design patterns from Codex review (UAA > Owner, out-of-band > TF-state, sunset conditions, safe activation pattern) banked in docs/patterns/recovery-credentials.md for any future scenario where a recovery SP is genuinely warranted.

Context

SV0 today operates three identity-bearing systems: GitHub (SecurityV0 org), WorkOS (AuthKit), and Entra ID (Azure default directory). Until 2026-05-11 there was no canonical document of which system was the source of truth for which decision; this drift led directly to an incorrect "Entra at Cloudflare Access" verdict on 2026-05-11 that survived 24 hours before being reversed. The reversal exposed several latent issues: app code in dev-provider.ts reading Cf-Access-Jwt-Assertion (a violation of the intended layering), no defined emergency-access path for the Azure compute landing zone, no documented offboarding TTL, no rotation discipline on CF Access service tokens, and no rollback for the tier-3 subscription-owner SPOF.

This ADR locks the target architecture covering three scopes — portal UI access, API access, and infrastructure access — using only IdPs SV0 already operates (zero new identity stores) and applying ADR-022's cloud-portability discipline so the design moves cleanly to AWS if/when credits arrive there.

The document is intentionally long because the failure mode it guards against is future-Claude-session-reverses-this-on-flimsy-reasoning (per §11.4). Three rounds of adversarial review (saved under ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review*.md) pressure-tested every claim. Round-3 verdict: ship as ADR with the Phase 0 precondition.

Decision

The full decision is laid out below in §1 (executive summary) through §11 (amendment process). Key invariants are §5's "Hard rules"; key implementation items are §6's phased rollout (Phase 0 is a hard precondition, Phase 3a is the next sprint).

Decision summary

  • Three IdPs, each the source of truth for exactly one thing: GitHub (L1 staff perimeter), WorkOS (L2 platform users), Entra (L3 Azure RBAC, break-glass only).
  • Four SSH tiers: Tier-1 = CF Access SSH + GitHub IdP, Tier-1.5 = narrow per-VM emergency key (CTO/CEO-only), Tier-2 = Azure Serial Console + custom Entra-group role, Tier-3 = dual Active subscription Owners (Sergey + Ivan, both Owners since 2026-01-04 / 2026-03-10) + Security Defaults MFA-on-sign-in. No PIM, no backup SP (Entra ID P2 not adopted for infra — see §7; the proposed backup SP was cancelled when state-verification showed the 2-Owner rollback already existed — see §6.2 step 12).
  • Cloud-portable by construction: WorkOS + GitHub + Cloudflare are cloud-agnostic; Azure RBAC is the only cloud-specific config and it has AWS-equivalent migration sketches in §1.1.
  • Phase 0 (precondition): rewrite dev-provider.ts to drop CF-Access-JWT reads, add CI grep gate, add WORKOS_AUTHKIT_DOMAIN cross-env startup assertion + test.

Consequences

Positive

  • One coherent identity model covering UI, API, and infra — replaces the ad-hoc, surface-by-surface state that produced the 2026-05-11 reversal.
  • Zero net new IdPs to operate; everything below uses what SV0 already runs.
  • Tier-1.5 + Tier-3 dual-Owner rollback close two real SPOFs that the spike-era setup left implicit (the 2-Owner state already exists; the original ADR text claiming "Tier-3 = Ivan only" was stale).
  • Cloud portability preserved end-to-end (only L3 config changes on AWS migration).
  • Explicit #audit-prod-staff-writes Slack channel + quarterly review give a SOC2-prep audit trail without adding anomaly-detection complexity that's premature at current scale.

Negative / accepted residual risk

  • Tier-3 direct (non-PIM) Azure Bastion role for CF-Access-down recovery is a narrow persistent SPOF (Bastion-reader on one Bastion host, tier-3-only). Accepted as a net-positive trade vs. the CF-fronted-PIM deadlock alternative.
  • §11.3 two-Claude-session convergence is a thin epistemic guard (same model + same training cutoff + similar context can converge on a wrong answer). Re-litigated when team scales past 5 staff.
  • 1Password is the credential vector for the existing sv0-azure-break-glass SP (rg-sv0-prod Contributor), but the Tier-3 subscription-Owner SPOF concern doesn't apply: Sergey + Ivan are both Active Owners with independent Entra accounts and independent MFA devices. Mutual recovery between operators is the in-place rollback. Microsoft support tenant-root reset (RTO: days) remains the residual fallback only for the joint-loss-of-both-Owner-accounts scenario, which is acceptable.
  • Backup FIDO2 keys are operator-managed without a physical-safe requirement (per Ivan's pre-client simplicity preference). Acceptable at 1-2 operator scale; revisit at team-of-5 or first compliance ask.
  • Anomaly detection for super-admin actions is deferred to first compliance ask or 5+ staff.

What changes downstream

  • ADR-022 §5c.1 needs amendment to swap "Entra IdP at CF Access" for "GitHub IdP at CF Access" (tracked as §6.4 step 19).
  • Runbook 12 phases adopt the four-tier SSH model and the dual-Owner rollback procedure (no PIM, no backup SP).
  • Phase 0 implementation lands in sv0-platform (dev-provider.ts rewrite + CI grep gate + env.ts iss assertion + test).
  • Phase 3a implementation lands in sv0-infrastructure (CF Access SSH on dev-azure-ssh.securityv0.com, Tier-1.5 emergency key + cloud-init wiring, check-cf-service-tokens.yml GHA, direct Bastion role). Dual-Owner state at Tier-3 was already in place — no Phase 3a-4 Azure provisioning required.

1. Executive summary

Three identity providers, each the source of truth for one thing:

  1. GitHub (SecurityV0 org) — source of truth for SV0 staff identity at the network perimeter (Cloudflare Access). Every SV0 staff member has a GitHub account in the org. MFA enforced at GitHub.
  2. WorkOS (AuthKit) — source of truth for platform user accounts (Layer 2 inside the platform). Federates Google Workspace (@securityv0.com), Magic Link, OTP. Customer users and staff-as-product-users authenticate here.
  3. Entra ID (Azure default directory) — source of truth for Azure resource access only. Holds a small fixed sv0-vm-emergency-ops tier-2 emergency operator group + the tier-3 subscription owner accounts (Sergey + Ivan). Not a staff identity store.

Three layers, each answering a different question:

LayerQuestionSource of truthSurface
L1 — Network perimeterIs this an SV0 staff member who's allowed to reach this URL/port at all?GitHub (org membership)Cloudflare Access
L2 — ApplicationIs this a platform user (customer or staff-as-product-user) with a valid session?WorkOSInside the platform
L3 — Cloud resource RBACIs this Azure principal allowed to call this Azure API?Entra (Azure roles, Active assignments + Security Defaults MFA-on-sign-in)Azure portal / CLI / Serial Console

Net new IdPs vs today: zero.

Cloud portability — WorkOS + GitHub + Cloudflare layers are cloud-agnostic. Only L3 RBAC (Azure roles, Entra groups, Security Defaults) is Azure-specific; on AWS migration it maps to IAM + IAM Identity Center + an IAM Identity Center MFA policy. Detailed sketch lives in .scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v3.md — promote to a sub-section here if/when AWS is actually funded. Speculative content removed from this ADR per CEO scope review 2026-05-13.


2. Audiences

AudienceExamplesAuth path
SV0 staffEngineers, ops, foundersGitHub at L1 + WorkOS at L2
SV0 staff doing emergency opsTier-2 emergency operators (1–3 manually-provisioned Entra members)+ Entra account (Active role, Security Defaults MFA-on-sign-in) for Serial Console
SV0 subscription ownersTier-3 — today Sergey (original Owner since 2026-01-04) and Ivan (Owner since 2026-03-10)+ Entra account with Active Owner role (Security Defaults MFA-on-sign-in, browser session 8h). The other Owner is the account-lockout rollback (§6.2).
Customer usersCustomer admin/analyst logging into SV0 platformWorkOS at L2 only (no L1 — prod has no perimeter)
Service / agent / CITFC runs, connector pipelines, internal botsOIDC federation (Azure), M2M tokens (WorkOS, CF service tokens)

3. Surface decision matrix

3.1 Customer-facing portal (UI)

SurfaceL1 perimeterL2 applicationL3 resource
app.securityv0.com (prod portal)(public DNS, no CF Access)WorkOS prod env (Google + Magic Link + OTP)n/a
staging.securityv0.com(matches prod's posture so E2E auth tests are meaningful)WorkOS staging env (separate org, distinct JWKS + audience)n/a

Customers need to reach prod/staging URLs — CF Access in front would require provisioning every customer in CF Access, bypassing WorkOS. Prod's only door is L2.

3.2 Internal portals (UI)

SurfaceL1 perimeterL2 applicationL3 resource
dev.securityv0.com (Hetzner today, Azure later)CF Access + GitHub IdPWorkOSn/a
dev-azure.securityv0.com (spike)CF Access + GitHub IdPWorkOSn/a
pr-N-dev.securityv0.com (PR previews)CF Access + GitHub IdPWorkOSn/a

L1 keeps unfinished builds invisible to the world; L2 means even L1-authenticated visitors need a real WorkOS session. CF Access policy at L1 is GitHub org membership in SecurityV0. Never parallel email allowlists.

3.3 API access (programmatic + CLI)

SurfaceL1 perimeterL2 application
app.securityv0.com/api/* (prod)WorkOS bearer (session cookie OR M2M token)
staging.securityv0.com/api/*WorkOS staging bearer
dev.*.securityv0.com/api/*CF Access (inherits from URL)WorkOS bearer
Internal CLI scripts hitting prod/staging(perimeter inherits)WorkOS device_code flow → bearer
Internal CLI scripts hitting dev/dev-azure/previewCF Access service token (§4a) OR human GitHub flowWorkOS bearer
Customer-tenant API consumerWorkOS M2M token (per-tenant)

Hard rule: application code MUST NOT read Cf-Access-Jwt-Assertion to derive identity. App identity is always WorkOS-derived. L1 is a network-reachability gate, not an identity signal the app trusts. (See §5 Rule #1 — Phase 0 resolves the legacy dev-provider violation.)

3.4 Infrastructure access

Four tiers, each with a distinct mechanism.

TierUse caseMechanismIdentityLifetime
Tier-1 SSHRoutine operator SSH to a VMCloudflare Access SSH in front of port 22, GitHub IdP, CF SSH CA short-lived certsGitHub user (must be in SecurityV0 org)~1h per session
Tier-1.5 emergency SSHCTO/CEO-level break-glass: cloud-init broke, Serial Console unreachable, or CF Access SSH degradedPer-VM ed25519 key, sv0emergency user (no sudo, read-only /var/log, single sv0-rescue script)1Password-stored private keyUntil per-VM key is rotated (on next redeploy)
Tier-2 emergency consoleNetwork/SSH itself is broken — last-resortAzure portal Serial Console, gated by custom sv0-serial-console-operator role on sv0-vm-emergency-ops Entra groupEntra account, manually provisioned, 1–3 membersPer-session, Azure-audited
Tier-3 subscription ownerSubscription-Owner-level ops: bootstrap, RBAC, policy editsAzure portal / CLI, Entra account, Active assignment + Security Defaults MFA-on-sign-in. Dual-Owner (Sergey + Ivan, the other Owner is the account-lockout rollback).Tier-3 operators' Entra accounts (Sergey since 2026-01-04, Ivan since 2026-03-10)Per-session (browser 8h, MFA-on-sign-in)

3.4.1 Tier-1.5 per-VM emergency key — scope and constraints

The Tier-1.5 key is not "SSH keys for daily ops" — it's a narrowly-scoped fourth route in when the first three fail. Per Ivan's 2026-05-12 decision, accepted only under these constraints, all enforced at provisioning time:

  • Key is NOT in authorized_keys for sv0admin (routine-ops account). Only sv0emergency accepts it.
  • sv0emergency has:
    • No sudo, not in wheel or sudo groups, no sudoers entry.
    • Read-only access to /var/log.
    • Permission to run exactly one script: /usr/local/bin/sv0-rescue, which:
      1. Writes local diagnostics to disk first (timestamped tarball: last 1000 syslog lines, journalctl tail, docker ps, df -h) at /var/lib/sv0-rescue/$(date +%s).tar.gz. This is the primary action — never deferred.
      2. Then attempts to post an audit-trail webhook to a CF Audit endpoint with a 5-second timeout. On failure, logs to stderr (loud warning) and continues. Audit completeness is achieved by weekly reconciliation of local tarballs against received webhooks (see §3.4.4).
    • No shell init, no PATH access to user binaries beyond sv0-rescue.
  • Key generated per-VM at TF apply-time, stored in 1Password sv0-infra vault as item vm-emergency-<vm-name>. Storage path is out-of-band (see §3.4.5).
  • Rotated automatically on every VM redeploy.
  • Accessible only to CTO/CEO-level operators (today: Ivan + Sergey).

This satisfies the spirit of Rule #3: no long-lived authorized_keys for routine ops; emergency-only key with narrow blast radius and tight rotation.

3.4.2 Tier-1 SSH via Cloudflare Access — concrete shape

  • Separate hostname per VM environment, not the URL hostname (Cloudflare enforces domain-uniqueness across app types). Hostname must be depth-1 from the apex (single label under securityv0.com) — Cloudflare Free's Universal SSL covers only one wildcard level, so a depth-2 name like ssh.dev-azure.securityv0.com will TLS-fail at the edge with handshake_failure. Use a depth-1 pattern like dev-azure-ssh.securityv0.com for the spike VM, staging-ssh.securityv0.com for staging, etc. (PR #35 + sv0-infrastructure issue #38 confirmed the depth-2 failure live; the pattern is also called out in the operator memory project_cf_universal_ssl_one_level.)
  • DNS CNAME for that hostname → the Cloudflare Tunnel.
  • Tunnel ingress rule routing SSH traffic to ssh://localhost:22 on the VM.
  • Cloudflare Access app of type=ssh on that hostname, GitHub IdP, auto_redirect_to_identity=true, allow-list filtered by GitHub org membership.
  • MFA enforcement is upstream at GitHub's org-policy (require-2FA), not via CF Access require { auth_method = "mfa" }. Empirically confirmed 2026-05-12: CF Access's IdP-based MFA require reads the OIDC amr claim and is only supported for Okta, Microsoft Entra ID, Generic OIDC, Generic SAML 2.0. GitHub OAuth (which the GitHub IdP uses) does not emit amr, so the require is structurally unsatisfiable — it denies all authentications rather than enforcing MFA. The replacement path (CF Access independent MFA: TOTP/WebAuthn at the application layer, IdP-agnostic) is tracked in sv0-infrastructure#36. Until that lands, MFA is enforced only at GitHub's session layer.
  • Per-app SSH CA managed via Terraform (cloudflare_zero_trust_access_short_lived_certificate). Public key rendered into cloud-init at apply time; no runtime fetch.
  • VM sshd configured: TrustedUserCAKeys /etc/ssh/cloudflare_ca.pub, AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u.
  • Cert principal format: CF Access SSH emits the user's email local-part (not full email, not GitHub login) as the cert Principals field, with the full email in Key ID. The principals file (/etc/ssh/auth_principals/sv0admin) must contain email local-parts to match. Empirically confirmed 2026-05-12 via ssh-keygen -L on a live cert (Principals=ifofanov, Key ID=ifofanov@securityv0.com).
  • CF-issued cert validity is 4 minutes. The operator ssh_config's Match host ... exec "cloudflared access ssh-gen --hostname %h" re-mints the cert on every connection. If ssh-gen fails (e.g., CF Access degraded), ssh fails closed with publickey denied — no silent fallback to a long-lived key. See §3.4.4 for the CF-down recovery path.
  • Operator connects: cloudflared access ssh --hostname dev-azure-ssh.securityv0.com. Full ssh_config block per cloudflared access ssh-config --hostname ... --short-lived-cert; must include IdentitiesOnly yes + IdentityAgent none so a local SSH agent (1Password etc.) doesn't shadow the CF cert.

3.4.3 Why not Azure AD SSH login (az ssh vm)?

Considered, rejected: adds Entra dependency on every operator (SV0 staff aren't in Entra); requires aadsshlogin-extension; doesn't compose with Cloudflare Tunnel (no public IP, needs Azure Bastion → more cost); GitHub already covers everyone; not cloud-portable to AWS.

3.4.4 When Cloudflare Access is down

CF Access has had outages (most recently 2026-02-13, ~2h partial control-plane). During such windows, every L1-gated surface is unreachable — including Tier-1 SSH.

The CF-independent fallback path:

  1. Confirm CF Access is down (not local network): curl -fsSL https://api.cloudflare.com/client/v4/zones | head — repeated 5xx or timeouts → it's CF.
  2. portal.azure.com works — it's Microsoft-hosted, not CF-fronted. Tier-3 Owner can reach Entra portal as normal (Owner is Active, no activation step).
  3. Tier-1.5 emergency key is unreachable via cloudflared (tunnel needs CF control plane to reconnect). Use Azure Bastion via a direct (non-PIM) Bastion-reader role assignment — this is the CF-independent transport. The direct role is provisioned for tier-3 only and stays Active (not PIM-eligible) so it remains usable during CF outages.
  4. If neither Bastion nor Tier-1.5 work, fall to Tier-2 Serial Console (Azure portal, CF-independent).
  5. cloudflared reconnect is best-effort during CF control-plane outages — do not rely on tunnel availability.

Note: sv0-rescue writes local diagnostics first regardless of CF reachability (per §3.4.1). The CF Audit webhook is best-effort; reconciliation runs weekly to catch missing entries.

Drills: quarterly per-operator exercise — run through CF-down procedure on a test VM. First time you need this for real is the wrong time to learn it.

3.4.5 Tier-1.5 key storage — out-of-band 1Password write

Decision: out-of-band write, not the 1Password CLI provider at TF apply-time. Rationale: coupling TFC's apply identity to 1Password write access is sensitive (compromise blast radius = vault write). Out-of-band keeps blast narrow.

Enforcement: post-apply CI check verifies that 1Password contains an item vm-emergency-<vm-name> whose created_at >= apply_completed_at. CI gates merge of the apply-PR on this check passing. This makes "rotated automatically on every VM redeploy" enforced, not aspirational.

Verifier credentials: a 1Password service account scoped read-only to items matching vm-emergency-* in the sv0-infra vault, stored as GHA secret OP_VM_EMERGENCY_VERIFIER. Rotated yearly (calendar reminder, same cadence as §4a service tokens). Read-only and item-prefix-scoped → minimal blast radius even if leaked. The service account has no write or delete capability anywhere. Tracked in 1Password under item op-service-account-vm-emergency-verifier.

Implementation note: the apply itself writes the public key to the VM's cloud-init. A subsequent step (manual or GHA workflow with operator OP_TOKEN at PR-author request time) uses 1Password CLI with operator credentials to write the private key. The CI verification step uses the read-only service account above to verify the write happened. Write and read are separate identities by design.

3.5 Other admin surfaces

SurfaceAuth
github.com/SecurityV0/*GitHub login (MFA enforced via org policy)
dash.cloudflare.comCloudflare account login (federated to Google Workspace @securityv0.com)
portal.azure.com (subscription)Tier-3 subscription owners (Sergey's + Ivan's Entra) — Active Owner, Security Defaults MFA-on-sign-in
MongoDB Atlas consoleAtlas login (federated to Google Workspace)
HCP Terraform UIHashiCorp Cloud login (federated to GitHub)
1Password (secret vault)1Password account, MFA enforced

3.6 Staff access to prod — audit trail

Prod has no L1 perimeter (customers need to reach it), so staff and customers use the same L2 (WorkOS). The staff/customer boundary lives entirely in detection.

EventAction
Super-admin (staff with WORKOS_SUPER_ADMIN_ORG_ID membership) authenticates against prodStandard WorkOS audit log, retained 90d
Super-admin writes to a customer tenantSlack notification to #audit-prod-staff-writes: actor + tenant + endpoint + timestamp
Super-admin reads customer tenant dataTail-aggregated daily, posted to #audit-prod-staff-reads as rollup
WORKOS_SUPER_ADMIN_ORG_ID membership changeNotified to #audit-workos-membership

#audit-* Slack channels are the audit record, not the alerting mechanism. PagerDuty/Opsgenie pages when a super-admin acts outside business hours (configurable).

Anomaly detection deferred. Metrics-based detection (super-admin writes per hour > N, or writes to > M distinct tenants in a window) is explicitly deferred to first compliance ask or 5+ staff. At current scale (1-2 operators), the Slack rollup + quarterly review by Sergey/Ivan is the control. Re-evaluate when the team grows or a customer audit requires it.

Policy (start weak, tighten as the team grows):

  • Routine staff prod access should reference a paired ticket or customer support escalation. The audit channels make this enforceable retroactively.
  • Production write actions by staff must be reproducible from non-staff API endpoints. If not (one-off Mongo edit), file a follow-up to make them reproducible.

3.7 Account recovery

When a staff member loses GitHub access (locked, MFA device lost):

CapabilityStatus during recovery
dev.*/pr-*-dev.* URLsBlocked (L1 GitHub gate)
Tier-1 SSHBlocked (CF Access SSH uses GitHub IdP)
app.securityv0.com, staging.securityv0.comWorks (L2 WorkOS uses Google Workspace, independent of GitHub)
GitHub SecurityV0/* reposBlocked
Customer support work (read-only on prod)Works
Serial ConsoleWorks if user is a tier-2 emergency operator (Entra independent of GitHub)

Expected GitHub recovery TAT: 2–5 business days for fully-locked account; minutes-to-hours for self-serve MFA backup-code path.

Backup FIDO2 setup (pragmatic, simple):

  • Each tier-2 emergency operator registers a backup FIDO2 key with both GitHub and Entra.
  • The backup key is kept physically separate from daily-carry items (not on the same keyring as the daily MFA device) — the operator chooses where.
  • The FIDO2 PIN is memorized, not stored anywhere.
  • Yearly check that the backup FIDO2 still authenticates against GitHub + Entra (5-min self-test, calendar reminder).
  • No physical-safe requirement, no per-location restrictions — operators work from any location.

Accepted residual risk: if the backup FIDO2 is lost or compromised at the same time as the daily MFA device, recovery falls to the GitHub support flow (2-5 business days). This is acceptable at current scale.


4. Service / machine identities

Use caseMechanismNotes
TFC plans/applies → AzureOIDC federation (per-workspace SP)No long-lived secrets
TFC writes state backup to Azure StorageOIDC federationSame SP as apply
Bootstrap apply → AzureOperator's az login (Tier-3 Active Owner — Sergey or Ivan)Local-apply per ADR-022 §7; migration to TFC: sv0-infrastructure#29
Connectors reading source systemsPer-connector API key, stored in source systemRead-only
CI runs (GitHub Actions → external)GitHub OIDC where supported; PAT otherwise (read-scoped)
Internal agents (Claude Code) hitting platform APIWorkOS M2M client per agent (delegated_agent kind)Memory project_auth_principal_model_locked
Internal scripts hitting prod/stagingWorkOS device_code flow → short-lived bearer
Internal scripts hitting dev URLs (must pass CF Access)CF Access service token (§4a) OR cloudflared with operator's GitHub identity

4a. CF Access service tokens — policy

  • One token per script/workflow, named cf-access-st-<purpose> (e.g., cf-access-st-seed-demo).
  • Stored in 1Password sv0-infra vault with prefix cf-access-st-. Each 1Password item has an expires_at field set to issue-date + 90d. Item references where the consumer stores it (TFC variable name, GH Actions secret name).
  • Scope-per-app: each token bound to exactly one CF Access app. Wildcards forbidden.
  • Rotation enforced by automation: scheduled GHA workflow check-cf-service-tokens.yml (weekly) reads 1Password via OP CLI, lists tokens past expires_at, opens an issue assigned to the token's owner, fails CI if any token is >120d old (hard cap). Manual rotation only — automated rotation is non-trivial and not worth the complexity at this scale.
  • Fail loud on missing config: every consumer asserts both CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET non-empty at startup, hard-exit if missing. Per Rule #6.
  • Leak detection: GitHub secret scanning patterns for CF_ACCESS_CLIENT_SECRET.

4b. Session lifetimes (target)

Token / sessionTTLReason
WorkOS staff session8hOne workday; daily re-auth via Google Workspace
WorkOS customer session30dCustomer UX expectation; refreshed on activity
CF Access dev session (browser)8hMatches staff workday
CF Access SSH cert1hPer-session re-auth is the value of CF Access SSH
Azure Owner Entra session8h (browser default)MFA-on-sign-in via Security Defaults. PIM (eligible/JIT) NOT adopted — see §7.
CF Access service tokenNone (rotate every 90d, hard cap 120d)Long-lived by design; rotation is the control
GitHub PATs90d maxPer GitHub org policy

5. Hard rules

Non-negotiable invariants. Changes require the §11 amendment process.

  1. App code never reads Cf-Access-Jwt-Assertion to derive user identity or grant access. App identity is WorkOS-only. CF Access is a network gate, not an identity signal the app trusts.

    • Exception (legacy): src/api/auth/providers/dev-provider.ts currently reads CF Access JWT. Phase 0 (§6.0) rewrites it to use a hardcoded identity. Phase 0 PR merge is a hard precondition for ADR-023 promotion. CI grep gate enforces zero cf-access-* reads in src/ and ui/src/ after Phase 0.
  2. No parallel allow-lists at the same layer. Each surface's CF Access policy uses exactly one signal: GitHub org membership. No "GitHub OR email-in-list" fallbacks.

  3. No long-lived authorized_keys for routine operator SSH. Tier-1 SSH = CF Access SSH CA. Tier-1.5 per-VM emergency keys are narrow-scoped (§3.4.1) and explicitly NOT routine ops.

  4. Tier-2 emergency access is Entra + Azure RBAC only, via the custom sv0-serial-console-operator role on the fixed sv0-vm-emergency-ops group. Never federated to GitHub or WorkOS.

  5. Service identities are always explicit. No human-tied tokens used by services. CI doesn't reuse a developer's GitHub token; agents don't reuse a developer's WorkOS session.

  6. Fail loud on missing config. A missing secret → hard exit at boot or in the deploy fail-closed check. No silent fallback to a less-secure path.

    • Corollary 6a: WorkOS membership and super-admin signals MUST NOT use cached fallback on lookup failure. Fail the request (503), do not degrade to last-known. Required test in test/api/auth/.
  7. One source of truth per identity domain. GitHub = staff at perimeter. WorkOS = product users at app. Entra = Azure RBAC only. Never federate one into another for the sake of "single sign-on" if it adds an indirection.

  8. WorkOS environment isolation. Prod and staging use separate WorkOS environments with distinct JWKS endpoints and audience claims. Never shared org IDs across envs.

    • Startup assertion (Phase 0): src/shared/config/env.ts asserts on NODE_ENV=production that WORKOS_AUTHKIT_DOMAIN matches a prod-allowlist regex (or doesn't match a staging-denylist). Hard-exit with a named error otherwise. Symmetric assertion on staging.

6. What changes from today

6.0 Phase 0 — must land before ADR-023 merges

Phase 0 is a hard precondition: ADR-023's "Status" header will reference the Phase 0 PR's merge SHA. ADR-023 does not merge until Phase 0 is green.

  1. Rewrite src/api/auth/providers/dev-provider.ts to drop the CF-Access-JWT path entirely. Provider returns DEV_USER unconditionally. No verifyCfAccessJwt call.
  2. Add CI grep gate: grep -rE "cf-access-jwt-assertion|Cf-Access-Jwt-Assertion" src/ ui/src/ returns zero matches. Fail CI step otherwise.
  3. Audit the codebase for similar silent-fallback patterns and remove any found (focus on || '', ?? '', if (!secret) against sensitive vars).
  4. Implement Rule #8 startup assertion in src/shared/config/env.ts: on NODE_ENV=production, assert WORKOS_AUTHKIT_DOMAIN doesn't match staging-denylist regex (e.g., must not contain staging, test, dev substrings); symmetric assertion on staging. Hard-exit on misconfig. Test in test/shared/config/ proves misconfig → hard exit, not silent degradation.

6.1 Immediate corrections (the dev-azure spike, this week)

  1. Delete cf_entra_idp_id workspace variable on sv0-dev OR set to "". The Terraform local.cf_idp_id falls through to the GitHub IdP fallback. Re-trigger sv0-dev apply.

  2. PR: remove dual-IdP conditional from envs/dev/main.tf. Replace var.cf_entra_idp_id != "" ? var.cf_entra_idp_id : "45cdd3b1-..." with the GitHub IdP ID as a named local (no conditional). Per Rule #6 / Rule #2.

  3. Disable Entra IdP at CF Access (already missing per 2026-05-12 diagnostic — confirm via dashboard). Azure App Registration "Cloudflare Access" stays quiescent; cleanup is deferred.

  4. Close sv0-infrastructure#27 with reversal note.

  5. Offboarding runbook (place in sv0-documentation/docs/runbooks/): when a staff member leaves SV0, in order:

    1. Revoke their CF Access user sessions: POST /accounts/{id}/access/organizations/revoke_user with their email.
    2. Remove from GitHub SecurityV0 org.
    3. Verify by attempting cloudflared access login from a test machine with the offboarded identity (should be denied).
    4. Remove from WorkOS organizations they were in.
    5. Remove from any sv0-vm-emergency-ops Entra group membership.
    6. Audit the offboarded user's last 30d activity in WorkOS, CF Access, GitHub.

    Max time-to-revocation: minutes if procedure followed; hours otherwise.

6.2 Phase 3a (next sprint)

  1. CF Access SSH for the dev-azure VM on dev-azure-ssh.securityv0.com per §3.4.2 (depth-1 hostname — see Universal SSL constraint there). DNS CNAME + tunnel ingress for ssh://localhost:22 + CF Access app type=ssh with GitHub IdP (MFA enforced upstream at GitHub org policy, not via CF auth_method=mfa — see §3.4.2) + cloud-init configures sshd to trust the per-app CF SSH CA + principals file (email local-parts).

  2. Tier-1.5 per-VM emergency key for the dev-azure VM per §3.4.1. Cloud-init creates sv0emergency user with no sudo, writes per-VM ed25519 public key, installs /usr/local/bin/sv0-rescue (which writes local diagnostics first, then best-effort CF webhook). Private key stored out-of-band in 1Password as vm-emergency-vm-sv0-dev-1. Post-apply CI check verifies item exists and is fresh (§3.4.5).

  3. Tier-3 Owner — Active dual-Owner + Security Defaults MFA (no PIM, no backup SP):

    Amended 2026-05-13 (no-PIM revision + CEO scope review + state-verification correction). The premise that "Tier-3 = Ivan only" was a documentation-staleness bug: az role assignment list verifies that Sergey has been subscription Owner since 2026-01-04 (original Owner; created the subscription) and Ivan was added 2026-03-10. The 2-human Owner rollback has been in place for 2+ months. The original PIM-eligibility design + the proposed sv0-azure-backup-owner bridge SP were both solving a non-existent SPOF. The bridge is cancelled (PR #57 closed). The recovery-credentials design patterns from the cancelled work are banked in docs/patterns/recovery-credentials.md for any future scenario where a real recovery SP is genuinely warranted.

    1. Verify Security Defaults is enabled on the tenant (free-tier policy that enforces MFA on az login / portal.azure.com / ARM API via the ARM MFA-required policy). Read via portal: Entra → Properties → Manage Security Defaults. This is the free-tier replacement for PIM's MFA-on-activate enforcement — without it, Tier-3 Owners have no MFA gate.

    2. The 2-human Owner setup IS the rollback. Subscription Owner assignments (verified 2026-05-13):

      • Sergey (098551cd-0071-4408-846a-961c35da98a4) — Owner since 2026-01-04
      • Ivan (a38b998e-b2f4-4e73-ac3d-370da0b0a1da) — Owner since 2026-03-10

      Either operator can re-provision the other's role assignment in a lockout scenario. No backup SP, no Microsoft-support-RTO concern.

    Hard rule (no exception): the 2-human Owner state above is preserved until ≥3 humans exist OR a documented superseding design is in place. Neither Owner is removable without explicit migration plan.

    What we explicitly do not get without PIM (accepted tradeoff for not paying for P2):

    • No JIT activation window — Owner is always Active for both operators. Compensating control: Security Defaults MFA-on-sign-in + dual-Owner attribution (each human's actions logged distinctly in Activity Log).
    • No per-activation business-justification field. Compensating control: routine Owner-scoped operations go through TFC (audited via TFC run history).
    • No activation-event audit log. Compensating control: Azure Activity Log captures every role-scoped operation, with monthly review automated per sv0-infrastructure#59 (scoped to break_glass + bootstrap SPs; loud-on-zero-actions).

    Re-evaluate PIM adoption when (a) Entra P2 is procured for product/demo reasons and we can opportunistically extend, or (b) staff with Owner-scoped access grows to ≥3.

  4. Azure Bastion direct (non-PIM) role assignment for tier-3 — provisioned for the Tier-3 Owners (Sergey + Ivan), kept Active. This is the CF-independent transport during CF Access outages (§3.4.4). One-time setup; deferred until Bastion is actually provisioned (out of Phase 3a-4 scope).

  5. GHA workflow check-cf-service-tokens.yml — weekly, reads 1Password via OP CLI, opens issues for tokens past expires_at, fails CI for tokens >120d old. Per §4a.

  6. Quarterly tier-2 emergency drill (per §3.4.4): each emergency operator runs Serial Console + Tier-1.5 procedures on a test VM, results posted to #audit-tier-2-drills.

Explicitly NOT in this phase: separate FIDO2 break-glass Entra account in a physical safe (per Ivan 2026-05-12). Backup FIDO2 setup per §3.7 (no safe requirement) is sufficient at current scale.

6.3 Phase 3b (formal staging)

  1. Staging applies the same pattern. L2 (WorkOS staging env, separate org from prod), no L1 perimeter. SSH via CF Access SSH on staging-ssh.securityv0.com (depth-1 per §3.4.2). Per-VM emergency key for staging VM.

6.4 Deferred cleanup (any time)

  1. Delete the "Cloudflare Access" Azure App Registration and its 1Password client secret entry. Non-blocking.

  2. ADR-023 promotion CI gate (closes the human-review-only state described in the Status header). GHA workflow in sv0-documentation that triggers when docs/architecture/decisions/adr-023-*.md is added or its Status line changes:

    • Parse the Status line for a Phase 0 commit SHA.
    • Verify the SHA exists in sv0-platform main.
    • Verify the §6.0 four steps are observable at the SHA: dev-provider has no CF-Access-JWT path, CI grep gate is present in .github/workflows/, env.ts iss-claim assertion is present, corresponding test exists.
    • Fail merge of the ADR PR if any check fails.

    Until this lands, ADR-023 promotion is gated by human review (Ivan verifies Phase 0 is merged before merging the ADR). Tracked as a sv0-documentation issue to file alongside the ADR PR.

  3. ADR-022 amendments:

    • §5c.1 "Two doors": replace every "Entra IdP at CF Access" with "GitHub IdP at CF Access."
    • §5c (Emergency access tiers): tier-1 SSH = CF Access SSH (GitHub), tier-1.5 = per-VM emergency key (CTO/CEO-only), tier-2 = Azure Serial Console (Entra group + custom role), tier-3 = dual Active Azure Owners (Sergey + Ivan) + Security Defaults MFA-on-sign-in (no PIM, no backup SP — see §7).
    • §4: change default vm_size from Standard_B2s (NotAvailableForSubscription in westeurope) to Standard_D2as_v6, and amend the Azure Policy to match.

6.5 What does NOT change

  • WorkOS at L2 for prod/staging/dev portals — unchanged.
  • Azure RBAC + Entra group for Serial Console — unchanged.
  • TFC OIDC federation for sv0-shared, sv0-prod, sv0-dev workspaces — unchanged.
  • GitHub at CF Access for dev.securityv0.com — unchanged.

7. Things explicitly NOT in scope

  • WorkOS at L1. Pricing-gated (OIDC Connect is a separate SKU), wrong product fit, blurs customer/staff boundary.
  • Entra at L1. Wrong source of truth — SV0 staff are GitHub users.
  • Mixed IdP fallbacks. Rule #2.
  • Federating GitHub into Entra (or vice versa). Adds indirection for no operational gain at our team size. Reopen if team grows past ~20 staff or compliance demands single-IdP.
  • Cloudflare Access as an application-identity source. Rule #1.
  • Long-lived authorized_keys for routine operators. Rule #3 (Tier-1.5 is narrow-scope exception).
  • Bastion / jump host pattern for SSH. CF Access SSH replaces this.
  • Azure AD SSH login. §3.4.3.
  • Separate FIDO2 break-glass Entra account in a physical safe. Deferred per Ivan 2026-05-12. Current §3.7 setup is sufficient.
  • Entra ID P2 / Azure PIM for infra access control. Resolved 2026-05-13. Verification returned subscribedSkus: [] + 400 AadPremiumLicenseRequired (P2 absent on tenant). P2 procurement is reserved for product/demo use cases (Entra audit logs in execution findings); we do not adopt P2/PIM to gate our own infra. Tier-3 runs dual Active Owners (Sergey + Ivan; verified via az role assignment list 2026-05-13) + Security Defaults MFA-on-sign-in. Mutual recovery between the two Owners is the account-lockout rollback. Microsoft support tenant-root reset remains as a residual fallback only for the joint-loss scenario, RTO of days — acceptable. Re-evaluate PIM when (a) P2 lands for product reasons and we can opportunistically extend, or (b) staff with Owner-scoped access grows to ≥3.
  • Two-human signer on amendments. Deferred until team scales past 5 staff (Ivan 2026-05-12). Current process: two independent Claude/Codex sessions for AI-proposed amendments (§11.3).
  • Anomaly detector for super-admin actions. Deferred to first compliance ask or 5+ staff (§3.6).
  • Replacing Cloudflare with a cloud-specific edge. Cloudflare stays — perimeter, DNS, Tunnel, Access all live there. Cloud-portable as-is.

8. Open questions

Most v1/v2 questions are resolved in v3. Remaining live items:

  1. The bootstrap operator role — currently either Tier-3 Owner (Sergey or Ivan), running local-apply on bootstrap/. Long-term it should be a distinct identity (a service account with limited scope used only for bootstrap apply). Track as a Phase 3c+ item.
  2. Cross-Claude-session convergence for AI-proposed amendments — implementation: a checklist in the PR template? A make verify-amendment script that wraps both Claude sessions? Decide before the first such amendment lands.

Closed: "Entra ID P2 license state on the current tenant" — resolved 2026-05-13 (subscribedSkus: []). See §7 + §6.2 step 12 amendment.


9. Glossary (used precisely throughout this doc)

  • L1 / "perimeter" — Cloudflare Access in front of a URL or port. Decides whether the network connection reaches the backend. Used in §3.2, §3.4. Never "staff/external trust boundary" generally — that's "trust boundary."
  • L2 / "application" — Authentication inside the application itself. For SV0: WorkOS sessions.
  • L3 / "resource RBAC" — Cloud-side authorization decisions (Azure RBAC, MongoDB Atlas roles, GitHub permissions).
  • IdP — Identity provider.
  • CF Access app — A Cloudflare Access "application" resource: hostname + type + allowed IdPs + policy.
  • CF Access SSH — Cloudflare Access in front of port 22, VM trusts CF Access SSH CA. Operator: cloudflared access ssh --hostname X.
  • Service token — Long-lived token issued by Cloudflare Access for service-to-service use.
  • M2M token — Machine-to-machine token issued by WorkOS for a service principal.
  • Tier-1 SSH — Routine operator SSH (target: CF Access SSH).
  • Tier-1.5 emergency SSH — Narrow-scope per-VM key for CTO/CEO-level break-glass when Tier-1 and Tier-2 are unavailable.
  • Tier-2 emergency operator — Person assigned to sv0-vm-emergency-ops Entra group. Has Serial Console + custom role. 1–3 members.
  • Tier-3 subscription owner — Person with Azure subscription Owner role, Active assignment, MFA-on-sign-in via Security Defaults. Today: Sergey (original Owner since 2026-01-04) and Ivan (Owner since 2026-03-10). Mutual recovery between the two is the account-lockout rollback; no backup SP (the proposed sv0-azure-backup-owner was cancelled when state-verification 2026-05-13 showed the 2-Owner state already existed — design patterns banked in docs/patterns/recovery-credentials.md for any future scenario where a backup SP is actually warranted).
  • Bootstrap operator — The role used for bootstrap/ local-apply. Currently the same identity as tier-3, conceptually distinct (§8 Q1).
  • "Break-glass" (adjective only) — describes tier-1.5 or tier-2 access. Not used as a noun.
  • Security Defaults — Free-tier Entra policy that enforces MFA registration + MFA-on-sign-in for admin roles (Owner, Contributor, etc.) tenant-wide. The free-tier replacement for Conditional Access (which requires P1/P2). Not granular — applies to all admins uniformly. SV0 staff scope is small enough that this is sufficient.
  • PIM — Azure Privileged Identity Management. Entra ID P2 feature. Eligible-not-active role assignments, MFA-on-activate. NOT adopted for SV0 infra (§7); referenced only for the AWS-migration sketch in §1.1.

10. References

  • Round 3 review of v3: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review-r3.md
  • Round 2 review of v2: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review-r2.md
  • Round 1 review of v1: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review.md
  • v3 draft: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v3.md
  • v2 draft: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v2.md
  • v1 draft: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v1.md
  • Reversed earlier verdict: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-11-cf-access-idp-question-for-auth-agent.md
  • Azure landing zone session: ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-11-azure-landing-zone-staging-first-session.md
  • ADR-022 — Azure compute landing zone: ~/dev/securityv0/repos/sv0-documentation/docs/architecture/decisions/adr-022-azure-compute-landing-zone.md
  • Runbook 12 — Azure VM landing zone: ~/dev/securityv0/repos/sv0-documentation/docs/runbooks/12-azure-vm-landing-zone.md
  • WorkOS principal model: ~/dev/securityv0/repos/sv0-skills/auth-context/SKILL.md
  • Memory: project_auth_principal_model_locked (2026-04-30), feedback_subagent_backward_compat_neutralizes_fix, feedback_fail_loud_over_silent_fallback

11. How to change this document

At current 1-2 operator scale, amendments are operator-PRs reviewed by the other operator (or solo for documentation-only changes). Two named substantive criteria:

  1. Hard Rules in §5 are invariants. Changes to a hard rule require the operator to articulate the threat-model delta in the PR description.
  2. AI-proposed amendments need a second-model pass. Run any AI-generated amendment through a second independent session (Codex or a fresh Claude session) before merging. The 2026-05-13 no-PIM revision is the worked example: Codex caught five threat-model premises the first model missed; the CEO scope review caught that the resulting bridge was unnecessary. Both passes mattered.

Trigger to revisit and expand this section: team growth past 5 staff, OR first SOC2 / customer audit ask. Until then, lighter-weight is correct — heavier process for a 2-person team is the failure mode this section used to be.

Reversal lesson worth keeping (2026-05-11): when proposing an auth change, ask "what does existing infra already provide?" before re-deriving from first principles. The 2026-05-11 verdict was reversed within 24 hours because it conflated backend presence (Azure tenant has 1-3 emergency accounts) with staff identity store (where SV0 staff actually live = GitHub). The same failure mode reappeared 2026-05-13 with the cancelled backup-Owner SP: building a sophisticated workaround for a problem whose existing infra-equivalent (add a 2nd human Owner) was cheaper. Pattern-match to it.


— Authentication target architecture (DRAFT v4), 2026-05-12