ADR-023: Authentication Target Architecture
Status
Accepted — 2026-05-12.
Phase 0 (§6.0) — the hard precondition for this ADR's operational guarantees — landed in sv0-platform at commit d7885d8 via PR #856. The PR included three rounds of adversarial review and post-merge regression coverage for ADR-023 Rule #1 (CI grep gate) and Rule #8 (cross-env startup assertion + tests). An automated GHA gate that enforces future ADR-023 amendments against the Phase 0 SHA is tracked as a follow-up in §6.4 step 18.
Operationalises and supersedes the perimeter-IdP choice in ADR-022 §5c.1. Paired with docs/runbooks/12-azure-vm-landing-zone.md, which holds the Azure-side implementation sequencing for the items in §6 of this ADR.
2026-05-13 amendment (no-PIM revision + CEO scope review + state-verification correction) — §3.4 Tier-3, §6.2 step 12, §7, glossary amended to drop Azure PIM. Verification returned subscribedSkus: [] + 400 AadPremiumLicenseRequired; P2/PIM reserved for product/demo use cases, not infra access control. The interim design layered on a sv0-azure-backup-owner UAA service principal as an account-lockout rollback (Codex adversarial review tightened it to UAA + out-of-band credential + sunset condition). CEO scope review same day caught that the SP's own sunset trigger — "delete when a 2nd human Owner exists" — was cheaper than the SP itself. State verification (az role assignment list) then showed the trigger was already satisfied months ago: Sergey has been subscription Owner since 2026-01-04 (created the subscription); Ivan was added 2026-03-10. The 2-human Owner rollback has existed since March. The entire PIM-design + backup-SP-design + "add Sergey" issue (#60) were solving a non-problem masked by stale documentation that said "Tier-3 = Ivan only." Bridge cancelled (PR #57 closed). #60 closed as already-resolved. Design patterns from Codex review (UAA > Owner, out-of-band > TF-state, sunset conditions, safe activation pattern) banked in docs/patterns/recovery-credentials.md for any future scenario where a recovery SP is genuinely warranted.
Context
SV0 today operates three identity-bearing systems: GitHub (SecurityV0 org), WorkOS (AuthKit), and Entra ID (Azure default directory). Until 2026-05-11 there was no canonical document of which system was the source of truth for which decision; this drift led directly to an incorrect "Entra at Cloudflare Access" verdict on 2026-05-11 that survived 24 hours before being reversed. The reversal exposed several latent issues: app code in dev-provider.ts reading Cf-Access-Jwt-Assertion (a violation of the intended layering), no defined emergency-access path for the Azure compute landing zone, no documented offboarding TTL, no rotation discipline on CF Access service tokens, and no rollback for the tier-3 subscription-owner SPOF.
This ADR locks the target architecture covering three scopes — portal UI access, API access, and infrastructure access — using only IdPs SV0 already operates (zero new identity stores) and applying ADR-022's cloud-portability discipline so the design moves cleanly to AWS if/when credits arrive there.
The document is intentionally long because the failure mode it guards against is future-Claude-session-reverses-this-on-flimsy-reasoning (per §11.4). Three rounds of adversarial review (saved under ~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review*.md) pressure-tested every claim. Round-3 verdict: ship as ADR with the Phase 0 precondition.
Decision
The full decision is laid out below in §1 (executive summary) through §11 (amendment process). Key invariants are §5's "Hard rules"; key implementation items are §6's phased rollout (Phase 0 is a hard precondition, Phase 3a is the next sprint).
Decision summary
- Three IdPs, each the source of truth for exactly one thing: GitHub (L1 staff perimeter), WorkOS (L2 platform users), Entra (L3 Azure RBAC, break-glass only).
- Four SSH tiers: Tier-1 = CF Access SSH + GitHub IdP, Tier-1.5 = narrow per-VM emergency key (CTO/CEO-only), Tier-2 = Azure Serial Console + custom Entra-group role, Tier-3 = dual Active subscription Owners (Sergey + Ivan, both Owners since 2026-01-04 / 2026-03-10) + Security Defaults MFA-on-sign-in. No PIM, no backup SP (Entra ID P2 not adopted for infra — see §7; the proposed backup SP was cancelled when state-verification showed the 2-Owner rollback already existed — see §6.2 step 12).
- Cloud-portable by construction: WorkOS + GitHub + Cloudflare are cloud-agnostic; Azure RBAC is the only cloud-specific config and it has AWS-equivalent migration sketches in §1.1.
- Phase 0 (precondition): rewrite
dev-provider.tsto drop CF-Access-JWT reads, add CI grep gate, addWORKOS_AUTHKIT_DOMAINcross-env startup assertion + test.
Consequences
Positive
- One coherent identity model covering UI, API, and infra — replaces the ad-hoc, surface-by-surface state that produced the 2026-05-11 reversal.
- Zero net new IdPs to operate; everything below uses what SV0 already runs.
- Tier-1.5 + Tier-3 dual-Owner rollback close two real SPOFs that the spike-era setup left implicit (the 2-Owner state already exists; the original ADR text claiming "Tier-3 = Ivan only" was stale).
- Cloud portability preserved end-to-end (only L3 config changes on AWS migration).
- Explicit
#audit-prod-staff-writesSlack channel + quarterly review give a SOC2-prep audit trail without adding anomaly-detection complexity that's premature at current scale.
Negative / accepted residual risk
- Tier-3 direct (non-PIM) Azure Bastion role for CF-Access-down recovery is a narrow persistent SPOF (Bastion-reader on one Bastion host, tier-3-only). Accepted as a net-positive trade vs. the CF-fronted-PIM deadlock alternative.
- §11.3 two-Claude-session convergence is a thin epistemic guard (same model + same training cutoff + similar context can converge on a wrong answer). Re-litigated when team scales past 5 staff.
- 1Password is the credential vector for the existing
sv0-azure-break-glassSP (rg-sv0-prod Contributor), but the Tier-3 subscription-Owner SPOF concern doesn't apply: Sergey + Ivan are both Active Owners with independent Entra accounts and independent MFA devices. Mutual recovery between operators is the in-place rollback. Microsoft support tenant-root reset (RTO: days) remains the residual fallback only for the joint-loss-of-both-Owner-accounts scenario, which is acceptable. - Backup FIDO2 keys are operator-managed without a physical-safe requirement (per Ivan's pre-client simplicity preference). Acceptable at 1-2 operator scale; revisit at team-of-5 or first compliance ask.
- Anomaly detection for super-admin actions is deferred to first compliance ask or 5+ staff.
What changes downstream
- ADR-022 §5c.1 needs amendment to swap "Entra IdP at CF Access" for "GitHub IdP at CF Access" (tracked as §6.4 step 19).
- Runbook 12 phases adopt the four-tier SSH model and the dual-Owner rollback procedure (no PIM, no backup SP).
- Phase 0 implementation lands in
sv0-platform(dev-provider.tsrewrite + CI grep gate +env.tsiss assertion + test). - Phase 3a implementation lands in
sv0-infrastructure(CF Access SSH ondev-azure-ssh.securityv0.com, Tier-1.5 emergency key + cloud-init wiring,check-cf-service-tokens.ymlGHA, direct Bastion role). Dual-Owner state at Tier-3 was already in place — no Phase 3a-4 Azure provisioning required.
1. Executive summary
Three identity providers, each the source of truth for one thing:
- GitHub (
SecurityV0org) — source of truth for SV0 staff identity at the network perimeter (Cloudflare Access). Every SV0 staff member has a GitHub account in the org. MFA enforced at GitHub. - WorkOS (AuthKit) — source of truth for platform user accounts (Layer 2 inside the platform). Federates Google Workspace (
@securityv0.com), Magic Link, OTP. Customer users and staff-as-product-users authenticate here. - Entra ID (Azure default directory) — source of truth for Azure resource access only. Holds a small fixed
sv0-vm-emergency-opstier-2 emergency operator group + the tier-3 subscription owner accounts (Sergey + Ivan). Not a staff identity store.
Three layers, each answering a different question:
| Layer | Question | Source of truth | Surface |
|---|---|---|---|
| L1 — Network perimeter | Is this an SV0 staff member who's allowed to reach this URL/port at all? | GitHub (org membership) | Cloudflare Access |
| L2 — Application | Is this a platform user (customer or staff-as-product-user) with a valid session? | WorkOS | Inside the platform |
| L3 — Cloud resource RBAC | Is this Azure principal allowed to call this Azure API? | Entra (Azure roles, Active assignments + Security Defaults MFA-on-sign-in) | Azure portal / CLI / Serial Console |
Net new IdPs vs today: zero.
Cloud portability — WorkOS + GitHub + Cloudflare layers are cloud-agnostic. Only L3 RBAC (Azure roles, Entra groups, Security Defaults) is Azure-specific; on AWS migration it maps to IAM + IAM Identity Center + an IAM Identity Center MFA policy. Detailed sketch lives in
.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v3.md— promote to a sub-section here if/when AWS is actually funded. Speculative content removed from this ADR per CEO scope review 2026-05-13.
2. Audiences
| Audience | Examples | Auth path |
|---|---|---|
| SV0 staff | Engineers, ops, founders | GitHub at L1 + WorkOS at L2 |
| SV0 staff doing emergency ops | Tier-2 emergency operators (1–3 manually-provisioned Entra members) | + Entra account (Active role, Security Defaults MFA-on-sign-in) for Serial Console |
| SV0 subscription owners | Tier-3 — today Sergey (original Owner since 2026-01-04) and Ivan (Owner since 2026-03-10) | + Entra account with Active Owner role (Security Defaults MFA-on-sign-in, browser session 8h). The other Owner is the account-lockout rollback (§6.2). |
| Customer users | Customer admin/analyst logging into SV0 platform | WorkOS at L2 only (no L1 — prod has no perimeter) |
| Service / agent / CI | TFC runs, connector pipelines, internal bots | OIDC federation (Azure), M2M tokens (WorkOS, CF service tokens) |
3. Surface decision matrix
3.1 Customer-facing portal (UI)
| Surface | L1 perimeter | L2 application | L3 resource |
|---|---|---|---|
app.securityv0.com (prod portal) | — (public DNS, no CF Access) | WorkOS prod env (Google + Magic Link + OTP) | n/a |
staging.securityv0.com | — (matches prod's posture so E2E auth tests are meaningful) | WorkOS staging env (separate org, distinct JWKS + audience) | n/a |
Customers need to reach prod/staging URLs — CF Access in front would require provisioning every customer in CF Access, bypassing WorkOS. Prod's only door is L2.
3.2 Internal portals (UI)
| Surface | L1 perimeter | L2 application | L3 resource |
|---|---|---|---|
dev.securityv0.com (Hetzner today, Azure later) | CF Access + GitHub IdP | WorkOS | n/a |
dev-azure.securityv0.com (spike) | CF Access + GitHub IdP | WorkOS | n/a |
pr-N-dev.securityv0.com (PR previews) | CF Access + GitHub IdP | WorkOS | n/a |
L1 keeps unfinished builds invisible to the world; L2 means even L1-authenticated visitors need a real WorkOS session. CF Access policy at L1 is GitHub org membership in SecurityV0. Never parallel email allowlists.
3.3 API access (programmatic + CLI)
| Surface | L1 perimeter | L2 application |
|---|---|---|
app.securityv0.com/api/* (prod) | — | WorkOS bearer (session cookie OR M2M token) |
staging.securityv0.com/api/* | — | WorkOS staging bearer |
dev.*.securityv0.com/api/* | CF Access (inherits from URL) | WorkOS bearer |
| Internal CLI scripts hitting prod/staging | (perimeter inherits) | WorkOS device_code flow → bearer |
| Internal CLI scripts hitting dev/dev-azure/preview | CF Access service token (§4a) OR human GitHub flow | WorkOS bearer |
| Customer-tenant API consumer | — | WorkOS M2M token (per-tenant) |
Hard rule: application code MUST NOT read Cf-Access-Jwt-Assertion to derive identity. App identity is always WorkOS-derived. L1 is a network-reachability gate, not an identity signal the app trusts. (See §5 Rule #1 — Phase 0 resolves the legacy dev-provider violation.)
3.4 Infrastructure access
Four tiers, each with a distinct mechanism.
| Tier | Use case | Mechanism | Identity | Lifetime |
|---|---|---|---|---|
| Tier-1 SSH | Routine operator SSH to a VM | Cloudflare Access SSH in front of port 22, GitHub IdP, CF SSH CA short-lived certs | GitHub user (must be in SecurityV0 org) | ~1h per session |
| Tier-1.5 emergency SSH | CTO/CEO-level break-glass: cloud-init broke, Serial Console unreachable, or CF Access SSH degraded | Per-VM ed25519 key, sv0emergency user (no sudo, read-only /var/log, single sv0-rescue script) | 1Password-stored private key | Until per-VM key is rotated (on next redeploy) |
| Tier-2 emergency console | Network/SSH itself is broken — last-resort | Azure portal Serial Console, gated by custom sv0-serial-console-operator role on sv0-vm-emergency-ops Entra group | Entra account, manually provisioned, 1–3 members | Per-session, Azure-audited |
| Tier-3 subscription owner | Subscription-Owner-level ops: bootstrap, RBAC, policy edits | Azure portal / CLI, Entra account, Active assignment + Security Defaults MFA-on-sign-in. Dual-Owner (Sergey + Ivan, the other Owner is the account-lockout rollback). | Tier-3 operators' Entra accounts (Sergey since 2026-01-04, Ivan since 2026-03-10) | Per-session (browser 8h, MFA-on-sign-in) |
3.4.1 Tier-1.5 per-VM emergency key — scope and constraints
The Tier-1.5 key is not "SSH keys for daily ops" — it's a narrowly-scoped fourth route in when the first three fail. Per Ivan's 2026-05-12 decision, accepted only under these constraints, all enforced at provisioning time:
- Key is NOT in
authorized_keysforsv0admin(routine-ops account). Onlysv0emergencyaccepts it. sv0emergencyhas:- No
sudo, not inwheelorsudogroups, no sudoers entry. - Read-only access to
/var/log. - Permission to run exactly one script:
/usr/local/bin/sv0-rescue, which:- Writes local diagnostics to disk first (timestamped tarball: last 1000 syslog lines, journalctl tail,
docker ps,df -h) at/var/lib/sv0-rescue/$(date +%s).tar.gz. This is the primary action — never deferred. - Then attempts to post an audit-trail webhook to a CF Audit endpoint with a 5-second timeout. On failure, logs to stderr (loud warning) and continues. Audit completeness is achieved by weekly reconciliation of local tarballs against received webhooks (see §3.4.4).
- Writes local diagnostics to disk first (timestamped tarball: last 1000 syslog lines, journalctl tail,
- No shell init, no PATH access to user binaries beyond
sv0-rescue.
- No
- Key generated per-VM at TF apply-time, stored in 1Password
sv0-infravault as itemvm-emergency-<vm-name>. Storage path is out-of-band (see §3.4.5). - Rotated automatically on every VM redeploy.
- Accessible only to CTO/CEO-level operators (today: Ivan + Sergey).
This satisfies the spirit of Rule #3: no long-lived authorized_keys for routine ops; emergency-only key with narrow blast radius and tight rotation.
3.4.2 Tier-1 SSH via Cloudflare Access — concrete shape
- Separate hostname per VM environment, not the URL hostname (Cloudflare enforces domain-uniqueness across app types). Hostname must be depth-1 from the apex (single label under
securityv0.com) — Cloudflare Free's Universal SSL covers only one wildcard level, so a depth-2 name likessh.dev-azure.securityv0.comwill TLS-fail at the edge withhandshake_failure. Use a depth-1 pattern likedev-azure-ssh.securityv0.comfor the spike VM,staging-ssh.securityv0.comfor staging, etc. (PR #35 + sv0-infrastructure issue #38 confirmed the depth-2 failure live; the pattern is also called out in the operator memoryproject_cf_universal_ssl_one_level.) - DNS CNAME for that hostname → the Cloudflare Tunnel.
- Tunnel ingress rule routing SSH traffic to
ssh://localhost:22on the VM. - Cloudflare Access app of
type=sshon that hostname, GitHub IdP,auto_redirect_to_identity=true, allow-list filtered by GitHub org membership. - MFA enforcement is upstream at GitHub's org-policy (require-2FA), not via CF Access
require { auth_method = "mfa" }. Empirically confirmed 2026-05-12: CF Access's IdP-based MFA require reads the OIDCamrclaim and is only supported for Okta, Microsoft Entra ID, Generic OIDC, Generic SAML 2.0. GitHub OAuth (which the GitHub IdP uses) does not emitamr, so the require is structurally unsatisfiable — it denies all authentications rather than enforcing MFA. The replacement path (CF Access independent MFA: TOTP/WebAuthn at the application layer, IdP-agnostic) is tracked in sv0-infrastructure#36. Until that lands, MFA is enforced only at GitHub's session layer. - Per-app SSH CA managed via Terraform (
cloudflare_zero_trust_access_short_lived_certificate). Public key rendered into cloud-init at apply time; no runtime fetch. - VM sshd configured:
TrustedUserCAKeys /etc/ssh/cloudflare_ca.pub,AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u. - Cert principal format: CF Access SSH emits the user's email local-part (not full email, not GitHub login) as the cert
Principalsfield, with the full email in Key ID. The principals file (/etc/ssh/auth_principals/sv0admin) must contain email local-parts to match. Empirically confirmed 2026-05-12 viassh-keygen -Lon a live cert (Principals=ifofanov, Key ID=ifofanov@securityv0.com). - CF-issued cert validity is 4 minutes. The operator ssh_config's
Match host ... exec "cloudflared access ssh-gen --hostname %h"re-mints the cert on every connection. Ifssh-genfails (e.g., CF Access degraded), ssh fails closed withpublickey denied— no silent fallback to a long-lived key. See §3.4.4 for the CF-down recovery path. - Operator connects:
cloudflared access ssh --hostname dev-azure-ssh.securityv0.com. Full ssh_config block percloudflared access ssh-config --hostname ... --short-lived-cert; must includeIdentitiesOnly yes+IdentityAgent noneso a local SSH agent (1Password etc.) doesn't shadow the CF cert.
3.4.3 Why not Azure AD SSH login (az ssh vm)?
Considered, rejected: adds Entra dependency on every operator (SV0 staff aren't in Entra); requires aadsshlogin-extension; doesn't compose with Cloudflare Tunnel (no public IP, needs Azure Bastion → more cost); GitHub already covers everyone; not cloud-portable to AWS.
3.4.4 When Cloudflare Access is down
CF Access has had outages (most recently 2026-02-13, ~2h partial control-plane). During such windows, every L1-gated surface is unreachable — including Tier-1 SSH.
The CF-independent fallback path:
- Confirm CF Access is down (not local network):
curl -fsSL https://api.cloudflare.com/client/v4/zones | head— repeated 5xx or timeouts → it's CF. portal.azure.comworks — it's Microsoft-hosted, not CF-fronted. Tier-3 Owner can reach Entra portal as normal (Owner is Active, no activation step).- Tier-1.5 emergency key is unreachable via cloudflared (tunnel needs CF control plane to reconnect). Use Azure Bastion via a direct (non-PIM) Bastion-reader role assignment — this is the CF-independent transport. The direct role is provisioned for tier-3 only and stays Active (not PIM-eligible) so it remains usable during CF outages.
- If neither Bastion nor Tier-1.5 work, fall to Tier-2 Serial Console (Azure portal, CF-independent).
cloudflaredreconnect is best-effort during CF control-plane outages — do not rely on tunnel availability.
Note: sv0-rescue writes local diagnostics first regardless of CF reachability (per §3.4.1). The CF Audit webhook is best-effort; reconciliation runs weekly to catch missing entries.
Drills: quarterly per-operator exercise — run through CF-down procedure on a test VM. First time you need this for real is the wrong time to learn it.
3.4.5 Tier-1.5 key storage — out-of-band 1Password write
Decision: out-of-band write, not the 1Password CLI provider at TF apply-time. Rationale: coupling TFC's apply identity to 1Password write access is sensitive (compromise blast radius = vault write). Out-of-band keeps blast narrow.
Enforcement: post-apply CI check verifies that 1Password contains an item vm-emergency-<vm-name> whose created_at >= apply_completed_at. CI gates merge of the apply-PR on this check passing. This makes "rotated automatically on every VM redeploy" enforced, not aspirational.
Verifier credentials: a 1Password service account scoped read-only to items matching vm-emergency-* in the sv0-infra vault, stored as GHA secret OP_VM_EMERGENCY_VERIFIER. Rotated yearly (calendar reminder, same cadence as §4a service tokens). Read-only and item-prefix-scoped → minimal blast radius even if leaked. The service account has no write or delete capability anywhere. Tracked in 1Password under item op-service-account-vm-emergency-verifier.
Implementation note: the apply itself writes the public key to the VM's cloud-init. A subsequent step (manual or GHA workflow with operator OP_TOKEN at PR-author request time) uses 1Password CLI with operator credentials to write the private key. The CI verification step uses the read-only service account above to verify the write happened. Write and read are separate identities by design.
3.5 Other admin surfaces
| Surface | Auth |
|---|---|
github.com/SecurityV0/* | GitHub login (MFA enforced via org policy) |
dash.cloudflare.com | Cloudflare account login (federated to Google Workspace @securityv0.com) |
portal.azure.com (subscription) | Tier-3 subscription owners (Sergey's + Ivan's Entra) — Active Owner, Security Defaults MFA-on-sign-in |
| MongoDB Atlas console | Atlas login (federated to Google Workspace) |
| HCP Terraform UI | HashiCorp Cloud login (federated to GitHub) |
| 1Password (secret vault) | 1Password account, MFA enforced |
3.6 Staff access to prod — audit trail
Prod has no L1 perimeter (customers need to reach it), so staff and customers use the same L2 (WorkOS). The staff/customer boundary lives entirely in detection.
| Event | Action |
|---|---|
Super-admin (staff with WORKOS_SUPER_ADMIN_ORG_ID membership) authenticates against prod | Standard WorkOS audit log, retained 90d |
| Super-admin writes to a customer tenant | Slack notification to #audit-prod-staff-writes: actor + tenant + endpoint + timestamp |
| Super-admin reads customer tenant data | Tail-aggregated daily, posted to #audit-prod-staff-reads as rollup |
WORKOS_SUPER_ADMIN_ORG_ID membership change | Notified to #audit-workos-membership |
#audit-* Slack channels are the audit record, not the alerting mechanism. PagerDuty/Opsgenie pages when a super-admin acts outside business hours (configurable).
Anomaly detection deferred. Metrics-based detection (super-admin writes per hour > N, or writes to > M distinct tenants in a window) is explicitly deferred to first compliance ask or 5+ staff. At current scale (1-2 operators), the Slack rollup + quarterly review by Sergey/Ivan is the control. Re-evaluate when the team grows or a customer audit requires it.
Policy (start weak, tighten as the team grows):
- Routine staff prod access should reference a paired ticket or customer support escalation. The audit channels make this enforceable retroactively.
- Production write actions by staff must be reproducible from non-staff API endpoints. If not (one-off Mongo edit), file a follow-up to make them reproducible.
3.7 Account recovery
When a staff member loses GitHub access (locked, MFA device lost):
| Capability | Status during recovery |
|---|---|
dev.*/pr-*-dev.* URLs | Blocked (L1 GitHub gate) |
| Tier-1 SSH | Blocked (CF Access SSH uses GitHub IdP) |
app.securityv0.com, staging.securityv0.com | Works (L2 WorkOS uses Google Workspace, independent of GitHub) |
GitHub SecurityV0/* repos | Blocked |
| Customer support work (read-only on prod) | Works |
| Serial Console | Works if user is a tier-2 emergency operator (Entra independent of GitHub) |
Expected GitHub recovery TAT: 2–5 business days for fully-locked account; minutes-to-hours for self-serve MFA backup-code path.
Backup FIDO2 setup (pragmatic, simple):
- Each tier-2 emergency operator registers a backup FIDO2 key with both GitHub and Entra.
- The backup key is kept physically separate from daily-carry items (not on the same keyring as the daily MFA device) — the operator chooses where.
- The FIDO2 PIN is memorized, not stored anywhere.
- Yearly check that the backup FIDO2 still authenticates against GitHub + Entra (5-min self-test, calendar reminder).
- No physical-safe requirement, no per-location restrictions — operators work from any location.
Accepted residual risk: if the backup FIDO2 is lost or compromised at the same time as the daily MFA device, recovery falls to the GitHub support flow (2-5 business days). This is acceptable at current scale.
4. Service / machine identities
| Use case | Mechanism | Notes |
|---|---|---|
| TFC plans/applies → Azure | OIDC federation (per-workspace SP) | No long-lived secrets |
| TFC writes state backup to Azure Storage | OIDC federation | Same SP as apply |
| Bootstrap apply → Azure | Operator's az login (Tier-3 Active Owner — Sergey or Ivan) | Local-apply per ADR-022 §7; migration to TFC: sv0-infrastructure#29 |
| Connectors reading source systems | Per-connector API key, stored in source system | Read-only |
| CI runs (GitHub Actions → external) | GitHub OIDC where supported; PAT otherwise (read-scoped) | |
| Internal agents (Claude Code) hitting platform API | WorkOS M2M client per agent (delegated_agent kind) | Memory project_auth_principal_model_locked |
| Internal scripts hitting prod/staging | WorkOS device_code flow → short-lived bearer | |
| Internal scripts hitting dev URLs (must pass CF Access) | CF Access service token (§4a) OR cloudflared with operator's GitHub identity |
4a. CF Access service tokens — policy
- One token per script/workflow, named
cf-access-st-<purpose>(e.g.,cf-access-st-seed-demo). - Stored in 1Password
sv0-infravault with prefixcf-access-st-. Each 1Password item has anexpires_atfield set to issue-date + 90d. Item references where the consumer stores it (TFC variable name, GH Actions secret name). - Scope-per-app: each token bound to exactly one CF Access app. Wildcards forbidden.
- Rotation enforced by automation: scheduled GHA workflow
check-cf-service-tokens.yml(weekly) reads 1Password via OP CLI, lists tokens pastexpires_at, opens an issue assigned to the token's owner, fails CI if any token is >120d old (hard cap). Manual rotation only — automated rotation is non-trivial and not worth the complexity at this scale. - Fail loud on missing config: every consumer asserts both
CF_ACCESS_CLIENT_IDandCF_ACCESS_CLIENT_SECRETnon-empty at startup, hard-exit if missing. Per Rule #6. - Leak detection: GitHub secret scanning patterns for
CF_ACCESS_CLIENT_SECRET.
4b. Session lifetimes (target)
| Token / session | TTL | Reason |
|---|---|---|
| WorkOS staff session | 8h | One workday; daily re-auth via Google Workspace |
| WorkOS customer session | 30d | Customer UX expectation; refreshed on activity |
| CF Access dev session (browser) | 8h | Matches staff workday |
| CF Access SSH cert | 1h | Per-session re-auth is the value of CF Access SSH |
| Azure Owner Entra session | 8h (browser default) | MFA-on-sign-in via Security Defaults. PIM (eligible/JIT) NOT adopted — see §7. |
| CF Access service token | None (rotate every 90d, hard cap 120d) | Long-lived by design; rotation is the control |
| GitHub PATs | 90d max | Per GitHub org policy |
5. Hard rules
Non-negotiable invariants. Changes require the §11 amendment process.
-
App code never reads
Cf-Access-Jwt-Assertionto derive user identity or grant access. App identity is WorkOS-only. CF Access is a network gate, not an identity signal the app trusts.- Exception (legacy):
src/api/auth/providers/dev-provider.tscurrently reads CF Access JWT. Phase 0 (§6.0) rewrites it to use a hardcoded identity. Phase 0 PR merge is a hard precondition for ADR-023 promotion. CI grep gate enforces zerocf-access-*reads insrc/andui/src/after Phase 0.
- Exception (legacy):
-
No parallel allow-lists at the same layer. Each surface's CF Access policy uses exactly one signal: GitHub org membership. No "GitHub OR email-in-list" fallbacks.
-
No long-lived
authorized_keysfor routine operator SSH. Tier-1 SSH = CF Access SSH CA. Tier-1.5 per-VM emergency keys are narrow-scoped (§3.4.1) and explicitly NOT routine ops. -
Tier-2 emergency access is Entra + Azure RBAC only, via the custom
sv0-serial-console-operatorrole on the fixedsv0-vm-emergency-opsgroup. Never federated to GitHub or WorkOS. -
Service identities are always explicit. No human-tied tokens used by services. CI doesn't reuse a developer's GitHub token; agents don't reuse a developer's WorkOS session.
-
Fail loud on missing config. A missing secret → hard exit at boot or in the deploy fail-closed check. No silent fallback to a less-secure path.
- Corollary 6a: WorkOS membership and super-admin signals MUST NOT use cached fallback on lookup failure. Fail the request (503), do not degrade to last-known. Required test in
test/api/auth/.
- Corollary 6a: WorkOS membership and super-admin signals MUST NOT use cached fallback on lookup failure. Fail the request (503), do not degrade to last-known. Required test in
-
One source of truth per identity domain. GitHub = staff at perimeter. WorkOS = product users at app. Entra = Azure RBAC only. Never federate one into another for the sake of "single sign-on" if it adds an indirection.
-
WorkOS environment isolation. Prod and staging use separate WorkOS environments with distinct JWKS endpoints and audience claims. Never shared org IDs across envs.
- Startup assertion (Phase 0):
src/shared/config/env.tsasserts onNODE_ENV=productionthatWORKOS_AUTHKIT_DOMAINmatches a prod-allowlist regex (or doesn't match a staging-denylist). Hard-exit with a named error otherwise. Symmetric assertion on staging.
- Startup assertion (Phase 0):
6. What changes from today
6.0 Phase 0 — must land before ADR-023 merges
Phase 0 is a hard precondition: ADR-023's "Status" header will reference the Phase 0 PR's merge SHA. ADR-023 does not merge until Phase 0 is green.
- Rewrite
src/api/auth/providers/dev-provider.tsto drop the CF-Access-JWT path entirely. Provider returnsDEV_USERunconditionally. NoverifyCfAccessJwtcall. - Add CI grep gate:
grep -rE "cf-access-jwt-assertion|Cf-Access-Jwt-Assertion" src/ ui/src/returns zero matches. Fail CI step otherwise. - Audit the codebase for similar silent-fallback patterns and remove any found (focus on
|| '',?? '',if (!secret)against sensitive vars). - Implement Rule #8 startup assertion in
src/shared/config/env.ts: onNODE_ENV=production, assertWORKOS_AUTHKIT_DOMAINdoesn't match staging-denylist regex (e.g., must not containstaging,test,devsubstrings); symmetric assertion on staging. Hard-exit on misconfig. Test intest/shared/config/proves misconfig → hard exit, not silent degradation.
6.1 Immediate corrections (the dev-azure spike, this week)
-
Delete
cf_entra_idp_idworkspace variable onsv0-devOR set to"". The Terraformlocal.cf_idp_idfalls through to the GitHub IdP fallback. Re-trigger sv0-dev apply. -
PR: remove dual-IdP conditional from
envs/dev/main.tf. Replacevar.cf_entra_idp_id != "" ? var.cf_entra_idp_id : "45cdd3b1-..."with the GitHub IdP ID as a named local (no conditional). Per Rule #6 / Rule #2. -
Disable Entra IdP at CF Access (already missing per 2026-05-12 diagnostic — confirm via dashboard). Azure App Registration "Cloudflare Access" stays quiescent; cleanup is deferred.
-
Close
sv0-infrastructure#27with reversal note. -
Offboarding runbook (place in
sv0-documentation/docs/runbooks/): when a staff member leaves SV0, in order:- Revoke their CF Access user sessions:
POST /accounts/{id}/access/organizations/revoke_userwith their email. - Remove from GitHub
SecurityV0org. - Verify by attempting
cloudflared access loginfrom a test machine with the offboarded identity (should be denied). - Remove from WorkOS organizations they were in.
- Remove from any
sv0-vm-emergency-opsEntra group membership. - Audit the offboarded user's last 30d activity in WorkOS, CF Access, GitHub.
Max time-to-revocation: minutes if procedure followed; hours otherwise.
- Revoke their CF Access user sessions:
6.2 Phase 3a (next sprint)
-
CF Access SSH for the dev-azure VM on
dev-azure-ssh.securityv0.comper §3.4.2 (depth-1 hostname — see Universal SSL constraint there). DNS CNAME + tunnel ingress forssh://localhost:22+ CF Access apptype=sshwith GitHub IdP (MFA enforced upstream at GitHub org policy, not via CFauth_method=mfa— see §3.4.2) + cloud-init configures sshd to trust the per-app CF SSH CA + principals file (email local-parts). -
Tier-1.5 per-VM emergency key for the dev-azure VM per §3.4.1. Cloud-init creates
sv0emergencyuser with no sudo, writes per-VM ed25519 public key, installs/usr/local/bin/sv0-rescue(which writes local diagnostics first, then best-effort CF webhook). Private key stored out-of-band in 1Password asvm-emergency-vm-sv0-dev-1. Post-apply CI check verifies item exists and is fresh (§3.4.5). -
Tier-3 Owner — Active dual-Owner + Security Defaults MFA (no PIM, no backup SP):
Amended 2026-05-13 (no-PIM revision + CEO scope review + state-verification correction). The premise that "Tier-3 = Ivan only" was a documentation-staleness bug:
az role assignment listverifies that Sergey has been subscription Owner since 2026-01-04 (original Owner; created the subscription) and Ivan was added 2026-03-10. The 2-human Owner rollback has been in place for 2+ months. The original PIM-eligibility design + the proposedsv0-azure-backup-ownerbridge SP were both solving a non-existent SPOF. The bridge is cancelled (PR #57 closed). The recovery-credentials design patterns from the cancelled work are banked indocs/patterns/recovery-credentials.mdfor any future scenario where a real recovery SP is genuinely warranted.-
Verify Security Defaults is enabled on the tenant (free-tier policy that enforces MFA on
az login/portal.azure.com/ ARM API via the ARM MFA-required policy). Read via portal: Entra → Properties → Manage Security Defaults. This is the free-tier replacement for PIM's MFA-on-activate enforcement — without it, Tier-3 Owners have no MFA gate. -
The 2-human Owner setup IS the rollback. Subscription Owner assignments (verified 2026-05-13):
- Sergey (
098551cd-0071-4408-846a-961c35da98a4) — Owner since 2026-01-04 - Ivan (
a38b998e-b2f4-4e73-ac3d-370da0b0a1da) — Owner since 2026-03-10
Either operator can re-provision the other's role assignment in a lockout scenario. No backup SP, no Microsoft-support-RTO concern.
- Sergey (
Hard rule (no exception): the 2-human Owner state above is preserved until ≥3 humans exist OR a documented superseding design is in place. Neither Owner is removable without explicit migration plan.
What we explicitly do not get without PIM (accepted tradeoff for not paying for P2):
- No JIT activation window — Owner is always Active for both operators. Compensating control: Security Defaults MFA-on-sign-in + dual-Owner attribution (each human's actions logged distinctly in Activity Log).
- No per-activation business-justification field. Compensating control: routine Owner-scoped operations go through TFC (audited via TFC run history).
- No activation-event audit log. Compensating control: Azure Activity Log captures every role-scoped operation, with monthly review automated per sv0-infrastructure#59 (scoped to
break_glass+bootstrapSPs; loud-on-zero-actions).
Re-evaluate PIM adoption when (a) Entra P2 is procured for product/demo reasons and we can opportunistically extend, or (b) staff with Owner-scoped access grows to ≥3.
-
-
Azure Bastion direct (non-PIM) role assignment for tier-3 — provisioned for the Tier-3 Owners (Sergey + Ivan), kept Active. This is the CF-independent transport during CF Access outages (§3.4.4). One-time setup; deferred until Bastion is actually provisioned (out of Phase 3a-4 scope).
-
GHA workflow
check-cf-service-tokens.yml— weekly, reads 1Password via OP CLI, opens issues for tokens pastexpires_at, fails CI for tokens >120d old. Per §4a. -
Quarterly tier-2 emergency drill (per §3.4.4): each emergency operator runs Serial Console + Tier-1.5 procedures on a test VM, results posted to
#audit-tier-2-drills.
Explicitly NOT in this phase: separate FIDO2 break-glass Entra account in a physical safe (per Ivan 2026-05-12). Backup FIDO2 setup per §3.7 (no safe requirement) is sufficient at current scale.
6.3 Phase 3b (formal staging)
- Staging applies the same pattern. L2 (WorkOS staging env, separate org from prod), no L1 perimeter. SSH via CF Access SSH on
staging-ssh.securityv0.com(depth-1 per §3.4.2). Per-VM emergency key for staging VM.
6.4 Deferred cleanup (any time)
-
Delete the "Cloudflare Access" Azure App Registration and its 1Password client secret entry. Non-blocking.
-
ADR-023 promotion CI gate (closes the human-review-only state described in the Status header). GHA workflow in
sv0-documentationthat triggers whendocs/architecture/decisions/adr-023-*.mdis added or its Status line changes:- Parse the Status line for a Phase 0 commit SHA.
- Verify the SHA exists in
sv0-platformmain. - Verify the §6.0 four steps are observable at the SHA: dev-provider has no CF-Access-JWT path, CI grep gate is present in
.github/workflows/,env.tsiss-claim assertion is present, corresponding test exists. - Fail merge of the ADR PR if any check fails.
Until this lands, ADR-023 promotion is gated by human review (Ivan verifies Phase 0 is merged before merging the ADR). Tracked as a
sv0-documentationissue to file alongside the ADR PR. -
ADR-022 amendments:
- §5c.1 "Two doors": replace every "Entra IdP at CF Access" with "GitHub IdP at CF Access."
- §5c (Emergency access tiers): tier-1 SSH = CF Access SSH (GitHub), tier-1.5 = per-VM emergency key (CTO/CEO-only), tier-2 = Azure Serial Console (Entra group + custom role), tier-3 = dual Active Azure Owners (Sergey + Ivan) + Security Defaults MFA-on-sign-in (no PIM, no backup SP — see §7).
- §4: change default
vm_sizefromStandard_B2s(NotAvailableForSubscription in westeurope) toStandard_D2as_v6, and amend the Azure Policy to match.
6.5 What does NOT change
- WorkOS at L2 for prod/staging/dev portals — unchanged.
- Azure RBAC + Entra group for Serial Console — unchanged.
- TFC OIDC federation for
sv0-shared,sv0-prod,sv0-devworkspaces — unchanged. - GitHub at CF Access for
dev.securityv0.com— unchanged.
7. Things explicitly NOT in scope
- WorkOS at L1. Pricing-gated (OIDC Connect is a separate SKU), wrong product fit, blurs customer/staff boundary.
- Entra at L1. Wrong source of truth — SV0 staff are GitHub users.
- Mixed IdP fallbacks. Rule #2.
- Federating GitHub into Entra (or vice versa). Adds indirection for no operational gain at our team size. Reopen if team grows past ~20 staff or compliance demands single-IdP.
- Cloudflare Access as an application-identity source. Rule #1.
- Long-lived
authorized_keysfor routine operators. Rule #3 (Tier-1.5 is narrow-scope exception). - Bastion / jump host pattern for SSH. CF Access SSH replaces this.
- Azure AD SSH login. §3.4.3.
- Separate FIDO2 break-glass Entra account in a physical safe. Deferred per Ivan 2026-05-12. Current §3.7 setup is sufficient.
- Entra ID P2 / Azure PIM for infra access control. Resolved 2026-05-13. Verification returned
subscribedSkus: []+400 AadPremiumLicenseRequired(P2 absent on tenant). P2 procurement is reserved for product/demo use cases (Entra audit logs in execution findings); we do not adopt P2/PIM to gate our own infra. Tier-3 runs dual Active Owners (Sergey + Ivan; verified viaaz role assignment list2026-05-13) + Security Defaults MFA-on-sign-in. Mutual recovery between the two Owners is the account-lockout rollback. Microsoft support tenant-root reset remains as a residual fallback only for the joint-loss scenario, RTO of days — acceptable. Re-evaluate PIM when (a) P2 lands for product reasons and we can opportunistically extend, or (b) staff with Owner-scoped access grows to ≥3. - Two-human signer on amendments. Deferred until team scales past 5 staff (Ivan 2026-05-12). Current process: two independent Claude/Codex sessions for AI-proposed amendments (§11.3).
- Anomaly detector for super-admin actions. Deferred to first compliance ask or 5+ staff (§3.6).
- Replacing Cloudflare with a cloud-specific edge. Cloudflare stays — perimeter, DNS, Tunnel, Access all live there. Cloud-portable as-is.
8. Open questions
Most v1/v2 questions are resolved in v3. Remaining live items:
- The bootstrap operator role — currently either Tier-3 Owner (Sergey or Ivan), running local-apply on
bootstrap/. Long-term it should be a distinct identity (a service account with limited scope used only for bootstrap apply). Track as a Phase 3c+ item. - Cross-Claude-session convergence for AI-proposed amendments — implementation: a checklist in the PR template? A
make verify-amendmentscript that wraps both Claude sessions? Decide before the first such amendment lands.
Closed: "Entra ID P2 license state on the current tenant" — resolved 2026-05-13 (
subscribedSkus: []). See §7 + §6.2 step 12 amendment.
9. Glossary (used precisely throughout this doc)
- L1 / "perimeter" — Cloudflare Access in front of a URL or port. Decides whether the network connection reaches the backend. Used in §3.2, §3.4. Never "staff/external trust boundary" generally — that's "trust boundary."
- L2 / "application" — Authentication inside the application itself. For SV0: WorkOS sessions.
- L3 / "resource RBAC" — Cloud-side authorization decisions (Azure RBAC, MongoDB Atlas roles, GitHub permissions).
- IdP — Identity provider.
- CF Access app — A Cloudflare Access "application" resource: hostname + type + allowed IdPs + policy.
- CF Access SSH — Cloudflare Access in front of port 22, VM trusts CF Access SSH CA. Operator:
cloudflared access ssh --hostname X. - Service token — Long-lived token issued by Cloudflare Access for service-to-service use.
- M2M token — Machine-to-machine token issued by WorkOS for a service principal.
- Tier-1 SSH — Routine operator SSH (target: CF Access SSH).
- Tier-1.5 emergency SSH — Narrow-scope per-VM key for CTO/CEO-level break-glass when Tier-1 and Tier-2 are unavailable.
- Tier-2 emergency operator — Person assigned to
sv0-vm-emergency-opsEntra group. Has Serial Console + custom role. 1–3 members. - Tier-3 subscription owner — Person with Azure subscription Owner role, Active assignment, MFA-on-sign-in via Security Defaults. Today: Sergey (original Owner since 2026-01-04) and Ivan (Owner since 2026-03-10). Mutual recovery between the two is the account-lockout rollback; no backup SP (the proposed
sv0-azure-backup-ownerwas cancelled when state-verification 2026-05-13 showed the 2-Owner state already existed — design patterns banked indocs/patterns/recovery-credentials.mdfor any future scenario where a backup SP is actually warranted). - Bootstrap operator — The role used for
bootstrap/local-apply. Currently the same identity as tier-3, conceptually distinct (§8 Q1). - "Break-glass" (adjective only) — describes tier-1.5 or tier-2 access. Not used as a noun.
- Security Defaults — Free-tier Entra policy that enforces MFA registration + MFA-on-sign-in for admin roles (Owner, Contributor, etc.) tenant-wide. The free-tier replacement for Conditional Access (which requires P1/P2). Not granular — applies to all admins uniformly. SV0 staff scope is small enough that this is sufficient.
- PIM — Azure Privileged Identity Management. Entra ID P2 feature. Eligible-not-active role assignments, MFA-on-activate. NOT adopted for SV0 infra (§7); referenced only for the AWS-migration sketch in §1.1.
10. References
- Round 3 review of v3:
~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review-r3.md - Round 2 review of v2:
~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review-r2.md - Round 1 review of v1:
~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-review.md - v3 draft:
~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v3.md - v2 draft:
~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v2.md - v1 draft:
~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-12-target-auth-architecture-v1.md - Reversed earlier verdict:
~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-11-cf-access-idp-question-for-auth-agent.md - Azure landing zone session:
~/dev/securityv0/.scratch/session-notes/cross-repo/2026-05-11-azure-landing-zone-staging-first-session.md - ADR-022 — Azure compute landing zone:
~/dev/securityv0/repos/sv0-documentation/docs/architecture/decisions/adr-022-azure-compute-landing-zone.md - Runbook 12 — Azure VM landing zone:
~/dev/securityv0/repos/sv0-documentation/docs/runbooks/12-azure-vm-landing-zone.md - WorkOS principal model:
~/dev/securityv0/repos/sv0-skills/auth-context/SKILL.md - Memory:
project_auth_principal_model_locked(2026-04-30),feedback_subagent_backward_compat_neutralizes_fix,feedback_fail_loud_over_silent_fallback
11. How to change this document
At current 1-2 operator scale, amendments are operator-PRs reviewed by the other operator (or solo for documentation-only changes). Two named substantive criteria:
- Hard Rules in §5 are invariants. Changes to a hard rule require the operator to articulate the threat-model delta in the PR description.
- AI-proposed amendments need a second-model pass. Run any AI-generated amendment through a second independent session (Codex or a fresh Claude session) before merging. The 2026-05-13 no-PIM revision is the worked example: Codex caught five threat-model premises the first model missed; the CEO scope review caught that the resulting bridge was unnecessary. Both passes mattered.
Trigger to revisit and expand this section: team growth past 5 staff, OR first SOC2 / customer audit ask. Until then, lighter-weight is correct — heavier process for a 2-person team is the failure mode this section used to be.
Reversal lesson worth keeping (2026-05-11): when proposing an auth change, ask "what does existing infra already provide?" before re-deriving from first principles. The 2026-05-11 verdict was reversed within 24 hours because it conflated backend presence (Azure tenant has 1-3 emergency accounts) with staff identity store (where SV0 staff actually live = GitHub). The same failure mode reappeared 2026-05-13 with the cancelled backup-Owner SP: building a sophisticated workaround for a problem whose existing infra-equivalent (add a 2nd human Owner) was cheaper. Pattern-match to it.
— Authentication target architecture (DRAFT v4), 2026-05-12