Auth Simplification Plan
Status: Active. PR #822 shipped the bearer-downgrade hotfix (issue #821) — that's the regression #816 introduced, not part of this plan's architectural work.
Goal: collapse the auth env-var surface from 14 per container (post-#816) to ~7, kill dead code, and stop the "add a secret instead of fixing the legacy path" pattern.
Shape: four PRs. Earlier drafts had six sequenced steps; that mirrored the accretion pattern this plan exists to fix. The four-PR version is the simplification applied to itself.
Out of scope: webhook receiver wiring, connector API key system (#645 stays), CF Access bypass for visual-review, real new features, the STAGING_CI_M2M Connect App. Note on STAGING_CI_M2M: it is a service-principal Connect App (no principalUserId) and is the intended shape for headless service auth — see agent-and-m2m-authentication §Path 2. What this plan deletes is the identity-bridging pattern (personal-agent-*), not service principals.
Why this matters
Every duplicated config slot in this surface has produced a real production incident in the last 60 days. Not aesthetics:
- The
optionalNonEmptyshim inagent-clients.ts:60-63exists because of #732 —${VAR:-}expansion in docker-compose flowed empty values for the unused-prefix pair and crashed boot. - The redirect-URI allowlist accreted across #368, #810, #813 — three PRs solving facets of one duplicated config.
- The cookie-password split is structurally divergent:
index.ts:60configures iron-session viaSESSION_COOKIE_PASSWORD;env.ts:333builds the WorkOS provider withWORKOS_COOKIE_PASSWORDdirectly. If the two diverge, the seal and the provider's internal cookie machinery use different secrets. - #821 / PR #822 — the
??form inbearer-token-middleware.ts:561silently stripped DB super-admin from any user not onSTAFF_SUPER_ADMIN_PROVIDER_USER_IDS. Bearer auth disagreed with cookie auth. Fixed.
Removing each duplicate removes the class of bug, not just one bug.
Reasoning constraints
- Each item is reduction, not addition. No new env vars, no new abstractions, no new providers.
- No silent fallback shims. Each step removes the old path entirely after the new path lands.
- Fail loud over silent fallback. Preflight must error, not pick up an empty value.
- Bundle interdependent changes. Three changes that touch the same lines of
env.tsshould be one PR, not three sequenced. - Every PR pairs with a docs PR or explicit deprecation note. Where stale docs already exist, update or stub them in the same change.
Container env-var inventory (concrete)
The numbers in this plan are not abstract. Today's 14 → target ~7 means these specific env vars:
Today (post-#816), per container
| Env var | Purpose | Plan disposition |
|---|---|---|
AUTH_PROVIDER | Provider selection (workos/dev) | Keep |
WORKOS_API_KEY | WorkOS server-side API auth | Keep |
WORKOS_CLIENT_ID | Main user-session OAuth app | Keep |
WORKOS_AUTHKIT_DOMAIN | M2M JWT issuer + JWKS | Keep |
WORKOS_REDIRECT_URI | Legacy single-host redirect (#368) | Delete in PR-D (covered by allowlist) |
WORKOS_REDIRECT_URI_ALLOWED_HOSTS | Per-request derived redirect | Keep, becomes mandatory |
WORKOS_COOKIE_PASSWORD | WorkOS provider cookie config | Delete in PR-D (consolidated into SESSION_COOKIE_PASSWORD) |
SESSION_COOKIE_PASSWORD | iron-session seal | Keep, becomes the one slot |
WORKOS_SUPER_ADMIN_ORG_ID | Org-based super-admin signal | Decision in PR-C (kept canonical OR deleted depending on B/A/C outcome) |
STAFF_SUPER_ADMIN_PROVIDER_USER_IDS | Per-user staff allowlist (added by #816) | Decision in PR-C (kept canonical OR deleted) |
WORKOS_WEBHOOK_SECRET | Webhook receiver auth (placeholder; receiver not wired) | Keep (stub, no plan disposition) |
STAGING_/PROD_WORKOS_APP_CLAUDECODE_CLIENT_ID/SECRET (4 secrets total in env, 2 per container at runtime via NODE_ENV branching) | Staff CLI device_code Connect App | Optional: PR-E (collapse to unprefixed) |
STAGING_/PROD_PERSONAL_AGENT_IVAN_CLIENT_ID/SECRET/PRINCIPAL_USER_ID (6 secrets total, 3 per container at runtime) | Personal-agent bridge for one staff member | Delete in PR-B |
REQUIRE_AUTH | Dev-bypass gate | Keep |
The OIDC_*, JWT_*, ALLOWED_API_KEYS, API_KEY_HEADER env vars are listed in env.ts but dead at runtime — they're deleted in PR-D as part of the env.ts cleanup.
After PR-B (delete bridge), per container
Removed: 3 personal-agent vars. Net: 11.
After PR-C (one super-admin signal), per container
Removed: one of WORKOS_SUPER_ADMIN_ORG_ID or STAFF_SUPER_ADMIN_PROVIDER_USER_IDS. Net: 10.
After PR-D (env.ts cleanup), per container
Removed: WORKOS_REDIRECT_URI, WORKOS_COOKIE_PASSWORD, plus the OIDC/JWT/API-key dead env vars. Net: ~8.
After PR-E (optional STAGING/PROD prefix collapse), per container
Removed: prefix on WORKOS_APP_CLAUDECODE_* reduces 2 vars per container to 2 (no count change, but eliminates the runtime branch + optionalNonEmpty shim). Net: ~7-8.
The "floor" is WORKOS_API_KEY + WORKOS_CLIENT_ID + WORKOS_AUTHKIT_DOMAIN + AUTH_PROVIDER + SESSION_COOKIE_PASSWORD + the survived super-admin signal + REQUIRE_AUTH + the agent-client pair = ~7-8 vars. Three are orthogonal facts that can't collapse further.
The four PRs
PR-A (already shipped): #822 hotfix for #821
│
▼
PR-B: Delete personal-agent bridge ──────────┐
│ (independent of PR-C) │
▼ ├─→ PR-D: env.ts cleanup
PR-C: One super-admin signal ────────────────┤ (S2 + S4 + S5 bundled — same file, same lines)
│
▼
PR-E (OPTIONAL): STAGING/PROD prefix collapse
— park unless deploy YAML is being touched anyway
PR-B and PR-C are independent (the bridge mechanism doesn't touch the super-admin signal). They can land in either order; PR-B first is recommended because it's the highest-leverage single change (~−800 LOC test, 6 secrets removed). PR-D depends on PR-C only because PR-C's outcome determines which super-admin env vars get deleted.
PR-A — Bearer-downgrade hotfix (SHIPPED)
Status: filed as PR #822, closes #821.
bearer-token-middleware.ts:561 switched from ?? to Boolean(...) || user.is_super_admin. The allowlist promotes; never demotes. Bearer and cookie now agree on isSuperAdmin for the same user.
This is the regression #816 introduced. Not part of the architectural simplification work — it's a fix for a bug that landed during the planning window. Listed first in the plan only because it's the temporal first PR; it does not advance the architectural goal.
PR-B — Delete the personal-agent bridge
Why: Six secrets, one staff member, zero non-interactive consumers confirmed across all sv0 repos and workflows. Justification ("Telegram bots, SSH-from-anywhere") never materialized.
Prerequisite (you): explicit Slack confirmation from Ivan that no laptop-local script (cron, Telegram bridge in development, ad-hoc tooling) uses the bridge. The 800 LOC of behavioral test coverage means someone judged it worth the investment — confirm it's no longer reaching for that investment.
Changes:
src/api/auth/agent-clients.ts:147-203— remove personal-agent-ivan staging+prod blocks,buildPersonalAgentIvanEntry,principalUserIdfield onAgentClientEntry.src/api/middleware/bearer-token-middleware.ts:229-278— remove the bridge branch in the M2Mclient_*path. Falls through to standard service-principal handling.- Test cleanup (~−800 LOC):
- Delete
test/api/auth/personal-agent-bridge.test.ts(344 LOC). - Delete
test/api/integration/personal-agent-bridge-full-chain.test.ts(301 LOC). - Remove personal-agent describe blocks at
test/api/auth/agent-clients.test.ts:190-342(~150 LOC).
- Delete
- Delete
scripts/validate-personal-agent-playwright.ts. - Workflows: remove
STAGING_/PROD_PERSONAL_AGENT_IVAN_*fromdeploy-{dev,prod}.yml,deploy-instance.sh,docker-compose.deploy.yml. - GitHub: delete the 6 personal-agent secrets.
- Docs: the deprecated
provision-personal-agentrunbook is already a stub pointing atagent-and-m2m-authentication. PR-B can either delete it entirely or leave the stub for one release; either is fine.
Port a small regression guard before deletion — three security invariants the bridge tests cover that, untested, could let the same bug class re-emerge:
- Tenant comes from JWT
org_id, not from a header (currently atpersonal-agent-bridge.test.ts:218). - Service-path preservation: a registered
client_*token withoutprincipalUserIdstays a machine principal, never silently becomes a delegated_agent. - Scope intersection for delegated_agent (currently at
personal-agent-bridge-full-chain.test.ts:234) — port to the device_code path.
Net: ~3 small new tests in bearer-token-middleware.test.ts (~50 LOC) replacing the bridge-specific files.
LOC impact: ~−100 src, ~−800 test (with ~+50 ported), ~−50 deploy YAML.
PR-C — One canonical super-admin signal (DECIDED: Option A)
Today: three mechanisms running simultaneously: WORKOS_SUPER_ADMIN_ORG_ID (canonical per the prod gate at env.ts:240), @securityv0.com email-domain fallback (dev/test), STAFF_SUPER_ADMIN_PROVIDER_USER_IDS (added by #816).
Each was added because the prior one didn't quite fit a case at hand — exactly the pattern this plan exists to stop. PR-C ends with one mechanism live and the other two deleted.
Verification result (2026-05-08): AuthKit's authenticateWithCode returns organizationId: null even when the user is a member of the org, on personal-email logins. Confirmed empirically with Ivan's account and corroborated by the WorkOS SDK type. Option B is therefore not viable without also pinning organizationId in getLoginUrl (which would break multi-org switcher behavior for users in more than one org).
Decision: Option A. Option C (accept STAFF_SUPER_ADMIN_PROVIDER_USER_IDS as canonical) was considered and rejected: it's simpler to ship this week but every subsequent staff hire/offboarding requires updating a GitHub-secret allowlist and redeploying. The source of truth becomes a parallel manually-maintained list rather than the WorkOS org membership data the system is already maintaining for SSO and billing. Long-term simplicity wins.
Option A: implement membership lookup at callback
The canonical signal is org membership in the WorkOS org named by WORKOS_SUPER_ADMIN_ORG_ID, derived at request time from WorkOS — not from a local allowlist.
Changes:
src/api/auth/providers/workos-provider.ts— extendhandleCallbackto callGET /user_management/organization_memberships?user_id=<sub>afterauthenticateWithCodeand return the user's org IDs as part ofAuthCallbackResult. Add a small in-memory cache keyed byprovider_user_idwith a short TTL (5 minutes is fine — invalidation on logout/revoke is not required for super-admin computation; staleness reverts within one window).src/api/routes/auth.ts:159-166— replace the three-branchisSuperAdminresolution with a single match:isSuperAdmin = result.organizationMemberships.some(m => m.organizationId === deps.superAdminOrgId). Drop theresult.isSuperAdminprovider override path and the email-domain fallback path entirely.src/api/middleware/auth-middleware.ts:160-167(cookie path) — leaveuser.is_super_adminas the cached source. The DB is updated by the callback path each login from the canonical signal.src/api/middleware/bearer-token-middleware.ts:556-562— remove thesuperAdminProviderUserIdsparameter fromBearerTokenMiddlewareOptionsentirely. TheBoolean(...) || user.is_super_adminform from PR-A collapses to justuser.is_super_admin(the override no longer exists). For delegated_agent contexts where the user record was JIT-upserted withis_super_admin: falsebefore the membership lookup ran, the next login refreshes the DB row.src/shared/config/env.ts— deleteSTAFF_SUPER_ADMIN_PROVIDER_USER_IDSschema entry,parseStaffSuperAdminUserIdsparser, the validator that requires at least one super-admin signal in production (drop the||arm;WORKOS_SUPER_ADMIN_ORG_IDbecomes unconditionally required in prod). Delete the email-domain fallback constant.- Workflows +
docker-compose.deploy.yml— removeSTAFF_SUPER_ADMIN_PROVIDER_USER_IDSenv var injection fromdeploy-{dev,prod}.ymlanddeploy-instance.sh. ConfirmWORKOS_SUPER_ADMIN_ORG_IDflows correctly in dev (it should already; double-check). - GitHub: delete
STAFF_SUPER_ADMIN_PROVIDER_USER_IDSsecret from both environments after a one-deploy-cycle safety window. - Tests:
- Update
bearer-token-middleware.test.tssuper-admin override tests — most are deletable since the override is gone. Keep one test asserting cookie/bearer parity onuser.is_super_admin. - Add a callback test: a user whose membership lookup returns the super-admin org gets
is_super_admin: trueupserted; a user whose memberships exclude that org getsis_super_admin: false.
- Update
- Docs:
- Update §16.1 row to "Decision: Option A; canonical signal is WorkOS org membership."
- Update
agent-and-m2m-authentication.mdDON'T section #3 to reflect that there is now exactly one super-admin signal.
Cache TTL note: 5 minutes is the recommendation. The membership API call adds ~50-100ms to the callback path, but that's amortized over a session that lasts hours. In-process cache means no Redis dependency.
Operational characteristics:
| Lifecycle event | Today (with allowlist) | After PR-C (Option A) |
|---|---|---|
| New staff hire | Update STAFF_SUPER_ADMIN_PROVIDER_USER_IDS GitHub secret in dev + prod, redeploy both | Add to WorkOS org via dashboard. Done. |
| Staff offboarding | Remove from WorkOS org AND remove from secret list. Drift risk if step 2 forgotten. | Remove from WorkOS org. One step, no drift. |
| Cert / key rotation | n/a | n/a |
| Adding a new super-admin org (e.g., partner) | Not supported without code changes | Set WORKOS_SUPER_ADMIN_ORG_ID to a comma-separated list, parse, match (small follow-up if needed) |
LOC impact: ~+30 src (membership lookup + cache), ~−60 src (delete allowlist parser, email-domain fallback, three-branch logic), ~+10 test (callback test), ~−40 test (bearer override tests delete), ~−15 deploy YAML. Net deletion.
PR-D — env.ts cleanup (S2 + S4 + S5 bundled)
Why bundle: all three changes touch the zod block at env.ts:240-291 and the workos provider config emission. Sequencing them means three PRs fighting for the same lines. One PR is the right shape.
Three things this PR deletes:
D1. OIDC provider + legacy authMiddleware + legacy JWT/ALLOWED_API_KEYS
OIDC is never deployed (oidc-provider.ts:32 throws on construction; production gate at env.ts:228 rejects AUTH_PROVIDER=oidc). Legacy authMiddleware (src/api/middleware/auth.ts) is the OLD pipeline; comment says "remove in Phase 5" — Phase 5 hasn't happened.
- Delete
src/api/auth/providers/oidc-provider.ts. - Delete
src/api/middleware/auth.tsandJwtVerificationOptions. env.ts: removeOIDC_*,JWT_*,ALLOWED_API_KEYS,API_KEY_HEADERschema entries; remove the"oidc"member ofAuthProviderType; remove the OIDC validation block; remove the OIDC config emission block. KeepREQUIRE_AUTH— it gates the dev-bypass atauth-middleware.ts:76and integration tests rely on it.provider-factory.ts:63-71— remove theoidccase.app.ts:28— remove theauthMiddlewareimport.- Critical: preserve
app.ts:155-171— the #347 IDOR defense-in-depth guard (throws in prod ifcreateAppis invoked withoutauthProvider+storageAdapter). Testapp-bearer-mount.test.ts:247enforces it. Only remove the legacyapp.use(authMiddleware(...))call, not the prod-throw. - Delete
.env.exampleOIDC section. - Test cleanup:
- OIDC residue (~−40 LOC): describe blocks in
workos-provider-verify.test.ts:222,provider-factory.test.ts:48,env.test.ts:98-105. - Legacy-auth route coverage (~−200 LOC):
test/api/auth-ingest.test.ts:18,test/api/middleware/auth-jwt.test.ts:11,test/api/scan-runs-auth.test.ts:49. Read each end-to-end and decide per-file: delete (if behavior is covered elsewhere) or port (if it tests a route-level invariant the new path doesn't cover).
- OIDC residue (~−40 LOC): describe blocks in
Verification — tenantContextMiddleware exposure: trace src/api/middleware/tenant-context.ts:30 and require-tenant.ts:71. Verify the new createSessionMiddleware + bearer pipeline gates everything before tenant resolution. If non-prod paths lack the equivalent gate, PR-D must add it before deleting authMiddleware.
Pre-step grep (mandatory):
grep -rn "method:.*api_key\|JwtVerificationOptions\|ALLOWED_API_KEYS\|JWT_JWKS_URI" src/ test/ scripts/
D2. Collapse WORKOS_REDIRECT_URI into WORKOS_REDIRECT_URI_ALLOWED_HOSTS
Single host is expressible as a one-element allowlist. The duplication produced #368/#810/#813.
env.ts: removeWORKOS_REDIRECT_URIschema entry; remove fallback branching; makeWORKOS_REDIRECT_URI_ALLOWED_HOSTSrequired whenAUTH_PROVIDER=workos.workos-provider.ts— remove the legacyredirectUrifield; always derive per-request from request host validated against allowlist.- Workflows: drop
WORKOS_REDIRECT_URI. Allowlists:- Prod:
app.securityv0.com - Dev:
*.securityv0.com(depth-1 wildcard, matchespr-N-dev.securityv0.com). Keep this — Cloudflare free Universal SSL covers depth-1 only;*.dev.securityv0.com(depth-2) silently breaks in browsers.
- Prod:
deploy-instance.sh:227,228— removeWORKOS_REDIRECT_URIline.- WorkOS AuthKit dashboard: no change needed if it currently registers
https://*.securityv0.com/auth/callback.
D3. Collapse SESSION_COOKIE_PASSWORD and WORKOS_COOKIE_PASSWORD
The two env vars are read by different code paths (see "Why this matters" above). The split exists for the OIDC path that no longer exists after D1.
env.ts:81, 70, 207-215— removeWORKOS_COOKIE_PASSWORD, keepSESSION_COOKIE_PASSWORDas the single source. Required for any non-dev provider; dev auto-generates.- Workflows: rename secret
WORKOS_COOKIE_PASSWORD→SESSION_COOKIE_PASSWORDper environment. workos-provider.ts— read the cookie password from the unified slot.
LOC impact (PR-D total): ~−250 src, ~−240 test (after the per-file delete-vs-port decisions in D1), ~−30 YAML.
PR-E (OPTIONAL) — Drop STAGING_/PROD_ prefix on agent-client envs
Park unless deploy YAML is already being touched for another reason. This is the lowest-leverage step in the original draft and the most operationally noisy.
The benefit (real): removes the optionalNonEmpty shim at agent-clients.ts:60-63 and the isProd branch in loadAgentClientRegistry. Eliminates a class of "accidentally added a STAGING_X because that's how the others looked" bugs.
The cost (also real): rename GitHub Secrets in two environments, update scripts/cli/auth.ts and scripts/lib/api-client.ts (both read prefixed names from a developer-local .env), ask every developer to rotate their local .env in a Slack message, plus the rollback trap requiring "wait one full prod-deploy cycle before deleting old secret names."
For a team of fewer than 10 contributors, the rename cost roughly equals the simplification benefit. Defer until a future PR is already in the same files and the prefix removal is incremental work. Don't run it as a standalone simplification PR.
If/when this lands, the changes are:
agent-clients.ts— collapseAGENT_CLIENT_ENV_SCHEMAto one set, deleteoptionalNonEmptyshim andisProdbranching.provider-factory.ts:20-26—readAgentCredentialreads unprefixed names.- Workflows +
docker-compose.deploy.yml+deploy-instance.sh— inject unprefixed names from the per-environment GitHub secrets. scripts/cli/auth.ts:76-85,scripts/lib/api-client.ts:70-78— read unprefixed names.- GitHub: rename secrets per environment; keep both names alive through one deploy cycle as rollback safety.
deploy-prod.ymlpreflight at L123-124 must validate the new unprefixed names. Don't ship without this.
Total impact
After PR-A through PR-D:
- Container env vars: 14 → ~8.
- Distinct WorkOS Apps in use: 6 → 4 (main, claude-code-staging, claude-code-prod, ci-staging-m2m). Three with PR-B applied if you also retire one of the claude-code variants (separate decision, not in this plan).
- LOC delta: ~−400 src, ~−1,060 test, ~−100 deploy YAML.
- Eliminated bug classes: the duplications that produced #732, #368, #810, #813. The
??-vs-||divergence (#821) gone via PR-A. Bearer/cookie super-admin agreement enforced via PR-C.
Acceptance test for the whole plan
After PR-A through PR-D land:
- List secrets in GitHub Environment "dev". Count ≤ 8 auth-related entries.
- List secrets in "prod". Same count, parallel names.
grep -rn "OIDC\|oidc" src/api→ zero hits.grep -rn "personal-agent" src/→ zero hits (PR-B).grep -rn "WORKOS_REDIRECT_URI[^_]" src/→ zero hits (PR-D D2; onlyWORKOS_REDIRECT_URI_ALLOWED_HOSTSsurvives).grep -rn "WORKOS_COOKIE_PASSWORD" src/→ zero hits (PR-D D3).- Only one of
WORKOS_SUPER_ADMIN_ORG_IDandSTAFF_SUPER_ADMIN_PROVIDER_USER_IDSsurvives inenv.ts(PR-C). app.ts:155-171IDOR guard still present;app-bearer-mount.test.ts:247still passes.- Bearer and cookie auth agree on
isSuperAdminfor the same user — both use the same allowlist (or no allowlist) override logic. tenantContextMiddlewarenon-prod path verified gated by auth.docker-compose.deploy.ymlreferences no removed env vars; container boots clean.- Boot prod against the new env. Login works for a staff super-admin (PR-C outcome verified). M2M JWT works. Connector API key works.
- The three legacy-auth route tests are either deleted or ported.
- The three ported regression guards exercise the device_code/standard machine paths and survive PR-B.
provision-personal-agent.mdis either deleted or remains as a stub redirecting toagent-and-m2m-authentication.md.- The "Today" inventory in
architecture/13§16.1 is updated to reflect the post-plan state.
Tracking
| PR | Description | Status |
|---|---|---|
| PR-A | #822 — ?? → `Boolean(...) | |
| PR-B | Delete personal-agent bridge | Blocked on Ivan's Slack confirmation |
| PR-C | One super-admin signal — Option A (WorkOS membership lookup at callback) | Verification done 2026-05-08; B ruled out empirically; A chosen for long-term simplicity. Ready to implement. |
| PR-D | env.ts cleanup (OIDC + legacy auth + REDIRECT_URI + COOKIE_PASSWORD) | Blocked on PR-C |
| PR-E | OPTIONAL — STAGING/PROD prefix collapse | Parked |
Related
- Architecture: 13 — Authentication and User Management §16
- Runbook: Agent and M2M Authentication
- WorkOS Production Configuration
- Provisioning a personal-agent (DEPRECATED stub)
- WorkOS Auth Implementation Plan (2026-04-09)
- ADR-016: Multi-Tenant Authentication Architecture
- ADR-017: WorkOS as Authentication Provider