Skip to main content

Auth Simplification Plan

Status: Active. PR #822 shipped the bearer-downgrade hotfix (issue #821) — that's the regression #816 introduced, not part of this plan's architectural work.

Goal: collapse the auth env-var surface from 14 per container (post-#816) to ~7, kill dead code, and stop the "add a secret instead of fixing the legacy path" pattern.

Shape: four PRs. Earlier drafts had six sequenced steps; that mirrored the accretion pattern this plan exists to fix. The four-PR version is the simplification applied to itself.

Out of scope: webhook receiver wiring, connector API key system (#645 stays), CF Access bypass for visual-review, real new features, the STAGING_CI_M2M Connect App. Note on STAGING_CI_M2M: it is a service-principal Connect App (no principalUserId) and is the intended shape for headless service auth — see agent-and-m2m-authentication §Path 2. What this plan deletes is the identity-bridging pattern (personal-agent-*), not service principals.


Why this matters

Every duplicated config slot in this surface has produced a real production incident in the last 60 days. Not aesthetics:

  • The optionalNonEmpty shim in agent-clients.ts:60-63 exists because of #732${VAR:-} expansion in docker-compose flowed empty values for the unused-prefix pair and crashed boot.
  • The redirect-URI allowlist accreted across #368, #810, #813 — three PRs solving facets of one duplicated config.
  • The cookie-password split is structurally divergent: index.ts:60 configures iron-session via SESSION_COOKIE_PASSWORD; env.ts:333 builds the WorkOS provider with WORKOS_COOKIE_PASSWORD directly. If the two diverge, the seal and the provider's internal cookie machinery use different secrets.
  • #821 / PR #822 — the ?? form in bearer-token-middleware.ts:561 silently stripped DB super-admin from any user not on STAFF_SUPER_ADMIN_PROVIDER_USER_IDS. Bearer auth disagreed with cookie auth. Fixed.

Removing each duplicate removes the class of bug, not just one bug.


Reasoning constraints

  • Each item is reduction, not addition. No new env vars, no new abstractions, no new providers.
  • No silent fallback shims. Each step removes the old path entirely after the new path lands.
  • Fail loud over silent fallback. Preflight must error, not pick up an empty value.
  • Bundle interdependent changes. Three changes that touch the same lines of env.ts should be one PR, not three sequenced.
  • Every PR pairs with a docs PR or explicit deprecation note. Where stale docs already exist, update or stub them in the same change.

Container env-var inventory (concrete)

The numbers in this plan are not abstract. Today's 14 → target ~7 means these specific env vars:

Today (post-#816), per container

Env varPurposePlan disposition
AUTH_PROVIDERProvider selection (workos/dev)Keep
WORKOS_API_KEYWorkOS server-side API authKeep
WORKOS_CLIENT_IDMain user-session OAuth appKeep
WORKOS_AUTHKIT_DOMAINM2M JWT issuer + JWKSKeep
WORKOS_REDIRECT_URILegacy single-host redirect (#368)Delete in PR-D (covered by allowlist)
WORKOS_REDIRECT_URI_ALLOWED_HOSTSPer-request derived redirectKeep, becomes mandatory
WORKOS_COOKIE_PASSWORDWorkOS provider cookie configDelete in PR-D (consolidated into SESSION_COOKIE_PASSWORD)
SESSION_COOKIE_PASSWORDiron-session sealKeep, becomes the one slot
WORKOS_SUPER_ADMIN_ORG_IDOrg-based super-admin signalDecision in PR-C (kept canonical OR deleted depending on B/A/C outcome)
STAFF_SUPER_ADMIN_PROVIDER_USER_IDSPer-user staff allowlist (added by #816)Decision in PR-C (kept canonical OR deleted)
WORKOS_WEBHOOK_SECRETWebhook receiver auth (placeholder; receiver not wired)Keep (stub, no plan disposition)
STAGING_/PROD_WORKOS_APP_CLAUDECODE_CLIENT_ID/SECRET (4 secrets total in env, 2 per container at runtime via NODE_ENV branching)Staff CLI device_code Connect AppOptional: PR-E (collapse to unprefixed)
STAGING_/PROD_PERSONAL_AGENT_IVAN_CLIENT_ID/SECRET/PRINCIPAL_USER_ID (6 secrets total, 3 per container at runtime)Personal-agent bridge for one staff memberDelete in PR-B
REQUIRE_AUTHDev-bypass gateKeep

The OIDC_*, JWT_*, ALLOWED_API_KEYS, API_KEY_HEADER env vars are listed in env.ts but dead at runtime — they're deleted in PR-D as part of the env.ts cleanup.

After PR-B (delete bridge), per container

Removed: 3 personal-agent vars. Net: 11.

After PR-C (one super-admin signal), per container

Removed: one of WORKOS_SUPER_ADMIN_ORG_ID or STAFF_SUPER_ADMIN_PROVIDER_USER_IDS. Net: 10.

After PR-D (env.ts cleanup), per container

Removed: WORKOS_REDIRECT_URI, WORKOS_COOKIE_PASSWORD, plus the OIDC/JWT/API-key dead env vars. Net: ~8.

After PR-E (optional STAGING/PROD prefix collapse), per container

Removed: prefix on WORKOS_APP_CLAUDECODE_* reduces 2 vars per container to 2 (no count change, but eliminates the runtime branch + optionalNonEmpty shim). Net: ~7-8.

The "floor" is WORKOS_API_KEY + WORKOS_CLIENT_ID + WORKOS_AUTHKIT_DOMAIN + AUTH_PROVIDER + SESSION_COOKIE_PASSWORD + the survived super-admin signal + REQUIRE_AUTH + the agent-client pair = ~7-8 vars. Three are orthogonal facts that can't collapse further.


The four PRs

PR-A (already shipped): #822 hotfix for #821


PR-B: Delete personal-agent bridge ──────────┐
│ (independent of PR-C) │
▼ ├─→ PR-D: env.ts cleanup
PR-C: One super-admin signal ────────────────┤ (S2 + S4 + S5 bundled — same file, same lines)


PR-E (OPTIONAL): STAGING/PROD prefix collapse
— park unless deploy YAML is being touched anyway

PR-B and PR-C are independent (the bridge mechanism doesn't touch the super-admin signal). They can land in either order; PR-B first is recommended because it's the highest-leverage single change (~−800 LOC test, 6 secrets removed). PR-D depends on PR-C only because PR-C's outcome determines which super-admin env vars get deleted.


PR-A — Bearer-downgrade hotfix (SHIPPED)

Status: filed as PR #822, closes #821.

bearer-token-middleware.ts:561 switched from ?? to Boolean(...) || user.is_super_admin. The allowlist promotes; never demotes. Bearer and cookie now agree on isSuperAdmin for the same user.

This is the regression #816 introduced. Not part of the architectural simplification work — it's a fix for a bug that landed during the planning window. Listed first in the plan only because it's the temporal first PR; it does not advance the architectural goal.


PR-B — Delete the personal-agent bridge

Why: Six secrets, one staff member, zero non-interactive consumers confirmed across all sv0 repos and workflows. Justification ("Telegram bots, SSH-from-anywhere") never materialized.

Prerequisite (you): explicit Slack confirmation from Ivan that no laptop-local script (cron, Telegram bridge in development, ad-hoc tooling) uses the bridge. The 800 LOC of behavioral test coverage means someone judged it worth the investment — confirm it's no longer reaching for that investment.

Changes:

  • src/api/auth/agent-clients.ts:147-203 — remove personal-agent-ivan staging+prod blocks, buildPersonalAgentIvanEntry, principalUserId field on AgentClientEntry.
  • src/api/middleware/bearer-token-middleware.ts:229-278 — remove the bridge branch in the M2M client_* path. Falls through to standard service-principal handling.
  • Test cleanup (~−800 LOC):
    • Delete test/api/auth/personal-agent-bridge.test.ts (344 LOC).
    • Delete test/api/integration/personal-agent-bridge-full-chain.test.ts (301 LOC).
    • Remove personal-agent describe blocks at test/api/auth/agent-clients.test.ts:190-342 (~150 LOC).
  • Delete scripts/validate-personal-agent-playwright.ts.
  • Workflows: remove STAGING_/PROD_PERSONAL_AGENT_IVAN_* from deploy-{dev,prod}.yml, deploy-instance.sh, docker-compose.deploy.yml.
  • GitHub: delete the 6 personal-agent secrets.
  • Docs: the deprecated provision-personal-agent runbook is already a stub pointing at agent-and-m2m-authentication. PR-B can either delete it entirely or leave the stub for one release; either is fine.

Port a small regression guard before deletion — three security invariants the bridge tests cover that, untested, could let the same bug class re-emerge:

  1. Tenant comes from JWT org_id, not from a header (currently at personal-agent-bridge.test.ts:218).
  2. Service-path preservation: a registered client_* token without principalUserId stays a machine principal, never silently becomes a delegated_agent.
  3. Scope intersection for delegated_agent (currently at personal-agent-bridge-full-chain.test.ts:234) — port to the device_code path.

Net: ~3 small new tests in bearer-token-middleware.test.ts (~50 LOC) replacing the bridge-specific files.

LOC impact: ~−100 src, ~−800 test (with ~+50 ported), ~−50 deploy YAML.


PR-C — One canonical super-admin signal (DECIDED: Option A)

Today: three mechanisms running simultaneously: WORKOS_SUPER_ADMIN_ORG_ID (canonical per the prod gate at env.ts:240), @securityv0.com email-domain fallback (dev/test), STAFF_SUPER_ADMIN_PROVIDER_USER_IDS (added by #816).

Each was added because the prior one didn't quite fit a case at hand — exactly the pattern this plan exists to stop. PR-C ends with one mechanism live and the other two deleted.

Verification result (2026-05-08): AuthKit's authenticateWithCode returns organizationId: null even when the user is a member of the org, on personal-email logins. Confirmed empirically with Ivan's account and corroborated by the WorkOS SDK type. Option B is therefore not viable without also pinning organizationId in getLoginUrl (which would break multi-org switcher behavior for users in more than one org).

Decision: Option A. Option C (accept STAFF_SUPER_ADMIN_PROVIDER_USER_IDS as canonical) was considered and rejected: it's simpler to ship this week but every subsequent staff hire/offboarding requires updating a GitHub-secret allowlist and redeploying. The source of truth becomes a parallel manually-maintained list rather than the WorkOS org membership data the system is already maintaining for SSO and billing. Long-term simplicity wins.

Option A: implement membership lookup at callback

The canonical signal is org membership in the WorkOS org named by WORKOS_SUPER_ADMIN_ORG_ID, derived at request time from WorkOS — not from a local allowlist.

Changes:

  • src/api/auth/providers/workos-provider.ts — extend handleCallback to call GET /user_management/organization_memberships?user_id=<sub> after authenticateWithCode and return the user's org IDs as part of AuthCallbackResult. Add a small in-memory cache keyed by provider_user_id with a short TTL (5 minutes is fine — invalidation on logout/revoke is not required for super-admin computation; staleness reverts within one window).
  • src/api/routes/auth.ts:159-166 — replace the three-branch isSuperAdmin resolution with a single match: isSuperAdmin = result.organizationMemberships.some(m => m.organizationId === deps.superAdminOrgId). Drop the result.isSuperAdmin provider override path and the email-domain fallback path entirely.
  • src/api/middleware/auth-middleware.ts:160-167 (cookie path) — leave user.is_super_admin as the cached source. The DB is updated by the callback path each login from the canonical signal.
  • src/api/middleware/bearer-token-middleware.ts:556-562 — remove the superAdminProviderUserIds parameter from BearerTokenMiddlewareOptions entirely. The Boolean(...) || user.is_super_admin form from PR-A collapses to just user.is_super_admin (the override no longer exists). For delegated_agent contexts where the user record was JIT-upserted with is_super_admin: false before the membership lookup ran, the next login refreshes the DB row.
  • src/shared/config/env.ts — delete STAFF_SUPER_ADMIN_PROVIDER_USER_IDS schema entry, parseStaffSuperAdminUserIds parser, the validator that requires at least one super-admin signal in production (drop the || arm; WORKOS_SUPER_ADMIN_ORG_ID becomes unconditionally required in prod). Delete the email-domain fallback constant.
  • Workflows + docker-compose.deploy.yml — remove STAFF_SUPER_ADMIN_PROVIDER_USER_IDS env var injection from deploy-{dev,prod}.yml and deploy-instance.sh. Confirm WORKOS_SUPER_ADMIN_ORG_ID flows correctly in dev (it should already; double-check).
  • GitHub: delete STAFF_SUPER_ADMIN_PROVIDER_USER_IDS secret from both environments after a one-deploy-cycle safety window.
  • Tests:
    • Update bearer-token-middleware.test.ts super-admin override tests — most are deletable since the override is gone. Keep one test asserting cookie/bearer parity on user.is_super_admin.
    • Add a callback test: a user whose membership lookup returns the super-admin org gets is_super_admin: true upserted; a user whose memberships exclude that org gets is_super_admin: false.
  • Docs:
    • Update §16.1 row to "Decision: Option A; canonical signal is WorkOS org membership."
    • Update agent-and-m2m-authentication.md DON'T section #3 to reflect that there is now exactly one super-admin signal.

Cache TTL note: 5 minutes is the recommendation. The membership API call adds ~50-100ms to the callback path, but that's amortized over a session that lasts hours. In-process cache means no Redis dependency.

Operational characteristics:

Lifecycle eventToday (with allowlist)After PR-C (Option A)
New staff hireUpdate STAFF_SUPER_ADMIN_PROVIDER_USER_IDS GitHub secret in dev + prod, redeploy bothAdd to WorkOS org via dashboard. Done.
Staff offboardingRemove from WorkOS org AND remove from secret list. Drift risk if step 2 forgotten.Remove from WorkOS org. One step, no drift.
Cert / key rotationn/an/a
Adding a new super-admin org (e.g., partner)Not supported without code changesSet WORKOS_SUPER_ADMIN_ORG_ID to a comma-separated list, parse, match (small follow-up if needed)

LOC impact: ~+30 src (membership lookup + cache), ~−60 src (delete allowlist parser, email-domain fallback, three-branch logic), ~+10 test (callback test), ~−40 test (bearer override tests delete), ~−15 deploy YAML. Net deletion.


PR-D — env.ts cleanup (S2 + S4 + S5 bundled)

Why bundle: all three changes touch the zod block at env.ts:240-291 and the workos provider config emission. Sequencing them means three PRs fighting for the same lines. One PR is the right shape.

Three things this PR deletes:

D1. OIDC provider + legacy authMiddleware + legacy JWT/ALLOWED_API_KEYS

OIDC is never deployed (oidc-provider.ts:32 throws on construction; production gate at env.ts:228 rejects AUTH_PROVIDER=oidc). Legacy authMiddleware (src/api/middleware/auth.ts) is the OLD pipeline; comment says "remove in Phase 5" — Phase 5 hasn't happened.

  • Delete src/api/auth/providers/oidc-provider.ts.
  • Delete src/api/middleware/auth.ts and JwtVerificationOptions.
  • env.ts: remove OIDC_*, JWT_*, ALLOWED_API_KEYS, API_KEY_HEADER schema entries; remove the "oidc" member of AuthProviderType; remove the OIDC validation block; remove the OIDC config emission block. Keep REQUIRE_AUTH — it gates the dev-bypass at auth-middleware.ts:76 and integration tests rely on it.
  • provider-factory.ts:63-71 — remove the oidc case.
  • app.ts:28 — remove the authMiddleware import.
  • Critical: preserve app.ts:155-171 — the #347 IDOR defense-in-depth guard (throws in prod if createApp is invoked without authProvider+storageAdapter). Test app-bearer-mount.test.ts:247 enforces it. Only remove the legacy app.use(authMiddleware(...)) call, not the prod-throw.
  • Delete .env.example OIDC section.
  • Test cleanup:
    • OIDC residue (~−40 LOC): describe blocks in workos-provider-verify.test.ts:222, provider-factory.test.ts:48, env.test.ts:98-105.
    • Legacy-auth route coverage (~−200 LOC): test/api/auth-ingest.test.ts:18, test/api/middleware/auth-jwt.test.ts:11, test/api/scan-runs-auth.test.ts:49. Read each end-to-end and decide per-file: delete (if behavior is covered elsewhere) or port (if it tests a route-level invariant the new path doesn't cover).

Verification — tenantContextMiddleware exposure: trace src/api/middleware/tenant-context.ts:30 and require-tenant.ts:71. Verify the new createSessionMiddleware + bearer pipeline gates everything before tenant resolution. If non-prod paths lack the equivalent gate, PR-D must add it before deleting authMiddleware.

Pre-step grep (mandatory):

grep -rn "method:.*api_key\|JwtVerificationOptions\|ALLOWED_API_KEYS\|JWT_JWKS_URI" src/ test/ scripts/

D2. Collapse WORKOS_REDIRECT_URI into WORKOS_REDIRECT_URI_ALLOWED_HOSTS

Single host is expressible as a one-element allowlist. The duplication produced #368/#810/#813.

  • env.ts: remove WORKOS_REDIRECT_URI schema entry; remove fallback branching; make WORKOS_REDIRECT_URI_ALLOWED_HOSTS required when AUTH_PROVIDER=workos.
  • workos-provider.ts — remove the legacy redirectUri field; always derive per-request from request host validated against allowlist.
  • Workflows: drop WORKOS_REDIRECT_URI. Allowlists:
    • Prod: app.securityv0.com
    • Dev: *.securityv0.com (depth-1 wildcard, matches pr-N-dev.securityv0.com). Keep this — Cloudflare free Universal SSL covers depth-1 only; *.dev.securityv0.com (depth-2) silently breaks in browsers.
  • deploy-instance.sh:227,228 — remove WORKOS_REDIRECT_URI line.
  • WorkOS AuthKit dashboard: no change needed if it currently registers https://*.securityv0.com/auth/callback.

The two env vars are read by different code paths (see "Why this matters" above). The split exists for the OIDC path that no longer exists after D1.

  • env.ts:81, 70, 207-215 — remove WORKOS_COOKIE_PASSWORD, keep SESSION_COOKIE_PASSWORD as the single source. Required for any non-dev provider; dev auto-generates.
  • Workflows: rename secret WORKOS_COOKIE_PASSWORDSESSION_COOKIE_PASSWORD per environment.
  • workos-provider.ts — read the cookie password from the unified slot.

LOC impact (PR-D total): ~−250 src, ~−240 test (after the per-file delete-vs-port decisions in D1), ~−30 YAML.


PR-E (OPTIONAL) — Drop STAGING_/PROD_ prefix on agent-client envs

Park unless deploy YAML is already being touched for another reason. This is the lowest-leverage step in the original draft and the most operationally noisy.

The benefit (real): removes the optionalNonEmpty shim at agent-clients.ts:60-63 and the isProd branch in loadAgentClientRegistry. Eliminates a class of "accidentally added a STAGING_X because that's how the others looked" bugs.

The cost (also real): rename GitHub Secrets in two environments, update scripts/cli/auth.ts and scripts/lib/api-client.ts (both read prefixed names from a developer-local .env), ask every developer to rotate their local .env in a Slack message, plus the rollback trap requiring "wait one full prod-deploy cycle before deleting old secret names."

For a team of fewer than 10 contributors, the rename cost roughly equals the simplification benefit. Defer until a future PR is already in the same files and the prefix removal is incremental work. Don't run it as a standalone simplification PR.

If/when this lands, the changes are:

  • agent-clients.ts — collapse AGENT_CLIENT_ENV_SCHEMA to one set, delete optionalNonEmpty shim and isProd branching.
  • provider-factory.ts:20-26readAgentCredential reads unprefixed names.
  • Workflows + docker-compose.deploy.yml + deploy-instance.sh — inject unprefixed names from the per-environment GitHub secrets.
  • scripts/cli/auth.ts:76-85, scripts/lib/api-client.ts:70-78 — read unprefixed names.
  • GitHub: rename secrets per environment; keep both names alive through one deploy cycle as rollback safety.
  • deploy-prod.yml preflight at L123-124 must validate the new unprefixed names. Don't ship without this.

Total impact

After PR-A through PR-D:

  • Container env vars: 14 → ~8.
  • Distinct WorkOS Apps in use: 6 → 4 (main, claude-code-staging, claude-code-prod, ci-staging-m2m). Three with PR-B applied if you also retire one of the claude-code variants (separate decision, not in this plan).
  • LOC delta: ~−400 src, ~−1,060 test, ~−100 deploy YAML.
  • Eliminated bug classes: the duplications that produced #732, #368, #810, #813. The ??-vs-|| divergence (#821) gone via PR-A. Bearer/cookie super-admin agreement enforced via PR-C.

Acceptance test for the whole plan

After PR-A through PR-D land:

  1. List secrets in GitHub Environment "dev". Count ≤ 8 auth-related entries.
  2. List secrets in "prod". Same count, parallel names.
  3. grep -rn "OIDC\|oidc" src/api → zero hits.
  4. grep -rn "personal-agent" src/ → zero hits (PR-B).
  5. grep -rn "WORKOS_REDIRECT_URI[^_]" src/ → zero hits (PR-D D2; only WORKOS_REDIRECT_URI_ALLOWED_HOSTS survives).
  6. grep -rn "WORKOS_COOKIE_PASSWORD" src/ → zero hits (PR-D D3).
  7. Only one of WORKOS_SUPER_ADMIN_ORG_ID and STAFF_SUPER_ADMIN_PROVIDER_USER_IDS survives in env.ts (PR-C).
  8. app.ts:155-171 IDOR guard still present; app-bearer-mount.test.ts:247 still passes.
  9. Bearer and cookie auth agree on isSuperAdmin for the same user — both use the same allowlist (or no allowlist) override logic.
  10. tenantContextMiddleware non-prod path verified gated by auth.
  11. docker-compose.deploy.yml references no removed env vars; container boots clean.
  12. Boot prod against the new env. Login works for a staff super-admin (PR-C outcome verified). M2M JWT works. Connector API key works.
  13. The three legacy-auth route tests are either deleted or ported.
  14. The three ported regression guards exercise the device_code/standard machine paths and survive PR-B.
  15. provision-personal-agent.md is either deleted or remains as a stub redirecting to agent-and-m2m-authentication.md.
  16. The "Today" inventory in architecture/13 §16.1 is updated to reflect the post-plan state.

Tracking

PRDescriptionStatus
PR-A#822 — ?? → `Boolean(...)
PR-BDelete personal-agent bridgeBlocked on Ivan's Slack confirmation
PR-COne super-admin signal — Option A (WorkOS membership lookup at callback)Verification done 2026-05-08; B ruled out empirically; A chosen for long-term simplicity. Ready to implement.
PR-Denv.ts cleanup (OIDC + legacy auth + REDIRECT_URI + COOKIE_PASSWORD)Blocked on PR-C
PR-EOPTIONAL — STAGING/PROD prefix collapseParked