SecurityV0 — Comprehensive Architecture & Security Audit Report

Executive Summary

SecurityV0 is a well-conceived Autonomous Execution Exposure Management platform with sound architectural principles: deterministic findings, evidence-grade audit trails, temporal drift detection. The core pipeline is production-ready. However, the audit uncovered 2 critical security vulnerabilities, 7 ship-blocking implementation bugs, and significant scalability risks that must be addressed before claiming production readiness.

0. Architecture Decisions — Critical Review

This section reviews the strongest criticisms of each major architectural decision. Some decisions hold up under scrutiny; others have real structural problems.

0.1 MongoDB for Graph Storage — The Evidence Immutability Claim Is False

The ADRs claim: MongoDB stores immutable evidence packs via SHA256 hashes.

The reality: The SHA256 hash is stored in the same mutable MongoDB collection as the content it is supposed to protect. A database administrator — or a compromised service account with write access — can modify both the content and the hash in a single operation. MongoDB has no append-only collection mode, no WORM storage, and no write-once semantics. The immutability is a convention enforced by application code, not by the database.

Why this matters for a security product: When a customer challenges the integrity of a finding during an incident response, SecurityV0's answer is "trust us." There is no cryptographic proof that the finding wasn't modified after the fact. SOC 2 AU-10 (non-repudiation) and NIST 800-53 AU-10 both require this.

Fix: Append evidence pack hashes to an append-only PostgreSQL table with triggers preventing UPDATE/DELETE, or use Amazon QLDB. This costs minimal operational effort and transforms the compliance posture.

0.2 Materialized Paths — The Write Amplification Is Worse Than Documented

The ADRs claim: Materialized paths provide O(1) blast radius queries. The scaling ceiling is ~10K identities.

The reality: The write cost is O(I × R × P × Res) where I=identities holding a changed role, R=roles, P=permissions/role, Res=resources/permission. When a role held by 3,000 identities changes its permissions, the materializer issues ~3.3M read operations and writes updated accessible_by arrays across hundreds of resource documents — all non-atomically. A failure mid-way leaves the graph in an inconsistent state. The ADR says "eventual consistency" as if it's acceptable; for a security product where blast radius queries are the core value proposition, inconsistent state during sync means incorrect answers to the CISO's primary question.

The trigger for this breaking: It is not raw entity count. It is role fan-out. A single highly-shared role (like "Developer" held by 3,000 engineers) changing permissions triggers the storm. This happens at much smaller tenant sizes than 10K total identities.

The ADR should lower the Neo4j/Kuzu trigger from 10,000 identities to 5,000 — or more precisely, to any role with >1,000 holders.

0.3 Stateless Sessions for a Security Platform — Structurally Wrong

The ADR claims: iron-session provides secure, stateless encrypted cookies. The design is provider-independent.

The reality: iron-session stores an encrypted, self-contained session payload in a cookie — this is genuinely stateless from the server's perspective (no server-side lookup to decode the session). However, the middleware hits MongoDB on every request anyway to validate the user's current membership and permissions, so the server-side lookup is happening regardless. In that context, iron-session's statelessness provides no performance benefit, and it removes the ability to revoke an individual session: fire an employee, deactivate their WorkOS account, and their sv0_session cookie remains valid for up to 7 days.

Additional gap — logout is a no-op: workos-provider.ts:74 has an empty logout() method. Clearing the cookie does not revoke the session on WorkOS's side. A cookie exfiltrated before logout remains valid.

The 7-day TTL is inappropriate. AWS Console sessions are 1–12 hours. Security tooling industry practice is 8–24 hours for human sessions. A security platform storing CISO-grade findings should not have sessions that outlive most employees' work weeks.

0.4 The In-Process Job Queue — Production Incident Waiting to Happen

The ADR claims: The in-process FIFO queue is sufficient for MVP scale.

The reality: The WorkerJob[] array at runtime.ts:26 is unbounded (no backpressure), not persisted (lost on restart), and not recoverable (no dead letter queue). The shutdown() handler sets a flag and exits — if a sync is mid-flight at step 6 of 11 when the container restarts (deployment, OOM kill, crash), the sync stays in "running" status forever. There is no detection, no alerting, no recovery path.

The event loop concern is a red herring. The real risk is MongoDB connection pool pressure under concurrent syncs. Sequential await calls release connections back to the pool between operations — they do not hold a connection across the full path materialization loop. Pool saturation occurs when multiple syncs run concurrently (each holding its own connections simultaneously). The current serial in-process queue actually prevents this specific problem by serializing syncs. The correct argument for replacing it is persistence and recovery (lost jobs on restart, no dead letter queue) — not connection pool saturation.

The Express 4 async bug is real: Every async route handler in Express 4 that throws an unhandled rejection hangs the request indefinitely — it does not route to the error handler. ingest.ts:160 has exactly this pattern. Express 5 fixes this natively.

0.5 The ELK.js Web Worker — ADR and Code Are Contradictory

ADR-011 states explicitly: "The layout uses the Web Worker variant from day one (elkjs/lib/elk-worker.min.js) — since the API is async either way, using the worker costs no extra complexity and keeps the UI thread free for all graph sizes."

The actual code at layout.ts:1:

import ELK from "elkjs/lib/elk.bundled.js";  // main thread — blocks UI

This is not a gray area or a judgment call. The ADR says use the worker variant. The implementation uses the main thread variant. At 200+ nodes, layout computation freezes the UI for 150-400ms. The spinner overlay that displays during layout may not even render before the thread locks.

This is the easiest fix in the entire audit: one line, verified against the ADR.

0.6 SSH Deployment Key — Docker Group Membership Is Root

The deployment docs claim: Deployment uses a restricted deploy user for security.

The reality: The deploy user is in the Docker group (deployment.md:307-308: sudo usermod -aG docker deploy). Docker group membership is functionally equivalent to root — docker run -v /:/host ubuntu chroot /host gives a root shell on the host. A compromised DEPLOY_SSH_KEY (which is exposed to every GitHub Actions runner that touches this repo) gives the attacker full root access to the production server, MongoDB included.

The kill chain is not theoretical. GitHub Actions runners are shared VMs. A supply chain attack on any dependency in the CI pipeline, or a compromised runner, exposes the key.

The fix is specific: Remove deploy from the Docker group. Use sudo with an allowlist of exactly two commands: docker compose pull and docker compose up -d in the platform directory. Nothing else.

0.7 `REQUIRE_AUTH` Defaults to False — The Insecure Default Ships

The deployment compose claims: Authentication is configurable.

The reality: docker-compose.deploy.yml:46 has REQUIRE_AUTH: "${REQUIRE_AUTH:-false}". The default is the insecure value. A deployment that forgets to set this environment variable — or a new engineer who spins up an instance following the compose file — gets a fully unauthenticated API where any caller can inject data into any tenant by setting the X-Tenant-Id header.

The Zod schema in env.ts:18 defaults to "true", which partially saves production. But this is defense by accident — two defaults in two files that contradict each other. The compose file default should be true. Secure defaults must not require active choices.

0.8 What the ADRs Got Right (Genuinely)

Several decisions hold up under scrutiny:

Python for connectors: Correct. boto3/msgraph-sdk ecosystem advantage is real. Go/TypeScript/Rust offer no practical advantage for I/O-bound batch API scanning. The GIL is irrelevant.
Docker Compose for current scale: Correct for now. The deploy-instance.sh multi-instance orchestration with Caddy hot-reload is well-engineered for the current 2-server footprint. It becomes a ceiling when rolling deployments, multi-host scaling, or cell provisioning automation are required — see §13.
WorkOS selection: Correct. Provider abstraction (AuthProvider interface) is clean. Exit path exists. Admin Portal alone justifies the choice for enterprise SSO onboarding.
10-entity type model: Correct. Not too universal. Adding a 4th connector is 90% connector work. The subtype system handles cloud-specific variation without fracturing the evaluator rules.
Rejecting Apache AGE (ADR-003): Correct. Variable-length path exponential blowup is architectural. AWS RDS still doesn't support it. Decision remains valid.
StorageAdapter abstraction: The single best decision in the codebase. 60+ methods behind a clean interface makes Kuzu, Neo4j, or any future migration feasible without touching connectors or evaluator rules.

1. Architecture Overview

What it is: A system of record for Non-Human Identity (NHI) execution authority. Answers the CISO question: "What can this automation actually do, who owns it, and what happened to its access?"

Stack: Node.js/TypeScript API + React 19 frontend + MongoDB + Python connectors (entra-servicenow, azure-foundry, aws)

Pipeline: 3-job sequential: sync_ingestion → evaluate_findings → build_evidence_pack (SHA256-sealed, immutable)

Entity model: 10 types — workload, connection, credential, identity, role, permission_set, permission, resource, owner, execution_evidence

15 finding rules in the evaluator (orphaned_ownership, scope_drift, dormant_authority, reachability_drift, llm_egress, etc.)

Architecture maturity: ~75% — pipeline solid, auth transition in-progress, SCIM/OAA deferred, ~40% of docs stale vs. current implementation.

2. Gray Zone #1 — Graph Storage & Scalability

MongoDB: Adequate for MVP, Breaking Point at ~10K Identities

The architecture uses materialized execution paths (pre-computed at sync time, O(1) blast radius queries) rather than real-time graph traversal. No $graphLookup anywhere — application-level BFS only. This is a deliberate, documented trade-off.

Scale ceiling:

Scenario	Latency	Breaks At
MVP (<1K identities, 2-3 connectors)	<100ms	—
Growth (5K identities)	100-500ms	Path recompute bottleneck
Scale (10K+ identities)	500ms-2s	Breaking point
Production (50K identities)	2-10s	Query timeouts, incomplete results

On graph database alternatives: The decision to stay on MongoDB is correct for now, but the reasoning in the ADRs is partially wrong.

The claim that "Neo4j is bad at rich document storage" is not a valid reason to avoid it — Neo4j handles property maps on nodes/edges adequately, and more importantly, PostgreSQL with JSONB handles document storage extremely well and is fast. A PostgreSQL-based alternative covers document storage, temporal queries (range types, tstzrange), and graph traversal (recursive CTEs or Apache AGE extension) in a single engine. AGE was rejected (ADR-003) for exponential blowup on variable-length paths — a real limitation — but that's a specific traversal argument, not a document storage argument.

The correct reasons to stay on MongoDB at current scale:

The StorageAdapter abstraction already makes migration low-cost whenever the trigger is hit
MongoDB is sufficient for <10K identities with the current materialized path model
Adding a graph engine before the breaking point is premature

When the 10K identity breaking point approaches, the real options are:

Option	Graph	Documents	Temporal	Ops Cost
Kuzu (embedded)	Cypher, fast analytics	Via MongoDB (hybrid)	Via MongoDB	Near-zero — no new service
PostgreSQL + AGE	OpenCypher, exponential blowup risk on deep paths	JSONB, excellent	Native range types	One service replaces MongoDB
PostgreSQL (recursive CTEs)	Depth-limited traversal only	JSONB, excellent	Native range types	One service replaces MongoDB
Neo4j	Best-in-class graph	Property maps (adequate)	Temporal plugin needed	High — dedicated server
Neptune	Gremlin/SPARQL	External only	External only	AWS lock-in

PostgreSQL is a legitimate and underrated option — it is not on the ADR radar but should be. A single Postgres instance with JSONB columns replaces MongoDB entirely, handles temporal queries natively, and graph traversal via recursive CTEs works for depth-limited paths (which is all SecurityV0 needs at MAX_AUTH_CHAIN_DEPTH). The StorageAdapter abstraction makes this migration just as feasible as Neo4j. Kuzu remains the lowest-friction first step (embedded, no new service, Cypher queries replace BFS loops).

Critical Code Bugs

Severity	Issue	File:Line
CRITICAL	Reverse-lookup BFS has no document limit — can pull 50K+ docs into memory	`subgraph-adapter.ts:158`
HIGH	Unbounded frontier growth — exponential blowup on high-degree nodes	`subgraph-adapter.ts:35`
HIGH	Stale execution paths when role GRANTS change — affected identities not re-materialized	`path-materializer.ts:40`
HIGH	`MAX_AUTH_CHAIN_DEPTH=1` — any 3-system chain (Entra → SN → Slack) is missed	`path-materializer.ts:17`
MEDIUM	Blast radius endpoint returns all paths with no pagination	`paths.ts:14`
MEDIUM	Visited set in DFS causes path aliasing via shared state across branches	`path-materializer.ts:110`

3. Gray Zone #2 — Data Model Universality

Verdict: NOT Too Universal

The 10-type model is well-differentiated along three orthogonal axes: functional role, scope binding, temporal nature. The permission_set type (ADR-014) correctly distinguishes IAM policy documents (ceiling constraints) from role grants.

However: 3 deterministic, silent failures make key AWS features completely non-functional:

Ship-Blocking Bugs in AWS Connector

F1 — privilege_justification_gap returns 0 findings on all AWS data

path.resource_id is a MongoDB hex hash, never matches an ARN
Rule matching branch always fails for AWS sources
File: src/evaluator/rules/privilege-justification-gap.ts:48-50

F2 — CloudTrail extractor doesn't exist

cloudtrail_evidence initialized to [] in cli/main.py:146, never populated
dormant_authority rule fires on 100% of Lambda functions (no evidence ever found)
_transform_cloudtrail_evidence() exists but receives empty input, discards request_parameters and resources anyway
Tracked: sv0-connectors#31

F3 — Assumed-role ARN parser returns None for 80-90% of real AWS events

Lambda, ECS, Step Functions, Bedrock all produce sts:assumed-role/RoleName/session ARNs
Parser only handles iam:role/ and iam:user/ shapes
All assumed-role evidence lands with entity_id: "" — ungroupable by workload
Fix is 5 lines: add elif ":assumed-role/" in arn: branch at transformer.py:1768

F4 — AWS connector never sets normalized_action — all AWS execution path actions are "unknown"

path-materializer.ts:147 reads perm.properties.normalized_action to populate the actions array on every execution path
The Entra-ServiceNow and Azure-Foundry connectors both set normalized_action ("read", "write", "admin", "execute")
The AWS connector sets only properties.action (raw IAM string: iam:PassRole, iam:CreateRole, etc.) and never sets normalized_action
Result: every AWS execution path has actions: ["unknown"] — the raw IAM action is silently discarded by the materializer
Second reason F1 is broken: even after the resource_id matching fix, privilege_justification_gap's write-level action mismatch check (hasWriteActions()) would still never trigger on AWS data because it checks for "write", "admin", "delete" — not "unknown"
Blocks escalation detection: a future escalation_capable rule checking for IAM privilege-escalation actions (iam:PassRole, iam:CreateRole, sts:AssumeRole*) cannot work until this is fixed
scope_drift is NOT affected — it checks role additions against domain sensitivity, never reads path.actions
Not caught by any test: AWS connector tests only assert node/edge counts and subtype == "iam_permission", never check normalized_action; all path materializer and evaluator tests use hand-crafted entra_id fixtures with normalized_action explicitly set; no seed data includes AWS-sourced entities
Files: sv0-connectors/integrations/aws/src/sv0_aws/core/transformer.py:1619–1628 (sets action, not normalized_action), sv0-platform/src/ingestion/path-materializer.ts:147 (reads normalized_action with ?? "unknown" fallback, no attempt to read properties.action)

Additional gaps:

permission_set platform materializer not updated — still traverses HAS_ROLE for AWS paths → incorrect via_roles on all AWS authority paths
Ownership mapping from AWS resource tags never implemented → all AWS identities ownership_state: unknown
resource_name never populated on AWS resource nodes
AWS IAM condition keys detected but not evaluated → authority paths over-report reachability; no conditions_not_evaluated flag on ExecutionPath to surface this

On "metadata-only vs. code analysis":

Structural authorization (what roles can reach what): ✅ works when CloudTrail bugs fixed
Behavioral (is identity actually used): ⚠️ blocked by F2/F3
Code vulnerability (injection, hardcoded secrets): ❌ out of scope, needs SAST/SCA connectors (future additive connector)

4. Gray Zone #3 — Connector Rate Limiting

Overall Risk: HIGH — Inconsistent throttling resilience across connectors

Connector	Risk	Primary Issue
AWS	MEDIUM	Good botocore adaptive retry — missing jitter
Azure Entra	HIGH	Sequential-only (12+ min for 500 SPs at 2 RPS); no explicit `Retry-After`
ServiceNow	CRITICAL	Offset pagination `break`s on 429 — no cursor resume
Azure Foundry	MEDIUM	Relies on SDK defaults — behavior unclear

Critical code findings:

servicenow_client.py:421 — if response.status_code != 200: break silently drops remainder of pagination on any 429
aws_client.py:276 — wait_time = 2**retry_count with no jitter → synchronized retry storms across tenants
No global rate-quota tracker — one large tenant's scan blocks others
No per-resource skip logic — one failed get_policy() fails the entire scan

Rate limit exposure at medium scale (500 resources):

Service	Limit	Calls/Scan	Risk at 10K resources
AWS IAM	~20 RPS	500-1500	~15min sustained, retries cascade
Azure Graph API	2 RPS	600+	12+ min serial, any 429 stalls all
ServiceNow	2-4 RPS	200-250	No recovery on 429
Azure Foundry ARM	4 RPS	150	Unclear retry behavior

5. The Blocker — AI Agent Permissions & PII Access Graph

"Show new permissions graph when deploying AI agent with MCP servers, flag PII access"

What SecurityV0 Already Has

ai_agent workload subtype — already in entity model
5-level sensitivity classification propagates through authority paths
reachability_drift, scope_drift evaluator rules detect changes since baseline
reachable_sensitive_domain finding fires on PII-classified resource access
Deployment approval fully designed (research docs 2026-04-07-mcp-agentic-deployment-approval-research.md, 12-deployment-approval.md)

What's Missing — Implementation, Not Design

Gap	Notes
`mcp_tool` entity type + `DECLARES_TOOL` relationship	Tools currently invisible in graph
`data_domain` entity type + `ACCESSES` relationship	Business domain classification needed
MCP manifest parser (`mcp.json` → NormalizedGraph)	No parser exists
Graph projection algorithm (merge manifest → run materializer on projected state)	Core "what-if" engine
`POST /api/v1/deployment/preview` endpoint	Designed, not coded
PII output schema tracking on tool declarations	Resource-level exists; tool output level missing
Approval record storage + UI	Operating layer not built

Hard Problems (No Easy Solution)

MCP tool opacity — tools are blackboxes; declared ≠ actual. Mitigation: cryptographic manifest attestation, grade as "C" until runtime evidence
One identity per MCP server — all tools share service principal blast radius. SV0 detects; application architecture must fix
PII exfiltration tracking — tool output schema declaration partially solves; runtime inspection required for full coverage

6. Platform Security Audit

Critical Vulnerabilities

CRITICAL — Cross-Tenant IDOR via REQUIRE_AUTH Bypass

When REQUIRE_AUTH=false (development default):

auth.ts:62-70 — sets req.auth = { tenantId: attacker-controlled }
tenant-context.ts:12-14 — reads tenant from auth, no membership validation
A connector can POST /api/v1/ingest/normalized-graph with X-Tenant-Id: victim-tenant and inject data into any tenant

The new auth-middleware.ts with WorkOS membership validation fixes this, but has not been deployed (app.ts:26-29 TODO).

CRITICAL — DevAuthProvider Has No Production Gate

dev-provider.ts:100-108 — returns valid super-admin session for any token when AUTH_PROVIDER=dev. If set in production, auth is completely bypassed.

Fix: provider-factory.ts must throw on AUTH_PROVIDER=dev && NODE_ENV=production.

Full Severity Table

Severity	Issue	File:Line	Fix
CRITICAL	Cross-tenant IDOR via REQUIRE_AUTH bypass	`auth.ts:62-70`	Deploy new auth-middleware with membership check
CRITICAL	DevAuthProvider: no production gate	`dev-provider.ts:100-108`	Throw if `AUTH_PROVIDER=dev && NODE_ENV=production`
HIGH	Ingest: no cycle detection — evaluator infinite-loop risk	`ingest.ts:121-152`	DFS cycle check; max 100K nodes
HIGH	Connector reports: `.passthrough()` allows field injection	`ingest.ts:65-73`	Remove passthrough; ban `_`-prefixed fields
MEDIUM	Rate limiting per-tenant only — bypass by rotating tenant IDs	`rate-limit.ts:14-16`	Key on `${tenantId}:${principalId}`
MEDIUM	Path evaluator: no depth limit on ownership chain traversal	`path-evaluator.ts:127`	Max 10 levels; fail with `unresolved_ownership_depth`
MEDIUM	Session: no refresh token; 7-day TTL forces full re-auth	`session.ts:56-68`	Add `POST /auth/refresh`; 24h sliding window
MEDIUM	Silent entity overwrite without idempotency warning	`ingest.ts:160-206`	Warn if nodeIds exist in prior syncs
~~LOW~~	~~`q` search param not verified escaped before MongoDB regex~~	~~`entities.ts:48-50`~~	Finding retracted — `escapeRegex()` exists in `entity-adapter.ts` and is applied before every `$regex` query. No injection risk.

Positive: Helmet enabled, CORS explicit, x-powered-by disabled, 5MB body limit, no hardcoded secrets, Zod validation throughout.

7. Master Weakness Table

See §12 (Updated Master Priority Table) for the complete, reconciled finding list. §12 supersedes this section and includes findings from the full technology validation in §9–11. The table below is an early-pass summary retained for cross-reference with the section findings above.

Pri	Category	Issue	Status
1	Security	Cross-tenant IDOR via REQUIRE_AUTH=false	Ship blocker
2	Security	DevAuthProvider no production gate	Ship blocker
2a	Security	`verifyM2MToken()` returns null — every Bearer-token M2M auth path is completely unenforced	Ship blocker
3	AWS Connector	CloudTrail extractor not implemented	Ship blocker
4	AWS Connector	Assumed-role ARN parsing broken (80-90% events)	Ship blocker
5	AWS Connector	`privilege_justification_gap` always 0 on AWS	Ship blocker
5a	AWS Connector	`normalized_action` never set — all AWS execution path actions are `"unknown"`	Ship blocker
6	AWS Platform	`permission_set` materializer not updated	Ship blocker
7	Graph DB	BFS reverse lookup: no document limit	Pre-scale blocker
8	Connector	ServiceNow pagination: no cursor resume on 429 — corrupts baselines permanently	Ship blocker
9	Graph DB	Stale paths on role GRANTS change	Correctness gap
10	Graph DB	MAX_AUTH_CHAIN_DEPTH=1 — 3-system chains missed	Feature gap
11	Security	Ingest: no cycle detection	Hardening
12	Security	`.passthrough()` allows field injection	Hardening
13	Connector	AWS backoff: no jitter	Pre-scale hardening
14	AWS Connector	Ownership not mapped from resource tags	Feature gap
15	AWS Connector	IAM conditions not evaluated; no caveat flag	Feature gap
16	MCP Feature	`mcp_tool`, manifest parser, graph projection missing	Phase 1 feature
17	Evaluator	No escalation/impersonation detection — roles with `iam:PassRole`, `roleAssignments/write`, `actAs` are invisible	Feature gap
18	Security	Rate limiting per-tenant only	Hardening
19	Docs	~40% of architecture docs stale	Operational risk

8. Prioritized Action Plan

Critical — Security and Data Integrity

Deploy auth-middleware.ts pipeline — fixes IDOR
Add AUTH_PROVIDER=dev && NODE_ENV=production guard
Fix assumed-role ARN parser — 5-line fix at transformer.py:1768
Fix ServiceNow pagination cursor resume

AWS Connector

Implement CloudTrail extractor (sv0-connectors#31)
Fix _transform_cloudtrail_evidence to preserve request_parameters + resources
Update platform materializer for HAS_PERMISSION_SET traversal on AWS
Add .limit(query.limit) to BFS reverse lookup

Correctness and Hardening

Ownership mapping from AWS resource tags
conditions_not_evaluated caveat flag on ExecutionPath
Cycle detection in ingest schema validation
Jitter on AWS backoff; MAX_AUTH_CHAIN_DEPTH → 2
Rate limit key: ${tenantId}:${principalId}

MCP / AI Agent Feature

mcp_tool entity + DECLARES_TOOL / ACCESSES relationships
MCP manifest parser
Graph projection algorithm + POST /api/v1/deployment/preview
Approval record storage

Parallel Track

Event-driven delta sync (CloudTrail streaming, Entra odata.deltaLink)
Session refresh token endpoint
Documentation refresh for 00-overview, 04-api, 07-ui

9. Technology Validation — Architecture Review (April 12, 2026)

9.1 Graph & Database Layer

Verdict: MongoDB + ADRs validated for current scale, with two key gaps: evidence immutability is structurally broken; Kuzu is a viable in-process alternative for path queries that hasn't been evaluated.

Confirmed / Adjusted

Decision	Status	Adjustment
MongoDB for MVP	VALIDATED	Correct for current scale
Single entities collection (ADR-002)	VALIDATED WITH CAVEAT	Implement `accessible_by` overflow collection before 5K identities/tenant
Materialized paths strategy	VALIDATED WITH CAVEAT	Write amplification is O(I × R × P × Res) — role fan-out is the real scaling cliff, not raw entity count
No $graphLookup	VALIDATED	Application-level BFS is the documented trade-off; `$graphLookup` has depth limits, no shortest-path support, and does not address SecurityV0's bounded-hop traversal pattern better than the materialized path approach
Reject Apache AGE (ADR-003)	VALIDATED	Variable-length path exponential blowup is architectural; AWS RDS still unsupported
Neo4j trigger threshold	LOWER to 5K	Original ADR said 10K; role fan-out write amplification hits at ~3K identities sharing a common role

New Findings

Write amplification formula: When a role held by I identities changes permissions across R roles, P permissions/role, Res resources/permission:

Read operations: I × (1 + R + R×P + R×P×Res)
Write operations: I writes (identity docs) + R×P×Res writes (resource accessible_by arrays)
At 3,000 identities × 10 roles × 20 permissions × 5 resources = ~3.3M read ops + 3,300 write ops per role change

Critical evidence immutability gap (HIGH for a security product):

SHA-256 hash stored alongside mutable data in the same MongoDB collection
Database admin can modify both content and hash in one operation
No chain-of-custody linking evidence records
No external trust anchor (Merkle tree, blockchain timestamp, signed receipt)
Mitigation options (in order of trustworthiness): (1) Sigstore transparency log or Amazon QLDB — cryptographic proof that a log entry existed at a specific time, verifiable by third parties, not bypassable by a DBA; (2) S3 Object Lock (WORM) — append-only at the storage layer, independent of application code; (3) append-only PostgreSQL trigger table — weakest option because a DBA with DISABLE TRIGGER permission can bypass it; application-enforced immutability has the same trust level as MongoDB convention

Kuzu as embedded Neo4j alternative:

Kuzu is an embedded in-process OLAP graph database (like DuckDB for graphs) with OpenCypher support
Zero additional infrastructure — embedded library, ~50MB binary addition
Eliminates path materialization write amplification by computing paths at query time
Would replace execution_paths + accessible_by embedded arrays entirely
The StorageAdapter abstraction already enables this via a MongoKuzuStorageAdapter implementation
2026 maturity: Adequate for analytics workloads; not recommended as primary transactional store
Verdict: Worth prototyping as an analytics layer over MongoDB for path queries — see Section 11 gray zone analysis

Bi-temporal gap: The platform tracks valid time (valid_at/expired_at) but transaction time is implicit. This matters for "did we know about this identity BEFORE the breach?" queries — a common compliance requirement.

9.2 API Runtime & Job Queue

Verdict: Replace in-memory queue with BullMQ + Redis. Upgrade Node 20 → 22. Express 5 for async error handling. The primary risk is OOM from the unbounded queue and lost jobs on restart — not event loop blocking or connection pool starvation. (Sequential awaits release connections between operations; pool saturation would require concurrent syncs, which the serial queue prevents.)

Critical Bugs Found

Severity	Issue	File	Fix
HIGH	Async route handlers in Express 4 have no try/catch — unhandled rejection hangs request indefinitely	`ingest.ts:160`	Add try/catch or `express-async-errors`
HIGH	In-memory queue: `WorkerJob[]` unbounded, lost on process restart	`runtime.ts:26`	Bounded queue + job persistence
MEDIUM	Shutdown handler kills mid-flight jobs — 30s timeout + `process.exit(1)` can corrupt sync state	`index.ts:103-138`	Drain queue before shutdown
MEDIUM	`processedSyncIds` Set is in-memory — lost on restart, re-processing risk	`ingest-service.ts:14`	Persist to MongoDB. Note: the cross-tenant blocking concern requires an engineered UUID collision (1/2^122 probability for UUIDv4) — not a realistic attack vector; the in-memory/restart concern is the actual issue here.

Architecture Decisions

Node.js version: Node 20 reached end-of-life April 2026. Node 22 LTS is the correct version — one-line change in Dockerfile:1,9. V8 12.4 improvements, no breaking changes for this stack.

Job queue: BullMQ is the right direction. A lower-complexity alternative: MongoDB-backed job persistence (write jobs to worker_jobs collection before acknowledging, recover on startup) avoids adding Redis as a new stateful dependency. Decision table:

Approach	Durability	Ops Complexity	Recommended
Current (in-memory)	None	Minimal	No
MongoDB-backed job store	At-least-once	Zero (uses existing Mongo)	Yes — lower complexity
BullMQ + Redis	At-least-once + advanced features	Adds Redis service	Yes — when queue needs grow
Temporal.io	Exactly-once + saga	Very high	No (overkill for 3-step pipeline)

Container memory: Increase from 512MB → 1GB. A large tenant sync (5MB JSON graph + entity arrays + path materialization) can push 300-400MB leaving insufficient headroom.

Express version: Express 4 → 5 migration is low-risk and fixes async error handling natively.

9.3 Frontend Stack

Verdict: Stack is correct. Two bugs require immediate fixes. React Compiler should be enabled. Strategic concern: Graph Explorer as a primary view may not match CISO workflow — Wiz and Orca both use graphs as drill-down from findings, not standalone pages.

Critical Bug: ELK.js Running on Main Thread

File: ui/src/components/graph/layout.ts:1

// CURRENT (wrong — blocks main thread):
import ELK from "elkjs/lib/elk.bundled.js";

// SHOULD BE (ADR-011 explicit requirement):
import ELK from "elkjs/lib/elk-worker.min.js";

ADR-011 explicitly requires the Web Worker variant. The bundled variant blocks the UI thread for 150-400ms at 200 nodes, 500ms-2s at 500 nodes. The spinner overlay misleads — the spinner may not even paint before the thread freezes. One-line fix.

Performance Ceilings

Component	Safe	Warning	Breaking
@xyflow/react nodes	<200 (60fps)	200-500 (30fps)	500+ (<15fps)
ELK layout (Web Worker)	<100ms (<100 nodes)	100ms-2s (100-500 nodes)	2s+ (>500 nodes)
ELK layout (main thread — current)	<50ms	50-500ms (jank)	500ms+ (frozen)
Selection highlight re-render	<200 nodes	200-500 (O(n) spread)	500+

Additional Findings

ELK.js not lazy-loaded: 1.4MB loaded on every page including Dashboard and Findings. Should be dynamic import — only GraphCanvas and MiniGraph need it.
styledNodes memo defeated: GraphCanvas.tsx:102 creates new object references for all nodes on every selection change, defeating memo() on EntityNode. Fix: CSS class toggle instead of style spread.
React Compiler: Enable via babel-plugin-react-compiler in vite.config.ts. Eliminates manual useMemo/useCallback overhead across 6+ graph components.
Strategic: Graph Explorer as a primary UI view may not match CISO workflow. Wiz/Orca both use graphs as drill-down from findings, not standalone pages. Consider making Graph Explorer seed-anchored (always starts from a finding or entity) to cap graph size and align with CISO workflow.

9.4 Authentication Stack

Verdict: WorkOS + iron-session architecture is sound. Ship the new auth middleware. Critical hardening required: no instant session revocation, super-admin email-domain check is a security bug, logout is a no-op.

Security Bugs Found

Severity	Issue	File	Fix
HIGH	Super-admin determined by email domain string match	`auth.ts:76`	Use WorkOS Organization membership check
HIGH	`logout()` is a no-op — cookie cleared but WorkOS session NOT revoked	`workos-provider.ts:74`	Call `workos.userManagement.revokeSession()`
HIGH	`verifySession()`, `verifyApiKey()`, `verifyM2MToken()` all return null	`workos-provider.ts:78-92`	Three auth sources documented; only one implemented
MEDIUM	Sessions cannot be instantly revoked — stateless cookie survives deprovisioning	`session.ts`	Add `sessions_revoked_at` timestamp to user documents
MEDIUM	Rolling refresh not implemented despite being documented	`session.ts:41-43`	Implement TTL extension on each request
MEDIUM	7-day TTL inappropriate for a security platform	`session.ts:18`	Reduce to 24h users / 8h super-admins
LOW	`listActiveConnections` called per-request for SSO-enforced tenants	`workos-provider.ts`	Cache with 60-second TTL per `provider_org_id`

Architecture Assessment

WorkOS vendor selection (ADR-017) confirmed correct. Provider abstraction (AuthProvider interface) is well-designed — migration to Clerk or self-hosted is a backend-only change. Cloudflare Access as defense-in-depth is appropriate architecture.

iron-session assessment: Functions as a session ID into MongoDB (middleware hits DB on every request anyway). Consider formalizing: either accept the server-side lookup and add explicit revocation support, or move to short-lived JWTs (15 min) + refresh tokens for clean stateless/revocable semantics.

9.5 Infrastructure & Deployment

Verdict: Python correct. Docker Compose correct for current scale. Two critical security issues: REQUIRE_AUTH default and SSH deploy key blast radius. Dead code needs cleanup.

Critical Security Issues

Issue	Evidence	Fix
`REQUIRE_AUTH` defaults to `false` in deploy compose	`docker-compose.deploy.yml:46`	Change default to `true`; fail loudly if not set
SSH deploy key grants Docker-group-equivalent-root to production	`deployment.md:307-308`	Restrict `deploy` user via sudoers to specific compose commands only; remove Docker group membership

Operational Gaps

Issue	Impact	Fix
No external monitoring after deployment	Outage invisible until customer reports it	Add UptimeRobot / Cloudflare health check
Backup never tested; no restore runbook	6-hour RPO with no recovery confidence	Test restore once, document procedure
Same SSH key for dev and prod	Dev compromise = prod access	Separate keys, rotate quarterly
No `/ready` endpoint checking MongoDB	Health check misleads Docker	Add MongoDB connectivity check

Infrastructure Gaps

Issue	Impact	Fix
MongoDB without replica set: 6-hour RPO	Hardware failure = data loss up to last backup	Add replica set (even single-node for oplog)
Single-server SPOF: 100% downtime on host failure	SOC 2 story has single point of failure	Add second server with hot standby
Connectors require Python 3.11 at customer site	Support burden; installation friction	Ship as Docker images (`docker run ghcr.io/sv0/sv0-aws:latest`)

Python Connectors: Validated

Python 3.11 + boto3 + msgraph-sdk is correct for I/O-bound batch API scanning workloads. The GIL is irrelevant (I/O-bound). Go/TypeScript/Rust offer no practical advantage for this workload. One improvement: add concurrent.futures.ThreadPoolExecutor to AWS region scanning loop for parallel region extraction (20-line change, not a language migration).

Dead Code Cleanup

Remove: docker-compose.prod.yml (legacy Certbot overlay), ui/nginx-ssl.conf (superseded by Caddy). Both create confusion about the active architecture.

Infrastructure Maturity Triggers

Trigger	Action
First paying enterprise customer	2-server MongoDB replica set, test restore, separate SSH keys
Contractual uptime SLA ≥99.95%	k3s or managed Kubernetes, Atlas managed MongoDB, Docker-based connectors

10. High-Confidence Findings

Priority	Finding	File
1	ELK.js running on main thread (should be Web Worker)	`layout.ts:1`
2	In-memory job queue will lose data on restart — needs persistence	`runtime.ts:26`
3	REQUIRE_AUTH=false is a critical default in deploy compose	`docker-compose.deploy.yml:46`
4	SSH deploy key blast radius too broad	`deployment.md:307-308`
5	Super-admin via email domain string match is a security bug	`auth.ts:76`
6	Node 20 is EOL; upgrade to Node 22	`Dockerfile:1,9`
7	Logout is a no-op — WorkOS session not revoked	`workos-provider.ts:74`
8	Evidence immutability: hash stored alongside mutable data	`evidence_packs` schema
9	Role fan-out write amplification is the real MongoDB scaling cliff	`path-materializer.ts`
10	Neo4j trigger threshold should be 5K, not 10K	ADR-001

11. Gray Zone Deep-Dive

11.1 Gray Zone 2: Data Model Universality — VERDICT: RIGHT-SIZED

Agent findings (deep code analysis across all 3 connectors + all 15 evaluator rules):

Verdict: The 10-entity-type model is NOT too universal. It is the correct abstraction level. Evidence:

3 connectors map to it with zero forced compromises (when model is followed correctly)
All 15 evaluator rules operate against universal types and work across all connectors without cloud-specific branching
The path materializer traverses the graph without cloud-specific conditionals
Adding a 4th connector (GitHub, Okta, Salesforce) would be ~90% connector work, <10% platform work — no new entity types needed

Mapping fidelity by connector:

Entra/ServiceNow: Clean 1:1 mapping. No compromises.
AWS: Functional but carrying ADR-014 implementation debt (see below)
Bedrock AI agents: Clean — workload subtype bedrock_agent, RUNS_AS IAM role, INVOKES Lambda action groups

Two critical seams:

Seam	Issue	Impact	Fix
ADR-014 implementation gap	AWS connector emits `HAS_ROLE` / `nodeType: "role"` for IAM Managed Policies instead of `HAS_PERMISSION_SET` / `permission_set`	Platform-side types already support `permission_set`. Path materializer traverses `HAS_ROLE` but NOT `HAS_PERMISSION_SET` → all AWS authority paths via managed policies are incorrect	Update AWS transformer line 499 + materializer to traverse `HAS_PERMISSION_SET`
Resource key migration	`resource-key.ts` is comprehensive for AWS but `privilege_justification_gap` needs `resource_key` on evidence records	CloudTrail evidence has no `resource_key` → rule returns false negatives until CloudTrail extractor is implemented	Populate `resource_key` on evidence during CloudTrail extraction

privilege_justification_gap bug is implementation, NOT model: The resource-key.ts module correctly handles all AWS ARN formats (S3, Lambda, DynamoDB, SecretsManager, SSM, ECR, ECS, IAM, Bedrock, SNS, SFN, EventBridge, SQS). The matching logic is correct. The problem is that CloudTrail evidence records don't have resource_key populated yet (because CloudTrail extractor doesn't exist — F2).

MCP/AI agent model fit: Good. The existing model handles AI agents via ai_agent workload subtype. What's missing for pre-deployment preview is not an entity type but a behavioral distinction: "configured to invoke" (current INVOKES edge) vs. "has exercised" (needs runtime evidence). The graph already has the right edges; additional evaluator rules needed for authority preview.

_type_provisional: true was never implemented (searched codebase — zero occurrences). ADR-014 mentioned it as a migration strategy that was never built.

11.2 Gray Zone 1: Graph Alternatives to Neo4j — VERDICT: KUZU

Agent findings (deep research across 6 alternatives with full code context):

The team's hesitation about Neo4j is justified on operational grounds — and there is a better answer that avoids adding any new infrastructure.

Recommended: MongoDB + Kuzu (Embedded Analytics Layer)

Kuzu is an embedded in-process graph database (like DuckDB, but for graphs). Native Cypher support. Node.js/TypeScript bindings. MIT licensed. Zero additional infrastructure — runs inside the sv0-platform process.

Why Kuzu is the right answer for SecurityV0:

MongoDB (14 collections)           Source of truth: entities, versions, events,
        |                          findings, evidence packs, temporal history
  sync completes
        |
Kuzu (in-process)                  Graph projection: nodes + typed edges
        |                          for path traversal queries
        |
   Cypher queries
   /      |      \
blast   subgraph  chain
radius  explore   assembly

Kuzu replaces the 3 most problematic application-level BFS implementations:

path-materializer.ts:computePaths() — recursive MongoDB-per-hop query storm → single Cypher query
chain-builder.ts:bfsCollectChain() → Cypher MATCH (w)-[*1..5]->(r)
subgraph-adapter.ts:neighborhoodBFS() → Cypher MATCH (n)-[*1..2]-(m) WHERE n.id = $seed

This eliminates:

The BFS document limit bug (subgraph-adapter.ts:158 — no .limit())
Unbounded frontier growth in high-degree nodes
Stale execution paths when role GRANTS change (Kuzu recomputes at query time)
The write amplification problem (no accessible_by arrays to maintain)

Migration is incremental — zero storage risk:

Start: Kuzu handles getSubgraph() queries only (replace SubgraphAdapter BFS)
Next: Kuzu generates execution_paths[] instead of path-materializer.ts
Then: Kuzu handles chain assembly (chain-builder.ts)
Later: Temporal graph queries via historical entity snapshots loaded into temporary Kuzu instance

The StorageAdapter interface does NOT change. All 60 methods stay as-is. MongoDB schema unchanged. Evaluator rules unchanged. Connector interface unchanged.

Alternative Comparison

Option	Fit Score	Key Verdict
Kuzu (embedded)	8/10	Best fit. Zero infra, native Cypher, MIT license, incremental migration
XTDB v2	6/10	Excellent bi-temporal but NO graph traversal; JVM service required
TerminusDB	5/10	Git-like immutability interesting but Prolog query language + project risk
FalkorDB	4/10	Fastest BFS but Redis AOF persistence = disqualifying for evidence-grade requirements
TypeDB	5/10	Inference rules compelling but TypeQL proprietary, JVM service, no temporal
Memgraph	5/10	Neo4j-like quality but BSL license + same operational cost as Neo4j

When Kuzu stops being sufficient (trigger for Neo4j or Memgraph):

50K+ entities where graph rebuild time exceeds acceptable sync latency
Multi-process/multi-service need to query the same graph (Kuzu is in-process only)
Real-time graph mutations needed (Kuzu is batch-rebuild-oriented)
Geographic distribution requirements

On the bi-temporal gap: SecurityV0 already has a working bi-temporal model (entity_versions with valid_at/expired_at + events with transaction timestamps). XTDB's native bi-temporal is elegant but solves a problem that is already solved adequately. For historical graph queries, the right approach is: load historical entity snapshots from entity_versions into a temporary Kuzu instance and traverse that. This is the "git checkout past commit" pattern.

On evidence immutability: Neither Kuzu nor any graph DB solves the evidence hash-colocation problem. This must be solved separately — and PostgreSQL triggers are the weakest option because a DBA with DISABLE TRIGGER permission can bypass them. Prefer Sigstore transparency logs, Amazon QLDB, or S3 Object Lock (WORM) for genuinely tamper-evident storage. These are independent of which graph layer is chosen.

11.3 Gray Zone 3: Connector Depth + Rate Limiting

Part A: Metadata-Only Scanning — VERDICT: RIGHT STRATEGY FOR V1

What metadata scanning concretely delivers (from all 14 evaluator rules):

Ownership governance (orphaned, degraded, drifted, ambiguous, unknown)
Authority hygiene (dormant, scope drift, reachability drift, privilege justification gap)
Identity binding (unproven execution, unknown binding, unresolved cross-system auth)
Egress/data flow (LLM egress, external egress, reachable sensitive domain)

This is authorization graph analysis with temporal drift detection — a capability combination that existing tools address only partially or not at all.

What metadata misses: Hardcoded secrets in code, injection vulnerabilities, dependency CVEs, logic vulnerabilities, runtime behavioral anomalies, CSPM-style resource misconfiguration checks (S3 bucket ACLs vs. CIS benchmarks).

Code analysis path (additive, not a redesign): The NormalizedGraph schema already accommodates it. A sv0-code-scanner connector would: (1) fetch code artifacts linked to known entities, (2) run lightweight checks (regex for secrets, SBOM extraction, trufflehog), (3) emit NormalizedGraph additions. This is additive — no connector architecture change needed. The ServiceNow connector already parses script bodies (analyze_script_mutations(), analyze_script_queries()).

Verdict: Metadata-only is fully defensible for V1. Strategic risk is customers expecting CSPM-style findings alongside the authorization graph — that's a breadth gap where CSPM-first tools have an advantage.

Part B: Rate Limiting — CRITICAL FINDINGS

ServiceNow 429 bug is a data integrity crisis, not a UX issue:

The break at servicenow_client.py:421 on any non-200 response causes silent partial data ingestion. Blast radius:

Scan returns 400/2000 records as if complete
Downstream evaluator computes massive phantom ownership_drift and scope_drift — entities "disappeared"
Phantom-truncated scan becomes the new baseline — subsequent full scans show phantom "new" entities
Temporal drift detection becomes unreliable

This is not a "fix later" issue. This calcifies baselines. Every scan run with this bug creates corrupted baselines that compound. Must fix before production.

Fix — ServiceNow cursor resume on 429:

Note: urllib3 retry logic at the adapter level handles transient TCP/TLS failures before the pagination loop sees a status code. The bug is what happens after urllib3 retries are exhausted: the 429 bubbles up to application code and the break at line 421 exits the pagination loop without resuming the cursor. The fix is at the application level, not the adapter level:

if response.status_code == 429:
    if retry_count >= max_retries:
        raise ConnectorError(f"ServiceNow rate limit exceeded after {max_retries} retries; pagination cursor at offset {offset}")
    retry_after = min(int(response.headers.get("Retry-After", 0)), 300)  # cap at 5 min
    wait_time = max(retry_after, 2 ** retry_count)
    wait_time *= random.uniform(0.75, 1.25)  # Full jitter
    time.sleep(wait_time)
    retry_count += 1
    continue  # NOT break — retry SAME offset

Fix — AWS full jitter (1 line):

# Line 276 of aws_client.py — replace:
wait_time = 2**retry_count
# With (AWS Architecture Blog "full jitter" pattern):
wait_time = random.uniform(0, 2**retry_count)

Recommended rate limiting architecture (current stage):

Per-connector AdaptiveTokenBucket per API endpoint
Respects Retry-After headers (ServiceNow, Azure Graph both send these)
Full jitter on all retry delays
rateLimitConfig in connector contract (05-connectors.md:122-126) is the right interface — configure max RPS per connector
Later: Redis-backed cross-tenant quota tracker when concurrent multi-tenant scans are needed

Reversibility assessment:

Decision	Reversible?	Notes
Metadata-only scanning	Fully reversible	Code analysis connectors are additive
ServiceNow break-on-429	Calcifying	Fix before any production customer
AWS no-jitter	Easily reversible	1-line fix, low complexity
No global quota tracker	Reversible but expensive later	Design interface now, implement when multi-tenant orchestration built

11.4 MCP Blocker: AI Agent Pre-Deployment PII Access Graph

Agent findings (deep architecture review of 12-deployment-approval.md + full codebase analysis):

What the architecture team already knows well

The design docs are thorough: three modes (post-deploy detection, pre-deploy preview, deployment gate) are correctly separated. Five approaches were evaluated. Platform capabilities inventory is accurate.

Hard unsolved problems (not yet designed)

Problem	Why Hard
Graph projection algorithm	"Run materializer on projected state" is stated but the how is not designed
Path-level diff engine	Current `diff-engine.ts` diffs EntityDoc only — no AuthorityPathDoc comparison
Cross-connector entity correlation	Prerequisite for cross-system authority chains (Entra→SN→HR DB) — not yet built
MCP tool-to-data-domain mapping	Tool declarations are free text blackboxes — classification is unsolved
`data_domain` as first-class entity type	Not yet in the entity model

MCP Opacity Mitigation (layered, honest approach)

The fundamental problem: MCP tool declarations (tools/list) show name + description + input schema. They don't reveal what databases the tool queries, what data it returns, or what its blast radius is.

Recommended mitigation layers:

Layer	What It Provides	Evidence Grade	Build Now?
1 — Identity-bounded authority	The identity's IAM permissions ARE the worst-case blast radius	C (inferred)	Yes — already modeled
2 — Manifest-declared intent	Parse `mcp.json` for env vars, tool names, resource URIs	C (inferred)	Yes — build now
3 — Tool description parsing	NLP/regex on tool descriptions for domain hints	C (inferred)	Caution — conflicts with "no ML/heuristics" policy
4 — Runtime observation	Actual network/DB calls after deployment	A (proven)	Future

Honest framing for clients: "We show you the identity's authority boundary. The tool may exercise all, some, or none of that authority. The boundary is the worst case."

Graph Projection Algorithm: Recommended Design

Rejected options:

Option A (clone to MongoDB + materializer): Write amplification, cleanup complexity, persistence risk
Option C (what-if tenant namespace): Cross-tenant reference failures, tenant semantics broken

Recommended: In-memory ProjectionStorageAdapter

mcp.json
   ↓ MCP manifest parser
NormalizedGraph (mcp_server, mcp_tool, identity nodes)
   ↓ graph-transformer.ts (existing)
EntityDoc[] (projected entities)
   ↓ inject into
ProjectionStorageAdapter (Map<string, EntityDoc> backed)
   ↑ seeded from real tenant subgraph via getSubgraph(identity, depth=3)
   ↓ materializeExecutionPaths() + materializeAuthorityPaths() (unchanged)
projected AuthorityPathDoc[]
   ↓ evaluateSinglePath() (unchanged)
ProjectedFindingCandidate[]
   ↓ diff against current MongoDB authority paths
Authority Delta: new/removed/changed paths + new sensitive domains reached

This works because the StorageAdapter interface is already the abstraction boundary. A ProjectionStorageAdapter implementing ~8-10 methods (getEntity, upsertEntity, queryEntities, getEntitiesByIds, queryAuthorityPaths, upsertAuthorityPaths, markAuthorityPathsRemoved, countAuthorityPaths) runs the entire materializer + evaluator pipeline with zero MongoDB writes.

New Entity Types & Schema

Add to entity types: mcp_tool, data_domain

Add to edge types: DECLARES_TOOL (mcp_server → mcp_tool), ACCESSES (mcp_tool → data_domain), PROJECTED_FROM (projected entity → manifest source)

Add to workload subtypes: mcp_server (already has ai_agent, bedrock_agent)

Data domain classification (3 tiers, in priority order):

Tier 1 — Resource name pattern matching (deterministic, build now): hr.*|employee.* → domain: "hr", sensitivity: "confidential". This is consistent with the "no ML/heuristics" policy — it's a curated registry.
Tier 2 — Operator tagging via API/UI: Security team manually classifies resources. Stored as data_domain entities with ACCESSES relationships.
Tier 3 — Tool description NLP: Skip for now — conflicts with determinism policy.

Evidence Grading for Projected State

All projected paths carry: claim_type: "capability_inferred", evidence_strength: "inferred" (weakest grade, rank 3). In the UI: dashed edges, "PROJECTED" badge, distinct color. Projected findings do NOT count toward active posture score — advisory only.

Post-deployment upgrade path: projected → structural (after first scan confirms configuration) → correlated (after execution evidence accumulates) → deterministic (proven in production).

Approval Record Schema (minimal)

interface DeploymentPreviewRequestDoc {
  _id: string; tenant_id: string;
  requested_by: string; requested_at: Date;
  source_type: "mcp_manifest" | "cloudformation" | "arm_template";
  source_manifest?: Record<string, unknown>;
  projected_paths: {
    new_paths: number; new_sensitive_paths: number;
    new_domains_reached: string[];            // e.g., ["hr", "finance"]
  };
  projected_findings: ProjectedFindingSummary[];
  projected_authority_paths: AuthorityPathDoc[];
  status: "pending" | "approved" | "rejected" | "expired";
  reviewed_by?: string; reviewed_at?: Date; review_notes?: string;
  conditions?: string[];
  // Post-deployment accuracy tracking
  projection_accuracy?: {
    paths_matched: number; paths_unexpected: number; paths_missing: number;
  };
}

Closest Analogues and Gaps

Tool	What It Does	Gap
OAuth consent screens	Shows flat permission list	No authority graph, no cross-system chains
`terraform plan`	Projects infrastructure state	No authority implications of infra changes
AWS IAM Access Analyzer	Checks single policy for public access	Not a graph, not pre-deployment, not cross-system
Microsoft Agent Governance Toolkit	Runtime policy enforcement	No pre-deployment preview, no authority graph
Wiz AI-SPM	Cloud security posture for AI	Runtime/post-deployment only, no authority graph

Delivery Sequence

Phase	Component	Output
1	In-memory ProjectionStorageAdapter (~8-10 methods)	Foundation for all projection
2	MCP manifest parser → NormalizedGraph	mcp.json input accepted
3	`POST /api/v1/deployment/preview` endpoint	Working projection pipeline
4	Approval record schema + `PATCH` endpoint	Approve/reject workflow
5	Resource-name data domain classifier (Tier 1)	PII domain detection

After initial delivery: Path-level diff engine + full graph snapshot (prerequisite for "did reality match projection?")

Genuine hard problems not in scope yet: Cross-connector entity correlation, CloudFormation/ARM/Terraform parsers, multi-environment tenant model, what-if simulation UI.

Biggest implementation risk: ProjectionStorageAdapter must handle edge cases in the materializer (circuit breakers, deletion thresholds, AP_REMOVAL_THRESHOLD safety net). Medium risk — methods are well-defined but materializer edge cases will surface during integration testing.

12. Updated Master Priority Table

Pri	Category	Issue	Severity
1	Security	Cross-tenant IDOR via REQUIRE_AUTH bypass	Ship blocker
2	Security	REQUIRE_AUTH defaults to false in deploy compose	P0
2a	Security	`verifyM2MToken()` returns null — Bearer-token M2M auth is completely unenforced	CRITICAL
3	Security	DevAuthProvider: no production gate	Ship blocker
4	Security	SSH deploy key → Docker group = root access	P0
5	Security	Super-admin via email domain string (not org membership)	HIGH
6	AWS Connector	CloudTrail extractor not implemented	Ship blocker
7	AWS Connector	Assumed-role ARN parsing broken (80-90% events)	Ship blocker
8	AWS Connector	`privilege_justification_gap` always 0 on AWS	Ship blocker
9	AWS Platform	`permission_set` materializer not updated	Ship blocker
10	Frontend	ELK.js running on main thread (not Web Worker)	HIGH
11	Runtime	Async route handler no try/catch — hangs request	HIGH
12	Runtime	Job queue unbounded + no persistence = data loss	HIGH
13	Auth	Logout is a no-op (WorkOS session not revoked)	HIGH
14	Node.js	Node 20 EOL → upgrade to Node 22	HIGH
15	Graph DB	BFS reverse lookup: no document limit	Pre-scale blocker
16	Connector	ServiceNow pagination: break on 429 — corrupts baselines	Ship blocker
17	Auth	7-day session TTL; rolling refresh not implemented	MEDIUM
18	Auth	Iron-session: no instant revocation on deprovisioning	MEDIUM
19	Infra	No external monitoring; backup untested	P1
20	Graph DB	Stale paths on role GRANTS change	Correctness gap
21	Graph DB	MAX_AUTH_CHAIN_DEPTH=1 — 3-system chains missed	Feature gap
22	Evidence	Immutability: hash stored in mutable collection	Compliance gap
23	Frontend	ELK.js not lazy-loaded (1.4MB on every page)	MEDIUM
24	Graph DB	Neo4j trigger lower to 5K; monitor role fan-out	Planning
25	MCP Feature	`mcp_tool`, manifest parser, graph projection	Feature — not yet built

13. Infrastructure: Docker Compose Is a Dead End

Docker Compose is a development and single-host orchestration tool. SecurityV0 runs it in production for both app.securityv0.com and dev.securityv0.com. This is a structural ceiling — not a configuration gap, an architectural one.

What Docker Compose cannot do:

Capability	Docker Compose	Required for Scale
Horizontal scaling (multiple hosts)	No — single host only	Yes, for any cell model
Rolling deployments	No — `up` restarts all containers, causing downtime	Yes, for zero-downtime deploys
Health-based routing	No — failed containers removed from routing manually	Yes, for resilience
Cross-node service discovery	No	Yes, for cell provisioning
Autoscaling	No	Yes, for variable sync load
Resource enforcement	Soft limits only	Yes, for noisy-neighbor isolation
Secret management	`.env` files on disk	Yes, must use vault

The consequence for cell architecture: Cell provisioning automation — the core operational requirement for cells — is impossible on Docker Compose. "Provisioning a new cell" on Docker Compose means SSH-ing into a server and running docker compose up manually. This defeats the purpose.

The right migration path:

Current: Docker Compose (CPX21 Hetzner, single host)
         ↓
Step 1:  k3s on Hetzner
         Single-node Kubernetes — identical Hetzner hardware, same Docker images
         No operational cost increase; enables everything below
         ↓
Step 2:  Helm charts per service
         Parameterized deployment: one Helm chart = one cell
         Rolling deployments, health checks, resource quotas — free
         ↓
Step 3:  Cell provisioning via Helm (when triggered by scale)
         `helm install cell-eu-02 ./charts/sv0-cell --set tenants=...`
         New cell live in 15 minutes, zero downtime for existing cells

k3s is the correct migration path: same Hetzner infrastructure, same container images, same Docker workflows for developers — production-grade runtime that enables the full cell model when needed.

Note on scope: The async route handler bug, unbounded job queue, and ELK.js Web Worker issue are independent code bugs — Docker Compose did not cause them and k3s would not fix them. The argument for migrating is forward-looking: Docker Compose cannot support rolling deployments, multi-host scaling, or cell provisioning automation, all of which become necessary as the platform grows. Fix the bugs separately; migrate the runtime to unlock the scaling model.

14. Cell Architecture vs. Current Architecture — Full Comparison

What Cell Architecture Means for SecurityV0

A cell is a complete, independently deployed replica of the platform stack, permanently assigned a bounded set of tenants, such that the failure or resource exhaustion of any component in that cell has zero runtime effect on any other cell.

One SecurityV0 cell contains:

┌─────────────────────────────────────────────────────────────┐
│  CELL A  (tenants T001, T047, T203 ... T035)                │
│                                                             │
│  Express API pods (3×) ─── Redis ─── BFS Workers (4×)      │
│                                  │                          │
│                          MongoDB Replica Set                 │
│                      (IAM graphs, findings, BFS paths       │
│                       for THIS cell's tenants ONLY)         │
└─────────────────────────────────────────────────────────────┘

CONTROL PLANE (global, not a cell):
  Cell Router │ Auth Service │ Billing │ Tenant Registry
  Maps tenant_id → cell. Never holds IAM graph data.
  Must NOT be in the hot path — cells operate independently
  if control plane goes down.

Connectors use transparent routing: they always call api.securityv0.com. The cell router maps tenant_id → cell from a cached registry and proxies the request. Connectors never need to know which cell they're in — no reconfiguration when cells are added or tenants migrated.

Scalability Comparison

Architecture A (Current) — Binding Constraints (from code analysis):

Critical correction from code audit: The production worker runtime is NOT BullMQ. It is a plain JavaScript array (private readonly queue: WorkerJob[] = [] at runtime.ts:26) inside the API process, draining one job at a time, sequentially. There is no separate worker process. BullMQ exists in documentation, not in the running code.

A full tenant sync cycle is 3 sequential jobs: sync_ingestion → evaluate_findings → build_evidence_pack.

For a medium tenant (5,000 entities): 85–240 seconds total.

Tenants	Sync frequency	Worker queue drain time	Outcome
5	Hourly	~22 min	Drains before next sync
10	Hourly	~45 min	Queue backs up permanently
35	Daily	~105–180 min	Barely drains before next daily window
50	Daily	~225–375 min	Queue never empties
500	Daily	37+ hours	Architecture collapses

Sequential breaking points (in order they bite):

Worker queue saturation — ~10 tenants (hourly) / 35 tenants (daily)
MongoDB working set overflow — ~60–70 tenants × 5K entities (WiredTiger cache is 256MB from --wiredTigerCacheSizeGB 0.25 in compose; total working set exceeds it at low tenant counts)
Node.js OOM — one 50K-entity sync calls queryEntities(limit:0), loads 500MB of entity docs into 512MB container; immediate OOM kill; all tenants dark
Express latency degradation — ~100+ tenants with concurrent dashboard load

Architecture A single-event total-outage scenario: One enterprise customer runs a 50K-entity sync during business hours. Path materialization triggers 3.2M sequential MongoDB reads (~9 hours). Evaluator calls queryEntities(limit:0) on 50K entities → ~500MB heap → OOM kill. Container restarts. The stalled sync is permanently stuck at "running" in MongoDB. All other tenant syncs are blocked for the duration. No alerting fires — the process crash is not surfaced as a sync failure. All tenants on the platform go dark.

Architecture B (Cell) — Scaling Characteristics:

Cell capacity: 25–35 tenants per cell (MongoDB M20, 4 parallel workers)
New cell provisioning: 12–18 minutes, zero downtime for other cells
Same 50K-entity OOM scenario in Cell B: one cell degraded, 25–35 tenants affected, all other cells continue normally
Vertical scaling: eliminated — add cells, not bigger servers
Geographic cells: US customers on US cell (<20ms RTT vs 120–220ms from Nuremberg); APAC (<30ms vs 350ms)

APAC dashboard latency on Architecture A (350ms per interaction) crosses the threshold where users perceive the product as slow. Regional cells eliminate this entirely.

Metric	Architecture A	Architecture B
Daily sync saturation	35 tenants	25–35 per cell, unlimited cells
Hourly sync saturation	10 tenants	25–35 per cell
Total-outage trigger	One 50K-entity sync	Atlas AZ outage (30–60s failover)
Noisy tenant blast radius	All tenants on platform	25–35 tenants in one cell
Scaling action downtime	5–15 min (vertical resize)	Zero (new cell)
APAC dashboard RTT	350ms	<30ms (APAC cell)

Security Comparison

Security scorecard:

Attack Vector	Architecture A	Architecture B
Tenant data isolation	CRITICAL — `tenant_id` field only; one missing filter exposes all tenant data	LOW — per-cell MongoDB; missing filter leaks within 25–35 tenant cell only
Noisy tenant / resource exhaustion	HIGH — one tenant starves all; no per-tenant limits at any layer	LOW inter-cell / MEDIUM intra-cell
Auth bypass blast radius	CRITICAL — `REQUIRE_AUTH=false` exposes 100% of tenants simultaneously	HIGH — one cell exposed; others protected by independent auth
Cross-tenant IDOR	CRITICAL — MongoDB ObjectIDs from shared DB are time-ordered and estimable; one bug in any of 50+ query paths leaks cross-tenant	LOW — ObjectIDs from other cells do not exist in this cell's DB; physically absent, not just filtered
Session compromise blast radius	CRITICAL — stolen `@securityv0.com` admin session has 7-day unrestricted access to all tenants; no revocation	HIGH — control plane admin / MEDIUM — cell-scoped
Database breach blast radius	CRITICAL — one MongoDB breach delivers full IAM graph of every customer; complete cloud attack kit	HIGH per-cell / LOW platform-wide — independent credentials per cell
Connector push forgery	HIGH — `verifyM2MToken()` returns null; `tenant_id` in payload is attacker-controlled	MEDIUM — cell URL discovery required; wrong-cell push rejected at routing layer
Super-admin escalation	CRITICAL — email domain string match for all `@securityv0.com` accounts; no revocation	HIGH — same fragility, bounded blast radius
Compliance (SOC 2 Type II)	Blocked — CC6.1 (logical access only), CC6.3 (no revocation), IdP stubs	Achievable with remaining auth work
Compliance (FedRAMP Moderate)	Explicitly blocked — SC-4 requires DB-level isolation; field-level filtering fails this control	Eligible path — single-tenant government cells satisfy SC-4
GDPR / Data Residency	High risk — EU tenant data co-mingles with US tenant data at storage layer	Strong — EU cell on EU infrastructure; no cross-jurisdiction data residency risk

Six security fixes required regardless of architecture choice (Architecture B reduces blast radius but does not fix these):

verifySession(), verifyApiKey(), verifyM2MToken() returning null — this is an active auth bypass on those paths, not a stub
REQUIRE_AUTH=false as default in docker-compose.deploy.yml — must be inverted; opt-out for dev, not opt-in for prod
Iron-session server-side revocation — Redis-backed session store with immediate invalidation capability
Super-admin email domain check — replace with explicit RBAC membership from WorkOS org claims + user ID allowlist
BFS document limit — hard cap on traversal depth and result count per request
DevAuthProvider production gate — startup crash (not silent fallthrough) if NODE_ENV=production and DevAuthProvider is active

Customer Isolation Comparison

Architecture A — Isolation Reality (from code):

All 23 MongoDB collections are shared. The only isolation boundary is the tenant_id field predicate in application queries. MongoDB has no row-level security; the application is the sole enforcement point. Additional isolation failures found in code:

InMemoryFindingsStore is shared across all tenants — if keying is not tenant-scoped internally, connector report findings from Tenant A are visible to Tenant B
IngestService.processedSyncIds is a global Set<string> — not tenant-scoped; the practical risk is re-processing on restart (Set is lost), not cross-tenant blocking (UUIDv4 collision probability is 1/2^122)
A stuck sync job (infinite path materialization loop) has no per-job timeout or watchdog; it occupies the entire worker indefinitely, blocking all other tenants' pipelines
The auto-join domain-match feature (in the new, not-yet-mounted middleware) adds users to any tenant matching their email domain without explicit invitation — a multi-tenant implicit membership risk

Architecture B — Isolation Reality:

IAM graph data for Tenant A physically does not exist in Cell B's database — cross-cell IDOR requires control plane compromise + cell credential forgery
Worker exhaustion, OOM, stuck jobs: bounded to the cell (25–35 tenants), not the platform
Enterprise single-tenant cells: zero cross-tenant data at any layer; database breach exposes exactly one customer
GDPR data residency: EU cell on EU Hetzner region + EU Atlas region; US tenant data never touches EU infrastructure

Pros and Cons

Current Architecture (Shared Multi-Tenant)

Pros:

Simple to operate at current scale (single compose stack, one MongoDB)
Low infrastructure cost ($0.74/tenant at 50 tenants)
StorageAdapter abstraction provides a clean migration path to per-tenant collections without touching connectors or API routes
Fast iteration — one deployment target

Cons:

Worker queue blocks all tenants for the duration of any single sync job
One large-tenant OOM kills the API process for all tenants simultaneously
tenant_id field isolation is the only data boundary — one missing filter in any of 50+ query paths is a platform-wide cross-tenant breach
FedRAMP, ISO 27001 SC-4, and GDPR data residency compliance are structurally blocked
Write amplification from path materialization (O(I×R×P×Res) MongoDB reads) is a shared-instance bottleneck
No horizontal scaling path without rewrite
Docker Compose provides no rolling deployments, no health-based routing, no autoscaling
APAC dashboard unusable (350ms+ RTT from Nuremberg)

Cell Architecture

Pros:

Any single-cell failure (OOM, MongoDB, stuck job) affects 25–35 tenants, not the entire platform
FedRAMP Moderate eligible via single-tenant government cells
GDPR data residency: EU customers on EU cells, US customers on US cells — provable in procurement
Geographic cells eliminate APAC latency penalty
Enterprise isolation is a compliance requirement for CISO-grade buyers (FedRAMP, GDPR, contractual)
Cell provisioning via Helm is 12–18 minutes, zero downtime
Per-cell MongoDB credentials — one cell's database breach does not cascade

Cons:

Control plane is a new single point of failure; must be built to higher availability than data plane
Cell-to-cell tenant migration requires quiesce-export-import-verify-flip procedure (~30 min, coordination risk)
Cell sprawl: 10 cells = 10 MongoDB instances to patch, 10 Redis instances to monitor, 10 deployment rollbacks per release
$5.20/tenant at 50 tenants vs. $0.74 — 7× cost premium at low scale
Significant engineering investment for control plane, provisioning, cell-aware routing — time not spent on the AWS connector or MCP feature
Requires k3s or ECS as prerequisite — Docker Compose is incompatible with cell provisioning automation
Intra-cell isolation within a 25–35 tenant cell still requires tenant_id field discipline; Architecture B reduces blast radius, not isolation mechanism

Cost Model

Scale	Arch A Infrastructure	Arch A $/tenant	Cells Needed	Arch B Infrastructure	Arch B $/tenant
50 tenants	2× CPX21 = €22/mo + Redis	$0.74	2 cells	$260/mo	$5.20
200 tenants	CPX51 + Atlas M20 = ~$170/mo	$0.85	7 cells	$910/mo	$4.55
500 tenants	CPX51 + Atlas M50 = ~$470/mo	$1.10	17 cells	$2,210/mo	$4.42

At 200+ tenants, Architecture A requires a dedicated DBA and constant capacity management; the staffing cost delta alone exceeds the $3.70/tenant infrastructure premium of Architecture B.

The Verdict: When to Invest in Cell Architecture

Cell architecture is the correct long-term direction. It is the wrong immediate investment.

SecurityV0 has no evidence any cell-architecture-solvable problem exists at its current scale. The AWS connector produces no reliable execution evidence. Authentication is mid-migration. The worker queue is a JS array. Before rearchitecting for scale, the product must work.

Triggers that justify the cell investment (all must be true):

100+ tenants with active sync workloads
Demonstrated requirement for physical data isolation (not just field-level tenant_id discipline)
Measured noisy-neighbor degradation — not theoretical; actual P95 latency correlation between one tenant's sync load and another's dashboard latency
All items 1–9 from the existing priority table are closed
WorkOS auth migration is complete and deployed
Operational capacity to maintain multiple independent MongoDB instances, Redis instances, and Helm deployments

The incremental path — no big-bang rewrite:

Step A:  Per-tenant MongoDB collections via StorageAdapter
         Add tenantId → collectionName routing inside the adapter
         Delivers collection-level isolation; maps cleanly to cell extraction later
         Application code: unchanged. Connectors: unchanged.

Step B:  Persistent job queue
         Replace WorkerJob[] array with durable queue (MongoDB-backed or BullMQ)
         Enables parallel workers, per-tenant priority lanes, job recovery

Step C:  Per-tenant API rate limiting
         Token bucket keyed by tenantId in middleware
         Eliminates noisy-neighbor at the API layer

When triggered:  First enterprise customer requiring contractual isolation
                 Extract them to a dedicated single-tenant cell
                 One cell, one Terraform module, no generalized control plane yet

When triggered:  Measured queue degradation across tenants
                 General cell model: control plane, provisioning automation, cell router
                 Steps A–C are already done; the migration is additive, not a rewrite

This path avoids the big-bang rewrite. Each step is independently justified by a confirmed problem. The architecture evolves toward cells driven by real customer requirements, not hypothetical scale.

Executive Summary​

0. Architecture Decisions — Critical Review​

0.1 MongoDB for Graph Storage — The Evidence Immutability Claim Is False​

0.2 Materialized Paths — The Write Amplification Is Worse Than Documented​

0.3 Stateless Sessions for a Security Platform — Structurally Wrong​

0.4 The In-Process Job Queue — Production Incident Waiting to Happen​

0.5 The ELK.js Web Worker — ADR and Code Are Contradictory​

0.6 SSH Deployment Key — Docker Group Membership Is Root​

0.7 REQUIRE_AUTH Defaults to False — The Insecure Default Ships​

0.8 What the ADRs Got Right (Genuinely)​

1. Architecture Overview​

2. Gray Zone #1 — Graph Storage & Scalability​

MongoDB: Adequate for MVP, Breaking Point at ~10K Identities​

Critical Code Bugs​

3. Gray Zone #2 — Data Model Universality​

Verdict: NOT Too Universal​

Ship-Blocking Bugs in AWS Connector​

4. Gray Zone #3 — Connector Rate Limiting​

Overall Risk: HIGH — Inconsistent throttling resilience across connectors​

5. The Blocker — AI Agent Permissions & PII Access Graph​

"Show new permissions graph when deploying AI agent with MCP servers, flag PII access"​

What SecurityV0 Already Has​

What's Missing — Implementation, Not Design​

Hard Problems (No Easy Solution)​

6. Platform Security Audit​

Critical Vulnerabilities​

Full Severity Table​

7. Master Weakness Table​

8. Prioritized Action Plan​

Critical — Security and Data Integrity​

AWS Connector​

Correctness and Hardening​

MCP / AI Agent Feature​

Parallel Track​

9. Technology Validation — Architecture Review (April 12, 2026)​

9.1 Graph & Database Layer​

Confirmed / Adjusted​

New Findings​

9.2 API Runtime & Job Queue​

Critical Bugs Found​

Architecture Decisions​

9.3 Frontend Stack​

Critical Bug: ELK.js Running on Main Thread​

Performance Ceilings​

Additional Findings​

9.4 Authentication Stack​

Security Bugs Found​

Architecture Assessment​

9.5 Infrastructure & Deployment​

Critical Security Issues​

Operational Gaps​

Infrastructure Gaps​

Python Connectors: Validated​

Dead Code Cleanup​

Infrastructure Maturity Triggers​

10. High-Confidence Findings​

11. Gray Zone Deep-Dive​

11.1 Gray Zone 2: Data Model Universality — VERDICT: RIGHT-SIZED​

11.2 Gray Zone 1: Graph Alternatives to Neo4j — VERDICT: KUZU​

Recommended: MongoDB + Kuzu (Embedded Analytics Layer)​

Alternative Comparison​

11.3 Gray Zone 3: Connector Depth + Rate Limiting​

Part A: Metadata-Only Scanning — VERDICT: RIGHT STRATEGY FOR V1​

Part B: Rate Limiting — CRITICAL FINDINGS​

11.4 MCP Blocker: AI Agent Pre-Deployment PII Access Graph​

What the architecture team already knows well​

Hard unsolved problems (not yet designed)​

MCP Opacity Mitigation (layered, honest approach)​

Graph Projection Algorithm: Recommended Design​

New Entity Types & Schema​

Evidence Grading for Projected State​

Approval Record Schema (minimal)​

Closest Analogues and Gaps​

Delivery Sequence​

12. Updated Master Priority Table​

13. Infrastructure: Docker Compose Is a Dead End​

14. Cell Architecture vs. Current Architecture — Full Comparison​

What Cell Architecture Means for SecurityV0​

Scalability Comparison​

Security Comparison​