Skip to main content

SecurityV0 — Comprehensive Architecture & Security Audit Report


Executive Summary

SecurityV0 is a well-conceived Autonomous Execution Exposure Management platform with sound architectural principles: deterministic findings, evidence-grade audit trails, temporal drift detection. The core pipeline is production-ready. However, the audit uncovered 2 critical security vulnerabilities, 7 ship-blocking implementation bugs, and significant scalability risks that must be addressed before claiming production readiness.


0. Architecture Decisions — Critical Review

This section reviews the strongest criticisms of each major architectural decision. Some decisions hold up under scrutiny; others have real structural problems.


0.1 MongoDB for Graph Storage — The Evidence Immutability Claim Is False

The ADRs claim: MongoDB stores immutable evidence packs via SHA256 hashes.

The reality: The SHA256 hash is stored in the same mutable MongoDB collection as the content it is supposed to protect. A database administrator — or a compromised service account with write access — can modify both the content and the hash in a single operation. MongoDB has no append-only collection mode, no WORM storage, and no write-once semantics. The immutability is a convention enforced by application code, not by the database.

Why this matters for a security product: When a customer challenges the integrity of a finding during an incident response, SecurityV0's answer is "trust us." There is no cryptographic proof that the finding wasn't modified after the fact. SOC 2 AU-10 (non-repudiation) and NIST 800-53 AU-10 both require this.

Fix: Append evidence pack hashes to an append-only PostgreSQL table with triggers preventing UPDATE/DELETE, or use Amazon QLDB. This costs minimal operational effort and transforms the compliance posture.


0.2 Materialized Paths — The Write Amplification Is Worse Than Documented

The ADRs claim: Materialized paths provide O(1) blast radius queries. The scaling ceiling is ~10K identities.

The reality: The write cost is O(I × R × P × Res) where I=identities holding a changed role, R=roles, P=permissions/role, Res=resources/permission. When a role held by 3,000 identities changes its permissions, the materializer issues ~3.3M read operations and writes updated accessible_by arrays across hundreds of resource documents — all non-atomically. A failure mid-way leaves the graph in an inconsistent state. The ADR says "eventual consistency" as if it's acceptable; for a security product where blast radius queries are the core value proposition, inconsistent state during sync means incorrect answers to the CISO's primary question.

The trigger for this breaking: It is not raw entity count. It is role fan-out. A single highly-shared role (like "Developer" held by 3,000 engineers) changing permissions triggers the storm. This happens at much smaller tenant sizes than 10K total identities.

The ADR should lower the Neo4j/Kuzu trigger from 10,000 identities to 5,000 — or more precisely, to any role with >1,000 holders.


0.3 Stateless Sessions for a Security Platform — Structurally Wrong

The ADR claims: iron-session provides secure, stateless encrypted cookies. The design is provider-independent.

The reality: iron-session stores an encrypted, self-contained session payload in a cookie — this is genuinely stateless from the server's perspective (no server-side lookup to decode the session). However, the middleware hits MongoDB on every request anyway to validate the user's current membership and permissions, so the server-side lookup is happening regardless. In that context, iron-session's statelessness provides no performance benefit, and it removes the ability to revoke an individual session: fire an employee, deactivate their WorkOS account, and their sv0_session cookie remains valid for up to 7 days.

Additional gap — logout is a no-op: workos-provider.ts:74 has an empty logout() method. Clearing the cookie does not revoke the session on WorkOS's side. A cookie exfiltrated before logout remains valid.

The 7-day TTL is inappropriate. AWS Console sessions are 1–12 hours. Security tooling industry practice is 8–24 hours for human sessions. A security platform storing CISO-grade findings should not have sessions that outlive most employees' work weeks.


0.4 The In-Process Job Queue — Production Incident Waiting to Happen

The ADR claims: The in-process FIFO queue is sufficient for MVP scale.

The reality: The WorkerJob[] array at runtime.ts:26 is unbounded (no backpressure), not persisted (lost on restart), and not recoverable (no dead letter queue). The shutdown() handler sets a flag and exits — if a sync is mid-flight at step 6 of 11 when the container restarts (deployment, OOM kill, crash), the sync stays in "running" status forever. There is no detection, no alerting, no recovery path.

The event loop concern is a red herring. The real risk is MongoDB connection pool pressure under concurrent syncs. Sequential await calls release connections back to the pool between operations — they do not hold a connection across the full path materialization loop. Pool saturation occurs when multiple syncs run concurrently (each holding its own connections simultaneously). The current serial in-process queue actually prevents this specific problem by serializing syncs. The correct argument for replacing it is persistence and recovery (lost jobs on restart, no dead letter queue) — not connection pool saturation.

The Express 4 async bug is real: Every async route handler in Express 4 that throws an unhandled rejection hangs the request indefinitely — it does not route to the error handler. ingest.ts:160 has exactly this pattern. Express 5 fixes this natively.


0.5 The ELK.js Web Worker — ADR and Code Are Contradictory

ADR-011 states explicitly: "The layout uses the Web Worker variant from day one (elkjs/lib/elk-worker.min.js) — since the API is async either way, using the worker costs no extra complexity and keeps the UI thread free for all graph sizes."

The actual code at layout.ts:1:

import ELK from "elkjs/lib/elk.bundled.js";  // main thread — blocks UI

This is not a gray area or a judgment call. The ADR says use the worker variant. The implementation uses the main thread variant. At 200+ nodes, layout computation freezes the UI for 150-400ms. The spinner overlay that displays during layout may not even render before the thread locks.

This is the easiest fix in the entire audit: one line, verified against the ADR.


0.6 SSH Deployment Key — Docker Group Membership Is Root

The deployment docs claim: Deployment uses a restricted deploy user for security.

The reality: The deploy user is in the Docker group (deployment.md:307-308: sudo usermod -aG docker deploy). Docker group membership is functionally equivalent to root — docker run -v /:/host ubuntu chroot /host gives a root shell on the host. A compromised DEPLOY_SSH_KEY (which is exposed to every GitHub Actions runner that touches this repo) gives the attacker full root access to the production server, MongoDB included.

The kill chain is not theoretical. GitHub Actions runners are shared VMs. A supply chain attack on any dependency in the CI pipeline, or a compromised runner, exposes the key.

The fix is specific: Remove deploy from the Docker group. Use sudo with an allowlist of exactly two commands: docker compose pull and docker compose up -d in the platform directory. Nothing else.


0.7 REQUIRE_AUTH Defaults to False — The Insecure Default Ships

The deployment compose claims: Authentication is configurable.

The reality: docker-compose.deploy.yml:46 has REQUIRE_AUTH: "${REQUIRE_AUTH:-false}". The default is the insecure value. A deployment that forgets to set this environment variable — or a new engineer who spins up an instance following the compose file — gets a fully unauthenticated API where any caller can inject data into any tenant by setting the X-Tenant-Id header.

The Zod schema in env.ts:18 defaults to "true", which partially saves production. But this is defense by accident — two defaults in two files that contradict each other. The compose file default should be true. Secure defaults must not require active choices.


0.8 What the ADRs Got Right (Genuinely)

Several decisions hold up under scrutiny:

  • Python for connectors: Correct. boto3/msgraph-sdk ecosystem advantage is real. Go/TypeScript/Rust offer no practical advantage for I/O-bound batch API scanning. The GIL is irrelevant.
  • Docker Compose for current scale: Correct for now. The deploy-instance.sh multi-instance orchestration with Caddy hot-reload is well-engineered for the current 2-server footprint. It becomes a ceiling when rolling deployments, multi-host scaling, or cell provisioning automation are required — see §13.
  • WorkOS selection: Correct. Provider abstraction (AuthProvider interface) is clean. Exit path exists. Admin Portal alone justifies the choice for enterprise SSO onboarding.
  • 10-entity type model: Correct. Not too universal. Adding a 4th connector is 90% connector work. The subtype system handles cloud-specific variation without fracturing the evaluator rules.
  • Rejecting Apache AGE (ADR-003): Correct. Variable-length path exponential blowup is architectural. AWS RDS still doesn't support it. Decision remains valid.
  • StorageAdapter abstraction: The single best decision in the codebase. 60+ methods behind a clean interface makes Kuzu, Neo4j, or any future migration feasible without touching connectors or evaluator rules.

1. Architecture Overview

What it is: A system of record for Non-Human Identity (NHI) execution authority. Answers the CISO question: "What can this automation actually do, who owns it, and what happened to its access?"

Stack: Node.js/TypeScript API + React 19 frontend + MongoDB + Python connectors (entra-servicenow, azure-foundry, aws)

Pipeline: 3-job sequential: sync_ingestion → evaluate_findings → build_evidence_pack (SHA256-sealed, immutable)

Entity model: 10 types — workload, connection, credential, identity, role, permission_set, permission, resource, owner, execution_evidence

15 finding rules in the evaluator (orphaned_ownership, scope_drift, dormant_authority, reachability_drift, llm_egress, etc.)

Architecture maturity: ~75% — pipeline solid, auth transition in-progress, SCIM/OAA deferred, ~40% of docs stale vs. current implementation.


2. Gray Zone #1 — Graph Storage & Scalability

MongoDB: Adequate for MVP, Breaking Point at ~10K Identities

The architecture uses materialized execution paths (pre-computed at sync time, O(1) blast radius queries) rather than real-time graph traversal. No $graphLookup anywhere — application-level BFS only. This is a deliberate, documented trade-off.

Scale ceiling:

ScenarioLatencyBreaks At
MVP (<1K identities, 2-3 connectors)<100ms
Growth (5K identities)100-500msPath recompute bottleneck
Scale (10K+ identities)500ms-2sBreaking point
Production (50K identities)2-10sQuery timeouts, incomplete results

On graph database alternatives: The decision to stay on MongoDB is correct for now, but the reasoning in the ADRs is partially wrong.

The claim that "Neo4j is bad at rich document storage" is not a valid reason to avoid it — Neo4j handles property maps on nodes/edges adequately, and more importantly, PostgreSQL with JSONB handles document storage extremely well and is fast. A PostgreSQL-based alternative covers document storage, temporal queries (range types, tstzrange), and graph traversal (recursive CTEs or Apache AGE extension) in a single engine. AGE was rejected (ADR-003) for exponential blowup on variable-length paths — a real limitation — but that's a specific traversal argument, not a document storage argument.

The correct reasons to stay on MongoDB at current scale:

  • The StorageAdapter abstraction already makes migration low-cost whenever the trigger is hit
  • MongoDB is sufficient for <10K identities with the current materialized path model
  • Adding a graph engine before the breaking point is premature

When the 10K identity breaking point approaches, the real options are:

OptionGraphDocumentsTemporalOps Cost
Kuzu (embedded)Cypher, fast analyticsVia MongoDB (hybrid)Via MongoDBNear-zero — no new service
PostgreSQL + AGEOpenCypher, exponential blowup risk on deep pathsJSONB, excellentNative range typesOne service replaces MongoDB
PostgreSQL (recursive CTEs)Depth-limited traversal onlyJSONB, excellentNative range typesOne service replaces MongoDB
Neo4jBest-in-class graphProperty maps (adequate)Temporal plugin neededHigh — dedicated server
NeptuneGremlin/SPARQLExternal onlyExternal onlyAWS lock-in

PostgreSQL is a legitimate and underrated option — it is not on the ADR radar but should be. A single Postgres instance with JSONB columns replaces MongoDB entirely, handles temporal queries natively, and graph traversal via recursive CTEs works for depth-limited paths (which is all SecurityV0 needs at MAX_AUTH_CHAIN_DEPTH). The StorageAdapter abstraction makes this migration just as feasible as Neo4j. Kuzu remains the lowest-friction first step (embedded, no new service, Cypher queries replace BFS loops).

Critical Code Bugs

SeverityIssueFile:Line
CRITICALReverse-lookup BFS has no document limit — can pull 50K+ docs into memorysubgraph-adapter.ts:158
HIGHUnbounded frontier growth — exponential blowup on high-degree nodessubgraph-adapter.ts:35
HIGHStale execution paths when role GRANTS change — affected identities not re-materializedpath-materializer.ts:40
HIGHMAX_AUTH_CHAIN_DEPTH=1 — any 3-system chain (Entra → SN → Slack) is missedpath-materializer.ts:17
MEDIUMBlast radius endpoint returns all paths with no paginationpaths.ts:14
MEDIUMVisited set in DFS causes path aliasing via shared state across branchespath-materializer.ts:110

3. Gray Zone #2 — Data Model Universality

Verdict: NOT Too Universal

The 10-type model is well-differentiated along three orthogonal axes: functional role, scope binding, temporal nature. The permission_set type (ADR-014) correctly distinguishes IAM policy documents (ceiling constraints) from role grants.

However: 3 deterministic, silent failures make key AWS features completely non-functional:

Ship-Blocking Bugs in AWS Connector

F1 — privilege_justification_gap returns 0 findings on all AWS data

  • path.resource_id is a MongoDB hex hash, never matches an ARN
  • Rule matching branch always fails for AWS sources
  • File: src/evaluator/rules/privilege-justification-gap.ts:48-50

F2 — CloudTrail extractor doesn't exist

  • cloudtrail_evidence initialized to [] in cli/main.py:146, never populated
  • dormant_authority rule fires on 100% of Lambda functions (no evidence ever found)
  • _transform_cloudtrail_evidence() exists but receives empty input, discards request_parameters and resources anyway
  • Tracked: sv0-connectors#31

F3 — Assumed-role ARN parser returns None for 80-90% of real AWS events

  • Lambda, ECS, Step Functions, Bedrock all produce sts:assumed-role/RoleName/session ARNs
  • Parser only handles iam:role/ and iam:user/ shapes
  • All assumed-role evidence lands with entity_id: "" — ungroupable by workload
  • Fix is 5 lines: add elif ":assumed-role/" in arn: branch at transformer.py:1768

F4 — AWS connector never sets normalized_action — all AWS execution path actions are "unknown"

  • path-materializer.ts:147 reads perm.properties.normalized_action to populate the actions array on every execution path
  • The Entra-ServiceNow and Azure-Foundry connectors both set normalized_action ("read", "write", "admin", "execute")
  • The AWS connector sets only properties.action (raw IAM string: iam:PassRole, iam:CreateRole, etc.) and never sets normalized_action
  • Result: every AWS execution path has actions: ["unknown"] — the raw IAM action is silently discarded by the materializer
  • Second reason F1 is broken: even after the resource_id matching fix, privilege_justification_gap's write-level action mismatch check (hasWriteActions()) would still never trigger on AWS data because it checks for "write", "admin", "delete" — not "unknown"
  • Blocks escalation detection: a future escalation_capable rule checking for IAM privilege-escalation actions (iam:PassRole, iam:CreateRole, sts:AssumeRole*) cannot work until this is fixed
  • scope_drift is NOT affected — it checks role additions against domain sensitivity, never reads path.actions
  • Not caught by any test: AWS connector tests only assert node/edge counts and subtype == "iam_permission", never check normalized_action; all path materializer and evaluator tests use hand-crafted entra_id fixtures with normalized_action explicitly set; no seed data includes AWS-sourced entities
  • Files: sv0-connectors/integrations/aws/src/sv0_aws/core/transformer.py:1619–1628 (sets action, not normalized_action), sv0-platform/src/ingestion/path-materializer.ts:147 (reads normalized_action with ?? "unknown" fallback, no attempt to read properties.action)

Additional gaps:

  • permission_set platform materializer not updated — still traverses HAS_ROLE for AWS paths → incorrect via_roles on all AWS authority paths
  • Ownership mapping from AWS resource tags never implemented → all AWS identities ownership_state: unknown
  • resource_name never populated on AWS resource nodes
  • AWS IAM condition keys detected but not evaluated → authority paths over-report reachability; no conditions_not_evaluated flag on ExecutionPath to surface this

On "metadata-only vs. code analysis":

  • Structural authorization (what roles can reach what): ✅ works when CloudTrail bugs fixed
  • Behavioral (is identity actually used): ⚠️ blocked by F2/F3
  • Code vulnerability (injection, hardcoded secrets): ❌ out of scope, needs SAST/SCA connectors (future additive connector)

4. Gray Zone #3 — Connector Rate Limiting

Overall Risk: HIGH — Inconsistent throttling resilience across connectors

ConnectorRiskPrimary Issue
AWSMEDIUMGood botocore adaptive retry — missing jitter
Azure EntraHIGHSequential-only (12+ min for 500 SPs at 2 RPS); no explicit Retry-After
ServiceNowCRITICALOffset pagination breaks on 429 — no cursor resume
Azure FoundryMEDIUMRelies on SDK defaults — behavior unclear

Critical code findings:

  • servicenow_client.py:421if response.status_code != 200: break silently drops remainder of pagination on any 429
  • aws_client.py:276wait_time = 2**retry_count with no jitter → synchronized retry storms across tenants
  • No global rate-quota tracker — one large tenant's scan blocks others
  • No per-resource skip logic — one failed get_policy() fails the entire scan

Rate limit exposure at medium scale (500 resources):

ServiceLimitCalls/ScanRisk at 10K resources
AWS IAM~20 RPS500-1500~15min sustained, retries cascade
Azure Graph API2 RPS600+12+ min serial, any 429 stalls all
ServiceNow2-4 RPS200-250No recovery on 429
Azure Foundry ARM4 RPS150Unclear retry behavior

5. The Blocker — AI Agent Permissions & PII Access Graph

"Show new permissions graph when deploying AI agent with MCP servers, flag PII access"

What SecurityV0 Already Has

  • ai_agent workload subtype — already in entity model
  • 5-level sensitivity classification propagates through authority paths
  • reachability_drift, scope_drift evaluator rules detect changes since baseline
  • reachable_sensitive_domain finding fires on PII-classified resource access
  • Deployment approval fully designed (research docs 2026-04-07-mcp-agentic-deployment-approval-research.md, 12-deployment-approval.md)

What's Missing — Implementation, Not Design

GapNotes
mcp_tool entity type + DECLARES_TOOL relationshipTools currently invisible in graph
data_domain entity type + ACCESSES relationshipBusiness domain classification needed
MCP manifest parser (mcp.json → NormalizedGraph)No parser exists
Graph projection algorithm (merge manifest → run materializer on projected state)Core "what-if" engine
POST /api/v1/deployment/preview endpointDesigned, not coded
PII output schema tracking on tool declarationsResource-level exists; tool output level missing
Approval record storage + UIOperating layer not built

Hard Problems (No Easy Solution)

  • MCP tool opacity — tools are blackboxes; declared ≠ actual. Mitigation: cryptographic manifest attestation, grade as "C" until runtime evidence
  • One identity per MCP server — all tools share service principal blast radius. SV0 detects; application architecture must fix
  • PII exfiltration tracking — tool output schema declaration partially solves; runtime inspection required for full coverage

6. Platform Security Audit

Critical Vulnerabilities

CRITICAL — Cross-Tenant IDOR via REQUIRE_AUTH Bypass

When REQUIRE_AUTH=false (development default):

  1. auth.ts:62-70 — sets req.auth = { tenantId: attacker-controlled }
  2. tenant-context.ts:12-14 — reads tenant from auth, no membership validation
  3. A connector can POST /api/v1/ingest/normalized-graph with X-Tenant-Id: victim-tenant and inject data into any tenant

The new auth-middleware.ts with WorkOS membership validation fixes this, but has not been deployed (app.ts:26-29 TODO).

CRITICAL — DevAuthProvider Has No Production Gate

dev-provider.ts:100-108 — returns valid super-admin session for any token when AUTH_PROVIDER=dev. If set in production, auth is completely bypassed.

Fix: provider-factory.ts must throw on AUTH_PROVIDER=dev && NODE_ENV=production.

Full Severity Table

SeverityIssueFile:LineFix
CRITICALCross-tenant IDOR via REQUIRE_AUTH bypassauth.ts:62-70Deploy new auth-middleware with membership check
CRITICALDevAuthProvider: no production gatedev-provider.ts:100-108Throw if AUTH_PROVIDER=dev && NODE_ENV=production
HIGHIngest: no cycle detection — evaluator infinite-loop riskingest.ts:121-152DFS cycle check; max 100K nodes
HIGHConnector reports: .passthrough() allows field injectioningest.ts:65-73Remove passthrough; ban _-prefixed fields
MEDIUMRate limiting per-tenant only — bypass by rotating tenant IDsrate-limit.ts:14-16Key on ${tenantId}:${principalId}
MEDIUMPath evaluator: no depth limit on ownership chain traversalpath-evaluator.ts:127Max 10 levels; fail with unresolved_ownership_depth
MEDIUMSession: no refresh token; 7-day TTL forces full re-authsession.ts:56-68Add POST /auth/refresh; 24h sliding window
MEDIUMSilent entity overwrite without idempotency warningingest.ts:160-206Warn if nodeIds exist in prior syncs
LOWq search param not verified escaped before MongoDB regexentities.ts:48-50Finding retractedescapeRegex() exists in entity-adapter.ts and is applied before every $regex query. No injection risk.

Positive: Helmet enabled, CORS explicit, x-powered-by disabled, 5MB body limit, no hardcoded secrets, Zod validation throughout.


7. Master Weakness Table

See §12 (Updated Master Priority Table) for the complete, reconciled finding list. §12 supersedes this section and includes findings from the full technology validation in §9–11. The table below is an early-pass summary retained for cross-reference with the section findings above.

PriCategoryIssueStatus
1SecurityCross-tenant IDOR via REQUIRE_AUTH=falseShip blocker
2SecurityDevAuthProvider no production gateShip blocker
2aSecurityverifyM2MToken() returns null — every Bearer-token M2M auth path is completely unenforcedShip blocker
3AWS ConnectorCloudTrail extractor not implementedShip blocker
4AWS ConnectorAssumed-role ARN parsing broken (80-90% events)Ship blocker
5AWS Connectorprivilege_justification_gap always 0 on AWSShip blocker
5aAWS Connectornormalized_action never set — all AWS execution path actions are "unknown"Ship blocker
6AWS Platformpermission_set materializer not updatedShip blocker
7Graph DBBFS reverse lookup: no document limitPre-scale blocker
8ConnectorServiceNow pagination: no cursor resume on 429 — corrupts baselines permanentlyShip blocker
9Graph DBStale paths on role GRANTS changeCorrectness gap
10Graph DBMAX_AUTH_CHAIN_DEPTH=1 — 3-system chains missedFeature gap
11SecurityIngest: no cycle detectionHardening
12Security.passthrough() allows field injectionHardening
13ConnectorAWS backoff: no jitterPre-scale hardening
14AWS ConnectorOwnership not mapped from resource tagsFeature gap
15AWS ConnectorIAM conditions not evaluated; no caveat flagFeature gap
16MCP Featuremcp_tool, manifest parser, graph projection missingPhase 1 feature
17EvaluatorNo escalation/impersonation detection — roles with iam:PassRole, roleAssignments/write, actAs are invisibleFeature gap
18SecurityRate limiting per-tenant onlyHardening
19Docs~40% of architecture docs staleOperational risk

8. Prioritized Action Plan

Critical — Security and Data Integrity

  1. Deploy auth-middleware.ts pipeline — fixes IDOR
  2. Add AUTH_PROVIDER=dev && NODE_ENV=production guard
  3. Fix assumed-role ARN parser — 5-line fix at transformer.py:1768
  4. Fix ServiceNow pagination cursor resume

AWS Connector

  1. Implement CloudTrail extractor (sv0-connectors#31)
  2. Fix _transform_cloudtrail_evidence to preserve request_parameters + resources
  3. Update platform materializer for HAS_PERMISSION_SET traversal on AWS
  4. Add .limit(query.limit) to BFS reverse lookup

Correctness and Hardening

  1. Ownership mapping from AWS resource tags
  2. conditions_not_evaluated caveat flag on ExecutionPath
  3. Cycle detection in ingest schema validation
  4. Jitter on AWS backoff; MAX_AUTH_CHAIN_DEPTH → 2
  5. Rate limit key: ${tenantId}:${principalId}

MCP / AI Agent Feature

  1. mcp_tool entity + DECLARES_TOOL / ACCESSES relationships
  2. MCP manifest parser
  3. Graph projection algorithm + POST /api/v1/deployment/preview
  4. Approval record storage

Parallel Track

  1. Event-driven delta sync (CloudTrail streaming, Entra odata.deltaLink)
  2. Session refresh token endpoint
  3. Documentation refresh for 00-overview, 04-api, 07-ui

9. Technology Validation — Architecture Review (April 12, 2026)


9.1 Graph & Database Layer

Verdict: MongoDB + ADRs validated for current scale, with two key gaps: evidence immutability is structurally broken; Kuzu is a viable in-process alternative for path queries that hasn't been evaluated.

Confirmed / Adjusted

DecisionStatusAdjustment
MongoDB for MVPVALIDATEDCorrect for current scale
Single entities collection (ADR-002)VALIDATED WITH CAVEATImplement accessible_by overflow collection before 5K identities/tenant
Materialized paths strategyVALIDATED WITH CAVEATWrite amplification is O(I × R × P × Res) — role fan-out is the real scaling cliff, not raw entity count
No $graphLookupVALIDATEDApplication-level BFS is the documented trade-off; $graphLookup has depth limits, no shortest-path support, and does not address SecurityV0's bounded-hop traversal pattern better than the materialized path approach
Reject Apache AGE (ADR-003)VALIDATEDVariable-length path exponential blowup is architectural; AWS RDS still unsupported
Neo4j trigger thresholdLOWER to 5KOriginal ADR said 10K; role fan-out write amplification hits at ~3K identities sharing a common role

New Findings

Write amplification formula: When a role held by I identities changes permissions across R roles, P permissions/role, Res resources/permission:

  • Read operations: I × (1 + R + R×P + R×P×Res)
  • Write operations: I writes (identity docs) + R×P×Res writes (resource accessible_by arrays)
  • At 3,000 identities × 10 roles × 20 permissions × 5 resources = ~3.3M read ops + 3,300 write ops per role change

Critical evidence immutability gap (HIGH for a security product):

  • SHA-256 hash stored alongside mutable data in the same MongoDB collection
  • Database admin can modify both content and hash in one operation
  • No chain-of-custody linking evidence records
  • No external trust anchor (Merkle tree, blockchain timestamp, signed receipt)
  • Mitigation options (in order of trustworthiness): (1) Sigstore transparency log or Amazon QLDB — cryptographic proof that a log entry existed at a specific time, verifiable by third parties, not bypassable by a DBA; (2) S3 Object Lock (WORM) — append-only at the storage layer, independent of application code; (3) append-only PostgreSQL trigger table — weakest option because a DBA with DISABLE TRIGGER permission can bypass it; application-enforced immutability has the same trust level as MongoDB convention

Kuzu as embedded Neo4j alternative:

  • Kuzu is an embedded in-process OLAP graph database (like DuckDB for graphs) with OpenCypher support
  • Zero additional infrastructure — embedded library, ~50MB binary addition
  • Eliminates path materialization write amplification by computing paths at query time
  • Would replace execution_paths + accessible_by embedded arrays entirely
  • The StorageAdapter abstraction already enables this via a MongoKuzuStorageAdapter implementation
  • 2026 maturity: Adequate for analytics workloads; not recommended as primary transactional store
  • Verdict: Worth prototyping as an analytics layer over MongoDB for path queries — see Section 11 gray zone analysis

Bi-temporal gap: The platform tracks valid time (valid_at/expired_at) but transaction time is implicit. This matters for "did we know about this identity BEFORE the breach?" queries — a common compliance requirement.


9.2 API Runtime & Job Queue

Verdict: Replace in-memory queue with BullMQ + Redis. Upgrade Node 20 → 22. Express 5 for async error handling. The primary risk is OOM from the unbounded queue and lost jobs on restart — not event loop blocking or connection pool starvation. (Sequential awaits release connections between operations; pool saturation would require concurrent syncs, which the serial queue prevents.)

Critical Bugs Found

SeverityIssueFileFix
HIGHAsync route handlers in Express 4 have no try/catch — unhandled rejection hangs request indefinitelyingest.ts:160Add try/catch or express-async-errors
HIGHIn-memory queue: WorkerJob[] unbounded, lost on process restartruntime.ts:26Bounded queue + job persistence
MEDIUMShutdown handler kills mid-flight jobs — 30s timeout + process.exit(1) can corrupt sync stateindex.ts:103-138Drain queue before shutdown
MEDIUMprocessedSyncIds Set is in-memory — lost on restart, re-processing riskingest-service.ts:14Persist to MongoDB. Note: the cross-tenant blocking concern requires an engineered UUID collision (1/2^122 probability for UUIDv4) — not a realistic attack vector; the in-memory/restart concern is the actual issue here.

Architecture Decisions

Node.js version: Node 20 reached end-of-life April 2026. Node 22 LTS is the correct version — one-line change in Dockerfile:1,9. V8 12.4 improvements, no breaking changes for this stack.

Job queue: BullMQ is the right direction. A lower-complexity alternative: MongoDB-backed job persistence (write jobs to worker_jobs collection before acknowledging, recover on startup) avoids adding Redis as a new stateful dependency. Decision table:

ApproachDurabilityOps ComplexityRecommended
Current (in-memory)NoneMinimalNo
MongoDB-backed job storeAt-least-onceZero (uses existing Mongo)Yes — lower complexity
BullMQ + RedisAt-least-once + advanced featuresAdds Redis serviceYes — when queue needs grow
Temporal.ioExactly-once + sagaVery highNo (overkill for 3-step pipeline)

Container memory: Increase from 512MB → 1GB. A large tenant sync (5MB JSON graph + entity arrays + path materialization) can push 300-400MB leaving insufficient headroom.

Express version: Express 4 → 5 migration is low-risk and fixes async error handling natively.


9.3 Frontend Stack

Verdict: Stack is correct. Two bugs require immediate fixes. React Compiler should be enabled. Strategic concern: Graph Explorer as a primary view may not match CISO workflow — Wiz and Orca both use graphs as drill-down from findings, not standalone pages.

Critical Bug: ELK.js Running on Main Thread

File: ui/src/components/graph/layout.ts:1

// CURRENT (wrong — blocks main thread):
import ELK from "elkjs/lib/elk.bundled.js";

// SHOULD BE (ADR-011 explicit requirement):
import ELK from "elkjs/lib/elk-worker.min.js";

ADR-011 explicitly requires the Web Worker variant. The bundled variant blocks the UI thread for 150-400ms at 200 nodes, 500ms-2s at 500 nodes. The spinner overlay misleads — the spinner may not even paint before the thread freezes. One-line fix.

Performance Ceilings

ComponentSafeWarningBreaking
@xyflow/react nodes<200 (60fps)200-500 (30fps)500+ (<15fps)
ELK layout (Web Worker)<100ms (<100 nodes)100ms-2s (100-500 nodes)2s+ (>500 nodes)
ELK layout (main thread — current)<50ms50-500ms (jank)500ms+ (frozen)
Selection highlight re-render<200 nodes200-500 (O(n) spread)500+

Additional Findings

  • ELK.js not lazy-loaded: 1.4MB loaded on every page including Dashboard and Findings. Should be dynamic import — only GraphCanvas and MiniGraph need it.
  • styledNodes memo defeated: GraphCanvas.tsx:102 creates new object references for all nodes on every selection change, defeating memo() on EntityNode. Fix: CSS class toggle instead of style spread.
  • React Compiler: Enable via babel-plugin-react-compiler in vite.config.ts. Eliminates manual useMemo/useCallback overhead across 6+ graph components.
  • Strategic: Graph Explorer as a primary UI view may not match CISO workflow. Wiz/Orca both use graphs as drill-down from findings, not standalone pages. Consider making Graph Explorer seed-anchored (always starts from a finding or entity) to cap graph size and align with CISO workflow.

9.4 Authentication Stack

Verdict: WorkOS + iron-session architecture is sound. Ship the new auth middleware. Critical hardening required: no instant session revocation, super-admin email-domain check is a security bug, logout is a no-op.

Security Bugs Found

SeverityIssueFileFix
HIGHSuper-admin determined by email domain string matchauth.ts:76Use WorkOS Organization membership check
HIGHlogout() is a no-op — cookie cleared but WorkOS session NOT revokedworkos-provider.ts:74Call workos.userManagement.revokeSession()
HIGHverifySession(), verifyApiKey(), verifyM2MToken() all return nullworkos-provider.ts:78-92Three auth sources documented; only one implemented
MEDIUMSessions cannot be instantly revoked — stateless cookie survives deprovisioningsession.tsAdd sessions_revoked_at timestamp to user documents
MEDIUMRolling refresh not implemented despite being documentedsession.ts:41-43Implement TTL extension on each request
MEDIUM7-day TTL inappropriate for a security platformsession.ts:18Reduce to 24h users / 8h super-admins
LOWlistActiveConnections called per-request for SSO-enforced tenantsworkos-provider.tsCache with 60-second TTL per provider_org_id

Architecture Assessment

WorkOS vendor selection (ADR-017) confirmed correct. Provider abstraction (AuthProvider interface) is well-designed — migration to Clerk or self-hosted is a backend-only change. Cloudflare Access as defense-in-depth is appropriate architecture.

iron-session assessment: Functions as a session ID into MongoDB (middleware hits DB on every request anyway). Consider formalizing: either accept the server-side lookup and add explicit revocation support, or move to short-lived JWTs (15 min) + refresh tokens for clean stateless/revocable semantics.


9.5 Infrastructure & Deployment

Verdict: Python correct. Docker Compose correct for current scale. Two critical security issues: REQUIRE_AUTH default and SSH deploy key blast radius. Dead code needs cleanup.

Critical Security Issues

IssueEvidenceFix
REQUIRE_AUTH defaults to false in deploy composedocker-compose.deploy.yml:46Change default to true; fail loudly if not set
SSH deploy key grants Docker-group-equivalent-root to productiondeployment.md:307-308Restrict deploy user via sudoers to specific compose commands only; remove Docker group membership

Operational Gaps

IssueImpactFix
No external monitoring after deploymentOutage invisible until customer reports itAdd UptimeRobot / Cloudflare health check
Backup never tested; no restore runbook6-hour RPO with no recovery confidenceTest restore once, document procedure
Same SSH key for dev and prodDev compromise = prod accessSeparate keys, rotate quarterly
No /ready endpoint checking MongoDBHealth check misleads DockerAdd MongoDB connectivity check

Infrastructure Gaps

IssueImpactFix
MongoDB without replica set: 6-hour RPOHardware failure = data loss up to last backupAdd replica set (even single-node for oplog)
Single-server SPOF: 100% downtime on host failureSOC 2 story has single point of failureAdd second server with hot standby
Connectors require Python 3.11 at customer siteSupport burden; installation frictionShip as Docker images (docker run ghcr.io/sv0/sv0-aws:latest)

Python Connectors: Validated

Python 3.11 + boto3 + msgraph-sdk is correct for I/O-bound batch API scanning workloads. The GIL is irrelevant (I/O-bound). Go/TypeScript/Rust offer no practical advantage for this workload. One improvement: add concurrent.futures.ThreadPoolExecutor to AWS region scanning loop for parallel region extraction (20-line change, not a language migration).

Dead Code Cleanup

Remove: docker-compose.prod.yml (legacy Certbot overlay), ui/nginx-ssl.conf (superseded by Caddy). Both create confusion about the active architecture.

Infrastructure Maturity Triggers

TriggerAction
First paying enterprise customer2-server MongoDB replica set, test restore, separate SSH keys
Contractual uptime SLA ≥99.95%k3s or managed Kubernetes, Atlas managed MongoDB, Docker-based connectors

10. High-Confidence Findings

PriorityFindingFile
1ELK.js running on main thread (should be Web Worker)layout.ts:1
2In-memory job queue will lose data on restart — needs persistenceruntime.ts:26
3REQUIRE_AUTH=false is a critical default in deploy composedocker-compose.deploy.yml:46
4SSH deploy key blast radius too broaddeployment.md:307-308
5Super-admin via email domain string match is a security bugauth.ts:76
6Node 20 is EOL; upgrade to Node 22Dockerfile:1,9
7Logout is a no-op — WorkOS session not revokedworkos-provider.ts:74
8Evidence immutability: hash stored alongside mutable dataevidence_packs schema
9Role fan-out write amplification is the real MongoDB scaling cliffpath-materializer.ts
10Neo4j trigger threshold should be 5K, not 10KADR-001

11. Gray Zone Deep-Dive


11.1 Gray Zone 2: Data Model Universality — VERDICT: RIGHT-SIZED

Agent findings (deep code analysis across all 3 connectors + all 15 evaluator rules):

Verdict: The 10-entity-type model is NOT too universal. It is the correct abstraction level. Evidence:

  • 3 connectors map to it with zero forced compromises (when model is followed correctly)
  • All 15 evaluator rules operate against universal types and work across all connectors without cloud-specific branching
  • The path materializer traverses the graph without cloud-specific conditionals
  • Adding a 4th connector (GitHub, Okta, Salesforce) would be ~90% connector work, <10% platform work — no new entity types needed

Mapping fidelity by connector:

  • Entra/ServiceNow: Clean 1:1 mapping. No compromises.
  • AWS: Functional but carrying ADR-014 implementation debt (see below)
  • Bedrock AI agents: Clean — workload subtype bedrock_agent, RUNS_AS IAM role, INVOKES Lambda action groups

Two critical seams:

SeamIssueImpactFix
ADR-014 implementation gapAWS connector emits HAS_ROLE / nodeType: "role" for IAM Managed Policies instead of HAS_PERMISSION_SET / permission_setPlatform-side types already support permission_set. Path materializer traverses HAS_ROLE but NOT HAS_PERMISSION_SET → all AWS authority paths via managed policies are incorrectUpdate AWS transformer line 499 + materializer to traverse HAS_PERMISSION_SET
Resource key migrationresource-key.ts is comprehensive for AWS but privilege_justification_gap needs resource_key on evidence recordsCloudTrail evidence has no resource_key → rule returns false negatives until CloudTrail extractor is implementedPopulate resource_key on evidence during CloudTrail extraction

privilege_justification_gap bug is implementation, NOT model: The resource-key.ts module correctly handles all AWS ARN formats (S3, Lambda, DynamoDB, SecretsManager, SSM, ECR, ECS, IAM, Bedrock, SNS, SFN, EventBridge, SQS). The matching logic is correct. The problem is that CloudTrail evidence records don't have resource_key populated yet (because CloudTrail extractor doesn't exist — F2).

MCP/AI agent model fit: Good. The existing model handles AI agents via ai_agent workload subtype. What's missing for pre-deployment preview is not an entity type but a behavioral distinction: "configured to invoke" (current INVOKES edge) vs. "has exercised" (needs runtime evidence). The graph already has the right edges; additional evaluator rules needed for authority preview.

_type_provisional: true was never implemented (searched codebase — zero occurrences). ADR-014 mentioned it as a migration strategy that was never built.


11.2 Gray Zone 1: Graph Alternatives to Neo4j — VERDICT: KUZU

Agent findings (deep research across 6 alternatives with full code context):

The team's hesitation about Neo4j is justified on operational grounds — and there is a better answer that avoids adding any new infrastructure.

Kuzu is an embedded in-process graph database (like DuckDB, but for graphs). Native Cypher support. Node.js/TypeScript bindings. MIT licensed. Zero additional infrastructure — runs inside the sv0-platform process.

Why Kuzu is the right answer for SecurityV0:

MongoDB (14 collections)           Source of truth: entities, versions, events,
| findings, evidence packs, temporal history
sync completes
|
Kuzu (in-process) Graph projection: nodes + typed edges
| for path traversal queries
|
Cypher queries
/ | \
blast subgraph chain
radius explore assembly

Kuzu replaces the 3 most problematic application-level BFS implementations:

  • path-materializer.ts:computePaths() — recursive MongoDB-per-hop query storm → single Cypher query
  • chain-builder.ts:bfsCollectChain() → Cypher MATCH (w)-[*1..5]->(r)
  • subgraph-adapter.ts:neighborhoodBFS() → Cypher MATCH (n)-[*1..2]-(m) WHERE n.id = $seed

This eliminates:

  • The BFS document limit bug (subgraph-adapter.ts:158 — no .limit())
  • Unbounded frontier growth in high-degree nodes
  • Stale execution paths when role GRANTS change (Kuzu recomputes at query time)
  • The write amplification problem (no accessible_by arrays to maintain)

Migration is incremental — zero storage risk:

  1. Start: Kuzu handles getSubgraph() queries only (replace SubgraphAdapter BFS)
  2. Next: Kuzu generates execution_paths[] instead of path-materializer.ts
  3. Then: Kuzu handles chain assembly (chain-builder.ts)
  4. Later: Temporal graph queries via historical entity snapshots loaded into temporary Kuzu instance

The StorageAdapter interface does NOT change. All 60 methods stay as-is. MongoDB schema unchanged. Evaluator rules unchanged. Connector interface unchanged.

Alternative Comparison

OptionFit ScoreKey Verdict
Kuzu (embedded)8/10Best fit. Zero infra, native Cypher, MIT license, incremental migration
XTDB v26/10Excellent bi-temporal but NO graph traversal; JVM service required
TerminusDB5/10Git-like immutability interesting but Prolog query language + project risk
FalkorDB4/10Fastest BFS but Redis AOF persistence = disqualifying for evidence-grade requirements
TypeDB5/10Inference rules compelling but TypeQL proprietary, JVM service, no temporal
Memgraph5/10Neo4j-like quality but BSL license + same operational cost as Neo4j

When Kuzu stops being sufficient (trigger for Neo4j or Memgraph):

  • 50K+ entities where graph rebuild time exceeds acceptable sync latency
  • Multi-process/multi-service need to query the same graph (Kuzu is in-process only)
  • Real-time graph mutations needed (Kuzu is batch-rebuild-oriented)
  • Geographic distribution requirements

On the bi-temporal gap: SecurityV0 already has a working bi-temporal model (entity_versions with valid_at/expired_at + events with transaction timestamps). XTDB's native bi-temporal is elegant but solves a problem that is already solved adequately. For historical graph queries, the right approach is: load historical entity snapshots from entity_versions into a temporary Kuzu instance and traverse that. This is the "git checkout past commit" pattern.

On evidence immutability: Neither Kuzu nor any graph DB solves the evidence hash-colocation problem. This must be solved separately — and PostgreSQL triggers are the weakest option because a DBA with DISABLE TRIGGER permission can bypass them. Prefer Sigstore transparency logs, Amazon QLDB, or S3 Object Lock (WORM) for genuinely tamper-evident storage. These are independent of which graph layer is chosen.


11.3 Gray Zone 3: Connector Depth + Rate Limiting

Part A: Metadata-Only Scanning — VERDICT: RIGHT STRATEGY FOR V1

What metadata scanning concretely delivers (from all 14 evaluator rules):

  • Ownership governance (orphaned, degraded, drifted, ambiguous, unknown)
  • Authority hygiene (dormant, scope drift, reachability drift, privilege justification gap)
  • Identity binding (unproven execution, unknown binding, unresolved cross-system auth)
  • Egress/data flow (LLM egress, external egress, reachable sensitive domain)

This is authorization graph analysis with temporal drift detection — a capability combination that existing tools address only partially or not at all.

What metadata misses: Hardcoded secrets in code, injection vulnerabilities, dependency CVEs, logic vulnerabilities, runtime behavioral anomalies, CSPM-style resource misconfiguration checks (S3 bucket ACLs vs. CIS benchmarks).

Code analysis path (additive, not a redesign): The NormalizedGraph schema already accommodates it. A sv0-code-scanner connector would: (1) fetch code artifacts linked to known entities, (2) run lightweight checks (regex for secrets, SBOM extraction, trufflehog), (3) emit NormalizedGraph additions. This is additive — no connector architecture change needed. The ServiceNow connector already parses script bodies (analyze_script_mutations(), analyze_script_queries()).

Verdict: Metadata-only is fully defensible for V1. Strategic risk is customers expecting CSPM-style findings alongside the authorization graph — that's a breadth gap where CSPM-first tools have an advantage.

Part B: Rate Limiting — CRITICAL FINDINGS

ServiceNow 429 bug is a data integrity crisis, not a UX issue:

The break at servicenow_client.py:421 on any non-200 response causes silent partial data ingestion. Blast radius:

  1. Scan returns 400/2000 records as if complete
  2. Downstream evaluator computes massive phantom ownership_drift and scope_drift — entities "disappeared"
  3. Phantom-truncated scan becomes the new baseline — subsequent full scans show phantom "new" entities
  4. Temporal drift detection becomes unreliable

This is not a "fix later" issue. This calcifies baselines. Every scan run with this bug creates corrupted baselines that compound. Must fix before production.

Fix — ServiceNow cursor resume on 429:

Note: urllib3 retry logic at the adapter level handles transient TCP/TLS failures before the pagination loop sees a status code. The bug is what happens after urllib3 retries are exhausted: the 429 bubbles up to application code and the break at line 421 exits the pagination loop without resuming the cursor. The fix is at the application level, not the adapter level:

if response.status_code == 429:
if retry_count >= max_retries:
raise ConnectorError(f"ServiceNow rate limit exceeded after {max_retries} retries; pagination cursor at offset {offset}")
retry_after = min(int(response.headers.get("Retry-After", 0)), 300) # cap at 5 min
wait_time = max(retry_after, 2 ** retry_count)
wait_time *= random.uniform(0.75, 1.25) # Full jitter
time.sleep(wait_time)
retry_count += 1
continue # NOT break — retry SAME offset

Fix — AWS full jitter (1 line):

# Line 276 of aws_client.py — replace:
wait_time = 2**retry_count
# With (AWS Architecture Blog "full jitter" pattern):
wait_time = random.uniform(0, 2**retry_count)

Recommended rate limiting architecture (current stage):

  • Per-connector AdaptiveTokenBucket per API endpoint
  • Respects Retry-After headers (ServiceNow, Azure Graph both send these)
  • Full jitter on all retry delays
  • rateLimitConfig in connector contract (05-connectors.md:122-126) is the right interface — configure max RPS per connector
  • Later: Redis-backed cross-tenant quota tracker when concurrent multi-tenant scans are needed

Reversibility assessment:

DecisionReversible?Notes
Metadata-only scanningFully reversibleCode analysis connectors are additive
ServiceNow break-on-429CalcifyingFix before any production customer
AWS no-jitterEasily reversible1-line fix, low complexity
No global quota trackerReversible but expensive laterDesign interface now, implement when multi-tenant orchestration built

11.4 MCP Blocker: AI Agent Pre-Deployment PII Access Graph

Agent findings (deep architecture review of 12-deployment-approval.md + full codebase analysis):

What the architecture team already knows well

The design docs are thorough: three modes (post-deploy detection, pre-deploy preview, deployment gate) are correctly separated. Five approaches were evaluated. Platform capabilities inventory is accurate.

Hard unsolved problems (not yet designed)

ProblemWhy Hard
Graph projection algorithm"Run materializer on projected state" is stated but the how is not designed
Path-level diff engineCurrent diff-engine.ts diffs EntityDoc only — no AuthorityPathDoc comparison
Cross-connector entity correlationPrerequisite for cross-system authority chains (Entra→SN→HR DB) — not yet built
MCP tool-to-data-domain mappingTool declarations are free text blackboxes — classification is unsolved
data_domain as first-class entity typeNot yet in the entity model

MCP Opacity Mitigation (layered, honest approach)

The fundamental problem: MCP tool declarations (tools/list) show name + description + input schema. They don't reveal what databases the tool queries, what data it returns, or what its blast radius is.

Recommended mitigation layers:

LayerWhat It ProvidesEvidence GradeBuild Now?
1 — Identity-bounded authorityThe identity's IAM permissions ARE the worst-case blast radiusC (inferred)Yes — already modeled
2 — Manifest-declared intentParse mcp.json for env vars, tool names, resource URIsC (inferred)Yes — build now
3 — Tool description parsingNLP/regex on tool descriptions for domain hintsC (inferred)Caution — conflicts with "no ML/heuristics" policy
4 — Runtime observationActual network/DB calls after deploymentA (proven)Future

Honest framing for clients: "We show you the identity's authority boundary. The tool may exercise all, some, or none of that authority. The boundary is the worst case."

Rejected options:

  • Option A (clone to MongoDB + materializer): Write amplification, cleanup complexity, persistence risk
  • Option C (what-if tenant namespace): Cross-tenant reference failures, tenant semantics broken

Recommended: In-memory ProjectionStorageAdapter

mcp.json
↓ MCP manifest parser
NormalizedGraph (mcp_server, mcp_tool, identity nodes)
↓ graph-transformer.ts (existing)
EntityDoc[] (projected entities)
↓ inject into
ProjectionStorageAdapter (Map<string, EntityDoc> backed)
↑ seeded from real tenant subgraph via getSubgraph(identity, depth=3)
↓ materializeExecutionPaths() + materializeAuthorityPaths() (unchanged)
projected AuthorityPathDoc[]
↓ evaluateSinglePath() (unchanged)
ProjectedFindingCandidate[]
↓ diff against current MongoDB authority paths
Authority Delta: new/removed/changed paths + new sensitive domains reached

This works because the StorageAdapter interface is already the abstraction boundary. A ProjectionStorageAdapter implementing ~8-10 methods (getEntity, upsertEntity, queryEntities, getEntitiesByIds, queryAuthorityPaths, upsertAuthorityPaths, markAuthorityPathsRemoved, countAuthorityPaths) runs the entire materializer + evaluator pipeline with zero MongoDB writes.

New Entity Types & Schema

Add to entity types: mcp_tool, data_domain

Add to edge types: DECLARES_TOOL (mcp_server → mcp_tool), ACCESSES (mcp_tool → data_domain), PROJECTED_FROM (projected entity → manifest source)

Add to workload subtypes: mcp_server (already has ai_agent, bedrock_agent)

Data domain classification (3 tiers, in priority order):

  1. Tier 1 — Resource name pattern matching (deterministic, build now): hr.*|employee.* → domain: "hr", sensitivity: "confidential". This is consistent with the "no ML/heuristics" policy — it's a curated registry.
  2. Tier 2 — Operator tagging via API/UI: Security team manually classifies resources. Stored as data_domain entities with ACCESSES relationships.
  3. Tier 3 — Tool description NLP: Skip for now — conflicts with determinism policy.

Evidence Grading for Projected State

All projected paths carry: claim_type: "capability_inferred", evidence_strength: "inferred" (weakest grade, rank 3). In the UI: dashed edges, "PROJECTED" badge, distinct color. Projected findings do NOT count toward active posture score — advisory only.

Post-deployment upgrade path: projected → structural (after first scan confirms configuration) → correlated (after execution evidence accumulates) → deterministic (proven in production).

Approval Record Schema (minimal)

interface DeploymentPreviewRequestDoc {
_id: string; tenant_id: string;
requested_by: string; requested_at: Date;
source_type: "mcp_manifest" | "cloudformation" | "arm_template";
source_manifest?: Record<string, unknown>;
projected_paths: {
new_paths: number; new_sensitive_paths: number;
new_domains_reached: string[]; // e.g., ["hr", "finance"]
};
projected_findings: ProjectedFindingSummary[];
projected_authority_paths: AuthorityPathDoc[];
status: "pending" | "approved" | "rejected" | "expired";
reviewed_by?: string; reviewed_at?: Date; review_notes?: string;
conditions?: string[];
// Post-deployment accuracy tracking
projection_accuracy?: {
paths_matched: number; paths_unexpected: number; paths_missing: number;
};
}

Closest Analogues and Gaps

ToolWhat It DoesGap
OAuth consent screensShows flat permission listNo authority graph, no cross-system chains
terraform planProjects infrastructure stateNo authority implications of infra changes
AWS IAM Access AnalyzerChecks single policy for public accessNot a graph, not pre-deployment, not cross-system
Microsoft Agent Governance ToolkitRuntime policy enforcementNo pre-deployment preview, no authority graph
Wiz AI-SPMCloud security posture for AIRuntime/post-deployment only, no authority graph

Delivery Sequence

PhaseComponentOutput
1In-memory ProjectionStorageAdapter (~8-10 methods)Foundation for all projection
2MCP manifest parser → NormalizedGraphmcp.json input accepted
3POST /api/v1/deployment/preview endpointWorking projection pipeline
4Approval record schema + PATCH endpointApprove/reject workflow
5Resource-name data domain classifier (Tier 1)PII domain detection

After initial delivery: Path-level diff engine + full graph snapshot (prerequisite for "did reality match projection?")

Genuine hard problems not in scope yet: Cross-connector entity correlation, CloudFormation/ARM/Terraform parsers, multi-environment tenant model, what-if simulation UI.

Biggest implementation risk: ProjectionStorageAdapter must handle edge cases in the materializer (circuit breakers, deletion thresholds, AP_REMOVAL_THRESHOLD safety net). Medium risk — methods are well-defined but materializer edge cases will surface during integration testing.


12. Updated Master Priority Table

PriCategoryIssueSeverity
1SecurityCross-tenant IDOR via REQUIRE_AUTH bypassShip blocker
2SecurityREQUIRE_AUTH defaults to false in deploy composeP0
2aSecurityverifyM2MToken() returns null — Bearer-token M2M auth is completely unenforcedCRITICAL
3SecurityDevAuthProvider: no production gateShip blocker
4SecuritySSH deploy key → Docker group = root accessP0
5SecuritySuper-admin via email domain string (not org membership)HIGH
6AWS ConnectorCloudTrail extractor not implementedShip blocker
7AWS ConnectorAssumed-role ARN parsing broken (80-90% events)Ship blocker
8AWS Connectorprivilege_justification_gap always 0 on AWSShip blocker
9AWS Platformpermission_set materializer not updatedShip blocker
10FrontendELK.js running on main thread (not Web Worker)HIGH
11RuntimeAsync route handler no try/catch — hangs requestHIGH
12RuntimeJob queue unbounded + no persistence = data lossHIGH
13AuthLogout is a no-op (WorkOS session not revoked)HIGH
14Node.jsNode 20 EOL → upgrade to Node 22HIGH
15Graph DBBFS reverse lookup: no document limitPre-scale blocker
16ConnectorServiceNow pagination: break on 429 — corrupts baselinesShip blocker
17Auth7-day session TTL; rolling refresh not implementedMEDIUM
18AuthIron-session: no instant revocation on deprovisioningMEDIUM
19InfraNo external monitoring; backup untestedP1
20Graph DBStale paths on role GRANTS changeCorrectness gap
21Graph DBMAX_AUTH_CHAIN_DEPTH=1 — 3-system chains missedFeature gap
22EvidenceImmutability: hash stored in mutable collectionCompliance gap
23FrontendELK.js not lazy-loaded (1.4MB on every page)MEDIUM
24Graph DBNeo4j trigger lower to 5K; monitor role fan-outPlanning
25MCP Featuremcp_tool, manifest parser, graph projectionFeature — not yet built

13. Infrastructure: Docker Compose Is a Dead End

Docker Compose is a development and single-host orchestration tool. SecurityV0 runs it in production for both app.securityv0.com and dev.securityv0.com. This is a structural ceiling — not a configuration gap, an architectural one.

What Docker Compose cannot do:

CapabilityDocker ComposeRequired for Scale
Horizontal scaling (multiple hosts)No — single host onlyYes, for any cell model
Rolling deploymentsNo — up restarts all containers, causing downtimeYes, for zero-downtime deploys
Health-based routingNo — failed containers removed from routing manuallyYes, for resilience
Cross-node service discoveryNoYes, for cell provisioning
AutoscalingNoYes, for variable sync load
Resource enforcementSoft limits onlyYes, for noisy-neighbor isolation
Secret management.env files on diskYes, must use vault

The consequence for cell architecture: Cell provisioning automation — the core operational requirement for cells — is impossible on Docker Compose. "Provisioning a new cell" on Docker Compose means SSH-ing into a server and running docker compose up manually. This defeats the purpose.

The right migration path:

Current: Docker Compose (CPX21 Hetzner, single host)

Step 1: k3s on Hetzner
Single-node Kubernetes — identical Hetzner hardware, same Docker images
No operational cost increase; enables everything below

Step 2: Helm charts per service
Parameterized deployment: one Helm chart = one cell
Rolling deployments, health checks, resource quotas — free

Step 3: Cell provisioning via Helm (when triggered by scale)
`helm install cell-eu-02 ./charts/sv0-cell --set tenants=...`
New cell live in 15 minutes, zero downtime for existing cells

k3s is the correct migration path: same Hetzner infrastructure, same container images, same Docker workflows for developers — production-grade runtime that enables the full cell model when needed.

Note on scope: The async route handler bug, unbounded job queue, and ELK.js Web Worker issue are independent code bugs — Docker Compose did not cause them and k3s would not fix them. The argument for migrating is forward-looking: Docker Compose cannot support rolling deployments, multi-host scaling, or cell provisioning automation, all of which become necessary as the platform grows. Fix the bugs separately; migrate the runtime to unlock the scaling model.


14. Cell Architecture vs. Current Architecture — Full Comparison

What Cell Architecture Means for SecurityV0

A cell is a complete, independently deployed replica of the platform stack, permanently assigned a bounded set of tenants, such that the failure or resource exhaustion of any component in that cell has zero runtime effect on any other cell.

One SecurityV0 cell contains:

┌─────────────────────────────────────────────────────────────┐
│ CELL A (tenants T001, T047, T203 ... T035) │
│ │
│ Express API pods (3×) ─── Redis ─── BFS Workers (4×) │
│ │ │
│ MongoDB Replica Set │
│ (IAM graphs, findings, BFS paths │
│ for THIS cell's tenants ONLY) │
└─────────────────────────────────────────────────────────────┘

CONTROL PLANE (global, not a cell):
Cell Router │ Auth Service │ Billing │ Tenant Registry
Maps tenant_id → cell. Never holds IAM graph data.
Must NOT be in the hot path — cells operate independently
if control plane goes down.

Connectors use transparent routing: they always call api.securityv0.com. The cell router maps tenant_id → cell from a cached registry and proxies the request. Connectors never need to know which cell they're in — no reconfiguration when cells are added or tenants migrated.


Scalability Comparison

Architecture A (Current) — Binding Constraints (from code analysis):

Critical correction from code audit: The production worker runtime is NOT BullMQ. It is a plain JavaScript array (private readonly queue: WorkerJob[] = [] at runtime.ts:26) inside the API process, draining one job at a time, sequentially. There is no separate worker process. BullMQ exists in documentation, not in the running code.

A full tenant sync cycle is 3 sequential jobs: sync_ingestionevaluate_findingsbuild_evidence_pack.

For a medium tenant (5,000 entities): 85–240 seconds total.

TenantsSync frequencyWorker queue drain timeOutcome
5Hourly~22 minDrains before next sync
10Hourly~45 minQueue backs up permanently
35Daily~105–180 minBarely drains before next daily window
50Daily~225–375 minQueue never empties
500Daily37+ hoursArchitecture collapses

Sequential breaking points (in order they bite):

  1. Worker queue saturation — ~10 tenants (hourly) / 35 tenants (daily)
  2. MongoDB working set overflow — ~60–70 tenants × 5K entities (WiredTiger cache is 256MB from --wiredTigerCacheSizeGB 0.25 in compose; total working set exceeds it at low tenant counts)
  3. Node.js OOM — one 50K-entity sync calls queryEntities(limit:0), loads 500MB of entity docs into 512MB container; immediate OOM kill; all tenants dark
  4. Express latency degradation — ~100+ tenants with concurrent dashboard load

Architecture A single-event total-outage scenario: One enterprise customer runs a 50K-entity sync during business hours. Path materialization triggers 3.2M sequential MongoDB reads (~9 hours). Evaluator calls queryEntities(limit:0) on 50K entities → ~500MB heap → OOM kill. Container restarts. The stalled sync is permanently stuck at "running" in MongoDB. All other tenant syncs are blocked for the duration. No alerting fires — the process crash is not surfaced as a sync failure. All tenants on the platform go dark.

Architecture B (Cell) — Scaling Characteristics:

  • Cell capacity: 25–35 tenants per cell (MongoDB M20, 4 parallel workers)
  • New cell provisioning: 12–18 minutes, zero downtime for other cells
  • Same 50K-entity OOM scenario in Cell B: one cell degraded, 25–35 tenants affected, all other cells continue normally
  • Vertical scaling: eliminated — add cells, not bigger servers
  • Geographic cells: US customers on US cell (<20ms RTT vs 120–220ms from Nuremberg); APAC (<30ms vs 350ms)

APAC dashboard latency on Architecture A (350ms per interaction) crosses the threshold where users perceive the product as slow. Regional cells eliminate this entirely.

MetricArchitecture AArchitecture B
Daily sync saturation35 tenants25–35 per cell, unlimited cells
Hourly sync saturation10 tenants25–35 per cell
Total-outage triggerOne 50K-entity syncAtlas AZ outage (30–60s failover)
Noisy tenant blast radiusAll tenants on platform25–35 tenants in one cell
Scaling action downtime5–15 min (vertical resize)Zero (new cell)
APAC dashboard RTT350ms<30ms (APAC cell)

Security Comparison

Security scorecard:

Attack VectorArchitecture AArchitecture B
Tenant data isolationCRITICALtenant_id field only; one missing filter exposes all tenant dataLOW — per-cell MongoDB; missing filter leaks within 25–35 tenant cell only
Noisy tenant / resource exhaustionHIGH — one tenant starves all; no per-tenant limits at any layerLOW inter-cell / MEDIUM intra-cell
Auth bypass blast radiusCRITICALREQUIRE_AUTH=false exposes 100% of tenants simultaneouslyHIGH — one cell exposed; others protected by independent auth
Cross-tenant IDORCRITICAL — MongoDB ObjectIDs from shared DB are time-ordered and estimable; one bug in any of 50+ query paths leaks cross-tenantLOW — ObjectIDs from other cells do not exist in this cell's DB; physically absent, not just filtered
Session compromise blast radiusCRITICAL — stolen @securityv0.com admin session has 7-day unrestricted access to all tenants; no revocationHIGH — control plane admin / MEDIUM — cell-scoped
Database breach blast radiusCRITICAL — one MongoDB breach delivers full IAM graph of every customer; complete cloud attack kitHIGH per-cell / LOW platform-wide — independent credentials per cell
Connector push forgeryHIGHverifyM2MToken() returns null; tenant_id in payload is attacker-controlledMEDIUM — cell URL discovery required; wrong-cell push rejected at routing layer
Super-admin escalationCRITICAL — email domain string match for all @securityv0.com accounts; no revocationHIGH — same fragility, bounded blast radius
Compliance (SOC 2 Type II)Blocked — CC6.1 (logical access only), CC6.3 (no revocation), IdP stubsAchievable with remaining auth work
Compliance (FedRAMP Moderate)Explicitly blocked — SC-4 requires DB-level isolation; field-level filtering fails this controlEligible path — single-tenant government cells satisfy SC-4
GDPR / Data ResidencyHigh risk — EU tenant data co-mingles with US tenant data at storage layerStrong — EU cell on EU infrastructure; no cross-jurisdiction data residency risk

Six security fixes required regardless of architecture choice (Architecture B reduces blast radius but does not fix these):

  1. verifySession(), verifyApiKey(), verifyM2MToken() returning null — this is an active auth bypass on those paths, not a stub
  2. REQUIRE_AUTH=false as default in docker-compose.deploy.yml — must be inverted; opt-out for dev, not opt-in for prod
  3. Iron-session server-side revocation — Redis-backed session store with immediate invalidation capability
  4. Super-admin email domain check — replace with explicit RBAC membership from WorkOS org claims + user ID allowlist
  5. BFS document limit — hard cap on traversal depth and result count per request
  6. DevAuthProvider production gate — startup crash (not silent fallthrough) if NODE_ENV=production and DevAuthProvider is active

Customer Isolation Comparison

Architecture A — Isolation Reality (from code):

All 23 MongoDB collections are shared. The only isolation boundary is the tenant_id field predicate in application queries. MongoDB has no row-level security; the application is the sole enforcement point. Additional isolation failures found in code:

  • InMemoryFindingsStore is shared across all tenants — if keying is not tenant-scoped internally, connector report findings from Tenant A are visible to Tenant B
  • IngestService.processedSyncIds is a global Set<string> — not tenant-scoped; the practical risk is re-processing on restart (Set is lost), not cross-tenant blocking (UUIDv4 collision probability is 1/2^122)
  • A stuck sync job (infinite path materialization loop) has no per-job timeout or watchdog; it occupies the entire worker indefinitely, blocking all other tenants' pipelines
  • The auto-join domain-match feature (in the new, not-yet-mounted middleware) adds users to any tenant matching their email domain without explicit invitation — a multi-tenant implicit membership risk

Architecture B — Isolation Reality:

  • IAM graph data for Tenant A physically does not exist in Cell B's database — cross-cell IDOR requires control plane compromise + cell credential forgery
  • Worker exhaustion, OOM, stuck jobs: bounded to the cell (25–35 tenants), not the platform
  • Enterprise single-tenant cells: zero cross-tenant data at any layer; database breach exposes exactly one customer
  • GDPR data residency: EU cell on EU Hetzner region + EU Atlas region; US tenant data never touches EU infrastructure

Pros and Cons

Current Architecture (Shared Multi-Tenant)

Pros:

  • Simple to operate at current scale (single compose stack, one MongoDB)
  • Low infrastructure cost ($0.74/tenant at 50 tenants)
  • StorageAdapter abstraction provides a clean migration path to per-tenant collections without touching connectors or API routes
  • Fast iteration — one deployment target

Cons:

  • Worker queue blocks all tenants for the duration of any single sync job
  • One large-tenant OOM kills the API process for all tenants simultaneously
  • tenant_id field isolation is the only data boundary — one missing filter in any of 50+ query paths is a platform-wide cross-tenant breach
  • FedRAMP, ISO 27001 SC-4, and GDPR data residency compliance are structurally blocked
  • Write amplification from path materialization (O(I×R×P×Res) MongoDB reads) is a shared-instance bottleneck
  • No horizontal scaling path without rewrite
  • Docker Compose provides no rolling deployments, no health-based routing, no autoscaling
  • APAC dashboard unusable (350ms+ RTT from Nuremberg)

Cell Architecture

Pros:

  • Any single-cell failure (OOM, MongoDB, stuck job) affects 25–35 tenants, not the entire platform
  • FedRAMP Moderate eligible via single-tenant government cells
  • GDPR data residency: EU customers on EU cells, US customers on US cells — provable in procurement
  • Geographic cells eliminate APAC latency penalty
  • Enterprise isolation is a compliance requirement for CISO-grade buyers (FedRAMP, GDPR, contractual)
  • Cell provisioning via Helm is 12–18 minutes, zero downtime
  • Per-cell MongoDB credentials — one cell's database breach does not cascade

Cons:

  • Control plane is a new single point of failure; must be built to higher availability than data plane
  • Cell-to-cell tenant migration requires quiesce-export-import-verify-flip procedure (~30 min, coordination risk)
  • Cell sprawl: 10 cells = 10 MongoDB instances to patch, 10 Redis instances to monitor, 10 deployment rollbacks per release
  • $5.20/tenant at 50 tenants vs. $0.74 — 7× cost premium at low scale
  • Significant engineering investment for control plane, provisioning, cell-aware routing — time not spent on the AWS connector or MCP feature
  • Requires k3s or ECS as prerequisite — Docker Compose is incompatible with cell provisioning automation
  • Intra-cell isolation within a 25–35 tenant cell still requires tenant_id field discipline; Architecture B reduces blast radius, not isolation mechanism

Cost Model

ScaleArch A InfrastructureArch A $/tenantCells NeededArch B InfrastructureArch B $/tenant
50 tenants2× CPX21 = €22/mo + Redis$0.742 cells$260/mo$5.20
200 tenantsCPX51 + Atlas M20 = ~$170/mo$0.857 cells$910/mo$4.55
500 tenantsCPX51 + Atlas M50 = ~$470/mo$1.1017 cells$2,210/mo$4.42

At 200+ tenants, Architecture A requires a dedicated DBA and constant capacity management; the staffing cost delta alone exceeds the $3.70/tenant infrastructure premium of Architecture B.


The Verdict: When to Invest in Cell Architecture

Cell architecture is the correct long-term direction. It is the wrong immediate investment.

SecurityV0 has no evidence any cell-architecture-solvable problem exists at its current scale. The AWS connector produces no reliable execution evidence. Authentication is mid-migration. The worker queue is a JS array. Before rearchitecting for scale, the product must work.

Triggers that justify the cell investment (all must be true):

  1. 100+ tenants with active sync workloads
  2. Demonstrated requirement for physical data isolation (not just field-level tenant_id discipline)
  3. Measured noisy-neighbor degradation — not theoretical; actual P95 latency correlation between one tenant's sync load and another's dashboard latency
  4. All items 1–9 from the existing priority table are closed
  5. WorkOS auth migration is complete and deployed
  6. Operational capacity to maintain multiple independent MongoDB instances, Redis instances, and Helm deployments

The incremental path — no big-bang rewrite:

Step A:  Per-tenant MongoDB collections via StorageAdapter
Add tenantId → collectionName routing inside the adapter
Delivers collection-level isolation; maps cleanly to cell extraction later
Application code: unchanged. Connectors: unchanged.

Step B: Persistent job queue
Replace WorkerJob[] array with durable queue (MongoDB-backed or BullMQ)
Enables parallel workers, per-tenant priority lanes, job recovery

Step C: Per-tenant API rate limiting
Token bucket keyed by tenantId in middleware
Eliminates noisy-neighbor at the API layer

When triggered: First enterprise customer requiring contractual isolation
Extract them to a dedicated single-tenant cell
One cell, one Terraform module, no generalized control plane yet

When triggered: Measured queue degradation across tenants
General cell model: control plane, provisioning automation, cell router
Steps A–C are already done; the migration is additive, not a rewrite

This path avoids the big-bang rewrite. Each step is independently justified by a confirmed problem. The architecture evolves toward cells driven by real customer requirements, not hypothetical scale.