SecurityV0 — Comprehensive Architecture & Security Audit Report
Executive Summary
SecurityV0 is a well-conceived Autonomous Execution Exposure Management platform with sound architectural principles: deterministic findings, evidence-grade audit trails, temporal drift detection. The core pipeline is production-ready. However, the audit uncovered 2 critical security vulnerabilities, 7 ship-blocking implementation bugs, and significant scalability risks that must be addressed before claiming production readiness.
0. Architecture Decisions — Critical Review
This section reviews the strongest criticisms of each major architectural decision. Some decisions hold up under scrutiny; others have real structural problems.
0.1 MongoDB for Graph Storage — The Evidence Immutability Claim Is False
The ADRs claim: MongoDB stores immutable evidence packs via SHA256 hashes.
The reality: The SHA256 hash is stored in the same mutable MongoDB collection as the content it is supposed to protect. A database administrator — or a compromised service account with write access — can modify both the content and the hash in a single operation. MongoDB has no append-only collection mode, no WORM storage, and no write-once semantics. The immutability is a convention enforced by application code, not by the database.
Why this matters for a security product: When a customer challenges the integrity of a finding during an incident response, SecurityV0's answer is "trust us." There is no cryptographic proof that the finding wasn't modified after the fact. SOC 2 AU-10 (non-repudiation) and NIST 800-53 AU-10 both require this.
Fix: Append evidence pack hashes to an append-only PostgreSQL table with triggers preventing UPDATE/DELETE, or use Amazon QLDB. This costs minimal operational effort and transforms the compliance posture.
0.2 Materialized Paths — The Write Amplification Is Worse Than Documented
The ADRs claim: Materialized paths provide O(1) blast radius queries. The scaling ceiling is ~10K identities.
The reality: The write cost is O(I × R × P × Res) where I=identities holding a changed role, R=roles, P=permissions/role, Res=resources/permission. When a role held by 3,000 identities changes its permissions, the materializer issues ~3.3M read operations and writes updated accessible_by arrays across hundreds of resource documents — all non-atomically. A failure mid-way leaves the graph in an inconsistent state. The ADR says "eventual consistency" as if it's acceptable; for a security product where blast radius queries are the core value proposition, inconsistent state during sync means incorrect answers to the CISO's primary question.
The trigger for this breaking: It is not raw entity count. It is role fan-out. A single highly-shared role (like "Developer" held by 3,000 engineers) changing permissions triggers the storm. This happens at much smaller tenant sizes than 10K total identities.
The ADR should lower the Neo4j/Kuzu trigger from 10,000 identities to 5,000 — or more precisely, to any role with >1,000 holders.
0.3 Stateless Sessions for a Security Platform — Structurally Wrong
The ADR claims: iron-session provides secure, stateless encrypted cookies. The design is provider-independent.
The reality: iron-session stores an encrypted, self-contained session payload in a cookie — this is genuinely stateless from the server's perspective (no server-side lookup to decode the session). However, the middleware hits MongoDB on every request anyway to validate the user's current membership and permissions, so the server-side lookup is happening regardless. In that context, iron-session's statelessness provides no performance benefit, and it removes the ability to revoke an individual session: fire an employee, deactivate their WorkOS account, and their sv0_session cookie remains valid for up to 7 days.
Additional gap — logout is a no-op: workos-provider.ts:74 has an empty logout() method. Clearing the cookie does not revoke the session on WorkOS's side. A cookie exfiltrated before logout remains valid.
The 7-day TTL is inappropriate. AWS Console sessions are 1–12 hours. Security tooling industry practice is 8–24 hours for human sessions. A security platform storing CISO-grade findings should not have sessions that outlive most employees' work weeks.
0.4 The In-Process Job Queue — Production Incident Waiting to Happen
The ADR claims: The in-process FIFO queue is sufficient for MVP scale.
The reality: The WorkerJob[] array at runtime.ts:26 is unbounded (no backpressure), not persisted (lost on restart), and not recoverable (no dead letter queue). The shutdown() handler sets a flag and exits — if a sync is mid-flight at step 6 of 11 when the container restarts (deployment, OOM kill, crash), the sync stays in "running" status forever. There is no detection, no alerting, no recovery path.
The event loop concern is a red herring. The real risk is MongoDB connection pool pressure under concurrent syncs. Sequential await calls release connections back to the pool between operations — they do not hold a connection across the full path materialization loop. Pool saturation occurs when multiple syncs run concurrently (each holding its own connections simultaneously). The current serial in-process queue actually prevents this specific problem by serializing syncs. The correct argument for replacing it is persistence and recovery (lost jobs on restart, no dead letter queue) — not connection pool saturation.
The Express 4 async bug is real: Every async route handler in Express 4 that throws an unhandled rejection hangs the request indefinitely — it does not route to the error handler. ingest.ts:160 has exactly this pattern. Express 5 fixes this natively.
0.5 The ELK.js Web Worker — ADR and Code Are Contradictory
ADR-011 states explicitly: "The layout uses the Web Worker variant from day one (elkjs/lib/elk-worker.min.js) — since the API is async either way, using the worker costs no extra complexity and keeps the UI thread free for all graph sizes."
The actual code at layout.ts:1:
import ELK from "elkjs/lib/elk.bundled.js"; // main thread — blocks UI
This is not a gray area or a judgment call. The ADR says use the worker variant. The implementation uses the main thread variant. At 200+ nodes, layout computation freezes the UI for 150-400ms. The spinner overlay that displays during layout may not even render before the thread locks.
This is the easiest fix in the entire audit: one line, verified against the ADR.
0.6 SSH Deployment Key — Docker Group Membership Is Root
The deployment docs claim: Deployment uses a restricted deploy user for security.
The reality: The deploy user is in the Docker group (deployment.md:307-308: sudo usermod -aG docker deploy). Docker group membership is functionally equivalent to root — docker run -v /:/host ubuntu chroot /host gives a root shell on the host. A compromised DEPLOY_SSH_KEY (which is exposed to every GitHub Actions runner that touches this repo) gives the attacker full root access to the production server, MongoDB included.
The kill chain is not theoretical. GitHub Actions runners are shared VMs. A supply chain attack on any dependency in the CI pipeline, or a compromised runner, exposes the key.
The fix is specific: Remove deploy from the Docker group. Use sudo with an allowlist of exactly two commands: docker compose pull and docker compose up -d in the platform directory. Nothing else.
0.7 REQUIRE_AUTH Defaults to False — The Insecure Default Ships
The deployment compose claims: Authentication is configurable.
The reality: docker-compose.deploy.yml:46 has REQUIRE_AUTH: "${REQUIRE_AUTH:-false}". The default is the insecure value. A deployment that forgets to set this environment variable — or a new engineer who spins up an instance following the compose file — gets a fully unauthenticated API where any caller can inject data into any tenant by setting the X-Tenant-Id header.
The Zod schema in env.ts:18 defaults to "true", which partially saves production. But this is defense by accident — two defaults in two files that contradict each other. The compose file default should be true. Secure defaults must not require active choices.
0.8 What the ADRs Got Right (Genuinely)
Several decisions hold up under scrutiny:
- Python for connectors: Correct. boto3/msgraph-sdk ecosystem advantage is real. Go/TypeScript/Rust offer no practical advantage for I/O-bound batch API scanning. The GIL is irrelevant.
- Docker Compose for current scale: Correct for now. The
deploy-instance.shmulti-instance orchestration with Caddy hot-reload is well-engineered for the current 2-server footprint. It becomes a ceiling when rolling deployments, multi-host scaling, or cell provisioning automation are required — see §13. - WorkOS selection: Correct. Provider abstraction (
AuthProviderinterface) is clean. Exit path exists. Admin Portal alone justifies the choice for enterprise SSO onboarding. - 10-entity type model: Correct. Not too universal. Adding a 4th connector is 90% connector work. The subtype system handles cloud-specific variation without fracturing the evaluator rules.
- Rejecting Apache AGE (ADR-003): Correct. Variable-length path exponential blowup is architectural. AWS RDS still doesn't support it. Decision remains valid.
- StorageAdapter abstraction: The single best decision in the codebase. 60+ methods behind a clean interface makes Kuzu, Neo4j, or any future migration feasible without touching connectors or evaluator rules.
1. Architecture Overview
What it is: A system of record for Non-Human Identity (NHI) execution authority. Answers the CISO question: "What can this automation actually do, who owns it, and what happened to its access?"
Stack: Node.js/TypeScript API + React 19 frontend + MongoDB + Python connectors (entra-servicenow, azure-foundry, aws)
Pipeline: 3-job sequential: sync_ingestion → evaluate_findings → build_evidence_pack (SHA256-sealed, immutable)
Entity model: 10 types — workload, connection, credential, identity, role, permission_set, permission, resource, owner, execution_evidence
15 finding rules in the evaluator (orphaned_ownership, scope_drift, dormant_authority, reachability_drift, llm_egress, etc.)
Architecture maturity: ~75% — pipeline solid, auth transition in-progress, SCIM/OAA deferred, ~40% of docs stale vs. current implementation.
2. Gray Zone #1 — Graph Storage & Scalability
MongoDB: Adequate for MVP, Breaking Point at ~10K Identities
The architecture uses materialized execution paths (pre-computed at sync time, O(1) blast radius queries) rather than real-time graph traversal. No $graphLookup anywhere — application-level BFS only. This is a deliberate, documented trade-off.
Scale ceiling:
| Scenario | Latency | Breaks At |
|---|---|---|
| MVP (<1K identities, 2-3 connectors) | <100ms | — |
| Growth (5K identities) | 100-500ms | Path recompute bottleneck |
| Scale (10K+ identities) | 500ms-2s | Breaking point |
| Production (50K identities) | 2-10s | Query timeouts, incomplete results |
On graph database alternatives: The decision to stay on MongoDB is correct for now, but the reasoning in the ADRs is partially wrong.
The claim that "Neo4j is bad at rich document storage" is not a valid reason to avoid it — Neo4j handles property maps on nodes/edges adequately, and more importantly, PostgreSQL with JSONB handles document storage extremely well and is fast. A PostgreSQL-based alternative covers document storage, temporal queries (range types, tstzrange), and graph traversal (recursive CTEs or Apache AGE extension) in a single engine. AGE was rejected (ADR-003) for exponential blowup on variable-length paths — a real limitation — but that's a specific traversal argument, not a document storage argument.
The correct reasons to stay on MongoDB at current scale:
- The
StorageAdapterabstraction already makes migration low-cost whenever the trigger is hit - MongoDB is sufficient for <10K identities with the current materialized path model
- Adding a graph engine before the breaking point is premature
When the 10K identity breaking point approaches, the real options are:
| Option | Graph | Documents | Temporal | Ops Cost |
|---|---|---|---|---|
| Kuzu (embedded) | Cypher, fast analytics | Via MongoDB (hybrid) | Via MongoDB | Near-zero — no new service |
| PostgreSQL + AGE | OpenCypher, exponential blowup risk on deep paths | JSONB, excellent | Native range types | One service replaces MongoDB |
| PostgreSQL (recursive CTEs) | Depth-limited traversal only | JSONB, excellent | Native range types | One service replaces MongoDB |
| Neo4j | Best-in-class graph | Property maps (adequate) | Temporal plugin needed | High — dedicated server |
| Neptune | Gremlin/SPARQL | External only | External only | AWS lock-in |
PostgreSQL is a legitimate and underrated option — it is not on the ADR radar but should be. A single Postgres instance with JSONB columns replaces MongoDB entirely, handles temporal queries natively, and graph traversal via recursive CTEs works for depth-limited paths (which is all SecurityV0 needs at MAX_AUTH_CHAIN_DEPTH). The StorageAdapter abstraction makes this migration just as feasible as Neo4j. Kuzu remains the lowest-friction first step (embedded, no new service, Cypher queries replace BFS loops).
Critical Code Bugs
| Severity | Issue | File:Line |
|---|---|---|
| CRITICAL | Reverse-lookup BFS has no document limit — can pull 50K+ docs into memory | subgraph-adapter.ts:158 |
| HIGH | Unbounded frontier growth — exponential blowup on high-degree nodes | subgraph-adapter.ts:35 |
| HIGH | Stale execution paths when role GRANTS change — affected identities not re-materialized | path-materializer.ts:40 |
| HIGH | MAX_AUTH_CHAIN_DEPTH=1 — any 3-system chain (Entra → SN → Slack) is missed | path-materializer.ts:17 |
| MEDIUM | Blast radius endpoint returns all paths with no pagination | paths.ts:14 |
| MEDIUM | Visited set in DFS causes path aliasing via shared state across branches | path-materializer.ts:110 |
3. Gray Zone #2 — Data Model Universality
Verdict: NOT Too Universal
The 10-type model is well-differentiated along three orthogonal axes: functional role, scope binding, temporal nature. The permission_set type (ADR-014) correctly distinguishes IAM policy documents (ceiling constraints) from role grants.
However: 3 deterministic, silent failures make key AWS features completely non-functional:
Ship-Blocking Bugs in AWS Connector
F1 — privilege_justification_gap returns 0 findings on all AWS data
path.resource_idis a MongoDB hex hash, never matches an ARN- Rule matching branch always fails for AWS sources
- File:
src/evaluator/rules/privilege-justification-gap.ts:48-50
F2 — CloudTrail extractor doesn't exist
cloudtrail_evidenceinitialized to[]incli/main.py:146, never populateddormant_authorityrule fires on 100% of Lambda functions (no evidence ever found)_transform_cloudtrail_evidence()exists but receives empty input, discards request_parameters and resources anyway- Tracked:
sv0-connectors#31
F3 — Assumed-role ARN parser returns None for 80-90% of real AWS events
- Lambda, ECS, Step Functions, Bedrock all produce
sts:assumed-role/RoleName/sessionARNs - Parser only handles
iam:role/andiam:user/shapes - All assumed-role evidence lands with
entity_id: ""— ungroupable by workload - Fix is 5 lines: add
elif ":assumed-role/" in arn:branch attransformer.py:1768
F4 — AWS connector never sets normalized_action — all AWS execution path actions are "unknown"
path-materializer.ts:147readsperm.properties.normalized_actionto populate theactionsarray on every execution path- The Entra-ServiceNow and Azure-Foundry connectors both set
normalized_action("read","write","admin","execute") - The AWS connector sets only
properties.action(raw IAM string:iam:PassRole,iam:CreateRole, etc.) and never setsnormalized_action - Result: every AWS execution path has
actions: ["unknown"]— the raw IAM action is silently discarded by the materializer - Second reason F1 is broken: even after the resource_id matching fix,
privilege_justification_gap's write-level action mismatch check (hasWriteActions()) would still never trigger on AWS data because it checks for"write","admin","delete"— not"unknown" - Blocks escalation detection: a future
escalation_capablerule checking for IAM privilege-escalation actions (iam:PassRole,iam:CreateRole,sts:AssumeRole*) cannot work until this is fixed scope_driftis NOT affected — it checks role additions against domain sensitivity, never readspath.actions- Not caught by any test: AWS connector tests only assert node/edge counts and
subtype == "iam_permission", never checknormalized_action; all path materializer and evaluator tests use hand-craftedentra_idfixtures withnormalized_actionexplicitly set; no seed data includes AWS-sourced entities - Files:
sv0-connectors/integrations/aws/src/sv0_aws/core/transformer.py:1619–1628(setsaction, notnormalized_action),sv0-platform/src/ingestion/path-materializer.ts:147(readsnormalized_actionwith?? "unknown"fallback, no attempt to readproperties.action)
Additional gaps:
permission_setplatform materializer not updated — still traversesHAS_ROLEfor AWS paths → incorrectvia_roleson all AWS authority paths- Ownership mapping from AWS resource tags never implemented → all AWS identities
ownership_state: unknown resource_namenever populated on AWS resource nodes- AWS IAM condition keys detected but not evaluated → authority paths over-report reachability; no
conditions_not_evaluatedflag onExecutionPathto surface this
On "metadata-only vs. code analysis":
- Structural authorization (what roles can reach what): ✅ works when CloudTrail bugs fixed
- Behavioral (is identity actually used): ⚠️ blocked by F2/F3
- Code vulnerability (injection, hardcoded secrets): ❌ out of scope, needs SAST/SCA connectors (future additive connector)
4. Gray Zone #3 — Connector Rate Limiting
Overall Risk: HIGH — Inconsistent throttling resilience across connectors
| Connector | Risk | Primary Issue |
|---|---|---|
| AWS | MEDIUM | Good botocore adaptive retry — missing jitter |
| Azure Entra | HIGH | Sequential-only (12+ min for 500 SPs at 2 RPS); no explicit Retry-After |
| ServiceNow | CRITICAL | Offset pagination breaks on 429 — no cursor resume |
| Azure Foundry | MEDIUM | Relies on SDK defaults — behavior unclear |
Critical code findings:
servicenow_client.py:421—if response.status_code != 200: breaksilently drops remainder of pagination on any 429aws_client.py:276—wait_time = 2**retry_countwith no jitter → synchronized retry storms across tenants- No global rate-quota tracker — one large tenant's scan blocks others
- No per-resource skip logic — one failed
get_policy()fails the entire scan
Rate limit exposure at medium scale (500 resources):
| Service | Limit | Calls/Scan | Risk at 10K resources |
|---|---|---|---|
| AWS IAM | ~20 RPS | 500-1500 | ~15min sustained, retries cascade |
| Azure Graph API | 2 RPS | 600+ | 12+ min serial, any 429 stalls all |
| ServiceNow | 2-4 RPS | 200-250 | No recovery on 429 |
| Azure Foundry ARM | 4 RPS | 150 | Unclear retry behavior |
5. The Blocker — AI Agent Permissions & PII Access Graph
"Show new permissions graph when deploying AI agent with MCP servers, flag PII access"
What SecurityV0 Already Has
ai_agentworkload subtype — already in entity model- 5-level sensitivity classification propagates through authority paths
reachability_drift,scope_driftevaluator rules detect changes since baselinereachable_sensitive_domainfinding fires on PII-classified resource access- Deployment approval fully designed (research docs
2026-04-07-mcp-agentic-deployment-approval-research.md,12-deployment-approval.md)
What's Missing — Implementation, Not Design
| Gap | Notes |
|---|---|
mcp_tool entity type + DECLARES_TOOL relationship | Tools currently invisible in graph |
data_domain entity type + ACCESSES relationship | Business domain classification needed |
MCP manifest parser (mcp.json → NormalizedGraph) | No parser exists |
| Graph projection algorithm (merge manifest → run materializer on projected state) | Core "what-if" engine |
POST /api/v1/deployment/preview endpoint | Designed, not coded |
| PII output schema tracking on tool declarations | Resource-level exists; tool output level missing |
| Approval record storage + UI | Operating layer not built |
Hard Problems (No Easy Solution)
- MCP tool opacity — tools are blackboxes; declared ≠ actual. Mitigation: cryptographic manifest attestation, grade as "C" until runtime evidence
- One identity per MCP server — all tools share service principal blast radius. SV0 detects; application architecture must fix
- PII exfiltration tracking — tool output schema declaration partially solves; runtime inspection required for full coverage
6. Platform Security Audit
Critical Vulnerabilities
CRITICAL — Cross-Tenant IDOR via REQUIRE_AUTH Bypass
When REQUIRE_AUTH=false (development default):
auth.ts:62-70— setsreq.auth = { tenantId: attacker-controlled }tenant-context.ts:12-14— reads tenant from auth, no membership validation- A connector can
POST /api/v1/ingest/normalized-graphwithX-Tenant-Id: victim-tenantand inject data into any tenant
The new auth-middleware.ts with WorkOS membership validation fixes this, but has not been deployed (app.ts:26-29 TODO).
CRITICAL — DevAuthProvider Has No Production Gate
dev-provider.ts:100-108 — returns valid super-admin session for any token when AUTH_PROVIDER=dev. If set in production, auth is completely bypassed.
Fix: provider-factory.ts must throw on AUTH_PROVIDER=dev && NODE_ENV=production.
Full Severity Table
| Severity | Issue | File:Line | Fix |
|---|---|---|---|
| CRITICAL | Cross-tenant IDOR via REQUIRE_AUTH bypass | auth.ts:62-70 | Deploy new auth-middleware with membership check |
| CRITICAL | DevAuthProvider: no production gate | dev-provider.ts:100-108 | Throw if AUTH_PROVIDER=dev && NODE_ENV=production |
| HIGH | Ingest: no cycle detection — evaluator infinite-loop risk | ingest.ts:121-152 | DFS cycle check; max 100K nodes |
| HIGH | Connector reports: .passthrough() allows field injection | ingest.ts:65-73 | Remove passthrough; ban _-prefixed fields |
| MEDIUM | Rate limiting per-tenant only — bypass by rotating tenant IDs | rate-limit.ts:14-16 | Key on ${tenantId}:${principalId} |
| MEDIUM | Path evaluator: no depth limit on ownership chain traversal | path-evaluator.ts:127 | Max 10 levels; fail with unresolved_ownership_depth |
| MEDIUM | Session: no refresh token; 7-day TTL forces full re-auth | session.ts:56-68 | Add POST /auth/refresh; 24h sliding window |
| MEDIUM | Silent entity overwrite without idempotency warning | ingest.ts:160-206 | Warn if nodeIds exist in prior syncs |
q search param not verified escaped before MongoDB regex | entities.ts:48-50 | Finding retracted — escapeRegex() exists in entity-adapter.ts and is applied before every $regex query. No injection risk. |
Positive: Helmet enabled, CORS explicit, x-powered-by disabled, 5MB body limit, no hardcoded secrets, Zod validation throughout.
7. Master Weakness Table
See §12 (Updated Master Priority Table) for the complete, reconciled finding list. §12 supersedes this section and includes findings from the full technology validation in §9–11. The table below is an early-pass summary retained for cross-reference with the section findings above.
| Pri | Category | Issue | Status |
|---|---|---|---|
| 1 | Security | Cross-tenant IDOR via REQUIRE_AUTH=false | Ship blocker |
| 2 | Security | DevAuthProvider no production gate | Ship blocker |
| 2a | Security | verifyM2MToken() returns null — every Bearer-token M2M auth path is completely unenforced | Ship blocker |
| 3 | AWS Connector | CloudTrail extractor not implemented | Ship blocker |
| 4 | AWS Connector | Assumed-role ARN parsing broken (80-90% events) | Ship blocker |
| 5 | AWS Connector | privilege_justification_gap always 0 on AWS | Ship blocker |
| 5a | AWS Connector | normalized_action never set — all AWS execution path actions are "unknown" | Ship blocker |
| 6 | AWS Platform | permission_set materializer not updated | Ship blocker |
| 7 | Graph DB | BFS reverse lookup: no document limit | Pre-scale blocker |
| 8 | Connector | ServiceNow pagination: no cursor resume on 429 — corrupts baselines permanently | Ship blocker |
| 9 | Graph DB | Stale paths on role GRANTS change | Correctness gap |
| 10 | Graph DB | MAX_AUTH_CHAIN_DEPTH=1 — 3-system chains missed | Feature gap |
| 11 | Security | Ingest: no cycle detection | Hardening |
| 12 | Security | .passthrough() allows field injection | Hardening |
| 13 | Connector | AWS backoff: no jitter | Pre-scale hardening |
| 14 | AWS Connector | Ownership not mapped from resource tags | Feature gap |
| 15 | AWS Connector | IAM conditions not evaluated; no caveat flag | Feature gap |
| 16 | MCP Feature | mcp_tool, manifest parser, graph projection missing | Phase 1 feature |
| 17 | Evaluator | No escalation/impersonation detection — roles with iam:PassRole, roleAssignments/write, actAs are invisible | Feature gap |
| 18 | Security | Rate limiting per-tenant only | Hardening |
| 19 | Docs | ~40% of architecture docs stale | Operational risk |
8. Prioritized Action Plan
Critical — Security and Data Integrity
- Deploy
auth-middleware.tspipeline — fixes IDOR - Add
AUTH_PROVIDER=dev && NODE_ENV=productionguard - Fix assumed-role ARN parser — 5-line fix at
transformer.py:1768 - Fix ServiceNow pagination cursor resume
AWS Connector
- Implement CloudTrail extractor (
sv0-connectors#31) - Fix
_transform_cloudtrail_evidenceto preserverequest_parameters+resources - Update platform materializer for
HAS_PERMISSION_SETtraversal on AWS - Add
.limit(query.limit)to BFS reverse lookup
Correctness and Hardening
- Ownership mapping from AWS resource tags
conditions_not_evaluatedcaveat flag onExecutionPath- Cycle detection in ingest schema validation
- Jitter on AWS backoff;
MAX_AUTH_CHAIN_DEPTH→ 2 - Rate limit key:
${tenantId}:${principalId}
MCP / AI Agent Feature
mcp_toolentity +DECLARES_TOOL/ACCESSESrelationships- MCP manifest parser
- Graph projection algorithm +
POST /api/v1/deployment/preview - Approval record storage
Parallel Track
- Event-driven delta sync (CloudTrail streaming, Entra
odata.deltaLink) - Session refresh token endpoint
- Documentation refresh for 00-overview, 04-api, 07-ui
9. Technology Validation — Architecture Review (April 12, 2026)
9.1 Graph & Database Layer
Verdict: MongoDB + ADRs validated for current scale, with two key gaps: evidence immutability is structurally broken; Kuzu is a viable in-process alternative for path queries that hasn't been evaluated.
Confirmed / Adjusted
| Decision | Status | Adjustment |
|---|---|---|
| MongoDB for MVP | VALIDATED | Correct for current scale |
| Single entities collection (ADR-002) | VALIDATED WITH CAVEAT | Implement accessible_by overflow collection before 5K identities/tenant |
| Materialized paths strategy | VALIDATED WITH CAVEAT | Write amplification is O(I × R × P × Res) — role fan-out is the real scaling cliff, not raw entity count |
| No $graphLookup | VALIDATED | Application-level BFS is the documented trade-off; $graphLookup has depth limits, no shortest-path support, and does not address SecurityV0's bounded-hop traversal pattern better than the materialized path approach |
| Reject Apache AGE (ADR-003) | VALIDATED | Variable-length path exponential blowup is architectural; AWS RDS still unsupported |
| Neo4j trigger threshold | LOWER to 5K | Original ADR said 10K; role fan-out write amplification hits at ~3K identities sharing a common role |
New Findings
Write amplification formula: When a role held by I identities changes permissions across R roles, P permissions/role, Res resources/permission:
- Read operations: I × (1 + R + R×P + R×P×Res)
- Write operations: I writes (identity docs) + R×P×Res writes (resource
accessible_byarrays) - At 3,000 identities × 10 roles × 20 permissions × 5 resources = ~3.3M read ops + 3,300 write ops per role change
Critical evidence immutability gap (HIGH for a security product):
- SHA-256 hash stored alongside mutable data in the same MongoDB collection
- Database admin can modify both content and hash in one operation
- No chain-of-custody linking evidence records
- No external trust anchor (Merkle tree, blockchain timestamp, signed receipt)
- Mitigation options (in order of trustworthiness): (1) Sigstore transparency log or Amazon QLDB — cryptographic proof that a log entry existed at a specific time, verifiable by third parties, not bypassable by a DBA; (2) S3 Object Lock (WORM) — append-only at the storage layer, independent of application code; (3) append-only PostgreSQL trigger table — weakest option because a DBA with
DISABLE TRIGGERpermission can bypass it; application-enforced immutability has the same trust level as MongoDB convention
Kuzu as embedded Neo4j alternative:
- Kuzu is an embedded in-process OLAP graph database (like DuckDB for graphs) with OpenCypher support
- Zero additional infrastructure — embedded library, ~50MB binary addition
- Eliminates path materialization write amplification by computing paths at query time
- Would replace
execution_paths+accessible_byembedded arrays entirely - The
StorageAdapterabstraction already enables this via aMongoKuzuStorageAdapterimplementation - 2026 maturity: Adequate for analytics workloads; not recommended as primary transactional store
- Verdict: Worth prototyping as an analytics layer over MongoDB for path queries — see Section 11 gray zone analysis
Bi-temporal gap:
The platform tracks valid time (valid_at/expired_at) but transaction time is implicit. This matters for "did we know about this identity BEFORE the breach?" queries — a common compliance requirement.
9.2 API Runtime & Job Queue
Verdict: Replace in-memory queue with BullMQ + Redis. Upgrade Node 20 → 22. Express 5 for async error handling. The primary risk is OOM from the unbounded queue and lost jobs on restart — not event loop blocking or connection pool starvation. (Sequential awaits release connections between operations; pool saturation would require concurrent syncs, which the serial queue prevents.)
Critical Bugs Found
| Severity | Issue | File | Fix |
|---|---|---|---|
| HIGH | Async route handlers in Express 4 have no try/catch — unhandled rejection hangs request indefinitely | ingest.ts:160 | Add try/catch or express-async-errors |
| HIGH | In-memory queue: WorkerJob[] unbounded, lost on process restart | runtime.ts:26 | Bounded queue + job persistence |
| MEDIUM | Shutdown handler kills mid-flight jobs — 30s timeout + process.exit(1) can corrupt sync state | index.ts:103-138 | Drain queue before shutdown |
| MEDIUM | processedSyncIds Set is in-memory — lost on restart, re-processing risk | ingest-service.ts:14 | Persist to MongoDB. Note: the cross-tenant blocking concern requires an engineered UUID collision (1/2^122 probability for UUIDv4) — not a realistic attack vector; the in-memory/restart concern is the actual issue here. |
Architecture Decisions
Node.js version: Node 20 reached end-of-life April 2026. Node 22 LTS is the correct version — one-line change in Dockerfile:1,9. V8 12.4 improvements, no breaking changes for this stack.
Job queue: BullMQ is the right direction. A lower-complexity alternative: MongoDB-backed job persistence (write jobs to worker_jobs collection before acknowledging, recover on startup) avoids adding Redis as a new stateful dependency. Decision table:
| Approach | Durability | Ops Complexity | Recommended |
|---|---|---|---|
| Current (in-memory) | None | Minimal | No |
| MongoDB-backed job store | At-least-once | Zero (uses existing Mongo) | Yes — lower complexity |
| BullMQ + Redis | At-least-once + advanced features | Adds Redis service | Yes — when queue needs grow |
| Temporal.io | Exactly-once + saga | Very high | No (overkill for 3-step pipeline) |
Container memory: Increase from 512MB → 1GB. A large tenant sync (5MB JSON graph + entity arrays + path materialization) can push 300-400MB leaving insufficient headroom.
Express version: Express 4 → 5 migration is low-risk and fixes async error handling natively.
9.3 Frontend Stack
Verdict: Stack is correct. Two bugs require immediate fixes. React Compiler should be enabled. Strategic concern: Graph Explorer as a primary view may not match CISO workflow — Wiz and Orca both use graphs as drill-down from findings, not standalone pages.
Critical Bug: ELK.js Running on Main Thread
File: ui/src/components/graph/layout.ts:1
// CURRENT (wrong — blocks main thread):
import ELK from "elkjs/lib/elk.bundled.js";
// SHOULD BE (ADR-011 explicit requirement):
import ELK from "elkjs/lib/elk-worker.min.js";
ADR-011 explicitly requires the Web Worker variant. The bundled variant blocks the UI thread for 150-400ms at 200 nodes, 500ms-2s at 500 nodes. The spinner overlay misleads — the spinner may not even paint before the thread freezes. One-line fix.
Performance Ceilings
| Component | Safe | Warning | Breaking |
|---|---|---|---|
| @xyflow/react nodes | <200 (60fps) | 200-500 (30fps) | 500+ (<15fps) |
| ELK layout (Web Worker) | <100ms (<100 nodes) | 100ms-2s (100-500 nodes) | 2s+ (>500 nodes) |
| ELK layout (main thread — current) | <50ms | 50-500ms (jank) | 500ms+ (frozen) |
| Selection highlight re-render | <200 nodes | 200-500 (O(n) spread) | 500+ |
Additional Findings
- ELK.js not lazy-loaded: 1.4MB loaded on every page including Dashboard and Findings. Should be dynamic import — only GraphCanvas and MiniGraph need it.
styledNodesmemo defeated:GraphCanvas.tsx:102creates new object references for all nodes on every selection change, defeatingmemo()on EntityNode. Fix: CSS class toggle instead of style spread.- React Compiler: Enable via
babel-plugin-react-compilerinvite.config.ts. Eliminates manualuseMemo/useCallbackoverhead across 6+ graph components. - Strategic: Graph Explorer as a primary UI view may not match CISO workflow. Wiz/Orca both use graphs as drill-down from findings, not standalone pages. Consider making Graph Explorer seed-anchored (always starts from a finding or entity) to cap graph size and align with CISO workflow.
9.4 Authentication Stack
Verdict: WorkOS + iron-session architecture is sound. Ship the new auth middleware. Critical hardening required: no instant session revocation, super-admin email-domain check is a security bug, logout is a no-op.
Security Bugs Found
| Severity | Issue | File | Fix |
|---|---|---|---|
| HIGH | Super-admin determined by email domain string match | auth.ts:76 | Use WorkOS Organization membership check |
| HIGH | logout() is a no-op — cookie cleared but WorkOS session NOT revoked | workos-provider.ts:74 | Call workos.userManagement.revokeSession() |
| HIGH | verifySession(), verifyApiKey(), verifyM2MToken() all return null | workos-provider.ts:78-92 | Three auth sources documented; only one implemented |
| MEDIUM | Sessions cannot be instantly revoked — stateless cookie survives deprovisioning | session.ts | Add sessions_revoked_at timestamp to user documents |
| MEDIUM | Rolling refresh not implemented despite being documented | session.ts:41-43 | Implement TTL extension on each request |
| MEDIUM | 7-day TTL inappropriate for a security platform | session.ts:18 | Reduce to 24h users / 8h super-admins |
| LOW | listActiveConnections called per-request for SSO-enforced tenants | workos-provider.ts | Cache with 60-second TTL per provider_org_id |
Architecture Assessment
WorkOS vendor selection (ADR-017) confirmed correct. Provider abstraction (AuthProvider interface) is well-designed — migration to Clerk or self-hosted is a backend-only change. Cloudflare Access as defense-in-depth is appropriate architecture.
iron-session assessment: Functions as a session ID into MongoDB (middleware hits DB on every request anyway). Consider formalizing: either accept the server-side lookup and add explicit revocation support, or move to short-lived JWTs (15 min) + refresh tokens for clean stateless/revocable semantics.
9.5 Infrastructure & Deployment
Verdict: Python correct. Docker Compose correct for current scale. Two critical security issues: REQUIRE_AUTH default and SSH deploy key blast radius. Dead code needs cleanup.
Critical Security Issues
| Issue | Evidence | Fix |
|---|---|---|
REQUIRE_AUTH defaults to false in deploy compose | docker-compose.deploy.yml:46 | Change default to true; fail loudly if not set |
| SSH deploy key grants Docker-group-equivalent-root to production | deployment.md:307-308 | Restrict deploy user via sudoers to specific compose commands only; remove Docker group membership |
Operational Gaps
| Issue | Impact | Fix |
|---|---|---|
| No external monitoring after deployment | Outage invisible until customer reports it | Add UptimeRobot / Cloudflare health check |
| Backup never tested; no restore runbook | 6-hour RPO with no recovery confidence | Test restore once, document procedure |
| Same SSH key for dev and prod | Dev compromise = prod access | Separate keys, rotate quarterly |
No /ready endpoint checking MongoDB | Health check misleads Docker | Add MongoDB connectivity check |
Infrastructure Gaps
| Issue | Impact | Fix |
|---|---|---|
| MongoDB without replica set: 6-hour RPO | Hardware failure = data loss up to last backup | Add replica set (even single-node for oplog) |
| Single-server SPOF: 100% downtime on host failure | SOC 2 story has single point of failure | Add second server with hot standby |
| Connectors require Python 3.11 at customer site | Support burden; installation friction | Ship as Docker images (docker run ghcr.io/sv0/sv0-aws:latest) |
Python Connectors: Validated
Python 3.11 + boto3 + msgraph-sdk is correct for I/O-bound batch API scanning workloads. The GIL is irrelevant (I/O-bound). Go/TypeScript/Rust offer no practical advantage for this workload. One improvement: add concurrent.futures.ThreadPoolExecutor to AWS region scanning loop for parallel region extraction (20-line change, not a language migration).
Dead Code Cleanup
Remove: docker-compose.prod.yml (legacy Certbot overlay), ui/nginx-ssl.conf (superseded by Caddy). Both create confusion about the active architecture.
Infrastructure Maturity Triggers
| Trigger | Action |
|---|---|
| First paying enterprise customer | 2-server MongoDB replica set, test restore, separate SSH keys |
| Contractual uptime SLA ≥99.95% | k3s or managed Kubernetes, Atlas managed MongoDB, Docker-based connectors |
10. High-Confidence Findings
| Priority | Finding | File |
|---|---|---|
| 1 | ELK.js running on main thread (should be Web Worker) | layout.ts:1 |
| 2 | In-memory job queue will lose data on restart — needs persistence | runtime.ts:26 |
| 3 | REQUIRE_AUTH=false is a critical default in deploy compose | docker-compose.deploy.yml:46 |
| 4 | SSH deploy key blast radius too broad | deployment.md:307-308 |
| 5 | Super-admin via email domain string match is a security bug | auth.ts:76 |
| 6 | Node 20 is EOL; upgrade to Node 22 | Dockerfile:1,9 |
| 7 | Logout is a no-op — WorkOS session not revoked | workos-provider.ts:74 |
| 8 | Evidence immutability: hash stored alongside mutable data | evidence_packs schema |
| 9 | Role fan-out write amplification is the real MongoDB scaling cliff | path-materializer.ts |
| 10 | Neo4j trigger threshold should be 5K, not 10K | ADR-001 |
11. Gray Zone Deep-Dive
11.1 Gray Zone 2: Data Model Universality — VERDICT: RIGHT-SIZED
Agent findings (deep code analysis across all 3 connectors + all 15 evaluator rules):
Verdict: The 10-entity-type model is NOT too universal. It is the correct abstraction level. Evidence:
- 3 connectors map to it with zero forced compromises (when model is followed correctly)
- All 15 evaluator rules operate against universal types and work across all connectors without cloud-specific branching
- The path materializer traverses the graph without cloud-specific conditionals
- Adding a 4th connector (GitHub, Okta, Salesforce) would be ~90% connector work, <10% platform work — no new entity types needed
Mapping fidelity by connector:
- Entra/ServiceNow: Clean 1:1 mapping. No compromises.
- AWS: Functional but carrying ADR-014 implementation debt (see below)
- Bedrock AI agents: Clean —
workloadsubtypebedrock_agent, RUNS_AS IAM role, INVOKES Lambda action groups
Two critical seams:
| Seam | Issue | Impact | Fix |
|---|---|---|---|
| ADR-014 implementation gap | AWS connector emits HAS_ROLE / nodeType: "role" for IAM Managed Policies instead of HAS_PERMISSION_SET / permission_set | Platform-side types already support permission_set. Path materializer traverses HAS_ROLE but NOT HAS_PERMISSION_SET → all AWS authority paths via managed policies are incorrect | Update AWS transformer line 499 + materializer to traverse HAS_PERMISSION_SET |
| Resource key migration | resource-key.ts is comprehensive for AWS but privilege_justification_gap needs resource_key on evidence records | CloudTrail evidence has no resource_key → rule returns false negatives until CloudTrail extractor is implemented | Populate resource_key on evidence during CloudTrail extraction |
privilege_justification_gap bug is implementation, NOT model: The resource-key.ts module correctly handles all AWS ARN formats (S3, Lambda, DynamoDB, SecretsManager, SSM, ECR, ECS, IAM, Bedrock, SNS, SFN, EventBridge, SQS). The matching logic is correct. The problem is that CloudTrail evidence records don't have resource_key populated yet (because CloudTrail extractor doesn't exist — F2).
MCP/AI agent model fit: Good. The existing model handles AI agents via ai_agent workload subtype. What's missing for pre-deployment preview is not an entity type but a behavioral distinction: "configured to invoke" (current INVOKES edge) vs. "has exercised" (needs runtime evidence). The graph already has the right edges; additional evaluator rules needed for authority preview.
_type_provisional: true was never implemented (searched codebase — zero occurrences). ADR-014 mentioned it as a migration strategy that was never built.
11.2 Gray Zone 1: Graph Alternatives to Neo4j — VERDICT: KUZU
Agent findings (deep research across 6 alternatives with full code context):
The team's hesitation about Neo4j is justified on operational grounds — and there is a better answer that avoids adding any new infrastructure.
Recommended: MongoDB + Kuzu (Embedded Analytics Layer)
Kuzu is an embedded in-process graph database (like DuckDB, but for graphs). Native Cypher support. Node.js/TypeScript bindings. MIT licensed. Zero additional infrastructure — runs inside the sv0-platform process.
Why Kuzu is the right answer for SecurityV0:
MongoDB (14 collections) Source of truth: entities, versions, events,
| findings, evidence packs, temporal history
sync completes
|
Kuzu (in-process) Graph projection: nodes + typed edges
| for path traversal queries
|
Cypher queries
/ | \
blast subgraph chain
radius explore assembly
Kuzu replaces the 3 most problematic application-level BFS implementations:
path-materializer.ts:computePaths()— recursive MongoDB-per-hop query storm → single Cypher querychain-builder.ts:bfsCollectChain()→ CypherMATCH (w)-[*1..5]->(r)subgraph-adapter.ts:neighborhoodBFS()→ CypherMATCH (n)-[*1..2]-(m) WHERE n.id = $seed
This eliminates:
- The BFS document limit bug (
subgraph-adapter.ts:158— no.limit()) - Unbounded frontier growth in high-degree nodes
- Stale execution paths when role GRANTS change (Kuzu recomputes at query time)
- The write amplification problem (no
accessible_byarrays to maintain)
Migration is incremental — zero storage risk:
- Start: Kuzu handles
getSubgraph()queries only (replace SubgraphAdapter BFS) - Next: Kuzu generates
execution_paths[]instead of path-materializer.ts - Then: Kuzu handles chain assembly (chain-builder.ts)
- Later: Temporal graph queries via historical entity snapshots loaded into temporary Kuzu instance
The StorageAdapter interface does NOT change. All 60 methods stay as-is. MongoDB schema unchanged. Evaluator rules unchanged. Connector interface unchanged.
Alternative Comparison
| Option | Fit Score | Key Verdict |
|---|---|---|
| Kuzu (embedded) | 8/10 | Best fit. Zero infra, native Cypher, MIT license, incremental migration |
| XTDB v2 | 6/10 | Excellent bi-temporal but NO graph traversal; JVM service required |
| TerminusDB | 5/10 | Git-like immutability interesting but Prolog query language + project risk |
| FalkorDB | 4/10 | Fastest BFS but Redis AOF persistence = disqualifying for evidence-grade requirements |
| TypeDB | 5/10 | Inference rules compelling but TypeQL proprietary, JVM service, no temporal |
| Memgraph | 5/10 | Neo4j-like quality but BSL license + same operational cost as Neo4j |
When Kuzu stops being sufficient (trigger for Neo4j or Memgraph):
- 50K+ entities where graph rebuild time exceeds acceptable sync latency
- Multi-process/multi-service need to query the same graph (Kuzu is in-process only)
- Real-time graph mutations needed (Kuzu is batch-rebuild-oriented)
- Geographic distribution requirements
On the bi-temporal gap: SecurityV0 already has a working bi-temporal model (entity_versions with valid_at/expired_at + events with transaction timestamps). XTDB's native bi-temporal is elegant but solves a problem that is already solved adequately. For historical graph queries, the right approach is: load historical entity snapshots from entity_versions into a temporary Kuzu instance and traverse that. This is the "git checkout past commit" pattern.
On evidence immutability: Neither Kuzu nor any graph DB solves the evidence hash-colocation problem. This must be solved separately — and PostgreSQL triggers are the weakest option because a DBA with DISABLE TRIGGER permission can bypass them. Prefer Sigstore transparency logs, Amazon QLDB, or S3 Object Lock (WORM) for genuinely tamper-evident storage. These are independent of which graph layer is chosen.
11.3 Gray Zone 3: Connector Depth + Rate Limiting
Part A: Metadata-Only Scanning — VERDICT: RIGHT STRATEGY FOR V1
What metadata scanning concretely delivers (from all 14 evaluator rules):
- Ownership governance (orphaned, degraded, drifted, ambiguous, unknown)
- Authority hygiene (dormant, scope drift, reachability drift, privilege justification gap)
- Identity binding (unproven execution, unknown binding, unresolved cross-system auth)
- Egress/data flow (LLM egress, external egress, reachable sensitive domain)
This is authorization graph analysis with temporal drift detection — a capability combination that existing tools address only partially or not at all.
What metadata misses: Hardcoded secrets in code, injection vulnerabilities, dependency CVEs, logic vulnerabilities, runtime behavioral anomalies, CSPM-style resource misconfiguration checks (S3 bucket ACLs vs. CIS benchmarks).
Code analysis path (additive, not a redesign): The NormalizedGraph schema already accommodates it. A sv0-code-scanner connector would: (1) fetch code artifacts linked to known entities, (2) run lightweight checks (regex for secrets, SBOM extraction, trufflehog), (3) emit NormalizedGraph additions. This is additive — no connector architecture change needed. The ServiceNow connector already parses script bodies (analyze_script_mutations(), analyze_script_queries()).
Verdict: Metadata-only is fully defensible for V1. Strategic risk is customers expecting CSPM-style findings alongside the authorization graph — that's a breadth gap where CSPM-first tools have an advantage.
Part B: Rate Limiting — CRITICAL FINDINGS
ServiceNow 429 bug is a data integrity crisis, not a UX issue:
The break at servicenow_client.py:421 on any non-200 response causes silent partial data ingestion. Blast radius:
- Scan returns 400/2000 records as if complete
- Downstream evaluator computes massive phantom
ownership_driftandscope_drift— entities "disappeared" - Phantom-truncated scan becomes the new baseline — subsequent full scans show phantom "new" entities
- Temporal drift detection becomes unreliable
This is not a "fix later" issue. This calcifies baselines. Every scan run with this bug creates corrupted baselines that compound. Must fix before production.
Fix — ServiceNow cursor resume on 429:
Note: urllib3 retry logic at the adapter level handles transient TCP/TLS failures before the pagination loop sees a status code. The bug is what happens after urllib3 retries are exhausted: the 429 bubbles up to application code and the break at line 421 exits the pagination loop without resuming the cursor. The fix is at the application level, not the adapter level:
if response.status_code == 429:
if retry_count >= max_retries:
raise ConnectorError(f"ServiceNow rate limit exceeded after {max_retries} retries; pagination cursor at offset {offset}")
retry_after = min(int(response.headers.get("Retry-After", 0)), 300) # cap at 5 min
wait_time = max(retry_after, 2 ** retry_count)
wait_time *= random.uniform(0.75, 1.25) # Full jitter
time.sleep(wait_time)
retry_count += 1
continue # NOT break — retry SAME offset
Fix — AWS full jitter (1 line):
# Line 276 of aws_client.py — replace:
wait_time = 2**retry_count
# With (AWS Architecture Blog "full jitter" pattern):
wait_time = random.uniform(0, 2**retry_count)
Recommended rate limiting architecture (current stage):
- Per-connector
AdaptiveTokenBucketper API endpoint - Respects
Retry-Afterheaders (ServiceNow, Azure Graph both send these) - Full jitter on all retry delays
rateLimitConfigin connector contract (05-connectors.md:122-126) is the right interface — configure max RPS per connector- Later: Redis-backed cross-tenant quota tracker when concurrent multi-tenant scans are needed
Reversibility assessment:
| Decision | Reversible? | Notes |
|---|---|---|
| Metadata-only scanning | Fully reversible | Code analysis connectors are additive |
| ServiceNow break-on-429 | Calcifying | Fix before any production customer |
| AWS no-jitter | Easily reversible | 1-line fix, low complexity |
| No global quota tracker | Reversible but expensive later | Design interface now, implement when multi-tenant orchestration built |
11.4 MCP Blocker: AI Agent Pre-Deployment PII Access Graph
Agent findings (deep architecture review of 12-deployment-approval.md + full codebase analysis):
What the architecture team already knows well
The design docs are thorough: three modes (post-deploy detection, pre-deploy preview, deployment gate) are correctly separated. Five approaches were evaluated. Platform capabilities inventory is accurate.
Hard unsolved problems (not yet designed)
| Problem | Why Hard |
|---|---|
| Graph projection algorithm | "Run materializer on projected state" is stated but the how is not designed |
| Path-level diff engine | Current diff-engine.ts diffs EntityDoc only — no AuthorityPathDoc comparison |
| Cross-connector entity correlation | Prerequisite for cross-system authority chains (Entra→SN→HR DB) — not yet built |
| MCP tool-to-data-domain mapping | Tool declarations are free text blackboxes — classification is unsolved |
data_domain as first-class entity type | Not yet in the entity model |
MCP Opacity Mitigation (layered, honest approach)
The fundamental problem: MCP tool declarations (tools/list) show name + description + input schema. They don't reveal what databases the tool queries, what data it returns, or what its blast radius is.
Recommended mitigation layers:
| Layer | What It Provides | Evidence Grade | Build Now? |
|---|---|---|---|
| 1 — Identity-bounded authority | The identity's IAM permissions ARE the worst-case blast radius | C (inferred) | Yes — already modeled |
| 2 — Manifest-declared intent | Parse mcp.json for env vars, tool names, resource URIs | C (inferred) | Yes — build now |
| 3 — Tool description parsing | NLP/regex on tool descriptions for domain hints | C (inferred) | Caution — conflicts with "no ML/heuristics" policy |
| 4 — Runtime observation | Actual network/DB calls after deployment | A (proven) | Future |
Honest framing for clients: "We show you the identity's authority boundary. The tool may exercise all, some, or none of that authority. The boundary is the worst case."
Graph Projection Algorithm: Recommended Design
Rejected options:
- Option A (clone to MongoDB + materializer): Write amplification, cleanup complexity, persistence risk
- Option C (what-if tenant namespace): Cross-tenant reference failures, tenant semantics broken
Recommended: In-memory ProjectionStorageAdapter
mcp.json
↓ MCP manifest parser
NormalizedGraph (mcp_server, mcp_tool, identity nodes)
↓ graph-transformer.ts (existing)
EntityDoc[] (projected entities)
↓ inject into
ProjectionStorageAdapter (Map<string, EntityDoc> backed)
↑ seeded from real tenant subgraph via getSubgraph(identity, depth=3)
↓ materializeExecutionPaths() + materializeAuthorityPaths() (unchanged)
projected AuthorityPathDoc[]
↓ evaluateSinglePath() (unchanged)
ProjectedFindingCandidate[]
↓ diff against current MongoDB authority paths
Authority Delta: new/removed/changed paths + new sensitive domains reached
This works because the StorageAdapter interface is already the abstraction boundary. A ProjectionStorageAdapter implementing ~8-10 methods (getEntity, upsertEntity, queryEntities, getEntitiesByIds, queryAuthorityPaths, upsertAuthorityPaths, markAuthorityPathsRemoved, countAuthorityPaths) runs the entire materializer + evaluator pipeline with zero MongoDB writes.
New Entity Types & Schema
Add to entity types: mcp_tool, data_domain
Add to edge types: DECLARES_TOOL (mcp_server → mcp_tool), ACCESSES (mcp_tool → data_domain), PROJECTED_FROM (projected entity → manifest source)
Add to workload subtypes: mcp_server (already has ai_agent, bedrock_agent)
Data domain classification (3 tiers, in priority order):
- Tier 1 — Resource name pattern matching (deterministic, build now):
hr.*|employee.*→ domain: "hr", sensitivity: "confidential". This is consistent with the "no ML/heuristics" policy — it's a curated registry. - Tier 2 — Operator tagging via API/UI: Security team manually classifies resources. Stored as
data_domainentities withACCESSESrelationships. - Tier 3 — Tool description NLP: Skip for now — conflicts with determinism policy.
Evidence Grading for Projected State
All projected paths carry: claim_type: "capability_inferred", evidence_strength: "inferred" (weakest grade, rank 3). In the UI: dashed edges, "PROJECTED" badge, distinct color. Projected findings do NOT count toward active posture score — advisory only.
Post-deployment upgrade path: projected → structural (after first scan confirms configuration) → correlated (after execution evidence accumulates) → deterministic (proven in production).
Approval Record Schema (minimal)
interface DeploymentPreviewRequestDoc {
_id: string; tenant_id: string;
requested_by: string; requested_at: Date;
source_type: "mcp_manifest" | "cloudformation" | "arm_template";
source_manifest?: Record<string, unknown>;
projected_paths: {
new_paths: number; new_sensitive_paths: number;
new_domains_reached: string[]; // e.g., ["hr", "finance"]
};
projected_findings: ProjectedFindingSummary[];
projected_authority_paths: AuthorityPathDoc[];
status: "pending" | "approved" | "rejected" | "expired";
reviewed_by?: string; reviewed_at?: Date; review_notes?: string;
conditions?: string[];
// Post-deployment accuracy tracking
projection_accuracy?: {
paths_matched: number; paths_unexpected: number; paths_missing: number;
};
}
Closest Analogues and Gaps
| Tool | What It Does | Gap |
|---|---|---|
| OAuth consent screens | Shows flat permission list | No authority graph, no cross-system chains |
terraform plan | Projects infrastructure state | No authority implications of infra changes |
| AWS IAM Access Analyzer | Checks single policy for public access | Not a graph, not pre-deployment, not cross-system |
| Microsoft Agent Governance Toolkit | Runtime policy enforcement | No pre-deployment preview, no authority graph |
| Wiz AI-SPM | Cloud security posture for AI | Runtime/post-deployment only, no authority graph |
Delivery Sequence
| Phase | Component | Output |
|---|---|---|
| 1 | In-memory ProjectionStorageAdapter (~8-10 methods) | Foundation for all projection |
| 2 | MCP manifest parser → NormalizedGraph | mcp.json input accepted |
| 3 | POST /api/v1/deployment/preview endpoint | Working projection pipeline |
| 4 | Approval record schema + PATCH endpoint | Approve/reject workflow |
| 5 | Resource-name data domain classifier (Tier 1) | PII domain detection |
After initial delivery: Path-level diff engine + full graph snapshot (prerequisite for "did reality match projection?")
Genuine hard problems not in scope yet: Cross-connector entity correlation, CloudFormation/ARM/Terraform parsers, multi-environment tenant model, what-if simulation UI.
Biggest implementation risk: ProjectionStorageAdapter must handle edge cases in the materializer (circuit breakers, deletion thresholds, AP_REMOVAL_THRESHOLD safety net). Medium risk — methods are well-defined but materializer edge cases will surface during integration testing.
12. Updated Master Priority Table
| Pri | Category | Issue | Severity |
|---|---|---|---|
| 1 | Security | Cross-tenant IDOR via REQUIRE_AUTH bypass | Ship blocker |
| 2 | Security | REQUIRE_AUTH defaults to false in deploy compose | P0 |
| 2a | Security | verifyM2MToken() returns null — Bearer-token M2M auth is completely unenforced | CRITICAL |
| 3 | Security | DevAuthProvider: no production gate | Ship blocker |
| 4 | Security | SSH deploy key → Docker group = root access | P0 |
| 5 | Security | Super-admin via email domain string (not org membership) | HIGH |
| 6 | AWS Connector | CloudTrail extractor not implemented | Ship blocker |
| 7 | AWS Connector | Assumed-role ARN parsing broken (80-90% events) | Ship blocker |
| 8 | AWS Connector | privilege_justification_gap always 0 on AWS | Ship blocker |
| 9 | AWS Platform | permission_set materializer not updated | Ship blocker |
| 10 | Frontend | ELK.js running on main thread (not Web Worker) | HIGH |
| 11 | Runtime | Async route handler no try/catch — hangs request | HIGH |
| 12 | Runtime | Job queue unbounded + no persistence = data loss | HIGH |
| 13 | Auth | Logout is a no-op (WorkOS session not revoked) | HIGH |
| 14 | Node.js | Node 20 EOL → upgrade to Node 22 | HIGH |
| 15 | Graph DB | BFS reverse lookup: no document limit | Pre-scale blocker |
| 16 | Connector | ServiceNow pagination: break on 429 — corrupts baselines | Ship blocker |
| 17 | Auth | 7-day session TTL; rolling refresh not implemented | MEDIUM |
| 18 | Auth | Iron-session: no instant revocation on deprovisioning | MEDIUM |
| 19 | Infra | No external monitoring; backup untested | P1 |
| 20 | Graph DB | Stale paths on role GRANTS change | Correctness gap |
| 21 | Graph DB | MAX_AUTH_CHAIN_DEPTH=1 — 3-system chains missed | Feature gap |
| 22 | Evidence | Immutability: hash stored in mutable collection | Compliance gap |
| 23 | Frontend | ELK.js not lazy-loaded (1.4MB on every page) | MEDIUM |
| 24 | Graph DB | Neo4j trigger lower to 5K; monitor role fan-out | Planning |
| 25 | MCP Feature | mcp_tool, manifest parser, graph projection | Feature — not yet built |
13. Infrastructure: Docker Compose Is a Dead End
Docker Compose is a development and single-host orchestration tool. SecurityV0 runs it in production for both app.securityv0.com and dev.securityv0.com. This is a structural ceiling — not a configuration gap, an architectural one.
What Docker Compose cannot do:
| Capability | Docker Compose | Required for Scale |
|---|---|---|
| Horizontal scaling (multiple hosts) | No — single host only | Yes, for any cell model |
| Rolling deployments | No — up restarts all containers, causing downtime | Yes, for zero-downtime deploys |
| Health-based routing | No — failed containers removed from routing manually | Yes, for resilience |
| Cross-node service discovery | No | Yes, for cell provisioning |
| Autoscaling | No | Yes, for variable sync load |
| Resource enforcement | Soft limits only | Yes, for noisy-neighbor isolation |
| Secret management | .env files on disk | Yes, must use vault |
The consequence for cell architecture: Cell provisioning automation — the core operational requirement for cells — is impossible on Docker Compose. "Provisioning a new cell" on Docker Compose means SSH-ing into a server and running docker compose up manually. This defeats the purpose.
The right migration path:
Current: Docker Compose (CPX21 Hetzner, single host)
↓
Step 1: k3s on Hetzner
Single-node Kubernetes — identical Hetzner hardware, same Docker images
No operational cost increase; enables everything below
↓
Step 2: Helm charts per service
Parameterized deployment: one Helm chart = one cell
Rolling deployments, health checks, resource quotas — free
↓
Step 3: Cell provisioning via Helm (when triggered by scale)
`helm install cell-eu-02 ./charts/sv0-cell --set tenants=...`
New cell live in 15 minutes, zero downtime for existing cells
k3s is the correct migration path: same Hetzner infrastructure, same container images, same Docker workflows for developers — production-grade runtime that enables the full cell model when needed.
Note on scope: The async route handler bug, unbounded job queue, and ELK.js Web Worker issue are independent code bugs — Docker Compose did not cause them and k3s would not fix them. The argument for migrating is forward-looking: Docker Compose cannot support rolling deployments, multi-host scaling, or cell provisioning automation, all of which become necessary as the platform grows. Fix the bugs separately; migrate the runtime to unlock the scaling model.
14. Cell Architecture vs. Current Architecture — Full Comparison
What Cell Architecture Means for SecurityV0
A cell is a complete, independently deployed replica of the platform stack, permanently assigned a bounded set of tenants, such that the failure or resource exhaustion of any component in that cell has zero runtime effect on any other cell.
One SecurityV0 cell contains:
┌─────────────────────────────────────────────────────────────┐
│ CELL A (tenants T001, T047, T203 ... T035) │
│ │
│ Express API pods (3×) ─── Redis ─── BFS Workers (4×) │
│ │ │
│ MongoDB Replica Set │
│ (IAM graphs, findings, BFS paths │
│ for THIS cell's tenants ONLY) │
└─────────────────────────────────────────────────────────────┘
CONTROL PLANE (global, not a cell):
Cell Router │ Auth Service │ Billing │ Tenant Registry
Maps tenant_id → cell. Never holds IAM graph data.
Must NOT be in the hot path — cells operate independently
if control plane goes down.
Connectors use transparent routing: they always call api.securityv0.com. The cell router maps tenant_id → cell from a cached registry and proxies the request. Connectors never need to know which cell they're in — no reconfiguration when cells are added or tenants migrated.
Scalability Comparison
Architecture A (Current) — Binding Constraints (from code analysis):
Critical correction from code audit: The production worker runtime is NOT BullMQ. It is a plain JavaScript array (
private readonly queue: WorkerJob[] = []atruntime.ts:26) inside the API process, draining one job at a time, sequentially. There is no separate worker process. BullMQ exists in documentation, not in the running code.
A full tenant sync cycle is 3 sequential jobs: sync_ingestion → evaluate_findings → build_evidence_pack.
For a medium tenant (5,000 entities): 85–240 seconds total.
| Tenants | Sync frequency | Worker queue drain time | Outcome |
|---|---|---|---|
| 5 | Hourly | ~22 min | Drains before next sync |
| 10 | Hourly | ~45 min | Queue backs up permanently |
| 35 | Daily | ~105–180 min | Barely drains before next daily window |
| 50 | Daily | ~225–375 min | Queue never empties |
| 500 | Daily | 37+ hours | Architecture collapses |
Sequential breaking points (in order they bite):
- Worker queue saturation — ~10 tenants (hourly) / 35 tenants (daily)
- MongoDB working set overflow — ~60–70 tenants × 5K entities (WiredTiger cache is 256MB from
--wiredTigerCacheSizeGB 0.25in compose; total working set exceeds it at low tenant counts) - Node.js OOM — one 50K-entity sync calls
queryEntities(limit:0), loads 500MB of entity docs into 512MB container; immediate OOM kill; all tenants dark - Express latency degradation — ~100+ tenants with concurrent dashboard load
Architecture A single-event total-outage scenario: One enterprise customer runs a 50K-entity sync during business hours. Path materialization triggers 3.2M sequential MongoDB reads (~9 hours). Evaluator calls queryEntities(limit:0) on 50K entities → ~500MB heap → OOM kill. Container restarts. The stalled sync is permanently stuck at "running" in MongoDB. All other tenant syncs are blocked for the duration. No alerting fires — the process crash is not surfaced as a sync failure. All tenants on the platform go dark.
Architecture B (Cell) — Scaling Characteristics:
- Cell capacity: 25–35 tenants per cell (MongoDB M20, 4 parallel workers)
- New cell provisioning: 12–18 minutes, zero downtime for other cells
- Same 50K-entity OOM scenario in Cell B: one cell degraded, 25–35 tenants affected, all other cells continue normally
- Vertical scaling: eliminated — add cells, not bigger servers
- Geographic cells: US customers on US cell (<20ms RTT vs 120–220ms from Nuremberg); APAC (<30ms vs 350ms)
APAC dashboard latency on Architecture A (350ms per interaction) crosses the threshold where users perceive the product as slow. Regional cells eliminate this entirely.
| Metric | Architecture A | Architecture B |
|---|---|---|
| Daily sync saturation | 35 tenants | 25–35 per cell, unlimited cells |
| Hourly sync saturation | 10 tenants | 25–35 per cell |
| Total-outage trigger | One 50K-entity sync | Atlas AZ outage (30–60s failover) |
| Noisy tenant blast radius | All tenants on platform | 25–35 tenants in one cell |
| Scaling action downtime | 5–15 min (vertical resize) | Zero (new cell) |
| APAC dashboard RTT | 350ms | <30ms (APAC cell) |
Security Comparison
Security scorecard:
| Attack Vector | Architecture A | Architecture B |
|---|---|---|
| Tenant data isolation | CRITICAL — tenant_id field only; one missing filter exposes all tenant data | LOW — per-cell MongoDB; missing filter leaks within 25–35 tenant cell only |
| Noisy tenant / resource exhaustion | HIGH — one tenant starves all; no per-tenant limits at any layer | LOW inter-cell / MEDIUM intra-cell |
| Auth bypass blast radius | CRITICAL — REQUIRE_AUTH=false exposes 100% of tenants simultaneously | HIGH — one cell exposed; others protected by independent auth |
| Cross-tenant IDOR | CRITICAL — MongoDB ObjectIDs from shared DB are time-ordered and estimable; one bug in any of 50+ query paths leaks cross-tenant | LOW — ObjectIDs from other cells do not exist in this cell's DB; physically absent, not just filtered |
| Session compromise blast radius | CRITICAL — stolen @securityv0.com admin session has 7-day unrestricted access to all tenants; no revocation | HIGH — control plane admin / MEDIUM — cell-scoped |
| Database breach blast radius | CRITICAL — one MongoDB breach delivers full IAM graph of every customer; complete cloud attack kit | HIGH per-cell / LOW platform-wide — independent credentials per cell |
| Connector push forgery | HIGH — verifyM2MToken() returns null; tenant_id in payload is attacker-controlled | MEDIUM — cell URL discovery required; wrong-cell push rejected at routing layer |
| Super-admin escalation | CRITICAL — email domain string match for all @securityv0.com accounts; no revocation | HIGH — same fragility, bounded blast radius |
| Compliance (SOC 2 Type II) | Blocked — CC6.1 (logical access only), CC6.3 (no revocation), IdP stubs | Achievable with remaining auth work |
| Compliance (FedRAMP Moderate) | Explicitly blocked — SC-4 requires DB-level isolation; field-level filtering fails this control | Eligible path — single-tenant government cells satisfy SC-4 |
| GDPR / Data Residency | High risk — EU tenant data co-mingles with US tenant data at storage layer | Strong — EU cell on EU infrastructure; no cross-jurisdiction data residency risk |
Six security fixes required regardless of architecture choice (Architecture B reduces blast radius but does not fix these):
verifySession(),verifyApiKey(),verifyM2MToken()returning null — this is an active auth bypass on those paths, not a stubREQUIRE_AUTH=falseas default indocker-compose.deploy.yml— must be inverted; opt-out for dev, not opt-in for prod- Iron-session server-side revocation — Redis-backed session store with immediate invalidation capability
- Super-admin email domain check — replace with explicit RBAC membership from WorkOS org claims + user ID allowlist
- BFS document limit — hard cap on traversal depth and result count per request
DevAuthProviderproduction gate — startup crash (not silent fallthrough) ifNODE_ENV=productionandDevAuthProvideris active
Customer Isolation Comparison
Architecture A — Isolation Reality (from code):
All 23 MongoDB collections are shared. The only isolation boundary is the tenant_id field predicate in application queries. MongoDB has no row-level security; the application is the sole enforcement point. Additional isolation failures found in code:
InMemoryFindingsStoreis shared across all tenants — if keying is not tenant-scoped internally, connector report findings from Tenant A are visible to Tenant BIngestService.processedSyncIdsis a globalSet<string>— not tenant-scoped; the practical risk is re-processing on restart (Set is lost), not cross-tenant blocking (UUIDv4 collision probability is 1/2^122)- A stuck sync job (infinite path materialization loop) has no per-job timeout or watchdog; it occupies the entire worker indefinitely, blocking all other tenants' pipelines
- The auto-join domain-match feature (in the new, not-yet-mounted middleware) adds users to any tenant matching their email domain without explicit invitation — a multi-tenant implicit membership risk
Architecture B — Isolation Reality:
- IAM graph data for Tenant A physically does not exist in Cell B's database — cross-cell IDOR requires control plane compromise + cell credential forgery
- Worker exhaustion, OOM, stuck jobs: bounded to the cell (25–35 tenants), not the platform
- Enterprise single-tenant cells: zero cross-tenant data at any layer; database breach exposes exactly one customer
- GDPR data residency: EU cell on EU Hetzner region + EU Atlas region; US tenant data never touches EU infrastructure
Pros and Cons
Current Architecture (Shared Multi-Tenant)
Pros:
- Simple to operate at current scale (single compose stack, one MongoDB)
- Low infrastructure cost ($0.74/tenant at 50 tenants)
- StorageAdapter abstraction provides a clean migration path to per-tenant collections without touching connectors or API routes
- Fast iteration — one deployment target
Cons:
- Worker queue blocks all tenants for the duration of any single sync job
- One large-tenant OOM kills the API process for all tenants simultaneously
tenant_idfield isolation is the only data boundary — one missing filter in any of 50+ query paths is a platform-wide cross-tenant breach- FedRAMP, ISO 27001 SC-4, and GDPR data residency compliance are structurally blocked
- Write amplification from path materialization (O(I×R×P×Res) MongoDB reads) is a shared-instance bottleneck
- No horizontal scaling path without rewrite
- Docker Compose provides no rolling deployments, no health-based routing, no autoscaling
- APAC dashboard unusable (350ms+ RTT from Nuremberg)
Cell Architecture
Pros:
- Any single-cell failure (OOM, MongoDB, stuck job) affects 25–35 tenants, not the entire platform
- FedRAMP Moderate eligible via single-tenant government cells
- GDPR data residency: EU customers on EU cells, US customers on US cells — provable in procurement
- Geographic cells eliminate APAC latency penalty
- Enterprise isolation is a compliance requirement for CISO-grade buyers (FedRAMP, GDPR, contractual)
- Cell provisioning via Helm is 12–18 minutes, zero downtime
- Per-cell MongoDB credentials — one cell's database breach does not cascade
Cons:
- Control plane is a new single point of failure; must be built to higher availability than data plane
- Cell-to-cell tenant migration requires quiesce-export-import-verify-flip procedure (~30 min, coordination risk)
- Cell sprawl: 10 cells = 10 MongoDB instances to patch, 10 Redis instances to monitor, 10 deployment rollbacks per release
- $5.20/tenant at 50 tenants vs. $0.74 — 7× cost premium at low scale
- Significant engineering investment for control plane, provisioning, cell-aware routing — time not spent on the AWS connector or MCP feature
- Requires k3s or ECS as prerequisite — Docker Compose is incompatible with cell provisioning automation
- Intra-cell isolation within a 25–35 tenant cell still requires
tenant_idfield discipline; Architecture B reduces blast radius, not isolation mechanism
Cost Model
| Scale | Arch A Infrastructure | Arch A $/tenant | Cells Needed | Arch B Infrastructure | Arch B $/tenant |
|---|---|---|---|---|---|
| 50 tenants | 2× CPX21 = €22/mo + Redis | $0.74 | 2 cells | $260/mo | $5.20 |
| 200 tenants | CPX51 + Atlas M20 = ~$170/mo | $0.85 | 7 cells | $910/mo | $4.55 |
| 500 tenants | CPX51 + Atlas M50 = ~$470/mo | $1.10 | 17 cells | $2,210/mo | $4.42 |
At 200+ tenants, Architecture A requires a dedicated DBA and constant capacity management; the staffing cost delta alone exceeds the $3.70/tenant infrastructure premium of Architecture B.
The Verdict: When to Invest in Cell Architecture
Cell architecture is the correct long-term direction. It is the wrong immediate investment.
SecurityV0 has no evidence any cell-architecture-solvable problem exists at its current scale. The AWS connector produces no reliable execution evidence. Authentication is mid-migration. The worker queue is a JS array. Before rearchitecting for scale, the product must work.
Triggers that justify the cell investment (all must be true):
- 100+ tenants with active sync workloads
- Demonstrated requirement for physical data isolation (not just field-level
tenant_iddiscipline) - Measured noisy-neighbor degradation — not theoretical; actual P95 latency correlation between one tenant's sync load and another's dashboard latency
- All items 1–9 from the existing priority table are closed
- WorkOS auth migration is complete and deployed
- Operational capacity to maintain multiple independent MongoDB instances, Redis instances, and Helm deployments
The incremental path — no big-bang rewrite:
Step A: Per-tenant MongoDB collections via StorageAdapter
Add tenantId → collectionName routing inside the adapter
Delivers collection-level isolation; maps cleanly to cell extraction later
Application code: unchanged. Connectors: unchanged.
Step B: Persistent job queue
Replace WorkerJob[] array with durable queue (MongoDB-backed or BullMQ)
Enables parallel workers, per-tenant priority lanes, job recovery
Step C: Per-tenant API rate limiting
Token bucket keyed by tenantId in middleware
Eliminates noisy-neighbor at the API layer
When triggered: First enterprise customer requiring contractual isolation
Extract them to a dedicated single-tenant cell
One cell, one Terraform module, no generalized control plane yet
When triggered: Measured queue degradation across tenants
General cell model: control plane, provisioning automation, cell router
Steps A–C are already done; the migration is additive, not a rewrite
This path avoids the big-bang rewrite. Each step is independently justified by a confirmed problem. The architecture evolves toward cells driven by real customer requirements, not hypothetical scale.