2026-06-09 Architecture Review — Enterprise Survivability, Current Limitations, Evolution Path
Tracking issue: SecurityV0/sv0-documentation#373
Scope and method
Third review round, commissioned by Ivan (CTO) on 2026-06-09. Question asked: will the architecture survive and prosper in real enterprises; where are the limitations today; how do we get to better options.
Five parallel review passes, synthesized here:
- Prior-review follow-through — the 2026-04-12 external audit (Ilya) + proposed-changes, the 2026-04-21 pre-client readiness review, the March 2026 platform review; status of every major finding verified against issues/PRs/code at today's HEAD.
- Scalability position — the sizing doc, the graph strategy doc, the data-model position memo, ADR-031 and the founding ADRs (001/002/003/020).
- Platform code audit — sv0-platform at
mainHEAD1b0a46c, focused on enterprise data volumes and operational robustness, every claim citedfile:line. - Connector audit — sv0-connectors, all 5 integrations, against the connector contract and runtime architecture docs.
- Enterprise ops readiness — tenant isolation, SSO/SCIM/RBAC, audit logging, residency, deployment models, DR, observability, compliance, team-size operational maturity.
A sixth pass ran after synthesis: an adversarial verification agent (Opus) spot-checked 24 file:line citations and 6 issue/ADR/prod facts against both repos. Verdict: UPHOLD-WITH-CORRECTIONS — the P0 finding and the four ceilings verified exactly; one finding (posture truncation) was overstated and has been corrected below; its identified scope limits are recorded in "Scope limits of this review."
Everything below distinguishes verified (code/issue evidence) from projected (the docs' own forward estimates). Customer archetypes A–E are as defined in scalability-sizing-and-decision-points.md (A ≈ 150 NHIs demo … C ≈ 25K NHIs mid-market … D ≈ 150K NHIs Fortune-1000 … E ≈ 600K regulated mega).
Executive verdict
Survive: the architecture survives archetypes A and B today, and one archetype-C tenant marginally. It does not survive a real Fortune-1000 estate (archetype D) in its current form — and the docs already say so honestly. The strategy (the migration staircase, trigger-gated steps, Mongo-now substrate per ADR-031) is sound and was re-validated this round. The problem is not the plan.
The problem is execution lag on the write/compute pipeline. The graph-traversal read path (BFS caps, index discipline, cursor pagination) has been through real adversarial hardening and would degrade gracefully at enterprise volumes — verified for traversal; list-API payloads and UI render budgets were not load-checked (see Scope limits). The write/compute pipeline — queue → ingest → diff → materialize → chains → evaluate → stitch — is still single-node, in-memory, sequential, and full-tenant-per-sync. The single most-flagged finding across all prior reviews (the in-memory job queue, #388) has a merged design doc from 2026-04-16 and zero implementation eight weeks later. Four independent, code-verified ceilings sit between today and a 100K-entity tenant.
Prosper: prosperity in enterprises is currently blocked less by scale than by operational/procurement gaps: no persisted audit log (ADR-021 still Proposed; prod audit signal lives in a Docker log buffer), no prod log/metric shipping, no compliance program, EU data processed on US compute, and ADR-018's own standing condition — "no production customer tenants until migrated" off the risk-accepted Hetzner prod — which the Azure cutover (ADR-022 Phase 3c) has not yet discharged.
New highest-severity finding this round: a throttled or access-denied connector cell can cause the platform to silently delete previously-ingested entities and close findings — the connector emits partial-status evidence precisely so this won't happen, but the platform's diff engine never reads it (detail in Part 3, finding 1). This is a correctness bug that converts an AWS API throttle into vanished inventory, and it gets more likely at enterprise scale, not less.
One overarching observation: every claim above archetype B is a projection. No cliff has ever been load-tested; no tenant above ~3K entities has ever run. The sizing docs admit this. Converting one projection into an observation (a synthetic 25K-entity tenant) is cheap and would either validate or re-price the entire staircase.
Part 1 — Follow-through since April 2026
The April audit and pre-client review were taken seriously: the security wave is essentially closed, the product-legibility wave shipped, the runtime wave stalled.
| Theme (≥2 reviews flagged it) | Status 2026-06-09 |
|---|---|
| Auth hardening (IDOR, DevAuth gate, bearer middleware, super-admin, logout, session TTL) | CLOSED — verified in code (provider-factory.ts:43-48, bearer-token-middleware.ts, zero endsWith("@securityv0 hits). One survivor: rate limiting still keyed per-tenant only (rate-limit.ts:15) |
AWS evidence chain (CloudTrail extractor, assumed-role parsing, normalized_action, justification-gap) | Code-complete, operationally unproven — all phases closed, but sv0-platform#393 (run the first CloudTrail-enabled scan, Wave-1 exit criterion) is still open. The fix chain has never been proven end-to-end on a real scan |
| Proven-vs-inferred evidence legibility (Sergey's standard) | SHIPPED — EvidenceClaim classification, v0.6 north-star clauses |
| Graph as seed-anchored drill-down, bounded render | Converged — ADR-030/031 D3/D4; predicate query layer (#1306 Phase 1) still pending |
| In-memory job queue | OPEN — src/workers/runtime.ts:34 is still private readonly queue: WorkerJob[] = []. #388 untouched since the design doc merged (PR #402, 2026-04-16). Flagged by three independent passes since April; the most-flagged, least-fixed item in the corpus |
| Evidence immutability / tamper-evidence | OPEN, zero progress — flagged three times (audit §0.1/§9.1, proposed-changes §8, pre-client HMAC item). No ADR, no issue, no code. Events still TTL-expire at 2 years; no sequence numbers; evidence-pack event-range binding never started |
| Tenant-isolation blast radius (cells, per-tenant collections) | Deliberately deferred with triggers — consistent; no trigger fired |
| Epic #174 | Wave 1: 10/10 closed. Wave 2: 8/9. Wave 3 (Kuzu prototype, ADR-001 amendment, cell triggers): 0/4 — absorbed informally into ADR-031 rather than tracked to closure |
Open strategic-decision hygiene: #255 was partially answered by ADR-031 (authorization-graph positioning rejected, full-chain direction committed, federated edge deferred) but the issue was never updated to record this; decisions 3 (one-product vs two-product) and 4 (investor artifact shape) remain genuinely unresolved.
Part 2 — Survivability by archetype
| Archetype | Verdict today (code-verified where marked) |
|---|---|
| A (~150 NHIs) / B (~5K) | Survives with large headroom. This is the only regime ever observed in production (~3K-entity tenants — stitch-ingestion.ts:870 records the measured baseline) |
| C (~25K NHIs, ~3M edges) | One tenant: marginal. Evaluator + sync wall-time becomes hours (verified mechanics, projected duration); a second concurrent C tenant breaks the shared FIFO queue. Sizing doc itself recommends M30 + Step 1 first |
| D (~150K) | Does not survive. Four independent verified ceilings (below) + connector full-rescan model. Sizing doc agrees: "cannot run safely without Step 1 [+ Step 2 + Tier 3a/3b]" |
| E (~600K, regulated) | Cannot onboard — by design, pending full staircase. Also blocked by procurement gaps regardless of scale |
The verified ceilings (platform, at HEAD 1b0a46c)
- Whole graph in memory, twice.
/api/v1/ingestaccepts 200MB JSON (app.ts:97); the parsed graph then sits in the in-memory job queue aspayload.graph(ingest-service.ts:60-67). The chunked path re-assembles the full graph at commit (ingest-chunked.ts:293-330) — it solves the HTTP limit, not the memory ceiling. A 100K-node payload OOMs API + worker together (they share one process). - Per-entity N+1 sync ingestion. 3 Mongo round trips per entity in the cited hot loop (read, atomic upsert, re-read —
sync-ingestion.ts:240-326), plus further per-entity writes in the downstream version/soft-delete/path steps; the batchedupsertEntities(entity-adapter.ts:431-487) exists and is unused by it. 100K entities ≈ 300K+ sequential ops ≈ tens of minutes during which the FIFO queue serves no other tenant. - Evaluator re-runs the whole tenant on every sync of any connector. All entities, all paths, all findings loaded with
limit: 0(evaluator/index.ts:51,103,591), with an awaited per-candidategetFindinground trip. Enqueued on everyacceptGraphand post-stitch. - 16MB BSON wall on hub entities, warn-only. Embedded relationships + mirrored inbound edges +
accessible_by/execution_pathsfan-out arrays; the only guard is a warning at 8MB explicitly marked "a soft signal, not a cap" (entity-adapter.ts:236-244). An AWS admin role with O(10–50K) edges fails the write mid-sync, unhandled. (This is also the ADR-002 vs ADR-031 contradiction — see Part 4.)
Supporting ceilings: whole-tenant in-memory stitcher hard-capped at 100K entities, silently truncated above (stitch-ingestion.ts:871, load loop :862-884); tenant-wide chain rebuild with uncapped BFS on every sync (sync-ingestion.ts:498, chain-builder.ts:210-299); unanchored $regex entity search (entity-adapter.ts:563-569); a subset of posture/risk-cluster aggregates computed per request over a 5,000-path slice (posture-service.ts:52-53 + 7 sites in risk-cluster-service.ts) — adversarial verification showed the headline counts use server-side count* and stay correct, and the truncation is flagged in response meta (posture-service.ts:203); the residual risk is the UI not rendering that flag (see finding 9). >5K paths is realistic for a single large AWS account.
Robustness (crash / partial failure / multi-instance)
- Crash mid-sync: queued jobs (including 200MB payloads) lost; failed jobs dropped with no retry (
runtime.ts:139-149);connector_syncsstuck inrunningforever (no reaper); 13-step pipeline is non-transactional with no resume marker — recovery depends entirely on the connector re-sending a full scan. - No job timeout: one hung Mongo call blocks the entire queue for all tenants (
runtime.ts:131). - Multi-instance deploy silently corrupts data. The code documents its own single-node assumption (
sync-ingestion.ts:228-234); property pre-merge andaccessible_byupdates are non-atomic read-modify-write. Nothing prevents someone from scaling the API horizontally and silently reintroducing the #459 clobber class.
What the code does well (credit where due)
Subgraph BFS hardening (MAX_REVERSE_LOOKUP_DOCS=5000, frontier/edge caps, deterministic truncation surfaced to the UI); 12 tenant-prefixed compound indexes with boot-time verification; correct cursor pagination with clamped limits; deletion circuit breakers (global + per-type) on entities, authority paths, and chains; atomic server-side relationship merge; finding change-detection gating evidence-pack rebuilds; scheduler with atomic claim, cooldown, and SIGTERM drain. The read side would survive enterprise volumes with graceful truncation. The write side would not.
Connectors at enterprise scale
- Incremental sync: zero implementation anywhere. The contract specs
extractIncremental/delta tokens; the platform has an unusedsync_cursorscollection with adapter plumbing (sync-adapter.ts:59-71); no connector and no worker consumes it. Every scan is a full re-extract — at 100K identities / 500 AWS accounts that's a 12+-hour single-process scan (AWS: ~6,000 cells, 8 workers) or days (Entra: ~4–6 sequential Graph calls per service principal, no Graph-specific throttle budget), with no cross-run checkpoint — the failure mode is "start over." - AWS is the most mature (multi-account org discovery, cell isolation, jittered backoff with in-run pagination resume, per-cell evidence keys) — genuinely pilot-ready for a tens-of-accounts org.
- Scheduled scans against real tenants are not possible yet — the env-broker phase of the connector runtime is unmerged; all live scans so far ran from a developer laptop.
Part 3 — Current limitations, ranked
Severity × likelihood at a real enterprise tenant. P0 = will cause a customer-visible incident or kill a deal; P1 = blocks the next archetype; P2 = structural debt with a defined trigger.
| # | Finding | Class | Evidence |
|---|---|---|---|
| 1 | Failed/throttled connector cell → silent entity deletion and finding closure. Failed AWS cells still contribute their namespace to scanScope.sourceSystems (integrations/aws/sv0_aws/core/transformer.py:344-345); the platform's deletion detection treats that namespace as scanned (diff-engine.ts:336-360) and the absent entities become deletion candidates. The connector emits two signals specifically so this won't happen — per-cell "partial" evidence status (aws cli/main.py:1244-1250) and scanScope.errors (transformer.py:347-351) — and the platform ignores both: neither sync-ingestion.ts nor diff-engine.ts reads them (grep: zero references). One failed account in 500 ≈ 0.2% deletion ratio — sails under the 30–60% circuit breaker. Upheld by adversarial verification | P0 correctness — NEW this round | sv0-platform#1513 |
| 2 | In-memory, no-retry, no-timeout job queue. Most-flagged, least-fixed finding since April. Everything else on the runtime list (retries, resume, timeouts, multi-worker, payload externalization) hangs off it | P0 ops | #388 open; runtime.ts:34,131,139 |
| 3 | Full-graph-in-memory ingest (200MB body → queue payload; chunked path re-assembles) | P0 at C+, P1 today | app.ts:97, ingest-service.ts:60-67, ingest-chunked.ts:293-330 |
| 4 | No persisted audit log; no prod log/metric shipping. ADR-021 still Proposed; prod = Docker log buffer on Hetzner; Alloy runs on dev/staging only. Fails the most universal procurement question; gates SOC 2 CC6/CC7. Detection of a prod outage exists (external probes); explanation does not | P0 procurement | ADR-021; environments-and-ops-links.md; no audit_logs in collections.ts |
| 5 | ADR-018's standing condition vs reality. Prod runs on the risk-accepted Hetzner box whose ADR says "no production customer tenants until migrated" — root-equivalent deploy user, single VM, US location, no log shipping. The Azure prod cutover is the single event that discharges this, the EU-residency caveat, and the prod-observability gap | P0 procurement | ADR-018/022; prod also ~1 month behind dev |
| 6 | Evaluator full-tenant re-run per sync (per-entity awaits) | P1 throughput wall | evaluator/index.ts:51,69-93,103,591 |
| 7 | Per-entity N+1 sync ingestion; batched writer exists unused | P1 | sync-ingestion.ts:240-326 |
| 8 | 16MB BSON wall on hub entities, warn-only at 8MB | P1 (hard failure, no runway) | entity-adapter.ts:236-244; contradicts ADR-002 |
| 9 | Posture/risk-cluster truncation above 5,000 paths may be invisible in the UI. (Corrected after adversarial verification — originally rated P1 "silent falsification".) Headline counts (active_paths, dormant_paths, ownership counts) use server-side count* and stay correct at any scale; only the in-memory slice aggregates (e.g. total_executions_30d) are capped, and the cap is surfaced as meta.truncated (posture-service.ts:55,73-74,203; same pattern at 4 sites in risk-cluster-service.ts). Residual risk: the UI may not render the flag, and which aggregates degrade is undocumented | P2 product trust (UI rendering) | posture-service.ts:203 |
| 10 | No incremental sync, no cross-run checkpoint in any connector; sync_cursors infra unused; Entra per-SP sequential enrichment where throttling can manufacture ownership-decay findings (integrations/entra-servicenow/.../cli/main.py:103-124) | P1 | sv0-connectors, as cited |
| 11 | Multi-instance deploy unsafe (lost-update races; duplicated Scheduler/WorkerRuntime per process); nothing prevents horizontal scaling today | P1 latent data-loss | sync-ingestion.ts:228-310, index.ts:79,158 |
| 12 | Identity lifecycle automation unwired: no WorkOS webhooks, no reconciliation, no SCIM enabled, manual sso_enforced flips; real deprovisioning latency = "next login" | P1 procurement | doc 13 §15 |
| 13 | Evidence immutability — flagged 3×, zero progress; events TTL 2y; no event-range binding on packs | P2 compliance | proposed-changes §8; W3.2 |
| 14 | Connector credential posture: all-tenant secrets in VM env (self-acknowledged SOC 2 failure, doc 15 §136-139); static AWS bootstrap keys; ServiceNow operator-mediated secret handoff | P2 procurement | doc 15 |
| 15 | No compliance program (no controls inventory, no docs/compliance/ despite ADR-020 committing to a residency evidence pack) — "start at first compliance ask" serializes a multi-month scramble behind the first serious prospect | P2 | ADR-020 §6 |
| 16 | Sev1 handling is one human, best-effort; no incident-response runbook (docs/operations/ planned in #369, not yet created), no customer-comms process | P2 | resiliency plan |
Two stale-control findings worth a same-day look: (a) deploy-prod.yml:35-51 still runs a pre-deploy mongodump against a local container while prod data lives in Atlas — the compensating control ADR-018 cites is backing up a vestigial store; (b) rate limiting keyed per-tenant is an intentional anti-starvation design (rate-limit.ts:13) — the actual gap is the absence of a sub-tenant (per-principal/IP) limit.
Part 4 — Strategy quality: what holds, what has drifted
What holds. The staircase (Step 0 chain contract → Step 1 pipeline stabilization → Step 2 embedded graph engine → Tier 3a/3b per-tenant isolation → Step 4 cells) was re-derived independently by this round's code audit — the cliffs it names are the cliffs the code has. ADR-031's "pipeline cliffs bite before the graph cliff" is exactly right: nothing in this review says "move off MongoDB"; everything says "finish Step 1." The docs' honesty discipline (self-declared guessed numbers, code-cited cliffs, trigger-gated deferrals) is itself a procurement asset.
What has drifted:
- ADR-002 vs ADR-031 on the 16MB limit. ADR-002 (Feb): "Document Size Not a Concern." ADR-031 D5.2 (May): unbounded fan-out arrays are "a hard write cliff with no runway today." ADR-002 was never amended.
- Three different graph-engine triggers coexist — ADR-001 (10K identities → Neo4j), strategy doc Step 2 (1K-holder role → embedded engine), ADR-031 (50–100K identities uneconomic window). A first D-class deal satisfies ADR-001's trigger on day one. Needs one canonical trigger set.
- Triggers were unmeasurable until weeks ago. ADR-031 D0 named the dependency inversion; #1326 (doc-size/write-amp monitoring) merged, but cap-hit rate, reverse fan-in distribution, per-tenant identity counts, and chain-rebuild duration are still not dashboards. "Trigger-driven, not deadline-driven" only works if the triggers are observable.
- Status rot. ADR-020 frontmatter says
proposed, body says Accepted; doc 14 carries a stale reverted-status note; the two strategy docs aredraftyet treated as canonical; ADR-031 is Proposed with only D3 ratified; #255 and #1306 bodies carry pre-correction claims. - Multi-account topology is absent from the model. Archetypes count clouds, never AWS account counts; a 500-account org appears nowhere in the sizing; cross-account
TRUSTS(the primary enterprise lateral-movement edge, per ADR-031 D5.3 itself) is dropped from multiple traversal vocabularies. - Nothing above B was ever observed. All C/D/E numbers are ratio-derived from vendor benchmarks. The sizing doc says so; the consequence is that Step-1 scoping is being done against projections.
Part 5 — Path forward (sequenced)
Phase NOW — before the first paying customer (≈ the next 4–6 weeks of platform work)
- #388 persistent job queue — the first domino. Design is approved (MongoDB-backed store, PR #402). Implementation unlocks retries, timeouts, resume, payload externalization (fixes ceiling 3's queue residency), multi-worker safety, and per-tenant fair-share. Nearly every other runtime fix is blocked behind it.
- Fix deletion-vs-staleness (finding 1). Make
diff-engine.tsconsume evidence-completeness statuses and exclude failed-cell namespaces from deletion scope. Small, contained, prevents the worst customer-visible failure this review found. - Posture truncation honesty (finding 9, corrected scope). The backend already returns
meta.truncated— the work is to render it in the UI (badge on Overview/posture/cluster pages) and document which aggregates degrade above the 5K cap. Hours of work; protects product trust. - Audit-log persistence — ADR-021's own first PR (schema lock + fail-loud sink + Loki shipper on prod). Days of work, removes the #1 procurement blocker. Caveat from adversarial verification: shipping prod logs to Grafana Cloud adds a new egress path and a new prod credential — pick the Loki region deliberately and route the token through the existing 3-tier secrets model, or this fix partially re-opens findings 5 and 14.
- Azure prod cutover (ADR-022 Phase 3c) — discharges ADR-018's condition, fixes EU-on-US-compute, and brings Alloy/observability to prod in one move. This is the critical path for "prosper."
- One synthetic load test: a 25K-entity / 3M-edge archetype-C tenant — run against the prod Atlas tier (M10), not just a staging clone. The envelope's binding constraints are tier-specific (M10 ≈ 2GB RAM, connection and IOPS caps); a test on a bigger cluster validates nothing about prod. Converts the C-tier envelope from projection to observation; either validates the staircase pricing or re-prices it before a real prospect does. (Per the determinism rules this is a fixture-generation script, not ML.)
- Run #393 (first CloudTrail-enabled scan) — close the April audit's Wave-1 exit criterion that everything else already assumes.
Phase NEXT — triggered by the first C-archetype prospect (Step 1 of the staircase, made concrete)
- Batch the sync hot path — route ingestion through the existing
upsertEntitiesbulkWrite; kill the unboundedPromise.allre-reads. Likely a 10–50× sync-duration win for code that already exists. - Scope the evaluator — evaluate affected entities per sync (the diff engine already knows the changed set), full-tenant only on rule-version change.
- Cap and scope chain/stitch work — node budget for chain-builder BFS (the subgraph adapter is the template); paginate the stitcher; scope chain rebuilds to affected components.
- 8MB → hard cap with overflow strategy for fan-out arrays (spill to a side collection or truncate-with-marker; ADR needed — this amends ADR-002).
- First incremental sync: Entra delta queries (the API supports it natively,
sync_cursorsplumbing exists) + cross-run checkpointing for AWS cells. - ADR-031 Phase 1 — predicate query layer (#1306/#1324), browse-mode deprecation.
- Connector runtime env-broker phase (per-tenant Key Vault resolution) — unblocks scheduled scans against real tenants and removes the self-acknowledged SOC 2 failure.
Phase LATER — keep trigger-gated, as designed (do not pull forward)
Step 2 embedded graph engine; Tier 3a per-tenant Atlas (sellable as Premium isolation SKU); Tier 3b worker pools; Step 4 cells; multi-region phases per ADR-020. The triggers are now (mostly) measurable — finish the D0 dashboards so they're actually monitored.
Documentation hygiene (cheap, do alongside)
- Amend ADR-001/ADR-002 with the claims ADR-031 invalidated; pick one canonical graph-engine trigger set.
- Ratify ADR-031 fully or mark which Ds remain proposed; fix ADR-020/doc-14 status rot.
- Close out #255 with what ADR-031 answered; leave decisions 3 and 4 explicitly open.
- Create
docs/compliance/with the ADR-020 residency evidence pack; reconciledeploy-prod.ymlmongodump vs Atlas reality. - Quarterly "claims vs reality" pass over ADR risk-acceptance conditions (ADR-018's condition would have been caught by this).
Scope limits of this review (from adversarial verification)
Four areas this round did not examine; treat the verdict as silent on them, not positive:
- Atlas tier operational limits. "Mongo-now" was validated at the data-model level, but M10-specific ceilings (≈2GB RAM, connection caps, IOPS) were never exercised against a realistic edge count — hence the prod-tier requirement on the load test (NOW #6).
- Read-API payloads and UI render budget. Graph traversal is verified-hardened; list endpoints at capped-but-large pages and the React/@xyflow render budget at thousands of nodes were not checked. "Read side survives" is half-verified.
- Per-tenant cost economics. Full-rescan-every-sync at 100K identities is treated as a time problem here; Graph API / CloudTrail / egress cost per tenant is a pricing input nobody has computed.
- Second-order effects of the recommendations themselves — partially addressed inline (NOW #4's residency/credential caveat); a proper pass should review each NOW item against findings 5/13/14 before implementation.
Bottom line
The April review asked "is the architecture sound?" and the answer was yes-with-fixes; the fixes shipped everywhere except the runtime. This round asked "will it survive real enterprises?" and the answer is: the strategy will; the current pipeline implementation will not, and it doesn't need to yet — but the gap between decided and built is now the dominant risk. The platform has roughly one archetype of headroom (B → one C) before the four verified ceilings bite, and the procurement blockers (audit log, prod posture, compliance trajectory) will bite before the scale ceilings do on any real enterprise deal. The good news: the first five items in Phase NOW are all small-to-medium, fully designed, and none requires new architecture — they require finishing what's already decided.