Skip to main content

ADR-031: Graph Query Architecture — Predicate Query Layer, Materialization Policy, and Substrate Stance

Status

Proposed — 2026-05-23. Captures two adversarial architecture reviews of the graph stack (substrate/scale, and queryability) commissioned after the cross-system bridge work, then pressure-tested by three independent models (Codex, Opus, Sonnet). The strategic spine survived all three; this revision incorporates their convergent findings — most importantly that the query layer (D3) is a new query engine, not a BFS tweak, and that observability is a Phase-0 prerequisite the original draft buried. It operationalizes the existing graph-scalability research (sv0-documentation#251, merged — position memo + migration staircase + sizing) rather than re-deciding it; see the Context subsection "Relationship to the existing graph-scalability research." Implementation lands in separate sv0-platform PRs that cite this ADR. Umbrella: sv0-platform#1306. Predecessor context: the correlation-blind audit #1289, the read-time correlation merge #1292, and the guided default #1303.

Update 2026-06-02 — partial ratification. The D3 substrate fork (Mongo-now vs graph-index-now) is resolved to Mongo-now (CTO call) — see "Ratification decision" below and the Alternatives entry. The ADR otherwise remains Proposed pending full ratification of the remaining decisions (D0–D6) and their phased sequencing under #1306.

Ratification decision — the D3 substrate fork (resolved)

The one genuine fork the reviews surfaced — build the D3 query layer on Mongo now, or on a graph index from the start? — is resolved: Mongo-now (CTO call, 2026-06-02). D3 ships on Mongo facets/multikey indexes; the graph index stays deferred to D1's triggers. Rationale, accepted: the staircase mandates Step 1 (pipeline stabilization) before Step 2 (graph engine) regardless — a graph engine on an unstabilized pipeline produces faster stale answers — so D3 lands on Mongo first under either choice; the residual "build twice" risk is contained by keeping D3 behind the StorageAdapter seam so its read path can re-target the Step-2 engine later, and the Mongo predicate layer is the cheapest way to learn the real query patterns before committing to a second datastore. No early additional investment in a graph index. This decision retires the fork in Alternatives below; D1/D3 stand as written.


Context

The graph is the product's headline surface ("no single system shows the complete execution-authority path — SecurityV0 does"). Two questions were raised about whether the current implementation survives a Fortune-500 customer rather than a demo tenant:

  1. Is the data model — described as "edge on edge, custom bridging" — survivable long-term and at enterprise scale?
  2. Is the graph actually queryable and filterable? We cannot and should not draw 1000+ nodes (a human cannot comprehend them), but an analyst must be able to ask the graph a question and get a comprehensible answer.

What exists today

Substrate. MongoDB, per ADR-001 (Mongo-only) and ADR-003 (reject Apache AGE). Not a graph database. The graph is materialized in several shapes:

  • entity.relationships[] — connector-native, intra-system edges embedded on each entity document (written by a per-connector atomic array merge in entity-adapter.ts).
  • entity.execution_paths[] — embedded materialized authority paths.
  • authority_paths, stitched_paths, execution_chains — derived collections.
  • correlations — synthesized cross-system / cross-identity edges (BRIDGES_TO, SAME_ENTITY), produced by the stitcher.
  • A read-time correlation merge (subgraph-adapter.ts, shipped in #1292) that composes correlations onto the subgraph at query time.

Traversal. Application-side BFS in subgraph-adapter.ts, with deterministic safety caps (MAX_REVERSE_LOOKUP_DOCS=5000, frontier and per-entity fanout caps), per-hop MongoDB round-trips, and a multikey reverse index (tenant_rel_target).

Query/render. The graph endpoint /api/v1/graph/subgraph is seed-anchored: it requires a seed_id and answers "neighborhood / execution-flow of this entity at depth N". The default Explorer (browse mode) loads roughly 200 entities (/api/v1/entities) and applies every filter client-side. The render half (bounded BFS, correlation merge, deterministic ELK layout per ADR-011, client aggregation) is solid.

Cross-view navigation. The forward hops that exist today (finding / exposure → graph) pass a single seed_id + a fixed depth (FindingDetail.tsx, ExposureDetailPage.tsx), so the graph re-derives a neighborhood rather than projecting the path's exact slice — and there is no reverse hop (graph → access-paths). After #1292 both views read the same connected data, but the hop between them is lossy and non-deterministic. This is D6, grounded on live nimbus-cloud Jira→AWS data.

Constraints. All logic is deterministic — no ML, no probabilistic ranking. Connectors are read-only. Tenants are isolated.

What the reviews found (verdicts)

  • Substrate: SURVIVABLE WITH CHANGES. The "built-for-demo" worry is half right. The multiplicity of materialized stores is real and is the root of the #1289 drift class. But the substrate (Mongo + materialization) is a defensible, deliberate choice for a read-mostly, deterministic, tenant-isolated, bounded-query product, and the team's own 03-database.md already documents the scale thresholds and a Neo4j-as-thin-index migration. The #1292 read-time merge is the correct direction, not the disease.
  • Queryability: NOT THERE. The graph supports "draw the neighborhood of a seed you already named" plus "load a 200-node page and filter it in the browser" — the exact load-then-filter-client model that does not scale. It does not support predicate questions ("all cross-system write paths", "everything identity X can reach", "all bridges into S3 in account Y"). The facet/index half already exists (rich facets on /entities and /authority-paths; reverse-reachability indexes tenant_exec_paths_resource and tenant_accessible_by_accessor that no query path is written to exploit). What is missing is the query layer joining them.

The honest framing: this is not a re-platform. It is (a) operational fixes to make the substrate safe for a real tenant, and (b) one net-new query layer built on foundations that already exist.

Relationship to the existing graph-scalability research (the canonical strategic frame)

This ADR does not re-decide the substrate/scaling strategy from scratch — that work already exists as a founder/investor-level position and an engineering staircase, merged in sv0-documentation#251:

  • Graph Data Model — Position — the company-defining statement: SecurityV0 is the authority system of record for non-human identities, and full-chain reasoning is the product architecture; bounded-hop materialized paths are today's MVP implementation, not the destination. The hop ceiling is MAX_AUTH_CHAIN_DEPTH = 2 today (bumped 1→2 by sv0-platform#1100) — a three-system chain (e.g. Entra → ServiceNow → AWS) is reachable, but a four-system chain still truncates, and more importantly the chain is stored without ordered, edge-bearing provenance (the real Step-0 gap — see below).
  • Graph Scalability and Migration Strategy — the migration staircase (Step 0 → 1 → 2 → 3 → Tier 3a/3b → 4, plus a parallel intra-tenant-partitioning track), each step trigger-driven not deadline-driven.
  • Scalability Sizing and Decision Points — five customer archetypes A–E (A ≈ 150 NHIs demo → D ≈ 150k Fortune-1000 → E ≈ 600k regulated mega) with per-archetype cutoff thresholds.

Three load-bearing facts from that research that this ADR adopts wholesale rather than restating:

  1. The cliffs arrive in order, and the pipeline cliffs come before the graph cliff. Cliff 1 = full-tenant evaluator read (queryEntities(tenantId, { limit: 0 })); Cliff 2 = synchronous evidence aggregation inside the sync loop; Cliff 3 = in-memory FIFO job queue (no persistence, no fair-share); then Cliff 4 = role fan-out write amplification (the graph cliff investors ask about). Investing in a graph engine before stabilizing the pipeline buys "faster stale answers." This re-orders D5 below.
  2. Step 0 — the chain contract — precedes everything, including this ADR's query layer. Promoting AuthorityChain to a first-class persisted entity with ordered chain_steps (today only auth_chain_depth / via_identity are stored) is the change that makes full-chain / multi-hop reasoning meaningful regardless of engine. D3's predicate queries cannot return an evidence-grade multi-system chain until Step 0 lands — not because of the depth ceiling (now = 2), but because the materialized path stores no ordered, edge-bearing step provenance to return.
  3. The seam is real for reads, not a containment boundary for writes. The StorageAdapter seam insulates traversal queries from the engine; the materializer/evaluator/evidence surfaces consume the materialized shape and refactor surface-by-surface during any engine swap. This is the same nuance the multi-model review raised as the dual-write/consistency gap.

This ADR's net-new contribution beyond that research: the staircase is about scaling the existing blast-radius query (and the engine under it); it does not add an analyst predicate-query layer (D3) — "ask the graph a question" — nor does it incorporate the post-research operational learnings from #1289 / #1292. ADR-031 operationalizes the staircase's near-term steps (Step 0 + Step 1) as its Phase 0, and adds the queryability axis on top.


Decision

D0 — Observability first (Phase 0 prerequisite, before D3)

The multi-model review found a dependency inversion: D1's deferral triggers and D3's "bounded answers are honest" claim both depend on telemetry that does not exist. The BFS safety caps fire only a console.warn; there is no metric for cap-hit rate, reverse fan-in distribution, per-tenant identity count, query latency, or chain-rebuild duration. Therefore: before the query layer ships, instrument the graph as structured metrics with alert thresholds — BFS cap-hit rates (frontier, reverse-lookup, correlation, per-entity fanout), query-latency histograms, slow-query logging, per-tenant identity/edge counts, and path/chain recompute duration. This is the cheapest item, it converts D1's "defer until a trigger fires" from aspiration into a measurable gate, and it is the only way to know whether D3's bounded results are complete or silently truncated. Observability is Phase 0; nothing else in this ADR is trustworthy without it.

D1 — Substrate: keep MongoDB + materialization; defer the graph index to explicit triggers

Reaffirm ADR-001 / ADR-003. Do not introduce a graph database now. The product profile (read-mostly, deterministic, point-in-time history via entity_versions, tenant-isolated, queries bounded/seeded) genuinely favors Mongo + O(1) materialized reads over a graph store.

The documented endgame is the migration staircase's Step 2 — native graph engine as a read model (an embedded graph engine alongside Mongo; Mongo stays the system of record; only traversal moves to the index, behind the StorageAdapter seam). Per the staircase, Step 1 (pipeline stabilization) must precede Step 2 — a graph engine fed by an unstabilized pipeline produces faster stale answers. Step 2 is deferred until its trigger fires:

  • a single role held by ~1,000+ identities (the staircase's Step 2 trigger — role fan-out is the graph cliff, Cliff 4), or
  • reverse fan-in for common queries routinely exceeds the MAX_REVERSE_LOOKUP_DOCS cap (an account/shared-role hub silently truncates blast radius — now measurable via D0), or
  • query-depth requirements exceed what the materializer can pre-compute economically.

Mapped to the archetypes: an Archetype-C tenant (~25k NHIs, mid-market) sits on the production baseline today; Archetype-D (~150k NHIs, Fortune-1000) needs all of Step 1 to run at all; pre-compute becomes uneconomic between ~50k–100k identities, which is the Step 2/3 window. Steps beyond — Step 3 (managed multi-process graph service), Tier 3a/3b (per-tenant storage/worker isolation, independently sellable), Step 4 (cells), and the parallel intra-tenant-partitioning + federated/edge track for whale/regulated estates — are all trigger-driven per the staircase, not pulled forward. Until Step 2's trigger, Mongo + materialization is the substrate.

D2 — Materialization policy: freeze new materialized graph stores; read-time composition is canonical

The same graph is already materialized six ways (the Context list above), and that multiplicity — not the substrate — is the root of the drift defects in #1289 (e.g. execution_chains stale after stitch). Therefore:

  • No new materialized graph-edge stores. New graph-derived surfaces are produced by read-time composition over the canonical sources, following the pattern proven in #1292 (mergeCorrelationEdges): query the source at read time, bounded and index-backed, deterministic.
  • Scope of the freeze (review carve-out). The freeze applies to full-graph edge projections that duplicate the canonical edge stores. It explicitly does not forbid: (a) new indexes on existing collections (e.g. the partial correlation index in D5), or (b) a bounded query-result cache keyed on (tenant, predicate-hash, sync_version) — which read-time composition will need at scale and which is invalidated by sync version, not a parallel source of truth.
  • Carve-out (2026-06-02): one chain-of-record collection. The Step-0 chain contract (D5 / sv0-platform#1353) introduces exactly one new collection, authority_chains, holding the ordered, edge-bearing execution chain (chain_steps[]) that no current store holds — the gap proven on live data in D6. This is permitted because it is not "yet another parallel projection": it is the topology source-of-record from which the existing flat path topology is derived, governed by strict drift control — (i) a single writer (the path-materializer; nothing else writes chain topology), (ii) a chain_contract_version startup gate + re-materialization worker mirroring the existing chain_builder_version machinery, and (iii) derive-everything-else: authority_paths is not duplicated but derived in topology from the chain in the same materialization pass (it continues to own only its evidence-derived current_state, rotation-stable path_lineage_id, and role/action composition_hash, which are not graph topology and cannot live in the chain). Net effect on drift: the chain reduces the number of independent traversals (it is the seed for the D5 shared-traversal layer), rather than adding a fourth divergent one. Any future graph-store proposal must clear this same single-writer + version-gate + derive-downstream bar or fall under the freeze.
  • Cost honesty. Read-time composition trades write-time amortization for per-query latency; it is not the "O(1) materialized read" of D1. Every D3 query pays predicate-resolution + BFS + correlation-merge each time. This is acceptable only with the D3 cost bounds (maxTimeMS, seed-set cap) and the result cache above; the cost must be measured (D0), not assumed.
  • Read-time composition is currently aspirational, not implemented. Three consumers traverse the graph independently today (chain-builder, path-materializer, subgraph-adapter) — "single source of truth" is the goal, not the state. A shared traversal layer consumed by all three (the D5 edge-traversal registry is its seed) is the precondition for this decision to be real.
  • The existing materialized stores are frozen in count. Their drift is closed by D5, not by adding more.

D3 — Query model: a deterministic predicate query layer (a new query engine, scoped honestly)

Introduce the missing query half so the graph answers questions, not just "neighborhood of X". The review's correction, accepted: this is a new query-execution component, not a small extension of the existing single-seed BFS. The current SubgraphQuery is seed-only, the route hard-requires seed_id, the existing indexes cover resource_id/accessor_id but not sensitivity/account/source-system predicates, and entity text search today is an unanchored regex (a full collection scan). D3 must therefore ship with an explicit contract, not "reuse the BFS unchanged":

  • New endpoint POST /api/v1/graph/query taking a deterministic predicate: entity facets (the ones /entities already supports server-side), edge-verb constraints (e.g. traverse only write / escalation edges), and target-class reachability (resource sensitivity, account, source system). It returns a bounded result (graph, or — see D4 — an aggregate summary).
  • A formal query contract is part of D3, not a follow-up: a predicate grammar; a maximum seed-set size (if a predicate resolves to more than the cap, return a deterministic representative subset with truncation_reason: "predicate_set_too_large" — never fan BFS out from all N seeds); a hop bound; a per-query maxTimeMS and cost budget; result semantics (rows vs graph vs aggregate); and cursor pagination.
  • Heterogeneous seed-set traversal must be designed, not assumed: the current executionFlowTraversal branches its reverse-edge vocabulary on workload-vs-identity, so a mixed seed-set has no single traversal vocabulary. The BFS engine's safety caps and correlation merge are reusable; the per-seed traversal-mode selection and cross-seed deduplication (a shared visited set across seeds) are new.
  • New index work is required despite "no schema migration": the analyst's entry point (type a partial ARN/account/property → render as graph) needs a real text or anchored-prefix index — the existing unanchored $regex on display_name/source_id does not use an index and is a full tenant scan at F500 size.
  • A search / "render as graph" entry from findings / exposures / authority-paths, so an analyst goes from a question to a bounded graph instead of having to already know the seed.

All deterministic: facet filters, path search, deterministic truncation — no relevance ranking. The seed-set cap + per-hop caps + maxTimeMS together close the I/O-amplification (DoS) surface the reviews flagged.

Dependency on Step 0. A predicate can return a seed-set and its bounded neighborhood today, but a genuinely multi-hop, cross-system answer ("everything reachable from external systems into restricted S3, with the full chain") cannot be returned with ordered, edge-bearing provenance until the Step 0 chain contract (D5) lands (the depth ceiling is MAX_AUTH_CHAIN_DEPTH = 2; provenance, not depth, is the binding limit). D3 ships its single-/bounded-hop form on the current substrate; its full multi-hop promise is gated on Step 0, not on a graph engine.

D4 — Rendering invariant: bound the rendered node set; comprehensibility is a query concern

The graph never renders 1000+ nodes. A comprehensible view (target on the order of a few hundred nodes, then client aggregation/grouping for density) is produced by server-side query/filter to a bounded set, not by loading a capped page and filtering in the browser. Filtering moves server-side; the client renders the already-bounded result.

But "don't render 1000" is not "the answer is small" (review correction). When a legitimate query's true answer exceeds the render budget (e.g. "all cross-system write paths" in a 50k-entity tenant is genuinely thousands of nodes), truncating to N and rendering a silent subset is the same "draw what you found" failure D3 is meant to cure, relocated to the answer side. Therefore the query layer must, when the bounded result would exceed the render budget, return a server-side aggregate summary instead of a truncated node list: true total count, counts grouped by account / sensitivity / source-system / finding-type, representative exemplar paths, and a drill-down token to expand a cluster into a bounded subgraph. The response always reports the true result size (not nodes.length) so the analyst can tell "all of it" from "150 of 8,000." Client aggregation handles density within the rendered set; server aggregation handles the beyond-budget case.

D5 — Operational must-fixes before a Fortune-500 tenant (in cliff-arrival order)

Phase 0 is the staircase's Step 0 (chain contract) + Step 1 (pipeline stabilization), then the #1289/#1292-era fixes. Per the research, the pipeline cliffs bite before the graph cliff, so they lead — and the chain contract is foundational because it unblocks both full-chain reasoning (the product architecture) and D3's multi-hop predicate answers.

Step 0 — chain contract (foundational): promote AuthorityChain to a first-class persisted entity with ordered chain_steps (today only auth_chain_depth / via_identity are stored); authority_paths' topology becomes derived from the chain (it retains its evidence-derived current_state, rotation-stable path_lineage_id, and role/action composition_hash — see the D2 carve-out). Lift the MAX_AUTH_CHAIN_DEPTH = 2 hop ceiling further behind a per-tenant flag after the ordered-steps contract lands (deeper paths without ordered steps are not evidence-grade). Convert the materializer's silent path-removal safety breaker to an operator-visible signal. D3's multi-hop predicates cannot return evidence-grade chains until this lands. Tracked: sv0-platform#1353.

Step 1 — pipeline stabilization (Cliffs 1–3, the first to bite):

  • Cliff 1 — bound the full-tenant evaluator reads. Replace queryEntities(tenantId, { limit: 0 }) and the active-authority-path full read (both load the whole tenant into memory each evaluation cycle) with streamed/partitioned reads.
  • Cliff 2 — decouple evidence aggregation from the sync critical path. Move the 30/60-day execution-evidence rollups out of inline materialization into an async projection, so sync time stops scaling with evidence volume.
  • Cliff 3 — durable, fair-share job queue. Replace the in-memory FIFO with a persistent queue + per-tenant lanes — this also subsumes the #1289 E2 debouncer-durability gap and the multi-tenant fairness item below.

#1289 / #1292-era operational fixes:

  1. Incremental execution_chains rebuild + re-run on stitch. Today chain assembly rebuilds the entire tenant on every sync (assembleExecutionChains takes no affected-set; findEntryPoints uses limit:0) and is not re-run by the stitcher (the largest scale risk and a live #1289 defect). Scope the rebuild to an affected-entity set (the stitcher already computes changedEntityIds) — this is a structural chain-builder API change, not just a new trigger. Resolve the double-rebuild hazard: chains run in sync step 9 and would re-run post-stitch; specify that the post-stitch run is incremental over affected entities only, and does not duplicate the sync-time run. Extends ADR-026 (which governs when chains re-materialize) with how much.
  2. Document-size / write-amplification monitoring. Implement the instrumentation 03-database.md already specifies (BSON size tracking via Object.bsonsize(), warn at 8MB / 30s) and remove the phantom "doc-size guard" comment in path-materializer.ts that describes code which does not exist. The 16MB document limit on a high-fan-out role or account (unbounded execution_paths[] / accessible_by[]) is a hard write cliff with no runway today.
  3. One tested edge-traversal registry — sequenced after (2). Multiple hand-maintained traversal vocabularies have drifted and drop TRUSTS (cross-account AssumeRole — the primary enterprise lateral-movement edge) in both graph and path layers. Unify behind one registry consumed by chain-builder, path-materializer, and subgraph (and reconciled with the domain RELATIONSHIP_TYPES, ingest accepted-types, and the UI blast access-class map), with a test asserting they agree. Sequencing: adding TRUSTS to traversal expands execution_paths[] for every cross-account assume-role identity, so it must land after (2)'s monitoring to catch any entity crossing the doc-size warn threshold.
  4. Correlation garbage collection. Correlations are soft-deprecated and never removed; the tenant_entity_ids index has no partial filter, so the read-time merge scans deprecated rows and discards them in memory (a D2 cost). Correction (#1325): a partial index keyed on absence of deprecated_at is not directly expressible — MongoDB partialFilterExpression does not support $exists: false (only $exists: true, equality, ranges, $type), and correlations carry no positive "active" marker (status is unused/undefined in live data). So the index optimization requires either adding a positive active marker field to the write path to index on, or — simpler — relying on the hard-delete below to keep the full tenant_entity_ids index small. Implement the hard-delete of correlations deprecated beyond a retention window (e.g. 90 days) via a scheduled off-host job; treat the partial index as a follow-on only if the GC proves insufficient.
  5. Restore the #1289 defects this list originally dropped. The review noted D5 was a strict subset of the #1289 audit. Re-include: (a) the in-memory stitch debouncer durability gap (a process restart loses pending stitch timers → stitched_paths stale, #1289 E2); (b) the reachability_drift baseline reading raw execution_paths including scope: ids without concreteResourcePaths() → false drift findings (#1289 F); (c) the evidence-pack cross_system_auth section is correlation-blind (evidence/sections.ts) — this is the auditor-facing artifact for an enterprise customer, arguably higher priority than the Explorer render, and the Linkage-Proof card has the same gap.
  6. Query operational guardrails (prerequisite for D3 in production). Per-query maxTimeMS on every BFS/predicate Mongo call (none today; socketTimeoutMS is the only bound and a depth-10 query can hold a pooled connection for 30s); cursor pagination for large bounded results; per-tenant query concurrency limits (noisy-neighbor: one tenant's broad-predicate BFS storm must not starve others on the shared API process); and a decision on RBAC / authorization scope for predicate queries (whether all authenticated tenant users may enumerate "everything identity X can reach").

D6 — Deterministic cross-view navigation: render the path's slice, not a re-derived neighborhood

The path/chain views and the graph view are two perspectives on the same connected data, but moving between them is lossy and non-deterministic: the existing hops pass a single seed_id + a fixed depth, so the graph re-derives a BFS neighborhood instead of projecting the exact slice the analyst came from. Verified on live nimbus-cloud Jira→AWS data (tracking issue #1324): the one cross-system path (stitched_paths 51da3f9a…, jira_cloud → aws_lambda → aws_iam) is a 6-node slice, but no single store holds it — the authority_path node set excludes the bridge endpoints, the correlation holds only the bridge endpoints, and the full slice requires a UNION across stitched_paths + authority_paths + correlations with edges re-derived. The "view this path in the graph" hop from the Jira-side finding renders {workload, connection, lambda-leaf} and dead-ends at the bridge (bridge peers are pulled as leaves, not re-expanded — subgraph-adapter.ts:160), omitting the IAM identity, permission, and terminal scope — i.e. the path's actual destination is invisible. The Lambda-side finding renders a different partial slice. Neither equals the path.

Therefore, the deterministic-hop / induced-subgraph contract:

  • The graph accepts a slice, not just a seed. Generalize the subgraph entry (or fold into the D3 POST /graph/query) to take { seed_ids[] | path_id } and return the induced subgraph — exactly the slice's nodes + the edges among them + the contributing correlation edges. The adapter already operates on a node set internally (seedIds = [...nodeMap.keys()]), so this is an entry-point generalization, not a new engine — it is the seed-set form of D3.
  • Forward hops carry the slice. Replace seed=<single>&depth=N in the finding / exposure / access-path views with the path's path_id (or its resolved entity-id set), so the destination renders the exact path.
  • A reverse hop (graph node / selection → "access-paths containing this entity") via an entity_id → path_ids reverse index — none exists today.
  • Edges must agree across views (the D5 edge-traversal registry) and the slice becomes a stored shared object once the Step 0 chain contract (D5) persists ordered chain_steps — at which point the hop is simply "render the chain," with no re-derivation.

Dependency / sequencing. D6 is the seed-set form of D3 and is unblocked end-to-end by D5 (Step 0 chain contract + edge registry). The induced-subgraph form ships on the current substrate for already-materialized paths before full multi-hop — giving deterministic hops now and exercising the seed-set traversal D3 needs anyway. The slice identity is already deterministic (content-addressed correlation._id / stitched_paths._id / authority_path.composition_hash); what is missing is projecting it into the graph and one shared edge definition.


Consequences

Enables. A F500 analyst can ask the graph predicate questions and get bounded, comprehensible answers; filtering scales because it runs server-side over indexes rather than over a 200-row client page; the substrate is safe for a real enterprise tenant; the drift class from #1289 is closed at the source (no new stores, unified traversal, incremental rebuild). An analyst can move deterministically between the path/chain and graph perspectives — clicking from a cross-system finding renders that exact path, not a lossy neighborhood (D6, #1324).

Costs (revised after review — the original draft understated this). D3 is a new query-execution engine, not a BFS tweak: a predicate grammar, a query planner, heterogeneous seed-set traversal with cross-seed dedup, cost bounds, pagination, and result-aggregation. It needs at least one new index (a text/prefix index for ARN/account/property search — the existing facet indexes do not cover the analyst's entry point). Read-time composition (D2) shifts cost to per-query latency and requires a result cache to stay acceptable. Observability (D0) is a hard prerequisite, not optional. The operational fixes (D5) include structural API changes (incremental chains) and re-included #1289 defects. Net: still not a re-platform, but a substantial query-layer build plus a real operational hardening pass — sequenced D0 → D5 → D3.

Explicitly out of scope (now). No graph database (D1 — deferred, not rejected). No ML or relevance ranking (determinism constraint). No new materialized stores (D2). No data-model rewrite — #1289 already concluded no storage migration is needed and #1292 proved the read-time pattern.

Risks if not done. The substrate hits the 16MB cliff or the chain-rebuild bottleneck on the first large tenant (D5); and the graph stays a "draw what you already found" tool that a F500 analyst cannot drive (D3) — undercutting the headline value proposition exactly when it matters most.


Alternatives considered

  • Re-platform to a graph database now (Neo4j / Neptune). Rejected for now. The product is read-mostly and deterministic with bounded queries; Mongo + materialized reads serve the common case well, and history / point-in-time are things a graph store does worse. The thin-index migration is deferred to D1's triggers, not pulled forward. Caveat from review: the Neo4j-as-thin-index endgame reintroduces a write-side dual-write/consistency problem (the very thing D2 avoids), made harder by the split source (embedded relationships[] + the correlations collection). The StorageAdapter seam swaps reads; the write-side sync/backfill/reconciliation is unspecified and must be designed before the migration, not at trigger time.
  • Build the D3 query layer on a graph index now instead of on Mongo. Rejected — Mongo-now ratified (CTO call, 2026-06-02; see Status). The argument for it was real (building D3 on Mongo risks building it twice if D1's trigger fires — which a F500 tenant may satisfy on day one, per 03-database.md's own 10k-identity threshold), but it was outweighed: a graph index now front-loads operational risk (a second datastore, dual-write) before the query model has proven its shape, and the Mongo predicate layer is the cheapest way to learn the real query patterns. The build-twice risk is contained by the StorageAdapter seam (D3's read path re-targets the Step-2 engine later). If the D1 triggers fire and the data shows a graph index is warranted, this ADR is revised before that build.
  • Keep seed-only queries and improve client-side filters. Rejected. It cannot scale past a capped page and is the model the rendering invariant (D4) explicitly forbids.
  • Keep the single-seed + fixed-depth cross-view hop (status quo). Rejected (D6). Verified on nimbus-cloud data to be lossy and non-deterministic — it re-derives a neighborhood that omits the path's terminal and includes off-path nodes. The induced-subgraph (seed-set / path_id) hop renders the exact slice and is the seed-set form of D3, not new machinery.
  • Denormalize correlations into entity.relationships[] (so all consumers "just see" bridges). Rejected in #1289/#1292 — three verified failure modes (sync clobbering, soft-delete overwrite, deprecation zombies). Read-time composition (D2) is the chosen pattern.
  • Add yet another materialized projection for each new graph surface. Rejected (D2) — that multiplicity is the root cause of the drift defects this ADR aims to stop.
  • Adopt a tuple/relationship authorization engine (OpenFGA / SpiceDB / Cedar) as the core store. Rejected per the position memo. Those engines answer "can subject X access object Y right now?"; SecurityV0 is broader — stateful exposure, ownership decay, drift over time, source-system provenance, and SIEM-grade evidence. Their tuple vocabulary is a useful interop/export model (cf. ADR-009 OAA projection), not the core store.
  • Federated / edge processing now. Deferred, not rejected. For mega / regulated estates (tens of millions of resources, data-residency procurement, ephemeral Kubernetes) the position memo names federated/edge processing — the customer holds raw graph state locally; SecurityV0 receives findings, posture summaries, evidence hashes, and selected projections — as the primary enterprise path for the largest tier, not a near-term replatform. It presupposes stable resource identity, durable event semantics, and a customer-side agent the architecture does not yet have; it sits beyond Step 4 / on the parallel track.