Graph Scalability and Migration Strategy

:::tip Reading order If you are a founder, investor, or executive buyer asking "what company is SecurityV0 building, and what is the data-model commitment?", read the one-page memo first: Graph Data Model — Position. This document is the engineering execution path for the position the memo takes — the four cliffs, the migration staircase, the trigger thresholds. The sizing numbers live in the companion sizing document. :::

Part I — For the Founder, the Investor, and the Buyer

The question we have to answer

Every CISO buyer eventually asks the same question, in different words:

"When we plug in our full estate — millions of resources, tens of thousands of identities, AI agents spawning workloads continuously — does the answer to 'what can this thing reach?' still come back in seconds?"

This is the question that decides whether we ship a posture tool, or whether we ship the system of record for non-human identity authority. Two recent independent reviews of the platform converged on the same concern:

An architecture audit (April 2026) flagged the data model as the most consequential long-term decision, and called out the specific scaling cliffs to watch for.
A security investor advisory call (May 2026) raised the same concern from a market-pattern perspective, citing failure modes seen in earlier graph-based security products: short-hop reasoning works, deeper multi-hop reasoning becomes either expensive or limiting. The strategic warning was: design for full-chain reasoning early, accept bounded-hop limitations honestly, or expect a major re-architecture later.

The one-paragraph answer

We made a deliberate trade-off: pre-compute the authority graph at ingestion time so that read queries are sub-second at MVP scale. This works because the storage layer is a real seam — connectors and evaluation rules go through a business-shaped interface rather than the database directly. The migration to a native graph engine is incremental and architecturally prepared, but it is not free above the storage seam. Some surfaces — evaluator, authority-path materialization, evidence pack assembly, and the read/traversal surfaces (graph, blast, chains) — consume the materialized shape today and will be refactored alongside any engine swap. The migration is bounded work, not a rewrite, and is staged so each step is independently justified.

Why we believe the answer holds up

Three things are simultaneously true, and all three matter:

Today's model is honestly bounded. It works at MVP scale. It does not pretend to scale infinitely.
The cliffs are identified and named. They come in a specific order, and the first ones to bite are pipeline cliffs — full-tenant evaluation reads, synchronous evidence aggregation inside the sync loop, and an in-memory job queue. The graph cliff — role fan-out — comes after those.
The escape route is staged. The storage seam is a real boundary for traversal queries. It is not a complete containment boundary for the pipeline that feeds the graph. The migration plan separates pipeline work from engine work and orders them correctly.

What's honestly still open

We do not yet have an answer for the following, and we should say so:

Three pipeline cliffs arrive before the graph cliff. The evaluation phase loads the full tenant entity set into memory each cycle. Authority-path materialization runs heavy evidence aggregations synchronously during sync. The job queue is in-memory and not persistent. Any one of these will bite before role fan-out does on a moderately active tenant.
A read-side cliff the original four miss entirely. Every cliff above is a write/infra cliff (memory, sync window, queue). There is a distinct read-side limit: graph queryability today is seed-anchored — the graph query layer requires a starting entity and has no predicate/browse query path, so the analyst-facing browse surface loads a capped entity inventory and filters in the browser. At mid-market scale (Archetype C, ~3M edges) and above, that read model is inadequate for an analyst before write-side role fan-out bites. The fix — a predicate query layer plus large-answer aggregation — lives in ADR-031, not in this paper.
Middle-layer API correlation — when a service receives a call, then makes outbound calls, our graph models the identities and resources but does not yet model the inbound-to-outbound call linkage. This is a future modeling gap, not a scaling one.
Kubernetes cardinality — we do not have a Kubernetes connector yet. When we build one, we will need to decide whether to model individual pods (cardinality explosion under autoscaling) or the workload abstractions above them (Deployments, StatefulSets).
Cross-system chain provenance — promoted out of "open" into Step 0. The materialized path record today carries depth and via-identity but not the full ordered identity/credential/system chain. The chain contract addition described in Part III, Step 0 is the planned fix.

For the per-customer-archetype sizing, decision-point grid, and cutoff thresholds, see the companion document: Scalability Sizing and Decision Points. Everything else in this document is forward design backed by an architecture already in place.

Terms used in this document

Brief definitions for a first-time reader; each is expanded where it first matters in the body.

Non-human identity (NHI): service principals, IAM roles, OAuth apps, automation accounts, AI agents — every identity that is not a human user. Estate size in this paper is measured in NHIs, not headcount.
The staircase: the migration plan in Part III. Five steps (0–4) plus two intermediate isolation tiers (3a, 3b) and one parallel workstream. Each step is independently triggered by a real signal, not a calendar date.
The four cliffs: the named scaling failure modes the architecture must navigate. Three are pipeline cliffs (Cliffs 1–3, inside the platform's own runtime); one is the graph cliff (Cliff 4, write amplification on the materialized graph). See Part II for the numbered list.
The seam: the storage abstraction layer. Every read and write goes through one business-shaped interface rather than through the database directly. The seam is what makes engine migration bounded — but it is not a complete containment boundary (see Part IV).
The chain contract: the data-model commitment that AuthorityChain is a first-class persisted entity with ordered chain_steps, and that bounded materialized authority_paths is a derived projection of it. Today, ordered chain steps are not persisted; promoting them is the Step 0 contract change and the architecture-level commitment described in the position memo.
The hop ceiling: the maximum authority-chain depth the materializer pre-computes today. Currently 2 (raised from 1) — enough to reach a three-system chain. Lifting it further is part of Step 0, but only after the chain contract is in place — deeper paths without persisted ordered steps produce results the rest of the platform cannot reason over.
Customer archetypes (A–E): five named customer profiles ranging from a 50-employee demo tenant (Archetype A, ~150 NHIs) to a 50,000-employee regulated enterprise (Archetype E, ~600K NHIs). Defined in the companion sizing document.
Production baseline: MongoDB Atlas M10 in aws:eu-west-1 plus two Azure Standard_B2s VMs — the shape ADR-020 and ADR-022 commit to. Dev / PR-preview environments run on Docker Compose with intentionally smaller footprints and are not a production reference.

Part II — The Architectural Trade-Off

Why graph queries are different from document queries

The platform answers two kinds of questions, and they have very different cost structures:

Question	Shape	Cost in a Document Store	Cost in a Graph Engine
"Show me this entity and its properties"	Point lookup	Trivial	Trivial
"Show me everything within N hops of this entity"	Traversal	Increasingly expensive with N	Designed for this

A document database is excellent at the first question and forced into awkward patterns for the second. A native graph database is the inverse. The platform's value proposition — "what can this identity actually reach, and through what chain?" — is the second kind of question.

There are two ways to make a document database serve graph queries:

Recursive query at read time. Walk the graph hop-by-hop on every request. Predictable cost on small graphs. Becomes a query storm on deep traversals. This is the pattern that competitor products have publicly struggled with on large estates.
Materialize the answers at write time. Pre-compute the reachable-from sets when data arrives, store them as fields on the entity, and serve read queries as O(1) lookups. Predictable read cost at any scale. Cost moves to write time and grows with the complexity of how shared a given role is — and with how much downstream pipeline work depends on the materialized shape.

We chose the second pattern as the spine — but the live architecture is honestly a hybrid, and the hybrid is a strength worth stating plainly. Intra-system reach is materialized at write time. Cross-system reach — identities that are the same actor across systems, and bridges between separate systems — is composed at read time from a separate cross-system identity store (shipped 2026-05), merged into the answer on each request and bounded by explicit caps. This is a deliberate read-time merge, not a denormalization onto every entity. So the accurate claim is not "we rejected read-time traversal wholesale"; it is "we pre-compute the expensive intra-system traversal and compose the cross-system layer at read time, under caps." That the seam already supports a bounded read-time composition is the earliest evidence that the migration to a richer read model is real, not aspirational. (Note: "cross-system identity correlation" here is distinct from the "middle-layer API correlation" gap discussed later — the former links the same or bridged identities across systems; the latter links a service's inbound calls to its outbound calls.)

What that choice buys us

Sub-second blast radius queries at MVP scale, regardless of authority chain depth within the materialization window.
Deterministic outputs that are evidence-grade — a finding generated today and the same finding regenerated tomorrow are byte-identical, because both come from a stable pre-computed snapshot.
Operational simplicity — one storage engine to run, one set of credentials, one backup story.
A real seam at the storage layer — connectors, evaluator rules, and the API never construct database queries directly. They go through a business-shaped interface.

What that choice costs us

A hop ceiling — but a less binding one than before. Multi-system authority chains are bounded by how many hops we materialize. That bound is now MAX_AUTH_CHAIN_DEPTH = 2 (raised from 1) — two hops, enough to reach a three-system chain such as Jira-webhook → AWS-Lambda → IAM-role → resource. The earlier framing that a three-system chain is "silently truncated" no longer holds; depth is no longer the headline limitation. The remaining honest gap is provenance, not depth: the materialized path records depth and via-identity but not the ordered chain steps, so the platform cannot yet reason end-to-end over the chain it traverses (the Step 0 chain-contract work). Lifting the ceiling further is still not a free knob — depth changes traversal cost and downstream evidence/finding semantics simultaneously.
Write amplification under role fan-out. When a permission attached to a widely-held role changes, the materialization has to update every identity that holds that role and every resource that role can now reach. A role held by thousands of identities reaching dozens of resources via tens of permissions can drive millions of read operations per change.
Sync-time work, not query-time work. The cost has moved, not disappeared. The sync cycle and the work that runs inside it become the bottleneck rather than the read path.
A seam, not a boundary. The storage abstraction insulates traversal queries from the engine underneath. It does not insulate the materializer, the evaluator, or evidence assembly from the shape of materialized data. Those surfaces consume embedded path arrays directly. Changing the engine without refactoring those surfaces preserves the shape; refactoring those surfaces is real work, not zero work.

Where the cliffs actually are

The single most important claim in this section: the cliff is not one cliff. It is four, and they arrive in order.

In arrival order:

Cliff 1 — full-tenant evaluator read. Each evaluation cycle loads the entire tenant's entities and active authority paths into memory.
Cliff 2 — synchronous evidence aggregation. The materializer aggregates 30- and 60-day execution evidence inline per path, per sync.
Cliff 3 — in-memory job queue. A single-process FIFO queue (first-in, first-out, in-memory) with no persistence and no fair-share scheduling between tenants.
Cliff 4 — role fan-out. A single widely-held role changing permissions triggers materialization across thousands of identities.

The first three are pipeline cliffs — they live inside the platform's own runtime, before storage. The fourth is the graph cliff — write amplification on the materialized graph. The graph cliff is the one investors and reviewers asked about.

Cliff 4 is, in truth, two-faced, and only the write face arrives last. The write-amplification face — recomputing materialized paths across every identity holding a changed role — arrives latest, around enterprise scale (Archetype D). But the same high-holder role also has a read/render face: a traversal or blast-radius answer for a role held by hundreds of identities is too large for a human to read, and that bites earlier, around mid-market scale (Archetype C). The platform already handles this, but bluntly: server-side traversal safety caps with explicit truncation flags, and client-side aggregation (group, supergroup, and overflow nodes) so the canvas stays legible. What it does not yet do is server-side aggregation with true counts — answer the question "this role reaches 4,000 resources" without shipping 4,000 nodes. That refinement lives in ADR-031. Results are capped and aggregated, not returned raw — so the gap is the quality of handling, not its absence.

Order	Cliff	Failure mode	Where it lives
1	Full-tenant evaluation read	Each evaluation cycle loads the entire tenant entity set into memory. On a large tenant, this is an O(N) memory spike per cycle.	Evaluation phase
2	Synchronous evidence aggregation	The materializer aggregates 30- and 60-day execution evidence inline per path, per sync. Cost scales with evidence volume, not entity count, and inflates the sync window on active estates.	Sync / materialization phase
3	In-memory job queue	A single-process FIFO with no persistence. Lost on restart; no fair-share scheduling between tenants; one large tenant's sync starves everyone else's.	Worker runtime
4	Role fan-out	A single widely-held role changing permissions triggers materialization across thousands of identities. The graph-shaped cliff the advisor warned about.	Path materialization

The implication for the staircase is important: investing in a graph engine before stabilizing the pipeline gets us a faster traversal layer sitting on top of an unreliable feed. Order matters.

Mapping the advisor's pain points to the model

Pain point	How today's model handles it	What's needed at the next step
Deep multi-hop reasoning becomes expensive	Bounded by the materialization depth; we accept this honestly.	Native graph engine: arbitrary depth at query time, no materialization debt.
Middle-layer API correlation (inbound ↔ outbound)	Not modeled. We track identities and resources, not call relationships between services.	New entity and edge types — a modeling extension, separate from scaling.
Serverless / ephemeral workloads	Modeled as workload subtypes (functions, agents, jobs). The model handles this correctly when the connector emits the right entities.	The storage layer is fine. The connector evidence path is what determines accuracy.
Kubernetes labels and autoscaling	No connector yet. When built, must aggregate at the workload level, not the pod level.	Connector design choice; defer until it lands.
Cloud-native estates with millions of resources	Today's pre-compute model becomes uneconomic somewhere between 50,000 and 100,000 identities.	Native graph engine; then intra-tenant partitioning for the largest single-tenant estates; then cells for cross-tenant isolation.
SIEM point-in-time alerts vs. stateful exposure	The platform tracks valid time only (`valid_at` / `expired_at` on entity versions) — there is no separate transaction-time axis today. SIEM alerts can be consumed as evidence under the current model.	SIEM-as-temporal-correction (late-arriving alerts about past state that should retro-update the platform's posture) requires a transaction-time axis on entity versions. Additive schema change, not a rewrite — but it is not something the data model "already supports."

Part III — The Migration Staircase

The migration is designed as five steps, plus two intermediate isolation tiers (Tier 3a and Tier 3b — see §Step 3.5) and one parallel workstream. Each step is independently triggered by a real signal, not a calendar date. Each step is additive — it does not invalidate prior work. Step 1 (pipeline stabilization) sits before the graph engine deliberately: without it, a graph engine inherits a fragile feed.

Step 0 — Promote the chain contract; stop hiding the ceiling

Three near-term changes: one is a small contract addition; two stop hiding existing limits. None move us off the document store.

Promote AuthorityChain with ordered chain_steps to a first-class persisted entity. Today, materialized authority paths record depth and the via-identity, but not the full ordered identity → credential → system → role → permission → resource chain. The chain contract adds that ordered record as a first-class entity; the existing bounded materialized authority_paths becomes a derived projection. This is the contract change that makes full-chain reasoning possible regardless of storage engine — without it, lifting the hop ceiling produces deeper paths the rest of the platform cannot reason over end-to-end. The position memo (Graph Data Model — Position) treats this as the architecture-level commitment that precedes the operational staircase.
Lift the hop limit further as needed, behind a feature flag. The depth bound has already moved from 1 to 2 — enough to reach a three-system authority chain such as Jira-webhook → AWS-Lambda → IAM-role → resource. Raising it beyond 2 should be treated as a contained change, not a knob, because depth changes traversal cost and the semantics of downstream evidence and finding rules. Feature-flagged rollout per tenant; observed before generalized. The chain contract (sub-step 1) is the prerequisite — without ordered chain steps persisted, deeper paths lose their explanatory value.
Surface, don't hide, the safety breakers. The materializer has a guard that silently blocks state updates when too many paths would be removed in one sync — protective against accidents, but indistinguishable to an operator from "the platform is broken." Convert it from silent block to operator-visible signal with a clear remediation path.

The chain contract is the only Step 0 change that is structurally additive — the other two are guardrail and observability changes. None move us off the document store. All three close gaps the review flagged as immediate.

Step 1 — Stabilize the pipeline before changing the engine

Trigger: any of the first three cliffs is observed in production, or before the first tenant exceeds roughly 25,000 entities — whichever comes first.

What changes (four sub-steps, ordered):

Persistent job queue with fair-share tenant scheduling. Replace the in-memory FIFO with a durable queue. Add per-tenant priority lanes so one large tenant cannot starve others. Lost-job recovery becomes possible. Restart resilience becomes possible.
Formalize the event log. The platform already records typed change events. Two corrections promote them from a working log to a real event store: remove the time-based retention so the log is permanent, and add a per-tenant monotonic sequence number so events can be replayed in order and gaps can be detected. Without this, any downstream projection — including a graph engine read model — cannot prove its freshness or replay safely after a failure.
Decouple evidence aggregation from the sync critical path. The heavy 30- and 60-day evidence rollups currently run inside materialization. Move them to a separate, asynchronous projection driven by the event log, so sync cycle time stops scaling with evidence volume. The projection also publishes freshness markers that the evaluator can check before consuming.
Bound the evaluation reads. Replace the full-tenant in-memory reads — both the entity read and the active-authority-path read — with streamed or partitioned reads. This is a pre-condition for any tenant exceeding the working memory of a single process. Both reads matter; fixing one without the other only delays the cliff.

What does not change: the connector contract, the API surface, the data model. This step is internal to the platform.

What this buys: the pipeline becomes capable of feeding a graph engine reliably. Without this step, a graph engine becomes another inconsistent projection on top of a fragile feed.

Why it must come before Step 2: the graph engine is a consumer of the pipeline. A faster traversal layer sitting on top of an unreliable feed produces faster wrong answers.

Step 2 — Native graph engine as a read model

Trigger: the first tenant approaching the role fan-out threshold (a single role held by ~1,000+ identities), or the first customer demanding query depth beyond what the materializer can pre-compute economically. Step 1 must be substantially in place.

What changes: an embedded, in-process graph engine is added alongside the existing document store. The document store remains the system of record for entities, evidence, findings, and audit history. The graph engine receives a projection and serves traversal queries.

What does not change:

The connector contract.
The API surface customers integrate against.
No new managed service is added at this step. The graph engine runs in-process. The operational concerns it does introduce — per-tenant memory footprint, per-process cache warmup, restart-rebuild time — are real but bounded, and addressed below.

What does change above the storage seam: the materializer, authority-path assembly, and evidence pack assembly currently consume embedded path arrays directly. The refactor scope is broader than those three write/finding consumers, though: the read and structural surfaces — graph subgraph queries, blast-radius, execution chains, cross-system paths, the evidence cross-system-auth section, and the entity/exposure mini-graphs — each bind to the materialized shape and traverse independently, so each is a refactor surface too. Migrating them all to consume graph-derived deltas is a phased refactor — done one surface at a time, behind feature flags, with the document-store implementation kept as fallback until each surface is proven.

What this buys: the hop ceiling is gone for traversal queries. Write amplification on traversals is gone — the graph engine computes at query time. The most common pattern that today requires thousands of small reads against the document store collapses to a single graph query. Write amplification on materialized downstream projections (authority paths, findings, evidence) remains until those surfaces are refactored — which is why this is phased, not flipped. Note that lifting the traversal hop ceiling does not by itself give us evidence-grade arbitrary-depth chain reasoning; that requires preserving the full ordered chain in the path record (a modeling change, listed under open questions in Part I).

What it costs: two stores to keep in sync, projection lifecycle to manage, per-process state to operate. Restart-rebuild is supported by the event log formalized in Step 1 — replay from the last known sequence number rather than full rebuild from the document source. Per-tenant memory and cache warmup become operational concerns to monitor.

Step 3 — Multi-process graph engine

Trigger: the embedded engine becomes insufficient. Most plausible reasons: queries from multiple services need to share the same graph state, geographic distribution requires graph state in multiple regions, or the working-set size for a single tenant grows beyond what an in-process engine can hold.

What changes: the embedded engine is replaced with a managed graph database that runs as its own service. The same storage abstraction routes to it. The same business interfaces stay above it.

What this buys: the next order of magnitude of scale. Multi-tenant routing. Geographic distribution becomes tractable.

What it costs: real operational overhead — a second stateful service to back up, monitor, patch, and scale.

Step 3.5 — Intermediate isolation tiers (Tier 3a, Tier 3b)

Naming note: "3.5" reflects the position between Step 3 and Step 4 in the staircase's arrival order — Tier 3a and Tier 3b are not sub-steps of Step 3, and they are triggered independently of each other and of the surrounding steps.

Step 4 (cells) is a large investment. Two compositional tiers sit between Step 3 and Step 4 — not as separate steps in the staircase, but as independently sellable refinements that unlock specific deal classes long before full cells are warranted.

Tier 3a — Per-tenant storage cluster, shared API and workers. The storage layer routes a tenant's reads and writes to a cluster reserved for that tenant. Small tenants pool onto shared clusters; mid-market tenants get a dedicated cluster sized for their workload; whale tenants get over-provisioned dedicated capacity. The application stays shared.

What this unlocks:

A premium pricing tier — "your data on its own dedicated cluster, in your selected region, with its own backups" — that is a one-sentence CISO-readable contract clause.
GDPR data residency at storage level (per-region clusters).
The contractual "no shared database" language that mid-market and enterprise security reviews ask for.
Bounded write amplification — a fan-out storm on one tenant's cluster does not slow another tenant's reads.

What this does not unlock: any of the three pipeline cliffs. The evaluator still loads the full tenant entity set into the shared API process; the synchronous evidence aggregation still inflates the sync window; the in-memory job queue still serializes. Tier 3a is storage-side isolation, not pipeline-side isolation.

Tier 3b — Per-tenant worker pool. A tenant's sync, evaluation, and evidence-pack jobs run in a worker pool reserved to that tenant. Half a cell — isolates the pipeline cliffs per tenant without paying the full control-plane cost.

What this unlocks: predictable sync time per tenant regardless of what other tenants are doing. The natural follow-on for any deal where the buyer's contract specifies a sync-window SLA.

What this does not unlock: storage residency or storage isolation. Those are Tier 3a's job.

Order of arrival. Tier 3a typically lands first, sold as a Premium isolation SKU once Step 1 + Step 2 are operating. Tier 3b lands when one tenant alone produces enough work to monopolize the shared queue despite Step 1's fair-share lanes. Step 4 (cells) lands only when a regulated buyer demands compute and storage isolation, or when 100+ tenant operational scale forces the cell model.

For the per-archetype decision grid (which tier is required at which customer profile, with verdicts of Required / Recommended / Comfortable), see the companion sizing document.

Step 4 — Cell architecture for physical tenant isolation

Trigger: the first customer with a contractual isolation requirement that field-level tenant filtering cannot satisfy. In practice this means FedRAMP Moderate, GDPR data residency commitments stronger than "logical separation," or enterprise contracts with explicit physical-isolation language. Other triggers: tens of tenants where a single noisy tenant degrades the platform for everyone, or 100+ tenants where the shared-database model becomes a single failure domain.

What changes: the platform is provisioned as multiple complete, independent stacks ("cells"), each handling a bounded set of tenants. A thin global routing layer maps tenants to cells. Connectors continue to call the same endpoint; the router transparently directs to the right cell.

What this buys: physical data isolation, regulatory eligibility, geographic latency optimization, cross-tenant blast-radius reduction.

What it does NOT buy: relief for a single hot tenant. A tenant with ~100,000 identities, 1M+ resources, 10M+ edges, and high daily churn will still overwhelm intra-tenant systems regardless of cell architecture. That case requires a separate workstream: intra-tenant partitioning by account, region, or source, with delta processing rather than full re-projection. This is not on the four-step staircase; it is a parallel track that becomes necessary as soon as the platform lands its first whale tenant.

What it costs: real engineering investment in the global routing layer, cell provisioning automation, and the operational discipline to run many smaller stacks instead of one large one.

The intermediate work that makes cells additive rather than a rewrite — per-tenant collections within a single store, the persistent job queue from Step 1, per-tenant rate limiting — buys meaningful isolation along the way.

Part IV — Why the Migration Is Bounded

The reason this staircase is honest, and not the kind of "we'll fix it later" promise that founders give investors and then can't deliver on, is one architectural decision made early: the storage layer is a real seam, even if it is not a complete containment boundary.

The platform consists of three rings:

┌──────────────────────────────────────────────────────────┐
│  CONNECTORS                                              │
│  Read-only. Emit a normalized graph in a stable shape.   │
│  Don't know which database is underneath.                │
└──────────────────────────────────────────────────────────┘
                           │
┌──────────────────────────────────────────────────────────┐
│  PIPELINE, EVALUATION & API                              │
│  Materialization, findings, evidence packs, queries.     │
│  Operate against a business interface, not a database.   │
│  Consume materialized shapes today; refactored per       │
│  surface during engine migration.                        │
└──────────────────────────────────────────────────────────┘
                           │
┌──────────────────────────────────────────────────────────┐
│  STORAGE ABSTRACTION (the seam)                          │
│  Translates business operations to physical reads/writes.│
│  Today: document store with materialized paths.          │
│  Tomorrow: hybrid document + graph engine.               │
│  Later: graph service. Eventually: cells.                │
└──────────────────────────────────────────────────────────┘

The connector ring is engine-agnostic, full stop. The middle ring is engine-agnostic for queries, materialized-shape-aware for writes. When the storage layer changes:

Connectors don't ship new versions.
API contracts don't change.
Read paths route transparently.
Write paths and materialization surfaces refactor surface-by-surface, behind feature flags, with the prior path retained as fallback until each surface is proven.

This is what makes the staircase real and honest. Most companies that promise "we can swap our database when we grow" cannot, because the database leaks into every layer. We have a seam. We have not yet promoted that seam to a complete containment boundary; that promotion is part of the migration cost. Calling it "free" would be selling fiction. Calling it "bounded and staged" is what the code actually supports.

A related point worth naming directly: the most load-bearing abstraction in the picture above is not the storage seam itself but the normalized graph contract that connectors emit at the top. That contract has held up across three connector additions (Azure/Entra, ServiceNow, AWS) without schema change. Each new connector is roughly 90% connector work and under 10% platform work — and the evaluation rules above never need to know which source produced a given identity, role, or permission. The storage seam protects the engine swap during the staircase. The connector schema protects the engine that generates revenue — the ability to add new sources of authority data without redesigning the platform. If we had to name a single architectural decision as the most consequential one already proven by production data, it would be the normalized graph contract, not the storage abstraction.

One precise caveat keeps that claim honest: what is proven is the node/edge shape — the structural contract. The semantics of edges are still converging. Whether a given edge "forwards authority across systems" is currently decided by per-surface logic rather than one shared rule: cross-account AssumeRole (TRUSTS) is followed as authority-forwarding on some surfaces but not yet on the path-materialization surface, for example. This is in-progress correctness work, not a scaling concern — the shape contract holds; unifying edge-traversal semantics across surfaces is the remaining convergence.

Part V — What This Means for the Next Two Quarters

This document is a position paper, not an implementation plan. The specific work, in order:

Close Step 0. Lift the hop limit behind a feature flag, with downstream effects in scope. Convert the materializer safety breakers from silent to operator-visible.
Open the design for Step 1 — pipeline stabilization. Persistent queue with fair-share scheduling, asynchronous evidence aggregation, bounded evaluator reads. This is the highest-leverage next investment. The graph engine work in Step 2 depends on it.
Prototype Step 2 in parallel, but only at the read-model layer. Pick the single most expensive traversal path in production today and shadow-route it through an embedded graph engine. Treat as a tracked architecture decision with explicit go/no-go criteria. Do not extend until Step 1 is reliable.
Design the Tier 3a routing change ahead of the first deal that requires it. This is a one-time, bounded change to the storage layer; the engineering work is small and the premium pricing tier it unlocks is meaningful. See the companion sizing document for the per-archetype decision grid and the cost-vs-uplift arithmetic.
Defer Tier 3b and Step 4 until their triggers fire. Document the triggers explicitly so the conversation about "is it time?" has agreed criteria.
Acknowledge the modeling gaps as parallel workstreams. Middle-layer API correlation, Kubernetes connector design, and cross-system chain provenance are not part of the storage migration. They deserve their own design documents when they become priority.

Cutoff thresholds at a glance

The companion sizing document carries the full twelve-cutoff list, anchored on the production baseline (Atlas + Azure VMs, not the dev Docker stack). For reference, the five customer archetypes:

Archetype	Profile	NHIs
A	Demo / single-cloud seed-stage	~150
B	Series-A SaaS, one cloud + Entra + ServiceNow	~5,000
C	Mid-market multi-cloud (2,500-employee fintech)	~25,000
D	Enterprise multi-cloud (Fortune 1000)	~150,000
E	Regulated mega (FedRAMP / GDPR-strict pharma)	~600,000

Five load-bearing thresholds for sales conversations:

A single Archetype-C tenant (mid-market, ~25,000 NHIs) sits comfortably on the production baseline today — no staircase work strictly required for the first commercial-scale deal.
Concurrent commercial-scale tenants need Step 1 (the cheap sub-step is fair-share lanes — queue starvation bites before memory does).
Archetype-D tenants (enterprise, ~150,000 NHIs) need all four sub-steps of Step 1 to land at all.
Dedicated-storage / GDPR-residency contract clauses unlock at Tier 3a (per-tenant Atlas cluster) — independent of Step 4 (full cells).
FedRAMP Moderate (US federal cloud security authorization) requires Step 4 in the strict reading of SC-4 / SC-7 (isolation and boundary protection); Tier 3a delivers "FedRAMP-aligned" data handling for the permissive reading commercial regulated buyers accept.

The strategic posture to hold with investors and enterprise buyers:

"We chose a sync-time pre-compute model that gives us sub-second answers at MVP scale and bounded hop depth today. We have mapped four scaling cliffs in the order they arrive — three pipeline cliffs first, then the graph cliff. We have a migration plan that addresses them in order, with each step and tier independently triggered by a real signal. The plan is incremental, not a rewrite. Premium isolation is a Tier 3a refinement that lands well before any cell architecture investment. Above the storage seam, some surfaces refactor as part of the migration — we are honest about that, because the fact that we can be specific about it is itself a function of the architectural choices we made first."

References

Founder/investor position memo — Graph Data Model — Position
Sizing companion — Scalability Sizing and Decision Points
Production database strategy — ADR-020: Multi-region MongoDB Strategy
Production compute landing zone — ADR-022: Azure Compute Landing Zone
Read/query architecture (predicate query layer, large-answer aggregation) — ADR-031: Graph Query Architecture
Independent architecture reviews of the platform (April and May 2026), with companion remediation plans. See docs/architecture/reviews/.
Platform overview — 00 — Overview
Data model — 01 — Data Model
Database design — 03 — Database

Part I — For the Founder, the Investor, and the Buyer​

The question we have to answer​

The one-paragraph answer​

Why we believe the answer holds up​

What's honestly still open​

Terms used in this document​

Part II — The Architectural Trade-Off​

Why graph queries are different from document queries​

What that choice buys us​

What that choice costs us​

Where the cliffs actually are​

Mapping the advisor's pain points to the model​

Part III — The Migration Staircase​

Step 0 — Promote the chain contract; stop hiding the ceiling​

Step 1 — Stabilize the pipeline before changing the engine​

Step 2 — Native graph engine as a read model​

Step 3 — Multi-process graph engine​

Step 3.5 — Intermediate isolation tiers (Tier 3a, Tier 3b)​

Step 4 — Cell architecture for physical tenant isolation​

Part IV — Why the Migration Is Bounded​

Part V — What This Means for the Next Two Quarters​

Cutoff thresholds at a glance​

References​