Cross-Connector Graph Stitching Architecture

TL;DR

Today's pipeline ingests each connector's NormalizedGraph independently and merges them only where two connectors happen to emit the same (source_system, source_id) tuple. That produces a "half-stitched" graph: an Entra service principal, the AWS IAM role that trusts it via OIDC, and the ServiceNow OAuth client that uses its client_id render as three islands even when they are the same identity. This proposal extends the adopted 2026-02-26 correlation research (Phase A still required as schema bedrock; Phases B-E need re-scoping for AWS, multi-account, and Question-B identity reconciliation) with an explicit reconciliation phase (Option C from sv0-platform#486) that runs after a per-tenant stitch group settles. The phase: (1) applies a deterministic correlation rule registry against post-upsert entities, (2) materializes correlations linking records and an optional canonical entity, (3) re-runs path materialization scoped to the closure of changed correlations, and (4) is fully auditable per analyst click. No ML, no fuzzy matching, MongoDB-only.

Problem

The Foundry demo on 2026-04-21 only worked because PR #459 and PR #461 patched two cross-connector data-shape bugs in the diff engine eight days before Sergey's call. The patches were tactical: relationship-level provenance plus a scoped diff. The structural problem — the platform has no place that owns the question "is identity A in Entra the same identity as IAM role X in AWS and OAuth client C in ServiceNow?" — remains. Concrete symptoms today:

Foundry case (closed by #459/#461): same entra-sp-{principal_id} was emitted by both Entra-ServiceNow and Azure Foundry; the connector-side shared node_ids.py library de-duplicated by source-id agreement. The platform never stitched anything; it simply got lucky that two connectors agreed on a tuple. The seed script can render the full path because it builds it manually; live connectors cannot.
AWS-Entra federation case (NOT closed): Entra emits entra-sp-{principal_id} with source_system=entra_id. AWS emits aws-iam-role-{account}-{name} with source_system=aws_iam. They have different (source_system, source_id) tuples and therefore different _id values from buildStableEntityId. The trust policy on the AWS role names the Entra SP via OIDC subject — but no platform code reads that policy and produces a link. They render as two nodes.
ServiceNow-Entra OAuth case: handled today only because the entra-servicenow connector internally correlates SN OAuth → Entra SP by client_id and emits a CORRELATED edge before submitting the graph (sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/correlator.py:440). When AWS or any other connector enters the picture, no equivalent stitch exists.
Lab 2 (Nimbus Enterprise) is gated by exactly this gap (docs/plans/2026-04-08-demo-lab-plan.md:431,504). The plan calls out: "building it earlier produces a half-stitched demo that undersells the product." MediaPro Lab 2 is the same shape.

The platform's path-materializer is already source-system-agnostic — it follows edges by entity _id regardless of which connector created them (src/ingestion/path-materializer.ts:108). The bottleneck is upstream of materialization: nothing forces those _id values to converge.

Current state

What `2026-02-26-cross-connector-entity-correlation-research.md` proposed and what shipped

The 2026-02-26 doc proposed five phases:

Phase	Description	Status today
A	Multi-connector entity ownership (`connector_owners[]`) + relationship partitioning (`source_connector_id` per edge, atomic pipeline upsert)	Partially shipped. `EntityRelationship.source_connector_id` landed in #459. Atomic pipeline upsert and `connector_owners[]` did not ship — the read-merge-write at `sync-ingestion.ts:124-140` is still racy under concurrency, and `connector_id` is still singular (#488).
B	Connector-declared `correlationKeys[]` on `NormalizedNode` (e.g., `endpoint_uri`, `entra_principal_id`)	Did not ship. Endpoint URLs are still stored as plain properties; no platform code consumes them as match keys.
C	`entity-correlator.ts` runs after upsert, before path materialization	Did not ship. No `correlations` collection, no correlator.
D	Extend shared `node_ids.py` for ARM resources	Partial — Entra SP / user covered (the de-dup that saved the Foundry demo); ARM resources, AWS roles, Logic Apps not covered.
E	Materializer extension to follow `CONNECTS_TO` for connection-to-resource bridging	Did not ship. `FORWARDING_EDGE_TYPES` at `path-materializer.ts:108` is still `{CALLS, INVOKES, USES, AUTHENTICATES_AS, AUTHENTICATES_VIA}`.

So the only piece of the 2026-02-26 research that landed end-to-end is partial Phase A (relationship provenance) and Phase D for Entra principals. Everything that requires platform-side correlation logic is unbuilt.

What #459 and #461 fixed tactically

#459 added EntityRelationship.source_connector_id, taught the graph transformer to stamp it, taught sync-ingestion.ts to merge cross-connector relationships before upsert (mergeRelationships at sync-ingestion.ts:26-37), and added getEntitiesWithRelationshipTo so a Foundry sync that adds HAS_ROLE to a shared SP also re-materializes upstream Entra-ServiceNow workloads via inbound RUNS_AS. None of this stitches across (source_system, source_id) tuples — it merely fixes the wholesale-overwrite bug for the cases where two connectors already agree on the tuple.
#461 scoped diffRelationships to source_connector_id === connectorId || === undefined and filtered inbound mirrors. Closed the spurious-event class for the same already-merged-by-tuple-agreement case.

Both fixes are correct and load-bearing, but they only operate inside an entity that two connectors happen to claim with identical (source_system, source_id). They do nothing for the Entra-SP-vs-AWS-IAM-role case, which is the demo-killer.

What's still missing

Issue	Class	Why it blocks stitching
#486 (epic)	Architectural	No reconciliation phase exists. Every per-field cross-connector bug is patched in the surface where it appears (diff, merge, materializer); no layer owns the canonical state.
#491 (investigation)	Re-materialization scope	When a second connector adds a relationship that unlocks a longer path through a workload from a prior connector's sync window, the upstream re-materialization fix from #459 may not cover the case. The Sergey-flow replay test in PR #484 narrows its assertion because of this.
#488	Schema	`EntityDoc.connector_id` is scalar — last-writer-wins. Deletion detection scoped by this field cannot see shared entities owned by another connector. Stitching makes shared entities the norm, not the exception.
#485	Diff scope	`diffProperties` compares wholesale; cross-connector property differences fire spurious `entity_versions`. Once stitching produces shared entities at scale, this turns from "P0 with one symptom" into "the diff engine is structurally broken."
#383	Type system	AWS `human_identity` nodes are silently retyped to `owner` by `graph-transformer.ts:45`. With stitching, the same human identity could be claimed by Okta (as `human_identity`), AWS (silently retyped to `owner`), and Entra (as `human_identity`). The reconciler needs to own type.
`sv0-connectors#79` (Phase A–E from 2026-02-26 research)	Connector-side	`correlationKeys[]` declarations on `NormalizedNode` never landed; the platform has no inputs to correlate from.

Schema-level blockers

EntityDoc.connector_id: string (src/domain/entities/types.ts:67) — must become connector_owners: string[].
EntityDoc.properties: Record<string, unknown> — must gain property_provenance map (Option A from sv0-platform#486 plan) so per-property survivorship works.
EntityDoc has no correlations reference — needs a way to point at the correlations collection (or an embedded linked_entity_ids[] for cheap reverse lookup).
NormalizedNode has no correlationKeys[] — connectors cannot declare match keys (research doc Phase B).
Connectors do not emit OIDC subject / federated principal as a structured field; they bury it in trust-policy properties.

Design proposal

Position in the pipeline

sync_ingestion (per connector, runs as today through step 7)
   1. insert ConnectorSyncDoc
   2. transformGraph
   3. computeDiff (per-connector scoped, post #461)
   4. mergeRelationships + atomic upsertEntity   [needs Option-A schema fixes; today is racy]
   5. insertEvents
   6. insertEntityVersion
   7. soft-delete absent entities
   ── per-connector sync ENDS ─────────────────────────────────────────────
        ▼
   ▶▶▶ NEW: enqueue stitch_run for tenant T (debounced, see "Trigger semantics") ◀◀◀
        ▼
stitch_ingestion (NEW — runs once per stitch_run, NOT per connector)
   S1. Fetch correlation rule set for tenant
   S2. Compute candidate set: changed entities since last stitch_run +
       any entity transitively reachable from one via existing correlations
   S3. Apply correlation rule registry → propose CorrelationDoc records
   S4. Validate proposals against tenant opt-out + collision policy
   S5. Persist correlations (upsert + soft-deprecate stale)
   S6. Compute re-materialization closure (workloads RUNS_AS any newly
       linked identity, or workloads transitively reachable through
       newly bridged edges)
   S7. Re-run materializeExecutionPaths + materializeAuthorityPaths
       scoped to that closure
   S8. Emit `stitch_completed` event; update StitchRunDoc with metrics
        ▼
evaluate_findings (existing, runs per-tenant when stitch_run completes)
build_evidence_pack (existing, fan-out per changed finding)

Why post-transform / post-upsert, NOT per-connector inline:

Stitching needs the post-merge entity state. Running inline would mean every connector handler has to reason about every other connector's data — that defeats the read-only-per-connector design.
Stitching consumes from at minimum two connectors' outputs to produce useful links. Embedding it in connector A's handler means connector B's data may not exist yet.
The Stream-1 ScanRun schema lets us treat "connector A finished" and "connector B finished" as independent events; the stitcher debounces and runs once per quiet window per tenant.
It must run before evaluate_findings because findings (reachable_sensitive_domain, external_egress) consume the materialized authority paths. Stitched paths must exist before evaluation runs, otherwise the CISO sees findings that disappear and reappear when the next connector lands.

Sync vs async: Async. Stitch runs are kicked off via the worker queue. The HTTP /ingest/normalized-graph endpoint already returns 202 today; nothing changes there. The per-connector sync handler enqueues a stitch_run job at the end (or refreshes an existing pending one for the same tenant — see debounce).

Batched vs streaming: Batched, debounced per tenant. A connector that finishes its sync enqueues a stitch_run job for (tenant_id) with a debounce timer (default 60 s; configurable per tenant). If another stitch_run arrives during the debounce window, the timer resets. This collapses bursts (e.g., when all four connectors complete their Sunday-night syncs within minutes of each other) into one stitch pass — important for cost and for avoiding partial-stitch states visible to the UI.

Trigger semantics (consumed from Stream 1's ScanRun):

A ScanRun transitioning to status=completed enqueues stitch_run(tenant_id).
A ScanRun transitioning to status=failed does not trigger a stitch — partial data could falsely deprecate links.
Manual trigger: POST /api/v1/admin/stitch-runs with {tenant_id, scope: "full" | "incremental"} — required for opt-out toggling and rule registry changes.
Backfill trigger: when a tenant first enables a new correlation rule, a one-time stitch_run(tenant_id, mode: "full_rescan") runs against all entities, not just the changed set.

Idempotency: The job key is (tenant_id, debounce_window_id). If a stitch run is already running for the tenant, new triggers wait for completion and then enqueue at most one follow-up. There is never more than one in-flight stitch run per tenant. This is enforced at the worker layer via Mongo-backed leader-election on a stitch_runs collection insert.

Correlation rule registry

Rule schema

// src/domain/correlations/types.ts (NEW)

export type CorrelationKind =
  | "SAME_ENTITY"      // Two source records describe the same identity. Materialize a canonical link.
  | "BRIDGES_TO"       // Two source records are different entities but related via an edge. Add edge, do not merge.
  | "AUTHENTICATES_TO";// Specialization of BRIDGES_TO for cross-system identity hops (kept for evaluator clarity).

export type CorrelationConfidence = "HIGH" | "MEDIUM";  // No LOW. Determinism is non-negotiable.

export type CollisionPolicy =
  | "first_match_wins"   // If multiple A-side rows correlate to the same B-side row, only the first (by deterministic order) is kept.
  | "all_match"          // All matches are emitted as separate CorrelationDoc records (use for fan-out edges).
  | "drop_all_ambiguous";// If >1 candidate, emit zero correlations and log to stitch_audit for operator review.

export interface CorrelationRule {
  /** Stable rule ID, e.g. "aws-oidc-federation-to-entra-sp". Must not change after introduction. */
  rule_id: string;
  /** Human-readable description shown in the debuggability surface. */
  description: string;
  /** Rule version. Bumped if the predicate changes. Stored on every CorrelationDoc for replay/debug. */
  version: number;
  /** Source-A side: which entities are eligible to match. */
  source_a: EntityPredicate;
  /** Source-B side: which entities are eligible to match. */
  source_b: EntityPredicate;
  /** The deterministic match key extractor — produces a string from each side that must match exactly. */
  match_key: MatchKeyExtractor;
  /** What kind of correlation this produces. */
  kind: CorrelationKind;
  /** Confidence — a compile-time property of the rule, NOT a runtime score. */
  confidence: CorrelationConfidence;
  /** What to do when multiple candidates match. */
  on_collision: CollisionPolicy;
  /** Whether the rule is enabled by default. Tenants can override. */
  default_enabled: boolean;
  /** Documentation link explaining the underlying real-world relationship. */
  doc_url?: string;
}

export interface EntityPredicate {
  source_systems: string[];           // e.g. ["aws_iam"]
  entity_types: EntityType[];         // e.g. ["identity", "role"]
  required_properties?: string[];     // properties that must be non-null on the entity
  property_filters?: Record<string, unknown>; // exact-match filters on properties
}

export interface MatchKeyExtractor {
  /** Pure function name registered in `correlation-key-extractors.ts`. NOT arbitrary code. */
  extractor_id: string;
  /** Path or property name(s) the extractor operates on. */
  inputs: string[];
}

The rule registry is a TypeScript array of literal objects, defined in src/ingestion/stitching/rules/registry.ts. Rules are not hot-loaded; they ship with the platform binary. This guarantees determinism across deploys and makes every rule grep-able and version-controlled.

Tenant overrides (enable/disable, parameter tweaks) live in a tenant_correlation_settings collection (see Tenant opt-out below).

Match-key extractors (the only "logic" in a rule)

Extractors are pure functions registered by string ID. They are the single place where structural parsing happens (parsing an ARN, extracting an OIDC subject, normalizing an email). Each extractor:

Takes an EntityDoc and the rule's inputs array.
Returns either a string (match key) or null (entity not eligible).
Is unit-tested per extractor with a fixture.
Has zero side effects.

Initial extractors:

Extractor ID	Purpose	Output
`entra_sp_object_id`	Returns `properties.principal_id` (or `object_id`) from an Entra SP.	`"abc-123-def"`
`aws_role_oidc_trust_subject`	Parses `properties.trust_policy.Statement[].Principal.Federated` looking for `sts.windows.net/<tenant>` and reads `Condition.StringEquals['sts.windows.net/<tenant>:sub']` to return the Entra SP object ID. Returns `null` if no OIDC trust or trust is not Entra-issued.	`"abc-123-def"` (matches above)
`aws_role_saml_trust_subject`	Same shape, for SAML federation. Reads `Principal.Federated` of form `arn:aws:iam::<acct>:saml-provider/<name>`.	depends on provider
`oauth_client_id_lower`	Lowercases `properties.client_id` or `properties.app_id`.	`"560ad26b-..."`
`arn_canonical`	Lowercases an ARN; strips trailing slashes; preserves account-id and region.	`"arn:aws:iam::123456789012:role/foo"`
`email_lower`	Lowercases `properties.email` or `properties.upn`.	`"alice@example.com"`
`external_principal_arn`	Extracts ARN from `properties.trust_policy.Statement[].Principal.AWS` array. Emits one match key per entry (used with `on_collision: all_match`).	`"arn:aws:iam::987654321098:role/x"`
`endpoint_uri_normalized`	Parses URL; returns `host + path` lowercased; strips query/fragment.	`"prod-28.eastus.logic.azure.com/workflows/abc/triggers/manual/invoke"`
`entra_app_id_lower`	Lowercases `properties.app_id`.	`"560ad26b-..."`

Extractors are the only place where source-system-specific parsing lives in the stitching layer. Adding a new extractor is a code change reviewed like any other deterministic rule.

Initial correlation rule set

`rule_id`	source_a	source_b	match_key	kind	confidence	on_collision	enabled
`aws-oidc-federation-to-entra-sp`	AWS role with OIDC trust on Entra issuer	Entra SP	A: `aws_role_oidc_trust_subject`; B: `entra_sp_object_id`	SAME_ENTITY	HIGH	first_match_wins	yes
`aws-saml-federation-to-entra-sp`	AWS role with SAML trust on Entra	Entra SP	A: `aws_role_saml_trust_subject`; B: `entra_sp_object_id`	SAME_ENTITY	HIGH	first_match_wins	yes
`servicenow-oauth-to-entra-sp`	ServiceNow OAuth client	Entra SP	A: `oauth_client_id_lower`; B: `entra_app_id_lower`	SAME_ENTITY	HIGH	first_match_wins	yes
`aws-cross-account-role-trust`	AWS role with explicit AWS principal in trust	AWS role/user (target account)	A: `external_principal_arn`; B: `arn_canonical`	BRIDGES_TO (new edge `TRUSTED_BY`)	HIGH	all_match	yes
`human-identity-by-email`	Any `human_identity` (post-#383 fix)	Any other `human_identity` from a different connector	A,B: `email_lower`	SAME_ENTITY	HIGH	drop_all_ambiguous	yes
`connection-endpoint-bridge`	`connection` from any connector	`resource` (e.g. Logic App) from any connector	A,B: `endpoint_uri_normalized`	BRIDGES_TO (new edge `CONNECTS_TO`)	MEDIUM	drop_all_ambiguous	yes
`mcp-server-to-entra-sp`	AWS Lambda or workload labeled MCP host	Entra SP referenced via env var `ENTRA_CLIENT_ID`	A: extractor reads `properties.environment.ENTRA_CLIENT_ID`; B: `entra_app_id_lower`	BRIDGES_TO (`AUTHENTICATES_TO`)	MEDIUM	first_match_wins	no (opt-in per tenant; high false-merge risk)

Confidence semantics:

HIGH rules auto-link. The match key is structurally guaranteed to identify the same entity (OIDC subject IS the Entra SP object ID; OAuth client_id IS the Entra appId).
MEDIUM rules emit links only if there is exactly one candidate; otherwise they drop and log to stitch_audit. Endpoint URLs match this profile — a host can serve many resources.
There is no LOW. If a rule cannot decide deterministically with at most one operator-policy parameter, it does not enter the registry.

Versioning: Every CorrelationDoc records rule_id and rule_version. When a rule's predicate or extractor changes, version bumps. Old correlations remain valid until the next stitch run, which re-evaluates them under the new rule version.

Determinism guarantee: Same set of entities + same enabled rules + same rule versions = same set of correlations. Order independence is achieved by sorting the candidate set by (entity._id) lexicographically before evaluation.

Merge semantics

Correlations are stored as first-class records, not implicit. The platform never hard-merges entities — it links them and computes a canonical view on read (or on demand for the materializer).

`CorrelationDoc` schema (new collection `correlations`)

export interface CorrelationDoc {
  _id: string;                   // sha256(tenant_id + sorted(entity_ids) + rule_id) — stable & idempotent
  tenant_id: string;
  rule_id: string;
  rule_version: number;
  kind: CorrelationKind;
  confidence: CorrelationConfidence;
  /** The set of entity IDs linked by this correlation. Always sorted. For SAME_ENTITY, all members are aliases. */
  entity_ids: string[];
  /** Match key value used (for debugging — e.g. the OIDC subject). */
  match_key_value: string;
  /** When this correlation first appeared in a stitch run. */
  created_at: Date;
  /** When this correlation was last confirmed by a stitch run. */
  last_confirmed_at: Date;
  /** Stitch run that created this correlation. */
  created_by_stitch_run_id: string;
  /** Set when the rule no longer fires for this candidate set. Soft-deprecation. */
  deprecated_at?: Date;
  deprecated_by_stitch_run_id?: string;
  /** Per-source provenance for the inputs that produced this match. */
  source_records: CorrelationSourceRecord[];
}

export interface CorrelationSourceRecord {
  entity_id: string;
  source_system: string;
  source_id: string;
  /** The connector_id that contributed this source record. */
  connector_id: string;
  /** When this source-record was last observed by its contributing connector. */
  observed_at: Date;
  /** The actual property values the extractor read, captured for audit. */
  extracted_value: string;
}

Canonical EntityDoc model

SAME_ENTITY correlations form an equivalence class. The platform exposes both:

The contributing entities as-is, unchanged in the entities collection. UI widgets that need the per-source view (e.g., "show me what AWS sees vs what Entra sees") read these directly. No data is destroyed.
A canonical_identity_id field added to each contributing entity, pointing at the lexicographically lowest entity._id in its equivalence class. This is recomputed on every stitch run for entities in the affected set.

The path materializer is taught (small change) to traverse equivalence classes via canonical_identity_id: when computing paths from workload W → identity I, it includes paths through any entity I' where I.canonical_identity_id === I'.canonical_identity_id. This converts "three islands" into "one identity with three source records and a unified outbound edge set" without rewriting the entity store.

For BRIDGES_TO correlations, no canonical merge happens. Instead, a synthetic edge of the rule's declared type is materialized into the source entity's relationships[] array, tagged with source_connector_id = "stitcher" and properties: {via_correlation_id: <CorrelationDoc._id>}. This reuses the existing relationship-based traversal (mergeRelationships already preserves edges from "other connectors" — stitcher is just another connector ID).

Per-property source-of-truth precedence

When the same logical attribute exists on multiple linked entities, the on-read merge uses a declared survivorship policy from field-policies.ts:

Property class	Policy	Rationale
`display_name`	most-recently-updated non-empty	Different connectors invent different display names; the most recent observation is usually the most useful.
`properties.principal_id` / `app_id` / `client_id`	authoritative-source: Entra wins	Entra is the system of record for these IDs.
`properties.trust_policy`	source-system-only (not merged across)	Each connector's view of trust is local to its system; never merge.
`properties.email` / `upn`	authoritative-source: Entra > AWS Identity Center > Okta > others	Per-tenant override allowed.
`properties.tags`	set-union (deduplicated)	Tags are additive metadata; never lose a tag.
`entity_type`	authoritative-source: identity > workload > connection > credential > owner; ties broken by lexicographic source_system	Consistent with #383 — explicit type-survivorship rule.
`resource_key`	first non-null wins; flag conflict if two non-null differ	Drift here means the canonical key is genuinely contested; surface to operators.

The merge function is computeCanonicalView(entityIds, fieldPolicies, storage) — pure given the entities and the policy table. Versioning the policy table is part of the stitch_run metadata so a re-stitch can be reproduced identically.

Source lineage

Every property in the canonical view carries (connector_id, source_record_id, observed_at, contributing_entity_id) provenance, surfaced via the GET /api/v1/identities/:id/lineage endpoint (see Debuggability). The on-disk shape uses the existing property_provenance: Record<string, ConnectorProvenance> field added in Option-A from sv0-platform#486. Stitching extends provenance to include the other entities that contributed via correlation (so an analyst clicks one identity and sees all three source records).

Type reconciliation (#383)

The human_identity → owner retype in graph-transformer.ts:45 is removed. EntityType gains human_identity as a first-class type. The reconciler then applies the type-survivorship rule (above) to canonicalize across linked entities. Migration backfills existing owner rows that originated from human_identity nodes (identifiable by properties.subtype or by source-id pattern).

Re-materialization

What triggers re-stitch

A stitch run computes a change set at the start:

ChangedEntities = entities upserted/changed since last stitch_run.last_completed_at
AffectedCorrelations = correlations whose entity_ids ∩ ChangedEntities ≠ ∅
TransitivelyAffectedEntities = entities reachable from ChangedEntities via existing correlations + AffectedCorrelations

Only TransitivelyAffectedEntities are evaluated by the rule registry. A full-tenant re-stitch is gated behind the manual mode: "full_rescan" trigger.

What re-materialization looks like

Closure expansion (closes #491):

M = ∅  // entities whose execution_paths must be re-materialized
for each entity e in TransitivelyAffectedEntities where entity_type ∈ {identity, workload}:
    M.add(e._id)
    M.add(canonical_identity_id of e)
    // Walk inbound RUNS_AS and add upstream workloads (Fix B from #459)
    for each w in storage.getEntitiesWithRelationshipTo(e._id, "RUNS_AS"):
        M.add(w._id)
    // NEW: walk equivalence-class peers and their upstream workloads
    for each peer p with canonical_identity_id == e.canonical_identity_id:
        for each w in storage.getEntitiesWithRelationshipTo(p._id, "RUNS_AS"):
            M.add(w._id)
    // NEW: walk new BRIDGES_TO edges added this run and re-materialize sources
    for each new bridge edge (s -> t) added this run:
        M.add(s._id)

materializeExecutionPaths(M)
materializeAuthorityPaths(workloads in M)

The "walk equivalence-class peers" step is the structural fix for #491 (and the architectural successor to #459's Fix B). The replay test in PR #484 that currently .skips assertion (b) becomes green once this lands.

The materialized-paths collection (Option C from #486)

A new collection stitched_paths holds just the cross-connector segments discovered by the stitcher — not duplicating authority_paths, but giving the UI a fast lookup for "show me only paths that span two or more source systems." Schema:

export interface StitchedPathDoc {
  _id: string;                        // sha256(tenant + workload_id + canonical_identity_id + dest_resource_key)
  tenant_id: string;
  workload_id: string;                // entry-point workload
  canonical_identity_id: string;      // the bridging identity
  contributing_correlation_ids: string[];
  source_systems_traversed: string[]; // ordered list of distinct source systems on the path
  authority_path_id: string;          // pointer into existing authority_paths collection
  computed_at: Date;
  computed_by_stitch_run_id: string;
}

This is a denormalized index, not a new source of truth. UI queries like "show me all stitched paths in the Foundry-Entra-ServiceNow trio" become O(index lookup) instead of O(scan + filter).

Idempotency & ordering

Two scans landing simultaneously

Per-connector sync_ingestion jobs serialize via the existing per-tenant queue (no change). At-most-one connector handler runs at a time per tenant.
A new stitch_run is enqueued at the end of each connector handler. The debouncer collapses bursts.
The stitch_runs collection enforces "at most one in-flight per tenant" via a unique index on (tenant_id, status: "running").
If a stitch run is in progress when a new sync_ingestion lands, the new sync runs to completion, then triggers a fresh stitch run after the current one finishes.

Connector A then B vs B then A — same canonical graph

This is the load-bearing invariant. Achieved by:

Correlation rules are pure functions over the post-merge entity set. Order of arrival of source records does not affect rule evaluation, because rules read from a settled snapshot.
The candidate set is sorted lexicographically by entity._id before evaluation, so first_match_wins is order-independent.
The canonical-identity-id is the lexicographically lowest entity._id in the equivalence class — order-independent.
Survivorship rules ("authoritative-source: Entra wins") are deterministic functions of (value, source_system) tuples — order-independent.

Property tested: permute(connector_completion_order) × replay(same_entities) → identical correlations and identical canonical_identity_id per entity.

Replay semantics

A stitch run can be deterministically re-run by:

Reading the stitch_runs doc (which records entity-set hashes, rule-set version hash, policy-table version).
Re-applying the rule registry at that version against the entity snapshot at that timestamp.
Asserting the produced correlations match what was persisted.

This is the foundation of the "why was this merged?" debuggability surface and of the integration replay tests (extending PR #484's harness).

Tenant opt-out + debuggability

Per-tenant per-rule disable

tenant_correlation_settings collection:

export interface TenantCorrelationSettingsDoc {
  _id: string;                              // tenant_id
  tenant_id: string;
  /** Per-rule enable/disable. Absence = use rule's default_enabled. */
  rules: Record<string, { enabled: boolean; reason?: string; updated_by: string; updated_at: Date }>;
  /** Per-rule parameter overrides (e.g. authoritative-source hierarchy for emails). */
  rule_params: Record<string, Record<string, unknown>>;
  /** Force-disabled correlations: never auto-stitch these entity pairs. */
  blocklist: Array<{ entity_a: string; entity_b: string; reason: string; created_at: Date }>;
  /** Operator-confirmed correlations not produced by any rule. */
  manual_links: Array<{ entity_ids: string[]; kind: CorrelationKind; reason: string; created_by: string; created_at: Date }>;
}

Disabling a rule triggers a stitch_run(tenant_id, mode: "full_rescan") so existing correlations from that rule get deprecated.

"Why was this merged?" surface

API:

GET /api/v1/identities/:id/lineage — returns the canonical view + every contributing source record + every correlation that linked them + the rule(s) that fired (with rule_id, rule_version, match_key_value, confidence, created_at).
GET /api/v1/correlations/:id — the raw CorrelationDoc with all source records and match key values.
GET /api/v1/identities/:id/correlation-history — full history of correlations that have ever linked this entity, including deprecated ones.

UI surface (described for Stream-3 completeness; not in this stream's implementation):

Identity card has a "Linked across N systems" badge. Click → expanding panel listing each contributing entity, the rule that linked them, the match-key value (e.g., "matched on Entra SP object ID 8a0cb6c3..."), and a "Disable this link" button that adds to the tenant blocklist.

Audit log

stitch_audit collection records every rule firing decision, including:

Rule fired and produced a new correlation.
Rule fired but on_collision: drop_all_ambiguous discarded the result (operator review queue).
Rule was skipped because tenant disabled it.
Existing correlation was confirmed (no change).
Existing correlation was deprecated (rule no longer fires).

Indexed by (tenant_id, stitch_run_id) and (tenant_id, entity_id).

Schema migrations required

This is the critical-path call-out for downstream streams. The following must land before the stitcher can be implemented; Stream-2 (multi-account AWS connector) and Stream-4 (Lab 2) consume these.

Migration	What changes	Why
M1: `connector_id` → `connector_owners[]`	`EntityDoc.connector_id: string` → `EntityDoc.connector_owners: string[]`. Backfill from existing scalar via one-time `updateMany`. Deletion detection in `diff-engine.ts:318-323` switches to `connector_owners: connectorId` filter; entity is only fully deleted when ALL owning connectors have marked it absent.	Closes #488. Blocks stitching: shared entities are the norm under stitching, so multi-owner deletion is required.
M2: `property_provenance` map	`EntityDoc.property_provenance: Record<string, { connector_id: string; observed_at: Date }>`. `diffProperties` filters by `property_provenance[key].connector_id === connectorId \|\| === undefined`.	Closes #485. Required for per-property survivorship (canonical view).
M3: Atomic upsert via aggregation pipeline	`entity-adapter.ts upsertEntity` uses MongoDB aggregation pipeline updates (`$filter` + `$concatArrays`) so `mergeRelationships` is collapsed into one round-trip. Removes the read-merge-write race that exists today and is masked by single-worker serialization.	Closes #487. Required because stitcher writes during a window where a per-connector sync may also be writing.
M4: `entity_type=human_identity`	Add `"human_identity"` to `ENTITY_TYPES` in `src/domain/entities/types.ts`. Remove the silent retype at `graph-transformer.ts:45`. Backfill existing `owner` rows that originated from `human_identity` nodes. Update Identities page filter and Graph Explorer legend.	Closes #383. Required because cross-connector human-identity correlation is a P0 stitching rule.
M5: `canonical_identity_id` on EntityDoc	New optional field `canonical_identity_id?: string` on `EntityDoc`. Set by stitch runs; `null` for entities not part of any equivalence class.	Required for path-materializer equivalence-class traversal.
M6: `correlations` collection	New collection. Indexes on `(tenant_id, entity_ids)` (multikey), `(tenant_id, rule_id, deprecated_at)`, `(tenant_id, last_confirmed_at)`. Schema as described above.	Storage for `CorrelationDoc`.
M7: `stitch_runs` collection	New collection. Unique index on `(tenant_id, status: "running")` for at-most-one-in-flight enforcement. Schema includes entity-set hash, rule-version hash, policy-version hash, started_at/completed_at, metrics.	Required for replay determinism + concurrency control.
M8: `stitch_audit` collection	New collection. Append-only. Indexed by `(tenant_id, stitch_run_id)` and `(tenant_id, entity_id)`.	Required for "why was this merged?" surface.
M9: `tenant_correlation_settings` collection	New collection (described above).	Required for tenant opt-out.
M10: `stitched_paths` collection	New collection (described above).	Required for fast UI lookup of cross-connector paths.
M11: `NormalizedNode.correlationKeys?[]`	Add to `src/ingestion/types.ts`. Optional. Connectors that don't emit it are still supported (rules fall back to extracting from `properties`). Connectors that do emit it get faster, declarative correlation.	Required to cleanly express AWS OIDC subjects, ServiceNow OAuth client IDs, and federated-principal ARNs without spelunking through trust policies in extractors.
M12: `NormalizedNode.lineage_records?[]`	Add a stable per-source-record provenance block for fields the rule registry needs to attribute.	Required so source lineage in the canonical view is precise — the canonical view shows which connector contributed which property.

Migration / backward compat

Existing per-connector graphs → stitched graph:

M1–M4 land first (Option A from #486 — closes the schema bugs). Each is independently shippable.
M5–M12 land in a single PR series with the stitcher disabled by default (STITCHER_ENABLED=false env flag).
A one-time backfill stitch run executes on each tenant when STITCHER_ENABLED=true is flipped. The first run is mode: "full_rescan" and may be expensive (typically minutes for production tenants); it runs out-of-band off the request path.
The materializer change to traverse equivalence classes is gated on canonical_identity_id !== undefined. Pre-stitch entities have it undefined and traversal behaves identically to today.

How existing UI / queries continue to work during migration:

entities collection remains the source of truth. UI reads EntityDoc as before.
Authority paths are still materialized into authority_paths collection. The stitcher only adds to the path set; it never deletes paths the existing materializer would have produced.
stitched_paths is a new index, not a new source — the existing /authority-paths/grouped endpoint becomes "include stitched paths in grouping" rather than a new endpoint.
The connector_id field is retained as a deprecated mirror of connector_owners[0] for one quarter to give downstream readers (analytics, manual scripts) time to migrate.

Re-stitch existing data: cost, time:

Production tenant default (~3,000 entities, 4 connectors): expected initial full stitch < 60 s.
Demo-w1 (~200 entities): < 5 s.
demo-nimbus (~300 AWS entities, single connector): < 5 s.
Subsequent incremental stitches (per debounce window): < 2 s for typical change sets (~10–100 affected entities).

These are estimates from the rule-evaluation cost (O(rules × candidates × log(candidates)) for the index lookup). Will be benchmarked in Phase 4.

Implementation plan (writing-plans format)

All tasks live in sv0-platform unless noted. Each task is bite-sized (≤1 day for one engineer), has a clear acceptance criterion, and follows TDD: write the failing test first, then make it pass.

Phase 1 — Schema migrations (unblock the rest)

Goal: ship Option A from sv0-platform#486 plus the stitching-specific schema additions. Each PR is independently revertible.

M1: connector_owners[] migration — Add connector_owners: string[] to EntityDoc; teach entity-adapter.ts to $addToSet on upsert; ship a one-time backfill script scripts/migrations/2026-04-backfill-connector-owners.ts; flip deletion detection to filter by connector_owners. Acceptance: integration test where two connectors write the same entity → both appear in connector_owners; only fully-absent entities are deleted.
M2: property_provenance map — Add property_provenance to EntityDoc; teach graph-transformer.ts to stamp it; teach diffProperties to filter by it. Acceptance: regression test for #485 (no spurious entity_versions on no-op cross-connector re-sync).
M3: Atomic aggregation-pipeline upsert — Replace read-merge-write at sync-ingestion.ts:124-140 with an atomic aggregation-pipeline updateOne in entity-adapter.ts. Acceptance: property-test where two connectors interleave reads/writes → no relationships are lost.
M4: entity_type=human_identity + #383 fix — Add human_identity to ENTITY_TYPES; remove the retype at graph-transformer.ts:45; backfill existing owner rows. Acceptance: GET /api/v1/entities?entity_type=human_identity returns the 4 SSO users on demo-nimbus.
M5: canonical_identity_id field — Add optional field to EntityDoc; index (tenant_id, canonical_identity_id). No write logic yet — placeholder for Phase 4. Acceptance: index exists; field accepts null; existing tests pass.
M6–M10: Stitching collections + indexes — Create correlations, stitch_runs, stitch_audit, tenant_correlation_settings, stitched_paths collections via the storage adapter. Add MongoDB indexes. Acceptance: storage-adapter tests for each collection's CRUD methods pass.
M11–M12: NormalizedNode.correlationKeys + lineage_records — Add optional fields to src/ingestion/types.ts. Acceptance: existing connectors continue to work without emitting these (backwards compatible).

Phase 2 — Correlation rule engine + initial rule set

Rule schema + registry skeleton — Add src/domain/correlations/types.ts with the schemas above; create src/ingestion/stitching/rules/registry.ts exporting an empty CorrelationRule[]. Acceptance: types compile, registry is iterable.
Match-key extractor framework — Add src/ingestion/stitching/extractors/index.ts with the Extractor interface and a registry. Implement entra_sp_object_id, entra_app_id_lower, oauth_client_id_lower, email_lower, arn_canonical. One file per extractor. Acceptance: each extractor has unit-test fixtures with positive and negative cases.
Extractor: aws_role_oidc_trust_subject — Parse trust-policy JSON; return Entra SP object ID for OIDC trusts on sts.windows.net/<tenant>. Acceptance: fixture from a real Lab-1 / Lab-2 AWS role yields the correct subject; non-Entra federations return null.
Extractor: aws_role_saml_trust_subject — Same shape for SAML. Acceptance: Lab-2 fixture passes.
Extractor: external_principal_arn — Iterates Principal.AWS entries; emits one match key per ARN. Acceptance: cross-account-trust fixture yields N match keys.
Extractor: endpoint_uri_normalized — URL parse; lowercase host+path; strip query/fragment. Acceptance: matches Foundry connection endpoint vs ServiceNow REST message endpoint.
Define initial rule set — Add the 7 rules from "Initial correlation rule set" table to registry.ts. Each rule has a unit test verifying its predicate selects only the intended entity classes. Acceptance: rule registry exports 7 rules; per-rule tests pass.
Rule executor — Add src/ingestion/stitching/rule-executor.ts that, given a rule and a candidate set of entities, returns a list of proposed CorrelationDoc records. Pure function. Acceptance: per-rule executor test produces expected correlations against a fixture.
Collision policies — Implement first_match_wins, all_match, drop_all_ambiguous in the executor. Acceptance: collision-policy tests pass with multi-candidate fixtures.

Phase 3 — Stitcher pipeline integration

StitchRunDoc lifecycle — Add storage-adapter methods to insert/update stitch_runs with at-most-one-in-flight enforcement. Acceptance: integration test where two stitch_run inserts race → second one waits.
Debounced stitch_run trigger — Add a STITCH_DEBOUNCE_MS env (default 60 000); modify the worker handler to enqueue a debounced stitch_run at end of each sync_ingestion. Acceptance: integration test where two sync_ingestions land within 60 s → one stitch_run executes.
Stitcher worker handler stitch_ingestion — New file src/workers/handlers/stitch-ingestion.ts. Implements steps S1–S8 from the pipeline diagram. Reads tenant settings, computes change set, applies rule executor, writes correlations + canonical_identity_id, emits audit. Acceptance: single-rule integration test (Foundry replay fixture) produces a correlation between Entra SP and the AWS role that trusts it.
Tenant opt-out wiring — Read tenant_correlation_settings at the start of each stitch_run; honor disabled rules and the blocklist. Acceptance: integration test where rule is disabled per-tenant → no correlation produced.
Audit logging — Every rule decision (fired/skipped/dropped/confirmed/deprecated) writes a stitch_audit record. Acceptance: audit query returns one record per rule per candidate per stitch run.

Phase 4 — Re-materialization

Equivalence-class traversal in path materializer — Modify path-materializer.ts so workload-to-identity edges traverse canonical_identity_id peers. Gated on canonical_identity_id !== undefined. Acceptance: integration test where Entra SP and AWS role share a canonical ID → workload RUNS_AS Entra SP produces an authority path through the AWS role's HAS_ROLE edges.
Re-materialization closure — In stitch-ingestion.ts, compute M per the pseudocode in "Re-materialization", call materializeExecutionPaths(M) and materializeAuthorityPaths(workloads in M). Acceptance: PR #484's .skip'd assertion (b) becomes green; #491 closes.
stitched_paths index materialization — After authority paths are computed, write StitchedPathDoc records for any path whose source_systems_traversed.length > 1. Acceptance: the Foundry-Entra-ServiceNow path appears in stitched_paths with three source systems.
BRIDGES_TO edge materialization — For BRIDGES_TO correlations, write a synthetic edge into the source entity's relationships[] with source_connector_id = "stitcher". Acceptance: connection-endpoint-bridge rule produces a CONNECTS_TO edge between Foundry connection and Logic App resource.

Phase 5 — Debuggability + opt-out

/api/v1/identities/:id/lineage endpoint — Returns canonical view + contributing source records + correlations + rule firings. Acceptance: API test against Foundry replay fixture returns 3 source records, 1 SAME_ENTITY correlation, 1 rule firing.
/api/v1/correlations/:id endpoint — Returns the full CorrelationDoc with source records. Acceptance: API test passes.
/api/v1/admin/stitch-runs POST endpoint — Manual trigger for full-rescan and per-tenant settings updates. Acceptance: POST with mode: "full_rescan" re-stitches the tenant; sync resolves with stitch_run summary.
Tenant opt-out admin endpoints — PUT /api/v1/admin/tenants/:id/correlation-settings to disable rules / blocklist correlations / add manual links. Acceptance: API test where a rule is disabled then a stitch run is triggered → existing correlations from that rule are deprecated.

Phase 6 — UI surface

Stitched-identity card — UI component on the Identity Detail page showing "Linked across N systems" badge, expanding panel with per-source-record breakdown, rule provenance per link. Acceptance: visual QA on Foundry replay shows 3 source records on servicenow-openai-client identity card.

Total: 30 tasks across 6 phases. Phase 1 is independently shippable and unblocks the rest. Phases 2–4 are sequential. Phases 5–6 can parallelize after Phase 4 lands.

Validation criteria

Per-phase acceptance

Phase	Validation
1 (schema)	`npm run ci` passes. Backfills idempotent. Existing replay test (`test/integration/replay/sergey-demo.test.ts`) remains green.
2 (rules)	All 7 rules have unit tests. All extractors have positive + negative fixtures. Rule executor is order-independent (property test).
3 (pipeline)	After ingesting Entra-ServiceNow + Azure Foundry fixtures from PR #484: a single `CorrelationDoc` exists linking the OAuth client to the Entra SP. `connector_owners` on the SP includes both connectors. No spurious events on no-op re-sync (#485 fully closed).
4 (re-materialization)	After ingesting Entra-ServiceNow + AWS connector outputs where AWS role X trusts Entra SP Y via OIDC: one canonical identity exists (`canonical_identity_id` shared), `source_record_count = 2`, and one authority path of length ≥ 4 spans both source systems. PR #484 assertion (b) un-skipped and green.
5 (debuggability)	`GET /api/v1/identities/:id/lineage` returns ≥ 2 source records for any stitched identity, with rule provenance per link.
6 (UI)	Stitched-identity card visible on Identity Detail page; visual QA passes per platform standards.

MediaPro Lab 2 validation contract (delivered to Stream 4)

When MediaPro Lab 2 runs end-to-end (Stream 4 builds the demo; this stream owns the data-shape acceptance), the platform must produce exactly these stitched paths for the demo to count as validated:

Bedrock-agent → Lambda → MCP-server → Entra-SP → ServiceNow-OAuth-app → HR-table
- Resolves to one AuthorityPathDoc of length 6 (or 7 if MCP server emits a separate identity).
- source_systems_traversed = ["aws_iam", "entra_id", "servicenow"] (3 distinct).
- contributing_correlation_ids includes:
  - One aws-oidc-federation-to-entra-sp correlation (or mcp-server-to-entra-sp if the MCP server uses env-var auth).
  - One servicenow-oauth-to-entra-sp correlation linking the SN OAuth client to the Entra SP.
- canonical_identity_id on the Entra SP, AWS role, and SN OAuth client all match.
- GET /authority-paths/grouped?identity=<canonical_identity_id> returns the workload + this path.
Bedrock-agent → cross-account assume-role → S3 PII bucket
- Resolves to one AuthorityPathDoc of length ≥ 3.
- One aws-cross-account-role-trust correlation links the source role to the target role.
- Bridges nimbus-workloads and nimbus-data accounts in the same canonical AWS-org context.
Foundry-agent → Logic-App → ServiceNow-incident-table (Lab 2 Phase B)
- Resolves to one AuthorityPathDoc with connection-endpoint-bridge correlation in contributing_correlation_ids.
- Logic App appears as a single resource entity (not duplicated across Foundry and Entra source records).
- source_systems_traversed includes both azure_foundry and servicenow.
No duplicate entities for the OAuth client: the demo screen shows ONE node for servicenow-openai-client even though Entra (SP), Foundry (managed identity), and ServiceNow (OAuth client) all contribute.
Lineage panel on the canonical identity shows ≥ 3 source records with their contributing connectors and rule firings.
Order independence: re-running connectors in any order (AWS first then Entra then SN; SN then Entra then AWS; etc.) produces identical correlations and identical canonical_identity_ids.
Tenant opt-out works: disabling aws-oidc-federation-to-entra-sp for the demo tenant and triggering a manual mode: "full_rescan" removes the AWS-Entra link; the AWS role and Entra SP render as separate entities again.

Non-goals (explicit)

This stream does not specify the Lab 2 demo narrative or visual flow — that is Stream 4's deliverable.
This stream does not extend connectors beyond emitting the optional correlationKeys[] and lineage_records[] (M11–M12). The deeper trust-policy parsing for AWS lives in extractors, not in connectors. Stream 2 owns AWS-side node shapes.
This stream does not build the ScanRun schema (Stream 1 owns it); it consumes the completed event.

Open questions

Should BRIDGES_TO correlations participate in the equivalence-class merge? Currently no — they only add an edge, not a canonical link. But the MCP-server-to-Entra-SP rule blurs this: if an MCP server's only identity is its Entra SP, are they conceptually the same identity or just bridged? Recommendation: keep BRIDGES_TO strictly edge-additive; promote a rule to SAME_ENTITY only when the structural relationship is unambiguous (OIDC subject IS the Entra principal ID).
What happens if a SAME_ENTITY correlation links entities of different entity_type? Example: Entra emits identity, AWS emits role. The type-survivorship rule resolves canonical type, but the contributing entities keep their original types. Does the UI show "this entity is sometimes a role, sometimes an identity"? Recommendation: yes; lineage panel shows per-source type. Findings evaluator reads the canonical type.
How should the path materializer handle equivalence classes with > 2 members under cycle detection? Today's materializer has cycle detection via visited sets keyed on entity ID. With equivalence-class traversal, the visited set must be keyed on canonical_identity_id, not _id. Edge case: an entity's canonical_identity_id == entity._id (the lex-smallest member). Acceptance test required.
Does the stitcher need its own circuit breaker like the diff-engine deletion breaker? If a buggy rule starts producing cross-tenant correlations or a bad extractor merges hundreds of unrelated entities, the system needs a halt. Recommendation: add a stitch-level breaker that halts if a single stitch_run would touch >X% of the tenant's entities (configurable, default 50%).
Should stitched_paths be the source of truth for the UI for cross-system paths, or just an index? If source-of-truth, authority_paths becomes a per-source-system view. If just-an-index, authority_paths continues to be the canonical store and stitched_paths is denormalized for fast lookup. Recommendation: just-an-index for now; reconsider after Lab 2 ships if query patterns demand it.
For Stream 4: what happens when a stitched path's contributing connectors disagree on intervals? Example: AWS reports the assume-role grant as continuously active; Entra reports the SP was disabled for two weeks last quarter. The path should reflect the gap. This is a finding-layer concern, not a stitching concern, but it affects what Stream 4's demo can claim. Flagged here so Stream 4 designs around it.
Should rules be able to reference other rules' outputs? Example: "if aws-oidc-federation-to-entra-sp linked entities X and Y, and the Entra SP Y has an OAuth client correlation, transitively bridge X to the OAuth client." Today this works because path-materializer traverses the graph. But if rules could chain, more compact correlations would be possible. Recommendation: do not allow rule chaining initially (kills order-independence). Re-evaluate after Phase 4.
What is the SLO for stitch runs? Default debounce is 60 s; first stitch after a connector sync should complete within 5 minutes for tenants ≤ 10 K entities. Beyond that, large-tenant performance needs benchmarking (Phase 4 acceptance criterion).

References

Internal — research and plans

2026-02-26-cross-connector-entity-correlation-research.md — Phase A–E foundation that this proposal extends.
docs/plans/2026-04-21-multi-connector-reconciliation.md (in multi-connector-reconciliation worktree) — the Option A/B/C/D analysis that selected Option C as the eventual target.
docs/plans/2026-04-08-demo-lab-plan.md — Lab 2 dependency on stitching; multi-connector demo requirements.
docs/session-notes/2026-04-20-foundry-demo-resolution-session-handoff.md (in sv0-platform) — what #459 / #461 patched and what remained.

Internal — issues

sv0-platform#300 — feat(ingestion): cross-connector graph stitching for shared identities (the original ask).
sv0-platform#486 — epic: multi-connector reconciliation phase (Option C).
sv0-platform#491 — investigation: cross-sync re-materialization gap.
sv0-platform#488 — bug: EntityDoc.connector_id is singular.
sv0-platform#487 — bug: concurrent upsertEntity race via path-materializer.
sv0-platform#485 — bug: diffProperties not connector-scoped.
sv0-platform#383 — bug: human_identity silently retyped to owner.
sv0-platform#459 (merged) — fix: cross-connector relationship merge + upstream re-materialize.
sv0-platform#461 (merged) — fix: scope diff-engine relationship comparison to current connector.
sv0-connectors#79 — feat: cross-connector entity correlation Phase A–E (partially shipped).

Internal — architecture docs

01-data-model.md — entity / relationship schema being extended.
02-processing-pipeline.md — pipeline being extended with stitch_run phase.
03-database.md — collection schemas; new collections added here.
05-connectors.md — NormalizedGraph contract being extended (M11, M12).

Internal — source code

src/ingestion/types.ts — NormalizedNode, NormalizedEdge, NormalizedGraph, ScanScope.
src/domain/entities/types.ts — EntityDoc, EntityRelationship, ExecutionPath, EntityVersionDoc.
src/ingestion/graph-transformer.ts — buildStableEntityId, mapNodeType (#383 lives here at line 45), reclassifyBySubtype.
src/ingestion/diff-engine.ts — diffProperties (#485), diffRelationships (post-#460), computeDiff deletion scope (#488 root cause at line 318-323).
src/ingestion/path-materializer.ts — materializeExecutionPaths, FORWARDING_EDGE_TYPES (line 108).
src/ingestion/authority-path-materializer.ts — authority-path materialization, removal circuit breaker.
src/storage/storage-adapter.ts — getEntityBySourceId, upsertEntity, getEntitiesWithRelationshipTo.
src/storage/mongo/adapters/entity-adapter.ts — upsertEntities non-atomic $set (M3 root cause).
src/workers/handlers/sync-ingestion.ts — 12-step per-connector handler; mergeRelationships at lines 26-37; read-merge-write at 124-140; upstream re-materialize Fix B at 218-231.
sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/correlator.py — connector-side OAuth-client-id ↔ Entra-app-id correlation (the model the platform stitcher generalizes).
sv0-connectors/integrations/aws/src/sv0_aws/core/trust_policy_parser.py — AWS trust-policy parsing the aws_role_oidc_trust_subject extractor mirrors (read-only).
sv0-connectors/shared/sv0_azure/sv0_azure/node_ids.py — shared node ID generators (Phase D from 2026-02-26 doc; only Entra principals covered today).

External

Veza OAA cross-service connections (deterministic exact-match identity correlation).
Wiz unified-graph + query-time path discovery (the architectural shape this proposal converges on).
SailPoint correlation rules + authoritative-source policies (the survivorship-rules model).
Neo4j entity-resolution patterns (linking-relationships pattern this proposal adopts).
MongoDB aggregation pipeline updates (docs) — used for atomic merge upserts (M3).

TL;DR​

Problem​

Current state​

What 2026-02-26-cross-connector-entity-correlation-research.md proposed and what shipped​

What #459 and #461 fixed tactically​

What's still missing​

Schema-level blockers​

Design proposal​

Position in the pipeline​

Correlation rule registry​

Rule schema​

Match-key extractors (the only "logic" in a rule)​

Initial correlation rule set​

Merge semantics​

CorrelationDoc schema (new collection correlations)​

Canonical EntityDoc model​

Per-property source-of-truth precedence​

Source lineage​

Type reconciliation (#383)​

Re-materialization​

What triggers re-stitch​

What re-materialization looks like​

The materialized-paths collection (Option C from #486)​

Idempotency & ordering​

Two scans landing simultaneously​

Connector A then B vs B then A — same canonical graph​

Replay semantics​

Tenant opt-out + debuggability​

Per-tenant per-rule disable​

"Why was this merged?" surface​

Audit log​

Schema migrations required​

Migration / backward compat​

Implementation plan (writing-plans format)​

Phase 1 — Schema migrations (unblock the rest)​

Phase 2 — Correlation rule engine + initial rule set​

Phase 3 — Stitcher pipeline integration​

Phase 4 — Re-materialization​

Phase 5 — Debuggability + opt-out​

Phase 6 — UI surface​

Validation criteria​

Per-phase acceptance​

MediaPro Lab 2 validation contract (delivered to Stream 4)​

Non-goals (explicit)​

Open questions​

References​

Internal — research and plans​

Internal — issues​

Internal — architecture docs​

Internal — source code​

External​