Cross-Connector Graph Stitching Architecture
TL;DR
Today's pipeline ingests each connector's NormalizedGraph independently and merges them only where two connectors happen to emit the same (source_system, source_id) tuple. That produces a "half-stitched" graph: an Entra service principal, the AWS IAM role that trusts it via OIDC, and the ServiceNow OAuth client that uses its client_id render as three islands even when they are the same identity. This proposal extends the adopted 2026-02-26 correlation research (Phase A still required as schema bedrock; Phases B-E need re-scoping for AWS, multi-account, and Question-B identity reconciliation) with an explicit reconciliation phase (Option C from sv0-platform#486) that runs after a per-tenant stitch group settles. The phase: (1) applies a deterministic correlation rule registry against post-upsert entities, (2) materializes correlations linking records and an optional canonical entity, (3) re-runs path materialization scoped to the closure of changed correlations, and (4) is fully auditable per analyst click. No ML, no fuzzy matching, MongoDB-only.
Problem
The Foundry demo on 2026-04-21 only worked because PR #459 and PR #461 patched two cross-connector data-shape bugs in the diff engine eight days before Sergey's call. The patches were tactical: relationship-level provenance plus a scoped diff. The structural problem — the platform has no place that owns the question "is identity A in Entra the same identity as IAM role X in AWS and OAuth client C in ServiceNow?" — remains. Concrete symptoms today:
- Foundry case (closed by #459/#461): same
entra-sp-{principal_id}was emitted by both Entra-ServiceNow and Azure Foundry; the connector-side sharednode_ids.pylibrary de-duplicated by source-id agreement. The platform never stitched anything; it simply got lucky that two connectors agreed on a tuple. The seed script can render the full path because it builds it manually; live connectors cannot. - AWS-Entra federation case (NOT closed): Entra emits
entra-sp-{principal_id}withsource_system=entra_id. AWS emitsaws-iam-role-{account}-{name}withsource_system=aws_iam. They have different(source_system, source_id)tuples and therefore different_idvalues frombuildStableEntityId. The trust policy on the AWS role names the Entra SP via OIDC subject — but no platform code reads that policy and produces a link. They render as two nodes. - ServiceNow-Entra OAuth case: handled today only because the entra-servicenow connector internally correlates SN OAuth → Entra SP by
client_idand emits aCORRELATEDedge before submitting the graph (sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/correlator.py:440). When AWS or any other connector enters the picture, no equivalent stitch exists. - Lab 2 (Nimbus Enterprise) is gated by exactly this gap (
docs/plans/2026-04-08-demo-lab-plan.md:431,504). The plan calls out: "building it earlier produces a half-stitched demo that undersells the product." MediaPro Lab 2 is the same shape.
The platform's path-materializer is already source-system-agnostic — it follows edges by entity _id regardless of which connector created them (src/ingestion/path-materializer.ts:108). The bottleneck is upstream of materialization: nothing forces those _id values to converge.
Current state
What 2026-02-26-cross-connector-entity-correlation-research.md proposed and what shipped
The 2026-02-26 doc proposed five phases:
| Phase | Description | Status today |
|---|---|---|
| A | Multi-connector entity ownership (connector_owners[]) + relationship partitioning (source_connector_id per edge, atomic pipeline upsert) | Partially shipped. EntityRelationship.source_connector_id landed in #459. Atomic pipeline upsert and connector_owners[] did not ship — the read-merge-write at sync-ingestion.ts:124-140 is still racy under concurrency, and connector_id is still singular (#488). |
| B | Connector-declared correlationKeys[] on NormalizedNode (e.g., endpoint_uri, entra_principal_id) | Did not ship. Endpoint URLs are still stored as plain properties; no platform code consumes them as match keys. |
| C | entity-correlator.ts runs after upsert, before path materialization | Did not ship. No correlations collection, no correlator. |
| D | Extend shared node_ids.py for ARM resources | Partial — Entra SP / user covered (the de-dup that saved the Foundry demo); ARM resources, AWS roles, Logic Apps not covered. |
| E | Materializer extension to follow CONNECTS_TO for connection-to-resource bridging | Did not ship. FORWARDING_EDGE_TYPES at path-materializer.ts:108 is still {CALLS, INVOKES, USES, AUTHENTICATES_AS, AUTHENTICATES_VIA}. |
So the only piece of the 2026-02-26 research that landed end-to-end is partial Phase A (relationship provenance) and Phase D for Entra principals. Everything that requires platform-side correlation logic is unbuilt.
What #459 and #461 fixed tactically
- #459 added
EntityRelationship.source_connector_id, taught the graph transformer to stamp it, taughtsync-ingestion.tsto merge cross-connector relationships before upsert (mergeRelationshipsatsync-ingestion.ts:26-37), and addedgetEntitiesWithRelationshipToso a Foundry sync that addsHAS_ROLEto a shared SP also re-materializes upstream Entra-ServiceNow workloads via inboundRUNS_AS. None of this stitches across(source_system, source_id)tuples — it merely fixes the wholesale-overwrite bug for the cases where two connectors already agree on the tuple. - #461 scoped
diffRelationshipstosource_connector_id === connectorId || === undefinedand filtered inbound mirrors. Closed the spurious-event class for the same already-merged-by-tuple-agreement case.
Both fixes are correct and load-bearing, but they only operate inside an entity that two connectors happen to claim with identical (source_system, source_id). They do nothing for the Entra-SP-vs-AWS-IAM-role case, which is the demo-killer.
What's still missing
| Issue | Class | Why it blocks stitching |
|---|---|---|
| #486 (epic) | Architectural | No reconciliation phase exists. Every per-field cross-connector bug is patched in the surface where it appears (diff, merge, materializer); no layer owns the canonical state. |
| #491 (investigation) | Re-materialization scope | When a second connector adds a relationship that unlocks a longer path through a workload from a prior connector's sync window, the upstream re-materialization fix from #459 may not cover the case. The Sergey-flow replay test in PR #484 narrows its assertion because of this. |
| #488 | Schema | EntityDoc.connector_id is scalar — last-writer-wins. Deletion detection scoped by this field cannot see shared entities owned by another connector. Stitching makes shared entities the norm, not the exception. |
| #485 | Diff scope | diffProperties compares wholesale; cross-connector property differences fire spurious entity_versions. Once stitching produces shared entities at scale, this turns from "P0 with one symptom" into "the diff engine is structurally broken." |
| #383 | Type system | AWS human_identity nodes are silently retyped to owner by graph-transformer.ts:45. With stitching, the same human identity could be claimed by Okta (as human_identity), AWS (silently retyped to owner), and Entra (as human_identity). The reconciler needs to own type. |
sv0-connectors#79 (Phase A–E from 2026-02-26 research) | Connector-side | correlationKeys[] declarations on NormalizedNode never landed; the platform has no inputs to correlate from. |
Schema-level blockers
EntityDoc.connector_id: string(src/domain/entities/types.ts:67) — must becomeconnector_owners: string[].EntityDoc.properties: Record<string, unknown>— must gainproperty_provenancemap (Option A fromsv0-platform#486plan) so per-property survivorship works.EntityDochas nocorrelationsreference — needs a way to point at thecorrelationscollection (or an embeddedlinked_entity_ids[]for cheap reverse lookup).NormalizedNodehas nocorrelationKeys[]— connectors cannot declare match keys (research doc Phase B).- Connectors do not emit OIDC subject / federated principal as a structured field; they bury it in trust-policy properties.
Design proposal
Position in the pipeline
sync_ingestion (per connector, runs as today through step 7)
1. insert ConnectorSyncDoc
2. transformGraph
3. computeDiff (per-connector scoped, post #461)
4. mergeRelationships + atomic upsertEntity [needs Option-A schema fixes; today is racy]
5. insertEvents
6. insertEntityVersion
7. soft-delete absent entities
── per-connector sync ENDS ─────────────────────────────────────────────
▼
▶▶▶ NEW: enqueue stitch_run for tenant T (debounced, see "Trigger semantics") ◀◀◀
▼
stitch_ingestion (NEW — runs once per stitch_run, NOT per connector)
S1. Fetch correlation rule set for tenant
S2. Compute candidate set: changed entities since last stitch_run +
any entity transitively reachable from one via existing correlations
S3. Apply correlation rule registry → propose CorrelationDoc records
S4. Validate proposals against tenant opt-out + collision policy
S5. Persist correlations (upsert + soft-deprecate stale)
S6. Compute re-materialization closure (workloads RUNS_AS any newly
linked identity, or workloads transitively reachable through
newly bridged edges)
S7. Re-run materializeExecutionPaths + materializeAuthorityPaths
scoped to that closure
S8. Emit `stitch_completed` event; update StitchRunDoc with metrics
▼
evaluate_findings (existing, runs per-tenant when stitch_run completes)
build_evidence_pack (existing, fan-out per changed finding)
Why post-transform / post-upsert, NOT per-connector inline:
- Stitching needs the post-merge entity state. Running inline would mean every connector handler has to reason about every other connector's data — that defeats the read-only-per-connector design.
- Stitching consumes from at minimum two connectors' outputs to produce useful links. Embedding it in connector A's handler means connector B's data may not exist yet.
- The Stream-1
ScanRunschema lets us treat "connector A finished" and "connector B finished" as independent events; the stitcher debounces and runs once per quiet window per tenant. - It must run before
evaluate_findingsbecause findings (reachable_sensitive_domain,external_egress) consume the materialized authority paths. Stitched paths must exist before evaluation runs, otherwise the CISO sees findings that disappear and reappear when the next connector lands.
Sync vs async: Async. Stitch runs are kicked off via the worker queue. The HTTP /ingest/normalized-graph endpoint already returns 202 today; nothing changes there. The per-connector sync handler enqueues a stitch_run job at the end (or refreshes an existing pending one for the same tenant — see debounce).
Batched vs streaming: Batched, debounced per tenant. A connector that finishes its sync enqueues a stitch_run job for (tenant_id) with a debounce timer (default 60 s; configurable per tenant). If another stitch_run arrives during the debounce window, the timer resets. This collapses bursts (e.g., when all four connectors complete their Sunday-night syncs within minutes of each other) into one stitch pass — important for cost and for avoiding partial-stitch states visible to the UI.
Trigger semantics (consumed from Stream 1's ScanRun):
- A
ScanRuntransitioning tostatus=completedenqueuesstitch_run(tenant_id). - A
ScanRuntransitioning tostatus=faileddoes not trigger a stitch — partial data could falsely deprecate links. - Manual trigger:
POST /api/v1/admin/stitch-runswith{tenant_id, scope: "full" | "incremental"}— required for opt-out toggling and rule registry changes. - Backfill trigger: when a tenant first enables a new correlation rule, a one-time
stitch_run(tenant_id, mode: "full_rescan")runs against all entities, not just the changed set.
Idempotency: The job key is (tenant_id, debounce_window_id). If a stitch run is already running for the tenant, new triggers wait for completion and then enqueue at most one follow-up. There is never more than one in-flight stitch run per tenant. This is enforced at the worker layer via Mongo-backed leader-election on a stitch_runs collection insert.
Correlation rule registry
Rule schema
// src/domain/correlations/types.ts (NEW)
export type CorrelationKind =
| "SAME_ENTITY" // Two source records describe the same identity. Materialize a canonical link.
| "BRIDGES_TO" // Two source records are different entities but related via an edge. Add edge, do not merge.
| "AUTHENTICATES_TO";// Specialization of BRIDGES_TO for cross-system identity hops (kept for evaluator clarity).
export type CorrelationConfidence = "HIGH" | "MEDIUM"; // No LOW. Determinism is non-negotiable.
export type CollisionPolicy =
| "first_match_wins" // If multiple A-side rows correlate to the same B-side row, only the first (by deterministic order) is kept.
| "all_match" // All matches are emitted as separate CorrelationDoc records (use for fan-out edges).
| "drop_all_ambiguous";// If >1 candidate, emit zero correlations and log to stitch_audit for operator review.
export interface CorrelationRule {
/** Stable rule ID, e.g. "aws-oidc-federation-to-entra-sp". Must not change after introduction. */
rule_id: string;
/** Human-readable description shown in the debuggability surface. */
description: string;
/** Rule version. Bumped if the predicate changes. Stored on every CorrelationDoc for replay/debug. */
version: number;
/** Source-A side: which entities are eligible to match. */
source_a: EntityPredicate;
/** Source-B side: which entities are eligible to match. */
source_b: EntityPredicate;
/** The deterministic match key extractor — produces a string from each side that must match exactly. */
match_key: MatchKeyExtractor;
/** What kind of correlation this produces. */
kind: CorrelationKind;
/** Confidence — a compile-time property of the rule, NOT a runtime score. */
confidence: CorrelationConfidence;
/** What to do when multiple candidates match. */
on_collision: CollisionPolicy;
/** Whether the rule is enabled by default. Tenants can override. */
default_enabled: boolean;
/** Documentation link explaining the underlying real-world relationship. */
doc_url?: string;
}
export interface EntityPredicate {
source_systems: string[]; // e.g. ["aws_iam"]
entity_types: EntityType[]; // e.g. ["identity", "role"]
required_properties?: string[]; // properties that must be non-null on the entity
property_filters?: Record<string, unknown>; // exact-match filters on properties
}
export interface MatchKeyExtractor {
/** Pure function name registered in `correlation-key-extractors.ts`. NOT arbitrary code. */
extractor_id: string;
/** Path or property name(s) the extractor operates on. */
inputs: string[];
}
The rule registry is a TypeScript array of literal objects, defined in src/ingestion/stitching/rules/registry.ts. Rules are not hot-loaded; they ship with the platform binary. This guarantees determinism across deploys and makes every rule grep-able and version-controlled.
Tenant overrides (enable/disable, parameter tweaks) live in a tenant_correlation_settings collection (see Tenant opt-out below).
Match-key extractors (the only "logic" in a rule)
Extractors are pure functions registered by string ID. They are the single place where structural parsing happens (parsing an ARN, extracting an OIDC subject, normalizing an email). Each extractor:
- Takes an
EntityDocand the rule'sinputsarray. - Returns either a string (match key) or
null(entity not eligible). - Is unit-tested per extractor with a fixture.
- Has zero side effects.
Initial extractors:
| Extractor ID | Purpose | Output |
|---|---|---|
entra_sp_object_id | Returns properties.principal_id (or object_id) from an Entra SP. | "abc-123-def" |
aws_role_oidc_trust_subject | Parses properties.trust_policy.Statement[].Principal.Federated looking for sts.windows.net/<tenant> and reads Condition.StringEquals['sts.windows.net/<tenant>:sub'] to return the Entra SP object ID. Returns null if no OIDC trust or trust is not Entra-issued. | "abc-123-def" (matches above) |
aws_role_saml_trust_subject | Same shape, for SAML federation. Reads Principal.Federated of form arn:aws:iam::<acct>:saml-provider/<name>. | depends on provider |
oauth_client_id_lower | Lowercases properties.client_id or properties.app_id. | "560ad26b-..." |
arn_canonical | Lowercases an ARN; strips trailing slashes; preserves account-id and region. | "arn:aws:iam::123456789012:role/foo" |
email_lower | Lowercases properties.email or properties.upn. | "alice@example.com" |
external_principal_arn | Extracts ARN from properties.trust_policy.Statement[].Principal.AWS array. Emits one match key per entry (used with on_collision: all_match). | "arn:aws:iam::987654321098:role/x" |
endpoint_uri_normalized | Parses URL; returns host + path lowercased; strips query/fragment. | "prod-28.eastus.logic.azure.com/workflows/abc/triggers/manual/invoke" |
entra_app_id_lower | Lowercases properties.app_id. | "560ad26b-..." |
Extractors are the only place where source-system-specific parsing lives in the stitching layer. Adding a new extractor is a code change reviewed like any other deterministic rule.
Initial correlation rule set
rule_id | source_a | source_b | match_key | kind | confidence | on_collision | enabled |
|---|---|---|---|---|---|---|---|
aws-oidc-federation-to-entra-sp | AWS role with OIDC trust on Entra issuer | Entra SP | A: aws_role_oidc_trust_subject; B: entra_sp_object_id | SAME_ENTITY | HIGH | first_match_wins | yes |
aws-saml-federation-to-entra-sp | AWS role with SAML trust on Entra | Entra SP | A: aws_role_saml_trust_subject; B: entra_sp_object_id | SAME_ENTITY | HIGH | first_match_wins | yes |
servicenow-oauth-to-entra-sp | ServiceNow OAuth client | Entra SP | A: oauth_client_id_lower; B: entra_app_id_lower | SAME_ENTITY | HIGH | first_match_wins | yes |
aws-cross-account-role-trust | AWS role with explicit AWS principal in trust | AWS role/user (target account) | A: external_principal_arn; B: arn_canonical | BRIDGES_TO (new edge TRUSTED_BY) | HIGH | all_match | yes |
human-identity-by-email | Any human_identity (post-#383 fix) | Any other human_identity from a different connector | A,B: email_lower | SAME_ENTITY | HIGH | drop_all_ambiguous | yes |
connection-endpoint-bridge | connection from any connector | resource (e.g. Logic App) from any connector | A,B: endpoint_uri_normalized | BRIDGES_TO (new edge CONNECTS_TO) | MEDIUM | drop_all_ambiguous | yes |
mcp-server-to-entra-sp | AWS Lambda or workload labeled MCP host | Entra SP referenced via env var ENTRA_CLIENT_ID | A: extractor reads properties.environment.ENTRA_CLIENT_ID; B: entra_app_id_lower | BRIDGES_TO (AUTHENTICATES_TO) | MEDIUM | first_match_wins | no (opt-in per tenant; high false-merge risk) |
Confidence semantics:
HIGHrules auto-link. The match key is structurally guaranteed to identify the same entity (OIDC subject IS the Entra SP object ID; OAuthclient_idIS the EntraappId).MEDIUMrules emit links only if there is exactly one candidate; otherwise they drop and log tostitch_audit. Endpoint URLs match this profile — a host can serve many resources.- There is no
LOW. If a rule cannot decide deterministically with at most one operator-policy parameter, it does not enter the registry.
Versioning: Every CorrelationDoc records rule_id and rule_version. When a rule's predicate or extractor changes, version bumps. Old correlations remain valid until the next stitch run, which re-evaluates them under the new rule version.
Determinism guarantee: Same set of entities + same enabled rules + same rule versions = same set of correlations. Order independence is achieved by sorting the candidate set by (entity._id) lexicographically before evaluation.
Merge semantics
Correlations are stored as first-class records, not implicit. The platform never hard-merges entities — it links them and computes a canonical view on read (or on demand for the materializer).
CorrelationDoc schema (new collection correlations)
export interface CorrelationDoc {
_id: string; // sha256(tenant_id + sorted(entity_ids) + rule_id) — stable & idempotent
tenant_id: string;
rule_id: string;
rule_version: number;
kind: CorrelationKind;
confidence: CorrelationConfidence;
/** The set of entity IDs linked by this correlation. Always sorted. For SAME_ENTITY, all members are aliases. */
entity_ids: string[];
/** Match key value used (for debugging — e.g. the OIDC subject). */
match_key_value: string;
/** When this correlation first appeared in a stitch run. */
created_at: Date;
/** When this correlation was last confirmed by a stitch run. */
last_confirmed_at: Date;
/** Stitch run that created this correlation. */
created_by_stitch_run_id: string;
/** Set when the rule no longer fires for this candidate set. Soft-deprecation. */
deprecated_at?: Date;
deprecated_by_stitch_run_id?: string;
/** Per-source provenance for the inputs that produced this match. */
source_records: CorrelationSourceRecord[];
}
export interface CorrelationSourceRecord {
entity_id: string;
source_system: string;
source_id: string;
/** The connector_id that contributed this source record. */
connector_id: string;
/** When this source-record was last observed by its contributing connector. */
observed_at: Date;
/** The actual property values the extractor read, captured for audit. */
extracted_value: string;
}
Canonical EntityDoc model
SAME_ENTITY correlations form an equivalence class. The platform exposes both:
- The contributing entities as-is, unchanged in the
entitiescollection. UI widgets that need the per-source view (e.g., "show me what AWS sees vs what Entra sees") read these directly. No data is destroyed. - A
canonical_identity_idfield added to each contributing entity, pointing at the lexicographically lowestentity._idin its equivalence class. This is recomputed on every stitch run for entities in the affected set.
The path materializer is taught (small change) to traverse equivalence classes via canonical_identity_id: when computing paths from workload W → identity I, it includes paths through any entity I' where I.canonical_identity_id === I'.canonical_identity_id. This converts "three islands" into "one identity with three source records and a unified outbound edge set" without rewriting the entity store.
For BRIDGES_TO correlations, no canonical merge happens. Instead, a synthetic edge of the rule's declared type is materialized into the source entity's relationships[] array, tagged with source_connector_id = "stitcher" and properties: {via_correlation_id: <CorrelationDoc._id>}. This reuses the existing relationship-based traversal (mergeRelationships already preserves edges from "other connectors" — stitcher is just another connector ID).
Per-property source-of-truth precedence
When the same logical attribute exists on multiple linked entities, the on-read merge uses a declared survivorship policy from field-policies.ts:
| Property class | Policy | Rationale |
|---|---|---|
display_name | most-recently-updated non-empty | Different connectors invent different display names; the most recent observation is usually the most useful. |
properties.principal_id / app_id / client_id | authoritative-source: Entra wins | Entra is the system of record for these IDs. |
properties.trust_policy | source-system-only (not merged across) | Each connector's view of trust is local to its system; never merge. |
properties.email / upn | authoritative-source: Entra > AWS Identity Center > Okta > others | Per-tenant override allowed. |
properties.tags | set-union (deduplicated) | Tags are additive metadata; never lose a tag. |
entity_type | authoritative-source: identity > workload > connection > credential > owner; ties broken by lexicographic source_system | Consistent with #383 — explicit type-survivorship rule. |
resource_key | first non-null wins; flag conflict if two non-null differ | Drift here means the canonical key is genuinely contested; surface to operators. |
The merge function is computeCanonicalView(entityIds, fieldPolicies, storage) — pure given the entities and the policy table. Versioning the policy table is part of the stitch_run metadata so a re-stitch can be reproduced identically.
Source lineage
Every property in the canonical view carries (connector_id, source_record_id, observed_at, contributing_entity_id) provenance, surfaced via the GET /api/v1/identities/:id/lineage endpoint (see Debuggability). The on-disk shape uses the existing property_provenance: Record<string, ConnectorProvenance> field added in Option-A from sv0-platform#486. Stitching extends provenance to include the other entities that contributed via correlation (so an analyst clicks one identity and sees all three source records).
Type reconciliation (#383)
The human_identity → owner retype in graph-transformer.ts:45 is removed. EntityType gains human_identity as a first-class type. The reconciler then applies the type-survivorship rule (above) to canonicalize across linked entities. Migration backfills existing owner rows that originated from human_identity nodes (identifiable by properties.subtype or by source-id pattern).
Re-materialization
What triggers re-stitch
A stitch run computes a change set at the start:
ChangedEntities = entities upserted/changed since last stitch_run.last_completed_atAffectedCorrelations = correlations whose entity_ids ∩ ChangedEntities ≠ ∅TransitivelyAffectedEntities = entities reachable from ChangedEntities via existing correlations + AffectedCorrelations
Only TransitivelyAffectedEntities are evaluated by the rule registry. A full-tenant re-stitch is gated behind the manual mode: "full_rescan" trigger.
What re-materialization looks like
Closure expansion (closes #491):
M = ∅ // entities whose execution_paths must be re-materialized
for each entity e in TransitivelyAffectedEntities where entity_type ∈ {identity, workload}:
M.add(e._id)
M.add(canonical_identity_id of e)
// Walk inbound RUNS_AS and add upstream workloads (Fix B from #459)
for each w in storage.getEntitiesWithRelationshipTo(e._id, "RUNS_AS"):
M.add(w._id)
// NEW: walk equivalence-class peers and their upstream workloads
for each peer p with canonical_identity_id == e.canonical_identity_id:
for each w in storage.getEntitiesWithRelationshipTo(p._id, "RUNS_AS"):
M.add(w._id)
// NEW: walk new BRIDGES_TO edges added this run and re-materialize sources
for each new bridge edge (s -> t) added this run:
M.add(s._id)
materializeExecutionPaths(M)
materializeAuthorityPaths(workloads in M)
The "walk equivalence-class peers" step is the structural fix for #491 (and the architectural successor to #459's Fix B). The replay test in PR #484 that currently .skips assertion (b) becomes green once this lands.
The materialized-paths collection (Option C from #486)
A new collection stitched_paths holds just the cross-connector segments discovered by the stitcher — not duplicating authority_paths, but giving the UI a fast lookup for "show me only paths that span two or more source systems." Schema:
export interface StitchedPathDoc {
_id: string; // sha256(tenant + workload_id + canonical_identity_id + dest_resource_key)
tenant_id: string;
workload_id: string; // entry-point workload
canonical_identity_id: string; // the bridging identity
contributing_correlation_ids: string[];
source_systems_traversed: string[]; // ordered list of distinct source systems on the path
authority_path_id: string; // pointer into existing authority_paths collection
computed_at: Date;
computed_by_stitch_run_id: string;
}
This is a denormalized index, not a new source of truth. UI queries like "show me all stitched paths in the Foundry-Entra-ServiceNow trio" become O(index lookup) instead of O(scan + filter).
Idempotency & ordering
Two scans landing simultaneously
- Per-connector
sync_ingestionjobs serialize via the existing per-tenant queue (no change). At-most-one connector handler runs at a time per tenant. - A new
stitch_runis enqueued at the end of each connector handler. The debouncer collapses bursts. - The
stitch_runscollection enforces "at most one in-flight per tenant" via a unique index on(tenant_id, status: "running"). - If a stitch run is in progress when a new
sync_ingestionlands, the new sync runs to completion, then triggers a fresh stitch run after the current one finishes.
Connector A then B vs B then A — same canonical graph
This is the load-bearing invariant. Achieved by:
- Correlation rules are pure functions over the post-merge entity set. Order of arrival of source records does not affect rule evaluation, because rules read from a settled snapshot.
- The candidate set is sorted lexicographically by
entity._idbefore evaluation, sofirst_match_winsis order-independent. - The canonical-identity-id is the lexicographically lowest
entity._idin the equivalence class — order-independent. - Survivorship rules ("authoritative-source: Entra wins") are deterministic functions of
(value, source_system)tuples — order-independent.
Property tested: permute(connector_completion_order) × replay(same_entities) → identical correlations and identical canonical_identity_id per entity.
Replay semantics
A stitch run can be deterministically re-run by:
- Reading the
stitch_runsdoc (which records entity-set hashes, rule-set version hash, policy-table version). - Re-applying the rule registry at that version against the entity snapshot at that timestamp.
- Asserting the produced correlations match what was persisted.
This is the foundation of the "why was this merged?" debuggability surface and of the integration replay tests (extending PR #484's harness).
Tenant opt-out + debuggability
Per-tenant per-rule disable
tenant_correlation_settings collection:
export interface TenantCorrelationSettingsDoc {
_id: string; // tenant_id
tenant_id: string;
/** Per-rule enable/disable. Absence = use rule's default_enabled. */
rules: Record<string, { enabled: boolean; reason?: string; updated_by: string; updated_at: Date }>;
/** Per-rule parameter overrides (e.g. authoritative-source hierarchy for emails). */
rule_params: Record<string, Record<string, unknown>>;
/** Force-disabled correlations: never auto-stitch these entity pairs. */
blocklist: Array<{ entity_a: string; entity_b: string; reason: string; created_at: Date }>;
/** Operator-confirmed correlations not produced by any rule. */
manual_links: Array<{ entity_ids: string[]; kind: CorrelationKind; reason: string; created_by: string; created_at: Date }>;
}
Disabling a rule triggers a stitch_run(tenant_id, mode: "full_rescan") so existing correlations from that rule get deprecated.
"Why was this merged?" surface
API:
GET /api/v1/identities/:id/lineage— returns the canonical view + every contributing source record + every correlation that linked them + the rule(s) that fired (withrule_id,rule_version,match_key_value,confidence,created_at).GET /api/v1/correlations/:id— the rawCorrelationDocwith all source records and match key values.GET /api/v1/identities/:id/correlation-history— full history of correlations that have ever linked this entity, including deprecated ones.
UI surface (described for Stream-3 completeness; not in this stream's implementation):
- Identity card has a "Linked across N systems" badge. Click → expanding panel listing each contributing entity, the rule that linked them, the match-key value (e.g., "matched on Entra SP object ID
8a0cb6c3..."), and a "Disable this link" button that adds to the tenant blocklist.
Audit log
stitch_audit collection records every rule firing decision, including:
- Rule fired and produced a new correlation.
- Rule fired but
on_collision: drop_all_ambiguousdiscarded the result (operator review queue). - Rule was skipped because tenant disabled it.
- Existing correlation was confirmed (no change).
- Existing correlation was deprecated (rule no longer fires).
Indexed by (tenant_id, stitch_run_id) and (tenant_id, entity_id).
Schema migrations required
This is the critical-path call-out for downstream streams. The following must land before the stitcher can be implemented; Stream-2 (multi-account AWS connector) and Stream-4 (Lab 2) consume these.
| Migration | What changes | Why |
|---|---|---|
M1: connector_id → connector_owners[] | EntityDoc.connector_id: string → EntityDoc.connector_owners: string[]. Backfill from existing scalar via one-time updateMany. Deletion detection in diff-engine.ts:318-323 switches to connector_owners: connectorId filter; entity is only fully deleted when ALL owning connectors have marked it absent. | Closes #488. Blocks stitching: shared entities are the norm under stitching, so multi-owner deletion is required. |
M2: property_provenance map | EntityDoc.property_provenance: Record<string, { connector_id: string; observed_at: Date }>. diffProperties filters by property_provenance[key].connector_id === connectorId || === undefined. | Closes #485. Required for per-property survivorship (canonical view). |
| M3: Atomic upsert via aggregation pipeline | entity-adapter.ts upsertEntity uses MongoDB aggregation pipeline updates ($filter + $concatArrays) so mergeRelationships is collapsed into one round-trip. Removes the read-merge-write race that exists today and is masked by single-worker serialization. | Closes #487. Required because stitcher writes during a window where a per-connector sync may also be writing. |
M4: entity_type=human_identity | Add "human_identity" to ENTITY_TYPES in src/domain/entities/types.ts. Remove the silent retype at graph-transformer.ts:45. Backfill existing owner rows that originated from human_identity nodes. Update Identities page filter and Graph Explorer legend. | Closes #383. Required because cross-connector human-identity correlation is a P0 stitching rule. |
M5: canonical_identity_id on EntityDoc | New optional field canonical_identity_id?: string on EntityDoc. Set by stitch runs; null for entities not part of any equivalence class. | Required for path-materializer equivalence-class traversal. |
M6: correlations collection | New collection. Indexes on (tenant_id, entity_ids) (multikey), (tenant_id, rule_id, deprecated_at), (tenant_id, last_confirmed_at). Schema as described above. | Storage for CorrelationDoc. |
M7: stitch_runs collection | New collection. Unique index on (tenant_id, status: "running") for at-most-one-in-flight enforcement. Schema includes entity-set hash, rule-version hash, policy-version hash, started_at/completed_at, metrics. | Required for replay determinism + concurrency control. |
M8: stitch_audit collection | New collection. Append-only. Indexed by (tenant_id, stitch_run_id) and (tenant_id, entity_id). | Required for "why was this merged?" surface. |
M9: tenant_correlation_settings collection | New collection (described above). | Required for tenant opt-out. |
M10: stitched_paths collection | New collection (described above). | Required for fast UI lookup of cross-connector paths. |
M11: NormalizedNode.correlationKeys?[] | Add to src/ingestion/types.ts. Optional. Connectors that don't emit it are still supported (rules fall back to extracting from properties). Connectors that do emit it get faster, declarative correlation. | Required to cleanly express AWS OIDC subjects, ServiceNow OAuth client IDs, and federated-principal ARNs without spelunking through trust policies in extractors. |
M12: NormalizedNode.lineage_records?[] | Add a stable per-source-record provenance block for fields the rule registry needs to attribute. | Required so source lineage in the canonical view is precise — the canonical view shows which connector contributed which property. |
Migration / backward compat
Existing per-connector graphs → stitched graph:
- M1–M4 land first (Option A from #486 — closes the schema bugs). Each is independently shippable.
- M5–M12 land in a single PR series with the stitcher disabled by default (
STITCHER_ENABLED=falseenv flag). - A one-time backfill stitch run executes on each tenant when
STITCHER_ENABLED=trueis flipped. The first run ismode: "full_rescan"and may be expensive (typically minutes for production tenants); it runs out-of-band off the request path. - The materializer change to traverse equivalence classes is gated on
canonical_identity_id !== undefined. Pre-stitch entities have itundefinedand traversal behaves identically to today.
How existing UI / queries continue to work during migration:
entitiescollection remains the source of truth. UI readsEntityDocas before.- Authority paths are still materialized into
authority_pathscollection. The stitcher only adds to the path set; it never deletes paths the existing materializer would have produced. stitched_pathsis a new index, not a new source — the existing/authority-paths/groupedendpoint becomes "include stitched paths in grouping" rather than a new endpoint.- The
connector_idfield is retained as a deprecated mirror ofconnector_owners[0]for one quarter to give downstream readers (analytics, manual scripts) time to migrate.
Re-stitch existing data: cost, time:
- Production tenant
default(~3,000 entities, 4 connectors): expected initial full stitch < 60 s. - Demo-w1 (~200 entities): < 5 s.
- demo-nimbus (~300 AWS entities, single connector): < 5 s.
- Subsequent incremental stitches (per debounce window): < 2 s for typical change sets (~10–100 affected entities).
These are estimates from the rule-evaluation cost (O(rules × candidates × log(candidates)) for the index lookup). Will be benchmarked in Phase 4.
Implementation plan (writing-plans format)
All tasks live in sv0-platform unless noted. Each task is bite-sized (≤1 day for one engineer), has a clear acceptance criterion, and follows TDD: write the failing test first, then make it pass.
Phase 1 — Schema migrations (unblock the rest)
Goal: ship Option A from sv0-platform#486 plus the stitching-specific schema additions. Each PR is independently revertible.
- M1:
connector_owners[]migration — Addconnector_owners: string[]toEntityDoc; teachentity-adapter.tsto$addToSeton upsert; ship a one-time backfill scriptscripts/migrations/2026-04-backfill-connector-owners.ts; flip deletion detection to filter byconnector_owners. Acceptance: integration test where two connectors write the same entity → both appear inconnector_owners; only fully-absent entities are deleted. - M2:
property_provenancemap — Addproperty_provenancetoEntityDoc; teachgraph-transformer.tsto stamp it; teachdiffPropertiesto filter by it. Acceptance: regression test for #485 (no spuriousentity_versionson no-op cross-connector re-sync). - M3: Atomic aggregation-pipeline upsert — Replace read-merge-write at
sync-ingestion.ts:124-140with an atomic aggregation-pipelineupdateOneinentity-adapter.ts. Acceptance: property-test where two connectors interleave reads/writes → no relationships are lost. - M4:
entity_type=human_identity+ #383 fix — Addhuman_identitytoENTITY_TYPES; remove the retype atgraph-transformer.ts:45; backfill existingownerrows. Acceptance:GET /api/v1/entities?entity_type=human_identityreturns the 4 SSO users on demo-nimbus. - M5:
canonical_identity_idfield — Add optional field toEntityDoc; index(tenant_id, canonical_identity_id). No write logic yet — placeholder for Phase 4. Acceptance: index exists; field accepts null; existing tests pass. - M6–M10: Stitching collections + indexes — Create
correlations,stitch_runs,stitch_audit,tenant_correlation_settings,stitched_pathscollections via the storage adapter. Add MongoDB indexes. Acceptance: storage-adapter tests for each collection's CRUD methods pass. - M11–M12:
NormalizedNode.correlationKeys+lineage_records— Add optional fields tosrc/ingestion/types.ts. Acceptance: existing connectors continue to work without emitting these (backwards compatible).
Phase 2 — Correlation rule engine + initial rule set
- Rule schema + registry skeleton — Add
src/domain/correlations/types.tswith the schemas above; createsrc/ingestion/stitching/rules/registry.tsexporting an emptyCorrelationRule[]. Acceptance: types compile, registry is iterable. - Match-key extractor framework — Add
src/ingestion/stitching/extractors/index.tswith theExtractorinterface and a registry. Implemententra_sp_object_id,entra_app_id_lower,oauth_client_id_lower,email_lower,arn_canonical. One file per extractor. Acceptance: each extractor has unit-test fixtures with positive and negative cases. - Extractor:
aws_role_oidc_trust_subject— Parse trust-policy JSON; return Entra SP object ID for OIDC trusts onsts.windows.net/<tenant>. Acceptance: fixture from a real Lab-1 / Lab-2 AWS role yields the correct subject; non-Entra federations return null. - Extractor:
aws_role_saml_trust_subject— Same shape for SAML. Acceptance: Lab-2 fixture passes. - Extractor:
external_principal_arn— IteratesPrincipal.AWSentries; emits one match key per ARN. Acceptance: cross-account-trust fixture yields N match keys. - Extractor:
endpoint_uri_normalized— URL parse; lowercase host+path; strip query/fragment. Acceptance: matches Foundry connection endpoint vs ServiceNow REST message endpoint. - Define initial rule set — Add the 7 rules from "Initial correlation rule set" table to
registry.ts. Each rule has a unit test verifying its predicate selects only the intended entity classes. Acceptance: rule registry exports 7 rules; per-rule tests pass. - Rule executor — Add
src/ingestion/stitching/rule-executor.tsthat, given a rule and a candidate set of entities, returns a list of proposedCorrelationDocrecords. Pure function. Acceptance: per-rule executor test produces expected correlations against a fixture. - Collision policies — Implement
first_match_wins,all_match,drop_all_ambiguousin the executor. Acceptance: collision-policy tests pass with multi-candidate fixtures.
Phase 3 — Stitcher pipeline integration
StitchRunDoclifecycle — Add storage-adapter methods to insert/updatestitch_runswith at-most-one-in-flight enforcement. Acceptance: integration test where two stitch_run inserts race → second one waits.- Debounced stitch_run trigger — Add a
STITCH_DEBOUNCE_MSenv (default 60 000); modify the worker handler to enqueue a debounced stitch_run at end of eachsync_ingestion. Acceptance: integration test where twosync_ingestions land within 60 s → one stitch_run executes. - Stitcher worker handler
stitch_ingestion— New filesrc/workers/handlers/stitch-ingestion.ts. Implements steps S1–S8 from the pipeline diagram. Reads tenant settings, computes change set, applies rule executor, writescorrelations+canonical_identity_id, emits audit. Acceptance: single-rule integration test (Foundry replay fixture) produces a correlation between Entra SP and the AWS role that trusts it. - Tenant opt-out wiring — Read
tenant_correlation_settingsat the start of each stitch_run; honor disabled rules and the blocklist. Acceptance: integration test where rule is disabled per-tenant → no correlation produced. - Audit logging — Every rule decision (fired/skipped/dropped/confirmed/deprecated) writes a
stitch_auditrecord. Acceptance: audit query returns one record per rule per candidate per stitch run.
Phase 4 — Re-materialization
- Equivalence-class traversal in path materializer — Modify
path-materializer.tsso workload-to-identity edges traversecanonical_identity_idpeers. Gated oncanonical_identity_id !== undefined. Acceptance: integration test where Entra SP and AWS role share a canonical ID → workloadRUNS_AS Entra SPproduces an authority path through the AWS role's HAS_ROLE edges. - Re-materialization closure — In
stitch-ingestion.ts, computeMper the pseudocode in "Re-materialization", callmaterializeExecutionPaths(M)andmaterializeAuthorityPaths(workloads in M). Acceptance: PR #484's.skip'd assertion (b) becomes green; #491 closes. stitched_pathsindex materialization — After authority paths are computed, writeStitchedPathDocrecords for any path whosesource_systems_traversed.length > 1. Acceptance: the Foundry-Entra-ServiceNow path appears institched_pathswith three source systems.BRIDGES_TOedge materialization — ForBRIDGES_TOcorrelations, write a synthetic edge into the source entity'srelationships[]withsource_connector_id = "stitcher". Acceptance: connection-endpoint-bridge rule produces aCONNECTS_TOedge between Foundry connection and Logic App resource.
Phase 5 — Debuggability + opt-out
/api/v1/identities/:id/lineageendpoint — Returns canonical view + contributing source records + correlations + rule firings. Acceptance: API test against Foundry replay fixture returns 3 source records, 1 SAME_ENTITY correlation, 1 rule firing./api/v1/correlations/:idendpoint — Returns the full CorrelationDoc with source records. Acceptance: API test passes./api/v1/admin/stitch-runsPOST endpoint — Manual trigger for full-rescan and per-tenant settings updates. Acceptance: POST withmode: "full_rescan"re-stitches the tenant; sync resolves with stitch_run summary.- Tenant opt-out admin endpoints —
PUT /api/v1/admin/tenants/:id/correlation-settingsto disable rules / blocklist correlations / add manual links. Acceptance: API test where a rule is disabled then a stitch run is triggered → existing correlations from that rule are deprecated.
Phase 6 — UI surface
- Stitched-identity card — UI component on the Identity Detail page showing "Linked across N systems" badge, expanding panel with per-source-record breakdown, rule provenance per link. Acceptance: visual QA on Foundry replay shows 3 source records on
servicenow-openai-clientidentity card.
Total: 30 tasks across 6 phases. Phase 1 is independently shippable and unblocks the rest. Phases 2–4 are sequential. Phases 5–6 can parallelize after Phase 4 lands.
Validation criteria
Per-phase acceptance
| Phase | Validation |
|---|---|
| 1 (schema) | npm run ci passes. Backfills idempotent. Existing replay test (test/integration/replay/sergey-demo.test.ts) remains green. |
| 2 (rules) | All 7 rules have unit tests. All extractors have positive + negative fixtures. Rule executor is order-independent (property test). |
| 3 (pipeline) | After ingesting Entra-ServiceNow + Azure Foundry fixtures from PR #484: a single CorrelationDoc exists linking the OAuth client to the Entra SP. connector_owners on the SP includes both connectors. No spurious events on no-op re-sync (#485 fully closed). |
| 4 (re-materialization) | After ingesting Entra-ServiceNow + AWS connector outputs where AWS role X trusts Entra SP Y via OIDC: one canonical identity exists (canonical_identity_id shared), source_record_count = 2, and one authority path of length ≥ 4 spans both source systems. PR #484 assertion (b) un-skipped and green. |
| 5 (debuggability) | GET /api/v1/identities/:id/lineage returns ≥ 2 source records for any stitched identity, with rule provenance per link. |
| 6 (UI) | Stitched-identity card visible on Identity Detail page; visual QA passes per platform standards. |
MediaPro Lab 2 validation contract (delivered to Stream 4)
When MediaPro Lab 2 runs end-to-end (Stream 4 builds the demo; this stream owns the data-shape acceptance), the platform must produce exactly these stitched paths for the demo to count as validated:
-
Bedrock-agent → Lambda → MCP-server → Entra-SP → ServiceNow-OAuth-app → HR-table
- Resolves to one
AuthorityPathDocof length 6 (or 7 if MCP server emits a separate identity). source_systems_traversed=["aws_iam", "entra_id", "servicenow"](3 distinct).contributing_correlation_idsincludes:- One
aws-oidc-federation-to-entra-spcorrelation (ormcp-server-to-entra-spif the MCP server uses env-var auth). - One
servicenow-oauth-to-entra-spcorrelation linking the SN OAuth client to the Entra SP.
- One
canonical_identity_idon the Entra SP, AWS role, and SN OAuth client all match.GET /authority-paths/grouped?identity=<canonical_identity_id>returns the workload + this path.
- Resolves to one
-
Bedrock-agent → cross-account assume-role → S3 PII bucket
- Resolves to one
AuthorityPathDocof length ≥ 3. - One
aws-cross-account-role-trustcorrelation links the source role to the target role. - Bridges
nimbus-workloadsandnimbus-dataaccounts in the same canonical AWS-org context.
- Resolves to one
-
Foundry-agent → Logic-App → ServiceNow-incident-table (Lab 2 Phase B)
- Resolves to one
AuthorityPathDocwithconnection-endpoint-bridgecorrelation incontributing_correlation_ids. - Logic App appears as a single
resourceentity (not duplicated across Foundry and Entra source records). source_systems_traversedincludes bothazure_foundryandservicenow.
- Resolves to one
-
No duplicate entities for the OAuth client: the demo screen shows ONE node for
servicenow-openai-clienteven though Entra (SP), Foundry (managed identity), and ServiceNow (OAuth client) all contribute. -
Lineage panel on the canonical identity shows ≥ 3 source records with their contributing connectors and rule firings.
-
Order independence: re-running connectors in any order (AWS first then Entra then SN; SN then Entra then AWS; etc.) produces identical correlations and identical canonical_identity_ids.
-
Tenant opt-out works: disabling
aws-oidc-federation-to-entra-spfor the demo tenant and triggering a manualmode: "full_rescan"removes the AWS-Entra link; the AWS role and Entra SP render as separate entities again.
Non-goals (explicit)
- This stream does not specify the Lab 2 demo narrative or visual flow — that is Stream 4's deliverable.
- This stream does not extend connectors beyond emitting the optional
correlationKeys[]andlineage_records[](M11–M12). The deeper trust-policy parsing for AWS lives in extractors, not in connectors. Stream 2 owns AWS-side node shapes. - This stream does not build the
ScanRunschema (Stream 1 owns it); it consumes thecompletedevent.
Open questions
-
Should
BRIDGES_TOcorrelations participate in the equivalence-class merge? Currently no — they only add an edge, not a canonical link. But the MCP-server-to-Entra-SP rule blurs this: if an MCP server's only identity is its Entra SP, are they conceptually the same identity or just bridged? Recommendation: keepBRIDGES_TOstrictly edge-additive; promote a rule toSAME_ENTITYonly when the structural relationship is unambiguous (OIDC subject IS the Entra principal ID). -
What happens if a
SAME_ENTITYcorrelation links entities of differententity_type? Example: Entra emitsidentity, AWS emitsrole. The type-survivorship rule resolves canonical type, but the contributing entities keep their original types. Does the UI show "this entity is sometimes a role, sometimes an identity"? Recommendation: yes; lineage panel shows per-source type. Findings evaluator reads the canonical type. -
How should the path materializer handle equivalence classes with > 2 members under cycle detection? Today's materializer has cycle detection via
visitedsets keyed on entity ID. With equivalence-class traversal, the visited set must be keyed oncanonical_identity_id, not_id. Edge case: an entity'scanonical_identity_id == entity._id(the lex-smallest member). Acceptance test required. -
Does the stitcher need its own circuit breaker like the diff-engine deletion breaker? If a buggy rule starts producing cross-tenant correlations or a bad extractor merges hundreds of unrelated entities, the system needs a halt. Recommendation: add a stitch-level breaker that halts if a single stitch_run would touch >X% of the tenant's entities (configurable, default 50%).
-
Should
stitched_pathsbe the source of truth for the UI for cross-system paths, or just an index? If source-of-truth,authority_pathsbecomes a per-source-system view. If just-an-index,authority_pathscontinues to be the canonical store andstitched_pathsis denormalized for fast lookup. Recommendation: just-an-index for now; reconsider after Lab 2 ships if query patterns demand it. -
For Stream 4: what happens when a stitched path's contributing connectors disagree on intervals? Example: AWS reports the assume-role grant as continuously active; Entra reports the SP was disabled for two weeks last quarter. The path should reflect the gap. This is a finding-layer concern, not a stitching concern, but it affects what Stream 4's demo can claim. Flagged here so Stream 4 designs around it.
-
Should rules be able to reference other rules' outputs? Example: "if
aws-oidc-federation-to-entra-splinked entities X and Y, and the Entra SP Y has an OAuth client correlation, transitively bridge X to the OAuth client." Today this works becausepath-materializertraverses the graph. But if rules could chain, more compact correlations would be possible. Recommendation: do not allow rule chaining initially (kills order-independence). Re-evaluate after Phase 4. -
What is the SLO for stitch runs? Default debounce is 60 s; first stitch after a connector sync should complete within 5 minutes for tenants ≤ 10 K entities. Beyond that, large-tenant performance needs benchmarking (Phase 4 acceptance criterion).
References
Internal — research and plans
2026-02-26-cross-connector-entity-correlation-research.md— Phase A–E foundation that this proposal extends.docs/plans/2026-04-21-multi-connector-reconciliation.md(inmulti-connector-reconciliationworktree) — the Option A/B/C/D analysis that selected Option C as the eventual target.docs/plans/2026-04-08-demo-lab-plan.md— Lab 2 dependency on stitching; multi-connector demo requirements.docs/session-notes/2026-04-20-foundry-demo-resolution-session-handoff.md(in sv0-platform) — what #459 / #461 patched and what remained.
Internal — issues
sv0-platform#300— feat(ingestion): cross-connector graph stitching for shared identities (the original ask).sv0-platform#486— epic: multi-connector reconciliation phase (Option C).sv0-platform#491— investigation: cross-sync re-materialization gap.sv0-platform#488— bug:EntityDoc.connector_idis singular.sv0-platform#487— bug: concurrentupsertEntityrace viapath-materializer.sv0-platform#485— bug:diffPropertiesnot connector-scoped.sv0-platform#383— bug:human_identitysilently retyped toowner.sv0-platform#459(merged) — fix: cross-connector relationship merge + upstream re-materialize.sv0-platform#461(merged) — fix: scope diff-engine relationship comparison to current connector.sv0-connectors#79— feat: cross-connector entity correlation Phase A–E (partially shipped).
Internal — architecture docs
01-data-model.md— entity / relationship schema being extended.02-processing-pipeline.md— pipeline being extended with stitch_run phase.03-database.md— collection schemas; new collections added here.05-connectors.md—NormalizedGraphcontract being extended (M11, M12).
Internal — source code
src/ingestion/types.ts—NormalizedNode,NormalizedEdge,NormalizedGraph,ScanScope.src/domain/entities/types.ts—EntityDoc,EntityRelationship,ExecutionPath,EntityVersionDoc.src/ingestion/graph-transformer.ts—buildStableEntityId,mapNodeType(#383 lives here at line 45),reclassifyBySubtype.src/ingestion/diff-engine.ts—diffProperties(#485),diffRelationships(post-#460),computeDiffdeletion scope (#488 root cause at line 318-323).src/ingestion/path-materializer.ts—materializeExecutionPaths,FORWARDING_EDGE_TYPES(line 108).src/ingestion/authority-path-materializer.ts— authority-path materialization, removal circuit breaker.src/storage/storage-adapter.ts—getEntityBySourceId,upsertEntity,getEntitiesWithRelationshipTo.src/storage/mongo/adapters/entity-adapter.ts—upsertEntitiesnon-atomic$set(M3 root cause).src/workers/handlers/sync-ingestion.ts— 12-step per-connector handler;mergeRelationshipsat lines 26-37; read-merge-write at 124-140; upstream re-materialize Fix B at 218-231.sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/correlator.py— connector-side OAuth-client-id ↔ Entra-app-id correlation (the model the platform stitcher generalizes).sv0-connectors/integrations/aws/src/sv0_aws/core/trust_policy_parser.py— AWS trust-policy parsing theaws_role_oidc_trust_subjectextractor mirrors (read-only).sv0-connectors/shared/sv0_azure/sv0_azure/node_ids.py— shared node ID generators (Phase D from 2026-02-26 doc; only Entra principals covered today).
External
- Veza OAA cross-service connections (deterministic exact-match identity correlation).
- Wiz unified-graph + query-time path discovery (the architectural shape this proposal converges on).
- SailPoint correlation rules + authoritative-source policies (the survivorship-rules model).
- Neo4j entity-resolution patterns (linking-relationships pattern this proposal adopts).
- MongoDB aggregation pipeline updates (docs) — used for atomic merge upserts (M3).