Skip to main content

Cross-Connector Entity Correlation Research

Date: 2026-02-26 Status: Draft v2 — revised per review findings (8 items addressed) Scope: sv0-platform (ingestion pipeline), sv0-connectors (shared libraries, entra-servicenow, azure-foundry) Trigger: ProvisionUser Agent scenario revealed that authority paths cannot be fully reconstructed from live connector data when execution crosses platform boundaries (Foundry → Logic App → ServiceNow)


1. Problem Statement

SecurityV0 discovers entities and relationships through multiple connectors:

  • entra-servicenow — Entra ID service principals, ServiceNow workloads, OAuth credentials, REST Messages, execution chains
  • azure-foundry — Foundry AI agents, connections, managed identities, ARM role assignments

Each connector produces a NormalizedGraph that is ingested independently. The platform's path materializer builds authority paths by traversing the chain:

workload → RUNS_AS → identity → HAS_ROLE → role → GRANTS → permission → APPLIES_TO → resource

The materializer is source-system-agnostic — it follows edges by entity _id regardless of which connector created them. However, no mechanism exists to correlate entities discovered by different connectors into unified cross-platform paths.

Concrete Example: ProvisionUser Agent

The Foundry ProvisionUser Agent executes across three systems:

Foundry Agent → (SAS token) → Azure Logic App → (HTTP) → ServiceNow Incident Table

Neither hop uses ARM RBAC. The Foundry connector discovers the agent, its connection (with the Logic App endpoint URL), and the SAS credential. The entra-servicenow connector discovers the ServiceNow side. But nothing links them:

What ExistsWhat's Missing
Agent → RUNS_AS → managed identityRole/permission chain → Logic App (SAS auth bypasses ARM)
Identity → HAS_ROLE → ARM roles (Azure resources)Role/permission chain → ServiceNow (external system)
Connection node with endpoint: "https://prod-28.eastus.logic.azure.com/..."Correlation to Logic App resource entity
SN REST Message node with endpoint_url: "https://prod-28.eastus.logic.azure.com/..."Correlation to same Logic App from ServiceNow side
INVOKES → connection → USES → credentialThese edges are decorative — not in the authority path chain

Result: The seed script can manually create the full path, but live connector scans cannot reconstruct it. Authority paths are limited to ARM-scoped Azure resources.


2. Industry Research

2.1 Terminology

TermUsed ByScope
Identity CorrelationSailPoint, Broadcom IGAMatching accounts from different sources to a single identity
Entity ResolutionNeo4j, data engineeringDetermining when different records refer to the same real-world entity
Graph StitchingVeza, security graph platformsLinking authorization graph nodes across application boundaries
Cross-Domain CorrelationCrowdStrike, XDR platformsCorrelating signals across endpoint, identity, cloud, and SaaS domains
Unified Entity AbstractionWiz, CSPM platformsNormalizing provider-specific entities into common types

Recommended terminology for SecurityV0: "Cross-source entity correlation" (for the matching step) and "graph stitching" (for the result of linking nodes across connector boundaries).

2.2 Platform-by-Platform Analysis

SailPoint IdentityNow — Account Correlation

SailPoint treats correlation as a first-class ingest-time operation. Every source has a correlation configuration — ordered attribute pairings like (identityAttribute, accountAttribute).

  • Happens at ingest time during account aggregation
  • Deterministic exact match only (Equals operator, case-insensitive)
  • Declarative (configured attribute pairings) with programmatic escape hatch (Java "Cloud Rules")
  • Fallback: accountName == identity.name
  • Up to 100 accounts correlated to a single identity
  • Does NOT model cross-application execution paths — only "who has access to what" per application

Reference: SailPoint Correlation Documentation, Correlation Rule Developer Guide

Veza — OAA Cross-Service Connections

Veza builds an Authorization Metadata Graph. Cross-system linking is achieved through explicit identity references in the OAA payload:

{
"local_users": [
{
"name": "svc-foundry",
"identities": ["svc-foundry@tenant.onmicrosoft.com"],
"groups": ["admins"]
}
]
}
  • Each local user carries an identities[] array with stable cross-system identifiers
  • Matching at ingest time against IdP entities by email, UPN, or IdP ID
  • Deterministic exact match — no fuzzy matching
  • Declarative — connector author explicitly provides identity references
  • No built-in URL-to-resource matching — connector author must provide explicit references

Reference: Veza OAA Guide, Cross-Service Connections

Wiz — Security Graph Unified Entity Model

Wiz builds a single unified graph in Amazon Neptune normalizing all cloud entities:

Cloud ConceptAWSAzureGCPWiz Common Model
IdentityIAM User/RoleAD PrincipalService AccountIdentity Entity
PermissionsIAM PoliciesRBAC RolesIAM RolesNormalized Permission (5 levels)
  • Entity normalization at ingest time per cloud provider
  • Cross-cloud attack paths discovered at query time via graph traversal
  • Deterministic — provider entities mapped through known schema translations
  • Permission normalization to 5 universal levels: List, Read, Write, High Privilege, Admin

Reference: Wiz Security Graph, Attack Path Analysis, AWS Case Study — Neptune

CrowdStrike — Enterprise Graph

CrowdStrike unifies multiple specialized graphs (Threat Graph, Asset Graph, Intel Graph) into the Enterprise Graph:

  • Correlation at ingest time (continuous streaming)
  • Cross-domain correlation via shared identity anchors (SID, UPN, IP address, session tokens)
  • Emphasis on behavioral correlation (linking events across domains)
  • Agent-based (Falcon sensor) for endpoint + API-based connectors for cloud/identity

Reference: CrowdStrike Enterprise Graph, CNAPP with CIEM

Neo4j — Entity Resolution Patterns

Neo4j provides three approaches:

  1. Deterministic (rule-based): Exact match on specific fields. High precision, explainable, fast.
  2. Probabilistic (similarity-based): String similarity algorithms with weighted scores. Handles data quality issues.
  3. Graph-enhanced: Uses shared relationships and neighbors to improve matching confidence.

Key recommendation from Neo4j community: Do NOT merge duplicate nodes. Create linking relationships (SAME_AS) or intermediate grouping nodes that both source nodes connect to. This preserves provenance while enabling cross-source traversal.

Reference: Neo4j Entity Resolution, Entity Resolution Example

SCIM Standard

The externalId attribute is the primary cross-system correlation key:

  • Issued by the provisioning client, stored by the receiving system alongside its internal ID
  • Enables bidirectional lookup across system boundaries
  • Best practice: use stable identifiers (objectId, not email)
  • SCIM is a provisioning protocol — assumes the mapping is known, doesn't do discovery

Reference: RFC 7643 — SCIM Core Schema

2.3 Industry Patterns (not consensus — approaches vary)

Review finding addressed (Medium #8): The v1 draft overstated certainty ("all major platforms", "exact only") and contained an internal contradiction: SailPoint's Java Cloud Rules are programmatic/heuristic, contradicting the "deterministic only" claim. Wiz discovers attack paths at query time, contradicting "ingest time only". This section is revised to present patterns with nuance.

QuestionPredominant PatternExceptions / Nuance
Ingest time or query time?Mostly ingest time for entity linking. SailPoint, Veza, CrowdStrike all establish links during ingestion.Wiz normalizes at ingest but discovers attack paths at query time via graph traversal. The boundary is blurry — linking is ingest-time, path discovery is query-time.
Fuzzy or exact matching?Predominantly exact for identity/security graphs. SailPoint uses Equals operator; Veza matches on email/UPN/IdP ID; SCIM uses externalId.SailPoint's Java Cloud Rules allow arbitrary programmatic logic — the connector author can implement fuzzy matching, regex, or database lookups. Neo4j community explicitly supports probabilistic matching for data quality scenarios. SecurityV0 should start with exact matching but the architecture should not preclude rule-based extensions.
Declarative or automatic?Mix. SailPoint: declarative config + Java escape hatch. Veza: connector-declared identities[]. SCIM: admin-configured attribute mapping.Wiz and CrowdStrike are more automatic — built-in normalization per cloud provider, not admin-configured. The "declarative vs automatic" axis depends on whether the platform knows the integration topology in advance.
Connection URL → resource matching?No universal mechanism. All platforms handle this as connector-specific or provider-specific logic.This falls in a gap between identity correlation (well-standardized) and infrastructure topology discovery (provider-specific). Azure Logic App callback URLs are identified by host + path + trigger + query signature, not host alone (Microsoft docs).

3. SecurityV0 Current Infrastructure Audit

3.1 What Already Works

Shared Node ID Generators

File: sv0-connectors/shared/sv0_azure/sv0_azure/node_ids.py

def sp_node_id(principal_id: str) -> str:
return f"entra-sp-{principal_id}"

def owner_node_id(object_id: str) -> str:
return f"entra-user-{object_id}"

Both connectors import and use these:

  • entra-servicenow: from sv0_azure.node_ids import sp_node_id, owner_node_id (transformer.py:27-28)
  • azure-foundry: from sv0_azure.node_ids import sp_node_id, owner_node_id (edge_resolver.py:24)

When both connectors discover the same managed identity (principal ID abc-123), they produce:

  • Same nodeId: "entra-sp-abc-123"
  • Same sourceSystem: "entra_id"
  • Same sourceId: "abc-123"

The platform's buildStableEntityId() hashes tenantId + sourceSystem + sourceId into the same entity _id. The upsert merges them into one entity.

Source-System-Agnostic Path Traversal

File: sv0-platform/src/ingestion/path-materializer.ts

The path materializer follows edges by entity _id regardless of sourceSystem. It traverses:

  • RUNS_AS (workload → identity binding)
  • HAS_ROLE → GRANTS → APPLIES_TO (permission chain)
  • CALLS, INVOKES, USES, AUTHENTICATES_AS, AUTHENTICATES_VIA (forwarding)
  • AUTHENTICATES_TO (cross-system auth, depth-limited to 1)

If connector A creates node X and connector B creates node Y, and an edge X→Y exists, the materializer will follow it.

Cross-Connector Deletion Protection

File: sv0-platform/src/ingestion/diff-engine.ts (line 281)

Deletion detection is scoped by connectorId. Connector A's sync cannot delete connector B's entities.

Endpoint URL Properties

Both connectors store endpoint URLs on connection nodes:

ConnectorProperty NameExample Value
azure-foundryendpointhttps://prod-28.eastus.logic.azure.com/...
entra-servicenowendpoint_urlhttps://prod-28.eastus.logic.azure.com/...

Same URL, different property names, different nodes, no linkage. This is the unexploited correlation key.

3.2 Critical Gaps

Gap 1: Last-Writer-Wins Entity Overwrite

File: sv0-platform/src/storage/mongo/adapters/entity-adapter.ts (line 36)

The entity upsert uses $set for the full document including relationships: [...]. When two connectors emit the same entity:

  1. entra-servicenow ingests → SP has AUTHENTICATES_AS, OWNED_BY relationships
  2. azure-foundry ingests → SP overwritten with only RUNS_AS, HAS_ROLE relationships
  3. entra-servicenow relationships are lost until next ingestion

This is the most critical blocker. The unified graph's correctness depends on ingestion order.

Gap 2: No Endpoint-Based Correlation

ServiceNow REST Messages and Foundry connections both contain endpoint URLs pointing to the same external services. There is no mechanism to detect this match and create bridging edges.

Gap 3: No Shared Node IDs for Non-Entra Entities

sv0_azure.node_ids only covers Entra SPs and users. No shared ID scheme for:

  • Azure resources (Storage, Key Vault, Logic Apps) discovered by multiple connectors
  • ARM role assignments
  • Connections/credentials representing the same external service

Gap 4: Ingestion Order Dependency

Because relationships are not additively merged, the correctness of the unified graph depends on which connector ingests last. No mechanism exists to replay or recompose the merged entity from both connectors' contributions.

Gap 5: Property Name Inconsistency

The endpoint URL property is named endpoint_url in entra-servicenow and endpoint in azure-foundry. A future correlation mechanism must handle both.


Based on industry research and the existing infrastructure audit, the recommended approach is a platform-level post-ingestion correlation phase using deterministic exact matching. This aligns with industry best practices (Veza's declarative model, SailPoint's ingest-time correlation, Neo4j's linking relationships).

Design Principles

  1. Deterministic only. No fuzzy matching — false positive links create incorrect authority paths.
  2. Ingest-time execution. Correlation runs as part of the sync ingestion pipeline, after entity upsert but before path materialization.
  3. Additive, not destructive. Correlation adds edges between existing entities; it never merges or deletes nodes. Following Neo4j's recommendation: preserve provenance, create linking relationships.
  4. Declarative correlation keys. Connectors declare what keys to correlate on — the platform executes the matching.

Phase A: Multi-Connector Entity Ownership & Relationship Partitioning (Prerequisite)

Review finding addressed (Critical #1): The v1 draft proposed a simple "additive merge" of relationships. This breaks three existing mechanisms: (1) diffRelationships() in diff-engine.ts:189 computes added/removed/modified relationships by comparing existing vs incoming — if the existing set includes another connector's relationships, they appear as "removed" in the diff, generating false events and incorrect version history. (2) insertEntityVersion() in sync-ingestion.ts:89 stores entity.relationships as the full state — merged relationships would make version snapshots connector-impure. (3) The connector_id field on EntityDoc is singular (types.ts:56) — deletion detection scopes by this single value (diff-engine.ts:279), which is inconsistent when multiple connectors contribute to the same entity.

Review finding addressed (Critical #2): The v1 draft proposed read-merge-write (read existing → merge → write). Under concurrent connector syncs, two connectors could both read the existing entity, both compute their merge independently, and the last writer's $set would overwrite the other's merged relationships. MongoDB warns about lost updates when concurrent updates use broad filters and $set (MongoDB Atomicity Docs). The solution must use atomic operations or compound operations (MongoDB Compound Operations).

Review finding addressed (High #3): The entity ownership model is currently single-writer (connector_id: string). Correlation requires multi-connector ownership. Without this change, deletion logic will be inconsistent — if connector A's sync doesn't see a shared entity, it would mark it deleted even though connector B still reports it.

Problem: Three interrelated issues:

  1. Single-writer entity ownership: EntityDoc.connector_id is a single string. When two connectors emit the same entity, the second overwrites the first's connector_id, and the first connector's deletion detection can no longer find it.
  2. Relationship overwrite: $set replaces the entire relationships array. Cross-connector relationships are lost.
  3. Race condition: Concurrent syncs from different connectors can both read/merge/write, causing lost updates.

Solution: Partition entity contributions by connector using connector-scoped sub-documents and atomic MongoDB operations.

A1. Multi-Connector Entity Ownership

Replace single connector_id with connector_owners:

// EntityDoc changes:
export interface EntityDoc {
// ... existing fields
connector_id?: string; // DEPRECATED — kept for migration compat, set to last writer
connector_owners: string[]; // NEW — all connectors that have contributed to this entity
// ...
}

Atomic update: When upserting, use $addToSet to register the connector:

// In entity-adapter.ts upsertEntity():
await this.c.entities.updateOne(
{ tenant_id, source_system, source_id },
{
$set: { ...connectorOwnedFields, connector_id: connectorId, updated_at: now },
$addToSet: { connector_owners: connectorId },
$setOnInsert: { _id, created_at },
},
{ upsert: true }
);

Deletion detection change: diff-engine.ts:279 currently filters by connector_id. With multi-ownership:

  • An entity is only eligible for deletion by connector A if connector_owners includes A
  • An entity is only fully deleted when ALL owning connectors have marked it absent
  • If only one connector marks it absent, the entity remains active (other connectors still report it)
// diff-engine.ts deletion detection:
// Before: filter.connector_id = connectorId
// After: filter.connector_owners = connectorId // $elemMatch implicit for scalar
// Deletion only proceeds if this is the LAST remaining owner

A2. Connector-Partitioned Relationships

Instead of storing a flat relationships: EntityRelationship[] that gets overwritten, partition relationships by connector:

// New field on EntityRelationship:
export interface EntityRelationship {
type: string;
target_id: string;
properties: Record<string, unknown>;
source_connector_id: string; // NEW — which connector created this relationship
}

Atomic update strategy: Use a two-step atomic pipeline update (MongoDB 4.2+) to replace only the current connector's relationships while preserving others:

// In entity-adapter.ts — atomic pipeline update, no read-merge-write race:
await this.c.entities.updateOne(
{ tenant_id, source_system, source_id },
[
// Step 1: Remove all relationships from this connector
{
$set: {
relationships: {
$filter: {
input: "$relationships",
cond: { $ne: ["$$this.source_connector_id", connectorId] }
}
}
}
},
// Step 2: Append this connector's new relationships
{
$set: {
relationships: { $concatArrays: ["$relationships", newRelationships] }
}
}
]
);

This is a single atomic operation — no read-merge-write race. MongoDB executes the aggregation pipeline stages sequentially within a single document lock.

A3. Diff Engine and Version History Compatibility

Diff engine (diff-engine.ts:189): diffRelationships() must compare only relationships from the current connector:

// Before: diffRelationships(existing.relationships, incoming.relationships)
// After:
const existingForConnector = existing.relationships.filter(
r => r.source_connector_id === connectorId
);
diffRelationships(existingForConnector, incoming.relationships);

This ensures that another connector's relationships don't appear as "removed" in the diff.

Version history (sync-ingestion.ts:89): Entity versions should store the full merged relationship set (snapshot of the entity at that point in time), not just the current connector's contribution. This is correct behavior — the version represents the entity's complete state, which includes all connector contributions.

// insertEntityVersion stores the entity as-is after the atomic update.
// The merged relationships from all connectors are the true state.
// No change needed here — the atomic update already produces the correct merged state.

A4. Path Materializer Impact

The path materializer (path-materializer.ts:101) reads entity.relationships and filters by direction !== "inbound". It currently assumes all relationships are from a single connector. With partitioned relationships:

  • The materializer sees the union of all connector contributions (correct — this is the desired behavior)
  • The source_connector_id field is ignored by the materializer (it doesn't need to know provenance)
  • No materializer changes are needed

Files changed:

  • src/domain/entities/types.ts — add connector_owners: string[] to EntityDoc, add source_connector_id to EntityRelationship
  • src/storage/mongo/adapters/entity-adapter.ts — atomic pipeline update for relationships, $addToSet for connector_owners
  • src/ingestion/graph-transformer.ts — stamp source_connector_id on every relationship
  • src/ingestion/diff-engine.ts — filter by source_connector_id in diffRelationships(), multi-owner deletion logic
  • src/workers/handlers/sync-ingestion.ts — no changes to version insertion (stores full merged state)
  • src/ingestion/types.ts — add sourceConnectorId to NormalizedEdge
  • Migration script — backfill connector_owners: [connector_id] and source_connector_id on existing relationships

Estimated effort: 3-4 days (including migration, diff engine changes, and testing) Impact: Enables all subsequent phases. Without this, cross-connector paths are unreliable and the ownership model breaks under multi-connector scenarios.

Phase B: Connector-Declared Correlation Keys

Inspiration: Veza's identities[] array pattern.

Review finding addressed (High #4): The v1 draft used endpoint_host as the correlation key type. Host alone is too coarse — a single host (e.g., prod-28.eastus.logic.azure.com) can serve multiple Logic App workflows, each with a different path and trigger signature (Microsoft Logic Apps HTTP endpoint docs). The callback URL identity is host + path + trigger, not just host. Correlation keys are revised to support multi-part compound keys with explicit confidence modeling.

Add a correlationKeys field to NormalizedNode:

export interface NormalizedNode {
// ... existing fields
correlationKeys?: CorrelationKey[];
}

export interface CorrelationKey {
/** The type of correlation key */
keyType: "endpoint_uri" | "endpoint_host" | "entra_principal_id" | "entra_app_id" |
"arm_resource_id" | "oauth_client_id" | "custom";

/** The correlation value (exact match, normalized) */
value: string;

/** How precise this key is for matching (affects whether we auto-link or flag for review) */
specificity: "exact" | "host_only";

/** Optional: restrict matching to specific target node types */
targetNodeTypes?: string[];
}

Key type specificity:

Key TypeValue FormatSpecificityExample
endpoint_urihost + path (no query params/signature)exactprod-28.eastus.logic.azure.com/workflows/abc123/triggers/manual/invoke
endpoint_hosthostname onlyhost_onlyprod-28.eastus.logic.azure.com
arm_resource_idfull ARM resource ID, lowercasedexact/subscriptions/.../microsoft.logic/workflows/...
entra_principal_idGUIDexactabc-123-def-456
oauth_client_idGUID or stringexactabc-123-def-456

Matching behavior by specificity:

  • exact matches → create linking edge automatically
  • host_only matches → create linking edge only if there's exactly one candidate per host, otherwise flag as ambiguous for operator review

Connector-side changes:

azure-foundry connection nodes would emit:

# Strip query params (contains SAS signature) but keep host + path
parsed = urlparse(conn.endpoint)
endpoint_uri = f"{parsed.hostname}{parsed.path}"

"correlationKeys": [
{"keyType": "endpoint_uri", "value": endpoint_uri, "specificity": "exact"},
{"keyType": "endpoint_host", "value": parsed.hostname, "specificity": "host_only"},
]

entra-servicenow REST Message nodes would emit:

parsed = urlparse(endpoint_url)
endpoint_uri = f"{parsed.hostname}{parsed.path}"

"correlationKeys": [
{"keyType": "endpoint_uri", "value": endpoint_uri, "specificity": "exact"},
{"keyType": "endpoint_host", "value": parsed.hostname, "specificity": "host_only"},
]

URI normalization: Both connectors strip query parameters (which contain signatures, API keys, etc.) and retain host + path. This ensures the Logic App callback URL matches across connectors even if SAS signatures differ. Path comparison is case-insensitive (Azure resource names are case-insensitive per ARM naming rules).

Files changed:

  • src/ingestion/types.ts — add correlationKeys to NormalizedNode
  • Both connector transformers — emit correlation keys with endpoint_uri (not just endpoint_host)
  • Standardize endpoint property to endpoint across both connectors

Estimated effort: 1 day

Phase C: Platform Correlator

New file: src/ingestion/entity-correlator.ts

After all entities are upserted (but before path materialization), the correlator scans for matchable entities:

interface CorrelationResult {
sourceEntityId: string;
targetEntityId: string;
matchedKeyType: string;
matchedValue: string;
edgeType: string; // e.g., "TARGETS_SAME_SERVICE"
specificity: "exact" | "host_only";
autoLinked: boolean; // true if exact, false if flagged for review
}

async function correlateEntities(
tenantId: string,
syncId: string,
storage: StorageAdapter,
): Promise<CorrelationResult[]> {
// 1. Fetch entities with correlationKeys touched by this sync
// (incremental — only re-correlate entities whose keys changed)
// 2. Group by (keyType, value)
// 3. For each group with >1 entity from different connectors:
// a. If specificity === "exact" → auto-create linking edge
// b. If specificity === "host_only" AND exactly 1 candidate → auto-create
// c. If specificity === "host_only" AND multiple candidates → flag as ambiguous
// 4. Return results for logging/observability
}

Correlation rules:

Key TypeSource EntityTarget EntityEdge CreatedBehavior
endpoint_uri (exact)connection (azure_foundry)connection (servicenow)TARGETS_SAME_SERVICEAuto-link (host+path match)
endpoint_host (host_only)connection (any)connection (any)TARGETS_SAME_SERVICEAuto-link only if 1 candidate; flag ambiguous if multiple
arm_resource_id (exact)resource (azure_foundry)resource (entra_servicenow)Already merged if same node_idsExtend node_ids.py
entra_principal_ididentity (azure_foundry)identity (entra_servicenow)Already merged via shared node IDN/A — already works
oauth_client_idcredential (servicenow)identity (entra_id)Already handled within entra-servicenowN/A

Pipeline integration:

sync_ingestion steps:
1. Transform (existing)
2. Upsert entities (existing, with merge fix from Phase A)
3. ▶ Correlate entities (NEW — Phase C)
4. Compute execution paths (existing)
5. Materialize authority paths (existing)
6. Evaluate findings (existing)

Estimated effort: 2 days

Phase D: Extend Shared Node ID Library

Review finding addressed (Medium #6): The v1 draft used replace("/", "-")[:80] truncation for ARM node IDs. This has two problems: (1) Two different ARM resources with identical first 80 chars of their sanitized resource ID would collide. (2) Azure resource names are case-insensitive per ARM naming guidance — the same resource can appear as /subscriptions/.../Microsoft.Logic/workflows/... or /subscriptions/.../microsoft.logic/workflows/.... The revised approach uses content-addressed hashing (SHA-256) with a human-readable suffix, and normalizes case.

File: sv0-connectors/shared/sv0_azure/sv0_azure/node_ids.py

Add shared generators for ARM resources so both connectors produce identical node IDs:

import hashlib

def arm_resource_node_id(resource_id: str) -> str:
"""Canonical node ID for an ARM resource, shared across connectors.

Uses SHA-256 hash of the lowercased resource ID to avoid:
- Truncation collisions (different resources sharing a prefix)
- Case-sensitivity issues (Azure names are case-insensitive)
"""
normalized = resource_id.lower()
h = hashlib.sha256(normalized.encode()).hexdigest()[:16]
# Human-readable suffix: last path segment, lowercased, sanitized
last_segment = normalized.rstrip("/").rsplit("/", 1)[-1].replace(" ", "-")[:30]
return f"arm-resource-{h}-{last_segment}"

def arm_role_node_id(role_definition_id: str, scope: str) -> str:
"""Canonical node ID for an ARM role assignment, shared across connectors.

Hash-based to avoid truncation collisions on long scope paths.
"""
normalized_scope = scope.lower()
normalized_role = role_definition_id.lower()
role_guid = normalized_role.split("/")[-1]
h = hashlib.sha256(f"{role_guid}:{normalized_scope}".encode()).hexdigest()[:16]
return f"arm-role-{role_guid[:8]}-{h}"

Case normalization: All ARM resource IDs are lowercased before hashing. This ensures that /subscriptions/ABC/Microsoft.Logic/workflows/MyApp and /subscriptions/abc/microsoft.logic/workflows/myapp produce the same node ID.

Collision resistance: SHA-256 hash provides 16 hex chars (64 bits) of uniqueness — collision probability is negligible for any realistic entity count.

Migration note: Existing azure-foundry connector uses _resource_node_id_from_assignment() and _role_node_id() with the old truncation pattern. These must be updated to use the shared library, which will change existing node IDs. A migration step is needed to update entity references.

Both connectors import these instead of defining local versions. This extends the existing pattern from sp_node_id / owner_node_id.

Estimated effort: 1 day (including migration of existing node IDs)

Phase E: Connection-to-Resource Path Bridging

For the specific pattern where a connection's endpoint URL identifies a known resource, the path materializer must be able to traverse from the connection to the target resource.

Review finding addressed (High #5): The v1 draft recommended Option E1 (synthetic RBAC chain nodes) over E2 (materializer extension). Synthetic role/permission nodes carry fabricated RBAC semantics — the findings evaluator (evaluator/rules/) inspects role names, normalized actions, and permission scopes to compute findings. A synthetic "implied-role" node with no real RBAC backing would be processed by every evaluator rule, potentially producing false findings (e.g., excessive_permissions on a fabricated role). Evidence packs would include synthetic entities with no real-world counterpart, undermining the platform's deterministic, source-of-truth model. E2 is the correct approach — it preserves entity purity and avoids semantic pollution.

Recommended approach: Option E2 — Materializer extension

Extend the path materializer to follow a new traversal pattern:

workload → INVOKES → connection → CONNECTS_TO → resource

Where:

  • INVOKES already exists (edge_resolver creates it for agent → connection)
  • CONNECTS_TO is the new edge type created by the correlator (Phase C) when endpoint URIs match
  • The connection node acts as a transparent hop (like existing CALLS, INVOKES, USES forwarding edges)

Implementation:

// In path-materializer.ts — add CONNECTS_TO to the forwarding edge set:
const FORWARDING_EDGE_TYPES = new Set([
"CALLS", "INVOKES", "USES",
"AUTHENTICATES_AS", "AUTHENTICATES_VIA",
"CONNECTS_TO", // NEW — connection endpoint correlation
]);

The materializer already follows forwarding edges transparently (path-materializer.ts:197-221). Adding CONNECTS_TO to the set means the materializer will:

  1. Follow workload → INVOKES → connection (already works)
  2. Follow connection → CONNECTS_TO → resource (new, via forwarding)
  3. The target resource becomes a reachable destination in the execution path

What this does NOT do: It does not create a fake HAS_ROLE → GRANTS → APPLIES_TO chain. The authority path's via_roles will be empty and actions will be empty for this hop. This is semantically correct — the agent reaches the resource via a connection credential (SAS token), not via an RBAC role assignment. The path accurately represents the real-world access mechanism.

Evaluator implications: Evaluator rules that require via_roles.length > 0 or check actions will correctly treat this path differently from RBAC-based paths. This is desirable — a SAS-token-based path has different security characteristics than an RBAC-based path.

Pro: No synthetic entities. No fabricated RBAC semantics. Findings and evidence remain grounded in real source data. Single line change to the materializer forwarding set. Con: Authority paths via connections will have empty via_roles and actions — downstream consumers (UI, evaluator) must handle this case. New evaluator rules may be needed for connection-based paths.

Estimated effort: 1 day (materializer change is small; testing and evaluator rule review is the bulk)


5. Implementation Priority

Review finding addressed (Medium #7): The v1 estimate of 6-7 days was understated. Phase A alone is now 3-4 days due to the cross-cutting nature of multi-connector ownership (touches entity types, storage adapter, diff engine, graph transformer, and requires a data migration). Phase D requires migration of existing node IDs. The total includes testing, migration scripts, and integration tests across the full ingestion pipeline.

PhaseDescriptionEffortDependency
0Update architectural documentation (source of truth)2 daysNone — must come first
AMulti-connector entity ownership + relationship partitioning3-4 daysPhase 0
BConnector-declared correlation keys (with URI specificity)1 dayPhase 0
CPlatform correlator (incremental, per-sync)2-3 daysA, B
DExtend shared node ID library (hash-based, case-normalized)1 dayPhase 0
EConnection-to-resource bridging (materializer extension)1 dayC

Total estimated effort: 10-12 days

Phase 0 must come first — architectural docs are the source of truth and must reflect the new ingestion model before code changes begin.

Phase 0: Architectural Documentation Updates

The following canonical docs must be updated to reflect cross-connector correlation before any code is written:

DocumentSection to UpdateChange
docs/architecture/01-data-model.mdEntity schema, Relationship modelAdd source_connector_id on relationships. Add correlationKeys on entities. Document additive merge semantics (replaces current implicit last-writer-wins).
docs/architecture/01-data-model.mdEntity typesDocument correlation key types per entity type (connection: endpoint_host; identity: entra_principal_id; resource: arm_resource_id).
docs/architecture/02-processing-pipeline.mdPipeline stepsAdd "Entity Correlation" step between "Upsert Entities" and "Compute Execution Paths". Document the correlator's inputs, outputs, and failure modes.
docs/architecture/00-overview.mdSystem designAdd cross-connector correlation to the architecture overview. Explain how multiple connectors contribute to a unified authorization graph.
docs/architecture/05-connectors.mdConnector interfaceAdd correlationKeys to NormalizedNode schema. Add sourceConnectorId to NormalizedEdge schema. Document the contract: connectors declare correlation keys, platform executes matching.
docs/architecture/03-database.mdEntity collection schemaAdd source_connector_id field on embedded relationships. Add correlation_keys field on entity documents. Document index requirements for correlation queries.

Why docs first: These documents are referenced by CLAUDE.md as the authoritative source of truth for the platform's data model and pipeline. AI agents and human developers read them before making changes. If code is written before docs are updated, the docs become stale and misleading — a much harder problem to fix retroactively.

Estimated effort: 1 day Deliverable: Updated architectural docs with the correlation model documented as the design intent, ready for code implementation.


6. What This Enables

Before (current state)

ScenarioAuthority Path Reconstructed?
Foundry agent → ARM-scoped Azure resourcesYes (via ARM role assignments)
Foundry agent → Logic App (SAS auth)No
Foundry agent → ServiceNow (via Logic App)No
ServiceNow workload → Entra SP (OAuth)Yes (within entra-servicenow connector)
Cross-connector shared identity (same SP)Partially — last-writer-wins overwrites relationships

After (with Phases A-E)

ScenarioAuthority Path Reconstructed?Mechanism
Foundry agent → ARM-scoped Azure resourcesYes (unchanged)ARM role assignments via HAS_ROLE → GRANTS → APPLIES_TO
Foundry agent → Logic App (SAS auth)YesEndpoint URI correlation (CONNECTS_TO) + materializer forwarding
Foundry agent → ServiceNow (via Logic App)YesCross-connector CONNECTS_TO stitching
ServiceNow workload → Entra SP (OAuth)Yes (unchanged)Within entra-servicenow connector via AUTHENTICATES_TO
Cross-connector shared identity (same SP)YesConnector-partitioned relationships (atomic merge, multi-owner)

Note: Authority paths via CONNECTS_TO will have empty via_roles and actions (no RBAC chain). This is semantically correct — the access is credential-based (SAS token), not role-based. Evaluator rules and UI must handle this distinction.


7. Open Questions

  1. Should entity ownership become connector_owners[] before any correlation work? (From reviewer.) Answer: Yes. This is now Phase A and is a prerequisite for all other phases. The multi-owner model is required for correct deletion semantics when multiple connectors contribute to the same entity.

  2. Should correlation operate incrementally per changed keys instead of tenant-wide scans each sync? (From reviewer.) Recommendation: Yes, incremental. The correlator should only re-correlate entities whose correlationKeys changed in the current sync, not scan all entities tenant-wide. This bounds the cost to O(changed entities) rather than O(all entities). Phase C is updated to reflect this.

  3. Should we preserve strict RBAC semantics (E2) or allow synthetic inferred privilege objects (E1)? (From reviewer.) Answer: E2 (materializer extension). Synthetic nodes create semantic drift in findings and evidence. Phase E is revised to recommend E2.

  4. How to handle endpoint_host ambiguity? When multiple resources share a hostname (e.g., API gateway, load balancer), host_only specificity should only auto-link if there's exactly one candidate. Multiple candidates are flagged as ambiguous for operator review.

  5. What other correlation keys exist beyond endpoint URLs? ServiceNow sys_id for users could correlate to Entra objectId if the ServiceNow instance uses SAML/OIDC with Azure AD. Should we pre-define these or discover them?

  6. Should source_connector_id on relationships use connector ID or sync ID? Recommendation: connector ID. Sync ID is too granular — each sync from the same connector would create a new partition, and the "remove old connector relationships" step would need to track the latest sync ID per connector. Connector ID is simpler and matches the ownership model.

  7. Migration path for existing data: Changing node ID generation (Phase D) and adding connector_owners / source_connector_id (Phase A) requires a data migration for existing entities. Should this be a script or an online migration during the next sync?

Resolved Questions (from review)

QuestionResolution
Can we use read-merge-write for relationship merge?No. Race-prone under concurrent syncs. Use MongoDB atomic pipeline updates ($filter + $concatArrays).
Is endpoint_host sufficient for correlation?No. Host alone is too coarse. Use endpoint_uri (host+path) with exact specificity; fall back to endpoint_host with host_only specificity.
Should we create synthetic RBAC nodes (E1)?No. Fabricated role/permission nodes cause semantic drift in findings and evidence. Use materializer extension (E2) instead.
Is the ARM node ID replace + [:80] pattern safe?No. Truncation collisions and case-sensitivity issues. Use SHA-256 hash with lowercased input.
Is 6-7 days a realistic effort estimate?No. Cross-cutting type/storage/traversal/index/test changes require 10-12 days.
Is the industry consensus "exact matching only"?Overstated. SailPoint supports programmatic Cloud Rules (arbitrary logic). Wiz does query-time path discovery. Pattern is "predominantly exact" with escape hatches.

8. References

Internal

  • sv0-connectors/shared/sv0_azure/sv0_azure/node_ids.py — Existing shared node ID generators
  • sv0-platform/src/ingestion/path-materializer.ts — Path traversal logic (source-system-agnostic)
  • sv0-platform/src/ingestion/graph-transformer.tsbuildStableEntityId() function
  • sv0-platform/src/storage/mongo/adapters/entity-adapter.ts — Entity upsert (last-writer-wins)
  • sv0-platform/src/ingestion/diff-engine.ts — Cross-connector deletion protection
  • docs/product/scenario-setup/foundry-logic-app-servicenow.md — ProvisionUser Agent scenario
  • docs/product/notion-synced/foundry-agent-llm-azure-app-logic-servicenow.md — Scenario implementation details

Industry


Appendix A: Review Findings Traceability

All 8 review findings from the v1 draft review have been addressed in this v2 revision.

#SeverityFindingResolutionSection
1CriticalPhase A additive merge breaks diff/version semantics (diffRelationships false events, impure version snapshots)Connector-partitioned relationships with source_connector_id. Diff engine filters by connector before comparison. Versions store full merged state (correct composite snapshot).Phase A (A2, A3)
2CriticalRead-merge-write race under concurrent connector syncsReplaced with MongoDB atomic pipeline update ($filter + $concatArrays). No read-merge-write. Single document lock.Phase A (A2)
3HighSingle-writer connector_id breaks deletion logic for shared entitiesAdded connector_owners: string[] with $addToSet. Deletion only when ALL owning connectors mark entity absent.Phase A (A1)
4Highendpoint_host exact match too coarse; host maps multiple workflowsAdded endpoint_uri (host+path) with exact specificity. endpoint_host demoted to host_only with ambiguity handling.Phase B
5HighSynthetic RBAC nodes (E1) cause semantic drift in findings/evidenceReversed recommendation to E2 (materializer extension). CONNECTS_TO added to forwarding edge set. No synthetic entities.Phase E
6MediumARM node ID replace + [:80] truncation causes collisions; no case normalizationSHA-256 hash with lowercased input. Human-readable suffix. No truncation collisions.Phase D
7MediumEffort estimate understated (6-7 days)Revised to 10-12 days. Phase A alone is 3-4 days due to cross-cutting changes.Section 5
8MediumIndustry consensus overstated; internal contradiction (SailPoint rules vs "exact only", Wiz query-time vs "ingest time only")Section retitled "Industry Patterns" with nuance. SailPoint Cloud Rules acknowledged. Wiz query-time path discovery noted.Section 2.3

Reviewer Open Questions Addressed

QuestionAnswer
Should entity ownership become connector_owners[] before correlation work?Yes. Prerequisite. Phase A.
E1 or E2 for path bridging?E2. Materializer extension preserves entity purity.
Incremental or tenant-wide correlation?Incremental. Per-sync, only re-correlate changed keys.

Next Action

Status: adopted — implementation planned Decision: Proceed with Phase A–E implementation. SAME_AS edges not adopted; existing AUTHENTICATES_TO handles current cross-system identity linking. New CONNECTS_TO edge to be added for endpoint-URL-based correlation (Phase E).

Implementation tracked in:

  • Phase 0 (docs first): sv0-documentation #78 — Update 01-data-model, 02-processing-pipeline, 00-overview, 05-connectors, 03-database
  • Phase A–E (platform): sv0-platform #79 — Multi-connector ownership, correlator, shared node IDs, path bridging

No further research needed. Implementation may begin after Phase 0 docs are merged.