Cross-Connector Entity Correlation Research
Date: 2026-02-26 Status: Draft v2 — revised per review findings (8 items addressed) Scope: sv0-platform (ingestion pipeline), sv0-connectors (shared libraries, entra-servicenow, azure-foundry) Trigger: ProvisionUser Agent scenario revealed that authority paths cannot be fully reconstructed from live connector data when execution crosses platform boundaries (Foundry → Logic App → ServiceNow)
1. Problem Statement
SecurityV0 discovers entities and relationships through multiple connectors:
- entra-servicenow — Entra ID service principals, ServiceNow workloads, OAuth credentials, REST Messages, execution chains
- azure-foundry — Foundry AI agents, connections, managed identities, ARM role assignments
Each connector produces a NormalizedGraph that is ingested independently. The platform's path materializer builds authority paths by traversing the chain:
workload → RUNS_AS → identity → HAS_ROLE → role → GRANTS → permission → APPLIES_TO → resource
The materializer is source-system-agnostic — it follows edges by entity _id regardless of which connector created them. However, no mechanism exists to correlate entities discovered by different connectors into unified cross-platform paths.
Concrete Example: ProvisionUser Agent
The Foundry ProvisionUser Agent executes across three systems:
Foundry Agent → (SAS token) → Azure Logic App → (HTTP) → ServiceNow Incident Table
Neither hop uses ARM RBAC. The Foundry connector discovers the agent, its connection (with the Logic App endpoint URL), and the SAS credential. The entra-servicenow connector discovers the ServiceNow side. But nothing links them:
| What Exists | What's Missing |
|---|---|
| Agent → RUNS_AS → managed identity | Role/permission chain → Logic App (SAS auth bypasses ARM) |
| Identity → HAS_ROLE → ARM roles (Azure resources) | Role/permission chain → ServiceNow (external system) |
Connection node with endpoint: "https://prod-28.eastus.logic.azure.com/..." | Correlation to Logic App resource entity |
SN REST Message node with endpoint_url: "https://prod-28.eastus.logic.azure.com/..." | Correlation to same Logic App from ServiceNow side |
| INVOKES → connection → USES → credential | These edges are decorative — not in the authority path chain |
Result: The seed script can manually create the full path, but live connector scans cannot reconstruct it. Authority paths are limited to ARM-scoped Azure resources.
2. Industry Research
2.1 Terminology
| Term | Used By | Scope |
|---|---|---|
| Identity Correlation | SailPoint, Broadcom IGA | Matching accounts from different sources to a single identity |
| Entity Resolution | Neo4j, data engineering | Determining when different records refer to the same real-world entity |
| Graph Stitching | Veza, security graph platforms | Linking authorization graph nodes across application boundaries |
| Cross-Domain Correlation | CrowdStrike, XDR platforms | Correlating signals across endpoint, identity, cloud, and SaaS domains |
| Unified Entity Abstraction | Wiz, CSPM platforms | Normalizing provider-specific entities into common types |
Recommended terminology for SecurityV0: "Cross-source entity correlation" (for the matching step) and "graph stitching" (for the result of linking nodes across connector boundaries).
2.2 Platform-by-Platform Analysis
SailPoint IdentityNow — Account Correlation
SailPoint treats correlation as a first-class ingest-time operation. Every source has a correlation configuration — ordered attribute pairings like (identityAttribute, accountAttribute).
- Happens at ingest time during account aggregation
- Deterministic exact match only (Equals operator, case-insensitive)
- Declarative (configured attribute pairings) with programmatic escape hatch (Java "Cloud Rules")
- Fallback:
accountName == identity.name - Up to 100 accounts correlated to a single identity
- Does NOT model cross-application execution paths — only "who has access to what" per application
Reference: SailPoint Correlation Documentation, Correlation Rule Developer Guide
Veza — OAA Cross-Service Connections
Veza builds an Authorization Metadata Graph. Cross-system linking is achieved through explicit identity references in the OAA payload:
{
"local_users": [
{
"name": "svc-foundry",
"identities": ["svc-foundry@tenant.onmicrosoft.com"],
"groups": ["admins"]
}
]
}
- Each local user carries an
identities[]array with stable cross-system identifiers - Matching at ingest time against IdP entities by email, UPN, or IdP ID
- Deterministic exact match — no fuzzy matching
- Declarative — connector author explicitly provides identity references
- No built-in URL-to-resource matching — connector author must provide explicit references
Reference: Veza OAA Guide, Cross-Service Connections
Wiz — Security Graph Unified Entity Model
Wiz builds a single unified graph in Amazon Neptune normalizing all cloud entities:
| Cloud Concept | AWS | Azure | GCP | Wiz Common Model |
|---|---|---|---|---|
| Identity | IAM User/Role | AD Principal | Service Account | Identity Entity |
| Permissions | IAM Policies | RBAC Roles | IAM Roles | Normalized Permission (5 levels) |
- Entity normalization at ingest time per cloud provider
- Cross-cloud attack paths discovered at query time via graph traversal
- Deterministic — provider entities mapped through known schema translations
- Permission normalization to 5 universal levels: List, Read, Write, High Privilege, Admin
Reference: Wiz Security Graph, Attack Path Analysis, AWS Case Study — Neptune
CrowdStrike — Enterprise Graph
CrowdStrike unifies multiple specialized graphs (Threat Graph, Asset Graph, Intel Graph) into the Enterprise Graph:
- Correlation at ingest time (continuous streaming)
- Cross-domain correlation via shared identity anchors (SID, UPN, IP address, session tokens)
- Emphasis on behavioral correlation (linking events across domains)
- Agent-based (Falcon sensor) for endpoint + API-based connectors for cloud/identity
Reference: CrowdStrike Enterprise Graph, CNAPP with CIEM
Neo4j — Entity Resolution Patterns
Neo4j provides three approaches:
- Deterministic (rule-based): Exact match on specific fields. High precision, explainable, fast.
- Probabilistic (similarity-based): String similarity algorithms with weighted scores. Handles data quality issues.
- Graph-enhanced: Uses shared relationships and neighbors to improve matching confidence.
Key recommendation from Neo4j community: Do NOT merge duplicate nodes. Create linking relationships (SAME_AS) or intermediate grouping nodes that both source nodes connect to. This preserves provenance while enabling cross-source traversal.
Reference: Neo4j Entity Resolution, Entity Resolution Example
SCIM Standard
The externalId attribute is the primary cross-system correlation key:
- Issued by the provisioning client, stored by the receiving system alongside its internal ID
- Enables bidirectional lookup across system boundaries
- Best practice: use stable identifiers (objectId, not email)
- SCIM is a provisioning protocol — assumes the mapping is known, doesn't do discovery
Reference: RFC 7643 — SCIM Core Schema
2.3 Industry Patterns (not consensus — approaches vary)
Review finding addressed (Medium #8): The v1 draft overstated certainty ("all major platforms", "exact only") and contained an internal contradiction: SailPoint's Java Cloud Rules are programmatic/heuristic, contradicting the "deterministic only" claim. Wiz discovers attack paths at query time, contradicting "ingest time only". This section is revised to present patterns with nuance.
| Question | Predominant Pattern | Exceptions / Nuance |
|---|---|---|
| Ingest time or query time? | Mostly ingest time for entity linking. SailPoint, Veza, CrowdStrike all establish links during ingestion. | Wiz normalizes at ingest but discovers attack paths at query time via graph traversal. The boundary is blurry — linking is ingest-time, path discovery is query-time. |
| Fuzzy or exact matching? | Predominantly exact for identity/security graphs. SailPoint uses Equals operator; Veza matches on email/UPN/IdP ID; SCIM uses externalId. | SailPoint's Java Cloud Rules allow arbitrary programmatic logic — the connector author can implement fuzzy matching, regex, or database lookups. Neo4j community explicitly supports probabilistic matching for data quality scenarios. SecurityV0 should start with exact matching but the architecture should not preclude rule-based extensions. |
| Declarative or automatic? | Mix. SailPoint: declarative config + Java escape hatch. Veza: connector-declared identities[]. SCIM: admin-configured attribute mapping. | Wiz and CrowdStrike are more automatic — built-in normalization per cloud provider, not admin-configured. The "declarative vs automatic" axis depends on whether the platform knows the integration topology in advance. |
| Connection URL → resource matching? | No universal mechanism. All platforms handle this as connector-specific or provider-specific logic. | This falls in a gap between identity correlation (well-standardized) and infrastructure topology discovery (provider-specific). Azure Logic App callback URLs are identified by host + path + trigger + query signature, not host alone (Microsoft docs). |
3. SecurityV0 Current Infrastructure Audit
3.1 What Already Works
Shared Node ID Generators
File: sv0-connectors/shared/sv0_azure/sv0_azure/node_ids.py
def sp_node_id(principal_id: str) -> str:
return f"entra-sp-{principal_id}"
def owner_node_id(object_id: str) -> str:
return f"entra-user-{object_id}"
Both connectors import and use these:
- entra-servicenow:
from sv0_azure.node_ids import sp_node_id, owner_node_id(transformer.py:27-28) - azure-foundry:
from sv0_azure.node_ids import sp_node_id, owner_node_id(edge_resolver.py:24)
When both connectors discover the same managed identity (principal ID abc-123), they produce:
- Same
nodeId: "entra-sp-abc-123" - Same
sourceSystem: "entra_id" - Same
sourceId: "abc-123"
The platform's buildStableEntityId() hashes tenantId + sourceSystem + sourceId into the same entity _id. The upsert merges them into one entity.
Source-System-Agnostic Path Traversal
File: sv0-platform/src/ingestion/path-materializer.ts
The path materializer follows edges by entity _id regardless of sourceSystem. It traverses:
RUNS_AS(workload → identity binding)HAS_ROLE → GRANTS → APPLIES_TO(permission chain)CALLS,INVOKES,USES,AUTHENTICATES_AS,AUTHENTICATES_VIA(forwarding)AUTHENTICATES_TO(cross-system auth, depth-limited to 1)
If connector A creates node X and connector B creates node Y, and an edge X→Y exists, the materializer will follow it.
Cross-Connector Deletion Protection
File: sv0-platform/src/ingestion/diff-engine.ts (line 281)
Deletion detection is scoped by connectorId. Connector A's sync cannot delete connector B's entities.
Endpoint URL Properties
Both connectors store endpoint URLs on connection nodes:
| Connector | Property Name | Example Value |
|---|---|---|
| azure-foundry | endpoint | https://prod-28.eastus.logic.azure.com/... |
| entra-servicenow | endpoint_url | https://prod-28.eastus.logic.azure.com/... |
Same URL, different property names, different nodes, no linkage. This is the unexploited correlation key.
3.2 Critical Gaps
Gap 1: Last-Writer-Wins Entity Overwrite
File: sv0-platform/src/storage/mongo/adapters/entity-adapter.ts (line 36)
The entity upsert uses $set for the full document including relationships: [...]. When two connectors emit the same entity:
- entra-servicenow ingests → SP has
AUTHENTICATES_AS,OWNED_BYrelationships - azure-foundry ingests → SP overwritten with only
RUNS_AS,HAS_ROLErelationships - entra-servicenow relationships are lost until next ingestion
This is the most critical blocker. The unified graph's correctness depends on ingestion order.
Gap 2: No Endpoint-Based Correlation
ServiceNow REST Messages and Foundry connections both contain endpoint URLs pointing to the same external services. There is no mechanism to detect this match and create bridging edges.
Gap 3: No Shared Node IDs for Non-Entra Entities
sv0_azure.node_ids only covers Entra SPs and users. No shared ID scheme for:
- Azure resources (Storage, Key Vault, Logic Apps) discovered by multiple connectors
- ARM role assignments
- Connections/credentials representing the same external service
Gap 4: Ingestion Order Dependency
Because relationships are not additively merged, the correctness of the unified graph depends on which connector ingests last. No mechanism exists to replay or recompose the merged entity from both connectors' contributions.
Gap 5: Property Name Inconsistency
The endpoint URL property is named endpoint_url in entra-servicenow and endpoint in azure-foundry. A future correlation mechanism must handle both.
4. Recommended Approach: Platform-Level Correlation Phase
Based on industry research and the existing infrastructure audit, the recommended approach is a platform-level post-ingestion correlation phase using deterministic exact matching. This aligns with industry best practices (Veza's declarative model, SailPoint's ingest-time correlation, Neo4j's linking relationships).
Design Principles
- Deterministic only. No fuzzy matching — false positive links create incorrect authority paths.
- Ingest-time execution. Correlation runs as part of the sync ingestion pipeline, after entity upsert but before path materialization.
- Additive, not destructive. Correlation adds edges between existing entities; it never merges or deletes nodes. Following Neo4j's recommendation: preserve provenance, create linking relationships.
- Declarative correlation keys. Connectors declare what keys to correlate on — the platform executes the matching.
Phase A: Multi-Connector Entity Ownership & Relationship Partitioning (Prerequisite)
Review finding addressed (Critical #1): The v1 draft proposed a simple "additive merge" of relationships. This breaks three existing mechanisms: (1)
diffRelationships()indiff-engine.ts:189computes added/removed/modified relationships by comparing existing vs incoming — if the existing set includes another connector's relationships, they appear as "removed" in the diff, generating false events and incorrect version history. (2)insertEntityVersion()insync-ingestion.ts:89storesentity.relationshipsas the full state — merged relationships would make version snapshots connector-impure. (3) Theconnector_idfield onEntityDocis singular (types.ts:56) — deletion detection scopes by this single value (diff-engine.ts:279), which is inconsistent when multiple connectors contribute to the same entity.
Review finding addressed (Critical #2): The v1 draft proposed read-merge-write (read existing → merge → write). Under concurrent connector syncs, two connectors could both read the existing entity, both compute their merge independently, and the last writer's
$setwould overwrite the other's merged relationships. MongoDB warns about lost updates when concurrent updates use broad filters and$set(MongoDB Atomicity Docs). The solution must use atomic operations or compound operations (MongoDB Compound Operations).
Review finding addressed (High #3): The entity ownership model is currently single-writer (
connector_id: string). Correlation requires multi-connector ownership. Without this change, deletion logic will be inconsistent — if connector A's sync doesn't see a shared entity, it would mark it deleted even though connector B still reports it.
Problem: Three interrelated issues:
- Single-writer entity ownership:
EntityDoc.connector_idis a single string. When two connectors emit the same entity, the second overwrites the first'sconnector_id, and the first connector's deletion detection can no longer find it. - Relationship overwrite:
$setreplaces the entirerelationshipsarray. Cross-connector relationships are lost. - Race condition: Concurrent syncs from different connectors can both read/merge/write, causing lost updates.
Solution: Partition entity contributions by connector using connector-scoped sub-documents and atomic MongoDB operations.
A1. Multi-Connector Entity Ownership
Replace single connector_id with connector_owners:
// EntityDoc changes:
export interface EntityDoc {
// ... existing fields
connector_id?: string; // DEPRECATED — kept for migration compat, set to last writer
connector_owners: string[]; // NEW — all connectors that have contributed to this entity
// ...
}
Atomic update: When upserting, use $addToSet to register the connector:
// In entity-adapter.ts upsertEntity():
await this.c.entities.updateOne(
{ tenant_id, source_system, source_id },
{
$set: { ...connectorOwnedFields, connector_id: connectorId, updated_at: now },
$addToSet: { connector_owners: connectorId },
$setOnInsert: { _id, created_at },
},
{ upsert: true }
);
Deletion detection change: diff-engine.ts:279 currently filters by connector_id. With multi-ownership:
- An entity is only eligible for deletion by connector A if
connector_ownersincludes A - An entity is only fully deleted when ALL owning connectors have marked it absent
- If only one connector marks it absent, the entity remains active (other connectors still report it)
// diff-engine.ts deletion detection:
// Before: filter.connector_id = connectorId
// After: filter.connector_owners = connectorId // $elemMatch implicit for scalar
// Deletion only proceeds if this is the LAST remaining owner
A2. Connector-Partitioned Relationships
Instead of storing a flat relationships: EntityRelationship[] that gets overwritten, partition relationships by connector:
// New field on EntityRelationship:
export interface EntityRelationship {
type: string;
target_id: string;
properties: Record<string, unknown>;
source_connector_id: string; // NEW — which connector created this relationship
}
Atomic update strategy: Use a two-step atomic pipeline update (MongoDB 4.2+) to replace only the current connector's relationships while preserving others:
// In entity-adapter.ts — atomic pipeline update, no read-merge-write race:
await this.c.entities.updateOne(
{ tenant_id, source_system, source_id },
[
// Step 1: Remove all relationships from this connector
{
$set: {
relationships: {
$filter: {
input: "$relationships",
cond: { $ne: ["$$this.source_connector_id", connectorId] }
}
}
}
},
// Step 2: Append this connector's new relationships
{
$set: {
relationships: { $concatArrays: ["$relationships", newRelationships] }
}
}
]
);
This is a single atomic operation — no read-merge-write race. MongoDB executes the aggregation pipeline stages sequentially within a single document lock.
A3. Diff Engine and Version History Compatibility
Diff engine (diff-engine.ts:189): diffRelationships() must compare only relationships from the current connector:
// Before: diffRelationships(existing.relationships, incoming.relationships)
// After:
const existingForConnector = existing.relationships.filter(
r => r.source_connector_id === connectorId
);
diffRelationships(existingForConnector, incoming.relationships);
This ensures that another connector's relationships don't appear as "removed" in the diff.
Version history (sync-ingestion.ts:89): Entity versions should store the full merged relationship set (snapshot of the entity at that point in time), not just the current connector's contribution. This is correct behavior — the version represents the entity's complete state, which includes all connector contributions.
// insertEntityVersion stores the entity as-is after the atomic update.
// The merged relationships from all connectors are the true state.
// No change needed here — the atomic update already produces the correct merged state.
A4. Path Materializer Impact
The path materializer (path-materializer.ts:101) reads entity.relationships and filters by direction !== "inbound". It currently assumes all relationships are from a single connector. With partitioned relationships:
- The materializer sees the union of all connector contributions (correct — this is the desired behavior)
- The
source_connector_idfield is ignored by the materializer (it doesn't need to know provenance) - No materializer changes are needed
Files changed:
src/domain/entities/types.ts— addconnector_owners: string[]toEntityDoc, addsource_connector_idtoEntityRelationshipsrc/storage/mongo/adapters/entity-adapter.ts— atomic pipeline update for relationships,$addToSetfor connector_ownerssrc/ingestion/graph-transformer.ts— stampsource_connector_idon every relationshipsrc/ingestion/diff-engine.ts— filter bysource_connector_idindiffRelationships(), multi-owner deletion logicsrc/workers/handlers/sync-ingestion.ts— no changes to version insertion (stores full merged state)src/ingestion/types.ts— addsourceConnectorIdtoNormalizedEdge- Migration script — backfill
connector_owners: [connector_id]andsource_connector_idon existing relationships
Estimated effort: 3-4 days (including migration, diff engine changes, and testing) Impact: Enables all subsequent phases. Without this, cross-connector paths are unreliable and the ownership model breaks under multi-connector scenarios.
Phase B: Connector-Declared Correlation Keys
Inspiration: Veza's identities[] array pattern.
Review finding addressed (High #4): The v1 draft used
endpoint_hostas the correlation key type. Host alone is too coarse — a single host (e.g.,prod-28.eastus.logic.azure.com) can serve multiple Logic App workflows, each with a different path and trigger signature (Microsoft Logic Apps HTTP endpoint docs). The callback URL identity ishost + path + trigger, not just host. Correlation keys are revised to support multi-part compound keys with explicit confidence modeling.
Add a correlationKeys field to NormalizedNode:
export interface NormalizedNode {
// ... existing fields
correlationKeys?: CorrelationKey[];
}
export interface CorrelationKey {
/** The type of correlation key */
keyType: "endpoint_uri" | "endpoint_host" | "entra_principal_id" | "entra_app_id" |
"arm_resource_id" | "oauth_client_id" | "custom";
/** The correlation value (exact match, normalized) */
value: string;
/** How precise this key is for matching (affects whether we auto-link or flag for review) */
specificity: "exact" | "host_only";
/** Optional: restrict matching to specific target node types */
targetNodeTypes?: string[];
}
Key type specificity:
| Key Type | Value Format | Specificity | Example |
|---|---|---|---|
endpoint_uri | host + path (no query params/signature) | exact | prod-28.eastus.logic.azure.com/workflows/abc123/triggers/manual/invoke |
endpoint_host | hostname only | host_only | prod-28.eastus.logic.azure.com |
arm_resource_id | full ARM resource ID, lowercased | exact | /subscriptions/.../microsoft.logic/workflows/... |
entra_principal_id | GUID | exact | abc-123-def-456 |
oauth_client_id | GUID or string | exact | abc-123-def-456 |
Matching behavior by specificity:
exactmatches → create linking edge automaticallyhost_onlymatches → create linking edge only if there's exactly one candidate per host, otherwise flag as ambiguous for operator review
Connector-side changes:
azure-foundry connection nodes would emit:
# Strip query params (contains SAS signature) but keep host + path
parsed = urlparse(conn.endpoint)
endpoint_uri = f"{parsed.hostname}{parsed.path}"
"correlationKeys": [
{"keyType": "endpoint_uri", "value": endpoint_uri, "specificity": "exact"},
{"keyType": "endpoint_host", "value": parsed.hostname, "specificity": "host_only"},
]
entra-servicenow REST Message nodes would emit:
parsed = urlparse(endpoint_url)
endpoint_uri = f"{parsed.hostname}{parsed.path}"
"correlationKeys": [
{"keyType": "endpoint_uri", "value": endpoint_uri, "specificity": "exact"},
{"keyType": "endpoint_host", "value": parsed.hostname, "specificity": "host_only"},
]
URI normalization: Both connectors strip query parameters (which contain signatures, API keys, etc.) and retain host + path. This ensures the Logic App callback URL matches across connectors even if SAS signatures differ. Path comparison is case-insensitive (Azure resource names are case-insensitive per ARM naming rules).
Files changed:
src/ingestion/types.ts— addcorrelationKeystoNormalizedNode- Both connector transformers — emit correlation keys with
endpoint_uri(not justendpoint_host) - Standardize endpoint property to
endpointacross both connectors
Estimated effort: 1 day
Phase C: Platform Correlator
New file: src/ingestion/entity-correlator.ts
After all entities are upserted (but before path materialization), the correlator scans for matchable entities:
interface CorrelationResult {
sourceEntityId: string;
targetEntityId: string;
matchedKeyType: string;
matchedValue: string;
edgeType: string; // e.g., "TARGETS_SAME_SERVICE"
specificity: "exact" | "host_only";
autoLinked: boolean; // true if exact, false if flagged for review
}
async function correlateEntities(
tenantId: string,
syncId: string,
storage: StorageAdapter,
): Promise<CorrelationResult[]> {
// 1. Fetch entities with correlationKeys touched by this sync
// (incremental — only re-correlate entities whose keys changed)
// 2. Group by (keyType, value)
// 3. For each group with >1 entity from different connectors:
// a. If specificity === "exact" → auto-create linking edge
// b. If specificity === "host_only" AND exactly 1 candidate → auto-create
// c. If specificity === "host_only" AND multiple candidates → flag as ambiguous
// 4. Return results for logging/observability
}
Correlation rules:
| Key Type | Source Entity | Target Entity | Edge Created | Behavior |
|---|---|---|---|---|
endpoint_uri (exact) | connection (azure_foundry) | connection (servicenow) | TARGETS_SAME_SERVICE | Auto-link (host+path match) |
endpoint_host (host_only) | connection (any) | connection (any) | TARGETS_SAME_SERVICE | Auto-link only if 1 candidate; flag ambiguous if multiple |
arm_resource_id (exact) | resource (azure_foundry) | resource (entra_servicenow) | Already merged if same node_ids | Extend node_ids.py |
entra_principal_id | identity (azure_foundry) | identity (entra_servicenow) | Already merged via shared node ID | N/A — already works |
oauth_client_id | credential (servicenow) | identity (entra_id) | Already handled within entra-servicenow | N/A |
Pipeline integration:
sync_ingestion steps:
1. Transform (existing)
2. Upsert entities (existing, with merge fix from Phase A)
3. ▶ Correlate entities (NEW — Phase C)
4. Compute execution paths (existing)
5. Materialize authority paths (existing)
6. Evaluate findings (existing)
Estimated effort: 2 days
Phase D: Extend Shared Node ID Library
Review finding addressed (Medium #6): The v1 draft used
replace("/", "-")[:80]truncation for ARM node IDs. This has two problems: (1) Two different ARM resources with identical first 80 chars of their sanitized resource ID would collide. (2) Azure resource names are case-insensitive per ARM naming guidance — the same resource can appear as/subscriptions/.../Microsoft.Logic/workflows/...or/subscriptions/.../microsoft.logic/workflows/.... The revised approach uses content-addressed hashing (SHA-256) with a human-readable suffix, and normalizes case.
File: sv0-connectors/shared/sv0_azure/sv0_azure/node_ids.py
Add shared generators for ARM resources so both connectors produce identical node IDs:
import hashlib
def arm_resource_node_id(resource_id: str) -> str:
"""Canonical node ID for an ARM resource, shared across connectors.
Uses SHA-256 hash of the lowercased resource ID to avoid:
- Truncation collisions (different resources sharing a prefix)
- Case-sensitivity issues (Azure names are case-insensitive)
"""
normalized = resource_id.lower()
h = hashlib.sha256(normalized.encode()).hexdigest()[:16]
# Human-readable suffix: last path segment, lowercased, sanitized
last_segment = normalized.rstrip("/").rsplit("/", 1)[-1].replace(" ", "-")[:30]
return f"arm-resource-{h}-{last_segment}"
def arm_role_node_id(role_definition_id: str, scope: str) -> str:
"""Canonical node ID for an ARM role assignment, shared across connectors.
Hash-based to avoid truncation collisions on long scope paths.
"""
normalized_scope = scope.lower()
normalized_role = role_definition_id.lower()
role_guid = normalized_role.split("/")[-1]
h = hashlib.sha256(f"{role_guid}:{normalized_scope}".encode()).hexdigest()[:16]
return f"arm-role-{role_guid[:8]}-{h}"
Case normalization: All ARM resource IDs are lowercased before hashing. This ensures that /subscriptions/ABC/Microsoft.Logic/workflows/MyApp and /subscriptions/abc/microsoft.logic/workflows/myapp produce the same node ID.
Collision resistance: SHA-256 hash provides 16 hex chars (64 bits) of uniqueness — collision probability is negligible for any realistic entity count.
Migration note: Existing azure-foundry connector uses _resource_node_id_from_assignment() and _role_node_id() with the old truncation pattern. These must be updated to use the shared library, which will change existing node IDs. A migration step is needed to update entity references.
Both connectors import these instead of defining local versions. This extends the existing pattern from sp_node_id / owner_node_id.
Estimated effort: 1 day (including migration of existing node IDs)
Phase E: Connection-to-Resource Path Bridging
For the specific pattern where a connection's endpoint URL identifies a known resource, the path materializer must be able to traverse from the connection to the target resource.
Review finding addressed (High #5): The v1 draft recommended Option E1 (synthetic RBAC chain nodes) over E2 (materializer extension). Synthetic role/permission nodes carry fabricated RBAC semantics — the findings evaluator (
evaluator/rules/) inspects role names, normalized actions, and permission scopes to compute findings. A synthetic "implied-role" node with no real RBAC backing would be processed by every evaluator rule, potentially producing false findings (e.g.,excessive_permissionson a fabricated role). Evidence packs would include synthetic entities with no real-world counterpart, undermining the platform's deterministic, source-of-truth model. E2 is the correct approach — it preserves entity purity and avoids semantic pollution.
Recommended approach: Option E2 — Materializer extension
Extend the path materializer to follow a new traversal pattern:
workload → INVOKES → connection → CONNECTS_TO → resource
Where:
INVOKESalready exists (edge_resolver creates it for agent → connection)CONNECTS_TOis the new edge type created by the correlator (Phase C) when endpoint URIs match- The connection node acts as a transparent hop (like existing
CALLS,INVOKES,USESforwarding edges)
Implementation:
// In path-materializer.ts — add CONNECTS_TO to the forwarding edge set:
const FORWARDING_EDGE_TYPES = new Set([
"CALLS", "INVOKES", "USES",
"AUTHENTICATES_AS", "AUTHENTICATES_VIA",
"CONNECTS_TO", // NEW — connection endpoint correlation
]);
The materializer already follows forwarding edges transparently (path-materializer.ts:197-221). Adding CONNECTS_TO to the set means the materializer will:
- Follow
workload → INVOKES → connection(already works) - Follow
connection → CONNECTS_TO → resource(new, via forwarding) - The target resource becomes a reachable destination in the execution path
What this does NOT do: It does not create a fake HAS_ROLE → GRANTS → APPLIES_TO chain. The authority path's via_roles will be empty and actions will be empty for this hop. This is semantically correct — the agent reaches the resource via a connection credential (SAS token), not via an RBAC role assignment. The path accurately represents the real-world access mechanism.
Evaluator implications: Evaluator rules that require via_roles.length > 0 or check actions will correctly treat this path differently from RBAC-based paths. This is desirable — a SAS-token-based path has different security characteristics than an RBAC-based path.
Pro: No synthetic entities. No fabricated RBAC semantics. Findings and evidence remain grounded in real source data. Single line change to the materializer forwarding set.
Con: Authority paths via connections will have empty via_roles and actions — downstream consumers (UI, evaluator) must handle this case. New evaluator rules may be needed for connection-based paths.
Estimated effort: 1 day (materializer change is small; testing and evaluator rule review is the bulk)
5. Implementation Priority
Review finding addressed (Medium #7): The v1 estimate of 6-7 days was understated. Phase A alone is now 3-4 days due to the cross-cutting nature of multi-connector ownership (touches entity types, storage adapter, diff engine, graph transformer, and requires a data migration). Phase D requires migration of existing node IDs. The total includes testing, migration scripts, and integration tests across the full ingestion pipeline.
| Phase | Description | Effort | Dependency |
|---|---|---|---|
| 0 | Update architectural documentation (source of truth) | 2 days | None — must come first |
| A | Multi-connector entity ownership + relationship partitioning | 3-4 days | Phase 0 |
| B | Connector-declared correlation keys (with URI specificity) | 1 day | Phase 0 |
| C | Platform correlator (incremental, per-sync) | 2-3 days | A, B |
| D | Extend shared node ID library (hash-based, case-normalized) | 1 day | Phase 0 |
| E | Connection-to-resource bridging (materializer extension) | 1 day | C |
Total estimated effort: 10-12 days
Phase 0 must come first — architectural docs are the source of truth and must reflect the new ingestion model before code changes begin.
Phase 0: Architectural Documentation Updates
The following canonical docs must be updated to reflect cross-connector correlation before any code is written:
| Document | Section to Update | Change |
|---|---|---|
docs/architecture/01-data-model.md | Entity schema, Relationship model | Add source_connector_id on relationships. Add correlationKeys on entities. Document additive merge semantics (replaces current implicit last-writer-wins). |
docs/architecture/01-data-model.md | Entity types | Document correlation key types per entity type (connection: endpoint_host; identity: entra_principal_id; resource: arm_resource_id). |
docs/architecture/02-processing-pipeline.md | Pipeline steps | Add "Entity Correlation" step between "Upsert Entities" and "Compute Execution Paths". Document the correlator's inputs, outputs, and failure modes. |
docs/architecture/00-overview.md | System design | Add cross-connector correlation to the architecture overview. Explain how multiple connectors contribute to a unified authorization graph. |
docs/architecture/05-connectors.md | Connector interface | Add correlationKeys to NormalizedNode schema. Add sourceConnectorId to NormalizedEdge schema. Document the contract: connectors declare correlation keys, platform executes matching. |
docs/architecture/03-database.md | Entity collection schema | Add source_connector_id field on embedded relationships. Add correlation_keys field on entity documents. Document index requirements for correlation queries. |
Why docs first: These documents are referenced by CLAUDE.md as the authoritative source of truth for the platform's data model and pipeline. AI agents and human developers read them before making changes. If code is written before docs are updated, the docs become stale and misleading — a much harder problem to fix retroactively.
Estimated effort: 1 day Deliverable: Updated architectural docs with the correlation model documented as the design intent, ready for code implementation.
6. What This Enables
Before (current state)
| Scenario | Authority Path Reconstructed? |
|---|---|
| Foundry agent → ARM-scoped Azure resources | Yes (via ARM role assignments) |
| Foundry agent → Logic App (SAS auth) | No |
| Foundry agent → ServiceNow (via Logic App) | No |
| ServiceNow workload → Entra SP (OAuth) | Yes (within entra-servicenow connector) |
| Cross-connector shared identity (same SP) | Partially — last-writer-wins overwrites relationships |
After (with Phases A-E)
| Scenario | Authority Path Reconstructed? | Mechanism |
|---|---|---|
| Foundry agent → ARM-scoped Azure resources | Yes (unchanged) | ARM role assignments via HAS_ROLE → GRANTS → APPLIES_TO |
| Foundry agent → Logic App (SAS auth) | Yes | Endpoint URI correlation (CONNECTS_TO) + materializer forwarding |
| Foundry agent → ServiceNow (via Logic App) | Yes | Cross-connector CONNECTS_TO stitching |
| ServiceNow workload → Entra SP (OAuth) | Yes (unchanged) | Within entra-servicenow connector via AUTHENTICATES_TO |
| Cross-connector shared identity (same SP) | Yes | Connector-partitioned relationships (atomic merge, multi-owner) |
Note: Authority paths via CONNECTS_TO will have empty via_roles and actions (no RBAC chain). This is semantically correct — the access is credential-based (SAS token), not role-based. Evaluator rules and UI must handle this distinction.
7. Open Questions
-
Should entity ownership become
connector_owners[]before any correlation work? (From reviewer.) Answer: Yes. This is now Phase A and is a prerequisite for all other phases. The multi-owner model is required for correct deletion semantics when multiple connectors contribute to the same entity. -
Should correlation operate incrementally per changed keys instead of tenant-wide scans each sync? (From reviewer.) Recommendation: Yes, incremental. The correlator should only re-correlate entities whose
correlationKeyschanged in the current sync, not scan all entities tenant-wide. This bounds the cost to O(changed entities) rather than O(all entities). Phase C is updated to reflect this. -
Should we preserve strict RBAC semantics (E2) or allow synthetic inferred privilege objects (E1)? (From reviewer.) Answer: E2 (materializer extension). Synthetic nodes create semantic drift in findings and evidence. Phase E is revised to recommend E2.
-
How to handle
endpoint_hostambiguity? When multiple resources share a hostname (e.g., API gateway, load balancer),host_onlyspecificity should only auto-link if there's exactly one candidate. Multiple candidates are flagged as ambiguous for operator review. -
What other correlation keys exist beyond endpoint URLs? ServiceNow
sys_idfor users could correlate to EntraobjectIdif the ServiceNow instance uses SAML/OIDC with Azure AD. Should we pre-define these or discover them? -
Should
source_connector_idon relationships use connector ID or sync ID? Recommendation: connector ID. Sync ID is too granular — each sync from the same connector would create a new partition, and the "remove old connector relationships" step would need to track the latest sync ID per connector. Connector ID is simpler and matches the ownership model. -
Migration path for existing data: Changing node ID generation (Phase D) and adding
connector_owners/source_connector_id(Phase A) requires a data migration for existing entities. Should this be a script or an online migration during the next sync?
Resolved Questions (from review)
| Question | Resolution |
|---|---|
| Can we use read-merge-write for relationship merge? | No. Race-prone under concurrent syncs. Use MongoDB atomic pipeline updates ($filter + $concatArrays). |
Is endpoint_host sufficient for correlation? | No. Host alone is too coarse. Use endpoint_uri (host+path) with exact specificity; fall back to endpoint_host with host_only specificity. |
| Should we create synthetic RBAC nodes (E1)? | No. Fabricated role/permission nodes cause semantic drift in findings and evidence. Use materializer extension (E2) instead. |
Is the ARM node ID replace + [:80] pattern safe? | No. Truncation collisions and case-sensitivity issues. Use SHA-256 hash with lowercased input. |
| Is 6-7 days a realistic effort estimate? | No. Cross-cutting type/storage/traversal/index/test changes require 10-12 days. |
| Is the industry consensus "exact matching only"? | Overstated. SailPoint supports programmatic Cloud Rules (arbitrary logic). Wiz does query-time path discovery. Pattern is "predominantly exact" with escape hatches. |
8. References
Internal
sv0-connectors/shared/sv0_azure/sv0_azure/node_ids.py— Existing shared node ID generatorssv0-platform/src/ingestion/path-materializer.ts— Path traversal logic (source-system-agnostic)sv0-platform/src/ingestion/graph-transformer.ts—buildStableEntityId()functionsv0-platform/src/storage/mongo/adapters/entity-adapter.ts— Entity upsert (last-writer-wins)sv0-platform/src/ingestion/diff-engine.ts— Cross-connector deletion protectiondocs/product/scenario-setup/foundry-logic-app-servicenow.md— ProvisionUser Agent scenariodocs/product/notion-synced/foundry-agent-llm-azure-app-logic-servicenow.md— Scenario implementation details
Industry
- SailPoint: Account Correlation, Correlation Rules
- Veza: OAA Guide, Cross-Service Connections, Modeling Users and Permissions
- Wiz: Security Graph, Attack Path Analysis, OCI IAM Support, AWS Neptune Case Study
- CrowdStrike: Enterprise Graph, CNAPP with CIEM
- Neo4j: Entity Resolution, Entity Resolution Example
- SCIM: RFC 7643 — Core Schema, RFC 7644 — Protocol
- Cy5: Entity-Driven Cloud Security Architecture
- Prisma Cloud: Resource drift alerting when >30% of resources disappear in single scan
- ServiceNow CMDB: IRE staging area with reconciliation rules for cross-source entity matching
- MongoDB: Write Operations Atomicity, Compound Operations
- Azure: Logic Apps HTTP Endpoint, ARM Resource Name Rules
Appendix A: Review Findings Traceability
All 8 review findings from the v1 draft review have been addressed in this v2 revision.
| # | Severity | Finding | Resolution | Section |
|---|---|---|---|---|
| 1 | Critical | Phase A additive merge breaks diff/version semantics (diffRelationships false events, impure version snapshots) | Connector-partitioned relationships with source_connector_id. Diff engine filters by connector before comparison. Versions store full merged state (correct composite snapshot). | Phase A (A2, A3) |
| 2 | Critical | Read-merge-write race under concurrent connector syncs | Replaced with MongoDB atomic pipeline update ($filter + $concatArrays). No read-merge-write. Single document lock. | Phase A (A2) |
| 3 | High | Single-writer connector_id breaks deletion logic for shared entities | Added connector_owners: string[] with $addToSet. Deletion only when ALL owning connectors mark entity absent. | Phase A (A1) |
| 4 | High | endpoint_host exact match too coarse; host maps multiple workflows | Added endpoint_uri (host+path) with exact specificity. endpoint_host demoted to host_only with ambiguity handling. | Phase B |
| 5 | High | Synthetic RBAC nodes (E1) cause semantic drift in findings/evidence | Reversed recommendation to E2 (materializer extension). CONNECTS_TO added to forwarding edge set. No synthetic entities. | Phase E |
| 6 | Medium | ARM node ID replace + [:80] truncation causes collisions; no case normalization | SHA-256 hash with lowercased input. Human-readable suffix. No truncation collisions. | Phase D |
| 7 | Medium | Effort estimate understated (6-7 days) | Revised to 10-12 days. Phase A alone is 3-4 days due to cross-cutting changes. | Section 5 |
| 8 | Medium | Industry consensus overstated; internal contradiction (SailPoint rules vs "exact only", Wiz query-time vs "ingest time only") | Section retitled "Industry Patterns" with nuance. SailPoint Cloud Rules acknowledged. Wiz query-time path discovery noted. | Section 2.3 |
Reviewer Open Questions Addressed
| Question | Answer |
|---|---|
Should entity ownership become connector_owners[] before correlation work? | Yes. Prerequisite. Phase A. |
| E1 or E2 for path bridging? | E2. Materializer extension preserves entity purity. |
| Incremental or tenant-wide correlation? | Incremental. Per-sync, only re-correlate changed keys. |
Next Action
Status: adopted — implementation planned
Decision: Proceed with Phase A–E implementation. SAME_AS edges not adopted; existing AUTHENTICATES_TO handles current cross-system identity linking. New CONNECTS_TO edge to be added for endpoint-URL-based correlation (Phase E).
Implementation tracked in:
- Phase 0 (docs first): sv0-documentation #78 — Update
01-data-model,02-processing-pipeline,00-overview,05-connectors,03-database - Phase A–E (platform): sv0-platform #79 — Multi-connector ownership, correlator, shared node IDs, path bridging
No further research needed. Implementation may begin after Phase 0 docs are merged.