Skip to main content

Database Architecture


Context

SecurityV0 has two data workloads:

  1. Rich entity storage — Full policy documents, allowlisted source metadata, versioned history, events, evidence packs
  2. Graph traversal — Execution path queries: "What can this identity reach through which roles?"

For the MVP, both workloads run on MongoDB. Graph traversal uses application-level path computation with materialized execution paths on identity documents.


Decision

MongoDB as the single database for the MVP. All data — entities, relationships, versions, events, findings, evidence — lives in MongoDB.

Graph queries are handled through:

  • Materialized execution paths — pre-computed on each identity during sync
  • Denormalized reverse lookupsaccessible_by arrays on resource documents
  • Application-level traversal — follow relationship references for on-demand path queries

When This Is Sufficient

  • < 1,000 identities per tenant
  • 2-3 hop paths (Identity → Role → Permission → Resource)
  • Materialized paths recomputed on sync (acceptable latency)
  • 2-3 connectors per tenant
  • Historical queries prioritized over deep graph discovery

When to Add Neo4j (Future)

  • 10,000+ identities per tenant with 5+ connectors
  • 5+ hop transitive chains (cross-system paths)
  • Real-time reverse queries at scale
  • Path recomputation becomes the sync bottleneck

See "Future: Scaling with Neo4j" section at the end of this document.


MongoDB Schema

Collections Overview

CollectionPurposeRetention
entitiesCurrent state of all graph objectsPermanent
entity_versionsPoint-in-time snapshots (baseline-derived)Configurable (default: 2 years)
eventsCanonical audit log eventsTTL (default: 2 years)
baseline_metadataBaseline headers (counts, integrity hash, schedule)Configurable (default: 1 year)
baseline_entitiesPer-entity baseline snapshotsConfigurable (default: 1 year)
execution_evidenceFirst-class execution proof recordsConfigurable (default: 2 years)
execution_chainsPlatform-computed execution chain trackingPermanent
execution_chain_versionsChain state snapshots for temporal comparison (Phase 2)Configurable (default: 2 years)
execution_chain_eventsDiscrete events for chain lifecycle (Phase 2)Configurable (default: 2 years)
sync_cursorsAudit log sync progress trackingPermanent
connector_syncsSync execution recordsPermanent
findingsTriggered detectionsPermanent
evidence_packsSealed finding artifactsPermanent

Single entities Collection

All entity types (identity, workload, connection, credential, owner, role, permission, resource, execution_evidence) are stored in a single entities collection, discriminated by the entity_type field. The workload type replaces the former automation type (see ADR-010); the platform accepts both values during the migration period. This is the idiomatic MongoDB pattern for polymorphic documents — it minimizes query round-trips, simplifies operations, and aligns with the materialized execution paths strategy.

Full rationale: See ADR-002: Single Entities Collection


entities — Current State

Every node in the execution/authority graph is an entity document. The entity_type field determines the shape of properties.

{
_id: "uuid-...", // Platform-generated UUID
tenant_id: "uuid-...",
entity_type: "identity", // identity | workload | connection | credential | owner | role | permission | resource | execution_evidence (deprecated: "automation" accepted as alias for "workload")
source_system: "entra_id",
source_id: "sp-abc123", // ID in source system

// Type-specific data (flexible schema per entity_type)
properties: {
identity_type: "service_principal",
display_name: "SN-Integration-Prod",
execution_mode: "autonomous",
status: "active",
last_activity_at: ISODate("2026-01-22T14:00:00Z"),

// Allowlisted source metadata (full response hashed, not stored)
source_metadata: {
appId: "a1b2c3d4-...",
servicePrincipalType: "Application",
signInAudience: "AzureADMyOrg",
appOwnerOrganizationId: "tenant-xyz"
},
source_metadata_hash: "sha256:e5f6a7b8...",

oauth_config: {
grant_type: "client_credentials",
resource: "https://instance.service-now.com",
permissions_granted: ["Application.ReadWrite.All", "User.Read.All"]
},
credentials: [{
type: "client_secret",
expires_at: ISODate("2026-06-15"),
key_id: "key-abc",
created_at: ISODate("2024-06-15")
}]
},

// Relationships to other entities
relationships: [
{
type: "OWNED_BY",
target_id: "uuid-owner-sarah", // Owner entity (owner_type: "human")
properties: {
since: ISODate("2024-06-15"),
until: ISODate("2025-11-15"),
status: "decayed",
ownership_level: "primary", // primary | secondary | inherited
assigned_by: "admin@corp.com"
}
},
{
type: "OWNED_BY",
target_id: "uuid-owner-it-automation-team", // Owner entity (owner_type: "team")
properties: {
since: ISODate("2024-06-15"),
status: "active",
ownership_level: "secondary",
assigned_by: "admin@corp.com"
}
},
{
type: "AUTHENTICATES_TO",
target_id: "uuid-identity-sn-integration-user", // Identity in target system
properties: {
via_credential_id: "uuid-cred-client-secret",
auth_protocol: "oauth2",
target_system: "servicenow",
trust_chain_position: 0,
established_at: ISODate("2024-06-15"),
last_used_at: ISODate("2026-01-22T14:00:00Z"),
// Cross-system linkage proof (full issuer/tenant/instance context)
evidence_references: {
issuing_system_id: "a1b2c3d4-...", // Entra SP appId
issuing_tenant_id: "72f988bf-...", // Entra tenant ID
target_system_id: "a1b2c3d4-...", // ServiceNow oauth_entity.client_id
target_instance_id: "https://corp.service-now.com", // ServiceNow instance
target_record_sys_id: "oauth-entity-xyz", // oauth_entity sys_id
matching_field: "client_id",
matching_value: "a1b2c3d4-...",
target_user_binding: "oauth_entity.user -> sys_user.user_name"
}
}
},
{
type: "HAS_ROLE",
target_id: "uuid-role-itil",
properties: {
granted_at: ISODate("2024-06-15"),
granted_by: "admin@corp.com"
}
},
{
type: "HAS_ROLE",
target_id: "uuid-role-hr-admin",
properties: {
granted_at: ISODate("2025-03-15"),
granted_by: "bob.chen@corp.com"
}
}
],

// Materialized execution paths (pre-computed during sync)
// For cross-system paths, includes source_system, auth_chain_depth, and via_identity
execution_paths: [
{
resource_id: "uuid-res-hr-case",
resource_name: "hr_case",
business_domain: "hr",
sensitivity: "confidential",
via_roles: ["hr_admin"],
actions: ["create", "read", "update", "delete"],
source_system: "servicenow", // Which system the resource lives in
auth_chain_depth: 1, // 1 = reached via AUTHENTICATES_TO hop
via_identity: "uuid-identity-sn-integration-user", // Target system identity
computed_at: ISODate("2026-01-22T15:00:00Z")
},
{
resource_id: "uuid-res-incident",
resource_name: "incident",
business_domain: "it_ops",
sensitivity: "internal",
via_roles: ["itil"],
actions: ["read", "update"],
source_system: "servicenow",
auth_chain_depth: 1,
via_identity: "uuid-identity-sn-integration-user",
computed_at: ISODate("2026-01-22T15:00:00Z")
}
],

// Sync metadata
sync_version: 42,
last_synced_at: ISODate("2026-01-22T15:00:00Z"),
created_at: ISODate("2024-06-15T10:00:00Z"),
updated_at: ISODate("2026-01-22T15:00:00Z")
}

Resource entity (with reverse lookup)

{
_id: "uuid-res-hr-case",
tenant_id: "uuid-...",
entity_type: "resource",
source_system: "servicenow",
source_id: "table-hr_case",

properties: {
resource_type: "table",
resource_name: "hr_case",
business_domain: "hr",
sensitivity: "confidential",
contains_pii: true,

// Allowlisted source metadata
source_metadata: {
name: "hr_case",
label: "HR Case",
super_class: "task",
is_extendable: true
},
source_metadata_hash: "sha256:b2c3d4e5...",
acl_details: [{ /* individual ACL records */ }]
},

// Relationships
relationships: [
{ type: "ACCESSIBLE_VIA", target_id: "uuid-perm-hr-case-write", properties: {} }
],

// Denormalized reverse lookup: which identities can reach this resource
accessible_by: [
{
identity_id: "uuid-sp-sn-integration",
identity_name: "SN-Integration-Prod",
via_roles: ["hr_admin"],
actions: ["create", "read", "update", "delete"],
computed_at: ISODate("2026-01-22T15:00:00Z")
}
],

sync_version: 40,
last_synced_at: ISODate("2026-01-22T15:00:00Z"),
created_at: ISODate("2024-01-10T00:00:00Z"),
updated_at: ISODate("2026-01-22T15:00:00Z")
}

entity_versions — Point-in-Time Snapshots (Reduced Role)

With audit-log-based change detection, entity_versions is no longer populated on every sync. Instead, versions are stored:

  1. From baselines: When a baseline snapshot is created
  2. On significant changes: Changes that trigger findings (ownership decay, scope drift)
  3. On explicit request: When users request point-in-time snapshots for investigations

Most historical queries are served by baseline + event replay. This collection acts as a cache for frequently-accessed historical states.

{
_id: ObjectId,
entity_id: "uuid-...", // References entities._id
tenant_id: "uuid-...",
valid_at: ISODate("2025-07-22T00:00:00Z"), // When this version became active
expired_at: ISODate("2025-11-15T00:00:00Z"), // When superseded (null = current)
sync_version: 38,

// Full snapshot of entity at this point in time
entity_type: "identity",
properties: { /* full properties as they were */ },
relationships: [ /* relationships as they were */ ],
execution_paths: [ /* paths as they were */ ],

// Version source tracking
baseline_id: ObjectId("..."), // If derived from a baseline
derived_from_events: true, // If reconstructed via event replay
reconstruction_timestamp: ISODate("2026-01-22T10:00:00Z") // When reconstructed
}

Point-in-time query — hybrid approach:

// "What did this identity look like 6 months ago?"
// 1. Fast path: check for direct snapshot
const snapshot = db.entity_versions.findOne({
entity_id: "uuid-sp-abc123",
tenant_id: tenantId,
valid_at: { $lte: sixMonthsAgo },
expired_at: { $gt: sixMonthsAgo }
});

// 2. Slow path: find baseline + replay events
if (!snapshot) {
const baselineMeta = db.baseline_metadata.findOne({
tenant_id: tenantId,
source_system: "entra_id",
created_at: { $lte: sixMonthsAgo }
}, { sort: { created_at: -1 } });

const baselineEntity = db.baseline_entities.findOne({
baseline_id: baselineMeta._id,
entity_source_id: "sp-abc123"
});

const events = db.events.find({
entity_id: "uuid-sp-abc123",
timestamp: { $gt: baselineMeta.created_at, $lte: sixMonthsAgo }
}).sort({ timestamp: 1 });

// Replay events onto baseline entity to reconstruct state
}

baseline_metadata + baseline_entities — Periodic Snapshots

Baselines capture the complete state of all entities at a point in time. Created weekly (default) or on-demand. Stored as two collections to avoid MongoDB's 16MB document limit — even modest tenants (500 identities, 200 roles, 1000 permissions) can exceed this when relationships and properties are included.

baseline_metadata — Baseline headers

{
_id: ObjectId,
tenant_id: "uuid-...",
baseline_type: "scheduled", // initial | scheduled | manual | pre_retention_expiry
created_at: ISODate("2026-01-22T00:00:00Z"),
source_system: "entra_id", // One baseline per source system

// Summary statistics
entity_counts: {
identities: 150,
automations: 92,
connections: 15,
credentials: 75,
owners: 85,
roles: 42,
permissions: 210,
resources: 180,
execution_evidence: 340
},

// Baseline chain
previous_baseline_id: ObjectId("..."), // For diff computation
schedule_id: "weekly-entra", // If from scheduled job

// Integrity verification (hash of all entity_hashes)
integrity_hash: "sha256:a4f8c2e1...",
schema_version: "1.0"
}

baseline_entities — Per-entity snapshots

One document per entity per baseline. Enables partial retrieval (diff a single entity without loading the full baseline).

{
_id: ObjectId,
baseline_id: ObjectId("..."), // References baseline_metadata._id
tenant_id: "uuid-...",
entity_source_id: "sp-abc123", // Source system entity ID
entity_id: "uuid-...", // References entities._id
entity_type: "identity",

// Full snapshot of this entity at baseline time
properties: { /* full properties */ },
relationships: [ /* all relationships */ ],
execution_paths: [ /* computed paths, if identity */ ],

// Checksum for integrity verification
entity_hash: "sha256:..."
}

sync_cursors — Audit Log Sync Progress

Tracks where each connector left off when fetching audit logs, enabling incremental syncs.

{
_id: ObjectId,
tenant_id: "uuid-...",
source_system: "entra_id",
cursor_type: "audit_log",

// Source-specific cursor state
cursor_state: {
// For Azure Entra ID (directoryAudits)
last_activity_datetime: ISODate("2026-01-22T15:00:00Z"),
skip_token: "...", // For pagination

// For ServiceNow (sys_audit)
last_sys_updated_on: ISODate("2026-01-22T15:00:00Z"),
last_sys_id: "audit-record-id"
},

// Sync metadata
last_successful_sync: ISODate("2026-01-22T16:00:00Z"),
last_sync_id: "uuid-sync-...",
events_fetched_count: 147,

// Retention tracking
retention_warning_sent: false,
retention_expires_at: ISODate("2026-04-22T00:00:00Z") // ~90 days from oldest unfetched
}

events — Canonical Audit Log

The events collection is the canonical source of truth for changes. Events are sourced from audit log APIs (Azure directoryAudits, ServiceNow sys_audit) rather than computed by diffing snapshots.

{
_id: ObjectId,
tenant_id: "uuid-...",
timestamp: ISODate("2026-01-22T15:00:00Z"),
entity_id: "uuid-...",
entity_type: "identity",

// Event classification
event_type: "role_assigned", // See Event Types below

// Audit log provenance (source system attribution)
audit_source: {
source_api: "directoryAudits", // Azure: 'directoryAudits' | ServiceNow: 'sys_audit'
source_record_id: "audit-abc123", // Original audit record ID in source
source_timestamp: ISODate("2026-01-22T15:00:00Z"), // When change actually occurred
fetched_at: ISODate("2026-01-22T16:00:00Z"), // When we retrieved this
retention_expires_at: ISODate("2026-04-22T00:00:00Z") // When source will purge (~90 days)
},

// Actor who made the change
actor: {
actor_id: "user-uuid-...",
actor_type: "user", // user | service_principal | system | unknown
actor_display_name: "bob.chen@corp.com",
actor_source_system: "entra_id"
},

// Change details
change_details: {
operation: "Add member to role", // Source-specific operation name
field_changes: [ // For property changes
{
field_name: "relationships",
old_value: null,
new_value: { role_id: "uuid-role-hr-admin", role_name: "hr_admin" }
}
],
target_resources: [ // Affected entities
{
id: "uuid-role-hr-admin",
type: "role",
display_name: "hr_admin"
}
],
additional_details: {
granted_by: "bob.chen@corp.com",
approval_ticket: "JIRA-IT-4521" // If available
}
},

// Sync context
source_connector: "entra_id",
sync_id: "uuid-sync-..."
}

Event Types

Event TypeDescriptionTriggers
createdEntity createdFinding evaluation
updatedProperty changed-
deletedEntity removedFinding evaluation
role_assignedRole granted to identityScope drift detection
role_revokedRole removed from identity-
permission_grantedDirect permission grantedScope drift detection
permission_revokedDirect permission removed-
owner_assignedOwner added to identityOwnership tracking
owner_removedOwner removed from identityOrphaned ownership detection
status_changedEntity status transition (active → disabled)Ownership decay detection
credential_createdNew credential issued-
credential_rotatedCredential refreshed-
credential_expiredCredential expiredCredential hygiene detection
credential_deletedCredential removed-
relationship_addedGeneric relationship added-
relationship_removedGeneric relationship removed-

connector_syncs — Sync Records

{
_id: "uuid-sync-...",
tenant_id: "uuid-...",
connector_type: "entra_id",
started_at: ISODate,
completed_at: ISODate,
status: "completed", // running | completed | failed | partial

// Sync mode determines data source
sync_mode: "audit_log", // full | incremental | audit_log

// For audit_log mode
audit_log_window: {
since: ISODate("2026-01-15T00:00:00Z"),
until: ISODate("2026-01-22T15:00:00Z"),
source_api: "directoryAudits"
},

metrics: {
// Audit log metrics
audit_records_fetched: 147,
events_created: 89,
events_deduplicated: 12, // Already had these events

// Entity metrics
entities_affected: 23,
entities_created: 3,
entities_updated: 20,
relationships_added: 15,
relationships_removed: 4,

// Path metrics
paths_recomputed: 18
},

error: null,
cursor_state_after: { // For cursor tracking
last_activity_datetime: ISODate("2026-01-22T15:00:00Z")
}
}

findings — Triggered Detections

{
_id: "uuid-finding-...",
tenant_id: "uuid-...",
finding_type: "orphaned_ownership",
status: "active", // active | acknowledged | remediated | false_positive
detected_at: ISODate,
last_evaluated_at: ISODate,

entity_id: "uuid-sp-abc123",
affected_resources: ["uuid-res-hr-case", "uuid-res-incident"],

deterministic_explanation: "Service principal 'SN-Integration-Prod' has no active owner. Previous owner Jane Doe (jane.doe@corp.com) was disabled on 2025-11-15. SP continues executing with 3 roles granting access to HR and IT domains.",

evidence_refs: {
ownership_events: ["event-id-1", "event-id-2"],
current_paths: 2,
last_activity: ISODate("2026-01-22T14:00:00Z")
},

evidence_pack_id: "uuid-pack-..."
}

evidence_packs — Sealed Artifacts

{
_id: "uuid-pack-...",
tenant_id: "uuid-...",
finding_id: "uuid-finding-...",
created_at: ISODate,
sealed_at: ISODate,
schema_version: "1.0",

content: {
identity_summary: { /* ... */ },
authority_snapshot: { /* ... */ },
ownership_timeline: [ /* ... */ ],
blast_radius: { /* ... */ },
remediation: { /* ... */ }
},

content_markdown: "...",
integrity_hash: "sha256:...",
previous_pack_id: null
}

Integrity model — current scope and known gap: The integrity_hash is a SHA-256 digest of the pack's content (including tenant_id, finding_id, schema_version, and all content fields). It provides tamper detection — any accidental corruption or replay inconsistency is detectable by recomputing the hash. However, since the hash is stored alongside the content in the same mutable MongoDB collection, it does not provide non-repudiation — an actor with database write access could rewrite both content and hash. For compliance-grade deployments (NIST 800-53 AU-10, SI-7), evidence packs should be signed with a KMS/HSM-backed key (detached signature + certificate chain) and stored in immutable/WORM storage. This is deferred until customer compliance requirements drive it — see Open Question #6.


execution_evidence — First-Class Execution Proof

Immutable records proving that an autonomous identity actually executed an action. Links source system log entries to the identity graph. Without these records, EXECUTES_ON edges are claims without proof.

{
_id: ObjectId,
tenant_id: "uuid-...",
entity_id: "uuid-...", // Identity that executed
source_system: "servicenow",

// Source provenance
source_table: "syslog_transaction", // Where evidence came from
source_record_id: "txn-abc123", // ID in source table (for verification)
source_timestamp: ISODate("2026-01-22T14:00:00Z"), // When execution occurred

// Evidence content
evidence_type: "api_call", // api_call | flow_execution | scheduled_job | sign_in
action: "POST /api/now/table/incident",
target_resource: "incident",
outcome: "success", // success | failure | unknown

// Integrity (no raw data stored — hash proves content hasn't changed)
payload_hash: "sha256:...",

// Sync context
fetched_at: ISODate("2026-01-22T16:00:00Z"),
sync_id: "uuid-sync-...",
sync_version: "sync-v4"
}

Evidence type examples:

TypeSource TableWhat It Proves
api_callsyslog_transactionIdentity executed inbound REST API call
flow_executionsys_flow_contextFlow Designer automation ran
scheduled_jobsys_triggerScheduled job executed
sign_inAzure signInsService principal authenticated

execution_chains — Platform-Computed Chain Tracking

Execution chains are platform-computed ordered sets of entities from trigger to destination. Chain identity is anchored to the entry point entity (e.g., a Business Rule sys_id), which means chains survive entity rotation (OAuth client_id changes, SP credential rotation). The composition_hash is a SHA256 of sorted entity_id:role pairs — when this hash changes between syncs, the chain structure has changed.

W1 scope note: W1 (Exposure wedge) does NOT require execution_chains persistence, chain versioning, composition fingerprints, or temporal drift analysis. W1 derives Authority Paths and Exposures from entity relationships and execution_paths[] directly. The execution_chains collection is used for operational tracking and future wedges. See 01-data-model.md W1 Derived Concepts.

// execution_chains — platform-computed execution chain tracking
{
_id: "chain-uuid",
tenant_id: String,
name: String,
anchor_entity_id: String, // Entry point entity ID (stable root)
entity_refs: [{
entity_id: String,
entity_type: String, // identity | workload | connection | credential | owner | role | permission | resource | execution_evidence
role: String // entry_point | code_component | outbound_target | auth_credential | destination_identity | trigger_resource | data_resource
}],
summary: {
trigger: String,
destination: String,
egress_category: String,
blast_radius_domains: [String],
ownership_status: String,
total_roles: Number,
max_sensitivity: String,
canonical_permissions: {
reads: [String],
writes: [String]
}
},
composition_hash: String, // SHA256 fingerprint for change detection
first_detected_at: Date,
last_seen_at: Date,
sync_version: Number
}

execution_chain_versions — Chain State Snapshots (Phase 2)

Phase 2 deliverable. This collection will be implemented during Phase 2 temporal chain tracking (see implementation plan B.6). Section header reserved during Phase 1.

// execution_chain_versions — chain state snapshots for temporal comparison
{
_id: "version-uuid",
chain_id: String, // FK to execution_chains._id
tenant_id: String,
version_number: Number,
entity_refs: [{ entity_id: String, entity_type: String, role: String }],
summary: { /* snapshot at this version */ },
composition_hash: String,
created_at: Date,
sync_version: Number,
diff_from_previous: {
entities_added: [{ entity_id: String, entity_type: String, role: String }],
entities_removed: [{ entity_id: String, entity_type: String, role: String }],
summary_changes: Object
}
}

execution_chain_events — Chain Lifecycle Events (Phase 2)

Phase 2 deliverable. This collection will be implemented during Phase 2 temporal chain tracking (see implementation plan B.6). Section header reserved during Phase 1.

// execution_chain_events — discrete events for chain lifecycle
{
_id: "event-uuid",
chain_id: String,
tenant_id: String,
event_type: String, // chain_created | chain_entity_added | chain_entity_removed
// | chain_blast_radius_changed | chain_ownership_changed
timestamp: Date,
details: Object,
sync_version: Number
}

Indexes

// === entities ===
db.entities.createIndex({ tenant_id: 1, entity_type: 1, "properties.status": 1 });
db.entities.createIndex({ tenant_id: 1, source_system: 1, source_id: 1 }, { unique: true });
db.entities.createIndex({ tenant_id: 1, entity_type: 1, "properties.last_activity_at": -1 });
db.entities.createIndex({ tenant_id: 1, "relationships.target_id": 1 });
db.entities.createIndex({ tenant_id: 1, entity_type: 1, "properties.identity_type": 1 });

// For reverse path queries
db.entities.createIndex({ tenant_id: 1, "execution_paths.resource_id": 1 });
db.entities.createIndex({ tenant_id: 1, "accessible_by.identity_id": 1 });

// For orphaned ownership trigger (checks all ownership levels)
db.entities.createIndex({
tenant_id: 1,
entity_type: 1,
"relationships.type": 1,
"relationships.properties.status": 1,
"relationships.properties.ownership_level": 1
});

// For owner hierarchy queries (team → BU → org)
db.entities.createIndex({
tenant_id: 1,
entity_type: 1,
"properties.owner_type": 1,
"properties.status": 1
});

// For cross-system auth path queries
db.entities.createIndex({
tenant_id: 1,
"relationships.type": 1,
"relationships.properties.target_system": 1
});

// === entity_versions ===
db.entity_versions.createIndex({ entity_id: 1, valid_at: 1, expired_at: 1 });
db.entity_versions.createIndex({ tenant_id: 1, entity_id: 1, sync_version: -1 });

// === events ===
db.events.createIndex({ tenant_id: 1, entity_id: 1, timestamp: -1 });
db.events.createIndex({ tenant_id: 1, event_type: 1, timestamp: -1 });
db.events.createIndex({ sync_id: 1 });
db.events.createIndex({ timestamp: 1 }, { expireAfterSeconds: 63072000 }); // 2-year TTL

// Audit log provenance queries
db.events.createIndex({ tenant_id: 1, "audit_source.source_api": 1, timestamp: -1 });
db.events.createIndex({ tenant_id: 1, "actor.actor_id": 1, timestamp: -1 });
db.events.createIndex({ "audit_source.source_record_id": 1 }, { unique: true });

// Retention management
db.events.createIndex({ "audit_source.retention_expires_at": 1 });

// === baseline_metadata ===
db.baseline_metadata.createIndex({ tenant_id: 1, source_system: 1, created_at: -1 });
db.baseline_metadata.createIndex({ tenant_id: 1, baseline_type: 1, created_at: -1 });
db.baseline_metadata.createIndex({ integrity_hash: 1 });

// === baseline_entities ===
db.baseline_entities.createIndex({ baseline_id: 1, entity_source_id: 1 }, { unique: true });
db.baseline_entities.createIndex({ baseline_id: 1, entity_type: 1 });
db.baseline_entities.createIndex({ tenant_id: 1, entity_id: 1, baseline_id: -1 });

// === execution_evidence ===
db.execution_evidence.createIndex({ tenant_id: 1, entity_id: 1, source_timestamp: -1 });
db.execution_evidence.createIndex({ tenant_id: 1, source_system: 1, source_record_id: 1 }, { unique: true });
db.execution_evidence.createIndex({ tenant_id: 1, evidence_type: 1, source_timestamp: -1 });

// === sync_cursors ===
db.sync_cursors.createIndex(
{ tenant_id: 1, source_system: 1, cursor_type: 1 },
{ unique: true }
);

// === connector_syncs ===
db.connector_syncs.createIndex({ tenant_id: 1, connector_type: 1, started_at: -1 });

// === findings ===
db.findings.createIndex({ tenant_id: 1, finding_type: 1, status: 1 });
db.findings.createIndex({ tenant_id: 1, entity_id: 1 });

// === evidence_packs ===
db.evidence_packs.createIndex({ tenant_id: 1, finding_id: 1, created_at: -1 });

// === execution_chains ===
db.execution_chains.createIndex({ tenant_id: 1, anchor_entity_id: 1 }, { unique: true });
db.execution_chains.createIndex({ tenant_id: 1, "entity_refs.entity_id": 1 });
db.execution_chains.createIndex({ tenant_id: 1, "summary.ownership_status": 1 });
db.execution_chains.createIndex({ tenant_id: 1, composition_hash: 1 });
db.execution_chains.createIndex({ tenant_id: 1, last_seen_at: 1 });

// === execution_chain_versions (Phase 2) ===
db.execution_chain_versions.createIndex({ chain_id: 1, version_number: -1 });
db.execution_chain_versions.createIndex({ tenant_id: 1, chain_id: 1, created_at: -1 });

// === execution_chain_events (Phase 2) ===
db.execution_chain_events.createIndex({ tenant_id: 1, chain_id: 1, timestamp: -1 });
db.execution_chain_events.createIndex({ tenant_id: 1, event_type: 1, timestamp: -1 });

Graph Query Patterns (Application-Level)

Pattern 1: Blast Radius (Materialized Paths)

Fast — pre-computed during sync, direct document read:

// "What can this identity reach?"
const identity = await db.entities.findOne({
_id: identityId,
tenant_id: tenantId
});

// execution_paths already contains the answer
const blastRadius = identity.execution_paths;
// Group by business_domain, sensitivity for risk assessment

Pattern 2: Reverse Query (Denormalized)

"All identities that can reach this resource":

const resource = await db.entities.findOne({
_id: resourceId,
tenant_id: tenantId
});

// accessible_by already contains the answer
const identitiesWithAccess = resource.accessible_by;

Pattern 3: On-Demand Path Computation

For paths not yet materialized or for ad-hoc queries:

async function computeExecutionPaths(identityId, tenantId, db) {
const identity = await db.entities.findOne({
_id: identityId,
tenant_id: tenantId
});

const roleIds = identity.relationships
.filter(r => r.type === "HAS_ROLE")
.map(r => r.target_id);

// Load roles
const roles = await db.entities.find({
_id: { $in: roleIds },
tenant_id: tenantId
}).toArray();

const paths = [];
for (const role of roles) {
const permissionIds = role.relationships
.filter(r => r.type === "GRANTS")
.map(r => r.target_id);

const permissions = await db.entities.find({
_id: { $in: permissionIds },
tenant_id: tenantId
}).toArray();

for (const perm of permissions) {
const resourceIds = perm.relationships
.filter(r => r.type === "APPLIES_TO")
.map(r => r.target_id);

const resources = await db.entities.find({
_id: { $in: resourceIds },
tenant_id: tenantId
}).toArray();

for (const resource of resources) {
paths.push({
resource_id: resource._id,
resource_name: resource.properties.resource_name,
business_domain: resource.properties.business_domain,
sensitivity: resource.properties.sensitivity,
via_roles: [role.properties.role_name],
actions: [perm.properties.normalized_action],
computed_at: new Date()
});
}
}
}

return paths;
}

Pattern 4: Point-in-Time Comparison

// "How has this identity's access changed in 6 months?"
const now = await db.entities.findOne({ _id: identityId, tenant_id: tenantId });
const then = await db.entity_versions.findOne({
entity_id: identityId,
tenant_id: tenantId,
valid_at: { $lte: sixMonthsAgo },
expired_at: { $gt: sixMonthsAgo }
});

// Diff execution_paths for scope drift detection
const addedPaths = now.execution_paths.filter(
p => !then.execution_paths.some(tp => tp.resource_id === p.resource_id)
);

Pattern 5: Drift Timeline

// "All role changes for identity X over 12 months"
const driftEvents = await db.events.find({
tenant_id: tenantId,
entity_id: identityId,
event_type: { $in: ["role_assigned", "role_revoked"] },
timestamp: { $gte: twelveMonthsAgo }
}).sort({ timestamp: 1 }).toArray();

Temporal Strategy

Audit-Log-Based Change Detection

Instead of detecting changes by diffing snapshots on every sync, the platform queries source system audit logs directly. This approach:

  • Reduces storage: Only stores actual changes (events), not full snapshots on every sync
  • Provides richer provenance: Captures who made each change, not just what changed
  • Handles infrequent scans: Works even with weekly or monthly sync schedules
  • Leverages source truth: Audit logs are the authoritative record of changes

How History Works

AUDIT-LOG SYNC LIFECYCLE
========================

1. CHECK BASELINE STATE
└── First sync? → Take initial baseline first
└── Retention approaching? → Archive/baseline before source purges

2. LOAD SYNC CURSOR
└── Get last successful sync timestamp
└── Calculate fetch window: cursor → now

3. EXTRACT AUDIT LOGS
Azure: GET /auditLogs/directoryAudits?$filter=activityDateTime ge {since}
ServiceNow: GET /api/now/table/sys_audit?sysparm_query=sys_updated_on>{since}

4. TRANSFORM TO EVENTS
└── Map source operations to EventType
└── Extract actor, change details, timestamps

5. CORRELATE TO ENTITIES
└── Match audit records to existing entities
└── Update relationships based on events
└── Mark identities for path recomputation

6. PERSIST
└── Append events (canonical record)
└── Update sync cursor
└── Recompute paths for affected identities

7. TRIGGER EVALUATION
└── Run findings against event patterns

Data Flow

  • Current state: Direct read from entities
  • Point-in-time state: Baseline + event replay (or cached entity_versions)
  • Change history: Query events by entity_id + time range (canonical source)
  • Drift timeline: Aggregate events by type and time window

Baseline Strategy

Baselines are taken:

  • Initial: On first sync for a tenant/source
  • Scheduled: Weekly by default (configurable)
  • Pre-retention: Before source audit logs expire (~90 days for Azure)
  • Manual: On-demand for investigations

Baselines enable efficient point-in-time reconstruction without replaying months of events.

Retention Management

Source systems have limited audit log retention (Azure: ~90 days, ServiceNow: configurable). The platform:

  1. Tracks retention window via sync_cursors.retention_expires_at
  2. Runs daily check for approaching retention limits
  3. Triggers baseline capture before window closes
  4. Archives old events to cold storage after 90 days (remains in MongoDB for 2 years via TTL)

Path Materialization

Execution paths are recomputed when events indicate relationship changes:

Event received (role_assigned | role_revoked | permission_granted | ...)
→ Identify affected identity
→ Traverse: identity → roles → permissions → resources
→ Update: identity.execution_paths array
→ Update: resource.accessible_by arrays (reverse lookup)

Path queries remain O(1) document reads. The cost is paid at sync time, not query time.

Externalization Strategy for Embedded Arrays

execution_paths and accessible_by are embedded arrays on entity documents for O(1) reads. At MVP scale this is safe, but high-fan-out entities could theoretically approach MongoDB's 16MB document limit.

Size analysis: Each ExecutionPath entry is ~200-500 bytes. An entity would need >30,000 paths to approach 16MB. At MVP scale (<1,000 identities), this is unrealistic even for admin-level service accounts.

Monitoring — size-based triggers: Track document sizes via Object.bsonsize() in sync post-processing. Log warnings when:

  • Any entity document exceeds 8MB (50% of limit)
  • Any entity has >1,000 execution_paths or accessible_by entries

Monitoring — operational triggers: Write amplification and sync latency can become bottlenecks before size limits are hit. Also monitor:

  • Path recompute duration per sync (p95) — warn at >30s for a single entity's paths
  • Total sync duration increase attributable to path materialization — warn at >2x baseline
  • Write amplification ratio — warn at >10 resource documents updated per identity path change (high fan-out indicator)

Either size-based or operational triggers can initiate the migration to overflow collections.

Fallback design (implement when any trigger fires):

  • execution_paths_overflow collection keyed by (tenant_id, identity_id) — holds the full paths array
  • accessible_by_overflow collection keyed by (tenant_id, resource_id) — holds the full reverse lookup array
  • Entity document retains a paths_externalized: true flag and path_count field
  • StorageAdapter transparently reads from overflow collection when flag is set

No implementation needed for MVP. The design is documented here so the team can act quickly when monitoring indicates need.


Multi-Tenancy

All collections use tenant_id as the leading field in compound indexes. Application layer injects tenant_id filter on every query.

Scaling Path

ScaleStrategy
< 50 tenantsShared cluster, field-based isolation
50-500 tenantsShard by tenant_id
500+ tenantsDatabase-per-tenant

Operational Considerations

Backups

  • MongoDB Atlas continuous backup, or mongodump for self-hosted
  • Point-in-time recovery for operational errors

Monitoring

  • Query performance (slow query log)
  • Collection sizes and index sizes
  • Replication lag (if replica set)
  • TTL deletion rate on events collection
  • Path recomputation duration per sync

Local Development

  • Docker Compose with single MongoDB container
  • Seed script populates mock data (NormalizedGraph snapshots)
  • No external dependencies beyond MongoDB

StorageAdapter Interface

The connector framework is database-agnostic. A StorageAdapter abstracts the storage layer, enabling future migration to a hybrid architecture without changing connectors:

interface StorageAdapter {
// Write path
applyChanges(tenantId: string, changes: ChangeSet): Promise<SyncResult>;

// Entity queries
getEntity(tenantId: string, entityId: string): Promise<Entity>;
getEntityVersion(tenantId: string, entityId: string, asOf: Date): Promise<Entity>;
getEvents(tenantId: string, query: EventQuery): Promise<Event[]>;

// Path queries
getExecutionPaths(tenantId: string, identityId: string): Promise<ExecutionPath[]>;
getAccessibleBy(tenantId: string, resourceId: string): Promise<AccessEntry[]>;
queryPaths(tenantId: string, query: PathQuery): Promise<Path[]>;
}

// MVP implementation
class MongoStorageAdapter implements StorageAdapter {
// All methods query MongoDB directly
// Path queries read from materialized execution_paths / accessible_by
}

// Future implementation (when scale requires it)
class MongoNeo4jStorageAdapter implements StorageAdapter {
// Entity/version/event queries → MongoDB
// Path queries → Neo4j traversal
// Write path → dual-write (MongoDB first, Neo4j second)
}

Future: Scaling with Neo4j

When the MongoDB-only approach hits its limits, add Neo4j as a thin graph index over MongoDB:

Migration Triggers

SignalThresholdWhat breaks
Identities per tenant> 10,000Path recomputation too slow
Path depth> 3-4 hopsApplication-level traversal impractical
Connectors per tenant> 5Cross-system paths create fan-out
Reverse query latency> acceptable SLAFull collection scans on accessible_by

What Changes

  • Neo4j stores thin nodes (id, type, status) and edges (relationship types)
  • MongoDB remains source of truth (no data removed)
  • Path queries route to Neo4j instead of reading materialized arrays
  • Materialized execution_paths and accessible_by can be dropped (Neo4j replaces them)
  • Write path becomes dual-write: MongoDB first, Neo4j second

What Stays the Same

  • StorageAdapter interface (connectors unchanged)
  • API contracts (consumers unchanged)
  • MongoDB schema (add Neo4j alongside, don't remove anything)
  • Temporal queries (still MongoDB entity_versions)

Migration Steps

  1. Deploy Neo4j alongside MongoDB
  2. Implement MongoNeo4jStorageAdapter
  3. Backfill Neo4j from current MongoDB entities + relationships
  4. Switch StorageAdapter implementation
  5. Validate path query results match materialized paths
  6. Optionally drop materialized execution_paths arrays (or keep as cache)

Open Questions

  1. Path recomputation scope: Recompute all tenant paths on any change, or only affected subgraph? Decided for MVP: Recompute paths for all identities present in the sync payload (not all tenant identities). This scopes recompute to affected entities per sync. Reverse-impact propagation (e.g., a role's GRANTS change → find all identities holding that role → recompute their paths) is deferred until monitoring shows path staleness causing incorrect findings. When path recompute duration exceeds the p95 threshold (>30s per entity), implement incremental subgraph recompute.
  2. MongoDB deployment: Atlas (managed, simpler) vs self-hosted (more control)?
  3. Version retention: Keep all versions forever, or prune after configurable window?
  4. Events TTL: 2 years default — configurable per tenant?
  5. Materialized path staleness: Acceptable between syncs? Between-sync staleness is accepted for MVP. Paths refresh on next connector sync. Real-time recompute on individual relationship changes is deferred.
  6. Evidence pack signing: Current integrity_hash provides tamper detection but not non-repudiation. KMS/HSM-backed detached signatures + WORM storage needed for NIST 800-53 AU-10/SI-7 compliance. Deferred until customer compliance requirements drive it.