Database Architecture

Context

SecurityV0 has two data workloads:

Rich entity storage — Full policy documents, allowlisted source metadata, versioned history, events, evidence packs
Graph traversal — Execution path queries: "What can this identity reach through which roles?"

For the MVP, both workloads run on MongoDB. Graph traversal uses application-level path computation with materialized execution paths on identity documents.

Decision

MongoDB as the single database for the MVP. All data — entities, relationships, versions, events, findings, evidence — lives in MongoDB.

Graph queries are handled through:

Materialized execution paths — pre-computed on each identity during sync
Denormalized reverse lookups — accessible_by arrays on resource documents
Application-level traversal — follow relationship references for on-demand path queries

When This Is Sufficient

< 1,000 identities per tenant
2-3 hop paths (Identity → Role → Permission → Resource)
Materialized paths recomputed on sync (acceptable latency)
2-3 connectors per tenant
Historical queries prioritized over deep graph discovery

When to Add Neo4j (Future)

10,000+ identities per tenant with 5+ connectors
5+ hop transitive chains (cross-system paths)
Real-time reverse queries at scale
Path recomputation becomes the sync bottleneck

See "Future: Scaling with Neo4j" section at the end of this document.

MongoDB Schema

Collections Overview

Collection	Purpose	Retention
`entities`	Current state of all graph objects	Permanent
`entity_versions`	Point-in-time snapshots (baseline-derived)	Configurable (default: 2 years)
`events`	Canonical audit log events	TTL (default: 2 years)
`baseline_metadata`	Baseline headers (counts, integrity hash, schedule)	Configurable (default: 1 year)
`baseline_entities`	Per-entity baseline snapshots	Configurable (default: 1 year)
`execution_evidence`	First-class execution proof records	Configurable (default: 2 years)
`execution_chains`	Platform-computed execution chain tracking	Permanent
`execution_chain_versions`	Chain state snapshots for temporal comparison (Phase 2)	Configurable (default: 2 years)
`execution_chain_events`	Discrete events for chain lifecycle (Phase 2)	Configurable (default: 2 years)
`sync_cursors`	Audit log sync progress tracking	Permanent
`connector_syncs`	Sync execution records	Permanent
`findings`	Triggered detections	Permanent
`evidence_packs`	Sealed finding artifacts	Permanent

Single `entities` Collection

All entity types (identity, workload, connection, credential, owner, role, permission, resource, execution_evidence) are stored in a single entities collection, discriminated by the entity_type field. The workload type replaces the former automation type (see ADR-010); the platform accepts both values during the migration period. This is the idiomatic MongoDB pattern for polymorphic documents — it minimizes query round-trips, simplifies operations, and aligns with the materialized execution paths strategy.

Full rationale: See ADR-002: Single Entities Collection

`entities` — Current State

Every node in the execution/authority graph is an entity document. The entity_type field determines the shape of properties.

{
    _id: "uuid-...",                     // Platform-generated UUID
    tenant_id: "uuid-...",
    entity_type: "identity",             // identity | workload | connection | credential | owner | role | permission | resource | execution_evidence (deprecated: "automation" accepted as alias for "workload")
    source_system: "entra_id",
    source_id: "sp-abc123",              // ID in source system

    // Type-specific data (flexible schema per entity_type)
    properties: {
        identity_type: "service_principal",
        display_name: "SN-Integration-Prod",
        execution_mode: "autonomous",
        status: "active",
        last_activity_at: ISODate("2026-01-22T14:00:00Z"),

        // Allowlisted source metadata (full response hashed, not stored)
        source_metadata: {
            appId: "a1b2c3d4-...",
            servicePrincipalType: "Application",
            signInAudience: "AzureADMyOrg",
            appOwnerOrganizationId: "tenant-xyz"
        },
        source_metadata_hash: "sha256:e5f6a7b8...",

        oauth_config: {
            grant_type: "client_credentials",
            resource: "https://instance.service-now.com",
            permissions_granted: ["Application.ReadWrite.All", "User.Read.All"]
        },
        credentials: [{
            type: "client_secret",
            expires_at: ISODate("2026-06-15"),
            key_id: "key-abc",
            created_at: ISODate("2024-06-15")
        }]
    },

    // Relationships to other entities
    relationships: [
        {
            type: "OWNED_BY",
            target_id: "uuid-owner-sarah",        // Owner entity (owner_type: "human")
            properties: {
                since: ISODate("2024-06-15"),
                until: ISODate("2025-11-15"),
                status: "decayed",
                ownership_level: "primary",       // primary | secondary | inherited
                assigned_by: "admin@corp.com"
            }
        },
        {
            type: "OWNED_BY",
            target_id: "uuid-owner-it-automation-team",  // Owner entity (owner_type: "team")
            properties: {
                since: ISODate("2024-06-15"),
                status: "active",
                ownership_level: "secondary",
                assigned_by: "admin@corp.com"
            }
        },
        {
            type: "AUTHENTICATES_TO",
            target_id: "uuid-identity-sn-integration-user",  // Identity in target system
            properties: {
                via_credential_id: "uuid-cred-client-secret",
                auth_protocol: "oauth2",
                target_system: "servicenow",
                trust_chain_position: 0,
                established_at: ISODate("2024-06-15"),
                last_used_at: ISODate("2026-01-22T14:00:00Z"),
                // Cross-system linkage proof (full issuer/tenant/instance context)
                evidence_references: {
                    issuing_system_id: "a1b2c3d4-...",           // Entra SP appId
                    issuing_tenant_id: "72f988bf-...",           // Entra tenant ID
                    target_system_id: "a1b2c3d4-...",            // ServiceNow oauth_entity.client_id
                    target_instance_id: "https://corp.service-now.com",  // ServiceNow instance
                    target_record_sys_id: "oauth-entity-xyz",    // oauth_entity sys_id
                    matching_field: "client_id",
                    matching_value: "a1b2c3d4-...",
                    target_user_binding: "oauth_entity.user -> sys_user.user_name"
                }
            }
        },
        {
            type: "HAS_ROLE",
            target_id: "uuid-role-itil",
            properties: {
                granted_at: ISODate("2024-06-15"),
                granted_by: "admin@corp.com"
            }
        },
        {
            type: "HAS_ROLE",
            target_id: "uuid-role-hr-admin",
            properties: {
                granted_at: ISODate("2025-03-15"),
                granted_by: "bob.chen@corp.com"
            }
        }
    ],

    // Materialized execution paths (pre-computed during sync)
    // For cross-system paths, includes source_system, auth_chain_depth, and via_identity
    execution_paths: [
        {
            resource_id: "uuid-res-hr-case",
            resource_name: "hr_case",
            business_domain: "hr",
            sensitivity: "confidential",
            via_roles: ["hr_admin"],
            actions: ["create", "read", "update", "delete"],
            source_system: "servicenow",           // Which system the resource lives in
            auth_chain_depth: 1,                   // 1 = reached via AUTHENTICATES_TO hop
            via_identity: "uuid-identity-sn-integration-user",  // Target system identity
            computed_at: ISODate("2026-01-22T15:00:00Z")
        },
        {
            resource_id: "uuid-res-incident",
            resource_name: "incident",
            business_domain: "it_ops",
            sensitivity: "internal",
            via_roles: ["itil"],
            actions: ["read", "update"],
            source_system: "servicenow",
            auth_chain_depth: 1,
            via_identity: "uuid-identity-sn-integration-user",
            computed_at: ISODate("2026-01-22T15:00:00Z")
        }
    ],

    // Sync metadata
    sync_version: 42,
    last_synced_at: ISODate("2026-01-22T15:00:00Z"),
    created_at: ISODate("2024-06-15T10:00:00Z"),
    updated_at: ISODate("2026-01-22T15:00:00Z")
}

Resource entity (with reverse lookup)

{
    _id: "uuid-res-hr-case",
    tenant_id: "uuid-...",
    entity_type: "resource",
    source_system: "servicenow",
    source_id: "table-hr_case",

    properties: {
        resource_type: "table",
        resource_name: "hr_case",
        business_domain: "hr",
        sensitivity: "confidential",
        contains_pii: true,

        // Allowlisted source metadata
        source_metadata: {
            name: "hr_case",
            label: "HR Case",
            super_class: "task",
            is_extendable: true
        },
        source_metadata_hash: "sha256:b2c3d4e5...",
        acl_details: [{ /* individual ACL records */ }]
    },

    // Relationships
    relationships: [
        { type: "ACCESSIBLE_VIA", target_id: "uuid-perm-hr-case-write", properties: {} }
    ],

    // Denormalized reverse lookup: which identities can reach this resource
    accessible_by: [
        {
            identity_id: "uuid-sp-sn-integration",
            identity_name: "SN-Integration-Prod",
            via_roles: ["hr_admin"],
            actions: ["create", "read", "update", "delete"],
            computed_at: ISODate("2026-01-22T15:00:00Z")
        }
    ],

    sync_version: 40,
    last_synced_at: ISODate("2026-01-22T15:00:00Z"),
    created_at: ISODate("2024-01-10T00:00:00Z"),
    updated_at: ISODate("2026-01-22T15:00:00Z")
}

`entity_versions` — Point-in-Time Snapshots (Reduced Role)

With audit-log-based change detection, entity_versions is no longer populated on every sync. Instead, versions are stored:

From baselines: When a baseline snapshot is created
On significant changes: Changes that trigger findings (ownership decay, scope drift)
On explicit request: When users request point-in-time snapshots for investigations

Most historical queries are served by baseline + event replay. This collection acts as a cache for frequently-accessed historical states.

{
    _id: ObjectId,
    entity_id: "uuid-...",               // References entities._id
    tenant_id: "uuid-...",
    valid_at: ISODate("2025-07-22T00:00:00Z"),   // When this version became active
    expired_at: ISODate("2025-11-15T00:00:00Z"), // When superseded (null = current)
    sync_version: 38,

    // Full snapshot of entity at this point in time
    entity_type: "identity",
    properties: { /* full properties as they were */ },
    relationships: [ /* relationships as they were */ ],
    execution_paths: [ /* paths as they were */ ],

    // Version source tracking
    baseline_id: ObjectId("..."),        // If derived from a baseline
    derived_from_events: true,           // If reconstructed via event replay
    reconstruction_timestamp: ISODate("2026-01-22T10:00:00Z")  // When reconstructed
}

Point-in-time query — hybrid approach:

// "What did this identity look like 6 months ago?"
// 1. Fast path: check for direct snapshot
const snapshot = db.entity_versions.findOne({
    entity_id: "uuid-sp-abc123",
    tenant_id: tenantId,
    valid_at: { $lte: sixMonthsAgo },
    expired_at: { $gt: sixMonthsAgo }
});

// 2. Slow path: find baseline + replay events
if (!snapshot) {
    const baselineMeta = db.baseline_metadata.findOne({
        tenant_id: tenantId,
        source_system: "entra_id",
        created_at: { $lte: sixMonthsAgo }
    }, { sort: { created_at: -1 } });

    const baselineEntity = db.baseline_entities.findOne({
        baseline_id: baselineMeta._id,
        entity_source_id: "sp-abc123"
    });

    const events = db.events.find({
        entity_id: "uuid-sp-abc123",
        timestamp: { $gt: baselineMeta.created_at, $lte: sixMonthsAgo }
    }).sort({ timestamp: 1 });

    // Replay events onto baseline entity to reconstruct state
}

`baseline_metadata` + `baseline_entities` — Periodic Snapshots

Baselines capture the complete state of all entities at a point in time. Created weekly (default) or on-demand. Stored as two collections to avoid MongoDB's 16MB document limit — even modest tenants (500 identities, 200 roles, 1000 permissions) can exceed this when relationships and properties are included.

`baseline_metadata` — Baseline headers

{
    _id: ObjectId,
    tenant_id: "uuid-...",
    baseline_type: "scheduled",          // initial | scheduled | manual | pre_retention_expiry
    created_at: ISODate("2026-01-22T00:00:00Z"),
    source_system: "entra_id",           // One baseline per source system

    // Summary statistics
    entity_counts: {
        identities: 150,
        automations: 92,
        connections: 15,
        credentials: 75,
        owners: 85,
        roles: 42,
        permissions: 210,
        resources: 180,
        execution_evidence: 340
    },

    // Baseline chain
    previous_baseline_id: ObjectId("..."),  // For diff computation
    schedule_id: "weekly-entra",            // If from scheduled job

    // Integrity verification (hash of all entity_hashes)
    integrity_hash: "sha256:a4f8c2e1...",
    schema_version: "1.0"
}

`baseline_entities` — Per-entity snapshots

One document per entity per baseline. Enables partial retrieval (diff a single entity without loading the full baseline).

{
    _id: ObjectId,
    baseline_id: ObjectId("..."),        // References baseline_metadata._id
    tenant_id: "uuid-...",
    entity_source_id: "sp-abc123",       // Source system entity ID
    entity_id: "uuid-...",               // References entities._id
    entity_type: "identity",

    // Full snapshot of this entity at baseline time
    properties: { /* full properties */ },
    relationships: [ /* all relationships */ ],
    execution_paths: [ /* computed paths, if identity */ ],

    // Checksum for integrity verification
    entity_hash: "sha256:..."
}

`sync_cursors` — Audit Log Sync Progress

Tracks where each connector left off when fetching audit logs, enabling incremental syncs.

{
    _id: ObjectId,
    tenant_id: "uuid-...",
    source_system: "entra_id",
    cursor_type: "audit_log",

    // Source-specific cursor state
    cursor_state: {
        // For Azure Entra ID (directoryAudits)
        last_activity_datetime: ISODate("2026-01-22T15:00:00Z"),
        skip_token: "...",               // For pagination

        // For ServiceNow (sys_audit)
        last_sys_updated_on: ISODate("2026-01-22T15:00:00Z"),
        last_sys_id: "audit-record-id"
    },

    // Sync metadata
    last_successful_sync: ISODate("2026-01-22T16:00:00Z"),
    last_sync_id: "uuid-sync-...",
    events_fetched_count: 147,

    // Retention tracking
    retention_warning_sent: false,
    retention_expires_at: ISODate("2026-04-22T00:00:00Z")  // ~90 days from oldest unfetched
}

`events` — Canonical Audit Log

The events collection is the canonical source of truth for changes. Events are sourced from audit log APIs (Azure directoryAudits, ServiceNow sys_audit) rather than computed by diffing snapshots.

{
    _id: ObjectId,
    tenant_id: "uuid-...",
    timestamp: ISODate("2026-01-22T15:00:00Z"),
    entity_id: "uuid-...",
    entity_type: "identity",

    // Event classification
    event_type: "role_assigned",         // See Event Types below

    // Audit log provenance (source system attribution)
    audit_source: {
        source_api: "directoryAudits",   // Azure: 'directoryAudits' | ServiceNow: 'sys_audit'
        source_record_id: "audit-abc123", // Original audit record ID in source
        source_timestamp: ISODate("2026-01-22T15:00:00Z"),  // When change actually occurred
        fetched_at: ISODate("2026-01-22T16:00:00Z"),        // When we retrieved this
        retention_expires_at: ISODate("2026-04-22T00:00:00Z") // When source will purge (~90 days)
    },

    // Actor who made the change
    actor: {
        actor_id: "user-uuid-...",
        actor_type: "user",              // user | service_principal | system | unknown
        actor_display_name: "bob.chen@corp.com",
        actor_source_system: "entra_id"
    },

    // Change details
    change_details: {
        operation: "Add member to role",  // Source-specific operation name
        field_changes: [                  // For property changes
            {
                field_name: "relationships",
                old_value: null,
                new_value: { role_id: "uuid-role-hr-admin", role_name: "hr_admin" }
            }
        ],
        target_resources: [               // Affected entities
            {
                id: "uuid-role-hr-admin",
                type: "role",
                display_name: "hr_admin"
            }
        ],
        additional_details: {
            granted_by: "bob.chen@corp.com",
            approval_ticket: "JIRA-IT-4521"  // If available
        }
    },

    // Sync context
    source_connector: "entra_id",
    sync_id: "uuid-sync-..."
}

Event Types

Event Type	Description	Triggers
`created`	Entity created	Finding evaluation
`updated`	Property changed	-
`deleted`	Entity removed	Finding evaluation
`role_assigned`	Role granted to identity	Scope drift detection
`role_revoked`	Role removed from identity	-
`permission_granted`	Direct permission granted	Scope drift detection
`permission_revoked`	Direct permission removed	-
`owner_assigned`	Owner added to identity	Ownership tracking
`owner_removed`	Owner removed from identity	Orphaned ownership detection
`status_changed`	Entity status transition (active → disabled)	Ownership decay detection
`credential_created`	New credential issued	-
`credential_rotated`	Credential refreshed	-
`credential_expired`	Credential expired	Credential hygiene detection
`credential_deleted`	Credential removed	-
`relationship_added`	Generic relationship added	-
`relationship_removed`	Generic relationship removed	-

`connector_syncs` — Sync Records

{
    _id: "uuid-sync-...",
    tenant_id: "uuid-...",
    connector_type: "entra_id",
    started_at: ISODate,
    completed_at: ISODate,
    status: "completed",                 // running | completed | failed | partial

    // Sync mode determines data source
    sync_mode: "audit_log",              // full | incremental | audit_log

    // For audit_log mode
    audit_log_window: {
        since: ISODate("2026-01-15T00:00:00Z"),
        until: ISODate("2026-01-22T15:00:00Z"),
        source_api: "directoryAudits"
    },

    metrics: {
        // Audit log metrics
        audit_records_fetched: 147,
        events_created: 89,
        events_deduplicated: 12,         // Already had these events

        // Entity metrics
        entities_affected: 23,
        entities_created: 3,
        entities_updated: 20,
        relationships_added: 15,
        relationships_removed: 4,

        // Path metrics
        paths_recomputed: 18
    },

    error: null,
    cursor_state_after: {                // For cursor tracking
        last_activity_datetime: ISODate("2026-01-22T15:00:00Z")
    }
}

`findings` — Triggered Detections

{
    _id: "uuid-finding-...",
    tenant_id: "uuid-...",
    finding_type: "orphaned_ownership",
    status: "active",                    // active | acknowledged | remediated | false_positive
    detected_at: ISODate,
    last_evaluated_at: ISODate,

    entity_id: "uuid-sp-abc123",
    affected_resources: ["uuid-res-hr-case", "uuid-res-incident"],

    deterministic_explanation: "Service principal 'SN-Integration-Prod' has no active owner. Previous owner Jane Doe (jane.doe@corp.com) was disabled on 2025-11-15. SP continues executing with 3 roles granting access to HR and IT domains.",

    evidence_refs: {
        ownership_events: ["event-id-1", "event-id-2"],
        current_paths: 2,
        last_activity: ISODate("2026-01-22T14:00:00Z")
    },

    evidence_pack_id: "uuid-pack-..."
}

`evidence_packs` — Sealed Artifacts

{
    _id: "uuid-pack-...",
    tenant_id: "uuid-...",
    finding_id: "uuid-finding-...",
    created_at: ISODate,
    sealed_at: ISODate,
    schema_version: "1.0",

    content: {
        identity_summary: { /* ... */ },
        authority_snapshot: { /* ... */ },
        ownership_timeline: [ /* ... */ ],
        blast_radius: { /* ... */ },
        remediation: { /* ... */ }
    },

    content_markdown: "...",
    integrity_hash: "sha256:...",
    previous_pack_id: null
}

Integrity model — current scope and known gap: The integrity_hash is a SHA-256 digest of the pack's content (including tenant_id, finding_id, schema_version, and all content fields). It provides tamper detection — any accidental corruption or replay inconsistency is detectable by recomputing the hash. However, since the hash is stored alongside the content in the same mutable MongoDB collection, it does not provide non-repudiation — an actor with database write access could rewrite both content and hash. For compliance-grade deployments (NIST 800-53 AU-10, SI-7), evidence packs should be signed with a KMS/HSM-backed key (detached signature + certificate chain) and stored in immutable/WORM storage. This is deferred until customer compliance requirements drive it — see Open Question #6.

`execution_evidence` — First-Class Execution Proof

Immutable records proving that an autonomous identity actually executed an action. Links source system log entries to the identity graph. Without these records, EXECUTES_ON edges are claims without proof.

{
    _id: ObjectId,
    tenant_id: "uuid-...",
    entity_id: "uuid-...",              // Identity that executed
    source_system: "servicenow",

    // Source provenance
    source_table: "syslog_transaction", // Where evidence came from
    source_record_id: "txn-abc123",     // ID in source table (for verification)
    source_timestamp: ISODate("2026-01-22T14:00:00Z"),  // When execution occurred

    // Evidence content
    evidence_type: "api_call",          // api_call | flow_execution | scheduled_job | sign_in
    action: "POST /api/now/table/incident",
    target_resource: "incident",
    outcome: "success",                 // success | failure | unknown

    // Integrity (no raw data stored — hash proves content hasn't changed)
    payload_hash: "sha256:...",

    // Sync context
    fetched_at: ISODate("2026-01-22T16:00:00Z"),
    sync_id: "uuid-sync-...",
    sync_version: "sync-v4"
}

Evidence type examples:

Type	Source Table	What It Proves
`api_call`	`syslog_transaction`	Identity executed inbound REST API call
`flow_execution`	`sys_flow_context`	Flow Designer automation ran
`scheduled_job`	`sys_trigger`	Scheduled job executed
`sign_in`	Azure `signIns`	Service principal authenticated

`execution_chains` — Platform-Computed Chain Tracking

Execution chains are platform-computed ordered sets of entities from trigger to destination. Chain identity is anchored to the entry point entity (e.g., a Business Rule sys_id), which means chains survive entity rotation (OAuth client_id changes, SP credential rotation). The composition_hash is a SHA256 of sorted entity_id:role pairs — when this hash changes between syncs, the chain structure has changed.

W1 scope note: W1 (Exposure wedge) does NOT require execution_chains persistence, chain versioning, composition fingerprints, or temporal drift analysis. W1 derives Authority Paths and Exposures from entity relationships and execution_paths[] directly. The execution_chains collection is used for operational tracking and future wedges. See 01-data-model.md W1 Derived Concepts.

// execution_chains — platform-computed execution chain tracking
{
  _id: "chain-uuid",
  tenant_id: String,
  name: String,
  anchor_entity_id: String,          // Entry point entity ID (stable root)
  entity_refs: [{
    entity_id: String,
    entity_type: String,             // identity | workload | connection | credential | owner | role | permission | resource | execution_evidence
    role: String                      // entry_point | code_component | outbound_target | auth_credential | destination_identity | trigger_resource | data_resource
  }],
  summary: {
    trigger: String,
    destination: String,
    egress_category: String,
    blast_radius_domains: [String],
    ownership_status: String,
    total_roles: Number,
    max_sensitivity: String,
    canonical_permissions: {
      reads: [String],
      writes: [String]
    }
  },
  composition_hash: String,          // SHA256 fingerprint for change detection
  first_detected_at: Date,
  last_seen_at: Date,
  sync_version: Number
}

`execution_chain_versions` — Chain State Snapshots (Phase 2)

Phase 2 deliverable. This collection will be implemented during Phase 2 temporal chain tracking (see implementation plan B.6). Section header reserved during Phase 1.

// execution_chain_versions — chain state snapshots for temporal comparison
{
  _id: "version-uuid",
  chain_id: String,                      // FK to execution_chains._id
  tenant_id: String,
  version_number: Number,
  entity_refs: [{ entity_id: String, entity_type: String, role: String }],
  summary: { /* snapshot at this version */ },
  composition_hash: String,
  created_at: Date,
  sync_version: Number,
  diff_from_previous: {
    entities_added: [{ entity_id: String, entity_type: String, role: String }],
    entities_removed: [{ entity_id: String, entity_type: String, role: String }],
    summary_changes: Object
  }
}

`execution_chain_events` — Chain Lifecycle Events (Phase 2)

Phase 2 deliverable. This collection will be implemented during Phase 2 temporal chain tracking (see implementation plan B.6). Section header reserved during Phase 1.

// execution_chain_events — discrete events for chain lifecycle
{
  _id: "event-uuid",
  chain_id: String,
  tenant_id: String,
  event_type: String,   // chain_created | chain_entity_added | chain_entity_removed
                        // | chain_blast_radius_changed | chain_ownership_changed
  timestamp: Date,
  details: Object,
  sync_version: Number
}

Indexes

// === entities ===
db.entities.createIndex({ tenant_id: 1, entity_type: 1, "properties.status": 1 });
db.entities.createIndex({ tenant_id: 1, source_system: 1, source_id: 1 }, { unique: true });
db.entities.createIndex({ tenant_id: 1, entity_type: 1, "properties.last_activity_at": -1 });
db.entities.createIndex({ tenant_id: 1, "relationships.target_id": 1 });
db.entities.createIndex({ tenant_id: 1, entity_type: 1, "properties.identity_type": 1 });

// For reverse path queries
db.entities.createIndex({ tenant_id: 1, "execution_paths.resource_id": 1 });
db.entities.createIndex({ tenant_id: 1, "accessible_by.identity_id": 1 });

// For orphaned ownership trigger (checks all ownership levels)
db.entities.createIndex({
    tenant_id: 1,
    entity_type: 1,
    "relationships.type": 1,
    "relationships.properties.status": 1,
    "relationships.properties.ownership_level": 1
});

// For owner hierarchy queries (team → BU → org)
db.entities.createIndex({
    tenant_id: 1,
    entity_type: 1,
    "properties.owner_type": 1,
    "properties.status": 1
});

// For cross-system auth path queries
db.entities.createIndex({
    tenant_id: 1,
    "relationships.type": 1,
    "relationships.properties.target_system": 1
});

// === entity_versions ===
db.entity_versions.createIndex({ entity_id: 1, valid_at: 1, expired_at: 1 });
db.entity_versions.createIndex({ tenant_id: 1, entity_id: 1, sync_version: -1 });

// === events ===
db.events.createIndex({ tenant_id: 1, entity_id: 1, timestamp: -1 });
db.events.createIndex({ tenant_id: 1, event_type: 1, timestamp: -1 });
db.events.createIndex({ sync_id: 1 });
db.events.createIndex({ timestamp: 1 }, { expireAfterSeconds: 63072000 }); // 2-year TTL

// Audit log provenance queries
db.events.createIndex({ tenant_id: 1, "audit_source.source_api": 1, timestamp: -1 });
db.events.createIndex({ tenant_id: 1, "actor.actor_id": 1, timestamp: -1 });
db.events.createIndex({ "audit_source.source_record_id": 1 }, { unique: true });

// Retention management
db.events.createIndex({ "audit_source.retention_expires_at": 1 });

// === baseline_metadata ===
db.baseline_metadata.createIndex({ tenant_id: 1, source_system: 1, created_at: -1 });
db.baseline_metadata.createIndex({ tenant_id: 1, baseline_type: 1, created_at: -1 });
db.baseline_metadata.createIndex({ integrity_hash: 1 });

// === baseline_entities ===
db.baseline_entities.createIndex({ baseline_id: 1, entity_source_id: 1 }, { unique: true });
db.baseline_entities.createIndex({ baseline_id: 1, entity_type: 1 });
db.baseline_entities.createIndex({ tenant_id: 1, entity_id: 1, baseline_id: -1 });

// === execution_evidence ===
db.execution_evidence.createIndex({ tenant_id: 1, entity_id: 1, source_timestamp: -1 });
db.execution_evidence.createIndex({ tenant_id: 1, source_system: 1, source_record_id: 1 }, { unique: true });
db.execution_evidence.createIndex({ tenant_id: 1, evidence_type: 1, source_timestamp: -1 });

// === sync_cursors ===
db.sync_cursors.createIndex(
    { tenant_id: 1, source_system: 1, cursor_type: 1 },
    { unique: true }
);

// === connector_syncs ===
db.connector_syncs.createIndex({ tenant_id: 1, connector_type: 1, started_at: -1 });

// === findings ===
db.findings.createIndex({ tenant_id: 1, finding_type: 1, status: 1 });
db.findings.createIndex({ tenant_id: 1, entity_id: 1 });

// === evidence_packs ===
db.evidence_packs.createIndex({ tenant_id: 1, finding_id: 1, created_at: -1 });

// === execution_chains ===
db.execution_chains.createIndex({ tenant_id: 1, anchor_entity_id: 1 }, { unique: true });
db.execution_chains.createIndex({ tenant_id: 1, "entity_refs.entity_id": 1 });
db.execution_chains.createIndex({ tenant_id: 1, "summary.ownership_status": 1 });
db.execution_chains.createIndex({ tenant_id: 1, composition_hash: 1 });
db.execution_chains.createIndex({ tenant_id: 1, last_seen_at: 1 });

// === execution_chain_versions (Phase 2) ===
db.execution_chain_versions.createIndex({ chain_id: 1, version_number: -1 });
db.execution_chain_versions.createIndex({ tenant_id: 1, chain_id: 1, created_at: -1 });

// === execution_chain_events (Phase 2) ===
db.execution_chain_events.createIndex({ tenant_id: 1, chain_id: 1, timestamp: -1 });
db.execution_chain_events.createIndex({ tenant_id: 1, event_type: 1, timestamp: -1 });

Graph Query Patterns (Application-Level)

Pattern 1: Blast Radius (Materialized Paths)

Fast — pre-computed during sync, direct document read:

// "What can this identity reach?"
const identity = await db.entities.findOne({
    _id: identityId,
    tenant_id: tenantId
});

// execution_paths already contains the answer
const blastRadius = identity.execution_paths;
// Group by business_domain, sensitivity for risk assessment

Pattern 2: Reverse Query (Denormalized)

"All identities that can reach this resource":

const resource = await db.entities.findOne({
    _id: resourceId,
    tenant_id: tenantId
});

// accessible_by already contains the answer
const identitiesWithAccess = resource.accessible_by;

Pattern 3: On-Demand Path Computation

For paths not yet materialized or for ad-hoc queries:

async function computeExecutionPaths(identityId, tenantId, db) {
    const identity = await db.entities.findOne({
        _id: identityId,
        tenant_id: tenantId
    });

    const roleIds = identity.relationships
        .filter(r => r.type === "HAS_ROLE")
        .map(r => r.target_id);

    // Load roles
    const roles = await db.entities.find({
        _id: { $in: roleIds },
        tenant_id: tenantId
    }).toArray();

    const paths = [];
    for (const role of roles) {
        const permissionIds = role.relationships
            .filter(r => r.type === "GRANTS")
            .map(r => r.target_id);

        const permissions = await db.entities.find({
            _id: { $in: permissionIds },
            tenant_id: tenantId
        }).toArray();

        for (const perm of permissions) {
            const resourceIds = perm.relationships
                .filter(r => r.type === "APPLIES_TO")
                .map(r => r.target_id);

            const resources = await db.entities.find({
                _id: { $in: resourceIds },
                tenant_id: tenantId
            }).toArray();

            for (const resource of resources) {
                paths.push({
                    resource_id: resource._id,
                    resource_name: resource.properties.resource_name,
                    business_domain: resource.properties.business_domain,
                    sensitivity: resource.properties.sensitivity,
                    via_roles: [role.properties.role_name],
                    actions: [perm.properties.normalized_action],
                    computed_at: new Date()
                });
            }
        }
    }

    return paths;
}

Pattern 4: Point-in-Time Comparison

// "How has this identity's access changed in 6 months?"
const now = await db.entities.findOne({ _id: identityId, tenant_id: tenantId });
const then = await db.entity_versions.findOne({
    entity_id: identityId,
    tenant_id: tenantId,
    valid_at: { $lte: sixMonthsAgo },
    expired_at: { $gt: sixMonthsAgo }
});

// Diff execution_paths for scope drift detection
const addedPaths = now.execution_paths.filter(
    p => !then.execution_paths.some(tp => tp.resource_id === p.resource_id)
);

Pattern 5: Drift Timeline

// "All role changes for identity X over 12 months"
const driftEvents = await db.events.find({
    tenant_id: tenantId,
    entity_id: identityId,
    event_type: { $in: ["role_assigned", "role_revoked"] },
    timestamp: { $gte: twelveMonthsAgo }
}).sort({ timestamp: 1 }).toArray();

Temporal Strategy

Audit-Log-Based Change Detection

Instead of detecting changes by diffing snapshots on every sync, the platform queries source system audit logs directly. This approach:

Reduces storage: Only stores actual changes (events), not full snapshots on every sync
Provides richer provenance: Captures who made each change, not just what changed
Handles infrequent scans: Works even with weekly or monthly sync schedules
Leverages source truth: Audit logs are the authoritative record of changes

How History Works

AUDIT-LOG SYNC LIFECYCLE
========================

1. CHECK BASELINE STATE
   └── First sync? → Take initial baseline first
   └── Retention approaching? → Archive/baseline before source purges

2. LOAD SYNC CURSOR
   └── Get last successful sync timestamp
   └── Calculate fetch window: cursor → now

3. EXTRACT AUDIT LOGS
   Azure:      GET /auditLogs/directoryAudits?$filter=activityDateTime ge {since}
   ServiceNow: GET /api/now/table/sys_audit?sysparm_query=sys_updated_on>{since}

4. TRANSFORM TO EVENTS
   └── Map source operations to EventType
   └── Extract actor, change details, timestamps

5. CORRELATE TO ENTITIES
   └── Match audit records to existing entities
   └── Update relationships based on events
   └── Mark identities for path recomputation

6. PERSIST
   └── Append events (canonical record)
   └── Update sync cursor
   └── Recompute paths for affected identities

7. TRIGGER EVALUATION
   └── Run findings against event patterns

Data Flow

Current state: Direct read from entities
Point-in-time state: Baseline + event replay (or cached entity_versions)
Change history: Query events by entity_id + time range (canonical source)
Drift timeline: Aggregate events by type and time window

Baseline Strategy

Baselines are taken:

Initial: On first sync for a tenant/source
Scheduled: Weekly by default (configurable)
Pre-retention: Before source audit logs expire (~90 days for Azure)
Manual: On-demand for investigations

Baselines enable efficient point-in-time reconstruction without replaying months of events.

Retention Management

Source systems have limited audit log retention (Azure: ~90 days, ServiceNow: configurable). The platform:

Tracks retention window via sync_cursors.retention_expires_at
Runs daily check for approaching retention limits
Triggers baseline capture before window closes
Archives old events to cold storage after 90 days (remains in MongoDB for 2 years via TTL)

Path Materialization

Execution paths are recomputed when events indicate relationship changes:

Event received (role_assigned | role_revoked | permission_granted | ...)
    → Identify affected identity
    → Traverse: identity → roles → permissions → resources
    → Update: identity.execution_paths array
    → Update: resource.accessible_by arrays (reverse lookup)

Path queries remain O(1) document reads. The cost is paid at sync time, not query time.

Externalization Strategy for Embedded Arrays

execution_paths and accessible_by are embedded arrays on entity documents for O(1) reads. At MVP scale this is safe, but high-fan-out entities could theoretically approach MongoDB's 16MB document limit.

Size analysis: Each ExecutionPath entry is ~200-500 bytes. An entity would need >30,000 paths to approach 16MB. At MVP scale (<1,000 identities), this is unrealistic even for admin-level service accounts.

Monitoring — size-based triggers: Track document sizes via Object.bsonsize() in sync post-processing. Log warnings when:

Any entity document exceeds 8MB (50% of limit)
Any entity has >1,000 execution_paths or accessible_by entries

Monitoring — operational triggers: Write amplification and sync latency can become bottlenecks before size limits are hit. Also monitor:

Path recompute duration per sync (p95) — warn at >30s for a single entity's paths
Total sync duration increase attributable to path materialization — warn at >2x baseline
Write amplification ratio — warn at >10 resource documents updated per identity path change (high fan-out indicator)

Either size-based or operational triggers can initiate the migration to overflow collections.

Fallback design (implement when any trigger fires):

execution_paths_overflow collection keyed by (tenant_id, identity_id) — holds the full paths array
accessible_by_overflow collection keyed by (tenant_id, resource_id) — holds the full reverse lookup array
Entity document retains a paths_externalized: true flag and path_count field
StorageAdapter transparently reads from overflow collection when flag is set

No implementation needed for MVP. The design is documented here so the team can act quickly when monitoring indicates need.

Multi-Tenancy

All collections use tenant_id as the leading field in compound indexes. Application layer injects tenant_id filter on every query.

Scaling Path

Scale	Strategy
< 50 tenants	Shared cluster, field-based isolation
50-500 tenants	Shard by tenant_id
500+ tenants	Database-per-tenant

Operational Considerations

Backups

MongoDB Atlas continuous backup, or mongodump for self-hosted
Point-in-time recovery for operational errors

Monitoring

Query performance (slow query log)
Collection sizes and index sizes
Replication lag (if replica set)
TTL deletion rate on events collection
Path recomputation duration per sync

Local Development

Docker Compose with single MongoDB container
Seed script populates mock data (NormalizedGraph snapshots)
No external dependencies beyond MongoDB

StorageAdapter Interface

The connector framework is database-agnostic. A StorageAdapter abstracts the storage layer, enabling future migration to a hybrid architecture without changing connectors:

interface StorageAdapter {
    // Write path
    applyChanges(tenantId: string, changes: ChangeSet): Promise<SyncResult>;

    // Entity queries
    getEntity(tenantId: string, entityId: string): Promise<Entity>;
    getEntityVersion(tenantId: string, entityId: string, asOf: Date): Promise<Entity>;
    getEvents(tenantId: string, query: EventQuery): Promise<Event[]>;

    // Path queries
    getExecutionPaths(tenantId: string, identityId: string): Promise<ExecutionPath[]>;
    getAccessibleBy(tenantId: string, resourceId: string): Promise<AccessEntry[]>;
    queryPaths(tenantId: string, query: PathQuery): Promise<Path[]>;
}

// MVP implementation
class MongoStorageAdapter implements StorageAdapter {
    // All methods query MongoDB directly
    // Path queries read from materialized execution_paths / accessible_by
}

// Future implementation (when scale requires it)
class MongoNeo4jStorageAdapter implements StorageAdapter {
    // Entity/version/event queries → MongoDB
    // Path queries → Neo4j traversal
    // Write path → dual-write (MongoDB first, Neo4j second)
}

Future: Scaling with Neo4j

When the MongoDB-only approach hits its limits, add Neo4j as a thin graph index over MongoDB:

Migration Triggers

Signal	Threshold	What breaks
Identities per tenant	> 10,000	Path recomputation too slow
Path depth	> 3-4 hops	Application-level traversal impractical
Connectors per tenant	> 5	Cross-system paths create fan-out
Reverse query latency	> acceptable SLA	Full collection scans on `accessible_by`

What Changes

Neo4j stores thin nodes (id, type, status) and edges (relationship types)
MongoDB remains source of truth (no data removed)
Path queries route to Neo4j instead of reading materialized arrays
Materialized execution_paths and accessible_by can be dropped (Neo4j replaces them)
Write path becomes dual-write: MongoDB first, Neo4j second

What Stays the Same

StorageAdapter interface (connectors unchanged)
API contracts (consumers unchanged)
MongoDB schema (add Neo4j alongside, don't remove anything)
Temporal queries (still MongoDB entity_versions)

Migration Steps

Deploy Neo4j alongside MongoDB
Implement MongoNeo4jStorageAdapter
Backfill Neo4j from current MongoDB entities + relationships
Switch StorageAdapter implementation
Validate path query results match materialized paths
Optionally drop materialized execution_paths arrays (or keep as cache)

Open Questions

Path recomputation scope: ~~Recompute all tenant paths on any change, or only affected subgraph?~~ Decided for MVP: Recompute paths for all identities present in the sync payload (not all tenant identities). This scopes recompute to affected entities per sync. Reverse-impact propagation (e.g., a role's GRANTS change → find all identities holding that role → recompute their paths) is deferred until monitoring shows path staleness causing incorrect findings. When path recompute duration exceeds the p95 threshold (>30s per entity), implement incremental subgraph recompute.
MongoDB deployment: Atlas (managed, simpler) vs self-hosted (more control)?
Version retention: Keep all versions forever, or prune after configurable window?
Events TTL: 2 years default — configurable per tenant?
Materialized path staleness: Acceptable between syncs? Between-sync staleness is accepted for MVP. Paths refresh on next connector sync. Real-time recompute on individual relationship changes is deferred.
Evidence pack signing: Current integrity_hash provides tamper detection but not non-repudiation. KMS/HSM-backed detached signatures + WORM storage needed for NIST 800-53 AU-10/SI-7 compliance. Deferred until customer compliance requirements drive it.

Context​

Decision​

When This Is Sufficient​

When to Add Neo4j (Future)​

MongoDB Schema​

Collections Overview​

Single entities Collection​

entities — Current State​

Resource entity (with reverse lookup)​

entity_versions — Point-in-Time Snapshots (Reduced Role)​

baseline_metadata + baseline_entities — Periodic Snapshots​

baseline_metadata — Baseline headers​

baseline_entities — Per-entity snapshots​

sync_cursors — Audit Log Sync Progress​

events — Canonical Audit Log​

Event Types​

connector_syncs — Sync Records​

findings — Triggered Detections​

evidence_packs — Sealed Artifacts​

execution_evidence — First-Class Execution Proof​

execution_chains — Platform-Computed Chain Tracking​

execution_chain_versions — Chain State Snapshots (Phase 2)​

execution_chain_events — Chain Lifecycle Events (Phase 2)​

Indexes​

Graph Query Patterns (Application-Level)​

Pattern 1: Blast Radius (Materialized Paths)​

Pattern 2: Reverse Query (Denormalized)​

Pattern 3: On-Demand Path Computation​

Pattern 4: Point-in-Time Comparison​

Pattern 5: Drift Timeline​

Temporal Strategy​

Audit-Log-Based Change Detection​

How History Works​

Data Flow​

Baseline Strategy​

Retention Management​

Path Materialization​

Externalization Strategy for Embedded Arrays​

Multi-Tenancy​

Scaling Path​

Operational Considerations​

Backups​

Monitoring​

Local Development​

StorageAdapter Interface​

Future: Scaling with Neo4j​

Migration Triggers​

What Changes​

What Stays the Same​

Migration Steps​

Open Questions​