Skip to main content

Implementation Plan: Scan Safety, Data Loss Prevention & Connector Observability

Date: 2026-02-26 Status: Draft v2 — revised per review findings (8 items addressed) Scope: sv0-platform (ingestion pipeline, API, UI), sv0-connectors (entra-servicenow)

Core Assumption (non-negotiable): The platform must never perform automatic irreversible deletes and must never perform automatic large soft-removals from a single suspect scan. All destructive operations must be gated, observable, and reversible. Trigger: Production incident — fresh connector scan removed all 5 authority paths for default tenant


1. Incident Summary

On 2026-02-26, a fresh entra-servicenow connector scan (syncId: cebc3162) removed all 5 existing authority paths for the default tenant. The production UI showed zero authority paths where 5 had been present.

Impact: Complete loss of authority path visibility for the real (non-demo) tenant.

Root cause: Two compounding failures:

  1. Connector bug (primary): Commit f322e48 ("discover all outbound REST Messages, not just Azure") removed the Azure endpoint filter from the get_outbound_rest_messages() query. The _get_table() method has a 100-record default limit. With the filter removed, the ServiceNow instance returned 100+ generic REST Messages, pushing the Azure-specific ones outside the window. Result: discover_execution_chains() returned 0 chains, and the graph output contained 0 chain workloads (business rules, script includes, scheduled jobs).

  2. Platform design gap (amplifier): The platform uses a full-replacement model for sync processing. When the diff engine (diff-engine.ts:267-305) detected that 5 previously-ingested workloads were absent from the incoming graph, it marked them as deleted. The authority path materializer (sync-ingestion.ts:166-181) then soft-removed all authority paths for those deleted workloads.

Key metric from the failing sync:

pathsComputed: 1, authorityPathsCreated: 0, authorityPathsRemoved: 5

Resolution: Authority paths manually restored via MongoDB updateMany (status: "removed" → "active"). Possible because the platform uses soft-delete (markAuthorityPathsRemoved sets status: "removed", does not physically delete).


2. Design Principles

Based on industry research across identity governance (SailPoint, Veza), CSPM (Wiz, Prisma Cloud), SIEM (Splunk, Sentinel), and sensor platforms (CrowdStrike Falcon):

  1. Never trust a single scan to be complete. Connectors can fail partially — API limits, permission revocations, timeouts. The platform must treat incoming data as potentially incomplete.

  2. Absence ≠ deletion. An entity missing from one scan should not be immediately removed. Use "last seen" tracking with grace periods (industry standard: 1-7 days depending on entity type).

  3. Protect high-value data with circuit breakers. If a sync would remove a significant portion of existing data, halt and quarantine rather than apply. Thresholds vary: 30% for identities, 50% for resources (SailPoint/Veza pattern).

  4. Connectors must declare scope. A connector scanning only Function Apps should not trigger deletion of unrelated ServiceNow workloads. Scan scope must be explicit in the payload.

  5. Make all destructive operations observable and reversible. Operators must be able to see what each scan changed, detect anomalies, and roll back bad syncs.


3. Implementation Plan

Phase 0: Connector Fix (Immediate — blocks further scans)

Goal: Fix the entra-servicenow connector so scans produce complete graphs.

TaskFileChange
Use paginated query for REST Messagesservicenow_client.pyget_outbound_rest_messages()Replace _get_table() with _get_table_paginated() to fetch all REST Messages regardless of count
Add self-validation before submissioncli/main.py → submit logicLog warning if chain discovery returns 0 chains when prior scan had >0; optionally abort submission

Estimated effort: 1 hour Prevents recurrence of this specific incident: Yes


Phase 1: Platform Circuit Breaker (P0 — 1 day)

Goal: Prevent any single sync from causing mass entity deletion or authority path removal, regardless of connector bugs.

Review finding addressed (Critical #1): The original per-workload AP breaker with existingActive.length >= 3 minimum would allow a full wipe for small tenants (e.g., 2 workloads with 2 paths each → 100% removal allowed). The breaker now operates at the global/tenant level across all entities and paths for the entire sync, with no minimum floor — even removing 1 of 1 paths triggers evaluation.

Review finding addressed (Critical #2): Entity deletion and authority path removal are two separate destructive operations, but they are causally linked: deleted entities → missing execution_paths → materializer removes authority paths. The circuit breaker now gates the entire destructive pipeline (entity deletion + execution path materialization + authority path removal) as a single unit. If the entity deletion breaker fires, the materializer is also blocked from removing paths.

1a. Global Entity Deletion Threshold

Before the diff engine marks entities as deleted, compare the total deletion count against total existing entity count across all source systems in the sync.

File: src/ingestion/diff-engine.ts (around line 267-305)

// Proposed logic — GLOBAL breaker (runs once for the entire sync, not per-workload):
interface DeletionBreaker {
totalToDelete: number;
totalExisting: number;
deletionRatio: number;
triggered: boolean;
blockedEntityIds: string[];
}

function evaluateDeletionBreaker(
allToDelete: EntityDoc[],
allExistingForSyncSources: EntityDoc[],
): DeletionBreaker {
const totalToDelete = allToDelete.length;
const totalExisting = allExistingForSyncSources.length;
const deletionRatio = totalExisting > 0 ? totalToDelete / totalExisting : 0;
const threshold = 0.50; // 50% global threshold — no minimum floor

// Zero existing = first scan, no breaker needed
if (totalExisting === 0) {
return { totalToDelete, totalExisting, deletionRatio, triggered: false, blockedEntityIds: [] };
}

// Special case: incoming scan has 0 entities when baseline > 0 → always block
if (totalToDelete === totalExisting && totalExisting > 0) {
return { totalToDelete, totalExisting, deletionRatio: 1.0, triggered: true,
blockedEntityIds: allToDelete.map(e => e.node_id) };
}

const triggered = deletionRatio > threshold;
return {
totalToDelete, totalExisting, deletionRatio, triggered,
blockedEntityIds: triggered ? allToDelete.map(e => e.node_id) : [],
};
}

Per-type thresholds (applied as a secondary check within global breaker):

Entity Type (runtime)ThresholdRationale
identity (service principals)30%Anchor authority paths — rarely mass-deleted legitimately
workload40%Core platform object; client config changes are incremental
role / permission40%Role structures are relatively stable
resource60%Cloud resources are more volatile (scale-up/down)
owner50%People join/leave; moderate volatility
Default50%Safe middle ground

Review finding addressed (Medium #8): Entity type names in thresholds now use runtime types from the graph transformer (e.g., owner not human_identity), matching NormalizedNode.nodeType values that actually appear in the pipeline.

No minimum entity floor. Previous draft required existingForSource.length >= 5 — this is removed. A tenant with 2 entities losing both is a 100% drop and must be caught.

1b. Cascading Pipeline Gate

When the entity deletion breaker fires, the entire destructive pipeline for this sync is halted — not just entity deletion.

File: src/workers/handlers/sync-ingestion.ts

// After entity diff:
const deletionBreaker = evaluateDeletionBreaker(toDelete, existingForSource);

if (deletionBreaker.triggered) {
logger.warn("Circuit breaker triggered — blocking ALL destructive operations", {
syncId, tenantId,
wouldDelete: deletionBreaker.totalToDelete,
existing: deletionBreaker.totalExisting,
ratio: deletionBreaker.deletionRatio,
});

// Skip: entity deletions
// Skip: execution path re-materialization for deleted entities
// Skip: authority path removal (markAuthorityPathsRemoved)
// Continue: entity creates/updates (additive operations are safe)
// Continue: findings evaluation, evidence packs, posture snapshot

syncMetrics.circuit_breaker_triggered = true;
syncMetrics.deletions_blocked = deletionBreaker.totalToDelete;
syncMetrics.authority_paths_removal_blocked = /* count from materializer */ 0;
}

Key insight: The materializer at authority-path-materializer.ts:115-127 removes paths when execution_paths is empty for a workload. If we mark workloads as deleted (removing their execution_paths), the materializer will cascade-remove their authority paths even without an explicit AP breaker. Therefore, the entity deletion breaker must gate the materializer as well — if deletions are blocked, the materializer runs with the pre-existing entity set (as if the deletions never happened).

1c. Authority Path Removal Safety Net (Independent)

A secondary check at the authority path level, as defense-in-depth for cases where entity deletion proceeds but materializer behavior is anomalous.

File: src/ingestion/authority-path-materializer.ts (around line 115-127)

// GLOBAL authority path breaker — NOT per-workload
const totalExistingPaths = allExistingActivePaths.length;
const totalToRemove = allPathsToRemove.length;
const removalRatio = totalExistingPaths > 0 ? totalToRemove / totalExistingPaths : 0;
const AP_REMOVAL_THRESHOLD = 0.50;

// No minimum floor — even 1 of 1 is 100% and triggers
if (removalRatio > AP_REMOVAL_THRESHOLD && totalExistingPaths > 0) {
logger.warn("Authority path removal blocked by circuit breaker", {
syncId, tenantId,
wouldRemove: totalToRemove,
existing: totalExistingPaths,
ratio: removalRatio,
});
syncMetrics.authority_paths_removal_blocked = totalToRemove;
// Skip all path removals — return early
}

1d. Sync Status: Keep "completed" + Flag

Review finding addressed (Critical #3): The original plan proposed adding "degraded" to SyncStatus. This would break downstream processing: evaluate-findings.ts:21 gates on sync.status !== "completed" — a "degraded" status would cause the findings evaluator, evidence pack builder, and posture snapshot to all be skipped. This is worse than the original problem.

Solution: Keep status: "completed" so downstream processing runs normally. Add a circuit_breaker_triggered: boolean flag on the sync metrics to indicate that destructive operations were blocked.

File: src/domain/syncs/types.ts

Do NOT add "degraded" to SyncStatus. Instead, add metrics fields:

// Added to ConnectorSyncMetrics (existing interface):
deletions_blocked?: number;
authority_paths_removal_blocked?: number;
circuit_breaker_triggered?: boolean;
circuit_breaker_details?: {
entity_deletion_ratio: number;
entity_deletion_threshold: number;
ap_removal_ratio: number;
ap_removal_threshold: number;
};

The sync status remains "completed". The UI and alerting system check circuit_breaker_triggered to surface warnings. Findings evaluation, evidence packs, and posture snapshots all proceed normally against the non-destructed data.

Estimated effort: 1 day Prevents recurrence: Yes — any connector bug that produces incomplete data triggers the circuit breaker, blocking all destructive operations while preserving downstream processing.


Phase 2: Scan Scope Declaration & Rollback Determinism (P0 — 1-2 days)

Goal: Let connectors declare what they scanned so the platform only removes entities within that scope. Make rollback deterministic by tracking which sync caused each removal.

Review finding addressed (Critical #4): The original rollback mechanism used removed_at ± 5s timestamp matching — this is non-deterministic and could restore paths from unrelated operations. Rollback must use removed_by_sync_id for exact causality tracking.

Review finding addressed (High #5): The original plan trusted connector self-reported completeness.expectedNodeCount to relax circuit breaker thresholds. This is dangerous — a buggy connector could self-report "I expected 0 nodes" and bypass all safety. All safety decisions must be platform-derived from historical baselines.

Review finding addressed (High #7): The current codebase has two gaps: (1) the ingest route does not validate or accept scanScope in the payload, and (2) sync-ingestion.ts:33 hardcodes sync_mode: "full". Both must be addressed for scan scope to function.

2a. NormalizedGraph Schema Extension

File: src/ingestion/types.ts

export interface ScanScope {
/** What this scan covers — only entities matching this scope are eligible for deletion */
mode: "full" | "incremental" | "targeted";

/** Source systems included in this scan (e.g., ["servicenow", "entra_id"]) */
sourceSystems?: string[];

/** Entity types included (e.g., ["workload", "identity"]). If omitted, all types in scope. */
scannedEntityTypes?: string[];

/** Connector self-reported errors — used ONLY for logging/observability, NOT for safety decisions */
errors?: {
errorsEncountered?: number;
partialFailures?: string[];
permissionDenied?: string[];
};
}

export interface NormalizedGraph {
// ... existing fields
scanScope?: ScanScope;
}

Important: The completeness.expectedNodeCount field from the original draft is removed. The platform must never trust connector self-reported node counts for circuit breaker override decisions. All safety thresholds are derived from the platform's own historical baseline (previous successful sync for the same connector + tenant).

2b. Scope-Aware Deletion

File: src/ingestion/diff-engine.ts

When scanScope.mode === "incremental", skip deletion detection entirely (additive-only). When scanScope.scannedEntityTypes is provided, only consider entities of those types for deletion.

2c. Ingest Route Validation

File: src/api/routes/ingest.ts

The ingest route must accept and validate the scanScope field from the payload:

// Add to existing NormalizedGraph validation schema:
scanScope: z.object({
mode: z.enum(["full", "incremental", "targeted"]),
sourceSystems: z.array(z.string()).optional(),
scannedEntityTypes: z.array(z.string()).optional(),
errors: z.object({
errorsEncountered: z.number().optional(),
partialFailures: z.array(z.string()).optional(),
permissionDenied: z.array(z.string()).optional(),
}).optional(),
}).optional(),

2d. Sync Mode from Payload (Remove Hardcoded Default)

File: src/workers/handlers/sync-ingestion.ts (line 33)

Currently: sync_mode: "full" is hardcoded. Change to derive from the incoming payload:

// Before (hardcoded):
sync_mode: "full",

// After (from payload, with safe default):
sync_mode: graph.scanScope?.mode ?? "full",

When mode is "incremental", the diff engine skips deletion detection. When "targeted", only scannedEntityTypes are eligible for deletion. When "full" (or absent), current behavior is preserved (all entity types eligible).

2e. Deterministic Rollback: removed_by_sync_id

File: src/domain/authority-paths/types.ts

Add removed_by_sync_id to the authority path schema:

// Added to AuthorityPathDoc:
removed_by_sync_id?: string; // The sync that caused this path to be removed

File: src/ingestion/authority-path-materializer.ts

When removing authority paths, stamp them with the sync ID:

// In markAuthorityPathsRemoved:
await collection.updateMany(
{ _id: { $in: pathIdsToRemove } },
{
$set: {
status: "removed",
removed_at: new Date(),
removed_by_sync_id: syncId, // NEW: deterministic rollback key
}
}
);

File: src/domain/entities/types.ts

Add removed_by_sync_id to entity schema for the same reason:

// Added to EntityDoc:
removed_by_sync_id?: string; // The sync that caused this entity to be deleted

This enables the rollback in Phase 6 to use an exact match (removed_by_sync_id === targetSyncId) instead of the non-deterministic removed_at ± 5s window.

2f. Connector-Side Scope Declaration

Files: Both connector transformers should include scanScope in their output.

The entra-servicenow connector already implicitly knows its scope — it scans ["servicenow", "entra_id"] entities of types ["workload", "identity", "role", "permission", "resource", "connection"]. Making this explicit prevents the class of bug where a partial failure in one subsystem (chain discovery) causes deletions across all entity types.

Estimated effort: 1-2 days Prevents recurrence: Yes — even if chain discovery fails, the scope declaration tells the platform which entity types were actually scanned. A scan that produces 0 workloads but declares scannedEntityTypes: ["workload"] would still trigger the circuit breaker. A scan that only scans flows but NOT business rules would omit "workload" from its scanned types, preventing BR deletion.


Phase 3: Soft-Delete with Grace Periods (P1 — 2-3 days)

Goal: Replace immediate removal with a multi-phase lifecycle that tolerates transient scan failures.

3a. Entity Absence Tracking

File: src/domain/entities/types.ts

Add to EntityDoc:

last_seen_sync_id?: string;
last_seen_at?: Date;
consecutive_absences?: number;
absence_since?: Date;

3b. Two-Phase Removal Lifecycle (No Hard Deletion)

Review finding addressed (High #6): The original plan included a "Tombstoned / eligible for hard deletion" phase. This introduces irreversible data loss — once tombstoned and purged, data cannot be recovered even if the deletion was caused by a connector bug discovered weeks later. The tombstone phase is removed entirely. Entities and authority paths use soft-delete indefinitely. Storage cost for soft-deleted records is negligible compared to the risk of irreversible loss.

PhaseConditionEffect on Authority Paths
ActiveSeen in latest scanNormal — paths materialized
StaleMissing 1 scan, < 24hRetain paths, flag in UI with warning badge
AbsentMissing 2+ scans OR > 48hSoft-remove from active paths, retain entity as status: "removed"

There is no automatic hard deletion. Entities in "Absent" state are soft-deleted (status: "removed") and excluded from active queries, but remain in the database indefinitely. Manual purge is available via admin API for explicit operator-driven cleanup (see Phase 6).

Grace period by entity type:

Entity TypeStale → Absent
Identity (SP, managed identity)48h
Workload48h
Role / Permission24h
Resource24h
Owner72h

3c. Implementation

File: src/ingestion/diff-engine.ts

Instead of adding absent entities to deletedEntityIds:

  1. Increment consecutive_absences on the entity
  2. Set absence_since if first absence
  3. Only add to deletedEntityIds when entity reaches "Absent" phase

File: src/workers/handlers/sync-ingestion.ts

Add a periodic cleanup step (or separate worker job) that promotes stale → absent based on time elapsed. There is no further promotion — absent entities remain soft-deleted indefinitely.

Estimated effort: 2-3 days Benefit: Tolerates transient connector failures (API timeouts, temporary permission issues) without data loss. If a connector self-heals on next scan, stale entities return to active with no operator intervention. No risk of irreversible data loss from automatic purging.


Phase 4: Connector Health Metrics & Validation (P1 — 2 days)

Goal: Compute health scores per scan and reject/quarantine unhealthy scans before they cause damage.

4a. Health Score Computation

New file: src/ingestion/scan-health.ts

Compute a health score (0.0–1.0) for each incoming scan by comparing against the most recent successful sync for the same connector:

interface ScanHealthReport {
healthScore: number; // 0.0 = critical, 1.0 = healthy
healthStatus: "healthy" | "degraded" | "critical" | "failed";

metrics: {
nodeCount: number;
edgeCount: number;
nodeCountByType: Record<string, number>;
};

deviations: {
nodeCountDeltaPercent: number;
edgeCountDeltaPercent: number;
missingNodeTypes: string[]; // types present before, absent now
missingEdgeTypes: string[];
};

connectorReported: {
errorsEncountered: number;
permissionDenied: string[];
partialFailures: string[];
};
}

Health score formula (platform-derived only — no connector self-reported inputs):

healthScore = weighted average of:
- volumeScore (45%): node/edge count deviation from baseline (platform-computed)
- typeScore (35%): missing entity types penalty (platform-computed)
- durationScore(20%): scan duration anomaly (platform-computed)

Note: Connector self-reported errors (from scanScope.errors) are stored for observability and logged in the health report, but they are not used as inputs to the health score or circuit breaker decisions. This prevents a buggy connector from self-reporting "all clear" and bypassing safety.

Thresholds:

Score RangeStatusAction
≥ 0.8HealthyApply normally
0.5 – 0.8DegradedApply with circuit breakers active, log warning
0.2 – 0.5CriticalQuarantine — do not apply destructive operations
< 0.2FailedReject scan, notify operators

4b. Pre-Ingestion Gate

File: src/workers/handlers/sync-ingestion.ts

Before step 2 (transform), compute health report. If healthStatus === "critical" or "failed", skip destructive operations or quarantine the entire sync.

4c. Store Health Reports

New collection: scan_health_reports (indexed by tenant_id, sync_id)

Persist every health report for trend analysis and operator review.

Estimated effort: 2 days Benefit: Catches degraded scans before they enter the pipeline. Provides historical health data for monitoring dashboards.


Phase 5: Observability & Admin Dashboard (P2 — 3-5 days)

Goal: Give operators visibility into connector health, scan history, and anomalies.

What Already Exists

The platform already has significant infrastructure (discovered during audit):

ComponentStatusLocation
Health endpoints (/health, /ready, /metrics, /diagnostics)Builtsrc/api/routes/system.ts
Prometheus metrics (8+ metrics: HTTP latency, job duration, queue depth, sync age, findings count, authority path count)Builtsrc/shared/metrics/metrics.ts
Structured JSON loggingBuiltsrc/shared/logging/logger.ts
Sync history API (GET /api/v1/syncs)Builtsrc/api/routes/syncs.ts
SyncsPage UI (table with status badges, filtering)Builtui/src/pages/SyncsPage.tsx
Worker queue depth trackingBuiltsrc/workers/runtime.ts
Connector sync metrics (entities_created/updated, paths_created/removed, etc.)Builtsrc/domain/syncs/types.ts

What's Missing

ComponentPriorityEffort
Connector health summary (last scan per connector, trend sparklines)P21 day
Scan health dashboard (entity counts over time, anomaly flags)P22 days
Error visibility (display sync errors in UI, categorize by type)P21 day
Authority path delta visualization (created/updated/removed per sync)P21 day
Admin/operator page (multi-tenant overview, system status)P32 days
Alerting framework (webhook notifications for degraded/critical scans)P32 days
Operational runbooks (currently placeholder at docs/runbooks/index.md)P31 day

5a. Enhanced SyncsPage

File: ui/src/pages/SyncsPage.tsx

Add to existing page:

  • Health badge per sync (healthy/degraded/critical/failed) based on scan_health_reports
  • Entity delta column showing +created / −removed counts with color coding
  • Authority path delta showing paths affected
  • Error column displaying sync error messages (currently stored in DB but not shown)
  • Trend sparklines per connector (last 10 syncs entity count)

5b. Connector Health Summary API

New endpoint: GET /api/v1/connectors/health

Returns per-connector:

{
connectorId: string;
lastSyncAt: Date;
lastSyncStatus: string;
lastHealthScore: number;
syncCount24h: number;
failureCount24h: number;
entityCountTrend: number[]; // last 10 syncs
authorityPathsTrend: number[]; // last 10 syncs
}

5c. Scan Detail View

Clicking a sync in the SyncsPage opens a detail view showing:

  • Full health report (deviations, missing types, errors)
  • Entity diff summary (what was created/updated/deleted)
  • Authority paths affected
  • Action buttons: "Rollback this sync" (admin only)

5d. Alerting (Phase 5 stretch)

Architecture: Webhook-based notifications.

Events that trigger alerts:

EventSeverityChannel
Scan failed completelyCriticalWebhook + in-app banner
Circuit breaker triggered (deletions blocked)AlertWebhook + in-app notification
Health score dropped below 0.5WarningIn-app notification
No scan received in > 24h for active connectorWarningWebhook
Permission denied errors in scanAlertWebhook

Webhook payload:

{
event: "scan_degraded" | "scan_failed" | "circuit_breaker_triggered" | "scan_stale",
severity: "info" | "warning" | "alert" | "critical",
connectorId: string,
syncId: string,
tenantId: string,
timestamp: string,
title: string, // "Entra scan returned 45% fewer service principals"
details: { healthScore, deviations, actionTaken, recommendedAction }
}

Configuration: POST /api/v1/settings/webhooks to register notification endpoints (Slack, Teams, email relay, PagerDuty).

Estimated effort: 3-5 days total for Phase 5 Benefit: Operators can monitor connector health without SSH access to production. Anomalies surface proactively instead of being discovered when a customer reports missing data.


Phase 6: Rollback Capability (P2 — 1 day)

Goal: Enable operators to undo the effects of a bad sync.

6a. Restore Removed Authority Paths

Review finding addressed (Critical #4): Rollback uses removed_by_sync_id for exact causality — not the non-deterministic removed_at ± 5s window from the original draft.

Since authority paths use soft-delete with removed_by_sync_id stamping (added in Phase 2e), restoration is a single deterministic query:

async restoreAuthorityPaths(
tenantId: string,
syncId: string // the sync that caused the removal
): Promise<number> {
// Exact match on the sync that caused removal — no timestamp tolerance needed
const result = await this.c.authorityPaths.updateMany(
{
tenant_id: tenantId,
status: "removed",
removed_by_sync_id: syncId,
},
{
$set: { status: "active" },
$unset: { removed_at: "", removed_by_sync_id: "" }
}
);
return result.modifiedCount;
}

async restoreDeletedEntities(
tenantId: string,
syncId: string
): Promise<number> {
const result = await this.c.entities.updateMany(
{
tenant_id: tenantId,
status: "removed",
removed_by_sync_id: syncId,
},
{
$set: { status: "active" },
$unset: { removed_at: "", removed_by_sync_id: "" }
}
);
return result.modifiedCount;
}

6b. Admin API Endpoint

New endpoint: POST /api/v1/admin/syncs/:syncId/rollback

Requires admin authentication. Restores all authority paths removed by the specified sync.

6c. CLI Script

New file: scripts/rollback-sync.ts

npx tsx scripts/rollback-sync.ts --sync-id cebc3162-... --tenant-id default

Estimated effort: 1 day Benefit: Recovery from bad syncs without direct MongoDB access. Can be triggered from admin UI or CLI.


4. Implementation Priority & Timeline

PhaseDescriptionPriorityEffortCumulative
0Connector fix (paginated query)P01h1h
1Circuit breaker (deletion + AP thresholds)P01 day1.5 days
2Scan scope declarationP01-2 days3 days
3Soft-delete with grace periodsP12-3 days6 days
4Health score computation & pre-ingestion gateP12 days8 days
5Observability dashboard & alertingP23-5 days13 days
6Rollback capabilityP21 day14 days

Phases 0-2 are blocking — they prevent recurrence of this class of bug. Phases 3-4 add defense-in-depth and operational intelligence. Phases 5-6 provide ongoing visibility and recovery tools.


5. What This Prevents

ScenarioBeforeAfter (Phase 1-2)After (Phase 3-4)
Connector returns empty graphAll paths deletedCircuit breaker blocks deletionEntities marked stale, paths retained
Connector loses API permissionsEntities missing, paths removedScope-aware deletion limits impactHealth score drops, scan quarantined
Partial connector failure (e.g., one API times out)Some entity types disappearOnly in-scope types considered for deletionGrace period covers transient failures
Client intentionally removes configurationsPaths linger indefinitelyCircuit breaker may false-positive (needs override)Grace period expires, paths correctly removed
New connector with first-time scanN/A (creation only)Normal operationHealth baseline established

Handling Legitimate Removals

When a client genuinely removes configurations (e.g., decommissions an Azure SP), the grace period model (Phase 3) handles this correctly:

  1. First scan after removal: entity marked stale (paths retained, warning in UI)
  2. Second scan: entity moves to absent (paths removed)
  3. After retention period: entity tombstoned

For urgent legitimate removals, operators can manually confirm the deletion via the admin UI, bypassing the grace period.

Review finding addressed (High #5): The original plan included an override mechanism based on connector self-reported expectedNodeCount. This is removed — the platform never trusts connector self-reported completeness for safety decisions. The only override is explicit operator action via the admin UI or API. If a client genuinely scaled down and the circuit breaker fires, the operator reviews the quarantined sync and manually approves it.


6. Files Changed (Summary)

sv0-platform

FilePhaseChange
src/ingestion/diff-engine.ts1, 2Global deletion threshold, scope-aware deletion
src/ingestion/authority-path-materializer.ts1, 2Global AP removal breaker, removed_by_sync_id stamping
src/ingestion/types.ts2ScanScope on NormalizedGraph (no expectedNodeCount)
src/ingestion/scan-health.ts4New file — platform-derived health score (no connector self-report)
src/domain/syncs/types.ts1, 4circuit_breaker_triggered flag (NOT "degraded" status), health metrics
src/domain/entities/types.ts2, 3removed_by_sync_id, absence tracking fields
src/domain/authority-paths/types.ts2, 3removed_by_sync_id, "stale" status
src/workers/handlers/sync-ingestion.ts1, 2, 3, 4Cascading pipeline gate, sync_mode from payload, lifecycle
src/api/routes/ingest.ts2Validate scanScope in payload
src/storage/storage-adapter.ts3, 6New methods (stale marking, deterministic restore by sync_id)
src/api/routes/syncs.ts5, 6Health summary endpoint, rollback endpoint
ui/src/pages/SyncsPage.tsx5Health badges, circuit breaker warnings, trends, error display

sv0-connectors

FilePhaseChange
entra-servicenow/.../servicenow_client.py0_get_table_paginated() for REST Messages
entra-servicenow/.../cli/main.py0, 2Self-validation, scanScope in output
azure-foundry/.../transformer.py2scanScope in output

7. Acceptance Criteria

Phase 0-1 (blocks deployment)

  • Connector scan produces all chain workloads (no 100-record limit)
  • Sync that would remove >50% of entities triggers global circuit breaker (no per-workload minimum floor)
  • Circuit breaker gates entire destructive pipeline (entity deletion + materialization + AP removal)
  • Circuit breaker logs warning with counts and ratio
  • Sync status remains "completed" with circuit_breaker_triggered: true in metrics
  • Downstream processing (findings, evidence, posture) runs normally when breaker fires
  • Entity type threshold config uses runtime types (owner, not human_identity)
  • Existing tests pass, new unit tests for threshold logic

Phase 2

  • NormalizedGraph accepts scanScope field (validated at ingest route)
  • mode: "incremental" skips all deletion detection
  • scannedEntityTypes limits deletion scope to declared types
  • sync_mode derived from payload (no longer hardcoded to "full")
  • removed_by_sync_id stamped on all soft-deleted entities and authority paths
  • Both connectors include scanScope in output
  • No connector self-reported counts used for safety decisions

Phase 3

  • Entities track last_seen_sync_id, consecutive_absences
  • First absence marks entity as stale (not deleted)
  • Authority paths for stale entities are retained
  • Entities reaching absence threshold are soft-removed (no hard deletion)
  • No automatic tombstoning or hard deletion lifecycle

Phase 4

  • Health score computed for every incoming scan using platform-derived metrics only
  • Scans with score < 0.2 are rejected
  • Scans with score 0.2-0.5 are quarantined
  • Connector self-reported errors stored for observability but not used in score
  • Health reports stored in scan_health_reports collection

Phase 5

  • SyncsPage shows health badges, circuit breaker warnings, and error messages
  • Connector health summary endpoint returns per-connector metrics
  • Webhook notifications fire for degraded/critical/failed scans

Phase 6

  • POST /api/v1/admin/syncs/:syncId/rollback restores removed paths using removed_by_sync_id (deterministic)
  • Rollback also restores soft-deleted entities from the same sync
  • CLI script rollback-sync.ts works for manual recovery
  • Manual purge API available for operator-driven hard deletion (not automatic)

8. Open Questions

  1. Threshold tuning: Should thresholds be configurable per-tenant (multi-tenant scenario where different clients have different volatility)?
  2. Quarantine storage: Should quarantined scans be stored in a separate collection or tagged in the existing connector_syncs collection?
  3. Grace period for first scan: When a connector runs for the first time, there's no baseline. Should the circuit breaker be disabled for the first N scans?
  4. Webhook delivery guarantees: Should the alerting system guarantee at-least-once delivery (retry on failure), or is best-effort sufficient for MVP?
  5. Admin authentication: The rollback endpoint needs admin-level auth. How should this be distinguished from regular tenant auth? API key with admin scope?
  6. Storage growth: Without automatic hard deletion, soft-deleted records accumulate indefinitely. At what scale does this become a storage concern? (Likely not relevant for years at current data volumes — a single tenant's full entity set is <10MB.)

Resolved Questions (from review)

QuestionResolution
Should circuit breaker use "degraded" status?No. Keep "completed" + circuit_breaker_triggered flag. "degraded" breaks downstream gates.
Should we trust connector self-reported expectedNodeCount?No. All safety decisions platform-derived. Connector errors stored for observability only.
Should there be automatic hard deletion (tombstoning)?No. Soft-delete indefinitely. Manual purge via admin API only.
Should AP breaker be per-workload with minimum floor?No. Global/tenant-level breaker, no minimum floor.
How to make rollback deterministic?removed_by_sync_id field on entities and authority paths.
What entity type names for threshold config?Runtime types from graph transformer (owner, not human_identity).

9. Operational Monitoring & Admin Dashboard (Detailed Design)

9a. Admin Dashboard Layout

Primary view: Connector Health Cards (one per connector_type per tenant)

Each card shows:

  • Connector name and type (e.g., "Azure Entra ID", "ServiceNow", "Azure Foundry")
  • Overall status badge: Healthy / Degraded / Failed / Stale
  • Last successful sync timestamp + relative time ("2h ago")
  • Entity count from last sync with delta vs. previous ("+12" or "−45 (warning)")
  • Authority paths created/updated/removed in last sync
  • Mini sparkline: entity count trend over last 10 syncs
  • Click-through to filtered SyncsPage for that connector

Industry reference: SailPoint IdentityNow shows per-source health with Normal/Error states and aggregation troubleshooting views. Veza groups dashboards by security scenario with 90-day trend analysis.

9b. Enhanced SyncsPage

Extend the existing SyncsPage.tsx (already built with DataTable, status badges, filtering):

  • Entity delta column: +created / −removed with color coding (red for >30% drop)
  • Duration comparison: vs P50 for this connector type
  • Health badge: derived from scan health report
  • Error column: display sync.error field (stored in DB, currently hidden in UI)
  • Expandable detail: side-by-side metrics comparison with previous sync

9c. Alerting Architecture

Tiered alerts:

TierConditionAction
P1 CriticalSync failed; job stalled >10min; no sync in >2× expected intervalIn-app banner + webhook (Slack/PagerDuty)
P2 WarningEntity count drop >30%; queue backlog >15min; partial syncIn-app notification + webhook (Slack)
P3 InfoSync completed; new finding types detectedDaily digest

Implementation:

  • New alerts collection in MongoDB (type, severity, connector, sync_id, message, acknowledged_at)
  • Alert evaluation runs after each sync_ingestion and evaluate_findings completion
  • Notification bell in UI header with unread count
  • Webhook dispatcher: single outbound HTTP POST covers Slack, Teams, PagerDuty

Escalation pattern:

P3 → Log + daily digest
P2 → In-app + webhook; if unacknowledged 4h → escalate to P1
P1 → In-app + webhook + PagerDuty; if unacknowledged 30min → re-fire

9d. Scan Quarantine Workflow

When anomaly thresholds are breached, quarantine instead of apply:

Scan arrives → Validate schema → Check anomaly thresholds
│ │
│ (normal) │ (anomaly detected)
▼ ▼
Process normally Store as "quarantined" sync
Alert P2 to admin


Admin reviews in UI:
- Previous vs current metrics side-by-side
- Actions: Approve / Reject / Re-scan

Quarantine triggers (deterministic):

  • Entity count drops >50% from previous sync
  • Entity count increases >200%
  • Zero entities returned when baseline >0
  • Scan duration < 10% of P50 (suspiciously fast → likely incomplete)

Quarantine tracking: Quarantined scans are marked status: "completed" with circuit_breaker_triggered: true and quarantined: true in sync metrics. No new sync statuses are added — this avoids breaking the evaluate-findings.ts:21 gate on status === "completed".

9e. Operational Runbooks

Currently placeholder at docs/runbooks/index.md. Priority runbooks to write:

  1. Sync Failure Triage — classify error (connection/schema/DB/transform), fix, re-scan, verify recovery
  2. Data Freshness Outage — check connector alive, check for stalled syncs, check target system availability
  3. Delta Anomaly Triage — determine if real change vs connector bug, accept new baseline or investigate
  4. Authority Path Rollback — use admin API or CLI to restore paths removed by a bad sync

9f. Build vs Buy Decision

ApproachEffortRecommendation
Custom admin panel in product UI5-7 daysRecommended for MVP — single deployment, customers see it too
Grafana + Prometheus2-3 days setupDeferred — wire up existing /metrics endpoint when >5 tenants
Datadog / Monte Carlo$6K+/yearNot justified at current scale

Key insight: SecurityV0 already has Prometheus metrics at /metrics with 8 metric families. Grafana can be added in 2-3 hours when needed. The admin panel is the higher-value investment because it's customer-facing.

9g. Existing Infrastructure (Already Built)

ComponentStatusFile
Health endpoints (/health, /ready, /metrics, /diagnostics)Builtsrc/api/routes/system.ts
Prometheus metrics (HTTP latency, job duration, queue depth, sync age, findings, authority paths)Builtsrc/shared/metrics/metrics.ts
Structured JSON loggingBuiltsrc/shared/logging/logger.ts
Syncs API (GET /api/v1/syncs)Builtsrc/api/routes/syncs.ts
SyncsPage UI (table, filtering, status badges)Builtui/src/pages/SyncsPage.tsx
Worker queue depth trackingBuiltsrc/workers/runtime.ts
ConnectorSyncDoc with detailed metricsBuiltsrc/domain/syncs/types.ts

10. References

Internal

  • Architecture docs: docs/architecture/03-database.md (connector_syncs schema, lines 518-580)
  • Processing pipeline: docs/architecture/02-processing-pipeline.md (SLIs/SLOs, alert matrix, dashboard requirements)
  • Existing infrastructure: Prometheus metrics (src/shared/metrics/metrics.ts), health endpoints (src/api/routes/system.ts), SyncsPage (ui/src/pages/SyncsPage.tsx), worker runtime (src/workers/runtime.ts)

Industry Research

  • SailPoint: Aggregation safeguards — full/delta/targeted aggregation modes; zero-account aggregation abort; uncorrelated account review workflow; per-source health notifications (docs, aggregation troubleshooting)
  • Veza: OAA provider-level granularity — failed push for one provider does not affect others; dashboard grouping by security scenario; 90-day trend analysis (product updates)
  • Wiz: Last-seen model with type-specific grace periods (24h cloud resources, 72h soft-delete, 7d identity retention); resource drift alerting
  • CrowdStrike Falcon: Sensor health model — "reduced functionality mode" retains last-known-good state; 45-minute inactive threshold before status change
  • Splunk: Event count deviation monitoring (50% of 7-day rolling average triggers alert); append-only model prevents destructive overwrites; data quarantine for suspect data
  • Microsoft Sentinel: Data connector health monitoring with configurable per-connector thresholds
  • Prisma Cloud (Palo Alto): Resource drift alerting when >30% of resources disappear in single scan
  • ServiceNow CMDB: IRE staging area with reconciliation rules; staleness thresholds (7 days cloud, 30 days on-prem); IRE batches held in staging on anomaly detection
  • Data Observability: Monte Carlo's 5 pillars (freshness, volume, schema, distribution, lineage); O'Reilly Data Quality Fundamentals ch4 (monitoring and anomaly detection for pipelines)

Appendix A: Review Findings Traceability

All 8 review findings from the v1 draft review have been addressed in this v2 revision.

#SeverityFindingResolutionSection
1CriticalAP breaker per-workload >= 3 allows full wipe for small tenantsReplaced with global/tenant-level breaker, no minimum floorPhase 1a, 1b
2CriticalEntity deletion breaker doesn't stop authority path materialization from removing paths via missing execution_pathsCascading pipeline gate: if entity deletion is blocked, materialization also blockedPhase 1b (cascading gate)
3Critical"degraded" status breaks downstream — evaluate-findings.ts:21 gates on status === "completed"Keep "completed" status + circuit_breaker_triggered: true flag in metricsPhase 1d
4CriticalRollback by removed_at ± 5s is non-deterministicAdded removed_by_sync_id field for exact causality trackingPhase 2e, Phase 6a
5HighTrusting connector self-reported expectedNodeCount for breaker overridesRemoved expectedNodeCount. All safety decisions platform-derived. Override is operator-only.Phase 2a, Phase 4
6HighTombstoning = irreversible lossRemoved tombstone phase entirely. Soft-delete indefinitely, manual purge via admin API.Phase 3b
7HighPhase 2 incomplete vs current code (ingest validation, hardcoded sync_mode)Added ingest route validation (2c) and sync_mode derivation from payload (2d)Phase 2c, 2d
8MediumEntity type policy names (human_identity) don't match runtime types (owner)Threshold config uses runtime types from graph transformerPhase 1a (threshold table)

Core assumption validated: No automatic irreversible deletes and no automatic large soft-removals from a single suspect scan.