Implementation Plan: Scan Safety, Data Loss Prevention & Connector Observability

Date: 2026-02-26 Status: Draft v2 — revised per review findings (8 items addressed) Scope: sv0-platform (ingestion pipeline, API, UI), sv0-connectors (entra-servicenow)

Core Assumption (non-negotiable): The platform must never perform automatic irreversible deletes and must never perform automatic large soft-removals from a single suspect scan. All destructive operations must be gated, observable, and reversible. Trigger: Production incident — fresh connector scan removed all 5 authority paths for default tenant

1. Incident Summary

On 2026-02-26, a fresh entra-servicenow connector scan (syncId: cebc3162) removed all 5 existing authority paths for the default tenant. The production UI showed zero authority paths where 5 had been present.

Impact: Complete loss of authority path visibility for the real (non-demo) tenant.

Root cause: Two compounding failures:

Connector bug (primary): Commit f322e48 ("discover all outbound REST Messages, not just Azure") removed the Azure endpoint filter from the get_outbound_rest_messages() query. The _get_table() method has a 100-record default limit. With the filter removed, the ServiceNow instance returned 100+ generic REST Messages, pushing the Azure-specific ones outside the window. Result: discover_execution_chains() returned 0 chains, and the graph output contained 0 chain workloads (business rules, script includes, scheduled jobs).
Platform design gap (amplifier): The platform uses a full-replacement model for sync processing. When the diff engine (diff-engine.ts:267-305) detected that 5 previously-ingested workloads were absent from the incoming graph, it marked them as deleted. The authority path materializer (sync-ingestion.ts:166-181) then soft-removed all authority paths for those deleted workloads.

Key metric from the failing sync:

pathsComputed: 1, authorityPathsCreated: 0, authorityPathsRemoved: 5

Resolution: Authority paths manually restored via MongoDB updateMany (status: "removed" → "active"). Possible because the platform uses soft-delete (markAuthorityPathsRemoved sets status: "removed", does not physically delete).

2. Design Principles

Based on industry research across identity governance (SailPoint, Veza), CSPM (Wiz, Prisma Cloud), SIEM (Splunk, Sentinel), and sensor platforms (CrowdStrike Falcon):

Never trust a single scan to be complete. Connectors can fail partially — API limits, permission revocations, timeouts. The platform must treat incoming data as potentially incomplete.
Absence ≠ deletion. An entity missing from one scan should not be immediately removed. Use "last seen" tracking with grace periods (industry standard: 1-7 days depending on entity type).
Protect high-value data with circuit breakers. If a sync would remove a significant portion of existing data, halt and quarantine rather than apply. Thresholds vary: 30% for identities, 50% for resources (SailPoint/Veza pattern).
Connectors must declare scope. A connector scanning only Function Apps should not trigger deletion of unrelated ServiceNow workloads. Scan scope must be explicit in the payload.
Make all destructive operations observable and reversible. Operators must be able to see what each scan changed, detect anomalies, and roll back bad syncs.

3. Implementation Plan

Phase 0: Connector Fix (Immediate — blocks further scans)

Goal: Fix the entra-servicenow connector so scans produce complete graphs.

Task	File	Change
Use paginated query for REST Messages	`servicenow_client.py` → `get_outbound_rest_messages()`	Replace `_get_table()` with `_get_table_paginated()` to fetch all REST Messages regardless of count
Add self-validation before submission	`cli/main.py` → submit logic	Log warning if chain discovery returns 0 chains when prior scan had >0; optionally abort submission

Estimated effort: 1 hour Prevents recurrence of this specific incident: Yes

Phase 1: Platform Circuit Breaker (P0 — 1 day)

Goal: Prevent any single sync from causing mass entity deletion or authority path removal, regardless of connector bugs.

Review finding addressed (Critical #1): The original per-workload AP breaker with existingActive.length >= 3 minimum would allow a full wipe for small tenants (e.g., 2 workloads with 2 paths each → 100% removal allowed). The breaker now operates at the global/tenant level across all entities and paths for the entire sync, with no minimum floor — even removing 1 of 1 paths triggers evaluation.

Review finding addressed (Critical #2): Entity deletion and authority path removal are two separate destructive operations, but they are causally linked: deleted entities → missing execution_paths → materializer removes authority paths. The circuit breaker now gates the entire destructive pipeline (entity deletion + execution path materialization + authority path removal) as a single unit. If the entity deletion breaker fires, the materializer is also blocked from removing paths.

1a. Global Entity Deletion Threshold

Before the diff engine marks entities as deleted, compare the total deletion count against total existing entity count across all source systems in the sync.

File: src/ingestion/diff-engine.ts (around line 267-305)

// Proposed logic — GLOBAL breaker (runs once for the entire sync, not per-workload):
interface DeletionBreaker {
  totalToDelete: number;
  totalExisting: number;
  deletionRatio: number;
  triggered: boolean;
  blockedEntityIds: string[];
}

function evaluateDeletionBreaker(
  allToDelete: EntityDoc[],
  allExistingForSyncSources: EntityDoc[],
): DeletionBreaker {
  const totalToDelete = allToDelete.length;
  const totalExisting = allExistingForSyncSources.length;
  const deletionRatio = totalExisting > 0 ? totalToDelete / totalExisting : 0;
  const threshold = 0.50; // 50% global threshold — no minimum floor

  // Zero existing = first scan, no breaker needed
  if (totalExisting === 0) {
    return { totalToDelete, totalExisting, deletionRatio, triggered: false, blockedEntityIds: [] };
  }

  // Special case: incoming scan has 0 entities when baseline > 0 → always block
  if (totalToDelete === totalExisting && totalExisting > 0) {
    return { totalToDelete, totalExisting, deletionRatio: 1.0, triggered: true,
             blockedEntityIds: allToDelete.map(e => e.node_id) };
  }

  const triggered = deletionRatio > threshold;
  return {
    totalToDelete, totalExisting, deletionRatio, triggered,
    blockedEntityIds: triggered ? allToDelete.map(e => e.node_id) : [],
  };
}

Per-type thresholds (applied as a secondary check within global breaker):

Entity Type (runtime)	Threshold	Rationale
`identity` (service principals)	30%	Anchor authority paths — rarely mass-deleted legitimately
`workload`	40%	Core platform object; client config changes are incremental
`role` / `permission`	40%	Role structures are relatively stable
`resource`	60%	Cloud resources are more volatile (scale-up/down)
`owner`	50%	People join/leave; moderate volatility
Default	50%	Safe middle ground

Review finding addressed (Medium #8): Entity type names in thresholds now use runtime types from the graph transformer (e.g., owner not human_identity), matching NormalizedNode.nodeType values that actually appear in the pipeline.

No minimum entity floor. Previous draft required existingForSource.length >= 5 — this is removed. A tenant with 2 entities losing both is a 100% drop and must be caught.

1b. Cascading Pipeline Gate

When the entity deletion breaker fires, the entire destructive pipeline for this sync is halted — not just entity deletion.

File: src/workers/handlers/sync-ingestion.ts

// After entity diff:
const deletionBreaker = evaluateDeletionBreaker(toDelete, existingForSource);

if (deletionBreaker.triggered) {
  logger.warn("Circuit breaker triggered — blocking ALL destructive operations", {
    syncId, tenantId,
    wouldDelete: deletionBreaker.totalToDelete,
    existing: deletionBreaker.totalExisting,
    ratio: deletionBreaker.deletionRatio,
  });

  // Skip: entity deletions
  // Skip: execution path re-materialization for deleted entities
  // Skip: authority path removal (markAuthorityPathsRemoved)
  // Continue: entity creates/updates (additive operations are safe)
  // Continue: findings evaluation, evidence packs, posture snapshot

  syncMetrics.circuit_breaker_triggered = true;
  syncMetrics.deletions_blocked = deletionBreaker.totalToDelete;
  syncMetrics.authority_paths_removal_blocked = /* count from materializer */ 0;
}

Key insight: The materializer at authority-path-materializer.ts:115-127 removes paths when execution_paths is empty for a workload. If we mark workloads as deleted (removing their execution_paths), the materializer will cascade-remove their authority paths even without an explicit AP breaker. Therefore, the entity deletion breaker must gate the materializer as well — if deletions are blocked, the materializer runs with the pre-existing entity set (as if the deletions never happened).

1c. Authority Path Removal Safety Net (Independent)

A secondary check at the authority path level, as defense-in-depth for cases where entity deletion proceeds but materializer behavior is anomalous.

File: src/ingestion/authority-path-materializer.ts (around line 115-127)

// GLOBAL authority path breaker — NOT per-workload
const totalExistingPaths = allExistingActivePaths.length;
const totalToRemove = allPathsToRemove.length;
const removalRatio = totalExistingPaths > 0 ? totalToRemove / totalExistingPaths : 0;
const AP_REMOVAL_THRESHOLD = 0.50;

// No minimum floor — even 1 of 1 is 100% and triggers
if (removalRatio > AP_REMOVAL_THRESHOLD && totalExistingPaths > 0) {
  logger.warn("Authority path removal blocked by circuit breaker", {
    syncId, tenantId,
    wouldRemove: totalToRemove,
    existing: totalExistingPaths,
    ratio: removalRatio,
  });
  syncMetrics.authority_paths_removal_blocked = totalToRemove;
  // Skip all path removals — return early
}

1d. Sync Status: Keep `"completed"` + Flag

Review finding addressed (Critical #3): The original plan proposed adding "degraded" to SyncStatus. This would break downstream processing: evaluate-findings.ts:21 gates on sync.status !== "completed" — a "degraded" status would cause the findings evaluator, evidence pack builder, and posture snapshot to all be skipped. This is worse than the original problem.

Solution: Keep status: "completed" so downstream processing runs normally. Add a circuit_breaker_triggered: boolean flag on the sync metrics to indicate that destructive operations were blocked.

File: src/domain/syncs/types.ts

Do NOT add "degraded" to SyncStatus. Instead, add metrics fields:

// Added to ConnectorSyncMetrics (existing interface):
deletions_blocked?: number;
authority_paths_removal_blocked?: number;
circuit_breaker_triggered?: boolean;
circuit_breaker_details?: {
  entity_deletion_ratio: number;
  entity_deletion_threshold: number;
  ap_removal_ratio: number;
  ap_removal_threshold: number;
};

The sync status remains "completed". The UI and alerting system check circuit_breaker_triggered to surface warnings. Findings evaluation, evidence packs, and posture snapshots all proceed normally against the non-destructed data.

Estimated effort: 1 day Prevents recurrence: Yes — any connector bug that produces incomplete data triggers the circuit breaker, blocking all destructive operations while preserving downstream processing.

Phase 2: Scan Scope Declaration & Rollback Determinism (P0 — 1-2 days)

Goal: Let connectors declare what they scanned so the platform only removes entities within that scope. Make rollback deterministic by tracking which sync caused each removal.

Review finding addressed (Critical #4): The original rollback mechanism used removed_at ± 5s timestamp matching — this is non-deterministic and could restore paths from unrelated operations. Rollback must use removed_by_sync_id for exact causality tracking.

Review finding addressed (High #5): The original plan trusted connector self-reported completeness.expectedNodeCount to relax circuit breaker thresholds. This is dangerous — a buggy connector could self-report "I expected 0 nodes" and bypass all safety. All safety decisions must be platform-derived from historical baselines.

Review finding addressed (High #7): The current codebase has two gaps: (1) the ingest route does not validate or accept scanScope in the payload, and (2) sync-ingestion.ts:33 hardcodes sync_mode: "full". Both must be addressed for scan scope to function.

2a. NormalizedGraph Schema Extension

File: src/ingestion/types.ts

export interface ScanScope {
  /** What this scan covers — only entities matching this scope are eligible for deletion */
  mode: "full" | "incremental" | "targeted";

  /** Source systems included in this scan (e.g., ["servicenow", "entra_id"]) */
  sourceSystems?: string[];

  /** Entity types included (e.g., ["workload", "identity"]). If omitted, all types in scope. */
  scannedEntityTypes?: string[];

  /** Connector self-reported errors — used ONLY for logging/observability, NOT for safety decisions */
  errors?: {
    errorsEncountered?: number;
    partialFailures?: string[];
    permissionDenied?: string[];
  };
}

export interface NormalizedGraph {
  // ... existing fields
  scanScope?: ScanScope;
}

Important: The completeness.expectedNodeCount field from the original draft is removed. The platform must never trust connector self-reported node counts for circuit breaker override decisions. All safety thresholds are derived from the platform's own historical baseline (previous successful sync for the same connector + tenant).

2b. Scope-Aware Deletion

File: src/ingestion/diff-engine.ts

When scanScope.mode === "incremental", skip deletion detection entirely (additive-only). When scanScope.scannedEntityTypes is provided, only consider entities of those types for deletion.

2c. Ingest Route Validation

File: src/api/routes/ingest.ts

The ingest route must accept and validate the scanScope field from the payload:

// Add to existing NormalizedGraph validation schema:
scanScope: z.object({
  mode: z.enum(["full", "incremental", "targeted"]),
  sourceSystems: z.array(z.string()).optional(),
  scannedEntityTypes: z.array(z.string()).optional(),
  errors: z.object({
    errorsEncountered: z.number().optional(),
    partialFailures: z.array(z.string()).optional(),
    permissionDenied: z.array(z.string()).optional(),
  }).optional(),
}).optional(),

2d. Sync Mode from Payload (Remove Hardcoded Default)

File: src/workers/handlers/sync-ingestion.ts (line 33)

Currently: sync_mode: "full" is hardcoded. Change to derive from the incoming payload:

// Before (hardcoded):
sync_mode: "full",

// After (from payload, with safe default):
sync_mode: graph.scanScope?.mode ?? "full",

When mode is "incremental", the diff engine skips deletion detection. When "targeted", only scannedEntityTypes are eligible for deletion. When "full" (or absent), current behavior is preserved (all entity types eligible).

2e. Deterministic Rollback: `removed_by_sync_id`

File: src/domain/authority-paths/types.ts

Add removed_by_sync_id to the authority path schema:

// Added to AuthorityPathDoc:
removed_by_sync_id?: string;  // The sync that caused this path to be removed

File: src/ingestion/authority-path-materializer.ts

When removing authority paths, stamp them with the sync ID:

// In markAuthorityPathsRemoved:
await collection.updateMany(
  { _id: { $in: pathIdsToRemove } },
  {
    $set: {
      status: "removed",
      removed_at: new Date(),
      removed_by_sync_id: syncId,  // NEW: deterministic rollback key
    }
  }
);

File: src/domain/entities/types.ts

Add removed_by_sync_id to entity schema for the same reason:

// Added to EntityDoc:
removed_by_sync_id?: string;  // The sync that caused this entity to be deleted

This enables the rollback in Phase 6 to use an exact match (removed_by_sync_id === targetSyncId) instead of the non-deterministic removed_at ± 5s window.

2f. Connector-Side Scope Declaration

Files: Both connector transformers should include scanScope in their output.

The entra-servicenow connector already implicitly knows its scope — it scans ["servicenow", "entra_id"] entities of types ["workload", "identity", "role", "permission", "resource", "connection"]. Making this explicit prevents the class of bug where a partial failure in one subsystem (chain discovery) causes deletions across all entity types.

Estimated effort: 1-2 days Prevents recurrence: Yes — even if chain discovery fails, the scope declaration tells the platform which entity types were actually scanned. A scan that produces 0 workloads but declares scannedEntityTypes: ["workload"] would still trigger the circuit breaker. A scan that only scans flows but NOT business rules would omit "workload" from its scanned types, preventing BR deletion.

Phase 3: Soft-Delete with Grace Periods (P1 — 2-3 days)

Goal: Replace immediate removal with a multi-phase lifecycle that tolerates transient scan failures.

3a. Entity Absence Tracking

File: src/domain/entities/types.ts

Add to EntityDoc:

last_seen_sync_id?: string;
last_seen_at?: Date;
consecutive_absences?: number;
absence_since?: Date;

3b. Two-Phase Removal Lifecycle (No Hard Deletion)

Review finding addressed (High #6): The original plan included a "Tombstoned / eligible for hard deletion" phase. This introduces irreversible data loss — once tombstoned and purged, data cannot be recovered even if the deletion was caused by a connector bug discovered weeks later. The tombstone phase is removed entirely. Entities and authority paths use soft-delete indefinitely. Storage cost for soft-deleted records is negligible compared to the risk of irreversible loss.

Phase	Condition	Effect on Authority Paths
Active	Seen in latest scan	Normal — paths materialized
Stale	Missing 1 scan, < 24h	Retain paths, flag in UI with warning badge
Absent	Missing 2+ scans OR > 48h	Soft-remove from active paths, retain entity as `status: "removed"`

There is no automatic hard deletion. Entities in "Absent" state are soft-deleted (status: "removed") and excluded from active queries, but remain in the database indefinitely. Manual purge is available via admin API for explicit operator-driven cleanup (see Phase 6).

Grace period by entity type:

Entity Type	Stale → Absent
Identity (SP, managed identity)	48h
Workload	48h
Role / Permission	24h
Resource	24h
Owner	72h

3c. Implementation

File: src/ingestion/diff-engine.ts

Instead of adding absent entities to deletedEntityIds:

Increment consecutive_absences on the entity
Set absence_since if first absence
Only add to deletedEntityIds when entity reaches "Absent" phase

File: src/workers/handlers/sync-ingestion.ts

Add a periodic cleanup step (or separate worker job) that promotes stale → absent based on time elapsed. There is no further promotion — absent entities remain soft-deleted indefinitely.

Estimated effort: 2-3 days Benefit: Tolerates transient connector failures (API timeouts, temporary permission issues) without data loss. If a connector self-heals on next scan, stale entities return to active with no operator intervention. No risk of irreversible data loss from automatic purging.

Phase 4: Connector Health Metrics & Validation (P1 — 2 days)

Goal: Compute health scores per scan and reject/quarantine unhealthy scans before they cause damage.

4a. Health Score Computation

New file: src/ingestion/scan-health.ts

Compute a health score (0.0–1.0) for each incoming scan by comparing against the most recent successful sync for the same connector:

interface ScanHealthReport {
  healthScore: number;                    // 0.0 = critical, 1.0 = healthy
  healthStatus: "healthy" | "degraded" | "critical" | "failed";

  metrics: {
    nodeCount: number;
    edgeCount: number;
    nodeCountByType: Record<string, number>;
  };

  deviations: {
    nodeCountDeltaPercent: number;
    edgeCountDeltaPercent: number;
    missingNodeTypes: string[];           // types present before, absent now
    missingEdgeTypes: string[];
  };

  connectorReported: {
    errorsEncountered: number;
    permissionDenied: string[];
    partialFailures: string[];
  };
}

Health score formula (platform-derived only — no connector self-reported inputs):

healthScore = weighted average of:
  - volumeScore  (45%): node/edge count deviation from baseline (platform-computed)
  - typeScore    (35%): missing entity types penalty (platform-computed)
  - durationScore(20%): scan duration anomaly (platform-computed)

Note: Connector self-reported errors (from scanScope.errors) are stored for observability and logged in the health report, but they are not used as inputs to the health score or circuit breaker decisions. This prevents a buggy connector from self-reporting "all clear" and bypassing safety.

Thresholds:

Score Range	Status	Action
≥ 0.8	Healthy	Apply normally
0.5 – 0.8	Degraded	Apply with circuit breakers active, log warning
0.2 – 0.5	Critical	Quarantine — do not apply destructive operations
< 0.2	Failed	Reject scan, notify operators

4b. Pre-Ingestion Gate

File: src/workers/handlers/sync-ingestion.ts

Before step 2 (transform), compute health report. If healthStatus === "critical" or "failed", skip destructive operations or quarantine the entire sync.

4c. Store Health Reports

New collection: scan_health_reports (indexed by tenant_id, sync_id)

Persist every health report for trend analysis and operator review.

Estimated effort: 2 days Benefit: Catches degraded scans before they enter the pipeline. Provides historical health data for monitoring dashboards.

Phase 5: Observability & Admin Dashboard (P2 — 3-5 days)

Goal: Give operators visibility into connector health, scan history, and anomalies.

What Already Exists

The platform already has significant infrastructure (discovered during audit):

Component	Status	Location
Health endpoints (`/health`, `/ready`, `/metrics`, `/diagnostics`)	Built	`src/api/routes/system.ts`
Prometheus metrics (8+ metrics: HTTP latency, job duration, queue depth, sync age, findings count, authority path count)	Built	`src/shared/metrics/metrics.ts`
Structured JSON logging	Built	`src/shared/logging/logger.ts`
Sync history API (`GET /api/v1/syncs`)	Built	`src/api/routes/syncs.ts`
SyncsPage UI (table with status badges, filtering)	Built	`ui/src/pages/SyncsPage.tsx`
Worker queue depth tracking	Built	`src/workers/runtime.ts`
Connector sync metrics (entities_created/updated, paths_created/removed, etc.)	Built	`src/domain/syncs/types.ts`

What's Missing

Component	Priority	Effort
Connector health summary (last scan per connector, trend sparklines)	P2	1 day
Scan health dashboard (entity counts over time, anomaly flags)	P2	2 days
Error visibility (display sync errors in UI, categorize by type)	P2	1 day
Authority path delta visualization (created/updated/removed per sync)	P2	1 day
Admin/operator page (multi-tenant overview, system status)	P3	2 days
Alerting framework (webhook notifications for degraded/critical scans)	P3	2 days
Operational runbooks (currently placeholder at `docs/runbooks/index.md`)	P3	1 day

5a. Enhanced SyncsPage

File: ui/src/pages/SyncsPage.tsx

Add to existing page:

Health badge per sync (healthy/degraded/critical/failed) based on scan_health_reports
Entity delta column showing +created / −removed counts with color coding
Authority path delta showing paths affected
Error column displaying sync error messages (currently stored in DB but not shown)
Trend sparklines per connector (last 10 syncs entity count)

5b. Connector Health Summary API

New endpoint: GET /api/v1/connectors/health

Returns per-connector:

{
  connectorId: string;
  lastSyncAt: Date;
  lastSyncStatus: string;
  lastHealthScore: number;
  syncCount24h: number;
  failureCount24h: number;
  entityCountTrend: number[];   // last 10 syncs
  authorityPathsTrend: number[]; // last 10 syncs
}

5c. Scan Detail View

Clicking a sync in the SyncsPage opens a detail view showing:

Full health report (deviations, missing types, errors)
Entity diff summary (what was created/updated/deleted)
Authority paths affected
Action buttons: "Rollback this sync" (admin only)

5d. Alerting (Phase 5 stretch)

Architecture: Webhook-based notifications.

Events that trigger alerts:

Event	Severity	Channel
Scan failed completely	Critical	Webhook + in-app banner
Circuit breaker triggered (deletions blocked)	Alert	Webhook + in-app notification
Health score dropped below 0.5	Warning	In-app notification
No scan received in > 24h for active connector	Warning	Webhook
Permission denied errors in scan	Alert	Webhook

Webhook payload:

{
  event: "scan_degraded" | "scan_failed" | "circuit_breaker_triggered" | "scan_stale",
  severity: "info" | "warning" | "alert" | "critical",
  connectorId: string,
  syncId: string,
  tenantId: string,
  timestamp: string,
  title: string,       // "Entra scan returned 45% fewer service principals"
  details: { healthScore, deviations, actionTaken, recommendedAction }
}

Configuration: POST /api/v1/settings/webhooks to register notification endpoints (Slack, Teams, email relay, PagerDuty).

Estimated effort: 3-5 days total for Phase 5 Benefit: Operators can monitor connector health without SSH access to production. Anomalies surface proactively instead of being discovered when a customer reports missing data.

Phase 6: Rollback Capability (P2 — 1 day)

Goal: Enable operators to undo the effects of a bad sync.

6a. Restore Removed Authority Paths

Review finding addressed (Critical #4): Rollback uses removed_by_sync_id for exact causality — not the non-deterministic removed_at ± 5s window from the original draft.

Since authority paths use soft-delete with removed_by_sync_id stamping (added in Phase 2e), restoration is a single deterministic query:

async restoreAuthorityPaths(
  tenantId: string,
  syncId: string  // the sync that caused the removal
): Promise<number> {
  // Exact match on the sync that caused removal — no timestamp tolerance needed
  const result = await this.c.authorityPaths.updateMany(
    {
      tenant_id: tenantId,
      status: "removed",
      removed_by_sync_id: syncId,
    },
    {
      $set: { status: "active" },
      $unset: { removed_at: "", removed_by_sync_id: "" }
    }
  );
  return result.modifiedCount;
}

async restoreDeletedEntities(
  tenantId: string,
  syncId: string
): Promise<number> {
  const result = await this.c.entities.updateMany(
    {
      tenant_id: tenantId,
      status: "removed",
      removed_by_sync_id: syncId,
    },
    {
      $set: { status: "active" },
      $unset: { removed_at: "", removed_by_sync_id: "" }
    }
  );
  return result.modifiedCount;
}

6b. Admin API Endpoint

New endpoint: POST /api/v1/admin/syncs/:syncId/rollback

Requires admin authentication. Restores all authority paths removed by the specified sync.

6c. CLI Script

New file: scripts/rollback-sync.ts

npx tsx scripts/rollback-sync.ts --sync-id cebc3162-... --tenant-id default

Estimated effort: 1 day Benefit: Recovery from bad syncs without direct MongoDB access. Can be triggered from admin UI or CLI.

4. Implementation Priority & Timeline

Phase	Description	Priority	Effort	Cumulative
0	Connector fix (paginated query)	P0	1h	1h
1	Circuit breaker (deletion + AP thresholds)	P0	1 day	1.5 days
2	Scan scope declaration	P0	1-2 days	3 days
3	Soft-delete with grace periods	P1	2-3 days	6 days
4	Health score computation & pre-ingestion gate	P1	2 days	8 days
5	Observability dashboard & alerting	P2	3-5 days	13 days
6	Rollback capability	P2	1 day	14 days

Phases 0-2 are blocking — they prevent recurrence of this class of bug. Phases 3-4 add defense-in-depth and operational intelligence. Phases 5-6 provide ongoing visibility and recovery tools.

5. What This Prevents

Scenario	Before	After (Phase 1-2)	After (Phase 3-4)
Connector returns empty graph	All paths deleted	Circuit breaker blocks deletion	Entities marked stale, paths retained
Connector loses API permissions	Entities missing, paths removed	Scope-aware deletion limits impact	Health score drops, scan quarantined
Partial connector failure (e.g., one API times out)	Some entity types disappear	Only in-scope types considered for deletion	Grace period covers transient failures
Client intentionally removes configurations	Paths linger indefinitely	Circuit breaker may false-positive (needs override)	Grace period expires, paths correctly removed
New connector with first-time scan	N/A (creation only)	Normal operation	Health baseline established

Handling Legitimate Removals

When a client genuinely removes configurations (e.g., decommissions an Azure SP), the grace period model (Phase 3) handles this correctly:

First scan after removal: entity marked stale (paths retained, warning in UI)
Second scan: entity moves to absent (paths removed)
After retention period: entity tombstoned

For urgent legitimate removals, operators can manually confirm the deletion via the admin UI, bypassing the grace period.

Review finding addressed (High #5): The original plan included an override mechanism based on connector self-reported expectedNodeCount. This is removed — the platform never trusts connector self-reported completeness for safety decisions. The only override is explicit operator action via the admin UI or API. If a client genuinely scaled down and the circuit breaker fires, the operator reviews the quarantined sync and manually approves it.

6. Files Changed (Summary)

sv0-platform

File	Phase	Change
`src/ingestion/diff-engine.ts`	1, 2	Global deletion threshold, scope-aware deletion
`src/ingestion/authority-path-materializer.ts`	1, 2	Global AP removal breaker, `removed_by_sync_id` stamping
`src/ingestion/types.ts`	2	`ScanScope` on `NormalizedGraph` (no `expectedNodeCount`)
`src/ingestion/scan-health.ts`	4	New file — platform-derived health score (no connector self-report)
`src/domain/syncs/types.ts`	1, 4	`circuit_breaker_triggered` flag (NOT `"degraded"` status), health metrics
`src/domain/entities/types.ts`	2, 3	`removed_by_sync_id`, absence tracking fields
`src/domain/authority-paths/types.ts`	2, 3	`removed_by_sync_id`, `"stale"` status
`src/workers/handlers/sync-ingestion.ts`	1, 2, 3, 4	Cascading pipeline gate, `sync_mode` from payload, lifecycle
`src/api/routes/ingest.ts`	2	Validate `scanScope` in payload
`src/storage/storage-adapter.ts`	3, 6	New methods (stale marking, deterministic restore by sync_id)
`src/api/routes/syncs.ts`	5, 6	Health summary endpoint, rollback endpoint
`ui/src/pages/SyncsPage.tsx`	5	Health badges, circuit breaker warnings, trends, error display

sv0-connectors

File	Phase	Change
`entra-servicenow/.../servicenow_client.py`	0	`_get_table_paginated()` for REST Messages
`entra-servicenow/.../cli/main.py`	0, 2	Self-validation, `scanScope` in output
`azure-foundry/.../transformer.py`	2	`scanScope` in output

7. Acceptance Criteria

Phase 0-1 (blocks deployment)

Connector scan produces all chain workloads (no 100-record limit)
Sync that would remove >50% of entities triggers global circuit breaker (no per-workload minimum floor)
Circuit breaker gates entire destructive pipeline (entity deletion + materialization + AP removal)
Circuit breaker logs warning with counts and ratio
Sync status remains "completed" with circuit_breaker_triggered: true in metrics
Downstream processing (findings, evidence, posture) runs normally when breaker fires
Entity type threshold config uses runtime types (owner, not human_identity)
Existing tests pass, new unit tests for threshold logic

Phase 2

NormalizedGraph accepts scanScope field (validated at ingest route)
mode: "incremental" skips all deletion detection
scannedEntityTypes limits deletion scope to declared types
sync_mode derived from payload (no longer hardcoded to "full")
removed_by_sync_id stamped on all soft-deleted entities and authority paths
Both connectors include scanScope in output
No connector self-reported counts used for safety decisions

Phase 3

Entities track last_seen_sync_id, consecutive_absences
First absence marks entity as stale (not deleted)
Authority paths for stale entities are retained
Entities reaching absence threshold are soft-removed (no hard deletion)
No automatic tombstoning or hard deletion lifecycle

Phase 4

Health score computed for every incoming scan using platform-derived metrics only
Scans with score < 0.2 are rejected
Scans with score 0.2-0.5 are quarantined
Connector self-reported errors stored for observability but not used in score
Health reports stored in scan_health_reports collection

Phase 5

SyncsPage shows health badges, circuit breaker warnings, and error messages
Connector health summary endpoint returns per-connector metrics
Webhook notifications fire for degraded/critical/failed scans

Phase 6

POST /api/v1/admin/syncs/:syncId/rollback restores removed paths using removed_by_sync_id (deterministic)
Rollback also restores soft-deleted entities from the same sync
CLI script rollback-sync.ts works for manual recovery
Manual purge API available for operator-driven hard deletion (not automatic)

8. Open Questions

Threshold tuning: Should thresholds be configurable per-tenant (multi-tenant scenario where different clients have different volatility)?
Quarantine storage: Should quarantined scans be stored in a separate collection or tagged in the existing connector_syncs collection?
Grace period for first scan: When a connector runs for the first time, there's no baseline. Should the circuit breaker be disabled for the first N scans?
Webhook delivery guarantees: Should the alerting system guarantee at-least-once delivery (retry on failure), or is best-effort sufficient for MVP?
Admin authentication: The rollback endpoint needs admin-level auth. How should this be distinguished from regular tenant auth? API key with admin scope?
Storage growth: Without automatic hard deletion, soft-deleted records accumulate indefinitely. At what scale does this become a storage concern? (Likely not relevant for years at current data volumes — a single tenant's full entity set is <10MB.)

Resolved Questions (from review)

Question	Resolution
Should circuit breaker use `"degraded"` status?	No. Keep `"completed"` + `circuit_breaker_triggered` flag. `"degraded"` breaks downstream gates.
Should we trust connector self-reported `expectedNodeCount`?	No. All safety decisions platform-derived. Connector errors stored for observability only.
Should there be automatic hard deletion (tombstoning)?	No. Soft-delete indefinitely. Manual purge via admin API only.
Should AP breaker be per-workload with minimum floor?	No. Global/tenant-level breaker, no minimum floor.
How to make rollback deterministic?	`removed_by_sync_id` field on entities and authority paths.
What entity type names for threshold config?	Runtime types from graph transformer (`owner`, not `human_identity`).

9. Operational Monitoring & Admin Dashboard (Detailed Design)

9a. Admin Dashboard Layout

Primary view: Connector Health Cards (one per connector_type per tenant)

Each card shows:

Connector name and type (e.g., "Azure Entra ID", "ServiceNow", "Azure Foundry")
Overall status badge: Healthy / Degraded / Failed / Stale
Last successful sync timestamp + relative time ("2h ago")
Entity count from last sync with delta vs. previous ("+12" or "−45 (warning)")
Authority paths created/updated/removed in last sync
Mini sparkline: entity count trend over last 10 syncs
Click-through to filtered SyncsPage for that connector

Industry reference: SailPoint IdentityNow shows per-source health with Normal/Error states and aggregation troubleshooting views. Veza groups dashboards by security scenario with 90-day trend analysis.

9b. Enhanced SyncsPage

Extend the existing SyncsPage.tsx (already built with DataTable, status badges, filtering):

Entity delta column: +created / −removed with color coding (red for >30% drop)
Duration comparison: vs P50 for this connector type
Health badge: derived from scan health report
Error column: display sync.error field (stored in DB, currently hidden in UI)
Expandable detail: side-by-side metrics comparison with previous sync

9c. Alerting Architecture

Tiered alerts:

Tier	Condition	Action
P1 Critical	Sync failed; job stalled >10min; no sync in >2× expected interval	In-app banner + webhook (Slack/PagerDuty)
P2 Warning	Entity count drop >30%; queue backlog >15min; partial sync	In-app notification + webhook (Slack)
P3 Info	Sync completed; new finding types detected	Daily digest

Implementation:

New alerts collection in MongoDB (type, severity, connector, sync_id, message, acknowledged_at)
Alert evaluation runs after each sync_ingestion and evaluate_findings completion
Notification bell in UI header with unread count
Webhook dispatcher: single outbound HTTP POST covers Slack, Teams, PagerDuty

Escalation pattern:

P3 → Log + daily digest
P2 → In-app + webhook; if unacknowledged 4h → escalate to P1
P1 → In-app + webhook + PagerDuty; if unacknowledged 30min → re-fire

9d. Scan Quarantine Workflow

When anomaly thresholds are breached, quarantine instead of apply:

Scan arrives → Validate schema → Check anomaly thresholds
  │                                    │
  │ (normal)                          │ (anomaly detected)
  ▼                                    ▼
  Process normally               Store as "quarantined" sync
                                 Alert P2 to admin
                                       │
                                       ▼
                                 Admin reviews in UI:
                                   - Previous vs current metrics side-by-side
                                   - Actions: Approve / Reject / Re-scan

Quarantine triggers (deterministic):

Entity count drops >50% from previous sync
Entity count increases >200%
Zero entities returned when baseline >0
Scan duration < 10% of P50 (suspiciously fast → likely incomplete)

Quarantine tracking: Quarantined scans are marked status: "completed" with circuit_breaker_triggered: true and quarantined: true in sync metrics. No new sync statuses are added — this avoids breaking the evaluate-findings.ts:21 gate on status === "completed".

9e. Operational Runbooks

Currently placeholder at docs/runbooks/index.md. Priority runbooks to write:

Sync Failure Triage — classify error (connection/schema/DB/transform), fix, re-scan, verify recovery
Data Freshness Outage — check connector alive, check for stalled syncs, check target system availability
Delta Anomaly Triage — determine if real change vs connector bug, accept new baseline or investigate
Authority Path Rollback — use admin API or CLI to restore paths removed by a bad sync

9f. Build vs Buy Decision

Approach	Effort	Recommendation
Custom admin panel in product UI	5-7 days	Recommended for MVP — single deployment, customers see it too
Grafana + Prometheus	2-3 days setup	Deferred — wire up existing `/metrics` endpoint when >5 tenants
Datadog / Monte Carlo	$6K+/year	Not justified at current scale

Key insight: SecurityV0 already has Prometheus metrics at /metrics with 8 metric families. Grafana can be added in 2-3 hours when needed. The admin panel is the higher-value investment because it's customer-facing.

9g. Existing Infrastructure (Already Built)

Component	Status	File
Health endpoints (`/health`, `/ready`, `/metrics`, `/diagnostics`)	Built	`src/api/routes/system.ts`
Prometheus metrics (HTTP latency, job duration, queue depth, sync age, findings, authority paths)	Built	`src/shared/metrics/metrics.ts`
Structured JSON logging	Built	`src/shared/logging/logger.ts`
Syncs API (`GET /api/v1/syncs`)	Built	`src/api/routes/syncs.ts`
SyncsPage UI (table, filtering, status badges)	Built	`ui/src/pages/SyncsPage.tsx`
Worker queue depth tracking	Built	`src/workers/runtime.ts`
ConnectorSyncDoc with detailed metrics	Built	`src/domain/syncs/types.ts`

10. References

Internal

Architecture docs: docs/architecture/03-database.md (connector_syncs schema, lines 518-580)
Processing pipeline: docs/architecture/02-processing-pipeline.md (SLIs/SLOs, alert matrix, dashboard requirements)
Existing infrastructure: Prometheus metrics (src/shared/metrics/metrics.ts), health endpoints (src/api/routes/system.ts), SyncsPage (ui/src/pages/SyncsPage.tsx), worker runtime (src/workers/runtime.ts)

Industry Research

SailPoint: Aggregation safeguards — full/delta/targeted aggregation modes; zero-account aggregation abort; uncorrelated account review workflow; per-source health notifications (docs, aggregation troubleshooting)
Veza: OAA provider-level granularity — failed push for one provider does not affect others; dashboard grouping by security scenario; 90-day trend analysis (product updates)
Wiz: Last-seen model with type-specific grace periods (24h cloud resources, 72h soft-delete, 7d identity retention); resource drift alerting
CrowdStrike Falcon: Sensor health model — "reduced functionality mode" retains last-known-good state; 45-minute inactive threshold before status change
Splunk: Event count deviation monitoring (50% of 7-day rolling average triggers alert); append-only model prevents destructive overwrites; data quarantine for suspect data
Microsoft Sentinel: Data connector health monitoring with configurable per-connector thresholds
Prisma Cloud (Palo Alto): Resource drift alerting when >30% of resources disappear in single scan
ServiceNow CMDB: IRE staging area with reconciliation rules; staleness thresholds (7 days cloud, 30 days on-prem); IRE batches held in staging on anomaly detection
Data Observability: Monte Carlo's 5 pillars (freshness, volume, schema, distribution, lineage); O'Reilly Data Quality Fundamentals ch4 (monitoring and anomaly detection for pipelines)

Appendix A: Review Findings Traceability

All 8 review findings from the v1 draft review have been addressed in this v2 revision.

#	Severity	Finding	Resolution	Section
1	Critical	AP breaker per-workload `>= 3` allows full wipe for small tenants	Replaced with global/tenant-level breaker, no minimum floor	Phase 1a, 1b
2	Critical	Entity deletion breaker doesn't stop authority path materialization from removing paths via missing `execution_paths`	Cascading pipeline gate: if entity deletion is blocked, materialization also blocked	Phase 1b (cascading gate)
3	Critical	`"degraded"` status breaks downstream — `evaluate-findings.ts:21` gates on `status === "completed"`	Keep `"completed"` status + `circuit_breaker_triggered: true` flag in metrics	Phase 1d
4	Critical	Rollback by `removed_at ± 5s` is non-deterministic	Added `removed_by_sync_id` field for exact causality tracking	Phase 2e, Phase 6a
5	High	Trusting connector self-reported `expectedNodeCount` for breaker overrides	Removed `expectedNodeCount`. All safety decisions platform-derived. Override is operator-only.	Phase 2a, Phase 4
6	High	Tombstoning = irreversible loss	Removed tombstone phase entirely. Soft-delete indefinitely, manual purge via admin API.	Phase 3b
7	High	Phase 2 incomplete vs current code (ingest validation, hardcoded `sync_mode`)	Added ingest route validation (2c) and `sync_mode` derivation from payload (2d)	Phase 2c, 2d
8	Medium	Entity type policy names (`human_identity`) don't match runtime types (`owner`)	Threshold config uses runtime types from graph transformer	Phase 1a (threshold table)

Core assumption validated: No automatic irreversible deletes and no automatic large soft-removals from a single suspect scan.

1. Incident Summary​

2. Design Principles​

3. Implementation Plan​

Phase 0: Connector Fix (Immediate — blocks further scans)​

Phase 1: Platform Circuit Breaker (P0 — 1 day)​

1a. Global Entity Deletion Threshold​

1b. Cascading Pipeline Gate​

1c. Authority Path Removal Safety Net (Independent)​

1d. Sync Status: Keep "completed" + Flag​

Phase 2: Scan Scope Declaration & Rollback Determinism (P0 — 1-2 days)​

2a. NormalizedGraph Schema Extension​

2b. Scope-Aware Deletion​

2c. Ingest Route Validation​

2d. Sync Mode from Payload (Remove Hardcoded Default)​

2e. Deterministic Rollback: removed_by_sync_id​

2f. Connector-Side Scope Declaration​

Phase 3: Soft-Delete with Grace Periods (P1 — 2-3 days)​

3a. Entity Absence Tracking​

3b. Two-Phase Removal Lifecycle (No Hard Deletion)​

3c. Implementation​

Phase 4: Connector Health Metrics & Validation (P1 — 2 days)​

4a. Health Score Computation​

4b. Pre-Ingestion Gate​

4c. Store Health Reports​

Phase 5: Observability & Admin Dashboard (P2 — 3-5 days)​

What Already Exists​

What's Missing​

5a. Enhanced SyncsPage​

5b. Connector Health Summary API​

5c. Scan Detail View​

5d. Alerting (Phase 5 stretch)​

Phase 6: Rollback Capability (P2 — 1 day)​

6a. Restore Removed Authority Paths​

6b. Admin API Endpoint​

6c. CLI Script​

4. Implementation Priority & Timeline​

5. What This Prevents​

Handling Legitimate Removals​

6. Files Changed (Summary)​

sv0-platform​

sv0-connectors​

7. Acceptance Criteria​

Phase 0-1 (blocks deployment)​

Phase 2​

Phase 3​

Phase 4​

Phase 5​

Phase 6​

8. Open Questions​

Resolved Questions (from review)​

9. Operational Monitoring & Admin Dashboard (Detailed Design)​

9a. Admin Dashboard Layout​

9b. Enhanced SyncsPage​

9c. Alerting Architecture​

9d. Scan Quarantine Workflow​

9e. Operational Runbooks​

9f. Build vs Buy Decision​

9g. Existing Infrastructure (Already Built)​

10. References​

Internal​

Industry Research​

Appendix A: Review Findings Traceability​

1. Incident Summary

2. Design Principles

3. Implementation Plan

Phase 0: Connector Fix (Immediate — blocks further scans)

Phase 1: Platform Circuit Breaker (P0 — 1 day)

1a. Global Entity Deletion Threshold

1b. Cascading Pipeline Gate

1c. Authority Path Removal Safety Net (Independent)

1d. Sync Status: Keep `"completed"` + Flag

Phase 2: Scan Scope Declaration & Rollback Determinism (P0 — 1-2 days)

2a. NormalizedGraph Schema Extension

2b. Scope-Aware Deletion

2c. Ingest Route Validation

2d. Sync Mode from Payload (Remove Hardcoded Default)

2e. Deterministic Rollback: `removed_by_sync_id`

2f. Connector-Side Scope Declaration

Phase 3: Soft-Delete with Grace Periods (P1 — 2-3 days)

3a. Entity Absence Tracking

3b. Two-Phase Removal Lifecycle (No Hard Deletion)

3c. Implementation

Phase 4: Connector Health Metrics & Validation (P1 — 2 days)

4a. Health Score Computation

4b. Pre-Ingestion Gate

4c. Store Health Reports

Phase 5: Observability & Admin Dashboard (P2 — 3-5 days)

What Already Exists

What's Missing

5a. Enhanced SyncsPage

5b. Connector Health Summary API

5c. Scan Detail View

5d. Alerting (Phase 5 stretch)

Phase 6: Rollback Capability (P2 — 1 day)

6a. Restore Removed Authority Paths

6b. Admin API Endpoint

6c. CLI Script

4. Implementation Priority & Timeline

5. What This Prevents

Handling Legitimate Removals

6. Files Changed (Summary)

sv0-platform

sv0-connectors

7. Acceptance Criteria

Phase 0-1 (blocks deployment)

Phase 2

Phase 3

Phase 4

Phase 5

Phase 6

8. Open Questions

Resolved Questions (from review)

9. Operational Monitoring & Admin Dashboard (Detailed Design)

9a. Admin Dashboard Layout

9b. Enhanced SyncsPage

9c. Alerting Architecture

9d. Scan Quarantine Workflow

9e. Operational Runbooks

9f. Build vs Buy Decision

9g. Existing Infrastructure (Already Built)

10. References

Internal

Industry Research

Appendix A: Review Findings Traceability