Implementation Plan: Scan Safety, Data Loss Prevention & Connector Observability
Date: 2026-02-26 Status: Draft v2 — revised per review findings (8 items addressed) Scope: sv0-platform (ingestion pipeline, API, UI), sv0-connectors (entra-servicenow)
Core Assumption (non-negotiable): The platform must never perform automatic irreversible deletes and must never perform automatic large soft-removals from a single suspect scan. All destructive operations must be gated, observable, and reversible. Trigger: Production incident — fresh connector scan removed all 5 authority paths for
defaulttenant
1. Incident Summary
On 2026-02-26, a fresh entra-servicenow connector scan (syncId: cebc3162) removed all 5 existing authority paths for the default tenant. The production UI showed zero authority paths where 5 had been present.
Impact: Complete loss of authority path visibility for the real (non-demo) tenant.
Root cause: Two compounding failures:
-
Connector bug (primary): Commit
f322e48("discover all outbound REST Messages, not just Azure") removed the Azure endpoint filter from theget_outbound_rest_messages()query. The_get_table()method has a 100-record default limit. With the filter removed, the ServiceNow instance returned 100+ generic REST Messages, pushing the Azure-specific ones outside the window. Result:discover_execution_chains()returned 0 chains, and the graph output contained 0 chain workloads (business rules, script includes, scheduled jobs). -
Platform design gap (amplifier): The platform uses a full-replacement model for sync processing. When the diff engine (
diff-engine.ts:267-305) detected that 5 previously-ingested workloads were absent from the incoming graph, it marked them as deleted. The authority path materializer (sync-ingestion.ts:166-181) then soft-removed all authority paths for those deleted workloads.
Key metric from the failing sync:
pathsComputed: 1, authorityPathsCreated: 0, authorityPathsRemoved: 5
Resolution: Authority paths manually restored via MongoDB updateMany (status: "removed" → "active"). Possible because the platform uses soft-delete (markAuthorityPathsRemoved sets status: "removed", does not physically delete).
2. Design Principles
Based on industry research across identity governance (SailPoint, Veza), CSPM (Wiz, Prisma Cloud), SIEM (Splunk, Sentinel), and sensor platforms (CrowdStrike Falcon):
-
Never trust a single scan to be complete. Connectors can fail partially — API limits, permission revocations, timeouts. The platform must treat incoming data as potentially incomplete.
-
Absence ≠ deletion. An entity missing from one scan should not be immediately removed. Use "last seen" tracking with grace periods (industry standard: 1-7 days depending on entity type).
-
Protect high-value data with circuit breakers. If a sync would remove a significant portion of existing data, halt and quarantine rather than apply. Thresholds vary: 30% for identities, 50% for resources (SailPoint/Veza pattern).
-
Connectors must declare scope. A connector scanning only Function Apps should not trigger deletion of unrelated ServiceNow workloads. Scan scope must be explicit in the payload.
-
Make all destructive operations observable and reversible. Operators must be able to see what each scan changed, detect anomalies, and roll back bad syncs.
3. Implementation Plan
Phase 0: Connector Fix (Immediate — blocks further scans)
Goal: Fix the entra-servicenow connector so scans produce complete graphs.
| Task | File | Change |
|---|---|---|
| Use paginated query for REST Messages | servicenow_client.py → get_outbound_rest_messages() | Replace _get_table() with _get_table_paginated() to fetch all REST Messages regardless of count |
| Add self-validation before submission | cli/main.py → submit logic | Log warning if chain discovery returns 0 chains when prior scan had >0; optionally abort submission |
Estimated effort: 1 hour Prevents recurrence of this specific incident: Yes
Phase 1: Platform Circuit Breaker (P0 — 1 day)
Goal: Prevent any single sync from causing mass entity deletion or authority path removal, regardless of connector bugs.
Review finding addressed (Critical #1): The original per-workload AP breaker with
existingActive.length >= 3minimum would allow a full wipe for small tenants (e.g., 2 workloads with 2 paths each → 100% removal allowed). The breaker now operates at the global/tenant level across all entities and paths for the entire sync, with no minimum floor — even removing 1 of 1 paths triggers evaluation.
Review finding addressed (Critical #2): Entity deletion and authority path removal are two separate destructive operations, but they are causally linked: deleted entities → missing
execution_paths→ materializer removes authority paths. The circuit breaker now gates the entire destructive pipeline (entity deletion + execution path materialization + authority path removal) as a single unit. If the entity deletion breaker fires, the materializer is also blocked from removing paths.
1a. Global Entity Deletion Threshold
Before the diff engine marks entities as deleted, compare the total deletion count against total existing entity count across all source systems in the sync.
File: src/ingestion/diff-engine.ts (around line 267-305)
// Proposed logic — GLOBAL breaker (runs once for the entire sync, not per-workload):
interface DeletionBreaker {
totalToDelete: number;
totalExisting: number;
deletionRatio: number;
triggered: boolean;
blockedEntityIds: string[];
}
function evaluateDeletionBreaker(
allToDelete: EntityDoc[],
allExistingForSyncSources: EntityDoc[],
): DeletionBreaker {
const totalToDelete = allToDelete.length;
const totalExisting = allExistingForSyncSources.length;
const deletionRatio = totalExisting > 0 ? totalToDelete / totalExisting : 0;
const threshold = 0.50; // 50% global threshold — no minimum floor
// Zero existing = first scan, no breaker needed
if (totalExisting === 0) {
return { totalToDelete, totalExisting, deletionRatio, triggered: false, blockedEntityIds: [] };
}
// Special case: incoming scan has 0 entities when baseline > 0 → always block
if (totalToDelete === totalExisting && totalExisting > 0) {
return { totalToDelete, totalExisting, deletionRatio: 1.0, triggered: true,
blockedEntityIds: allToDelete.map(e => e.node_id) };
}
const triggered = deletionRatio > threshold;
return {
totalToDelete, totalExisting, deletionRatio, triggered,
blockedEntityIds: triggered ? allToDelete.map(e => e.node_id) : [],
};
}
Per-type thresholds (applied as a secondary check within global breaker):
| Entity Type (runtime) | Threshold | Rationale |
|---|---|---|
identity (service principals) | 30% | Anchor authority paths — rarely mass-deleted legitimately |
workload | 40% | Core platform object; client config changes are incremental |
role / permission | 40% | Role structures are relatively stable |
resource | 60% | Cloud resources are more volatile (scale-up/down) |
owner | 50% | People join/leave; moderate volatility |
| Default | 50% | Safe middle ground |
Review finding addressed (Medium #8): Entity type names in thresholds now use runtime types from the graph transformer (e.g.,
ownernothuman_identity), matchingNormalizedNode.nodeTypevalues that actually appear in the pipeline.
No minimum entity floor. Previous draft required existingForSource.length >= 5 — this is removed. A tenant with 2 entities losing both is a 100% drop and must be caught.
1b. Cascading Pipeline Gate
When the entity deletion breaker fires, the entire destructive pipeline for this sync is halted — not just entity deletion.
File: src/workers/handlers/sync-ingestion.ts
// After entity diff:
const deletionBreaker = evaluateDeletionBreaker(toDelete, existingForSource);
if (deletionBreaker.triggered) {
logger.warn("Circuit breaker triggered — blocking ALL destructive operations", {
syncId, tenantId,
wouldDelete: deletionBreaker.totalToDelete,
existing: deletionBreaker.totalExisting,
ratio: deletionBreaker.deletionRatio,
});
// Skip: entity deletions
// Skip: execution path re-materialization for deleted entities
// Skip: authority path removal (markAuthorityPathsRemoved)
// Continue: entity creates/updates (additive operations are safe)
// Continue: findings evaluation, evidence packs, posture snapshot
syncMetrics.circuit_breaker_triggered = true;
syncMetrics.deletions_blocked = deletionBreaker.totalToDelete;
syncMetrics.authority_paths_removal_blocked = /* count from materializer */ 0;
}
Key insight: The materializer at authority-path-materializer.ts:115-127 removes paths when execution_paths is empty for a workload. If we mark workloads as deleted (removing their execution_paths), the materializer will cascade-remove their authority paths even without an explicit AP breaker. Therefore, the entity deletion breaker must gate the materializer as well — if deletions are blocked, the materializer runs with the pre-existing entity set (as if the deletions never happened).
1c. Authority Path Removal Safety Net (Independent)
A secondary check at the authority path level, as defense-in-depth for cases where entity deletion proceeds but materializer behavior is anomalous.
File: src/ingestion/authority-path-materializer.ts (around line 115-127)
// GLOBAL authority path breaker — NOT per-workload
const totalExistingPaths = allExistingActivePaths.length;
const totalToRemove = allPathsToRemove.length;
const removalRatio = totalExistingPaths > 0 ? totalToRemove / totalExistingPaths : 0;
const AP_REMOVAL_THRESHOLD = 0.50;
// No minimum floor — even 1 of 1 is 100% and triggers
if (removalRatio > AP_REMOVAL_THRESHOLD && totalExistingPaths > 0) {
logger.warn("Authority path removal blocked by circuit breaker", {
syncId, tenantId,
wouldRemove: totalToRemove,
existing: totalExistingPaths,
ratio: removalRatio,
});
syncMetrics.authority_paths_removal_blocked = totalToRemove;
// Skip all path removals — return early
}
1d. Sync Status: Keep "completed" + Flag
Review finding addressed (Critical #3): The original plan proposed adding
"degraded"toSyncStatus. This would break downstream processing:evaluate-findings.ts:21gates onsync.status !== "completed"— a"degraded"status would cause the findings evaluator, evidence pack builder, and posture snapshot to all be skipped. This is worse than the original problem.
Solution: Keep status: "completed" so downstream processing runs normally. Add a circuit_breaker_triggered: boolean flag on the sync metrics to indicate that destructive operations were blocked.
File: src/domain/syncs/types.ts
Do NOT add "degraded" to SyncStatus. Instead, add metrics fields:
// Added to ConnectorSyncMetrics (existing interface):
deletions_blocked?: number;
authority_paths_removal_blocked?: number;
circuit_breaker_triggered?: boolean;
circuit_breaker_details?: {
entity_deletion_ratio: number;
entity_deletion_threshold: number;
ap_removal_ratio: number;
ap_removal_threshold: number;
};
The sync status remains "completed". The UI and alerting system check circuit_breaker_triggered to surface warnings. Findings evaluation, evidence packs, and posture snapshots all proceed normally against the non-destructed data.
Estimated effort: 1 day Prevents recurrence: Yes — any connector bug that produces incomplete data triggers the circuit breaker, blocking all destructive operations while preserving downstream processing.
Phase 2: Scan Scope Declaration & Rollback Determinism (P0 — 1-2 days)
Goal: Let connectors declare what they scanned so the platform only removes entities within that scope. Make rollback deterministic by tracking which sync caused each removal.
Review finding addressed (Critical #4): The original rollback mechanism used
removed_at ± 5stimestamp matching — this is non-deterministic and could restore paths from unrelated operations. Rollback must useremoved_by_sync_idfor exact causality tracking.
Review finding addressed (High #5): The original plan trusted connector self-reported
completeness.expectedNodeCountto relax circuit breaker thresholds. This is dangerous — a buggy connector could self-report "I expected 0 nodes" and bypass all safety. All safety decisions must be platform-derived from historical baselines.
Review finding addressed (High #7): The current codebase has two gaps: (1) the ingest route does not validate or accept
scanScopein the payload, and (2)sync-ingestion.ts:33hardcodessync_mode: "full". Both must be addressed for scan scope to function.
2a. NormalizedGraph Schema Extension
File: src/ingestion/types.ts
export interface ScanScope {
/** What this scan covers — only entities matching this scope are eligible for deletion */
mode: "full" | "incremental" | "targeted";
/** Source systems included in this scan (e.g., ["servicenow", "entra_id"]) */
sourceSystems?: string[];
/** Entity types included (e.g., ["workload", "identity"]). If omitted, all types in scope. */
scannedEntityTypes?: string[];
/** Connector self-reported errors — used ONLY for logging/observability, NOT for safety decisions */
errors?: {
errorsEncountered?: number;
partialFailures?: string[];
permissionDenied?: string[];
};
}
export interface NormalizedGraph {
// ... existing fields
scanScope?: ScanScope;
}
Important: The completeness.expectedNodeCount field from the original draft is removed. The platform must never trust connector self-reported node counts for circuit breaker override decisions. All safety thresholds are derived from the platform's own historical baseline (previous successful sync for the same connector + tenant).
2b. Scope-Aware Deletion
File: src/ingestion/diff-engine.ts
When scanScope.mode === "incremental", skip deletion detection entirely (additive-only).
When scanScope.scannedEntityTypes is provided, only consider entities of those types for deletion.
2c. Ingest Route Validation
File: src/api/routes/ingest.ts
The ingest route must accept and validate the scanScope field from the payload:
// Add to existing NormalizedGraph validation schema:
scanScope: z.object({
mode: z.enum(["full", "incremental", "targeted"]),
sourceSystems: z.array(z.string()).optional(),
scannedEntityTypes: z.array(z.string()).optional(),
errors: z.object({
errorsEncountered: z.number().optional(),
partialFailures: z.array(z.string()).optional(),
permissionDenied: z.array(z.string()).optional(),
}).optional(),
}).optional(),
2d. Sync Mode from Payload (Remove Hardcoded Default)
File: src/workers/handlers/sync-ingestion.ts (line 33)
Currently: sync_mode: "full" is hardcoded. Change to derive from the incoming payload:
// Before (hardcoded):
sync_mode: "full",
// After (from payload, with safe default):
sync_mode: graph.scanScope?.mode ?? "full",
When mode is "incremental", the diff engine skips deletion detection. When "targeted", only scannedEntityTypes are eligible for deletion. When "full" (or absent), current behavior is preserved (all entity types eligible).
2e. Deterministic Rollback: removed_by_sync_id
File: src/domain/authority-paths/types.ts
Add removed_by_sync_id to the authority path schema:
// Added to AuthorityPathDoc:
removed_by_sync_id?: string; // The sync that caused this path to be removed
File: src/ingestion/authority-path-materializer.ts
When removing authority paths, stamp them with the sync ID:
// In markAuthorityPathsRemoved:
await collection.updateMany(
{ _id: { $in: pathIdsToRemove } },
{
$set: {
status: "removed",
removed_at: new Date(),
removed_by_sync_id: syncId, // NEW: deterministic rollback key
}
}
);
File: src/domain/entities/types.ts
Add removed_by_sync_id to entity schema for the same reason:
// Added to EntityDoc:
removed_by_sync_id?: string; // The sync that caused this entity to be deleted
This enables the rollback in Phase 6 to use an exact match (removed_by_sync_id === targetSyncId) instead of the non-deterministic removed_at ± 5s window.
2f. Connector-Side Scope Declaration
Files: Both connector transformers should include scanScope in their output.
The entra-servicenow connector already implicitly knows its scope — it scans ["servicenow", "entra_id"] entities of types ["workload", "identity", "role", "permission", "resource", "connection"]. Making this explicit prevents the class of bug where a partial failure in one subsystem (chain discovery) causes deletions across all entity types.
Estimated effort: 1-2 days
Prevents recurrence: Yes — even if chain discovery fails, the scope declaration tells the platform which entity types were actually scanned. A scan that produces 0 workloads but declares scannedEntityTypes: ["workload"] would still trigger the circuit breaker. A scan that only scans flows but NOT business rules would omit "workload" from its scanned types, preventing BR deletion.
Phase 3: Soft-Delete with Grace Periods (P1 — 2-3 days)
Goal: Replace immediate removal with a multi-phase lifecycle that tolerates transient scan failures.
3a. Entity Absence Tracking
File: src/domain/entities/types.ts
Add to EntityDoc:
last_seen_sync_id?: string;
last_seen_at?: Date;
consecutive_absences?: number;
absence_since?: Date;
3b. Two-Phase Removal Lifecycle (No Hard Deletion)
Review finding addressed (High #6): The original plan included a "Tombstoned / eligible for hard deletion" phase. This introduces irreversible data loss — once tombstoned and purged, data cannot be recovered even if the deletion was caused by a connector bug discovered weeks later. The tombstone phase is removed entirely. Entities and authority paths use soft-delete indefinitely. Storage cost for soft-deleted records is negligible compared to the risk of irreversible loss.
| Phase | Condition | Effect on Authority Paths |
|---|---|---|
| Active | Seen in latest scan | Normal — paths materialized |
| Stale | Missing 1 scan, < 24h | Retain paths, flag in UI with warning badge |
| Absent | Missing 2+ scans OR > 48h | Soft-remove from active paths, retain entity as status: "removed" |
There is no automatic hard deletion. Entities in "Absent" state are soft-deleted (status: "removed") and excluded from active queries, but remain in the database indefinitely. Manual purge is available via admin API for explicit operator-driven cleanup (see Phase 6).
Grace period by entity type:
| Entity Type | Stale → Absent |
|---|---|
| Identity (SP, managed identity) | 48h |
| Workload | 48h |
| Role / Permission | 24h |
| Resource | 24h |
| Owner | 72h |
3c. Implementation
File: src/ingestion/diff-engine.ts
Instead of adding absent entities to deletedEntityIds:
- Increment
consecutive_absenceson the entity - Set
absence_sinceif first absence - Only add to
deletedEntityIdswhen entity reaches "Absent" phase
File: src/workers/handlers/sync-ingestion.ts
Add a periodic cleanup step (or separate worker job) that promotes stale → absent based on time elapsed. There is no further promotion — absent entities remain soft-deleted indefinitely.
Estimated effort: 2-3 days Benefit: Tolerates transient connector failures (API timeouts, temporary permission issues) without data loss. If a connector self-heals on next scan, stale entities return to active with no operator intervention. No risk of irreversible data loss from automatic purging.
Phase 4: Connector Health Metrics & Validation (P1 — 2 days)
Goal: Compute health scores per scan and reject/quarantine unhealthy scans before they cause damage.
4a. Health Score Computation
New file: src/ingestion/scan-health.ts
Compute a health score (0.0–1.0) for each incoming scan by comparing against the most recent successful sync for the same connector:
interface ScanHealthReport {
healthScore: number; // 0.0 = critical, 1.0 = healthy
healthStatus: "healthy" | "degraded" | "critical" | "failed";
metrics: {
nodeCount: number;
edgeCount: number;
nodeCountByType: Record<string, number>;
};
deviations: {
nodeCountDeltaPercent: number;
edgeCountDeltaPercent: number;
missingNodeTypes: string[]; // types present before, absent now
missingEdgeTypes: string[];
};
connectorReported: {
errorsEncountered: number;
permissionDenied: string[];
partialFailures: string[];
};
}
Health score formula (platform-derived only — no connector self-reported inputs):
healthScore = weighted average of:
- volumeScore (45%): node/edge count deviation from baseline (platform-computed)
- typeScore (35%): missing entity types penalty (platform-computed)
- durationScore(20%): scan duration anomaly (platform-computed)
Note: Connector self-reported
errors(fromscanScope.errors) are stored for observability and logged in the health report, but they are not used as inputs to the health score or circuit breaker decisions. This prevents a buggy connector from self-reporting "all clear" and bypassing safety.
Thresholds:
| Score Range | Status | Action |
|---|---|---|
| ≥ 0.8 | Healthy | Apply normally |
| 0.5 – 0.8 | Degraded | Apply with circuit breakers active, log warning |
| 0.2 – 0.5 | Critical | Quarantine — do not apply destructive operations |
| < 0.2 | Failed | Reject scan, notify operators |
4b. Pre-Ingestion Gate
File: src/workers/handlers/sync-ingestion.ts
Before step 2 (transform), compute health report. If healthStatus === "critical" or "failed", skip destructive operations or quarantine the entire sync.
4c. Store Health Reports
New collection: scan_health_reports (indexed by tenant_id, sync_id)
Persist every health report for trend analysis and operator review.
Estimated effort: 2 days Benefit: Catches degraded scans before they enter the pipeline. Provides historical health data for monitoring dashboards.
Phase 5: Observability & Admin Dashboard (P2 — 3-5 days)
Goal: Give operators visibility into connector health, scan history, and anomalies.
What Already Exists
The platform already has significant infrastructure (discovered during audit):
| Component | Status | Location |
|---|---|---|
Health endpoints (/health, /ready, /metrics, /diagnostics) | Built | src/api/routes/system.ts |
| Prometheus metrics (8+ metrics: HTTP latency, job duration, queue depth, sync age, findings count, authority path count) | Built | src/shared/metrics/metrics.ts |
| Structured JSON logging | Built | src/shared/logging/logger.ts |
Sync history API (GET /api/v1/syncs) | Built | src/api/routes/syncs.ts |
| SyncsPage UI (table with status badges, filtering) | Built | ui/src/pages/SyncsPage.tsx |
| Worker queue depth tracking | Built | src/workers/runtime.ts |
| Connector sync metrics (entities_created/updated, paths_created/removed, etc.) | Built | src/domain/syncs/types.ts |
What's Missing
| Component | Priority | Effort |
|---|---|---|
| Connector health summary (last scan per connector, trend sparklines) | P2 | 1 day |
| Scan health dashboard (entity counts over time, anomaly flags) | P2 | 2 days |
| Error visibility (display sync errors in UI, categorize by type) | P2 | 1 day |
| Authority path delta visualization (created/updated/removed per sync) | P2 | 1 day |
| Admin/operator page (multi-tenant overview, system status) | P3 | 2 days |
| Alerting framework (webhook notifications for degraded/critical scans) | P3 | 2 days |
Operational runbooks (currently placeholder at docs/runbooks/index.md) | P3 | 1 day |
5a. Enhanced SyncsPage
File: ui/src/pages/SyncsPage.tsx
Add to existing page:
- Health badge per sync (healthy/degraded/critical/failed) based on
scan_health_reports - Entity delta column showing +created / −removed counts with color coding
- Authority path delta showing paths affected
- Error column displaying sync error messages (currently stored in DB but not shown)
- Trend sparklines per connector (last 10 syncs entity count)
5b. Connector Health Summary API
New endpoint: GET /api/v1/connectors/health
Returns per-connector:
{
connectorId: string;
lastSyncAt: Date;
lastSyncStatus: string;
lastHealthScore: number;
syncCount24h: number;
failureCount24h: number;
entityCountTrend: number[]; // last 10 syncs
authorityPathsTrend: number[]; // last 10 syncs
}
5c. Scan Detail View
Clicking a sync in the SyncsPage opens a detail view showing:
- Full health report (deviations, missing types, errors)
- Entity diff summary (what was created/updated/deleted)
- Authority paths affected
- Action buttons: "Rollback this sync" (admin only)
5d. Alerting (Phase 5 stretch)
Architecture: Webhook-based notifications.
Events that trigger alerts:
| Event | Severity | Channel |
|---|---|---|
| Scan failed completely | Critical | Webhook + in-app banner |
| Circuit breaker triggered (deletions blocked) | Alert | Webhook + in-app notification |
| Health score dropped below 0.5 | Warning | In-app notification |
| No scan received in > 24h for active connector | Warning | Webhook |
| Permission denied errors in scan | Alert | Webhook |
Webhook payload:
{
event: "scan_degraded" | "scan_failed" | "circuit_breaker_triggered" | "scan_stale",
severity: "info" | "warning" | "alert" | "critical",
connectorId: string,
syncId: string,
tenantId: string,
timestamp: string,
title: string, // "Entra scan returned 45% fewer service principals"
details: { healthScore, deviations, actionTaken, recommendedAction }
}
Configuration: POST /api/v1/settings/webhooks to register notification endpoints (Slack, Teams, email relay, PagerDuty).
Estimated effort: 3-5 days total for Phase 5 Benefit: Operators can monitor connector health without SSH access to production. Anomalies surface proactively instead of being discovered when a customer reports missing data.
Phase 6: Rollback Capability (P2 — 1 day)
Goal: Enable operators to undo the effects of a bad sync.
6a. Restore Removed Authority Paths
Review finding addressed (Critical #4): Rollback uses
removed_by_sync_idfor exact causality — not the non-deterministicremoved_at ± 5swindow from the original draft.
Since authority paths use soft-delete with removed_by_sync_id stamping (added in Phase 2e), restoration is a single deterministic query:
async restoreAuthorityPaths(
tenantId: string,
syncId: string // the sync that caused the removal
): Promise<number> {
// Exact match on the sync that caused removal — no timestamp tolerance needed
const result = await this.c.authorityPaths.updateMany(
{
tenant_id: tenantId,
status: "removed",
removed_by_sync_id: syncId,
},
{
$set: { status: "active" },
$unset: { removed_at: "", removed_by_sync_id: "" }
}
);
return result.modifiedCount;
}
async restoreDeletedEntities(
tenantId: string,
syncId: string
): Promise<number> {
const result = await this.c.entities.updateMany(
{
tenant_id: tenantId,
status: "removed",
removed_by_sync_id: syncId,
},
{
$set: { status: "active" },
$unset: { removed_at: "", removed_by_sync_id: "" }
}
);
return result.modifiedCount;
}
6b. Admin API Endpoint
New endpoint: POST /api/v1/admin/syncs/:syncId/rollback
Requires admin authentication. Restores all authority paths removed by the specified sync.
6c. CLI Script
New file: scripts/rollback-sync.ts
npx tsx scripts/rollback-sync.ts --sync-id cebc3162-... --tenant-id default
Estimated effort: 1 day Benefit: Recovery from bad syncs without direct MongoDB access. Can be triggered from admin UI or CLI.
4. Implementation Priority & Timeline
| Phase | Description | Priority | Effort | Cumulative |
|---|---|---|---|---|
| 0 | Connector fix (paginated query) | P0 | 1h | 1h |
| 1 | Circuit breaker (deletion + AP thresholds) | P0 | 1 day | 1.5 days |
| 2 | Scan scope declaration | P0 | 1-2 days | 3 days |
| 3 | Soft-delete with grace periods | P1 | 2-3 days | 6 days |
| 4 | Health score computation & pre-ingestion gate | P1 | 2 days | 8 days |
| 5 | Observability dashboard & alerting | P2 | 3-5 days | 13 days |
| 6 | Rollback capability | P2 | 1 day | 14 days |
Phases 0-2 are blocking — they prevent recurrence of this class of bug. Phases 3-4 add defense-in-depth and operational intelligence. Phases 5-6 provide ongoing visibility and recovery tools.
5. What This Prevents
| Scenario | Before | After (Phase 1-2) | After (Phase 3-4) |
|---|---|---|---|
| Connector returns empty graph | All paths deleted | Circuit breaker blocks deletion | Entities marked stale, paths retained |
| Connector loses API permissions | Entities missing, paths removed | Scope-aware deletion limits impact | Health score drops, scan quarantined |
| Partial connector failure (e.g., one API times out) | Some entity types disappear | Only in-scope types considered for deletion | Grace period covers transient failures |
| Client intentionally removes configurations | Paths linger indefinitely | Circuit breaker may false-positive (needs override) | Grace period expires, paths correctly removed |
| New connector with first-time scan | N/A (creation only) | Normal operation | Health baseline established |
Handling Legitimate Removals
When a client genuinely removes configurations (e.g., decommissions an Azure SP), the grace period model (Phase 3) handles this correctly:
- First scan after removal: entity marked stale (paths retained, warning in UI)
- Second scan: entity moves to absent (paths removed)
- After retention period: entity tombstoned
For urgent legitimate removals, operators can manually confirm the deletion via the admin UI, bypassing the grace period.
Review finding addressed (High #5): The original plan included an override mechanism based on connector self-reported
expectedNodeCount. This is removed — the platform never trusts connector self-reported completeness for safety decisions. The only override is explicit operator action via the admin UI or API. If a client genuinely scaled down and the circuit breaker fires, the operator reviews the quarantined sync and manually approves it.
6. Files Changed (Summary)
sv0-platform
| File | Phase | Change |
|---|---|---|
src/ingestion/diff-engine.ts | 1, 2 | Global deletion threshold, scope-aware deletion |
src/ingestion/authority-path-materializer.ts | 1, 2 | Global AP removal breaker, removed_by_sync_id stamping |
src/ingestion/types.ts | 2 | ScanScope on NormalizedGraph (no expectedNodeCount) |
src/ingestion/scan-health.ts | 4 | New file — platform-derived health score (no connector self-report) |
src/domain/syncs/types.ts | 1, 4 | circuit_breaker_triggered flag (NOT "degraded" status), health metrics |
src/domain/entities/types.ts | 2, 3 | removed_by_sync_id, absence tracking fields |
src/domain/authority-paths/types.ts | 2, 3 | removed_by_sync_id, "stale" status |
src/workers/handlers/sync-ingestion.ts | 1, 2, 3, 4 | Cascading pipeline gate, sync_mode from payload, lifecycle |
src/api/routes/ingest.ts | 2 | Validate scanScope in payload |
src/storage/storage-adapter.ts | 3, 6 | New methods (stale marking, deterministic restore by sync_id) |
src/api/routes/syncs.ts | 5, 6 | Health summary endpoint, rollback endpoint |
ui/src/pages/SyncsPage.tsx | 5 | Health badges, circuit breaker warnings, trends, error display |
sv0-connectors
| File | Phase | Change |
|---|---|---|
entra-servicenow/.../servicenow_client.py | 0 | _get_table_paginated() for REST Messages |
entra-servicenow/.../cli/main.py | 0, 2 | Self-validation, scanScope in output |
azure-foundry/.../transformer.py | 2 | scanScope in output |
7. Acceptance Criteria
Phase 0-1 (blocks deployment)
- Connector scan produces all chain workloads (no 100-record limit)
- Sync that would remove >50% of entities triggers global circuit breaker (no per-workload minimum floor)
- Circuit breaker gates entire destructive pipeline (entity deletion + materialization + AP removal)
- Circuit breaker logs warning with counts and ratio
- Sync status remains
"completed"withcircuit_breaker_triggered: truein metrics - Downstream processing (findings, evidence, posture) runs normally when breaker fires
- Entity type threshold config uses runtime types (
owner, nothuman_identity) - Existing tests pass, new unit tests for threshold logic
Phase 2
-
NormalizedGraphacceptsscanScopefield (validated at ingest route) -
mode: "incremental"skips all deletion detection -
scannedEntityTypeslimits deletion scope to declared types -
sync_modederived from payload (no longer hardcoded to"full") -
removed_by_sync_idstamped on all soft-deleted entities and authority paths - Both connectors include
scanScopein output - No connector self-reported counts used for safety decisions
Phase 3
- Entities track
last_seen_sync_id,consecutive_absences - First absence marks entity as stale (not deleted)
- Authority paths for stale entities are retained
- Entities reaching absence threshold are soft-removed (no hard deletion)
- No automatic tombstoning or hard deletion lifecycle
Phase 4
- Health score computed for every incoming scan using platform-derived metrics only
- Scans with score < 0.2 are rejected
- Scans with score 0.2-0.5 are quarantined
- Connector self-reported errors stored for observability but not used in score
- Health reports stored in
scan_health_reportscollection
Phase 5
- SyncsPage shows health badges, circuit breaker warnings, and error messages
- Connector health summary endpoint returns per-connector metrics
- Webhook notifications fire for degraded/critical/failed scans
Phase 6
-
POST /api/v1/admin/syncs/:syncId/rollbackrestores removed paths usingremoved_by_sync_id(deterministic) - Rollback also restores soft-deleted entities from the same sync
- CLI script
rollback-sync.tsworks for manual recovery - Manual purge API available for operator-driven hard deletion (not automatic)
8. Open Questions
- Threshold tuning: Should thresholds be configurable per-tenant (multi-tenant scenario where different clients have different volatility)?
- Quarantine storage: Should quarantined scans be stored in a separate collection or tagged in the existing
connector_syncscollection? - Grace period for first scan: When a connector runs for the first time, there's no baseline. Should the circuit breaker be disabled for the first N scans?
- Webhook delivery guarantees: Should the alerting system guarantee at-least-once delivery (retry on failure), or is best-effort sufficient for MVP?
- Admin authentication: The rollback endpoint needs admin-level auth. How should this be distinguished from regular tenant auth? API key with admin scope?
- Storage growth: Without automatic hard deletion, soft-deleted records accumulate indefinitely. At what scale does this become a storage concern? (Likely not relevant for years at current data volumes — a single tenant's full entity set is <10MB.)
Resolved Questions (from review)
| Question | Resolution |
|---|---|
Should circuit breaker use "degraded" status? | No. Keep "completed" + circuit_breaker_triggered flag. "degraded" breaks downstream gates. |
Should we trust connector self-reported expectedNodeCount? | No. All safety decisions platform-derived. Connector errors stored for observability only. |
| Should there be automatic hard deletion (tombstoning)? | No. Soft-delete indefinitely. Manual purge via admin API only. |
| Should AP breaker be per-workload with minimum floor? | No. Global/tenant-level breaker, no minimum floor. |
| How to make rollback deterministic? | removed_by_sync_id field on entities and authority paths. |
| What entity type names for threshold config? | Runtime types from graph transformer (owner, not human_identity). |
9. Operational Monitoring & Admin Dashboard (Detailed Design)
9a. Admin Dashboard Layout
Primary view: Connector Health Cards (one per connector_type per tenant)
Each card shows:
- Connector name and type (e.g., "Azure Entra ID", "ServiceNow", "Azure Foundry")
- Overall status badge: Healthy / Degraded / Failed / Stale
- Last successful sync timestamp + relative time ("2h ago")
- Entity count from last sync with delta vs. previous ("+12" or "−45 (warning)")
- Authority paths created/updated/removed in last sync
- Mini sparkline: entity count trend over last 10 syncs
- Click-through to filtered SyncsPage for that connector
Industry reference: SailPoint IdentityNow shows per-source health with Normal/Error states and aggregation troubleshooting views. Veza groups dashboards by security scenario with 90-day trend analysis.
9b. Enhanced SyncsPage
Extend the existing SyncsPage.tsx (already built with DataTable, status badges, filtering):
- Entity delta column:
+created / −removedwith color coding (red for >30% drop) - Duration comparison: vs P50 for this connector type
- Health badge: derived from scan health report
- Error column: display
sync.errorfield (stored in DB, currently hidden in UI) - Expandable detail: side-by-side metrics comparison with previous sync
9c. Alerting Architecture
Tiered alerts:
| Tier | Condition | Action |
|---|---|---|
| P1 Critical | Sync failed; job stalled >10min; no sync in >2× expected interval | In-app banner + webhook (Slack/PagerDuty) |
| P2 Warning | Entity count drop >30%; queue backlog >15min; partial sync | In-app notification + webhook (Slack) |
| P3 Info | Sync completed; new finding types detected | Daily digest |
Implementation:
- New
alertscollection in MongoDB (type, severity, connector, sync_id, message, acknowledged_at) - Alert evaluation runs after each
sync_ingestionandevaluate_findingscompletion - Notification bell in UI header with unread count
- Webhook dispatcher: single outbound HTTP POST covers Slack, Teams, PagerDuty
Escalation pattern:
P3 → Log + daily digest
P2 → In-app + webhook; if unacknowledged 4h → escalate to P1
P1 → In-app + webhook + PagerDuty; if unacknowledged 30min → re-fire
9d. Scan Quarantine Workflow
When anomaly thresholds are breached, quarantine instead of apply:
Scan arrives → Validate schema → Check anomaly thresholds
│ │
│ (normal) │ (anomaly detected)
▼ ▼
Process normally Store as "quarantined" sync
Alert P2 to admin
│
▼
Admin reviews in UI:
- Previous vs current metrics side-by-side
- Actions: Approve / Reject / Re-scan
Quarantine triggers (deterministic):
- Entity count drops >50% from previous sync
- Entity count increases >200%
- Zero entities returned when baseline >0
- Scan duration < 10% of P50 (suspiciously fast → likely incomplete)
Quarantine tracking: Quarantined scans are marked status: "completed" with circuit_breaker_triggered: true and quarantined: true in sync metrics. No new sync statuses are added — this avoids breaking the evaluate-findings.ts:21 gate on status === "completed".
9e. Operational Runbooks
Currently placeholder at docs/runbooks/index.md. Priority runbooks to write:
- Sync Failure Triage — classify error (connection/schema/DB/transform), fix, re-scan, verify recovery
- Data Freshness Outage — check connector alive, check for stalled syncs, check target system availability
- Delta Anomaly Triage — determine if real change vs connector bug, accept new baseline or investigate
- Authority Path Rollback — use admin API or CLI to restore paths removed by a bad sync
9f. Build vs Buy Decision
| Approach | Effort | Recommendation |
|---|---|---|
| Custom admin panel in product UI | 5-7 days | Recommended for MVP — single deployment, customers see it too |
| Grafana + Prometheus | 2-3 days setup | Deferred — wire up existing /metrics endpoint when >5 tenants |
| Datadog / Monte Carlo | $6K+/year | Not justified at current scale |
Key insight: SecurityV0 already has Prometheus metrics at /metrics with 8 metric families. Grafana can be added in 2-3 hours when needed. The admin panel is the higher-value investment because it's customer-facing.
9g. Existing Infrastructure (Already Built)
| Component | Status | File |
|---|---|---|
Health endpoints (/health, /ready, /metrics, /diagnostics) | Built | src/api/routes/system.ts |
| Prometheus metrics (HTTP latency, job duration, queue depth, sync age, findings, authority paths) | Built | src/shared/metrics/metrics.ts |
| Structured JSON logging | Built | src/shared/logging/logger.ts |
Syncs API (GET /api/v1/syncs) | Built | src/api/routes/syncs.ts |
| SyncsPage UI (table, filtering, status badges) | Built | ui/src/pages/SyncsPage.tsx |
| Worker queue depth tracking | Built | src/workers/runtime.ts |
| ConnectorSyncDoc with detailed metrics | Built | src/domain/syncs/types.ts |
10. References
Internal
- Architecture docs:
docs/architecture/03-database.md(connector_syncs schema, lines 518-580) - Processing pipeline:
docs/architecture/02-processing-pipeline.md(SLIs/SLOs, alert matrix, dashboard requirements) - Existing infrastructure: Prometheus metrics (
src/shared/metrics/metrics.ts), health endpoints (src/api/routes/system.ts), SyncsPage (ui/src/pages/SyncsPage.tsx), worker runtime (src/workers/runtime.ts)
Industry Research
- SailPoint: Aggregation safeguards — full/delta/targeted aggregation modes; zero-account aggregation abort; uncorrelated account review workflow; per-source health notifications (docs, aggregation troubleshooting)
- Veza: OAA provider-level granularity — failed push for one provider does not affect others; dashboard grouping by security scenario; 90-day trend analysis (product updates)
- Wiz: Last-seen model with type-specific grace periods (24h cloud resources, 72h soft-delete, 7d identity retention); resource drift alerting
- CrowdStrike Falcon: Sensor health model — "reduced functionality mode" retains last-known-good state; 45-minute inactive threshold before status change
- Splunk: Event count deviation monitoring (50% of 7-day rolling average triggers alert); append-only model prevents destructive overwrites; data quarantine for suspect data
- Microsoft Sentinel: Data connector health monitoring with configurable per-connector thresholds
- Prisma Cloud (Palo Alto): Resource drift alerting when >30% of resources disappear in single scan
- ServiceNow CMDB: IRE staging area with reconciliation rules; staleness thresholds (7 days cloud, 30 days on-prem); IRE batches held in staging on anomaly detection
- Data Observability: Monte Carlo's 5 pillars (freshness, volume, schema, distribution, lineage); O'Reilly Data Quality Fundamentals ch4 (monitoring and anomaly detection for pipelines)
Appendix A: Review Findings Traceability
All 8 review findings from the v1 draft review have been addressed in this v2 revision.
| # | Severity | Finding | Resolution | Section |
|---|---|---|---|---|
| 1 | Critical | AP breaker per-workload >= 3 allows full wipe for small tenants | Replaced with global/tenant-level breaker, no minimum floor | Phase 1a, 1b |
| 2 | Critical | Entity deletion breaker doesn't stop authority path materialization from removing paths via missing execution_paths | Cascading pipeline gate: if entity deletion is blocked, materialization also blocked | Phase 1b (cascading gate) |
| 3 | Critical | "degraded" status breaks downstream — evaluate-findings.ts:21 gates on status === "completed" | Keep "completed" status + circuit_breaker_triggered: true flag in metrics | Phase 1d |
| 4 | Critical | Rollback by removed_at ± 5s is non-deterministic | Added removed_by_sync_id field for exact causality tracking | Phase 2e, Phase 6a |
| 5 | High | Trusting connector self-reported expectedNodeCount for breaker overrides | Removed expectedNodeCount. All safety decisions platform-derived. Override is operator-only. | Phase 2a, Phase 4 |
| 6 | High | Tombstoning = irreversible loss | Removed tombstone phase entirely. Soft-delete indefinitely, manual purge via admin API. | Phase 3b |
| 7 | High | Phase 2 incomplete vs current code (ingest validation, hardcoded sync_mode) | Added ingest route validation (2c) and sync_mode derivation from payload (2d) | Phase 2c, 2d |
| 8 | Medium | Entity type policy names (human_identity) don't match runtime types (owner) | Threshold config uses runtime types from graph transformer | Phase 1a (threshold table) |
Core assumption validated: No automatic irreversible deletes and no automatic large soft-removals from a single suspect scan.