Automated connector pipeline — current-state audit + gap analysis
Issue: sv0-platform#1185 (implementation umbrella) · sv0-documentation#302 (this audit + ADR-027) Drives: ADR-027
Next Action
Decide on ADR-027: adopt the seven-slice migration as proposed, or amend the decision shape (broker interface, scan_runs as pipeline-run root, deploy-gate rematerialize job) before any implementation lands.
- Adopt → ADR-027 transitions to
Accepted; this audit transitions toadopted; cut PRs against sv0-platform#1185 starting with Slice 1 (credential broker, ~150 LOC). - Amend → comment on ADR-027 with the delta; this audit stays
research-completeuntil amendment is merged. - Defer / reject → ADR-027 closes; this audit transitions to
deferredorrejectedaccordingly.
TL;DR
The framing "today, zero automated connector infrastructure exists" — common in conversation — is partially correct but undersells what already shipped. Stream-1 Phases 1–3 (per 2026-04-22-connector-control-execution-architecture.md) delivered a working in-process scheduler, an execute_scan worker, a connector-driver seam, and a chain of jobs that auto-cascades sync → evaluate → evidence-pack. The data model for ConnectorInstance / ScanScope / ScanRun is there. The cron-style scheduler ticks every 30 s and atomically claims due scopes (src/workers/scheduler.ts:362-369).
What is genuinely missing:
- Credential broker.
ConnectorInstance.credentials_refexists as a typed field (src/domain/connector-instances/types.ts:79) but no code resolves it at runtime. TheInProcessSubprocessDriveris constructed withenv: undefinedatsrc/index.ts:98, so it can't actually invoke a real connector — only the test fake. - Operator UX surface.
scan_runsis written by the scheduler/handler but no UI route renders it. There is noScanRunsPage.tsx(verified —grep -r "ScanRunsPage" ui/src/ → 0 results). - Pipeline-run root entity. A "scan #N → ingest #M → evaluate #O" causal chain is reconstructable today only by joining
scan_runs.sync_id→connector_syncs._id→ log lines. No single doc holds the multi-stage view. - Deploy-gate triggers (chain-builder version bumps per ADR-026; same shape for
stitched_paths). - Retry / backoff / circuit-breaker policy at the pipeline level (per-stage retry is non-existent; the worker just records
failedand drops). - Idempotency keys for cross-stage replay.
- Real tenant ↔ scope binding for production. All the plumbing exists; nobody has created
ConnectorInstance+ScanScoperows for real customer tenants and the credential broker (item 1) gates this anyway.
Recommendation: finish the in-place architecture rather than build a new one. Add a credential broker, a small UI shell for scan_runs / scan_scopes, a deploy-gate post-deploy worker, and idempotent retry helpers. Don't introduce a new pipelines / pipeline_runs collection — scan_runs is already the pipeline-run root. The decision shape is captured in ADR-027.
1. Current-state audit
1.1 Job runtime
- WorkerRuntime at
src/workers/runtime.ts:31-153. In-process FIFO queue, single consumer (processingflag at line 35), no persistence (jobs lost on restart). Registered handlers route onWorkerJobType. - WorkerJobType enum at
src/workers/runtime.ts:5-11. Six values:sync_ingestion,evaluate_findings,build_evidence_pack,generate_report,execute_scan,stitch_ingestion. There is nojobsMongo collection — jobs live only in memory. - Handler registration happens in
src/index.ts:86-105. The standalonesrc/workers/index.tsis documented as a STUB (file-level comment lines 1-19); no deploy uses it. - Dispatch model:
enqueue()pushes into the in-memory array (runtime.ts:64);processQueue()loops while items exist. Failure path incrementsfailedJobsand logs but does not retry (line 137-147).
Job creation paths (every workerRuntime.enqueue call site):
src/ingestion/transport/ingest-service.ts:60— enqueuessync_ingestionafteracceptGraph.src/ingestion/transport/ingest-service.ts:69— enqueuesevaluate_findingsafter sync_ingestion is queued.src/api/routes/reports.ts:87— enqueuesgenerate_reportfrom the reports API.src/workers/scheduler.ts:290— enqueuesexecute_scanfrom the scheduler tick.src/workers/handlers/evaluate-findings.ts:48— enqueuesbuild_evidence_packper created/updated finding.src/workers/handlers/stitch-ingestion.ts:409— enqueues a post-stitchevaluate_findings.src/services/stitching/stitch-debounce.ts:166— enqueuesstitch_ingestion(debounced).
1.2 Scheduler (Stream-1 Phase 3, already shipped)
- Scheduler class at
src/workers/scheduler.ts:313-582. Per-process, 30 s tick (DEFAULT_TICK_INTERVAL_MS = 30_000at line 89), 50-scope batch claim (DEFAULT_CLAIM_BATCH_SIZE = 50at line 90). - Tick body (
runTickBody,scheduler.ts:434-472): atomically claims due scopes viastorage.claimDueScopes, then for each: enforces concurrency cap (runtime.tsdoesn't track this — the scope'sbudget.max_concurrent_runsis enforced by countingrunningruns inscan_runsatscheduler.ts:241-244), enforces cooldown after failure (scheduler.ts:523-545), then callstriggerScanRun(). triggerScanRun()(scheduler.ts:215-306) inserts ascan_runsdoc withstatus: "running"(line 285) and enqueuesexecute_scan(line 290) atomically (same primitive used by the manual API atsrc/api/routes/scan-runs.ts).- Atomic claim at
src/storage/mongo/adapters/control-plane-adapter.ts:128-181. Single-docfindOneAndUpdateadvancesnext_run_atand stampslast_claimed_at. Cadence support:intervalonly.cronis documented but the$condreturnsnullfor cron (line 167) — cron scopes never auto-fire today. - Boot wire-up at
src/index.ts:116-127. Mounted whenstorageAdapterandenv.schedulerEnabledare true. Default-on;SV0_SCHEDULER_ENABLED=falseopts out.
1.3 execute_scan handler — connector invocation
- Handler at
src/workers/handlers/execute-scan.ts:234-377. Readsscan_runs(pre-created by the scheduler), drives theConnectorDriver, validates output, submits the NormalizedGraph viaIngestService, callscompleteScanRun. Detailed contract documented atexecute-scan.ts:1-29. - ConnectorDriver interface at
src/workers/connector-driver.ts:104-106. Production implementationInProcessSubprocessDriver(connector-driver.ts:201-449):- Spawns the binary (default
sv0-awson PATH, line 157) withscan --scope-id X --run-id Y --category-results-out CR --graph-json G --scope-json S. - Inherits ONLY
PATH/HOME/TMPDIR/LANG/LC_ALLfrom the platform env (FORWARDED_ENV_KEYSat line 171). Platform secrets are intentionally NOT bled through to the connector (security comment atconnector-driver.ts:131-145). extraEnvis the credential pipe — but it comes fromInProcessSubprocessDriverOptions.envpassed at construction time, which atsrc/index.ts:98is undefined. So in production today, the connector subprocess gets zero AWS / Entra / SN credentials and would fail every scan.- This is the single biggest gap and the reason "scans don't actually run from the platform yet." All the orchestration code paths exist; the credentials don't reach the connector.
- Spawns the binary (default
1.4 sync_ingestion — 13-step pipeline
src/workers/handlers/sync-ingestion.ts:125-617. Numbered steps in the source:
- Create
ConnectorSyncDocwithstatus: "running"(line 141-163). - Transform graph → entities + evidence (line 164-173).
- Diff against existing state → events (line 175-203).
- Upsert entities atomically with M2/M3 cross-connector merge (line 205-269).
- Insert events (line 278).
- Entity versioning (changed + deleted + ownership-released paths) (line 281-370).
- Upsert execution evidence (line 372-377).
- Materialize execution paths (line 379-439).
- Assemble execution chains (line 441-446 — calls
assembleExecutionChains). - Materialize authority paths (line 448-493).
- Compute resource_key coverage metrics (line 495-506).
- Run post-ingest validators (line 508-531).
- Update
ConnectorSyncDocwith metrics +status: "completed"(line 533-576).
- Then: schedule a debounced
stitch_ingestionjob (line 601-603) which will later enqueue a post-stitchevaluate_findings.
Contract with connectors: the platform consumes NormalizedGraph only; the connector decides shape. Submitted via either HTTP (/api/v1/ingest/normalized-graph) or in-process from execute_scan (IngestService.submit).
1.5 Evaluator + downstream stages
- EvaluatorService at
src/evaluator/index.ts:36. Triggered byevaluate_findingsworker handler (src/workers/handlers/evaluate-findings.ts:16-103). Reads entities + paths, writes findings. Enqueuesbuild_evidence_packper created/updated finding (line 48-53). - Trigger paths:
IngestService.acceptGraph→ unconditionally enqueues anevaluate_findingswithtrigger="sync"(ingest-service.ts:69). Handler gates onconnector_syncs.status === "completed"(evaluate-findings.ts:28-38).stitch_ingestionhandler → enqueues withtrigger="post_stitch"(stitch-ingestion.ts:401-417). This second pass exists because stitched-path rules need stitched data that wasn't visible in the first eval.
- Chaining today is purely fire-and-forget through the in-memory queue: enqueue is synchronous, but if the worker process crashes between sync_ingestion completing and evaluate_findings starting, the evaluate is lost forever (no persistence —
runtime.ts:32).
1.6 Scan-run tracking
scan_runscollection — schema atsrc/domain/scan-runs/types.ts:104-141. Written by:triggerScanRun(scheduler / manual API path) — insertsrunning.execute_scanhandler —updateScanRunCategoryResultper category, thencompleteScanRunwith terminal status.
connector_syncscollection — schema atsrc/domain/syncs/types.ts. Written bysync_ingestionstep 1 + step 13.- Linkage —
scan_runs.sync_idpoints to the resultingconnector_syncs._id(stamped byexecute_scanafterIngestService.submitreturns,execute-scan.ts:311-314). When a connector pushes via the HTTP/api/v1/ingest/...route (no scheduler involvement), noscan_runrow exists — only aconnector_sync. This is the seed-script path today. - API surface —
src/api/routes/scan-runs.ts:7-9:POST /api/v1/scan-runs— manual trigger.GET /api/v1/scan-runs/:id— fetch one.GET /api/v1/scan-runs?scope_id=…— list (paginated).
- UI surface — none.
grep -r "ScanRunsPage\|scan-runs" ui/src/ → 0 hits. The operator-facing run history exists only in the database and via direct curl.
1.7 Credential storage today
The data model has credentials_ref: { provider: "op" | "env" | "aws_secretsmanager" | "azure_keyvault", ref: string } (src/domain/connector-instances/types.ts:28-40) — documented as "Phase 1 supports op and env; vault back-ends reserved for future" (lines 30-32).
Reality (src/api/routes/admin/connector-instances.ts:18-21):
credentials_ref?: { provider, ref } — defaults to {env, ""}; the field is schema-required but **inert today** (no credential broker resolves op:// refs at runtime).
There is no broker. Confirmed by:
- No file matches
credential.*broker,resolveCredentials, orCredentialBrokeranywhere insrc/. InProcessSubprocessDrivertakesextraEnvfrom constructor options (connector-driver.ts:147); the constructor is called with noenvatsrc/index.ts:98.
Operationally, this is consistent with the laptop-only workflow today: a developer runs connectors with a local .env file (the connector loads it via python-dotenv at entra_servicenow/config.py:9; for AWS at aws/src/sv0_aws/cli/main.py:1581). For production tenants this is a hard blocker — there is no path to get credentials into a worker process today.
1.8 Connector → platform contract
Both audited connectors (sv0-aws and entra-servicenow) use the same shape:
- Auth to source systems: environment variables loaded from a sibling
.envfile scoped to the connector dir (entra_servicenow/config.py:9-19; AWS usesboto3default chain +AWS_PROFILEenv or--accountsflag,aws/cli/main.py:2280-2305). - Auth to platform:
PLATFORM_API_KEYenv (aws/cli/main.py:2229, line 2353) — a per-tenant connector API key, scope-restricted to/api/v1/ingest/*. Minted via the admin route atsrc/api/routes/admin/connector-api-keys.ts(delegated-agent rejected; interactive only) OR by direct Mongo insert for agent sessions. - Submit shape:
PlatformClient.submit_normalized_graph(graph)→POST {PLATFORM_URL}/api/v1/ingest/normalized-graphwithX-Tenant-Id+X-API-Keyheaders. - Driver invocation shape: when invoked from the platform's
InProcessSubprocessDriver, the connector receivesscan --scope-id X --run-id Y --category-results-out file --graph-json file --scope-json file. It writes per-category outcomes + the graph atomically to those files and exits. The platform reads them off disk.
1.9 What the seed-demo scripts actually do
scripts/seed-demo-w1.ts (representative): synthesises a NormalizedGraph in-process and POSTs it to /api/v1/ingest/normalized-graph (seed-demo-w1.ts:1676), then polls /api/v1/syncs/:id until completed (:1695). This path bypasses execute_scan entirely — no scan_run is created. Findings + chains + authority paths are produced because IngestService.acceptGraph still enqueues sync_ingestion + evaluate_findings.
The seed-demo path is the production-shaped ingest path minus the connector subprocess. It is the proof that everything from sync_ingestion onward works unattended.
2. Gap analysis
| # | Gap | Concrete consequence today |
|---|---|---|
| 1 | No credential broker — credentials_ref is inert | Cannot run a real (non-seed) scan against any tenant from the platform process. AWS/Entra creds only live on a developer laptop. Blocks every other gap. |
| 2 | Cron cadence unwired | claimDueScopes advances next_run_at = null for cron scopes (control-plane-adapter.ts:167); they never auto-fire. Interval-only today. |
| 3 | No operator UX — no ScanRunsPage / ScopesPage / InstancesPage in UI | Operators cannot see "did the 2 AM scan succeed?" without mongosh. Failures are buried in worker logs. |
| 4 | No deploy-gate triggers (ADR-026 path b, plus the same shape for stitched_paths) | Code-side changes to chain-builder.ts / stitched-path-materializer.ts don't propagate until the next per-tenant sync. Cost already paid twice on chains. |
| 5 | Volatile job queue — in-memory FIFO, no persistence | Worker process crash between sync completion and evaluate dispatch = silent lost work. Today's blast radius is small (one node, restarts are rare), but is a hard floor for any prod claim. |
| 6 | No retry / backoff at the pipeline level | A transient Mongo blip during evaluate_findings causes the whole evaluate to drop. The execute_scan handler has a cooldown-after-failure budget at the scope level (scheduler.ts:531-545), but ingest/evaluate stages have no equivalent. |
| 7 | No multi-stage pipeline-run view | "Why did Tuesday's brief for tenant T show old data?" requires joining scan_runs.sync_id → connector_syncs._id → logs across two services. The causal thread is reconstructable, not surfaced. |
| 8 | No tenant ↔ instance ↔ scope provisioning for real customer tenants | Even with a broker, there are zero ConnectorInstance / ScanScope rows for any non-seed tenant. Real demo tenants are currently fed by the developer-runs-from-laptop path. |
| 9 | No idempotency keys at the stage boundary | Re-driving an evaluate_findings manually for the same syncId is safe (the gate on sync.status === "completed" covers it), but re-driving rematerialize (per ADR-026 path c) has no documented re-entry contract because the job kind doesn't exist yet. |
| 10 | No failure escalation | A run stuck in running for >max_runtime_seconds is the documented "reaper" responsibility (scan-scopes/types.ts:46-53) — the reaper code doesn't exist yet (verified by grep — no file matches reaper outside comments). |
| 11 | Stitch ingestion is connector-completion-driven, not pipeline-driven | StitchDebouncer.schedule(tenantId) fires only at the end of sync_ingestion. A run that succeeded via execute_scan but failed mid-sync_ingestion (step 4 entity upsert failure, say) leaves the tenant in a half-stitched state until the next sync. |
Of these: #1 (credential broker) is the only true new-architecture gap. #2, #3, #4, #5, #10 are wiring/UX gaps on existing primitives. #6, #7, #8, #9, #11 are policy decisions that don't need new collections.
3. Architecture proposal
3.1 Pipeline primitive: keep scan_runs as the root, do NOT introduce pipeline_runs
scan_runs is already the canonical "did the work happen?" entity. It has status, started_at/ended_at, category_results, trigger, and most importantly a tenant-scoped _id (scan_runs.types.ts:104-141). Introducing a new pipeline_runs collection on top of it would duplicate every field, add a join, and conflict with the operator's mental model (which is "the scan ran, here's what it did").
Decision: extend scan_runs.category_results with downstream stage results. The category_results map is already a Record<string, CategoryResult> (line 88-95). Every cell — connector category OR platform stage — satisfies the existing CategoryResult = { status, items_scanned, started_at, ended_at, errors } shape. No schema change. Stage-specific semantics are encoded by what items_scanned counts:
| Cell key | items_scanned semantics |
|---|---|
iam / lambda / … | connector items scanned (existing) |
__sync | entities upserted by sync_ingestion |
__eval | findings created + updated + resolved |
__chain | execution chains created + updated |
__stitch | stitched paths materialized |
__evidence | evidence packs built |
Stage-specific detail beyond the cell (sync_id, finding ids, chain ids) lives in linked records, not in the cell — scan_runs.sync_id already pins the connector_sync; findings and chains are tenant-scoped queries from there. Cells carry no extra fields; the type-checker enforces this against CategoryResult.
The __ prefix collides with the existing "__driver__" sentinel (execute-scan.ts:209) so it's already in the validation allowlist concept. Tighten the regex at execute-scan.ts:80 to accept ^(?:[a-z][a-z0-9_]{0,63}|__[a-z][a-z0-9_]{0,30})$.
Tradeoff: keeps one row per pipeline run, no joins for the UI's "last scan summary" view. Downside: rows for ingest-only (seed-script) submissions don't get a scan_run parent today. Fix that by synthesising a scan_run row inside IngestService.acceptGraph for non-execute_scan submissions, with trigger.type = "manual_api" (already a valid variant, scan-runs/types.ts:28-34). That makes the data model uniform.
3.2 Trigger model: hybrid — scheduled + deploy-gate + manual API + sync-cascade
Four trigger sources, all leading to the same scan_runs-rooted execution:
- Scheduled — the existing scheduler tick (
Scheduler.tickOnce). Covers normal periodic scans. - Manual API —
POST /api/v1/scan-runs(scan-runs.ts:60). Already there. Operator UI button binds to this. - Deploy-gate — new worker job kind
rematerialize. Triggered by a small CI step aftermaindeploys, when the deploy diff toucheschain-builder.ts,stitched-path-materializer.ts, or related schema files. The CI step callsPOST /api/v1/admin/rematerializewith the list of stages and a list of tenants (default: allstatus="active"tenants). This is ADR-026 path (b) generalised. - Sync-cascade — already exists.
sync_ingestion→evaluate_findings;evaluate_findings→build_evidence_pack. Keep as-is.
Webhook triggers from connectors (event-driven push when source changes) are rejected for now: the connectors are full-state extractors per architecture doc 2026-04-22 §"Scheduling", and "delta-event consumers" was an explicitly considered-and-rejected alternative in the scheduler design (scheduler.ts:485-489). Revisit when a connector actually streams change events.
3.3 Credential model: two-tier interface, vendor-agnostic
The interface, not the vendor, is the architecture commitment.
// src/credentials/broker.ts (NEW)
export interface CredentialBroker {
/** Resolve a CredentialsRef to a flat env-var map for connector consumption. */
resolve(tenantId: string, ref: CredentialsRef): Promise<Record<string, string>>;
}
Implementations (pluggable, picked by env.credentialProvider):
EnvCredentialBroker—provider: "env"resolves toprocess.env[ref.ref + "_*"](the broker prefix-matches; e.g.ref: "AWS_DEV"→{AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, ...}pulled fromAWS_DEV_*platform-env vars). Useful for the single-tenant dev VM where creds still live in the deploy's.env.OnePasswordCredentialBroker—provider: "op"— rejected for runtime use. 1Password remains bootstrap-only. The schema admits this provider but the broker hard-errors if asked to resolve it at runtime, until a service-token-backed integration exists.AzureKeyVaultCredentialBroker—provider: "azure_keyvault"— Tier-3 (managed Azure VMs per ADR-022) gets per-VM managed-identity access to a Key Vault that holds per-tenant secret bundles. Ref shape:"vault://kv-sv0-{env}/tenants/{tenant_id}/connector/{kind}". The broker uses@azure/identity+@azure/keyvault-secrets, mints a 5-minute env bundle, hands it to the driver. Recommended path for staging+prod.AwsSecretsManagerCredentialBroker—provider: "aws_secretsmanager"— symmetric for AWS-hosted environments. Reserved; not Tier-1 today.
Security boundary:
- The broker is the ONLY path that can read tenant credentials at runtime. Worker handlers never touch raw secret values.
- Tenant namespace is derived from
tenantId, not from the ref string. Each broker constructs the lookup path itself: env broker usesSV0_TENANT_${tenantSlug}_${refSuffix}_*; Key Vault broker usestenants/${tenantId}/connector/${kind}/*. The user-controlledref.refonly contributes the trailing connector-specific suffix (AWS_PRIMARY,ENTRA_DEFAULT, …) — never the tenant segment. A misconfiguredrefpointing at another tenant's path is structurally impossible because the tenant prefix is computed. The broker also rejects anyref.refcontaining path separators or the literal substringtenant. - The driver receives a one-shot env bundle that the broker mints fresh per
execute_scanrun. The bundle is in the subprocess env only — never logged, never written to Mongo, never propagated out of the runtime. - Per ADR-018 (docker-group decision), the platform worker process already has docker permissions; the broker model doesn't change that.
- Per ADR-023 (four-tier auth), the broker uses tier-appropriate auth: managed identity on Azure VMs (Tier-3 dev/prod), service-token on managed PaaS (Tier-2 if ever introduced), env-vars on local dev (Tier-1).
Wiring change: src/index.ts:98 constructs InProcessSubprocessDriver with a closure that calls broker.resolve(tenantId, instance.credentials_ref). The driver invokes the broker per-run, not at boot, so credentials can rotate without restart.
3.4 Stage chaining: keep current enqueue cascade, add explicit pipeline state on scan_runs
Current chain (preserved):
execute_scan → sync_ingestion → evaluate_findings → build_evidence_pack
↘ stitch_ingestion (debounced) → evaluate_findings (post_stitch)
Add: each handler stamps its stage outcome onto scan_runs.category_results.__stage. This is the only schema change. Handler responsibilities:
execute_scan(existing) — writes per-category +__driver__(already does this).sync_ingestion(existing) — at end of step 13, ifjob.payload.scanRunIdis present, callstorage.updateScanRunCategoryResult(tenantId, scanRunId, "__sync", { status: deriveStatus(), …metrics }). New code: ~10 lines. ThescanRunIdpropagates fromexecute_scan'sIngestService.submit→acceptGraph→ enqueue payload. Today the payload only has{graph, authMethod}atingest-service.ts:62-66; addscanRunId(nullable for HTTP-direct seed paths).evaluate_findings— same. Stamp__eval. ~10 lines.build_evidence_pack— stamp__evidence(roll-up; the handler runs per finding, so it has to upsert with$inc-style metrics).stitch_ingestion— stamp__stitch. Tricky because the stitch debouncer collapses N sync completions into 1 stitch — pick the most-recentscan_runfrom the participating syncs.rematerialize(NEW worker job kind, per ADR-026 path b/c generalised) — stamps__chainand/or__stitchdepending on the requested stages list. When triggered by deploy-gate (no parent scan), creates a syntheticscan_runwithtrigger.type = "manual_api"(or introduce a new"deploy_gate"variant — see Open Q).
Idempotency: every stage's "did I already do this work?" gate is keyed by either scanRunId + stage or syncId + stage. Re-driving a stage:
sync_ingestion: duplicate-key onconnector_syncs._idis already handled (sync-ingestion.ts:153-163).evaluate_findings: already gated onconnector_syncs.status === "completed"(evaluate-findings.ts:28-38); re-runs upsert findings deterministically.rematerialize: idempotent per ADR-026 §Consequences ("Re-runningassemble_chainsagainst an unchanged graph upserts the samecomposition_hashand produces no net mutation beyondlast_seen_at"). The same shape applies to stitched-path re-materialization.build_evidence_pack: change-detection already short-circuits unchanged findings.stitch_ingestion: stitched paths are upserted by(tenant_id, path_signature); re-runs are no-ops.
Failure modes (per stage):
| Stage | Transient failure | Persistent failure |
|---|---|---|
execute_scan | Driver SIGKILL on timeout → scan_run.status = "timeout"; cooldown enforced next tick. | Bad creds / scope unreachable → failed; same cooldown gate. |
sync_ingestion | Mongo blip → connector_sync.status = "failed"; lost (no retry). Gap #6. | Schema validation failure → failed. |
evaluate_findings | Skipped if sync.status !== "completed" (degrades gracefully). | Rule code throws → currently kills the whole evaluation; needs per-rule isolation. |
rematerialize | Same as sync_ingestion. | Same. |
build_evidence_pack | Per-finding; one failure doesn't block siblings. | — |
To close #6, add a stage-retry helper (not full Temporal/Celery — overkill for one node): wrap each handler in a withRetry decorator that retries up to N times on a structured TransientError thrown by storage/network code. Persistent errors propagate. This is ~30 lines.
3.5 Observability
scan_runs becomes the pipeline-run root, with category_results.__* keys exposing every downstream stage. UI implications:
/operations/runspage (new) — table ofscan_runsacross the operator's tenant scope (super-admin sees all); columns: tenant, scope, started, ended, status, per-stage statuses (sparkline of__sync/__eval/__chain/__stitch/__evidence)./operations/runs/:id— single-run detail; per-category breakdown + per-stage panel; "Re-run" button posts to a newPOST /api/v1/scan-runs/:id/retryroute that creates atrigger.type = "retry"child run (already a typed variant,scan-runs/types.ts:32)./operations/scopes— schedule editor; create/pause/resume scopes. CRUD already exists insrc/api/routes/admin/scan-scopes.ts; needs a thin UI./operations/instances—ConnectorInstanceCRUD already exists atsrc/api/routes/admin/connector-instances.ts; needs UI.
Total UI: 3 pages + ~5 hooks. Smallest deliverable that closes the operator-UX gap.
3.6 Anti-goals
- No ML / no probabilistic stage decisions. Retries are deterministic counts, not adaptive backoff curves. Stage-gating is exact-match status, not confidence scores.
- No connector writes to source systems. Confirmed at the driver layer (
connector-driver.ts:13-15). The broker model preserves this — the connector receives source-system credentials and an instruction to extract; no write path is enabled. - No tenant-isolation bypass. Every handler key is already
(tenantId, ...). The broker takestenantIdas an argument (broker.resolve(tenantId, ref)) so a misconfiguredrefcan't accidentally grant cross-tenant read. - No second pipeline-run collection.
scan_runsIS the pipeline run.
4. Alternatives considered (and rejected)
Recorded in ADR-027 §Alternatives. Briefly:
- Plan B — separate
pipeline_runscollection. Rejected for field duplication, UI join cost, and mental-model mismatch. - Plan C — event-driven webhooks. Rejected pending a connector that actually exposes change events.
- Plan D — Temporal / Celery / Bull. Rejected for the first slice as overkill against one VM; revisit when fanning out.
- Plan E — lazy on-demand pipeline state. Rejected for read-amplification and log-mining cost.
5. Migration plan
Each slice is one PR (or one small umbrella with N PRs). All slices preserve the existing seed-script workflow until the final slice.
Slice 0 — wire the existing Scheduler to one demo tenant on staging
Size: ~50 LOC + a manual provisioning step. No new code. A one-off admin script (or three admin-API curl calls):
POST /api/v1/admin/connector-instancesfordemo-nimbuswithconnector_kind: "aws",credentials_ref: { provider: "env", ref: "AWS_DEMO" }.POST /api/v1/admin/scan-scopesfor the instance withcadence: "interval",interval_seconds: 86400,service_categories: [...].- Add
AWS_DEMO_ACCESS_KEY_ID/AWS_DEMO_SECRET_ACCESS_KEYto staging's.env. Blocked on: Slice 1 (credential broker), because without it the driver gets no creds.
Slice 1 — credential broker, env provider only
Size: ~150 LOC + tests.
PR: feat(workers): credential broker for connector driver
src/credentials/broker.tsinterface.src/credentials/env-broker.tsimpl (prefix-matchprocess.env).InProcessSubprocessDriveraccepts acredentialBrokeroption; at run time callsbroker.resolve(tenantId, instance.credentials_ref)and passes the result asextraEnv.src/index.ts:98constructs the broker and wires it.- Decision gate: unblocks Slice 0. Smallest unit of value.
Slice 2 — operator UI shell (runs / scopes / instances)
Size: ~600 LOC across 3 pages + 5 hooks. No backend changes (admin APIs already exist).
PR: feat(ui): operations pages — scan runs, scopes, connector instances
ui/src/pages/operations/RunsListPage.tsx,RunDetailPage.tsx,ScopesPage.tsx,InstancesPage.tsx.- TanStack Query hooks.
- Route auth: super-admin only initially.
Slice 3 — pipeline state on scan_runs
Size: ~200 LOC + tests.
PR: feat(workers): stage outcome tracking on scan_runs.category_results
- Plumb
scanRunIdthroughIngestService.submit→ enqueue payloads. - Each handler stamps its
__stageoutcome. IngestService.acceptGraph(HTTP path) synthesises ascan_runfor non-execute_scansubmissions so seed scripts get pipeline visibility for free.- UI run-detail page reads the
__stagepanels.
Slice 4 — deploy-gate rematerialize job
Size: ~250 LOC + CI step.
PR: feat(workers): rematerialize job + deploy-gate trigger (ADR-026 path b)
- New worker handler
rematerialize.ts. Stages list (initially["chains", "stitched_paths"]); reusesassembleExecutionChainsand the stitched-path materializer. - New admin route
POST /api/v1/admin/rematerialize. - CI step in
deploy-prod.yml/deploy-dev.yml: when the merge diff toucheschain-builder.ts(or a tracked schema constant), curl the admin route for all active tenants. - Extend to
stitched-path-materializer.tsin the same trigger; the ADR-026 §"stitched_paths shares the same vulnerability" caveat is closed here.
Slice 5 — Azure Key Vault credential broker
Size: ~250 LOC + Bicep.
PR: feat(credentials): Azure Key Vault broker for staging/prod
src/credentials/akv-broker.ts.- Bicep / Terraform delta in sv0-infrastructure to mint the Key Vault and grant managed-identity access from
vm-sv0-dev-1(Tier-3 dev) and the future prod VM. - Tenant provisioning step: deposit
tenants/{tenant_id}/connector/{kind}secret bundle into the vault. - Promotes Slice 0 from env-creds-on-VM to vault-resolved-per-run.
Slice 6 — retire seed-demo-* for non-mock tenants
Size: ~50 LOC of deletion.
PR: chore(scripts): seed-demo scripts now mock-only; real tenants run via scheduler
- Document that
seed-demo-*is for synthetic mock data only (already true). - Real demo tenants (e.g.
enterprise-nimbus) getConnectorInstance+ScanScoperows and are driven by the scheduler.
6. Open questions
rematerializetrigger type. Add a newScanRunTriggerTypevalue"deploy_gate", or reuse"manual_api"with a special user-id sentinel? Prefer adding the variant — it survives audit-log inspection cleanly.- Stitch-debounce ↔ scan_run linkage. When N syncs collapse into 1 stitch, which
scan_runowns the__stitchcell? Two candidates: pick the newest participating scan, OR fan out and write__stitchto all N. Recommend fanning out for the audit trail. - Cron cadence vs interval.
claimDueScopesshould wire cron evaluation (usecron-parseror similar). Today it silently leavesnext_run_at = nullfor cron scopes. Track as a separate small follow-up; not part of this proposal's first slice. - Reaper for orphaned
runningruns. Thelast_claimed_atfield is stamped specifically for this (scan-scopes/types.ts:46-53). A periodic reaper is documented as the responsibility of "an operational reaper" (control-plane-adapter.ts:124-126) but doesn't exist. Add as Slice 7 or fold into Slice 4. - Job-queue persistence. In-memory queue is fine for one VM. If/when the platform fans out to N workers, Mongo-backed jobs become required. Not part of this proposal — file as future tech-debt.
References
src/workers/scheduler.ts:215-306,313-582—triggerScanRun+Scheduler.src/workers/runtime.ts:31-153—WorkerRuntime.src/workers/connector-driver.ts:104-449—ConnectorDriverinterface +InProcessSubprocessDriver.src/workers/handlers/execute-scan.ts:234-377—execute_scanhandler.src/workers/handlers/sync-ingestion.ts:125-617— 13-step ingest pipeline.src/workers/handlers/evaluate-findings.ts:16-103— evaluate handler + stitch-cascade trigger.src/storage/mongo/adapters/control-plane-adapter.ts:128-181— atomic scope claim.src/api/routes/admin/connector-instances.ts:1-33— admin CRUD; "broker is inert today" note.src/api/routes/admin/connector-api-keys.ts— connector API key mint.src/api/routes/scan-runs.ts:7-9— manual scan trigger API.src/domain/scan-runs/types.ts:104-141—ScanRunDocschema.src/domain/scan-scopes/types.ts:33-65—ScanScopeSchedule,ScanScopeBudget.src/domain/connector-instances/types.ts:28-92—CredentialsRef+ConnectorInstanceDoc.repos/sv0-connectors/integrations/aws/src/sv0_aws/cli/main.py:2229,2353— AWS connector platform-submit shape.repos/sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/config.py:1-50— Entra connector env config.- ADR-018 — deploy security (docker-group decision).
- ADR-022 — Azure compute landing zone (Tier-3 dev/prod VMs).
- ADR-023 — authentication target architecture (four-tier).
- ADR-024 — Azure deploy lifecycle.
- ADR-026 — chain re-materialization triggers (path b/c framing reused here).
- ADR-027 — the decision shape this audit drives.
2026-04-22-connector-control-execution-architecture.md— original architecture doc that delivered Stream-1 Phases 1–3.
Honored North Star clauses
- C-13 (SIEM landing supported, not a SIEM console —
north-star.md:405). A scheduled scan that fails silently breaks the SIEM-cold landing flow because the brief / chain shown to the analyst is stale. Slice 0 + Slice 2 close this: scans run on cadence, failures are visible to operators within minutes. - C-15 (path differentiability —
north-star.md:377). Stale stitched-paths from astitched-path-materializer.tsdeploy without re-materialization break per-path content. Slice 4 closes this for both chains and stitched paths.