Skip to main content

Automated connector pipeline — current-state audit + gap analysis

Issue: sv0-platform#1185 (implementation umbrella) · sv0-documentation#302 (this audit + ADR-027) Drives: ADR-027

Next Action

Decide on ADR-027: adopt the seven-slice migration as proposed, or amend the decision shape (broker interface, scan_runs as pipeline-run root, deploy-gate rematerialize job) before any implementation lands.

  • Adopt → ADR-027 transitions to Accepted; this audit transitions to adopted; cut PRs against sv0-platform#1185 starting with Slice 1 (credential broker, ~150 LOC).
  • Amend → comment on ADR-027 with the delta; this audit stays research-complete until amendment is merged.
  • Defer / reject → ADR-027 closes; this audit transitions to deferred or rejected accordingly.

TL;DR

The framing "today, zero automated connector infrastructure exists" — common in conversation — is partially correct but undersells what already shipped. Stream-1 Phases 1–3 (per 2026-04-22-connector-control-execution-architecture.md) delivered a working in-process scheduler, an execute_scan worker, a connector-driver seam, and a chain of jobs that auto-cascades sync → evaluate → evidence-pack. The data model for ConnectorInstance / ScanScope / ScanRun is there. The cron-style scheduler ticks every 30 s and atomically claims due scopes (src/workers/scheduler.ts:362-369).

What is genuinely missing:

  1. Credential broker. ConnectorInstance.credentials_ref exists as a typed field (src/domain/connector-instances/types.ts:79) but no code resolves it at runtime. The InProcessSubprocessDriver is constructed with env: undefined at src/index.ts:98, so it can't actually invoke a real connector — only the test fake.
  2. Operator UX surface. scan_runs is written by the scheduler/handler but no UI route renders it. There is no ScanRunsPage.tsx (verified — grep -r "ScanRunsPage" ui/src/ → 0 results).
  3. Pipeline-run root entity. A "scan #N → ingest #M → evaluate #O" causal chain is reconstructable today only by joining scan_runs.sync_idconnector_syncs._id → log lines. No single doc holds the multi-stage view.
  4. Deploy-gate triggers (chain-builder version bumps per ADR-026; same shape for stitched_paths).
  5. Retry / backoff / circuit-breaker policy at the pipeline level (per-stage retry is non-existent; the worker just records failed and drops).
  6. Idempotency keys for cross-stage replay.
  7. Real tenant ↔ scope binding for production. All the plumbing exists; nobody has created ConnectorInstance + ScanScope rows for real customer tenants and the credential broker (item 1) gates this anyway.

Recommendation: finish the in-place architecture rather than build a new one. Add a credential broker, a small UI shell for scan_runs / scan_scopes, a deploy-gate post-deploy worker, and idempotent retry helpers. Don't introduce a new pipelines / pipeline_runs collection — scan_runs is already the pipeline-run root. The decision shape is captured in ADR-027.


1. Current-state audit

1.1 Job runtime

  • WorkerRuntime at src/workers/runtime.ts:31-153. In-process FIFO queue, single consumer (processing flag at line 35), no persistence (jobs lost on restart). Registered handlers route on WorkerJobType.
  • WorkerJobType enum at src/workers/runtime.ts:5-11. Six values: sync_ingestion, evaluate_findings, build_evidence_pack, generate_report, execute_scan, stitch_ingestion. There is no jobs Mongo collection — jobs live only in memory.
  • Handler registration happens in src/index.ts:86-105. The standalone src/workers/index.ts is documented as a STUB (file-level comment lines 1-19); no deploy uses it.
  • Dispatch model: enqueue() pushes into the in-memory array (runtime.ts:64); processQueue() loops while items exist. Failure path increments failedJobs and logs but does not retry (line 137-147).

Job creation paths (every workerRuntime.enqueue call site):

  • src/ingestion/transport/ingest-service.ts:60 — enqueues sync_ingestion after acceptGraph.
  • src/ingestion/transport/ingest-service.ts:69 — enqueues evaluate_findings after sync_ingestion is queued.
  • src/api/routes/reports.ts:87 — enqueues generate_report from the reports API.
  • src/workers/scheduler.ts:290 — enqueues execute_scan from the scheduler tick.
  • src/workers/handlers/evaluate-findings.ts:48 — enqueues build_evidence_pack per created/updated finding.
  • src/workers/handlers/stitch-ingestion.ts:409 — enqueues a post-stitch evaluate_findings.
  • src/services/stitching/stitch-debounce.ts:166 — enqueues stitch_ingestion (debounced).

1.2 Scheduler (Stream-1 Phase 3, already shipped)

  • Scheduler class at src/workers/scheduler.ts:313-582. Per-process, 30 s tick (DEFAULT_TICK_INTERVAL_MS = 30_000 at line 89), 50-scope batch claim (DEFAULT_CLAIM_BATCH_SIZE = 50 at line 90).
  • Tick body (runTickBody, scheduler.ts:434-472): atomically claims due scopes via storage.claimDueScopes, then for each: enforces concurrency cap (runtime.ts doesn't track this — the scope's budget.max_concurrent_runs is enforced by counting running runs in scan_runs at scheduler.ts:241-244), enforces cooldown after failure (scheduler.ts:523-545), then calls triggerScanRun().
  • triggerScanRun() (scheduler.ts:215-306) inserts a scan_runs doc with status: "running" (line 285) and enqueues execute_scan (line 290) atomically (same primitive used by the manual API at src/api/routes/scan-runs.ts).
  • Atomic claim at src/storage/mongo/adapters/control-plane-adapter.ts:128-181. Single-doc findOneAndUpdate advances next_run_at and stamps last_claimed_at. Cadence support: interval only. cron is documented but the $cond returns null for cron (line 167) — cron scopes never auto-fire today.
  • Boot wire-up at src/index.ts:116-127. Mounted when storageAdapter and env.schedulerEnabled are true. Default-on; SV0_SCHEDULER_ENABLED=false opts out.

1.3 execute_scan handler — connector invocation

  • Handler at src/workers/handlers/execute-scan.ts:234-377. Reads scan_runs (pre-created by the scheduler), drives the ConnectorDriver, validates output, submits the NormalizedGraph via IngestService, calls completeScanRun. Detailed contract documented at execute-scan.ts:1-29.
  • ConnectorDriver interface at src/workers/connector-driver.ts:104-106. Production implementation InProcessSubprocessDriver (connector-driver.ts:201-449):
    • Spawns the binary (default sv0-aws on PATH, line 157) with scan --scope-id X --run-id Y --category-results-out CR --graph-json G --scope-json S.
    • Inherits ONLY PATH/HOME/TMPDIR/LANG/LC_ALL from the platform env (FORWARDED_ENV_KEYS at line 171). Platform secrets are intentionally NOT bled through to the connector (security comment at connector-driver.ts:131-145).
    • extraEnv is the credential pipe — but it comes from InProcessSubprocessDriverOptions.env passed at construction time, which at src/index.ts:98 is undefined. So in production today, the connector subprocess gets zero AWS / Entra / SN credentials and would fail every scan.
    • This is the single biggest gap and the reason "scans don't actually run from the platform yet." All the orchestration code paths exist; the credentials don't reach the connector.

1.4 sync_ingestion — 13-step pipeline

src/workers/handlers/sync-ingestion.ts:125-617. Numbered steps in the source:

  1. Create ConnectorSyncDoc with status: "running" (line 141-163).
  2. Transform graph → entities + evidence (line 164-173).
  3. Diff against existing state → events (line 175-203).
  4. Upsert entities atomically with M2/M3 cross-connector merge (line 205-269).
  5. Insert events (line 278).
  6. Entity versioning (changed + deleted + ownership-released paths) (line 281-370).
  7. Upsert execution evidence (line 372-377).
  8. Materialize execution paths (line 379-439).
  9. Assemble execution chains (line 441-446 — calls assembleExecutionChains).
  10. Materialize authority paths (line 448-493).
  11. Compute resource_key coverage metrics (line 495-506).
  12. Run post-ingest validators (line 508-531).
  13. Update ConnectorSyncDoc with metrics + status: "completed" (line 533-576).
  • Then: schedule a debounced stitch_ingestion job (line 601-603) which will later enqueue a post-stitch evaluate_findings.

Contract with connectors: the platform consumes NormalizedGraph only; the connector decides shape. Submitted via either HTTP (/api/v1/ingest/normalized-graph) or in-process from execute_scan (IngestService.submit).

1.5 Evaluator + downstream stages

  • EvaluatorService at src/evaluator/index.ts:36. Triggered by evaluate_findings worker handler (src/workers/handlers/evaluate-findings.ts:16-103). Reads entities + paths, writes findings. Enqueues build_evidence_pack per created/updated finding (line 48-53).
  • Trigger paths:
    • IngestService.acceptGraph → unconditionally enqueues an evaluate_findings with trigger="sync" (ingest-service.ts:69). Handler gates on connector_syncs.status === "completed" (evaluate-findings.ts:28-38).
    • stitch_ingestion handler → enqueues with trigger="post_stitch" (stitch-ingestion.ts:401-417). This second pass exists because stitched-path rules need stitched data that wasn't visible in the first eval.
  • Chaining today is purely fire-and-forget through the in-memory queue: enqueue is synchronous, but if the worker process crashes between sync_ingestion completing and evaluate_findings starting, the evaluate is lost forever (no persistence — runtime.ts:32).

1.6 Scan-run tracking

  • scan_runs collection — schema at src/domain/scan-runs/types.ts:104-141. Written by:
    • triggerScanRun (scheduler / manual API path) — inserts running.
    • execute_scan handler — updateScanRunCategoryResult per category, then completeScanRun with terminal status.
  • connector_syncs collection — schema at src/domain/syncs/types.ts. Written by sync_ingestion step 1 + step 13.
  • Linkagescan_runs.sync_id points to the resulting connector_syncs._id (stamped by execute_scan after IngestService.submit returns, execute-scan.ts:311-314). When a connector pushes via the HTTP /api/v1/ingest/... route (no scheduler involvement), no scan_run row exists — only a connector_sync. This is the seed-script path today.
  • API surfacesrc/api/routes/scan-runs.ts:7-9:
    • POST /api/v1/scan-runs — manual trigger.
    • GET /api/v1/scan-runs/:id — fetch one.
    • GET /api/v1/scan-runs?scope_id=… — list (paginated).
  • UI surfacenone. grep -r "ScanRunsPage\|scan-runs" ui/src/ → 0 hits. The operator-facing run history exists only in the database and via direct curl.

1.7 Credential storage today

The data model has credentials_ref: { provider: "op" | "env" | "aws_secretsmanager" | "azure_keyvault", ref: string } (src/domain/connector-instances/types.ts:28-40) — documented as "Phase 1 supports op and env; vault back-ends reserved for future" (lines 30-32).

Reality (src/api/routes/admin/connector-instances.ts:18-21):

credentials_ref?: { provider, ref } — defaults to {env, ""}; the field is schema-required but **inert today** (no credential broker resolves op:// refs at runtime).

There is no broker. Confirmed by:

  • No file matches credential.*broker, resolveCredentials, or CredentialBroker anywhere in src/.
  • InProcessSubprocessDriver takes extraEnv from constructor options (connector-driver.ts:147); the constructor is called with no env at src/index.ts:98.

Operationally, this is consistent with the laptop-only workflow today: a developer runs connectors with a local .env file (the connector loads it via python-dotenv at entra_servicenow/config.py:9; for AWS at aws/src/sv0_aws/cli/main.py:1581). For production tenants this is a hard blocker — there is no path to get credentials into a worker process today.

1.8 Connector → platform contract

Both audited connectors (sv0-aws and entra-servicenow) use the same shape:

  • Auth to source systems: environment variables loaded from a sibling .env file scoped to the connector dir (entra_servicenow/config.py:9-19; AWS uses boto3 default chain + AWS_PROFILE env or --accounts flag, aws/cli/main.py:2280-2305).
  • Auth to platform: PLATFORM_API_KEY env (aws/cli/main.py:2229, line 2353) — a per-tenant connector API key, scope-restricted to /api/v1/ingest/*. Minted via the admin route at src/api/routes/admin/connector-api-keys.ts (delegated-agent rejected; interactive only) OR by direct Mongo insert for agent sessions.
  • Submit shape: PlatformClient.submit_normalized_graph(graph)POST {PLATFORM_URL}/api/v1/ingest/normalized-graph with X-Tenant-Id + X-API-Key headers.
  • Driver invocation shape: when invoked from the platform's InProcessSubprocessDriver, the connector receives scan --scope-id X --run-id Y --category-results-out file --graph-json file --scope-json file. It writes per-category outcomes + the graph atomically to those files and exits. The platform reads them off disk.

1.9 What the seed-demo scripts actually do

scripts/seed-demo-w1.ts (representative): synthesises a NormalizedGraph in-process and POSTs it to /api/v1/ingest/normalized-graph (seed-demo-w1.ts:1676), then polls /api/v1/syncs/:id until completed (:1695). This path bypasses execute_scan entirely — no scan_run is created. Findings + chains + authority paths are produced because IngestService.acceptGraph still enqueues sync_ingestion + evaluate_findings.

The seed-demo path is the production-shaped ingest path minus the connector subprocess. It is the proof that everything from sync_ingestion onward works unattended.


2. Gap analysis

#GapConcrete consequence today
1No credential brokercredentials_ref is inertCannot run a real (non-seed) scan against any tenant from the platform process. AWS/Entra creds only live on a developer laptop. Blocks every other gap.
2Cron cadence unwiredclaimDueScopes advances next_run_at = null for cron scopes (control-plane-adapter.ts:167); they never auto-fire. Interval-only today.
3No operator UX — no ScanRunsPage / ScopesPage / InstancesPage in UIOperators cannot see "did the 2 AM scan succeed?" without mongosh. Failures are buried in worker logs.
4No deploy-gate triggers (ADR-026 path b, plus the same shape for stitched_paths)Code-side changes to chain-builder.ts / stitched-path-materializer.ts don't propagate until the next per-tenant sync. Cost already paid twice on chains.
5Volatile job queue — in-memory FIFO, no persistenceWorker process crash between sync completion and evaluate dispatch = silent lost work. Today's blast radius is small (one node, restarts are rare), but is a hard floor for any prod claim.
6No retry / backoff at the pipeline levelA transient Mongo blip during evaluate_findings causes the whole evaluate to drop. The execute_scan handler has a cooldown-after-failure budget at the scope level (scheduler.ts:531-545), but ingest/evaluate stages have no equivalent.
7No multi-stage pipeline-run view"Why did Tuesday's brief for tenant T show old data?" requires joining scan_runs.sync_idconnector_syncs._id → logs across two services. The causal thread is reconstructable, not surfaced.
8No tenant ↔ instance ↔ scope provisioning for real customer tenantsEven with a broker, there are zero ConnectorInstance / ScanScope rows for any non-seed tenant. Real demo tenants are currently fed by the developer-runs-from-laptop path.
9No idempotency keys at the stage boundaryRe-driving an evaluate_findings manually for the same syncId is safe (the gate on sync.status === "completed" covers it), but re-driving rematerialize (per ADR-026 path c) has no documented re-entry contract because the job kind doesn't exist yet.
10No failure escalationA run stuck in running for >max_runtime_seconds is the documented "reaper" responsibility (scan-scopes/types.ts:46-53) — the reaper code doesn't exist yet (verified by grep — no file matches reaper outside comments).
11Stitch ingestion is connector-completion-driven, not pipeline-drivenStitchDebouncer.schedule(tenantId) fires only at the end of sync_ingestion. A run that succeeded via execute_scan but failed mid-sync_ingestion (step 4 entity upsert failure, say) leaves the tenant in a half-stitched state until the next sync.

Of these: #1 (credential broker) is the only true new-architecture gap. #2, #3, #4, #5, #10 are wiring/UX gaps on existing primitives. #6, #7, #8, #9, #11 are policy decisions that don't need new collections.


3. Architecture proposal

3.1 Pipeline primitive: keep scan_runs as the root, do NOT introduce pipeline_runs

scan_runs is already the canonical "did the work happen?" entity. It has status, started_at/ended_at, category_results, trigger, and most importantly a tenant-scoped _id (scan_runs.types.ts:104-141). Introducing a new pipeline_runs collection on top of it would duplicate every field, add a join, and conflict with the operator's mental model (which is "the scan ran, here's what it did").

Decision: extend scan_runs.category_results with downstream stage results. The category_results map is already a Record<string, CategoryResult> (line 88-95). Every cell — connector category OR platform stage — satisfies the existing CategoryResult = { status, items_scanned, started_at, ended_at, errors } shape. No schema change. Stage-specific semantics are encoded by what items_scanned counts:

Cell keyitems_scanned semantics
iam / lambda / …connector items scanned (existing)
__syncentities upserted by sync_ingestion
__evalfindings created + updated + resolved
__chainexecution chains created + updated
__stitchstitched paths materialized
__evidenceevidence packs built

Stage-specific detail beyond the cell (sync_id, finding ids, chain ids) lives in linked records, not in the cell — scan_runs.sync_id already pins the connector_sync; findings and chains are tenant-scoped queries from there. Cells carry no extra fields; the type-checker enforces this against CategoryResult.

The __ prefix collides with the existing "__driver__" sentinel (execute-scan.ts:209) so it's already in the validation allowlist concept. Tighten the regex at execute-scan.ts:80 to accept ^(?:[a-z][a-z0-9_]{0,63}|__[a-z][a-z0-9_]{0,30})$.

Tradeoff: keeps one row per pipeline run, no joins for the UI's "last scan summary" view. Downside: rows for ingest-only (seed-script) submissions don't get a scan_run parent today. Fix that by synthesising a scan_run row inside IngestService.acceptGraph for non-execute_scan submissions, with trigger.type = "manual_api" (already a valid variant, scan-runs/types.ts:28-34). That makes the data model uniform.

3.2 Trigger model: hybrid — scheduled + deploy-gate + manual API + sync-cascade

Four trigger sources, all leading to the same scan_runs-rooted execution:

  1. Scheduled — the existing scheduler tick (Scheduler.tickOnce). Covers normal periodic scans.
  2. Manual APIPOST /api/v1/scan-runs (scan-runs.ts:60). Already there. Operator UI button binds to this.
  3. Deploy-gate — new worker job kind rematerialize. Triggered by a small CI step after main deploys, when the deploy diff touches chain-builder.ts, stitched-path-materializer.ts, or related schema files. The CI step calls POST /api/v1/admin/rematerialize with the list of stages and a list of tenants (default: all status="active" tenants). This is ADR-026 path (b) generalised.
  4. Sync-cascade — already exists. sync_ingestionevaluate_findings; evaluate_findingsbuild_evidence_pack. Keep as-is.

Webhook triggers from connectors (event-driven push when source changes) are rejected for now: the connectors are full-state extractors per architecture doc 2026-04-22 §"Scheduling", and "delta-event consumers" was an explicitly considered-and-rejected alternative in the scheduler design (scheduler.ts:485-489). Revisit when a connector actually streams change events.

3.3 Credential model: two-tier interface, vendor-agnostic

The interface, not the vendor, is the architecture commitment.

// src/credentials/broker.ts (NEW)
export interface CredentialBroker {
/** Resolve a CredentialsRef to a flat env-var map for connector consumption. */
resolve(tenantId: string, ref: CredentialsRef): Promise<Record<string, string>>;
}

Implementations (pluggable, picked by env.credentialProvider):

  • EnvCredentialBrokerprovider: "env" resolves to process.env[ref.ref + "_*"] (the broker prefix-matches; e.g. ref: "AWS_DEV"{AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, ...} pulled from AWS_DEV_* platform-env vars). Useful for the single-tenant dev VM where creds still live in the deploy's .env.
  • OnePasswordCredentialBrokerprovider: "op"rejected for runtime use. 1Password remains bootstrap-only. The schema admits this provider but the broker hard-errors if asked to resolve it at runtime, until a service-token-backed integration exists.
  • AzureKeyVaultCredentialBrokerprovider: "azure_keyvault" — Tier-3 (managed Azure VMs per ADR-022) gets per-VM managed-identity access to a Key Vault that holds per-tenant secret bundles. Ref shape: "vault://kv-sv0-{env}/tenants/{tenant_id}/connector/{kind}". The broker uses @azure/identity + @azure/keyvault-secrets, mints a 5-minute env bundle, hands it to the driver. Recommended path for staging+prod.
  • AwsSecretsManagerCredentialBrokerprovider: "aws_secretsmanager" — symmetric for AWS-hosted environments. Reserved; not Tier-1 today.

Security boundary:

  • The broker is the ONLY path that can read tenant credentials at runtime. Worker handlers never touch raw secret values.
  • Tenant namespace is derived from tenantId, not from the ref string. Each broker constructs the lookup path itself: env broker uses SV0_TENANT_${tenantSlug}_${refSuffix}_*; Key Vault broker uses tenants/${tenantId}/connector/${kind}/*. The user-controlled ref.ref only contributes the trailing connector-specific suffix (AWS_PRIMARY, ENTRA_DEFAULT, …) — never the tenant segment. A misconfigured ref pointing at another tenant's path is structurally impossible because the tenant prefix is computed. The broker also rejects any ref.ref containing path separators or the literal substring tenant.
  • The driver receives a one-shot env bundle that the broker mints fresh per execute_scan run. The bundle is in the subprocess env only — never logged, never written to Mongo, never propagated out of the runtime.
  • Per ADR-018 (docker-group decision), the platform worker process already has docker permissions; the broker model doesn't change that.
  • Per ADR-023 (four-tier auth), the broker uses tier-appropriate auth: managed identity on Azure VMs (Tier-3 dev/prod), service-token on managed PaaS (Tier-2 if ever introduced), env-vars on local dev (Tier-1).

Wiring change: src/index.ts:98 constructs InProcessSubprocessDriver with a closure that calls broker.resolve(tenantId, instance.credentials_ref). The driver invokes the broker per-run, not at boot, so credentials can rotate without restart.

3.4 Stage chaining: keep current enqueue cascade, add explicit pipeline state on scan_runs

Current chain (preserved):

execute_scan → sync_ingestion → evaluate_findings → build_evidence_pack
↘ stitch_ingestion (debounced) → evaluate_findings (post_stitch)

Add: each handler stamps its stage outcome onto scan_runs.category_results.__stage. This is the only schema change. Handler responsibilities:

  • execute_scan (existing) — writes per-category + __driver__ (already does this).
  • sync_ingestion (existing) — at end of step 13, if job.payload.scanRunId is present, call storage.updateScanRunCategoryResult(tenantId, scanRunId, "__sync", { status: deriveStatus(), …metrics }). New code: ~10 lines. The scanRunId propagates from execute_scan's IngestService.submitacceptGraph → enqueue payload. Today the payload only has {graph, authMethod} at ingest-service.ts:62-66; add scanRunId (nullable for HTTP-direct seed paths).
  • evaluate_findings — same. Stamp __eval. ~10 lines.
  • build_evidence_pack — stamp __evidence (roll-up; the handler runs per finding, so it has to upsert with $inc-style metrics).
  • stitch_ingestion — stamp __stitch. Tricky because the stitch debouncer collapses N sync completions into 1 stitch — pick the most-recent scan_run from the participating syncs.
  • rematerialize (NEW worker job kind, per ADR-026 path b/c generalised) — stamps __chain and/or __stitch depending on the requested stages list. When triggered by deploy-gate (no parent scan), creates a synthetic scan_run with trigger.type = "manual_api" (or introduce a new "deploy_gate" variant — see Open Q).

Idempotency: every stage's "did I already do this work?" gate is keyed by either scanRunId + stage or syncId + stage. Re-driving a stage:

  • sync_ingestion: duplicate-key on connector_syncs._id is already handled (sync-ingestion.ts:153-163).
  • evaluate_findings: already gated on connector_syncs.status === "completed" (evaluate-findings.ts:28-38); re-runs upsert findings deterministically.
  • rematerialize: idempotent per ADR-026 §Consequences ("Re-running assemble_chains against an unchanged graph upserts the same composition_hash and produces no net mutation beyond last_seen_at"). The same shape applies to stitched-path re-materialization.
  • build_evidence_pack: change-detection already short-circuits unchanged findings.
  • stitch_ingestion: stitched paths are upserted by (tenant_id, path_signature); re-runs are no-ops.

Failure modes (per stage):

StageTransient failurePersistent failure
execute_scanDriver SIGKILL on timeout → scan_run.status = "timeout"; cooldown enforced next tick.Bad creds / scope unreachable → failed; same cooldown gate.
sync_ingestionMongo blip → connector_sync.status = "failed"; lost (no retry). Gap #6.Schema validation failure → failed.
evaluate_findingsSkipped if sync.status !== "completed" (degrades gracefully).Rule code throws → currently kills the whole evaluation; needs per-rule isolation.
rematerializeSame as sync_ingestion.Same.
build_evidence_packPer-finding; one failure doesn't block siblings.

To close #6, add a stage-retry helper (not full Temporal/Celery — overkill for one node): wrap each handler in a withRetry decorator that retries up to N times on a structured TransientError thrown by storage/network code. Persistent errors propagate. This is ~30 lines.

3.5 Observability

scan_runs becomes the pipeline-run root, with category_results.__* keys exposing every downstream stage. UI implications:

  • /operations/runs page (new) — table of scan_runs across the operator's tenant scope (super-admin sees all); columns: tenant, scope, started, ended, status, per-stage statuses (sparkline of __sync / __eval / __chain / __stitch / __evidence).
  • /operations/runs/:id — single-run detail; per-category breakdown + per-stage panel; "Re-run" button posts to a new POST /api/v1/scan-runs/:id/retry route that creates a trigger.type = "retry" child run (already a typed variant, scan-runs/types.ts:32).
  • /operations/scopes — schedule editor; create/pause/resume scopes. CRUD already exists in src/api/routes/admin/scan-scopes.ts; needs a thin UI.
  • /operations/instancesConnectorInstance CRUD already exists at src/api/routes/admin/connector-instances.ts; needs UI.

Total UI: 3 pages + ~5 hooks. Smallest deliverable that closes the operator-UX gap.

3.6 Anti-goals

  • No ML / no probabilistic stage decisions. Retries are deterministic counts, not adaptive backoff curves. Stage-gating is exact-match status, not confidence scores.
  • No connector writes to source systems. Confirmed at the driver layer (connector-driver.ts:13-15). The broker model preserves this — the connector receives source-system credentials and an instruction to extract; no write path is enabled.
  • No tenant-isolation bypass. Every handler key is already (tenantId, ...). The broker takes tenantId as an argument (broker.resolve(tenantId, ref)) so a misconfigured ref can't accidentally grant cross-tenant read.
  • No second pipeline-run collection. scan_runs IS the pipeline run.

4. Alternatives considered (and rejected)

Recorded in ADR-027 §Alternatives. Briefly:

  • Plan B — separate pipeline_runs collection. Rejected for field duplication, UI join cost, and mental-model mismatch.
  • Plan C — event-driven webhooks. Rejected pending a connector that actually exposes change events.
  • Plan D — Temporal / Celery / Bull. Rejected for the first slice as overkill against one VM; revisit when fanning out.
  • Plan E — lazy on-demand pipeline state. Rejected for read-amplification and log-mining cost.

5. Migration plan

Each slice is one PR (or one small umbrella with N PRs). All slices preserve the existing seed-script workflow until the final slice.

Slice 0 — wire the existing Scheduler to one demo tenant on staging

Size: ~50 LOC + a manual provisioning step. No new code. A one-off admin script (or three admin-API curl calls):

  1. POST /api/v1/admin/connector-instances for demo-nimbus with connector_kind: "aws", credentials_ref: { provider: "env", ref: "AWS_DEMO" }.
  2. POST /api/v1/admin/scan-scopes for the instance with cadence: "interval", interval_seconds: 86400, service_categories: [...].
  3. Add AWS_DEMO_ACCESS_KEY_ID / AWS_DEMO_SECRET_ACCESS_KEY to staging's .env. Blocked on: Slice 1 (credential broker), because without it the driver gets no creds.

Slice 1 — credential broker, env provider only

Size: ~150 LOC + tests. PR: feat(workers): credential broker for connector driver

  • src/credentials/broker.ts interface.
  • src/credentials/env-broker.ts impl (prefix-match process.env).
  • InProcessSubprocessDriver accepts a credentialBroker option; at run time calls broker.resolve(tenantId, instance.credentials_ref) and passes the result as extraEnv.
  • src/index.ts:98 constructs the broker and wires it.
  • Decision gate: unblocks Slice 0. Smallest unit of value.

Slice 2 — operator UI shell (runs / scopes / instances)

Size: ~600 LOC across 3 pages + 5 hooks. No backend changes (admin APIs already exist). PR: feat(ui): operations pages — scan runs, scopes, connector instances

  • ui/src/pages/operations/RunsListPage.tsx, RunDetailPage.tsx, ScopesPage.tsx, InstancesPage.tsx.
  • TanStack Query hooks.
  • Route auth: super-admin only initially.

Slice 3 — pipeline state on scan_runs

Size: ~200 LOC + tests. PR: feat(workers): stage outcome tracking on scan_runs.category_results

  • Plumb scanRunId through IngestService.submit → enqueue payloads.
  • Each handler stamps its __stage outcome.
  • IngestService.acceptGraph (HTTP path) synthesises a scan_run for non-execute_scan submissions so seed scripts get pipeline visibility for free.
  • UI run-detail page reads the __stage panels.

Slice 4 — deploy-gate rematerialize job

Size: ~250 LOC + CI step. PR: feat(workers): rematerialize job + deploy-gate trigger (ADR-026 path b)

  • New worker handler rematerialize.ts. Stages list (initially ["chains", "stitched_paths"]); reuses assembleExecutionChains and the stitched-path materializer.
  • New admin route POST /api/v1/admin/rematerialize.
  • CI step in deploy-prod.yml / deploy-dev.yml: when the merge diff touches chain-builder.ts (or a tracked schema constant), curl the admin route for all active tenants.
  • Extend to stitched-path-materializer.ts in the same trigger; the ADR-026 §"stitched_paths shares the same vulnerability" caveat is closed here.

Slice 5 — Azure Key Vault credential broker

Size: ~250 LOC + Bicep. PR: feat(credentials): Azure Key Vault broker for staging/prod

  • src/credentials/akv-broker.ts.
  • Bicep / Terraform delta in sv0-infrastructure to mint the Key Vault and grant managed-identity access from vm-sv0-dev-1 (Tier-3 dev) and the future prod VM.
  • Tenant provisioning step: deposit tenants/{tenant_id}/connector/{kind} secret bundle into the vault.
  • Promotes Slice 0 from env-creds-on-VM to vault-resolved-per-run.

Slice 6 — retire seed-demo-* for non-mock tenants

Size: ~50 LOC of deletion. PR: chore(scripts): seed-demo scripts now mock-only; real tenants run via scheduler

  • Document that seed-demo-* is for synthetic mock data only (already true).
  • Real demo tenants (e.g. enterprise-nimbus) get ConnectorInstance + ScanScope rows and are driven by the scheduler.

6. Open questions

  1. rematerialize trigger type. Add a new ScanRunTriggerType value "deploy_gate", or reuse "manual_api" with a special user-id sentinel? Prefer adding the variant — it survives audit-log inspection cleanly.
  2. Stitch-debounce ↔ scan_run linkage. When N syncs collapse into 1 stitch, which scan_run owns the __stitch cell? Two candidates: pick the newest participating scan, OR fan out and write __stitch to all N. Recommend fanning out for the audit trail.
  3. Cron cadence vs interval. claimDueScopes should wire cron evaluation (use cron-parser or similar). Today it silently leaves next_run_at = null for cron scopes. Track as a separate small follow-up; not part of this proposal's first slice.
  4. Reaper for orphaned running runs. The last_claimed_at field is stamped specifically for this (scan-scopes/types.ts:46-53). A periodic reaper is documented as the responsibility of "an operational reaper" (control-plane-adapter.ts:124-126) but doesn't exist. Add as Slice 7 or fold into Slice 4.
  5. Job-queue persistence. In-memory queue is fine for one VM. If/when the platform fans out to N workers, Mongo-backed jobs become required. Not part of this proposal — file as future tech-debt.

References

  • src/workers/scheduler.ts:215-306,313-582triggerScanRun + Scheduler.
  • src/workers/runtime.ts:31-153WorkerRuntime.
  • src/workers/connector-driver.ts:104-449ConnectorDriver interface + InProcessSubprocessDriver.
  • src/workers/handlers/execute-scan.ts:234-377execute_scan handler.
  • src/workers/handlers/sync-ingestion.ts:125-617 — 13-step ingest pipeline.
  • src/workers/handlers/evaluate-findings.ts:16-103 — evaluate handler + stitch-cascade trigger.
  • src/storage/mongo/adapters/control-plane-adapter.ts:128-181 — atomic scope claim.
  • src/api/routes/admin/connector-instances.ts:1-33 — admin CRUD; "broker is inert today" note.
  • src/api/routes/admin/connector-api-keys.ts — connector API key mint.
  • src/api/routes/scan-runs.ts:7-9 — manual scan trigger API.
  • src/domain/scan-runs/types.ts:104-141ScanRunDoc schema.
  • src/domain/scan-scopes/types.ts:33-65ScanScopeSchedule, ScanScopeBudget.
  • src/domain/connector-instances/types.ts:28-92CredentialsRef + ConnectorInstanceDoc.
  • repos/sv0-connectors/integrations/aws/src/sv0_aws/cli/main.py:2229,2353 — AWS connector platform-submit shape.
  • repos/sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/config.py:1-50 — Entra connector env config.
  • ADR-018 — deploy security (docker-group decision).
  • ADR-022 — Azure compute landing zone (Tier-3 dev/prod VMs).
  • ADR-023 — authentication target architecture (four-tier).
  • ADR-024 — Azure deploy lifecycle.
  • ADR-026 — chain re-materialization triggers (path b/c framing reused here).
  • ADR-027 — the decision shape this audit drives.
  • 2026-04-22-connector-control-execution-architecture.md — original architecture doc that delivered Stream-1 Phases 1–3.

Honored North Star clauses

  • C-13 (SIEM landing supported, not a SIEM console — north-star.md:405). A scheduled scan that fails silently breaks the SIEM-cold landing flow because the brief / chain shown to the analyst is stale. Slice 0 + Slice 2 close this: scans run on cadence, failures are visible to operators within minutes.
  • C-15 (path differentiability — north-star.md:377). Stale stitched-paths from a stitched-path-materializer.ts deploy without re-materialization break per-path content. Slice 4 closes this for both chains and stitched paths.