Automated connector pipeline — current-state audit + gap analysis

Issue: sv0-platform#1185 (implementation umbrella) · sv0-documentation#302 (this audit + ADR-027) Drives: ADR-027

Next Action

Decide on ADR-027: adopt the seven-slice migration as proposed, or amend the decision shape (broker interface, scan_runs as pipeline-run root, deploy-gate rematerialize job) before any implementation lands.

Adopt → ADR-027 transitions to Accepted; this audit transitions to adopted; cut PRs against sv0-platform#1185 starting with Slice 1 (credential broker, ~150 LOC).
Amend → comment on ADR-027 with the delta; this audit stays research-complete until amendment is merged.
Defer / reject → ADR-027 closes; this audit transitions to deferred or rejected accordingly.

TL;DR

The framing "today, zero automated connector infrastructure exists" — common in conversation — is partially correct but undersells what already shipped. Stream-1 Phases 1–3 (per 2026-04-22-connector-control-execution-architecture.md) delivered a working in-process scheduler, an execute_scan worker, a connector-driver seam, and a chain of jobs that auto-cascades sync → evaluate → evidence-pack. The data model for ConnectorInstance / ScanScope / ScanRun is there. The cron-style scheduler ticks every 30 s and atomically claims due scopes (src/workers/scheduler.ts:362-369).

What is genuinely missing:

Credential broker. ConnectorInstance.credentials_ref exists as a typed field (src/domain/connector-instances/types.ts:79) but no code resolves it at runtime. The InProcessSubprocessDriver is constructed with env: undefined at src/index.ts:98, so it can't actually invoke a real connector — only the test fake.
Operator UX surface. scan_runs is written by the scheduler/handler but no UI route renders it. There is no ScanRunsPage.tsx (verified — grep -r "ScanRunsPage" ui/src/ → 0 results).
Pipeline-run root entity. A "scan #N → ingest #M → evaluate #O" causal chain is reconstructable today only by joining scan_runs.sync_id → connector_syncs._id → log lines. No single doc holds the multi-stage view.
Deploy-gate triggers (chain-builder version bumps per ADR-026; same shape for stitched_paths).
Retry / backoff / circuit-breaker policy at the pipeline level (per-stage retry is non-existent; the worker just records failed and drops).
Idempotency keys for cross-stage replay.
Real tenant ↔ scope binding for production. All the plumbing exists; nobody has created ConnectorInstance + ScanScope rows for real customer tenants and the credential broker (item 1) gates this anyway.

Recommendation: finish the in-place architecture rather than build a new one. Add a credential broker, a small UI shell for scan_runs / scan_scopes, a deploy-gate post-deploy worker, and idempotent retry helpers. Don't introduce a new pipelines / pipeline_runs collection — scan_runs is already the pipeline-run root. The decision shape is captured in ADR-027.

1. Current-state audit

1.1 Job runtime

WorkerRuntime at src/workers/runtime.ts:31-153. In-process FIFO queue, single consumer (processing flag at line 35), no persistence (jobs lost on restart). Registered handlers route on WorkerJobType.
WorkerJobType enum at src/workers/runtime.ts:5-11. Six values: sync_ingestion, evaluate_findings, build_evidence_pack, generate_report, execute_scan, stitch_ingestion. There is no jobs Mongo collection — jobs live only in memory.
Handler registration happens in src/index.ts:86-105. The standalone src/workers/index.ts is documented as a STUB (file-level comment lines 1-19); no deploy uses it.
Dispatch model: enqueue() pushes into the in-memory array (runtime.ts:64); processQueue() loops while items exist. Failure path increments failedJobs and logs but does not retry (line 137-147).

Job creation paths (every workerRuntime.enqueue call site):

src/ingestion/transport/ingest-service.ts:60 — enqueues sync_ingestion after acceptGraph.
src/ingestion/transport/ingest-service.ts:69 — enqueues evaluate_findings after sync_ingestion is queued.
src/api/routes/reports.ts:87 — enqueues generate_report from the reports API.
src/workers/scheduler.ts:290 — enqueues execute_scan from the scheduler tick.
src/workers/handlers/evaluate-findings.ts:48 — enqueues build_evidence_pack per created/updated finding.
src/workers/handlers/stitch-ingestion.ts:409 — enqueues a post-stitch evaluate_findings.
src/services/stitching/stitch-debounce.ts:166 — enqueues stitch_ingestion (debounced).

1.2 Scheduler (Stream-1 Phase 3, already shipped)

Scheduler class at src/workers/scheduler.ts:313-582. Per-process, 30 s tick (DEFAULT_TICK_INTERVAL_MS = 30_000 at line 89), 50-scope batch claim (DEFAULT_CLAIM_BATCH_SIZE = 50 at line 90).
Tick body (runTickBody, scheduler.ts:434-472): atomically claims due scopes via storage.claimDueScopes, then for each: enforces concurrency cap (runtime.ts doesn't track this — the scope's budget.max_concurrent_runs is enforced by counting running runs in scan_runs at scheduler.ts:241-244), enforces cooldown after failure (scheduler.ts:523-545), then calls triggerScanRun().
triggerScanRun() (scheduler.ts:215-306) inserts a scan_runs doc with status: "running" (line 285) and enqueues execute_scan (line 290) atomically (same primitive used by the manual API at src/api/routes/scan-runs.ts).
Atomic claim at src/storage/mongo/adapters/control-plane-adapter.ts:128-181. Single-doc findOneAndUpdate advances next_run_at and stamps last_claimed_at. Cadence support: interval only. cron is documented but the $cond returns null for cron (line 167) — cron scopes never auto-fire today.
Boot wire-up at src/index.ts:116-127. Mounted when storageAdapter and env.schedulerEnabled are true. Default-on; SV0_SCHEDULER_ENABLED=false opts out.

1.3 `execute_scan` handler — connector invocation

Handler at src/workers/handlers/execute-scan.ts:234-377. Reads scan_runs (pre-created by the scheduler), drives the ConnectorDriver, validates output, submits the NormalizedGraph via IngestService, calls completeScanRun. Detailed contract documented at execute-scan.ts:1-29.
ConnectorDriver interface at src/workers/connector-driver.ts:104-106. Production implementation InProcessSubprocessDriver (connector-driver.ts:201-449):
- Spawns the binary (default sv0-aws on PATH, line 157) with scan --scope-id X --run-id Y --category-results-out CR --graph-json G --scope-json S.
- Inherits ONLY PATH/HOME/TMPDIR/LANG/LC_ALL from the platform env (FORWARDED_ENV_KEYS at line 171). Platform secrets are intentionally NOT bled through to the connector (security comment at connector-driver.ts:131-145).
- extraEnv is the credential pipe — but it comes from InProcessSubprocessDriverOptions.env passed at construction time, which at src/index.ts:98 is undefined. So in production today, the connector subprocess gets zero AWS / Entra / SN credentials and would fail every scan.
- This is the single biggest gap and the reason "scans don't actually run from the platform yet." All the orchestration code paths exist; the credentials don't reach the connector.

1.4 `sync_ingestion` — 13-step pipeline

src/workers/handlers/sync-ingestion.ts:125-617. Numbered steps in the source:

Create ConnectorSyncDoc with status: "running" (line 141-163).
Transform graph → entities + evidence (line 164-173).
Diff against existing state → events (line 175-203).
Upsert entities atomically with M2/M3 cross-connector merge (line 205-269).
Insert events (line 278).
Entity versioning (changed + deleted + ownership-released paths) (line 281-370).
Upsert execution evidence (line 372-377).
Materialize execution paths (line 379-439).
Assemble execution chains (line 441-446 — calls assembleExecutionChains).
Materialize authority paths (line 448-493).
Compute resource_key coverage metrics (line 495-506).
Run post-ingest validators (line 508-531).
Update ConnectorSyncDoc with metrics + status: "completed" (line 533-576).

Then: schedule a debounced stitch_ingestion job (line 601-603) which will later enqueue a post-stitch evaluate_findings.

Contract with connectors: the platform consumes NormalizedGraph only; the connector decides shape. Submitted via either HTTP (/api/v1/ingest/normalized-graph) or in-process from execute_scan (IngestService.submit).

1.5 Evaluator + downstream stages

EvaluatorService at src/evaluator/index.ts:36. Triggered by evaluate_findings worker handler (src/workers/handlers/evaluate-findings.ts:16-103). Reads entities + paths, writes findings. Enqueues build_evidence_pack per created/updated finding (line 48-53).
Trigger paths:
- IngestService.acceptGraph → unconditionally enqueues an evaluate_findings with trigger="sync" (ingest-service.ts:69). Handler gates on connector_syncs.status === "completed" (evaluate-findings.ts:28-38).
- stitch_ingestion handler → enqueues with trigger="post_stitch" (stitch-ingestion.ts:401-417). This second pass exists because stitched-path rules need stitched data that wasn't visible in the first eval.
Chaining today is purely fire-and-forget through the in-memory queue: enqueue is synchronous, but if the worker process crashes between sync_ingestion completing and evaluate_findings starting, the evaluate is lost forever (no persistence — runtime.ts:32).

1.6 Scan-run tracking

scan_runs collection — schema at src/domain/scan-runs/types.ts:104-141. Written by:
- triggerScanRun (scheduler / manual API path) — inserts running.
- execute_scan handler — updateScanRunCategoryResult per category, then completeScanRun with terminal status.
connector_syncs collection — schema at src/domain/syncs/types.ts. Written by sync_ingestion step 1 + step 13.
Linkage — scan_runs.sync_id points to the resulting connector_syncs._id (stamped by execute_scan after IngestService.submit returns, execute-scan.ts:311-314). When a connector pushes via the HTTP /api/v1/ingest/... route (no scheduler involvement), no scan_run row exists — only a connector_sync. This is the seed-script path today.
API surface — src/api/routes/scan-runs.ts:7-9:
- POST /api/v1/scan-runs — manual trigger.
- GET /api/v1/scan-runs/:id — fetch one.
- GET /api/v1/scan-runs?scope_id=… — list (paginated).
UI surface — none. grep -r "ScanRunsPage\|scan-runs" ui/src/ → 0 hits. The operator-facing run history exists only in the database and via direct curl.

1.7 Credential storage today

The data model has credentials_ref: { provider: "op" | "env" | "aws_secretsmanager" | "azure_keyvault", ref: string } (src/domain/connector-instances/types.ts:28-40) — documented as "Phase 1 supports op and env; vault back-ends reserved for future" (lines 30-32).

Reality (src/api/routes/admin/connector-instances.ts:18-21):

credentials_ref?: { provider, ref } — defaults to {env, ""}; the field is schema-required but **inert today** (no credential broker resolves op:// refs at runtime).

There is no broker. Confirmed by:

No file matches credential.*broker, resolveCredentials, or CredentialBroker anywhere in src/.
InProcessSubprocessDriver takes extraEnv from constructor options (connector-driver.ts:147); the constructor is called with no env at src/index.ts:98.

Operationally, this is consistent with the laptop-only workflow today: a developer runs connectors with a local .env file (the connector loads it via python-dotenv at entra_servicenow/config.py:9; for AWS at aws/src/sv0_aws/cli/main.py:1581). For production tenants this is a hard blocker — there is no path to get credentials into a worker process today.

1.8 Connector → platform contract

Both audited connectors (sv0-aws and entra-servicenow) use the same shape:

Auth to source systems: environment variables loaded from a sibling .env file scoped to the connector dir (entra_servicenow/config.py:9-19; AWS uses boto3 default chain + AWS_PROFILE env or --accounts flag, aws/cli/main.py:2280-2305).
Auth to platform: PLATFORM_API_KEY env (aws/cli/main.py:2229, line 2353) — a per-tenant connector API key, scope-restricted to /api/v1/ingest/*. Minted via the admin route at src/api/routes/admin/connector-api-keys.ts (delegated-agent rejected; interactive only) OR by direct Mongo insert for agent sessions.
Submit shape: PlatformClient.submit_normalized_graph(graph) → POST {PLATFORM_URL}/api/v1/ingest/normalized-graph with X-Tenant-Id + X-API-Key headers.
Driver invocation shape: when invoked from the platform's InProcessSubprocessDriver, the connector receives scan --scope-id X --run-id Y --category-results-out file --graph-json file --scope-json file. It writes per-category outcomes + the graph atomically to those files and exits. The platform reads them off disk.

1.9 What the seed-demo scripts actually do

scripts/seed-demo-w1.ts (representative): synthesises a NormalizedGraph in-process and POSTs it to /api/v1/ingest/normalized-graph (seed-demo-w1.ts:1676), then polls /api/v1/syncs/:id until completed (:1695). This path bypasses execute_scan entirely — no scan_run is created. Findings + chains + authority paths are produced because IngestService.acceptGraph still enqueues sync_ingestion + evaluate_findings.

The seed-demo path is the production-shaped ingest path minus the connector subprocess. It is the proof that everything from sync_ingestion onward works unattended.

2. Gap analysis

#	Gap	Concrete consequence today
1	No credential broker — `credentials_ref` is inert	Cannot run a real (non-seed) scan against any tenant from the platform process. AWS/Entra creds only live on a developer laptop. Blocks every other gap.
2	Cron cadence unwired	`claimDueScopes` advances `next_run_at = null` for cron scopes (`control-plane-adapter.ts:167`); they never auto-fire. Interval-only today.
3	No operator UX — no `ScanRunsPage` / `ScopesPage` / `InstancesPage` in UI	Operators cannot see "did the 2 AM scan succeed?" without `mongosh`. Failures are buried in worker logs.
4	No deploy-gate triggers (ADR-026 path b, plus the same shape for `stitched_paths`)	Code-side changes to `chain-builder.ts` / `stitched-path-materializer.ts` don't propagate until the next per-tenant sync. Cost already paid twice on chains.
5	Volatile job queue — in-memory FIFO, no persistence	Worker process crash between sync completion and evaluate dispatch = silent lost work. Today's blast radius is small (one node, restarts are rare), but is a hard floor for any prod claim.
6	No retry / backoff at the pipeline level	A transient Mongo blip during `evaluate_findings` causes the whole evaluate to drop. The `execute_scan` handler has a cooldown-after-failure budget at the scope level (scheduler.ts:531-545), but ingest/evaluate stages have no equivalent.
7	No multi-stage pipeline-run view	"Why did Tuesday's brief for tenant T show old data?" requires joining `scan_runs.sync_id` → `connector_syncs._id` → logs across two services. The causal thread is reconstructable, not surfaced.
8	No tenant ↔ instance ↔ scope provisioning for real customer tenants	Even with a broker, there are zero `ConnectorInstance` / `ScanScope` rows for any non-seed tenant. Real demo tenants are currently fed by the developer-runs-from-laptop path.
9	No idempotency keys at the stage boundary	Re-driving an `evaluate_findings` manually for the same `syncId` is safe (the gate on `sync.status === "completed"` covers it), but re-driving `rematerialize` (per ADR-026 path c) has no documented re-entry contract because the job kind doesn't exist yet.
10	No failure escalation	A run stuck in `running` for >max_runtime_seconds is the documented "reaper" responsibility (`scan-scopes/types.ts:46-53`) — the reaper code doesn't exist yet (verified by grep — no file matches `reaper` outside comments).
11	Stitch ingestion is connector-completion-driven, not pipeline-driven	`StitchDebouncer.schedule(tenantId)` fires only at the end of `sync_ingestion`. A run that succeeded via `execute_scan` but failed mid-`sync_ingestion` (step 4 entity upsert failure, say) leaves the tenant in a half-stitched state until the next sync.

Of these: #1 (credential broker) is the only true new-architecture gap. #2, #3, #4, #5, #10 are wiring/UX gaps on existing primitives. #6, #7, #8, #9, #11 are policy decisions that don't need new collections.

3. Architecture proposal

3.1 Pipeline primitive: keep `scan_runs` as the root, do NOT introduce `pipeline_runs`

scan_runs is already the canonical "did the work happen?" entity. It has status, started_at/ended_at, category_results, trigger, and most importantly a tenant-scoped _id (scan_runs.types.ts:104-141). Introducing a new pipeline_runs collection on top of it would duplicate every field, add a join, and conflict with the operator's mental model (which is "the scan ran, here's what it did").

Decision: extend scan_runs.category_results with downstream stage results. The category_results map is already a Record<string, CategoryResult> (line 88-95). Every cell — connector category OR platform stage — satisfies the existing CategoryResult = { status, items_scanned, started_at, ended_at, errors } shape. No schema change. Stage-specific semantics are encoded by what items_scanned counts:

Cell key	`items_scanned` semantics
`iam` / `lambda` / …	connector items scanned (existing)
`__sync`	entities upserted by `sync_ingestion`
`__eval`	findings created + updated + resolved
`__chain`	execution chains created + updated
`__stitch`	stitched paths materialized
`__evidence`	evidence packs built

Stage-specific detail beyond the cell (sync_id, finding ids, chain ids) lives in linked records, not in the cell — scan_runs.sync_id already pins the connector_sync; findings and chains are tenant-scoped queries from there. Cells carry no extra fields; the type-checker enforces this against CategoryResult.

The __ prefix collides with the existing "__driver__" sentinel (execute-scan.ts:209) so it's already in the validation allowlist concept. Tighten the regex at execute-scan.ts:80 to accept ^(?:[a-z][a-z0-9_]{0,63}|__[a-z][a-z0-9_]{0,30})$.

Tradeoff: keeps one row per pipeline run, no joins for the UI's "last scan summary" view. Downside: rows for ingest-only (seed-script) submissions don't get a scan_run parent today. Fix that by synthesising a scan_run row inside IngestService.acceptGraph for non-execute_scan submissions, with trigger.type = "manual_api" (already a valid variant, scan-runs/types.ts:28-34). That makes the data model uniform.

3.2 Trigger model: hybrid — scheduled + deploy-gate + manual API + sync-cascade

Four trigger sources, all leading to the same scan_runs-rooted execution:

Scheduled — the existing scheduler tick (Scheduler.tickOnce). Covers normal periodic scans.
Manual API — POST /api/v1/scan-runs (scan-runs.ts:60). Already there. Operator UI button binds to this.
Deploy-gate — new worker job kind rematerialize. Triggered by a small CI step after main deploys, when the deploy diff touches chain-builder.ts, stitched-path-materializer.ts, or related schema files. The CI step calls POST /api/v1/admin/rematerialize with the list of stages and a list of tenants (default: all status="active" tenants). This is ADR-026 path (b) generalised.
Sync-cascade — already exists. sync_ingestion → evaluate_findings; evaluate_findings → build_evidence_pack. Keep as-is.

Webhook triggers from connectors (event-driven push when source changes) are rejected for now: the connectors are full-state extractors per architecture doc 2026-04-22 §"Scheduling", and "delta-event consumers" was an explicitly considered-and-rejected alternative in the scheduler design (scheduler.ts:485-489). Revisit when a connector actually streams change events.

3.3 Credential model: two-tier interface, vendor-agnostic

The interface, not the vendor, is the architecture commitment.

// src/credentials/broker.ts (NEW)
export interface CredentialBroker {
  /** Resolve a CredentialsRef to a flat env-var map for connector consumption. */
  resolve(tenantId: string, ref: CredentialsRef): Promise<Record<string, string>>;
}

Implementations (pluggable, picked by env.credentialProvider):

EnvCredentialBroker — provider: "env" resolves to process.env[ref.ref + "_*"] (the broker prefix-matches; e.g. ref: "AWS_DEV" → {AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, ...} pulled from AWS_DEV_* platform-env vars). Useful for the single-tenant dev VM where creds still live in the deploy's .env.
OnePasswordCredentialBroker — provider: "op" — rejected for runtime use. 1Password remains bootstrap-only. The schema admits this provider but the broker hard-errors if asked to resolve it at runtime, until a service-token-backed integration exists.
AzureKeyVaultCredentialBroker — provider: "azure_keyvault" — Tier-3 (managed Azure VMs per ADR-022) gets per-VM managed-identity access to a Key Vault that holds per-tenant secret bundles. Ref shape: "vault://kv-sv0-{env}/tenants/{tenant_id}/connector/{kind}". The broker uses @azure/identity + @azure/keyvault-secrets, mints a 5-minute env bundle, hands it to the driver. Recommended path for staging+prod.
AwsSecretsManagerCredentialBroker — provider: "aws_secretsmanager" — symmetric for AWS-hosted environments. Reserved; not Tier-1 today.

Security boundary:

The broker is the ONLY path that can read tenant credentials at runtime. Worker handlers never touch raw secret values.
Tenant namespace is derived from tenantId, not from the ref string. Each broker constructs the lookup path itself: env broker uses SV0_TENANT_${tenantSlug}_${refSuffix}_*; Key Vault broker uses tenants/${tenantId}/connector/${kind}/*. The user-controlled ref.ref only contributes the trailing connector-specific suffix (AWS_PRIMARY, ENTRA_DEFAULT, …) — never the tenant segment. A misconfigured ref pointing at another tenant's path is structurally impossible because the tenant prefix is computed. The broker also rejects any ref.ref containing path separators or the literal substring tenant.
The driver receives a one-shot env bundle that the broker mints fresh per execute_scan run. The bundle is in the subprocess env only — never logged, never written to Mongo, never propagated out of the runtime.
Per ADR-018 (docker-group decision), the platform worker process already has docker permissions; the broker model doesn't change that.
Per ADR-023 (four-tier auth), the broker uses tier-appropriate auth: managed identity on Azure VMs (Tier-3 dev/prod), service-token on managed PaaS (Tier-2 if ever introduced), env-vars on local dev (Tier-1).

Wiring change: src/index.ts:98 constructs InProcessSubprocessDriver with a closure that calls broker.resolve(tenantId, instance.credentials_ref). The driver invokes the broker per-run, not at boot, so credentials can rotate without restart.

3.4 Stage chaining: keep current enqueue cascade, add explicit pipeline state on `scan_runs`

Current chain (preserved):

execute_scan → sync_ingestion → evaluate_findings → build_evidence_pack
                              ↘ stitch_ingestion (debounced) → evaluate_findings (post_stitch)

Add: each handler stamps its stage outcome onto scan_runs.category_results.__stage. This is the only schema change. Handler responsibilities:

execute_scan (existing) — writes per-category + __driver__ (already does this).
sync_ingestion (existing) — at end of step 13, if job.payload.scanRunId is present, call storage.updateScanRunCategoryResult(tenantId, scanRunId, "__sync", { status: deriveStatus(), …metrics }). New code: ~10 lines. The scanRunId propagates from execute_scan's IngestService.submit → acceptGraph → enqueue payload. Today the payload only has {graph, authMethod} at ingest-service.ts:62-66; add scanRunId (nullable for HTTP-direct seed paths).
evaluate_findings — same. Stamp __eval. ~10 lines.
build_evidence_pack — stamp __evidence (roll-up; the handler runs per finding, so it has to upsert with $inc-style metrics).
stitch_ingestion — stamp __stitch. Tricky because the stitch debouncer collapses N sync completions into 1 stitch — pick the most-recent scan_run from the participating syncs.
rematerialize (NEW worker job kind, per ADR-026 path b/c generalised) — stamps __chain and/or __stitch depending on the requested stages list. When triggered by deploy-gate (no parent scan), creates a synthetic scan_run with trigger.type = "manual_api" (or introduce a new "deploy_gate" variant — see Open Q).

Idempotency: every stage's "did I already do this work?" gate is keyed by either scanRunId + stage or syncId + stage. Re-driving a stage:

sync_ingestion: duplicate-key on connector_syncs._id is already handled (sync-ingestion.ts:153-163).
evaluate_findings: already gated on connector_syncs.status === "completed" (evaluate-findings.ts:28-38); re-runs upsert findings deterministically.
rematerialize: idempotent per ADR-026 §Consequences ("Re-running assemble_chains against an unchanged graph upserts the same composition_hash and produces no net mutation beyond last_seen_at"). The same shape applies to stitched-path re-materialization.
build_evidence_pack: change-detection already short-circuits unchanged findings.
stitch_ingestion: stitched paths are upserted by (tenant_id, path_signature); re-runs are no-ops.

Failure modes (per stage):

Stage	Transient failure	Persistent failure
`execute_scan`	Driver SIGKILL on timeout → `scan_run.status = "timeout"`; cooldown enforced next tick.	Bad creds / scope unreachable → `failed`; same cooldown gate.
`sync_ingestion`	Mongo blip → `connector_sync.status = "failed"`; lost (no retry). Gap #6.	Schema validation failure → `failed`.
`evaluate_findings`	Skipped if `sync.status !== "completed"` (degrades gracefully).	Rule code throws → currently kills the whole evaluation; needs per-rule isolation.
`rematerialize`	Same as sync_ingestion.	Same.
`build_evidence_pack`	Per-finding; one failure doesn't block siblings.	—

To close #6, add a stage-retry helper (not full Temporal/Celery — overkill for one node): wrap each handler in a withRetry decorator that retries up to N times on a structured TransientError thrown by storage/network code. Persistent errors propagate. This is ~30 lines.

3.5 Observability

scan_runs becomes the pipeline-run root, with category_results.__* keys exposing every downstream stage. UI implications:

/operations/runs page (new) — table of scan_runs across the operator's tenant scope (super-admin sees all); columns: tenant, scope, started, ended, status, per-stage statuses (sparkline of __sync / __eval / __chain / __stitch / __evidence).
/operations/runs/:id — single-run detail; per-category breakdown + per-stage panel; "Re-run" button posts to a new POST /api/v1/scan-runs/:id/retry route that creates a trigger.type = "retry" child run (already a typed variant, scan-runs/types.ts:32).
/operations/scopes — schedule editor; create/pause/resume scopes. CRUD already exists in src/api/routes/admin/scan-scopes.ts; needs a thin UI.
/operations/instances — ConnectorInstance CRUD already exists at src/api/routes/admin/connector-instances.ts; needs UI.

Total UI: 3 pages + ~5 hooks. Smallest deliverable that closes the operator-UX gap.

3.6 Anti-goals

No ML / no probabilistic stage decisions. Retries are deterministic counts, not adaptive backoff curves. Stage-gating is exact-match status, not confidence scores.
No connector writes to source systems. Confirmed at the driver layer (connector-driver.ts:13-15). The broker model preserves this — the connector receives source-system credentials and an instruction to extract; no write path is enabled.
No tenant-isolation bypass. Every handler key is already (tenantId, ...). The broker takes tenantId as an argument (broker.resolve(tenantId, ref)) so a misconfigured ref can't accidentally grant cross-tenant read.
No second pipeline-run collection. scan_runs IS the pipeline run.

4. Alternatives considered (and rejected)

Recorded in ADR-027 §Alternatives. Briefly:

Plan B — separate pipeline_runs collection. Rejected for field duplication, UI join cost, and mental-model mismatch.
Plan C — event-driven webhooks. Rejected pending a connector that actually exposes change events.
Plan D — Temporal / Celery / Bull. Rejected for the first slice as overkill against one VM; revisit when fanning out.
Plan E — lazy on-demand pipeline state. Rejected for read-amplification and log-mining cost.

5. Migration plan

Each slice is one PR (or one small umbrella with N PRs). All slices preserve the existing seed-script workflow until the final slice.

Slice 0 — wire the existing Scheduler to one demo tenant on staging

Size: ~50 LOC + a manual provisioning step. No new code. A one-off admin script (or three admin-API curl calls):

POST /api/v1/admin/connector-instances for demo-nimbus with connector_kind: "aws", credentials_ref: { provider: "env", ref: "AWS_DEMO" }.
POST /api/v1/admin/scan-scopes for the instance with cadence: "interval", interval_seconds: 86400, service_categories: [...].
Add AWS_DEMO_ACCESS_KEY_ID / AWS_DEMO_SECRET_ACCESS_KEY to staging's .env. Blocked on: Slice 1 (credential broker), because without it the driver gets no creds.

Slice 1 — credential broker, env provider only

Size: ~150 LOC + tests. PR: feat(workers): credential broker for connector driver

src/credentials/broker.ts interface.
src/credentials/env-broker.ts impl (prefix-match process.env).
InProcessSubprocessDriver accepts a credentialBroker option; at run time calls broker.resolve(tenantId, instance.credentials_ref) and passes the result as extraEnv.
src/index.ts:98 constructs the broker and wires it.
Decision gate: unblocks Slice 0. Smallest unit of value.

Slice 2 — operator UI shell (runs / scopes / instances)

Size: ~600 LOC across 3 pages + 5 hooks. No backend changes (admin APIs already exist). PR: feat(ui): operations pages — scan runs, scopes, connector instances

ui/src/pages/operations/RunsListPage.tsx, RunDetailPage.tsx, ScopesPage.tsx, InstancesPage.tsx.
TanStack Query hooks.
Route auth: super-admin only initially.

Slice 3 — pipeline state on `scan_runs`

Size: ~200 LOC + tests. PR: feat(workers): stage outcome tracking on scan_runs.category_results

Plumb scanRunId through IngestService.submit → enqueue payloads.
Each handler stamps its __stage outcome.
IngestService.acceptGraph (HTTP path) synthesises a scan_run for non-execute_scan submissions so seed scripts get pipeline visibility for free.
UI run-detail page reads the __stage panels.

Slice 4 — deploy-gate `rematerialize` job

Size: ~250 LOC + CI step. PR: feat(workers): rematerialize job + deploy-gate trigger (ADR-026 path b)

New worker handler rematerialize.ts. Stages list (initially ["chains", "stitched_paths"]); reuses assembleExecutionChains and the stitched-path materializer.
New admin route POST /api/v1/admin/rematerialize.
CI step in deploy-prod.yml / deploy-dev.yml: when the merge diff touches chain-builder.ts (or a tracked schema constant), curl the admin route for all active tenants.
Extend to stitched-path-materializer.ts in the same trigger; the ADR-026 §"stitched_paths shares the same vulnerability" caveat is closed here.

Slice 5 — Azure Key Vault credential broker

Size: ~250 LOC + Bicep. PR: feat(credentials): Azure Key Vault broker for staging/prod

src/credentials/akv-broker.ts.
Bicep / Terraform delta in sv0-infrastructure to mint the Key Vault and grant managed-identity access from vm-sv0-dev-1 (Tier-3 dev) and the future prod VM.
Tenant provisioning step: deposit tenants/{tenant_id}/connector/{kind} secret bundle into the vault.
Promotes Slice 0 from env-creds-on-VM to vault-resolved-per-run.

Slice 6 — retire `seed-demo-*` for non-mock tenants

Size: ~50 LOC of deletion. PR: chore(scripts): seed-demo scripts now mock-only; real tenants run via scheduler

Document that seed-demo-* is for synthetic mock data only (already true).
Real demo tenants (e.g. enterprise-nimbus) get ConnectorInstance + ScanScope rows and are driven by the scheduler.

6. Open questions

rematerialize trigger type. Add a new ScanRunTriggerType value "deploy_gate", or reuse "manual_api" with a special user-id sentinel? Prefer adding the variant — it survives audit-log inspection cleanly.
Stitch-debounce ↔ scan_run linkage. When N syncs collapse into 1 stitch, which scan_run owns the __stitch cell? Two candidates: pick the newest participating scan, OR fan out and write __stitch to all N. Recommend fanning out for the audit trail.
Cron cadence vs interval. claimDueScopes should wire cron evaluation (use cron-parser or similar). Today it silently leaves next_run_at = null for cron scopes. Track as a separate small follow-up; not part of this proposal's first slice.
Reaper for orphaned running runs. The last_claimed_at field is stamped specifically for this (scan-scopes/types.ts:46-53). A periodic reaper is documented as the responsibility of "an operational reaper" (control-plane-adapter.ts:124-126) but doesn't exist. Add as Slice 7 or fold into Slice 4.
Job-queue persistence. In-memory queue is fine for one VM. If/when the platform fans out to N workers, Mongo-backed jobs become required. Not part of this proposal — file as future tech-debt.

References

src/workers/scheduler.ts:215-306,313-582 — triggerScanRun + Scheduler.
src/workers/runtime.ts:31-153 — WorkerRuntime.
src/workers/connector-driver.ts:104-449 — ConnectorDriver interface + InProcessSubprocessDriver.
src/workers/handlers/execute-scan.ts:234-377 — execute_scan handler.
src/workers/handlers/sync-ingestion.ts:125-617 — 13-step ingest pipeline.
src/workers/handlers/evaluate-findings.ts:16-103 — evaluate handler + stitch-cascade trigger.
src/storage/mongo/adapters/control-plane-adapter.ts:128-181 — atomic scope claim.
src/api/routes/admin/connector-instances.ts:1-33 — admin CRUD; "broker is inert today" note.
src/api/routes/admin/connector-api-keys.ts — connector API key mint.
src/api/routes/scan-runs.ts:7-9 — manual scan trigger API.
src/domain/scan-runs/types.ts:104-141 — ScanRunDoc schema.
src/domain/scan-scopes/types.ts:33-65 — ScanScopeSchedule, ScanScopeBudget.
src/domain/connector-instances/types.ts:28-92 — CredentialsRef + ConnectorInstanceDoc.
repos/sv0-connectors/integrations/aws/src/sv0_aws/cli/main.py:2229,2353 — AWS connector platform-submit shape.
repos/sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/config.py:1-50 — Entra connector env config.
ADR-018 — deploy security (docker-group decision).
ADR-022 — Azure compute landing zone (Tier-3 dev/prod VMs).
ADR-023 — authentication target architecture (four-tier).
ADR-024 — Azure deploy lifecycle.
ADR-026 — chain re-materialization triggers (path b/c framing reused here).
ADR-027 — the decision shape this audit drives.
2026-04-22-connector-control-execution-architecture.md — original architecture doc that delivered Stream-1 Phases 1–3.

Honored North Star clauses

C-13 (SIEM landing supported, not a SIEM console — north-star.md:405). A scheduled scan that fails silently breaks the SIEM-cold landing flow because the brief / chain shown to the analyst is stale. Slice 0 + Slice 2 close this: scans run on cadence, failures are visible to operators within minutes.
C-15 (path differentiability — north-star.md:377). Stale stitched-paths from a stitched-path-materializer.ts deploy without re-materialization break per-path content. Slice 4 closes this for both chains and stitched paths.

Next Action​

TL;DR​

1. Current-state audit​

1.1 Job runtime​

1.2 Scheduler (Stream-1 Phase 3, already shipped)​

1.3 execute_scan handler — connector invocation​

1.4 sync_ingestion — 13-step pipeline​

1.5 Evaluator + downstream stages​

1.6 Scan-run tracking​

1.7 Credential storage today​

1.8 Connector → platform contract​

1.9 What the seed-demo scripts actually do​

2. Gap analysis​

3. Architecture proposal​

3.1 Pipeline primitive: keep scan_runs as the root, do NOT introduce pipeline_runs​

3.2 Trigger model: hybrid — scheduled + deploy-gate + manual API + sync-cascade​

3.3 Credential model: two-tier interface, vendor-agnostic​

3.4 Stage chaining: keep current enqueue cascade, add explicit pipeline state on scan_runs​

3.5 Observability​

3.6 Anti-goals​

4. Alternatives considered (and rejected)​

5. Migration plan​

Slice 0 — wire the existing Scheduler to one demo tenant on staging​

Slice 1 — credential broker, env provider only​

Slice 2 — operator UI shell (runs / scopes / instances)​

Slice 3 — pipeline state on scan_runs​

Slice 4 — deploy-gate rematerialize job​

Slice 5 — Azure Key Vault credential broker​

Slice 6 — retire seed-demo-* for non-mock tenants​

6. Open questions​

References​

Honored North Star clauses​