ADR-027: Automated Connector Pipeline — credential broker, pipeline-run root, deploy-gate
Status
Draft — 2026-05-19. Captures the decision shape for sv0-platform#1185. Implementation lands across 7 PRs (slices, see §Migration).
Companion research: 2026-05-19-automated-connector-pipeline-audit — current-state audit with file:line references, full gap analysis, alternatives considered, migration plan in detail.
Context
Stream-1 of the 2026-04-22 connector-control-execution architecture already shipped: a 30 s scheduler tick (src/workers/scheduler.ts:362-369), atomic scope claiming (src/storage/mongo/adapters/control-plane-adapter.ts:128-181), an execute_scan worker (src/workers/handlers/execute-scan.ts:234-377), a connector-driver seam (src/workers/connector-driver.ts:104-449), and a cascade from sync_ingestion → evaluate_findings → build_evidence_pack. Per-tenant ConnectorInstance and ScanScope data models exist (src/domain/connector-instances/types.ts, src/domain/scan-scopes/types.ts).
What is missing prevents the platform from running an unattended scan against a real tenant today:
credentials_refis inert. TheConnectorInstance.credentials_reffield carries{ provider, ref }but no runtime code resolves it. TheInProcessSubprocessDriveris constructed atsrc/index.ts:98with noenv, so its subprocess gets only the OS allowlist (PATH/HOME/…) and zero connector credentials. Connectors run only from a developer laptop today.- No operator UX for
scan_runs. The DB is the source of truth; UI is absent. Failures live in worker logs. - No pipeline-run root.
scan_runsandconnector_syncsare joined byscan_runs.sync_id; downstream stages (evaluate,chain assembly,stitch,evidence) have no first-class outcome record on either. - No deploy-gate rematerialization. ADR-026 documented path (b) for chains but the job kind does not exist yet.
stitched_pathshas the same gap (called out in ADR-026 §Consequences). - In-memory job queue. Persistence across worker restart is none; transient failures lose work silently.
A new pipeline-run collection (pipeline_runs) was considered. Rejected — scan_runs already has every field needed and the operator's mental model is "the scan ran." Duplicating it adds a join with no expressive gain.
Decision
(a) scan_runs is the pipeline-run root
Stage outcomes are recorded under reserved __-prefixed keys in scan_runs.category_results. Every cell — connector or platform-stage — satisfies the existing CategoryResult shape (src/domain/scan-runs/types.ts:88-95: { status, items_scanned, started_at, ended_at, errors }). No schema change. Stage-specific semantics are encoded by what items_scanned counts:
| Cell key | items_scanned semantics |
|---|---|
__sync | entities upserted by sync_ingestion |
__eval | findings created + updated + resolved |
__chain | execution chains created + updated |
__stitch | stitched paths materialized |
__evidence | evidence packs built |
Stage-specific detail beyond the cell (sync_id, finding ids, chain ids) lives in the linked records — scan_runs.sync_id already pins the connector_sync, and findings/chains are tenant-scoped queries from there. The cell carries no extra fields; the type-checker enforces this.
__ prefix is reserved in the category-name regex (src/workers/handlers/execute-scan.ts:80), aligning with the existing "__driver__" sentinel. The regex is tightened to admit ^(?:[a-z][a-z0-9_]{0,63}|__[a-z][a-z0-9_]{0,30})$.
Handlers stamp their own stage cell on completion. scanRunId is propagated through enqueue payloads from execute_scan → IngestService.submit → downstream jobs. The HTTP /api/v1/ingest/* path (seed scripts) synthesises a scan_run with trigger.type = "manual_api" so the data model is uniform.
(b) CredentialBroker is the only runtime path that resolves CredentialsRef
A typed interface, vendor-agnostic:
// src/credentials/broker.ts
export interface CredentialBroker {
resolve(tenantId: string, ref: CredentialsRef): Promise<Record<string, string>>;
}
Implementations:
env— prefix-matchesprocess.env. Single-tenant dev / staging-with-laptop-creds.azure_keyvault— managed-identity-authenticated, per-tenant secret bundles atvault://kv-sv0-{env}/tenants/{tenant}/connector/{kind}. Recommended for Tier-3 dev (ADR-022) and prod.aws_secretsmanager— symmetric, reserved.op— hard-error at runtime. 1Password is bootstrap-only per existing operational rule. Schema admits the variant; the broker refuses to resolve it.
The InProcessSubprocessDriver calls broker.resolve(tenantId, instance.credentials_ref) at run time and passes the result as extraEnv. Bundles are subprocess-scoped, never logged, never written to Mongo.
Security boundary:
- Workers never see raw secret values; only the broker does.
- Tenant namespace is derived from
tenantId, not from the ref string. Each broker constructs the lookup path itself: env broker usesSV0_TENANT_${tenantSlug}_${refSuffix}_*; Key Vault broker usestenants/${tenantId}/connector/${kind}/*. The user-controlledref.refonly contributes the trailing connector-specific suffix (AWS_PRIMARY,ENTRA_DEFAULT, …) — never the tenant segment. A misconfiguredrefpointing at another tenant's path is structurally impossible because the tenant prefix is computed, not read from the ref. The broker also rejects anyref.refthat contains path separators or the literal substringtenant. - The platform's OS-env allowlist for the subprocess (PATH/HOME/TMPDIR/LANG/LC_ALL) is unchanged; the broker bundle is layered on top.
(c) Deploy-gate rematerialization is a worker job kind
New WorkerJobType: "rematerialize". Generalises ADR-026 path (b):
- Triggered by a CI step in the deploy workflow when the merge diff touches
chain-builder.ts,stitched-path-materializer.ts, or tracked schema files. - The CI step calls
POST /api/v1/admin/rematerializewith a stages list (["chains", "stitched_paths"]initially) and a tenants list (default: allstatus="active"). - Synthesises a
scan_runper tenant with a newtrigger.type = "deploy_gate"variant. Stamps__chain/__stitchcells as the relevant materializers complete. - Idempotent — chains and stitched paths upsert by content hash; re-running on an unchanged graph produces no net mutation beyond
last_seen_at.
This closes ADR-026's deferred "stitched_paths shares the same vulnerability" caveat in the same trigger model.
(d) Stage-retry helper, deterministic and bounded
Wrap each handler with a withRetry(handler, { maxRetries: 3 }) decorator that retries only on a structured TransientError thrown by storage/network primitives. Persistent errors propagate to the existing failed recording path. No probabilistic backoff — fixed 1s/2s/4s schedule. ML-style adaptive backoff is explicitly out of scope.
(e) Operator UX
Three new UI pages, all backed by existing admin APIs:
/operations/runs—scan_runslist with per-stage status sparklines./operations/runs/:id— single-run detail; per-stage panel./operations/scopes— schedule editor (cron / interval / pause)./operations/instances—ConnectorInstanceCRUD.
Super-admin only initially.
The run-detail "Re-run" affordance is not in this slice. The current POST /api/v1/scan-runs route accepts only manual_api and iac_lifecycle triggers (src/api/routes/scan-runs.ts) — retry requires either widening that route or adding POST /api/v1/scan-runs/:id/retry. Land that with Slice 4 (pipeline state), which already opens the route layer.
Alternatives considered
Plan B — introduce a separate pipeline_runs collection
A new top-level entity, with scan_runs and connector_syncs as children. Every multi-stage outcome lives there; scan_runs reverts to "just the connector subprocess outcome."
Rejected. Three reasons:
- Duplicate fields.
pipeline_runswould re-declaretenant_id,status,started_at,ended_at,trigger— all already onscan_runs. - UI cost. Every operator query becomes a
$lookupfrompipeline_runstoscan_runstoconnector_syncs. Three collections deep. - Mental-model mismatch. The user's primitive is "the scan."
pipeline_runsis platform-internal vocabulary — operators don't think in terms of "Tuesday's pipeline run #4 included scan #N."
Stamping stage cells onto scan_runs.category_results achieves the same observability without a new collection. The cost is a tighter regex on category names — already required for security (execute-scan.ts:80).
Plan C — event-driven (webhook) connector triggers
Connectors push delta events to the platform; ingest is reactive instead of scheduled.
Rejected (for now). The connector architecture is full-state extraction (scheduler.ts:485-489). Source-system event streams (CloudTrail, Entra audit logs) are heterogeneous and would require per-connector buffering. Revisit when a connector actually exposes incremental change events and the platform has a use case for sub-minute freshness.
Plan D — Temporal / Celery / Bull-style external job runtime
Replace the in-memory queue with a persistent durable execution engine.
Rejected for the first slice. The in-memory queue is fine for one VM. Persistence becomes required when the platform fans out to N workers — file as a follow-up. The retry helper from (d) closes the most common failure shape (transient Mongo blip) without the operational overhead of a new dependency.
Plan E — Lazy on-demand pipeline state computation
Compute pipeline state at read time by joining scan_runs ⨝ connector_syncs ⨝ logs.
Rejected. Pushes complexity to every UI query. Requires reading log lines from a sink (or stamping more metadata onto syncs). Worst of both worlds — no normalized data, expensive reads.
Consequences
Positive
- One pipeline-run root — operators see scan + every downstream stage on one row.
- Credential boundary is explicit — the broker is the only thing that touches secret values at runtime. Auditable, swappable.
- Deploy-gate covers chains and stitched paths with one job kind. ADR-026's deferred caveat is closed.
- Migration is incremental — seven slices, each one PR-sized, each delivering value without breaking the seed-script workflow.
- Existing code reused — scheduler, driver, handlers, admin APIs all stay. The deltas are a credential broker, a per-stage stamp call, and three UI pages.
Negative
__-prefixed stage cells couple connector-category vocabulary to platform-internal stages. A future renaming would touch both layers. Mitigated by reserving the__prefix at the regex layer.- Single-VM job queue is still in-memory. A worker crash mid-pipeline still loses queued downstream stages until the next sync. Acceptable today (one VM, infrequent restarts), tracked as follow-up.
- Azure Key Vault is the recommended prod broker but not bootstrapped. Slice 5 stands up the vault; until then prod runs on the
envprovider with secrets in the deploy's.env. ADR-024's deploy mechanism already places that.envon the VM securely. opprovider remains rejected at runtime. The schema admits it; the broker hard-errors on attempt. This is a documented gap; closing it requires a service-token model with 1Password Connect.
Neutral
- No
scan_runsschema change. Reserved key prefix on an existingRecord<string, CategoryResult>map. - No change to connector contracts. Connectors continue to receive
--scope-json+ write--category-results-out+--graph-json. They are agnostic to whether credentials came from.envor a broker.
Open questions
trigger.type = "deploy_gate"— add this enum value, or reusemanual_apiwith a sentineltriggered_by_user_id? Recommend adding the variant for clean audit-log inspection. Touches one type file + one switch.- Stitch fan-out semantics. When N completing syncs collapse into 1 stitch, write
__stitchto all N participatingscan_runs, or only the newest? Recommend fan-out for the audit trail. - Cron scope wiring.
claimDueScopesadvancesnext_run_at = nullfor cron scopes today; cron evaluation is documented as a Phase-2 deferred. Track as a separate small follow-up; not part of this proposal's first slice. - Reaper for orphaned
runningruns.last_claimed_atis stamped specifically for this (scan-scopes/types.ts:46-53). Add a 5-minute periodic check that marksrunningruns older thanbudget.max_runtime_secondsastimeout. Fold into Slice 4 or stand alone.
Migration
Seven slices, each ~1 PR:
- Credential broker (env provider only) — unblocks Slice 0 below.
- Slice 0 wire-up — provision one demo tenant with
ConnectorInstance+ScanScope; scheduler starts running real scans. - Operator UX — three UI pages, read-only against existing admin APIs (no retry).
- Pipeline state on
scan_runs— propagatescanRunId, stamp__stagecells, addPOST /api/v1/scan-runs/:id/retryand wire the run-detail re-run button. - Deploy-gate
rematerializejob — ADR-026 path (b) generalised to chains + stitched paths. - Azure Key Vault broker — Tier-3 dev + prod cutover from env-on-VM to vault-resolved-per-run.
- Retire seed-demo for real tenants —
seed-demo-*becomes mock-only; real tenants run via scheduler.
Each slice preserves the existing seed-script workflow until the final slice. The seed scripts continue to use the HTTP /api/v1/ingest/normalized-graph path; with Slice 3 they get a synthesised scan_run and full pipeline-state observability for free.
Full slice-by-slice line-count estimates and PR shapes live in the companion research audit §5.
References
src/workers/scheduler.ts:313-582— existing scheduler.src/workers/connector-driver.ts:104-449— driver + env allowlist +extraEnvpipe.src/workers/handlers/execute-scan.ts:80,209— category-name regex (to be tightened) +__driver__precedent.src/workers/handlers/sync-ingestion.ts:125-617— 13-step ingest pipeline (stage stamp lands at step 13).src/storage/mongo/adapters/control-plane-adapter.ts:128-181— atomic claim.src/api/routes/admin/connector-instances.ts:18-21— explicit acknowledgement that broker is inert today.src/domain/scan-runs/types.ts:104-141—ScanRunDocschema (extended by reserved-key convention, no schema change).src/domain/connector-instances/types.ts:28-92—CredentialsRef+ConnectorInstanceDoc.- ADR-018 — deploy security posture (docker-group accepted pre-managed-platform).
- ADR-022 — Azure compute landing zone (Tier-3 dev/prod VMs).
- ADR-023 — authentication target architecture (four-tier).
- ADR-024 — Azure deploy lifecycle.
- ADR-026 — chain re-materialization triggers (path b/c generalised here).
- sv0-platform#1185 — implementation umbrella.
2026-05-19-automated-connector-pipeline-audit.md— companion research audit.
Honored North Star clauses
- C-13 (SIEM landing supported, not a SIEM console —
north-star.md:405). Scheduled scans + visible failures keep the brief / chain fresh for the SIEM-cold analyst landing. - C-15 (
LOCKED-IN-CODEpath differentiability —north-star.md:377). Deploy-gate rematerialization (decision (c)) extends ADR-026's chain coverage tostitched_paths, preventing the same dead-end-page failure shape on astitched-path-materializer.tsdeploy.