Skip to main content

ADR-027: Automated Connector Pipeline — credential broker, pipeline-run root, deploy-gate

Status

Draft — 2026-05-19. Captures the decision shape for sv0-platform#1185. Implementation lands across 7 PRs (slices, see §Migration).

Companion research: 2026-05-19-automated-connector-pipeline-audit — current-state audit with file:line references, full gap analysis, alternatives considered, migration plan in detail.


Context

Stream-1 of the 2026-04-22 connector-control-execution architecture already shipped: a 30 s scheduler tick (src/workers/scheduler.ts:362-369), atomic scope claiming (src/storage/mongo/adapters/control-plane-adapter.ts:128-181), an execute_scan worker (src/workers/handlers/execute-scan.ts:234-377), a connector-driver seam (src/workers/connector-driver.ts:104-449), and a cascade from sync_ingestionevaluate_findingsbuild_evidence_pack. Per-tenant ConnectorInstance and ScanScope data models exist (src/domain/connector-instances/types.ts, src/domain/scan-scopes/types.ts).

What is missing prevents the platform from running an unattended scan against a real tenant today:

  1. credentials_ref is inert. The ConnectorInstance.credentials_ref field carries { provider, ref } but no runtime code resolves it. The InProcessSubprocessDriver is constructed at src/index.ts:98 with no env, so its subprocess gets only the OS allowlist (PATH/HOME/…) and zero connector credentials. Connectors run only from a developer laptop today.
  2. No operator UX for scan_runs. The DB is the source of truth; UI is absent. Failures live in worker logs.
  3. No pipeline-run root. scan_runs and connector_syncs are joined by scan_runs.sync_id; downstream stages (evaluate, chain assembly, stitch, evidence) have no first-class outcome record on either.
  4. No deploy-gate rematerialization. ADR-026 documented path (b) for chains but the job kind does not exist yet. stitched_paths has the same gap (called out in ADR-026 §Consequences).
  5. In-memory job queue. Persistence across worker restart is none; transient failures lose work silently.

A new pipeline-run collection (pipeline_runs) was considered. Rejected — scan_runs already has every field needed and the operator's mental model is "the scan ran." Duplicating it adds a join with no expressive gain.


Decision

(a) scan_runs is the pipeline-run root

Stage outcomes are recorded under reserved __-prefixed keys in scan_runs.category_results. Every cell — connector or platform-stage — satisfies the existing CategoryResult shape (src/domain/scan-runs/types.ts:88-95: { status, items_scanned, started_at, ended_at, errors }). No schema change. Stage-specific semantics are encoded by what items_scanned counts:

Cell keyitems_scanned semantics
__syncentities upserted by sync_ingestion
__evalfindings created + updated + resolved
__chainexecution chains created + updated
__stitchstitched paths materialized
__evidenceevidence packs built

Stage-specific detail beyond the cell (sync_id, finding ids, chain ids) lives in the linked records — scan_runs.sync_id already pins the connector_sync, and findings/chains are tenant-scoped queries from there. The cell carries no extra fields; the type-checker enforces this.

__ prefix is reserved in the category-name regex (src/workers/handlers/execute-scan.ts:80), aligning with the existing "__driver__" sentinel. The regex is tightened to admit ^(?:[a-z][a-z0-9_]{0,63}|__[a-z][a-z0-9_]{0,30})$.

Handlers stamp their own stage cell on completion. scanRunId is propagated through enqueue payloads from execute_scanIngestService.submit → downstream jobs. The HTTP /api/v1/ingest/* path (seed scripts) synthesises a scan_run with trigger.type = "manual_api" so the data model is uniform.

(b) CredentialBroker is the only runtime path that resolves CredentialsRef

A typed interface, vendor-agnostic:

// src/credentials/broker.ts
export interface CredentialBroker {
resolve(tenantId: string, ref: CredentialsRef): Promise<Record<string, string>>;
}

Implementations:

  • env — prefix-matches process.env. Single-tenant dev / staging-with-laptop-creds.
  • azure_keyvault — managed-identity-authenticated, per-tenant secret bundles at vault://kv-sv0-{env}/tenants/{tenant}/connector/{kind}. Recommended for Tier-3 dev (ADR-022) and prod.
  • aws_secretsmanager — symmetric, reserved.
  • op — hard-error at runtime. 1Password is bootstrap-only per existing operational rule. Schema admits the variant; the broker refuses to resolve it.

The InProcessSubprocessDriver calls broker.resolve(tenantId, instance.credentials_ref) at run time and passes the result as extraEnv. Bundles are subprocess-scoped, never logged, never written to Mongo.

Security boundary:

  • Workers never see raw secret values; only the broker does.
  • Tenant namespace is derived from tenantId, not from the ref string. Each broker constructs the lookup path itself: env broker uses SV0_TENANT_${tenantSlug}_${refSuffix}_*; Key Vault broker uses tenants/${tenantId}/connector/${kind}/*. The user-controlled ref.ref only contributes the trailing connector-specific suffix (AWS_PRIMARY, ENTRA_DEFAULT, …) — never the tenant segment. A misconfigured ref pointing at another tenant's path is structurally impossible because the tenant prefix is computed, not read from the ref. The broker also rejects any ref.ref that contains path separators or the literal substring tenant.
  • The platform's OS-env allowlist for the subprocess (PATH/HOME/TMPDIR/LANG/LC_ALL) is unchanged; the broker bundle is layered on top.

(c) Deploy-gate rematerialization is a worker job kind

New WorkerJobType: "rematerialize". Generalises ADR-026 path (b):

  • Triggered by a CI step in the deploy workflow when the merge diff touches chain-builder.ts, stitched-path-materializer.ts, or tracked schema files.
  • The CI step calls POST /api/v1/admin/rematerialize with a stages list (["chains", "stitched_paths"] initially) and a tenants list (default: all status="active").
  • Synthesises a scan_run per tenant with a new trigger.type = "deploy_gate" variant. Stamps __chain / __stitch cells as the relevant materializers complete.
  • Idempotent — chains and stitched paths upsert by content hash; re-running on an unchanged graph produces no net mutation beyond last_seen_at.

This closes ADR-026's deferred "stitched_paths shares the same vulnerability" caveat in the same trigger model.

(d) Stage-retry helper, deterministic and bounded

Wrap each handler with a withRetry(handler, { maxRetries: 3 }) decorator that retries only on a structured TransientError thrown by storage/network primitives. Persistent errors propagate to the existing failed recording path. No probabilistic backoff — fixed 1s/2s/4s schedule. ML-style adaptive backoff is explicitly out of scope.

(e) Operator UX

Three new UI pages, all backed by existing admin APIs:

  • /operations/runsscan_runs list with per-stage status sparklines.
  • /operations/runs/:id — single-run detail; per-stage panel.
  • /operations/scopes — schedule editor (cron / interval / pause).
  • /operations/instancesConnectorInstance CRUD.

Super-admin only initially.

The run-detail "Re-run" affordance is not in this slice. The current POST /api/v1/scan-runs route accepts only manual_api and iac_lifecycle triggers (src/api/routes/scan-runs.ts) — retry requires either widening that route or adding POST /api/v1/scan-runs/:id/retry. Land that with Slice 4 (pipeline state), which already opens the route layer.


Alternatives considered

Plan B — introduce a separate pipeline_runs collection

A new top-level entity, with scan_runs and connector_syncs as children. Every multi-stage outcome lives there; scan_runs reverts to "just the connector subprocess outcome."

Rejected. Three reasons:

  1. Duplicate fields. pipeline_runs would re-declare tenant_id, status, started_at, ended_at, trigger — all already on scan_runs.
  2. UI cost. Every operator query becomes a $lookup from pipeline_runs to scan_runs to connector_syncs. Three collections deep.
  3. Mental-model mismatch. The user's primitive is "the scan." pipeline_runs is platform-internal vocabulary — operators don't think in terms of "Tuesday's pipeline run #4 included scan #N."

Stamping stage cells onto scan_runs.category_results achieves the same observability without a new collection. The cost is a tighter regex on category names — already required for security (execute-scan.ts:80).

Plan C — event-driven (webhook) connector triggers

Connectors push delta events to the platform; ingest is reactive instead of scheduled.

Rejected (for now). The connector architecture is full-state extraction (scheduler.ts:485-489). Source-system event streams (CloudTrail, Entra audit logs) are heterogeneous and would require per-connector buffering. Revisit when a connector actually exposes incremental change events and the platform has a use case for sub-minute freshness.

Plan D — Temporal / Celery / Bull-style external job runtime

Replace the in-memory queue with a persistent durable execution engine.

Rejected for the first slice. The in-memory queue is fine for one VM. Persistence becomes required when the platform fans out to N workers — file as a follow-up. The retry helper from (d) closes the most common failure shape (transient Mongo blip) without the operational overhead of a new dependency.

Plan E — Lazy on-demand pipeline state computation

Compute pipeline state at read time by joining scan_runsconnector_syncs ⨝ logs.

Rejected. Pushes complexity to every UI query. Requires reading log lines from a sink (or stamping more metadata onto syncs). Worst of both worlds — no normalized data, expensive reads.


Consequences

Positive

  • One pipeline-run root — operators see scan + every downstream stage on one row.
  • Credential boundary is explicit — the broker is the only thing that touches secret values at runtime. Auditable, swappable.
  • Deploy-gate covers chains and stitched paths with one job kind. ADR-026's deferred caveat is closed.
  • Migration is incremental — seven slices, each one PR-sized, each delivering value without breaking the seed-script workflow.
  • Existing code reused — scheduler, driver, handlers, admin APIs all stay. The deltas are a credential broker, a per-stage stamp call, and three UI pages.

Negative

  • __-prefixed stage cells couple connector-category vocabulary to platform-internal stages. A future renaming would touch both layers. Mitigated by reserving the __ prefix at the regex layer.
  • Single-VM job queue is still in-memory. A worker crash mid-pipeline still loses queued downstream stages until the next sync. Acceptable today (one VM, infrequent restarts), tracked as follow-up.
  • Azure Key Vault is the recommended prod broker but not bootstrapped. Slice 5 stands up the vault; until then prod runs on the env provider with secrets in the deploy's .env. ADR-024's deploy mechanism already places that .env on the VM securely.
  • op provider remains rejected at runtime. The schema admits it; the broker hard-errors on attempt. This is a documented gap; closing it requires a service-token model with 1Password Connect.

Neutral

  • No scan_runs schema change. Reserved key prefix on an existing Record<string, CategoryResult> map.
  • No change to connector contracts. Connectors continue to receive --scope-json + write --category-results-out + --graph-json. They are agnostic to whether credentials came from .env or a broker.

Open questions

  1. trigger.type = "deploy_gate" — add this enum value, or reuse manual_api with a sentinel triggered_by_user_id? Recommend adding the variant for clean audit-log inspection. Touches one type file + one switch.
  2. Stitch fan-out semantics. When N completing syncs collapse into 1 stitch, write __stitch to all N participating scan_runs, or only the newest? Recommend fan-out for the audit trail.
  3. Cron scope wiring. claimDueScopes advances next_run_at = null for cron scopes today; cron evaluation is documented as a Phase-2 deferred. Track as a separate small follow-up; not part of this proposal's first slice.
  4. Reaper for orphaned running runs. last_claimed_at is stamped specifically for this (scan-scopes/types.ts:46-53). Add a 5-minute periodic check that marks running runs older than budget.max_runtime_seconds as timeout. Fold into Slice 4 or stand alone.

Migration

Seven slices, each ~1 PR:

  1. Credential broker (env provider only) — unblocks Slice 0 below.
  2. Slice 0 wire-up — provision one demo tenant with ConnectorInstance + ScanScope; scheduler starts running real scans.
  3. Operator UX — three UI pages, read-only against existing admin APIs (no retry).
  4. Pipeline state on scan_runs — propagate scanRunId, stamp __stage cells, add POST /api/v1/scan-runs/:id/retry and wire the run-detail re-run button.
  5. Deploy-gate rematerialize job — ADR-026 path (b) generalised to chains + stitched paths.
  6. Azure Key Vault broker — Tier-3 dev + prod cutover from env-on-VM to vault-resolved-per-run.
  7. Retire seed-demo for real tenantsseed-demo-* becomes mock-only; real tenants run via scheduler.

Each slice preserves the existing seed-script workflow until the final slice. The seed scripts continue to use the HTTP /api/v1/ingest/normalized-graph path; with Slice 3 they get a synthesised scan_run and full pipeline-state observability for free.

Full slice-by-slice line-count estimates and PR shapes live in the companion research audit §5.


References

  • src/workers/scheduler.ts:313-582 — existing scheduler.
  • src/workers/connector-driver.ts:104-449 — driver + env allowlist + extraEnv pipe.
  • src/workers/handlers/execute-scan.ts:80,209 — category-name regex (to be tightened) + __driver__ precedent.
  • src/workers/handlers/sync-ingestion.ts:125-617 — 13-step ingest pipeline (stage stamp lands at step 13).
  • src/storage/mongo/adapters/control-plane-adapter.ts:128-181 — atomic claim.
  • src/api/routes/admin/connector-instances.ts:18-21 — explicit acknowledgement that broker is inert today.
  • src/domain/scan-runs/types.ts:104-141ScanRunDoc schema (extended by reserved-key convention, no schema change).
  • src/domain/connector-instances/types.ts:28-92CredentialsRef + ConnectorInstanceDoc.
  • ADR-018 — deploy security posture (docker-group accepted pre-managed-platform).
  • ADR-022 — Azure compute landing zone (Tier-3 dev/prod VMs).
  • ADR-023 — authentication target architecture (four-tier).
  • ADR-024 — Azure deploy lifecycle.
  • ADR-026 — chain re-materialization triggers (path b/c generalised here).
  • sv0-platform#1185 — implementation umbrella.
  • 2026-05-19-automated-connector-pipeline-audit.md — companion research audit.

Honored North Star clauses

  • C-13 (SIEM landing supported, not a SIEM console — north-star.md:405). Scheduled scans + visible failures keep the brief / chain fresh for the SIEM-cold analyst landing.
  • C-15 (LOCKED-IN-CODE path differentiability — north-star.md:377). Deploy-gate rematerialization (decision (c)) extends ADR-026's chain coverage to stitched_paths, preventing the same dead-end-page failure shape on a stitched-path-materializer.ts deploy.