Connector Runtime Architecture
How a scheduled connector run actually executes inside SecurityV0 — from a timer tick on our Azure VM, through credential resolution, to the customer's API. Complements 05-connectors.md (the connector interface contract, aimed at connector authors); this doc is the runtime / infrastructure view, aimed at stakeholders, reviewers, and anyone debugging "why didn't the scan run." It describes architecture and decisions, not implementation.
Bottom line. A connector run is a process on our Azure VM that calls the customer's API over the public internet — outbound only, no inbound, no access to customer networks. Customer credentials are issued to us once at onboarding, live in our Key Vault, and reach the run as scoped secrets resolved per-run by a credential broker. Tenant isolation is enforced in code (the broker derives the tenant from the verified database chain, not from the job message), not by network or vault access policy. Today, scheduled scans against real tenants cannot run — the runtime is wired with no credentials. The change that unblocks this is the env-broker phase (ADR-027 Slice 1).
Three phases, referenced throughout. Every section distinguishes what runs now from what changes next. These names are used in place of the ADR's "slice" numbers so the doc reads on its own.
- Shipped today — merged and running as of 2026-05-19. The runtime spawns connector runs with no credentials, so scheduled scans against real tenants cannot run yet.
- Env-broker phase — in flight, not merged. A credential broker resolves each tenant's secrets from the VM's environment, per run. This is the change that unblocks real scans. (ADR-027 Slice 1.)
- Key Vault broker phase — deferred, no work in progress. The broker fetches per-tenant secrets from Key Vault at run time, so the VM never holds the full set of secrets at rest. The production-grade target. (ADR-027 Slice 5.)
1. Big picture
A connector run is a process spawned on our Azure VM that calls a customer's source-system API over the public internet. The customer issues credentials to us out of band (an IAM key, an Entra app-registration secret, a ServiceNow OAuth client) and we store them on our side. Nothing about the connector model requires access to the customer's VMs or private networks — every connector is API-first.
Two things the diagram can't show:
- We pull; the customer never pushes. Customer credentials live in our infrastructure even though we're scanning their environment — issued to us once at onboarding.
- Outbound only, via one egress IP. All three customer-API surfaces are reached from a single shared outbound address — the NAT egress IP that customers allowlist in their firewalls (see the Azure VM landing-zone runbook). There is no inbound path: customer firewalls only need to permit our egress IP to call their public APIs.
What lives where
| Thing | Where it lives | Sensitivity |
|---|---|---|
| Customer's source-system credentials (AWS key, Entra app secret, ServiceNow creds) | Key Vault → VM environment | High — per-tenant secret |
| Our platform's own secrets (WorkOS key, database URI, …) | Same Key Vault → VM environment | High — platform-wide |
| Per-tenant connector config (which scopes, schedule, allow-list) | Database (operational config) | Low — not secret |
| Per-tenant credential reference (a pointer: provider + name) | Database (connector config) | Low — just a pointer |
| Per-tenant credential value | Never in the database. Only in Key Vault → VM environment → the run, for the duration of one scan. | High |
The reference/value split is what makes the design safe to audit: the database can be read by every platform process, but the credential value can be read only by the broker, only inside the run, only for the duration of one scan.
2. Credential delivery chain
This is the section the rest of the doc anchors to — it's the question that keeps coming up. Both the env-broker phase and the Key Vault broker phase use the same Key Vault; they differ on when the secret leaves it and which component reads it.
2a. Env-broker phase — deploy-time delivery (in flight)
Key properties of the env-broker phase:
- Application logic never touches raw secret values; only the broker does. The broker resolves credentials inside the run's driver, so the scan handler, HTTP paths, and evaluators never see them. The precise boundary: the driver runs in-process with the worker, so the credential bundle does sit in the worker's memory for one run's duration. The guarantee is "never logged, never persisted, never exposed to application/HTTP code, never crosses tenants" — not "no process ever holds it." A worker compromised mid-run can still observe that one run's bundle.
- Security caveat — this phase materializes ALL tenant secrets in the VM environment at once. The broker reads the VM's environment, which a boot-time secrets-fetcher (a startup script that copies Key Vault secrets into the VM's environment) fills with every tenant's credentials. A crash dump, debug endpoint, or process compromise exposes every tenant's credentials, not just the one being scanned.
- Posture: acceptable for dev / demo / a few design partners only. Not a production-grade control.
- Why it fails audit: SOC 2 CC6.1/CC6.6 and ISO 27001:2022 A.8.15, A.5.15/A.5.17/A.5.18 all require a per-tenant access boundary; a flat all-tenant environment bundle has none.
- Production bar: the Key Vault broker phase or federation (runtime per-tenant resolution, so the VM environment never holds the full set) plus an audit-log entry per resolve.
- A customer-facing security statement must say plainly that this flat-environment delivery is non-production.
- The run's environment is the credential boundary toward the connector. The connector run gets a small OS allowlist (PATH, HOME, locale) plus the resolved bundle — nothing else from the platform's own environment crosses the boundary. So a compromised connector can't read the platform's own secrets (its WorkOS key, its database URI, and so on).
- The broker computes the lookup key from the verified tenant, not from the customer-supplied reference. A misconfigured reference cannot accidentally cross tenants — the tenant portion of the key is derived from the verified tenant; the customer-supplied reference only contributes the trailing name. See §4 (and its caveat for federation providers).
2b. Key Vault broker phase — runtime delivery (deferred)
What changes in the Key Vault broker phase:
- Secrets do not sit in the VM environment at rest. A compromise of the VM file system never reveals customer credentials — only an active run holds them, and only for one scan.
- Rotation takes effect on the next scan (no redeploy).
- The Key Vault layout is a structured per-tenant path — operationally more expensive (one path and access policy per tenant per connector kind) but with structural tenant isolation at the vault layer.
- The platform code path that resolves credentials is identical between the two phases — the broker interface absorbs the difference.
2c. Shipped today — what actually runs now
Today the runtime starts a run with no credential source, so it gets only the OS allowlist and zero source-system credentials. Scheduled scans cannot run against any real tenant today — this is the literal hard blocker that the env-broker phase closes. The demo seed path sidesteps this by submitting a pre-extracted graph over HTTP from a developer laptop, where the connector ran with local credentials.
3. Per-scan runtime sequence
What happens between "scheduler tick" and "the customer's API returns data" — for one scan, end-to-end. This is the cascade that ADR-027 turns into a fully observable pipeline.
The output contract: connectors write their graph and per-category results to a temp directory the driver provides; stdout/stderr are captured for logs only, never parsed for data.
Concurrency in one sentence: at most one in-flight scan per (tenant, scope) at a time — one worker holds a given scope, while multiple tenants run in parallel up to the in-memory queue's bandwidth. Today there is one VM and one worker process, so the in-memory queue is fine. ADR-022 specs an HA prod fleet across two zones; once the platform runs on more than one worker, the in-memory queue is insufficient and a database-backed job queue becomes required — filed as a follow-up.
4. Tenant isolation model
The structurally hard question: "can a misconfigured credential reference on tenant A accidentally read tenant B's credentials?" The answer is no, by construction — and the construction is small enough to walk through in one diagram.
Invariants the broker enforces:
- The tenant is re-derived from the database relation chain, not trusted from the job message. The enqueued scan message carries a tenant, but the handler must not pass it to the broker blindly — a poisoned or buggy message would otherwise resolve another tenant's namespace. Instead every lookup is scoped by the verified tenant: the connector instance, scan scope, and scan run are each read tenant-scoped, so a row whose tenant differs is treated as not-found before any credential resolution. The message tenant is a hint; the database chain is authoritative.
- The tenant portion of any lookup key is derived from the verified tenant — the row's tenant, not the reference.
- The reference name is validated against a strict character class and rejected if it contains path separators or the literal substring
tenant. - Only the matched secrets are returned — no fallback to "if this tenant's secret is missing, try another's."
- The bundle is returned by reference but never persisted to the database, logs, or files. It exists only in the platform process's memory and the run's environment, both bounded by the scan's lifetime.
Equivalent invariants for the Key Vault broker phase: same construction, applied to vault paths. The path is keyed by the verified tenant and connector kind (the reference selects within the returned bundle; it does not shape the path), consistent with §2b. Important: vault access policy is not, by itself, tenant isolation here. The VM's single Managed Identity can read every tenant's secrets (one identity, all tenants), so a path-construction bug could still read any tenant's secret — the path convention is the only thing keeping them apart. Real vault-layer isolation would require per-tenant vaults or per-tenant managed identities; that is a deliberate non-goal at current scale, called out so no one mistakes the shared-identity design for hard isolation.
Federation providers — the lookup-key argument does NOT apply. Federation is where the customer grants us a role or app identity instead of handing over a secret (AWS assume-role, an Entra multi-tenant app, a GitHub App). There's no key to derive; the "lookup" is which role to assume, which Entra tenant to mint a token for, or which app installation to act as. The dangerous value is the binding (the role identifier, the Entra tenant ID, the app installation ID) stored on the connector instance. Isolation rests on two things instead (this is the canonical statement; the credential-exchange research doc points here):
- Server-controlled binding selection. The broker picks the binding by verified tenant (re-derived per invariant 1), not by a customer-supplied reference — so a customer cannot point their instance at another customer's role and have it resolve.
- Customer-side trust scoping. The customer's role trusts our account only with their external ID; their Entra admin-consent grants only their tenant; their GitHub install covers only their org.
The diagram above is the environment / vault path; federation is governed by these two plus invariant 1.
What we explicitly do NOT rely on for isolation:
- Network segmentation between tenants (there isn't any — all tenants share the same VM).
- Run sandboxing (a connector run can read its own environment; it just doesn't have access to other tenants' secrets because the broker never put them there).
- Vault access policy with the shared Managed Identity (see above — it's a path convention, not a per-tenant grant).
- Per-tenant credentials baked into the binary (the binary is shared; isolation is at the credential layer above it).
5. Failure & retry topology
What happens when something breaks. The platform's failure modes split cleanly into transient (network blip, rate limit, 5xx) and persistent (bad creds, scope unreachable, code error) — and ADR-027 wraps each pipeline stage with a deterministic retry on the transient class only.
The diagram tracks one scan run (the scope stays active throughout — see §3). The run statuses the runtime emits are running / succeeded / partial / failed / timeout; there is no ready or claimed status.
What each failure class produces:
| Class | Resulting status | Examples | Behavior | Operator action |
|---|---|---|---|---|
| Transient | running → (retry) → failed | DB blip, AWS throttle, HTTP 5xx | retry 1s/2s/4s, then failed if still failing | Usually none — auto-recovers next tick |
| Partial | partial | Some categories extracted, others failed | Run completes; the failed categories are recorded; downstream stages proceed on what succeeded | Inspect per-category errors; usually a scope-permission gap on one service |
| Persistent | failed | Bad creds, scope unreachable, permission denied | Immediate failed; cooldown gate prevents immediate retry next tick | Fix creds / scope; manual retry via UI (a deferred follow-up) |
| Timeout | timeout | Run exceeded its max-runtime budget | Killed; status set to timeout | Check connector logs; tighten budget if pathological run |
| Orphan (a condition, not a status) | stuck at running | Worker died mid-scan | Reaper sweeps after the runtime budget and marks timeout (not yet built) | None once the reaper ships; until then, manual cleanup |
| Code error | failed | A rule throws during evaluation | Propagates as failed; rule isolation is a follow-up | File bug; re-run after fix |
What we deliberately don't do:
- Adaptive / probabilistic backoff. Fixed 1s/2s/4s schedule. The platform's determinism rule applies — no ML-style retry curves.
- Cross-tenant impact analysis. A failed scan on tenant A does not back off scans for tenant B; each
(tenant, scope)has its own cooldown timer. - In-process recovery of evaluator code errors. The evaluator currently fails the whole evaluation if a rule throws — per-rule isolation is tracked as a follow-up, not in any current phase.
6. Cloud topology
This section is reserved for an Excalidraw diagram showing the spatial layout of resources — Azure subscription boundaries, resource groups, the platform VM, Key Vault, NAT egress IP, Cloudflare tunnel, target customer cloud accounts — at a level a CTO can scan in 15 seconds.
Status: deferred to a follow-up PR. The shape will be a SecurityV0 cloud-topology Excalidraw (docs/architecture/diagrams/connector-runtime-topology.excalidraw + exported .png), following the repo convention used by sv0-system-overview.excalidraw. Mermaid flowcharts don't render spatial layout well enough for this — it's the one place the existing Excalidraw + PNG pair pays off.
Reference inputs for the eventual diagram:
- Resource group layout —
runbook 12 §"Resource groups"(rg-sv0-bootstrap,rg-sv0-shared,rg-sv0-staging,rg-sv0-prod,rg-sv0-dev). - Key Vault inventory —
kv-sv0-staging,kv-sv0-prod,kv-sv0-dev(all inrg-sv0-shared). - Network topology — single VNet, public NAT egress IP (the one customers allowlist), Cloudflare Tunnel for inbound platform traffic (
app.securityv0.com,staging.securityv0.com,dev-azure.securityv0.com). - VM identity — a User-Assigned Managed Identity that can read per-tenant secrets (the flat per-tenant entries in the env-broker phase, and the structured per-tenant paths in the Key Vault broker phase).
- Customer side — for each tenant: their cloud accounts / Entra tenant / ServiceNow instance, reached only over public APIs from our NAT egress IP.
7. Cross-cuts
A few questions that don't belong to any one diagram but recur in reviews.
7a. Why is this Azure-centric? Can we move to AWS later?
The runtime infrastructure is Azure (per ADR-022). The connector model itself is cloud-agnostic — connectors call public APIs over HTTPS. If we moved the platform to AWS:
- Key Vault → AWS Secrets Manager. ADR-027 already reserves a Secrets Manager provider in the broker interface; the implementation is a small adapter mirroring the Key Vault broker.
- Managed Identity → an IAM role for the VM/Lambda/ECS task. Same shape (workload-bound identity, no static credentials).
- The driver, scheduler, ingest pipeline, and connector binaries don't change.
The Azure-specific surface is small enough that "cloud-portable by construction" (ADR-023 §1) holds at the connector layer too.
7b. Why not WorkOS for connector auth?
WorkOS is the IdP for Layer-2 platform-user auth — who calls our API. It mints tokens that we verify. It is not a credential store for third-party APIs: WorkOS cannot issue tokens that AWS, Microsoft Graph, or ServiceNow will accept. Each source system has its own IdP (AWS IAM, Entra, ServiceNow), and the customer must hand us source-system-issued credentials. WorkOS and the connector credential broker solve disjoint problems.
7c. Read-only invariant
Connectors never write to source systems — by convention at the driver/binary contract, and because the connector code only issues read calls. But the real boundary is the scope of the customer-issued credential, not our code. Our "read-only" promise is a property of connector behavior; a compromised connector holding a write-capable credential is not stopped by any platform control. The durable control is least-privilege on the customer side, enforced in the onboarding templates (which should be linted to stay read-only):
- AWS role → read-only managed policies; session policy caps where supported.
- Entra app → only read-scoped Graph permissions.
- GitHub App → only read permissions.
Per-connector minimum scopes are in the credential-exchange research doc §6.
Carve-out: platform-initiated remediation tickets (Jira, GitHub, ServiceNow ticket creation) are a separate outbound path that does not go through the connector framework — see ADR-019. Same VM, different code path, different credentials.
7d. What does the operator have to do per new tenant?
Once the env-broker phase lands, onboarding a tenant is three steps:
- The operator gets the customer to provision access the right way per connector. For AWS / Entra / GitHub, the customer runs the federation bootstrap (CloudFormation / admin-consent / app install) and only an identifier comes back (the role identifier, tenant ID, installation ID) — nothing secret. For connectors that genuinely need a secret (ServiceNow, etc.), the secret should reach us via the paste-into-portal flow (credential-exchange research §4), not email or Slack. The interim design-partner reality is that an operator may still handle a secret directly during onboarding — see the honesty note below.
- The credential material lands: an identifier on the connector instance for federation; or, for secret-bearing connectors, the value written to Key Vault (delivered into the VM environment at deploy in the env-broker phase; resolved at run time in the Key Vault broker phase).
- The operator registers the connector instance and a scan scope via the admin API to schedule it.
Honesty note on "no staff sees the secret." That guarantee is a property of the paste-into-portal flow (research §4), which assumes a customer-authenticated admin. During operator-mediated design-partner onboarding — where an SV0 operator sets up the instance on the customer's behalf — the operator may handle a ServiceNow-style secret directly, and the no-staff-visibility claim does not apply to that phase. Federation connectors (AWS / Entra / GitHub) avoid this entirely because no secret is exchanged. Closing the gap for secret-bearing connectors requires a customer-authenticated one-time credential-intake link; tracked as a follow-up.
In the Key Vault broker phase, step 2 becomes "write the secret to Key Vault under the tenant's path" with no redeploy, and the broker picks up the new path on the next scan. The other steps are unchanged.
References
Architecture
05-connectors.md— connector interface contract (the WHAT).08-reference-impl-entra-servicenow.md— reference connector walk-through.13-authentication-and-user-management.md— Layer-2 platform-user auth (disjoint from this doc).
Decisions
- ADR-018 — deploy security posture (docker-group accepted pre-managed-platform).
- ADR-022 — Azure compute landing zone (Tier-3 dev/prod VMs).
- ADR-023 — authentication target architecture (four-tier).
- ADR-024 — Azure deploy lifecycle (how secrets reach the VM).
- ADR-027 — automated connector pipeline (credential broker, pipeline-run root, deploy-gate). The "slice" numbering this doc translates into phase names comes from here.
Runbooks
12-azure-vm-landing-zone.md— Key Vault provisioning, Managed Identity setup, secrets-fetcher pattern.
Research
2026-04-22-connector-control-execution-architecture.md— original Stream-1 architecture (scheduler/driver/queue).2026-05-19-automated-connector-pipeline-audit.md— companion audit to ADR-027.
Honored North Star clauses
- C-13 (SIEM landing supported, not a SIEM console). The credential model documented here is what makes "scheduled scans against real tenants" possible at all — without the broker, the SIEM-cold analyst lands on stale data because no scan has ever run.
- C-15 (
LOCKED-IN-CODEpath differentiability). The runtime infrastructure documented here is the deterministic substrate the deploy-gate (ADR-027) operates on — visible, observable, idempotent.