Autonomous Scans + Built-in Validation — Strategy
Date: 2026-04-23
Tracking issue: sv0-documentation#200
Scope: sv0-platform, sv0-connectors, sv0-documentation, sv0-demo-labs
PR shape: this doc ships on its own. Implementation PRs are sequenced against the issues listed under New issues.
0. Relationship to #199 (source-of-truth for streams)
sv0-documentation#199 (the multi-account-e2e umbrella + four substream docs dated 2026-04-22) is the source-of-truth for the sequencing of:
- Stream 1 — per-tenant connector control (
ConnectorInstance, first-class persistedScanScope,ScanRunhistory, in-process Mongo-claim scheduler, HTTP API forup/scan/teardown). - Stream 2 — multi-account AWS connector (Organizations auto vs
--accounts, 12 service categories,(account × category)unit-of-work). - Stream 3 — cross-connector graph stitching (new
stitch_ingestionworker that runs post-sync, pre-evaluate, debounced). - Stream 4 — MediaPro Lab 2 (validation gate).
This document (#201) does not re-scope any of the above. It adds only three net-new things on top of #199:
- In-process validator module — a deterministic set of invariants called from the ingestion pipeline. Reuses #199 Stream 1's
ScanRunDocfor persistence (no parallelvalidation_findingscollection); moves into #199 Stream 3'sstitch_ingestionworker once that lands. - Canary tenant — a reserved
tenant_id = "canary"with a frozen fixture, scheduled post-deploy ingest, and structural-diff vs a checked-in golden graph. Catches the regression class that invariants can't, because no one wrote a rule for the field that silently got renamed. - Azure VM as T3 compute target — commits the Azure-or-AWS fork from sv0-platform#493 to Azure VM; keeps Docker-Compose-on-Linux-VM so future cloud flips stay a DNS + SSH-target change.
If this doc and #199 disagree on scheduling, scan-scope abstractions, or worker pipeline order — #199 wins. Any reference here to ScanScopeDoc, ScanRunDoc, ConnectorInstanceDoc, or stitch_ingestion is a reuse, not a redefinition.
1. Why now
Three forces are converging before the MediaPro pilot (early May):
- Manual scans don't survive the first real tenant. Every connector today is a human-invoked CLI (
sv0-aws scan --submit,entra-servicenow --all --json). The platform's worker queue is in-memory — restarts lose idempotency state (src/ingestion/transport/ingest-service.tsprocessedSyncIdsSet), there's no retry, no dead-letter, no tenant fan-out. A paying client cannot depend on a human to remember to run a CLI every 30 minutes. - We can't self-check our own output. Sergey flagged that per-path
execution_30dnumbers were overstated because the platform pulled workload-wide evidence on non-AWS paths. The fix landed inf6fb686(closing sv0-platform#497 and #498); sv0-platform#501 (GROUND_TRUTH-vs-proxy tier selection) and sv0-documentation#196 (fidelity-doc reconciliation) remain open. Nothing in the platform cross-checks "UI says 8, evidence tier is not GROUND_TRUTH, connector reported 0 per-destination rows" — so the bug only surfaced when a human read the numbers. The next such bug will hit a paying client first. - An Azure hosting lane is available for free. Microsoft for Startups / Azure credits are already active. Mercury → AWS Activate is still in evaluation. Running two candidate hosting lanes doubles infra-engineering load for no benefit; we commit to Azure VM as the platform-hosting target and close the AWS-or-Azure fork in sv0-platform#493. The AWS sandbox account stays, but only as a target the AWS connector scans — not as a hosting option.
This document sequences in-flight work across three pillars and names the one net-new component: a deterministic cross-validation layer that would have caught the execution_30d bug automatically.
2. Three pillars
Pillar A — Autonomous scan operations
Connectors run on a schedule without human invocation, per tenant, with retries and idempotency that survive process restarts.
Scheduling, scan-scope persistence, and the up/scan/teardown HTTP surface are owned by #199 Stream 1 — this doc does not redesign them. Systemd timers, connector_schedules collections, and any parallel scheduler are explicitly out of scope here.
| Component | Status | Reference |
|---|---|---|
| Containerize each connector (Dockerfile + GHCR publish) | New | New infra issue in this doc |
In-process Mongo-claim scheduler (ScanScopeDoc.next_run_at, atomic findOneAndUpdate as lock) | Designed | #199 Stream 1 — 2026-04-22-connector-control-execution-architecture.md |
ConnectorInstanceDoc + ScanScopeDoc + ScanRunDoc schema | Designed | Same |
HTTP API for up / scan / teardown (tenant onboarding as an API call, not SSH + templated unit files) | Designed | Same |
| Tiered secrets (1Password Business via Mercury → SOPS+age fallback) | Planned | 2026-03-31-infrastructure-strategy.md §3 Secret Management |
Persistent idempotency + retry + DLQ (replace in-memory processedSyncIds Set at src/ingestion/transport/ingest-service.ts) | New | Rolled into the scheduler work under #199 Stream 1 |
Circuit breaker + scan scope + rollback (global deletion threshold; removed_by_sync_id) | Designed | 2026-02-26-scan-safety-and-observability.md Phases 1–2, 6 |
Non-AWS canonical target_resource_key on execution evidence | Open | sv0-connectors#91 |
Why Mongo-claim and not systemd timers (decided by #199 Stream 1; captured here for the reader):
- Multi-host growth. Azure VM cutover may grow to 2+ hosts (at minimum host + warm standby). systemd timers don't coordinate across hosts; MongoDB's atomic
findOneAndUpdateis already the lock. - Tenant onboarding as an API call. Adding a tenant should be
POST /api/v1/connector-instances, not SSH + template a unit file +systemctl enable. Cadence changes are a DB update, not editing files on the host. - Composes with #199's HTTP scan API. Stream 1 designs
up/scan/teardownas HTTP endpoints; systemd has no HTTP surface.
What lives where after T1 completes:
deploy host (Hetzner → Azure VM at T3)
├── docker-compose.deploy.yml (unchanged across cloud cutover)
│ ├── api :3000 — scheduler loop, scope-aware diff, circuit breaker,
│ │ in-process claim of ScanScopeDocs whose next_run_at <= now
│ ├── ui :8080 — same image
│ └── mongo :27017 (dev + QA only after T3 Phase 1 — Atlas takes prod/pre-prod)
│
└── connectors run as short-lived containers invoked by the api service
via Docker Engine socket (or `docker run` from the api container itself),
scheduled by the Mongo-claim loop — not by systemd, not by cron.
Docker Compose stays. No PaaS (no App Service, no AKS, no ECS). Future cloud flips are a DNS + host-key change, not a rewrite.
Pillar B — Built-in validation, QA, observability
Four sub-pillars. B.1 and B.2 are already-planned work being sequenced. B.3 and B.4 are new.
B.1 Observability stack (existing plan — sequence, don't re-plan)
Owned by sv0-platform#494 — Grafana Cloud + BetterStack + grafana/mcp-grafana, 5-day rollout, $0 at pilot scale. Deliverables:
- Grafana Alloy on the deploy host streams Prom scrape + Loki logs + node_exporter.
- BetterStack monitors
app.securityv0.com/api/v1/healthanddev.securityv0.com/api/v1/health. - Read-only Grafana token wired as MCP server into
.claude/settings.json— agent sessions can query logs/metrics directly. - Alerts: error rate > 5% for 5min,
sv0_queue_depth> 100 for 10min,sv0_sync_age_minutes> 2× schedule,/ready503 for 2min. - Portable across the Hetzner → Azure VM cutover (Alloy moves as a systemd service; dashboards + alerts unchanged).
B.2 Scan safety (existing plan — ship it)
Owned by 2026-02-26-scan-safety-and-observability.md Phases 1, 4. Designed, reviewed, not shipped:
- Global (tenant-level) circuit breaker — no per-workload minimum floor; gates the entire destructive pipeline (entity deletion + materialization + authority path removal) as a single unit.
- Scan-health score (platform-derived from baseline — no connector self-reported inputs).
removed_by_sync_idfield for deterministic rollback.scanScopeonNormalizedGraph;sync_modederived from payload instead of hardcoded"full".
T1 ships Phases 0–2. T2 ships Phase 4 (health score) alongside B.3.
B.3 Data cross-validation layer (NEW — the missing piece)
A deterministic set of invariants runs against every real ingest (every tenant, not just fixtures) and records pass/fail verdicts on the ScanRunDoc from #199 Stream 1. Default mode is warn (record + emit Prom counter + log event); per-rule, per-tenant opt-in to fail mode exists, so a single bad rule cannot block a prod scan.
Module layout:
src/ingestion/validators/— pure functions(NormalizedGraph, PriorScan?) => ValidationResult[]. Each rule hasid,severity,mode: warn | fail(defaultwarn). No external framework (no Great Expectations, no SODA Core, no dbt — all assume a SQL warehouse context; our graph lives in Node memory pre-commit). Borrow the ideas (severity levels, rule catalog, per-rule verdicts) — write the code ourselves.- Plug point today: called from
src/workers/handlers/sync-ingestion.tsimmediately before the finallogger.info("Sync ingestion completed", ...)call at line 363, inside the sametryblock so any thrown validator error is caught by the existing handler at line 379. - Plug point once #199 Stream 3 lands: moves into the
stitch_ingestionworker as a step after materialization, before theevaluate_findingsenqueue. Natural post-sync / pre-evaluate placement, debounced at the same 60s as stitching. - Persistence: results live on
ScanRunDoc.validationResults[]from #199 Stream 1. No parallelvalidation_findingscollection, no net-new admin endpoint —GET /api/v1/scan-runs/:idsurfaces them for free. - Metrics: Prometheus counter
sv0_validation_findings_total{check_id,severity,tenant_id}— we accept thetenant_idcardinality cost here intentionally, same pattern #494 is already working through for scan-age metrics. Grafana alert fires whenseverity="critical"rate > 0. - No UI surface in this doc. Internal ops uses Grafana. A client-facing connector-status UI is a real product surface but belongs to a later product-scoped PR.
Invariants at T2 launch — each is a pure function over the just-ingested state. The execution_30d invariant uses actual schema field names and is specifically constructed to fail against the pre-fix data in test/integration/ingestion/non-aws-path-scoping.test.ts (the fixture that lands with the #497 fix). A pure existence check would incorrectly pass on that data, because the bad execution_30d counter was summed from real evidence rows attributed to the wrong paths — the bug is mis-scoping, not missing evidence.
# rule: execution_30d must be scoped to a destination and count-match
# severity: critical
# purpose: catches the exec_30d misattribution class (sv0-platform#497/498/501,
# sv0-documentation#196). Replay gate: must fire on the pre-fix fixture
# in non-aws-path-scoping.test.ts (pre-fix execution_30d=8 across
# destinations A+B, post-fix=3 per destination).
for each AuthorityPathDoc P where P.current_state.execution_30d > 0:
assert P.destination_resource_key IS NOT NULL # (scoping required)
let bucket_sum = SUM(e.execution_count)
for ExecutionEvidenceDoc e where
e.tenant_id = P.tenant_id
AND e.entity_id = P.workload_id # join key: workload, not path
AND e.resource_key = P.destination_resource_key # per-destination scoping
AND e.confidence IN (GROUND_TRUTH-tier confidence values per #501) # see note below
AND e.source_timestamp >= now - INTERVAL 30 days
assert bucket_sum = P.current_state.execution_30d # exact match, not existence
Schema notes (confirmed against src/domain/authority-paths/types.ts and src/domain/evidence/types.ts):
AuthorityPathDoc.destination_resource_key: string | undefinedexists as of the #497 fix and is the canonical destination key the materializer uses for per-path scoping.ExecutionEvidenceDoclinks to entities viaentity_id, not to paths — there is nopath_idfield. Join from path to evidence goesP.workload_id → e.entity_id, filtered bye.resource_key = P.destination_resource_key.ExecutionEvidenceDochasconfidence: "DETERMINISTIC" | "TEMPORAL_INFERRED" | "STRUCTURAL", not atierenum. The GROUND_TRUTH-vs-proxy distinction sv0-platform#501 is tracking is what will harden this rule's RHS — until that lands, the rule usesconfidence = "DETERMINISTIC"and will be tightened as #501 defines the canonical tier enum.
Other invariants at T2 launch:
| Invariant | Violation example (real) | File |
|---|---|---|
| UI-facing aggregate count == list-endpoint count for the same query | Posture summary shows 29 active + 3 dormant = 32; list endpoint returns 30 (DQ1) | src/ingestion/validators/aggregate-consistency.ts |
Every AuthorityPath with execution_30d > 0 has a non-null destination_resource_key AND sum-matches evidence (predicate above) | Pre-fix fixture: path shows execution_30d: 8, real per-destination evidence sums to 3 | src/ingestion/validators/execution-30d-scoping.ts |
scanScope.scannedEntityTypes ⊇ entity types actually present in the graph | Connector declares it scanned ["workload"] but graph contains identity entities | src/ingestion/validators/scope-vs-graph.ts |
Every authority_path.workload_id / .identity_id / .destination_id resolves to a live entities document | Dangling ID after a soft-delete cascade | src/ingestion/validators/referential-integrity.ts |
| Pagination-bounded counts match total counts (findings list breakdowns) | Severity/type breakdowns reflect current page, not total dataset (DQ2) | src/ingestion/validators/pagination-totals.ts |
No ML. No heuristics. No learned thresholds. Every check is a pure deterministic predicate (per AGENTS.md). "Threshold violations" (e.g., delta > X% from baseline) belong in the scan-safety circuit breaker, not here — this layer checks invariants, not anomalies.
B.4 Canary tenant (pipeline-regression line of defense)
A dedicated qa.dev.securityv0.com env was an earlier version of this section. Two problems killed it:
- Cloudflare wildcard certs don't cover two-level subdomains. Our certs are
*.securityv0.comand*.dev.securityv0.comas separate wildcards;qa.dev.securityv0.comsits one level deeper and would need its own cert or Access app per env. If we ever need a seed-corpus-before-dev promotion env, the naming convention isqa1.securityv0.comat the top level — matching the multi-instance-dev pattern already in use. Open an issue then; don't scope it here. - A QA env catches synthetic-corpus bugs, not live-client-data bugs. The
execution_30dclass of regression only surfaces on a real connector + real tenant. Validation has to run on prod scans, not just staging fixtures. B.3 already does that.
What goes in B.4 instead is a canary tenant — synthetic monitoring / SRE canary applied to our actual pipeline:
- Reserved
tenant_id = "canary". Distinct DB documents but served by the same prod API and workers. - Frozen fixture checked in under
test/canary/<fixture-name>.json— aNormalizedGraphthat exercises at least one path through each active rule family and tier. - Post-deploy scheduled job (triggered by the Grafana Cloud alert rule already set up for deploy events in #494, or a lightweight GitHub Action) ingests the fixture through the real pipeline by POSTing to
/api/v1/ingest/normalized-graphwithX-Tenant-Id: canary. - Structural-diff the resulting entities, paths, and evidence-pack sections against a checked-in golden under
test/canary/golden/<fixture-name>.json. Any non-trivial diff is an incident: a field got renamed, a migration dropped data, a rule silently changed semantics, a dependency update shifted a computation. - Emits
sv0_canary_drift_total{check}— paged when > 0.
Canary is what invariants can't be: invariants only catch violations of rules we wrote; canary catches "we renamed a field and every tenant's count silently halved equally, uniformly, in-spec". No one writes a rule for that. The frozen golden is the rule.
Promotion flow for seed corpora used by demos/labs remains: authors run make validate locally against a dev API (see Pillar sv0-demo-labs entry in §4), which exercises the same validator module over the seeded state.
Pillar C — Azure VM pivot (committed for T3, portable by design)
Decision: T3 moves platform hosting to Azure VM. We do not lock in Azure at the platform level — the whole point of staying on Docker-Compose-on-a-plain-Linux-VM is that flipping to AWS VM (or any other IaaS) later is a DNS + SSH-target + MONGODB_URI change, not a rewrite. "Committed" here means "we pick one lane now so we're not maintaining two deploy paths in parallel", not "locked in for life".
Why Azure first:
- Azure credits are already usable. Mercury → AWS Activate is still in evaluation; that grant is probabilistic and delays T3 if we wait.
- Managing two candidate hosting lanes doubles deploy scripts, CI matrices, and on-call muscle memory for no strategic benefit at pilot scale.
Why Docker Compose on a VM (and not Azure PaaS / AKS / ECS / Fargate):
- The shape is identical across clouds.
docker-compose.deploy.ymldoesn't change when we move from Hetzner → Azure VM, nor would it change if we later move Azure VM → AWS VM. - Any cloud that gives us an SSH-reachable Linux host with Docker installed is a valid target. This is a deliberate counter to vendor-lock-in: when credits expire, AWS gets competitive, or a client requires a specific region/jurisdiction, the migration cost is measured in hours, not sprints.
- No managed-PaaS primitives (App Service, Container Apps, AKS, ECS, Fargate, Cosmos DB for MongoDB) — each of those would bind us to its control plane and break this portability story.
What changes at T3 (split per environment — Mongo tiering matters):
| Artifact | prod / pre-prod | dev / QA |
|---|---|---|
| Deploy SSH target | deploy@<azure-vm-prod> | deploy@<azure-vm-dev> |
DEPLOY_HOST_KEY secret | Azure prod VM host key | Azure dev VM host key |
| DNS | app.securityv0.com → prod VM | dev.securityv0.com, pr-N.dev.securityv0.com → dev VM |
MONGODB_URI | mongodb+srv://...atlas-m10-frankfurt... (per #493 Phase 1) | mongodb://mongo:27017/sv0_<instance> (self-hosted in-compose, unchanged from today) |
| Backup | Atlas PITR + daily snapshot retention | Existing mongo-backup compose service (6h mongodump, local volume) |
docker-compose.deploy.yml, Caddy, Alloy | Unchanged | Unchanged |
Atlas is the right fit for prod + pre-prod: managed auth, Point-in-Time Recovery, offsite backups, monitoring — all things a paying client will (reasonably) ask about. Self-hosted Mongo on the same Azure VM is the right fit for dev + QA: Azure credits cover the VM compute, durability is good-enough for non-prod, and keeping the compose mongo service in the deploy path exercises it as the authoritative shape (so dev ≈ prod in structure, differing only in the MONGODB_URI).
AWS is not going away — just not for platform hosting:
sv0-security-toolingAWS account stays (per2026-03-31-infrastructure-strategy.md§4). The AWS connector needs a real AWS environment to scan; that's what the sandbox account is for.sv0-demoAWS account stays for demo scenarios.- The management / billing / SCP work stays — those are target-side, not hosting-side.
sv0-platform#493's body is updated (separate small edit, tracked in the follow-up list) to drop the Azure-or-AWS fork and name Azure VM as the sole compute target.
3. Sequencing
T1 → T2 → T3 is an ordering, not a calendar. Each item is sized:
- S — bolt-on, a single focused push. One PR, scoped, reviewable in an hour.
- M — new subsystem or multi-PR chain. Multiple focused pushes, integration seams, tests to author.
Each tranche has a done signal that gates the next.
T1 — Unblock autonomy
Goal: connectors run on a schedule without human invocation, and a single buggy scan cannot wipe a tenant's authority paths.
Rough shape: 2×S + 2×M.
- S — Dockerfiles + GHCR publishing CI for
entra-servicenow,azure-foundry,jira-cloud,aws. - M — #199 Stream 1:
ConnectorInstanceDoc+ScanScopeDoc+ScanRunDocschema + in-process Mongo-claim scheduler +up/scan/teardownHTTP API. - M — Ship scan-safety Phases 0–2 (global circuit breaker,
scanScope+ scope-aware diff,removed_by_sync_idon entities and paths). - S — Replace in-memory
processedSyncIdsSet with a Mongo-backed idempotency store (collapses into the scheduler PR's workspace). - Done signal: a
POST /api/v1/connector-instancescall creates a tenant's connector config; the scheduler claims and runs the next dueScanScopeDoc; a synthetic "empty graph" scan hits the global circuit breaker and the UI still shows the pre-existing entities.
T2 — Built-in validation + observability live
Goal: the execution_30d class of bug cannot ship undetected — on any tenant, not just fixtures. Operators have dashboards and alerts.
Rough shape: 2×M + 2×S.
- M — Ship
src/ingestion/validators/+ the five invariants from B.3, persisting verdicts toScanRunDoc.validationResults[]. Gating test: theexecution-30d-scopingvalidator fires on the pre-fix fixture intest/integration/ingestion/non-aws-path-scoping.test.ts(execution_30d=8, workload-wide misattribution) and passes on the post-fix fixture (execution_30d=3, per-destination scoped). - M — Execute the 5-day rollout in sv0-platform#494 (Grafana Cloud + BetterStack +
grafana/mcp-grafana). - S — Canary tenant: fixture + golden committed under
test/canary/; post-deploy job ingests through real pipeline;sv0_canary_drift_totalmetric wired. - S — Ship scan-safety Phase 4 (platform-derived health score, no connector self-report).
- Done signal: replaying the pre-fix fixture produces exactly one
critical-severity validation result on the associatedScanRunDocwith the predicate from B.3 above matching the actual violation;curloutput fromGET /api/v1/scan-runs/:idattached to the PR; Grafana shows thesv0_validation_findings_total{severity="critical"}counter increment; BetterStack pingsappand (once it exists) the Azure dev VM.
T3 — Azure VM cutover (gated on T1+T2 stable on Hetzner)
Goal: platform hosting runs on Azure. Prod data in Atlas, dev/QA data self-hosted on the Azure VM.
Rough shape: 1×M + 1×S.
- M — Execute #493 Phase 1 (Atlas cutover for prod/pre-prod): externalize
MONGODB_URI, stand up M10 in Frankfurt, 30-min maintenance window,mongodump → mongorestore, flip URI, smoke test. - S — Execute #493 Phase 2 (compute migration): provision Azure VMs (prod + dev), install Docker + Caddy + Alloy, update
DEPLOY_HOST_KEY+ DNS, demote Hetzner to off. - Done signal:
app.securityv0.comresolves to Azure prod VM; canary ingestion runs end-to-end; Grafana/BetterStack continuity unbroken across cutover.
4. What stays vs what changes per repo
| Repo | Changes |
|---|---|
sv0-connectors | Dockerfiles + CI to publish to GHCR; emit scanScope on every graph; emit canonical target_resource_key (#91). No scan-logic changes. |
sv0-platform | #199 Stream 1 scheduler + HTTP API; persistent idempotency; scope-aware diff; circuit breaker; src/ingestion/validators/ module + verdicts on ScanRunDoc.validationResults[]; test/canary/ fixtures + golden + post-deploy ingest job; Prom counters (sv0_validation_findings_total, sv0_canary_drift_total). No new admin endpoint, no parallel validation_findings collection, no new UI surface. |
sv0-documentation | This doc; mkdocs.yml nav update; callout in 2026-03-31-infrastructure-strategy.md naming Azure VM as platform-hosting target (AWS-sandbox rationale preserved). |
sv0-demo-labs | make validate target in each lab that POSTs the seeded corpus to a running dev API and reads back GET /api/v1/scan-runs/:id to get the validator verdicts — thumb-up/down in <30s. |
5. Non-goals
- Kubernetes (any flavor).
- Temporal.io or similar workflow engines — current workload is simple claim-and-run, not DAGs.
- AWS Lambda for connectors — revisit at 10+ connectors or per-customer secret isolation.
- Azure-native PaaS (App Service, AKS, Container Apps, Cosmos DB for MongoDB).
- Private Link / private endpoints — deferred to post-pilot hardening.
- Systemd timers / cron-per-connector-per-tenant on the deploy host — superseded by #199 Stream 1's Mongo-claim scheduler.
- Parallel
validation_findingscollection, or a new admin endpoint for validation results — results live onScanRunDoc.validationResults[]. qa.dev.securityv0.comas a dedicated fail-loud env — replaced by canary tenant on the prod pipeline. If a seed-corpus-before-dev promotion env is needed later, open an issue usingqa1.securityv0.comnaming.- In-platform ops dashboard for validation/canary metrics — internal ops uses Grafana. A client-facing connector-status UI is a real future product surface but out of scope here.
- External data-quality frameworks (Great Expectations, SODA Core, dbt tests, Pandera) — they assume a SQL warehouse; our graph is in Node memory pre-commit.
- Learned thresholds / historical-distribution anomaly detection — violates AGENTS.md's deterministic-only rule, hides bugs instead of surfacing them. Static per-tenant thresholds are fine; auto-tuned ones are not.
- Any ML, probabilistic scoring, or heuristic validation.
6. Verification (for the implementation tranches, not this doc)
Each tranche has a done signal above. Additional acceptance:
- T1: unit test in
test/ingestion/circuit-breaker.test.tsproves that a graph with 100% entity absence triggers the global breaker; a graph with 49% absence does not.POST /api/v1/connector-instancesround-trip captured in the PR description; scheduler loop log excerpt showing a claim + run + verdict write toScanRunDoc. - T2 — gating test for the validator PR: replay
test/integration/ingestion/non-aws-path-scoping.test.tspre-fix fixture (execution_30d = 8 misattributed across destinations A + B) through theexecution-30d-scopingvalidator; verdict must befailwith the scoping+count predicate and identified path_ids. Replay the post-fix fixture (execution_30d = 3 per destination); verdict must bepass. Both outputs attached to the PR. If either is wrong, the predicate needs another pass before merge. - T2 — canary: checked-in golden diff run against a clean prod pipeline ingestion of the canary fixture returns zero structural diffs.
- T3: BetterStack status-page screenshot showing uninterrupted monitoring across the DNS flip.
mongoshagainst the Atlas URI from the Azure prod VM returns the expected tenant count; self-hosted Mongo on the Azure dev VM serves apr-N.dev.securityv0.compreview end-to-end.
For this doc itself:
mkdocs serveinsv0-documentationrenders the page with no broken links.- All cross-repo issue refs resolve (checked via
gh issue view). - A reader unfamiliar with the codebase can answer three questions after reading §1–§3: (a) what triggers a scan today vs after T1? (b) how would the exec_30d bug be caught automatically after T2? (c) what changes on Azure VM cutover and what stays?
7. New GitHub issues
Created alongside this doc:
| Title | Repo | Pillar |
|---|---|---|
docs: autonomous scans + built-in validation strategy (#200) | sv0-documentation | This doc |
To open after doc sign-off (narrower than the original draft; most of Pillar A folds into #199 Stream 1 tasks, not standalone issues):
| Title | Repo | Pillar |
|---|---|---|
feat(validators): src/ingestion/validators/ module + ScanRunDoc.validationResults[] persistence | sv0-platform | B.3 |
infra(connectors): Dockerfiles + GHCR publishing CI for all connectors | sv0-platform (CI) + sv0-connectors (Dockerfiles) | A |
feat(canary): tenant_id=canary fixture + golden + post-deploy ingest job + drift counter | sv0-platform | B.4 |
To update (not new issues):
- sv0-platform#493 body: drop the Azure-VM-or-AWS-VM fork; name Azure VM as the sole compute target. Split the Phase 1 MongoDB story into prod/pre-prod (Atlas) vs dev/QA (self-hosted on the VM).
2026-03-31-infrastructure-strategy.md: header callout referencing this plan, naming Azure VM as the platform-hosting target (shipped with this PR).
Everything else is tracked under #199's substream issues or the existing plans.
8. References
- sv0-documentation#199 + the four substream docs dated 2026-04-22 — source-of-truth for connector control, multi-account AWS, graph stitching, and the MediaPro Lab 2 demo.
2026-02-26-scan-safety-and-observability.md— canonical source for circuit breaker, scan scope, rollback.2026-03-31-infrastructure-strategy.md— secret tiers, AWS Organization layout. Scheduler design in §3 Phase 1 is superseded by #199 Stream 1.- sv0-platform#493 — Atlas cutover + compute migration.
- sv0-platform#494 — Observability rollout.
- sv0-platform#497, #498, #501 — the
execution_30dbug trail that motivates B.3. - sv0-connectors#91 — canonical
target_resource_keyon execution evidence. - sv0-documentation#195 — MediaPro pilot umbrella (the deadline behind T1 and T2).
- sv0-documentation#196 — fidelity doc reconciliation for the per-path proxy counts.