Autonomous Scans + Built-in Validation — Strategy

Date: 2026-04-23 Tracking issue: sv0-documentation#200 Scope: sv0-platform, sv0-connectors, sv0-documentation, sv0-demo-labs PR shape: this doc ships on its own. Implementation PRs are sequenced against the issues listed under New issues.

0. Relationship to #199 (source-of-truth for streams)

sv0-documentation#199 (the multi-account-e2e umbrella + four substream docs dated 2026-04-22) is the source-of-truth for the sequencing of:

Stream 1 — per-tenant connector control (ConnectorInstance, first-class persisted ScanScope, ScanRun history, in-process Mongo-claim scheduler, HTTP API for up/scan/teardown).
Stream 2 — multi-account AWS connector (Organizations auto vs --accounts, 12 service categories, (account × category) unit-of-work).
Stream 3 — cross-connector graph stitching (new stitch_ingestion worker that runs post-sync, pre-evaluate, debounced).
Stream 4 — MediaPro Lab 2 (validation gate).

This document (#201) does not re-scope any of the above. It adds only three net-new things on top of #199:

In-process validator module — a deterministic set of invariants called from the ingestion pipeline. Reuses #199 Stream 1's ScanRunDoc for persistence (no parallel validation_findings collection); moves into #199 Stream 3's stitch_ingestion worker once that lands.
Canary tenant — a reserved tenant_id = "canary" with a frozen fixture, scheduled post-deploy ingest, and structural-diff vs a checked-in golden graph. Catches the regression class that invariants can't, because no one wrote a rule for the field that silently got renamed.
Azure VM as T3 compute target — commits the Azure-or-AWS fork from sv0-platform#493 to Azure VM; keeps Docker-Compose-on-Linux-VM so future cloud flips stay a DNS + SSH-target change.

If this doc and #199 disagree on scheduling, scan-scope abstractions, or worker pipeline order — #199 wins. Any reference here to ScanScopeDoc, ScanRunDoc, ConnectorInstanceDoc, or stitch_ingestion is a reuse, not a redefinition.

1. Why now

Three forces are converging before the MediaPro pilot (early May):

Manual scans don't survive the first real tenant. Every connector today is a human-invoked CLI (sv0-aws scan --submit, entra-servicenow --all --json). The platform's worker queue is in-memory — restarts lose idempotency state (src/ingestion/transport/ingest-service.ts processedSyncIds Set), there's no retry, no dead-letter, no tenant fan-out. A paying client cannot depend on a human to remember to run a CLI every 30 minutes.
We can't self-check our own output. Sergey flagged that per-path execution_30d numbers were overstated because the platform pulled workload-wide evidence on non-AWS paths. The fix landed in f6fb686 (closing sv0-platform#497 and #498); sv0-platform#501 (GROUND_TRUTH-vs-proxy tier selection) and sv0-documentation#196 (fidelity-doc reconciliation) remain open. Nothing in the platform cross-checks "UI says 8, evidence tier is not GROUND_TRUTH, connector reported 0 per-destination rows" — so the bug only surfaced when a human read the numbers. The next such bug will hit a paying client first.
An Azure hosting lane is available for free. Microsoft for Startups / Azure credits are already active. Mercury → AWS Activate is still in evaluation. Running two candidate hosting lanes doubles infra-engineering load for no benefit; we commit to Azure VM as the platform-hosting target and close the AWS-or-Azure fork in sv0-platform#493. The AWS sandbox account stays, but only as a target the AWS connector scans — not as a hosting option.

This document sequences in-flight work across three pillars and names the one net-new component: a deterministic cross-validation layer that would have caught the execution_30d bug automatically.

2. Three pillars

Pillar A — Autonomous scan operations

Connectors run on a schedule without human invocation, per tenant, with retries and idempotency that survive process restarts.

Scheduling, scan-scope persistence, and the up/scan/teardown HTTP surface are owned by #199 Stream 1 — this doc does not redesign them. Systemd timers, connector_schedules collections, and any parallel scheduler are explicitly out of scope here.

Component	Status	Reference
Containerize each connector (Dockerfile + GHCR publish)	New	New infra issue in this doc
In-process Mongo-claim scheduler (`ScanScopeDoc.next_run_at`, atomic `findOneAndUpdate` as lock)	Designed	#199 Stream 1 — `2026-04-22-connector-control-execution-architecture.md`
`ConnectorInstanceDoc` + `ScanScopeDoc` + `ScanRunDoc` schema	Designed	Same
HTTP API for `up / scan / teardown` (tenant onboarding as an API call, not SSH + templated unit files)	Designed	Same
Tiered secrets (1Password Business via Mercury → SOPS+age fallback)	Planned	`2026-03-31-infrastructure-strategy.md` §3 Secret Management
Persistent idempotency + retry + DLQ (replace in-memory `processedSyncIds` Set at `src/ingestion/transport/ingest-service.ts`)	New	Rolled into the scheduler work under #199 Stream 1
Circuit breaker + scan scope + rollback (global deletion threshold; `removed_by_sync_id`)	Designed	`2026-02-26-scan-safety-and-observability.md` Phases 1–2, 6
Non-AWS canonical `target_resource_key` on execution evidence	Open	sv0-connectors#91

Why Mongo-claim and not systemd timers (decided by #199 Stream 1; captured here for the reader):

Multi-host growth. Azure VM cutover may grow to 2+ hosts (at minimum host + warm standby). systemd timers don't coordinate across hosts; MongoDB's atomic findOneAndUpdate is already the lock.
Tenant onboarding as an API call. Adding a tenant should be POST /api/v1/connector-instances, not SSH + template a unit file + systemctl enable. Cadence changes are a DB update, not editing files on the host.
Composes with #199's HTTP scan API. Stream 1 designs up/scan/teardown as HTTP endpoints; systemd has no HTTP surface.

What lives where after T1 completes:

deploy host (Hetzner → Azure VM at T3)
├── docker-compose.deploy.yml                (unchanged across cloud cutover)
│   ├── api :3000   — scheduler loop, scope-aware diff, circuit breaker,
│   │                  in-process claim of ScanScopeDocs whose next_run_at <= now
│   ├── ui  :8080   — same image
│   └── mongo :27017 (dev + QA only after T3 Phase 1 — Atlas takes prod/pre-prod)
│
└── connectors run as short-lived containers invoked by the api service
    via Docker Engine socket (or `docker run` from the api container itself),
    scheduled by the Mongo-claim loop — not by systemd, not by cron.

Docker Compose stays. No PaaS (no App Service, no AKS, no ECS). Future cloud flips are a DNS + host-key change, not a rewrite.

Pillar B — Built-in validation, QA, observability

Four sub-pillars. B.1 and B.2 are already-planned work being sequenced. B.3 and B.4 are new.

B.1 Observability stack (existing plan — sequence, don't re-plan)

Owned by sv0-platform#494 — Grafana Cloud + BetterStack + grafana/mcp-grafana, 5-day rollout, $0 at pilot scale. Deliverables:

Grafana Alloy on the deploy host streams Prom scrape + Loki logs + node_exporter.
BetterStack monitors app.securityv0.com/api/v1/health and dev.securityv0.com/api/v1/health.
Read-only Grafana token wired as MCP server into .claude/settings.json — agent sessions can query logs/metrics directly.
Alerts: error rate > 5% for 5min, sv0_queue_depth > 100 for 10min, sv0_sync_age_minutes > 2× schedule, /ready 503 for 2min.
Portable across the Hetzner → Azure VM cutover (Alloy moves as a systemd service; dashboards + alerts unchanged).

B.2 Scan safety (existing plan — ship it)

Owned by 2026-02-26-scan-safety-and-observability.md Phases 1, 4. Designed, reviewed, not shipped:

Global (tenant-level) circuit breaker — no per-workload minimum floor; gates the entire destructive pipeline (entity deletion + materialization + authority path removal) as a single unit.
Scan-health score (platform-derived from baseline — no connector self-reported inputs).
removed_by_sync_id field for deterministic rollback.
scanScope on NormalizedGraph; sync_mode derived from payload instead of hardcoded "full".

T1 ships Phases 0–2. T2 ships Phase 4 (health score) alongside B.3.

B.3 Data cross-validation layer (NEW — the missing piece)

A deterministic set of invariants runs against every real ingest (every tenant, not just fixtures) and records pass/fail verdicts on the ScanRunDoc from #199 Stream 1. Default mode is warn (record + emit Prom counter + log event); per-rule, per-tenant opt-in to fail mode exists, so a single bad rule cannot block a prod scan.

Module layout:

src/ingestion/validators/ — pure functions (NormalizedGraph, PriorScan?) => ValidationResult[]. Each rule has id, severity, mode: warn | fail (default warn). No external framework (no Great Expectations, no SODA Core, no dbt — all assume a SQL warehouse context; our graph lives in Node memory pre-commit). Borrow the ideas (severity levels, rule catalog, per-rule verdicts) — write the code ourselves.
Plug point today: called from src/workers/handlers/sync-ingestion.ts immediately before the final logger.info("Sync ingestion completed", ...) call at line 363, inside the same try block so any thrown validator error is caught by the existing handler at line 379.
Plug point once #199 Stream 3 lands: moves into the stitch_ingestion worker as a step after materialization, before the evaluate_findings enqueue. Natural post-sync / pre-evaluate placement, debounced at the same 60s as stitching.
Persistence: results live on ScanRunDoc.validationResults[] from #199 Stream 1. No parallel validation_findings collection, no net-new admin endpoint — GET /api/v1/scan-runs/:id surfaces them for free.
Metrics: Prometheus counter sv0_validation_findings_total{check_id,severity,tenant_id} — we accept the tenant_id cardinality cost here intentionally, same pattern #494 is already working through for scan-age metrics. Grafana alert fires when severity="critical" rate > 0.
No UI surface in this doc. Internal ops uses Grafana. A client-facing connector-status UI is a real product surface but belongs to a later product-scoped PR.

Invariants at T2 launch — each is a pure function over the just-ingested state. The execution_30d invariant uses actual schema field names and is specifically constructed to fail against the pre-fix data in test/integration/ingestion/non-aws-path-scoping.test.ts (the fixture that lands with the #497 fix). A pure existence check would incorrectly pass on that data, because the bad execution_30d counter was summed from real evidence rows attributed to the wrong paths — the bug is mis-scoping, not missing evidence.

# rule: execution_30d must be scoped to a destination and count-match
# severity: critical
# purpose: catches the exec_30d misattribution class (sv0-platform#497/498/501,
#          sv0-documentation#196). Replay gate: must fire on the pre-fix fixture
#          in non-aws-path-scoping.test.ts (pre-fix execution_30d=8 across
#          destinations A+B, post-fix=3 per destination).

for each AuthorityPathDoc P where P.current_state.execution_30d > 0:
  assert P.destination_resource_key IS NOT NULL                                   # (scoping required)

  let bucket_sum = SUM(e.execution_count)
    for ExecutionEvidenceDoc e where
      e.tenant_id         = P.tenant_id
      AND e.entity_id     = P.workload_id                                         # join key: workload, not path
      AND e.resource_key  = P.destination_resource_key                            # per-destination scoping
      AND e.confidence    IN (GROUND_TRUTH-tier confidence values per #501)       # see note below
      AND e.source_timestamp >= now - INTERVAL 30 days

  assert bucket_sum = P.current_state.execution_30d                               # exact match, not existence

Schema notes (confirmed against src/domain/authority-paths/types.ts and src/domain/evidence/types.ts):

AuthorityPathDoc.destination_resource_key: string | undefined exists as of the #497 fix and is the canonical destination key the materializer uses for per-path scoping.
ExecutionEvidenceDoc links to entities via entity_id, not to paths — there is no path_id field. Join from path to evidence goes P.workload_id → e.entity_id, filtered by e.resource_key = P.destination_resource_key.
ExecutionEvidenceDoc has confidence: "DETERMINISTIC" | "TEMPORAL_INFERRED" | "STRUCTURAL", not a tier enum. The GROUND_TRUTH-vs-proxy distinction sv0-platform#501 is tracking is what will harden this rule's RHS — until that lands, the rule uses confidence = "DETERMINISTIC" and will be tightened as #501 defines the canonical tier enum.

Other invariants at T2 launch:

Invariant	Violation example (real)	File
UI-facing aggregate count == list-endpoint count for the same query	Posture summary shows 29 active + 3 dormant = 32; list endpoint returns 30 (DQ1)	`src/ingestion/validators/aggregate-consistency.ts`
Every AuthorityPath with `execution_30d > 0` has a non-null `destination_resource_key` AND sum-matches evidence (predicate above)	Pre-fix fixture: path shows `execution_30d: 8`, real per-destination evidence sums to 3	`src/ingestion/validators/execution-30d-scoping.ts`
`scanScope.scannedEntityTypes` ⊇ entity types actually present in the graph	Connector declares it scanned `["workload"]` but graph contains `identity` entities	`src/ingestion/validators/scope-vs-graph.ts`
Every `authority_path.workload_id` / `.identity_id` / `.destination_id` resolves to a live `entities` document	Dangling ID after a soft-delete cascade	`src/ingestion/validators/referential-integrity.ts`
Pagination-bounded counts match total counts (findings list breakdowns)	Severity/type breakdowns reflect current page, not total dataset (DQ2)	`src/ingestion/validators/pagination-totals.ts`

No ML. No heuristics. No learned thresholds. Every check is a pure deterministic predicate (per AGENTS.md). "Threshold violations" (e.g., delta > X% from baseline) belong in the scan-safety circuit breaker, not here — this layer checks invariants, not anomalies.

B.4 Canary tenant (pipeline-regression line of defense)

A dedicated qa.dev.securityv0.com env was an earlier version of this section. Two problems killed it:

Cloudflare wildcard certs don't cover two-level subdomains. Our certs are *.securityv0.com and *.dev.securityv0.com as separate wildcards; qa.dev.securityv0.com sits one level deeper and would need its own cert or Access app per env. If we ever need a seed-corpus-before-dev promotion env, the naming convention is qa1.securityv0.com at the top level — matching the multi-instance-dev pattern already in use. Open an issue then; don't scope it here.
A QA env catches synthetic-corpus bugs, not live-client-data bugs. The execution_30d class of regression only surfaces on a real connector + real tenant. Validation has to run on prod scans, not just staging fixtures. B.3 already does that.

What goes in B.4 instead is a canary tenant — synthetic monitoring / SRE canary applied to our actual pipeline:

Reserved tenant_id = "canary". Distinct DB documents but served by the same prod API and workers.
Frozen fixture checked in under test/canary/<fixture-name>.json — a NormalizedGraph that exercises at least one path through each active rule family and tier.
Post-deploy scheduled job (triggered by the Grafana Cloud alert rule already set up for deploy events in #494, or a lightweight GitHub Action) ingests the fixture through the real pipeline by POSTing to /api/v1/ingest/normalized-graph with X-Tenant-Id: canary.
Structural-diff the resulting entities, paths, and evidence-pack sections against a checked-in golden under test/canary/golden/<fixture-name>.json. Any non-trivial diff is an incident: a field got renamed, a migration dropped data, a rule silently changed semantics, a dependency update shifted a computation.
Emits sv0_canary_drift_total{check} — paged when > 0.

Canary is what invariants can't be: invariants only catch violations of rules we wrote; canary catches "we renamed a field and every tenant's count silently halved equally, uniformly, in-spec". No one writes a rule for that. The frozen golden is the rule.

Promotion flow for seed corpora used by demos/labs remains: authors run make validate locally against a dev API (see Pillar sv0-demo-labs entry in §4), which exercises the same validator module over the seeded state.

Pillar C — Azure VM pivot (committed for T3, portable by design)

Decision: T3 moves platform hosting to Azure VM. We do not lock in Azure at the platform level — the whole point of staying on Docker-Compose-on-a-plain-Linux-VM is that flipping to AWS VM (or any other IaaS) later is a DNS + SSH-target + MONGODB_URI change, not a rewrite. "Committed" here means "we pick one lane now so we're not maintaining two deploy paths in parallel", not "locked in for life".

Why Azure first:

Azure credits are already usable. Mercury → AWS Activate is still in evaluation; that grant is probabilistic and delays T3 if we wait.
Managing two candidate hosting lanes doubles deploy scripts, CI matrices, and on-call muscle memory for no strategic benefit at pilot scale.

Why Docker Compose on a VM (and not Azure PaaS / AKS / ECS / Fargate):

The shape is identical across clouds. docker-compose.deploy.yml doesn't change when we move from Hetzner → Azure VM, nor would it change if we later move Azure VM → AWS VM.
Any cloud that gives us an SSH-reachable Linux host with Docker installed is a valid target. This is a deliberate counter to vendor-lock-in: when credits expire, AWS gets competitive, or a client requires a specific region/jurisdiction, the migration cost is measured in hours, not sprints.
No managed-PaaS primitives (App Service, Container Apps, AKS, ECS, Fargate, Cosmos DB for MongoDB) — each of those would bind us to its control plane and break this portability story.

What changes at T3 (split per environment — Mongo tiering matters):

Artifact	prod / pre-prod	dev / QA
Deploy SSH target	`deploy@<azure-vm-prod>`	`deploy@<azure-vm-dev>`
`DEPLOY_HOST_KEY` secret	Azure prod VM host key	Azure dev VM host key
DNS	`app.securityv0.com` → prod VM	`dev.securityv0.com`, `pr-N.dev.securityv0.com` → dev VM
`MONGODB_URI`	`mongodb+srv://...atlas-m10-frankfurt...` (per #493 Phase 1)	`mongodb://mongo:27017/sv0_<instance>` (self-hosted in-compose, unchanged from today)
Backup	Atlas PITR + daily snapshot retention	Existing `mongo-backup` compose service (6h `mongodump`, local volume)
`docker-compose.deploy.yml`, Caddy, Alloy	Unchanged	Unchanged

Atlas is the right fit for prod + pre-prod: managed auth, Point-in-Time Recovery, offsite backups, monitoring — all things a paying client will (reasonably) ask about. Self-hosted Mongo on the same Azure VM is the right fit for dev + QA: Azure credits cover the VM compute, durability is good-enough for non-prod, and keeping the compose mongo service in the deploy path exercises it as the authoritative shape (so dev ≈ prod in structure, differing only in the MONGODB_URI).

AWS is not going away — just not for platform hosting:

sv0-security-tooling AWS account stays (per 2026-03-31-infrastructure-strategy.md §4). The AWS connector needs a real AWS environment to scan; that's what the sandbox account is for.
sv0-demo AWS account stays for demo scenarios.
The management / billing / SCP work stays — those are target-side, not hosting-side.

sv0-platform#493's body is updated (separate small edit, tracked in the follow-up list) to drop the Azure-or-AWS fork and name Azure VM as the sole compute target.

3. Sequencing

T1 → T2 → T3 is an ordering, not a calendar. Each item is sized:

S — bolt-on, a single focused push. One PR, scoped, reviewable in an hour.
M — new subsystem or multi-PR chain. Multiple focused pushes, integration seams, tests to author.

Each tranche has a done signal that gates the next.

T1 — Unblock autonomy

Goal: connectors run on a schedule without human invocation, and a single buggy scan cannot wipe a tenant's authority paths.

Rough shape: 2×S + 2×M.

S — Dockerfiles + GHCR publishing CI for entra-servicenow, azure-foundry, jira-cloud, aws.
M — #199 Stream 1: ConnectorInstanceDoc + ScanScopeDoc + ScanRunDoc schema + in-process Mongo-claim scheduler + up/scan/teardown HTTP API.
M — Ship scan-safety Phases 0–2 (global circuit breaker, scanScope + scope-aware diff, removed_by_sync_id on entities and paths).
S — Replace in-memory processedSyncIds Set with a Mongo-backed idempotency store (collapses into the scheduler PR's workspace).
Done signal: a POST /api/v1/connector-instances call creates a tenant's connector config; the scheduler claims and runs the next due ScanScopeDoc; a synthetic "empty graph" scan hits the global circuit breaker and the UI still shows the pre-existing entities.

T2 — Built-in validation + observability live

Goal: the execution_30d class of bug cannot ship undetected — on any tenant, not just fixtures. Operators have dashboards and alerts.

Rough shape: 2×M + 2×S.

M — Ship src/ingestion/validators/ + the five invariants from B.3, persisting verdicts to ScanRunDoc.validationResults[]. Gating test: the execution-30d-scoping validator fires on the pre-fix fixture in test/integration/ingestion/non-aws-path-scoping.test.ts (execution_30d=8, workload-wide misattribution) and passes on the post-fix fixture (execution_30d=3, per-destination scoped).
M — Execute the 5-day rollout in sv0-platform#494 (Grafana Cloud + BetterStack + grafana/mcp-grafana).
S — Canary tenant: fixture + golden committed under test/canary/; post-deploy job ingests through real pipeline; sv0_canary_drift_total metric wired.
S — Ship scan-safety Phase 4 (platform-derived health score, no connector self-report).
Done signal: replaying the pre-fix fixture produces exactly one critical-severity validation result on the associated ScanRunDoc with the predicate from B.3 above matching the actual violation; curl output from GET /api/v1/scan-runs/:id attached to the PR; Grafana shows the sv0_validation_findings_total{severity="critical"} counter increment; BetterStack pings app and (once it exists) the Azure dev VM.

T3 — Azure VM cutover (gated on T1+T2 stable on Hetzner)

Goal: platform hosting runs on Azure. Prod data in Atlas, dev/QA data self-hosted on the Azure VM.

Rough shape: 1×M + 1×S.

M — Execute #493 Phase 1 (Atlas cutover for prod/pre-prod): externalize MONGODB_URI, stand up M10 in Frankfurt, 30-min maintenance window, mongodump → mongorestore, flip URI, smoke test.
S — Execute #493 Phase 2 (compute migration): provision Azure VMs (prod + dev), install Docker + Caddy + Alloy, update DEPLOY_HOST_KEY + DNS, demote Hetzner to off.
Done signal: app.securityv0.com resolves to Azure prod VM; canary ingestion runs end-to-end; Grafana/BetterStack continuity unbroken across cutover.

4. What stays vs what changes per repo

Repo	Changes
`sv0-connectors`	Dockerfiles + CI to publish to GHCR; emit `scanScope` on every graph; emit canonical `target_resource_key` (#91). No scan-logic changes.
`sv0-platform`	#199 Stream 1 scheduler + HTTP API; persistent idempotency; scope-aware diff; circuit breaker; `src/ingestion/validators/` module + verdicts on `ScanRunDoc.validationResults[]`; `test/canary/` fixtures + golden + post-deploy ingest job; Prom counters (`sv0_validation_findings_total`, `sv0_canary_drift_total`). No new admin endpoint, no parallel `validation_findings` collection, no new UI surface.
`sv0-documentation`	This doc; `mkdocs.yml` nav update; callout in `2026-03-31-infrastructure-strategy.md` naming Azure VM as platform-hosting target (AWS-sandbox rationale preserved).
`sv0-demo-labs`	`make validate` target in each lab that POSTs the seeded corpus to a running dev API and reads back `GET /api/v1/scan-runs/:id` to get the validator verdicts — thumb-up/down in <30s.

5. Non-goals

Kubernetes (any flavor).
Temporal.io or similar workflow engines — current workload is simple claim-and-run, not DAGs.
AWS Lambda for connectors — revisit at 10+ connectors or per-customer secret isolation.
Azure-native PaaS (App Service, AKS, Container Apps, Cosmos DB for MongoDB).
Private Link / private endpoints — deferred to post-pilot hardening.
Systemd timers / cron-per-connector-per-tenant on the deploy host — superseded by #199 Stream 1's Mongo-claim scheduler.
Parallel validation_findings collection, or a new admin endpoint for validation results — results live on ScanRunDoc.validationResults[].
qa.dev.securityv0.com as a dedicated fail-loud env — replaced by canary tenant on the prod pipeline. If a seed-corpus-before-dev promotion env is needed later, open an issue using qa1.securityv0.com naming.
In-platform ops dashboard for validation/canary metrics — internal ops uses Grafana. A client-facing connector-status UI is a real future product surface but out of scope here.
External data-quality frameworks (Great Expectations, SODA Core, dbt tests, Pandera) — they assume a SQL warehouse; our graph is in Node memory pre-commit.
Learned thresholds / historical-distribution anomaly detection — violates AGENTS.md's deterministic-only rule, hides bugs instead of surfacing them. Static per-tenant thresholds are fine; auto-tuned ones are not.
Any ML, probabilistic scoring, or heuristic validation.

6. Verification (for the implementation tranches, not this doc)

Each tranche has a done signal above. Additional acceptance:

T1: unit test in test/ingestion/circuit-breaker.test.ts proves that a graph with 100% entity absence triggers the global breaker; a graph with 49% absence does not. POST /api/v1/connector-instances round-trip captured in the PR description; scheduler loop log excerpt showing a claim + run + verdict write to ScanRunDoc.
T2 — gating test for the validator PR: replay test/integration/ingestion/non-aws-path-scoping.test.ts pre-fix fixture (execution_30d = 8 misattributed across destinations A + B) through the execution-30d-scoping validator; verdict must be fail with the scoping+count predicate and identified path_ids. Replay the post-fix fixture (execution_30d = 3 per destination); verdict must be pass. Both outputs attached to the PR. If either is wrong, the predicate needs another pass before merge.
T2 — canary: checked-in golden diff run against a clean prod pipeline ingestion of the canary fixture returns zero structural diffs.
T3: BetterStack status-page screenshot showing uninterrupted monitoring across the DNS flip. mongosh against the Atlas URI from the Azure prod VM returns the expected tenant count; self-hosted Mongo on the Azure dev VM serves a pr-N.dev.securityv0.com preview end-to-end.

For this doc itself:

mkdocs serve in sv0-documentation renders the page with no broken links.
All cross-repo issue refs resolve (checked via gh issue view).
A reader unfamiliar with the codebase can answer three questions after reading §1–§3: (a) what triggers a scan today vs after T1? (b) how would the exec_30d bug be caught automatically after T2? (c) what changes on Azure VM cutover and what stays?

7. New GitHub issues

Created alongside this doc:

Title	Repo	Pillar
`docs: autonomous scans + built-in validation strategy` (#200)	sv0-documentation	This doc

To open after doc sign-off (narrower than the original draft; most of Pillar A folds into #199 Stream 1 tasks, not standalone issues):

Title	Repo	Pillar
`feat(validators): src/ingestion/validators/ module + ScanRunDoc.validationResults[] persistence`	sv0-platform	B.3
`infra(connectors): Dockerfiles + GHCR publishing CI for all connectors`	sv0-platform (CI) + sv0-connectors (Dockerfiles)	A
`feat(canary): tenant_id=canary fixture + golden + post-deploy ingest job + drift counter`	sv0-platform	B.4

To update (not new issues):

sv0-platform#493 body: drop the Azure-VM-or-AWS-VM fork; name Azure VM as the sole compute target. Split the Phase 1 MongoDB story into prod/pre-prod (Atlas) vs dev/QA (self-hosted on the VM).
2026-03-31-infrastructure-strategy.md: header callout referencing this plan, naming Azure VM as the platform-hosting target (shipped with this PR).

Everything else is tracked under #199's substream issues or the existing plans.

8. References

sv0-documentation#199 + the four substream docs dated 2026-04-22 — source-of-truth for connector control, multi-account AWS, graph stitching, and the MediaPro Lab 2 demo.
2026-02-26-scan-safety-and-observability.md — canonical source for circuit breaker, scan scope, rollback.
2026-03-31-infrastructure-strategy.md — secret tiers, AWS Organization layout. Scheduler design in §3 Phase 1 is superseded by #199 Stream 1.
sv0-platform#493 — Atlas cutover + compute migration.
sv0-platform#494 — Observability rollout.
sv0-platform#497, #498, #501 — the execution_30d bug trail that motivates B.3.
sv0-connectors#91 — canonical target_resource_key on execution evidence.
sv0-documentation#195 — MediaPro pilot umbrella (the deadline behind T1 and T2).
sv0-documentation#196 — fidelity doc reconciliation for the per-path proxy counts.

0. Relationship to #199 (source-of-truth for streams)​

1. Why now​

2. Three pillars​

Pillar A — Autonomous scan operations​

Pillar B — Built-in validation, QA, observability​

B.1 Observability stack (existing plan — sequence, don't re-plan)​

B.2 Scan safety (existing plan — ship it)​

B.3 Data cross-validation layer (NEW — the missing piece)​

B.4 Canary tenant (pipeline-regression line of defense)​

Pillar C — Azure VM pivot (committed for T3, portable by design)​

3. Sequencing​

T1 — Unblock autonomy​

T2 — Built-in validation + observability live​

T3 — Azure VM cutover (gated on T1+T2 stable on Hetzner)​

4. What stays vs what changes per repo​

5. Non-goals​

6. Verification (for the implementation tranches, not this doc)​

7. New GitHub issues​

8. References​