ADR-021: Delegated-Agent Audit Log Storage

Status

Proposed — 2026-05-04

Follows sv0-platform PR #788 (in-process emit fix for #719) and issue #790 (the persistence half). The issue explicitly asks for an ADR before implementation.

Context

PR #788 closed the emit half of the gap: every delegated_agent request's success log line now carries agentClientId and userId (and, on the bridge path, bridgedToUserId / originalSubjectId). The forensic question "all Ivan actions via Claude Code in the last 24h" now has the right fields in the payload.

The persistence half is unsolved. Today, all audit-grade signal lives in src/shared/logging/logger.ts — a hand-rolled JSON logger (~80 lines, no pino, no shipper) that writes to stdout / stderr. Docker's container log buffer is the only retention. There is no audit_logs collection, no Loki, no Logpush sink, no SIEM. Lines older than the buffer rotation are gone.

This is a posture decision, not a tech-detail decision. The choice determines:

Whether we can answer a customer security review's "show me an audit query for principal X over date range Y" with anything other than "we are working on it."
Whether SOC 2 readiness work has a place to point at for control evidence around privileged-access activity (CC6, CC7).
Whether the ops burden of audit retention scales with customer count or stays flat.
Whether the audit write path fights the request-handling write path on the same Mongo cluster under load.

The forensic query is also not the only consumer. Any future "agent activity" UI surface, any rate-limit-by-agent enforcement, any incident-response export to a customer all read from the same store.

Decision drivers

In rough order of weight:

Customer-facing answerability. "Give me an audit query" must work without an engineer hand-crafting a mongosh aggregation. This is the difference between an enterprise prospect's security review getting a "yes, here's how" vs. an excuse.
Retention. Pilot: 30 days non-negotiable. Enterprise prospect ask: 90 days minimum, 1 year likely. Drives storage volume and TTL design.
Write-path cost. The audit write happens on every authenticated request. It must not block, must not back up the response, and must not contend with the ingestion-write path on the platform's primary Mongo cluster.
Ops burden. New infrastructure surfaces add quarterly maintenance (provider rotations, retention recomputations, schema migrations). Tax compounds with every new surface.
Query expressiveness. "Group by agentClientId over time, filtered by userId" should be a one-liner, not a multi-stage aggregation.
Multi-tenant isolation. Today single-tenant in Mongo, but multi-tenant audit query is a likely-soon ask. Whichever option must support per-tenant scoping at query time without full-collection scans.
Fail-loud at boot. Per Ivan's preference and the global memory entry "Fail loud over silent fallback" — if the audit sink is misconfigured, the API must refuse to start, not silently drop audit writes.

Options considered

Option A — Mongo `audit_logs` collection

A new collection in the existing primary cluster. Per-request middleware pushes a doc on res.on("finish") for any request whose req.authContext.principal === "delegated_agent". Compound index (tenant_id, user_id, agent_client_id, created_at) plus a TTL index on created_at for retention. New AuditLogAdapter follows the existing pattern under src/storage/mongo/adapters/ (24 sibling adapters already; mechanical).

Pros

Zero new infrastructure surface — same cluster, same backups (mongodump on every prod deploy), same Atlas project once we migrate (ADR-019 Phase 4).
Tenant scoping is the same tenant_id filter we apply everywhere else; reuses the cross-tenant lookup invariants from MongoStorageAdapter.
Customer-facing query API is straightforward — findAuditLogs({ user_id, agent_client_id, since, until }) is six lines on top of the adapter.
Strong consistency. The audit doc and the request response live on the same cluster; we can prove "if the response went out, the audit entry was durable" without cross-system reconciliation.
exactOptionalPropertyTypes and the existing bulkWrite / $setOnInsert patterns apply unchanged.

Cons

Write contention with the ingestion hot path on the same cluster. A burst of agent traffic shares IO with NormalizedGraph upserts and finding writes. Mitigable with write: { w: 1 } and async fire-and-forget, but the contention model is real.
Mongo is not a great log query engine. Histograms over time, regex searches, and free-text grep over path / message bodies are doable but awkward compared to Loki's LogQL.
Retention beyond ~90 days starts to bloat the working set and the backup tarball. TTL handles the deletion, but the cluster sizing has to budget for the steady-state volume.

Effort estimate. 2 days. New AuditLogAdapter, middleware change, tests, runbook entry, fail-loud boot check that the index exists. Migrations live under scripts/migrations/.

Option B — Grafana Cloud Loki (already adopted by ADR-019)

A vector (or promtail) shipper on each deploy host tails the docker JSON logs, applies a small filter that keeps only entries with provenance: "delegated_agent" (or whatever audit shape we land on), and ships them to Grafana Cloud Loki with structured labels. LogQL queries from grafana/mcp-grafana (already on the agentic ops roadmap) answer the forensic questions directly.

ADR-019 already adopted Grafana Cloud free + BetterStack free + grafana/mcp-grafana as the observability stack (Phase 4 of the IaC rollout). The 2026-04-22 observability research doc explicitly evaluated this stack and picked it. Loki is not new infra — it is infra we are already standing up; this ADR would land on it sooner.

Pros

LogQL is the right query language for "all entries where userId == X and agentClientId == Y over time". One line. Histogram over time? count_over_time(...). Group by agent? sum by (agent_client_id). The query expressiveness gap vs. Mongo is large.
Free tier covers 50 GB logs / 14-day retention / unlimited dashboards / 3 users. Pilot fits inside this trivially. Paid tier $19 + $8/50GB beyond. Even at 100 enterprise customers, audit logs alone are unlikely to push past the free tier.
Zero write-path cost on the request handler — the shipper reads stdout async, the API thread does not block.
Native multi-tenant query via Loki labels: {tenant_id="..."} is a free dimension, no full-scan.
30-day / 90-day / 13-month retention is a config knob, not a TTL job we have to defend in a backup story.
grafana/mcp-grafana makes audit queries directly callable from Claude Code agent sessions — useful for incident response and for the "the agent investigated its own activity" loop.
Survives every infra migration on the roadmap. Hetzner today, Atlas + AWS/Azure later — the Loki endpoint is the same.

Cons

New shipper to operate (vector or promtail). Has to be installed on each deploy host (today: 2 VMs; post-migration: a managed-platform sidecar). Adds one more thing to keep running.
"Customer-facing audit query API" doesn't exist yet. If a customer wants a self-serve audit page in the sv0 UI, we either (a) call Grafana's HTTP API server-side and re-render, or (b) accept that the audit answer-ability story is "we run the query for you and export," or (c) build the small Mongo collection (Option A) on top just for that surface and let Loki carry the bulk retention.
Shipper config drift: the filter that decides "this line is an audit entry" lives outside the app code. If we change the audit-entry shape in the app and forget to update the shipper, audit entries silently start missing the store. Must be tested and runbook-documented.
Free tier has 14-day retention. To meet the 90-day enterprise minimum we need the $19+/mo Pro tier (still cheap, but not free). Free tier is enough for dev / pilot, not for an enterprise contract.

Effort estimate. 3-4 days, gated on ADR-019 Phase 4 actually landing. Includes Grafana Cloud signup (currently "pending signup" per ADR-019), vector.toml for both Hetzner VMs, Loki dashboards, a fail-loud health-check that the API refuses to start if LOKI_SHIPPER_HEALTHY=false, and the runbook entry.

Option C — Cloudflare Logpush → R2

Cloudflare ingests the request log at the edge before it reaches our API. Logpush ships those edge logs to an R2 bucket on a schedule. A separate Lambda (or scheduled GitHub Action) parses the R2 objects, extracts the audit-grade fields, and indexes them somewhere queryable.

Pros

Zero application change. The audit data is already at the edge.
R2 is cheap object storage with no per-request cost.
Keeps the audit data physically separate from the platform Mongo — strong blast-radius separation if the platform Mongo is compromised.

Cons

Cloudflare Logpush is Enterprise-plan-only (per the 2026-04-22 observability research). We are on the free / Pro tier. This is a multi-thousand-dollar-per-month tier upgrade, well past "rounding error against ACV."
Edge logs don't have userId / agentClientId. Those are resolved inside the API after JWT introspection. The edge sees the bearer token but not the resolved principal, so the structured audit shape we built in PR #788 cannot be reconstructed at the edge. This is the dealbreaker for the actual #790 query.
Even if we worked around it (e.g., have the API echo a structured X-Audit-* response header that CF logs picks up), we now have a bespoke header schema, an extractor pipeline, an index store, and a query UI to build. That's a quarter of work, not a week.
Slow query path. R2 → extractor → index → query is hours-to-days behind real time. Useless for incident response.

Effort estimate. Not pursued — the Enterprise-plan gate alone disqualifies it pre-revenue, and the principal-resolution gap disqualifies it on functional grounds even if we paid.

Recommendation

Adopt Option B (Grafana Cloud Loki) as the primary audit store, and add a small Option-A audit_logs Mongo collection only when the first customer asks for an in-product audit query.

Three reasons drive this:

Loki is already on the roadmap. ADR-019 picked it in Phase 4. Doing audit storage on Loki is not a new infra decision — it is using the infra we already committed to, slightly earlier. Picking Mongo here would either duplicate the storage substrate or pre-empt the ADR-019 rollout for no reason.
LogQL is the right query language for the forensic question. "Group by agentClientId over time, filtered by userId" is a one-line LogQL query and a multi-stage Mongo aggregation. The acceptance test from #719 ("all Ivan via Claude Code in last 24h returns rows") is trivially expressible.
Write-path separation. The audit write does not contend with ingestion or finding writes on the primary cluster. The shipper is async and external to the request lifecycle. Even at 10× current request volume the API thread cost is zero.

The Mongo audit_logs collection is not rejected — it is deferred until a customer asks for an in-product audit page. At that point we add a small adapter that writes the audit doc in addition to the stdout emit, indexed by (tenant_id, user_id, agent_client_id, created_at), scoped to the last 30-90 days so the working set stays small. Loki still carries the bulk and the long retention. This dual-write pattern is well-understood; the marginal cost when we get there is ~1 day.

Required behaviors regardless of option

Fail-loud boot. If the audit sink is configured but unreachable, the API must refuse to start. Specifically: if AUDIT_SINK=loki and the Loki health endpoint does not return 200 within the boot timeout, exit 1. No silent fallback to "stdout only." (Per Ivan's stated preference and the feedback_fail_loud_over_silent_fallback memory.)
No silent shipper drop. The shipper must emit a metric (sv0_audit_ship_lag_seconds, sv0_audit_ship_dropped_total) and BetterStack must alert on dropped_total > 0 for >5 min.
Schema lock. The audit-entry shape (request_id, ts, tenant_id, user_id, agent_client_id, provenance, path, method, status, duration_ms, plus bridge-path extras) must be defined as a TypeScript type with a runtime validator. The shipper's filter rule must reference the same shape; CI must fail if they drift.
Tenant scoping at query time. Loki labels include tenant_id. Mongo path uses the existing tenant invariants. No global-scan path on either side.
Read-only against source systems unchanged. This decision is platform-internal observability; connectors stay read-only.

Open questions / next step

Confirm Grafana Cloud signup is unblocked. ADR-019 lists it as "pending signup." If the signup is not complete, the first PR is "stand up Grafana Cloud free tier and a vector shipper on dev" — not the audit code. This needs to land before the Option-B implementation can.
Decide retention tier now or defer? Free tier (14-day) is fine for dev. If we want the prospect-credible 90-day number on day one, we sign up for Pro ($19/mo + $8/50GB) immediately. Recommend: free tier for dev, Pro on prod from day one — the cost is below the threshold of any budget review and removes a "we'll get to it" footnote from security questionnaires.
What is the audit-entry shape in code? Today the emit is an ad-hoc object literal in bearer-token-middleware.ts:222 and :267. Before the next PR we extract a DelegatedAgentAuditEntry type in src/domain/audit/types.ts, with a Zod (or hand-rolled) validator, and refactor both call sites to construct it. This is the structural prerequisite to Option B and to any future Option-A overlay.

Concrete first PR after this ADR is accepted: "feat(audit): extract DelegatedAgentAuditEntry type and validator; wire fail-loud AUDIT_SINK config check at boot; emit sv0_audit_emit_total Prometheus counter." No shipper, no Loki yet — just the schema lock, the boot check, and the emit metric. That PR is the foundation for whatever ships next; it gates both options and is the right thing to land regardless of when ADR-019 Phase 4 completes.

References

sv0-platform#790 — issue this ADR responds to
sv0-platform#788 — the in-process emit fix for #719
sv0-platform#719 — original audit-attribution gap
ADR-019: Infrastructure-as-Code Strategy — adopts Grafana Cloud + BetterStack + grafana/mcp-grafana
Observability stack research, 2026-04-22 — the matrix that picked Grafana Cloud Loki
src/api/middleware/bearer-token-middleware.ts:216-228, 262-280 — the two emit sites whose payload this ADR persists
src/shared/logging/logger.ts — current hand-rolled stdout logger; unchanged by this ADR

Status​

Context​

Decision drivers​

Options considered​

Option A — Mongo audit_logs collection​

Option B — Grafana Cloud Loki (already adopted by ADR-019)​

Option C — Cloudflare Logpush → R2​

Recommendation​

Required behaviors regardless of option​

Open questions / next step​

References​

Status

Context

Decision drivers

Options considered

Option A — Mongo `audit_logs` collection

Option B — Grafana Cloud Loki (already adopted by ADR-019)

Option C — Cloudflare Logpush → R2

Recommendation

Required behaviors regardless of option

Open questions / next step

References