Observability Stack Research — Pre-MediaPro Pilot

Summary

Pick: Grafana Cloud free tier + BetterStack free tier + grafana/mcp-grafana. Two engineer-days. $0/mo at pilot scale. Paid-tier exit ramp $30-60/mo. Survives every migration on our roadmap.

Agentic ops was the decision-driver. grafana/mcp-grafana (official Grafana Labs MCP server, Apache-2.0) is the most complete agent-queryable observability surface available today — logs (Loki), metrics (Prometheus), dashboards, alerts, and traces (Tempo) all callable from Claude Code sessions.

1. Current state of instrumentation

Better than "nothing," worse than "ready."

Logs. Hand-rolled structured JSON logger at sv0-platform/src/shared/logging/logger.ts (~80 lines, no dependencies). Emits { ts, level, message, ...context } JSON lines to stdout/stderr. Supports child() context binding. Used across ~20 modules (routes, workers, ingestion, risk-clusters, auth providers). Docker captures the output on Hetzner; nothing ships it anywhere; retention is the Docker log buffer.

Metrics. Real Prometheus instrumentation via prom-client ^15.1.3 at sv0-platform/src/shared/metrics/metrics.ts:

sv0_http_request_duration_seconds (histogram, labels: method/route/status_code)
sv0_http_requests_total
sv0_job_duration_seconds — labels include tenant_id, which is the P0-5 leak
sv0_job_total, sv0_queue_depth
sv0_sync_age_minutes (connector_id), sv0_findings_total (status/severity), sv0_authority_paths_total
Node defaults via collectDefaultMetrics()

Exposed at GET /metrics (sv0-platform/src/api/routes/system.ts:52). Currently in PUBLIC_PATHS at sv0-platform/src/api/middleware/auth-middleware.ts:17.

Health/readiness. /health, /ready, /diagnostics all wired. /ready validates Mongo + worker.

Traces. None. No OTEL, Sentry, or Datadog tracer in package.json.

Connectors (Python). Standard logging module; systemd captures to stdout.

Net: we already emit prometheus-shaped metrics and structured JSON logs. The missing layer is collection, storage, query, and alerting — not instrumentation. Any stack we pick needs to (a) scrape /metrics and (b) ingest stdout JSON logs. No application rewrite required.

2. Options matrix

Free-tier numbers from each vendor's 2026-04 public pricing. "MCP" = Model Context Protocol server availability (official or actively-maintained community).

Option	Free tier	First paid tier (pilot scale ~10k series + 100GB logs)	MCP / agent access	Self-host burden	Migration portability
1. Grafana Cloud free	10k active series, 50 GB logs, 50 GB traces, 14-day retention, 3 users, unlimited dashboards	Pro $19/mo base + $8/50GB log overage + $16/1k extra series → ~$60-80/mo at scale	Official MCP (`grafana/mcp-grafana`, Apache-2.0); LogCLI; HTTP API per-datasource; API tokens scopable	Zero	Fully portable
2. Self-host LGTM on a €10/mo VPS	N/A	€10-20/mo (VPS + backup storage)	Same MCP works against self-hosted	We run Loki, Prom, Grafana, Alertmanager, manage TSDB retention. 1-2 days stand up, ongoing babysit	Fully portable but we carry the state
3. Datadog free	5 hosts, 1-day log retention, no APM	$15/host/mo + $1.70/1M events + $0.10/GB retained → ~$100-200/mo minimum	No official MCP; community MCPs immature; HTTP API + `datadog` CLI	Zero	Locked — Datadog agent everywhere
4. BetterStack (Logtail + Uptime)	1 GB logs/mo, 3-day retention, 10 uptime monitors, unlimited incidents/status pages	Teams $25/mo (30-day retention, 30 GB), then $0.25/GB	No MCP. Good REST API + CLI. Uptime monitors trivially scriptable	Zero	Portable
5. Axiom	500 GB/mo free, 30-day retention, no credit card	$25/mo base + $0.25/GB beyond 500GB	Official MCP (`axiomhq/mcp-server-axiom`); APL (Kusto-like) queries	Zero	Portable
6. SigNoz cloud	14-day trial only	Teams $199/mo base incl. 50GB	Community MCP; ClickHouse-backed	Zero (cloud) or heavy (self-host)	Portable via OTEL
7. New Relic free	100 GB ingest/mo, 8-day retention, 1 full user	$0.35/extra GB + $99/full user	No official MCP; NRQL-heavy, SDK-heavy	Zero	Locked
8. Honeycomb free	20M events/mo, 60-day retention	Pro $100/mo → 100M events	No official MCP. Events-oriented (traces)	Zero	Portable via OTEL
9. Cloudflare Logpush + R2	Enterprise-plan-only; Workers Logs Engine free only for Workers	n/a	No MCP	Zero	Cloudflare-locked
10. Stitched minimum	UptimeRobot free + self-scraped `/metrics` + `docker logs` + BetterStack for SMS	~$0 until outgrown	SSH + grep; works but primitive	SSH hygiene, manual rotation	Fully portable

Security: `tenant_id` label leak kills several options if not fixed first

sv0_job_duration_seconds carries tenant_id. Today, anyone past Cloudflare Access can scrape /metrics and enumerate tenants (P0-5 in readiness review §3.1). The fix — strip tenant_id from metric labels, keep per-tenant context in logs — must land before we point any external scraper at /metrics. The scraper must also authenticate. Grafana Cloud / Datadog / Axiom all support per-agent API tokens. With Cloudflare Access we can additionally gate by service-token.

3. Agentic access — the decision driver

Three options have real agent stories; everyone else is HTTP-API-and-hope.

Grafana Cloud + grafana/mcp-grafana (https://github.com/grafana/mcp-grafana). Official, Grafana Labs–maintained, Apache-2.0. Covers Prometheus queries, Loki log queries, dashboard listing/search/panel-data, alert list/ack/silence, datasource list, Tempo trace queries. Most complete MCP surface among observability vendors. Works against both Grafana Cloud and self-hosted Grafana. Read-only scoped API tokens trivial.

Axiom + axiomhq/mcp-server-axiom (https://github.com/axiomhq/mcp-server-axiom). Official, Axiom-maintained. APL-based queries (SQL-ish). Narrower (logs only, no alerts) but 500 GB/mo free is the most generous in the industry, and APL is pleasant for LLMs to generate.

Grafana self-hosted with the same MCP. Same story but we run it. logcli is JSON-output-friendly and works as an agent escape hatch.

All others (Datadog, New Relic, Honeycomb, BetterStack, SigNoz) have HTTP APIs but no first-class MCP. Claude Code can still use them via shell + curl, but it's "works" not "works well."

4. Recommendation

Primary: Grafana Cloud free + BetterStack free + `grafana/mcp-grafana`

Cost: $0/mo. Two signups.

Why:

Free allowances (10k series, 50 GB logs, 14-day retention) cover a pilot and several customers after. Our current metric cardinality is ~30 series before tenant_id, hundreds after the label-strip fix.
Best-in-class agentic MCP. Claude Code sessions can query logs, check alerts, ack silences, read dashboards, inspect traces without bespoke tooling.
Zero ops burden. We already operate Hetzner + Atlas + WorkOS; a fifth thing to operate is declined.
Survives Hetzner → Azure VM / AWS VM migration unchanged. Alloy (Grafana's collector) moves as a systemd service.
BetterStack free covers §3.1's external uptime monitor requirement today (60s checks, SMS alerts, 10 monitors — enough for /health, /ready, app.securityv0.com landing).
Security clean: Alloy pushes /metrics outbound to Grafana Cloud (no public scrape endpoint). We remove /metrics from PUBLIC_PATHS simultaneously. Tenant labels only land where we control read access.

When we outgrow it: Grafana Cloud Pro = $19/mo base, $8/50GB log overage. At realistic pilot scale we stay free 6-12 months. First paid bill will be $30-60/mo.

Fallback / complement: Axiom free for logs-heavy workflows

When: if Grafana Loki's 50 GB feels tight (unlikely pre-pilot), or we want SQL-style agent log queries specifically. Complement, not replacement.

Explicitly rejected

Datadog — free tier is a trap (1-day retention, 5 hosts, no APM). First paid bill at scale is $100+. Vendor-lockiest.
Self-host LGTM — fine at our scale but we cannot afford 1-2 days of setup + ongoing retention management during a 10-14-day pilot sprint. Revisit post-pilot if costs bite.
New Relic — great free tier, bad agent story, sticky SDK.
Honeycomb — excellent at traces, we emit no traces today. Premature.
Cloudflare Logpush — Enterprise-only, not available to us.

5. One-week rollout

Fits inside Track A of the readiness plan. Two engineer-days total, parallelizable.

Day 1 — security + uptime (~1 hour)

Close P0-5. Strip tenant_id from sv0_job_duration_seconds labels (src/shared/metrics/metrics.ts:25); update call sites (src/workers/runtime.ts:125,133). Per-tenant duration survives in logs as a context field, not a metric label.
Lock down /metrics and /diagnostics. Remove from PUBLIC_PATHS (src/api/middleware/auth-middleware.ts:17). Make /metrics bearer-token-only via existing M2M auth. Provision one scrape token.
BetterStack. Sign up free. Add monitors for /api/v1/health on app + dev, 60s interval, SMS + email. Add CF-Access-Client-Id / CF-Access-Client-Secret headers so checks bypass the Cloudflare Access login page (service tokens already exist for CI).

Day 2 — Grafana Cloud setup (~2 hours)

Sign up for Grafana Cloud free. Create stack in Frankfurt (EMEA-aligned for MediaPro).
Install Grafana Alloy on the Hetzner host. One apt install, one config file. Configure:
- Prometheus remote_write to Grafana Cloud, scraping API /metrics with the bearer token
- Loki push from /var/lib/docker/containers/*/*-json.log (labels: service, container_name)
- node_exporter for host metrics
Verify in Grafana Cloud UI. Seed 3 dashboards: API overview (RED), worker queue, connector freshness (from sv0_sync_age_minutes).

Day 3 — alerts

Grafana Cloud Alerts: error rate >5% for 5min; sv0_queue_depth >100 for 10min; sv0_sync_age_minutes >2× schedule for any connector; /ready 503 for 2min. Email + Slack webhook.

Day 4 — agent access

Create read-only Grafana Cloud API token scoped to Loki + Prom + Alerts.
Install grafana/mcp-grafana as an MCP server in our .claude/settings.json. One permission entry.
Verify round-trip: ask a Claude Code session "show me 5xx responses in the last hour on dev," confirm it works.

Day 5 — connector logs

Point Alloy at Hetzner's journald so systemd connector runs stream to Loki. Label by unit name. ~0 cost at connector volume.

Days 6-7 — buffer

Tuning cardinality, dashboards, MediaPro-specific alerts.

Post-pilot forward-port

When compute migrates to Azure VM or AWS VM (sv0-platform#493 Phase 2), Alloy moves with it. Grafana Cloud stack, tokens, dashboards, alerts unchanged. Migration is a systemctl + DNS change, not a rewrite.

6. Risks

Cardinality blowup. tenant_id on metrics is today's obvious hazard. Future labels entity_id, finding_id, user_id are similarly dangerous. Discipline: IDs in logs, metrics labeled by shape (job_type, status, connector_id).
Free-tier drift. Grafana Cloud has raised prices before. If the free tier shrinks, we port to self-host LGTM on a €5 VPS in a day — architecture (Alloy + prom-client + JSON logs) is unchanged.
MCP maturity. grafana/mcp-grafana is active but young. Fallback: agents use logcli and curl against HTTP API — workable, chattier.
No traces. Deferred explicitly. The regret scenario: connector hangs mid-ingest, we can't see where. Mitigation: structured log correlation IDs already in the logger; OTEL-instrument later without throwing anything away.
BetterStack / Grafana overlap. Grafana Cloud's higher tiers include synthetic monitors. Post-pilot we consolidate. Free SMS alerting is worth the duplication for now.

7. Regional strategy — addendum (2026-05-03)

§4 picks Frankfurt in one line. This section captures the why so the question doesn't have to be re-answered every time a US client lands.

Telemetry region ≠ customer region

The data inside Grafana Cloud is metrics and logs about the platform, not customer data. tenant_id appears as a log field (post-#763 it is no longer a metric label), but it is an identifier, not the tenant's data. Customer-data residency conversations are about Atlas region + platform compute region — not about Grafana Cloud.

Telemetry mirrors deployment geography: where the compute is that emits the signals. SecurityV0 currently runs from one place (Hetzner DE), moving to one Azure region (EU/Frankfurt-aligned per pre-client readiness review §3). One deployment = one Grafana stack.

Triggers that justify a second stack

Trigger	Real or theoretical	Action
Second platform deployment in another region (e.g., separate `app-us.securityv0.com` on a US Azure VM with its own Atlas in `us-east-1`)	Real, ~12+ months out at earliest	Provision paired US stack at the same time as the deployment
Enterprise contract requires telemetry (not customer data) to stay in-region	Rare; most clauses target customer data which lives in Atlas + platform storage	Address per-contract; do not preempt
<50ms query response from a specific geography for human operators	Negligible — operators click dashboards async, 100ms WAN is invisible	Ignore
Free-tier volume limits hit (10k series / 50 GB logs / 14-day retention)	The real upgrade trigger; fires at moderate customer scale	Upgrade to Pro on the same stack ($19/mo base + overage); region does not change

The trigger most people imagine — "we have US customers, we need a US stack" — is not on this list.

What the free tier actually allows

Grafana Cloud's free tier is 1 stack per organization, not per region. Two free stacks would require two separate organizations, fully isolated: separate logins, dashboards, alert rules, API tokens, MCP server configs. You'd operate two consoles to monitor what is fundamentally one product, with no cost saving past the volume cap (the cap is per-stack, not per-org). The pain is real, the math doesn't work.

Decision

One stack, in Frankfurt, indefinitely. No regional splitting until the platform itself splits.

Operational details:

Owner email: company-owned address (ivan@securityv0.com or shared ops@securityv0.com), never personal Gmail. The org owns the data; if the seat changes hands, ownership transfers cleanly.
Stack name: sv0-prod. Single stack — dev + prod both push to it, distinguished via an env label, not separate stacks. Free tier is too small to split by env.
Region: Frankfurt — aligns with Hetzner today, MediaPro tomorrow, the planned Azure region per the readiness review, and EU GDPR posture by default.
Upgrade path: outgrow free → ~$30-60/mo Pro on the same stack. Migration is a billing-page click; no Alloy reconfiguration, no token rotation.
Re-evaluation trigger: when (not if) we commission a US-region SecurityV0 deployment. That day, provision a paired US stack — not before. Will be tracked alongside whatever issue eventually scopes the US deployment.

Summary​

1. Current state of instrumentation​

2. Options matrix​

Security: tenant_id label leak kills several options if not fixed first​

3. Agentic access — the decision driver​

4. Recommendation​

Primary: Grafana Cloud free + BetterStack free + grafana/mcp-grafana​

Fallback / complement: Axiom free for logs-heavy workflows​

Explicitly rejected​

5. One-week rollout​

Day 1 — security + uptime (~1 hour)​

Day 2 — Grafana Cloud setup (~2 hours)​

Day 3 — alerts​

Day 4 — agent access​

Day 5 — connector logs​

Days 6-7 — buffer​

Post-pilot forward-port​

6. Risks​

7. Regional strategy — addendum (2026-05-03)​

Telemetry region ≠ customer region​

Triggers that justify a second stack​

What the free tier actually allows​

Decision​