Observability Stack Research — Pre-MediaPro Pilot
Summary
Pick: Grafana Cloud free tier + BetterStack free tier + grafana/mcp-grafana. Two engineer-days. $0/mo at pilot scale. Paid-tier exit ramp $30-60/mo. Survives every migration on our roadmap.
Agentic ops was the decision-driver. grafana/mcp-grafana (official Grafana Labs MCP server, Apache-2.0) is the most complete agent-queryable observability surface available today — logs (Loki), metrics (Prometheus), dashboards, alerts, and traces (Tempo) all callable from Claude Code sessions.
1. Current state of instrumentation
Better than "nothing," worse than "ready."
Logs. Hand-rolled structured JSON logger at sv0-platform/src/shared/logging/logger.ts (~80 lines, no dependencies). Emits { ts, level, message, ...context } JSON lines to stdout/stderr. Supports child() context binding. Used across ~20 modules (routes, workers, ingestion, risk-clusters, auth providers). Docker captures the output on Hetzner; nothing ships it anywhere; retention is the Docker log buffer.
Metrics. Real Prometheus instrumentation via prom-client ^15.1.3 at sv0-platform/src/shared/metrics/metrics.ts:
sv0_http_request_duration_seconds(histogram, labels: method/route/status_code)sv0_http_requests_totalsv0_job_duration_seconds— labels includetenant_id, which is the P0-5 leaksv0_job_total,sv0_queue_depthsv0_sync_age_minutes(connector_id),sv0_findings_total(status/severity),sv0_authority_paths_total- Node defaults via
collectDefaultMetrics()
Exposed at GET /metrics (sv0-platform/src/api/routes/system.ts:52). Currently in PUBLIC_PATHS at sv0-platform/src/api/middleware/auth-middleware.ts:17.
Health/readiness. /health, /ready, /diagnostics all wired. /ready validates Mongo + worker.
Traces. None. No OTEL, Sentry, or Datadog tracer in package.json.
Connectors (Python). Standard logging module; systemd captures to stdout.
Net: we already emit prometheus-shaped metrics and structured JSON logs. The missing layer is collection, storage, query, and alerting — not instrumentation. Any stack we pick needs to (a) scrape /metrics and (b) ingest stdout JSON logs. No application rewrite required.
2. Options matrix
Free-tier numbers from each vendor's 2026-04 public pricing. "MCP" = Model Context Protocol server availability (official or actively-maintained community).
| Option | Free tier | First paid tier (pilot scale ~10k series + 100GB logs) | MCP / agent access | Self-host burden | Migration portability |
|---|---|---|---|---|---|
| 1. Grafana Cloud free | 10k active series, 50 GB logs, 50 GB traces, 14-day retention, 3 users, unlimited dashboards | Pro $19/mo base + $8/50GB log overage + $16/1k extra series → ~$60-80/mo at scale | Official MCP (grafana/mcp-grafana, Apache-2.0); LogCLI; HTTP API per-datasource; API tokens scopable | Zero | Fully portable |
| 2. Self-host LGTM on a €10/mo VPS | N/A | €10-20/mo (VPS + backup storage) | Same MCP works against self-hosted | We run Loki, Prom, Grafana, Alertmanager, manage TSDB retention. 1-2 days stand up, ongoing babysit | Fully portable but we carry the state |
| 3. Datadog free | 5 hosts, 1-day log retention, no APM | $15/host/mo + $1.70/1M events + $0.10/GB retained → ~$100-200/mo minimum | No official MCP; community MCPs immature; HTTP API + datadog CLI | Zero | Locked — Datadog agent everywhere |
| 4. BetterStack (Logtail + Uptime) | 1 GB logs/mo, 3-day retention, 10 uptime monitors, unlimited incidents/status pages | Teams $25/mo (30-day retention, 30 GB), then $0.25/GB | No MCP. Good REST API + CLI. Uptime monitors trivially scriptable | Zero | Portable |
| 5. Axiom | 500 GB/mo free, 30-day retention, no credit card | $25/mo base + $0.25/GB beyond 500GB | Official MCP (axiomhq/mcp-server-axiom); APL (Kusto-like) queries | Zero | Portable |
| 6. SigNoz cloud | 14-day trial only | Teams $199/mo base incl. 50GB | Community MCP; ClickHouse-backed | Zero (cloud) or heavy (self-host) | Portable via OTEL |
| 7. New Relic free | 100 GB ingest/mo, 8-day retention, 1 full user | $0.35/extra GB + $99/full user | No official MCP; NRQL-heavy, SDK-heavy | Zero | Locked |
| 8. Honeycomb free | 20M events/mo, 60-day retention | Pro $100/mo → 100M events | No official MCP. Events-oriented (traces) | Zero | Portable via OTEL |
| 9. Cloudflare Logpush + R2 | Enterprise-plan-only; Workers Logs Engine free only for Workers | n/a | No MCP | Zero | Cloudflare-locked |
| 10. Stitched minimum | UptimeRobot free + self-scraped /metrics + docker logs + BetterStack for SMS | ~$0 until outgrown | SSH + grep; works but primitive | SSH hygiene, manual rotation | Fully portable |
Security: tenant_id label leak kills several options if not fixed first
sv0_job_duration_seconds carries tenant_id. Today, anyone past Cloudflare Access can scrape /metrics and enumerate tenants (P0-5 in readiness review §3.1). The fix — strip tenant_id from metric labels, keep per-tenant context in logs — must land before we point any external scraper at /metrics. The scraper must also authenticate. Grafana Cloud / Datadog / Axiom all support per-agent API tokens. With Cloudflare Access we can additionally gate by service-token.
3. Agentic access — the decision driver
Three options have real agent stories; everyone else is HTTP-API-and-hope.
Grafana Cloud + grafana/mcp-grafana (https://github.com/grafana/mcp-grafana). Official, Grafana Labs–maintained, Apache-2.0. Covers Prometheus queries, Loki log queries, dashboard listing/search/panel-data, alert list/ack/silence, datasource list, Tempo trace queries. Most complete MCP surface among observability vendors. Works against both Grafana Cloud and self-hosted Grafana. Read-only scoped API tokens trivial.
Axiom + axiomhq/mcp-server-axiom (https://github.com/axiomhq/mcp-server-axiom). Official, Axiom-maintained. APL-based queries (SQL-ish). Narrower (logs only, no alerts) but 500 GB/mo free is the most generous in the industry, and APL is pleasant for LLMs to generate.
Grafana self-hosted with the same MCP. Same story but we run it. logcli is JSON-output-friendly and works as an agent escape hatch.
All others (Datadog, New Relic, Honeycomb, BetterStack, SigNoz) have HTTP APIs but no first-class MCP. Claude Code can still use them via shell + curl, but it's "works" not "works well."
4. Recommendation
Primary: Grafana Cloud free + BetterStack free + grafana/mcp-grafana
Cost: $0/mo. Two signups.
Why:
- Free allowances (10k series, 50 GB logs, 14-day retention) cover a pilot and several customers after. Our current metric cardinality is ~30 series before
tenant_id, hundreds after the label-strip fix. - Best-in-class agentic MCP. Claude Code sessions can query logs, check alerts, ack silences, read dashboards, inspect traces without bespoke tooling.
- Zero ops burden. We already operate Hetzner + Atlas + WorkOS; a fifth thing to operate is declined.
- Survives Hetzner → Azure VM / AWS VM migration unchanged. Alloy (Grafana's collector) moves as a systemd service.
- BetterStack free covers §3.1's external uptime monitor requirement today (60s checks, SMS alerts, 10 monitors — enough for
/health,/ready,app.securityv0.comlanding). - Security clean: Alloy pushes
/metricsoutbound to Grafana Cloud (no public scrape endpoint). We remove/metricsfromPUBLIC_PATHSsimultaneously. Tenant labels only land where we control read access.
When we outgrow it: Grafana Cloud Pro = $19/mo base, $8/50GB log overage. At realistic pilot scale we stay free 6-12 months. First paid bill will be $30-60/mo.
Fallback / complement: Axiom free for logs-heavy workflows
When: if Grafana Loki's 50 GB feels tight (unlikely pre-pilot), or we want SQL-style agent log queries specifically. Complement, not replacement.
Explicitly rejected
- Datadog — free tier is a trap (1-day retention, 5 hosts, no APM). First paid bill at scale is $100+. Vendor-lockiest.
- Self-host LGTM — fine at our scale but we cannot afford 1-2 days of setup + ongoing retention management during a 10-14-day pilot sprint. Revisit post-pilot if costs bite.
- New Relic — great free tier, bad agent story, sticky SDK.
- Honeycomb — excellent at traces, we emit no traces today. Premature.
- Cloudflare Logpush — Enterprise-only, not available to us.
5. One-week rollout
Fits inside Track A of the readiness plan. Two engineer-days total, parallelizable.
Day 1 — security + uptime (~1 hour)
- Close P0-5. Strip
tenant_idfromsv0_job_duration_secondslabels (src/shared/metrics/metrics.ts:25); update call sites (src/workers/runtime.ts:125,133). Per-tenant duration survives in logs as a context field, not a metric label. - Lock down
/metricsand/diagnostics. Remove fromPUBLIC_PATHS(src/api/middleware/auth-middleware.ts:17). Make/metricsbearer-token-only via existing M2M auth. Provision one scrape token. - BetterStack. Sign up free. Add monitors for
/api/v1/healthon app + dev, 60s interval, SMS + email. AddCF-Access-Client-Id/CF-Access-Client-Secretheaders so checks bypass the Cloudflare Access login page (service tokens already exist for CI).
Day 2 — Grafana Cloud setup (~2 hours)
- Sign up for Grafana Cloud free. Create stack in Frankfurt (EMEA-aligned for MediaPro).
- Install Grafana Alloy on the Hetzner host. One
apt install, one config file. Configure:- Prometheus
remote_writeto Grafana Cloud, scraping API/metricswith the bearer token - Loki push from
/var/lib/docker/containers/*/*-json.log(labels: service, container_name) node_exporterfor host metrics
- Prometheus
- Verify in Grafana Cloud UI. Seed 3 dashboards: API overview (RED), worker queue, connector freshness (from
sv0_sync_age_minutes).
Day 3 — alerts
- Grafana Cloud Alerts: error rate >5% for 5min;
sv0_queue_depth>100 for 10min;sv0_sync_age_minutes>2× schedule for any connector;/ready503 for 2min. Email + Slack webhook.
Day 4 — agent access
- Create read-only Grafana Cloud API token scoped to Loki + Prom + Alerts.
- Install
grafana/mcp-grafanaas an MCP server in our.claude/settings.json. One permission entry. - Verify round-trip: ask a Claude Code session "show me 5xx responses in the last hour on dev," confirm it works.
Day 5 — connector logs
- Point Alloy at Hetzner's journald so systemd connector runs stream to Loki. Label by unit name. ~0 cost at connector volume.
Days 6-7 — buffer
Tuning cardinality, dashboards, MediaPro-specific alerts.
Post-pilot forward-port
When compute migrates to Azure VM or AWS VM (sv0-platform#493 Phase 2), Alloy moves with it. Grafana Cloud stack, tokens, dashboards, alerts unchanged. Migration is a systemctl + DNS change, not a rewrite.
6. Risks
- Cardinality blowup.
tenant_idon metrics is today's obvious hazard. Future labelsentity_id,finding_id,user_idare similarly dangerous. Discipline: IDs in logs, metrics labeled by shape (job_type, status, connector_id). - Free-tier drift. Grafana Cloud has raised prices before. If the free tier shrinks, we port to self-host LGTM on a €5 VPS in a day — architecture (Alloy + prom-client + JSON logs) is unchanged.
- MCP maturity.
grafana/mcp-grafanais active but young. Fallback: agents uselogcliandcurlagainst HTTP API — workable, chattier. - No traces. Deferred explicitly. The regret scenario: connector hangs mid-ingest, we can't see where. Mitigation: structured log correlation IDs already in the logger; OTEL-instrument later without throwing anything away.
- BetterStack / Grafana overlap. Grafana Cloud's higher tiers include synthetic monitors. Post-pilot we consolidate. Free SMS alerting is worth the duplication for now.
7. Regional strategy — addendum (2026-05-03)
§4 picks Frankfurt in one line. This section captures the why so the question doesn't have to be re-answered every time a US client lands.
Telemetry region ≠ customer region
The data inside Grafana Cloud is metrics and logs about the platform, not customer data. tenant_id appears as a log field (post-#763 it is no longer a metric label), but it is an identifier, not the tenant's data. Customer-data residency conversations are about Atlas region + platform compute region — not about Grafana Cloud.
Telemetry mirrors deployment geography: where the compute is that emits the signals. SecurityV0 currently runs from one place (Hetzner DE), moving to one Azure region (EU/Frankfurt-aligned per pre-client readiness review §3). One deployment = one Grafana stack.
Triggers that justify a second stack
| Trigger | Real or theoretical | Action |
|---|---|---|
Second platform deployment in another region (e.g., separate app-us.securityv0.com on a US Azure VM with its own Atlas in us-east-1) | Real, ~12+ months out at earliest | Provision paired US stack at the same time as the deployment |
| Enterprise contract requires telemetry (not customer data) to stay in-region | Rare; most clauses target customer data which lives in Atlas + platform storage | Address per-contract; do not preempt |
| <50ms query response from a specific geography for human operators | Negligible — operators click dashboards async, 100ms WAN is invisible | Ignore |
| Free-tier volume limits hit (10k series / 50 GB logs / 14-day retention) | The real upgrade trigger; fires at moderate customer scale | Upgrade to Pro on the same stack ($19/mo base + overage); region does not change |
The trigger most people imagine — "we have US customers, we need a US stack" — is not on this list.
What the free tier actually allows
Grafana Cloud's free tier is 1 stack per organization, not per region. Two free stacks would require two separate organizations, fully isolated: separate logins, dashboards, alert rules, API tokens, MCP server configs. You'd operate two consoles to monitor what is fundamentally one product, with no cost saving past the volume cap (the cap is per-stack, not per-org). The pain is real, the math doesn't work.
Decision
One stack, in Frankfurt, indefinitely. No regional splitting until the platform itself splits.
Operational details:
- Owner email: company-owned address (
ivan@securityv0.comor sharedops@securityv0.com), never personal Gmail. The org owns the data; if the seat changes hands, ownership transfers cleanly. - Stack name:
sv0-prod. Single stack — dev + prod both push to it, distinguished via anenvlabel, not separate stacks. Free tier is too small to split by env. - Region: Frankfurt — aligns with Hetzner today, MediaPro tomorrow, the planned Azure region per the readiness review, and EU GDPR posture by default.
- Upgrade path: outgrow free → ~$30-60/mo Pro on the same stack. Migration is a billing-page click; no Alloy reconfiguration, no token rotation.
- Re-evaluation trigger: when (not if) we commission a US-region SecurityV0 deployment. That day, provision a paired US stack — not before. Will be tracked alongside whatever issue eventually scopes the US deployment.