Skip to main content

Observability Stack Research — Pre-MediaPro Pilot

Summary

Pick: Grafana Cloud free tier + BetterStack free tier + grafana/mcp-grafana. Two engineer-days. $0/mo at pilot scale. Paid-tier exit ramp $30-60/mo. Survives every migration on our roadmap.

Agentic ops was the decision-driver. grafana/mcp-grafana (official Grafana Labs MCP server, Apache-2.0) is the most complete agent-queryable observability surface available today — logs (Loki), metrics (Prometheus), dashboards, alerts, and traces (Tempo) all callable from Claude Code sessions.


1. Current state of instrumentation

Better than "nothing," worse than "ready."

Logs. Hand-rolled structured JSON logger at sv0-platform/src/shared/logging/logger.ts (~80 lines, no dependencies). Emits { ts, level, message, ...context } JSON lines to stdout/stderr. Supports child() context binding. Used across ~20 modules (routes, workers, ingestion, risk-clusters, auth providers). Docker captures the output on Hetzner; nothing ships it anywhere; retention is the Docker log buffer.

Metrics. Real Prometheus instrumentation via prom-client ^15.1.3 at sv0-platform/src/shared/metrics/metrics.ts:

  • sv0_http_request_duration_seconds (histogram, labels: method/route/status_code)
  • sv0_http_requests_total
  • sv0_job_duration_secondslabels include tenant_id, which is the P0-5 leak
  • sv0_job_total, sv0_queue_depth
  • sv0_sync_age_minutes (connector_id), sv0_findings_total (status/severity), sv0_authority_paths_total
  • Node defaults via collectDefaultMetrics()

Exposed at GET /metrics (sv0-platform/src/api/routes/system.ts:52). Currently in PUBLIC_PATHS at sv0-platform/src/api/middleware/auth-middleware.ts:17.

Health/readiness. /health, /ready, /diagnostics all wired. /ready validates Mongo + worker.

Traces. None. No OTEL, Sentry, or Datadog tracer in package.json.

Connectors (Python). Standard logging module; systemd captures to stdout.

Net: we already emit prometheus-shaped metrics and structured JSON logs. The missing layer is collection, storage, query, and alerting — not instrumentation. Any stack we pick needs to (a) scrape /metrics and (b) ingest stdout JSON logs. No application rewrite required.


2. Options matrix

Free-tier numbers from each vendor's 2026-04 public pricing. "MCP" = Model Context Protocol server availability (official or actively-maintained community).

OptionFree tierFirst paid tier (pilot scale ~10k series + 100GB logs)MCP / agent accessSelf-host burdenMigration portability
1. Grafana Cloud free10k active series, 50 GB logs, 50 GB traces, 14-day retention, 3 users, unlimited dashboardsPro $19/mo base + $8/50GB log overage + $16/1k extra series → ~$60-80/mo at scaleOfficial MCP (grafana/mcp-grafana, Apache-2.0); LogCLI; HTTP API per-datasource; API tokens scopableZeroFully portable
2. Self-host LGTM on a €10/mo VPSN/A€10-20/mo (VPS + backup storage)Same MCP works against self-hostedWe run Loki, Prom, Grafana, Alertmanager, manage TSDB retention. 1-2 days stand up, ongoing babysitFully portable but we carry the state
3. Datadog free5 hosts, 1-day log retention, no APM$15/host/mo + $1.70/1M events + $0.10/GB retained → ~$100-200/mo minimumNo official MCP; community MCPs immature; HTTP API + datadog CLIZeroLocked — Datadog agent everywhere
4. BetterStack (Logtail + Uptime)1 GB logs/mo, 3-day retention, 10 uptime monitors, unlimited incidents/status pagesTeams $25/mo (30-day retention, 30 GB), then $0.25/GBNo MCP. Good REST API + CLI. Uptime monitors trivially scriptableZeroPortable
5. Axiom500 GB/mo free, 30-day retention, no credit card$25/mo base + $0.25/GB beyond 500GBOfficial MCP (axiomhq/mcp-server-axiom); APL (Kusto-like) queriesZeroPortable
6. SigNoz cloud14-day trial onlyTeams $199/mo base incl. 50GBCommunity MCP; ClickHouse-backedZero (cloud) or heavy (self-host)Portable via OTEL
7. New Relic free100 GB ingest/mo, 8-day retention, 1 full user$0.35/extra GB + $99/full userNo official MCP; NRQL-heavy, SDK-heavyZeroLocked
8. Honeycomb free20M events/mo, 60-day retentionPro $100/mo → 100M eventsNo official MCP. Events-oriented (traces)ZeroPortable via OTEL
9. Cloudflare Logpush + R2Enterprise-plan-only; Workers Logs Engine free only for Workersn/aNo MCPZeroCloudflare-locked
10. Stitched minimumUptimeRobot free + self-scraped /metrics + docker logs + BetterStack for SMS~$0 until outgrownSSH + grep; works but primitiveSSH hygiene, manual rotationFully portable

Security: tenant_id label leak kills several options if not fixed first

sv0_job_duration_seconds carries tenant_id. Today, anyone past Cloudflare Access can scrape /metrics and enumerate tenants (P0-5 in readiness review §3.1). The fix — strip tenant_id from metric labels, keep per-tenant context in logs — must land before we point any external scraper at /metrics. The scraper must also authenticate. Grafana Cloud / Datadog / Axiom all support per-agent API tokens. With Cloudflare Access we can additionally gate by service-token.


3. Agentic access — the decision driver

Three options have real agent stories; everyone else is HTTP-API-and-hope.

Grafana Cloud + grafana/mcp-grafana (https://github.com/grafana/mcp-grafana). Official, Grafana Labs–maintained, Apache-2.0. Covers Prometheus queries, Loki log queries, dashboard listing/search/panel-data, alert list/ack/silence, datasource list, Tempo trace queries. Most complete MCP surface among observability vendors. Works against both Grafana Cloud and self-hosted Grafana. Read-only scoped API tokens trivial.

Axiom + axiomhq/mcp-server-axiom (https://github.com/axiomhq/mcp-server-axiom). Official, Axiom-maintained. APL-based queries (SQL-ish). Narrower (logs only, no alerts) but 500 GB/mo free is the most generous in the industry, and APL is pleasant for LLMs to generate.

Grafana self-hosted with the same MCP. Same story but we run it. logcli is JSON-output-friendly and works as an agent escape hatch.

All others (Datadog, New Relic, Honeycomb, BetterStack, SigNoz) have HTTP APIs but no first-class MCP. Claude Code can still use them via shell + curl, but it's "works" not "works well."


4. Recommendation

Primary: Grafana Cloud free + BetterStack free + grafana/mcp-grafana

Cost: $0/mo. Two signups.

Why:

  • Free allowances (10k series, 50 GB logs, 14-day retention) cover a pilot and several customers after. Our current metric cardinality is ~30 series before tenant_id, hundreds after the label-strip fix.
  • Best-in-class agentic MCP. Claude Code sessions can query logs, check alerts, ack silences, read dashboards, inspect traces without bespoke tooling.
  • Zero ops burden. We already operate Hetzner + Atlas + WorkOS; a fifth thing to operate is declined.
  • Survives Hetzner → Azure VM / AWS VM migration unchanged. Alloy (Grafana's collector) moves as a systemd service.
  • BetterStack free covers §3.1's external uptime monitor requirement today (60s checks, SMS alerts, 10 monitors — enough for /health, /ready, app.securityv0.com landing).
  • Security clean: Alloy pushes /metrics outbound to Grafana Cloud (no public scrape endpoint). We remove /metrics from PUBLIC_PATHS simultaneously. Tenant labels only land where we control read access.

When we outgrow it: Grafana Cloud Pro = $19/mo base, $8/50GB log overage. At realistic pilot scale we stay free 6-12 months. First paid bill will be $30-60/mo.

Fallback / complement: Axiom free for logs-heavy workflows

When: if Grafana Loki's 50 GB feels tight (unlikely pre-pilot), or we want SQL-style agent log queries specifically. Complement, not replacement.

Explicitly rejected

  • Datadog — free tier is a trap (1-day retention, 5 hosts, no APM). First paid bill at scale is $100+. Vendor-lockiest.
  • Self-host LGTM — fine at our scale but we cannot afford 1-2 days of setup + ongoing retention management during a 10-14-day pilot sprint. Revisit post-pilot if costs bite.
  • New Relic — great free tier, bad agent story, sticky SDK.
  • Honeycomb — excellent at traces, we emit no traces today. Premature.
  • Cloudflare Logpush — Enterprise-only, not available to us.

5. One-week rollout

Fits inside Track A of the readiness plan. Two engineer-days total, parallelizable.

Day 1 — security + uptime (~1 hour)

  • Close P0-5. Strip tenant_id from sv0_job_duration_seconds labels (src/shared/metrics/metrics.ts:25); update call sites (src/workers/runtime.ts:125,133). Per-tenant duration survives in logs as a context field, not a metric label.
  • Lock down /metrics and /diagnostics. Remove from PUBLIC_PATHS (src/api/middleware/auth-middleware.ts:17). Make /metrics bearer-token-only via existing M2M auth. Provision one scrape token.
  • BetterStack. Sign up free. Add monitors for /api/v1/health on app + dev, 60s interval, SMS + email. Add CF-Access-Client-Id / CF-Access-Client-Secret headers so checks bypass the Cloudflare Access login page (service tokens already exist for CI).

Day 2 — Grafana Cloud setup (~2 hours)

  • Sign up for Grafana Cloud free. Create stack in Frankfurt (EMEA-aligned for MediaPro).
  • Install Grafana Alloy on the Hetzner host. One apt install, one config file. Configure:
    • Prometheus remote_write to Grafana Cloud, scraping API /metrics with the bearer token
    • Loki push from /var/lib/docker/containers/*/*-json.log (labels: service, container_name)
    • node_exporter for host metrics
  • Verify in Grafana Cloud UI. Seed 3 dashboards: API overview (RED), worker queue, connector freshness (from sv0_sync_age_minutes).

Day 3 — alerts

  • Grafana Cloud Alerts: error rate >5% for 5min; sv0_queue_depth >100 for 10min; sv0_sync_age_minutes >2× schedule for any connector; /ready 503 for 2min. Email + Slack webhook.

Day 4 — agent access

  • Create read-only Grafana Cloud API token scoped to Loki + Prom + Alerts.
  • Install grafana/mcp-grafana as an MCP server in our .claude/settings.json. One permission entry.
  • Verify round-trip: ask a Claude Code session "show me 5xx responses in the last hour on dev," confirm it works.

Day 5 — connector logs

  • Point Alloy at Hetzner's journald so systemd connector runs stream to Loki. Label by unit name. ~0 cost at connector volume.

Days 6-7 — buffer

Tuning cardinality, dashboards, MediaPro-specific alerts.

Post-pilot forward-port

When compute migrates to Azure VM or AWS VM (sv0-platform#493 Phase 2), Alloy moves with it. Grafana Cloud stack, tokens, dashboards, alerts unchanged. Migration is a systemctl + DNS change, not a rewrite.


6. Risks

  • Cardinality blowup. tenant_id on metrics is today's obvious hazard. Future labels entity_id, finding_id, user_id are similarly dangerous. Discipline: IDs in logs, metrics labeled by shape (job_type, status, connector_id).
  • Free-tier drift. Grafana Cloud has raised prices before. If the free tier shrinks, we port to self-host LGTM on a €5 VPS in a day — architecture (Alloy + prom-client + JSON logs) is unchanged.
  • MCP maturity. grafana/mcp-grafana is active but young. Fallback: agents use logcli and curl against HTTP API — workable, chattier.
  • No traces. Deferred explicitly. The regret scenario: connector hangs mid-ingest, we can't see where. Mitigation: structured log correlation IDs already in the logger; OTEL-instrument later without throwing anything away.
  • BetterStack / Grafana overlap. Grafana Cloud's higher tiers include synthetic monitors. Post-pilot we consolidate. Free SMS alerting is worth the duplication for now.

7. Regional strategy — addendum (2026-05-03)

§4 picks Frankfurt in one line. This section captures the why so the question doesn't have to be re-answered every time a US client lands.

Telemetry region ≠ customer region

The data inside Grafana Cloud is metrics and logs about the platform, not customer data. tenant_id appears as a log field (post-#763 it is no longer a metric label), but it is an identifier, not the tenant's data. Customer-data residency conversations are about Atlas region + platform compute region — not about Grafana Cloud.

Telemetry mirrors deployment geography: where the compute is that emits the signals. SecurityV0 currently runs from one place (Hetzner DE), moving to one Azure region (EU/Frankfurt-aligned per pre-client readiness review §3). One deployment = one Grafana stack.

Triggers that justify a second stack

TriggerReal or theoreticalAction
Second platform deployment in another region (e.g., separate app-us.securityv0.com on a US Azure VM with its own Atlas in us-east-1)Real, ~12+ months out at earliestProvision paired US stack at the same time as the deployment
Enterprise contract requires telemetry (not customer data) to stay in-regionRare; most clauses target customer data which lives in Atlas + platform storageAddress per-contract; do not preempt
<50ms query response from a specific geography for human operatorsNegligible — operators click dashboards async, 100ms WAN is invisibleIgnore
Free-tier volume limits hit (10k series / 50 GB logs / 14-day retention)The real upgrade trigger; fires at moderate customer scaleUpgrade to Pro on the same stack ($19/mo base + overage); region does not change

The trigger most people imagine — "we have US customers, we need a US stack" — is not on this list.

What the free tier actually allows

Grafana Cloud's free tier is 1 stack per organization, not per region. Two free stacks would require two separate organizations, fully isolated: separate logins, dashboards, alert rules, API tokens, MCP server configs. You'd operate two consoles to monitor what is fundamentally one product, with no cost saving past the volume cap (the cap is per-stack, not per-org). The pain is real, the math doesn't work.

Decision

One stack, in Frankfurt, indefinitely. No regional splitting until the platform itself splits.

Operational details:

  • Owner email: company-owned address (ivan@securityv0.com or shared ops@securityv0.com), never personal Gmail. The org owns the data; if the seat changes hands, ownership transfers cleanly.
  • Stack name: sv0-prod. Single stack — dev + prod both push to it, distinguished via an env label, not separate stacks. Free tier is too small to split by env.
  • Region: Frankfurt — aligns with Hetzner today, MediaPro tomorrow, the planned Azure region per the readiness review, and EU GDPR posture by default.
  • Upgrade path: outgrow free → ~$30-60/mo Pro on the same stack. Migration is a billing-page click; no Alloy reconfiguration, no token rotation.
  • Re-evaluation trigger: when (not if) we commission a US-region SecurityV0 deployment. That day, provision a paired US stack — not before. Will be tracked alongside whatever issue eventually scopes the US deployment.