Skip to main content

Operational Resiliency Plan

Problem Statement

SecurityV0 has a strong foundation — structured JSON logging, Prometheus /metrics endpoint, Docker health checks, and a token-health GitHub Actions workflow. But the pieces aren't connected:

  • Logs vanish when containers restart — no aggregation, no search
  • Errors are invisible — logged but nobody is notified
  • Connector staleness is undetectable — metric defined but never wired
  • No shared visibility — Sergey (different country) can't see logs or metrics without SSH access to the Mac mini
  • Multiple environments (prod, dev, future instances) have no unified view

This plan connects the dots with a middle-ground approach: Grafana Cloud for shared multi-environment visibility, GitHub Actions for external probing, Slack for notifications. One new container per environment (Alloy). Everything else deferred.

Design Principles

  1. Shared visibility is a collaboration requirement. Two developers in different countries need a single URL to see logs, metrics, and health across all environments. This is not premature — it's table stakes for a distributed team.

  2. External probes, not internal ones. Uptime monitors on the same host they monitor can't detect the failures that matter most. Probes run on GitHub Actions (external).

  3. One new container per environment, not four. Alloy ships logs and scrapes metrics — that's the minimum to make Grafana Cloud useful. MongoDB exporter, Uptime Kuma, and custom dashboards are deferred.

  4. Fix the code first. Unhandled async errors crash the API silently. Route handlers bypass the structured logger. These are bugs, not observability gaps.

  5. Operating model before alert rules. Every alert has a clear owner, response expectation, and remediation path.

Current State Audit

What Exists

AreaWhat's There
Health endpoints/health (liveness), /ready (MongoDB + worker), /metrics (Prometheus), /diagnostics
Structured loggingJSON { ts, level, message, ...meta }, child loggers, configurable LOG_LEVEL
Request trackingUUID x-request-id on every request, propagated to logs
Prometheus metrics8 custom metrics: HTTP duration/count, worker job duration/count, queue depth, sync age, findings/paths gauges
Docker health checksMongoDB mongosh ping, API wget /health, resource limits (512MB API, 512MB Mongo)
Token monitoringWeekly Cloudflare token expiry check, auto-creates GitHub Issues
Global error handlerExpress 4-arg error middleware, logs with request ID
Graceful shutdown4-phase shutdown (stop accepting → stop worker → drain → disconnect)

What's Missing

GapImpactPriority
No uncaughtException / unhandledRejection handlersAsync errors crash API silently, no traceCritical
Route handlers use console.error() not structured loggerErrors bypass logging pipelineHigh
No log aggregationLogs lost on restart, no search, Sergey can't accessHigh
No shared multi-environment visibilityOnly accessible via SSH to Mac miniHigh
No external uptime probeCan't detect host/Docker/tunnel failuresHigh
No Slack notificationsNobody knows about failures until manual checkHigh
sv0_sync_age_minutes gauge never updatedConnector freshness invisibleMedium

Operating Model

Alert Tiers

TierMeaningResponseExample
P1 — Prod downProduction health check failingIvan responds ASAP (within 1 hour)API unreachable, MongoDB down
P2 — DegradedProduction up but errors spiking or connectors staleIvan responds within business hours5xx error rate >5%, connector stale >24h
P3 — InformationalDeploy succeeded, credential expiry warningAcknowledge, schedule fixToken expires in 30 days

Notification Routing

TierChannelBehavior
P1Slack #sv0-alertsImmediate
P2Slack #sv0-alertsBusiness hours
P3Slack #sv0-deploysInformational

Explicit non-goal: No 24/7 on-call for a 2-person pre-revenue team. P1 outside business hours is best-effort until there are design partner SLAs.


Implementation: Three Phases

Phase 1: Fix the Code (2-3 hours)

No infrastructure changes. Fix bugs in the existing codebase.

1.1 Add Process Error Handlers

In src/index.ts, before the server starts:

process.on("uncaughtException", (error) => {
logger.error("Uncaught exception — shutting down", {
error: error.message,
stack: error.stack,
});
process.exit(1);
});

process.on("unhandledRejection", (reason) => {
logger.error("Unhandled rejection", {
reason: reason instanceof Error ? reason.message : String(reason),
stack: reason instanceof Error ? reason.stack : undefined,
});
});

1.2 Replace console.error() with Structured Logger

Grep and replace in all route handlers:

// Before
console.error("Unexpected error in GET /api/v1/entities:", error);

// After
deps.logger.error("GET /api/v1/entities failed", {
requestId: req.requestId,
error: error instanceof Error ? error.message : "Unknown error",
});

1.3 Wire Sync Freshness Metric

The sv0_sync_age_minutes gauge exists but is never called. Update it after each sync completes:

const ageMinutes = (Date.now() - lastSyncTimestamp) / 60_000;
syncAgeMinutes.set({ connector_id: connectorId }, ageMinutes);

Phase 2: Grafana Cloud + Alloy (3-4 hours)

Goal: Open one URL, see logs and metrics from all environments. Sergey and Ivan both have access.

2.1 Set Up Grafana Cloud Free Account

  1. Sign up at grafana.com (free tier: 50GB logs/mo, 10k metric series, 14-day retention, 3 users)
  2. Create a Grafana Cloud API key with push permissions
  3. Note the Loki push URL and Prometheus remote-write URL
  4. Store credentials in 1Password: op://sv0-bots/grafana-cloud/
  5. Add as GitHub secrets: GRAFANA_CLOUD_LOKI_URL, GRAFANA_CLOUD_LOKI_USER, GRAFANA_CLOUD_LOKI_TOKEN, GRAFANA_CLOUD_PROM_URL, GRAFANA_CLOUD_PROM_USER, GRAFANA_CLOUD_PROM_TOKEN

Free tier limits vs our usage:

ResourceFree LimitOur Usage (estimate)Headroom
Log ingest50 GB/month~2-5 GB (2-3 environments, low traffic)10-25x
Metric series10,000~200 (8 custom + Node.js defaults × 2-3 envs)50x
Retention14 daysSufficient for debugging
Users32 (Ivan + Sergey)1 spare

Lock-in mitigation: Alloy uses standard protocols (Loki push API, Prometheus remote-write). If we outgrow the free tier, we can point Alloy at a self-hosted Loki/Mimir instance or switch to any OpenTelemetry-compatible backend. The config change is 2 lines (URL + credentials).

2.2 Add Grafana Alloy to Docker Compose

Add to docker-compose.deploy.yml:

alloy:
image: grafana/alloy:latest
restart: unless-stopped
mem_limit: 128m
volumes:
- ./deploy/alloy-config.alloy:/etc/alloy/config.alloy:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
command: run /etc/alloy/config.alloy
depends_on:
api:
condition: service_healthy

Security note on Docker socket: Alloy needs read-only Docker socket access to discover containers and collect their logs. This is a sensitive surface. Mitigations:

  • Mounted :ro — Alloy cannot create/stop/modify containers
  • Alloy runs as a non-root user inside its container
  • The alternative (no log aggregation) means Sergey has zero visibility and logs are lost on restart — that's a worse operational risk for a 2-person distributed team
  • If this remains uncomfortable, an alternative is to use Docker's json-file log driver with a shared volume that Alloy reads (no socket needed, but loses container metadata labels)

2.3 Alloy Configuration

Create deploy/alloy-config.alloy (~40 lines):

// Discover Docker containers
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
}

// Relabel: extract service name and environment
discovery.relabel "containers" {
targets = discovery.docker.containers.targets

rule {
source_labels = ["__meta_docker_container_name"]
target_label = "container"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_compose_service"]
target_label = "service"
}
}

// Ship logs to Grafana Cloud Loki
loki.source.docker "logs" {
host = "unix:///var/run/docker.sock"
targets = discovery.relabel.containers.output
forward_to = [loki.write.grafana_cloud.receiver]
}

loki.write "grafana_cloud" {
endpoint {
url = env("GRAFANA_LOKI_URL")
basic_auth {
username = env("GRAFANA_LOKI_USER")
password = env("GRAFANA_LOKI_TOKEN")
}
}
external_labels = {
environment = env("SV0_ENVIRONMENT"), // "production" or "dev"
host = env("HOSTNAME"),
}
}

// Scrape Prometheus metrics from API
prometheus.scrape "api" {
targets = [{ __address__ = "api:3000" }]
metrics_path = "/metrics"
scrape_interval = "30s"
forward_to = [prometheus.remote_write.grafana_cloud.receiver]
}

prometheus.remote_write "grafana_cloud" {
endpoint {
url = env("GRAFANA_PROM_URL")
basic_auth {
username = env("GRAFANA_PROM_USER")
password = env("GRAFANA_PROM_TOKEN")
}
}
external_labels = {
environment = env("SV0_ENVIRONMENT"),
}
}

Environment variables added to the deploy script per instance:

SV0_ENVIRONMENT=production  # or "dev"
GRAFANA_LOKI_URL=https://logs-prod-...grafana.net/loki/api/v1/push
GRAFANA_LOKI_USER=...
GRAFANA_LOKI_TOKEN=...
GRAFANA_PROM_URL=https://prometheus-prod-...grafana.net/api/prom/push
GRAFANA_PROM_USER=...
GRAFANA_PROM_TOKEN=...

2.4 What You Get in Grafana Cloud

Once Alloy is running in both environments:

Logs (Loki):

  • {environment="production", service="api"} — search all prod API logs
  • {environment="dev", service="api"} |= "error" — find errors in dev
  • {service="api"} | json | level="error" — structured JSON parsing
  • Side-by-side prod vs dev log streams

Metrics (Mimir/Prometheus):

  • sv0_http_requests_total{environment="production"} — prod request count
  • rate(sv0_http_requests_total{status_code=~"5.."}[5m]) — error rate
  • sv0_queue_depth{environment="dev"} — dev worker queue
  • sv0_sync_age_minutes — connector freshness (once Phase 1.3 is wired)

Grafana Cloud includes pre-built explore views — no custom dashboards needed initially. Use Explore (Loki) for log search and Explore (Metrics) for metric queries. Custom dashboards can be added later when patterns emerge.

2.5 Grafana Alert Rules

Configure 3 high-value alerts in Grafana Cloud (free tier supports 500 rules):

AlertQueryFires WhenTier
High error raterate(sv0_http_requests_total{status_code=~"5..", environment="production"}[5m]) / rate(sv0_http_requests_total{environment="production"}[5m]) > 0.05>5% of prod requests are 5xx for 5 minP2
Worker queue backing upsv0_queue_depth{environment="production"} > 10Queue depth >10 for 5 minP2
Connector stalesv0_sync_age_minutes > 1440Any connector hasn't synced in 24hP2

Contact point: Slack #sv0-alerts incoming webhook.

Note: These are P2 alerts (business hours). P1 (prod down) is handled by the GitHub Actions external probe in Phase 3, because Grafana can't detect "host is dead" — it only sees "metrics stopped arriving," which has a delay.


Phase 3: External Probing + Slack Notifications (2-3 hours)

Runs entirely on GitHub Actions. No containers.

3.1 Platform Health Probe (GitHub Actions Cron)

New workflow: .github/workflows/platform-health.yml

name: platform-health
on:
schedule:
- cron: "*/5 * * * *" # Every 5 minutes
workflow_dispatch: {}

jobs:
probe:
runs-on: ubuntu-latest
steps:
- name: Check production
id: prod
run: |
HTTP_CODE=$(curl -sf -o /tmp/prod-health.json -w "%{http_code}" \
-H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
-H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
"https://app.securityv0.com/ready" || echo "000")
echo "status=$HTTP_CODE" >> "$GITHUB_OUTPUT"
if [ "$HTTP_CODE" = "200" ]; then
echo "✅ Production: healthy"
else
echo "❌ Production: HTTP $HTTP_CODE"
fi

- name: Check dev
id: dev
run: |
HTTP_CODE=$(curl -sf -o /dev/null -w "%{http_code}" \
-H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
-H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
"https://dev.securityv0.com/ready" || echo "000")
echo "status=$HTTP_CODE" >> "$GITHUB_OUTPUT"

- name: Smoke test — verify data access
id: smoke
if: steps.prod.outputs.status == '200'
run: |
BODY=$(curl -sf \
-H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
-H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
-H "X-Tenant-Id: demo-w1" \
"https://app.securityv0.com/api/v1/findings?limit=1")
COUNT=$(echo "$BODY" | python3 -c "import json,sys; print(json.load(sys.stdin)['meta']['total_count'])")
echo "findings=$COUNT" >> "$GITHUB_OUTPUT"
[ "$COUNT" -gt 0 ] && echo "✅ Data: $COUNT findings" || echo "⚠️ Data: 0 findings"

- name: Alert Slack — prod down
if: steps.prod.outputs.status != '200'
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_WEBHOOK_ALERTS }}
webhook-type: incoming-webhook
payload: |
{
"text": "🚨 P1: Production health check FAILING (HTTP ${{ steps.prod.outputs.status }})",
"blocks": [{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "🚨 *P1 — Production Down*\n`app.securityv0.com/ready` → HTTP ${{ steps.prod.outputs.status }}\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View probe>"
}
}]
}

Key decisions:

  • Probes /ready (checks MongoDB + worker), not /health (liveness only)
  • Smoke test reads actual data — catches "API up but database empty" failures
  • Runs on GitHub-hosted runners — external to Mac mini, detects host/tunnel death
  • Only alerts on failure (no alert fatigue)

3.2 Slack Notifications on Existing Workflows

Add failure notification to deploy-prod.yml, deploy-dev.yml, ci.yml, token-health.yml:

- name: Notify Slack on failure
if: failure()
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_WEBHOOK_ALERTS }}
webhook-type: incoming-webhook
payload: |
{
"text": "❌ ${{ github.workflow }} failed on ${{ github.ref_name }}",
"blocks": [{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "❌ *${{ github.workflow }}* failed on `${{ github.ref_name }}`\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View run>"
}
}]
}

Add success notification to deploy-prod.yml only (P3):

- name: Notify Slack — deploy success
if: success()
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_WEBHOOK_DEPLOYS }}
webhook-type: incoming-webhook
payload: |
{ "text": "✅ Production deployed: ${{ inputs.image_tag }}" }

3.3 Slack Setup

SecretChannelPurpose
SLACK_WEBHOOK_ALERTS#sv0-alertsP1/P2: prod down, errors, credential expiry
SLACK_WEBHOOK_DEPLOYS#sv0-deploysP3: deploy success/failure notifications

What We Explicitly Defer

ItemWhyRevisit When
Custom Grafana dashboardsExplore view is sufficient for now. Dashboards earn their place when we know what to monitor.After 2+ weeks of Grafana Cloud usage — patterns will emerge
Uptime KumaInternal prober can't detect host death. GitHub Actions probe is external and free.Never for this architecture
mongodb_exporterSingle MongoDB instance, mongosh ping catches failures. Detailed metrics (connections, op counters) matter under load.When MongoDB moves to Atlas or gets a replica set
Connector health APINo live customer connectors. Wire sv0_sync_age_minutes gauge instead.3+ connectors on a schedule with customer data
SentryuncaughtException handlers + structured logs + Grafana Loki cover error visibility.After an incident where log search was insufficient
Claude agent /ops-health skillBoth developers can access Grafana Cloud directly.When the team grows or ops checks become routine enough to automate

Exit Criteria for Deferred Dashboards

Build custom Grafana dashboards when any of:

  1. The same Explore query is run 3+ times in a week
  2. A design partner asks "what's the uptime?"
  3. Debugging an incident takes >30 min because the right metric wasn't visible

Security Considerations

RiskMitigation
Alloy Docker socket accessMounted :ro. Alloy cannot create/stop/modify containers. Alternative: json-file log driver with shared volume (loses container labels).
Grafana Cloud credentials on Mac miniStored as env vars in deploy config, same security model as existing MONGODB_URI. Future: rotate via 1Password CLI.
CF Access tokens in GitHub SecretsEphemeral runners only. Never persisted beyond workflow execution.
Slack webhook URLsGitHub Secrets. Worst case: attacker can post to Slack, not access the platform.
Grafana Cloud data exposureLogs may contain request paths, tenant IDs, error messages. Grafana Cloud is SOC2 compliant. No PII in logs (verified: logger doesn't log request bodies).

Architecture

┌──────────────────────────────────────────────┐
│ Mac Mini — Docker Compose (per environment) │
│ │
│ ┌───────┐ ┌──────┐ ┌───────┐ │
│ │ API │ │ UI │ │ Mongo │ │
│ │ :3000 │ │:8080 │ │:27017 │ │
│ │/ready │ │ │ │ │ │
│ │/metrics│ │ │ │ │ │
│ └───┬───┘ └──────┘ └───────┘ │
│ │ │
│ ┌───┴──────────────┐ │
│ │ Grafana Alloy │ ← 1 new container │
│ │ - ship logs │ │
│ │ - scrape metrics│ │
│ └────────┬─────────┘ │
└───────────┼──────────────────────────────────┘


┌────────────────┐ ┌──────────────┐
│ Grafana Cloud │──alerts─▶│ Slack │
│ (Free Tier) │ │ #sv0-alerts │
│ │ │ #sv0-deploys │
│ Loki (logs) │ └──────▲───────┘
│ Mimir (metrics)│ │
│ Alerting rules │ ┌──────┴───────┐
└────────────────┘ │GitHub Actions │
▲ │ - health probe│
│ │ - deploy notif│
Ivan + Sergey │ - token health│
(shared access) └──────────────┘

Summary

PhaseFocusWhereNew ContainersEffort
1Fix code: error handlers, structured logging, sync metricsv0-platform02-3 hours
2Grafana Cloud + Alloy for shared log/metric visibilityMac mini + SaaS1 per env3-4 hours
3External probing + Slack notificationsGitHub Actions02-3 hours

Total: ~8-10 hours. 1 new container per environment. $0 cost.


Decisions Needed

DecisionOwnerNotes
Grafana Cloud account setupIvan15 min — sign up, create API key, invite Sergey
Slack channels + webhooksIvan5 min — #sv0-alerts, #sv0-deploys, 2 incoming webhooks
Health probe frequencyIvan5-min cron proposed. ~8,640 GitHub Actions min/month (12 runs/hr × 24h × 30d × ~1 min/run) — within Enterprise limits, but exceeds Free tier (2,000 min/month).
Docker socket comfort levelIvanIf uncomfortable: use json-file log driver + shared volume instead

Next Action

Status: research-complete

Decision needed from: Ivan (Grafana Cloud account, Slack setup)

Sequencing:

  1. Phase 1 can start immediately — no decisions needed, purely code fixes
  2. Phase 2 requires Grafana Cloud account (15 min) + deploy config update
  3. Phase 3 requires Slack webhooks (5 min) + new GitHub Actions workflow

GitHub Issue: To be created after plan approval