Operational Resiliency Plan

Problem Statement

SecurityV0 has a strong foundation — structured JSON logging, Prometheus /metrics endpoint, Docker health checks, and a token-health GitHub Actions workflow. But the pieces aren't connected:

Logs vanish when containers restart — no aggregation, no search
Errors are invisible — logged but nobody is notified
Connector staleness is undetectable — metric defined but never wired
No shared visibility — Sergey (different country) can't see logs or metrics without SSH access to the Mac mini
Multiple environments (prod, dev, future instances) have no unified view

This plan connects the dots with a middle-ground approach: Grafana Cloud for shared multi-environment visibility, GitHub Actions for external probing, Slack for notifications. One new container per environment (Alloy). Everything else deferred.

Design Principles

Shared visibility is a collaboration requirement. Two developers in different countries need a single URL to see logs, metrics, and health across all environments. This is not premature — it's table stakes for a distributed team.
External probes, not internal ones. Uptime monitors on the same host they monitor can't detect the failures that matter most. Probes run on GitHub Actions (external).
One new container per environment, not four. Alloy ships logs and scrapes metrics — that's the minimum to make Grafana Cloud useful. MongoDB exporter, Uptime Kuma, and custom dashboards are deferred.
Fix the code first. Unhandled async errors crash the API silently. Route handlers bypass the structured logger. These are bugs, not observability gaps.
Operating model before alert rules. Every alert has a clear owner, response expectation, and remediation path.

Current State Audit

What Exists

Area	What's There
Health endpoints	`/health` (liveness), `/ready` (MongoDB + worker), `/metrics` (Prometheus), `/diagnostics`
Structured logging	JSON `{ ts, level, message, ...meta }`, child loggers, configurable `LOG_LEVEL`
Request tracking	UUID `x-request-id` on every request, propagated to logs
Prometheus metrics	8 custom metrics: HTTP duration/count, worker job duration/count, queue depth, sync age, findings/paths gauges
Docker health checks	MongoDB `mongosh ping`, API `wget /health`, resource limits (512MB API, 512MB Mongo)
Token monitoring	Weekly Cloudflare token expiry check, auto-creates GitHub Issues
Global error handler	Express 4-arg error middleware, logs with request ID
Graceful shutdown	4-phase shutdown (stop accepting → stop worker → drain → disconnect)

What's Missing

Gap	Impact	Priority
No `uncaughtException` / `unhandledRejection` handlers	Async errors crash API silently, no trace	Critical
Route handlers use `console.error()` not structured logger	Errors bypass logging pipeline	High
No log aggregation	Logs lost on restart, no search, Sergey can't access	High
No shared multi-environment visibility	Only accessible via SSH to Mac mini	High
No external uptime probe	Can't detect host/Docker/tunnel failures	High
No Slack notifications	Nobody knows about failures until manual check	High
`sv0_sync_age_minutes` gauge never updated	Connector freshness invisible	Medium

Operating Model

Alert Tiers

Tier	Meaning	Response	Example
P1 — Prod down	Production health check failing	Ivan responds ASAP (within 1 hour)	API unreachable, MongoDB down
P2 — Degraded	Production up but errors spiking or connectors stale	Ivan responds within business hours	5xx error rate >5%, connector stale >24h
P3 — Informational	Deploy succeeded, credential expiry warning	Acknowledge, schedule fix	Token expires in 30 days

Notification Routing

Tier	Channel	Behavior
P1	Slack `#sv0-alerts`	Immediate
P2	Slack `#sv0-alerts`	Business hours
P3	Slack `#sv0-deploys`	Informational

Explicit non-goal: No 24/7 on-call for a 2-person pre-revenue team. P1 outside business hours is best-effort until there are design partner SLAs.

Implementation: Three Phases

Phase 1: Fix the Code (2-3 hours)

No infrastructure changes. Fix bugs in the existing codebase.

1.1 Add Process Error Handlers

In src/index.ts, before the server starts:

process.on("uncaughtException", (error) => {
  logger.error("Uncaught exception — shutting down", {
    error: error.message,
    stack: error.stack,
  });
  process.exit(1);
});

process.on("unhandledRejection", (reason) => {
  logger.error("Unhandled rejection", {
    reason: reason instanceof Error ? reason.message : String(reason),
    stack: reason instanceof Error ? reason.stack : undefined,
  });
});

1.2 Replace `console.error()` with Structured Logger

Grep and replace in all route handlers:

// Before
console.error("Unexpected error in GET /api/v1/entities:", error);

// After
deps.logger.error("GET /api/v1/entities failed", {
  requestId: req.requestId,
  error: error instanceof Error ? error.message : "Unknown error",
});

1.3 Wire Sync Freshness Metric

The sv0_sync_age_minutes gauge exists but is never called. Update it after each sync completes:

const ageMinutes = (Date.now() - lastSyncTimestamp) / 60_000;
syncAgeMinutes.set({ connector_id: connectorId }, ageMinutes);

Phase 2: Grafana Cloud + Alloy (3-4 hours)

Goal: Open one URL, see logs and metrics from all environments. Sergey and Ivan both have access.

2.1 Set Up Grafana Cloud Free Account

Sign up at grafana.com (free tier: 50GB logs/mo, 10k metric series, 14-day retention, 3 users)
Create a Grafana Cloud API key with push permissions
Note the Loki push URL and Prometheus remote-write URL
Store credentials in 1Password: op://sv0-bots/grafana-cloud/
Add as GitHub secrets: GRAFANA_CLOUD_LOKI_URL, GRAFANA_CLOUD_LOKI_USER, GRAFANA_CLOUD_LOKI_TOKEN, GRAFANA_CLOUD_PROM_URL, GRAFANA_CLOUD_PROM_USER, GRAFANA_CLOUD_PROM_TOKEN

Free tier limits vs our usage:

Resource	Free Limit	Our Usage (estimate)	Headroom
Log ingest	50 GB/month	~2-5 GB (2-3 environments, low traffic)	10-25x
Metric series	10,000	~200 (8 custom + Node.js defaults × 2-3 envs)	50x
Retention	14 days	Sufficient for debugging	—
Users	3	2 (Ivan + Sergey)	1 spare

Lock-in mitigation: Alloy uses standard protocols (Loki push API, Prometheus remote-write). If we outgrow the free tier, we can point Alloy at a self-hosted Loki/Mimir instance or switch to any OpenTelemetry-compatible backend. The config change is 2 lines (URL + credentials).

2.2 Add Grafana Alloy to Docker Compose

Add to docker-compose.deploy.yml:

alloy:
  image: grafana/alloy:latest
  restart: unless-stopped
  mem_limit: 128m
  volumes:
    - ./deploy/alloy-config.alloy:/etc/alloy/config.alloy:ro
    - /var/run/docker.sock:/var/run/docker.sock:ro
  command: run /etc/alloy/config.alloy
  depends_on:
    api:
      condition: service_healthy

Security note on Docker socket: Alloy needs read-only Docker socket access to discover containers and collect their logs. This is a sensitive surface. Mitigations:

Mounted :ro — Alloy cannot create/stop/modify containers
Alloy runs as a non-root user inside its container
The alternative (no log aggregation) means Sergey has zero visibility and logs are lost on restart — that's a worse operational risk for a 2-person distributed team
If this remains uncomfortable, an alternative is to use Docker's json-file log driver with a shared volume that Alloy reads (no socket needed, but loses container metadata labels)

2.3 Alloy Configuration

Create deploy/alloy-config.alloy (~40 lines):

// Discover Docker containers
discovery.docker "containers" {
  host = "unix:///var/run/docker.sock"
}

// Relabel: extract service name and environment
discovery.relabel "containers" {
  targets = discovery.docker.containers.targets

  rule {
    source_labels = ["__meta_docker_container_name"]
    target_label  = "container"
  }
  rule {
    source_labels = ["__meta_docker_container_label_com_docker_compose_service"]
    target_label  = "service"
  }
}

// Ship logs to Grafana Cloud Loki
loki.source.docker "logs" {
  host    = "unix:///var/run/docker.sock"
  targets = discovery.relabel.containers.output
  forward_to = [loki.write.grafana_cloud.receiver]
}

loki.write "grafana_cloud" {
  endpoint {
    url = env("GRAFANA_LOKI_URL")
    basic_auth {
      username = env("GRAFANA_LOKI_USER")
      password = env("GRAFANA_LOKI_TOKEN")
    }
  }
  external_labels = {
    environment = env("SV0_ENVIRONMENT"),  // "production" or "dev"
    host        = env("HOSTNAME"),
  }
}

// Scrape Prometheus metrics from API
prometheus.scrape "api" {
  targets = [{ __address__ = "api:3000" }]
  metrics_path = "/metrics"
  scrape_interval = "30s"
  forward_to = [prometheus.remote_write.grafana_cloud.receiver]
}

prometheus.remote_write "grafana_cloud" {
  endpoint {
    url = env("GRAFANA_PROM_URL")
    basic_auth {
      username = env("GRAFANA_PROM_USER")
      password = env("GRAFANA_PROM_TOKEN")
    }
  }
  external_labels = {
    environment = env("SV0_ENVIRONMENT"),
  }
}

Environment variables added to the deploy script per instance:

SV0_ENVIRONMENT=production  # or "dev"
GRAFANA_LOKI_URL=https://logs-prod-...grafana.net/loki/api/v1/push
GRAFANA_LOKI_USER=...
GRAFANA_LOKI_TOKEN=...
GRAFANA_PROM_URL=https://prometheus-prod-...grafana.net/api/prom/push
GRAFANA_PROM_USER=...
GRAFANA_PROM_TOKEN=...

2.4 What You Get in Grafana Cloud

Once Alloy is running in both environments:

Logs (Loki):

{environment="production", service="api"} — search all prod API logs
{environment="dev", service="api"} |= "error" — find errors in dev
{service="api"} | json | level="error" — structured JSON parsing
Side-by-side prod vs dev log streams

Metrics (Mimir/Prometheus):

sv0_http_requests_total{environment="production"} — prod request count
rate(sv0_http_requests_total{status_code=~"5.."}[5m]) — error rate
sv0_queue_depth{environment="dev"} — dev worker queue
sv0_sync_age_minutes — connector freshness (once Phase 1.3 is wired)

Grafana Cloud includes pre-built explore views — no custom dashboards needed initially. Use Explore (Loki) for log search and Explore (Metrics) for metric queries. Custom dashboards can be added later when patterns emerge.

2.5 Grafana Alert Rules

Configure 3 high-value alerts in Grafana Cloud (free tier supports 500 rules):

Alert	Query	Fires When	Tier
High error rate	`rate(sv0_http_requests_total{status_code=~"5..", environment="production"}[5m]) / rate(sv0_http_requests_total{environment="production"}[5m]) > 0.05`	>5% of prod requests are 5xx for 5 min	P2
Worker queue backing up	`sv0_queue_depth{environment="production"} > 10`	Queue depth >10 for 5 min	P2
Connector stale	`sv0_sync_age_minutes > 1440`	Any connector hasn't synced in 24h	P2

Contact point: Slack #sv0-alerts incoming webhook.

Note: These are P2 alerts (business hours). P1 (prod down) is handled by the GitHub Actions external probe in Phase 3, because Grafana can't detect "host is dead" — it only sees "metrics stopped arriving," which has a delay.

Phase 3: External Probing + Slack Notifications (2-3 hours)

Runs entirely on GitHub Actions. No containers.

3.1 Platform Health Probe (GitHub Actions Cron)

New workflow: .github/workflows/platform-health.yml

name: platform-health
on:
  schedule:
    - cron: "*/5 * * * *"  # Every 5 minutes
  workflow_dispatch: {}

jobs:
  probe:
    runs-on: ubuntu-latest
    steps:
      - name: Check production
        id: prod
        run: |
          HTTP_CODE=$(curl -sf -o /tmp/prod-health.json -w "%{http_code}" \
            -H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
            -H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
            "https://app.securityv0.com/ready" || echo "000")
          echo "status=$HTTP_CODE" >> "$GITHUB_OUTPUT"
          if [ "$HTTP_CODE" = "200" ]; then
            echo "✅ Production: healthy"
          else
            echo "❌ Production: HTTP $HTTP_CODE"
          fi

      - name: Check dev
        id: dev
        run: |
          HTTP_CODE=$(curl -sf -o /dev/null -w "%{http_code}" \
            -H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
            -H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
            "https://dev.securityv0.com/ready" || echo "000")
          echo "status=$HTTP_CODE" >> "$GITHUB_OUTPUT"

      - name: Smoke test — verify data access
        id: smoke
        if: steps.prod.outputs.status == '200'
        run: |
          BODY=$(curl -sf \
            -H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
            -H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
            -H "X-Tenant-Id: demo-w1" \
            "https://app.securityv0.com/api/v1/findings?limit=1")
          COUNT=$(echo "$BODY" | python3 -c "import json,sys; print(json.load(sys.stdin)['meta']['total_count'])")
          echo "findings=$COUNT" >> "$GITHUB_OUTPUT"
          [ "$COUNT" -gt 0 ] && echo "✅ Data: $COUNT findings" || echo "⚠️ Data: 0 findings"

      - name: Alert Slack — prod down
        if: steps.prod.outputs.status != '200'
        uses: slackapi/slack-github-action@v2
        with:
          webhook: ${{ secrets.SLACK_WEBHOOK_ALERTS }}
          webhook-type: incoming-webhook
          payload: |
            {
              "text": "🚨 P1: Production health check FAILING (HTTP ${{ steps.prod.outputs.status }})",
              "blocks": [{
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": "🚨 *P1 — Production Down*\n`app.securityv0.com/ready` → HTTP ${{ steps.prod.outputs.status }}\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View probe>"
                }
              }]
            }

Key decisions:

Probes /ready (checks MongoDB + worker), not /health (liveness only)
Smoke test reads actual data — catches "API up but database empty" failures
Runs on GitHub-hosted runners — external to Mac mini, detects host/tunnel death
Only alerts on failure (no alert fatigue)

3.2 Slack Notifications on Existing Workflows

Add failure notification to deploy-prod.yml, deploy-dev.yml, ci.yml, token-health.yml:

- name: Notify Slack on failure
  if: failure()
  uses: slackapi/slack-github-action@v2
  with:
    webhook: ${{ secrets.SLACK_WEBHOOK_ALERTS }}
    webhook-type: incoming-webhook
    payload: |
      {
        "text": "❌ ${{ github.workflow }} failed on ${{ github.ref_name }}",
        "blocks": [{
          "type": "section",
          "text": {
            "type": "mrkdwn",
            "text": "❌ *${{ github.workflow }}* failed on `${{ github.ref_name }}`\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View run>"
          }
        }]
      }

Add success notification to deploy-prod.yml only (P3):

- name: Notify Slack — deploy success
  if: success()
  uses: slackapi/slack-github-action@v2
  with:
    webhook: ${{ secrets.SLACK_WEBHOOK_DEPLOYS }}
    webhook-type: incoming-webhook
    payload: |
      { "text": "✅ Production deployed: ${{ inputs.image_tag }}" }

3.3 Slack Setup

Secret	Channel	Purpose
`SLACK_WEBHOOK_ALERTS`	`#sv0-alerts`	P1/P2: prod down, errors, credential expiry
`SLACK_WEBHOOK_DEPLOYS`	`#sv0-deploys`	P3: deploy success/failure notifications

What We Explicitly Defer

Item	Why	Revisit When
Custom Grafana dashboards	Explore view is sufficient for now. Dashboards earn their place when we know what to monitor.	After 2+ weeks of Grafana Cloud usage — patterns will emerge
Uptime Kuma	Internal prober can't detect host death. GitHub Actions probe is external and free.	Never for this architecture
mongodb_exporter	Single MongoDB instance, `mongosh ping` catches failures. Detailed metrics (connections, op counters) matter under load.	When MongoDB moves to Atlas or gets a replica set
Connector health API	No live customer connectors. Wire `sv0_sync_age_minutes` gauge instead.	3+ connectors on a schedule with customer data
Sentry	`uncaughtException` handlers + structured logs + Grafana Loki cover error visibility.	After an incident where log search was insufficient
Claude agent `/ops-health` skill	Both developers can access Grafana Cloud directly.	When the team grows or ops checks become routine enough to automate

Exit Criteria for Deferred Dashboards

Build custom Grafana dashboards when any of:

The same Explore query is run 3+ times in a week
A design partner asks "what's the uptime?"
Debugging an incident takes >30 min because the right metric wasn't visible

Security Considerations

Risk	Mitigation
Alloy Docker socket access	Mounted `:ro`. Alloy cannot create/stop/modify containers. Alternative: `json-file` log driver with shared volume (loses container labels).
Grafana Cloud credentials on Mac mini	Stored as env vars in deploy config, same security model as existing MONGODB_URI. Future: rotate via 1Password CLI.
CF Access tokens in GitHub Secrets	Ephemeral runners only. Never persisted beyond workflow execution.
Slack webhook URLs	GitHub Secrets. Worst case: attacker can post to Slack, not access the platform.
Grafana Cloud data exposure	Logs may contain request paths, tenant IDs, error messages. Grafana Cloud is SOC2 compliant. No PII in logs (verified: logger doesn't log request bodies).

Architecture

┌──────────────────────────────────────────────┐
│  Mac Mini — Docker Compose (per environment) │
│                                              │
│  ┌───────┐  ┌──────┐  ┌───────┐             │
│  │  API  │  │  UI  │  │ Mongo │             │
│  │ :3000 │  │:8080 │  │:27017 │             │
│  │/ready │  │      │  │       │             │
│  │/metrics│ │      │  │       │             │
│  └───┬───┘  └──────┘  └───────┘             │
│      │                                       │
│  ┌───┴──────────────┐                        │
│  │  Grafana Alloy   │  ← 1 new container     │
│  │  - ship logs     │                        │
│  │  - scrape metrics│                        │
│  └────────┬─────────┘                        │
└───────────┼──────────────────────────────────┘
            │
            ▼
   ┌────────────────┐         ┌──────────────┐
   │ Grafana Cloud  │──alerts─▶│    Slack     │
   │  (Free Tier)   │         │ #sv0-alerts  │
   │                │         │ #sv0-deploys │
   │ Loki (logs)    │         └──────▲───────┘
   │ Mimir (metrics)│                │
   │ Alerting rules │         ┌──────┴───────┐
   └────────────────┘         │GitHub Actions │
        ▲                     │ - health probe│
        │                     │ - deploy notif│
   Ivan + Sergey              │ - token health│
   (shared access)            └──────────────┘

Summary

Phase	Focus	Where	New Containers	Effort
1	Fix code: error handlers, structured logging, sync metric	sv0-platform	0	2-3 hours
2	Grafana Cloud + Alloy for shared log/metric visibility	Mac mini + SaaS	1 per env	3-4 hours
3	External probing + Slack notifications	GitHub Actions	0	2-3 hours

Total: ~8-10 hours. 1 new container per environment. $0 cost.

Decisions Needed

Decision	Owner	Notes
Grafana Cloud account setup	Ivan	15 min — sign up, create API key, invite Sergey
Slack channels + webhooks	Ivan	5 min — `#sv0-alerts`, `#sv0-deploys`, 2 incoming webhooks
Health probe frequency	Ivan	5-min cron proposed. ~8,640 GitHub Actions min/month (12 runs/hr × 24h × 30d × ~1 min/run) — within Enterprise limits, but exceeds Free tier (2,000 min/month).
Docker socket comfort level	Ivan	If uncomfortable: use `json-file` log driver + shared volume instead

Next Action

Status: research-complete

Decision needed from: Ivan (Grafana Cloud account, Slack setup)

Sequencing:

Phase 1 can start immediately — no decisions needed, purely code fixes
Phase 2 requires Grafana Cloud account (15 min) + deploy config update
Phase 3 requires Slack webhooks (5 min) + new GitHub Actions workflow

GitHub Issue: To be created after plan approval

Problem Statement​

Design Principles​

Current State Audit​

What Exists​

What's Missing​

Operating Model​

Alert Tiers​

Notification Routing​

Implementation: Three Phases​

Phase 1: Fix the Code (2-3 hours)​

1.1 Add Process Error Handlers​

1.2 Replace console.error() with Structured Logger​

1.3 Wire Sync Freshness Metric​

Phase 2: Grafana Cloud + Alloy (3-4 hours)​

2.1 Set Up Grafana Cloud Free Account​

2.2 Add Grafana Alloy to Docker Compose​

2.3 Alloy Configuration​

2.4 What You Get in Grafana Cloud​

2.5 Grafana Alert Rules​

Phase 3: External Probing + Slack Notifications (2-3 hours)​

3.1 Platform Health Probe (GitHub Actions Cron)​

3.2 Slack Notifications on Existing Workflows​

3.3 Slack Setup​

What We Explicitly Defer​

Exit Criteria for Deferred Dashboards​

Security Considerations​

Architecture​

Summary​

Decisions Needed​

Next Action​