Operational Resiliency Plan
Problem Statement
SecurityV0 has a strong foundation — structured JSON logging, Prometheus /metrics endpoint, Docker health checks, and a token-health GitHub Actions workflow. But the pieces aren't connected:
- Logs vanish when containers restart — no aggregation, no search
- Errors are invisible — logged but nobody is notified
- Connector staleness is undetectable — metric defined but never wired
- No shared visibility — Sergey (different country) can't see logs or metrics without SSH access to the Mac mini
- Multiple environments (prod, dev, future instances) have no unified view
This plan connects the dots with a middle-ground approach: Grafana Cloud for shared multi-environment visibility, GitHub Actions for external probing, Slack for notifications. One new container per environment (Alloy). Everything else deferred.
Design Principles
-
Shared visibility is a collaboration requirement. Two developers in different countries need a single URL to see logs, metrics, and health across all environments. This is not premature — it's table stakes for a distributed team.
-
External probes, not internal ones. Uptime monitors on the same host they monitor can't detect the failures that matter most. Probes run on GitHub Actions (external).
-
One new container per environment, not four. Alloy ships logs and scrapes metrics — that's the minimum to make Grafana Cloud useful. MongoDB exporter, Uptime Kuma, and custom dashboards are deferred.
-
Fix the code first. Unhandled async errors crash the API silently. Route handlers bypass the structured logger. These are bugs, not observability gaps.
-
Operating model before alert rules. Every alert has a clear owner, response expectation, and remediation path.
Current State Audit
What Exists
| Area | What's There |
|---|---|
| Health endpoints | /health (liveness), /ready (MongoDB + worker), /metrics (Prometheus), /diagnostics |
| Structured logging | JSON { ts, level, message, ...meta }, child loggers, configurable LOG_LEVEL |
| Request tracking | UUID x-request-id on every request, propagated to logs |
| Prometheus metrics | 8 custom metrics: HTTP duration/count, worker job duration/count, queue depth, sync age, findings/paths gauges |
| Docker health checks | MongoDB mongosh ping, API wget /health, resource limits (512MB API, 512MB Mongo) |
| Token monitoring | Weekly Cloudflare token expiry check, auto-creates GitHub Issues |
| Global error handler | Express 4-arg error middleware, logs with request ID |
| Graceful shutdown | 4-phase shutdown (stop accepting → stop worker → drain → disconnect) |
What's Missing
| Gap | Impact | Priority |
|---|---|---|
No uncaughtException / unhandledRejection handlers | Async errors crash API silently, no trace | Critical |
Route handlers use console.error() not structured logger | Errors bypass logging pipeline | High |
| No log aggregation | Logs lost on restart, no search, Sergey can't access | High |
| No shared multi-environment visibility | Only accessible via SSH to Mac mini | High |
| No external uptime probe | Can't detect host/Docker/tunnel failures | High |
| No Slack notifications | Nobody knows about failures until manual check | High |
sv0_sync_age_minutes gauge never updated | Connector freshness invisible | Medium |
Operating Model
Alert Tiers
| Tier | Meaning | Response | Example |
|---|---|---|---|
| P1 — Prod down | Production health check failing | Ivan responds ASAP (within 1 hour) | API unreachable, MongoDB down |
| P2 — Degraded | Production up but errors spiking or connectors stale | Ivan responds within business hours | 5xx error rate >5%, connector stale >24h |
| P3 — Informational | Deploy succeeded, credential expiry warning | Acknowledge, schedule fix | Token expires in 30 days |
Notification Routing
| Tier | Channel | Behavior |
|---|---|---|
| P1 | Slack #sv0-alerts | Immediate |
| P2 | Slack #sv0-alerts | Business hours |
| P3 | Slack #sv0-deploys | Informational |
Explicit non-goal: No 24/7 on-call for a 2-person pre-revenue team. P1 outside business hours is best-effort until there are design partner SLAs.
Implementation: Three Phases
Phase 1: Fix the Code (2-3 hours)
No infrastructure changes. Fix bugs in the existing codebase.
1.1 Add Process Error Handlers
In src/index.ts, before the server starts:
process.on("uncaughtException", (error) => {
logger.error("Uncaught exception — shutting down", {
error: error.message,
stack: error.stack,
});
process.exit(1);
});
process.on("unhandledRejection", (reason) => {
logger.error("Unhandled rejection", {
reason: reason instanceof Error ? reason.message : String(reason),
stack: reason instanceof Error ? reason.stack : undefined,
});
});
1.2 Replace console.error() with Structured Logger
Grep and replace in all route handlers:
// Before
console.error("Unexpected error in GET /api/v1/entities:", error);
// After
deps.logger.error("GET /api/v1/entities failed", {
requestId: req.requestId,
error: error instanceof Error ? error.message : "Unknown error",
});
1.3 Wire Sync Freshness Metric
The sv0_sync_age_minutes gauge exists but is never called. Update it after each sync completes:
const ageMinutes = (Date.now() - lastSyncTimestamp) / 60_000;
syncAgeMinutes.set({ connector_id: connectorId }, ageMinutes);
Phase 2: Grafana Cloud + Alloy (3-4 hours)
Goal: Open one URL, see logs and metrics from all environments. Sergey and Ivan both have access.
2.1 Set Up Grafana Cloud Free Account
- Sign up at grafana.com (free tier: 50GB logs/mo, 10k metric series, 14-day retention, 3 users)
- Create a Grafana Cloud API key with push permissions
- Note the Loki push URL and Prometheus remote-write URL
- Store credentials in 1Password:
op://sv0-bots/grafana-cloud/ - Add as GitHub secrets:
GRAFANA_CLOUD_LOKI_URL,GRAFANA_CLOUD_LOKI_USER,GRAFANA_CLOUD_LOKI_TOKEN,GRAFANA_CLOUD_PROM_URL,GRAFANA_CLOUD_PROM_USER,GRAFANA_CLOUD_PROM_TOKEN
Free tier limits vs our usage:
| Resource | Free Limit | Our Usage (estimate) | Headroom |
|---|---|---|---|
| Log ingest | 50 GB/month | ~2-5 GB (2-3 environments, low traffic) | 10-25x |
| Metric series | 10,000 | ~200 (8 custom + Node.js defaults × 2-3 envs) | 50x |
| Retention | 14 days | Sufficient for debugging | — |
| Users | 3 | 2 (Ivan + Sergey) | 1 spare |
Lock-in mitigation: Alloy uses standard protocols (Loki push API, Prometheus remote-write). If we outgrow the free tier, we can point Alloy at a self-hosted Loki/Mimir instance or switch to any OpenTelemetry-compatible backend. The config change is 2 lines (URL + credentials).
2.2 Add Grafana Alloy to Docker Compose
Add to docker-compose.deploy.yml:
alloy:
image: grafana/alloy:latest
restart: unless-stopped
mem_limit: 128m
volumes:
- ./deploy/alloy-config.alloy:/etc/alloy/config.alloy:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
command: run /etc/alloy/config.alloy
depends_on:
api:
condition: service_healthy
Security note on Docker socket: Alloy needs read-only Docker socket access to discover containers and collect their logs. This is a sensitive surface. Mitigations:
- Mounted
:ro— Alloy cannot create/stop/modify containers - Alloy runs as a non-root user inside its container
- The alternative (no log aggregation) means Sergey has zero visibility and logs are lost on restart — that's a worse operational risk for a 2-person distributed team
- If this remains uncomfortable, an alternative is to use Docker's
json-filelog driver with a shared volume that Alloy reads (no socket needed, but loses container metadata labels)
2.3 Alloy Configuration
Create deploy/alloy-config.alloy (~40 lines):
// Discover Docker containers
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
}
// Relabel: extract service name and environment
discovery.relabel "containers" {
targets = discovery.docker.containers.targets
rule {
source_labels = ["__meta_docker_container_name"]
target_label = "container"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_compose_service"]
target_label = "service"
}
}
// Ship logs to Grafana Cloud Loki
loki.source.docker "logs" {
host = "unix:///var/run/docker.sock"
targets = discovery.relabel.containers.output
forward_to = [loki.write.grafana_cloud.receiver]
}
loki.write "grafana_cloud" {
endpoint {
url = env("GRAFANA_LOKI_URL")
basic_auth {
username = env("GRAFANA_LOKI_USER")
password = env("GRAFANA_LOKI_TOKEN")
}
}
external_labels = {
environment = env("SV0_ENVIRONMENT"), // "production" or "dev"
host = env("HOSTNAME"),
}
}
// Scrape Prometheus metrics from API
prometheus.scrape "api" {
targets = [{ __address__ = "api:3000" }]
metrics_path = "/metrics"
scrape_interval = "30s"
forward_to = [prometheus.remote_write.grafana_cloud.receiver]
}
prometheus.remote_write "grafana_cloud" {
endpoint {
url = env("GRAFANA_PROM_URL")
basic_auth {
username = env("GRAFANA_PROM_USER")
password = env("GRAFANA_PROM_TOKEN")
}
}
external_labels = {
environment = env("SV0_ENVIRONMENT"),
}
}
Environment variables added to the deploy script per instance:
SV0_ENVIRONMENT=production # or "dev"
GRAFANA_LOKI_URL=https://logs-prod-...grafana.net/loki/api/v1/push
GRAFANA_LOKI_USER=...
GRAFANA_LOKI_TOKEN=...
GRAFANA_PROM_URL=https://prometheus-prod-...grafana.net/api/prom/push
GRAFANA_PROM_USER=...
GRAFANA_PROM_TOKEN=...
2.4 What You Get in Grafana Cloud
Once Alloy is running in both environments:
Logs (Loki):
{environment="production", service="api"}— search all prod API logs{environment="dev", service="api"} |= "error"— find errors in dev{service="api"} | json | level="error"— structured JSON parsing- Side-by-side prod vs dev log streams
Metrics (Mimir/Prometheus):
sv0_http_requests_total{environment="production"}— prod request countrate(sv0_http_requests_total{status_code=~"5.."}[5m])— error ratesv0_queue_depth{environment="dev"}— dev worker queuesv0_sync_age_minutes— connector freshness (once Phase 1.3 is wired)
Grafana Cloud includes pre-built explore views — no custom dashboards needed initially. Use Explore (Loki) for log search and Explore (Metrics) for metric queries. Custom dashboards can be added later when patterns emerge.
2.5 Grafana Alert Rules
Configure 3 high-value alerts in Grafana Cloud (free tier supports 500 rules):
| Alert | Query | Fires When | Tier |
|---|---|---|---|
| High error rate | rate(sv0_http_requests_total{status_code=~"5..", environment="production"}[5m]) / rate(sv0_http_requests_total{environment="production"}[5m]) > 0.05 | >5% of prod requests are 5xx for 5 min | P2 |
| Worker queue backing up | sv0_queue_depth{environment="production"} > 10 | Queue depth >10 for 5 min | P2 |
| Connector stale | sv0_sync_age_minutes > 1440 | Any connector hasn't synced in 24h | P2 |
Contact point: Slack #sv0-alerts incoming webhook.
Note: These are P2 alerts (business hours). P1 (prod down) is handled by the GitHub Actions external probe in Phase 3, because Grafana can't detect "host is dead" — it only sees "metrics stopped arriving," which has a delay.
Phase 3: External Probing + Slack Notifications (2-3 hours)
Runs entirely on GitHub Actions. No containers.
3.1 Platform Health Probe (GitHub Actions Cron)
New workflow: .github/workflows/platform-health.yml
name: platform-health
on:
schedule:
- cron: "*/5 * * * *" # Every 5 minutes
workflow_dispatch: {}
jobs:
probe:
runs-on: ubuntu-latest
steps:
- name: Check production
id: prod
run: |
HTTP_CODE=$(curl -sf -o /tmp/prod-health.json -w "%{http_code}" \
-H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
-H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
"https://app.securityv0.com/ready" || echo "000")
echo "status=$HTTP_CODE" >> "$GITHUB_OUTPUT"
if [ "$HTTP_CODE" = "200" ]; then
echo "✅ Production: healthy"
else
echo "❌ Production: HTTP $HTTP_CODE"
fi
- name: Check dev
id: dev
run: |
HTTP_CODE=$(curl -sf -o /dev/null -w "%{http_code}" \
-H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
-H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
"https://dev.securityv0.com/ready" || echo "000")
echo "status=$HTTP_CODE" >> "$GITHUB_OUTPUT"
- name: Smoke test — verify data access
id: smoke
if: steps.prod.outputs.status == '200'
run: |
BODY=$(curl -sf \
-H "CF-Access-Client-Id: ${{ secrets.CF_ACCESS_CLIENT_ID_DEPLOY }}" \
-H "CF-Access-Client-Secret: ${{ secrets.CF_ACCESS_CLIENT_SECRET_DEPLOY }}" \
-H "X-Tenant-Id: demo-w1" \
"https://app.securityv0.com/api/v1/findings?limit=1")
COUNT=$(echo "$BODY" | python3 -c "import json,sys; print(json.load(sys.stdin)['meta']['total_count'])")
echo "findings=$COUNT" >> "$GITHUB_OUTPUT"
[ "$COUNT" -gt 0 ] && echo "✅ Data: $COUNT findings" || echo "⚠️ Data: 0 findings"
- name: Alert Slack — prod down
if: steps.prod.outputs.status != '200'
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_WEBHOOK_ALERTS }}
webhook-type: incoming-webhook
payload: |
{
"text": "🚨 P1: Production health check FAILING (HTTP ${{ steps.prod.outputs.status }})",
"blocks": [{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "🚨 *P1 — Production Down*\n`app.securityv0.com/ready` → HTTP ${{ steps.prod.outputs.status }}\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View probe>"
}
}]
}
Key decisions:
- Probes
/ready(checks MongoDB + worker), not/health(liveness only) - Smoke test reads actual data — catches "API up but database empty" failures
- Runs on GitHub-hosted runners — external to Mac mini, detects host/tunnel death
- Only alerts on failure (no alert fatigue)
3.2 Slack Notifications on Existing Workflows
Add failure notification to deploy-prod.yml, deploy-dev.yml, ci.yml, token-health.yml:
- name: Notify Slack on failure
if: failure()
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_WEBHOOK_ALERTS }}
webhook-type: incoming-webhook
payload: |
{
"text": "❌ ${{ github.workflow }} failed on ${{ github.ref_name }}",
"blocks": [{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "❌ *${{ github.workflow }}* failed on `${{ github.ref_name }}`\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View run>"
}
}]
}
Add success notification to deploy-prod.yml only (P3):
- name: Notify Slack — deploy success
if: success()
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ secrets.SLACK_WEBHOOK_DEPLOYS }}
webhook-type: incoming-webhook
payload: |
{ "text": "✅ Production deployed: ${{ inputs.image_tag }}" }
3.3 Slack Setup
| Secret | Channel | Purpose |
|---|---|---|
SLACK_WEBHOOK_ALERTS | #sv0-alerts | P1/P2: prod down, errors, credential expiry |
SLACK_WEBHOOK_DEPLOYS | #sv0-deploys | P3: deploy success/failure notifications |
What We Explicitly Defer
| Item | Why | Revisit When |
|---|---|---|
| Custom Grafana dashboards | Explore view is sufficient for now. Dashboards earn their place when we know what to monitor. | After 2+ weeks of Grafana Cloud usage — patterns will emerge |
| Uptime Kuma | Internal prober can't detect host death. GitHub Actions probe is external and free. | Never for this architecture |
| mongodb_exporter | Single MongoDB instance, mongosh ping catches failures. Detailed metrics (connections, op counters) matter under load. | When MongoDB moves to Atlas or gets a replica set |
| Connector health API | No live customer connectors. Wire sv0_sync_age_minutes gauge instead. | 3+ connectors on a schedule with customer data |
| Sentry | uncaughtException handlers + structured logs + Grafana Loki cover error visibility. | After an incident where log search was insufficient |
Claude agent /ops-health skill | Both developers can access Grafana Cloud directly. | When the team grows or ops checks become routine enough to automate |
Exit Criteria for Deferred Dashboards
Build custom Grafana dashboards when any of:
- The same Explore query is run 3+ times in a week
- A design partner asks "what's the uptime?"
- Debugging an incident takes >30 min because the right metric wasn't visible
Security Considerations
| Risk | Mitigation |
|---|---|
| Alloy Docker socket access | Mounted :ro. Alloy cannot create/stop/modify containers. Alternative: json-file log driver with shared volume (loses container labels). |
| Grafana Cloud credentials on Mac mini | Stored as env vars in deploy config, same security model as existing MONGODB_URI. Future: rotate via 1Password CLI. |
| CF Access tokens in GitHub Secrets | Ephemeral runners only. Never persisted beyond workflow execution. |
| Slack webhook URLs | GitHub Secrets. Worst case: attacker can post to Slack, not access the platform. |
| Grafana Cloud data exposure | Logs may contain request paths, tenant IDs, error messages. Grafana Cloud is SOC2 compliant. No PII in logs (verified: logger doesn't log request bodies). |
Architecture
┌──────────────────────────────────────────────┐
│ Mac Mini — Docker Compose (per environment) │
│ │
│ ┌───────┐ ┌──────┐ ┌───────┐ │
│ │ API │ │ UI │ │ Mongo │ │
│ │ :3000 │ │:8080 │ │:27017 │ │
│ │/ready │ │ │ │ │ │
│ │/metrics│ │ │ │ │ │
│ └───┬───┘ └──────┘ └───────┘ │
│ │ │
│ ┌───┴──────────────┐ │
│ │ Grafana Alloy │ ← 1 new container │
│ │ - ship logs │ │
│ │ - scrape metrics│ │
│ └────────┬─────────┘ │
└───────────┼──────────────────────────────────┘
│
▼
┌────────────────┐ ┌──────────────┐
│ Grafana Cloud │──alerts─▶│ Slack │
│ (Free Tier) │ │ #sv0-alerts │
│ │ │ #sv0-deploys │
│ Loki (logs) │ └──────▲───────┘
│ Mimir (metrics)│ │
│ Alerting rules │ ┌──────┴───────┐
└────────────────┘ │GitHub Actions │
▲ │ - health probe│
│ │ - deploy notif│
Ivan + Sergey │ - token health│
(shared access) └──────────────┘
Summary
| Phase | Focus | Where | New Containers | Effort |
|---|---|---|---|---|
| 1 | Fix code: error handlers, structured logging, sync metric | sv0-platform | 0 | 2-3 hours |
| 2 | Grafana Cloud + Alloy for shared log/metric visibility | Mac mini + SaaS | 1 per env | 3-4 hours |
| 3 | External probing + Slack notifications | GitHub Actions | 0 | 2-3 hours |
Total: ~8-10 hours. 1 new container per environment. $0 cost.
Decisions Needed
| Decision | Owner | Notes |
|---|---|---|
| Grafana Cloud account setup | Ivan | 15 min — sign up, create API key, invite Sergey |
| Slack channels + webhooks | Ivan | 5 min — #sv0-alerts, #sv0-deploys, 2 incoming webhooks |
| Health probe frequency | Ivan | 5-min cron proposed. ~8,640 GitHub Actions min/month (12 runs/hr × 24h × 30d × ~1 min/run) — within Enterprise limits, but exceeds Free tier (2,000 min/month). |
| Docker socket comfort level | Ivan | If uncomfortable: use json-file log driver + shared volume instead |
Next Action
Status: research-complete
Decision needed from: Ivan (Grafana Cloud account, Slack setup)
Sequencing:
- Phase 1 can start immediately — no decisions needed, purely code fixes
- Phase 2 requires Grafana Cloud account (15 min) + deploy config update
- Phase 3 requires Slack webhooks (5 min) + new GitHub Actions workflow
GitHub Issue: To be created after plan approval