ADR-018: Deploy-Server Security Posture Before Managed-Platform Migration
Status
Accepted — 2026-04-20
Supersedes the sudoers-allowlist mitigation shipped in sv0-platform PR #401 (2026-04-16). Reverts the server-side change and the script/workflow sudo prefixes that #401 added; keeps the sudo-elimination work from sv0-platform commit 1b9f425 intact.
Context
SecurityV0 currently deploys to two Hetzner VPS hosts (dev.securityv0.com, app.securityv0.com) over SSH, using a dedicated deploy OS user with key-based auth. Images come from GHCR (ADR-013). Docker Compose orchestrates three containers per instance (api, ui, mongo). There are no customer tenants in production — current traffic is the founding team plus invited pilot viewers behind Cloudflare Access.
What audit #392 identified
The 2026-04-12 architecture review flagged that the deploy user was a member of the docker group on both VMs. Docker-group membership is root-equivalent: any member can run
docker run --rm -v /:/host alpine chroot /host bash
and obtain a root shell on the host. The exposed attack path was:
- Attacker obtains
DEPLOY_SSH_KEY(GitHub Actions secret, reachable by any CI workflow and by repo admins). - Attacker SSHes to the deploy VM as
deploy. - Attacker escapes the container boundary via the Docker socket and owns the host.
What PR #401 tried to do about it
PR #401 (sv0-platform#401, merged 2026-04-16) removed deploy from the docker group and installed a narrow sudoers allowlist at /etc/sudoers.d/sv0-deploy permitting only the specific docker compose, docker exec, docker image prune, docker builder prune, docker login, and pinned docker run … command shapes that the deploy workflow invokes. All deploy scripts and both deploy workflows were updated to prefix those calls with sudo.
The security rationale was sound: it narrowed the blast radius of a leaked DEPLOY_SSH_KEY from "full host compromise" to "attacker can run exactly the allowlisted compose/exec/prune/login shapes against the allowlisted compose file paths."
Why we are reversing it
In the 35 days since #401 merged, the allowlist has created a recurring operational failure pattern that — in its own way — reintroduces the privileged access it was meant to narrow:
| Date | Event |
|---|---|
| 2026-04-18 | sv0-platform#412 ships || true on docker image prune calls because "an out-of-date sudoers allowlist doesn't block deploys" — explicit admission in the PR body |
| 2026-04-18 | sv0-platform d9c2bd5 — "align prod image prune with sudoers allowlist (-af)" — second drift fix |
| 2026-04-20 | sv0-platform#455 — deploy-dev fails 5+ times with sudo: a password is required at the docker login step; dev stuck on pre-merge SHA |
| 2026-04-20 18:42 → 18:52 UTC | Dev deploy fails; 10 minutes later next deploy succeeds — someone with real root SSHed in and hand-edited /etc/sudoers.d/sv0-deploy |
The root-cause design smells:
- Two-tier source of truth.
deploy/scripts/*.shand.github/workflows/deploy-*.ymllive in git and ship on every merge./etc/sudoers.d/sv0-deployis edited by hand. Every shape change to a sudo invocation silently drifts the allowlist out of sync. - No CI signal for drift. The first warning is a broken deploy with an ambiguous
sudo: a password is requiredmessage. - Only recoverable by true root. The
deployuser can't edit its own allowlist. Every drift incident is fixed by a human with standing root SSH access editing the file — which is the exact privileged-access pattern #401 was meant to reduce, just concentrated in a smaller set of people. - Brittle pattern surface. The allowlist enumerates every compose invocation form (prod bare
-f, dev short-p sv0-*, dev full-path/home/deploy/instances/*/…) plus pinneddocker runargument orderings. Reordering a flag or moving a directory breaks the exact-match.
The question #462 posed
sv0-platform#462 proposed three ways forward: rootless Docker (eliminate sudo end-to-end), git-managed allowlist with drift detection, or a preflight guard. All three are defensible engineering investments.
None of them are the right investment right now, for one reason: we are pre-client, and the deploy target is likely to change.
The strategic frame
SecurityV0 is not going to stay on hand-managed Hetzner VPS hosts. The platform roadmap points at managed container runtime — AWS ECS or Fargate, Azure Container Apps or AKS, or equivalent — within the next 3–6 months, driven by a combination of:
- Enterprise-client expectations (IAM-based access, regional tenancy, SOC 2-mappable audit trails)
- Operational scale (running per-tenant instances without hand-managing Caddy site files and instance.conf ports on a 75 GB VPS)
- Security posture upgrades that come for free with managed platforms (no shared deploy user, no SSH keys, no host-level Docker socket, identity-federated pipelines)
Any sudoers-allowlist hardening we do on Hetzner is thrown-away work against that migration. Rootless Docker on the current VPS is 2–3 days of engineering plus a maintenance window on prod, validated against MongoDB WiredTiger behavior under rootless uid mapping — all of which gets deleted when we cut over to a managed platform.
Against that, the actual expected loss from the current docker-group risk is:
- Probability of
DEPLOY_SSH_KEYleak: low but non-zero (it has the attack surface of any GHA secret). - Probability of exploit before detection: further reduced by the secret being useless without knowing the deploy-host address and having network reachability.
- Value at stake: no customer data. Pilot and demo tenants only. Mongo backups are bind-mounted and taken pre-deploy; full rebuild from GHCR images is a ~5 minute operation.
- Compensating controls: Cloudflare Access gates all HTTPS ingress at the edge; SSH is key-only;
DEPLOY_SSH_KEYis rotatable.
The cost/benefit supports accepting the risk explicitly rather than paying the operational tax until migration.
Decision
Accept audit #392's finding as a known risk through the managed-platform migration.
Concretely:
- Revert the server-side change from PR #401. Restore
deployas a member of thedockergroup on both VMs. Remove/etc/sudoers.d/sv0-deploy. - Revert the sudoers-allowlist edits in the repo. Strip
sudoprefixes fromdeploy/scripts/{deploy,teardown,cleanup}-instance.shand both.github/workflows/deploy-{dev,prod}.ymlfiles. Rewrite the server-setup section ofdocs/deploy/deployment.mdto describe docker-group membership and link to this ADR for the rationale. - Keep the sudo-elimination pattern from sv0-platform commit 1b9f425. Caddy admin API at
localhost:2019replacessudo caddy reload. Adocker run --rm alpine rm -rfsidecar replacessudo rm -rfon MongoDB-owned data directories. Both patterns work cleanly under docker-group membership and survive migration. - Close sv0-platform#462 with a link back to this ADR.
- Close sv0-platform#455 as resolved by revert.
Consequences
Positive
- Deploys stop failing on allowlist drift. Immediate operational recovery — no more 10-minute outages while someone with root access hand-edits sudoers.
- No more manual root SSH sessions to repair the deploy path. Eliminates the exact anti-pattern #401 was meant to reduce. Human standing root access becomes genuinely exceptional.
- Clean slate for the managed-platform migration. No custom sudoers glob patterns to unwind. No rootless-Docker uid-mapping migration that gets deleted on cutover.
- The 1b9f425 reductions survive. PR-instance scripts keep running with zero privilege escalation via Caddy admin API + docker sidecar. That pattern is still a net improvement and carries forward.
- The decision is recorded. This ADR, not a silent revert, is what a future auditor or acquirer will see.
Negative
- Audit finding #392 reopens as an explicitly accepted risk until migration. We must be able to defend this tradeoff in writing (this ADR is that defense) if asked by a pilot-customer security review.
- A compromised
DEPLOY_SSH_KEYgives root on the deploy VM. Not theoretical — this is the pre-#401 threat model we are voluntarily re-entering. Depends on compensating controls (key rotation on any suspicion of leak, Cloudflare Access at the edge, no production customer data). - We lose the reflexive claim of "least-privilege deploy user" in security questionnaires. Any such questionnaire answer must now reference this ADR and the migration timeline.
Compensating controls retained
- Cloudflare Access gates all HTTPS ingress to both VMs. An attacker with
DEPLOY_SSH_KEYwould still need independent SSH network reachability (no direct public SSH; firewall rules). - MongoDB auth + bind-mounted backups. Pre-deploy
mongodumparchives todata/backupson every prod deploy (deploy-prod.yml:35).BACKUP_RETAIN_DAYS=14on prod. - GHCR-only image source. Deploy scripts only pull from
ghcr.io/securityv0/sv0-platform/…. Images are tagged immutably by commit SHA on prod (no:latest). - Rotatable credential.
DEPLOY_SSH_KEYlives in GitHub Actions secrets and can be rotated in one PR + one sudo-lessauthorized_keysupdate on each VM.
Required mitigations while this ADR is active
| Control | Owner | Cadence |
|---|---|---|
DEPLOY_SSH_KEY rotation on any suspected leak (repo access change, CI workflow compromise, contractor offboarding) | Ivan | Event-driven |
MongoDB backup retention check on prod (ls -lh ~/sv0-platform/data/backups) | Deploy owner | Before every prod release |
Host patching (apt upgrade) on both VMs | Ivan | Monthly |
| No production customer tenants until migrated or until this ADR is superseded | Product / Ivan | Ongoing |
When to reconsider
This ADR is time-limited by design. Revisit when any of the following happens:
- First real customer tenant is ready to onboard to the managed environment — the migration must be complete (or explicitly scoped to happen before tenant data lands) before this ADR can remain active.
- Managed-platform decision is made (ECS, ACI, AKS, EKS, Fargate, or equivalent). The migration plan supersedes this ADR and writes its own security posture doc.
- Credible leak signal on
DEPLOY_SSH_KEY— rotate immediately; if rotation isn't fast enough, reinstate the allowlist (Option B from #462) as a bridge control. - Regulatory scope changes — if we need SOC 2 / ISO 27001 attestation before the migration lands, re-open this and ship Option B with a canonical git-managed allowlist + CI drift check.
- Deploy frequency increases 5× — the operational-cost side of the tradeoff weakens if we are deploying many times per day to many VMs; at that point the allowlist-drift tax grows relative to the migration horizon.
References
- sv0-platform#401 — original sudoers-allowlist PR (the edits this ADR reverses)
- sv0-platform commit 1b9f425 — sudo-elimination via Caddy admin API + docker sidecar (kept)
- sv0-platform#412 — prune-on-failure + disk pre-check (allowlist-drift tolerance)
- sv0-platform commit d9c2bd5 — allowlist alignment for prod prune
- sv0-platform#455 — 2026-04-20 dev-deploy outage (resolved by this revert)
- sv0-platform#462 — proposal issue that surfaced this decision
- sv0-documentation#174 — 2026-04-12 architecture review (source of audit finding #392)
- sv0-platform#392 — original audit finding (reopened as accepted risk by this ADR)