ADR-018: Deploy-Server Security Posture Before Managed-Platform Migration

Status

Accepted — 2026-04-20

Supersedes the sudoers-allowlist mitigation shipped in sv0-platform PR #401 (2026-04-16). Reverts the server-side change and the script/workflow sudo prefixes that #401 added; keeps the sudo-elimination work from sv0-platform commit 1b9f425 intact.

Context

SecurityV0 currently deploys to two Hetzner VPS hosts (dev.securityv0.com, app.securityv0.com) over SSH, using a dedicated deploy OS user with key-based auth. Images come from GHCR (ADR-013). Docker Compose orchestrates three containers per instance (api, ui, mongo). There are no customer tenants in production — current traffic is the founding team plus invited pilot viewers behind Cloudflare Access.

What audit #392 identified

The 2026-04-12 architecture review flagged that the deploy user was a member of the docker group on both VMs. Docker-group membership is root-equivalent: any member can run

docker run --rm -v /:/host alpine chroot /host bash

and obtain a root shell on the host. The exposed attack path was:

Attacker obtains DEPLOY_SSH_KEY (GitHub Actions secret, reachable by any CI workflow and by repo admins).
Attacker SSHes to the deploy VM as deploy.
Attacker escapes the container boundary via the Docker socket and owns the host.

What PR #401 tried to do about it

PR #401 (sv0-platform#401, merged 2026-04-16) removed deploy from the docker group and installed a narrow sudoers allowlist at /etc/sudoers.d/sv0-deploy permitting only the specific docker compose, docker exec, docker image prune, docker builder prune, docker login, and pinned docker run … command shapes that the deploy workflow invokes. All deploy scripts and both deploy workflows were updated to prefix those calls with sudo.

The security rationale was sound: it narrowed the blast radius of a leaked DEPLOY_SSH_KEY from "full host compromise" to "attacker can run exactly the allowlisted compose/exec/prune/login shapes against the allowlisted compose file paths."

Why we are reversing it

In the 35 days since #401 merged, the allowlist has created a recurring operational failure pattern that — in its own way — reintroduces the privileged access it was meant to narrow:

Date	Event
2026-04-18	sv0-platform#412 ships `\|\| true` on `docker image prune` calls because "an out-of-date sudoers allowlist doesn't block deploys" — explicit admission in the PR body
2026-04-18	sv0-platform d9c2bd5 — "align prod image prune with sudoers allowlist (-af)" — second drift fix
2026-04-20	sv0-platform#455 — `deploy-dev` fails 5+ times with `sudo: a password is required` at the `docker login` step; dev stuck on pre-merge SHA
2026-04-20 18:42 → 18:52 UTC	Dev deploy fails; 10 minutes later next deploy succeeds — someone with real root SSHed in and hand-edited `/etc/sudoers.d/sv0-deploy`

The root-cause design smells:

Two-tier source of truth. deploy/scripts/*.sh and .github/workflows/deploy-*.yml live in git and ship on every merge. /etc/sudoers.d/sv0-deploy is edited by hand. Every shape change to a sudo invocation silently drifts the allowlist out of sync.
No CI signal for drift. The first warning is a broken deploy with an ambiguous sudo: a password is required message.
Only recoverable by true root. The deploy user can't edit its own allowlist. Every drift incident is fixed by a human with standing root SSH access editing the file — which is the exact privileged-access pattern #401 was meant to reduce, just concentrated in a smaller set of people.
Brittle pattern surface. The allowlist enumerates every compose invocation form (prod bare -f, dev short -p sv0-*, dev full-path /home/deploy/instances/*/…) plus pinned docker run argument orderings. Reordering a flag or moving a directory breaks the exact-match.

The question #462 posed

sv0-platform#462 proposed three ways forward: rootless Docker (eliminate sudo end-to-end), git-managed allowlist with drift detection, or a preflight guard. All three are defensible engineering investments.

None of them are the right investment right now, for one reason: we are pre-client, and the deploy target is likely to change.

The strategic frame

SecurityV0 is not going to stay on hand-managed Hetzner VPS hosts. The platform roadmap points at managed container runtime — AWS ECS or Fargate, Azure Container Apps or AKS, or equivalent — within the next 3–6 months, driven by a combination of:

Enterprise-client expectations (IAM-based access, regional tenancy, SOC 2-mappable audit trails)
Operational scale (running per-tenant instances without hand-managing Caddy site files and instance.conf ports on a 75 GB VPS)
Security posture upgrades that come for free with managed platforms (no shared deploy user, no SSH keys, no host-level Docker socket, identity-federated pipelines)

Any sudoers-allowlist hardening we do on Hetzner is thrown-away work against that migration. Rootless Docker on the current VPS is 2–3 days of engineering plus a maintenance window on prod, validated against MongoDB WiredTiger behavior under rootless uid mapping — all of which gets deleted when we cut over to a managed platform.

Against that, the actual expected loss from the current docker-group risk is:

Probability of DEPLOY_SSH_KEY leak: low but non-zero (it has the attack surface of any GHA secret).
Probability of exploit before detection: further reduced by the secret being useless without knowing the deploy-host address and having network reachability.
Value at stake: no customer data. Pilot and demo tenants only. Mongo backups are bind-mounted and taken pre-deploy; full rebuild from GHCR images is a ~5 minute operation.
Compensating controls: Cloudflare Access gates all HTTPS ingress at the edge; SSH is key-only; DEPLOY_SSH_KEY is rotatable.

The cost/benefit supports accepting the risk explicitly rather than paying the operational tax until migration.

Decision

Accept audit #392's finding as a known risk through the managed-platform migration.

Concretely:

Revert the server-side change from PR #401. Restore deploy as a member of the docker group on both VMs. Remove /etc/sudoers.d/sv0-deploy.
Revert the sudoers-allowlist edits in the repo. Strip sudo prefixes from deploy/scripts/{deploy,teardown,cleanup}-instance.sh and both .github/workflows/deploy-{dev,prod}.yml files. Rewrite the server-setup section of docs/deploy/deployment.md to describe docker-group membership and link to this ADR for the rationale.
Keep the sudo-elimination pattern from sv0-platform commit 1b9f425. Caddy admin API at localhost:2019 replaces sudo caddy reload. A docker run --rm alpine rm -rf sidecar replaces sudo rm -rf on MongoDB-owned data directories. Both patterns work cleanly under docker-group membership and survive migration.
Close sv0-platform#462 with a link back to this ADR.
Close sv0-platform#455 as resolved by revert.

Consequences

Positive

Deploys stop failing on allowlist drift. Immediate operational recovery — no more 10-minute outages while someone with root access hand-edits sudoers.
No more manual root SSH sessions to repair the deploy path. Eliminates the exact anti-pattern #401 was meant to reduce. Human standing root access becomes genuinely exceptional.
Clean slate for the managed-platform migration. No custom sudoers glob patterns to unwind. No rootless-Docker uid-mapping migration that gets deleted on cutover.
The 1b9f425 reductions survive. PR-instance scripts keep running with zero privilege escalation via Caddy admin API + docker sidecar. That pattern is still a net improvement and carries forward.
The decision is recorded. This ADR, not a silent revert, is what a future auditor or acquirer will see.

Negative

Audit finding #392 reopens as an explicitly accepted risk until migration. We must be able to defend this tradeoff in writing (this ADR is that defense) if asked by a pilot-customer security review.
A compromised DEPLOY_SSH_KEY gives root on the deploy VM. Not theoretical — this is the pre-#401 threat model we are voluntarily re-entering. Depends on compensating controls (key rotation on any suspicion of leak, Cloudflare Access at the edge, no production customer data).
We lose the reflexive claim of "least-privilege deploy user" in security questionnaires. Any such questionnaire answer must now reference this ADR and the migration timeline.

Compensating controls retained

Cloudflare Access gates all HTTPS ingress to both VMs. An attacker with DEPLOY_SSH_KEY would still need independent SSH network reachability (no direct public SSH; firewall rules).
MongoDB auth + bind-mounted backups. Pre-deploy mongodump archives to data/backups on every prod deploy (deploy-prod.yml:35). BACKUP_RETAIN_DAYS=14 on prod.
GHCR-only image source. Deploy scripts only pull from ghcr.io/securityv0/sv0-platform/…. Images are tagged immutably by commit SHA on prod (no :latest).
Rotatable credential. DEPLOY_SSH_KEY lives in GitHub Actions secrets and can be rotated in one PR + one sudo-less authorized_keys update on each VM.

Required mitigations while this ADR is active

Control	Owner	Cadence
`DEPLOY_SSH_KEY` rotation on any suspected leak (repo access change, CI workflow compromise, contractor offboarding)	Ivan	Event-driven
MongoDB backup retention check on prod (`ls -lh ~/sv0-platform/data/backups`)	Deploy owner	Before every prod release
Host patching (`apt upgrade`) on both VMs	Ivan	Monthly
No production customer tenants until migrated or until this ADR is superseded	Product / Ivan	Ongoing

When to reconsider

This ADR is time-limited by design. Revisit when any of the following happens:

First real customer tenant is ready to onboard to the managed environment — the migration must be complete (or explicitly scoped to happen before tenant data lands) before this ADR can remain active.
Managed-platform decision is made (ECS, ACI, AKS, EKS, Fargate, or equivalent). The migration plan supersedes this ADR and writes its own security posture doc.
Credible leak signal on DEPLOY_SSH_KEY — rotate immediately; if rotation isn't fast enough, reinstate the allowlist (Option B from #462) as a bridge control.
Regulatory scope changes — if we need SOC 2 / ISO 27001 attestation before the migration lands, re-open this and ship Option B with a canonical git-managed allowlist + CI drift check.
Deploy frequency increases 5× — the operational-cost side of the tradeoff weakens if we are deploying many times per day to many VMs; at that point the allowlist-drift tax grows relative to the migration horizon.

References

sv0-platform#401 — original sudoers-allowlist PR (the edits this ADR reverses)
sv0-platform commit 1b9f425 — sudo-elimination via Caddy admin API + docker sidecar (kept)
sv0-platform#412 — prune-on-failure + disk pre-check (allowlist-drift tolerance)
sv0-platform commit d9c2bd5 — allowlist alignment for prod prune
sv0-platform#455 — 2026-04-20 dev-deploy outage (resolved by this revert)
sv0-platform#462 — proposal issue that surfaced this decision
sv0-documentation#174 — 2026-04-12 architecture review (source of audit finding #392)
sv0-platform#392 — original audit finding (reopened as accepted risk by this ADR)

Status​

Context​

What audit #392 identified​

What PR #401 tried to do about it​

Why we are reversing it​

The question #462 posed​

The strategic frame​

Decision​

Consequences​

Positive​

Negative​

Compensating controls retained​

Required mitigations while this ADR is active​

When to reconsider​

References​