IaC Drift and Emergency Changes

Date: 2026-05-05 Status: Active

Problem Statement

The point of Terraform isn't to forbid dashboard clicks — it's to ensure every dashboard click that lasts more than a few hours becomes visible in code. When an incident hits and the fastest path to a fix is the Cloudflare or Atlas dashboard, you take that path. The discipline is making sure the change comes back into HCL within hours, not weeks.

This runbook codifies three things:

Three reconciliation paths when drift is detected
The emergency dashboard playbook for incident-time changes
What's intentionally drift-immune (resources with lifecycle.ignore_changes) and why

It is the operational companion to ADR-019, which defines why we use IaC. This document is what to do when reality and HCL diverge.

The Three Reconciliation Paths

When terraform plan shows drift on a Terraform-managed resource, exactly one of these is the right action. The wrong move is usually to terraform apply blindly — that overwrites whoever made the dashboard change, possibly bringing back the incident they fixed.

1. Accept — dashboard change is correct, HCL was wrong

When: an operator made a deliberate change in the dashboard during an incident, or a misconfiguration in HCL is now corrected by a manual fix.

Steps:

Open a feature branch on the relevant infra repo (sv0-infrastructure).
Edit the HCL to match what the dashboard now shows.
Open a PR titled fix(<vendor>): reconcile <thing> from incident <ID>. Reference the incident in the PR description.
TFC's speculative plan must show No changes — that's the gate confirming HCL faithfully describes reality.
Merge → confirm apply in TFC UI (no resource changes will run; state was already in sync).

2. Revert — dashboard change was wrong, HCL is correct

When: an unauthorized change, a typo, or an experiment that should be undone.

Steps:

Trigger a real plan in TFC (not a speculative plan from a PR — speculative plans don't apply). Two ways:
- Re-run the latest queued run from the workspace UI ("Actions → Re-run with the latest configuration").
- Or push an empty commit to main to re-trigger.
Read the plan carefully. The values being "set" should match HCL.
Confirm apply in TFC UI. HCL value gets pushed back over the dashboard change.
Verify subsequent plan is No changes.

Risk: if you blindly apply a revert without understanding the dashboard change, you may bring back the incident the operator was trying to fix. Always read the diff before clicking Confirm.

3. Allowlist — known external system mutates this field, intentionally ignore

When: an external system (Cloudflare bootstrap script, automated key rotation, vendor-managed values) writes to a specific field on a Terraform-managed resource. The resource itself stays in TF, but specific attributes are explicitly ignored.

Two mechanisms, picking depending on scope:

Per-field on a single resource — lifecycle.ignore_changes:

resource "cloudflare_zero_trust_access_policy" "ci_cd_bot_access" {
  # ... usual fields ...

  lifecycle {
    ignore_changes = [include]  # service token UUIDs rotate via GHA secret rotation
  }
}

Per-resource one-time exception — drift-allowlist.md in the env directory:

# envs/shared/drift-allowlist.md

## cloudflare_record.foo (added 2026-05-12 by @ivanfofanov)
Reason: WorkOS reissued our domain verification token during their migration.
Action: leave HCL as-is for 30 days; reconcile after WorkOS confirms the new token is final.
Review by: 2026-06-12.

The drift-allowlist.md is not enforced by code — it's a structured note that future-you will read during the next drift triage and either delete (when reconciled) or extend (if still needed). Date-stamp every entry; revisit monthly.

Tradeoff to know: every ignore_changes block is a blind spot — TF won't detect drift on those fields, including malicious changes. Add them sparingly, and rely on vendor audit logs (Cloudflare Logpush, Atlas Audit) for changes inside the blind spots.

Detection — How Drift Surfaces

Today (Phase 1-3 of the IaC rollout)

Manual weekly terraform plan. Run this from a clean checkout of main, in each env directory that has live state:

cd ~/dev/securityv0/repos/sv0-infrastructure
git fetch && git checkout main && git pull --ff-only

for env in envs/*/; do
  cd "$env"
  terraform init -input=false > /dev/null
  echo "=== $env ==="
  terraform plan -no-color -detailed-exitcode 2>&1 | tail -3
  cd ../..
done

-detailed-exitcode returns 0 (no changes), 1 (error), or 2 (drift detected). Eyeball the output. Reconcile via path 1, 2, or 3.

Cadence: every Monday morning, before the week's IaC work begins. Calendar reminder lives on the founder's calendar.

Phase 4+ — automated (planned)

A scheduled GitHub Actions workflow in sv0-infrastructure runs terraform plan daily on a cron, opens a GitHub Issue labeled drift + <workspace> when the plan is non-empty. Issue body includes the diff. Triage routes through the three paths above.

This is the free-tier equivalent of HCP Terraform's paid Health Assessments feature. No need to land it before drift actually bites — defer until Phase 4 of the IaC rollout.

Vendor audit logs as a backstop

Both Cloudflare and Atlas emit per-change audit records. For resources where TF has lifecycle.ignore_changes, the vendor audit log is the only signal that something changed. Sketch:

Vendor	Audit feature	Cost	Today
Cloudflare	Audit Log Export (Logpush)	Paid (Enterprise/Business)	Manual review in dashboard
MongoDB Atlas	Database Auditing + Project Audit	Included on M10+	Manual review

When Phase 2 (Atlas IaC) ships, wire Atlas Audit to a notification channel. Cloudflare Logpush is deferred until enterprise plan or a real incident proves the manual cadence isn't enough.

Emergency Dashboard Playbook

When an incident forces you to click in a vendor dashboard, follow this sequence. The order matters: incident first, IaC second.

1. Make the fix in the vendor dashboard.
   - Cloudflare:  https://dash.cloudflare.com
   - Atlas:       https://cloud.mongodb.com
2. In your incident notes / Slack channel, write down EXACTLY what you changed:
   - Resource type, name, identifier
   - Before value, after value
   - Why (one-sentence rationale)
3. Confirm the incident is resolved.

   ─── then within ~4 hours ───

4. Open a feature branch on the relevant infra repo:
     git fetch && git worktree add .claude/worktrees/incident-NNN -b fix/incident-NNN main
5. Update HCL to match the dashboard change.
6. Open a PR titled:
     fix(<vendor>): reconcile <thing> from incident <ID>
   Reference the incident in the PR body.
7. Speculative plan must show "No changes" before merge — that's the gate.
8. Merge → click Confirm in TFC UI (apply will be a no-op, just normalizes state).
9. Subsequent plans clean.

Anti-patterns (don't do these):

❌ Make the dashboard change, "I'll fix HCL tomorrow." Tomorrow becomes never; the next drift PR finds 5 stale changes and nobody remembers what each one was for.
❌ Make the dashboard change, run terraform apply to "fix" the drift without updating HCL — this reverts your incident fix.
❌ Open a PR but skip the speculative-plan gate. If plan shows changes, your HCL doesn't match reality and merging just relocates the drift.
❌ Document the change in a private Slack DM. Use the incident channel or notes doc — future operators need to find it.

Quick Decision Tree

Is the change time-critical (active incident, paying customer affected)?
├─ YES  → click dashboard, then PR within 4h to reconcile (Path 1: Accept)
└─ NO   → use the IaC path:
            1. checkout feature branch
            2. edit HCL
            3. push → open PR
            4. read TFC speculative plan in PR comment
            5. merge → click Confirm & Apply in TFC UI
            (typical end-to-end: ~10-15 min)

The IaC path is fast enough for non-emergencies. Reserve the dashboard for genuine "bytes are on fire" moments.

What's Drift-Immune (`lifecycle.ignore_changes`)

These resources have specific fields explicitly ignored. Drift on the ignored fields will NOT appear in terraform plan, will NOT trigger reconciliation, will NOT be visible in PR review. Every entry below is a deliberate blind spot.

`sv0-infrastructure` / Cloudflare

Resource	Ignored fields	Why
`cloudflare_zero_trust_access_application.securityv0_pr_previews`	`policies`, `allowed_idps`	`bootstrap-cf-access-pr-previews.sh` is a parallel writer. Will be retired in a follow-up; this is defense-in-depth until then.
`cloudflare_zero_trust_access_policy.securityv0_pr_previews_securityv0_team`	`include`, `exclude`, `require`	Same script is also a parallel writer for this policy's identity rules.
`cloudflare_zero_trust_access_policy.securityv0_pr_previews_ci_cd_bot_access`	`include`	Service token UUIDs rotate operationally via GHA secret rotation.
`cloudflare_zero_trust_access_policy.sv0_website_reviews_cloudflare_pages_visual_review_bot_service_token`	`include`	Same as above.
`cloudflare_zero_trust_access_policy.sv0_reviews_cloudflare_pages_visual_review_bot_service_token`	`include`	Same.
`cloudflare_zero_trust_access_policy.securityv0_dev_ci_cd_bot_access`	`include`	Same.
`cloudflare_zero_trust_access_policy.sv0_docs_cloudflare_pages_service_auth_bots`	`include`	Same.

Resources excluded from TF entirely (not even imported)

These resources EXIST in Cloudflare but are NOT Terraform-managed. Their lifecycle is owned by another system. See scripts/cloudflare-inventory.sh → DNS_IGNORE_PATTERNS:

Pattern	Owner / why
`pr-*-dev.securityv0.com` CNAMEs	Auto-managed by `sv0-platform/deploy-dev.yml` on PR open. Cleanup is best-effort outside Terraform's scope.
`google._domainkey.securityv0.com` (DKIM TXT)	Google Workspace admin console rotates the key. Every rotation would create a noise PR.
`_domainconnect.securityv0.com` (CNAME)	IONOS DomainConnect protocol-defined record. IONOS owns the protocol.

If you want to extend the exclusion list, edit scripts/cloudflare-inventory.sh, document the reason in a comment next to the pattern, and open a PR. The pattern stops the inventory script from re-introducing the resource on future runs.

When to Update This Runbook

New lifecycle.ignore_changes block added → add a row to the table above.
New exclusion pattern added → add a row to the second table above.
New IaC env (Atlas in Phase 2, AWS compute in Phase 3) → add a section describing its env-specific reconciliation paths.
Drift-detection cadence changes (manual → cron, cron → Health Assessments) → update the Detection section.

The runbook is canonical. If reality diverges from this doc, fix the doc in the same PR as the change.

ADR-019: Infrastructure-as-Code Strategy — why we use Terraform, how the secrets boundary works, and the multi-tier review model.
IaC Rollout Plan — phased delivery of IaC across the platform.
Git Workflow, Branching, and Worktrees — how to set up feature branches and worktrees referenced by this runbook's reconciliation paths.
sv0-infrastructure/scripts/cloudflare-inventory.sh — the operational tool that materializes Cloudflare state into HCL.

Problem Statement​

The Three Reconciliation Paths​

1. Accept — dashboard change is correct, HCL was wrong​

2. Revert — dashboard change was wrong, HCL is correct​

3. Allowlist — known external system mutates this field, intentionally ignore​

Detection — How Drift Surfaces​

Today (Phase 1-3 of the IaC rollout)​

Phase 4+ — automated (planned)​

Vendor audit logs as a backstop​

Emergency Dashboard Playbook​

Quick Decision Tree​

What's Drift-Immune (lifecycle.ignore_changes)​

sv0-infrastructure / Cloudflare​

Resources excluded from TF entirely (not even imported)​

When to Update This Runbook​

Related​