Recovery-credential patterns
The 2026-05-13 Phase 3a-4 work designed (and then cancelled) a sv0-azure-backup-owner UAA service principal as Tier-3 account-lockout rollback. The SP itself wasn't built — adding a 2nd human Owner was cheaper than the SP that would be deleted once a 2nd Owner exists. The design knowledge is the lasting asset, captured here for the next time the trigger condition genuinely cannot be satisfied.
When to reach for this: post-10-staff, first compliance ask, or any scenario where adding a second human with the privileged role is genuinely blocked.
When NOT to reach for this: at 1-2 operator scale, when the same blast radius can be covered by a 5-minute org-level action. The rule is: if the design's sunset trigger is cheaper than the design itself, skip the bridge and do the trigger.
Pattern 1 — Role: User Access Administrator, not Owner
If the recovery use case is "re-grant role assignment to a locked-out human," User Access Administrator is sufficient.
- UAA =
Microsoft.Authorization/*+*/read+Microsoft.Support/*. NodataActions. No resource mutation. - Owner =
actions: ["*"]. Everything.
UAA can self-grant Owner via UAA's own Microsoft.Authorization/roleAssignments/write when a recovery step genuinely needs resource mutation. Each self-elevation is a discrete audited Activity Log event — closer to PIM's audit framing than permanent Owner.
Why this matters: least privilege at standing-state, with on-demand elevation as the audited recovery action. The SP is a less-attractive compromise target.
Pattern 2 — Credential out-of-band, never in TF state
The SP credential MUST NOT enter Terraform state. Terraform state is plaintext and replicates to whatever vault holds state backups. If the same vault holds the credential + state, the vault becomes a shared failure domain — not defense-in-depth.
Pattern:
- Terraform manages only the immutable Entra/RBAC objects:
azuread_application,azuread_service_principal,azurerm_role_assignment. NOazuread_application_password. NOtime_rotating. - An operator script (e.g.,
scripts/provision-azure-<sp-name>.sh) mints the credential viaaz ad app credential reset --id <app-obj-id> --append true --years 1, captures stdout in a shell variable, pipes to 1Password via stdin (op item create … password=-orop item edit … password=-). - The secret never crosses argv of any process, never lands in a file (except briefly under
mktemp -d 0700+umask 077for the verification login), never persists in shell history. - Rotation: re-run the script with
--rotate. Pre-mint key-ID snapshot enables clean-up of old credentials after verification of the new one.
Reference implementation: Phase 3a-2's provision-vm-emergency-key.sh (~410 LOC) is the canonical pattern. The cancelled provision-azure-backup-owner.sh followed the same shape. Reuse.
Pattern 3 — Activation: file-pipe, never argv
az login --service-principal -p <secret> is the wrong pattern. The -p value lands in /proc/<pid>/cmdline (Linux) and in the calling shell's history (zsh/bash flush HISTFILE on command commit, before history -c runs).
Right pattern:
tmpf=$(mktemp); chmod 600 "$tmpf"
op read op://sv0-infra/<item>/password > "$tmpf"
az login --service-principal -u <client_id> --password "@$tmpf" --tenant <tenant_id>
rm -f "$tmpf"
# … recovery action …
az logout
Azure CLI's --password flag treats values starting with @ as file paths; if the file is missing, az exits with CLIError rather than treating the literal string as a password. Safe.
Document this exact invocation in the 1Password item's notesPlain field so the stressed operator who copy-pastes during a real incident gets the safe pattern.
Pattern 4a — Verify subscription state before any design work
Before any recovery-credential design begins, run the verification:
az role assignment list \
--scope "/subscriptions/<sub-id>" \
--role Owner \
--include-inherited \
-o table
If the answer shows ≥2 distinct human principals, the Tier-3 SPOF is already closed by org-level setup. No SP, no PIM, no special design. The 2026-05-13 Phase 3a-4 work is the worked example of what happens when this step is skipped: ADR text claiming "today: Ivan only" drove a multi-hour design exercise + adversarial review cycle, but az showed Sergey had been Owner since 2026-01-04 the whole time. The verifiable premise was never verified.
ADR text describing "who holds Owner today" is documentation; az is state. When they disagree, trust the CLI and update the ADR.
Pattern 4 — Sunset condition is mandatory
Every recovery SP must declare its sunset trigger. Examples:
- "Delete when a 2nd human Owner is provisioned."
- "Delete when PIM is adopted and the human Owner is converted to Eligible."
- "Delete when staff count exceeds 5 and per-operator SPs replace the shared one."
If the sunset condition itself is cheaper than the SP, don't build the SP — do the trigger directly. The 2026-05-13 cancelled sv0-azure-backup-owner is the worked example: trigger = "add Sergey as 2nd Owner" was a 5-minute Entra portal action; the SP was ~400 LOC + rotation + Activity Log GHA + cert-migration + annual deletion ceremony. Trigger wins.
Pattern 5 — prevent_destroy = true on the rollback path
If the SP genuinely needs to exist, protect the Terraform resources from accidental destruction:
resource "azuread_application" "<name>" {
# …
lifecycle {
prevent_destroy = true
}
}
resource "azurerm_role_assignment" "<name>" {
# …
lifecycle {
prevent_destroy = true
}
}
Deletion is a deliberate two-step: remove the lifecycle block in one apply, then destroy. Defense against terraform destroy typos and refactor renames.
Exception: the azuread_application_password does NOT get prevent_destroy — rotation requires replacement. The password being rotation-managed is the whole point of moving it out-of-band per Pattern 2.
Pattern 6 — Attribution: per-operator SPs at scale, shared SP at 1-operator
A single static-credential SP shared between operators (Ivan + Sergey) loses attribution at the Azure Activity Log layer (action shows principalId = <sp_object_id>, not the human who retrieved the credential from 1Password). Correlation requires matching 1Password Activity Log retrieval time with Azure Activity Log action time — possible, brittle.
Rule of thumb:
- 1 operator → shared SP is fine (no attribution dispute possible).
- 2+ operators with same recovery scope → per-operator SPs, or accept attribution loss as a documented post-mortem artifact.
The 2026-05-13 cancelled SP would have been Ivan+Sergey-shared; the cancellation eliminates the question.
Pattern 7 — Document the 1Password lockout SPOF with an explicit RTO
If the recovery credential lives in 1Password, then 1Password is the lockout vector. Be honest about the fallback:
- Azure tenant root-account password reset via Microsoft support
- Realistic RTO: days for free-tier subscriptions (no emergency support SLA)
- Acceptable at 1-2 operator scale; the real closure is multi-human-Owner
Don't hand-wave with "Microsoft support handles it." State the RTO. State the scale at which the RTO is acceptable. State the closure trigger.
Anti-patterns
- Don't use
--append falseon rotation. It wipes existing credentials atomically, leaving a window where the SP has no valid credential before the new one is in 1Password. Use--append true, write to 1P, verify, then delete old credentials bykeyId. - Don't pass the secret via
jq --arg.--argputs the value in jq's argv. Build the template with placeholders and pipe the secret separately toopvia stdin. - Don't store
time_rotatingfor a credential the script manages. TF will plan changes when the secret rotates out-of-band; just don't make the secret a TF resource at all. - Don't claim "calendar reminder" as a compensating control. It's theater. If activity logging matters, automate it (scheduled GHA → Slack post → issue assignment).
Related
- ADR-023 §6.2 step 12 — the use case this pattern was designed for (currently resolved by sv0-infrastructure#60 rather than this pattern).
- Sunset-trigger rule (saved as an operator memory): before building a sophisticated workaround with a documented sunset condition, ask whether satisfying the trigger directly is cheaper. The 2026-05-13 Phase 3a-4 cancellation is the worked example.
sv0-infrastructure/scripts/provision-vm-emergency-key.sh— reference implementation of patterns 2-3 for the Tier-1.5 per-VM key.