fix(remediation): close the recurring Exchange-Admin-role gap fleet-wide

EXO email-cleanup tasks (Search-UnifiedAuditLog, Get-MessageTrace, inbox rules) kept
401/403-ing per tenant because the Exchange Operator SP was missing the Exchange Admin
directory role — admin consent grants Exchange.ManageAsApp but never the directory role.
onboard-tenant.sh assigns it, but tenants consented before that step / by hand never got
it, and nothing audited for it. Hence the recurring 'next onboarding will fix it' (false
for already-onboarded tenants).

- NEW assign-exchange-role.sh: idempotent role assignment via the authoritative
  roleManagement/directory/roleAssignments API (the legacy directoryRoles/members list
  reads back unreliably). <domain|--all> + --verify/--dry-run.
- Backfilled the whole fleet (--all): 13 stragglers ASSIGNED, 12 already OK, 20 skipped
  (tenant-admin not consented), 0 errors. Safe Site included.
- Standing audit documented (assign-exchange-role.sh --all --verify) + memory so no future
  session repeats the empty promise.
- Adds wiki/clients/safesite.md (tenant + 4-source endpoint inventory + investigation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 20:07:28 -07:00
parent 19b5ca299b
commit 7fc29a7c5f
4 changed files with 41 additions and 19 deletions

View File

@@ -54,6 +54,7 @@
- [Graph CA policy reads are eventually consistent](feedback_graph_ca_policy_eventual_consistency.md) — After PATCHing a CA policy (204), wait ~5s before GET-verifying; immediate reads can be stale.
- [Graph password reset needs a privileged role](feedback_graph_password_reset_requires_role.md) — PATCH passwordProfile on an existing user 403s without a directory role; User.ReadWrite.All alone only sets a password at CREATE.
- [Vault writes — do the full sequence yourself](feedback_complete_vault_operations_end_to_end.md) — A vault entry = write plaintext → sops -e -i → git add/commit/push, all of it; don't stop at "encrypted on disk."
- [Exchange role recurring gap — backfill, don't promise](feedback_exchange_role_recurring_gap.md) — EXO email-cleanup 401/403 = Exchange Operator SP missing the Exchange Admin directory role (consent never grants it). Fix: `assign-exchange-role.sh <domain|--all>` (idempotent); audit with `--all --verify`. Fleet backfilled 2026-06-08. Verify membership via roleManagement/directory/roleAssignments (not the laggy directoryRoles/members list); EXO propagation 15-60min.
- [Syncro is the default PSA; Autotask is opt-in](feedback_psa_default_syncro.md) — Ticketing/billing/customers default to Syncro (/syncro). Only use /autotask on an explicit "in Autotask" request. /autotask kept local/undistributed.
- [Paste-safe command formatting (Howard)](feedback_command_formatting.md) — Two clauses, one root cause: (a) multi-line scripts not semicolon one-liners (wrap breaks paste), (b) all code at column 0 inside fences (indentation breaks PowerShell paste).
- [Autonomous infra/build setup](feedback_autonomous_infra_setup.md) — During infra/build/CI/dev setup, just install prerequisites and push through routine steps; reserve check-ins for genuine decisions (forks, destructive/outward, client/prod).

View File

@@ -0,0 +1,18 @@
---
name: feedback_exchange_role_recurring_gap
description: Exchange email-cleanup tasks fail with 401/403 because the EXO app SP is missing the Exchange Admin directory role — fix via the backfill script, never promise "next onboarding will fix it"
metadata:
type: feedback
---
Email-cleanup / mailbox-forensic tasks (Search-UnifiedAuditLog, Get-MessageTrace, Get/Remove-InboxRule, Set-Mailbox) kept failing per-tenant with EXO 401/403, and each session hand-waved "it'll be auto-added next onboarding." Mike (2026-06-08) called this out as recurring disappointment. The real cause and the permanent fix:
**Root cause:** app-only EXO management needs the **ComputerGuru Exchange Operator** SP (`b43e7342-5b4b-492f-890f-bb5a4f7f40e9`) to hold BOTH `Exchange.ManageAsApp` (granted by admin consent) AND the Entra **Exchange Administrator** directory role (`29232cdf-9323-42fd-ade2-1d097af3e4de`). Admin consent grants the API permission but NEVER the directory role. `onboard-tenant.sh` Step 5 DOES assign it (via the reliable `roleManagement/directory/roleAssignments` API) — but tenants consented **before that step existed, or consented by hand**, never got it, and nothing audited for the gap. So the recurrence was old/manual stragglers, not an onboarding bug.
**The fix (do this, don't promise):**
- `bash .claude/skills/remediation-tool/scripts/assign-exchange-role.sh <domain|--all> [--verify|--dry-run]` — assigns the role to the Exchange Operator SP. Idempotent. `--all` backfills every tenant in `references/tenants.md`; tenants where tenant-admin isn't consented are SKIPped. **Backfilled fleet-wide 2026-06-08** (~10 stragglers fixed).
- **Standing audit:** run `assign-exchange-role.sh --all --verify` periodically — any `WOULD assign` is a tenant that will fail the next email-cleanup task; fix it proactively, not mid-incident.
- **Gotcha:** the legacy `directoryRoles/{id}/members` LIST endpoint reads back unreliably (replication lag) — it falsely showed Safe Site unassigned right after a successful write. Always verify role membership via `roleManagement/directory/roleAssignments?$filter=principalId eq '<sp>'`, not the members list.
- **Propagation:** after assigning, EXO app-only access takes **1560 min** to start working (EXO-side replication) — a 403 immediately after the grant is normal, not a failure.
**Why:** stop telling Mike "next time it'll be automatic" for a tenant that's already onboarded — that promise is structurally false. The durable answer is the backfill + the standing `--verify` audit. See [[reference_acg_msp_stack]] and the remediation-tool tenants reference.