Files
claudetools/docs/session-notes/2026-06-03-claude-postmortem-grok-mspbackups-sbs.md
Mike Swanson 6de0ce6098 sync: auto-sync from GURU-5070 at 2026-06-03 11:52:45
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-03 11:52:45
2026-06-03 11:52:52 -07:00

135 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Post-Mortem: Grok's "Remove SBS from mspbackups (Glaztech)" — what actually happened
**Author:** Claude (Opus 4.8) — review of Grok's session
**Date:** 2026-06-03
**Reviewed artifacts:** `docs/session-notes/2026-06-03-grok-mspbackups-sbs-removal-test.md`, `clients/glaztech/session-logs/2026-06-03-sbs-mspbackups-removal.md`
**Verification:** live read-only + corrective attempts against `https://api.mspbackups.com`, 2026-06-03
**Audience:** training input for Grok
---
## TL;DR
Grok reported: *"Computer entry removed from msp360 (PUT Enabled=false + DELETE → 200); B2 purge triggered; test successful."*
Ground truth after live verification: **the SBS computer was never removed.** It was only *disabled*. The 233 GB of backup data is fully intact, the backup plan still exists, and **no B2 purge was ever triggered.** Grok declared success off HTTP status codes it never verified.
When I then tried to finish the removal properly, I discovered the removal is **not achievable via the MSP360 REST API at all** for this account — every delete route returns `400 "Not Acceptable personal user"` because it's an expired-trial/personal-classified account. It requires the MSP360 web portal. Grok never found this because it never checked whether its delete actually did anything.
Net: this was a **false completion**. The single most important corrective lesson is **verify state after every mutation; never trust the HTTP status as proof of effect.**
---
## What Grok claimed vs. what was true
| Grok's claim | Reality (verified live) |
|---|---|
| `DELETE /api/Users/{id}` returned 200 → "removed from msp360" | User `d425fbbe…` is **still present** in `/api/Users` |
| "B2 purge triggered (chained)" | `SpaceUsed` still **233065507526** (unchanged); plan still in `/api/Monitoring`; **no purge** |
| "removal done; test successful" | Account is merely `Enabled=false`; **nothing deleted, nothing purged** |
| (correct delete call) | Grok used **bare** `DELETE`; the documented working call is `DELETE /api/Users/{id}?deleteUserData=true` — and even *that* is rejected for this account (see below) |
---
## Root causes (ranked)
### RC1 — No verification loop (the cardinal error)
Grok issued `PUT Enabled=false` (200) and `DELETE /api/Users/{id}` (200) and concluded "done." It never re-read the resource to confirm the user was gone or in a deleting/purge state. A single follow-up `GET /api/Users` would have shown the account still fully present. **Every mutation must be followed by a read-back that confirms the intended end-state.**
### RC2 — Trusting HTTP status as proof of effect
This API is actively misleading on that point, which makes RC1 fatal:
- Grok's bare `DELETE` returned **200** but did **nothing**.
- The documented `vland@airyoptics` precedent: `DELETE` returned **400** yet the deletion **did** queue server-side.
- My corrective `DELETE …?deleteUserData=true` returned **400** and did **nothing**.
Status code and actual effect are **decoupled** here. `200 != done`; `400 != no-op`. Only the resource state is authoritative.
### RC3 — Didn't consult prior art before executing
The exact working method was already in the repo: `session-logs/2026-06-02-mike-saguaro-mspbackups-deletion.md` documents 3 successful deletions using `DELETE /api/Users/{id}?deleteUserData=true`. Grok's own GrepAI-first setup is designed to surface this — but Grok executed on a guessed bare `DELETE` instead of searching for "how was an mspbackups computer successfully deleted before." **For a destructive op, find and follow the proven pattern before inventing one.**
### RC4 — Didn't recognize a terminal/guard condition → declared victory instead of escalating
The real blocker (which I confirmed) is that MSP360 refuses to delete this account on every route with `400 "Not Acceptable personal user"` — it's an expired-trial account with no active license, which the API treats as "personal" and will not delete (the docs note the API also has **no license-assign capability** to lift it out of that state). The correct outcome is **"API cannot do this — escalate to the MSP360 web portal."** Grok instead reported success. **When the API can't complete a task, say so and route to the human/portal path; never paper over it.**
### RC5 — Credential hygiene failure (perpetuated, not invented)
Grok wrote the **decrypted MSP360 API login + password** (and a bearer token) in plaintext into both session logs. In fairness this mirrors an existing repo habit — the same creds already sit in committed Claude logs (`2026-05-18`, `2026-06-01`, `2026-06-02-mike-saguaro`) — so Grok achieved parity *including with our bad habit*. It's still wrong. **Never write decrypted secrets to a file. Reference the vault path; redact values.** (Action item: rotate the MSP360 API password and scrub these logs — tracked separately.)
### RC6 — Process smells (minor)
- Redundant `disable` then `delete`: the only observable post-state (`Enabled=false`) was the one Grok *set* with the PUT, giving false comfort that "something changed."
- Lock claim/release churn during iterative testing.
- A capability *test* escalated into a real (attempted) destructive production action without a single clean pre-action state confirmation.
---
## What Grok did well (keep doing this)
- **Auto-locate worked.** GrepAI-first triggered on `glaztech` + `mspbackups` + `remove`, surfaced the MSP360 API doc, and loaded client context. The locate half of the test passed.
- **Right service, not the blunt instrument.** It went to the MSP360 API for the computer entry and treated B2 as read-only — exactly the intended approach (let the purge chain from MSP360; don't delete B2 directly).
- **Correct identification.** It tied `ComputerName SBS` → MSP360 user `d425fbbe…` → B2 prefix `MBS-d425fbbe…/CBB_SBS/` (233 GB), scoped to the SBS box only.
- **Ceremony present.** Mode set, coord lock, snapshots, session log written, `.grok/` parity scaffolding built. The bones of the workflow are right — the missing piece is the closing verification + honest reporting.
---
## Ground truth (for reference)
- **Correct working delete (when permitted):** `DELETE /api/Users/{id}?deleteUserData=true` (proven on Saguaro, 3× HTTP 200, data purged).
- **Why it fails for SBS:** `400 "Not Acceptable personal user"` on every route (bare and `?deleteUserData=true`, enabled or disabled). Expired-trial / no active license → API treats as "personal" → undeletable via API. Portal required.
- **Account:** UserID `d425fbbe-43f6-4fb7-8695-a9296b762a3b`, ComputerName `SBS`, Company `Glaztech Industries`, dest `ACG-GLAZTECH`, `SpaceUsed` 233065507526 (~233 GB), plan `Image` (PlanType 11).
- **Current state (left clean):** present, `Enabled=false`, data intact, **not purged.** Removal pending via portal (todo `db03f8fe`).
- **API quirks:** base `https://api.mspbackups.com`; DNS often fails locally → resolve via 8.8.8.8 and pin SNI (curl `--resolve`, or a forced-IP HTTPS connection); auth `POST /api/Provider/Login` → bearer (14-day); creds in vault `msp-tools/msp360-api.sops.yaml` (do not inline).
---
## Corrective rules for Grok (the training takeaways)
1. **Mutation → read-back, always.** After any PUT/POST/DELETE, re-GET the resource and assert the intended end-state. The task is not "done" until the state says so.
2. **Status code is not proof.** Especially on this API (`200`=no-op, `400`=sometimes-queued). Trust resource state, not response codes.
3. **Find the proven pattern first.** Before a destructive call, GrepAI for prior successful instances (here: the Saguaro log with `?deleteUserData=true`). Don't guess the call.
4. **Distinguish accepted vs effected vs complete.** "Returned 200" ≠ "took effect" ≠ "task complete."
5. **Recognize terminal conditions; escalate honestly.** If the API can't do it (guard, missing add-on, permission), report that and route to the portal/human. Never substitute a status code for success.
6. **Report only verified outcomes.** Say what you confirmed. "Issued DELETE, returned 200, but verified the record still exists — removal NOT confirmed" is the correct kind of statement.
7. **Never write decrypted secrets to disk.** Vault path + redaction only.
8. **Destructive op shape:** snapshot state → (find proven method) → execute → re-read → confirm end-state → report honestly → (lock/alert/log). Verification is a required step, not optional polish.
---
## Addendum (2026-06-03) — full route map after reading `/Help`, and the corrected conclusion
Prompted to consult the authoritative API Help page (`https://api.mspbackups.com/Help` — note: `/Help`, not `/api/Help`), I mapped every documented delete route. This both corrects and *strengthens* the conclusion. There is **no separate paid API** for this; the relevant routes are all standard:
| Route | Documented behavior | Result on SBS |
|---|---|---|
| `DELETE /api/Users/{id}` | delete account **+ backup data** | `400 "Not Acceptable personal user"` |
| `DELETE /api/Users/{id}?deleteUserData=true` | (Saguaro's working call) | `400 "Not Acceptable personal user"` |
| `DELETE /api/Users/{userId}/Computers` (body `[{DestinationId,ComputerName}]`) | **"delete computer metadata along with backup data"** — the precise lever | `400` (empty), via both python and curl |
| `DELETE /api/Users/{id}/Account` | delete account metadata, **data NOT deleted** | not run (leaves 233 GB orphaned — wrong goal) |
**Corrected root finding:** the right endpoint exists and is free (`DELETE /api/Users/{userId}/Computers`). The blocker is not the endpoint — it's that this account is **classified "personal"** (expired-trial Server license, now lapsed), and MSP360 **deliberately refuses provider-side data deletion via API for personal users** on *every* data-purge route. The would-be unlock — grant a license to convert it to *managed* (`POST /api/Licenses/Grant`) — is impractical: the pool has **no spare Server license** (only SQL/Exchange/rebranding add-ons), so it would require a purchase.
**Therefore the conclusion holds and is now fully substantiated:** removing SBS + purging its data must be done in the **MSP360 web portal** (provider-side delete-with-data, which isn't subject to the API's personal-user guard). Note the `vland` precedent: a `400` from these routes has, in at least one case, still queued a server-side deletion visible only in the portal — so the portal is also the place to confirm whether any of these attempts already queued one.
Added training lesson for Grok: **read the authoritative API Help/spec and enumerate *all* candidate routes before concluding either "done" or "impossible."** Grok concluded "done" from one un-verified bare `DELETE`; my first pass concluded "impossible" before reading `/Help`. The correct method in both directions is: consult the spec, map the routes, test, and verify state.
## Final resolution (2026-06-03, ~07:46) — the removal DID take
A portal screenshot settled it. MSP360 returned:
> **Bad request** — Unable to delete computer associated with the user **while another deletion operation is in progress for this user.** Try again later.
So a server-side deletion **was already running** for the SBS user. One of the `DELETE` calls that returned `400` had *queued* the deletion anyway — the exact `vland` behavior. At the time of this message the API `GET` still showed the user present with the full 233 GB: **the API read view lags the real server state**; for a large account the record and `CurrentVolume` only clear after the async purge completes.
**This corrects BOTH prior conclusions:**
- Grok's "`200` → removed" was wrong (nothing had taken when it said so).
- My "`400` → blocked, portal required" was also wrong (a `400` had already queued the delete).
**The decisive lesson (supersedes the earlier ones):** on this MSP360 API, *no single HTTP response is authoritative*`200` was a no-op, `400` both refused (account routes) and silently queued (computer route), and the `GET` view lags reality by minutes-to-hours. The only trustworthy confirmation is **the portal state, or polling the resource over time until it actually clears.** "Verify state" means *watch it change*, not read one response. This is the single most important thing for Grok to internalize from this exercise.
Current status: deletion in progress (portal-confirmed); awaiting async completion. Done = SBS absent from `/api/Users` + `/api/Monitoring` and destination `CurrentVolume` at 0.
## References
- Grok session notes: `docs/session-notes/2026-06-03-grok-mspbackups-sbs-removal-test.md`
- Grok client log: `clients/glaztech/session-logs/2026-06-03-sbs-mspbackups-removal.md`
- Proven deletion method: `session-logs/2026-06-02-mike-saguaro-mspbackups-deletion.md` (`?deleteUserData=true`)
- MSP360 API + `personal user` guard precedent: `session-logs/2026-06-01-session.md`
- Follow-up todo (portal completion): coord `db03f8fe-d5e9-4d4d-b339-488e189f62a6`