harness: fleet-wide functional-error + correction + friction logging

Add .claude/scripts/log-skill-error.sh — the canonical agent error log helper
(writes errorlog.md in DATE | MACHINE | skill | [type] error format, soft-fails).
Three categories: execution failures (default), user corrections (--correction),
and preventable self-inflicted friction (--friction; cite ref= when it repeats a
documented gotcha). Goal: stop paying tokens twice for the same avoidable mistake.

- CLAUDE.md: make logging mandatory for all skills + corrections + friction.
- skill-creator: new skills must wire in the helper (guidance + checklist).
- Retrofit every skill script's genuine failure branches to call the helper
  (b2/bitdefender/mailprotector/packetdial/coord python CLIs; remediation-tool
  + onboard365 bash; vault, rmm-auth, post-bot-alert, agy, grok, 1password,
  run-onboarding-diagnostic). Handled conditions + self-tests left alone.
- errorlog.md: broaden header to cover skills + harness + corrections; seed this
  session's corrections (INKY, Mail.Send token-audience, omnibox-strictness) and
  friction (git-bash /tmp, env-persistence, argv-limit, PowerShell var-case).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-15 11:39:43 -07:00
parent 927a06a0cf
commit 9960da5f9a
29 changed files with 388 additions and 36 deletions

View File

@@ -1,14 +1,37 @@
# Error Log
Brief records of task-execution errors across the fleet, used to improve skills and the
command harness. Append newest entries at the top. Keep each entry to 1-2 lines.
Brief records of preventable, pattern-worthy events across the fleet used to improve
skills, write better CLAUDE.md rules, and clean stale/misleading memory. The aim: never
pay tokens twice for the same avoidable mistake. Append newest at the top; keep entries to
1-2 lines. **Always write via the helper, never by hand:**
`bash .claude/scripts/log-skill-error.sh "<skill/context>" "<brief>" [--correction|--friction] [--context "k=v"]`
Format: `YYYY-MM-DD | MACHINE | command/skill | error (brief)`
Format: `YYYY-MM-DD | MACHINE | command/skill/context | [type] error (brief) [ctx: ...]`
Categories (the `[type]` tag): _(none)_ = skill/command execution failure ·
`[correction]` = user corrected an improper assumption I made ·
`[friction]` = preventable self-inflicted token-waste (harness/env/tool misuse; cite a
`ref=` in ctx when it repeats a documented gotcha — that flags a rule/memory to strengthen).
---
<!-- Append entries below this line -->
2026-06-15 | GURU-5070 | powershell/var-case | [friction] PowerShell vars are case-INSENSITIVE: $gUid silently overwrote $guid (GPO id), Set-ADObject hit a bad DN and left GPT.ini/AD versionNumber inconsistent until fixed. Never rely on case to distinguish PS variables
2026-06-15 | GURU-5070 | python/argv-limit | [friction] passed full /api/agents JSON (248 agents) as a python CLI arg -> 'Argument list too long' on Windows. Pipe large payloads via stdin, not argv
2026-06-15 | GURU-5070 | bash/env-persist | [friction] re-derived RMM token every call after $TOKEN/$RMM vanished between Bash tool calls - shell env does NOT persist across calls; must re-eval auth (or chain) in the same command
2026-06-15 | GURU-5070 | bash/tmp-path | [friction] wrote curl -o /tmp/x.json then jq read it back and failed (No such file) - Git-Bash vs Write/tool /tmp resolve differently. Pipe directly or use repo-relative paths. REPEAT of documented gotcha [ctx: ref=feedback_tmp_path_windows]
2026-06-15 | GURU-5070 | DMARC / DNS | [correction] assumed ACG's own INKY rua convention (reports-sg.inkydmarc.com) applied to a client domain; only use the INKY rua if THAT client is onboarded to INKY - otherwise plain p=none or a real mailbox
2026-06-15 | GURU-5070 | remediation-tool (sendMail) | [correction] assumed none of the consented apps could send mail and started granting Graph Mail.Send; the Exchange Operator app ALREADY had Graph Mail.Send - I was decoding the EXO-audience token, not a Graph-audience token. Mint a Graph token for the app before concluding a permission is missing
2026-06-15 | GURU-5070 | rmm-search | [correction] assumed the CLI search must replicate the UI Omnibox scoreMatch exactly; user wants a FLEXIBLE forgiving multi-field search optimized for first-try correctness, not UI parity
2026-06-15 | GURU-BEAST-ROG | /syncro (comment edit) | Syncro API does not expose a comment-edit or comment-delete endpoint — once posted, comments can only be modified via the GUI. Bot posted an internal resolution note with an unwanted "Performed by: ClaudeTools Discord Bot" line and could not remove it programmatically. Remediation needed: either suppress bot-attribution lines from internal notes by default, or add a GUI-edit step to the workflow when the note needs correction.
2026-06-14 | GURU-5070 | mailbox skill (Graph token) | FABB app `fabb3421` (Claude-MSP-Access / "Cloud MSP Access") token request returned AADSTS700016 — app/SP no longer present in azcomputerguru.com tenant (deleted; gotchas.md already marked it deprecated). Blocks /mailbox + the M365 contacts task. Verified the remediation suite (live, ACG tenant) carries NO Mail.Send/Mail.ReadWrite/Contacts scopes (investigator has Mail.Read only) — so a straight repoint can't restore mailbox-send/contacts. Pending Mike decision: stand up a single-tenant ACG-internal mailbox app vs. add scopes to a suite tier. [2026-06-15] Docs hardened — gotchas.md now marks fabb3421 DELETED with the Mail/Contacts-scope blast radius + flags the 3 legacy "old app only" tenants (Valleywide/Dataforth/Cascades) as now having NO working remediation app (migration URGENT); mailbox.md carries a BLOCKED/AADSTS700016 banner. DECISION 2026-06-15 (Mike): Mail.Send goes into the suite (Exchange Operator tier) since its real use is IR victim-notification during mailbox takeovers; add Mail.Send to the exchange-op manifest + consent, repoint mailbox.md to exchange-op. Implementation not yet executed (production app change, needs go).