harness: fleet-wide functional-error + correction + friction logging

Add .claude/scripts/log-skill-error.sh — the canonical agent error log helper
(writes errorlog.md in DATE | MACHINE | skill | [type] error format, soft-fails).
Three categories: execution failures (default), user corrections (--correction),
and preventable self-inflicted friction (--friction; cite ref= when it repeats a
documented gotcha). Goal: stop paying tokens twice for the same avoidable mistake.

- CLAUDE.md: make logging mandatory for all skills + corrections + friction.
- skill-creator: new skills must wire in the helper (guidance + checklist).
- Retrofit every skill script's genuine failure branches to call the helper
  (b2/bitdefender/mailprotector/packetdial/coord python CLIs; remediation-tool
  + onboard365 bash; vault, rmm-auth, post-bot-alert, agy, grok, 1password,
  run-onboarding-diagnostic). Handled conditions + self-tests left alone.
- errorlog.md: broaden header to cover skills + harness + corrections; seed this
  session's corrections (INKY, Mail.Send token-audience, omnibox-strictness) and
  friction (git-bash /tmp, env-persistence, argv-limit, PowerShell var-case).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-15 11:39:43 -07:00
parent 927a06a0cf
commit 9960da5f9a
29 changed files with 388 additions and 36 deletions

View File

@@ -104,6 +104,8 @@ MODE="${1:-}"; shift 2>/dev/null || true
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
PF="$TMP/prompt.txt"; OUT="$TMP/out.txt"; ERR="$TMP/err.txt"
REPO_ROOT="${CLAUDETOOLS_ROOT:-$(cd "$SCRIPT_DIR/../../../.." 2>/dev/null && pwd)}"
# Functional-error logger (skill name "agy"); soft-fails, never breaks the caller.
_logerr() { bash "$REPO_ROOT/.claude/scripts/log-skill-error.sh" "agy" "$@" >/dev/null 2>&1 || true; }
# gtimeout on macOS (brew coreutils), timeout elsewhere.
TIMEOUT_CMD="timeout"
@@ -191,6 +193,7 @@ emit_or_fail() { # print .response, or retry once on a transient empty turn, el
# Auth failures won't be fixed by a retry — report immediately.
if auth_failed; then
echo "[$SELF] Gemini auth error — run 'gemini' interactively and choose 'Login with Google', then retry." >&2
_logerr "gemini auth/login failure" --context "mode=$MODE"
exit 1
fi
# Gemini occasionally returns an empty turn (or absorbs a 429 backoff into the
@@ -202,11 +205,13 @@ emit_or_fail() { # print .response, or retry once on a transient empty turn, el
if [ -n "$txt" ]; then printf '%s\n' "$txt"; return 0; fi
if auth_failed; then
echo "[$SELF] Gemini auth error — run 'gemini' interactively and choose 'Login with Google', then retry." >&2
_logerr "gemini auth/login failure (after retry)" --context "mode=$MODE"
exit 1
fi
fi
echo "[$SELF] no response from gemini. stderr tail:" >&2
tail -3 "$ERR" >&2 2>/dev/null || true
_logerr "gemini returned no response (empty after retry)" --context "mode=$MODE err=$(tail -1 "$ERR" 2>/dev/null | tr -d '\n' | cut -c1-80)"
exit 1
}