diff --git a/session-logs/2026-05-27-guru-kali-coord-hook-fix.md b/session-logs/2026-05-27-guru-kali-coord-hook-fix.md new file mode 100644 index 0000000..5a40416 --- /dev/null +++ b/session-logs/2026-05-27-guru-kali-coord-hook-fix.md @@ -0,0 +1,84 @@ +# Session Log — Coord UserPromptSubmit hook fix (broadcasts + JSON robustness) + +Distinct work follow-on to today's earlier `2026-05-27-guru-kali-identity-phase2-followups.md`. The hooks coord-todo (`c28d1baa`) plus the Mac-confirmed install-hooks status from earlier today are unchanged. + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-KALI +- **Role:** admin +- **Session span:** 2026-05-27 ~17:14 MST through ~20:40 MST + +## Session Summary + +A run of routine syncs through the afternoon and evening pulled a busy day of fleet activity (30 commits, then 3, then 2) — `/mailbox` skill, Autotask, RMM Phase-2 deploy, `rmm-audit` Agent F, GuruScan module refactor, Quantum M365 onboarding, Syncro delivery-channel rule, and three vault entries. The Mac picked up my Phase-2 step-4 hand-off cleanly (`2c12bd2` wired sync.sh + syncro.md to read from identity.json, `0e2629a` cleaned CLAUDE.md). Each sync.sh rebase merged my attribution guards intact with the Mac's concurrent rewrites — verified after every round. + +Mike flagged the coord UserPromptSubmit hook was broken: through several syncs neither he nor I had received any coord messages despite real fleet activity. Diagnosis: the hook (`.claude/scripts/check-messages.sh`) only queried `to_session=$SESSION` and `to_session=$USER_ALIAS`. It never asked for `to_session=ALL_SESSIONS`, which the fleet uses for broadcasts. Four real broadcasts were sitting unread (LHM/WinRing0 in-flight then done, SPEC-010 features+bugs, SPEC-011 ARP). The hook ran exit 0 every time and printed nothing — the gap was coverage, not error. + +Implemented the fix in three rounds, each surfacing a new bug. First pass added an `ALL_SESSIONS` query and a per-machine seen-file (`.claude/coord-broadcasts-seen`, gitignored) so the hook never PUTs `/read` on broadcasts and never re-surfaces ones already shown. Testing showed silent failure: `jq` aborted with "Invalid string: control characters from U+0000 through U+001F must be escaped". The coord API emits some message bodies with raw unescaped control chars — invalid per RFC 8259. Added a `sanitize_json` helper that round-trips through `python3 json.loads(strict=False)` (which accepts them) and re-emits valid JSON. + +Second round still silent: `bash echo "$json_var"` was interpreting backslash escapes inside the JSON (e.g. `\n` in body text becoming a literal newline) and corrupting it before jq saw it. Replaced every `echo "$VAR" | jq` with `printf '%s' "$VAR" | jq` throughout — the original script had the same latent bug, untriggered only because personal messages were usually empty. Third round still empty seen-file: the jq filter `($seen | index(.id))` evaluated `.id` against `$seen` (the array) not against the current message — classic jq scoping. Fixed by binding the message: `map(. as $m | select(($seen | index($m.id)) == null) ...)`. + +Also fixed an interim bug introduced by my own refactor: `${result:-{}}` as a bash default — the first `}` closes the parameter expansion early, producing `{...}}` and breaking jq. Replaced with an explicit `[ -z "$result_safe" ] && result_safe='{"messages":[]}'` guard. End-to-end test: run 1 surfaces 4 broadcasts and writes 4 UUIDs to the seen-file; run 2 is silent; server still shows 4 unread (other machines unaffected). Code Review Agent reviewed the whole change and returned APPROVED (no defects, only cosmetic observations). Shipped as `a35b583`. + +## Key Decisions + +- Per-machine local seen-tracking for broadcasts, NOT a server-side `read_at` flip — the schema has one read_at field, so PUT /read on a broadcast would clobber it for every other recipient that hasn't seen it. +- Sanitize JSON client-side via python `strict=False` rather than fixing the server. The API bug (unescaped control chars in bodies) is a separate concern; the hook needs to be robust regardless of server quality. +- Convert ALL `echo "$JSON" | jq` to `printf '%s'` fleet-wide, not just the broadcast paths. The bug was latent in the original code; fixing it everywhere prevents future surprise corruption on any message with backslash escapes. +- Mandatory code review (CLAUDE.md rule) before syncing the hook to the fleet — this script runs on every prompt on every machine, so a regression breaks coord-message surfacing globally. Review came back APPROVED. + +## Problems Encountered + +- **Hook silent despite real unread:** the hook only queried personal + alias, never ALL_SESSIONS. Fixed by adding the third query. +- **jq parse error on broadcast bodies:** server returned message JSON with unescaped U+0001 control chars (and similar). Fixed with the `sanitize_json` python round-trip helper. +- **echo backslash escape corruption:** bash `echo "$json"` expanded `\n`/`\"` inside JSON string values, breaking the content before jq could parse it. Fixed by switching every JSON pipe to `printf '%s'`. Latent in the original script. +- **jq scoping bug in filter:** `($seen | index(.id))` evaluated `.id` against `$seen` not the message. Fixed by binding the message to `$m` and using `$m.id` / `$m.from_session`. +- **Bash brace-default trap:** `${result:-{}}` parses as `${result:-{}` + `}` → extra brace → jq error. Replaced with an explicit guard. +- **Misleading test runs:** intermediate tests showed empty output and empty seen-file. A trace via `awk`-modified script copy in `/tmp` made SCRIPT_DIR resolve to `/`, giving a misleading `SEEN_FILE=//coord-broadcasts-seen`. Clean re-test in the real script path confirmed the actual fix worked end-to-end. + +## Configuration Changes + +- `.claude/scripts/check-messages.sh` — added `sanitize_json` helper; added `ALL_SESSIONS` query + per-machine seen-file logic; converted all `echo "$JSON" | jq` to `printf '%s' "$JSON" | jq` (12 sites incl. the locks block); fixed jq scoping with `. as $m` binding; replaced `${result:-{}}` brace-trap with explicit guard. +- `.gitignore` — added `.claude/coord-broadcasts-seen` (per-machine local seen-tracking). +- `.claude/memory/project_mac_gururmm_setup_pending.md` — caveat updated earlier today to `[CONFIRMED PENDING 2026-05-27]` based on Mac coord reply. +- `projects/msp-tools/guru-rmm` — submodule pointer auto-advanced by Phase 1a to `6326ec6`. + +## Credentials & Secrets + +None created or discovered. + +## Infrastructure & Servers + +- Coord API `http://172.16.3.30:8001/api/coord/messages` — broadcasts use `to_session=ALL_SESSIONS` (magic recipient string). Single server-side `read_at` field per message (no per-recipient tracking). +- 4 unread `ALL_SESSIONS` broadcasts at fix-time: `e032f029` (LHM v0.6.46 done), `620af7f5` (LHM in flight), `7bdc6d3c` (SPEC-011 ARP), `3fe667e1` (SPEC-010 UX). +- Hook config in `.claude/settings.json` → `UserPromptSubmit` → `bash .claude/scripts/check-messages.sh` (15s timeout). Unchanged. +- Per-machine seen-file: `.claude/coord-broadcasts-seen` (gitignored, append-only UUIDs). + +## Commands & Outputs + +- Hook test pattern: + ```bash + rm -f .claude/coord-broadcasts-seen + bash .claude/scripts/check-messages.sh # surfaces broadcasts, writes seen-file + bash .claude/scripts/check-messages.sh # silent (broadcasts in seen-file) + curl -s "http://172.16.3.30:8001/api/coord/messages?to_session=ALL_SESSIONS&unread_only=true" + # → still 4 unread on server: other machines unaffected + ``` +- Reading API msg shape (no auth): `curl -s "http://172.16.3.30:8001/api/coord/messages?to_session=ALL_SESSIONS&unread_only=true&limit=100"` → returns `{total, skip, limit, messages:[...]}`. Bodies may contain raw control chars (RFC violation; client-side sanitize required). +- jq scoping form: `map(. as $m | select(($seen | index($m.id)) == null) and (($m.from_session | ascii_downcase | sub("\\.local/"; "/")) != $self))`. + +## Pending / Incomplete Tasks + +Per Mike (earlier today, 2026-05-27): these are not this instance's tasks — recorded only as a fleet reference: +- Mac `install-hooks.sh` — owned by Mac via coord to-do `c28d1baa`. +- GURU-5070 pubkey on Pluto, Ollama-fallback rollout to other machines — owned elsewhere. + +No open items for GURU-KALI. + +## Reference Information + +- Commits this round: `45126c0`/`01af931` (submodule auto-advances), `2678d38` (mac-pending caveat), `a35b583` (coord hook fix). Pulled: 30 commits + 3 vault (RMM Phase 2, /mailbox, Autotask, Quantum M365, GuruScan refactor); then 3 (LHM removal submodule advance + SPEC-011 + Howard sync); then 2 (Mac + Howard auto-sync); then 1 (own hook fix push). +- Code Review Agent verdict on hook fix: APPROVED. Agent id `a7d01ffc8547e8dff`. +- Files: `.claude/scripts/check-messages.sh`, `.gitignore`. Hook config: `.claude/settings.json`. +- Coord broadcasts surfaced during fix: `e032f029`, `620af7f5`, `7bdc6d3c`, `3fe667e1`. +- Vault pulled: `clients/quantumwms/m365-breakglass.sops.yaml` (new), `clients/sif-oidak/laptops.sops.yaml` (new), `msp-tools/autotask.sops.yaml` (updated).