Files
claudetools/session-logs/2026-05-27-guru-kali-coord-hook-fix.md
Mike Swanson 8846ed7b05 sync: auto-sync from GURU-KALI at 2026-05-27 20:42:46
Author: Mike Swanson
Machine: GURU-KALI
Timestamp: 2026-05-27 20:42:46
2026-05-27 20:42:47 -07:00

8.7 KiB

Session Log — Coord UserPromptSubmit hook fix (broadcasts + JSON robustness)

Distinct work follow-on to today's earlier 2026-05-27-guru-kali-identity-phase2-followups.md. The hooks coord-todo (c28d1baa) plus the Mac-confirmed install-hooks status from earlier today are unchanged.

User

  • User: Mike Swanson (mike)
  • Machine: GURU-KALI
  • Role: admin
  • Session span: 2026-05-27 ~17:14 MST through ~20:40 MST

Session Summary

A run of routine syncs through the afternoon and evening pulled a busy day of fleet activity (30 commits, then 3, then 2) — /mailbox skill, Autotask, RMM Phase-2 deploy, rmm-audit Agent F, GuruScan module refactor, Quantum M365 onboarding, Syncro delivery-channel rule, and three vault entries. The Mac picked up my Phase-2 step-4 hand-off cleanly (2c12bd2 wired sync.sh + syncro.md to read from identity.json, 0e2629a cleaned CLAUDE.md). Each sync.sh rebase merged my attribution guards intact with the Mac's concurrent rewrites — verified after every round.

Mike flagged the coord UserPromptSubmit hook was broken: through several syncs neither he nor I had received any coord messages despite real fleet activity. Diagnosis: the hook (.claude/scripts/check-messages.sh) only queried to_session=$SESSION and to_session=$USER_ALIAS. It never asked for to_session=ALL_SESSIONS, which the fleet uses for broadcasts. Four real broadcasts were sitting unread (LHM/WinRing0 in-flight then done, SPEC-010 features+bugs, SPEC-011 ARP). The hook ran exit 0 every time and printed nothing — the gap was coverage, not error.

Implemented the fix in three rounds, each surfacing a new bug. First pass added an ALL_SESSIONS query and a per-machine seen-file (.claude/coord-broadcasts-seen, gitignored) so the hook never PUTs /read on broadcasts and never re-surfaces ones already shown. Testing showed silent failure: jq aborted with "Invalid string: control characters from U+0000 through U+001F must be escaped". The coord API emits some message bodies with raw unescaped control chars — invalid per RFC 8259. Added a sanitize_json helper that round-trips through python3 json.loads(strict=False) (which accepts them) and re-emits valid JSON.

Second round still silent: bash echo "$json_var" was interpreting backslash escapes inside the JSON (e.g. \n in body text becoming a literal newline) and corrupting it before jq saw it. Replaced every echo "$VAR" | jq with printf '%s' "$VAR" | jq throughout — the original script had the same latent bug, untriggered only because personal messages were usually empty. Third round still empty seen-file: the jq filter ($seen | index(.id)) evaluated .id against $seen (the array) not against the current message — classic jq scoping. Fixed by binding the message: map(. as $m | select(($seen | index($m.id)) == null) ...).

Also fixed an interim bug introduced by my own refactor: ${result:-{}} as a bash default — the first } closes the parameter expansion early, producing {...}} and breaking jq. Replaced with an explicit [ -z "$result_safe" ] && result_safe='{"messages":[]}' guard. End-to-end test: run 1 surfaces 4 broadcasts and writes 4 UUIDs to the seen-file; run 2 is silent; server still shows 4 unread (other machines unaffected). Code Review Agent reviewed the whole change and returned APPROVED (no defects, only cosmetic observations). Shipped as a35b583.

Key Decisions

  • Per-machine local seen-tracking for broadcasts, NOT a server-side read_at flip — the schema has one read_at field, so PUT /read on a broadcast would clobber it for every other recipient that hasn't seen it.
  • Sanitize JSON client-side via python strict=False rather than fixing the server. The API bug (unescaped control chars in bodies) is a separate concern; the hook needs to be robust regardless of server quality.
  • Convert ALL echo "$JSON" | jq to printf '%s' fleet-wide, not just the broadcast paths. The bug was latent in the original code; fixing it everywhere prevents future surprise corruption on any message with backslash escapes.
  • Mandatory code review (CLAUDE.md rule) before syncing the hook to the fleet — this script runs on every prompt on every machine, so a regression breaks coord-message surfacing globally. Review came back APPROVED.

Problems Encountered

  • Hook silent despite real unread: the hook only queried personal + alias, never ALL_SESSIONS. Fixed by adding the third query.
  • jq parse error on broadcast bodies: server returned message JSON with unescaped U+0001 control chars (and similar). Fixed with the sanitize_json python round-trip helper.
  • echo backslash escape corruption: bash echo "$json" expanded \n/\" inside JSON string values, breaking the content before jq could parse it. Fixed by switching every JSON pipe to printf '%s'. Latent in the original script.
  • jq scoping bug in filter: ($seen | index(.id)) evaluated .id against $seen not the message. Fixed by binding the message to $m and using $m.id / $m.from_session.
  • Bash brace-default trap: ${result:-{}} parses as ${result:-{} + } → extra brace → jq error. Replaced with an explicit guard.
  • Misleading test runs: intermediate tests showed empty output and empty seen-file. A trace via awk-modified script copy in /tmp made SCRIPT_DIR resolve to /, giving a misleading SEEN_FILE=//coord-broadcasts-seen. Clean re-test in the real script path confirmed the actual fix worked end-to-end.

Configuration Changes

  • .claude/scripts/check-messages.sh — added sanitize_json helper; added ALL_SESSIONS query + per-machine seen-file logic; converted all echo "$JSON" | jq to printf '%s' "$JSON" | jq (12 sites incl. the locks block); fixed jq scoping with . as $m binding; replaced ${result:-{}} brace-trap with explicit guard.
  • .gitignore — added .claude/coord-broadcasts-seen (per-machine local seen-tracking).
  • .claude/memory/project_mac_gururmm_setup_pending.md — caveat updated earlier today to [CONFIRMED PENDING 2026-05-27] based on Mac coord reply.
  • projects/msp-tools/guru-rmm — submodule pointer auto-advanced by Phase 1a to 6326ec6.

Credentials & Secrets

None created or discovered.

Infrastructure & Servers

  • Coord API http://172.16.3.30:8001/api/coord/messages — broadcasts use to_session=ALL_SESSIONS (magic recipient string). Single server-side read_at field per message (no per-recipient tracking).
  • 4 unread ALL_SESSIONS broadcasts at fix-time: e032f029 (LHM v0.6.46 done), 620af7f5 (LHM in flight), 7bdc6d3c (SPEC-011 ARP), 3fe667e1 (SPEC-010 UX).
  • Hook config in .claude/settings.jsonUserPromptSubmitbash .claude/scripts/check-messages.sh (15s timeout). Unchanged.
  • Per-machine seen-file: .claude/coord-broadcasts-seen (gitignored, append-only UUIDs).

Commands & Outputs

  • Hook test pattern:
    rm -f .claude/coord-broadcasts-seen
    bash .claude/scripts/check-messages.sh        # surfaces broadcasts, writes seen-file
    bash .claude/scripts/check-messages.sh        # silent (broadcasts in seen-file)
    curl -s "http://172.16.3.30:8001/api/coord/messages?to_session=ALL_SESSIONS&unread_only=true"
    # → still 4 unread on server: other machines unaffected
    
  • Reading API msg shape (no auth): curl -s "http://172.16.3.30:8001/api/coord/messages?to_session=ALL_SESSIONS&unread_only=true&limit=100" → returns {total, skip, limit, messages:[...]}. Bodies may contain raw control chars (RFC violation; client-side sanitize required).
  • jq scoping form: map(. as $m | select(($seen | index($m.id)) == null) and (($m.from_session | ascii_downcase | sub("\\.local/"; "/")) != $self)).

Pending / Incomplete Tasks

Per Mike (earlier today, 2026-05-27): these are not this instance's tasks — recorded only as a fleet reference:

  • Mac install-hooks.sh — owned by Mac via coord to-do c28d1baa.
  • GURU-5070 pubkey on Pluto, Ollama-fallback rollout to other machines — owned elsewhere.

No open items for GURU-KALI.

Reference Information

  • Commits this round: 45126c0/01af931 (submodule auto-advances), 2678d38 (mac-pending caveat), a35b583 (coord hook fix). Pulled: 30 commits + 3 vault (RMM Phase 2, /mailbox, Autotask, Quantum M365, GuruScan refactor); then 3 (LHM removal submodule advance + SPEC-011 + Howard sync); then 2 (Mac + Howard auto-sync); then 1 (own hook fix push).
  • Code Review Agent verdict on hook fix: APPROVED. Agent id a7d01ffc8547e8dff.
  • Files: .claude/scripts/check-messages.sh, .gitignore. Hook config: .claude/settings.json.
  • Coord broadcasts surfaced during fix: e032f029, 620af7f5, 7bdc6d3c, 3fe667e1.
  • Vault pulled: clients/quantumwms/m365-breakglass.sops.yaml (new), clients/sif-oidak/laptops.sops.yaml (new), msp-tools/autotask.sops.yaml (updated).