spec: ClaudeTools harness optimization (3-way reviewed)

Optimize the harness (not projects) for accuracy/completeness with context pressure as a first-class constraint; token efficiency secondary. Authored as a Claude+Grok+ Gemini review (see review-3way.md): P0 reliability footguns (submodule-safe sync, serialized/staged wiki synthesis, syncro SSOT, warn-only guard), P1 context diet (one-line registry descriptions, CLAUDE CORE/EXTENDED, thin save/sync), P2 delegation re-tune, P3 knowledge tiering. Adds harness VERSION marker + OOB recovery as rollout safety. Python port split to a separate future spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 07:32:45 -07:00
parent 0f02cae98c
commit bb7dc147ca
4 changed files with 402 additions and 0 deletions
--- a/specs/claudetools-harness-optimization/plan.md
+++ b/specs/claudetools-harness-optimization/plan.md
@@ -0,0 +1,161 @@
+# ClaudeTools Harness Optimization — Implementation Plan (3-way reviewed)
+
+> Spec created: 2026-06-08 | Status: not started | Priority: P1
+> Revised per `review-3way.md` (Claude + Grok + Gemini). Fleet-wide changes: every task
+> must be reversible, version-marked, and rollout-announced via coord.
+
+## Task 0: Commit this spec
+Commit `specs/claudetools-harness-optimization/` (incl. `review-3way.md`).
+
+## Task 0.5: Harness VERSION marker  [safety prerequisite — Grok]
+Files: `.claude/harness/VERSION` (new), CLAUDE.md CORE reference.
+- Single-line version string, bumped on every fleet-visible behavioral change (each P0
+  task). Sessions read it to detect partial rollout and give clear errors ("your save is
+  on the pre-Task-2 wiki path"). Enables conditional behavior during heterogeneous rollout.
+
+## Task 0.6: Out-of-band fleet recovery  [safety prerequisite — Gemini]
+Files: `.claude/scripts/force-pull-raw.sh` (new), distributed BEFORE any P0 change.
+- A raw, un-intercepted `git fetch + reset --hard origin/main` rescue path so a node can
+  recover if a P0 change bricks `sync.sh`/`save.sh` (the normal update vector). No hooks,
+  no guards, minimal deps. Broadcast its existence to the fleet first.
+
+---
+
+## P0 — Reliability footguns
+
+## Task 1+4 (co-developed on one branch — both touch staging)
+
+### Task 1: Submodule-safe sync  [P0]
+Files: `.claude/scripts/sync.sh`.
+- Replace blanket `git add -A` with submodule-aware staging: stage changes, then unstage
+  any submodule gitlink unless `--with-submodules` is passed
+  (`git reset -q HEAD projects/msp-tools/guru-rmm projects/msp-tools/guru-connect`).
+  Operator must NEVER need to detach the submodule to its pin before `/save`.
+- Verify: submodule on `main` (ahead of pin) -> `/sync` commits log/wiki/config but NOT
+  a gitlink bump; clean status after.
+
+### Task 4: Pre-commit guard — WARN-ONLY first  [P0, highest risk]
+Files: `.claude/scripts/harness-guard.sh` (new, standalone), later wired into `sync.sh`.
+- Detects: conflict markers in tracked files; unflagged submodule gitlink change;
+  vault-plaintext / obvious-secret patterns.
+- **Mode = WARN-ONLY** for the first 7-14 days fleet-wide (logs to a local file, never
+  blocks). Explicit `SKIP_HARNESS_GUARD=1` bypass that is logged + coord-broadcast when
+  used. Test matrix: synthetic bad inputs (incl. legitimate `<<<<<<<` in test files, new
+  SOPS edge cases) + the last N real commits, zero false positives, BEFORE promoting to
+  fatal. Promote to blocking only after a clean warn window.
+- Verify: warn fires on a planted conflict marker; clean commits pass; bypass logs.
+
+## Task 2: Decouple + stage wiki synthesis  [P0]
+Files: `.claude/commands/save.md` (drop inline recompile), `.claude/commands/wiki-compile.md`,
+new staged runner.
+- Sequence (do NOT excise the old path until the new one is proven):
+  1. Build the serialized synthesis path: gated by a coord lock (`lock claim claudetools
+     wiki/<slug>`) with **orphan-lock eviction** (TTL + dead-session reclaim) and a
+     coord-down fallback (skip + warn, never block the save).
+  2. **Staging pattern** (Gemini): the job writes proposed article updates to
+     `.claude/wiki_staging/<slug>.md` + emits a notification; an operator or a focused
+     validation agent reviews the diff and applies it — NO blind background auto-merge
+     into the live article (avoids progressive structural corruption). Preserve
+     Patterns/History; update `last_verified` + `source_logs`.
+  3. Prove it under concurrent saves from 2 machines (no conflict, one clean update).
+  4. ONLY THEN remove the inline Sonnet recompile from `/save` (save = write log + push).
+- Verify: two near-simultaneous `/save`s -> no wiki conflict markers; staged diff
+  reviewed + applied cleanly.
+
+## Task 3a: Confirm billing SSOT truth  [P0 prerequisite — Grok]
+- Hard prerequisite before Task 3 edits: confirm with Mike which is operational truth —
+  `/syncro` command's `add_line_item` (current practice) vs the `time-entry-protocol`
+  standard's `timer_entry`. Picking wrong bakes a permanent contradictory money rule.
+
+## Task 3: Single source of truth — Syncro billing  [P0]
+Files: `.claude/commands/syncro.md`, `.claude/standards/syncro/time-entry-protocol.md`.
+- Per Task 3a's decision, make one the source and the other a reference; delete the
+  contradictory/duplicated prose. Add a "command-restates-standard" lint to `/self-check`.
+- Verify: one rule, two consistent references.
+
+---
+
+## P1 — Context diet
+
+## Task 5a: "Where is text injected" inventory  [P1 prerequisite — Grok]
+- One-off pass cataloguing every repeated injection: CLAUDE.md, the skill/command
+  registry, system-reminders, MEMORY.md, CONTEXT.md loads, hook-injected prose, big
+  command bodies. Measure tokens each. This scopes Tasks 5/6/7 so the diet does not
+  under-deliver. One-off measurement (Grok: skip ongoing telemetry).
+
+## Task 5: One-line registry descriptions  [P1] (biggest token win)
+Files: every `.claude/skills/*/SKILL.md` `description:`; every `.claude/commands/*.md`
+description.
+- Truncate each registry description to ONE sentence; move verbose triggers/examples into
+  the SKILL.md body (loads only on activation). Worst first: rmm-audit, gc-audit,
+  packetdial, bitdefender, mailprotector, remediation-tool.
+- Verify: registry-list injection size cut >=50% (measured per Task 5a); skills still invoke.
+
+## Task 6: CLAUDE.md -> CORE + EXTENDED  [P1]
+Files: `.claude/CLAUDE.md` (lean CORE) + `.claude/CLAUDE_EXTENDED.md` (on-demand).
+- CORE (<=1.5k tokens): identity/multi-user, security/vault, model routing, delegation
+  thresholds (Task 9), coord-message-in-system-reminder rule, harness VERSION ref.
+  Everything else -> EXTENDED, loaded when relevant. Add a discoverability pointer so
+  on-demand detail is findable (Grok).
+- Verify: CORE <=1.5k tokens; nothing safety-critical lost.
+
+## Task 7: Thin `/save` and `/sync`  [P1]
+Files: `.claude/commands/save.md`, `.claude/commands/sync.md`.
+- The command file becomes "run the script, report the summary"; the procedure lives in
+  `sync.sh`/`save.sh`. **Keep any LLM/narrative step OUT of the critical mechanical path**
+  (Grok) — log narrative generation stays optional/separate, never a new failure domain
+  in the hot save/sync loop.
+- Verify: command-body token size cut substantially; behavior unchanged; save/sync never
+  blocked by an LLM call.
+
+## Task 8: Shard big command bodies  [P2 — DEFERRED]
+- Deferred past P1 (Grok): one-line descriptions + CLAUDE split + thin wrappers deliver
+  most of the win. Revisit only with a real lazy-load gate (Gemini: otherwise models
+  glob-load the appendices anyway). Not in the P0/P1 cut.
+
+---
+
+## P2 — Coordinator model
+
+## Task 9: Re-tune delegation defaults  [P2, gated on harness VERSION]
+Files: CLAUDE.md CORE.
+- "Act directly by default; delegate only for (a) high-volume tool output, (b) blast
+  radius >3 files across layers, (c) genuine domain shift, (d) parallelizable independent
+  work." Keep single-agent-for-coupled-tasks. Behavioral-contract change -> gate on
+  fleet VERSION so behavior is consistent once all machines sync.
+
+---
+
+## P3 — Knowledge architecture
+
+## Task 10: Tier the knowledge layers  [P3]
+Files: CLAUDE.md (recall order), `session-logs/YYYY-MM/` reorg, `memory-dream` schedule,
+wiki frontmatter.
+- Recall order: scoped log grep (date-foldered) -> wiki (CONTEXT.md/article) -> memory.
+  Reorganize logs into `session-logs/YYYY-MM/` and recall via scoped grep — **no
+  monolithic INDEX** (Gemini: it recreates bloat).
+- Memory: `memory-dream --apply-safe` on a schedule; `/graduate` finished-project memory
+  into wiki/standards then delete. (No arbitrary file-count target — Grok.)
+- Wiki: `last_verified` + `source_logs`; merge via the Task 2 staging pattern.
+- Verify: a recovery query hits the right layer first; no memory restates a wiki article.
+
+---
+
+## Deferred to a SEPARATE future spec (not this one)
+- **Python port of the core harness** (sync/save/hooks). This spec lands only the
+  deterministic Bash fixes that are cheap + high-value: `jq -n --arg` for ALL shell JSON
+  (fixes `post-bot-alert.sh`), a `now-phoenix` timestamp helper (PowerShell/Python, never
+  `TZ=America/Phoenix date`), and the skills-deploy parity (already shipped 2026-06-08).
+  A full port is its own spec (Gemini).
+
+---
+
+## Task 12: Verification + fleet rollout (consolidated)
+- Single P0/P1 rollout checklist (replaces per-task verify repetition): `/self-check`
+  green; real `/save` dry-run on a scratch branch; bump `VERSION`; coord-broadcast; other
+  machines pull + re-sync.
+- Fleet-level success criteria: 2 consecutive days across ALL machines with zero conflict
+  markers and zero submodule gitlink accidents; registry-injection tokens down >=50%;
+  CLAUDE CORE <=1.5k.
+- Add harness smoke tests to `/self-check`: sync/save dry-run, skills-deploy dir
+  non-empty, registry-size budget, guard self-test.