Files
claudetools/specs/claudetools-harness-optimization/plan.md
Mike Swanson bb7dc147ca spec: ClaudeTools harness optimization (3-way reviewed)
Optimize the harness (not projects) for accuracy/completeness with context pressure
as a first-class constraint; token efficiency secondary. Authored as a Claude+Grok+
Gemini review (see review-3way.md): P0 reliability footguns (submodule-safe sync,
serialized/staged wiki synthesis, syncro SSOT, warn-only guard), P1 context diet
(one-line registry descriptions, CLAUDE CORE/EXTENDED, thin save/sync), P2 delegation
re-tune, P3 knowledge tiering. Adds harness VERSION marker + OOB recovery as rollout
safety. Python port split to a separate future spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 07:32:45 -07:00

8.7 KiB

ClaudeTools Harness Optimization — Implementation Plan (3-way reviewed)

Spec created: 2026-06-08 | Status: not started | Priority: P1 Revised per review-3way.md (Claude + Grok + Gemini). Fleet-wide changes: every task must be reversible, version-marked, and rollout-announced via coord.

Task 0: Commit this spec

Commit specs/claudetools-harness-optimization/ (incl. review-3way.md).

Task 0.5: Harness VERSION marker [safety prerequisite — Grok]

Files: .claude/harness/VERSION (new), CLAUDE.md CORE reference.

  • Single-line version string, bumped on every fleet-visible behavioral change (each P0 task). Sessions read it to detect partial rollout and give clear errors ("your save is on the pre-Task-2 wiki path"). Enables conditional behavior during heterogeneous rollout.

Task 0.6: Out-of-band fleet recovery [safety prerequisite — Gemini]

Files: .claude/scripts/force-pull-raw.sh (new), distributed BEFORE any P0 change.

  • A raw, un-intercepted git fetch + reset --hard origin/main rescue path so a node can recover if a P0 change bricks sync.sh/save.sh (the normal update vector). No hooks, no guards, minimal deps. Broadcast its existence to the fleet first.

P0 — Reliability footguns

Task 1+4 (co-developed on one branch — both touch staging)

Task 1: Submodule-safe sync [P0]

Files: .claude/scripts/sync.sh.

  • Replace blanket git add -A with submodule-aware staging: stage changes, then unstage any submodule gitlink unless --with-submodules is passed (git reset -q HEAD projects/msp-tools/guru-rmm projects/msp-tools/guru-connect). Operator must NEVER need to detach the submodule to its pin before /save.
  • Verify: submodule on main (ahead of pin) -> /sync commits log/wiki/config but NOT a gitlink bump; clean status after.

Task 4: Pre-commit guard — WARN-ONLY first [P0, highest risk]

Files: .claude/scripts/harness-guard.sh (new, standalone), later wired into sync.sh.

  • Detects: conflict markers in tracked files; unflagged submodule gitlink change; vault-plaintext / obvious-secret patterns.
  • Mode = WARN-ONLY for the first 7-14 days fleet-wide (logs to a local file, never blocks). Explicit SKIP_HARNESS_GUARD=1 bypass that is logged + coord-broadcast when used. Test matrix: synthetic bad inputs (incl. legitimate <<<<<<< in test files, new SOPS edge cases) + the last N real commits, zero false positives, BEFORE promoting to fatal. Promote to blocking only after a clean warn window.
  • Verify: warn fires on a planted conflict marker; clean commits pass; bypass logs.

Task 2: Decouple + stage wiki synthesis [P0]

Files: .claude/commands/save.md (drop inline recompile), .claude/commands/wiki-compile.md, new staged runner.

  • Sequence (do NOT excise the old path until the new one is proven):
    1. Build the serialized synthesis path: gated by a coord lock (lock claim claudetools wiki/<slug>) with orphan-lock eviction (TTL + dead-session reclaim) and a coord-down fallback (skip + warn, never block the save).
    2. Staging pattern (Gemini): the job writes proposed article updates to .claude/wiki_staging/<slug>.md + emits a notification; an operator or a focused validation agent reviews the diff and applies it — NO blind background auto-merge into the live article (avoids progressive structural corruption). Preserve Patterns/History; update last_verified + source_logs.
    3. Prove it under concurrent saves from 2 machines (no conflict, one clean update).
    4. ONLY THEN remove the inline Sonnet recompile from /save (save = write log + push).
  • Verify: two near-simultaneous /saves -> no wiki conflict markers; staged diff reviewed + applied cleanly.

Task 3a: Confirm billing SSOT truth [P0 prerequisite — Grok]

  • Hard prerequisite before Task 3 edits: confirm with Mike which is operational truth — /syncro command's add_line_item (current practice) vs the time-entry-protocol standard's timer_entry. Picking wrong bakes a permanent contradictory money rule.

Task 3: Single source of truth — Syncro billing [P0]

Files: .claude/commands/syncro.md, .claude/standards/syncro/time-entry-protocol.md.

  • Per Task 3a's decision, make one the source and the other a reference; delete the contradictory/duplicated prose. Add a "command-restates-standard" lint to /self-check.
  • Verify: one rule, two consistent references.

P1 — Context diet

Task 5a: "Where is text injected" inventory [P1 prerequisite — Grok]

  • One-off pass cataloguing every repeated injection: CLAUDE.md, the skill/command registry, system-reminders, MEMORY.md, CONTEXT.md loads, hook-injected prose, big command bodies. Measure tokens each. This scopes Tasks 5/6/7 so the diet does not under-deliver. One-off measurement (Grok: skip ongoing telemetry).

Task 5: One-line registry descriptions [P1] (biggest token win)

Files: every .claude/skills/*/SKILL.md description:; every .claude/commands/*.md description.

  • Truncate each registry description to ONE sentence; move verbose triggers/examples into the SKILL.md body (loads only on activation). Worst first: rmm-audit, gc-audit, packetdial, bitdefender, mailprotector, remediation-tool.
  • Verify: registry-list injection size cut >=50% (measured per Task 5a); skills still invoke.

Task 6: CLAUDE.md -> CORE + EXTENDED [P1]

Files: .claude/CLAUDE.md (lean CORE) + .claude/CLAUDE_EXTENDED.md (on-demand).

  • CORE (<=1.5k tokens): identity/multi-user, security/vault, model routing, delegation thresholds (Task 9), coord-message-in-system-reminder rule, harness VERSION ref. Everything else -> EXTENDED, loaded when relevant. Add a discoverability pointer so on-demand detail is findable (Grok).
  • Verify: CORE <=1.5k tokens; nothing safety-critical lost.

Task 7: Thin /save and /sync [P1]

Files: .claude/commands/save.md, .claude/commands/sync.md.

  • The command file becomes "run the script, report the summary"; the procedure lives in sync.sh/save.sh. Keep any LLM/narrative step OUT of the critical mechanical path (Grok) — log narrative generation stays optional/separate, never a new failure domain in the hot save/sync loop.
  • Verify: command-body token size cut substantially; behavior unchanged; save/sync never blocked by an LLM call.

Task 8: Shard big command bodies [P2 — DEFERRED]

  • Deferred past P1 (Grok): one-line descriptions + CLAUDE split + thin wrappers deliver most of the win. Revisit only with a real lazy-load gate (Gemini: otherwise models glob-load the appendices anyway). Not in the P0/P1 cut.

P2 — Coordinator model

Task 9: Re-tune delegation defaults [P2, gated on harness VERSION]

Files: CLAUDE.md CORE.

  • "Act directly by default; delegate only for (a) high-volume tool output, (b) blast radius >3 files across layers, (c) genuine domain shift, (d) parallelizable independent work." Keep single-agent-for-coupled-tasks. Behavioral-contract change -> gate on fleet VERSION so behavior is consistent once all machines sync.

P3 — Knowledge architecture

Task 10: Tier the knowledge layers [P3]

Files: CLAUDE.md (recall order), session-logs/YYYY-MM/ reorg, memory-dream schedule, wiki frontmatter.

  • Recall order: scoped log grep (date-foldered) -> wiki (CONTEXT.md/article) -> memory. Reorganize logs into session-logs/YYYY-MM/ and recall via scoped grep — no monolithic INDEX (Gemini: it recreates bloat).
  • Memory: memory-dream --apply-safe on a schedule; /graduate finished-project memory into wiki/standards then delete. (No arbitrary file-count target — Grok.)
  • Wiki: last_verified + source_logs; merge via the Task 2 staging pattern.
  • Verify: a recovery query hits the right layer first; no memory restates a wiki article.

Deferred to a SEPARATE future spec (not this one)

  • Python port of the core harness (sync/save/hooks). This spec lands only the deterministic Bash fixes that are cheap + high-value: jq -n --arg for ALL shell JSON (fixes post-bot-alert.sh), a now-phoenix timestamp helper (PowerShell/Python, never TZ=America/Phoenix date), and the skills-deploy parity (already shipped 2026-06-08). A full port is its own spec (Gemini).

Task 12: Verification + fleet rollout (consolidated)

  • Single P0/P1 rollout checklist (replaces per-task verify repetition): /self-check green; real /save dry-run on a scratch branch; bump VERSION; coord-broadcast; other machines pull + re-sync.
  • Fleet-level success criteria: 2 consecutive days across ALL machines with zero conflict markers and zero submodule gitlink accidents; registry-injection tokens down >=50%; CLAUDE CORE <=1.5k.
  • Add harness smoke tests to /self-check: sync/save dry-run, skills-deploy dir non-empty, registry-size budget, guard self-test.