Optimize the harness (not projects) for accuracy/completeness with context pressure as a first-class constraint; token efficiency secondary. Authored as a Claude+Grok+ Gemini review (see review-3way.md): P0 reliability footguns (submodule-safe sync, serialized/staged wiki synthesis, syncro SSOT, warn-only guard), P1 context diet (one-line registry descriptions, CLAUDE CORE/EXTENDED, thin save/sync), P2 delegation re-tune, P3 knowledge tiering. Adds harness VERSION marker + OOB recovery as rollout safety. Python port split to a separate future spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
88 lines
5.8 KiB
Markdown
88 lines
5.8 KiB
Markdown
# ClaudeTools Harness Optimization — 3-Way Review Record
|
|
|
|
Independent reviews on 2026-06-08: Claude (firsthand session friction), Grok (xAI),
|
|
Gemini (Google). The analysis converged on the same P0/P1/P2/P3 priorities; the spec
|
|
draft was then put back to Grok and Gemini for a review pass. This file records each
|
|
model's input, the consensus, the dissent + how it was reconciled, and the resulting
|
|
revisions applied to `shape.md`/`plan.md`.
|
|
|
|
## Consensus (all three, analysis phase)
|
|
- Two core weaknesses: distributed-write safety (`git add -A` + submodule + concurrent
|
|
wiki recompiles) and context economics (repeated injection of the skill/command
|
|
registry, CLAUDE.md, big command bodies).
|
|
- P0 = reliability footguns (submodule-safe sync, decouple/serialize wiki, SSOT for
|
|
command-vs-standard, guards). P1 = context diet (one-line descriptions, CLAUDE
|
|
CORE/EXTENDED, thin save/sync). P2 = re-tune over-delegation. P3 = tier the knowledge
|
|
layers. Accuracy first; never compress safety/billing/vault rules for tokens.
|
|
|
|
## Grok — spec review highlights
|
|
- GAPS: no fleet rollout/escape-hatch under heterogeneity; no harness VERSION marker;
|
|
coord-lock failure modes unanalyzed; incomplete "where is text injected" inventory;
|
|
discoverability after the diet; no fleet-level success criteria.
|
|
- SEQUENCING: Task 3 "confirm with Mike" must be a prerequisite, not inside the edit
|
|
task. Task 2 — build the gated/serialized path and prove it under concurrent load
|
|
BEFORE excising the inline recompile. Task 1+4 touch the same staging path -> same
|
|
branch. Task 9 is a behavioral-contract change under heterogeneity.
|
|
- HIGHEST RISK: Task 4 (pre-commit guard) can fail-closed and brick every machine's
|
|
`/save`. De-risk: standalone script, WARN-ONLY for 7-14 days, `SKIP_HARNESS_GUARD=1`
|
|
bypass (logged+broadcast), test matrix, wire to fatal only after a clean warn period.
|
|
- CUT/SIMPLIFY: defer Task 8 (shard syncro/rmm) past P1; consolidate per-task verifies
|
|
into one rollout checklist; drop ongoing token telemetry (one-off before/after is
|
|
enough); drop the arbitrary 40-60 memory number.
|
|
- ADD: Task 0.5 — `.claude/harness/VERSION` + CORE reference, bumped on every
|
|
fleet-visible behavioral change, so sessions can detect partial rollout.
|
|
|
|
## Gemini — spec review highlights
|
|
- GAPS: coord-lock lifecycle/orphan-eviction undefined; the lazy-load mechanism for
|
|
sharded command files is undefined (models will glob-load them anyway).
|
|
- SEQUENCING: put the guard (Task 4) in EARLY so you don't commit the very footguns you
|
|
are fixing while developing Tasks 1-2.
|
|
- HIGHEST RISK: Task 2 "wiki merge-not-rewrite" via automation -> progressive structural
|
|
corruption / duplicated headers / hallucinated artifacts. De-risk: a STAGING pattern
|
|
— background job writes proposed updates to `.claude/wiki_staging/` + notifies; an
|
|
operator or focused validation agent reviews + applies the diff.
|
|
- CUT/SIMPLIFY: cut Task 11 (Python port) from this spec -> a separate future spec; fix
|
|
Bash determinism first. Simplify Task 10 — NO monolithic `session-logs/INDEX.md`
|
|
(recreates the bloat); use date-folder structure (`session-logs/YYYY-MM/`) + scoped
|
|
grep instead.
|
|
- ADD: an out-of-band recovery script (`force-pull-raw.sh`) distributed BEFORE P0, so a
|
|
node can rescue itself if a P0 change bricks `sync.sh`/`save.sh` (the update vector).
|
|
|
|
## Dissent + reconciliation
|
|
1. **Pre-commit guard timing** — Grok: wire late, warn-only (it can brick saves).
|
|
Gemini: put it in first (catch footguns during dev). RESOLVED: build it EARLY as a
|
|
standalone script in **WARN-ONLY** mode (present during Task 1-2 dev, non-fatal),
|
|
with a logged bypass; promote to fatal only after a clean warn period. Satisfies both.
|
|
2. **Highest-risk task** — Grok: Task 4 (guard fail-closed). Gemini: Task 2 (wiki
|
|
auto-merge corruption). RESOLVED: both are top risks; address both — guard is
|
|
warn-only, and wiki uses the staging-review pattern (no blind auto-merge).
|
|
3. **Session-log INDEX** — Claude proposed a monolithic INDEX; Gemini: that recreates
|
|
bloat. RESOLVED: drop the monolithic index; reorganize logs into `YYYY-MM/` folders
|
|
and recall via scoped grep.
|
|
4. **Python port** — Grok accepted it as a P4 stretch; Gemini: cut to a separate spec.
|
|
RESOLVED: this spec lands the deterministic Bash fixes (jq JSON, phoenix-date helper,
|
|
skills-deploy parity) only; a full Python port is a SEPARATE future spec.
|
|
|
|
## Revisions applied to the spec (see plan.md)
|
|
- NEW Task 0.5: harness `VERSION` marker (+ CORE reference) for partial-rollout detection.
|
|
- NEW Task 0.6: OOB `force-pull-raw.sh` recovery script, distributed before any P0 change.
|
|
- Task 3 split: Task 3a (confirm the billing SSOT truth with Mike) is a hard prerequisite.
|
|
- Task 2 rewritten: staging pattern (`.claude/wiki_staging/` + review/apply) + prove
|
|
under concurrent load + coord-lock lifecycle (orphan eviction, coord-down fallback)
|
|
BEFORE removing the inline recompile.
|
|
- Task 4 rewritten: standalone guard, WARN-ONLY first, bypass flag, test matrix; built
|
|
early, co-developed with Task 1 on one branch, promoted to fatal only after a clean
|
|
warn window.
|
|
- Task 7 constraint: keep any LLM/narrative step OUT of the critical mechanical save/sync
|
|
path (no new failure domain in the hot loop).
|
|
- Task 8 (shard syncro/rmm) demoted to P2/deferred; needs a real lazy-load gate to be
|
|
worth it.
|
|
- Task 9 gated on fleet harness VERSION (behavioral-contract change).
|
|
- Task 10: date-folder logs + scoped grep (no monolithic index); drop the 40-60 number;
|
|
keep graduate + memory-dream.
|
|
- Task 11 (Python port) moved OUT of this spec to a future one; deterministic Bash fixes
|
|
stay (jq, phoenix-date, skills-deploy).
|
|
- Verification consolidated into one P0/P1 rollout checklist with fleet-level success
|
|
criteria (e.g., 2 consecutive days, zero conflict markers / submodule accidents).
|
|
- Added a "where is text injected" inventory pass as a prerequisite to the diet (P1).
|