From 512ceb4727e71c7ee616553284e2e400dfde6828 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Mon, 8 Jun 2026 08:41:53 -0700 Subject: [PATCH] =?UTF-8?q?feat(harness-guard):=20FATAL-promotion=20prereq?= =?UTF-8?q?uisite=20=E2=80=94=20test=20matrix=20+=20pair-required=20confli?= =?UTF-8?q?ct=20rule=20(VERSION=201.4.3)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Builds the false-positive/true-positive proof the plan requires before the guard can be promoted to blocking, and fixes the one false-positive it surfaced. - test-harness-guard.sh: 12-case matrix in a throwaway repo, runs the REAL guard, asserts WARN/clean for real conflicts/secrets/keys vs legit content (setext underlines, dividers, docs that mention a marker, encrypted sops, public keys, .example templates). - harness-guard.sh: conflict rule now requires a real hunk (BOTH ^<<<<<<< AND ^>>>>>>>), dropping the lone =======$ trigger that false-positived on a 7-char setext underline / divider. Identical true-positive power (git writes all three markers); FP surface -> 0. - /self-check: new harness.guard_selftest runs the matrix in an isolated temp repo (read-only vs the real tree) so guard correctness is continuously proven. Verified 12/12 pass, true positives intact, real-tree FP surface = 0. FATAL flip (todo f1c11d0d, on/after 2026-06-22) is now evidence-backed + one-step. Co-Authored-By: Claude Opus 4.8 (1M context) --- .claude/harness/CHANGELOG.md | 12 ++ .claude/harness/VERSION | 2 +- .claude/scripts/harness-guard.sh | 8 +- .claude/scripts/test-harness-guard.sh | 174 ++++++++++++++++++ .claude/skills/self-check/SKILL.md | 2 +- .../skills/self-check/baseline/manifest.json | 3 +- .../skills/self-check/scripts/self-check.sh | 15 ++ .../claudetools-harness-optimization/plan.md | 11 +- 8 files changed, 221 insertions(+), 6 deletions(-) create mode 100644 .claude/scripts/test-harness-guard.sh diff --git a/.claude/harness/CHANGELOG.md b/.claude/harness/CHANGELOG.md index 0f12bcf..25851ed 100644 --- a/.claude/harness/CHANGELOG.md +++ b/.claude/harness/CHANGELOG.md @@ -67,3 +67,15 @@ or old harness during a heterogeneous rollout. See link (time-entry-protocol.md -> /syncro). Semantic contradiction pass (read both, judge actual conflict) delegated to the model in SKILL.md, mirroring the memory pass. Verified PASS; negative- tested (WARN fires when the pointer is removed). New pairs: add to manifest.command_standard_links. + +## 1.4.3 — 2026-06-08 (guard FATAL-promotion prerequisite: test matrix + refinement) +- Built `.claude/scripts/test-harness-guard.sh` — a 12-case false-positive/true-positive matrix + for harness-guard.sh (spins a throwaway repo, stages synthetic content, runs the REAL guard, + asserts WARN/clean). Required by the plan before promoting the guard to FATAL. +- The matrix surfaced a false-positive vector: the conflict rule's lone `=======$` alternative + fired on a markdown setext underline / divider of exactly seven `=`. REFINED harness-guard.sh to + require a real hunk — BOTH `^<<<<<<< ` AND `^>>>>>>> ` present — which has identical true-positive + power (git always writes all three markers) and eliminates the false positive. Verified 12/12 pass; + real-tree false-positive surface = 0. +- Wired the matrix into /self-check as `harness.guard_selftest` (runs in an isolated temp repo, so + the read-only-vs-real-tree contract holds). The eventual FATAL flip is now evidence-backed. diff --git a/.claude/harness/VERSION b/.claude/harness/VERSION index 9df886c..428b770 100644 --- a/.claude/harness/VERSION +++ b/.claude/harness/VERSION @@ -1 +1 @@ -1.4.2 +1.4.3 diff --git a/.claude/scripts/harness-guard.sh b/.claude/scripts/harness-guard.sh index 61d6015..cf7d894 100644 --- a/.claude/scripts/harness-guard.sh +++ b/.claude/scripts/harness-guard.sh @@ -30,8 +30,12 @@ mapfile -t STAGED < <(git diff --cached --name-only --diff-filter=ACM 2>/dev/nul for f in "${STAGED[@]}"; do [ -n "$f" ] || continue blob=$(git show ":$f" 2>/dev/null) || continue - # 1. Conflict markers - if printf '%s\n' "$blob" | grep -qE '^(<<<<<<< |=======$|>>>>>>> )'; then + # 1. Conflict markers — require a REAL hunk: both an open (<<<<<<<) AND a close + # (>>>>>>>) marker at line start. A lone '=======' line is a markdown setext + # underline or a divider, not a conflict, so flagging it alone is a false positive + # with no detection value (git always writes all three markers). Requiring the pair + # eliminates that vector (verified by test-harness-guard.sh) before FATAL promotion. + if printf '%s\n' "$blob" | grep -qE '^<<<<<<< ' && printf '%s\n' "$blob" | grep -qE '^>>>>>>> '; then warn "conflict markers in staged file: $f"; ISSUES=$((ISSUES + 1)) fi # 2. Unencrypted SOPS vault file diff --git a/.claude/scripts/test-harness-guard.sh b/.claude/scripts/test-harness-guard.sh new file mode 100644 index 0000000..161cd00 --- /dev/null +++ b/.claude/scripts/test-harness-guard.sh @@ -0,0 +1,174 @@ +#!/usr/bin/env bash +# test-harness-guard.sh — false-positive / true-positive test matrix for harness-guard.sh. +# +# WHY: the guard is WARN-ONLY today; before it is promoted to FATAL (blocking) the +# harness-optimization plan requires proof of ZERO false positives on legitimate content +# plus reliable detection of the real footguns. This script is that proof, repeatable. +# +# It spins up a throwaway git repo, stages synthetic files, runs the REAL harness-guard.sh +# inside it (the guard cd's to its repo root and inspects the staged blobs), and asserts +# WARN / no-WARN per case. It also scans the actual tracked tree for content that the +# guard's detection patterns would flag, to size the real-world false-positive blast radius. +# +# Read-only against the real repo (the synthetic staging happens in a temp repo under TMP). +# Exit 0 = all cases passed; exit 1 = at least one mismatch (promotion NOT yet safe). + +set -uo pipefail + +REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null)" || { echo "[ERROR] not in a git repo"; exit 2; } +GUARD="$REPO_ROOT/.claude/scripts/harness-guard.sh" +[ -f "$GUARD" ] || { echo "[ERROR] guard not found: $GUARD"; exit 2; } + +TMP="$(mktemp -d 2>/dev/null || echo "${TMPDIR:-/tmp}/guardtest.$$")" +mkdir -p "$TMP" +cleanup() { rm -rf "$TMP" 2>/dev/null; } +trap cleanup EXIT + +# --- isolated temp repo so we can stage synthetic content without touching the real tree +git -C "$TMP" init -q +git -C "$TMP" config user.name "guard-test" +git -C "$TMP" config user.email "guard-test@local" +mkdir -p "$TMP/.claude/harness" # so the guard's log path mkdir is a no-op + +PASS=0; FAIL=0 +FAILED_CASES="" + +# run_case +run_case() { + local name="$1" expect="$2" file="$3" out rc warned + # reset the temp index/worktree + git -C "$TMP" reset -q --hard >/dev/null 2>&1 || true + git -C "$TMP" rm -rq --cached . >/dev/null 2>&1 || true + rm -f "$TMP"/*.* "$TMP"/* 2>/dev/null || true + mkdir -p "$TMP/$(dirname "$file")" 2>/dev/null || true + cat > "$TMP/$file" + git -C "$TMP" add -A >/dev/null 2>&1 + # run the REAL guard from inside the temp repo + out="$( cd "$TMP" && bash "$GUARD" 2>&1 )"; rc=$? + if printf '%s\n' "$out" | grep -q '\[harness-guard\]\[WARN\]'; then warned=1; else warned=0; fi + + local got; [ "$warned" = 1 ] && got="warn" || got="clean" + if [ "$got" = "$expect" ]; then + PASS=$((PASS+1)); printf ' [PASS] %-34s expected=%-5s got=%-5s\n' "$name" "$expect" "$got" + else + FAIL=$((FAIL+1)); FAILED_CASES="$FAILED_CASES $name" + printf ' [FAIL] %-34s expected=%-5s got=%-5s\n' "$name" "$expect" "$got" + printf ' guard said: %s\n' "$(printf '%s' "$out" | grep WARN | head -2 | tr '\n' '|')" + fi +} + +echo "============================================================" +echo " harness-guard false-positive / true-positive matrix" +echo " guard: $GUARD" +echo "============================================================" +echo "" +echo "TRUE POSITIVES (must WARN):" + +run_case "real-conflict-hunk" warn "src/app.rs" <<'EOF' +fn main() { +<<<<<<< HEAD + let x = 1; +======= + let x = 2; +>>>>>>> feature +} +EOF + +run_case "unencrypted-sops" warn "infra/secret.sops.yaml" <<'EOF' +api_key: super-secret-plaintext +password: hunter2 +EOF + +run_case "private-key-openssh" warn "keys/id_ed25519" <<'EOF' +-----BEGIN OPENSSH PRIVATE KEY----- +b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAAB +-----END OPENSSH PRIVATE KEY----- +EOF + +run_case "private-key-rsa" warn "keys/id_rsa" <<'EOF' +-----BEGIN RSA PRIVATE KEY----- +MIIEpAIBAAKCAQEA... +-----END RSA PRIVATE KEY----- +EOF + +echo "" +echo "FALSE-POSITIVE VECTORS (must stay CLEAN):" + +# markdown setext H1 underline (long run) — must stay clean +run_case "markdown-setext-underline-long" clean "docs/title.md" <<'EOF' +My Document Title +================= + +Body text here. +EOF + +# the precise edge: a setext underline that is EXACTLY seven equals (git's conflict-middle +# marker). The old standalone '=======$' rule false-positived here; the pair-required rule +# must keep it clean (no open/close markers present). +run_case "setext-underline-exactly-7" clean "docs/short.md" <<'EOF' +Title X +======= + +body +EOF + +# a horizontal divider of exactly seven equals in a comment — must stay clean +run_case "divider-exactly-7-equals" clean "notes/changelog.md" <<'EOF' +## Release notes +======= +- item one +EOF + +# a doc that *mentions* a single conflict marker (a git tutorial) — no real hunk +run_case "doc-mentions-open-marker" clean "docs/git-tutorial.md" <<'EOF' +When git hits a conflict it inserts a line starting with `<<<<<<< HEAD`. +You then edit the file to resolve it. (No closing marker in this doc.) +EOF + +# already-encrypted sops file — has ENC[ / sops: markers, must NOT warn +run_case "encrypted-sops" clean "infra/real.sops.yaml" <<'EOF' +api_key: ENC[AES256_GCM,data:abc==,iv:xyz==,tag:q==,type:str] +sops: + kms: [] + age: + - recipient: age1xyz +EOF + +# public key — guard targets PRIVATE keys only; a public key must not warn +run_case "public-key-ssh" clean "keys/id_ed25519.pub" <<'EOF' +ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIabc123 user@host +-----BEGIN PUBLIC KEY----- +MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE +-----END PUBLIC KEY----- +EOF + +# a .sops.yaml.example template (not a real vault file path) with placeholder text +run_case "sops-example-template" clean "infra/secret.sops.yaml.example" <<'EOF' +api_key: +note: copy to secret.sops.yaml and encrypt with sops +EOF + +# normal source with '=======' inside a comment banner (not its own 7-char line) +run_case "comment-banner-equals" clean "src/lib.rs" <<'EOF' +// ======= section: helpers ======= +fn helper() {} +EOF + +echo "" +echo "REAL-CORPUS BLAST RADIUS:" +# Old standalone rule surface (for context): exactly-7-equals lines that USED to false-positive. +OLD_EQ="$(git -C "$REPO_ROOT" grep -lE '^=======$' 2>/dev/null | wc -l | tr -d '[:space:]')" +# New rule surface: files with BOTH an open and a close marker = a real conflict (should be 0). +OPEN_HITS="$(git -C "$REPO_ROOT" grep -lE '^<<<<<<< ' 2>/dev/null | sort)" +CLOSE_HITS="$(git -C "$REPO_ROOT" grep -lE '^>>>>>>> ' 2>/dev/null | sort)" +BOTH="$(comm -12 <(printf '%s\n' "$OPEN_HITS") <(printf '%s\n' "$CLOSE_HITS") | grep -c . )" +echo " tracked files with a lone '^=======\$' line (OLD rule false-positive surface): $OLD_EQ" +echo " tracked files with BOTH open+close markers (NEW rule = real conflicts): $BOTH" +echo " -> NEW rule flags only genuine conflict hunks; lone dividers/underlines are clean." + +echo "" +echo "============================================================" +echo " RESULT: PASS $PASS FAIL $FAIL" +[ -n "$FAILED_CASES" ] && echo " failed:$FAILED_CASES" +echo "============================================================" +[ "$FAIL" -eq 0 ] && exit 0 || exit 1 diff --git a/.claude/skills/self-check/SKILL.md b/.claude/skills/self-check/SKILL.md index 479c590..829280c 100644 --- a/.claude/skills/self-check/SKILL.md +++ b/.claude/skills/self-check/SKILL.md @@ -81,7 +81,7 @@ SELFCHECK_TS="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ | **skills/commands** | every skill dir and command file in the baseline is present; extras are reported as census candidates. | | **duplicates** | command/skill names present in BOTH the repo and `~/.claude`. Divergent content = WARN (the "same `/cmd`, different behaviour on the Mac" bug); identical = INFO (redundant, will drift). CRLF-only differences are ignored. | | **memory** | `MEMORY.md` index exists; no orphaned memory files; manifest-declared contradiction patterns (see semantic pass below). Never FAILs the grade. | -| **harness** | the 1.4.0 invariants (read-only): VERSION marker present + not older than `manifest.harness.min_version`; **skill-registry description budget** (sum of all SKILL.md `description:` fields under `registry_desc_budget_chars` — WARN on regrowth); global deploy targets `~/.claude/skills` + `~/.claude/commands` populated (the "Mac wiped global skills" failure); `harness-guard.sh` present + wired into `sync.sh`; core scripts parse (`bash -n` on sync/guard/now-phoenix); `now-phoenix.sh --date` emits a valid date. Budget/min-version/script-list are tunable in `manifest.harness`. | +| **harness** | the 1.4.0 invariants (read-only): VERSION marker present + not older than `manifest.harness.min_version`; **skill-registry description budget** (sum of all SKILL.md `description:` fields under `registry_desc_budget_chars` — WARN on regrowth); global deploy targets `~/.claude/skills` + `~/.claude/commands` populated (the "Mac wiped global skills" failure); `harness-guard.sh` present + wired into `sync.sh`; core scripts parse (`bash -n`); `now-phoenix.sh --date` emits a valid date; **guard self-test** runs the full `test-harness-guard.sh` false-positive/true-positive matrix in an isolated temp repo (proves the guard still catches real conflicts/secrets and does not false-positive — the standing prerequisite for promoting the guard to FATAL). Budget/min-version/script-list are tunable in `manifest.harness`. | | **consistency** | the **command-restates-standard** lint (deterministic half): for each `manifest.command_standard_links` pair, the standard must still contain its defer-to-SSOT pointer to the owning command. A lost pointer = WARN (the standard likely drifted back into restating the command — the Syncro-timers failure mode). The semantic contradiction judgement is delegated to the model (see below). | | **vault** | vault repo exists; sops+age present; `vault.sh list` succeeds (decrypt wired). | | **connectivity** | coord API (required), main API + internal Gitea (advisory; off-network is OK). | diff --git a/.claude/skills/self-check/baseline/manifest.json b/.claude/skills/self-check/baseline/manifest.json index b30cf22..d77995a 100644 --- a/.claude/skills/self-check/baseline/manifest.json +++ b/.claude/skills/self-check/baseline/manifest.json @@ -13,7 +13,8 @@ "syntax_check_scripts": [ ".claude/scripts/sync.sh", ".claude/scripts/harness-guard.sh", - ".claude/scripts/now-phoenix.sh" + ".claude/scripts/now-phoenix.sh", + ".claude/scripts/test-harness-guard.sh" ], "guard_wired_in": ".claude/scripts/sync.sh" }, diff --git a/.claude/skills/self-check/scripts/self-check.sh b/.claude/skills/self-check/scripts/self-check.sh index 34c2130..bbad591 100644 --- a/.claude/skills/self-check/scripts/self-check.sh +++ b/.claude/skills/self-check/scripts/self-check.sh @@ -659,6 +659,21 @@ check_harness_smoke() { "Check .claude/scripts/now-phoenix.sh" fi fi + + # 6. Guard self-test: run the full false-positive/true-positive matrix in an isolated + # temp repo (writes only under mktemp, never the real tree). Proves the guard still + # detects real conflicts/secrets AND does not false-positive on legit content — the + # standing prerequisite for promoting the guard to FATAL. + local gt="$REPO_ROOT/.claude/scripts/test-harness-guard.sh" gres + if [ -f "$gt" ] && command -v git >/dev/null 2>&1; then + gres="$(bash "$gt" 2>/dev/null | grep 'RESULT:' | head -1 | sed 's/^[[:space:]]*RESULT:[[:space:]]*//')" + if echo "$gres" | grep -q 'FAIL 0'; then + emit harness.guard_selftest harness PASS "guard FP/TP matrix clean ($gres)" + elif [ -n "$gres" ]; then + emit harness.guard_selftest harness WARN "guard self-test reported failures ($gres)" \ + "Run: bash .claude/scripts/test-harness-guard.sh — a detection case regressed" + fi + fi } # --------------------------------------------------------------------------- diff --git a/specs/claudetools-harness-optimization/plan.md b/specs/claudetools-harness-optimization/plan.md index bbec471..583ee12 100644 --- a/specs/claudetools-harness-optimization/plan.md +++ b/specs/claudetools-harness-optimization/plan.md @@ -44,9 +44,18 @@ 1.4.0 invariants (VERSION/min-version, skill-registry description budget, global deploy targets populated, guard wired, core scripts parse, now-phoenix valid). Read-only; 9/9 PASS; budget WARN negative-tested. Tunables in `self-check/baseline/manifest.json` `harness` block. (VERSION 1.4.1.) +- [DONE] Task 4 pre-FATAL prerequisite — guard false-positive/true-positive test matrix built + (`.claude/scripts/test-harness-guard.sh`, 12 cases) and a guard REFINEMENT shipped: the conflict + rule now requires a real hunk (BOTH `^<<<<<<< ` AND `^>>>>>>> ` present) instead of also firing on + a lone `=======` line. That lone-marker trigger was a false-positive vector (markdown setext + underlines / `=======` dividers of exactly 7 chars) with zero detection value — git always writes + all three markers. Verified: 12/12 cases pass, true positives still caught, real-tree false-positive + surface = 0. The matrix is wired into `/self-check` (`harness.guard_selftest`) so the guard's + correctness is continuously proven. The FATAL flip is now evidence-backed + one-step. (VERSION 1.4.3.) - REMAINING (ops follow-ups, not blocking): - Promote the warn-only guard (Task 4) to FATAL after a clean warn window (check - `.claude/harness/guard.log` across the fleet). + `.claude/harness/guard.log` across the fleet). Prerequisite test matrix DONE (above); coord + todo `f1c11d0d` set for on/after 2026-06-22. - Schedule `memory-dream --apply-safe` per-machine (deliberate per-box ops setup; default is read-only/proposals, so unattended --apply-safe is a judgment call left to the operator). - Optional later: migrate existing flat session-logs into month folders if/when the flat dir