Files

Mike Swanson dc5c09b40b sync: auto-sync from GURU-5070 at 2026-06-15 09:41:53

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-15 09:41:53

2026-06-15 09:42:17 -07:00

6.5 KiB

Raw Blame History

Graphifyy vs GrepAI — evaluation protocol (GURU-5070)

Goal: real, comparable data on whether Graphifyy beats the incumbent GrepAI for Mike's day-to-day in ClaudeTools, enough to make an adopt / skip / adopt-narrowly call. Decision hinges on token efficiency + retrieval quality, weighed against maintenance cost.

Tools under test

GrepAI — D:\claudetools\grepai.exe mcp-serve, exposed as mcp__grepai__* (semantic search + RPG graph: explore / trace_callers / trace_callees / trace_graph). Repo-wide index already built (.grepai/). Enabled per-machine via enabledMcpjsonServers:["grepai"] in .claude/settings.local.json.
Graphifyy — pip install graphifyy && graphify install. Local graph (NetworkX + tree-sitter + Leiden). CLI/skill: graphify <path> [--mode deep|--update], graphify query "q", graphify path "A" "B", graphify explain "C". Docs/PDF/images ingested via Claude API (token cost); code parsed locally.

Arms (run in separate sessions; MCP toggles need a restart)

A — GrepAI (baseline / "before"): grepai ON, Graphifyy not used. Run FIRST, this session.
B — Graphifyy ("after"): Graphifyy ON, grepai DISABLED (removed from enabledMcpjsonServers). New session.
C — Control (optional): both off; only grep/glob/Read. Shows whether either graph tool beats plain search.

Same model for all arms. Each query answered in a FRESH sub-agent constrained to that arm's tools, to avoid cross-arm contamination. Scoring done against the rubric, blind to arm where feasible.

Fixed test corpus (both tools index the SAME slice)

To keep it fair and bounded (not the whole repo + node_modules):

Code: projects/msp-tools/guru-rmm/ (Rust server + agent + React dashboard)
Docs: wiki/, projects/msp-pricing/, clients/kittle/, clients/dataforth/
PDF: projects/msp-pricing/marketing/The Arizona Business Owner's Guide to Choosing an MSP - Arizona Computer Guru.pdf

Note asymmetry: GrepAI's existing index is repo-wide (slight recall edge, more noise); Graphifyy indexes exactly this slice. All test queries are answerable from the slice.

Metrics (per query x arm)

Metric	How captured
`ctx_tokens`	chars of retrieved context the agent consumed / 4 (consistent approx)
`tool_calls`	number of retrieval round-trips to reach the answer
`latency_s`	wall-clock for the query
`score`	0 = wrong/missing, 1 = partial, 2 = complete & correct (vs rubric)

One-time / maintenance (measured once per tool):

index_build_s — full index of the test corpus (code-only, then code+docs)
reindex_s — incremental update after touching ONE file
ingest_api_tokens — Graphifyy's Claude-API tokens to ingest docs/PDF/images (GrepAI: note its embedding model/cost; LLM-ingestion ≈ 0)

Test set (10 queries; code-heavy + docs-heavy, since docs is Graphifyy's claimed edge)

Each has a rubric = key facts a correct answer MUST contain.

CODE

C1: "In GuruRMM, how does the server avoid false-failing commands that were delivered but not acked? Name the mechanism + migrations." Rubric: agent CommandAck on receipt + dedup; reaper RE-DELIVERS un-acked instead of false-failing; migrations 058 acked_at / 059 delivery_attempts.
C2: "Trace where un-acked command re-delivery is handled in the RMM server and what calls it." Rubric: the reaper fn + its caller path. (grepai trace_callers vs graphify path)
C3: "Where is GuruRMM agent self-update with rollback implemented and what guards it?" Rubric: agent updater/mod.rs + watchdog.
C4: "What does GuruConnect SPEC-018 propose?" Rubric: session broker / capture worker as SYSTEM.

DOCS / KNOWLEDGE (Graphifyy's claimed strength)

D1: "GPS pricing structure (tiers + prices)?" Rubric: Basic $19 / Pro $26 / Advanced $39 per endpoint; support plans Essential/Standard/Premium/Priority.
D2: "Summarize the Kittle BEC/ACH-fraud incident and root cause." Rubric: Ken+marco+Accounting compromised; fraudulent bank-change to City of Tucson + Marana ($130K+ prevented); IC3 filed; root cause = April credential theft + incomplete remediation (password never reset, ~2mo).
D3: "Which ACG clients had M365 breach/credential incidents in 2026 and each root cause?" Rubric (relationship query): Kittle (BEC), Dataforth (2026-03-27 phishing -> MFA), mvaninc (unauthorized sign-in OKC). Partial credit per client.
D4: "List the 7 red flags of a bad MSP from the Buyers Guide." Rubric: the 7 from MSP-Buyers-Guide-Content.md (unlimited-support, high-pressure sales, offshore-only, no proactive monitoring, long lock-ins, one-size packages, no local presence). PDF/doc ingestion.
D5: "Canonical Kittle article path + what it superseded?" Rubric: clients/kittle.md canonical; kittle-design.md superseded 2026-06-09.

MIXED (code + docs)

M1: "How do new GuruRMM builds get promoted from beta to stable?" Rubric: builds tag beta; promote via POST /api/updates/rollouts/:version/promote; build-server.sh auto-deploys.

Procedure

(Arm A, now) For each query, spawn a sub-agent: tools = grepai + Read only; instruct it to use ONLY grepai for retrieval, answer, and report (answer, total retrieved chars, # grepai calls, elapsed). Log to results.csv with arm=A.
Score each answer 0/1/2 vs rubric.
Disable GrepAI (below), install + index Graphifyy, measure one-time costs.
(Arm B, new session) Same queries, sub-agent tools = Bash(graphify) + Read; use ONLY graphify for retrieval. Log arm=B. Score.
(Arm C, optional) grep/glob/Read only. Log arm=C.
Analyze: per-metric medians by arm; weight ctx_tokens + score (the day-to-day levers); factor in index/maintenance cost and the doc-vs-code split.

Reversible environment changes (per-machine only)

Disable GrepAI (edit .claude/settings.local.json, remove "grepai" from enabledMcpjsonServers; restart session). Re-enable = add it back. Do NOT edit .mcp.json (shared/fleet). Install Graphifyy: py -m pip install graphifyy && graphify install. Uninstall = py -m pip uninstall graphifyy + remove its skill. Snapshot of settings.local.json kept at projects/graphifyy-eval/settings.local.json.bak before any edit.

Open setup unknowns to resolve at install

Which API key/env var Graphifyy uses for doc/PDF/image ingestion (README didn't say; it bills as "a Claude Code skill"). Confirm before indexing docs so ingest cost is attributable.
Whether graphify query itself spends LLM tokens to answer (vs returning raw graph context) — affects per-query cost comparison; measure.

6.5 KiB Raw Blame History