6.5 KiB
Graphifyy vs GrepAI — evaluation protocol (GURU-5070)
Goal: real, comparable data on whether Graphifyy beats the incumbent GrepAI for Mike's day-to-day in ClaudeTools, enough to make an adopt / skip / adopt-narrowly call. Decision hinges on token efficiency + retrieval quality, weighed against maintenance cost.
Tools under test
- GrepAI —
D:\claudetools\grepai.exe mcp-serve, exposed asmcp__grepai__*(semantic search + RPG graph: explore / trace_callers / trace_callees / trace_graph). Repo-wide index already built (.grepai/). Enabled per-machine viaenabledMcpjsonServers:["grepai"]in.claude/settings.local.json. - Graphifyy —
pip install graphifyy && graphify install. Local graph (NetworkX + tree-sitter + Leiden). CLI/skill:graphify <path> [--mode deep|--update],graphify query "q",graphify path "A" "B",graphify explain "C". Docs/PDF/images ingested via Claude API (token cost); code parsed locally.
Arms (run in separate sessions; MCP toggles need a restart)
- A — GrepAI (baseline / "before"): grepai ON, Graphifyy not used. Run FIRST, this session.
- B — Graphifyy ("after"): Graphifyy ON, grepai DISABLED (removed from
enabledMcpjsonServers). New session. - C — Control (optional): both off; only
grep/glob/Read. Shows whether either graph tool beats plain search.
Same model for all arms. Each query answered in a FRESH sub-agent constrained to that arm's tools, to avoid cross-arm contamination. Scoring done against the rubric, blind to arm where feasible.
Fixed test corpus (both tools index the SAME slice)
To keep it fair and bounded (not the whole repo + node_modules):
- Code:
projects/msp-tools/guru-rmm/(Rust server + agent + React dashboard) - Docs:
wiki/,projects/msp-pricing/,clients/kittle/,clients/dataforth/ - PDF:
projects/msp-pricing/marketing/The Arizona Business Owner's Guide to Choosing an MSP - Arizona Computer Guru.pdf
Note asymmetry: GrepAI's existing index is repo-wide (slight recall edge, more noise); Graphifyy indexes exactly this slice. All test queries are answerable from the slice.
Metrics (per query x arm)
| Metric | How captured |
|---|---|
ctx_tokens |
chars of retrieved context the agent consumed / 4 (consistent approx) |
tool_calls |
number of retrieval round-trips to reach the answer |
latency_s |
wall-clock for the query |
score |
0 = wrong/missing, 1 = partial, 2 = complete & correct (vs rubric) |
One-time / maintenance (measured once per tool):
index_build_s— full index of the test corpus (code-only, then code+docs)reindex_s— incremental update after touching ONE fileingest_api_tokens— Graphifyy's Claude-API tokens to ingest docs/PDF/images (GrepAI: note its embedding model/cost; LLM-ingestion ≈ 0)
Test set (10 queries; code-heavy + docs-heavy, since docs is Graphifyy's claimed edge)
Each has a rubric = key facts a correct answer MUST contain.
CODE
- C1: "In GuruRMM, how does the server avoid false-failing commands that were delivered but not acked? Name the mechanism + migrations." Rubric: agent CommandAck on receipt + dedup; reaper RE-DELIVERS un-acked instead of false-failing; migrations 058 acked_at / 059 delivery_attempts.
- C2: "Trace where un-acked command re-delivery is handled in the RMM server and what calls it." Rubric: the reaper fn + its caller path. (grepai trace_callers vs graphify path)
- C3: "Where is GuruRMM agent self-update with rollback implemented and what guards it?"
Rubric: agent
updater/mod.rs+ watchdog. - C4: "What does GuruConnect SPEC-018 propose?" Rubric: session broker / capture worker as SYSTEM.
DOCS / KNOWLEDGE (Graphifyy's claimed strength)
- D1: "GPS pricing structure (tiers + prices)?" Rubric: Basic $19 / Pro $26 / Advanced $39 per endpoint; support plans Essential/Standard/Premium/Priority.
- D2: "Summarize the Kittle BEC/ACH-fraud incident and root cause." Rubric: Ken+marco+Accounting compromised; fraudulent bank-change to City of Tucson + Marana ($130K+ prevented); IC3 filed; root cause = April credential theft + incomplete remediation (password never reset, ~2mo).
- D3: "Which ACG clients had M365 breach/credential incidents in 2026 and each root cause?" Rubric (relationship query): Kittle (BEC), Dataforth (2026-03-27 phishing -> MFA), mvaninc (unauthorized sign-in OKC). Partial credit per client.
- D4: "List the 7 red flags of a bad MSP from the Buyers Guide." Rubric: the 7 from MSP-Buyers-Guide-Content.md (unlimited-support, high-pressure sales, offshore-only, no proactive monitoring, long lock-ins, one-size packages, no local presence). PDF/doc ingestion.
- D5: "Canonical Kittle article path + what it superseded?" Rubric: clients/kittle.md canonical; kittle-design.md superseded 2026-06-09.
MIXED (code + docs)
- M1: "How do new GuruRMM builds get promoted from beta to stable?" Rubric: builds tag beta; promote via POST /api/updates/rollouts/:version/promote; build-server.sh auto-deploys.
Procedure
- (Arm A, now) For each query, spawn a sub-agent: tools = grepai + Read only; instruct it to use ONLY grepai for retrieval, answer, and report (answer, total retrieved chars, # grepai calls, elapsed). Log to results.csv with arm=A.
- Score each answer 0/1/2 vs rubric.
- Disable GrepAI (below), install + index Graphifyy, measure one-time costs.
- (Arm B, new session) Same queries, sub-agent tools = Bash(graphify) + Read; use ONLY graphify for retrieval. Log arm=B. Score.
- (Arm C, optional) grep/glob/Read only. Log arm=C.
- Analyze: per-metric medians by arm; weight ctx_tokens + score (the day-to-day levers); factor in index/maintenance cost and the doc-vs-code split.
Reversible environment changes (per-machine only)
Disable GrepAI (edit .claude/settings.local.json, remove "grepai" from
enabledMcpjsonServers; restart session). Re-enable = add it back. Do NOT edit .mcp.json
(shared/fleet). Install Graphifyy: py -m pip install graphifyy && graphify install. Uninstall
= py -m pip uninstall graphifyy + remove its skill. Snapshot of settings.local.json kept at
projects/graphifyy-eval/settings.local.json.bak before any edit.
Open setup unknowns to resolve at install
- Which API key/env var Graphifyy uses for doc/PDF/image ingestion (README didn't say; it bills as "a Claude Code skill"). Confirm before indexing docs so ingest cost is attributable.
- Whether
graphify queryitself spends LLM tokens to answer (vs returning raw graph context) — affects per-query cost comparison; measure.