107 lines
6.5 KiB
Markdown
107 lines
6.5 KiB
Markdown
# Graphifyy vs GrepAI — evaluation protocol (GURU-5070)
|
|
|
|
Goal: real, comparable data on whether **Graphifyy** beats the incumbent **GrepAI** for
|
|
Mike's day-to-day in ClaudeTools, enough to make an adopt / skip / adopt-narrowly call.
|
|
Decision hinges on token efficiency + retrieval quality, weighed against maintenance cost.
|
|
|
|
## Tools under test
|
|
- **GrepAI** — `D:\claudetools\grepai.exe mcp-serve`, exposed as `mcp__grepai__*` (semantic
|
|
search + RPG graph: explore / trace_callers / trace_callees / trace_graph). Repo-wide index
|
|
already built (`.grepai/`). Enabled per-machine via `enabledMcpjsonServers:["grepai"]` in
|
|
`.claude/settings.local.json`.
|
|
- **Graphifyy** — `pip install graphifyy && graphify install`. Local graph (NetworkX +
|
|
tree-sitter + Leiden). CLI/skill: `graphify <path> [--mode deep|--update]`,
|
|
`graphify query "q"`, `graphify path "A" "B"`, `graphify explain "C"`. Docs/PDF/images
|
|
ingested via Claude API (token cost); code parsed locally.
|
|
|
|
## Arms (run in separate sessions; MCP toggles need a restart)
|
|
- **A — GrepAI** (baseline / "before"): grepai ON, Graphifyy not used. Run FIRST, this session.
|
|
- **B — Graphifyy** ("after"): Graphifyy ON, grepai DISABLED (removed from
|
|
`enabledMcpjsonServers`). New session.
|
|
- **C — Control** (optional): both off; only `grep`/`glob`/`Read`. Shows whether either graph
|
|
tool beats plain search.
|
|
|
|
Same model for all arms. Each query answered in a FRESH sub-agent constrained to that arm's
|
|
tools, to avoid cross-arm contamination. Scoring done against the rubric, blind to arm where
|
|
feasible.
|
|
|
|
## Fixed test corpus (both tools index the SAME slice)
|
|
To keep it fair and bounded (not the whole repo + node_modules):
|
|
- Code: `projects/msp-tools/guru-rmm/` (Rust server + agent + React dashboard)
|
|
- Docs: `wiki/`, `projects/msp-pricing/`, `clients/kittle/`, `clients/dataforth/`
|
|
- PDF: `projects/msp-pricing/marketing/The Arizona Business Owner's Guide to Choosing an MSP - Arizona Computer Guru.pdf`
|
|
|
|
Note asymmetry: GrepAI's existing index is repo-wide (slight recall edge, more noise);
|
|
Graphifyy indexes exactly this slice. All test queries are answerable from the slice.
|
|
|
|
## Metrics (per query x arm)
|
|
| Metric | How captured |
|
|
|---|---|
|
|
| `ctx_tokens` | chars of retrieved context the agent consumed / 4 (consistent approx) |
|
|
| `tool_calls` | number of retrieval round-trips to reach the answer |
|
|
| `latency_s` | wall-clock for the query |
|
|
| `score` | 0 = wrong/missing, 1 = partial, 2 = complete & correct (vs rubric) |
|
|
|
|
One-time / maintenance (measured once per tool):
|
|
- `index_build_s` — full index of the test corpus (code-only, then code+docs)
|
|
- `reindex_s` — incremental update after touching ONE file
|
|
- `ingest_api_tokens` — Graphifyy's Claude-API tokens to ingest docs/PDF/images
|
|
(GrepAI: note its embedding model/cost; LLM-ingestion ≈ 0)
|
|
|
|
## Test set (10 queries; code-heavy + docs-heavy, since docs is Graphifyy's claimed edge)
|
|
Each has a rubric = key facts a correct answer MUST contain.
|
|
|
|
CODE
|
|
- C1: "In GuruRMM, how does the server avoid false-failing commands that were delivered but not
|
|
acked? Name the mechanism + migrations." Rubric: agent CommandAck on receipt + dedup; reaper
|
|
RE-DELIVERS un-acked instead of false-failing; migrations 058 acked_at / 059 delivery_attempts.
|
|
- C2: "Trace where un-acked command re-delivery is handled in the RMM server and what calls it."
|
|
Rubric: the reaper fn + its caller path. (grepai trace_callers vs graphify path)
|
|
- C3: "Where is GuruRMM agent self-update with rollback implemented and what guards it?"
|
|
Rubric: agent `updater/mod.rs` + watchdog.
|
|
- C4: "What does GuruConnect SPEC-018 propose?" Rubric: session broker / capture worker as SYSTEM.
|
|
|
|
DOCS / KNOWLEDGE (Graphifyy's claimed strength)
|
|
- D1: "GPS pricing structure (tiers + prices)?" Rubric: Basic $19 / Pro $26 / Advanced $39 per
|
|
endpoint; support plans Essential/Standard/Premium/Priority.
|
|
- D2: "Summarize the Kittle BEC/ACH-fraud incident and root cause." Rubric: Ken+marco+Accounting
|
|
compromised; fraudulent bank-change to City of Tucson + Marana ($130K+ prevented); IC3 filed;
|
|
root cause = April credential theft + incomplete remediation (password never reset, ~2mo).
|
|
- D3: "Which ACG clients had M365 breach/credential incidents in 2026 and each root cause?"
|
|
Rubric (relationship query): Kittle (BEC), Dataforth (2026-03-27 phishing -> MFA), mvaninc
|
|
(unauthorized sign-in OKC). Partial credit per client.
|
|
- D4: "List the 7 red flags of a bad MSP from the Buyers Guide." Rubric: the 7 from
|
|
MSP-Buyers-Guide-Content.md (unlimited-support, high-pressure sales, offshore-only, no
|
|
proactive monitoring, long lock-ins, one-size packages, no local presence). PDF/doc ingestion.
|
|
- D5: "Canonical Kittle article path + what it superseded?" Rubric: clients/kittle.md canonical;
|
|
kittle-design.md superseded 2026-06-09.
|
|
|
|
MIXED (code + docs)
|
|
- M1: "How do new GuruRMM builds get promoted from beta to stable?" Rubric: builds tag beta;
|
|
promote via POST /api/updates/rollouts/:version/promote; build-server.sh auto-deploys.
|
|
|
|
## Procedure
|
|
1. (Arm A, now) For each query, spawn a sub-agent: tools = grepai + Read only; instruct it to
|
|
use ONLY grepai for retrieval, answer, and report (answer, total retrieved chars, # grepai
|
|
calls, elapsed). Log to results.csv with arm=A.
|
|
2. Score each answer 0/1/2 vs rubric.
|
|
3. Disable GrepAI (below), install + index Graphifyy, measure one-time costs.
|
|
4. (Arm B, new session) Same queries, sub-agent tools = Bash(graphify) + Read; use ONLY
|
|
graphify for retrieval. Log arm=B. Score.
|
|
5. (Arm C, optional) grep/glob/Read only. Log arm=C.
|
|
6. Analyze: per-metric medians by arm; weight ctx_tokens + score (the day-to-day levers);
|
|
factor in index/maintenance cost and the doc-vs-code split.
|
|
|
|
## Reversible environment changes (per-machine only)
|
|
Disable GrepAI (edit `.claude/settings.local.json`, remove "grepai" from
|
|
`enabledMcpjsonServers`; restart session). Re-enable = add it back. **Do NOT edit `.mcp.json`**
|
|
(shared/fleet). Install Graphifyy: `py -m pip install graphifyy && graphify install`. Uninstall
|
|
= `py -m pip uninstall graphifyy` + remove its skill. Snapshot of `settings.local.json` kept at
|
|
`projects/graphifyy-eval/settings.local.json.bak` before any edit.
|
|
|
|
## Open setup unknowns to resolve at install
|
|
- Which API key/env var Graphifyy uses for doc/PDF/image ingestion (README didn't say; it bills
|
|
as "a Claude Code skill"). Confirm before indexing docs so ingest cost is attributable.
|
|
- Whether `graphify query` itself spends LLM tokens to answer (vs returning raw graph context) —
|
|
affects per-query cost comparison; measure.
|