Session log: qwen3.6 benchmark, route strict-format to 3.6
Benchmarked qwen3.6 (36B MoE) vs qwen3:14b and qwen3:32b on 16 representative prompts. qwen3.6 scored 15/16 vs 14b 11/16 and 32b 12/16, winning every strict-format/adherence test (multi-step rules, weekend-aware scheduling, prompt-injection resistance, word-limit summary). Single reasoning regression noted for re-check at qwen3.7. Updated .claude/OLLAMA.md (Models, Documentation Engine, and When-to-Use tables) and .claude/CLAUDE.md one-line model summary to route strict-format work to qwen3.6 and keep bulk prose on qwen3:14b (2x faster). Also removed openclaw npm package + ~/.openclaw data dir earlier in the session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -272,7 +272,7 @@ Tier 0 — **Ollama is the documentation and classification engine.** Route pros
|
||||
| DESKTOP-0O8A1RL | `http://localhost:11434` |
|
||||
| Other | `http://100.92.127.64:11434` (Tailscale) |
|
||||
|
||||
Models: `qwen3:14b` (docs, prose, classification, summarization), `codestral:22b` (code suggestions — always review). Full reference: `.claude/OLLAMA.md`
|
||||
Models: `qwen3.6:latest` (strict-format: JSON, classification, structured rules, redaction, word-limited summaries, untrusted-input handling), `qwen3:14b` (bulk prose: session logs, commit bodies, free-text drafts — 2x faster), `codestral:22b` (code suggestions — always review). Full reference + routing rationale: `.claude/OLLAMA.md`
|
||||
|
||||
### GrepAI (Semantic Code Search)
|
||||
|
||||
|
||||
@@ -6,10 +6,13 @@ Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Avail
|
||||
|
||||
| Model | Size | Use For |
|
||||
|-------|------|---------|
|
||||
| `qwen3:14b` | 9.3 GB | Summarization, classification, data extraction, drafting |
|
||||
| `qwen3.6:latest` | 24 GB | Strict-format work: JSON/structured extraction, classification, per-item rules, redaction, word-limited summaries, adherence-critical drafting. ~32 tok/s. |
|
||||
| `qwen3:14b` | 9.3 GB | Bulk prose where format is loose: session log narrative, commit bodies, client notes, free-text handoffs. ~66 tok/s — 2x faster than 3.6. |
|
||||
| `codestral:22b` | 12 GB | Code generation, refactoring suggestions, docstrings |
|
||||
| `nomic-embed-text` | 274 MB | Embeddings only (used by GrepAI) |
|
||||
|
||||
Routing basis: 16-prompt benchmark on 2026-05-16 (`benchmark_qwen_3_6.py` in repo root). qwen3.6 scored 15/16 vs qwen3:14b 11/16 and qwen3:32b 12/16. 3.6 won every strict-format and adherence test (multi-step rules, schedule reasoning with weekend trap, prompt-injection resistance, word-limit summary) — at the cost of ~2x slower inference. **Known regression**: 3.6 missed one small reasoning prompt (3 vs expected 4) that 14b/32b got — re-validate when qwen3.7 lands. qwen3:32b is dominated on every axis; not in routing rotation.
|
||||
|
||||
## Endpoints
|
||||
|
||||
Auto-detect: any machine that has a local Ollama listening on `127.0.0.1:11434` uses local. Otherwise fall back to Mike's workstation over Tailscale.
|
||||
@@ -85,11 +88,15 @@ This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama
|
||||
| Syncro comment bodies + billing descriptions | qwen3:14b | Review checklist + post via API |
|
||||
| Ticket initial issue / description text | qwen3:14b | Review + post |
|
||||
| Client-facing notes and summaries | qwen3:14b | Review for accuracy |
|
||||
| Ticket / issue classification (priority, type, category) | qwen3:14b | Review + apply label |
|
||||
| Diff summarization before commit | qwen3:14b | Review + use in commit message |
|
||||
| Error message categorization (transient / config / bug) | qwen3:14b | Review + act on classification |
|
||||
| Agent phase handoff summaries (explore → plan, plan → implement) | qwen3:14b | Review + include in agent brief |
|
||||
| Client email drafts | qwen3:14b | Review for accuracy + tone before sending |
|
||||
| Ticket / issue classification (priority, type, category) | qwen3.6 | Review + apply label |
|
||||
| Diff summarization before commit | qwen3.6 | Review + use in commit message |
|
||||
| Error message categorization (transient / config / bug) | qwen3.6 | Review + act on classification |
|
||||
| Structured data extraction (JSON, fields, tags) | qwen3.6 | Review + use programmatically |
|
||||
| PII redaction in logs/transcripts | qwen3.6 | Review before publishing |
|
||||
| Strict word-limit summaries (e.g. ticket subject, alert text) | qwen3.6 | Review + use |
|
||||
| Multi-step / per-item rule application on lists | qwen3.6 | Review + use |
|
||||
| Code comments and docstrings | codestral:22b | Review before applying |
|
||||
| Refactor suggestions | codestral:22b | Review before applying |
|
||||
|
||||
@@ -153,16 +160,22 @@ print('warm')
|
||||
| Session log narrative sections | qwen3:14b |
|
||||
| Commit message body | qwen3:14b |
|
||||
| Ticket / client comment drafting | qwen3:14b |
|
||||
| Summarize logs, diffs, incident notes | qwen3:14b |
|
||||
| Classify bug type, severity, category, priority | qwen3:14b |
|
||||
| Extract structured data from text | qwen3:14b |
|
||||
| Diff summarization before commit | qwen3:14b |
|
||||
| Error categorization (transient / config / bug) | qwen3:14b |
|
||||
| Summarize logs, diffs, incident notes (no length cap) | qwen3:14b |
|
||||
| Agent phase handoff summaries | qwen3:14b |
|
||||
| Client email drafts | qwen3:14b |
|
||||
| Classify bug type, severity, category, priority | qwen3.6 |
|
||||
| Extract structured data from text (JSON, fields) | qwen3.6 |
|
||||
| Diff summarization with strict format / fields | qwen3.6 |
|
||||
| Error categorization (transient / config / bug / permission) | qwen3.6 |
|
||||
| PII redaction, output preserving format | qwen3.6 |
|
||||
| Strict word-limit summaries (subject lines, alerts) | qwen3.6 |
|
||||
| Multi-step rule application across lists | qwen3.6 |
|
||||
| Untrusted input that may contain prompt injection | qwen3.6 |
|
||||
| Code comment / docstring generation | codestral:22b |
|
||||
| Refactor suggestions | codestral:22b |
|
||||
|
||||
**Rule of thumb:** if the output is *prose someone will read*, use qwen3:14b (2x faster). If the output is *structured data something will parse* or *must obey a tight format*, use qwen3.6.
|
||||
|
||||
## Review Policy
|
||||
|
||||
- Documentation output (session logs, commit messages, comments) — Claude reviews before writing/posting
|
||||
|
||||
@@ -189,3 +189,86 @@ Ground-truth docs were written for both active projects. GuruRMM got `docs/tech-
|
||||
- Standards index: `.claude/standards/index.yml`
|
||||
- Commit (claudetools): `dd0ef45`
|
||||
- Commit (gururmm submodule tech-stack/mission): `79604a2`
|
||||
|
||||
---
|
||||
|
||||
## Update: 16:02 MST — qwen3.6 benchmark + Ollama routing update + openclaw removal
|
||||
|
||||
**Machine:** GURU-BEAST-ROG (Mike Swanson)
|
||||
|
||||
### Session Summary
|
||||
|
||||
Removed openclaw from the workstation by uninstalling the global npm package (`npm uninstall -g openclaw`, 458 deps removed) and deleting the `~/.openclaw` data directory (identity, agents, memory, devices, `.env`). User chose complete deletion with no backup. Confirmed no running processes, scheduled tasks, or services existed for openclaw.
|
||||
|
||||
Benchmarked `qwen3.6:latest` (new 36B MoE) against `qwen3:14b` (current production default) and `qwen3:32b` on the local Ollama instance to evaluate whether 3.6 is a meaningful upgrade for the documentation-engine workload. Built a Python harness measuring cold-start load time, throughput (from Ollama's eval_duration), and capability scores against deterministic graders. Initial six-prompt round exposed a grader bug (multi_step test had the wrong expected set) — after fixing, all three models scored 5/6 with qwen3.6 the only one to apply per-file rules correctly. Per user request, expanded the suite to 16 prompts weighted toward strict-format and adherence work (CSV filter, FizzBuzz, PII redaction, exact-count bullets, nested JSON, scheduling with weekend trap, prompt-injection resistance, exact delimiter, multi-field classification, strict word-limit summary).
|
||||
|
||||
Re-ran with the expanded suite. Final scores: `qwen3:14b` 11/16, `qwen3:32b` 12/16, `qwen3.6` 15/16. qwen3.6 won every strict-format/adherence test (multi-step rules, weekend-aware scheduling, injection resistance, 25-word limit). One regression: qwen3.6 failed the 15-min schedule reasoning prompt (answered 3, expected 4) that 14b and 32b both got right. Throughput: 14b ~66 tok/s, 32b ~21 tok/s, 3.6 ~32 tok/s. qwen3.6 cold-load (4.9s) was actually faster than 14b's (8.6s) despite the larger file.
|
||||
|
||||
Updated `.claude/OLLAMA.md` (Models, Documentation Engine, When-to-Use tables) and the one-line model summary in `.claude/CLAUDE.md` to route prose drafting to qwen3:14b (2x faster) and strict-format work (JSON, classification, redaction, word limits, multi-step rules, untrusted input) to qwen3.6. Added an explicit "untrusted input that may contain prompt injection → qwen3.6" routing rule since 14b and 32b both output "HACKED" to the injection prompt and only 3.6 ignored it.
|
||||
|
||||
### Key Decisions
|
||||
|
||||
- **Promoted qwen3.6 to dual-routing default** (strict-format only) rather than full replacement — 14b's 2x throughput still wins for bulk prose where format is forgiving.
|
||||
- **Expanded benchmark from 6 to 16 prompts** before changing documentation defaults. The first 6 produced an ambiguous 5/6-across-the-board signal; the expanded suite produced a decisive 4-point capability gap.
|
||||
- **Added explicit injection-resistance routing rule.** Both older models output "HACKED" to the injection test; only 3.6 resisted. Worth calling out separately in OLLAMA.md so future routing decisions account for it.
|
||||
- **Documented the 3.6 reasoning regression in OLLAMA.md as a re-check-at-qwen3.7 note** rather than disqualifying 3.6. Single-prompt miss vs four strict-format wins is a clear net positive.
|
||||
- **Kept qwen3:32b installed** despite being dominated on every axis (per user choice — frees ~20 GB if removed later).
|
||||
- **Removed openclaw with no backup** per explicit user direction ("`.env`, identity, device pairings — all gone").
|
||||
|
||||
### Problems Encountered
|
||||
|
||||
- **Grader bug in the multi_step test.** Initial expected set uppercased `.py` filenames but the prompt said to leave them unchanged. Discovered by inspecting raw model outputs; fixed `check_multi_step()` and re-scored from saved snippets.
|
||||
- **Shell escaping of `\n` literals** when rescoring inline via `py -c "..."` from bash double-quoted heredocs — the backslash got eaten and the replace silently no-op'd. Worked around by writing `rescore_qwen.py` as a real file.
|
||||
- **Rebase conflict on this very session log** — DESKTOP-0O8A1RL had already pushed a 2026-05-16 log earlier (GuruRMM work). Resolved by keeping both, appending this work as an Update section.
|
||||
|
||||
### Configuration Changes
|
||||
|
||||
- `.claude/OLLAMA.md` — rewrote three tables (Models, Documentation Engine, When-to-Use). Added benchmark-basis paragraph under Models and a one-line rule-of-thumb under When-to-Use. +23/-10 lines.
|
||||
- `.claude/CLAUDE.md` — single line updated (model summary now names qwen3.6 + qwen3:14b instead of qwen3:14b only). +1/-1.
|
||||
|
||||
### Files Created (uncommitted, in CWD on GURU-BEAST-ROG)
|
||||
|
||||
- `benchmark_qwen_3_6.py` — re-runnable harness, 16 prompts, deterministic graders
|
||||
- `rescore_qwen.py` — one-off rescorer that reads snippets from JSON and regenerates the MD report
|
||||
- `qwen-benchmark-2026-05-16.json` — full raw benchmark output (per-prompt timings, token counts, snippets, pass/fail)
|
||||
- `qwen-benchmark-2026-05-16.md` — readable comparison report
|
||||
|
||||
### Infrastructure & Servers
|
||||
|
||||
- **Ollama (local on DESKTOP-0O8A1RL, accessed from GURU-BEAST-ROG via `OLLAMA` env)** — three models exercised: `qwen3:14b` (9.3 GB), `qwen3:32b` (20 GB), `qwen3.6:latest` (24 GB MoE, Q4_K_M, family `qwen35moe`).
|
||||
- No production servers, databases, or client systems touched.
|
||||
|
||||
### Credentials
|
||||
|
||||
None used or rotated. The deleted `~/.openclaw/.env` likely contained openclaw-specific API keys / device pairing tokens — destroyed per user direction, not captured.
|
||||
|
||||
### Commands & Outputs
|
||||
|
||||
```bash
|
||||
# Remove openclaw
|
||||
npm uninstall -g openclaw # removed 458 packages in 3s
|
||||
rm -rf "C:/Users/guru/.openclaw"
|
||||
where.exe openclaw # INFO: Could not find files for the given pattern(s).
|
||||
|
||||
# Run benchmark
|
||||
py benchmark_qwen_3_6.py # 16 prompts x 3 models, ~12 min total
|
||||
|
||||
# Final scoreboard
|
||||
# qwen3:14b 11/16 66 tok/s
|
||||
# qwen3:32b 12/16 21 tok/s
|
||||
# qwen3.6:latest 15/16 32 tok/s
|
||||
```
|
||||
|
||||
### Pending / Incomplete
|
||||
|
||||
- **Re-validate the reasoning regression** when qwen3.7 (or any qwen3.6 update) lands. The 15-min schedule prompt (`reasoning` test in the harness) is the canary — currently 3, expected 4.
|
||||
- **Decide on qwen3:32b retention** — dominated on every axis, frees ~20 GB if removed. Deferred.
|
||||
- **Decide whether to commit benchmark artifacts to repo** (e.g. `benchmarks/` folder) so future model evaluations have a baseline. Deferred.
|
||||
|
||||
### Reference Information
|
||||
|
||||
- Benchmark harness: `c:\Users\guru\ClaudeTools\benchmark_qwen_3_6.py` (rerun: `py benchmark_qwen_3_6.py`)
|
||||
- Benchmark report: `c:\Users\guru\ClaudeTools\qwen-benchmark-2026-05-16.md`
|
||||
- Benchmark raw data: `c:\Users\guru\ClaudeTools\qwen-benchmark-2026-05-16.json`
|
||||
- Ollama endpoint (local on this machine): `http://localhost:11434/api/chat` with `think:false` for qwen3 family, `options.num_ctx:4096` for benchmark
|
||||
- Updated docs: `.claude/OLLAMA.md`, `.claude/CLAUDE.md`
|
||||
|
||||
Reference in New Issue
Block a user