Session log: qwen3.6 benchmark, route strict-format to 3.6
Benchmarked qwen3.6 (36B MoE) vs qwen3:14b and qwen3:32b on 16 representative prompts. qwen3.6 scored 15/16 vs 14b 11/16 and 32b 12/16, winning every strict-format/adherence test (multi-step rules, weekend-aware scheduling, prompt-injection resistance, word-limit summary). Single reasoning regression noted for re-check at qwen3.7. Updated .claude/OLLAMA.md (Models, Documentation Engine, and When-to-Use tables) and .claude/CLAUDE.md one-line model summary to route strict-format work to qwen3.6 and keep bulk prose on qwen3:14b (2x faster). Also removed openclaw npm package + ~/.openclaw data dir earlier in the session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -6,10 +6,13 @@ Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Avail
|
||||
|
||||
| Model | Size | Use For |
|
||||
|-------|------|---------|
|
||||
| `qwen3:14b` | 9.3 GB | Summarization, classification, data extraction, drafting |
|
||||
| `qwen3.6:latest` | 24 GB | Strict-format work: JSON/structured extraction, classification, per-item rules, redaction, word-limited summaries, adherence-critical drafting. ~32 tok/s. |
|
||||
| `qwen3:14b` | 9.3 GB | Bulk prose where format is loose: session log narrative, commit bodies, client notes, free-text handoffs. ~66 tok/s — 2x faster than 3.6. |
|
||||
| `codestral:22b` | 12 GB | Code generation, refactoring suggestions, docstrings |
|
||||
| `nomic-embed-text` | 274 MB | Embeddings only (used by GrepAI) |
|
||||
|
||||
Routing basis: 16-prompt benchmark on 2026-05-16 (`benchmark_qwen_3_6.py` in repo root). qwen3.6 scored 15/16 vs qwen3:14b 11/16 and qwen3:32b 12/16. 3.6 won every strict-format and adherence test (multi-step rules, schedule reasoning with weekend trap, prompt-injection resistance, word-limit summary) — at the cost of ~2x slower inference. **Known regression**: 3.6 missed one small reasoning prompt (3 vs expected 4) that 14b/32b got — re-validate when qwen3.7 lands. qwen3:32b is dominated on every axis; not in routing rotation.
|
||||
|
||||
## Endpoints
|
||||
|
||||
Auto-detect: any machine that has a local Ollama listening on `127.0.0.1:11434` uses local. Otherwise fall back to Mike's workstation over Tailscale.
|
||||
@@ -85,11 +88,15 @@ This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama
|
||||
| Syncro comment bodies + billing descriptions | qwen3:14b | Review checklist + post via API |
|
||||
| Ticket initial issue / description text | qwen3:14b | Review + post |
|
||||
| Client-facing notes and summaries | qwen3:14b | Review for accuracy |
|
||||
| Ticket / issue classification (priority, type, category) | qwen3:14b | Review + apply label |
|
||||
| Diff summarization before commit | qwen3:14b | Review + use in commit message |
|
||||
| Error message categorization (transient / config / bug) | qwen3:14b | Review + act on classification |
|
||||
| Agent phase handoff summaries (explore → plan, plan → implement) | qwen3:14b | Review + include in agent brief |
|
||||
| Client email drafts | qwen3:14b | Review for accuracy + tone before sending |
|
||||
| Ticket / issue classification (priority, type, category) | qwen3.6 | Review + apply label |
|
||||
| Diff summarization before commit | qwen3.6 | Review + use in commit message |
|
||||
| Error message categorization (transient / config / bug) | qwen3.6 | Review + act on classification |
|
||||
| Structured data extraction (JSON, fields, tags) | qwen3.6 | Review + use programmatically |
|
||||
| PII redaction in logs/transcripts | qwen3.6 | Review before publishing |
|
||||
| Strict word-limit summaries (e.g. ticket subject, alert text) | qwen3.6 | Review + use |
|
||||
| Multi-step / per-item rule application on lists | qwen3.6 | Review + use |
|
||||
| Code comments and docstrings | codestral:22b | Review before applying |
|
||||
| Refactor suggestions | codestral:22b | Review before applying |
|
||||
|
||||
@@ -153,16 +160,22 @@ print('warm')
|
||||
| Session log narrative sections | qwen3:14b |
|
||||
| Commit message body | qwen3:14b |
|
||||
| Ticket / client comment drafting | qwen3:14b |
|
||||
| Summarize logs, diffs, incident notes | qwen3:14b |
|
||||
| Classify bug type, severity, category, priority | qwen3:14b |
|
||||
| Extract structured data from text | qwen3:14b |
|
||||
| Diff summarization before commit | qwen3:14b |
|
||||
| Error categorization (transient / config / bug) | qwen3:14b |
|
||||
| Summarize logs, diffs, incident notes (no length cap) | qwen3:14b |
|
||||
| Agent phase handoff summaries | qwen3:14b |
|
||||
| Client email drafts | qwen3:14b |
|
||||
| Classify bug type, severity, category, priority | qwen3.6 |
|
||||
| Extract structured data from text (JSON, fields) | qwen3.6 |
|
||||
| Diff summarization with strict format / fields | qwen3.6 |
|
||||
| Error categorization (transient / config / bug / permission) | qwen3.6 |
|
||||
| PII redaction, output preserving format | qwen3.6 |
|
||||
| Strict word-limit summaries (subject lines, alerts) | qwen3.6 |
|
||||
| Multi-step rule application across lists | qwen3.6 |
|
||||
| Untrusted input that may contain prompt injection | qwen3.6 |
|
||||
| Code comment / docstring generation | codestral:22b |
|
||||
| Refactor suggestions | codestral:22b |
|
||||
|
||||
**Rule of thumb:** if the output is *prose someone will read*, use qwen3:14b (2x faster). If the output is *structured data something will parse* or *must obey a tight format*, use qwen3.6.
|
||||
|
||||
## Review Policy
|
||||
|
||||
- Documentation output (session logs, commit messages, comments) — Claude reviews before writing/posting
|
||||
|
||||
Reference in New Issue
Block a user