feat: add qwen3:8b for DESKTOP-0O8A1RL, update Ollama routing
Benchmarked 2026-05-16 on DESKTOP-0O8A1RL (RTX 5070 Ti Laptop, 12 GB VRAM): - qwen3:8b: 100% VRAM fit (10.9/10.9 GB) -> 74-86 tok/s - qwen3:14b: 73% VRAM (11.3/15.6 GB split) -> 17-18 tok/s (4.8x slower) - qwen3.6: 41% VRAM (11.3/27.5 GB split) -> 17-19 tok/s qwen3:14b overflows 12 GB VRAM at runtime (9.3 GB GGUF = 15.6 GB loaded). qwen3:8b fits entirely in VRAM and matches the reference machine speed. Updated OLLAMA.md: added qwen3:8b to models table, per-machine routing table, benchmark results. Updated CLAUDE.md model one-liner. Routing: qwen3:8b for prose on DESKTOP-0O8A1RL, qwen3:14b everywhere else, qwen3.6 for strict-format tasks on all machines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -272,7 +272,7 @@ Tier 0 — **Ollama is the documentation and classification engine.** Route pros
|
||||
| DESKTOP-0O8A1RL | `http://localhost:11434` |
|
||||
| Other | `http://100.92.127.64:11434` (Tailscale) |
|
||||
|
||||
Models: `qwen3.6:latest` (strict-format: JSON, classification, structured rules, redaction, word-limited summaries, untrusted-input handling), `qwen3:14b` (bulk prose: session logs, commit bodies, free-text drafts — 2x faster), `codestral:22b` (code suggestions — always review). Full reference + routing rationale: `.claude/OLLAMA.md`
|
||||
Models: `qwen3.6:latest` (strict-format: JSON, classification, structured rules, redaction, word-limited summaries), `qwen3:8b` (prose on DESKTOP-0O8A1RL — 86 tok/s, full 12 GB VRAM fit), `qwen3:14b` (prose everywhere else — ~66 tok/s), `codestral:22b` (code suggestions — always review). Full reference + per-machine routing: `.claude/OLLAMA.md`
|
||||
|
||||
### GrepAI (Semantic Code Search)
|
||||
|
||||
|
||||
@@ -6,12 +6,34 @@ Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Avail
|
||||
|
||||
| Model | Size | Use For |
|
||||
|-------|------|---------|
|
||||
| `qwen3.6:latest` | 24 GB | Strict-format work: JSON/structured extraction, classification, per-item rules, redaction, word-limited summaries, adherence-critical drafting. ~32 tok/s. |
|
||||
| `qwen3:14b` | 9.3 GB | Bulk prose where format is loose: session log narrative, commit bodies, client notes, free-text handoffs. ~66 tok/s — 2x faster than 3.6. |
|
||||
| `qwen3.6:latest` | 23 GB | Strict-format work: JSON/structured extraction, classification, per-item rules, redaction, word-limited summaries, adherence-critical drafting. 36B MoE. |
|
||||
| `qwen3:14b` | 9.3 GB | Bulk prose on machines with >16 GB VRAM: session log narrative, commit bodies, client notes, free-text handoffs. |
|
||||
| `qwen3:8b` | 5.2 GB | Bulk prose on DESKTOP-0O8A1RL (12 GB VRAM). Same role as qwen3:14b but fits fully in VRAM on that machine. |
|
||||
| `codestral:22b` | 12 GB | Code generation, refactoring suggestions, docstrings |
|
||||
| `nomic-embed-text` | 274 MB | Embeddings only (used by GrepAI) |
|
||||
|
||||
Routing basis: 16-prompt benchmark on 2026-05-16 (`benchmark_qwen_3_6.py` in repo root). qwen3.6 scored 15/16 vs qwen3:14b 11/16 and qwen3:32b 12/16. 3.6 won every strict-format and adherence test (multi-step rules, schedule reasoning with weekend trap, prompt-injection resistance, word-limit summary) — at the cost of ~2x slower inference. **Known regression**: 3.6 missed one small reasoning prompt (3 vs expected 4) that 14b/32b got — re-validate when qwen3.7 lands. qwen3:32b is dominated on every axis; not in routing rotation.
|
||||
### Routing basis
|
||||
|
||||
Quality routing: 16-prompt benchmark on 2026-05-16 (`benchmark_qwen_3_6.py` in repo root). qwen3.6 scored 15/16 vs qwen3:14b 11/16 and qwen3:32b 12/16. 3.6 won every strict-format and adherence test. **Known regression**: 3.6 missed one small reasoning prompt — re-validate when qwen3.7 lands. qwen3:32b dominated on every axis; not in rotation.
|
||||
|
||||
Speed routing: benchmarked 2026-05-16 on DESKTOP-0O8A1RL (RTX 5070 Ti Laptop, 12 GB VRAM):
|
||||
|
||||
| Model | VRAM fit | Tok/s (this machine) | Tok/s (full-VRAM ref) |
|
||||
|-------|----------|----------------------|------------------------|
|
||||
| qwen3:8b | 100% (10.9/10.9 GB) | **74-86** | ~90 |
|
||||
| qwen3:14b | 73% (11.3/15.6 GB) | 17-18 | ~66 |
|
||||
| qwen3.6 | 41% (11.3/27.5 GB) | 17-19 | ~32 |
|
||||
|
||||
qwen3:14b and qwen3.6 are CPU-bottlenecked on this machine (split mode, PCIe bandwidth limited). qwen3:8b fits entirely in VRAM and is **4.8x faster** than qwen3:14b here.
|
||||
|
||||
### Machine-specific prose model
|
||||
|
||||
| Machine | GPU VRAM | Prose model |
|
||||
|---------|----------|-------------|
|
||||
| DESKTOP-0O8A1RL | 12 GB (RTX 5070 Ti Laptop) | `qwen3:8b` |
|
||||
| Mikes-MacBook-Air | unified memory | `qwen3:14b` |
|
||||
| HOWARD-HOME | local Ollama | `qwen3:14b` |
|
||||
| Other | Tailscale fallback | `qwen3:14b` |
|
||||
|
||||
## Endpoints
|
||||
|
||||
@@ -83,13 +105,15 @@ This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama
|
||||
|
||||
| Output | Model | Claude's role |
|
||||
|--------|-------|---------------|
|
||||
| Session log narrative (summary, decisions, problems) | qwen3:14b | Review + assemble with factual sections |
|
||||
| Commit message body | qwen3:14b | Review + execute git commit |
|
||||
| Syncro comment bodies + billing descriptions | qwen3:14b | Review checklist + post via API |
|
||||
| Ticket initial issue / description text | qwen3:14b | Review + post |
|
||||
| Client-facing notes and summaries | qwen3:14b | Review for accuracy |
|
||||
| Agent phase handoff summaries (explore → plan, plan → implement) | qwen3:14b | Review + include in agent brief |
|
||||
| Client email drafts | qwen3:14b | Review for accuracy + tone before sending |
|
||||
| Session log narrative (summary, decisions, problems) | qwen3:14b / qwen3:8b* | Review + assemble with factual sections |
|
||||
| Commit message body | qwen3:14b / qwen3:8b* | Review + execute git commit |
|
||||
| Syncro comment bodies + billing descriptions | qwen3:14b / qwen3:8b* | Review checklist + post via API |
|
||||
| Ticket initial issue / description text | qwen3:14b / qwen3:8b* | Review + post |
|
||||
| Client-facing notes and summaries | qwen3:14b / qwen3:8b* | Review for accuracy |
|
||||
| Agent phase handoff summaries (explore → plan, plan → implement) | qwen3:14b / qwen3:8b* | Review + include in agent brief |
|
||||
| Client email drafts | qwen3:14b / qwen3:8b* | Review for accuracy + tone before sending |
|
||||
|
||||
*Use `qwen3:8b` on DESKTOP-0O8A1RL — 4.8x faster due to full VRAM fit. Use `qwen3:14b` everywhere else.
|
||||
| Ticket / issue classification (priority, type, category) | qwen3.6 | Review + apply label |
|
||||
| Diff summarization before commit | qwen3.6 | Review + use in commit message |
|
||||
| Error message categorization (transient / config / bug) | qwen3.6 | Review + act on classification |
|
||||
@@ -157,12 +181,14 @@ print('warm')
|
||||
|
||||
| Task | Model |
|
||||
|------|-------|
|
||||
| Session log narrative sections | qwen3:14b |
|
||||
| Commit message body | qwen3:14b |
|
||||
| Ticket / client comment drafting | qwen3:14b |
|
||||
| Summarize logs, diffs, incident notes (no length cap) | qwen3:14b |
|
||||
| Agent phase handoff summaries | qwen3:14b |
|
||||
| Client email drafts | qwen3:14b |
|
||||
| Session log narrative sections | qwen3:8b* / qwen3:14b |
|
||||
| Commit message body | qwen3:8b* / qwen3:14b |
|
||||
| Ticket / client comment drafting | qwen3:8b* / qwen3:14b |
|
||||
| Summarize logs, diffs, incident notes (no length cap) | qwen3:8b* / qwen3:14b |
|
||||
| Agent phase handoff summaries | qwen3:8b* / qwen3:14b |
|
||||
| Client email drafts | qwen3:8b* / qwen3:14b |
|
||||
|
||||
*On DESKTOP-0O8A1RL only — 4.8x faster (86 tok/s vs 18 tok/s). Use qwen3:14b on all other machines.
|
||||
| Classify bug type, severity, category, priority | qwen3.6 |
|
||||
| Extract structured data from text (JSON, fields) | qwen3.6 |
|
||||
| Diff summarization with strict format / fields | qwen3.6 |
|
||||
|
||||
Reference in New Issue
Block a user