sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-16 16:59:53

Author: Mike Swanson
Machine: DESKTOP-0O8A1RL
Timestamp: 2026-05-16 16:59:53
This commit is contained in:
2026-05-16 16:59:55 -07:00
parent a6fb8d2ab6
commit 7386460f55

View File

@@ -272,3 +272,59 @@ py benchmark_qwen_3_6.py # 16 prompts x 3 models, ~12 min total
- Benchmark raw data: `c:\Users\guru\ClaudeTools\qwen-benchmark-2026-05-16.json`
- Ollama endpoint (local on this machine): `http://localhost:11434/api/chat` with `think:false` for qwen3 family, `options.num_ctx:4096` for benchmark
- Updated docs: `.claude/OLLAMA.md`, `.claude/CLAUDE.md`
---
## Update: 16:30 PT -- Ollama model benchmarking + qwen3:8b routing
### Session Summary
The session covered benchmarking qwen3.6:latest against qwen3:14b on this machine (DESKTOP-0O8A1RL) to determine whether the routing table from the Mac-based benchmark was appropriate here. Initial tests showed 18-19 tok/s across both models -- far below the reference machine's 66 tok/s (qwen3:14b) and 32 tok/s (qwen3.6). All initial qwen3.6 responses were empty because a 400-token budget was exhausted entirely by internal thinking before any visible output was generated.
A revised test suite with 2000-token budgets and /no_think mode confirmed the throughput floor. The Ollama /api/ps endpoint revealed that both models were running in split CPU/GPU mode: qwen3:14b at 73% VRAM (11.3/15.6 GB), qwen3.6 at 41% (11.3/27.5 GB). Windows WMI had reported the GPU VRAM as 4095 MB due to a known 32-bit integer cap in the Win32_VideoController.AdapterRAM field. The actual GPU is an RTX 5070 Ti Laptop with 12 GB GDDR7.
A 6000-token reasoning test confirmed qwen3.6's limitation on this machine: it consumed all 6000 tokens internally and produced no visible output, while qwen3:14b answered the same problem correctly in 2409 tokens. qwen3.6 is a 36B MoE model (family: qwen35moe) -- the MoE architecture explains why it runs at the same speed as qwen3:14b despite being 2.4x larger, since only a fraction of parameters activate per token.
qwen3:8b (5.2 GB GGUF) was pulled as the candidate fix. Benchmarked at 100% VRAM utilization (10.9/10.9 GB), it ran at 74-86 tok/s -- 4.8x faster than qwen3:14b on this machine and exceeding the reference machine's qwen3:14b speed of 66 tok/s. OLLAMA.md and CLAUDE.md were updated with a per-machine routing table: qwen3:8b for prose on DESKTOP-0O8A1RL, qwen3:14b everywhere else, qwen3.6 for strict-format tasks on all machines.
### Key Decisions
- **qwen3:8b chosen over qwen3:14b for prose on this machine.** qwen3:14b's 9.3 GB GGUF expands to 15.6 GB at runtime, overflowing the 12 GB VRAM by 3.6 GB and causing split-mode slowdown. qwen3:8b fits entirely in VRAM.
- **qwen3.6 retained for strict-format tasks despite 17 tok/s.** Quality advantage from the 16-prompt benchmark holds. Short output tasks (JSON, classification) are less sensitive to throughput.
- **WMI VRAM reporting not trusted.** Win32_VideoController.AdapterRAM caps at 4 GB due to 32-bit integer overflow. Ollama /api/ps size_vram is the reliable source.
- **6000-token budget still insufficient for qwen3.6 reasoning.** Model burns all tokens on internal thinking on complex prompts. For reasoning tasks, qwen3:14b is the correct choice on this machine.
- **/api/chat with think:false is required for reliable qwen3 output.** All benchmark tests used /api/generate, which allows thinking to consume the entire budget. Production Ollama calls must use /api/chat per the existing OLLAMA.md guidance.
### Problems Encountered
- **qwen3.6 empty responses at 400 tokens.** Internal thinking consumed entire budget. Fix: larger budget (2000+) and /no_think mode for initial testing.
- **qwen3.6 empty at 6000 tokens (reasoning).** Even 6000 tokens insufficient for qwen3.6 to complete thinking + output on a multi-step reasoning problem. qwen3:14b handled it in 2409 tokens.
- **WMI reported 4095 MB VRAM.** 32-bit cap bug. Actual VRAM confirmed via Ollama /api/ps: 11.3-11.8 GB loaded = 12 GB physical.
- **Unicode encode error in benchmark script.** Arrow character in f-string failed cp1252 encoding. Fixed by removing the character.
### Configuration Changes
- `.claude/OLLAMA.md` -- added qwen3:8b to models table, per-machine routing table, benchmark results table with VRAM split analysis
- `.claude/CLAUDE.md` -- updated model one-liner to include qwen3:8b with per-machine note
- `qwen3:8b` pulled to Ollama (5.2 GB, `500a1f067a9f`)
### Infrastructure & Servers
| Machine | GPU | VRAM | qwen3:8b speed | qwen3:14b speed |
|---------|-----|------|----------------|-----------------|
| DESKTOP-0O8A1RL | RTX 5070 Ti Laptop | 12 GB GDDR7 | 74-86 tok/s (full GPU) | 17-18 tok/s (split) |
| Mikes-MacBook-Air (ref) | M-series unified | ~16-24 GB | n/a | ~66 tok/s |
### Pending / Incomplete Tasks
- Same as prior updates
- Pluto vault entry still pending
- Pluto SSH key still pending
- Confirm whether /api/chat think:false resolves qwen3.6 JSON output failures (not tested this session)
### Reference Information
- Ollama model table benchmark commit: `4aadf16`
- qwen3:8b model ID: `500a1f067a9f` (5.2 GB, Q4_K_M)
- qwen3.6 family confirmed: `qwen35moe` (36B MoE, not 6B)
- VRAM reality check: use `curl http://localhost:11434/api/ps` not WMI for VRAM readings