From 7386460f55f821303ca40a8abc73a81643b243a0 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Sat, 16 May 2026 16:59:55 -0700 Subject: [PATCH] sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-16 16:59:53 Author: Mike Swanson Machine: DESKTOP-0O8A1RL Timestamp: 2026-05-16 16:59:53 --- session-logs/2026-05-16-session.md | 56 ++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/session-logs/2026-05-16-session.md b/session-logs/2026-05-16-session.md index 14ccfc8..e4c496a 100644 --- a/session-logs/2026-05-16-session.md +++ b/session-logs/2026-05-16-session.md @@ -272,3 +272,59 @@ py benchmark_qwen_3_6.py # 16 prompts x 3 models, ~12 min total - Benchmark raw data: `c:\Users\guru\ClaudeTools\qwen-benchmark-2026-05-16.json` - Ollama endpoint (local on this machine): `http://localhost:11434/api/chat` with `think:false` for qwen3 family, `options.num_ctx:4096` for benchmark - Updated docs: `.claude/OLLAMA.md`, `.claude/CLAUDE.md` + +--- + +## Update: 16:30 PT -- Ollama model benchmarking + qwen3:8b routing + +### Session Summary + +The session covered benchmarking qwen3.6:latest against qwen3:14b on this machine (DESKTOP-0O8A1RL) to determine whether the routing table from the Mac-based benchmark was appropriate here. Initial tests showed 18-19 tok/s across both models -- far below the reference machine's 66 tok/s (qwen3:14b) and 32 tok/s (qwen3.6). All initial qwen3.6 responses were empty because a 400-token budget was exhausted entirely by internal thinking before any visible output was generated. + +A revised test suite with 2000-token budgets and /no_think mode confirmed the throughput floor. The Ollama /api/ps endpoint revealed that both models were running in split CPU/GPU mode: qwen3:14b at 73% VRAM (11.3/15.6 GB), qwen3.6 at 41% (11.3/27.5 GB). Windows WMI had reported the GPU VRAM as 4095 MB due to a known 32-bit integer cap in the Win32_VideoController.AdapterRAM field. The actual GPU is an RTX 5070 Ti Laptop with 12 GB GDDR7. + +A 6000-token reasoning test confirmed qwen3.6's limitation on this machine: it consumed all 6000 tokens internally and produced no visible output, while qwen3:14b answered the same problem correctly in 2409 tokens. qwen3.6 is a 36B MoE model (family: qwen35moe) -- the MoE architecture explains why it runs at the same speed as qwen3:14b despite being 2.4x larger, since only a fraction of parameters activate per token. + +qwen3:8b (5.2 GB GGUF) was pulled as the candidate fix. Benchmarked at 100% VRAM utilization (10.9/10.9 GB), it ran at 74-86 tok/s -- 4.8x faster than qwen3:14b on this machine and exceeding the reference machine's qwen3:14b speed of 66 tok/s. OLLAMA.md and CLAUDE.md were updated with a per-machine routing table: qwen3:8b for prose on DESKTOP-0O8A1RL, qwen3:14b everywhere else, qwen3.6 for strict-format tasks on all machines. + +### Key Decisions + +- **qwen3:8b chosen over qwen3:14b for prose on this machine.** qwen3:14b's 9.3 GB GGUF expands to 15.6 GB at runtime, overflowing the 12 GB VRAM by 3.6 GB and causing split-mode slowdown. qwen3:8b fits entirely in VRAM. +- **qwen3.6 retained for strict-format tasks despite 17 tok/s.** Quality advantage from the 16-prompt benchmark holds. Short output tasks (JSON, classification) are less sensitive to throughput. +- **WMI VRAM reporting not trusted.** Win32_VideoController.AdapterRAM caps at 4 GB due to 32-bit integer overflow. Ollama /api/ps size_vram is the reliable source. +- **6000-token budget still insufficient for qwen3.6 reasoning.** Model burns all tokens on internal thinking on complex prompts. For reasoning tasks, qwen3:14b is the correct choice on this machine. +- **/api/chat with think:false is required for reliable qwen3 output.** All benchmark tests used /api/generate, which allows thinking to consume the entire budget. Production Ollama calls must use /api/chat per the existing OLLAMA.md guidance. + +### Problems Encountered + +- **qwen3.6 empty responses at 400 tokens.** Internal thinking consumed entire budget. Fix: larger budget (2000+) and /no_think mode for initial testing. +- **qwen3.6 empty at 6000 tokens (reasoning).** Even 6000 tokens insufficient for qwen3.6 to complete thinking + output on a multi-step reasoning problem. qwen3:14b handled it in 2409 tokens. +- **WMI reported 4095 MB VRAM.** 32-bit cap bug. Actual VRAM confirmed via Ollama /api/ps: 11.3-11.8 GB loaded = 12 GB physical. +- **Unicode encode error in benchmark script.** Arrow character in f-string failed cp1252 encoding. Fixed by removing the character. + +### Configuration Changes + +- `.claude/OLLAMA.md` -- added qwen3:8b to models table, per-machine routing table, benchmark results table with VRAM split analysis +- `.claude/CLAUDE.md` -- updated model one-liner to include qwen3:8b with per-machine note +- `qwen3:8b` pulled to Ollama (5.2 GB, `500a1f067a9f`) + +### Infrastructure & Servers + +| Machine | GPU | VRAM | qwen3:8b speed | qwen3:14b speed | +|---------|-----|------|----------------|-----------------| +| DESKTOP-0O8A1RL | RTX 5070 Ti Laptop | 12 GB GDDR7 | 74-86 tok/s (full GPU) | 17-18 tok/s (split) | +| Mikes-MacBook-Air (ref) | M-series unified | ~16-24 GB | n/a | ~66 tok/s | + +### Pending / Incomplete Tasks + +- Same as prior updates +- Pluto vault entry still pending +- Pluto SSH key still pending +- Confirm whether /api/chat think:false resolves qwen3.6 JSON output failures (not tested this session) + +### Reference Information + +- Ollama model table benchmark commit: `4aadf16` +- qwen3:8b model ID: `500a1f067a9f` (5.2 GB, Q4_K_M) +- qwen3.6 family confirmed: `qwen35moe` (36B MoE, not 6B) +- VRAM reality check: use `curl http://localhost:11434/api/ps` not WMI for VRAM readings