sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-16 16:59:53

Author: Mike Swanson Machine: DESKTOP-0O8A1RL Timestamp: 2026-05-16 16:59:53
2026-05-16 16:59:55 -07:00
parent a6fb8d2ab6
commit 7386460f55
1 changed files with 56 additions and 0 deletions
--- a/session-logs/2026-05-16-session.md
+++ b/session-logs/2026-05-16-session.md
@@ -272,3 +272,59 @@ py benchmark_qwen_3_6.py        # 16 prompts x 3 models, ~12 min total
 - Benchmark raw data: `c:\Users\guru\ClaudeTools\qwen-benchmark-2026-05-16.json`
 - Ollama endpoint (local on this machine): `http://localhost:11434/api/chat` with `think:false` for qwen3 family, `options.num_ctx:4096` for benchmark
 - Updated docs: `.claude/OLLAMA.md`, `.claude/CLAUDE.md`
+
+---
+
+## Update: 16:30 PT -- Ollama model benchmarking + qwen3:8b routing
+
+### Session Summary
+
+The session covered benchmarking qwen3.6:latest against qwen3:14b on this machine (DESKTOP-0O8A1RL) to determine whether the routing table from the Mac-based benchmark was appropriate here. Initial tests showed 18-19 tok/s across both models -- far below the reference machine's 66 tok/s (qwen3:14b) and 32 tok/s (qwen3.6). All initial qwen3.6 responses were empty because a 400-token budget was exhausted entirely by internal thinking before any visible output was generated.
+
+A revised test suite with 2000-token budgets and /no_think mode confirmed the throughput floor. The Ollama /api/ps endpoint revealed that both models were running in split CPU/GPU mode: qwen3:14b at 73% VRAM (11.3/15.6 GB), qwen3.6 at 41% (11.3/27.5 GB). Windows WMI had reported the GPU VRAM as 4095 MB due to a known 32-bit integer cap in the Win32_VideoController.AdapterRAM field. The actual GPU is an RTX 5070 Ti Laptop with 12 GB GDDR7.
+
+A 6000-token reasoning test confirmed qwen3.6's limitation on this machine: it consumed all 6000 tokens internally and produced no visible output, while qwen3:14b answered the same problem correctly in 2409 tokens. qwen3.6 is a 36B MoE model (family: qwen35moe) -- the MoE architecture explains why it runs at the same speed as qwen3:14b despite being 2.4x larger, since only a fraction of parameters activate per token.
+
+qwen3:8b (5.2 GB GGUF) was pulled as the candidate fix. Benchmarked at 100% VRAM utilization (10.9/10.9 GB), it ran at 74-86 tok/s -- 4.8x faster than qwen3:14b on this machine and exceeding the reference machine's qwen3:14b speed of 66 tok/s. OLLAMA.md and CLAUDE.md were updated with a per-machine routing table: qwen3:8b for prose on DESKTOP-0O8A1RL, qwen3:14b everywhere else, qwen3.6 for strict-format tasks on all machines.
+
+### Key Decisions
+
+- **qwen3:8b chosen over qwen3:14b for prose on this machine.** qwen3:14b's 9.3 GB GGUF expands to 15.6 GB at runtime, overflowing the 12 GB VRAM by 3.6 GB and causing split-mode slowdown. qwen3:8b fits entirely in VRAM.
+- **qwen3.6 retained for strict-format tasks despite 17 tok/s.** Quality advantage from the 16-prompt benchmark holds. Short output tasks (JSON, classification) are less sensitive to throughput.
+- **WMI VRAM reporting not trusted.** Win32_VideoController.AdapterRAM caps at 4 GB due to 32-bit integer overflow. Ollama /api/ps size_vram is the reliable source.
+- **6000-token budget still insufficient for qwen3.6 reasoning.** Model burns all tokens on internal thinking on complex prompts. For reasoning tasks, qwen3:14b is the correct choice on this machine.
+- **/api/chat with think:false is required for reliable qwen3 output.** All benchmark tests used /api/generate, which allows thinking to consume the entire budget. Production Ollama calls must use /api/chat per the existing OLLAMA.md guidance.
+
+### Problems Encountered
+
+- **qwen3.6 empty responses at 400 tokens.** Internal thinking consumed entire budget. Fix: larger budget (2000+) and /no_think mode for initial testing.
+- **qwen3.6 empty at 6000 tokens (reasoning).** Even 6000 tokens insufficient for qwen3.6 to complete thinking + output on a multi-step reasoning problem. qwen3:14b handled it in 2409 tokens.
+- **WMI reported 4095 MB VRAM.** 32-bit cap bug. Actual VRAM confirmed via Ollama /api/ps: 11.3-11.8 GB loaded = 12 GB physical.
+- **Unicode encode error in benchmark script.** Arrow character in f-string failed cp1252 encoding. Fixed by removing the character.
+
+### Configuration Changes
+
+- `.claude/OLLAMA.md` -- added qwen3:8b to models table, per-machine routing table, benchmark results table with VRAM split analysis
+- `.claude/CLAUDE.md` -- updated model one-liner to include qwen3:8b with per-machine note
+- `qwen3:8b` pulled to Ollama (5.2 GB, `500a1f067a9f`)
+
+### Infrastructure & Servers
+
+| Machine | GPU | VRAM | qwen3:8b speed | qwen3:14b speed |
+|---------|-----|------|----------------|-----------------|
+| DESKTOP-0O8A1RL | RTX 5070 Ti Laptop | 12 GB GDDR7 | 74-86 tok/s (full GPU) | 17-18 tok/s (split) |
+| Mikes-MacBook-Air (ref) | M-series unified | ~16-24 GB | n/a | ~66 tok/s |
+
+### Pending / Incomplete Tasks
+
+- Same as prior updates
+- Pluto vault entry still pending
+- Pluto SSH key still pending
+- Confirm whether /api/chat think:false resolves qwen3.6 JSON output failures (not tested this session)
+
+### Reference Information
+
+- Ollama model table benchmark commit: `4aadf16`
+- qwen3:8b model ID: `500a1f067a9f` (5.2 GB, Q4_K_M)
+- qwen3.6 family confirmed: `qwen35moe` (36B MoE, not 6B)
+- VRAM reality check: use `curl http://localhost:11434/api/ps` not WMI for VRAM readings