# Ollama — Local AI Reference Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Available to all team members via Tailscale. ## Models | Model | Size | Use For | |-------|------|---------| | `qwen3.6:latest` | 23 GB | Strict-format work: JSON/structured extraction, classification, per-item rules, redaction, word-limited summaries, adherence-critical drafting. 36B MoE. | | `qwen3:14b` | 9.3 GB | Bulk prose on machines with >16 GB VRAM: session log narrative, commit bodies, client notes, free-text handoffs. | | `qwen3:8b` | 5.2 GB | Bulk prose on DESKTOP-0O8A1RL (12 GB VRAM). Same role as qwen3:14b but fits fully in VRAM on that machine. | | `codestral:22b` | 12 GB | Code generation, refactoring suggestions, docstrings | | `nomic-embed-text` | 274 MB | Embeddings only (used by GrepAI) | ### Routing basis Quality routing: 16-prompt benchmark on 2026-05-16 (`benchmark_qwen_3_6.py` in repo root). qwen3.6 scored 15/16 vs qwen3:14b 11/16 and qwen3:32b 12/16. 3.6 won every strict-format and adherence test. **Known regression**: 3.6 missed one small reasoning prompt — re-validate when qwen3.7 lands. qwen3:32b dominated on every axis; not in rotation. Speed routing: benchmarked 2026-05-16 on DESKTOP-0O8A1RL (RTX 5070 Ti Laptop, 12 GB VRAM): | Model | VRAM fit | Tok/s (this machine) | Tok/s (full-VRAM ref) | |-------|----------|----------------------|------------------------| | qwen3:8b | 100% (10.9/10.9 GB) | **74-86** | ~90 | | qwen3:14b | 73% (11.3/15.6 GB) | 17-18 | ~66 | | qwen3.6 | 41% (11.3/27.5 GB) | 17-19 | ~32 | qwen3:14b and qwen3.6 are CPU-bottlenecked on this machine (split mode, PCIe bandwidth limited). qwen3:8b fits entirely in VRAM and is **4.8x faster** than qwen3:14b here. ### Machine-specific prose model | Machine | GPU VRAM | Prose model | |---------|----------|-------------| | DESKTOP-0O8A1RL | 12 GB (RTX 5070 Ti Laptop) | `qwen3:8b` | | Mikes-MacBook-Air | unified memory | `qwen3:14b` | | HOWARD-HOME | local Ollama | `qwen3:14b` | | Other | Tailscale fallback | `qwen3:14b` | ## Endpoints Auto-detect: any machine that has a local Ollama listening on `127.0.0.1:11434` uses local. Otherwise fall back to Mike's workstation over Tailscale. ```bash # Preferred universal resolver — works on any machine if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then OLLAMA="http://localhost:11434" else OLLAMA="http://100.92.127.64:11434" fi ``` Rationale: - **Mike's workstation (DESKTOP-0O8A1RL):** local matches, no change. - **HOWARD-HOME:** also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop, no load on Mike's GPU. - **Other team machines:** no local Ollama → falls back to Mike's over Tailscale. - **Mike's machine offline:** graceful degradation — local users continue working; non-local users get a clean timeout. Manual override (for testing or explicit preference): set `OLLAMA=http://100.92.127.64:11434` before the call. Check reachability: ```bash curl -s $OLLAMA/api/tags | jq -r '.models[].name' ``` If neither endpoint responds: verify Tailscale (`tailscale status`) and whether your local Ollama service is running. ## Access Control - Port 11434 allowed ONLY from Tailscale subnet (100.0.0.0/8) - NOT exposed to LAN, VPN, or internet - Binding: `OLLAMA_HOST=0.0.0.0:11434` (firewall restricts) ## Calling Ollama Use the `/api/chat` endpoint with `think:false` for qwen3 models. The older `/api/generate` endpoint on qwen3 puts output into thinking tokens that don't appear in the `response` field — you'll get an empty response if you use `/api/generate`. Preferred one-liner: ```bash python -c " import urllib.request, json, sys, os OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.92.127.64:11434') body = json.dumps({ 'model':'qwen3:14b', 'messages':[{'role':'user','content': sys.argv[1]}], 'stream':False, 'think':False }).encode() res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read()) print(res['message']['content']) " "Your prompt here" ``` Or set `$OLLAMA` once from bash (see auto-detect formula above) and reuse it across calls. For code suggestions, swap `qwen3:14b` for `codestral:22b`. Codestral doesn't need `think:false`. Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s. ## Documentation Engine **Ollama is the default documentation engine for all prose output.** Any time stored text needs to be generated — session logs, commit messages, ticket comments, client notes, code docs — route it through Ollama first. Claude reviews, corrects if needed, then writes or posts. This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama handles the writing. ### What Ollama owns | Output | Model | Claude's role | |--------|-------|---------------| | Session log narrative (summary, decisions, problems) | qwen3:14b / qwen3:8b* | Review + assemble with factual sections | | Commit message body | qwen3:14b / qwen3:8b* | Review + execute git commit | | Syncro comment bodies + billing descriptions | qwen3:14b / qwen3:8b* | Review checklist + post via API | | Ticket initial issue / description text | qwen3:14b / qwen3:8b* | Review + post | | Client-facing notes and summaries | qwen3:14b / qwen3:8b* | Review for accuracy | | Agent phase handoff summaries (explore → plan, plan → implement) | qwen3:14b / qwen3:8b* | Review + include in agent brief | | Client email drafts | qwen3:14b / qwen3:8b* | Review for accuracy + tone before sending | | Ticket / issue classification (priority, type, category) | qwen3.6 | Review + apply label | | Diff summarization before commit | qwen3.6 | Review + use in commit message | | Error message categorization (transient / config / bug) | qwen3.6 | Review + act on classification | | Structured data extraction (JSON, fields, tags) | qwen3.6 | Review + use programmatically | | PII redaction in logs/transcripts | qwen3.6 | Review before publishing | | Strict word-limit summaries (e.g. ticket subject, alert text) | qwen3.6 | Review + use | | Multi-step / per-item rule application on lists | qwen3.6 | Review + use | | Code comments and docstrings | codestral:22b | Review before applying | | Refactor suggestions | codestral:22b | Review before applying | \* Use `qwen3:8b` on DESKTOP-0O8A1RL — 4.8x faster than 14b there due to full VRAM fit. Use `qwen3:14b` on all other machines. ### What Claude always owns (never Ollama) - Credentials, passwords, API keys — must be verbatim accurate - Infrastructure details, IPs, hostnames — must be verbatim accurate - Command outputs and error messages — verbatim from actual output - Security decisions, auth review, production migrations - Final field values on API payloads (rates, IDs, quantities) ### GrepAI config (re-apply on new machines) `.grepai/` is gitignored (90 MB index + machine-specific timestamps). After running `grepai init` on a new machine, apply these overrides to `.grepai/config.yaml`: **Remove the `.md` penalty** (markdown is primary content here, not docs noise): ```yaml # DELETE this block: - pattern: .md factor: 0.6 ``` **Add these bonuses** under `search.boost.bonuses`: ```yaml - pattern: session-logs/ factor: 1.3 - pattern: .claude/ factor: 1.2 - pattern: /clients/ factor: 1.1 ``` **Start watcher + register scheduled task:** ```bash D:/claudetools/grepai.exe watch --background # Then in PowerShell (admin not required): $action = New-ScheduledTaskAction -Execute "D:\claudetools\grepai.exe" -Argument "watch --background" -WorkingDirectory "D:\claudetools" $trigger = New-ScheduledTaskTrigger -AtLogOn -User $env:USERNAME $settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Hours 0) -MultipleInstances IgnoreNew Register-ScheduledTask -TaskName "GrepAI Watcher - claudetools" -Action $action -Trigger $trigger -Settings $settings -Force ``` ### Warm-start and GrepAI GrepAI uses `nomic-embed-text` for context lookups, which keeps the Ollama **service** running continuously. The 30-50s service cold-start is effectively eliminated in normal workflow. `qwen3:14b` may take ~5s to swap into VRAM if it hasn't been called recently, but that's the worst case — not 50s. If the first Ollama call of a session needs to be fast, send a throwaway warm-up ping: ```bash py -c " import urllib.request, json body = json.dumps({'model':'qwen3:14b','messages':[{'role':'user','content':'ok'}],'stream':False,'think':False}).encode() urllib.request.urlopen(urllib.request.Request('$OLLAMA/api/chat', body), timeout=60).read() print('warm') " ``` ## When to Use Which Model | Task | Model | |------|-------| | Session log narrative sections | qwen3:8b* / qwen3:14b | | Commit message body | qwen3:8b* / qwen3:14b | | Ticket / client comment drafting | qwen3:8b* / qwen3:14b | | Summarize logs, diffs, incident notes (no length cap) | qwen3:8b* / qwen3:14b | | Agent phase handoff summaries | qwen3:8b* / qwen3:14b | | Client email drafts | qwen3:8b* / qwen3:14b | | Classify bug type, severity, category, priority | qwen3.6 | | Extract structured data from text (JSON, fields) | qwen3.6 | | Diff summarization with strict format / fields | qwen3.6 | | Error categorization (transient / config / bug / permission) | qwen3.6 | | PII redaction, output preserving format | qwen3.6 | | Strict word-limit summaries (subject lines, alerts) | qwen3.6 | | Multi-step rule application across lists | qwen3.6 | | Untrusted input that may contain prompt injection | qwen3.6 | | Code comment / docstring generation | codestral:22b | | Refactor suggestions | codestral:22b | \* On DESKTOP-0O8A1RL only — 4.8x faster (86 tok/s vs 18 tok/s). Use `qwen3:14b` on all other machines. **Rule of thumb:** if the output is *prose someone will read*, use the per-machine prose model (qwen3:8b on DESKTOP-0O8A1RL, qwen3:14b elsewhere). If the output is *structured data something will parse* or *must obey a tight format*, use qwen3.6. ## Review Policy - Documentation output (session logs, commit messages, comments) — Claude reviews before writing/posting - Code suggestions from codestral — always review before applying - Never use Ollama for: credentials, auth decisions, production migrations, security review, API payload field values