Files
claudetools/.claude/OLLAMA.md
Mike Swanson 2c5f10faaa Session log: qwen3.6 benchmark, route strict-format to 3.6
Benchmarked qwen3.6 (36B MoE) vs qwen3:14b and qwen3:32b on 16
representative prompts. qwen3.6 scored 15/16 vs 14b 11/16 and 32b
12/16, winning every strict-format/adherence test (multi-step rules,
weekend-aware scheduling, prompt-injection resistance, word-limit
summary). Single reasoning regression noted for re-check at qwen3.7.

Updated .claude/OLLAMA.md (Models, Documentation Engine, and
When-to-Use tables) and .claude/CLAUDE.md one-line model summary to
route strict-format work to qwen3.6 and keep bulk prose on qwen3:14b
(2x faster). Also removed openclaw npm package + ~/.openclaw data dir
earlier in the session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 16:03:07 -07:00

9.0 KiB

Ollama — Local AI Reference

Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Available to all team members via Tailscale.

Models

Model Size Use For
qwen3.6:latest 24 GB Strict-format work: JSON/structured extraction, classification, per-item rules, redaction, word-limited summaries, adherence-critical drafting. ~32 tok/s.
qwen3:14b 9.3 GB Bulk prose where format is loose: session log narrative, commit bodies, client notes, free-text handoffs. ~66 tok/s — 2x faster than 3.6.
codestral:22b 12 GB Code generation, refactoring suggestions, docstrings
nomic-embed-text 274 MB Embeddings only (used by GrepAI)

Routing basis: 16-prompt benchmark on 2026-05-16 (benchmark_qwen_3_6.py in repo root). qwen3.6 scored 15/16 vs qwen3:14b 11/16 and qwen3:32b 12/16. 3.6 won every strict-format and adherence test (multi-step rules, schedule reasoning with weekend trap, prompt-injection resistance, word-limit summary) — at the cost of ~2x slower inference. Known regression: 3.6 missed one small reasoning prompt (3 vs expected 4) that 14b/32b got — re-validate when qwen3.7 lands. qwen3:32b is dominated on every axis; not in routing rotation.

Endpoints

Auto-detect: any machine that has a local Ollama listening on 127.0.0.1:11434 uses local. Otherwise fall back to Mike's workstation over Tailscale.

# Preferred universal resolver — works on any machine
if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then
    OLLAMA="http://localhost:11434"
else
    OLLAMA="http://100.92.127.64:11434"
fi

Rationale:

  • Mike's workstation (DESKTOP-0O8A1RL): local matches, no change.
  • HOWARD-HOME: also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop, no load on Mike's GPU.
  • Other team machines: no local Ollama → falls back to Mike's over Tailscale.
  • Mike's machine offline: graceful degradation — local users continue working; non-local users get a clean timeout.

Manual override (for testing or explicit preference): set OLLAMA=http://100.92.127.64:11434 before the call.

Check reachability:

curl -s $OLLAMA/api/tags | jq -r '.models[].name'

If neither endpoint responds: verify Tailscale (tailscale status) and whether your local Ollama service is running.

Access Control

  • Port 11434 allowed ONLY from Tailscale subnet (100.0.0.0/8)
  • NOT exposed to LAN, VPN, or internet
  • Binding: OLLAMA_HOST=0.0.0.0:11434 (firewall restricts)

Calling Ollama

Use the /api/chat endpoint with think:false for qwen3 models. The older /api/generate endpoint on qwen3 puts output into thinking tokens that don't appear in the response field — you'll get an empty response if you use /api/generate.

Preferred one-liner:

python -c "
import urllib.request, json, sys, os
OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.92.127.64:11434')
body = json.dumps({
  'model':'qwen3:14b',
  'messages':[{'role':'user','content': sys.argv[1]}],
  'stream':False,
  'think':False
}).encode()
res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read())
print(res['message']['content'])
" "Your prompt here"

Or set $OLLAMA once from bash (see auto-detect formula above) and reuse it across calls.

For code suggestions, swap qwen3:14b for codestral:22b. Codestral doesn't need think:false.

Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s.

Documentation Engine

Ollama is the default documentation engine for all prose output. Any time stored text needs to be generated — session logs, commit messages, ticket comments, client notes, code docs — route it through Ollama first. Claude reviews, corrects if needed, then writes or posts.

This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama handles the writing.

What Ollama owns

Output Model Claude's role
Session log narrative (summary, decisions, problems) qwen3:14b Review + assemble with factual sections
Commit message body qwen3:14b Review + execute git commit
Syncro comment bodies + billing descriptions qwen3:14b Review checklist + post via API
Ticket initial issue / description text qwen3:14b Review + post
Client-facing notes and summaries qwen3:14b Review for accuracy
Agent phase handoff summaries (explore → plan, plan → implement) qwen3:14b Review + include in agent brief
Client email drafts qwen3:14b Review for accuracy + tone before sending
Ticket / issue classification (priority, type, category) qwen3.6 Review + apply label
Diff summarization before commit qwen3.6 Review + use in commit message
Error message categorization (transient / config / bug) qwen3.6 Review + act on classification
Structured data extraction (JSON, fields, tags) qwen3.6 Review + use programmatically
PII redaction in logs/transcripts qwen3.6 Review before publishing
Strict word-limit summaries (e.g. ticket subject, alert text) qwen3.6 Review + use
Multi-step / per-item rule application on lists qwen3.6 Review + use
Code comments and docstrings codestral:22b Review before applying
Refactor suggestions codestral:22b Review before applying

What Claude always owns (never Ollama)

  • Credentials, passwords, API keys — must be verbatim accurate
  • Infrastructure details, IPs, hostnames — must be verbatim accurate
  • Command outputs and error messages — verbatim from actual output
  • Security decisions, auth review, production migrations
  • Final field values on API payloads (rates, IDs, quantities)

GrepAI config (re-apply on new machines)

.grepai/ is gitignored (90 MB index + machine-specific timestamps). After running grepai init on a new machine, apply these overrides to .grepai/config.yaml:

Remove the .md penalty (markdown is primary content here, not docs noise):

# DELETE this block:
- pattern: .md
  factor: 0.6

Add these bonuses under search.boost.bonuses:

- pattern: session-logs/
  factor: 1.3
- pattern: .claude/
  factor: 1.2
- pattern: /clients/
  factor: 1.1

Start watcher + register scheduled task:

D:/claudetools/grepai.exe watch --background
# Then in PowerShell (admin not required):
$action = New-ScheduledTaskAction -Execute "D:\claudetools\grepai.exe" -Argument "watch --background" -WorkingDirectory "D:\claudetools"
$trigger = New-ScheduledTaskTrigger -AtLogOn -User $env:USERNAME
$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Hours 0) -MultipleInstances IgnoreNew
Register-ScheduledTask -TaskName "GrepAI Watcher - claudetools" -Action $action -Trigger $trigger -Settings $settings -Force

Warm-start and GrepAI

GrepAI uses nomic-embed-text for context lookups, which keeps the Ollama service running continuously. The 30-50s service cold-start is effectively eliminated in normal workflow. qwen3:14b may take ~5s to swap into VRAM if it hasn't been called recently, but that's the worst case — not 50s.

If the first Ollama call of a session needs to be fast, send a throwaway warm-up ping:

py -c "
import urllib.request, json
body = json.dumps({'model':'qwen3:14b','messages':[{'role':'user','content':'ok'}],'stream':False,'think':False}).encode()
urllib.request.urlopen(urllib.request.Request('$OLLAMA/api/chat', body), timeout=60).read()
print('warm')
"

When to Use Which Model

Task Model
Session log narrative sections qwen3:14b
Commit message body qwen3:14b
Ticket / client comment drafting qwen3:14b
Summarize logs, diffs, incident notes (no length cap) qwen3:14b
Agent phase handoff summaries qwen3:14b
Client email drafts qwen3:14b
Classify bug type, severity, category, priority qwen3.6
Extract structured data from text (JSON, fields) qwen3.6
Diff summarization with strict format / fields qwen3.6
Error categorization (transient / config / bug / permission) qwen3.6
PII redaction, output preserving format qwen3.6
Strict word-limit summaries (subject lines, alerts) qwen3.6
Multi-step rule application across lists qwen3.6
Untrusted input that may contain prompt injection qwen3.6
Code comment / docstring generation codestral:22b
Refactor suggestions codestral:22b

Rule of thumb: if the output is prose someone will read, use qwen3:14b (2x faster). If the output is structured data something will parse or must obey a tight format, use qwen3.6.

Review Policy

  • Documentation output (session logs, commit messages, comments) — Claude reviews before writing/posting
  • Code suggestions from codestral — always review before applying
  • Never use Ollama for: credentials, auth decisions, production migrations, security review, API payload field values