Files
claudetools/.claude/OLLAMA.md
Mike Swanson 2c5f10faaa Session log: qwen3.6 benchmark, route strict-format to 3.6
Benchmarked qwen3.6 (36B MoE) vs qwen3:14b and qwen3:32b on 16
representative prompts. qwen3.6 scored 15/16 vs 14b 11/16 and 32b
12/16, winning every strict-format/adherence test (multi-step rules,
weekend-aware scheduling, prompt-injection resistance, word-limit
summary). Single reasoning regression noted for re-check at qwen3.7.

Updated .claude/OLLAMA.md (Models, Documentation Engine, and
When-to-Use tables) and .claude/CLAUDE.md one-line model summary to
route strict-format work to qwen3.6 and keep bulk prose on qwen3:14b
(2x faster). Also removed openclaw npm package + ~/.openclaw data dir
earlier in the session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 16:03:07 -07:00

184 lines
9.0 KiB
Markdown

# Ollama — Local AI Reference
Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Available to all team members via Tailscale.
## Models
| Model | Size | Use For |
|-------|------|---------|
| `qwen3.6:latest` | 24 GB | Strict-format work: JSON/structured extraction, classification, per-item rules, redaction, word-limited summaries, adherence-critical drafting. ~32 tok/s. |
| `qwen3:14b` | 9.3 GB | Bulk prose where format is loose: session log narrative, commit bodies, client notes, free-text handoffs. ~66 tok/s — 2x faster than 3.6. |
| `codestral:22b` | 12 GB | Code generation, refactoring suggestions, docstrings |
| `nomic-embed-text` | 274 MB | Embeddings only (used by GrepAI) |
Routing basis: 16-prompt benchmark on 2026-05-16 (`benchmark_qwen_3_6.py` in repo root). qwen3.6 scored 15/16 vs qwen3:14b 11/16 and qwen3:32b 12/16. 3.6 won every strict-format and adherence test (multi-step rules, schedule reasoning with weekend trap, prompt-injection resistance, word-limit summary) — at the cost of ~2x slower inference. **Known regression**: 3.6 missed one small reasoning prompt (3 vs expected 4) that 14b/32b got — re-validate when qwen3.7 lands. qwen3:32b is dominated on every axis; not in routing rotation.
## Endpoints
Auto-detect: any machine that has a local Ollama listening on `127.0.0.1:11434` uses local. Otherwise fall back to Mike's workstation over Tailscale.
```bash
# Preferred universal resolver — works on any machine
if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then
OLLAMA="http://localhost:11434"
else
OLLAMA="http://100.92.127.64:11434"
fi
```
Rationale:
- **Mike's workstation (DESKTOP-0O8A1RL):** local matches, no change.
- **HOWARD-HOME:** also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop, no load on Mike's GPU.
- **Other team machines:** no local Ollama → falls back to Mike's over Tailscale.
- **Mike's machine offline:** graceful degradation — local users continue working; non-local users get a clean timeout.
Manual override (for testing or explicit preference): set `OLLAMA=http://100.92.127.64:11434` before the call.
Check reachability:
```bash
curl -s $OLLAMA/api/tags | jq -r '.models[].name'
```
If neither endpoint responds: verify Tailscale (`tailscale status`) and whether your local Ollama service is running.
## Access Control
- Port 11434 allowed ONLY from Tailscale subnet (100.0.0.0/8)
- NOT exposed to LAN, VPN, or internet
- Binding: `OLLAMA_HOST=0.0.0.0:11434` (firewall restricts)
## Calling Ollama
Use the `/api/chat` endpoint with `think:false` for qwen3 models. The older `/api/generate` endpoint on qwen3 puts output into thinking tokens that don't appear in the `response` field — you'll get an empty response if you use `/api/generate`.
Preferred one-liner:
```bash
python -c "
import urllib.request, json, sys, os
OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.92.127.64:11434')
body = json.dumps({
'model':'qwen3:14b',
'messages':[{'role':'user','content': sys.argv[1]}],
'stream':False,
'think':False
}).encode()
res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read())
print(res['message']['content'])
" "Your prompt here"
```
Or set `$OLLAMA` once from bash (see auto-detect formula above) and reuse it across calls.
For code suggestions, swap `qwen3:14b` for `codestral:22b`. Codestral doesn't need `think:false`.
Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s.
## Documentation Engine
**Ollama is the default documentation engine for all prose output.** Any time stored text needs to be generated — session logs, commit messages, ticket comments, client notes, code docs — route it through Ollama first. Claude reviews, corrects if needed, then writes or posts.
This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama handles the writing.
### What Ollama owns
| Output | Model | Claude's role |
|--------|-------|---------------|
| Session log narrative (summary, decisions, problems) | qwen3:14b | Review + assemble with factual sections |
| Commit message body | qwen3:14b | Review + execute git commit |
| Syncro comment bodies + billing descriptions | qwen3:14b | Review checklist + post via API |
| Ticket initial issue / description text | qwen3:14b | Review + post |
| Client-facing notes and summaries | qwen3:14b | Review for accuracy |
| Agent phase handoff summaries (explore → plan, plan → implement) | qwen3:14b | Review + include in agent brief |
| Client email drafts | qwen3:14b | Review for accuracy + tone before sending |
| Ticket / issue classification (priority, type, category) | qwen3.6 | Review + apply label |
| Diff summarization before commit | qwen3.6 | Review + use in commit message |
| Error message categorization (transient / config / bug) | qwen3.6 | Review + act on classification |
| Structured data extraction (JSON, fields, tags) | qwen3.6 | Review + use programmatically |
| PII redaction in logs/transcripts | qwen3.6 | Review before publishing |
| Strict word-limit summaries (e.g. ticket subject, alert text) | qwen3.6 | Review + use |
| Multi-step / per-item rule application on lists | qwen3.6 | Review + use |
| Code comments and docstrings | codestral:22b | Review before applying |
| Refactor suggestions | codestral:22b | Review before applying |
### What Claude always owns (never Ollama)
- Credentials, passwords, API keys — must be verbatim accurate
- Infrastructure details, IPs, hostnames — must be verbatim accurate
- Command outputs and error messages — verbatim from actual output
- Security decisions, auth review, production migrations
- Final field values on API payloads (rates, IDs, quantities)
### GrepAI config (re-apply on new machines)
`.grepai/` is gitignored (90 MB index + machine-specific timestamps). After running `grepai init` on a new machine, apply these overrides to `.grepai/config.yaml`:
**Remove the `.md` penalty** (markdown is primary content here, not docs noise):
```yaml
# DELETE this block:
- pattern: .md
factor: 0.6
```
**Add these bonuses** under `search.boost.bonuses`:
```yaml
- pattern: session-logs/
factor: 1.3
- pattern: .claude/
factor: 1.2
- pattern: /clients/
factor: 1.1
```
**Start watcher + register scheduled task:**
```bash
D:/claudetools/grepai.exe watch --background
# Then in PowerShell (admin not required):
$action = New-ScheduledTaskAction -Execute "D:\claudetools\grepai.exe" -Argument "watch --background" -WorkingDirectory "D:\claudetools"
$trigger = New-ScheduledTaskTrigger -AtLogOn -User $env:USERNAME
$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Hours 0) -MultipleInstances IgnoreNew
Register-ScheduledTask -TaskName "GrepAI Watcher - claudetools" -Action $action -Trigger $trigger -Settings $settings -Force
```
### Warm-start and GrepAI
GrepAI uses `nomic-embed-text` for context lookups, which keeps the Ollama **service** running continuously. The 30-50s service cold-start is effectively eliminated in normal workflow. `qwen3:14b` may take ~5s to swap into VRAM if it hasn't been called recently, but that's the worst case — not 50s.
If the first Ollama call of a session needs to be fast, send a throwaway warm-up ping:
```bash
py -c "
import urllib.request, json
body = json.dumps({'model':'qwen3:14b','messages':[{'role':'user','content':'ok'}],'stream':False,'think':False}).encode()
urllib.request.urlopen(urllib.request.Request('$OLLAMA/api/chat', body), timeout=60).read()
print('warm')
"
```
## When to Use Which Model
| Task | Model |
|------|-------|
| Session log narrative sections | qwen3:14b |
| Commit message body | qwen3:14b |
| Ticket / client comment drafting | qwen3:14b |
| Summarize logs, diffs, incident notes (no length cap) | qwen3:14b |
| Agent phase handoff summaries | qwen3:14b |
| Client email drafts | qwen3:14b |
| Classify bug type, severity, category, priority | qwen3.6 |
| Extract structured data from text (JSON, fields) | qwen3.6 |
| Diff summarization with strict format / fields | qwen3.6 |
| Error categorization (transient / config / bug / permission) | qwen3.6 |
| PII redaction, output preserving format | qwen3.6 |
| Strict word-limit summaries (subject lines, alerts) | qwen3.6 |
| Multi-step rule application across lists | qwen3.6 |
| Untrusted input that may contain prompt injection | qwen3.6 |
| Code comment / docstring generation | codestral:22b |
| Refactor suggestions | codestral:22b |
**Rule of thumb:** if the output is *prose someone will read*, use qwen3:14b (2x faster). If the output is *structured data something will parse* or *must obey a tight format*, use qwen3.6.
## Review Policy
- Documentation output (session logs, commit messages, comments) — Claude reviews before writing/posting
- Code suggestions from codestral — always review before applying
- Never use Ollama for: credentials, auth decisions, production migrations, security review, API payload field values