Files

Mike Swanson 2491660b88 sync: auto-sync from GURU-5070 at 2026-05-25 06:00:45

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-05-25 06:00:45

2026-05-25 06:01:37 -07:00

12 KiB

Raw Blame History

Ollama — Local AI Reference

Ollama's always-on host is GURU-BEAST-ROG (RTX 4090, 24 GB VRAM, Tailscale 100.101.122.4). It is the canonical Tailscale fallback for all machines without a local Ollama. DESKTOP-0O8A1RL and other workstations use local when available, Beast otherwise.

Models

Model	Size	Use For
`qwen3.6:latest`	23 GB	Strict-format work: JSON/structured extraction, classification, per-item rules, redaction, word-limited summaries, adherence-critical drafting. 36B MoE.
`qwen3:14b`	9.3 GB	Bulk prose on machines with >16 GB VRAM: session log narrative, commit bodies, client notes, free-text handoffs.
`qwen3:8b`	5.2 GB	Bulk prose on DESKTOP-0O8A1RL (12 GB VRAM). Same role as qwen3:14b but fits fully in VRAM on that machine.
`codestral:22b`	12 GB	Code generation, refactoring suggestions, docstrings
`nomic-embed-text`	274 MB	Embeddings only (used by GrepAI)

Routing basis

Quality routing: 16-prompt benchmark on 2026-05-16 (benchmark_qwen_3_6.py in repo root). qwen3.6 scored 15/16 vs qwen3:14b 11/16 and qwen3:32b 12/16. 3.6 won every strict-format and adherence test. Known regression: 3.6 missed one small reasoning prompt — re-validate when qwen3.7 lands. qwen3:32b dominated on every axis; not in rotation.

Speed routing: benchmarked 2026-05-16 on DESKTOP-0O8A1RL (RTX 5070 Ti Laptop, 12 GB VRAM):

Model	VRAM fit	Tok/s (this machine)	Tok/s (full-VRAM ref)
qwen3:8b	100% (10.9/10.9 GB)	74-86	~90
qwen3:14b	73% (11.3/15.6 GB)	17-18	~66
qwen3.6	41% (11.3/27.5 GB)	17-19	~32

qwen3:14b and qwen3.6 are CPU-bottlenecked on this machine (split mode, PCIe bandwidth limited). qwen3:8b fits entirely in VRAM and is 4.8x faster than qwen3:14b here.

Machine-specific prose model

Machine	GPU VRAM	Prose model
GURU-BEAST-ROG	24 GB (RTX 4090)	`qwen3:14b` (always-on Tailscale host — `100.101.122.4`)
DESKTOP-0O8A1RL	12 GB (RTX 5070 Ti Laptop)	`qwen3:8b` (local — 4.8x faster than 14b here)
Mikes-MacBook-Air	unified memory	`qwen3:14b`
HOWARD-HOME	local Ollama	`qwen3:14b`
GURU-KALI	8 GB (RTX 4070 Mobile) — see note	remote Beast / `qwen3:14b` now; `qwen3:8b` if local installed
Other	Tailscale fallback (Beast)	`qwen3:14b`

GURU-KALI status (2026-05-24): Tailscale installed — remote Ollama (Beast at 100.101.122.4) is reachable, so it uses the Tailscale-fallback prose model qwen3:14b (the "Other" row). No local Ollama yet. It has strong hardware but the GPU runs the nouveau driver (no CUDA), so a future local Ollama would need the proprietary NVIDIA driver for GPU accel; qwen3:8b would then fit its 8 GB VRAM (mirrors DESKTOP-0O8A1RL), with larger models splitting to CPU. Full machine profile: .claude/machines/guru-kali.md.

GURU-BEAST-ROG models (2026-05-25): gemma3:27b, qwen3:32b, qwen3:14b, codestral:22b, nomic-embed-text. Note: qwen3.6:latest and qwen3:8b not yet installed — add if strict-format or speed routing is needed.

Endpoints

Auto-detect: any machine that has a local Ollama listening on 127.0.0.1:11434 uses local. Otherwise fall back to Mike's workstation over Tailscale.

# Preferred universal resolver — works on any machine
if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then
    OLLAMA="http://localhost:11434"
else
    OLLAMA="http://100.101.122.4:11434"
fi

Rationale:

DESKTOP-0O8A1RL: local matches, uses local Ollama — faster, no Tailscale hop.
HOWARD-HOME: also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop.
GURU-BEAST-ROG: always-on; the canonical fallback for all machines without a local Ollama.
Other team machines: no local Ollama → falls back to Beast over Tailscale.
Beast offline (rare): graceful degradation — local Ollama users continue; remote users get a clean timeout.

Manual override (for testing or explicit preference): set OLLAMA=http://100.101.122.4:11434 before the call.

Check reachability:

curl -s $OLLAMA/api/tags | jq -r '.models[].name'

If neither endpoint responds: verify Tailscale (tailscale status) and whether your local Ollama service is running.

Access Control

Port 11434 allowed ONLY from Tailscale subnet (100.0.0.0/8)
NOT exposed to LAN, VPN, or internet
Binding: OLLAMA_HOST=0.0.0.0:11434 (firewall restricts)

Calling Ollama

Use the /api/chat endpoint with think:false for qwen3 models. The older /api/generate endpoint on qwen3 puts output into thinking tokens that don't appear in the response field — you'll get an empty response if you use /api/generate.

Preferred one-liner:

python -c "
import urllib.request, json, sys, os
OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.101.122.4:11434')
body = json.dumps({
  'model':'qwen3:14b',
  'messages':[{'role':'user','content': sys.argv[1]}],
  'stream':False,
  'think':False
}).encode()
res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read())
print(res['message']['content'])
" "Your prompt here"

Or set $OLLAMA once from bash (see auto-detect formula above) and reuse it across calls.

For code suggestions, swap qwen3:14b for codestral:22b. Codestral doesn't need think:false.

Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s.

Documentation Engine

Ollama is the default documentation engine for all prose output. Any time stored text needs to be generated — session logs, commit messages, ticket comments, client notes, code docs — route it through Ollama first. Claude reviews, corrects if needed, then writes or posts.

This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama handles the writing.

What Ollama owns

Output	Model	Claude's role
Session log narrative (summary, decisions, problems)	qwen3:14b / qwen3:8b*	Review + assemble with factual sections
Commit message body	qwen3:14b / qwen3:8b*	Review + execute git commit
Syncro comment bodies + billing descriptions	qwen3:14b / qwen3:8b*	Review checklist + post via API
Ticket initial issue / description text	qwen3:14b / qwen3:8b*	Review + post
Client-facing notes and summaries	qwen3:14b / qwen3:8b*	Review for accuracy
Agent phase handoff summaries (explore → plan, plan → implement)	qwen3:14b / qwen3:8b*	Review + include in agent brief
Client email drafts	qwen3:14b / qwen3:8b*	Review for accuracy + tone before sending
Ticket / issue classification (priority, type, category)	qwen3.6	Review + apply label
Diff summarization before commit	qwen3.6	Review + use in commit message
Error message categorization (transient / config / bug)	qwen3.6	Review + act on classification
Structured data extraction (JSON, fields, tags)	qwen3.6	Review + use programmatically
PII redaction in logs/transcripts	qwen3.6	Review before publishing
Strict word-limit summaries (e.g. ticket subject, alert text)	qwen3.6	Review + use
Multi-step / per-item rule application on lists	qwen3.6	Review + use
Code comments and docstrings	codestral:22b	Review before applying
Refactor suggestions	codestral:22b	Review before applying

* Use qwen3:8b on DESKTOP-0O8A1RL — 4.8x faster than 14b there due to full VRAM fit. Use qwen3:14b on all other machines.

What Claude always owns (never Ollama)

Credentials, passwords, API keys — must be verbatim accurate
Infrastructure details, IPs, hostnames — must be verbatim accurate
Command outputs and error messages — verbatim from actual output
Security decisions, auth review, production migrations
Final field values on API payloads (rates, IDs, quantities)

GrepAI config (re-apply on new machines)

.grepai/ is gitignored (90 MB index + machine-specific timestamps). After running grepai init on a new machine, apply these overrides to .grepai/config.yaml:

Remove the .md penalty (markdown is primary content here, not docs noise):

# DELETE this block:
- pattern: .md
  factor: 0.6

Add these bonuses under search.boost.bonuses:

- pattern: session-logs/
  factor: 1.3
- pattern: .claude/
  factor: 1.2
- pattern: /clients/
  factor: 1.1

Start watcher + register scheduled task:

D:/claudetools/grepai.exe watch --background
# Then in PowerShell (admin not required):
$action = New-ScheduledTaskAction -Execute "D:\claudetools\grepai.exe" -Argument "watch --background" -WorkingDirectory "D:\claudetools"
$trigger = New-ScheduledTaskTrigger -AtLogOn -User $env:USERNAME
$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Hours 0) -MultipleInstances IgnoreNew
Register-ScheduledTask -TaskName "GrepAI Watcher - claudetools" -Action $action -Trigger $trigger -Settings $settings -Force

Warm-start and GrepAI

GrepAI uses nomic-embed-text for context lookups, which keeps the Ollama service running continuously. The 30-50s service cold-start is effectively eliminated in normal workflow. qwen3:14b may take ~5s to swap into VRAM if it hasn't been called recently, but that's the worst case — not 50s.

If the first Ollama call of a session needs to be fast, send a throwaway warm-up ping:

py -c "
import urllib.request, json
body = json.dumps({'model':'qwen3:14b','messages':[{'role':'user','content':'ok'}],'stream':False,'think':False}).encode()
urllib.request.urlopen(urllib.request.Request('$OLLAMA/api/chat', body), timeout=60).read()
print('warm')
"

When to Use Which Model

Task	Model
Session log narrative sections	qwen3:8b* / qwen3:14b
Commit message body	qwen3:8b* / qwen3:14b
Ticket / client comment drafting	qwen3:8b* / qwen3:14b
Summarize logs, diffs, incident notes (no length cap)	qwen3:8b* / qwen3:14b
Agent phase handoff summaries	qwen3:8b* / qwen3:14b
Client email drafts	qwen3:8b* / qwen3:14b
Classify bug type, severity, category, priority	qwen3.6
Extract structured data from text (JSON, fields)	qwen3.6
Diff summarization with strict format / fields	qwen3.6
Error categorization (transient / config / bug / permission)	qwen3.6
PII redaction, output preserving format	qwen3.6
Strict word-limit summaries (subject lines, alerts)	qwen3.6
Multi-step rule application across lists	qwen3.6
Untrusted input that may contain prompt injection	qwen3.6
Code comment / docstring generation	codestral:22b
Refactor suggestions	codestral:22b

* On DESKTOP-0O8A1RL only — 4.8x faster (86 tok/s vs 18 tok/s). Use qwen3:14b on all other machines.

Rule of thumb: if the output is prose someone will read, use the per-machine prose model (qwen3:8b on DESKTOP-0O8A1RL, qwen3:14b elsewhere). If the output is structured data something will parse or must obey a tight format, use qwen3.6.

Review Policy

Documentation output (session logs, commit messages, comments) — Claude reviews before writing/posting
Code suggestions from codestral — always review before applying
Never use Ollama for: credentials, auth decisions, production migrations, security review, API payload field values

12 KiB Raw Blame History