Files

Mike Swanson 88bdc3d4c9 docs: establish Ollama as the documentation engine

Route all prose generation (session logs, commit messages, Syncro
comments, client notes, code docs) through Ollama qwen3:14b by default.
Claude reviews output and owns verbatim-accuracy sections (credentials,
IPs, command outputs). GrepAI context lookups keep the Ollama service
warm, eliminating the 30-50s cold-start in normal workflow.

Updates: OLLAMA.md (documentation engine scope + warm-start note),
CLAUDE.md (Ollama section), save.md (narrative drafting), checkpoint.md
(commit message body drafting).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-24 07:37:45 -07:00

5.7 KiB

Raw Blame History

Ollama — Local AI Reference

Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Available to all team members via Tailscale.

Models

Model	Size	Use For
`qwen3:14b`	9.3 GB	Summarization, classification, data extraction, drafting
`codestral:22b`	12 GB	Code generation, refactoring suggestions, docstrings
`nomic-embed-text`	274 MB	Embeddings only (used by GrepAI)

Endpoints

Auto-detect: any machine that has a local Ollama listening on 127.0.0.1:11434 uses local. Otherwise fall back to Mike's workstation over Tailscale.

# Preferred universal resolver — works on any machine
if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then
    OLLAMA="http://localhost:11434"
else
    OLLAMA="http://100.92.127.64:11434"
fi

Rationale:

Mike's workstation (DESKTOP-0O8A1RL): local matches, no change.
HOWARD-HOME: also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop, no load on Mike's GPU.
Other team machines: no local Ollama → falls back to Mike's over Tailscale.
Mike's machine offline: graceful degradation — local users continue working; non-local users get a clean timeout.

Manual override (for testing or explicit preference): set OLLAMA=http://100.92.127.64:11434 before the call.

Check reachability:

curl -s $OLLAMA/api/tags | jq -r '.models[].name'

If neither endpoint responds: verify Tailscale (tailscale status) and whether your local Ollama service is running.

Access Control

Port 11434 allowed ONLY from Tailscale subnet (100.0.0.0/8)
NOT exposed to LAN, VPN, or internet
Binding: OLLAMA_HOST=0.0.0.0:11434 (firewall restricts)

Calling Ollama

Use the /api/chat endpoint with think:false for qwen3 models. The older /api/generate endpoint on qwen3 puts output into thinking tokens that don't appear in the response field — you'll get an empty response if you use /api/generate.

Preferred one-liner:

python -c "
import urllib.request, json, sys, os
OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.92.127.64:11434')
body = json.dumps({
  'model':'qwen3:14b',
  'messages':[{'role':'user','content': sys.argv[1]}],
  'stream':False,
  'think':False
}).encode()
res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read())
print(res['message']['content'])
" "Your prompt here"

Or set $OLLAMA once from bash (see auto-detect formula above) and reuse it across calls.

For code suggestions, swap qwen3:14b for codestral:22b. Codestral doesn't need think:false.

Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s.

Documentation Engine

Ollama is the default documentation engine for all prose output. Any time stored text needs to be generated — session logs, commit messages, ticket comments, client notes, code docs — route it through Ollama first. Claude reviews, corrects if needed, then writes or posts.

This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama handles the writing.

What Ollama owns

Output	Model	Claude's role
Session log narrative (summary, decisions, problems)	qwen3:14b	Review + assemble with factual sections
Commit message body	qwen3:14b	Review + execute git commit
Syncro comment bodies + billing descriptions	qwen3:14b	Review checklist + post via API
Ticket initial issue / description text	qwen3:14b	Review + post
Client-facing notes and summaries	qwen3:14b	Review for accuracy
Code comments and docstrings	codestral:22b	Review before applying
Refactor suggestions	codestral:22b	Review before applying

What Claude always owns (never Ollama)

Credentials, passwords, API keys — must be verbatim accurate
Infrastructure details, IPs, hostnames — must be verbatim accurate
Command outputs and error messages — verbatim from actual output
Security decisions, auth review, production migrations
Final field values on API payloads (rates, IDs, quantities)

Warm-start and GrepAI

GrepAI uses nomic-embed-text for context lookups, which keeps the Ollama service running continuously. The 30-50s service cold-start is effectively eliminated in normal workflow. qwen3:14b may take ~5s to swap into VRAM if it hasn't been called recently, but that's the worst case — not 50s.

If the first Ollama call of a session needs to be fast, send a throwaway warm-up ping:

py -c "
import urllib.request, json
body = json.dumps({'model':'qwen3:14b','messages':[{'role':'user','content':'ok'}],'stream':False,'think':False}).encode()
urllib.request.urlopen(urllib.request.Request('$OLLAMA/api/chat', body), timeout=60).read()
print('warm')
"

When to Use Which Model

Task	Model
Session log narrative sections	qwen3:14b
Commit message body	qwen3:14b
Ticket / client comment drafting	qwen3:14b
Summarize logs, diffs, incident notes	qwen3:14b
Classify bug type, severity, category	qwen3:14b
Extract structured data from text	qwen3:14b
Code comment / docstring generation	codestral:22b
Refactor suggestions	codestral:22b

Review Policy

Documentation output (session logs, commit messages, comments) — Claude reviews before writing/posting
Code suggestions from codestral — always review before applying
Never use Ollama for: credentials, auth decisions, production migrations, security review, API payload field values

5.7 KiB Raw Blame History