Index was dead since 2026-04-19 (watcher not running). Fixes: - Watcher restarted; scheduled task registered for login persistence - Removed .md 0.6x penalty — markdown is primary content in this repo - Added session-logs/ 1.3x, .claude/ 1.2x, /clients/ 1.1x relevance bonuses - CLAUDE.md: grepai_search is now the first step for any context lookup - OLLAMA.md: documents config overrides + watcher setup for new machines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.8 KiB
Ollama — Local AI Reference
Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Available to all team members via Tailscale.
Models
| Model | Size | Use For |
|---|---|---|
qwen3:14b |
9.3 GB | Summarization, classification, data extraction, drafting |
codestral:22b |
12 GB | Code generation, refactoring suggestions, docstrings |
nomic-embed-text |
274 MB | Embeddings only (used by GrepAI) |
Endpoints
Auto-detect: any machine that has a local Ollama listening on 127.0.0.1:11434 uses local. Otherwise fall back to Mike's workstation over Tailscale.
# Preferred universal resolver — works on any machine
if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then
OLLAMA="http://localhost:11434"
else
OLLAMA="http://100.92.127.64:11434"
fi
Rationale:
- Mike's workstation (DESKTOP-0O8A1RL): local matches, no change.
- HOWARD-HOME: also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop, no load on Mike's GPU.
- Other team machines: no local Ollama → falls back to Mike's over Tailscale.
- Mike's machine offline: graceful degradation — local users continue working; non-local users get a clean timeout.
Manual override (for testing or explicit preference): set OLLAMA=http://100.92.127.64:11434 before the call.
Check reachability:
curl -s $OLLAMA/api/tags | jq -r '.models[].name'
If neither endpoint responds: verify Tailscale (tailscale status) and whether your local Ollama service is running.
Access Control
- Port 11434 allowed ONLY from Tailscale subnet (100.0.0.0/8)
- NOT exposed to LAN, VPN, or internet
- Binding:
OLLAMA_HOST=0.0.0.0:11434(firewall restricts)
Calling Ollama
Use the /api/chat endpoint with think:false for qwen3 models. The older /api/generate endpoint on qwen3 puts output into thinking tokens that don't appear in the response field — you'll get an empty response if you use /api/generate.
Preferred one-liner:
python -c "
import urllib.request, json, sys, os
OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.92.127.64:11434')
body = json.dumps({
'model':'qwen3:14b',
'messages':[{'role':'user','content': sys.argv[1]}],
'stream':False,
'think':False
}).encode()
res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read())
print(res['message']['content'])
" "Your prompt here"
Or set $OLLAMA once from bash (see auto-detect formula above) and reuse it across calls.
For code suggestions, swap qwen3:14b for codestral:22b. Codestral doesn't need think:false.
Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s.
Documentation Engine
Ollama is the default documentation engine for all prose output. Any time stored text needs to be generated — session logs, commit messages, ticket comments, client notes, code docs — route it through Ollama first. Claude reviews, corrects if needed, then writes or posts.
This keeps Claude tokens focused on reasoning, decisions, and execution. Ollama handles the writing.
What Ollama owns
| Output | Model | Claude's role |
|---|---|---|
| Session log narrative (summary, decisions, problems) | qwen3:14b | Review + assemble with factual sections |
| Commit message body | qwen3:14b | Review + execute git commit |
| Syncro comment bodies + billing descriptions | qwen3:14b | Review checklist + post via API |
| Ticket initial issue / description text | qwen3:14b | Review + post |
| Client-facing notes and summaries | qwen3:14b | Review for accuracy |
| Code comments and docstrings | codestral:22b | Review before applying |
| Refactor suggestions | codestral:22b | Review before applying |
What Claude always owns (never Ollama)
- Credentials, passwords, API keys — must be verbatim accurate
- Infrastructure details, IPs, hostnames — must be verbatim accurate
- Command outputs and error messages — verbatim from actual output
- Security decisions, auth review, production migrations
- Final field values on API payloads (rates, IDs, quantities)
GrepAI config (re-apply on new machines)
.grepai/ is gitignored (90 MB index + machine-specific timestamps). After running grepai init on a new machine, apply these overrides to .grepai/config.yaml:
Remove the .md penalty (markdown is primary content here, not docs noise):
# DELETE this block:
- pattern: .md
factor: 0.6
Add these bonuses under search.boost.bonuses:
- pattern: session-logs/
factor: 1.3
- pattern: .claude/
factor: 1.2
- pattern: /clients/
factor: 1.1
Start watcher + register scheduled task:
D:/claudetools/grepai.exe watch --background
# Then in PowerShell (admin not required):
$action = New-ScheduledTaskAction -Execute "D:\claudetools\grepai.exe" -Argument "watch --background" -WorkingDirectory "D:\claudetools"
$trigger = New-ScheduledTaskTrigger -AtLogOn -User $env:USERNAME
$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Hours 0) -MultipleInstances IgnoreNew
Register-ScheduledTask -TaskName "GrepAI Watcher - claudetools" -Action $action -Trigger $trigger -Settings $settings -Force
Warm-start and GrepAI
GrepAI uses nomic-embed-text for context lookups, which keeps the Ollama service running continuously. The 30-50s service cold-start is effectively eliminated in normal workflow. qwen3:14b may take ~5s to swap into VRAM if it hasn't been called recently, but that's the worst case — not 50s.
If the first Ollama call of a session needs to be fast, send a throwaway warm-up ping:
py -c "
import urllib.request, json
body = json.dumps({'model':'qwen3:14b','messages':[{'role':'user','content':'ok'}],'stream':False,'think':False}).encode()
urllib.request.urlopen(urllib.request.Request('$OLLAMA/api/chat', body), timeout=60).read()
print('warm')
"
When to Use Which Model
| Task | Model |
|---|---|
| Session log narrative sections | qwen3:14b |
| Commit message body | qwen3:14b |
| Ticket / client comment drafting | qwen3:14b |
| Summarize logs, diffs, incident notes | qwen3:14b |
| Classify bug type, severity, category | qwen3:14b |
| Extract structured data from text | qwen3:14b |
| Code comment / docstring generation | codestral:22b |
| Refactor suggestions | codestral:22b |
Review Policy
- Documentation output (session logs, commit messages, comments) — Claude reviews before writing/posting
- Code suggestions from codestral — always review before applying
- Never use Ollama for: credentials, auth decisions, production migrations, security review, API payload field values