# Ollama — Local AI Reference Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Available to all team members via Tailscale. ## Models | Model | Size | Use For | |-------|------|---------| | `qwen3:14b` | 9.3 GB | Summarization, classification, data extraction, drafting | | `codestral:22b` | 12 GB | Code generation, refactoring suggestions, docstrings | | `nomic-embed-text` | 274 MB | Embeddings only (used by GrepAI) | ## Endpoints Auto-detect: any machine that has a local Ollama listening on `127.0.0.1:11434` uses local. Otherwise fall back to Mike's workstation over Tailscale. ```bash # Preferred universal resolver — works on any machine if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then OLLAMA="http://localhost:11434" else OLLAMA="http://100.92.127.64:11434" fi ``` Rationale: - **Mike's workstation (DESKTOP-0O8A1RL):** local matches, no change. - **HOWARD-HOME:** also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop, no load on Mike's GPU. - **Other team machines:** no local Ollama → falls back to Mike's over Tailscale. - **Mike's machine offline:** graceful degradation — local users continue working; non-local users get a clean timeout. Manual override (for testing or explicit preference): set `OLLAMA=http://100.92.127.64:11434` before the call. Check reachability: ```bash curl -s $OLLAMA/api/tags | jq -r '.models[].name' ``` If neither endpoint responds: verify Tailscale (`tailscale status`) and whether your local Ollama service is running. ## Access Control - Port 11434 allowed ONLY from Tailscale subnet (100.0.0.0/8) - NOT exposed to LAN, VPN, or internet - Binding: `OLLAMA_HOST=0.0.0.0:11434` (firewall restricts) ## Calling Ollama Use the `/api/chat` endpoint with `think:false` for qwen3 models. The older `/api/generate` endpoint on qwen3 puts output into thinking tokens that don't appear in the `response` field — you'll get an empty response if you use `/api/generate`. Preferred one-liner: ```bash python -c " import urllib.request, json, sys, os OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.92.127.64:11434') body = json.dumps({ 'model':'qwen3:14b', 'messages':[{'role':'user','content': sys.argv[1]}], 'stream':False, 'think':False }).encode() res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read()) print(res['message']['content']) " "Your prompt here" ``` Or set `$OLLAMA` once from bash (see auto-detect formula above) and reuse it across calls. For code suggestions, swap `qwen3:14b` for `codestral:22b`. Codestral doesn't need `think:false`. Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s. ## When to Use Which Model | Task | Model | |------|-------| | Summarize logs, diffs, session notes | qwen3:14b | | Classify bug type, severity, category | qwen3:14b | | Extract structured data from text | qwen3:14b | | Draft commit message from diff | qwen3:14b | | Suggest refactor for a function | codestral:22b | | Docstring / comment generation | codestral:22b | ## Review Policy - Low-stakes output (summary, classification, draft) — use directly - Code suggestions from codestral — always review before applying - Never use Ollama for: auth decisions, credential handling, production migrations, security review