Files

Howard Enos 7e2e3a5882 sync: auto-sync from HOWARD-HOME at 2026-04-23 06:21:23

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-04-23 06:21:23

2026-04-23 06:21:24 -07:00

3.5 KiB

Raw Blame History

Ollama — Local AI Reference

Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Available to all team members via Tailscale.

Models

Model	Size	Use For
`qwen3:14b`	9.3 GB	Summarization, classification, data extraction, drafting
`codestral:22b`	12 GB	Code generation, refactoring suggestions, docstrings
`nomic-embed-text`	274 MB	Embeddings only (used by GrepAI)

Endpoints

Auto-detect: any machine that has a local Ollama listening on 127.0.0.1:11434 uses local. Otherwise fall back to Mike's workstation over Tailscale.

# Preferred universal resolver — works on any machine
if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then
    OLLAMA="http://localhost:11434"
else
    OLLAMA="http://100.92.127.64:11434"
fi

Rationale:

Mike's workstation (DESKTOP-0O8A1RL): local matches, no change.
HOWARD-HOME: also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop, no load on Mike's GPU.
Other team machines: no local Ollama → falls back to Mike's over Tailscale.
Mike's machine offline: graceful degradation — local users continue working; non-local users get a clean timeout.

Manual override (for testing or explicit preference): set OLLAMA=http://100.92.127.64:11434 before the call.

Check reachability:

curl -s $OLLAMA/api/tags | jq -r '.models[].name'

If neither endpoint responds: verify Tailscale (tailscale status) and whether your local Ollama service is running.

Access Control

Port 11434 allowed ONLY from Tailscale subnet (100.0.0.0/8)
NOT exposed to LAN, VPN, or internet
Binding: OLLAMA_HOST=0.0.0.0:11434 (firewall restricts)

Calling Ollama

Use the /api/chat endpoint with think:false for qwen3 models. The older /api/generate endpoint on qwen3 puts output into thinking tokens that don't appear in the response field — you'll get an empty response if you use /api/generate.

Preferred one-liner:

python -c "
import urllib.request, json, sys, os
OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.92.127.64:11434')
body = json.dumps({
  'model':'qwen3:14b',
  'messages':[{'role':'user','content': sys.argv[1]}],
  'stream':False,
  'think':False
}).encode()
res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read())
print(res['message']['content'])
" "Your prompt here"

Or set $OLLAMA once from bash (see auto-detect formula above) and reuse it across calls.

For code suggestions, swap qwen3:14b for codestral:22b. Codestral doesn't need think:false.

Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s.

When to Use Which Model

Task	Model
Summarize logs, diffs, session notes	qwen3:14b
Classify bug type, severity, category	qwen3:14b
Extract structured data from text	qwen3:14b
Draft commit message from diff	qwen3:14b
Suggest refactor for a function	codestral:22b
Docstring / comment generation	codestral:22b

Review Policy

Low-stakes output (summary, classification, draft) — use directly
Code suggestions from codestral — always review before applying
Never use Ollama for: auth decisions, credential handling, production migrations, security review

3.5 KiB Raw Blame History