claudetools/.claude/memory/feedback_ollama_tier0_routing.md at 2fcdc5fb1361ba42ed867c09b54d7945ac477275

Files

Mike Swanson 262fd8de62 sync: auto-sync from GURU-KALI at 2026-05-26 20:08:37

Author: Mike Swanson
Machine: GURU-KALI
Timestamp: 2026-05-26 20:08:37

2026-05-26 20:08:39 -07:00

3.1 KiB

Raw Blame History

name, description, type

name	description	type
Route Tier-0 tasks through Ollama (Mike's ClaudeTools design intent)	Drafts, summaries, classifications, extractions MUST go through Ollama per Mike's tiered-model architecture. Don't default to Claude inference for low-stakes text generation.	feedback

Route Tier-0 tasks (summaries, classifications, drafts, extractions) through Ollama. Not optional — this is how Mike designed ClaudeTools to work.

Why: Mike built the tiered-model architecture (CLAUDE.md Model Routing section + .claude/OLLAMA.md) deliberately. Tier 0 is free + fast + private. Defaulting to Claude for every drafting task burns context window and Anthropic tokens on work that qwen3:14b does fine.

How to apply:

Drafting emails, session-log paragraphs, status-update sentences, commit-message first-drafts → qwen3:14b
Summarizing long output (Graph JSON, PowerShell transcripts, log tails) → qwen3:14b
Extracting structured data from text → qwen3:14b
Suggesting refactors / generating docstrings → codestral:22b (then review)
NEVER for: auth decisions, credential handling, production migrations, security review, citation work, production-change scripts

Endpoint resolution — machine config is centralized in .claude/identity.json ollama (Phase 2, 2026-05-26). Read the declared endpoint; no curl probe per call:

OLLAMA=$(jq -r '.ollama.endpoint // .ollama.fallback // "http://localhost:11434"' .claude/identity.json)
MODEL=$(jq -r '.ollama.prose_model // "qwen3:14b"' .claude/identity.json)

migrate-identity.sh populates the ollama object per machine: endpoint (the one to use — localhost if a local Ollama was present at migration, else the Tailscale host), fallback (backup, usually GURU-BEAST-ROG http://100.101.122.4:11434, always-on RTX 4090), and prose_model (qwen3:8b on 12 GB boxes like GURU-5070, qwen3:14b elsewhere). GURU-KALI: endpoint+fallback = Beast (no local Ollama). Re-run migrate-identity.sh to re-detect after an Ollama/network change. Don't hardcode endpoints/IPs in shared files — read identity.json. (Superseded the interim ollama_fallback field from earlier 2026-05-26.)

Call pattern for qwen3 — use /api/chat with think:false, NOT /api/generate. qwen3 on generate endpoint dumps reasoning into internal thinking tokens and returns empty response field. Chat endpoint with think:false returns clean content in message.content:

body = json.dumps({
  'model':'qwen3:14b',
  'messages':[{'role':'user','content': prompt}],
  'stream':False,
  'think':False
}).encode()
# POST to OLLAMA + '/api/chat'
# Read res['message']['content']

Codestral doesn't need think:false — just use it on /api/chat normally.

Cold-start ~30-50s on first call per model per session; warm calls 1-5s.

Incident 2026-04-22: Spent an entire Cascades rollout session (G1 hygiene, orphan cleanup, risk register, synology discovery, etc.) without routing a single task through Ollama despite many drafting opportunities (report drafts, summary text, email drafts). Howard called this out: "just make sure ollama is being used as mike has designed claudetools to work."

3.1 KiB Raw Blame History

3.1 KiB

Raw Blame History