sync: auto-sync from HOWARD-HOME at 2026-04-23 06:21:23

Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-04-23 06:21:23
2026-04-23 06:21:24 -07:00
parent abfb0a18b0
commit 7e2e3a5882
5 changed files with 118 additions and 18 deletions
--- a/.claude/OLLAMA.md
+++ b/.claude/OLLAMA.md
@@ -12,15 +12,31 @@ Ollama runs on Mike's workstation (DESKTOP-0O8A1RL) with GPU acceleration. Avail

 ## Endpoints

- **DESKTOP-0O8A1RL** (local): `http://localhost:11434`
- **Any other machine** (Tailscale required): `http://100.92.127.64:11434`
+Auto-detect: any machine that has a local Ollama listening on `127.0.0.1:11434` uses local. Otherwise fall back to Mike's workstation over Tailscale.
+
+```bash
+# Preferred universal resolver — works on any machine
+if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then
+    OLLAMA="http://localhost:11434"
+else
+    OLLAMA="http://100.92.127.64:11434"
+fi
+```
+
+Rationale:
+- **Mike's workstation (DESKTOP-0O8A1RL):** local matches, no change.
+- **HOWARD-HOME:** also has a local Ollama with the canonical model set (confirmed 2026-04-22). Uses local — faster, zero Tailscale hop, no load on Mike's GPU.
+- **Other team machines:** no local Ollama → falls back to Mike's over Tailscale.
+- **Mike's machine offline:** graceful degradation — local users continue working; non-local users get a clean timeout.
+
+Manual override (for testing or explicit preference): set `OLLAMA=http://100.92.127.64:11434` before the call.

 Check reachability:
 ```bash
-curl -s http://100.92.127.64:11434/api/tags | jq -r '.models[].name'
+curl -s $OLLAMA/api/tags | jq -r '.models[].name'
 ```

-If it fails: verify Tailscale is connected (`tailscale status`) and Mike's workstation is online.
+If neither endpoint responds: verify Tailscale (`tailscale status`) and whether your local Ollama service is running.

 ## Access Control

@@ -30,24 +46,29 @@ If it fails: verify Tailscale is connected (`tailscale status`) and Mike's works

 ## Calling Ollama

-Resolve endpoint from identity.json first:
-```bash
-OLLAMA=$([ "$(jq -r .machine .claude/identity.json 2>/dev/null)" = "DESKTOP-0O8A1RL" ] \
-  && echo "http://localhost:11434" || echo "http://100.92.127.64:11434")
-```
+Use the `/api/chat` endpoint with `think:false` for qwen3 models. The older `/api/generate` endpoint on qwen3 puts output into thinking tokens that don't appear in the `response` field — you'll get an empty response if you use `/api/generate`.

-Preferred one-liner (avoids shell escaping):
+Preferred one-liner:
 ```bash
-py -c "
-import urllib.request, json, sys
-url = 'http://localhost:11434/api/generate'
-body = json.dumps({'model':'qwen3:14b','prompt': sys.argv[1],'stream':False}).encode()
-res = json.loads(urllib.request.urlopen(urllib.request.Request(url, body)).read())
-print(res['response'])
+python -c "
+import urllib.request, json, sys, os
+OLLAMA = os.environ.get('OLLAMA') or ('http://localhost:11434' if __import__('urllib.request').request.urlopen(urllib.request.Request('http://localhost:11434/api/tags'),timeout=2) else 'http://100.92.127.64:11434')
+body = json.dumps({
+  'model':'qwen3:14b',
+  'messages':[{'role':'user','content': sys.argv[1]}],
+  'stream':False,
+  'think':False
+}).encode()
+res = json.loads(urllib.request.urlopen(urllib.request.Request(OLLAMA+'/api/chat', body), timeout=120).read())
+print(res['message']['content'])
 " "Your prompt here"
 ```

-For code suggestions, swap `qwen3:14b` for `codestral:22b`.
+Or set `$OLLAMA` once from bash (see auto-detect formula above) and reuse it across calls.
+
+For code suggestions, swap `qwen3:14b` for `codestral:22b`. Codestral doesn't need `think:false`.
+
+Cold-start is ~30-50s on first call per model per session. Warm calls are 1-5s.

 ## When to Use Which Model

--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -23,6 +23,7 @@
 - [D2TESTNAS SSH Access](feedback_d2testnas_ssh.md) - Use root@192.168.0.9 with Paper123!@#, not sysadmin
 - [Bypass Permissions Setting](feedback_bypass_permissions_setting.md) - Set permissions.defaultMode to bypassPermissions in settings.json on all machines
 - [365 Remediation Tool](feedback_365_remediation_tool.md) - Always means Graph API app fabb3421, not CIPP
+- [Ollama Tier-0 Routing](feedback_ollama_tier0_routing.md) - Route drafts/summaries/classifications through Ollama (qwen3:14b). Mike designed ClaudeTools this way — not optional.

 ## Machine
 - [ACG-5070 Workstation Setup](reference_workstation_setup.md) - Windows 11 Pro clean install 2026-03-30, replaced CachyOS. All tools installed.
--- a/.claude/memory/feedback_ollama_tier0_routing.md
+++ b/.claude/memory/feedback_ollama_tier0_routing.md
@@ -0,0 +1,46 @@
+---
+name: Route Tier-0 tasks through Ollama (Mike's ClaudeTools design intent)
+description: Drafts, summaries, classifications, extractions MUST go through Ollama per Mike's tiered-model architecture. Don't default to Claude inference for low-stakes text generation.
+type: feedback
+---
+
+Route Tier-0 tasks (summaries, classifications, drafts, extractions) through Ollama. Not optional — this is how Mike designed ClaudeTools to work.
+
+**Why:** Mike built the tiered-model architecture (`CLAUDE.md` Model Routing section + `.claude/OLLAMA.md`) deliberately. Tier 0 is free + fast + private. Defaulting to Claude for every drafting task burns context window and Anthropic tokens on work that qwen3:14b does fine.
+
+**How to apply:**
+- Drafting emails, session-log paragraphs, status-update sentences, commit-message first-drafts → qwen3:14b
+- Summarizing long output (Graph JSON, PowerShell transcripts, log tails) → qwen3:14b
+- Extracting structured data from text → qwen3:14b
+- Suggesting refactors / generating docstrings → codestral:22b (then review)
+- NEVER for: auth decisions, credential handling, production migrations, security review, citation work, production-change scripts
+
+**Endpoint resolution (updated 2026-04-22 in `.claude/OLLAMA.md`):**
+```bash
+if curl -s -m 2 http://localhost:11434/api/tags >/dev/null 2>&1; then
+    OLLAMA="http://localhost:11434"
+else
+    OLLAMA="http://100.92.127.64:11434"
+fi
+```
+
+HOWARD-HOME has the canonical models loaded locally (qwen3:14b, codestral:22b, nomic-embed-text, plus bonus qwen3-coder:30b) — so HOWARD-HOME uses local Ollama, not Mike's. Zero Tailscale hop.
+
+**Call pattern for qwen3 — use `/api/chat` with `think:false`**, NOT `/api/generate`. qwen3 on generate endpoint dumps reasoning into internal thinking tokens and returns empty `response` field. Chat endpoint with `think:false` returns clean content in `message.content`:
+
+```python
+body = json.dumps({
+  'model':'qwen3:14b',
+  'messages':[{'role':'user','content': prompt}],
+  'stream':False,
+  'think':False
+}).encode()
+# POST to OLLAMA + '/api/chat'
+# Read res['message']['content']
+```
+
+Codestral doesn't need `think:false` — just use it on `/api/chat` normally.
+
+Cold-start ~30-50s on first call per model per session; warm calls 1-5s.
+
+**Incident 2026-04-22:** Spent an entire Cascades rollout session (G1 hygiene, orphan cleanup, risk register, synology discovery, etc.) without routing a single task through Ollama despite many drafting opportunities (report drafts, summary text, email drafts). Howard called this out: "just make sure ollama is being used as mike has designed claudetools to work."