sync: auto-sync from GURU-5070 at 2026-06-12 08:27:16
Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-12 08:27:16
This commit is contained in:
@@ -0,0 +1,92 @@
|
|||||||
|
# 2026-06-12 — GuruRMM log-analysis → Claude Haiku, old-VM decommission, wiki VM/build-chain sweep
|
||||||
|
|
||||||
|
## User
|
||||||
|
- **User:** Mike Swanson (mike)
|
||||||
|
- **Machine:** GURU-5070
|
||||||
|
- **Role:** admin
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
Started from a `gururmm_server::api::logs` error: "Log analysis unavailable: Ollama unreachable
|
||||||
|
(http://100.101.122.4:11434/api/chat)". Diagnosed, fixed, and shipped a cutover off Ollama; then
|
||||||
|
decommissioned the old GuruRMM VM, swept stale "VM" + Windows-build-chain framing across wiki/memory,
|
||||||
|
ran wiki-lint, compiled 5 missing wiki articles, seeded GuruConnect, and patched the wiki-compile skill.
|
||||||
|
|
||||||
|
## Root cause (the headline)
|
||||||
|
"Ollama unreachable" was a **mislabeled 120s reqwest timeout, NOT a reachability problem.**
|
||||||
|
- `100.101.122.4` already IS Beast (GURU-BEAST-ROG). The server `.30` reaches Beast fine for
|
||||||
|
`/api/tags` and short warm `/api/chat` (warm "say OK" = 1.1s), but a fleet-sized `/api/chat`
|
||||||
|
(~1500 logs) never completes — warm fleet-size curl from `.30` hit the 300s ceiling.
|
||||||
|
- Cause: qwen3:14b minutes-long inference on a big prompt over a flaky cross-LAN tailnet
|
||||||
|
(`.30` behind symmetric NAT `MappingVariesByDestIP:true`; Beast on Wi-Fi `10.2.51.228`).
|
||||||
|
reqwest's 120s `.timeout()` surfaced as "error sending request … Check Tailscale".
|
||||||
|
- Beast also had a duplicate-Ollama bind conflict (tray app's `ollama serve` couldn't bind 11434;
|
||||||
|
older PID 14144 held `0.0.0.0:11434` and served) — noisy, not the cause.
|
||||||
|
|
||||||
|
## Fix shipped — log analysis now uses Claude Haiku 4.5
|
||||||
|
- `server/src/api/logs.rs`: `analyze_logs_with_ollama` → `analyze_logs_with_claude`. POSTs
|
||||||
|
`https://api.anthropic.com/v1/messages` (plain HTTPS, no tailnet), `x-api-key` from env,
|
||||||
|
`ANTHROPIC_API_KEY` (required) + `ANTHROPIC_MODEL` (default `claude-haiku-4-5`). Structured
|
||||||
|
outputs (`output_config.format` + json_schema) → guaranteed-parseable findings JSON.
|
||||||
|
- Verified end-to-end against Haiku, then live via `/api/logs/analyze`: **1500 logs → 10 findings
|
||||||
|
in 24s** (was timing out at 120s). The findings even surfaced the old self-logged "Ollama
|
||||||
|
unreachable" errors — the problem we just fixed.
|
||||||
|
- ZDR (zero data retention) requested from Anthropic, **pending**. Test fleet OK in the meantime.
|
||||||
|
|
||||||
|
## Old VM decommissioned + mgmt IP dropped
|
||||||
|
- Migration to the physical box completed 2026-06-11; soak signed off → deleted the parked old VM.
|
||||||
|
- Jupiter (172.16.3.20): `virsh destroy GuruRMM` + `undefine` + removed `/mnt/user/domains/GuruRMM`
|
||||||
|
(vdisk1.img, 64 GB). Did NOT use `--remove-all-storage` (would have wiped the shared Ubuntu ISO).
|
||||||
|
`.46` down; Pluto (`Claude-Builder`) + ISO intact.
|
||||||
|
- `.30` netplan: removed the secondary `172.16.3.47/22` (backup saved `.bak`), `netplan apply`;
|
||||||
|
`eno1` now carries only `172.16.3.30`. Route + service intact, 216 agents.
|
||||||
|
|
||||||
|
## Wiki / knowledge sweeps
|
||||||
|
- Retired "GuruRMM VM / Linux VM on Jupiter" framing → `.30` is a **physical box** (Lenovo
|
||||||
|
ThinkCentre M83, Ubuntu 26.04): overview, index, jupiter, gururmm-build, internal-infrastructure,
|
||||||
|
POWER_FAILURE_RUNBOOK + 4 memory files. HOST_MIGRATION_RUNBOOK header flipped to COMPLETE.
|
||||||
|
- Windows build chain corrected everywhere: **Beast PRIMARY, Pluto FALLBACK**
|
||||||
|
(`attempt_build beast || attempt_build pluto`, verified in build-windows.sh): index, overview,
|
||||||
|
projects/gururmm, systems/pluto, internal-infrastructure + reference_pluto_build_server memory +
|
||||||
|
build-pipeline doc comments.
|
||||||
|
- `/wiki-lint` run: found+fixed 2 missed gaps (internal-infrastructure + pluto backlink). Pre-existing
|
||||||
|
backlog (missing/broken/index) left as-is. `guru-rmm.md` is an intentional redirect tombstone (kept);
|
||||||
|
the tailscale-enroll.ps1 "dead" link was a false positive (script exists).
|
||||||
|
- Compiled 5 missing articles via 5 parallel Sonnet sub-agents: clients gonzvar-tax-services,
|
||||||
|
tohono-oodham-doit (Syncro 33069069), tucson-golden-corral (3859123); projects gururmm-agent
|
||||||
|
(artifact-based), msp-tools (umbrella). Deduped the duplicate `system:neptune` compile-queue entry.
|
||||||
|
- Seeded **GuruConnect** wiki article (v0.3.0 production, ScreenConnect-class Rust tool; artifact-based
|
||||||
|
from guru-connect @ origin/main ded99c5). `[[guruconnect]]` backlinks now resolve.
|
||||||
|
- Per-client fixes: Gonzvar found via fuzzy `query=` ("Gonzvar Tax Service" singular, id **1830740**,
|
||||||
|
break-fix ~$175/hr, 6 assets). Golden Corral email = **Neptune Exchange** (per Mike; IX cPanel kept
|
||||||
|
as a caveat). **TGC-SERVER is colocated at ACG main office** (behind ACG office net, not a naked
|
||||||
|
public box at the restaurant).
|
||||||
|
- Patched wiki-compile Phase 2a: fuzzy `query=` + fallback ladder instead of near-exact `name=`
|
||||||
|
(root cause of the Gonzvar miss).
|
||||||
|
|
||||||
|
## Credentials / access (vault paths — no secrets inlined)
|
||||||
|
- Anthropic API key (GuruRMM log analysis): vault `projects/gururmm/anthropic-api` `credentials.api_key`.
|
||||||
|
Deployed to `.30` at `/opt/gururmm/.env` as `ANTHROPIC_API_KEY` (root-only file).
|
||||||
|
- `.30` SSH: `~/.ssh/gururmm-physical` → `guru@172.16.3.30`; sudo password = vault
|
||||||
|
`infrastructure/gururmm-server.sops.yaml` `credentials.password` (`sudo -S`).
|
||||||
|
- Jupiter: `ssh -i ~/.ssh/id_ed25519 root@172.16.3.20`.
|
||||||
|
- Beast (Windows build PRIMARY): `guru@100.101.122.4` (tailnet), key `~/.ssh/id_ed25519`.
|
||||||
|
|
||||||
|
## Infrastructure touched
|
||||||
|
- `.30` (physical GuruRMM/build host) — code deploy + .env + restart + netplan.
|
||||||
|
- Jupiter (.20) — deleted GuruRMM virsh domain.
|
||||||
|
- Beast (100.101.122.4) — inspected (duplicate Ollama).
|
||||||
|
|
||||||
|
## Files changed / pushed
|
||||||
|
- gururmm repo: `c869e4d` (logs.rs→Claude), runbook `37c8593`/`8b301bf`, build-pipeline docs `a794a7f`.
|
||||||
|
- ClaudeTools: multiple — Claude-cutover memory, VM-sweep, build-chain, wiki-lint fixes, 5 compiled
|
||||||
|
articles + index, GuruConnect, per-client fixes, wiki-compile fuzzy-search fix (through `401ed7d4`).
|
||||||
|
- vault: `c1c9744` (Anthropic key entry).
|
||||||
|
|
||||||
|
## Pending / follow-ups
|
||||||
|
- **ZDR** confirmation from Anthropic before pointing a production fleet at the key.
|
||||||
|
- Golden Corral: reconcile whether IX cPanel mail accounts/forwarders remain vs all-Neptune.
|
||||||
|
- Gonzvar: confirm contact name/email (Syncro record has phone only).
|
||||||
|
- `guru-connect` is an untracked standalone clone under `projects/msp-tools/` — consider making it a
|
||||||
|
proper submodule like guru-rmm.
|
||||||
|
- Optional: a server-side Ollama fallback in logs.rs (try Claude → fallback) — deferred; Beast path
|
||||||
|
is no longer used.
|
||||||
Reference in New Issue
Block a user