diff --git a/projects/msp-tools/guru-rmm b/projects/msp-tools/guru-rmm index 7374e8a..0a4db53 160000 --- a/projects/msp-tools/guru-rmm +++ b/projects/msp-tools/guru-rmm @@ -1 +1 @@ -Subproject commit 7374e8a5c4b3719653a4693493401961634b8595 +Subproject commit 0a4db5385461f693f5e2b92fe0bd9582e3bea237 diff --git a/session-logs/2026-05-25-session.md b/session-logs/2026-05-25-session.md index e15fdbc..998a807 100644 --- a/session-logs/2026-05-25-session.md +++ b/session-logs/2026-05-25-session.md @@ -581,3 +581,127 @@ git push origin main - Windows build table: 19045=Win10 22H2, 20348=Server 2022, 22621=Win11 22H2, 22631=Win11 23H2, 26100=Win11 24H2/Server 2025 - macOS name table: 15=Sequoia, 14=Sonoma, 13=Ventura, 12=Monterey, 11=Big Sur - Code review verdict: APPROVED — no defects + +--- + +## Update: 09:20 PT — GuruRMM Ollama log analysis: socat relay + findings deserialization fix + +### User +- **User:** Mike Swanson (mike) +- **Machine:** DESKTOP-0O8A1RL (GURU-5070) +- **Role:** admin +- **Session span:** resumed from compacted context, ~07:00–09:20 PT 2026-05-25 + +### Session Summary + +Session resumed mid-work from a prior context. The goal carried over from that context was to verify end-to-end connectivity from the GuruRMM server (172.16.3.30) to Beast's Ollama instance (100.101.122.4:11434) via a socat relay running on pfsense (172.16.0.1). Prior work had already: added a pfsense firewall rule to pass 100.x traffic without the FiberGW route-to override, set up socat relay (`TCP-LISTEN:11434,reuseaddr,fork TCP:100.101.122.4:11434`) on pfsense, written a systemd drop-in at `/etc/systemd/system/gururmm-server.service.d/ollama.conf` setting `OLLAMA_URL=http://172.16.0.1:11434`, and confirmed TCP connectivity with nc. + +The first task was confirming the full pipeline end-to-end. Called `POST /api/logs/analyze` with agent_id ACG-DC16 (49098c52-542b-44de-bef2-93182280bdc6), received a 200 with 1817 logs analyzed and a clean summary. Socat relay confirmed working. + +Next, Mike asked why findings always came back empty. Reviewed `analyze_logs_with_ollama()` in `server/src/api/logs.rs`: it fetched up to 2000 logs but then called `.take(200)` before sending to Ollama — a conservative holdover from paid-API thinking with no justification for local Ollama. Also, the agent-scope path fetched all log levels (`&[]` — no filter), so the 200 lines sent to Ollama were statistically dominated by INFO/DEBUG noise rather than errors. Two fixes were applied in one commit: (1) added a severity sort (errors first, warnings second, info/debug last) before sampling, and (2) raised the sample limit from 200 to 1500. + +After those changes built and deployed, the analysis returned `findings: 0` despite the summary text describing three real issues (WMI failures, missing LHM executable, failed agent update). Direct testing of Ollama with a 4-line test prompt confirmed the model produces correct structured JSON with populated findings — so the model was not at fault. Root cause identified: the `Finding` struct had `pub affected_agents: Vec` without `#[serde(default)]`. Since Ollama never returns UUIDs in its findings, serde failed to deserialize every finding entry, and `unwrap_or_default()` silently returned an empty vec. A prompt-tightening pass had been started before the root cause was found — that prompt change is still in the codebase but was not the actual fix. + +The real fix was adding `#[serde(default)]` to `affected_agents`. After the third build+deploy cycle, the analysis returned 3 findings with correct severity, count, sample lines, and suggested actions. + +### Key Decisions + +- **Raise sample from 200 → 1500 lines, not unlimited**: qwen3:14b's default Ollama context window is ~32k tokens; 1500 log lines ≈ 45k tokens so there's a ceiling, but 1500 matches the fleet-scope DB cap and is a safe pragmatic limit. +- **Severity sort before truncation**: Without this, agent-scope analysis (no level filter) sends INFO-heavy samples and Ollama correctly sees nothing alarming. Sort ensures errors bubble to the top so the 1500-line window is signal-dense. +- **Prompt tightening was a red herring**: Added "for EVERY distinct issue, create ONE finding entry" language to the prompt during diagnosis. Kept it in as it's better instruction, but the actual fix was `#[serde(default)]`. Don't confuse the two. +- **Manual `sudo /opt/gururmm/build-server.sh` required**: The Gitea webhook pipeline only rebuilds agents (linux/windows/mac via `build-linux.sh`, `build-windows.sh`, `build-mac.sh`). Server binary requires a manual `sudo /opt/gururmm/build-server.sh` on the build server. This is a gap — server changes don't auto-deploy. + +### Problems Encountered + +- **`.take(200)` discarded 90% of context**: The original code fetched 2000 logs then threw away 1800 before sending to Ollama. Fixed by raising limit to 1500 and adding severity sort. +- **findings always empty despite correct Ollama output**: `serde_json::from_value(parsed["findings"].clone()).unwrap_or_default()` silently swallowed deserialization errors. Root cause: `affected_agents: Vec` without `#[serde(default)]` — Ollama omits this field, serde rejects the entry. Fixed with one line: `#[serde(default)]`. +- **Pattern match failure for prompt edit via Python string replacement**: Escaping mismatch between Python double-escaped strings and the actual Rust source bytes caused the first replacement attempt to fail. Resolved by writing a patcher script to `/tmp/` on the build server and executing it via paramiko SFTP + exec_command, avoiding all local shell escaping. +- **Three full Rust builds required**: Each of the three fixes (sample limit + sort, prompt, serde fix) required a separate build. Rust release builds on 172.16.3.30 take ~4 minutes with warm cache. Total deploy time ~12 minutes across the three cycles. +- **Webhook pipeline does not build server**: Push to Gitea triggers agent builds only. Server must be manually rebuilt with `sudo /opt/gururmm/build-server.sh`. + +### Configuration Changes + +**`/home/guru/gururmm/server/src/api/logs.rs` (live on build server, pushed to Gitea):** +- Added severity sort on `sorted_logs` before sampling (errors=0, warns=1, info=2) +- Raised `.take(200)` → `.take(1500)` in `analyze_logs_with_ollama()` +- Rewrote Ollama prompt to be more directive: "for EVERY distinct issue, create ONE finding entry; do NOT put issues only in summary" +- Added `#[serde(default)]` to `pub affected_agents: Vec` in the `Finding` struct + +**`/etc/systemd/system/gururmm-server.service.d/ollama.conf` (on 172.16.3.30, already applied in prior session):** +```ini +[Service] +Environment="OLLAMA_URL=http://172.16.0.1:11434" +``` + +**pfsense (already applied in prior session):** +- Firewall rule: pass LAN traffic to 100.101.122.4 before FiberGW route-to rule (line 164) +- socat relay: `/usr/local/etc/rc.d/socat_ollama` rc.d script (PID 988 at time of testing) +- earlyshellcmd in config.xml: `/usr/local/etc/rc.d/socat_ollama start` + +### Credentials & Secrets + +No new credentials. Credentials used (existing): +- GuruRMM API: `claude-api@azcomputerguru.com` / `ClaudeAPI2026!@#` (vault: `infrastructure/gururmm-server.sops.yaml`) +- Build server SSH: `guru` / `Gptf*77ttb123!@#-rmm` @ 172.16.3.30:22 + +### Infrastructure & Servers + +| Host | IP | Notes | +|------|-----|-------| +| GuruRMM server (Saturn) | 172.16.3.30:3001 | Rebuilt 3x this session; final deploy at 16:17:20 UTC | +| Beast (Ollama host) | 100.101.122.4:11434 | RTX 4090, Tailscale peer, always-on | +| pfsense | 172.16.0.1 (SSH :2248) | socat relay running, Tailscale 100.119.153.74 | + +**socat relay chain:** LAN → pfsense:11434 → Beast:100.101.122.4:11434 +**GuruRMM OLLAMA_URL:** `http://172.16.0.1:11434` (pfsense relay) +**Model used:** qwen3:14b via Ollama `/api/chat` + +### Commands & Outputs + +```bash +# End-to-end test confirming socat relay works +POST http://172.16.3.30:3001/api/logs/analyze +{"agent_id": "49098c52-542b-44de-bef2-93182280bdc6"} +# -> 200 OK, log_count: 1817, summary: "No crashes..." (pre-fix) + +# Manual server build (run on 172.16.3.30 as guru via sudo) +sudo /opt/gururmm/build-server.sh +# Logs to /var/log/gururmm-build.log (~4 min with warm cache) + +# Post-fix analysis result +POST http://172.16.3.30:3001/api/logs/analyze {} (fleet scope) +# -> log_count: 500, findings: 3 +# [ERROR] WMI query failed due to invalid namespace (x102) +# action: winmgmt /verifyrepository to repair WMI +# sample: [17:57:30] WARN gururmm_agent::metrics: lhm: WMI query failed... +# [ERROR] LibreHardwareMonitor.exe not found (x4) +# action: reinstall LibreHardwareMonitor +# sample: [17:57:33] WARN ...LHM: not found at "C:\Program Files\GuruRMM..." +# [WARNING] Pending update did not apply (x1) +# action: restart agent or system and retry +# sample: [17:56:57] WARN ...updater: Pending update 0.6.29 -> 0.6.37 did not apply +``` + +**gururmm commits this session:** +- `090774c` — perf: send up to 1500 logs to Ollama, prioritize errors/warnings +- `3790be8` — fix: require findings entries for each identified issue in Ollama prompt +- `e9c60aa` — fix: serde(default) on affected_agents so Ollama findings deserialize correctly + +### Pending / Incomplete Tasks + +- **Server build not in webhook pipeline**: Every server code change requires `sudo /opt/gururmm/build-server.sh` manually on 172.16.3.30. Consider adding server build to the webhook handler or a separate trigger. +- **pfsense firewall rule matches exact host 100.101.122.4, not /8**: The intended rule was a /8 network match; pfsense's filter.inc drops the mask. Currently harmless since socat covers all Tailscale traffic via pfsense LAN IP, but the rule is technically wrong. +- **pfsense vault MAC mismatch**: `infrastructure/pfsense-firewall.sops.yaml` needs re-encryption (MAC mismatch noted in prior session). +- **TGC-SERVER Hyper-V disposition**: MAS90 VM running on TGC-SERVER (WS2016 DC). Customer says Hyper-V not expected there. Needs customer decision. +- **URGENT: Neptune SSL cert expires 2026-05-31** (now today or tomorrow) +- **URGENT: Western Tire SSL — verify AutoSSL on IX cPanel** + +### Reference Information + +- GuruRMM API base: `http://172.16.3.30:3001/api` +- Log analysis endpoint: `POST /api/logs/analyze` (body: `{"agent_id": UUID}` optional, `{"hours": N}` optional, default 24h) +- Analysis retrieval: `GET /api/logs/analysis` (last 20 runs) +- Build server script: `/opt/gururmm/build-server.sh` (logs to `/var/log/gururmm-build.log`) +- Webhook handler: `/opt/gururmm/webhook-handler.py` (port 9000, builds agents only, NOT server) +- gururmm Gitea: `http://172.16.3.20:3000/azcomputerguru/gururmm` +- Beast Ollama: `http://100.101.122.4:11434` (direct), `http://172.16.0.1:11434` (via socat relay from LAN)