sync: auto-sync from GURU-5070 at 2026-05-25 09:21:41

Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-05-25 09:21:41
2026-05-25 09:21:43 -07:00
parent 413df93189
commit 6945b4237e
2 changed files with 125 additions and 1 deletions
--- a/projects/msp-tools/guru-rmm
+++ b/projects/msp-tools/guru-rmm
--- a/session-logs/2026-05-25-session.md
+++ b/session-logs/2026-05-25-session.md
@@ -581,3 +581,127 @@ git push origin main
 - Windows build table: 19045=Win10 22H2, 20348=Server 2022, 22621=Win11 22H2, 22631=Win11 23H2, 26100=Win11 24H2/Server 2025
 - macOS name table: 15=Sequoia, 14=Sonoma, 13=Ventura, 12=Monterey, 11=Big Sur
 - Code review verdict: APPROVED — no defects
+
+---
+
+## Update: 09:20 PT — GuruRMM Ollama log analysis: socat relay + findings deserialization fix
+
+### User
+- **User:** Mike Swanson (mike)
+- **Machine:** DESKTOP-0O8A1RL (GURU-5070)
+- **Role:** admin
+- **Session span:** resumed from compacted context, ~07:00–09:20 PT 2026-05-25
+
+### Session Summary
+
+Session resumed mid-work from a prior context. The goal carried over from that context was to verify end-to-end connectivity from the GuruRMM server (172.16.3.30) to Beast's Ollama instance (100.101.122.4:11434) via a socat relay running on pfsense (172.16.0.1). Prior work had already: added a pfsense firewall rule to pass 100.x traffic without the FiberGW route-to override, set up socat relay (`TCP-LISTEN:11434,reuseaddr,fork TCP:100.101.122.4:11434`) on pfsense, written a systemd drop-in at `/etc/systemd/system/gururmm-server.service.d/ollama.conf` setting `OLLAMA_URL=http://172.16.0.1:11434`, and confirmed TCP connectivity with nc.
+
+The first task was confirming the full pipeline end-to-end. Called `POST /api/logs/analyze` with agent_id ACG-DC16 (49098c52-542b-44de-bef2-93182280bdc6), received a 200 with 1817 logs analyzed and a clean summary. Socat relay confirmed working.
+
+Next, Mike asked why findings always came back empty. Reviewed `analyze_logs_with_ollama()` in `server/src/api/logs.rs`: it fetched up to 2000 logs but then called `.take(200)` before sending to Ollama — a conservative holdover from paid-API thinking with no justification for local Ollama. Also, the agent-scope path fetched all log levels (`&[]` — no filter), so the 200 lines sent to Ollama were statistically dominated by INFO/DEBUG noise rather than errors. Two fixes were applied in one commit: (1) added a severity sort (errors first, warnings second, info/debug last) before sampling, and (2) raised the sample limit from 200 to 1500.
+
+After those changes built and deployed, the analysis returned `findings: 0` despite the summary text describing three real issues (WMI failures, missing LHM executable, failed agent update). Direct testing of Ollama with a 4-line test prompt confirmed the model produces correct structured JSON with populated findings — so the model was not at fault. Root cause identified: the `Finding` struct had `pub affected_agents: Vec<Uuid>` without `#[serde(default)]`. Since Ollama never returns UUIDs in its findings, serde failed to deserialize every finding entry, and `unwrap_or_default()` silently returned an empty vec. A prompt-tightening pass had been started before the root cause was found — that prompt change is still in the codebase but was not the actual fix.
+
+The real fix was adding `#[serde(default)]` to `affected_agents`. After the third build+deploy cycle, the analysis returned 3 findings with correct severity, count, sample lines, and suggested actions.
+
+### Key Decisions
+
+- **Raise sample from 200 → 1500 lines, not unlimited**: qwen3:14b's default Ollama context window is ~32k tokens; 1500 log lines ≈ 45k tokens so there's a ceiling, but 1500 matches the fleet-scope DB cap and is a safe pragmatic limit.
+- **Severity sort before truncation**: Without this, agent-scope analysis (no level filter) sends INFO-heavy samples and Ollama correctly sees nothing alarming. Sort ensures errors bubble to the top so the 1500-line window is signal-dense.
+- **Prompt tightening was a red herring**: Added "for EVERY distinct issue, create ONE finding entry" language to the prompt during diagnosis. Kept it in as it's better instruction, but the actual fix was `#[serde(default)]`. Don't confuse the two.
+- **Manual `sudo /opt/gururmm/build-server.sh` required**: The Gitea webhook pipeline only rebuilds agents (linux/windows/mac via `build-linux.sh`, `build-windows.sh`, `build-mac.sh`). Server binary requires a manual `sudo /opt/gururmm/build-server.sh` on the build server. This is a gap — server changes don't auto-deploy.
+
+### Problems Encountered
+
+- **`.take(200)` discarded 90% of context**: The original code fetched 2000 logs then threw away 1800 before sending to Ollama. Fixed by raising limit to 1500 and adding severity sort.
+- **findings always empty despite correct Ollama output**: `serde_json::from_value(parsed["findings"].clone()).unwrap_or_default()` silently swallowed deserialization errors. Root cause: `affected_agents: Vec<Uuid>` without `#[serde(default)]` — Ollama omits this field, serde rejects the entry. Fixed with one line: `#[serde(default)]`.
+- **Pattern match failure for prompt edit via Python string replacement**: Escaping mismatch between Python double-escaped strings and the actual Rust source bytes caused the first replacement attempt to fail. Resolved by writing a patcher script to `/tmp/` on the build server and executing it via paramiko SFTP + exec_command, avoiding all local shell escaping.
+- **Three full Rust builds required**: Each of the three fixes (sample limit + sort, prompt, serde fix) required a separate build. Rust release builds on 172.16.3.30 take ~4 minutes with warm cache. Total deploy time ~12 minutes across the three cycles.
+- **Webhook pipeline does not build server**: Push to Gitea triggers agent builds only. Server must be manually rebuilt with `sudo /opt/gururmm/build-server.sh`.
+
+### Configuration Changes
+
+**`/home/guru/gururmm/server/src/api/logs.rs` (live on build server, pushed to Gitea):**
+- Added severity sort on `sorted_logs` before sampling (errors=0, warns=1, info=2)
+- Raised `.take(200)` → `.take(1500)` in `analyze_logs_with_ollama()`
+- Rewrote Ollama prompt to be more directive: "for EVERY distinct issue, create ONE finding entry; do NOT put issues only in summary"
+- Added `#[serde(default)]` to `pub affected_agents: Vec<Uuid>` in the `Finding` struct
+
+**`/etc/systemd/system/gururmm-server.service.d/ollama.conf` (on 172.16.3.30, already applied in prior session):**
+```ini
+[Service]
+Environment="OLLAMA_URL=http://172.16.0.1:11434"
+```
+
+**pfsense (already applied in prior session):**
+- Firewall rule: pass LAN traffic to 100.101.122.4 before FiberGW route-to rule (line 164)
+- socat relay: `/usr/local/etc/rc.d/socat_ollama` rc.d script (PID 988 at time of testing)
+- earlyshellcmd in config.xml: `/usr/local/etc/rc.d/socat_ollama start`
+
+### Credentials & Secrets
+
+No new credentials. Credentials used (existing):
+- GuruRMM API: `claude-api@azcomputerguru.com` / `ClaudeAPI2026!@#` (vault: `infrastructure/gururmm-server.sops.yaml`)
+- Build server SSH: `guru` / `Gptf*77ttb123!@#-rmm` @ 172.16.3.30:22
+
+### Infrastructure & Servers
+
+| Host | IP | Notes |
+|------|-----|-------|
+| GuruRMM server (Saturn) | 172.16.3.30:3001 | Rebuilt 3x this session; final deploy at 16:17:20 UTC |
+| Beast (Ollama host) | 100.101.122.4:11434 | RTX 4090, Tailscale peer, always-on |
+| pfsense | 172.16.0.1 (SSH :2248) | socat relay running, Tailscale 100.119.153.74 |
+
+**socat relay chain:** LAN → pfsense:11434 → Beast:100.101.122.4:11434
+**GuruRMM OLLAMA_URL:** `http://172.16.0.1:11434` (pfsense relay)
+**Model used:** qwen3:14b via Ollama `/api/chat`
+
+### Commands & Outputs
+
+```bash
+# End-to-end test confirming socat relay works
+POST http://172.16.3.30:3001/api/logs/analyze
+{"agent_id": "49098c52-542b-44de-bef2-93182280bdc6"}
+# -> 200 OK, log_count: 1817, summary: "No crashes..."  (pre-fix)
+
+# Manual server build (run on 172.16.3.30 as guru via sudo)
+sudo /opt/gururmm/build-server.sh
+# Logs to /var/log/gururmm-build.log (~4 min with warm cache)
+
+# Post-fix analysis result
+POST http://172.16.3.30:3001/api/logs/analyze  {}  (fleet scope)
+# -> log_count: 500, findings: 3
+#   [ERROR] WMI query failed due to invalid namespace (x102)
+#     action: winmgmt /verifyrepository to repair WMI
+#     sample: [17:57:30] WARN gururmm_agent::metrics: lhm: WMI query failed...
+#   [ERROR] LibreHardwareMonitor.exe not found (x4)
+#     action: reinstall LibreHardwareMonitor
+#     sample: [17:57:33] WARN ...LHM: not found at "C:\Program Files\GuruRMM..."
+#   [WARNING] Pending update did not apply (x1)
+#     action: restart agent or system and retry
+#     sample: [17:56:57] WARN ...updater: Pending update 0.6.29 -> 0.6.37 did not apply
+```
+
+**gururmm commits this session:**
+- `090774c` — perf: send up to 1500 logs to Ollama, prioritize errors/warnings
+- `3790be8` — fix: require findings entries for each identified issue in Ollama prompt
+- `e9c60aa` — fix: serde(default) on affected_agents so Ollama findings deserialize correctly
+
+### Pending / Incomplete Tasks
+
+- **Server build not in webhook pipeline**: Every server code change requires `sudo /opt/gururmm/build-server.sh` manually on 172.16.3.30. Consider adding server build to the webhook handler or a separate trigger.
+- **pfsense firewall rule matches exact host 100.101.122.4, not /8**: The intended rule was a /8 network match; pfsense's filter.inc drops the mask. Currently harmless since socat covers all Tailscale traffic via pfsense LAN IP, but the rule is technically wrong.
+- **pfsense vault MAC mismatch**: `infrastructure/pfsense-firewall.sops.yaml` needs re-encryption (MAC mismatch noted in prior session).
+- **TGC-SERVER Hyper-V disposition**: MAS90 VM running on TGC-SERVER (WS2016 DC). Customer says Hyper-V not expected there. Needs customer decision.
+- **URGENT: Neptune SSL cert expires 2026-05-31** (now today or tomorrow)
+- **URGENT: Western Tire SSL — verify AutoSSL on IX cPanel**
+
+### Reference Information
+
+- GuruRMM API base: `http://172.16.3.30:3001/api`
+- Log analysis endpoint: `POST /api/logs/analyze` (body: `{"agent_id": UUID}` optional, `{"hours": N}` optional, default 24h)
+- Analysis retrieval: `GET /api/logs/analysis` (last 20 runs)
+- Build server script: `/opt/gururmm/build-server.sh` (logs to `/var/log/gururmm-build.log`)
+- Webhook handler: `/opt/gururmm/webhook-handler.py` (port 9000, builds agents only, NOT server)
+- gururmm Gitea: `http://172.16.3.20:3000/azcomputerguru/gururmm`
+- Beast Ollama: `http://100.101.122.4:11434` (direct), `http://172.16.0.1:11434` (via socat relay from LAN)