sync: auto-sync from GURU-5070 at 2026-05-25 09:21:41

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-05-25 09:21:41
This commit is contained in:
2026-05-25 09:21:43 -07:00
parent 413df93189
commit 6945b4237e
2 changed files with 125 additions and 1 deletions

View File

@@ -581,3 +581,127 @@ git push origin main
- Windows build table: 19045=Win10 22H2, 20348=Server 2022, 22621=Win11 22H2, 22631=Win11 23H2, 26100=Win11 24H2/Server 2025
- macOS name table: 15=Sequoia, 14=Sonoma, 13=Ventura, 12=Monterey, 11=Big Sur
- Code review verdict: APPROVED — no defects
---
## Update: 09:20 PT — GuruRMM Ollama log analysis: socat relay + findings deserialization fix
### User
- **User:** Mike Swanson (mike)
- **Machine:** DESKTOP-0O8A1RL (GURU-5070)
- **Role:** admin
- **Session span:** resumed from compacted context, ~07:0009:20 PT 2026-05-25
### Session Summary
Session resumed mid-work from a prior context. The goal carried over from that context was to verify end-to-end connectivity from the GuruRMM server (172.16.3.30) to Beast's Ollama instance (100.101.122.4:11434) via a socat relay running on pfsense (172.16.0.1). Prior work had already: added a pfsense firewall rule to pass 100.x traffic without the FiberGW route-to override, set up socat relay (`TCP-LISTEN:11434,reuseaddr,fork TCP:100.101.122.4:11434`) on pfsense, written a systemd drop-in at `/etc/systemd/system/gururmm-server.service.d/ollama.conf` setting `OLLAMA_URL=http://172.16.0.1:11434`, and confirmed TCP connectivity with nc.
The first task was confirming the full pipeline end-to-end. Called `POST /api/logs/analyze` with agent_id ACG-DC16 (49098c52-542b-44de-bef2-93182280bdc6), received a 200 with 1817 logs analyzed and a clean summary. Socat relay confirmed working.
Next, Mike asked why findings always came back empty. Reviewed `analyze_logs_with_ollama()` in `server/src/api/logs.rs`: it fetched up to 2000 logs but then called `.take(200)` before sending to Ollama — a conservative holdover from paid-API thinking with no justification for local Ollama. Also, the agent-scope path fetched all log levels (`&[]` — no filter), so the 200 lines sent to Ollama were statistically dominated by INFO/DEBUG noise rather than errors. Two fixes were applied in one commit: (1) added a severity sort (errors first, warnings second, info/debug last) before sampling, and (2) raised the sample limit from 200 to 1500.
After those changes built and deployed, the analysis returned `findings: 0` despite the summary text describing three real issues (WMI failures, missing LHM executable, failed agent update). Direct testing of Ollama with a 4-line test prompt confirmed the model produces correct structured JSON with populated findings — so the model was not at fault. Root cause identified: the `Finding` struct had `pub affected_agents: Vec<Uuid>` without `#[serde(default)]`. Since Ollama never returns UUIDs in its findings, serde failed to deserialize every finding entry, and `unwrap_or_default()` silently returned an empty vec. A prompt-tightening pass had been started before the root cause was found — that prompt change is still in the codebase but was not the actual fix.
The real fix was adding `#[serde(default)]` to `affected_agents`. After the third build+deploy cycle, the analysis returned 3 findings with correct severity, count, sample lines, and suggested actions.
### Key Decisions
- **Raise sample from 200 → 1500 lines, not unlimited**: qwen3:14b's default Ollama context window is ~32k tokens; 1500 log lines ≈ 45k tokens so there's a ceiling, but 1500 matches the fleet-scope DB cap and is a safe pragmatic limit.
- **Severity sort before truncation**: Without this, agent-scope analysis (no level filter) sends INFO-heavy samples and Ollama correctly sees nothing alarming. Sort ensures errors bubble to the top so the 1500-line window is signal-dense.
- **Prompt tightening was a red herring**: Added "for EVERY distinct issue, create ONE finding entry" language to the prompt during diagnosis. Kept it in as it's better instruction, but the actual fix was `#[serde(default)]`. Don't confuse the two.
- **Manual `sudo /opt/gururmm/build-server.sh` required**: The Gitea webhook pipeline only rebuilds agents (linux/windows/mac via `build-linux.sh`, `build-windows.sh`, `build-mac.sh`). Server binary requires a manual `sudo /opt/gururmm/build-server.sh` on the build server. This is a gap — server changes don't auto-deploy.
### Problems Encountered
- **`.take(200)` discarded 90% of context**: The original code fetched 2000 logs then threw away 1800 before sending to Ollama. Fixed by raising limit to 1500 and adding severity sort.
- **findings always empty despite correct Ollama output**: `serde_json::from_value(parsed["findings"].clone()).unwrap_or_default()` silently swallowed deserialization errors. Root cause: `affected_agents: Vec<Uuid>` without `#[serde(default)]` — Ollama omits this field, serde rejects the entry. Fixed with one line: `#[serde(default)]`.
- **Pattern match failure for prompt edit via Python string replacement**: Escaping mismatch between Python double-escaped strings and the actual Rust source bytes caused the first replacement attempt to fail. Resolved by writing a patcher script to `/tmp/` on the build server and executing it via paramiko SFTP + exec_command, avoiding all local shell escaping.
- **Three full Rust builds required**: Each of the three fixes (sample limit + sort, prompt, serde fix) required a separate build. Rust release builds on 172.16.3.30 take ~4 minutes with warm cache. Total deploy time ~12 minutes across the three cycles.
- **Webhook pipeline does not build server**: Push to Gitea triggers agent builds only. Server must be manually rebuilt with `sudo /opt/gururmm/build-server.sh`.
### Configuration Changes
**`/home/guru/gururmm/server/src/api/logs.rs` (live on build server, pushed to Gitea):**
- Added severity sort on `sorted_logs` before sampling (errors=0, warns=1, info=2)
- Raised `.take(200)` → `.take(1500)` in `analyze_logs_with_ollama()`
- Rewrote Ollama prompt to be more directive: "for EVERY distinct issue, create ONE finding entry; do NOT put issues only in summary"
- Added `#[serde(default)]` to `pub affected_agents: Vec<Uuid>` in the `Finding` struct
**`/etc/systemd/system/gururmm-server.service.d/ollama.conf` (on 172.16.3.30, already applied in prior session):**
```ini
[Service]
Environment="OLLAMA_URL=http://172.16.0.1:11434"
```
**pfsense (already applied in prior session):**
- Firewall rule: pass LAN traffic to 100.101.122.4 before FiberGW route-to rule (line 164)
- socat relay: `/usr/local/etc/rc.d/socat_ollama` rc.d script (PID 988 at time of testing)
- earlyshellcmd in config.xml: `/usr/local/etc/rc.d/socat_ollama start`
### Credentials & Secrets
No new credentials. Credentials used (existing):
- GuruRMM API: `claude-api@azcomputerguru.com` / `ClaudeAPI2026!@#` (vault: `infrastructure/gururmm-server.sops.yaml`)
- Build server SSH: `guru` / `Gptf*77ttb123!@#-rmm` @ 172.16.3.30:22
### Infrastructure & Servers
| Host | IP | Notes |
|------|-----|-------|
| GuruRMM server (Saturn) | 172.16.3.30:3001 | Rebuilt 3x this session; final deploy at 16:17:20 UTC |
| Beast (Ollama host) | 100.101.122.4:11434 | RTX 4090, Tailscale peer, always-on |
| pfsense | 172.16.0.1 (SSH :2248) | socat relay running, Tailscale 100.119.153.74 |
**socat relay chain:** LAN → pfsense:11434 → Beast:100.101.122.4:11434
**GuruRMM OLLAMA_URL:** `http://172.16.0.1:11434` (pfsense relay)
**Model used:** qwen3:14b via Ollama `/api/chat`
### Commands & Outputs
```bash
# End-to-end test confirming socat relay works
POST http://172.16.3.30:3001/api/logs/analyze
{"agent_id": "49098c52-542b-44de-bef2-93182280bdc6"}
# -> 200 OK, log_count: 1817, summary: "No crashes..." (pre-fix)
# Manual server build (run on 172.16.3.30 as guru via sudo)
sudo /opt/gururmm/build-server.sh
# Logs to /var/log/gururmm-build.log (~4 min with warm cache)
# Post-fix analysis result
POST http://172.16.3.30:3001/api/logs/analyze {} (fleet scope)
# -> log_count: 500, findings: 3
# [ERROR] WMI query failed due to invalid namespace (x102)
# action: winmgmt /verifyrepository to repair WMI
# sample: [17:57:30] WARN gururmm_agent::metrics: lhm: WMI query failed...
# [ERROR] LibreHardwareMonitor.exe not found (x4)
# action: reinstall LibreHardwareMonitor
# sample: [17:57:33] WARN ...LHM: not found at "C:\Program Files\GuruRMM..."
# [WARNING] Pending update did not apply (x1)
# action: restart agent or system and retry
# sample: [17:56:57] WARN ...updater: Pending update 0.6.29 -> 0.6.37 did not apply
```
**gururmm commits this session:**
- `090774c` — perf: send up to 1500 logs to Ollama, prioritize errors/warnings
- `3790be8` — fix: require findings entries for each identified issue in Ollama prompt
- `e9c60aa` — fix: serde(default) on affected_agents so Ollama findings deserialize correctly
### Pending / Incomplete Tasks
- **Server build not in webhook pipeline**: Every server code change requires `sudo /opt/gururmm/build-server.sh` manually on 172.16.3.30. Consider adding server build to the webhook handler or a separate trigger.
- **pfsense firewall rule matches exact host 100.101.122.4, not /8**: The intended rule was a /8 network match; pfsense's filter.inc drops the mask. Currently harmless since socat covers all Tailscale traffic via pfsense LAN IP, but the rule is technically wrong.
- **pfsense vault MAC mismatch**: `infrastructure/pfsense-firewall.sops.yaml` needs re-encryption (MAC mismatch noted in prior session).
- **TGC-SERVER Hyper-V disposition**: MAS90 VM running on TGC-SERVER (WS2016 DC). Customer says Hyper-V not expected there. Needs customer decision.
- **URGENT: Neptune SSL cert expires 2026-05-31** (now today or tomorrow)
- **URGENT: Western Tire SSL — verify AutoSSL on IX cPanel**
### Reference Information
- GuruRMM API base: `http://172.16.3.30:3001/api`
- Log analysis endpoint: `POST /api/logs/analyze` (body: `{"agent_id": UUID}` optional, `{"hours": N}` optional, default 24h)
- Analysis retrieval: `GET /api/logs/analysis` (last 20 runs)
- Build server script: `/opt/gururmm/build-server.sh` (logs to `/var/log/gururmm-build.log`)
- Webhook handler: `/opt/gururmm/webhook-handler.py` (port 9000, builds agents only, NOT server)
- gururmm Gitea: `http://172.16.3.20:3000/azcomputerguru/gururmm`
- Beast Ollama: `http://100.101.122.4:11434` (direct), `http://172.16.0.1:11434` (via socat relay from LAN)