docs: move RMM session log to root session-logs; update placement rules

2026-05-15 06:10:15 -07:00
parent 1ef03d7f2f
commit d52e79d1aa
4 changed files with 172 additions and 2 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@@ -315,7 +315,8 @@ Vault structure: `infrastructure/`, `clients/`, `services/`, `projects/`, `msp-t
 - **Dataforth DOS work** → `projects/dataforth-dos/`
 - **ClaudeTools API code** → `api/`, `migrations/`
- **GuruRMM work** → `projects/msp-tools/guru-rmm/`
+- **GuruRMM work** → `projects/msp-tools/guru-rmm/` (code reference only — submodule, stale copy of `azcomputerguru/gururmm`)
 - **GuruRMM session logs** → `session-logs/` (root, in claudetools — NOT committed to the gururmm submodule)
 - **Client work** → `clients/[client-name]/`
 - **Session logs** → project or client `session-logs/` subfolder; general → root `session-logs/`
 - **Full guide:** `.claude/FILE_PLACEMENT_GUIDE.md`
--- a/.claude/FILE_PLACEMENT_GUIDE.md
+++ b/.claude/FILE_PLACEMENT_GUIDE.md
@@ -17,6 +17,7 @@
 | Client Session Logs | Support notes | `clients/[client-name]/session-logs/` |
 | ClaudeTools API Code | `*.py`, migrations | `api/`, `migrations/` (keep existing structure) |
 | ClaudeTools API Logs | Session notes | `session-logs/` (root) |
 | GuruRMM Session Logs | RMM work | `session-logs/YYYY-MM-DD-session.md` (root — NOT in gururmm submodule) |
 | General Session Logs | Mixed work | `session-logs/YYYY-MM-DD-session.md` |
 | Credentials | All credentials | `credentials.md` (root - shared) |
--- a/projects/msp-tools/guru-rmm
+++ b/projects/msp-tools/guru-rmm
--- a/session-logs/2026-05-15-session.md
+++ b/session-logs/2026-05-15-session.md
@@ -0,0 +1,168 @@
 # Session Log — 2026-05-15
 ## User
 - **User:** Mike Swanson (mike)
 - **Machine:** DESKTOP-0O8A1RL
 - **Role:** admin
 - **Session Span:** ~03:30 UTC – 06:03 UTC (continued from prior context window)
 ---
 ## Session Summary
 This session was a continuation of a prior context window that had implemented 0.6.19 agent features (extended temperature sensors, wts.rs Windows fixes, watchdog always-on policy changes). The immediate work on entry was completing the 0.6.19 fleet rollout: three agents — IMC1 (fa99e913), GND-SERVER (cd086074), and CS-SERVER (6766e973) — were stuck on 0.6.18 with dead WebSocket write halves. The server's ConnectedAgents in-memory map held stale entries: read side (heartbeats) still worked, but write side (commands) was dead, so update dispatch failed with "Agent is offline" even though DB showed them online.
 The first approach was setting those agents offline in the DB to force a reconnect. This failed because the agents were still heartbeating (the server's in-memory read task was alive), so the DB immediately got updated back to online on the next heartbeat. A server restart was needed to clear the in-memory map. After restart, all three agents reconnected with fresh connections within seconds and immediately accepted the 0.6.19 update. All completed successfully within 3 seconds of reconnection.
 During log inspection, two server bugs were identified and fixed. First: `TemperatureSensor` struct in `server/src/ws/mod.rs` used field names `temp_celsius` and `critical_celsius`, but the agent's `SensorReading` struct serializes to `value`, `sensor_type`, `unit`, and `critical_value`. Every metrics message from any agent that included temperature readings caused a deserialization error (`missing field 'temp_celsius'`) that was logged but silently dropped the data. Second: the WebSocket receive loop did not monitor the send task. When a WebSocket write failed (killing the send task), the receive loop continued running indefinitely, keeping the agent in ConnectedAgents with a dead write half. Every subsequent command dispatch attempt failed silently. The fix uses `tokio::select!` to watch both incoming messages and the send task — when the send task exits, the receive loop breaks, cleanup removes the agent from ConnectedAgents, and the agent reconnects fresh.
 Both fixes were implemented via Python patch script on the server source, compiled with cargo (4m 6s build), and deployed by stopping the service, replacing `/opt/gururmm/gururmm-server`, and restarting. The fixes were committed and pushed to Gitea as commit 56283dd. The patched server ran cleanly with no `temp_celsius` errors and no failed command dispatches in the new process's logs.
 At session end: 15 online agents on 0.6.19, AD2 on 0.6.1 (offline since April 20, requires physical/VPN access), and ~30 offline agents on older versions that will auto-update on next reconnect.
 ---
 ## Key Decisions
 - **Server restart over DB-offline trick**: Setting agents offline in the DB does not disconnect them because the server's in-memory receive loop is still running and updates `last_seen` on every heartbeat, racing with any DB status change. Only a server restart clears the in-memory ConnectedAgents map. Accepted the brief (~10s) outage of all agents.
 - **`biased` ordering in select! (send_task first)**: Could have put incoming messages first, but polling send_task first ensures dead write halves are detected on the very next loop iteration rather than waiting for the next incoming message. Incoming messages still get processed every iteration as long as the send task is alive.
 - **TemperatureSensor renamed to match agent**: Rather than aliasing with `#[serde(rename)]`, fully renamed the struct fields to match the agent's canonical names (`value`, `sensor_type`, `unit`, `critical_value`). Any previously stored JSON in the `temperatures` column used wrong field names and was silently unreadable, so there's no backward-compat cost to renaming.
 - **Edit directly on server vs. local + push**: Local repo is a stale copy of gururmm. Edited the live source on `/home/guru/gururmm/`, built there, deployed, then committed and pushed. Faster than any local→Gitea→pull flow, and the single file edit was low-risk.
 - **Deployed first, then pushed to Gitea**: Committed after confirming the fix worked in production. Appropriate for a targeted bugfix with no DB migrations.
 ---
 ## Problems Encountered
 - **`cp: cannot create regular file '/opt/gururmm/gururmm-server': Text file busy`**: Tried to copy the new binary while the service was running. Resolution: stop service first (`systemctl stop`), then copy, then start. Standard Linux "can't replace a running executable" behavior.
 - **Binary deployed to wrong path first**: Copied to `/usr/local/bin/gururmm-server` but systemd unit's `ExecStart` points to `/opt/gururmm/gururmm-server`. The service restarted but ran the old binary. Identified by checking `systemctl show gururmm-server --property=ExecStart`. Resolution: stop/copy to correct path/start.
 - **git push rejected (non-fast-forward)**: Remote had commits not in local. Resolution: `git pull --rebase` then `git push`.
 - **psql peer auth failed**: `psql -U gururmm gururmm` uses peer auth (Unix socket), requires matching OS user. Used `sudo -u postgres psql -d gururmm` to execute queries as postgres superuser.
 - **`temp_celsius` errors in patched server logs**: After deploying the patch (PID 946066), still saw `temp_celsius` errors in `journalctl`. Turned out those error lines had PID 943615 or 945573 (old server instances) — the patched server produced none. Confirmed by filtering with `_PID=946066`.
 ---
 ## Configuration Changes
 ### Server Source (on `/home/guru/gururmm/`)
 **`server/src/ws/mod.rs`** — Two changes:
 1. `TemperatureSensor` struct renamed to match agent:
   - `temp_celsius: f32` → `value: f32`
   - `critical_celsius: Option<f32>` → `critical_value: Option<f32>`
   - Added: `sensor_type: String`, `unit: String`
 2. `let send_task` → `let mut send_task`
 3. Receive loop changed from `while let Some(msg_result) = receiver.next().await` to `loop { tokio::select! { biased; _ = &mut send_task => { warn!(...); break; } msg_result = receiver.next() => { ... } } }`
 ### Binary Deployed
 - `/opt/gururmm/gururmm-server` — replaced with build from 2026-05-15 03:47 UTC
 ### Commits
 - Gururmm repo: `56283dd` — "fix: TemperatureSensor schema mismatch and dead write-half detection"
 ---
 ## Credentials & Secrets
 None new. API credentials used:
 - GuruRMM API login: `claude-api@azcomputerguru.com` / `ClaudeAPI2026!@#` (from vault, used to get JWT for manual update trigger attempts)
 ---
 ## Infrastructure & Servers
 - **GuruRMM Server**: `172.16.3.30:3001` — Rust/Axum, systemd unit `gururmm-server`
 - **Binary path**: `/opt/gururmm/gururmm-server`
 - **Source path**: `/home/guru/gururmm/` (git repo, remote at `172.16.3.20:azcomputerguru/gururmm.git`)
 - **Gitea**: `http://172.16.3.20:3000` (internal, not git.azcomputerguru.com which is behind Cloudflare)
 - **DB**: PostgreSQL on `172.16.3.30`, database `gururmm`, accessed via `sudo -u postgres psql -d gururmm`
 ---
 ## Commands & Outputs
 ```bash
 # Set agents offline to force reconnect (didn't work alone, needed restart too)
 sudo -u postgres psql -d gururmm -c \
  "UPDATE agents SET status='offline' WHERE hostname IN ('IMC1','GND-SERVER','CS-SERVER') RETURNING hostname, status, agent_version;"
 # Server restart (clears in-memory ConnectedAgents map)
 sudo systemctl restart gururmm-server
 # Build patched server (4m 6s)
 cd /home/guru/gururmm/server && /home/guru/.cargo/bin/cargo build --release
 # Deploy (stop-first pattern to avoid "Text file busy")
 sudo systemctl stop gururmm-server
 sudo cp /home/guru/gururmm/server/target/release/gururmm-server /opt/gururmm/gururmm-server
 sudo systemctl start gururmm-server
 # Commit and push fixes
 cd /home/guru/gururmm
 git add server/src/ws/mod.rs
 git commit -m 'fix: TemperatureSensor schema mismatch and dead write-half detection'
 git pull --rebase && git push
 # Result: 56283dd pushed to 172.16.3.20:azcomputerguru/gururmm.git
 ```
 Key log evidence of dead write half (before fix):
 ```
 INFO  gururmm_server::ws: Dispatching update to connected agent fa99e913... on heartbeat: 0.6.18 -> 0.6.19
 ERROR gururmm_server::ws: Failed to send heartbeat update command to agent fa99e913... — rolling back pending record
 ```
 After restart + update:
 ```
 INFO  gururmm_server::ws: Received update result from agent fa99e913...: update_id=..., status=starting
 INFO  gururmm_server::ws: Agent fa99e913... reconnected after update: 0.6.18 -> 0.6.19
 ```
 ---
 ## Pending / Incomplete Tasks
 - **AD2 (0.6.1, offline since 2026-04-20)**: Requires physical or VPN access. Cannot be updated remotely. Low priority but should be investigated when accessible.
 - **BB-SERVER enrollment loop**: Repeatedly hitting `duplicate key value violates unique constraint "idx_agents_site_device"` on every WS connect attempt. Not investigated. The agent is already enrolled (row exists) but its auth flow is re-attempting first-time enrollment. Likely needs a code fix in the site-based auth logic to handle "already enrolled, just reconnecting" more gracefully.
 - **Offline agents on older versions** (will auto-update on reconnect):
  - 0.6.18: LAPTOP-8P7HDSEI, MSI, Maras-HP-Laptop
  - 0.6.3: ~14 machines (ACCT2-PC, ANN-PC, ASSISTMAN-PC, etc. — Stamback/Safesite fleet)
  - 0.6.2: NurseAssist, PST-SURFACE, StambackLaptopNew
  - 0.6.1: Mikes-MacBook-Air.local (offline)
  - 0.5.1: SL-SERVER x2 (offline, possibly abandoned)
 - **`unsupported Unicode escape sequence` on hardware inventory for IMC1**: Logged at `WARN` level after 0.6.19 update. The agent's hardware inventory JSON contains a Unicode escape sequence that PostgreSQL rejects. Likely a field value (serial number, software name, etc.) with a problematic character. Not investigated.
 - **Dead write half root cause not fully diagnosed**: We know the pattern (send_task dies, receive loop keeps running), and the fix prevents it from being persistent. But what originally causes the send_task to die (network issue? buffer full? specific message type?) is not determined. The `select!` fix means it self-heals now (agent reconnects), so this is lower priority.
 - **Policy wiring plan** (`ticklish-questing-stallman.md`): Full end-to-end policy propagation still pending. Server sends ConfigUpdate on connect (wired), but agent-side handling is not complete. Deferred.
 - **Safesite Glendale MSI machine**: Waiting for user to be away to push DisplayLink driver update.
 - **LHM bundling in MSI**: LibreHardwareMonitor files not in build pipeline; self-healing download not implemented.
 - **Build lock**: No flock on `build-agents.sh` to prevent concurrent invocations.
 ---
 ## Reference Information
 - **Gururmm Gitea repo**: `http://172.16.3.20:3000/azcomputerguru/gururmm`
 - **Fix commit**: `56283dd` — fix: TemperatureSensor schema mismatch and dead write-half detection
 - **Server source**: `/home/guru/gururmm/server/src/ws/mod.rs`
 - **Agent metrics struct**: `agent/src/metrics/mod.rs:17` — `SensorReading { label, value, sensor_type, unit, critical_value }`
 - **Server TemperatureSensor struct**: `server/src/ws/mod.rs:316` — now matches agent
 - **Dead write half fix**: `server/src/ws/mod.rs:679` — `let mut send_task`, receive loop at ~691 uses `tokio::select!`
 - **Plan file**: `C:\Users\guru\.claude\plans\ticklish-questing-stallman.md` (policy wiring, deferred)
 - **Fleet status as of session end**:
  - Online on 0.6.19: CS-SERVER, DESKTOP-0O8A1RL, DESKTOP-BTR2AM3, DESKTOP-DLTAGOI, DESKTOP-H6QHRR7, DESKTOP-KQSL232, DF-GAGETRAK, GND-SERVER, IMC1, LAPTOP-DRQ5L558, LAPTOP-E0STJJE8, MAINTENANCE-PC, MDIRECTOR-PC, NURSESTATION-PC, gururmm (15 agents)
  - Online on 0.6.1: AD2 (offline since 2026-04-20, unreachable)