217 lines
17 KiB
Markdown
217 lines
17 KiB
Markdown
# Session Log — 2026-05-14
|
||
|
||
## User
|
||
- **User:** Mike Swanson (mike)
|
||
- **Machine:** DESKTOP-0O8A1RL
|
||
- **Role:** admin
|
||
- **Session span:** ~15:00 – 16:37 UTC (continuation from prior context)
|
||
|
||
---
|
||
|
||
## Session Summary
|
||
|
||
This session resolved the GuruRMM inventory temperature collection issue end-to-end and delivered agent version 0.6.18 to production. The work began as a continuation from a prior context where temperatures were confirmed collecting at the agent (LHM returning cpu ~42°C) but arriving as NULL in the database.
|
||
|
||
Root cause investigation identified three compounding issues: (1) the system default policy in PostgreSQL had all metrics collection flags set to `false`, including `collect_temperatures`, so the agent's `collect_with_flags()` was zeroing temperatures before transmission; (2) the `lhm: ok` warn log fires inside `collect()` before policy is applied, making it appear temperatures were flowing when they were not; (3) a separate redundant metrics task in `service.rs` called raw `collect()` and logged CPU readings that were never sent to the server, creating false impressions about what data was reaching the DB.
|
||
|
||
Code fixes for 0.6.18 included: removing the redundant service.rs metrics task entirely, downgrading the `lhm: ok` log from WARN to DEBUG, bumping Cargo.toml to 0.6.18, and making the Pluto build script cleanup step non-fatal (it had been blocking deployment via `&&` chain when cleanup directory was absent). Deployment required manual SCP of pre-built Pluto artifacts (two concurrent build invocations had caused a race condition on `Cargo.lock`). All artifacts were signed and deployed successfully.
|
||
|
||
With 0.6.18 deployed, the session then diagnosed why this machine (DESKTOP-0O8A1RL) was not auto-updating: (1) the system default policy had `updates.auto_update: false`, preventing the server from dispatching Update commands; (2) after enabling auto_update via a direct PostgreSQL UPDATE, the 0.6.17 agent still wasn't receiving a dispatch because a stale pending update record (0.6.7→0.6.10 from the previous day) blocked the dispatch gate. Clearing that record allowed the next heartbeat to trigger dispatch. The agent updated to 0.6.18 at 16:35 UTC, and the first post-update metric landed in the database at 16:36:10 with cpu_temp=44°C and gpu_temp=43.5°C confirmed non-NULL.
|
||
|
||
---
|
||
|
||
## Key Decisions
|
||
|
||
- **Understand before fixing**: Followed the session constraint that bugs are not fixed without understanding root cause. Traced all three contributing factors before touching code or DB.
|
||
- **Manual deployment over re-triggering CI**: The concurrent build race condition left artifacts pre-built on Pluto; rather than risk a third invocation, SCP'd directly from Pluto and ran signing/deploy steps manually.
|
||
- **Cleanup non-fatal via subshell**: Changed `cd cleanup && $CARGO build --release` to `(cd cleanup && $CARGO build --release || echo cleanup_skipped)` — isolates cleanup failure without masking the exit code of the main build chain.
|
||
- **Mark stale pending record as failed, not delete**: Used `status='failed'` with an explanatory error_message rather than deleting, preserving audit trail of the interrupted 0.6.7→0.6.10 update.
|
||
- **Policy fix via direct SQL**: Updated the system default policy directly in PostgreSQL rather than via dashboard UI — faster, auditable, and doesn't require UI to be working for infrastructure fixes.
|
||
|
||
---
|
||
|
||
## Problems Encountered
|
||
|
||
- **`lhm: ok` fires before policy applied**: The warn log inside `collect()` gave a false positive — temperatures were collected but then zeroed by `collect_with_flags()`. Resolution: traced the call stack, downgraded to debug!.
|
||
- **Redundant service.rs metrics task**: Called `collect()` directly every 60s and logged CPU readings at INFO, making it look like policy was being bypassed. Resolution: removed the task entirely in 0.6.18.
|
||
- **Concurrent build race on Pluto**: Two SSH sessions simultaneously ran `move /Y Cargo.lock Cargo.lock.stable`. Second session failed with "The system cannot find the file specified." Resolution: manual deployment + build script fix for cleanup.
|
||
- **Cleanup step blocked deployment**: `cd cleanup && $CARGO build --release` at end of Pluto `&&` chain; failure here aborted the entire SSH session AFTER successful MSI build, preventing SCP. Resolution: subshell `|| echo cleanup_skipped`.
|
||
- **Stale pending update blocked dispatch**: `get_pending_update()` returned the old 0.6.7→0.6.10 record, causing the dispatch gate to skip. Resolution: identified via `agent_updates` table query, marked record as failed.
|
||
- **PostgreSQL not externally accessible**: Port 5432 not exposed from 172.16.3.30. Resolution: SSH into server and run `PGPASSWORD=... psql -h 127.0.0.1`.
|
||
- **0.6.13 agents looping send failures**: Three agents (c778b6a3, fa99e913, cd086074) repeatedly receive heartbeat dispatch but `send_to()` returns false — write half of their WS connection is dead while read half still works. Not fixed this session (documented for future investigation).
|
||
- **BB-SERVER enrollment loop**: BB-SERVER keeps hitting `duplicate key value violates unique constraint "idx_agents_site_device"` on first WS connect. Not fixed this session.
|
||
|
||
---
|
||
|
||
## Configuration Changes
|
||
|
||
### GuruRMM codebase (gururmm Gitea repo — `azcomputerguru/gururmm`)
|
||
|
||
| File | Change |
|
||
|------|--------|
|
||
| `agent/Cargo.toml` | Version `0.6.17` → `0.6.18` |
|
||
| `agent/src/metrics/mod.rs` | Line 629: `warn!("lhm: ok ...")` → `debug!("lhm: ok ...")` |
|
||
| `agent/src/service.rs` | Removed redundant metrics tokio task (lines 229-246) and its `select!` arm |
|
||
| `agent/src/transport/websocket.rs` | Metrics send log changed to `info!` at websocket module path; ConfigUpdate handler wired |
|
||
| `scripts/build-agents.sh` | Cleanup step wrapped in subshell: `(cd cleanup && $CARGO build --release \|\| echo cleanup_skipped)` |
|
||
|
||
### Database (PostgreSQL @ 172.16.3.30:5432/gururmm)
|
||
|
||
| Change | SQL |
|
||
|--------|-----|
|
||
| All metrics flags enabled in system default policy | `UPDATE policies SET policy_data = jsonb_set(policy_data, '{metrics}', '{"collect_cpu":true,"collect_memory":true,"collect_disk":true,"collect_network":true,"collect_temperatures":true,"collect_user_info":true,"collect_public_ip":true}'::jsonb) WHERE is_system_default = true;` (done in prior session) |
|
||
| Watchdog enabled in system default policy | `jsonb_set(policy_data, '{watchdog}', '{"enabled":true}'::jsonb)` (done in prior session) |
|
||
| Auto-update enabled in system default policy | `UPDATE policies SET policy_data = jsonb_set(policy_data, '{updates}', '{"auto_update": true}'::jsonb) WHERE is_system_default = true;` |
|
||
| Stale pending update cleared for DESKTOP-0O8A1RL | `UPDATE agent_updates SET status='failed', error_message='stale: agent updated via MSI, record never closed', completed_at=now() WHERE id='f1e243df-73fd-48c4-9f33-62a00211d5e8';` |
|
||
|
||
### Local dev repo (this machine)
|
||
- `projects/msp-tools/guru-rmm/session-logs/2026-05-13-session.md` — appended, committed as `51c651f`
|
||
|
||
---
|
||
|
||
## Credentials & Secrets
|
||
|
||
| Resource | Value |
|
||
|----------|-------|
|
||
| PostgreSQL (gururmm) | host: 172.16.3.30, port: 5432 (localhost only), db: gururmm, user: gururmm, password: `43617ebf7eb242e814ca9988cc4df5ad` |
|
||
| SSH to build server | `guru@172.16.3.30` — key-based from this machine |
|
||
|
||
---
|
||
|
||
## Infrastructure & Servers
|
||
|
||
| Component | Details |
|
||
|-----------|---------|
|
||
| GuruRMM server | 172.16.3.30:3001, wss://rmm-api.azcomputerguru.com/ws |
|
||
| PostgreSQL | 172.16.3.30:5432, localhost-only, db=gururmm |
|
||
| Pluto (Windows build) | Administrator@172.16.3.36, C:\gururmm, cargo/wix builds |
|
||
| Downloads dir | /var/www/gururmm/downloads/ on 172.16.3.30 |
|
||
| Agent on this machine | DESKTOP-0O8A1RL, agent_id=c043d9ac-4020-4cab-a5f4-b90213d11e73, now 0.6.18 |
|
||
|
||
---
|
||
|
||
## Commands & Outputs
|
||
|
||
```bash
|
||
# Enable auto_update in system default policy
|
||
ssh guru@172.16.3.30
|
||
PGPASSWORD='43617ebf7eb242e814ca9988cc4df5ad' psql -U gururmm -d gururmm -h 127.0.0.1 -c \
|
||
"UPDATE policies SET policy_data = jsonb_set(policy_data, '{updates}', '{\"auto_update\": true}'::jsonb) WHERE is_system_default = true;"
|
||
|
||
# Check for stale pending updates
|
||
PGPASSWORD='...' psql ... -c \
|
||
"SELECT au.id, a.hostname, au.old_version, au.target_version, au.status FROM agent_updates au JOIN agents a ON a.id=au.agent_id WHERE au.status IN ('pending','downloading','installing') ORDER BY au.created_at DESC LIMIT 20;"
|
||
# Found: f1e243df | DESKTOP-0O8A1RL | 0.6.7 | 0.6.10 | pending (from 2026-05-13)
|
||
|
||
# Clear the stale record
|
||
PGPASSWORD='...' psql ... -c \
|
||
"UPDATE agent_updates SET status='failed', error_message='stale: agent updated via MSI, record never closed', completed_at=now() WHERE id='f1e243df-73fd-48c4-9f33-62a00211d5e8';"
|
||
|
||
# Verify 0.6.18 metrics with temperatures in DB
|
||
PGPASSWORD='...' psql ... -c \
|
||
"SELECT timestamp, cpu_percent, cpu_temp_celsius, gpu_temp_celsius FROM metrics WHERE agent_id='c043d9ac-4020-4cab-a5f4-b90213d11e73' AND timestamp > '2026-05-14 16:35:00' ORDER BY timestamp DESC LIMIT 3;"
|
||
# Result: 2026-05-14 16:36:10 | 6.42 | 44 | 43.5 — confirmed non-NULL
|
||
```
|
||
|
||
---
|
||
|
||
## Pending / Incomplete Tasks
|
||
|
||
- **0.6.13 agents with dead WS write half** — agents c778b6a3, fa99e913, cd086074 still unresolved
|
||
- **BB-SERVER enrollment loop** — duplicate key on `idx_agents_site_device` still unresolved
|
||
- **Stale pending update records from April 19** — ~15 records for 0.6.1→0.6.2, need bulk cleanup
|
||
- **Policy wiring plan (ticklish-questing-stallman.md)** — full policy propagation; deferred
|
||
- **Build lock to prevent concurrent invocations** — flock or similar on build-agents.sh
|
||
|
||
---
|
||
|
||
## Reference Information
|
||
|
||
- GuruRMM Gitea repo: http://172.16.3.20:3000/azcomputerguru/gururmm
|
||
- Dashboard: https://rmm.azcomputerguru.com
|
||
- Agent downloads: https://rmm-api.azcomputerguru.com/downloads/
|
||
- 0.6.18 MSI: https://rmm-api.azcomputerguru.com/downloads/gururmm-agent-base-0.6.18.msi
|
||
- Policy wiring plan: `C:\Users\guru\.claude\plans\ticklish-questing-stallman.md`
|
||
- DESKTOP-0O8A1RL agent_id: `c043d9ac-4020-4cab-a5f4-b90213d11e73`
|
||
- System default policy id: `2bbd91d8-0920-4565-b8fe-658b81ab7d08`
|
||
- Cleared stale update record id: `f1e243df-73fd-48c4-9f33-62a00211d5e8`
|
||
- Successful 0.6.18 update_id: `0d30f404-4cee-4266-bd93-4d69aa22e4c3`
|
||
- Build script fix commit: `88db2b1` (gururmm repo)
|
||
- 0.6.18 session log commit (local dev clone): `51c651f`
|
||
|
||
---
|
||
|
||
## Update: 18:00–20:00 PT — 0.6.19 build fixes, roadmap, dead reference sweep, watchdog policy cleanup
|
||
|
||
### Session Summary
|
||
|
||
This continuation session fixed three Windows compile errors blocking the 0.6.19 Pluto build, updated the feature roadmap with bulk actions, performed a dead reference sweep across ClaudeTools, and removed the watchdog enabled toggle from the policy system.
|
||
|
||
The 0.6.19 build had been triggered at the end of the prior context but the Windows (Pluto) build failed with three errors in `agent/src/watchdog/wts.rs`. First, `SetHandleInformation` required `HANDLE_FLAGS(0)` for its third argument instead of bare `0` — `HANDLE_FLAGS` also needed adding to imports. Second, `ReadFile` at `windows::Win32::Storage::FileSystem` requires the `Win32_System_IO` feature flag — added to Cargo.toml. Third, `HANDLE(*mut c_void)` is not `Send` and cannot move into `thread::spawn` closures — fixed by extracting as `usize` before spawning and reconstructing inside each reader thread. Three fix commits, three re-triggered builds. The third build completed successfully in ~367s; all artifacts signed and deployed.
|
||
|
||
A user observation about wanting bulk actions across the UI led to updating `docs/FEATURE_ROADMAP.md` — which first required finding it (CLAUDE.md had the wrong path `ROADMAP.md`). A full dead reference sweep of ClaudeTools found one systematic dead reference: `projects/claudetools-api/` appearing in FILE_PLACEMENT_GUIDE.md and CONTEXT.md, corrected to the actual root-level `api/` and `migrations/` directories. CLAUDE.md roadmap path was also corrected.
|
||
|
||
The watchdog `enabled` field was removed from the entire policy stack. The watchdog is the agent's reliability mechanism — making it policy-toggleable would allow it to be accidentally or deliberately disabled, leaving agents unrecoverable. The field was stripped from 9 files across agent, server, and dashboard. Cross-session messaging was tested by sending messages to Howard, which revealed his hook only queried the full session ID but messages were addressed to the short alias. Howard fixed it (commit 0352595).
|
||
|
||
### Key Decisions
|
||
|
||
- **Three separate fix commits** — each Pluto build takes ~6 minutes; iterating one fix at a time gave clear per-error feedback rather than risking a multi-fix commit that might hide secondary failures.
|
||
- **`usize` as cross-thread HANDLE carrier** — standard pattern for Windows HANDLEs across thread boundaries; `*mut c_void` is not `Send`, `usize` is.
|
||
- **`Win32_System_IO` feature addition** — compiler error "found item that was configured out" means the path is correct but gated; adding the feature is cleaner than relocating the import.
|
||
- **Watchdog enabled removed at all layers** — stripping from DB schema, merge, wire format, and agent ensures stale `"enabled": false` JSON in existing policy records has no effect on deserialization.
|
||
|
||
### Problems Encountered
|
||
|
||
- **`SetHandleInformation` type mismatch** — third arg is `HANDLE_FLAGS` newtype, not bare integer. Fix: `HANDLE_FLAGS(0)` + import.
|
||
- **`ReadFile` gated behind `Win32_System_IO`** — item exists at the declared path but requires feature. Fix: add feature to Cargo.toml.
|
||
- **`HANDLE` not `Send`** — `*mut c_void` cannot cross thread boundary. Fix: `usize` carrier, reconstruct as `HANDLE(n as *mut core::ffi::c_void)` inside closure.
|
||
- **Roadmap path wrong in CLAUDE.md** — referenced `ROADMAP.md` at repo root; actual file is `docs/FEATURE_ROADMAP.md`.
|
||
- **`projects/claudetools-api/` doesn't exist** — FILE_PLACEMENT_GUIDE.md and CONTEXT.md referenced a nonexistent directory. API code is at root `api/`/`migrations/`.
|
||
- **Cross-session messages silently skipped** — Howard's hook queried `to_session=HOWARD-HOME/claude-main` only; messages sent to `howard` alias were dropped. Howard's fix (0352595) queries both.
|
||
|
||
### Configuration Changes
|
||
|
||
**GuruRMM repo (`azcomputerguru/gururmm`)**
|
||
|
||
| File | Change |
|
||
|------|--------|
|
||
| `agent/Cargo.toml` | Added `Win32_System_IO` to windows crate features |
|
||
| `agent/src/watchdog/wts.rs` | `HANDLE_FLAGS` import; `SetHandleInformation` 3rd arg fixed; HANDLE cast to `usize` for thread safety |
|
||
| `agent/src/transport/mod.rs` | Removed `enabled: Option<bool>` from `WatchdogConfigUpdate` |
|
||
| `agent/src/watchdog/monitor.rs` | Removed `enabled` from `WatchdogRuntimeConfig`, poll gate, and `UpdateConfig` handler |
|
||
| `server/src/db/policies.rs` | Removed `enabled: Option<bool>` from `WatchdogConfig` |
|
||
| `server/src/policy/merge.rs` | Removed `enabled` from watchdog merge and defaults |
|
||
| `server/src/policy/effective.rs` | Assert changed to `check_interval_seconds.is_some()` |
|
||
| `server/src/policy/config_update.rs` | Removed `enabled` from `AgentWatchdogConfig` and mapping |
|
||
| `dashboard/src/api/client.ts` | Removed `enabled?: boolean` from watchdog policy interface |
|
||
| `dashboard/src/pages/Policies.tsx` | Removed all `watchdog_enabled` references; stripped outer PolicyRadio toggle from renderWatchdog |
|
||
| `dashboard/src/pages/AgentDetail.tsx` | Removed Enabled EffRow from watchdog display |
|
||
| `docs/FEATURE_ROADMAP.md` | Added bulk actions feature (34 lines) |
|
||
|
||
**ClaudeTools repo (`azcomputerguru/claudetools`)**
|
||
|
||
| File | Change |
|
||
|------|--------|
|
||
| `.claude/CLAUDE.md` | Roadmap path corrected to `docs/FEATURE_ROADMAP.md` |
|
||
| `.claude/FILE_PLACEMENT_GUIDE.md` | Removed `projects/claudetools-api/` references |
|
||
| `CONTEXT.md` | Tree diagram updated — `claudetools-api/` replaced with note about root `api/`/`migrations/` |
|
||
|
||
### Pending / Incomplete Tasks
|
||
|
||
- **0.6.13 agents with dead WS write half** — c778b6a3, fa99e913, cd086074 still unresolved
|
||
- **BB-SERVER enrollment loop** — duplicate key on `idx_agents_site_device` still unresolved
|
||
- **Safesite Glendale MSI machine** — waiting for user to be away; DisplayLink + NVIDIA TDR; plan to push driver update
|
||
- **LHM bundling in MSI** — LHM files not yet in build pipeline; self-healing download not implemented
|
||
- **Policy wiring plan** — `ticklish-questing-stallman.md`; deferred
|
||
- **Build lock** — flock on build-agents.sh to prevent concurrent runs
|
||
|
||
### Reference Information
|
||
|
||
- 0.6.19 build fix commits: `3ee988d` (HANDLE_FLAGS + ReadFile feature), `a683473` (HANDLE usize cast)
|
||
- 0.6.19 feature commit: `4493c3d`
|
||
- Watchdog cleanup commit: `d4048f2`
|
||
- Bulk actions / roadmap commit: `2d362e2` (gururmm), `6515003` (claudetools dead-ref fixes)
|
||
- Howard hook fix commit: `0352595` (gururmm, Howard's machine)
|
||
- 0.6.19 artifacts: `/var/www/gururmm/downloads/gururmm-agent-base-0.6.19.msi`
|
||
- Build log: `/tmp/build-0.6.19-v3.log` on 172.16.3.30
|