Files
claudetools/session-logs/2026-05-28-session.md
Mike Swanson b0695ab3a0 sync: auto-sync from GURU-5070 at 2026-05-28 07:46:44
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-05-28 07:46:44
2026-05-28 07:46:49 -07:00

133 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Session Log — 2026-05-28
## User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
---
## Session Summary
Session opened with two unread coord messages from Howard (Howard-Home/claude-main): SPEC-010 (Agent UX Improvements & Bug Fixes, 6 items) and SPEC-011 (ARP Programs and Features registration). Both messages were marked as read. Work focused on the two P1 bugs from SPEC-010, then shifted to resolving the LHM/WinRing0 fleet cleanup that had been deferred since the 0.6.46 build.
**BUG-013** (`agent/src/metrics/mod.rs:538`): `logged_in_username()` was using `sysinfo::Users::iter().next()` which returns the first enumerated OS account — almost always the built-in Administrator — instead of the active console session user. Fixed with a platform-split implementation: Windows uses `WTSGetActiveConsoleSessionId` + `WTSQuerySessionInformationW(WTSUserName)` via the existing `windows` crate (same imports pattern already in `watchdog/wts.rs`); non-Windows uses `max_by_key(process_count)` as a better heuristic than `.next()`. **BUG-014** (`dashboard/src/pages/SiteDetail.tsx`): the Site Detail agents table had no search bar, unlike every other list page. Fixed by adding `agentSearch` state, a `filteredSiteAgents` derived array, a search `<Input>` above the table, and a "No agents match your search." empty state. Both committed together as `94234af` and pushed to gururmm main.
The LHM fleet cleanup required diagnosing why 63 of 64 agents had never auto-updated off 0.6.39. Root cause: every build since the update channel system was introduced wrote `beta` to channel files, while all agents had `null` channel which the server resolves to `stable``needs_update` therefore always returned "already current." Fixed by: (1) promoting 0.6.47 to stable via the API (5 Windows channel files updated), (2) promoting 0.6.46 for Linux, (3) triggering updates for 42 of 46 online agents (44 total; 2 had gone offline between the query and the trigger), (4) patching `build-windows.sh` and `build-linux.sh` on the server to write `stable` instead of `beta` going forward. The stray `n#` artifact on build-linux.sh line 54 was also corrected in the same pass.
To complete the LHM cleanup, `sc.exe stop WinRing0_1_2_0 & sc.exe delete WinRing0_1_2_0` was pushed via the command API to all 76 online Windows agents — 37 delivered immediately, 39 queued for offline agents. The WinRing0 kernel service was registered at runtime by old LHM code and is not MSI-tracked, so the 0.6.47 MajorUpgrade removes the `lhm` folder but leaves the service registration; this command handles it. Todo 42c08298 was closed.
Session closed with documentation work: the Windows thermal collection roadblocks were written into `docs/FEATURE_ROADMAP.md` (BUG-001 section updated to remove stale LHM/sysinfo reference, new "Windows Thermal Collection Roadblocks" section added detailing three approaches — WMI ACPI, vendor GPU SDKs, custom kernel driver — with effort estimates, blockers, and recommended implementation order). Committed as `d4a5c13`.
---
## Key Decisions
- **WTS API over sysinfo for Windows logged-in user:** `sysinfo::Users` enumerates all local/domain accounts non-deterministically; WTS console session query is the only reliable way to get the interactive user. Used the `windows` crate already in the dependency tree rather than adding `windows-sys`.
- **`#[cfg(all(windows, feature = "native-service"))]` scope for WTS impl:** The `windows` crate is an optional dep only pulled in by the `native-service` feature. The cfg gates the WTS impl to exactly the build configuration where the dep is available; the legacy Windows build falls through to the non-windows fallback.
- **Promoted 0.6.47 not 0.6.46 for Windows:** 0.6.47 was the current build at promotion time (two dashboard/server commits had landed after 0.6.46). Linux remained at 0.6.46 (no newer Linux binary had been produced yet).
- **`&` not `&&` for sc.exe chain:** `sc.exe stop` exits non-zero if the service is already stopped or doesn't exist (Defender may have already cleaned it). Using `&` ensures `sc.exe delete` runs regardless of stop's exit code.
- **Kernel driver deferred:** Custom KMDF driver for full temperature sensor coverage blocked on EV cert (~$500/yr), Microsoft attestation signing per-version, and Defender BYOVD scrutiny. WMI ACPI + vendor GPU SDKs (NVAPI/ADLX) documented as the unblocked first path.
- **Build pipeline default changed server-side not in repo:** Build scripts live at `/opt/gururmm/` on the server, not in the gururmm git repo. Changes were made directly on the server via ssh+sudo.
---
## Problems Encountered
- **`sed -i` permission denied in `/opt/gururmm/`:** `sed -i` creates a temp file in the same directory; the `guru` user lacks write permission there. Resolved by editing copies in `/tmp` then `sudo cp` back.
- **Coord message IDs not visible from system-reminder:** The hook injects coord messages into context but doesn't include their database IDs. Had to query the API to retrieve IDs before marking as read. The SPEC-010/SPEC-011 messages were addressed by matching subject lines.
- **Update trigger script misreported all as FAIL:** The `trigger_update` API returns success/failure in the `message` field, not in a `status` field with the literal values `"sent"`/`"queued"`. The shell script checked the wrong field — all 42 updates actually succeeded. Similarly, the sc.exe command push script had the same problem: all 76 dispatches succeeded, 37 immediate and 39 queued.
- **Fleet frozen — unexpected agent count on 0.6.47:** Only 1 of 64 agents was on 0.6.47 before the channel fix, despite 0.6.47 having been built. Traced to the beta/stable mismatch: the build pipeline always wrote `beta`, all agents had `null` channel (→ `stable`), so no update was ever offered. Root cause was in the build scripts, not the server or agents.
---
## Configuration Changes
- `projects/msp-tools/guru-rmm/agent/src/metrics/mod.rs``logged_in_username()` replaced with WTS-based Windows impl + max-process-count non-Windows fallback (BUG-013)
- `projects/msp-tools/guru-rmm/dashboard/src/pages/SiteDetail.tsx` — added `agentSearch` state, `filteredSiteAgents`, search Input, no-match empty state (BUG-014)
- `projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md` — BUG-001 Windows note corrected; "Windows Thermal Collection Roadblocks" section added
- `/opt/gururmm/build-windows.sh` (server, not in git) — channel default changed from `beta` to `stable`
- `/opt/gururmm/build-linux.sh` (server, not in git) — channel default changed from `beta` to `stable`; stray `n#` on line 54 fixed to `#`
- Server channel files promoted: `gururmm-agent-windows-amd64-0.6.47.exe.channel`, `gururmm-agent-windows-x86-0.6.47.exe.channel`, `gururmm-agent-windows-legacy-amd64-0.6.47.exe.channel`, `gururmm-agent-windows-legacy-x86-0.6.47.exe.channel`, `gururmm-agent-base-0.6.47.msi.channel`, `gururmm-agent-linux-amd64-0.6.46.channel` — all changed from `beta` to `stable`
---
## Credentials & Secrets
- GuruRMM API admin: `claude-api@azcomputerguru.com` / `ClaudeAPI2026!@#` (vaulted at `infrastructure/gururmm-server.sops.yaml``credentials.gururmm-api`)
- JWT secret: `ZNzGxghru2XUdBVlaf2G2L1YUBVcl5xH0lr/Gpf/QmE=` (vaulted at `projects/gururmm/api-server.sops.yaml`)
---
## Infrastructure & Servers
- GuruRMM server: `172.16.3.30:3001` (Rust/Axum API), `172.16.3.30:80/443` (nginx dashboard)
- Build scripts: `/opt/gururmm/build-windows.sh`, `/opt/gururmm/build-linux.sh` on `172.16.3.30`
- Downloads dir: `/var/www/gururmm/downloads/` on `172.16.3.30`
- Gitea (internal): `http://172.16.3.20:3000/azcomputerguru/gururmm`
- Coord API: `http://172.16.3.30:8001/api/coord`
---
## Commands & Outputs
```bash
# Promote 0.6.47 to stable (Windows)
POST http://172.16.3.30:3001/api/updates/rollouts/0.6.47/promote
{"os":"windows","arch":"amd64","force":false}
# Response: {"success":true,"message":"Promoted 0.6.47 to stable channel (5 files updated)","files_updated":5}
# Promote 0.6.46 to stable (Linux)
POST http://172.16.3.30:3001/api/updates/rollouts/0.6.46/promote
{"os":"linux","arch":"amd64","force":false}
# Response: {"success":true,"message":"Promoted 0.6.46 to stable channel (2 files updated)","files_updated":2}
# Trigger fleet update (per-agent loop, 42 of 46 online Windows agents succeeded)
POST http://172.16.3.30:3001/api/agents/<id>/update
# WinRing0 cleanup command pushed to all 76 Windows agents
POST http://172.16.3.30:3001/api/agents/<id>/command
{"command_type":"shell","command":"sc.exe stop WinRing0_1_2_0 & sc.exe delete WinRing0_1_2_0","timeout_seconds":30}
# 37 delivered immediately, 39 queued for offline agents
# Fix build pipeline channel default (on 172.16.3.30)
cp /opt/gururmm/build-windows.sh /tmp/build-windows.sh
sed -i -e 's/echo "beta"/echo "stable"/' ... /tmp/build-windows.sh
sudo cp /tmp/build-windows.sh /opt/gururmm/build-windows.sh
# Same for build-linux.sh
```
**Fleet state before this session:** 64 agents; 1 on 0.6.47, 38 on 0.6.39, remainder on 0.6.20.6.43. All on `null` channel (→ stable). All builds beta. Net: zero auto-updates ever delivered.
**Fleet state after:** 0.6.47 stable promoted. 42 updates in flight. 18 offline agents will update on reconnect. WinRing0 service deletion queued on all 76 Windows agents.
---
## Pending / Incomplete Tasks
- **18 offline agents** have updates queued (0.6.39→0.6.47) but haven't reconnected yet. Will apply automatically on reconnect.
- **39 offline Windows agents** have sc.exe WinRing0 cleanup queued. Will run on reconnect.
- **macOS agents** (0.6.41) have no newer stable build available yet. Next macOS build will promote to stable automatically with the pipeline fix.
- **SPEC-010 items D, E, C, F** remain unimplemented (logged-in user on agent cards, alert badges, process kill, inline notes). BUG-013+014 prerequisite done.
- **SPEC-011** (ARP Programs and Features registration in installer) — P2, installer-only, not started.
- **BUG-001 Windows thermal collection** — WMI ACPI + NVAPI paths documented as unblocked; implementation deferred.
- **build-linux.sh duplicate "MARK AS STABLE CHANNEL" block** (lines 55-72 appear twice) — pre-existing issue, noted in audit todo `54239760`, not addressed this session.
- **Coord message IDs for SPEC-010/SPEC-011** were from Howard-Home/claude-main `ALL_SESSIONS` broadcasts — marked read on GURU-5070 session ID but the messages were broadcast so may appear unread in other sessions.
---
## Reference Information
- gururmm commits this session:
- `94234af` — fix(agent,dashboard): fix logged-in user detection and add site agent search (BUG-013 + BUG-014)
- `d4a5c13` — docs(roadmap): document Windows thermal collection roadblocks
- Coord todos closed: `42c08298` (WinRing0 kernel-driver cleanup)
- Coord todos referenced: `bde31c52` (LHM headless temp fix — now superseded by LHM removal), `54239760` (Phase 3 audit remediation)
- Coord messages marked read: `7bdc6d3c` (SPEC-011), `3fe667e1` (SPEC-010)
- SPEC files: `docs/specs/SPEC-010-agent-ux-improvements.md`, `docs/specs/SPEC-011-arp-programs-features-registration.md`
- Fleet update API: `POST http://172.16.3.30:3001/api/agents/:id/update`
- Command dispatch API: `POST http://172.16.3.30:3001/api/agents/:id/command`
- Channel promotion API: `POST http://172.16.3.30:3001/api/updates/rollouts/:version/promote`
- Downloads dir channel files: `/var/www/gururmm/downloads/*.channel`
- Build scripts (server-local, not in git): `/opt/gururmm/build-linux.sh`, `/opt/gururmm/build-windows.sh`