Files
claudetools/session-logs/2026-05-13-session.md

550 lines
42 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GuruRMM Session — 2026-05-13
## User
- **User:** Mike Swanson (mike)
- **Machine:** DESKTOP-0O8A1RL
- **Role:** admin
- **Session span:** ~06:0013:00 local (approx)
---
## Session Summary
The session started with a `/sync` (clean, no changes) followed by a request to diagnose an RMM
update failure on DESKTOP-0O8A1RL. Diagnosis proceeded by inspecting the running GuruRMMAgent
service, the install directory, the ProgramData log files, and the agent source code in the
local dev clone (`projects/msp-tools/guru-rmm/`).
Initial inspection found the agent binary reporting version 0.6.4 (via `--version`) while the
registry `HKLM\SOFTWARE\GuruRMM\Version` still read 0.6.2. The `.old` backup file from a prior
update remained uncleaned in the install directory, and a `gururmm-agent.backup` file persisted
in ProgramData. Logs revealed two back-to-back updates had occurred overnight (0.6.2→0.6.3 at
00:44 UTC, 0.6.3→0.6.4 at 03:04 UTC) with restart gaps of 65 minutes and 3 hours 27 minutes
respectively — far longer than any deliberate delay.
Deep reading of `updater/mod.rs`, `watchdog/monitor.rs`, `watchdog/pipe.rs`,
`transport/websocket.rs`, and `server/src/ws/mod.rs` produced a full causal chain: the Windows
service exits with code 0 after binary replacement (bypassing SCM recovery), the detached cmd
restart helper is likely killed by the Windows job object, the `GuruRMMWatchdog` service is not
installed on this machine (so the IPC path always fails), the rollback PS1 uses `Get-Service`
which silently returns null for `GuruRMMAgent`, the watchdog reads a non-existent `agent.toml`
for the server URL, and post-update cleanup (`cancel_rollback_watchdog`, `cleanup_backup`) is
never triggered because the server sends `AuthAck` before computing the update confirmation.
Nine bugs were filed as tracked tasks (#1#9). A design discussion followed about whether to
fix the current system (Option A) or replace it with MSI-based updates. The consensus was
Option A with an architectural direction that the watchdog will eventually take over as the
primary updater, with the main agent retaining self-update as permanent fallback. An approved
plan was written, delegated to the Coding Agent, and all changes were implemented across 7 files.
A Code Review Agent was launched (in background at session end — result pending).
---
## Key Decisions
- **Option A (fix 9 bugs) over MSI-based updates** — MSI approach is cleaner long-term but
requires significant build pipeline changes. Option A ships in one sprint. Decision to revisit
MSI for Windows updates in a future phase.
- **Watchdog owns stop+replace+start; agent owns download+verify** — When the watchdog
eventually implements `PerformUpdate`, the IPC carries a local staged path, not a URL. This
keeps the watchdog free of HTTP client logic and avoids duplicating the agent's download+
checksum machinery.
- **Main agent retains full self-update as permanent fallback** — Even with the watchdog as
primary updater, the agent falls through to its own update logic if the IPC fails. Belt and
suspenders.
- **Exit(1) not exit(0) as SCM fallback** — When both IPC paths (PerformUpdate and
RestartMainService) fail, the agent now exits with code 1 so SCM recovery fires within 10s.
The IPC-success path still exits 0 (watchdog owns restart in that case).
- **SCM recovery delay changed 60s → 10s** — The original 60s delay was unnecessarily slow
for the fallback case. 10s is aggressive enough to be useful without thrashing.
- **PerformUpdate IPC variant added now (returns not-implemented)** — Adding the command to the
protocol now means no agent-side protocol change is needed when the watchdog implements it.
The interface is locked; the implementation is deferred.
- **`complete_update_by_agent()` was already written but never called** — The DB function
existed in `server/src/db/updates.rs` exactly matching the needed signature. The fix was
purely a wiring change in `server/src/ws/mod.rs`, not new code.
- **Watchdog server URL: compile-time constant, not registry** — The watchdog is the same
binary, compiled with the same `GURURMM_SERVER_URL` env var. Reading a TOML file was
architecturally wrong and the file didn't exist anyway. The fix uses `option_env!` directly.
---
## Problems Encountered
- **`Get-Service -Name "GuruRMMAgent"` returns nothing** — PowerShell silently fails to
enumerate the service even with the exact name; `sc.exe queryex` finds it fine. Root cause
unclear (likely a non-elevated session permission issue). Resolution: all PS1-based service
checks in the codebase replaced with `sc.exe query` equivalents.
- **Post-update cleanup path never reaches `cleanup_backup()`** — The server was sending
`AuthAck` before running the post-update check, so the agent never received any signal to
clean up. Resolution: compute `update_confirmed` before building AuthAck; include it in the
ack; agent acts on it in the AuthAck handler.
- **Detached cmd restart killed by job object** — Windows service processes are often placed
in a job object with `JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE`. When the service exits, child
processes (the detached cmd) are killed before `sc.exe start` runs. Resolution: added
`CREATE_BREAKAWAY_FROM_JOB` (0x01000000) flag alongside `CREATE_NO_WINDOW`; combined value
0x09000000. Also adding exit(1) fallback so SCM recovery fires regardless.
- **Code review still in progress at /save time** — Code Review Agent was launched as a
background task. Result pending. Changes should not be deployed until review is clean.
---
## Configuration Changes
### Files modified
- `agent/src/registry.rs` — added `write_version()`, `write_server_url()`, `read_server_url()` with full platform stubs
- `agent/src/service.rs` — added registry version+URL write on startup; SCM recovery delay 60s→10s; watchdog co-installation in `install_service()`
- `agent/src/updater/mod.rs` — PerformUpdate IPC attempt (step 2.5 in `do_update`); `0x09000000` creation flags; `exit(1)` final fallback; `sc.exe` in PS1 rollback template replacing `Get-Service`
- `agent/src/watchdog/pipe.rs``PerformUpdate` variant added to `WatchdogCommand` enum (staged_path model)
- `agent/src/watchdog/monitor.rs``read_server_api_url()` replaced with `server_api_url()` using compile-time constant; `PerformUpdate` match arm added (logs + no-op, pipe server sends ok:false)
- `agent/src/transport/mod.rs``update_confirmed: Option<Uuid>` with `#[serde(default)]` added to `AuthAckPayload`
- `agent/src/transport/websocket.rs` — cleanup block in `AuthAck` handler: calls `cancel_rollback_watchdog()` + `cleanup_backup()` when `update_confirmed` is `Some`
- `server/src/ws/mod.rs``update_confirmed: Option<Uuid>` added to server-side `AuthAckPayload`; `complete_update_by_agent()` wired up before `AuthAck` is sent; old post-ack update check block removed
---
## Credentials & Secrets
None created or discovered this session.
---
## Infrastructure & Servers
- **GuruRMM server:** 172.16.3.30:3001 (Rust/Axum)
- **Build pipeline:** Gitea push → webhook-handler.py (172.16.3.30:9000) → build-agents.sh → Pluto (172.16.3.36) for Windows MSI
- **Agent on DESKTOP-0O8A1RL:**
- Binary: `C:\Program Files\GuruRMM\gururmm-agent.exe` (4,452,648 bytes — v0.6.4)
- Registry: `HKLM\SOFTWARE\GuruRMM` — SiteId: `d008c7d4-9e5e-4666-9fa0-b432609d54cc`, AgentKey: `agk_ybg4Ty6zXU_2Ee0ddlUUtuZdz0B9Qw4_`
- Logs: `C:\ProgramData\GuruRMM\agent.log.YYYY-MM-DD`
- Backup: `C:\ProgramData\GuruRMM\gururmm-agent.backup` (0.6.3 binary, 4,303,656 bytes — not yet cleaned up pending code review + deploy)
- Service: `GuruRMMAgent` (PID 4856 at session start) — `GuruRMMWatchdog` NOT YET INSTALLED
---
## Commands & Outputs
```powershell
# Service query (use sc.exe, not Get-Service — Get-Service silently fails for GuruRMMAgent)
sc.exe queryex "GuruRMMAgent"
# → STATE: 4 RUNNING, PID: 4856
# Registry state at session start
Get-ItemProperty "HKLM:\SOFTWARE\GuruRMM"
# → Version: 0.6.2 (stale — binary was actually 0.6.4)
# → SiteId: d008c7d4-9e5e-4666-9fa0-b432609d54cc
# → AgentKey: agk_ybg4Ty6zXU_2Ee0ddlUUtuZdz0B9Qw4_
# SCM recovery (was 60s, changed to 10s in code)
sc.exe qfailure "GuruRMMAgent"
# → FAILURE_ACTIONS: RESTART/60000/RESTART/60000/RESTART/60000 (old config)
# Watchdog not installed
sc.exe queryex "GuruRMMWatchdog"
# → [SC] OpenService FAILED 1060 (service does not exist)
# Log entries confirming the two overnight updates
# 2026-05-13T00:44:24Z 0.6.2 → 0.6.3 update started; restart at 00:44:25; agent back at 01:49:47 (65 min gap)
# 2026-05-13T03:04:28Z 0.6.3 → 0.6.4 update started; restart at 03:04:29; agent back at 06:31:27 (3h27m gap)
```
---
## Pending / Incomplete Tasks
- **Code Review Agent result pending** — launched as background task at session end. Do NOT
build or deploy to the server until the review is clean. If issues are found, address them
in a follow-up session.
- **Build and deploy** — once code review is clean:
1. Push to Gitea `gururmm` repo (triggers build pipeline)
2. Server binary (v0.6.5 per coord API, `Pending build+deploy`) needs to be deployed first
3. After server deploy, agents will receive update push to new agent version
4. Verify on DESKTOP-0O8A1RL: registry version updated, backup cleaned, GuruRMMWatchdog installed
- **GuruRMMWatchdog not installed on existing endpoints** — the co-install logic in
`service.rs` triggers during the install flow, which existing enrolled agents won't re-run
automatically. Options: (a) add a one-time watchdog install step to the update post-restart
sequence, (b) run a remediation script via the dashboard command channel, or (c) accept that
watchdog deploys on next MSI re-run.
- **9 bug tasks (#1#9) need status updates** — tasks filed in the task system but not yet
marked complete; should be updated once code review passes and the build is deployed.
- **Bug #2 (Get-Service returns nothing)** — root cause unresolved. The fix (replacing
Get-Service with sc.exe everywhere in PS1 templates) is in, but the underlying service
visibility issue should be investigated if time permits.
- **Server component state** — `gururmm/server` is at state `built` v0.6.5 per coord API,
pending deploy. `gururmm/agents` at state `built` v0.6.4. Both need deploy after build.
---
## Reference Information
- **Coord API component states:** `GET http://172.16.3.30:8001/api/coord/status`
- **Session plan file:** `C:\Users\guru\.claude\plans\ticklish-questing-stallman.md`
- **Relevant source files:**
- `agent/src/updater/mod.rs` — full update flow, rollback, restart logic
- `agent/src/watchdog/monitor.rs` — SCM health monitor, alert posting, server URL
- `agent/src/watchdog/pipe.rs` — IPC protocol, WatchdogCommand enum
- `agent/src/transport/websocket.rs` — WebSocket client, AuthAck handler, update trigger
- `server/src/ws/mod.rs` — server-side WS handler, authenticate(), AuthAck, update dispatch
- `server/src/db/updates.rs``complete_update_by_agent()` (was unhooked, now wired)
- **Key bug notes:**
- `PerformUpdate` IPC: currently returns `ok:false` (not implemented in watchdog) — agent falls through to self-update. Future: watchdog implements stop+replace binary at `staged_path`+start.
- `exit(1)` is the final SCM safety net — SCM recovery fires within 10s (after config change deploys)
- `AuthAckPayload.update_confirmed` is `Option<Uuid>` with `#[serde(default)]` on agent side — backwards compatible with older servers
---
## Update: 14:50 PT — Update channel selection + server v0.3.1 deploy
### Session Summary
This update session began by confirming the v0.6.4 agent build (including all 9 auto-update reliability fixes and `ensure_watchdog_running`) had completed successfully on Pluto at 14:22 UTC. All Windows and Linux agent binaries were present in `/var/www/gururmm/downloads/`.
The main work was implementing the update channel selection feature: a stable/beta hierarchy allowing partners, clients, sites, and individual machines to opt-in to beta releases without affecting production machines. The design approved was DB-column + UI surface. A Coding Agent implemented all server and dashboard changes — migration 026, `resolve_agent_channel()` DB function, channel-aware `get_latest_version()`/`needs_update()` in the scanner, three new `PATCH /api/{agents,sites,clients}/:id/channel` endpoints, `GET /api/agents/:id/effective-channel`, and a `UpdateChannelSelector` React component wired into AgentDetail, SiteDetail, and ClientDetail pages. Server compiled clean (68 warnings, all pre-existing).
Deployment encountered two problems. First, migration 026 was applied manually via psql before the server binary was deployed, leaving the `_sqlx_migrations` table without a tracking row. When the new server binary started, sqlx tried to run migration 026 again and hit "column already exists". Resolution: deleted the bad row, changed the migration to use `ADD COLUMN IF NOT EXISTS`, pushed `c1b8b80`, and rebuilt. Second, the `build-server.sh` script always appends to the shared build log, and the background monitor's grep pattern matched an old "failed to start" line from the first attempt — causing a false "completed" notification. The second rebuild is currently running.
The `build-agents.sh` script was also updated to support a `--channel` flag. When invoked with `--channel beta`, it writes a `.channel` sidecar file (`"beta"`) alongside each binary; stable builds remove any existing sidecar. This is the production mechanism for tagging beta releases. Backup `build-agents.sh.bak-pre-channel` was created before the edit.
### Key Decisions
- **Never manually pre-apply migrations** — sqlx owns migration state through `_sqlx_migrations`. Manually running psql diverges the tracking table from the schema, causing checksum failures on the next startup. Correct procedure: let the server binary apply its own migrations on startup. If pre-applying is necessary (e.g., zero-downtime column add), always make migrations idempotent with `IF NOT EXISTS` from the start.
- **`build-agents.sh --channel beta` writes sidecar, stable builds remove it** — cleanup on stable builds prevents stale `.channel` files from a prior beta build being mistaken for a beta tag on the stable binary.
- **Channel resolution: agent → site → client → "stable"** — three-level inheritance. `resolve_agent_channel()` uses a single JOIN query. Beta channel gets the absolute latest binary (any channel tag); stable gets only binaries tagged "stable" (no sidecar = stable).
- **All new DB queries use `sqlx::query()` not `sqlx::query!()` macros** — avoids needing `cargo sqlx prepare` after each new query. The offline cache (`server/.sqlx/`) only needs updating for compile-time `query!` macros. This is the pattern all new code should follow to simplify builds.
### Problems Encountered
- **sqlx migration conflict (column already exists)**: Manually applied migration 026 via psql left the `_sqlx_migrations` table without an entry for version 26. On server startup, sqlx attempted to run migration 026, hit "column already exists". Resolution: deleted the invalid row, updated migration to use `IF NOT EXISTS`, rebuilt.
- **Background monitor false completion**: The `until grep -q 'Server build complete\|failed to start'` polling loop matched an old "failed to start" line from the first server build attempt in the shared log file. The second rebuild was still in progress when the monitor fired. Resolution: used a line-count guard in a new background wait command; second build still in progress at save time.
- **Server restart loop during failed deploy**: The first v0.3.1 binary (with the migration conflict) caused systemd to restart the process repeatedly. The old binary was already overwritten, so the server was down until the second build completed. No data loss; agents reconnect automatically.
### Configuration Changes
**Committed to gururmm repo (4035b5c, 3df5880, c1b8b80):**
- `server/migrations/026_update_channels.sql` — new; `ADD COLUMN IF NOT EXISTS update_channel TEXT CHECK (...)` on clients, sites, agents
- `server/src/db/updates.rs` — added `resolve_agent_channel()`, `set_agent_channel()`, `set_site_channel()`, `set_client_channel()`
- `server/src/updates/scanner.rs``AvailableVersion.channel` field; `.channel` sidecar file read; `get_latest_version`/`needs_update` accept `channel: &str`
- `server/src/ws/mod.rs` — both `needs_update` call sites (connect + heartbeat) now resolve agent channel first
- `server/src/api/agents.rs``trigger_update` uses agent effective channel; added `set_agent_channel_handler`, `get_effective_channel_handler`
- `server/src/api/sites.rs` — added `set_site_channel_handler`
- `server/src/api/clients.rs` — added `set_client_channel_handler`
- `server/src/api/mod.rs` — registered 4 new routes with `patch` routing
- `server/Cargo.toml` — version bumped 0.3.0 → 0.3.1
- `dashboard/src/api/client.ts``UpdateChannel` type, `EffectiveChannel` interface, `channelApi`, `update_channel` field on Agent/Site/Client
- `dashboard/src/components/UpdateChannelSelector.tsx` — new component: inherit/stable/beta selector with effective-channel label
- `dashboard/src/pages/AgentDetail.tsx` — added `UpdateChannelSelector` to info section
- `dashboard/src/pages/SiteDetail.tsx` — added `UpdateChannelSelector`
- `dashboard/src/pages/ClientDetail.tsx` — added `UpdateChannelSelector`
**Modified on server directly (not in git):**
- `/opt/gururmm/build-agents.sh` — added `--channel` arg parsing; writes `.channel` sidecar for beta builds; backup at `.bak-pre-channel`
**Applied to production DB:**
- Migration 026 applied (columns already exist; tracking row inserted and will be re-inserted cleanly on next startup with idempotent migration)
**Dashboard:**
- Built and deployed to `/var/www/gururmm/dashboard/` from `/home/guru/gururmm/dashboard/dist/`
### Infrastructure & Servers
- **GuruRMM server:** 172.16.3.30:3001 (Rust/Axum) — v0.3.1 binary deployed, second build in progress (startup was failing due to migration conflict; now resolved with `IF NOT EXISTS`)
- **Build pipeline:** `/opt/gururmm/build-server.sh` — builds server on 172.16.3.30 directly; `/opt/gururmm/build-agents.sh` — builds agents (Linux on server, Windows on Pluto 172.16.3.36)
- **Downloads:** `/var/www/gururmm/downloads/` — v0.6.4 agent binaries (all platforms) present and ready
- **Dashboard:** `/var/www/gururmm/dashboard/` (nginx-served) — updated with channel selector UI
### Commands & Outputs
```bash
# Apply migration 026 (manual, before server had IF NOT EXISTS)
psql 'postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm' \
-f /home/guru/gururmm/server/migrations/026_update_channels.sql
# → ALTER TABLE x3
# Build and deploy dashboard
cd /home/guru/gururmm/dashboard && npm run build
# → 2559 modules, dist/assets/index-*.js 1082kB, built in 11s
sudo cp -r /home/guru/gururmm/dashboard/dist/* /var/www/gururmm/dashboard/
# Delete bad sqlx migration row (checksum mismatch fix)
echo "DELETE FROM _sqlx_migrations WHERE version = 26;" | \
psql postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm -f /dev/stdin
# → DELETE 1
# Beta build usage (once channel feature is deployed):
sudo /opt/gururmm/build-agents.sh --channel beta
# Trigger server rebuild
sudo bash -c 'nohup /opt/gururmm/build-server.sh >> /var/log/gururmm-build.log 2>&1 &'
```
### Pending / Incomplete Tasks
- **Server v0.3.1 second build in progress** — `build-server.sh` running at save time. Monitor when it completes: `tail -20 /var/log/gururmm-build.log`. If it shows "Server build complete: v0.3.1", verify `systemctl is-active gururmm-server` returns `active`.
- **Watchdog not installed on DESKTOP-0O8A1RL** — `GuruRMMWatchdog` service still not present. Will self-install via `ensure_watchdog_running()` when the agent receives and applies the v0.6.4 update push. Agent will receive the update push once the server is back online.
- **SQLX issue needs permanent fix** — see user question: "Can the SQLX issue be permanently resolved?" The fix is to always write `ADD COLUMN IF NOT EXISTS` in migrations and to never manually pre-apply them via psql. Additionally, set up a `cargo sqlx prepare` step in the build pipeline for any future migrations using `query!()` macros.
- **Verify DESKTOP-0O8A1RL watchdog after agent update** — once server is running and agent updates, confirm: `sc.exe queryex GuruRMMWatchdog` shows RUNNING, registry version updated, backup file cleaned.
- **9 bug tasks (#1#9) still need TickTick status updates** — not yet marked complete.
### Reference Information
- **Commits:** `bdb751b` (bugs), `e52ee19` (watchdog self-install), `5b43fe6` (build fixes), `4035b5c` (channel feature), `3df5880` (version bump), `c1b8b80` (idempotent migration)
- **Server DB:** `postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm`
- **Build log:** `/var/log/gururmm-build.log` (shared between agent and server builds)
- **New API endpoints:**
- `PATCH /api/agents/:id/channel` — set agent channel
- `PATCH /api/sites/:id/channel` — set site channel
- `PATCH /api/clients/:id/channel` — set client channel
- `GET /api/agents/:id/effective-channel` — resolved channel + source
- **Scanner sidecar convention:** `<binary-filename>.channel` containing "beta"; absent = stable
---
## Update: 11:50 PT — IPC pipe DACL fix, CORS fix, dashboard clients bug
### Session Summary
This session was a continuation from a prior context window. The primary carried-over issue was the Windows agent system tray showing "disabled" — the tray icon connects to the agent service via a named pipe (`\\.\pipe\gururmm-agent`) created by SYSTEM, but user-session processes (non-elevated tray) were getting Access Denied when trying to open the pipe client end.
Four attempts were required to land the DACL fix. v0.6.8 tried `security_descriptor()` on `ServerOptions` (method doesn't exist). v0.6.9 tried `SetKernelObjectSecurity` post-creation with `DACL_SECURITY_INFORMATION` (compiled, but failed at runtime with 0x80070005 — Access Denied — because the tokio pipe handle lacks `WRITE_DAC`). v0.6.10 confirmed that approach is fundamentally broken: `CreateNamedPipeW` with `PIPE_ACCESS_DUPLEX` does not grant `WRITE_DAC`, so post-creation DACL modification is not possible via that handle. v0.6.11 switched to `CreateNamedPipeW` with a `SECURITY_ATTRIBUTES` struct holding a NULL DACL, but hit a compile error because `PIPE_ACCESS_DUPLEX` and `PIPE_TYPE_BYTE` are not exported by the `windows 0.58` crate from `Win32::System::Pipes`. v0.6.12 resolved this by using raw integer literals with the correct wrapper types: `FILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000)` and `NAMED_PIPE_MODE(0)`. That build compiled and the tray connected successfully ("IPC: tray client connected" in agent log).
A Firefox CORS bug was identified from a browser console export. The server was sending `Access-Control-Allow-Headers: *` (wildcard), and Firefox enforces that the `Authorization` header must be explicitly listed — the wildcard does not cover it. This caused Firefox to block API requests with warnings (escalating to errors in newer Firefox). The fix changed the Axum CORS layer from `.allow_headers(Any)` to `.allow_headers([AUTHORIZATION, CONTENT_TYPE, ACCEPT])`. A concurrent build error in `server/src/ws/mod.rs` was also fixed: `fail_agent_update` expects `Option<&str>` for its error message parameter, but the call site was passing `&str` directly. Both fixes were deployed as server v0.3.1.
The dashboard was then reporting no clients at all. Investigation showed the server API returning 200 with 0 ms latency for `/api/clients` — indicating an immediate empty-list return with no database query. The root cause was in `authz/permissions.rs`: `accessible_client_ids()` only returns `None` (meaning all clients) for the `dev_admin` role, but all four users in the production database have `role = "admin"` (the legacy superuser role predating the multi-tenant system). With `admin` role and no org memberships in the JWT, the function returns `Some([])`, triggering the early-exit branch. The fix added `is_admin()` to `AuthContext` covering both `admin` and `dev_admin`, and updated `accessible_client_ids()`, `can_access_org()`, `is_org_admin()`, and all the org management permission methods to use it. Deployed as a server rebuild (same v0.3.1 version number, uptime reset to 27s confirmed via `/status`).
### Key Decisions
- **CreateNamedPipeW + SECURITY_ATTRIBUTES at creation time, not SetKernelObjectSecurity post-creation** — `SetKernelObjectSecurity` requires `WRITE_DAC` access on the handle, which tokio's `CreateNamedPipeW`-backed `NamedPipeServer` does not have. The only correct approach is to pass the security descriptor at pipe creation time.
- **Raw integer literals for windows 0.58 pipe constants** — `PIPE_ACCESS_DUPLEX` and `PIPE_TYPE_BYTE` are simply not re-exported by the windows crate 0.58 from `Win32::System::Pipes`. Using `FILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000)` and `NAMED_PIPE_MODE(0)` directly is correct and stable — the Win32 ABI values do not change.
- **`admin` role treated as full superuser in all permission checks** — the `admin` role is explicitly documented as the legacy superuser before `dev_admin` was introduced. Restricting it to org-membership-based access is a regression. All production users are `admin`. The correct fix is to treat `admin` the same as `dev_admin` in `is_admin()`, not to forcibly re-issue JWTs or update the DB.
- **Tray binary crash investigation deferred** — the tray was crashing/exiting every 30 seconds (detected at the 30-second `reap_dead` poll boundary). Root cause is likely the OLD tray binary on enrolled machines (never updated by the agent auto-updater, which only updates the agent binary). The `gururmm-tray-windows-amd64-{version}.exe` is built and deployed to the downloads server but never pushed to agents. This is a separate deferred issue — auto-update flow needs to also update the tray binary.
- **Stale build lock must be cleared manually** — `/var/run/gururmm-build.lock` is left by zombie build processes after failures. Added `rm -f /var/run/gururmm-build.lock` before each build trigger. The build script should defensively remove stale locks on startup (future fix).
### Problems Encountered
- **`SetKernelObjectSecurity` fails at runtime with 0x80070005** — post-creation DACL modification is impossible on a tokio pipe handle. Required full approach switch to `CreateNamedPipeW` with `SECURITY_ATTRIBUTES`.
- **`PIPE_ACCESS_DUPLEX` not in windows 0.58** — constant exists in Win32 SDK headers but is not re-exported by the crate. Used `FILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000)` as the solution.
- **Stale `/var/run/gururmm-build.lock`** — blocked both the agent build and server build triggers. Had to `sudo rm -f` before each trigger.
- **`fail_agent_update` type mismatch** — function signature was `Option<&str>` but call site in `ws/mod.rs` passed raw `&str`. Caught as a compile error during the CORS server build.
- **All dashboard users have `role = "admin"`** — the production DB was seeded before the `dev_admin` role was introduced. The permission system assumed admin users would have `dev_admin`. Every API endpoint that gate-checked `is_dev_admin()` only was effectively broken for the entire production user base.
### Configuration Changes
**Modified in gururmm repo (committed and pushed):**
- `agent/src/ipc.rs``create_server_pipe()` rewritten using `CreateNamedPipeW` + NULL DACL `SECURITY_ATTRIBUTES`; added `windows::Win32::System::Pipes` and `Win32::Storage::FileSystem` feature usage
- `agent/Cargo.toml` — version 0.6.11 → 0.6.12; added `Win32_System_Pipes` and `Win32_Storage_FileSystem` to windows feature list
- `server/src/main.rs` — CORS: `.allow_headers(Any)``.allow_headers([AUTHORIZATION, CONTENT_TYPE, ACCEPT])`; added `use axum::http::header::{ACCEPT, AUTHORIZATION, CONTENT_TYPE}`
- `server/src/ws/mod.rs``fail_agent_update` call: `"send_to failed: ..."``Some("send_to failed: ...")`
- `server/src/authz/permissions.rs` — added `is_admin()` method; `accessible_client_ids()`, `can_access_org()`, `is_org_admin()`, `can_set_org_limits()`, `can_impersonate()`, `can_create_org()`, `can_delete_org()` all updated to use `is_admin()`
- `server/src/api/organizations.rs` — all `auth.is_dev_admin()` calls replaced with `auth.is_admin()`
**Deployed to production:**
- Agent v0.6.12 binary on download server, update scanner dispatching to all 48 enrolled agents
- Server v0.3.1 (two builds: first fixed CORS + ws/mod.rs; second fixed permissions)
### Credentials & Secrets
- **GuruRMM DB (production):** `postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm`
(rediscovered during permissions investigation — matches prior session log)
### Infrastructure & Servers
- **GuruRMM server:** 172.16.3.30:3001 — v0.3.1, uptime ~27s at last check (18:10 UTC restart)
- **Build servers:** 172.16.3.30 (Linux agent + server); 172.16.3.36 Pluto (Windows agent + tray + MSI)
- **Downloads:** `/var/www/gururmm/downloads/``gururmm-agent-windows-amd64-0.6.12.exe` + tray binary present
- **Enrolled agents:** 48 total, 10 online, 38 offline (at 18:10 UTC)
- **Tray binary on enrolled machines:** OLD version (never auto-updated) — crashes ~30s after connecting to new NULL-DACL pipe
### Commands & Outputs
```bash
# Check guruRMM DB users (diagnosed empty dashboard)
PGPASSWORD=43617ebf7eb242e814ca9988cc4df5ad psql -U gururmm -h localhost -d gururmm \
-c 'SELECT id, email, role FROM users;'
# → All 4 users have role="admin" (not dev_admin)
# Verify clients table has data
PGPASSWORD=... psql ... -c 'SELECT COUNT(*) FROM clients;'
# → 15
# Check server status after rebuild
curl -s https://rmm-api.azcomputerguru.com/status
# → {"version":"0.3.1","uptime_seconds":27,"agents":{"total":48,"online":10},...}
# Clear stale build lock
sudo rm -f /var/run/gururmm-build.lock
# Trigger server rebuild
nohup sudo bash /opt/gururmm/build-server.sh > /tmp/server-build-$$.log 2>&1 &
```
### Pending / Incomplete Tasks
- **Tray binary never auto-updated** — `gururmm-tray.exe` in agent install dirs on enrolled machines is the original install version. The auto-updater only replaces the agent binary. Fix: extend the update flow to also download and install the new tray binary alongside the agent binary. New server `UpdatePayload` field + agent-side tray download step needed.
- **Tray crash root cause** — the tray exits ~30s after connecting (detected at the 30-second `reap_dead` poll). Likely old binary crashing after connecting to the now-accessible pipe. Unconfirmed — no tray crash log (windows_subsystem = "windows" swallows panics). Adding file logging to the tray is the correct diagnostic step.
- **`/var/run/gururmm-build.lock` stale lock protection** — build script should `rm -f` the lock at startup to avoid manual intervention on the next build failure. Simple one-line addition to `build-agents.sh` and `build-server.sh`.
- **Plan file bugs #1#9** — the ticklish-questing-stallman plan still has unfinished items (registry version write, watchdog co-install, etc.) that were not addressed this session since the session focused on the tray pipe DACL fix and the production dashboard break.
### Reference Information
- **Commits (gururmm repo):**
- `a6b3174` — fix: grant legacy admin role full access in permission checks
- `8c7380c` — fix(server): wrap fail_agent_update error_message in Some()
- `f3d0cc0` — fix(server): explicitly list Authorization in CORS allow_headers
- `47128b5` — fix(ipc): use FILE_FLAGS_AND_ATTRIBUTES raw values
- `57ff059` — fix(ipc): create pipe with NULL DACL via CreateNamedPipeW+SECURITY_ATTRIBUTES
- **Key source files:**
- `agent/src/ipc.rs:156``create_server_pipe()` with NULL DACL (native-service feature)
- `server/src/authz/permissions.rs:35``is_admin()` method
- `server/src/api/clients.rs:49``accessible_client_ids()` usage (empty-list path)
- **Production DB connection:** `postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm`
- **Tray binary download server path:** `/var/www/gururmm/downloads/gururmm-tray-windows-amd64-{version}.exe`
---
## Update: ~17:00 PT — Policy UI defaultHint fix + full end-to-end policy wiring
### Session Summary
This session completed two major policy system items. First, a UI fix: the policy editor's "Default (ON/OFF)" radio hints were hardcoded strings that didn't reflect the actual system default policy data. After the system default policy was updated in the prior session to disable all metrics and auto-update while keeping tray settings, the hints still showed "Default (ON)" for disabled fields. The fix added a `systemDefaults?: PolicyData` prop to `PolicySectionEditor`, computed a `hints` map from the live system default policy data using a `boolHint()` helper, and replaced all 16 hardcoded `defaultHint="ON/OFF"` values with `hints.xxx` references. The system default policy data is fetched via `policiesApi.getSystemDefault()` (already in the page's queries) and passed down through `PolicyDetail``PolicySectionEditor`. Dashboard was built and deployed (commit `287d106`).
Second, a full audit of the policy system revealed that agents never receive policy data despite the infrastructure being in place. The `ConfigUpdate` WebSocket message existed in the protocol but was never sent by the server; `AppState` had no policy field; every subsystem (metrics, tray, updates, watchdog) used hardcoded defaults. The only wired section was thresholds, which are evaluated server-side when metrics arrive.
An implementation plan was approved covering 10 files across both the agent and server crates. A Coding Agent (Opus, 14 minutes) implemented all changes: expanded `ConfigUpdatePayload` from 2 fields to 4 nested section structs covering all policy sections; added `AppState::effective_policy: RwLock<ConfigUpdatePayload>`; replaced the ConfigUpdate stub with a real handler that stores policy and forwards the watchdog section via IPC; changed the fixed-interval metrics loop to a sleep-based loop reading interval + collect flags from policy each iteration; added `collect_with_flags()` to `MetricsCollector`; wired `GetPolicy` IPC to read from AppState instead of `default_permissive()`; added `UpdateConfig` to `WatchdogCommand` and handled it in the monitor with runtime config; added `policy_to_agent_config()` server-side helper in a new `server/src/policy/config_update.rs`; wired `ConfigUpdate` dispatch after AuthAck in `server/src/ws/mod.rs`; gated auto-update push on `updates.auto_update` from effective policy; and added `push_config_update_to_affected()` called from both assignment create and delete endpoints. Both crates built clean (agent on server via `cargo check`, server via `build-server.sh` — 68 warnings, all pre-existing). Deployed as server v0.3.1 (commit `78b6831`).
### Key Decisions
- **`hints` computed inside `PolicySectionEditor` from `systemDefaults` prop** — avoids threading per-field hint strings through the prop chain; one `boolHint()` call per field at render time. When `systemDefaults` is undefined (before query resolves), all hints default to "ON" to match the system default seeded in migration 027.
- **`systemDefaultData` passed to both `<PolicyDetail>` call sites** — the system default detail view passes `systemDefault.policy_data` to itself (showing its own values as hints); regular policies pass `systemDefault?.policy_data` (reflecting the true baseline).
- **Agent `collect_with_flags()` zeroes disabled fields rather than skipping collection** — avoids restructuring the single-pass `collect()` method. Disabled metrics send 0/None in the payload; server receives them but evaluates thresholds against 0, which never triggers alerts (effectively disabled). Simplest correct approach without breaking the existing collect architecture.
- **Server-side mirror structs in `policy/config_update.rs`, not re-using agent types** — agent and server are separate crates. The server serializes `AgentConfigUpdate` to JSON; the agent deserializes to `ConfigUpdatePayload`. JSON field names match exactly. Mirror structs keep the crates fully decoupled.
- **Single `get_effective_policy()` call per connect, result reused** — computed once after registration, used for both ConfigUpdate dispatch and auto-update gating. Avoids two round-trips to the DB on every agent connect.
- **`push_config_update_to_affected()` spawned async (non-blocking)** — assignment change response returns immediately; push happens in background. If the push fails (agent disconnected between check and send), it's a no-op — next connect will receive the current policy.
- **Watchdog `UpdateConfig` applies per-field (None = no change)** — consistent with the rest of the `ConfigUpdatePayload` design; partial updates won't reset unrelated watchdog fields.
- **Watchdog interval floored at 5 seconds** — prevents a policy misconfiguration from creating a CPU-spinning tight loop in the watchdog process.
### Problems Encountered
- **Coding Agent (first attempt) timed out at 18 minutes with no files written** — was mid-read phase. Resumed via a fresh agent invocation with the same detailed prompt. Second attempt (Opus model, 14 minutes) completed all 11 file changes.
- **`cargo` not on local Windows PATH** — build had to be done remotely via SSH to the build server. The agent `cargo check` runs on Linux (non-Windows target) but catches type/import errors; Windows-specific conditional compilation (`#[cfg(windows)]`) is not checked but those paths are structurally unchanged.
- **Build triggered remotely after user corrected local build attempt** — user noted builds should go to the build server, not local. Correct pattern: push to Gitea, SSH `sudo /opt/gururmm/build-server.sh`.
### Configuration Changes
**Committed to gururmm repo:**
`287d106` — fix: policy editor defaultHint reflects actual system default values
- `dashboard/src/pages/Policies.tsx``PolicySectionEditorProps` gains `systemDefaults?: PolicyData`; `boolHint()`/`hints` map computed from it; all 16 hardcoded `defaultHint` strings replaced with dynamic refs; `PolicyDetailProps` gains `systemDefaultData?: PolicyData`; both `<PolicyDetail>` call sites pass it
`78b6831` — feat: wire agent policy end-to-end (metrics, tray, updates, watchdog)
- `agent/src/transport/mod.rs``ConfigUpdatePayload` expanded to 4 nested section structs; `WatchdogConfigUpdate` gains `services`/`processes` fields
- `agent/src/main.rs``AppState::effective_policy: RwLock<ConfigUpdatePayload>` added
- `agent/src/transport/websocket.rs``ConfigUpdate` handler implemented; metrics loop converted to sleep-based with per-iteration policy read
- `agent/src/metrics/mod.rs``collect_with_flags()` method added
- `agent/src/ipc.rs``TrayPolicy::from_config_update()` added; `GetPolicy` reads from AppState
- `agent/src/watchdog/pipe.rs``WatchdogCommand::UpdateConfig` variant added
- `agent/src/watchdog/monitor.rs``WatchdogRuntimeConfig` struct; `UpdateConfig` handler; enabled gate; configurable interval
- `server/src/policy/config_update.rs` — NEW: `AgentConfigUpdate` mirror structs + `policy_to_agent_config()` helper
- `server/src/policy/mod.rs``pub mod config_update` added
- `server/src/ws/mod.rs``ConfigUpdate` sent after AuthAck; auto-update gated on policy
- `server/src/api/policies.rs``push_config_update_to_affected()` helper; called from `assign_policy` and `remove_assignment`
### Infrastructure & Servers
- **GuruRMM server:** 172.16.3.30:3001 — v0.3.1 rebuilt and deployed (build completed 00:00:31 UTC 2026-05-14)
- **Dashboard:** `/var/www/gururmm/dashboard/` — updated with dynamic policy hints
- **Agent crate:** `cargo check` clean on Linux (server); Windows build pending (Pluto)
### Commands & Outputs
```bash
# Remote server build (the correct build pattern)
ssh -i C:/Users/guru/.ssh/id_ed25519 guru@172.16.3.30 "sudo /opt/gururmm/build-server.sh 2>&1"
# → Compiling gururmm-server v0.3.1
# → warning: unused import: ... (68 warnings, all pre-existing)
# → Finished release profile in 4m 07s
# → === Server build complete: v0.3.1 ===
# Agent cargo check on build server (Linux target)
ssh ... guru@172.16.3.30 "cd /home/guru/gururmm/agent && /home/guru/.cargo/bin/cargo check 2>&1 | tail -5"
# → warning: struct PipeServer is never constructed (pre-existing)
# → Finished dev profile in 7.18s
# Deploy dashboard
cd D:/claudetools/projects/msp-tools/guru-rmm/dashboard && npm run build
scp -i C:/Users/guru/.ssh/id_ed25519 -r dist/* guru@172.16.3.30:/var/www/gururmm/dashboard/
```
### Pending / Incomplete Tasks
- **Windows agent build pending** — `cargo check` passed on Linux; full Windows cross-compile or Pluto build needed to confirm `#[cfg(windows)]` paths compile. Agent binary won't reflect policy wiring until built and deployed via build pipeline.
- **Network discovery feature** — next work item, starting after this save.
- **Tray binary auto-update** — still not implemented (carried over from prior session). Old tray binary on enrolled machines crashes ~30s after connecting to the NULL-DACL pipe.
- **Watchdog not installed on existing enrolled machines** — `ensure_watchdog_running()` is in the agent but existing machines won't trigger re-install until they receive a new agent binary.
### Reference Information
- **Commits:** `287d106` (defaultHint fix), `78b6831` (policy wiring)
- **Key source files:**
- `agent/src/transport/mod.rs``ConfigUpdatePayload`, `MetricsConfigUpdate`, `TrayConfigUpdate`, `UpdatesConfigUpdate`, `WatchdogConfigUpdate`
- `agent/src/main.rs``AppState::effective_policy`
- `agent/src/transport/websocket.rs``ConfigUpdate` handler + dynamic metrics loop
- `agent/src/metrics/mod.rs``collect_with_flags()`
- `agent/src/ipc.rs``TrayPolicy::from_config_update()`, `GetPolicy` handler
- `agent/src/watchdog/monitor.rs``WatchdogRuntimeConfig`, `UpdateConfig` handler
- `server/src/policy/config_update.rs``policy_to_agent_config()` (new file)
- `server/src/ws/mod.rs` — post-AuthAck ConfigUpdate dispatch + auto-update gate
- `server/src/api/policies.rs``push_config_update_to_affected()`
- **Policy wiring status after this session:**
- Metrics: WIRED (interval + collect_* flags from policy)
- Thresholds: WIRED server-side (unchanged — already worked)
- Tray: WIRED (reads from AppState, no longer hardcoded)
- Updates: WIRED (auto_update gate added server-side)
- Watchdog: WIRED (UpdateConfig IPC + runtime config applied)