azcomputerguru/claudetools

Fork 0

Files

Mike Swanson 39bc5f1e86 docs: migrate all gururmm session logs to claudetools session-logs/

2026-05-15 06:13:52 -07:00

42 KiB

Raw Blame History

GuruRMM Session — 2026-05-13

User

User: Mike Swanson (mike)
Machine: DESKTOP-0O8A1RL
Role: admin
Session span: ~06:00–13:00 local (approx)

Session Summary

The session started with a /sync (clean, no changes) followed by a request to diagnose an RMM update failure on DESKTOP-0O8A1RL. Diagnosis proceeded by inspecting the running GuruRMMAgent service, the install directory, the ProgramData log files, and the agent source code in the local dev clone (projects/msp-tools/guru-rmm/).

Initial inspection found the agent binary reporting version 0.6.4 (via --version) while the registry HKLM\SOFTWARE\GuruRMM\Version still read 0.6.2. The .old backup file from a prior update remained uncleaned in the install directory, and a gururmm-agent.backup file persisted in ProgramData. Logs revealed two back-to-back updates had occurred overnight (0.6.2→0.6.3 at 00:44 UTC, 0.6.3→0.6.4 at 03:04 UTC) with restart gaps of 65 minutes and 3 hours 27 minutes respectively — far longer than any deliberate delay.

Deep reading of updater/mod.rs, watchdog/monitor.rs, watchdog/pipe.rs, transport/websocket.rs, and server/src/ws/mod.rs produced a full causal chain: the Windows service exits with code 0 after binary replacement (bypassing SCM recovery), the detached cmd restart helper is likely killed by the Windows job object, the GuruRMMWatchdog service is not installed on this machine (so the IPC path always fails), the rollback PS1 uses Get-Service which silently returns null for GuruRMMAgent, the watchdog reads a non-existent agent.toml for the server URL, and post-update cleanup (cancel_rollback_watchdog, cleanup_backup) is never triggered because the server sends AuthAck before computing the update confirmation.

Nine bugs were filed as tracked tasks (#1–#9). A design discussion followed about whether to fix the current system (Option A) or replace it with MSI-based updates. The consensus was Option A with an architectural direction that the watchdog will eventually take over as the primary updater, with the main agent retaining self-update as permanent fallback. An approved plan was written, delegated to the Coding Agent, and all changes were implemented across 7 files. A Code Review Agent was launched (in background at session end — result pending).

Key Decisions

Option A (fix 9 bugs) over MSI-based updates — MSI approach is cleaner long-term but requires significant build pipeline changes. Option A ships in one sprint. Decision to revisit MSI for Windows updates in a future phase.
Watchdog owns stop+replace+start; agent owns download+verify — When the watchdog eventually implements PerformUpdate, the IPC carries a local staged path, not a URL. This keeps the watchdog free of HTTP client logic and avoids duplicating the agent's download+ checksum machinery.
Main agent retains full self-update as permanent fallback — Even with the watchdog as primary updater, the agent falls through to its own update logic if the IPC fails. Belt and suspenders.
Exit(1) not exit(0) as SCM fallback — When both IPC paths (PerformUpdate and RestartMainService) fail, the agent now exits with code 1 so SCM recovery fires within 10s. The IPC-success path still exits 0 (watchdog owns restart in that case).
SCM recovery delay changed 60s → 10s — The original 60s delay was unnecessarily slow for the fallback case. 10s is aggressive enough to be useful without thrashing.
PerformUpdate IPC variant added now (returns not-implemented) — Adding the command to the protocol now means no agent-side protocol change is needed when the watchdog implements it. The interface is locked; the implementation is deferred.
complete_update_by_agent() was already written but never called — The DB function existed in server/src/db/updates.rs exactly matching the needed signature. The fix was purely a wiring change in server/src/ws/mod.rs, not new code.
Watchdog server URL: compile-time constant, not registry — The watchdog is the same binary, compiled with the same GURURMM_SERVER_URL env var. Reading a TOML file was architecturally wrong and the file didn't exist anyway. The fix uses option_env! directly.

Problems Encountered

Get-Service -Name "GuruRMMAgent" returns nothing — PowerShell silently fails to enumerate the service even with the exact name; sc.exe queryex finds it fine. Root cause unclear (likely a non-elevated session permission issue). Resolution: all PS1-based service checks in the codebase replaced with sc.exe query equivalents.
Post-update cleanup path never reaches cleanup_backup() — The server was sending AuthAck before running the post-update check, so the agent never received any signal to clean up. Resolution: compute update_confirmed before building AuthAck; include it in the ack; agent acts on it in the AuthAck handler.
Detached cmd restart killed by job object — Windows service processes are often placed in a job object with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE. When the service exits, child processes (the detached cmd) are killed before sc.exe start runs. Resolution: added CREATE_BREAKAWAY_FROM_JOB (0x01000000) flag alongside CREATE_NO_WINDOW; combined value 0x09000000. Also adding exit(1) fallback so SCM recovery fires regardless.
Code review still in progress at /save time — Code Review Agent was launched as a background task. Result pending. Changes should not be deployed until review is clean.

Configuration Changes

Files modified

agent/src/registry.rs — added write_version(), write_server_url(), read_server_url() with full platform stubs
agent/src/service.rs — added registry version+URL write on startup; SCM recovery delay 60s→10s; watchdog co-installation in install_service()
agent/src/updater/mod.rs — PerformUpdate IPC attempt (step 2.5 in do_update); 0x09000000 creation flags; exit(1) final fallback; sc.exe in PS1 rollback template replacing Get-Service
agent/src/watchdog/pipe.rs — PerformUpdate variant added to WatchdogCommand enum (staged_path model)
agent/src/watchdog/monitor.rs — read_server_api_url() replaced with server_api_url() using compile-time constant; PerformUpdate match arm added (logs + no-op, pipe server sends ok:false)
agent/src/transport/mod.rs — update_confirmed: Option<Uuid> with #[serde(default)] added to AuthAckPayload
agent/src/transport/websocket.rs — cleanup block in AuthAck handler: calls cancel_rollback_watchdog() + cleanup_backup() when update_confirmed is Some
server/src/ws/mod.rs — update_confirmed: Option<Uuid> added to server-side AuthAckPayload; complete_update_by_agent() wired up before AuthAck is sent; old post-ack update check block removed

Credentials & Secrets

None created or discovered this session.

Infrastructure & Servers

GuruRMM server: 172.16.3.30:3001 (Rust/Axum)
Build pipeline: Gitea push → webhook-handler.py (172.16.3.30:9000) → build-agents.sh → Pluto (172.16.3.36) for Windows MSI
Agent on DESKTOP-0O8A1RL:
- Binary: C:\Program Files\GuruRMM\gururmm-agent.exe (4,452,648 bytes — v0.6.4)
- Registry: HKLM\SOFTWARE\GuruRMM — SiteId: d008c7d4-9e5e-4666-9fa0-b432609d54cc, AgentKey: agk_ybg4Ty6zXU_2Ee0ddlUUtuZdz0B9Qw4_
- Logs: C:\ProgramData\GuruRMM\agent.log.YYYY-MM-DD
- Backup: C:\ProgramData\GuruRMM\gururmm-agent.backup (0.6.3 binary, 4,303,656 bytes — not yet cleaned up pending code review + deploy)
- Service: GuruRMMAgent (PID 4856 at session start) — GuruRMMWatchdog NOT YET INSTALLED

Commands & Outputs

# Service query (use sc.exe, not Get-Service — Get-Service silently fails for GuruRMMAgent)
sc.exe queryex "GuruRMMAgent"
# → STATE: 4 RUNNING, PID: 4856

# Registry state at session start
Get-ItemProperty "HKLM:\SOFTWARE\GuruRMM"
# → Version: 0.6.2 (stale — binary was actually 0.6.4)
# → SiteId: d008c7d4-9e5e-4666-9fa0-b432609d54cc
# → AgentKey: agk_ybg4Ty6zXU_2Ee0ddlUUtuZdz0B9Qw4_

# SCM recovery (was 60s, changed to 10s in code)
sc.exe qfailure "GuruRMMAgent"
# → FAILURE_ACTIONS: RESTART/60000/RESTART/60000/RESTART/60000 (old config)

# Watchdog not installed
sc.exe queryex "GuruRMMWatchdog"
# → [SC] OpenService FAILED 1060 (service does not exist)

# Log entries confirming the two overnight updates
# 2026-05-13T00:44:24Z  0.6.2 → 0.6.3 update started; restart at 00:44:25; agent back at 01:49:47 (65 min gap)
# 2026-05-13T03:04:28Z  0.6.3 → 0.6.4 update started; restart at 03:04:29; agent back at 06:31:27 (3h27m gap)

Pending / Incomplete Tasks

Code Review Agent result pending — launched as background task at session end. Do NOT build or deploy to the server until the review is clean. If issues are found, address them in a follow-up session.
Build and deploy — once code review is clean:
1. Push to Gitea gururmm repo (triggers build pipeline)
2. Server binary (v0.6.5 per coord API, Pending build+deploy) needs to be deployed first
3. After server deploy, agents will receive update push to new agent version
4. Verify on DESKTOP-0O8A1RL: registry version updated, backup cleaned, GuruRMMWatchdog installed
GuruRMMWatchdog not installed on existing endpoints — the co-install logic in service.rs triggers during the install flow, which existing enrolled agents won't re-run automatically. Options: (a) add a one-time watchdog install step to the update post-restart sequence, (b) run a remediation script via the dashboard command channel, or (c) accept that watchdog deploys on next MSI re-run.
9 bug tasks (#1–#9) need status updates — tasks filed in the task system but not yet marked complete; should be updated once code review passes and the build is deployed.
Bug #2 (Get-Service returns nothing) — root cause unresolved. The fix (replacing Get-Service with sc.exe everywhere in PS1 templates) is in, but the underlying service visibility issue should be investigated if time permits.
Server component state — gururmm/server is at state built v0.6.5 per coord API, pending deploy. gururmm/agents at state built v0.6.4. Both need deploy after build.

Reference Information

Coord API component states: GET http://172.16.3.30:8001/api/coord/status
Session plan file: C:\Users\guru\.claude\plans\ticklish-questing-stallman.md
Relevant source files:
- agent/src/updater/mod.rs — full update flow, rollback, restart logic
- agent/src/watchdog/monitor.rs — SCM health monitor, alert posting, server URL
- agent/src/watchdog/pipe.rs — IPC protocol, WatchdogCommand enum
- agent/src/transport/websocket.rs — WebSocket client, AuthAck handler, update trigger
- server/src/ws/mod.rs — server-side WS handler, authenticate(), AuthAck, update dispatch
- server/src/db/updates.rs — complete_update_by_agent() (was unhooked, now wired)
Key bug notes:
- PerformUpdate IPC: currently returns ok:false (not implemented in watchdog) — agent falls through to self-update. Future: watchdog implements stop+replace binary at staged_path+start.
- exit(1) is the final SCM safety net — SCM recovery fires within 10s (after config change deploys)
- AuthAckPayload.update_confirmed is Option<Uuid> with #[serde(default)] on agent side — backwards compatible with older servers

Update: 14:50 PT — Update channel selection + server v0.3.1 deploy

Session Summary

This update session began by confirming the v0.6.4 agent build (including all 9 auto-update reliability fixes and ensure_watchdog_running) had completed successfully on Pluto at 14:22 UTC. All Windows and Linux agent binaries were present in /var/www/gururmm/downloads/.

The main work was implementing the update channel selection feature: a stable/beta hierarchy allowing partners, clients, sites, and individual machines to opt-in to beta releases without affecting production machines. The design approved was DB-column + UI surface. A Coding Agent implemented all server and dashboard changes — migration 026, resolve_agent_channel() DB function, channel-aware get_latest_version()/needs_update() in the scanner, three new PATCH /api/{agents,sites,clients}/:id/channel endpoints, GET /api/agents/:id/effective-channel, and a UpdateChannelSelector React component wired into AgentDetail, SiteDetail, and ClientDetail pages. Server compiled clean (68 warnings, all pre-existing).

Deployment encountered two problems. First, migration 026 was applied manually via psql before the server binary was deployed, leaving the _sqlx_migrations table without a tracking row. When the new server binary started, sqlx tried to run migration 026 again and hit "column already exists". Resolution: deleted the bad row, changed the migration to use ADD COLUMN IF NOT EXISTS, pushed c1b8b80, and rebuilt. Second, the build-server.sh script always appends to the shared build log, and the background monitor's grep pattern matched an old "failed to start" line from the first attempt — causing a false "completed" notification. The second rebuild is currently running.

The build-agents.sh script was also updated to support a --channel flag. When invoked with --channel beta, it writes a .channel sidecar file ("beta") alongside each binary; stable builds remove any existing sidecar. This is the production mechanism for tagging beta releases. Backup build-agents.sh.bak-pre-channel was created before the edit.

Key Decisions

Never manually pre-apply migrations — sqlx owns migration state through _sqlx_migrations. Manually running psql diverges the tracking table from the schema, causing checksum failures on the next startup. Correct procedure: let the server binary apply its own migrations on startup. If pre-applying is necessary (e.g., zero-downtime column add), always make migrations idempotent with IF NOT EXISTS from the start.
build-agents.sh --channel beta writes sidecar, stable builds remove it — cleanup on stable builds prevents stale .channel files from a prior beta build being mistaken for a beta tag on the stable binary.
Channel resolution: agent → site → client → "stable" — three-level inheritance. resolve_agent_channel() uses a single JOIN query. Beta channel gets the absolute latest binary (any channel tag); stable gets only binaries tagged "stable" (no sidecar = stable).
All new DB queries use sqlx::query() not sqlx::query!() macros — avoids needing cargo sqlx prepare after each new query. The offline cache (server/.sqlx/) only needs updating for compile-time query! macros. This is the pattern all new code should follow to simplify builds.

Problems Encountered

sqlx migration conflict (column already exists): Manually applied migration 026 via psql left the _sqlx_migrations table without an entry for version 26. On server startup, sqlx attempted to run migration 026, hit "column already exists". Resolution: deleted the invalid row, updated migration to use IF NOT EXISTS, rebuilt.
Background monitor false completion: The until grep -q 'Server build complete\|failed to start' polling loop matched an old "failed to start" line from the first server build attempt in the shared log file. The second rebuild was still in progress when the monitor fired. Resolution: used a line-count guard in a new background wait command; second build still in progress at save time.
Server restart loop during failed deploy: The first v0.3.1 binary (with the migration conflict) caused systemd to restart the process repeatedly. The old binary was already overwritten, so the server was down until the second build completed. No data loss; agents reconnect automatically.

Configuration Changes

Committed to gururmm repo (4035b5c, 3df5880, c1b8b80):

server/migrations/026_update_channels.sql — new; ADD COLUMN IF NOT EXISTS update_channel TEXT CHECK (...) on clients, sites, agents
server/src/db/updates.rs — added resolve_agent_channel(), set_agent_channel(), set_site_channel(), set_client_channel()
server/src/updates/scanner.rs — AvailableVersion.channel field; .channel sidecar file read; get_latest_version/needs_update accept channel: &str
server/src/ws/mod.rs — both needs_update call sites (connect + heartbeat) now resolve agent channel first
server/src/api/agents.rs — trigger_update uses agent effective channel; added set_agent_channel_handler, get_effective_channel_handler
server/src/api/sites.rs — added set_site_channel_handler
server/src/api/clients.rs — added set_client_channel_handler
server/src/api/mod.rs — registered 4 new routes with patch routing
server/Cargo.toml — version bumped 0.3.0 → 0.3.1
dashboard/src/api/client.ts — UpdateChannel type, EffectiveChannel interface, channelApi, update_channel field on Agent/Site/Client
dashboard/src/components/UpdateChannelSelector.tsx — new component: inherit/stable/beta selector with effective-channel label
dashboard/src/pages/AgentDetail.tsx — added UpdateChannelSelector to info section
dashboard/src/pages/SiteDetail.tsx — added UpdateChannelSelector
dashboard/src/pages/ClientDetail.tsx — added UpdateChannelSelector

Modified on server directly (not in git):

/opt/gururmm/build-agents.sh — added --channel arg parsing; writes .channel sidecar for beta builds; backup at .bak-pre-channel

Applied to production DB:

Migration 026 applied (columns already exist; tracking row inserted and will be re-inserted cleanly on next startup with idempotent migration)

Dashboard:

Built and deployed to /var/www/gururmm/dashboard/ from /home/guru/gururmm/dashboard/dist/

Infrastructure & Servers

GuruRMM server: 172.16.3.30:3001 (Rust/Axum) — v0.3.1 binary deployed, second build in progress (startup was failing due to migration conflict; now resolved with IF NOT EXISTS)
Build pipeline: /opt/gururmm/build-server.sh — builds server on 172.16.3.30 directly; /opt/gururmm/build-agents.sh — builds agents (Linux on server, Windows on Pluto 172.16.3.36)
Downloads: /var/www/gururmm/downloads/ — v0.6.4 agent binaries (all platforms) present and ready
Dashboard: /var/www/gururmm/dashboard/ (nginx-served) — updated with channel selector UI

Commands & Outputs

# Apply migration 026 (manual, before server had IF NOT EXISTS)
psql 'postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm' \
  -f /home/guru/gururmm/server/migrations/026_update_channels.sql
# → ALTER TABLE x3

# Build and deploy dashboard
cd /home/guru/gururmm/dashboard && npm run build
# → 2559 modules, dist/assets/index-*.js 1082kB, built in 11s
sudo cp -r /home/guru/gururmm/dashboard/dist/* /var/www/gururmm/dashboard/

# Delete bad sqlx migration row (checksum mismatch fix)
echo "DELETE FROM _sqlx_migrations WHERE version = 26;" | \
  psql postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm -f /dev/stdin
# → DELETE 1

# Beta build usage (once channel feature is deployed):
sudo /opt/gururmm/build-agents.sh --channel beta

# Trigger server rebuild
sudo bash -c 'nohup /opt/gururmm/build-server.sh >> /var/log/gururmm-build.log 2>&1 &'

Pending / Incomplete Tasks

Server v0.3.1 second build in progress — build-server.sh running at save time. Monitor when it completes: tail -20 /var/log/gururmm-build.log. If it shows "Server build complete: v0.3.1", verify systemctl is-active gururmm-server returns active.
Watchdog not installed on DESKTOP-0O8A1RL — GuruRMMWatchdog service still not present. Will self-install via ensure_watchdog_running() when the agent receives and applies the v0.6.4 update push. Agent will receive the update push once the server is back online.
SQLX issue needs permanent fix — see user question: "Can the SQLX issue be permanently resolved?" The fix is to always write ADD COLUMN IF NOT EXISTS in migrations and to never manually pre-apply them via psql. Additionally, set up a cargo sqlx prepare step in the build pipeline for any future migrations using query!() macros.
Verify DESKTOP-0O8A1RL watchdog after agent update — once server is running and agent updates, confirm: sc.exe queryex GuruRMMWatchdog shows RUNNING, registry version updated, backup file cleaned.
9 bug tasks (#1–#9) still need TickTick status updates — not yet marked complete.

Reference Information

Commits: bdb751b (bugs), e52ee19 (watchdog self-install), 5b43fe6 (build fixes), 4035b5c (channel feature), 3df5880 (version bump), c1b8b80 (idempotent migration)
Server DB: postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm
Build log: /var/log/gururmm-build.log (shared between agent and server builds)
New API endpoints:
- PATCH /api/agents/:id/channel — set agent channel
- PATCH /api/sites/:id/channel — set site channel
- PATCH /api/clients/:id/channel — set client channel
- GET /api/agents/:id/effective-channel — resolved channel + source
Scanner sidecar convention: <binary-filename>.channel containing "beta"; absent = stable

Update: 11:50 PT — IPC pipe DACL fix, CORS fix, dashboard clients bug

Session Summary

This session was a continuation from a prior context window. The primary carried-over issue was the Windows agent system tray showing "disabled" — the tray icon connects to the agent service via a named pipe (\\.\pipe\gururmm-agent) created by SYSTEM, but user-session processes (non-elevated tray) were getting Access Denied when trying to open the pipe client end.

Four attempts were required to land the DACL fix. v0.6.8 tried security_descriptor() on ServerOptions (method doesn't exist). v0.6.9 tried SetKernelObjectSecurity post-creation with DACL_SECURITY_INFORMATION (compiled, but failed at runtime with 0x80070005 — Access Denied — because the tokio pipe handle lacks WRITE_DAC). v0.6.10 confirmed that approach is fundamentally broken: CreateNamedPipeW with PIPE_ACCESS_DUPLEX does not grant WRITE_DAC, so post-creation DACL modification is not possible via that handle. v0.6.11 switched to CreateNamedPipeW with a SECURITY_ATTRIBUTES struct holding a NULL DACL, but hit a compile error because PIPE_ACCESS_DUPLEX and PIPE_TYPE_BYTE are not exported by the windows 0.58 crate from Win32::System::Pipes. v0.6.12 resolved this by using raw integer literals with the correct wrapper types: FILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000) and NAMED_PIPE_MODE(0). That build compiled and the tray connected successfully ("IPC: tray client connected" in agent log).

A Firefox CORS bug was identified from a browser console export. The server was sending Access-Control-Allow-Headers: * (wildcard), and Firefox enforces that the Authorization header must be explicitly listed — the wildcard does not cover it. This caused Firefox to block API requests with warnings (escalating to errors in newer Firefox). The fix changed the Axum CORS layer from .allow_headers(Any) to .allow_headers([AUTHORIZATION, CONTENT_TYPE, ACCEPT]). A concurrent build error in server/src/ws/mod.rs was also fixed: fail_agent_update expects Option<&str> for its error message parameter, but the call site was passing &str directly. Both fixes were deployed as server v0.3.1.

The dashboard was then reporting no clients at all. Investigation showed the server API returning 200 with 0 ms latency for /api/clients — indicating an immediate empty-list return with no database query. The root cause was in authz/permissions.rs: accessible_client_ids() only returns None (meaning all clients) for the dev_admin role, but all four users in the production database have role = "admin" (the legacy superuser role predating the multi-tenant system). With admin role and no org memberships in the JWT, the function returns Some([]), triggering the early-exit branch. The fix added is_admin() to AuthContext covering both admin and dev_admin, and updated accessible_client_ids(), can_access_org(), is_org_admin(), and all the org management permission methods to use it. Deployed as a server rebuild (same v0.3.1 version number, uptime reset to 27s confirmed via /status).

Key Decisions

CreateNamedPipeW + SECURITY_ATTRIBUTES at creation time, not SetKernelObjectSecurity post-creation — SetKernelObjectSecurity requires WRITE_DAC access on the handle, which tokio's CreateNamedPipeW-backed NamedPipeServer does not have. The only correct approach is to pass the security descriptor at pipe creation time.
Raw integer literals for windows 0.58 pipe constants — PIPE_ACCESS_DUPLEX and PIPE_TYPE_BYTE are simply not re-exported by the windows crate 0.58 from Win32::System::Pipes. Using FILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000) and NAMED_PIPE_MODE(0) directly is correct and stable — the Win32 ABI values do not change.
admin role treated as full superuser in all permission checks — the admin role is explicitly documented as the legacy superuser before dev_admin was introduced. Restricting it to org-membership-based access is a regression. All production users are admin. The correct fix is to treat admin the same as dev_admin in is_admin(), not to forcibly re-issue JWTs or update the DB.
Tray binary crash investigation deferred — the tray was crashing/exiting every 30 seconds (detected at the 30-second reap_dead poll boundary). Root cause is likely the OLD tray binary on enrolled machines (never updated by the agent auto-updater, which only updates the agent binary). The gururmm-tray-windows-amd64-{version}.exe is built and deployed to the downloads server but never pushed to agents. This is a separate deferred issue — auto-update flow needs to also update the tray binary.
Stale build lock must be cleared manually — /var/run/gururmm-build.lock is left by zombie build processes after failures. Added rm -f /var/run/gururmm-build.lock before each build trigger. The build script should defensively remove stale locks on startup (future fix).

Problems Encountered

SetKernelObjectSecurity fails at runtime with 0x80070005 — post-creation DACL modification is impossible on a tokio pipe handle. Required full approach switch to CreateNamedPipeW with SECURITY_ATTRIBUTES.
PIPE_ACCESS_DUPLEX not in windows 0.58 — constant exists in Win32 SDK headers but is not re-exported by the crate. Used FILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000) as the solution.
Stale /var/run/gururmm-build.lock — blocked both the agent build and server build triggers. Had to sudo rm -f before each trigger.
fail_agent_update type mismatch — function signature was Option<&str> but call site in ws/mod.rs passed raw &str. Caught as a compile error during the CORS server build.
All dashboard users have role = "admin" — the production DB was seeded before the dev_admin role was introduced. The permission system assumed admin users would have dev_admin. Every API endpoint that gate-checked is_dev_admin() only was effectively broken for the entire production user base.

Configuration Changes

Modified in gururmm repo (committed and pushed):

agent/src/ipc.rs — create_server_pipe() rewritten using CreateNamedPipeW + NULL DACL SECURITY_ATTRIBUTES; added windows::Win32::System::Pipes and Win32::Storage::FileSystem feature usage
agent/Cargo.toml — version 0.6.11 → 0.6.12; added Win32_System_Pipes and Win32_Storage_FileSystem to windows feature list
server/src/main.rs — CORS: .allow_headers(Any) → .allow_headers([AUTHORIZATION, CONTENT_TYPE, ACCEPT]); added use axum::http::header::{ACCEPT, AUTHORIZATION, CONTENT_TYPE}
server/src/ws/mod.rs — fail_agent_update call: "send_to failed: ..." → Some("send_to failed: ...")
server/src/authz/permissions.rs — added is_admin() method; accessible_client_ids(), can_access_org(), is_org_admin(), can_set_org_limits(), can_impersonate(), can_create_org(), can_delete_org() all updated to use is_admin()
server/src/api/organizations.rs — all auth.is_dev_admin() calls replaced with auth.is_admin()

Deployed to production:

Agent v0.6.12 binary on download server, update scanner dispatching to all 48 enrolled agents
Server v0.3.1 (two builds: first fixed CORS + ws/mod.rs; second fixed permissions)

Credentials & Secrets

GuruRMM DB (production): postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm (rediscovered during permissions investigation — matches prior session log)

Infrastructure & Servers

GuruRMM server: 172.16.3.30:3001 — v0.3.1, uptime ~27s at last check (18:10 UTC restart)
Build servers: 172.16.3.30 (Linux agent + server); 172.16.3.36 Pluto (Windows agent + tray + MSI)
Downloads: /var/www/gururmm/downloads/ — gururmm-agent-windows-amd64-0.6.12.exe + tray binary present
Enrolled agents: 48 total, 10 online, 38 offline (at 18:10 UTC)
Tray binary on enrolled machines: OLD version (never auto-updated) — crashes ~30s after connecting to new NULL-DACL pipe

Commands & Outputs

# Check guruRMM DB users (diagnosed empty dashboard)
PGPASSWORD=43617ebf7eb242e814ca9988cc4df5ad psql -U gururmm -h localhost -d gururmm \
  -c 'SELECT id, email, role FROM users;'
# → All 4 users have role="admin" (not dev_admin)

# Verify clients table has data
PGPASSWORD=... psql ... -c 'SELECT COUNT(*) FROM clients;'
# → 15

# Check server status after rebuild
curl -s https://rmm-api.azcomputerguru.com/status
# → {"version":"0.3.1","uptime_seconds":27,"agents":{"total":48,"online":10},...}

# Clear stale build lock
sudo rm -f /var/run/gururmm-build.lock

# Trigger server rebuild
nohup sudo bash /opt/gururmm/build-server.sh > /tmp/server-build-$$.log 2>&1 &

Pending / Incomplete Tasks

Tray binary never auto-updated — gururmm-tray.exe in agent install dirs on enrolled machines is the original install version. The auto-updater only replaces the agent binary. Fix: extend the update flow to also download and install the new tray binary alongside the agent binary. New server UpdatePayload field + agent-side tray download step needed.
Tray crash root cause — the tray exits ~30s after connecting (detected at the 30-second reap_dead poll). Likely old binary crashing after connecting to the now-accessible pipe. Unconfirmed — no tray crash log (windows_subsystem = "windows" swallows panics). Adding file logging to the tray is the correct diagnostic step.
/var/run/gururmm-build.lock stale lock protection — build script should rm -f the lock at startup to avoid manual intervention on the next build failure. Simple one-line addition to build-agents.sh and build-server.sh.
Plan file bugs #1–#9 — the ticklish-questing-stallman plan still has unfinished items (registry version write, watchdog co-install, etc.) that were not addressed this session since the session focused on the tray pipe DACL fix and the production dashboard break.

Reference Information

Commits (gururmm repo):
- a6b3174 — fix: grant legacy admin role full access in permission checks
- 8c7380c — fix(server): wrap fail_agent_update error_message in Some()
- f3d0cc0 — fix(server): explicitly list Authorization in CORS allow_headers
- 47128b5 — fix(ipc): use FILE_FLAGS_AND_ATTRIBUTES raw values
- 57ff059 — fix(ipc): create pipe with NULL DACL via CreateNamedPipeW+SECURITY_ATTRIBUTES
Key source files:
- agent/src/ipc.rs:156 — create_server_pipe() with NULL DACL (native-service feature)
- server/src/authz/permissions.rs:35 — is_admin() method
- server/src/api/clients.rs:49 — accessible_client_ids() usage (empty-list path)
Production DB connection: postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm
Tray binary download server path: /var/www/gururmm/downloads/gururmm-tray-windows-amd64-{version}.exe

Update: ~17:00 PT — Policy UI defaultHint fix + full end-to-end policy wiring

Session Summary

This session completed two major policy system items. First, a UI fix: the policy editor's "Default (ON/OFF)" radio hints were hardcoded strings that didn't reflect the actual system default policy data. After the system default policy was updated in the prior session to disable all metrics and auto-update while keeping tray settings, the hints still showed "Default (ON)" for disabled fields. The fix added a systemDefaults?: PolicyData prop to PolicySectionEditor, computed a hints map from the live system default policy data using a boolHint() helper, and replaced all 16 hardcoded defaultHint="ON/OFF" values with hints.xxx references. The system default policy data is fetched via policiesApi.getSystemDefault() (already in the page's queries) and passed down through PolicyDetail → PolicySectionEditor. Dashboard was built and deployed (commit 287d106).

Second, a full audit of the policy system revealed that agents never receive policy data despite the infrastructure being in place. The ConfigUpdate WebSocket message existed in the protocol but was never sent by the server; AppState had no policy field; every subsystem (metrics, tray, updates, watchdog) used hardcoded defaults. The only wired section was thresholds, which are evaluated server-side when metrics arrive.

An implementation plan was approved covering 10 files across both the agent and server crates. A Coding Agent (Opus, 14 minutes) implemented all changes: expanded ConfigUpdatePayload from 2 fields to 4 nested section structs covering all policy sections; added AppState::effective_policy: RwLock<ConfigUpdatePayload>; replaced the ConfigUpdate stub with a real handler that stores policy and forwards the watchdog section via IPC; changed the fixed-interval metrics loop to a sleep-based loop reading interval + collect flags from policy each iteration; added collect_with_flags() to MetricsCollector; wired GetPolicy IPC to read from AppState instead of default_permissive(); added UpdateConfig to WatchdogCommand and handled it in the monitor with runtime config; added policy_to_agent_config() server-side helper in a new server/src/policy/config_update.rs; wired ConfigUpdate dispatch after AuthAck in server/src/ws/mod.rs; gated auto-update push on updates.auto_update from effective policy; and added push_config_update_to_affected() called from both assignment create and delete endpoints. Both crates built clean (agent on server via cargo check, server via build-server.sh — 68 warnings, all pre-existing). Deployed as server v0.3.1 (commit 78b6831).

Key Decisions

hints computed inside PolicySectionEditor from systemDefaults prop — avoids threading per-field hint strings through the prop chain; one boolHint() call per field at render time. When systemDefaults is undefined (before query resolves), all hints default to "ON" to match the system default seeded in migration 027.
systemDefaultData passed to both <PolicyDetail> call sites — the system default detail view passes systemDefault.policy_data to itself (showing its own values as hints); regular policies pass systemDefault?.policy_data (reflecting the true baseline).
Agent collect_with_flags() zeroes disabled fields rather than skipping collection — avoids restructuring the single-pass collect() method. Disabled metrics send 0/None in the payload; server receives them but evaluates thresholds against 0, which never triggers alerts (effectively disabled). Simplest correct approach without breaking the existing collect architecture.
Server-side mirror structs in policy/config_update.rs, not re-using agent types — agent and server are separate crates. The server serializes AgentConfigUpdate to JSON; the agent deserializes to ConfigUpdatePayload. JSON field names match exactly. Mirror structs keep the crates fully decoupled.
Single get_effective_policy() call per connect, result reused — computed once after registration, used for both ConfigUpdate dispatch and auto-update gating. Avoids two round-trips to the DB on every agent connect.
push_config_update_to_affected() spawned async (non-blocking) — assignment change response returns immediately; push happens in background. If the push fails (agent disconnected between check and send), it's a no-op — next connect will receive the current policy.
Watchdog UpdateConfig applies per-field (None = no change) — consistent with the rest of the ConfigUpdatePayload design; partial updates won't reset unrelated watchdog fields.
Watchdog interval floored at 5 seconds — prevents a policy misconfiguration from creating a CPU-spinning tight loop in the watchdog process.

Problems Encountered

Coding Agent (first attempt) timed out at 18 minutes with no files written — was mid-read phase. Resumed via a fresh agent invocation with the same detailed prompt. Second attempt (Opus model, 14 minutes) completed all 11 file changes.
cargo not on local Windows PATH — build had to be done remotely via SSH to the build server. The agent cargo check runs on Linux (non-Windows target) but catches type/import errors; Windows-specific conditional compilation (#[cfg(windows)]) is not checked but those paths are structurally unchanged.
Build triggered remotely after user corrected local build attempt — user noted builds should go to the build server, not local. Correct pattern: push to Gitea, SSH sudo /opt/gururmm/build-server.sh.

Configuration Changes

Committed to gururmm repo:

287d106 — fix: policy editor defaultHint reflects actual system default values

dashboard/src/pages/Policies.tsx — PolicySectionEditorProps gains systemDefaults?: PolicyData; boolHint()/hints map computed from it; all 16 hardcoded defaultHint strings replaced with dynamic refs; PolicyDetailProps gains systemDefaultData?: PolicyData; both <PolicyDetail> call sites pass it

78b6831 — feat: wire agent policy end-to-end (metrics, tray, updates, watchdog)

agent/src/transport/mod.rs — ConfigUpdatePayload expanded to 4 nested section structs; WatchdogConfigUpdate gains services/processes fields
agent/src/main.rs — AppState::effective_policy: RwLock<ConfigUpdatePayload> added
agent/src/transport/websocket.rs — ConfigUpdate handler implemented; metrics loop converted to sleep-based with per-iteration policy read
agent/src/metrics/mod.rs — collect_with_flags() method added
agent/src/ipc.rs — TrayPolicy::from_config_update() added; GetPolicy reads from AppState
agent/src/watchdog/pipe.rs — WatchdogCommand::UpdateConfig variant added
agent/src/watchdog/monitor.rs — WatchdogRuntimeConfig struct; UpdateConfig handler; enabled gate; configurable interval
server/src/policy/config_update.rs — NEW: AgentConfigUpdate mirror structs + policy_to_agent_config() helper
server/src/policy/mod.rs — pub mod config_update added
server/src/ws/mod.rs — ConfigUpdate sent after AuthAck; auto-update gated on policy
server/src/api/policies.rs — push_config_update_to_affected() helper; called from assign_policy and remove_assignment

Infrastructure & Servers

GuruRMM server: 172.16.3.30:3001 — v0.3.1 rebuilt and deployed (build completed 00:00:31 UTC 2026-05-14)
Dashboard: /var/www/gururmm/dashboard/ — updated with dynamic policy hints
Agent crate: cargo check clean on Linux (server); Windows build pending (Pluto)

Commands & Outputs

# Remote server build (the correct build pattern)
ssh -i C:/Users/guru/.ssh/id_ed25519 guru@172.16.3.30 "sudo /opt/gururmm/build-server.sh 2>&1"
# → Compiling gururmm-server v0.3.1
# → warning: unused import: ... (68 warnings, all pre-existing)
# → Finished release profile in 4m 07s
# → === Server build complete: v0.3.1 ===

# Agent cargo check on build server (Linux target)
ssh ... guru@172.16.3.30 "cd /home/guru/gururmm/agent && /home/guru/.cargo/bin/cargo check 2>&1 | tail -5"
# → warning: struct PipeServer is never constructed (pre-existing)
# → Finished dev profile in 7.18s

# Deploy dashboard
cd D:/claudetools/projects/msp-tools/guru-rmm/dashboard && npm run build
scp -i C:/Users/guru/.ssh/id_ed25519 -r dist/* guru@172.16.3.30:/var/www/gururmm/dashboard/

Pending / Incomplete Tasks

Windows agent build pending — cargo check passed on Linux; full Windows cross-compile or Pluto build needed to confirm #[cfg(windows)] paths compile. Agent binary won't reflect policy wiring until built and deployed via build pipeline.
Network discovery feature — next work item, starting after this save.
Tray binary auto-update — still not implemented (carried over from prior session). Old tray binary on enrolled machines crashes ~30s after connecting to the NULL-DACL pipe.
Watchdog not installed on existing enrolled machines — ensure_watchdog_running() is in the agent but existing machines won't trigger re-install until they receive a new agent binary.

Reference Information

Commits: 287d106 (defaultHint fix), 78b6831 (policy wiring)
Key source files:
- agent/src/transport/mod.rs — ConfigUpdatePayload, MetricsConfigUpdate, TrayConfigUpdate, UpdatesConfigUpdate, WatchdogConfigUpdate
- agent/src/main.rs — AppState::effective_policy
- agent/src/transport/websocket.rs — ConfigUpdate handler + dynamic metrics loop
- agent/src/metrics/mod.rs — collect_with_flags()
- agent/src/ipc.rs — TrayPolicy::from_config_update(), GetPolicy handler
- agent/src/watchdog/monitor.rs — WatchdogRuntimeConfig, UpdateConfig handler
- server/src/policy/config_update.rs — policy_to_agent_config() (new file)
- server/src/ws/mod.rs — post-AuthAck ConfigUpdate dispatch + auto-update gate
- server/src/api/policies.rs — push_config_update_to_affected()
Policy wiring status after this session:
- Metrics: WIRED (interval + collect_* flags from policy)
- Thresholds: WIRED server-side (unchanged — already worked)
- Tray: WIRED (reads from AppState, no longer hardcoded)
- Updates: WIRED (auto_update gate added server-side)
- Watchdog: WIRED (UpdateConfig IPC + runtime config applied)

42 KiB Raw Blame History Unescape Escape

GuruRMM Session — 2026-05-13

User

Session Summary

Key Decisions

Problems Encountered

Configuration Changes

Files modified

Credentials & Secrets

Infrastructure & Servers

Commands & Outputs

Pending / Incomplete Tasks

Reference Information

Update: 14:50 PT — Update channel selection + server v0.3.1 deploy

Session Summary

Key Decisions

Problems Encountered

Configuration Changes

Infrastructure & Servers

Commands & Outputs

Pending / Incomplete Tasks

Reference Information

Update: 11:50 PT — IPC pipe DACL fix, CORS fix, dashboard clients bug

Session Summary

Key Decisions

Problems Encountered

Configuration Changes

Credentials & Secrets

Infrastructure & Servers

Commands & Outputs

Pending / Incomplete Tasks

Reference Information

Update: ~17:00 PT — Policy UI defaultHint fix + full end-to-end policy wiring

Session Summary

Key Decisions

Problems Encountered

Configuration Changes

Infrastructure & Servers

Commands & Outputs

Pending / Incomplete Tasks

Reference Information

42 KiB

Raw Blame History