42 KiB
GuruRMM Session — 2026-05-13
User
- User: Mike Swanson (mike)
- Machine: DESKTOP-0O8A1RL
- Role: admin
- Session span: ~06:00–13:00 local (approx)
Session Summary
The session started with a /sync (clean, no changes) followed by a request to diagnose an RMM
update failure on DESKTOP-0O8A1RL. Diagnosis proceeded by inspecting the running GuruRMMAgent
service, the install directory, the ProgramData log files, and the agent source code in the
local dev clone (projects/msp-tools/guru-rmm/).
Initial inspection found the agent binary reporting version 0.6.4 (via --version) while the
registry HKLM\SOFTWARE\GuruRMM\Version still read 0.6.2. The .old backup file from a prior
update remained uncleaned in the install directory, and a gururmm-agent.backup file persisted
in ProgramData. Logs revealed two back-to-back updates had occurred overnight (0.6.2→0.6.3 at
00:44 UTC, 0.6.3→0.6.4 at 03:04 UTC) with restart gaps of 65 minutes and 3 hours 27 minutes
respectively — far longer than any deliberate delay.
Deep reading of updater/mod.rs, watchdog/monitor.rs, watchdog/pipe.rs,
transport/websocket.rs, and server/src/ws/mod.rs produced a full causal chain: the Windows
service exits with code 0 after binary replacement (bypassing SCM recovery), the detached cmd
restart helper is likely killed by the Windows job object, the GuruRMMWatchdog service is not
installed on this machine (so the IPC path always fails), the rollback PS1 uses Get-Service
which silently returns null for GuruRMMAgent, the watchdog reads a non-existent agent.toml
for the server URL, and post-update cleanup (cancel_rollback_watchdog, cleanup_backup) is
never triggered because the server sends AuthAck before computing the update confirmation.
Nine bugs were filed as tracked tasks (#1–#9). A design discussion followed about whether to fix the current system (Option A) or replace it with MSI-based updates. The consensus was Option A with an architectural direction that the watchdog will eventually take over as the primary updater, with the main agent retaining self-update as permanent fallback. An approved plan was written, delegated to the Coding Agent, and all changes were implemented across 7 files. A Code Review Agent was launched (in background at session end — result pending).
Key Decisions
-
Option A (fix 9 bugs) over MSI-based updates — MSI approach is cleaner long-term but requires significant build pipeline changes. Option A ships in one sprint. Decision to revisit MSI for Windows updates in a future phase.
-
Watchdog owns stop+replace+start; agent owns download+verify — When the watchdog eventually implements
PerformUpdate, the IPC carries a local staged path, not a URL. This keeps the watchdog free of HTTP client logic and avoids duplicating the agent's download+ checksum machinery. -
Main agent retains full self-update as permanent fallback — Even with the watchdog as primary updater, the agent falls through to its own update logic if the IPC fails. Belt and suspenders.
-
Exit(1) not exit(0) as SCM fallback — When both IPC paths (PerformUpdate and RestartMainService) fail, the agent now exits with code 1 so SCM recovery fires within 10s. The IPC-success path still exits 0 (watchdog owns restart in that case).
-
SCM recovery delay changed 60s → 10s — The original 60s delay was unnecessarily slow for the fallback case. 10s is aggressive enough to be useful without thrashing.
-
PerformUpdate IPC variant added now (returns not-implemented) — Adding the command to the protocol now means no agent-side protocol change is needed when the watchdog implements it. The interface is locked; the implementation is deferred.
-
complete_update_by_agent()was already written but never called — The DB function existed inserver/src/db/updates.rsexactly matching the needed signature. The fix was purely a wiring change inserver/src/ws/mod.rs, not new code. -
Watchdog server URL: compile-time constant, not registry — The watchdog is the same binary, compiled with the same
GURURMM_SERVER_URLenv var. Reading a TOML file was architecturally wrong and the file didn't exist anyway. The fix usesoption_env!directly.
Problems Encountered
-
Get-Service -Name "GuruRMMAgent"returns nothing — PowerShell silently fails to enumerate the service even with the exact name;sc.exe queryexfinds it fine. Root cause unclear (likely a non-elevated session permission issue). Resolution: all PS1-based service checks in the codebase replaced withsc.exe queryequivalents. -
Post-update cleanup path never reaches
cleanup_backup()— The server was sendingAuthAckbefore running the post-update check, so the agent never received any signal to clean up. Resolution: computeupdate_confirmedbefore building AuthAck; include it in the ack; agent acts on it in the AuthAck handler. -
Detached cmd restart killed by job object — Windows service processes are often placed in a job object with
JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE. When the service exits, child processes (the detached cmd) are killed beforesc.exe startruns. Resolution: addedCREATE_BREAKAWAY_FROM_JOB(0x01000000) flag alongsideCREATE_NO_WINDOW; combined value 0x09000000. Also adding exit(1) fallback so SCM recovery fires regardless. -
Code review still in progress at /save time — Code Review Agent was launched as a background task. Result pending. Changes should not be deployed until review is clean.
Configuration Changes
Files modified
agent/src/registry.rs— addedwrite_version(),write_server_url(),read_server_url()with full platform stubsagent/src/service.rs— added registry version+URL write on startup; SCM recovery delay 60s→10s; watchdog co-installation ininstall_service()agent/src/updater/mod.rs— PerformUpdate IPC attempt (step 2.5 indo_update);0x09000000creation flags;exit(1)final fallback;sc.exein PS1 rollback template replacingGet-Serviceagent/src/watchdog/pipe.rs—PerformUpdatevariant added toWatchdogCommandenum (staged_path model)agent/src/watchdog/monitor.rs—read_server_api_url()replaced withserver_api_url()using compile-time constant;PerformUpdatematch arm added (logs + no-op, pipe server sends ok:false)agent/src/transport/mod.rs—update_confirmed: Option<Uuid>with#[serde(default)]added toAuthAckPayloadagent/src/transport/websocket.rs— cleanup block inAuthAckhandler: callscancel_rollback_watchdog()+cleanup_backup()whenupdate_confirmedisSomeserver/src/ws/mod.rs—update_confirmed: Option<Uuid>added to server-sideAuthAckPayload;complete_update_by_agent()wired up beforeAuthAckis sent; old post-ack update check block removed
Credentials & Secrets
None created or discovered this session.
Infrastructure & Servers
- GuruRMM server: 172.16.3.30:3001 (Rust/Axum)
- Build pipeline: Gitea push → webhook-handler.py (172.16.3.30:9000) → build-agents.sh → Pluto (172.16.3.36) for Windows MSI
- Agent on DESKTOP-0O8A1RL:
- Binary:
C:\Program Files\GuruRMM\gururmm-agent.exe(4,452,648 bytes — v0.6.4) - Registry:
HKLM\SOFTWARE\GuruRMM— SiteId:d008c7d4-9e5e-4666-9fa0-b432609d54cc, AgentKey:agk_ybg4Ty6zXU_2Ee0ddlUUtuZdz0B9Qw4_ - Logs:
C:\ProgramData\GuruRMM\agent.log.YYYY-MM-DD - Backup:
C:\ProgramData\GuruRMM\gururmm-agent.backup(0.6.3 binary, 4,303,656 bytes — not yet cleaned up pending code review + deploy) - Service:
GuruRMMAgent(PID 4856 at session start) —GuruRMMWatchdogNOT YET INSTALLED
- Binary:
Commands & Outputs
# Service query (use sc.exe, not Get-Service — Get-Service silently fails for GuruRMMAgent)
sc.exe queryex "GuruRMMAgent"
# → STATE: 4 RUNNING, PID: 4856
# Registry state at session start
Get-ItemProperty "HKLM:\SOFTWARE\GuruRMM"
# → Version: 0.6.2 (stale — binary was actually 0.6.4)
# → SiteId: d008c7d4-9e5e-4666-9fa0-b432609d54cc
# → AgentKey: agk_ybg4Ty6zXU_2Ee0ddlUUtuZdz0B9Qw4_
# SCM recovery (was 60s, changed to 10s in code)
sc.exe qfailure "GuruRMMAgent"
# → FAILURE_ACTIONS: RESTART/60000/RESTART/60000/RESTART/60000 (old config)
# Watchdog not installed
sc.exe queryex "GuruRMMWatchdog"
# → [SC] OpenService FAILED 1060 (service does not exist)
# Log entries confirming the two overnight updates
# 2026-05-13T00:44:24Z 0.6.2 → 0.6.3 update started; restart at 00:44:25; agent back at 01:49:47 (65 min gap)
# 2026-05-13T03:04:28Z 0.6.3 → 0.6.4 update started; restart at 03:04:29; agent back at 06:31:27 (3h27m gap)
Pending / Incomplete Tasks
-
Code Review Agent result pending — launched as background task at session end. Do NOT build or deploy to the server until the review is clean. If issues are found, address them in a follow-up session.
-
Build and deploy — once code review is clean:
- Push to Gitea
gururmmrepo (triggers build pipeline) - Server binary (v0.6.5 per coord API,
Pending build+deploy) needs to be deployed first - After server deploy, agents will receive update push to new agent version
- Verify on DESKTOP-0O8A1RL: registry version updated, backup cleaned, GuruRMMWatchdog installed
- Push to Gitea
-
GuruRMMWatchdog not installed on existing endpoints — the co-install logic in
service.rstriggers during the install flow, which existing enrolled agents won't re-run automatically. Options: (a) add a one-time watchdog install step to the update post-restart sequence, (b) run a remediation script via the dashboard command channel, or (c) accept that watchdog deploys on next MSI re-run. -
9 bug tasks (#1–#9) need status updates — tasks filed in the task system but not yet marked complete; should be updated once code review passes and the build is deployed.
-
Bug #2 (Get-Service returns nothing) — root cause unresolved. The fix (replacing Get-Service with sc.exe everywhere in PS1 templates) is in, but the underlying service visibility issue should be investigated if time permits.
-
Server component state —
gururmm/serveris at statebuiltv0.6.5 per coord API, pending deploy.gururmm/agentsat statebuiltv0.6.4. Both need deploy after build.
Reference Information
- Coord API component states:
GET http://172.16.3.30:8001/api/coord/status - Session plan file:
C:\Users\guru\.claude\plans\ticklish-questing-stallman.md - Relevant source files:
agent/src/updater/mod.rs— full update flow, rollback, restart logicagent/src/watchdog/monitor.rs— SCM health monitor, alert posting, server URLagent/src/watchdog/pipe.rs— IPC protocol, WatchdogCommand enumagent/src/transport/websocket.rs— WebSocket client, AuthAck handler, update triggerserver/src/ws/mod.rs— server-side WS handler, authenticate(), AuthAck, update dispatchserver/src/db/updates.rs—complete_update_by_agent()(was unhooked, now wired)
- Key bug notes:
PerformUpdateIPC: currently returnsok:false(not implemented in watchdog) — agent falls through to self-update. Future: watchdog implements stop+replace binary atstaged_path+start.exit(1)is the final SCM safety net — SCM recovery fires within 10s (after config change deploys)AuthAckPayload.update_confirmedisOption<Uuid>with#[serde(default)]on agent side — backwards compatible with older servers
Update: 14:50 PT — Update channel selection + server v0.3.1 deploy
Session Summary
This update session began by confirming the v0.6.4 agent build (including all 9 auto-update reliability fixes and ensure_watchdog_running) had completed successfully on Pluto at 14:22 UTC. All Windows and Linux agent binaries were present in /var/www/gururmm/downloads/.
The main work was implementing the update channel selection feature: a stable/beta hierarchy allowing partners, clients, sites, and individual machines to opt-in to beta releases without affecting production machines. The design approved was DB-column + UI surface. A Coding Agent implemented all server and dashboard changes — migration 026, resolve_agent_channel() DB function, channel-aware get_latest_version()/needs_update() in the scanner, three new PATCH /api/{agents,sites,clients}/:id/channel endpoints, GET /api/agents/:id/effective-channel, and a UpdateChannelSelector React component wired into AgentDetail, SiteDetail, and ClientDetail pages. Server compiled clean (68 warnings, all pre-existing).
Deployment encountered two problems. First, migration 026 was applied manually via psql before the server binary was deployed, leaving the _sqlx_migrations table without a tracking row. When the new server binary started, sqlx tried to run migration 026 again and hit "column already exists". Resolution: deleted the bad row, changed the migration to use ADD COLUMN IF NOT EXISTS, pushed c1b8b80, and rebuilt. Second, the build-server.sh script always appends to the shared build log, and the background monitor's grep pattern matched an old "failed to start" line from the first attempt — causing a false "completed" notification. The second rebuild is currently running.
The build-agents.sh script was also updated to support a --channel flag. When invoked with --channel beta, it writes a .channel sidecar file ("beta") alongside each binary; stable builds remove any existing sidecar. This is the production mechanism for tagging beta releases. Backup build-agents.sh.bak-pre-channel was created before the edit.
Key Decisions
-
Never manually pre-apply migrations — sqlx owns migration state through
_sqlx_migrations. Manually running psql diverges the tracking table from the schema, causing checksum failures on the next startup. Correct procedure: let the server binary apply its own migrations on startup. If pre-applying is necessary (e.g., zero-downtime column add), always make migrations idempotent withIF NOT EXISTSfrom the start. -
build-agents.sh --channel betawrites sidecar, stable builds remove it — cleanup on stable builds prevents stale.channelfiles from a prior beta build being mistaken for a beta tag on the stable binary. -
Channel resolution: agent → site → client → "stable" — three-level inheritance.
resolve_agent_channel()uses a single JOIN query. Beta channel gets the absolute latest binary (any channel tag); stable gets only binaries tagged "stable" (no sidecar = stable). -
All new DB queries use
sqlx::query()notsqlx::query!()macros — avoids needingcargo sqlx prepareafter each new query. The offline cache (server/.sqlx/) only needs updating for compile-timequery!macros. This is the pattern all new code should follow to simplify builds.
Problems Encountered
-
sqlx migration conflict (column already exists): Manually applied migration 026 via psql left the
_sqlx_migrationstable without an entry for version 26. On server startup, sqlx attempted to run migration 026, hit "column already exists". Resolution: deleted the invalid row, updated migration to useIF NOT EXISTS, rebuilt. -
Background monitor false completion: The
until grep -q 'Server build complete\|failed to start'polling loop matched an old "failed to start" line from the first server build attempt in the shared log file. The second rebuild was still in progress when the monitor fired. Resolution: used a line-count guard in a new background wait command; second build still in progress at save time. -
Server restart loop during failed deploy: The first v0.3.1 binary (with the migration conflict) caused systemd to restart the process repeatedly. The old binary was already overwritten, so the server was down until the second build completed. No data loss; agents reconnect automatically.
Configuration Changes
Committed to gururmm repo (4035b5c, 3df5880, c1b8b80):
server/migrations/026_update_channels.sql— new;ADD COLUMN IF NOT EXISTS update_channel TEXT CHECK (...)on clients, sites, agentsserver/src/db/updates.rs— addedresolve_agent_channel(),set_agent_channel(),set_site_channel(),set_client_channel()server/src/updates/scanner.rs—AvailableVersion.channelfield;.channelsidecar file read;get_latest_version/needs_updateacceptchannel: &strserver/src/ws/mod.rs— bothneeds_updatecall sites (connect + heartbeat) now resolve agent channel firstserver/src/api/agents.rs—trigger_updateuses agent effective channel; addedset_agent_channel_handler,get_effective_channel_handlerserver/src/api/sites.rs— addedset_site_channel_handlerserver/src/api/clients.rs— addedset_client_channel_handlerserver/src/api/mod.rs— registered 4 new routes withpatchroutingserver/Cargo.toml— version bumped 0.3.0 → 0.3.1dashboard/src/api/client.ts—UpdateChanneltype,EffectiveChannelinterface,channelApi,update_channelfield on Agent/Site/Clientdashboard/src/components/UpdateChannelSelector.tsx— new component: inherit/stable/beta selector with effective-channel labeldashboard/src/pages/AgentDetail.tsx— addedUpdateChannelSelectorto info sectiondashboard/src/pages/SiteDetail.tsx— addedUpdateChannelSelectordashboard/src/pages/ClientDetail.tsx— addedUpdateChannelSelector
Modified on server directly (not in git):
/opt/gururmm/build-agents.sh— added--channelarg parsing; writes.channelsidecar for beta builds; backup at.bak-pre-channel
Applied to production DB:
- Migration 026 applied (columns already exist; tracking row inserted and will be re-inserted cleanly on next startup with idempotent migration)
Dashboard:
- Built and deployed to
/var/www/gururmm/dashboard/from/home/guru/gururmm/dashboard/dist/
Infrastructure & Servers
- GuruRMM server: 172.16.3.30:3001 (Rust/Axum) — v0.3.1 binary deployed, second build in progress (startup was failing due to migration conflict; now resolved with
IF NOT EXISTS) - Build pipeline:
/opt/gururmm/build-server.sh— builds server on 172.16.3.30 directly;/opt/gururmm/build-agents.sh— builds agents (Linux on server, Windows on Pluto 172.16.3.36) - Downloads:
/var/www/gururmm/downloads/— v0.6.4 agent binaries (all platforms) present and ready - Dashboard:
/var/www/gururmm/dashboard/(nginx-served) — updated with channel selector UI
Commands & Outputs
# Apply migration 026 (manual, before server had IF NOT EXISTS)
psql 'postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm' \
-f /home/guru/gururmm/server/migrations/026_update_channels.sql
# → ALTER TABLE x3
# Build and deploy dashboard
cd /home/guru/gururmm/dashboard && npm run build
# → 2559 modules, dist/assets/index-*.js 1082kB, built in 11s
sudo cp -r /home/guru/gururmm/dashboard/dist/* /var/www/gururmm/dashboard/
# Delete bad sqlx migration row (checksum mismatch fix)
echo "DELETE FROM _sqlx_migrations WHERE version = 26;" | \
psql postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm -f /dev/stdin
# → DELETE 1
# Beta build usage (once channel feature is deployed):
sudo /opt/gururmm/build-agents.sh --channel beta
# Trigger server rebuild
sudo bash -c 'nohup /opt/gururmm/build-server.sh >> /var/log/gururmm-build.log 2>&1 &'
Pending / Incomplete Tasks
-
Server v0.3.1 second build in progress —
build-server.shrunning at save time. Monitor when it completes:tail -20 /var/log/gururmm-build.log. If it shows "Server build complete: v0.3.1", verifysystemctl is-active gururmm-serverreturnsactive. -
Watchdog not installed on DESKTOP-0O8A1RL —
GuruRMMWatchdogservice still not present. Will self-install viaensure_watchdog_running()when the agent receives and applies the v0.6.4 update push. Agent will receive the update push once the server is back online. -
SQLX issue needs permanent fix — see user question: "Can the SQLX issue be permanently resolved?" The fix is to always write
ADD COLUMN IF NOT EXISTSin migrations and to never manually pre-apply them via psql. Additionally, set up acargo sqlx preparestep in the build pipeline for any future migrations usingquery!()macros. -
Verify DESKTOP-0O8A1RL watchdog after agent update — once server is running and agent updates, confirm:
sc.exe queryex GuruRMMWatchdogshows RUNNING, registry version updated, backup file cleaned. -
9 bug tasks (#1–#9) still need TickTick status updates — not yet marked complete.
Reference Information
- Commits:
bdb751b(bugs),e52ee19(watchdog self-install),5b43fe6(build fixes),4035b5c(channel feature),3df5880(version bump),c1b8b80(idempotent migration) - Server DB:
postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm - Build log:
/var/log/gururmm-build.log(shared between agent and server builds) - New API endpoints:
PATCH /api/agents/:id/channel— set agent channelPATCH /api/sites/:id/channel— set site channelPATCH /api/clients/:id/channel— set client channelGET /api/agents/:id/effective-channel— resolved channel + source
- Scanner sidecar convention:
<binary-filename>.channelcontaining "beta"; absent = stable
Update: 11:50 PT — IPC pipe DACL fix, CORS fix, dashboard clients bug
Session Summary
This session was a continuation from a prior context window. The primary carried-over issue was the Windows agent system tray showing "disabled" — the tray icon connects to the agent service via a named pipe (\\.\pipe\gururmm-agent) created by SYSTEM, but user-session processes (non-elevated tray) were getting Access Denied when trying to open the pipe client end.
Four attempts were required to land the DACL fix. v0.6.8 tried security_descriptor() on ServerOptions (method doesn't exist). v0.6.9 tried SetKernelObjectSecurity post-creation with DACL_SECURITY_INFORMATION (compiled, but failed at runtime with 0x80070005 — Access Denied — because the tokio pipe handle lacks WRITE_DAC). v0.6.10 confirmed that approach is fundamentally broken: CreateNamedPipeW with PIPE_ACCESS_DUPLEX does not grant WRITE_DAC, so post-creation DACL modification is not possible via that handle. v0.6.11 switched to CreateNamedPipeW with a SECURITY_ATTRIBUTES struct holding a NULL DACL, but hit a compile error because PIPE_ACCESS_DUPLEX and PIPE_TYPE_BYTE are not exported by the windows 0.58 crate from Win32::System::Pipes. v0.6.12 resolved this by using raw integer literals with the correct wrapper types: FILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000) and NAMED_PIPE_MODE(0). That build compiled and the tray connected successfully ("IPC: tray client connected" in agent log).
A Firefox CORS bug was identified from a browser console export. The server was sending Access-Control-Allow-Headers: * (wildcard), and Firefox enforces that the Authorization header must be explicitly listed — the wildcard does not cover it. This caused Firefox to block API requests with warnings (escalating to errors in newer Firefox). The fix changed the Axum CORS layer from .allow_headers(Any) to .allow_headers([AUTHORIZATION, CONTENT_TYPE, ACCEPT]). A concurrent build error in server/src/ws/mod.rs was also fixed: fail_agent_update expects Option<&str> for its error message parameter, but the call site was passing &str directly. Both fixes were deployed as server v0.3.1.
The dashboard was then reporting no clients at all. Investigation showed the server API returning 200 with 0 ms latency for /api/clients — indicating an immediate empty-list return with no database query. The root cause was in authz/permissions.rs: accessible_client_ids() only returns None (meaning all clients) for the dev_admin role, but all four users in the production database have role = "admin" (the legacy superuser role predating the multi-tenant system). With admin role and no org memberships in the JWT, the function returns Some([]), triggering the early-exit branch. The fix added is_admin() to AuthContext covering both admin and dev_admin, and updated accessible_client_ids(), can_access_org(), is_org_admin(), and all the org management permission methods to use it. Deployed as a server rebuild (same v0.3.1 version number, uptime reset to 27s confirmed via /status).
Key Decisions
-
CreateNamedPipeW + SECURITY_ATTRIBUTES at creation time, not SetKernelObjectSecurity post-creation —
SetKernelObjectSecurityrequiresWRITE_DACaccess on the handle, which tokio'sCreateNamedPipeW-backedNamedPipeServerdoes not have. The only correct approach is to pass the security descriptor at pipe creation time. -
Raw integer literals for windows 0.58 pipe constants —
PIPE_ACCESS_DUPLEXandPIPE_TYPE_BYTEare simply not re-exported by the windows crate 0.58 fromWin32::System::Pipes. UsingFILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000)andNAMED_PIPE_MODE(0)directly is correct and stable — the Win32 ABI values do not change. -
adminrole treated as full superuser in all permission checks — theadminrole is explicitly documented as the legacy superuser beforedev_adminwas introduced. Restricting it to org-membership-based access is a regression. All production users areadmin. The correct fix is to treatadminthe same asdev_admininis_admin(), not to forcibly re-issue JWTs or update the DB. -
Tray binary crash investigation deferred — the tray was crashing/exiting every 30 seconds (detected at the 30-second
reap_deadpoll boundary). Root cause is likely the OLD tray binary on enrolled machines (never updated by the agent auto-updater, which only updates the agent binary). Thegururmm-tray-windows-amd64-{version}.exeis built and deployed to the downloads server but never pushed to agents. This is a separate deferred issue — auto-update flow needs to also update the tray binary. -
Stale build lock must be cleared manually —
/var/run/gururmm-build.lockis left by zombie build processes after failures. Addedrm -f /var/run/gururmm-build.lockbefore each build trigger. The build script should defensively remove stale locks on startup (future fix).
Problems Encountered
-
SetKernelObjectSecurityfails at runtime with 0x80070005 — post-creation DACL modification is impossible on a tokio pipe handle. Required full approach switch toCreateNamedPipeWwithSECURITY_ATTRIBUTES. -
PIPE_ACCESS_DUPLEXnot in windows 0.58 — constant exists in Win32 SDK headers but is not re-exported by the crate. UsedFILE_FLAGS_AND_ATTRIBUTES(0x00000003 | 0x40000000)as the solution. -
Stale
/var/run/gururmm-build.lock— blocked both the agent build and server build triggers. Had tosudo rm -fbefore each trigger. -
fail_agent_updatetype mismatch — function signature wasOption<&str>but call site inws/mod.rspassed raw&str. Caught as a compile error during the CORS server build. -
All dashboard users have
role = "admin"— the production DB was seeded before thedev_adminrole was introduced. The permission system assumed admin users would havedev_admin. Every API endpoint that gate-checkedis_dev_admin()only was effectively broken for the entire production user base.
Configuration Changes
Modified in gururmm repo (committed and pushed):
agent/src/ipc.rs—create_server_pipe()rewritten usingCreateNamedPipeW+ NULL DACLSECURITY_ATTRIBUTES; addedwindows::Win32::System::PipesandWin32::Storage::FileSystemfeature usageagent/Cargo.toml— version 0.6.11 → 0.6.12; addedWin32_System_PipesandWin32_Storage_FileSystemto windows feature listserver/src/main.rs— CORS:.allow_headers(Any)→.allow_headers([AUTHORIZATION, CONTENT_TYPE, ACCEPT]); addeduse axum::http::header::{ACCEPT, AUTHORIZATION, CONTENT_TYPE}server/src/ws/mod.rs—fail_agent_updatecall:"send_to failed: ..."→Some("send_to failed: ...")server/src/authz/permissions.rs— addedis_admin()method;accessible_client_ids(),can_access_org(),is_org_admin(),can_set_org_limits(),can_impersonate(),can_create_org(),can_delete_org()all updated to useis_admin()server/src/api/organizations.rs— allauth.is_dev_admin()calls replaced withauth.is_admin()
Deployed to production:
- Agent v0.6.12 binary on download server, update scanner dispatching to all 48 enrolled agents
- Server v0.3.1 (two builds: first fixed CORS + ws/mod.rs; second fixed permissions)
Credentials & Secrets
- GuruRMM DB (production):
postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm(rediscovered during permissions investigation — matches prior session log)
Infrastructure & Servers
- GuruRMM server: 172.16.3.30:3001 — v0.3.1, uptime ~27s at last check (18:10 UTC restart)
- Build servers: 172.16.3.30 (Linux agent + server); 172.16.3.36 Pluto (Windows agent + tray + MSI)
- Downloads:
/var/www/gururmm/downloads/—gururmm-agent-windows-amd64-0.6.12.exe+ tray binary present - Enrolled agents: 48 total, 10 online, 38 offline (at 18:10 UTC)
- Tray binary on enrolled machines: OLD version (never auto-updated) — crashes ~30s after connecting to new NULL-DACL pipe
Commands & Outputs
# Check guruRMM DB users (diagnosed empty dashboard)
PGPASSWORD=43617ebf7eb242e814ca9988cc4df5ad psql -U gururmm -h localhost -d gururmm \
-c 'SELECT id, email, role FROM users;'
# → All 4 users have role="admin" (not dev_admin)
# Verify clients table has data
PGPASSWORD=... psql ... -c 'SELECT COUNT(*) FROM clients;'
# → 15
# Check server status after rebuild
curl -s https://rmm-api.azcomputerguru.com/status
# → {"version":"0.3.1","uptime_seconds":27,"agents":{"total":48,"online":10},...}
# Clear stale build lock
sudo rm -f /var/run/gururmm-build.lock
# Trigger server rebuild
nohup sudo bash /opt/gururmm/build-server.sh > /tmp/server-build-$$.log 2>&1 &
Pending / Incomplete Tasks
-
Tray binary never auto-updated —
gururmm-tray.exein agent install dirs on enrolled machines is the original install version. The auto-updater only replaces the agent binary. Fix: extend the update flow to also download and install the new tray binary alongside the agent binary. New serverUpdatePayloadfield + agent-side tray download step needed. -
Tray crash root cause — the tray exits ~30s after connecting (detected at the 30-second
reap_deadpoll). Likely old binary crashing after connecting to the now-accessible pipe. Unconfirmed — no tray crash log (windows_subsystem = "windows" swallows panics). Adding file logging to the tray is the correct diagnostic step. -
/var/run/gururmm-build.lockstale lock protection — build script shouldrm -fthe lock at startup to avoid manual intervention on the next build failure. Simple one-line addition tobuild-agents.shandbuild-server.sh. -
Plan file bugs #1–#9 — the ticklish-questing-stallman plan still has unfinished items (registry version write, watchdog co-install, etc.) that were not addressed this session since the session focused on the tray pipe DACL fix and the production dashboard break.
Reference Information
- Commits (gururmm repo):
a6b3174— fix: grant legacy admin role full access in permission checks8c7380c— fix(server): wrap fail_agent_update error_message in Some()f3d0cc0— fix(server): explicitly list Authorization in CORS allow_headers47128b5— fix(ipc): use FILE_FLAGS_AND_ATTRIBUTES raw values57ff059— fix(ipc): create pipe with NULL DACL via CreateNamedPipeW+SECURITY_ATTRIBUTES
- Key source files:
agent/src/ipc.rs:156—create_server_pipe()with NULL DACL (native-service feature)server/src/authz/permissions.rs:35—is_admin()methodserver/src/api/clients.rs:49—accessible_client_ids()usage (empty-list path)
- Production DB connection:
postgres://gururmm:43617ebf7eb242e814ca9988cc4df5ad@localhost:5432/gururmm - Tray binary download server path:
/var/www/gururmm/downloads/gururmm-tray-windows-amd64-{version}.exe
Update: ~17:00 PT — Policy UI defaultHint fix + full end-to-end policy wiring
Session Summary
This session completed two major policy system items. First, a UI fix: the policy editor's "Default (ON/OFF)" radio hints were hardcoded strings that didn't reflect the actual system default policy data. After the system default policy was updated in the prior session to disable all metrics and auto-update while keeping tray settings, the hints still showed "Default (ON)" for disabled fields. The fix added a systemDefaults?: PolicyData prop to PolicySectionEditor, computed a hints map from the live system default policy data using a boolHint() helper, and replaced all 16 hardcoded defaultHint="ON/OFF" values with hints.xxx references. The system default policy data is fetched via policiesApi.getSystemDefault() (already in the page's queries) and passed down through PolicyDetail → PolicySectionEditor. Dashboard was built and deployed (commit 287d106).
Second, a full audit of the policy system revealed that agents never receive policy data despite the infrastructure being in place. The ConfigUpdate WebSocket message existed in the protocol but was never sent by the server; AppState had no policy field; every subsystem (metrics, tray, updates, watchdog) used hardcoded defaults. The only wired section was thresholds, which are evaluated server-side when metrics arrive.
An implementation plan was approved covering 10 files across both the agent and server crates. A Coding Agent (Opus, 14 minutes) implemented all changes: expanded ConfigUpdatePayload from 2 fields to 4 nested section structs covering all policy sections; added AppState::effective_policy: RwLock<ConfigUpdatePayload>; replaced the ConfigUpdate stub with a real handler that stores policy and forwards the watchdog section via IPC; changed the fixed-interval metrics loop to a sleep-based loop reading interval + collect flags from policy each iteration; added collect_with_flags() to MetricsCollector; wired GetPolicy IPC to read from AppState instead of default_permissive(); added UpdateConfig to WatchdogCommand and handled it in the monitor with runtime config; added policy_to_agent_config() server-side helper in a new server/src/policy/config_update.rs; wired ConfigUpdate dispatch after AuthAck in server/src/ws/mod.rs; gated auto-update push on updates.auto_update from effective policy; and added push_config_update_to_affected() called from both assignment create and delete endpoints. Both crates built clean (agent on server via cargo check, server via build-server.sh — 68 warnings, all pre-existing). Deployed as server v0.3.1 (commit 78b6831).
Key Decisions
-
hintscomputed insidePolicySectionEditorfromsystemDefaultsprop — avoids threading per-field hint strings through the prop chain; oneboolHint()call per field at render time. WhensystemDefaultsis undefined (before query resolves), all hints default to "ON" to match the system default seeded in migration 027. -
systemDefaultDatapassed to both<PolicyDetail>call sites — the system default detail view passessystemDefault.policy_datato itself (showing its own values as hints); regular policies passsystemDefault?.policy_data(reflecting the true baseline). -
Agent
collect_with_flags()zeroes disabled fields rather than skipping collection — avoids restructuring the single-passcollect()method. Disabled metrics send 0/None in the payload; server receives them but evaluates thresholds against 0, which never triggers alerts (effectively disabled). Simplest correct approach without breaking the existing collect architecture. -
Server-side mirror structs in
policy/config_update.rs, not re-using agent types — agent and server are separate crates. The server serializesAgentConfigUpdateto JSON; the agent deserializes toConfigUpdatePayload. JSON field names match exactly. Mirror structs keep the crates fully decoupled. -
Single
get_effective_policy()call per connect, result reused — computed once after registration, used for both ConfigUpdate dispatch and auto-update gating. Avoids two round-trips to the DB on every agent connect. -
push_config_update_to_affected()spawned async (non-blocking) — assignment change response returns immediately; push happens in background. If the push fails (agent disconnected between check and send), it's a no-op — next connect will receive the current policy. -
Watchdog
UpdateConfigapplies per-field (None = no change) — consistent with the rest of theConfigUpdatePayloaddesign; partial updates won't reset unrelated watchdog fields. -
Watchdog interval floored at 5 seconds — prevents a policy misconfiguration from creating a CPU-spinning tight loop in the watchdog process.
Problems Encountered
-
Coding Agent (first attempt) timed out at 18 minutes with no files written — was mid-read phase. Resumed via a fresh agent invocation with the same detailed prompt. Second attempt (Opus model, 14 minutes) completed all 11 file changes.
-
cargonot on local Windows PATH — build had to be done remotely via SSH to the build server. The agentcargo checkruns on Linux (non-Windows target) but catches type/import errors; Windows-specific conditional compilation (#[cfg(windows)]) is not checked but those paths are structurally unchanged. -
Build triggered remotely after user corrected local build attempt — user noted builds should go to the build server, not local. Correct pattern: push to Gitea, SSH
sudo /opt/gururmm/build-server.sh.
Configuration Changes
Committed to gururmm repo:
287d106 — fix: policy editor defaultHint reflects actual system default values
dashboard/src/pages/Policies.tsx—PolicySectionEditorPropsgainssystemDefaults?: PolicyData;boolHint()/hintsmap computed from it; all 16 hardcodeddefaultHintstrings replaced with dynamic refs;PolicyDetailPropsgainssystemDefaultData?: PolicyData; both<PolicyDetail>call sites pass it
78b6831 — feat: wire agent policy end-to-end (metrics, tray, updates, watchdog)
agent/src/transport/mod.rs—ConfigUpdatePayloadexpanded to 4 nested section structs;WatchdogConfigUpdategainsservices/processesfieldsagent/src/main.rs—AppState::effective_policy: RwLock<ConfigUpdatePayload>addedagent/src/transport/websocket.rs—ConfigUpdatehandler implemented; metrics loop converted to sleep-based with per-iteration policy readagent/src/metrics/mod.rs—collect_with_flags()method addedagent/src/ipc.rs—TrayPolicy::from_config_update()added;GetPolicyreads from AppStateagent/src/watchdog/pipe.rs—WatchdogCommand::UpdateConfigvariant addedagent/src/watchdog/monitor.rs—WatchdogRuntimeConfigstruct;UpdateConfighandler; enabled gate; configurable intervalserver/src/policy/config_update.rs— NEW:AgentConfigUpdatemirror structs +policy_to_agent_config()helperserver/src/policy/mod.rs—pub mod config_updateaddedserver/src/ws/mod.rs—ConfigUpdatesent after AuthAck; auto-update gated on policyserver/src/api/policies.rs—push_config_update_to_affected()helper; called fromassign_policyandremove_assignment
Infrastructure & Servers
- GuruRMM server: 172.16.3.30:3001 — v0.3.1 rebuilt and deployed (build completed 00:00:31 UTC 2026-05-14)
- Dashboard:
/var/www/gururmm/dashboard/— updated with dynamic policy hints - Agent crate:
cargo checkclean on Linux (server); Windows build pending (Pluto)
Commands & Outputs
# Remote server build (the correct build pattern)
ssh -i C:/Users/guru/.ssh/id_ed25519 guru@172.16.3.30 "sudo /opt/gururmm/build-server.sh 2>&1"
# → Compiling gururmm-server v0.3.1
# → warning: unused import: ... (68 warnings, all pre-existing)
# → Finished release profile in 4m 07s
# → === Server build complete: v0.3.1 ===
# Agent cargo check on build server (Linux target)
ssh ... guru@172.16.3.30 "cd /home/guru/gururmm/agent && /home/guru/.cargo/bin/cargo check 2>&1 | tail -5"
# → warning: struct PipeServer is never constructed (pre-existing)
# → Finished dev profile in 7.18s
# Deploy dashboard
cd D:/claudetools/projects/msp-tools/guru-rmm/dashboard && npm run build
scp -i C:/Users/guru/.ssh/id_ed25519 -r dist/* guru@172.16.3.30:/var/www/gururmm/dashboard/
Pending / Incomplete Tasks
-
Windows agent build pending —
cargo checkpassed on Linux; full Windows cross-compile or Pluto build needed to confirm#[cfg(windows)]paths compile. Agent binary won't reflect policy wiring until built and deployed via build pipeline. -
Network discovery feature — next work item, starting after this save.
-
Tray binary auto-update — still not implemented (carried over from prior session). Old tray binary on enrolled machines crashes ~30s after connecting to the NULL-DACL pipe.
-
Watchdog not installed on existing enrolled machines —
ensure_watchdog_running()is in the agent but existing machines won't trigger re-install until they receive a new agent binary.
Reference Information
- Commits:
287d106(defaultHint fix),78b6831(policy wiring) - Key source files:
agent/src/transport/mod.rs—ConfigUpdatePayload,MetricsConfigUpdate,TrayConfigUpdate,UpdatesConfigUpdate,WatchdogConfigUpdateagent/src/main.rs—AppState::effective_policyagent/src/transport/websocket.rs—ConfigUpdatehandler + dynamic metrics loopagent/src/metrics/mod.rs—collect_with_flags()agent/src/ipc.rs—TrayPolicy::from_config_update(),GetPolicyhandleragent/src/watchdog/monitor.rs—WatchdogRuntimeConfig,UpdateConfighandlerserver/src/policy/config_update.rs—policy_to_agent_config()(new file)server/src/ws/mod.rs— post-AuthAck ConfigUpdate dispatch + auto-update gateserver/src/api/policies.rs—push_config_update_to_affected()
- Policy wiring status after this session:
- Metrics: WIRED (interval + collect_* flags from policy)
- Thresholds: WIRED server-side (unchanged — already worked)
- Tray: WIRED (reads from AppState, no longer hardcoded)
- Updates: WIRED (auto_update gate added server-side)
- Watchdog: WIRED (UpdateConfig IPC + runtime config applied)