# BUG-016/017 Fix Follow-up on GURU-KALI: Unit Refresh, Override Removal, Verification, Coord Closure ## User - **User:** Mike Swanson (mike) - **Machine:** GURU-KALI - **Role:** admin ## Session Summary Short session, all of it follow-up on the BUG-016 / BUG-017 thread that the previous session opened. Three actions, in sequence. First, removed the local override at `/etc/systemd/system/gururmm-agent.service.d/override.conf` now that upstream had fixed both bugs in `gururmm` commit `30da053`. Found a subtle gap before acting: the gururmm auto-updater had refreshed the agent binary on this box (timestamp 2026-06-01 20:24:33, build-id `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c`, contains the new `"Failed to persist device ID (will retry on next restart)"` error string and not the old `"Failed to persist device ID: "` warn string) but **did not** touch the systemd unit file. The unit on disk was still the 2026-05-24 buggy template — `ReadWritePaths=/var/log /usr/local/bin /etc/gururmm` with no `StateDirectory=gururmm`. So removing the override without refreshing the unit would have re-broken BUG-016 on this machine even though upstream was fixed. Caught that, then replaced the unit file with the upstream-fixed template (gained `StateDirectory=gururmm`), removed the override drop-in, daemon-reload + restart. Backup of the pre-fix unit kept at `gururmm-agent.service.pre-bug016-fix`. Second, ran a comprehensive verification pass on the now-refreshed agent: binary identity (post-fix string matches, pre-fix string absent), service state (active, MainPID 686646, NRestarts=0, ~5 min uptime), effective unit (`systemctl cat` shows the base file only with `StateDirectory=gururmm` present, no drop-ins), agent's mount namespace (mountinfo line 535 confirms `/var/lib/gururmm` is rw-bound via StateDirectory), write probe via `nsenter` into the agent's namespace (PASS), persisted state (`.device-id` = `ec975630-d297-4df9-bcb5-a445c65b648d`, unchanged), and the post-restart agent log (zero EROFS, zero persist failures, zero ERROR-level entries; the startup sequence shows clean WebSocket auth as the keeper `agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0`). Third, sent a closing coord message to GURU-5070/claude-main (`07e01275-f3c3-4b1c-ac66-98c805098311`) wrapping up the ghost-churn thread. The message reports the upstream fix shapes (both matched my BUG-016/017 spec recommendations exactly — `OnceLock` for the device_id cache, `StateDirectory=gururmm` for the unit), the auto-updater-doesn't-refresh-unit gap and what it means for the rest of the Linux fleet, the verification results, and a recommendation that any other pre-fix Linux agent should get its unit refreshed (either via a small ops script over /rmm or organically on next reinstall). The peer session no longer has any open work item for this thread. ## Key Decisions - **Refreshed the unit BEFORE removing the override** rather than just removing the override. Mike's instruction was "remove the override now that upstream is fixed", but inspection showed the auto-update path only replaces the binary — the on-disk unit file was untouched. Doing the steps in this order keeps BUG-016 closed on this box; the alternate order would have regressed it. - **Used the upstream-fixed template verbatim** (rather than just appending `/var/lib/gururmm` to the existing ReadWritePaths line). Reason: makes GURU-KALI's unit structurally identical to what a fresh post-fix install would produce. No drift between this machine and the canonical shape; future reinstalls won't need to re-think this. - **Kept a backup of the pre-fix unit** at `gururmm-agent.service.pre-bug016-fix` rather than deleting it outright. Cheap insurance; takes a few KB on disk and gives an obvious rollback path if the new unit ever causes a regression. - **Coord update was a complete close, not a status ping.** Included the agent_id + device_id + build_id so GURU-5070 has the stable identifiers in its log for future cross-referencing. Also flagged the fleet-wide drift situation (other Linux hosts have new binary + old unit) as visibility, not as a request. - **Did NOT proactively scan/fix other Linux hosts in the fleet.** Flagged it in the coord message and left the decision to Mike (option (a) ops-script via /rmm vs option (b) wait for organic reinstalls). Acting unilaterally on other hosts would be out of session scope. ## Configuration Changes **GURU-KALI system (not version controlled — machine-local only):** - `/etc/systemd/system/gururmm-agent.service` — replaced with upstream-fixed template (added `StateDirectory=gururmm` directive; rest of unit unchanged). Old file backed up at `/etc/systemd/system/gururmm-agent.service.pre-bug016-fix`. - `/etc/systemd/system/gururmm-agent.service.d/override.conf` — deleted. - `/etc/systemd/system/gururmm-agent.service.d/` — directory removed (rmdir, was empty after override delete). - `systemctl daemon-reload` + `systemctl restart gururmm-agent` ran. **Repo (committed earlier on a separate /save in the prior session, mentioned here for completeness):** - `session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md` — the broader session log that includes today's follow-up's parent thread. This new file is a focused follow-up rather than an append, because the prior log was already pushed and the date rolled. ## Credentials & Secrets None created or rotated in this session. GURU-KALI gururmm agent stable identifiers (referenced in the coord message — preserved across the unit refresh): - `agent_id: 9bca5090-2d0e-40ad-9078-c11af8a435c0` - `device_id: ec975630-d297-4df9-bcb5-a445c65b648d` ## Infrastructure & Servers - **GURU-KALI gururmm-agent post-state:** - Binary: `/usr/local/bin/gururmm-agent` (4113696 bytes, mtime 2026-06-01 20:24:33) - Build-id: `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c` - PID: 686646 (active since 2026-06-01 20:31:38 MST, NRestarts=0) - Unit: `/etc/systemd/system/gururmm-agent.service` (upstream-fixed shape with `StateDirectory=gururmm`) - No drop-ins under `/etc/systemd/system/gururmm-agent.service.d/` (directory removed) - Server URL: `wss://rmm-api.azcomputerguru.com/ws` - Mountinfo line 535: `535 442 259:3 /var/lib/gururmm /var/lib/gururmm rw,nosuid,relatime shared:389 master:1 - ext4 /dev/nvme0n1p2 rw,errors=remount-ro` - Persisted state at `/var/lib/gururmm/.device-id` = `ec975630-d297-4df9-bcb5-a445c65b648d` - **Authentication endpoint behavior observed on this boot:** `INFO Authentication successful, agent_id: Some(9bca5090-2d0e-40ad-9078-c11af8a435c0)` — the server accepted the persisted device_id and returned the keeper agent_id without minting a new row. ## Commands & Outputs ```bash # 1. Backup old unit + write new sudo cp -a /etc/systemd/system/gururmm-agent.service{,.pre-bug016-fix} sudo tee /etc/systemd/system/gururmm-agent.service > /dev/null <<'EOF' ... (full upstream-fixed unit template, including StateDirectory=gururmm) EOF # 2. Remove override + its dir sudo rm -f /etc/systemd/system/gururmm-agent.service.d/override.conf sudo rmdir /etc/systemd/system/gururmm-agent.service.d # 3. Reload + restart sudo systemctl daemon-reload sudo systemctl restart gururmm-agent # 4. Verify - binary identity strings /usr/local/bin/gururmm-agent | grep -c 'Failed to persist device ID (will retry on next restart)' # -> 1 (post-fix string present) strings /usr/local/bin/gururmm-agent | grep -c '^Failed to persist device ID: ' # -> 0 (pre-fix string absent) # 5. Verify - mount namespace sudo grep -E '/var/lib/gururmm' /proc/$(systemctl show -p MainPID --value gururmm-agent)/mountinfo # -> 535 442 259:3 /var/lib/gururmm /var/lib/gururmm rw,nosuid,relatime ... # 6. Verify - write probe via nsenter sudo nsenter -t -m -- sh -c 'touch /var/lib/gururmm/.verify_$$ && rm /var/lib/gururmm/.verify_$$ && echo "PASS"' # -> PASS # 7. Verify - persisted state unchanged sudo cat /var/lib/gururmm/.device-id # -> ec975630-d297-4df9-bcb5-a445c65b648d # 8. Verify - no EROFS/persist-failure/ERROR since restart journalctl -u gururmm-agent --since "$(systemctl show -p ActiveEnterTimestamp --value gururmm-agent)" --no-pager | grep -ciE 'Read-only|errno 30|EROFS|Failed to persist' # -> 0 journalctl -u gururmm-agent --since "$(systemctl show -p ActiveEnterTimestamp --value gururmm-agent)" --no-pager | grep -c ' ERROR ' # -> 0 ``` ## Pending / Incomplete Tasks - **Fleet-wide BUG-016 sweep on other Linux GuruRMM agents.** Every Linux host installed before `30da053` is now running new-binary-over-old-unit: the OnceLock fix masks the symptom in-process, but a restart will still mint+lose a new id because the unit still has the bug. Two options flagged in the coord message: - (a) Push a small ops script (via /rmm) that refreshes the unit + removes any override + restarts the agent. Fast, deterministic. - (b) Wait for organic reinstall. Slower; lets each host drift until it gets touched. Mike's call. Not scheduled. - **GURU-KALI `gururmm-agent.service.pre-bug016-fix` backup** lingering at `/etc/systemd/system/`. Harmless (systemd ignores files that don't end in `.service`), but a candidate for cleanup once we trust the new unit through a few cycles. - **Memory-dream cluster heuristic refinement** (carried from the prior session — coord todo `5ad05d03-74ca-491d-9e72-3a699fcd1150`). Not touched in this session. - **Shared-drive access for Nick Pafford** on Rednour Syncro #32343 (also carried from the prior session). Not touched. ## Reference Information **Upstream fixes consumed:** - `gururmm` commit `30da053` — `fix(agent): resolve Linux device_id persistence issues (BUG-016, BUG-017)`. Touched `agent/src/main.rs` (systemd unit template) and `agent/src/device_id.rs` (OnceLock cache). - `gururmm` commits `2089e89` and `0597714` — roadmap doc updates marking BUG-016 / BUG-017 fixed. **Pre-existing IDs preserved through this session:** - agent_id: `9bca5090-2d0e-40ad-9078-c11af8a435c0` - device_id: `ec975630-d297-4df9-bcb5-a445c65b648d` - build_id (current binary): `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c` **Coord messages sent in this session:** - `07e01275-f3c3-4b1c-ac66-98c805098311` → GURU-5070/claude-main, project_key `gururmm`. Subject: "Ghost-churn thread fully closed on GURU-KALI - post-fix binary + post-fix unit + verified clean. Fleet drift note inside." **Files of interest:** - Prior session log (parent thread): `session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md` - Roadmap entries (now marked fixed): `projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md` — search `BUG-016` and `BUG-017` - Spec doc that drove the implementation pattern: `docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md` (separate item but documents the same disciplined-spec pattern used to file BUG-016/017) - GURU-KALI-local backup: `/etc/systemd/system/gururmm-agent.service.pre-bug016-fix`