diff --git a/session-logs/2026-06-02-mike-bug016-fix-followup-and-verification.md b/session-logs/2026-06-02-mike-bug016-fix-followup-and-verification.md new file mode 100644 index 0000000..49beac8 --- /dev/null +++ b/session-logs/2026-06-02-mike-bug016-fix-followup-and-verification.md @@ -0,0 +1,129 @@ +# BUG-016/017 Fix Follow-up on GURU-KALI: Unit Refresh, Override Removal, Verification, Coord Closure + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-KALI +- **Role:** admin + +## Session Summary + +Short session, all of it follow-up on the BUG-016 / BUG-017 thread that the previous session opened. Three actions, in sequence. + +First, removed the local override at `/etc/systemd/system/gururmm-agent.service.d/override.conf` now that upstream had fixed both bugs in `gururmm` commit `30da053`. Found a subtle gap before acting: the gururmm auto-updater had refreshed the agent binary on this box (timestamp 2026-06-01 20:24:33, build-id `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c`, contains the new `"Failed to persist device ID (will retry on next restart)"` error string and not the old `"Failed to persist device ID: "` warn string) but **did not** touch the systemd unit file. The unit on disk was still the 2026-05-24 buggy template — `ReadWritePaths=/var/log /usr/local/bin /etc/gururmm` with no `StateDirectory=gururmm`. So removing the override without refreshing the unit would have re-broken BUG-016 on this machine even though upstream was fixed. Caught that, then replaced the unit file with the upstream-fixed template (gained `StateDirectory=gururmm`), removed the override drop-in, daemon-reload + restart. Backup of the pre-fix unit kept at `gururmm-agent.service.pre-bug016-fix`. + +Second, ran a comprehensive verification pass on the now-refreshed agent: binary identity (post-fix string matches, pre-fix string absent), service state (active, MainPID 686646, NRestarts=0, ~5 min uptime), effective unit (`systemctl cat` shows the base file only with `StateDirectory=gururmm` present, no drop-ins), agent's mount namespace (mountinfo line 535 confirms `/var/lib/gururmm` is rw-bound via StateDirectory), write probe via `nsenter` into the agent's namespace (PASS), persisted state (`.device-id` = `ec975630-d297-4df9-bcb5-a445c65b648d`, unchanged), and the post-restart agent log (zero EROFS, zero persist failures, zero ERROR-level entries; the startup sequence shows clean WebSocket auth as the keeper `agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0`). + +Third, sent a closing coord message to GURU-5070/claude-main (`07e01275-f3c3-4b1c-ac66-98c805098311`) wrapping up the ghost-churn thread. The message reports the upstream fix shapes (both matched my BUG-016/017 spec recommendations exactly — `OnceLock` for the device_id cache, `StateDirectory=gururmm` for the unit), the auto-updater-doesn't-refresh-unit gap and what it means for the rest of the Linux fleet, the verification results, and a recommendation that any other pre-fix Linux agent should get its unit refreshed (either via a small ops script over /rmm or organically on next reinstall). The peer session no longer has any open work item for this thread. + +## Key Decisions + +- **Refreshed the unit BEFORE removing the override** rather than just removing the override. Mike's instruction was "remove the override now that upstream is fixed", but inspection showed the auto-update path only replaces the binary — the on-disk unit file was untouched. Doing the steps in this order keeps BUG-016 closed on this box; the alternate order would have regressed it. +- **Used the upstream-fixed template verbatim** (rather than just appending `/var/lib/gururmm` to the existing ReadWritePaths line). Reason: makes GURU-KALI's unit structurally identical to what a fresh post-fix install would produce. No drift between this machine and the canonical shape; future reinstalls won't need to re-think this. +- **Kept a backup of the pre-fix unit** at `gururmm-agent.service.pre-bug016-fix` rather than deleting it outright. Cheap insurance; takes a few KB on disk and gives an obvious rollback path if the new unit ever causes a regression. +- **Coord update was a complete close, not a status ping.** Included the agent_id + device_id + build_id so GURU-5070 has the stable identifiers in its log for future cross-referencing. Also flagged the fleet-wide drift situation (other Linux hosts have new binary + old unit) as visibility, not as a request. +- **Did NOT proactively scan/fix other Linux hosts in the fleet.** Flagged it in the coord message and left the decision to Mike (option (a) ops-script via /rmm vs option (b) wait for organic reinstalls). Acting unilaterally on other hosts would be out of session scope. + +## Configuration Changes + +**GURU-KALI system (not version controlled — machine-local only):** +- `/etc/systemd/system/gururmm-agent.service` — replaced with upstream-fixed template (added `StateDirectory=gururmm` directive; rest of unit unchanged). Old file backed up at `/etc/systemd/system/gururmm-agent.service.pre-bug016-fix`. +- `/etc/systemd/system/gururmm-agent.service.d/override.conf` — deleted. +- `/etc/systemd/system/gururmm-agent.service.d/` — directory removed (rmdir, was empty after override delete). +- `systemctl daemon-reload` + `systemctl restart gururmm-agent` ran. + +**Repo (committed earlier on a separate /save in the prior session, mentioned here for completeness):** +- `session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md` — the broader session log that includes today's follow-up's parent thread. This new file is a focused follow-up rather than an append, because the prior log was already pushed and the date rolled. + +## Credentials & Secrets + +None created or rotated in this session. + +GURU-KALI gururmm agent stable identifiers (referenced in the coord message — preserved across the unit refresh): +- `agent_id: 9bca5090-2d0e-40ad-9078-c11af8a435c0` +- `device_id: ec975630-d297-4df9-bcb5-a445c65b648d` + +## Infrastructure & Servers + +- **GURU-KALI gururmm-agent post-state:** + - Binary: `/usr/local/bin/gururmm-agent` (4113696 bytes, mtime 2026-06-01 20:24:33) + - Build-id: `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c` + - PID: 686646 (active since 2026-06-01 20:31:38 MST, NRestarts=0) + - Unit: `/etc/systemd/system/gururmm-agent.service` (upstream-fixed shape with `StateDirectory=gururmm`) + - No drop-ins under `/etc/systemd/system/gururmm-agent.service.d/` (directory removed) + - Server URL: `wss://rmm-api.azcomputerguru.com/ws` + - Mountinfo line 535: `535 442 259:3 /var/lib/gururmm /var/lib/gururmm rw,nosuid,relatime shared:389 master:1 - ext4 /dev/nvme0n1p2 rw,errors=remount-ro` + - Persisted state at `/var/lib/gururmm/.device-id` = `ec975630-d297-4df9-bcb5-a445c65b648d` + +- **Authentication endpoint behavior observed on this boot:** `INFO Authentication successful, agent_id: Some(9bca5090-2d0e-40ad-9078-c11af8a435c0)` — the server accepted the persisted device_id and returned the keeper agent_id without minting a new row. + +## Commands & Outputs + +```bash +# 1. Backup old unit + write new +sudo cp -a /etc/systemd/system/gururmm-agent.service{,.pre-bug016-fix} +sudo tee /etc/systemd/system/gururmm-agent.service > /dev/null <<'EOF' +... (full upstream-fixed unit template, including StateDirectory=gururmm) +EOF + +# 2. Remove override + its dir +sudo rm -f /etc/systemd/system/gururmm-agent.service.d/override.conf +sudo rmdir /etc/systemd/system/gururmm-agent.service.d + +# 3. Reload + restart +sudo systemctl daemon-reload +sudo systemctl restart gururmm-agent + +# 4. Verify - binary identity +strings /usr/local/bin/gururmm-agent | grep -c 'Failed to persist device ID (will retry on next restart)' +# -> 1 (post-fix string present) +strings /usr/local/bin/gururmm-agent | grep -c '^Failed to persist device ID: ' +# -> 0 (pre-fix string absent) + +# 5. Verify - mount namespace +sudo grep -E '/var/lib/gururmm' /proc/$(systemctl show -p MainPID --value gururmm-agent)/mountinfo +# -> 535 442 259:3 /var/lib/gururmm /var/lib/gururmm rw,nosuid,relatime ... + +# 6. Verify - write probe via nsenter +sudo nsenter -t -m -- sh -c 'touch /var/lib/gururmm/.verify_$$ && rm /var/lib/gururmm/.verify_$$ && echo "PASS"' +# -> PASS + +# 7. Verify - persisted state unchanged +sudo cat /var/lib/gururmm/.device-id +# -> ec975630-d297-4df9-bcb5-a445c65b648d + +# 8. Verify - no EROFS/persist-failure/ERROR since restart +journalctl -u gururmm-agent --since "$(systemctl show -p ActiveEnterTimestamp --value gururmm-agent)" --no-pager | grep -ciE 'Read-only|errno 30|EROFS|Failed to persist' +# -> 0 +journalctl -u gururmm-agent --since "$(systemctl show -p ActiveEnterTimestamp --value gururmm-agent)" --no-pager | grep -c ' ERROR ' +# -> 0 +``` + +## Pending / Incomplete Tasks + +- **Fleet-wide BUG-016 sweep on other Linux GuruRMM agents.** Every Linux host installed before `30da053` is now running new-binary-over-old-unit: the OnceLock fix masks the symptom in-process, but a restart will still mint+lose a new id because the unit still has the bug. Two options flagged in the coord message: + - (a) Push a small ops script (via /rmm) that refreshes the unit + removes any override + restarts the agent. Fast, deterministic. + - (b) Wait for organic reinstall. Slower; lets each host drift until it gets touched. + Mike's call. Not scheduled. +- **GURU-KALI `gururmm-agent.service.pre-bug016-fix` backup** lingering at `/etc/systemd/system/`. Harmless (systemd ignores files that don't end in `.service`), but a candidate for cleanup once we trust the new unit through a few cycles. +- **Memory-dream cluster heuristic refinement** (carried from the prior session — coord todo `5ad05d03-74ca-491d-9e72-3a699fcd1150`). Not touched in this session. +- **Shared-drive access for Nick Pafford** on Rednour Syncro #32343 (also carried from the prior session). Not touched. + +## Reference Information + +**Upstream fixes consumed:** +- `gururmm` commit `30da053` — `fix(agent): resolve Linux device_id persistence issues (BUG-016, BUG-017)`. Touched `agent/src/main.rs` (systemd unit template) and `agent/src/device_id.rs` (OnceLock cache). +- `gururmm` commits `2089e89` and `0597714` — roadmap doc updates marking BUG-016 / BUG-017 fixed. + +**Pre-existing IDs preserved through this session:** +- agent_id: `9bca5090-2d0e-40ad-9078-c11af8a435c0` +- device_id: `ec975630-d297-4df9-bcb5-a445c65b648d` +- build_id (current binary): `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c` + +**Coord messages sent in this session:** +- `07e01275-f3c3-4b1c-ac66-98c805098311` → GURU-5070/claude-main, project_key `gururmm`. Subject: "Ghost-churn thread fully closed on GURU-KALI - post-fix binary + post-fix unit + verified clean. Fleet drift note inside." + +**Files of interest:** +- Prior session log (parent thread): `session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md` +- Roadmap entries (now marked fixed): `projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md` — search `BUG-016` and `BUG-017` +- Spec doc that drove the implementation pattern: `docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md` (separate item but documents the same disciplined-spec pattern used to file BUG-016/017) +- GURU-KALI-local backup: `/etc/systemd/system/gururmm-agent.service.pre-bug016-fix`