sync: auto-sync from GURU-KALI at 2026-06-02 06:08:09
Author: Mike Swanson Machine: GURU-KALI Timestamp: 2026-06-02 06:08:09
This commit is contained in:
@@ -0,0 +1,129 @@
|
|||||||
|
# BUG-016/017 Fix Follow-up on GURU-KALI: Unit Refresh, Override Removal, Verification, Coord Closure
|
||||||
|
|
||||||
|
## User
|
||||||
|
- **User:** Mike Swanson (mike)
|
||||||
|
- **Machine:** GURU-KALI
|
||||||
|
- **Role:** admin
|
||||||
|
|
||||||
|
## Session Summary
|
||||||
|
|
||||||
|
Short session, all of it follow-up on the BUG-016 / BUG-017 thread that the previous session opened. Three actions, in sequence.
|
||||||
|
|
||||||
|
First, removed the local override at `/etc/systemd/system/gururmm-agent.service.d/override.conf` now that upstream had fixed both bugs in `gururmm` commit `30da053`. Found a subtle gap before acting: the gururmm auto-updater had refreshed the agent binary on this box (timestamp 2026-06-01 20:24:33, build-id `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c`, contains the new `"Failed to persist device ID (will retry on next restart)"` error string and not the old `"Failed to persist device ID: "` warn string) but **did not** touch the systemd unit file. The unit on disk was still the 2026-05-24 buggy template — `ReadWritePaths=/var/log /usr/local/bin /etc/gururmm` with no `StateDirectory=gururmm`. So removing the override without refreshing the unit would have re-broken BUG-016 on this machine even though upstream was fixed. Caught that, then replaced the unit file with the upstream-fixed template (gained `StateDirectory=gururmm`), removed the override drop-in, daemon-reload + restart. Backup of the pre-fix unit kept at `gururmm-agent.service.pre-bug016-fix`.
|
||||||
|
|
||||||
|
Second, ran a comprehensive verification pass on the now-refreshed agent: binary identity (post-fix string matches, pre-fix string absent), service state (active, MainPID 686646, NRestarts=0, ~5 min uptime), effective unit (`systemctl cat` shows the base file only with `StateDirectory=gururmm` present, no drop-ins), agent's mount namespace (mountinfo line 535 confirms `/var/lib/gururmm` is rw-bound via StateDirectory), write probe via `nsenter` into the agent's namespace (PASS), persisted state (`.device-id` = `ec975630-d297-4df9-bcb5-a445c65b648d`, unchanged), and the post-restart agent log (zero EROFS, zero persist failures, zero ERROR-level entries; the startup sequence shows clean WebSocket auth as the keeper `agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0`).
|
||||||
|
|
||||||
|
Third, sent a closing coord message to GURU-5070/claude-main (`07e01275-f3c3-4b1c-ac66-98c805098311`) wrapping up the ghost-churn thread. The message reports the upstream fix shapes (both matched my BUG-016/017 spec recommendations exactly — `OnceLock<String>` for the device_id cache, `StateDirectory=gururmm` for the unit), the auto-updater-doesn't-refresh-unit gap and what it means for the rest of the Linux fleet, the verification results, and a recommendation that any other pre-fix Linux agent should get its unit refreshed (either via a small ops script over /rmm or organically on next reinstall). The peer session no longer has any open work item for this thread.
|
||||||
|
|
||||||
|
## Key Decisions
|
||||||
|
|
||||||
|
- **Refreshed the unit BEFORE removing the override** rather than just removing the override. Mike's instruction was "remove the override now that upstream is fixed", but inspection showed the auto-update path only replaces the binary — the on-disk unit file was untouched. Doing the steps in this order keeps BUG-016 closed on this box; the alternate order would have regressed it.
|
||||||
|
- **Used the upstream-fixed template verbatim** (rather than just appending `/var/lib/gururmm` to the existing ReadWritePaths line). Reason: makes GURU-KALI's unit structurally identical to what a fresh post-fix install would produce. No drift between this machine and the canonical shape; future reinstalls won't need to re-think this.
|
||||||
|
- **Kept a backup of the pre-fix unit** at `gururmm-agent.service.pre-bug016-fix` rather than deleting it outright. Cheap insurance; takes a few KB on disk and gives an obvious rollback path if the new unit ever causes a regression.
|
||||||
|
- **Coord update was a complete close, not a status ping.** Included the agent_id + device_id + build_id so GURU-5070 has the stable identifiers in its log for future cross-referencing. Also flagged the fleet-wide drift situation (other Linux hosts have new binary + old unit) as visibility, not as a request.
|
||||||
|
- **Did NOT proactively scan/fix other Linux hosts in the fleet.** Flagged it in the coord message and left the decision to Mike (option (a) ops-script via /rmm vs option (b) wait for organic reinstalls). Acting unilaterally on other hosts would be out of session scope.
|
||||||
|
|
||||||
|
## Configuration Changes
|
||||||
|
|
||||||
|
**GURU-KALI system (not version controlled — machine-local only):**
|
||||||
|
- `/etc/systemd/system/gururmm-agent.service` — replaced with upstream-fixed template (added `StateDirectory=gururmm` directive; rest of unit unchanged). Old file backed up at `/etc/systemd/system/gururmm-agent.service.pre-bug016-fix`.
|
||||||
|
- `/etc/systemd/system/gururmm-agent.service.d/override.conf` — deleted.
|
||||||
|
- `/etc/systemd/system/gururmm-agent.service.d/` — directory removed (rmdir, was empty after override delete).
|
||||||
|
- `systemctl daemon-reload` + `systemctl restart gururmm-agent` ran.
|
||||||
|
|
||||||
|
**Repo (committed earlier on a separate /save in the prior session, mentioned here for completeness):**
|
||||||
|
- `session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md` — the broader session log that includes today's follow-up's parent thread. This new file is a focused follow-up rather than an append, because the prior log was already pushed and the date rolled.
|
||||||
|
|
||||||
|
## Credentials & Secrets
|
||||||
|
|
||||||
|
None created or rotated in this session.
|
||||||
|
|
||||||
|
GURU-KALI gururmm agent stable identifiers (referenced in the coord message — preserved across the unit refresh):
|
||||||
|
- `agent_id: 9bca5090-2d0e-40ad-9078-c11af8a435c0`
|
||||||
|
- `device_id: ec975630-d297-4df9-bcb5-a445c65b648d`
|
||||||
|
|
||||||
|
## Infrastructure & Servers
|
||||||
|
|
||||||
|
- **GURU-KALI gururmm-agent post-state:**
|
||||||
|
- Binary: `/usr/local/bin/gururmm-agent` (4113696 bytes, mtime 2026-06-01 20:24:33)
|
||||||
|
- Build-id: `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c`
|
||||||
|
- PID: 686646 (active since 2026-06-01 20:31:38 MST, NRestarts=0)
|
||||||
|
- Unit: `/etc/systemd/system/gururmm-agent.service` (upstream-fixed shape with `StateDirectory=gururmm`)
|
||||||
|
- No drop-ins under `/etc/systemd/system/gururmm-agent.service.d/` (directory removed)
|
||||||
|
- Server URL: `wss://rmm-api.azcomputerguru.com/ws`
|
||||||
|
- Mountinfo line 535: `535 442 259:3 /var/lib/gururmm /var/lib/gururmm rw,nosuid,relatime shared:389 master:1 - ext4 /dev/nvme0n1p2 rw,errors=remount-ro`
|
||||||
|
- Persisted state at `/var/lib/gururmm/.device-id` = `ec975630-d297-4df9-bcb5-a445c65b648d`
|
||||||
|
|
||||||
|
- **Authentication endpoint behavior observed on this boot:** `INFO Authentication successful, agent_id: Some(9bca5090-2d0e-40ad-9078-c11af8a435c0)` — the server accepted the persisted device_id and returned the keeper agent_id without minting a new row.
|
||||||
|
|
||||||
|
## Commands & Outputs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Backup old unit + write new
|
||||||
|
sudo cp -a /etc/systemd/system/gururmm-agent.service{,.pre-bug016-fix}
|
||||||
|
sudo tee /etc/systemd/system/gururmm-agent.service > /dev/null <<'EOF'
|
||||||
|
... (full upstream-fixed unit template, including StateDirectory=gururmm)
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# 2. Remove override + its dir
|
||||||
|
sudo rm -f /etc/systemd/system/gururmm-agent.service.d/override.conf
|
||||||
|
sudo rmdir /etc/systemd/system/gururmm-agent.service.d
|
||||||
|
|
||||||
|
# 3. Reload + restart
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
sudo systemctl restart gururmm-agent
|
||||||
|
|
||||||
|
# 4. Verify - binary identity
|
||||||
|
strings /usr/local/bin/gururmm-agent | grep -c 'Failed to persist device ID (will retry on next restart)'
|
||||||
|
# -> 1 (post-fix string present)
|
||||||
|
strings /usr/local/bin/gururmm-agent | grep -c '^Failed to persist device ID: '
|
||||||
|
# -> 0 (pre-fix string absent)
|
||||||
|
|
||||||
|
# 5. Verify - mount namespace
|
||||||
|
sudo grep -E '/var/lib/gururmm' /proc/$(systemctl show -p MainPID --value gururmm-agent)/mountinfo
|
||||||
|
# -> 535 442 259:3 /var/lib/gururmm /var/lib/gururmm rw,nosuid,relatime ...
|
||||||
|
|
||||||
|
# 6. Verify - write probe via nsenter
|
||||||
|
sudo nsenter -t <pid> -m -- sh -c 'touch /var/lib/gururmm/.verify_$$ && rm /var/lib/gururmm/.verify_$$ && echo "PASS"'
|
||||||
|
# -> PASS
|
||||||
|
|
||||||
|
# 7. Verify - persisted state unchanged
|
||||||
|
sudo cat /var/lib/gururmm/.device-id
|
||||||
|
# -> ec975630-d297-4df9-bcb5-a445c65b648d
|
||||||
|
|
||||||
|
# 8. Verify - no EROFS/persist-failure/ERROR since restart
|
||||||
|
journalctl -u gururmm-agent --since "$(systemctl show -p ActiveEnterTimestamp --value gururmm-agent)" --no-pager | grep -ciE 'Read-only|errno 30|EROFS|Failed to persist'
|
||||||
|
# -> 0
|
||||||
|
journalctl -u gururmm-agent --since "$(systemctl show -p ActiveEnterTimestamp --value gururmm-agent)" --no-pager | grep -c ' ERROR '
|
||||||
|
# -> 0
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pending / Incomplete Tasks
|
||||||
|
|
||||||
|
- **Fleet-wide BUG-016 sweep on other Linux GuruRMM agents.** Every Linux host installed before `30da053` is now running new-binary-over-old-unit: the OnceLock fix masks the symptom in-process, but a restart will still mint+lose a new id because the unit still has the bug. Two options flagged in the coord message:
|
||||||
|
- (a) Push a small ops script (via /rmm) that refreshes the unit + removes any override + restarts the agent. Fast, deterministic.
|
||||||
|
- (b) Wait for organic reinstall. Slower; lets each host drift until it gets touched.
|
||||||
|
Mike's call. Not scheduled.
|
||||||
|
- **GURU-KALI `gururmm-agent.service.pre-bug016-fix` backup** lingering at `/etc/systemd/system/`. Harmless (systemd ignores files that don't end in `.service`), but a candidate for cleanup once we trust the new unit through a few cycles.
|
||||||
|
- **Memory-dream cluster heuristic refinement** (carried from the prior session — coord todo `5ad05d03-74ca-491d-9e72-3a699fcd1150`). Not touched in this session.
|
||||||
|
- **Shared-drive access for Nick Pafford** on Rednour Syncro #32343 (also carried from the prior session). Not touched.
|
||||||
|
|
||||||
|
## Reference Information
|
||||||
|
|
||||||
|
**Upstream fixes consumed:**
|
||||||
|
- `gururmm` commit `30da053` — `fix(agent): resolve Linux device_id persistence issues (BUG-016, BUG-017)`. Touched `agent/src/main.rs` (systemd unit template) and `agent/src/device_id.rs` (OnceLock cache).
|
||||||
|
- `gururmm` commits `2089e89` and `0597714` — roadmap doc updates marking BUG-016 / BUG-017 fixed.
|
||||||
|
|
||||||
|
**Pre-existing IDs preserved through this session:**
|
||||||
|
- agent_id: `9bca5090-2d0e-40ad-9078-c11af8a435c0`
|
||||||
|
- device_id: `ec975630-d297-4df9-bcb5-a445c65b648d`
|
||||||
|
- build_id (current binary): `4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c`
|
||||||
|
|
||||||
|
**Coord messages sent in this session:**
|
||||||
|
- `07e01275-f3c3-4b1c-ac66-98c805098311` → GURU-5070/claude-main, project_key `gururmm`. Subject: "Ghost-churn thread fully closed on GURU-KALI - post-fix binary + post-fix unit + verified clean. Fleet drift note inside."
|
||||||
|
|
||||||
|
**Files of interest:**
|
||||||
|
- Prior session log (parent thread): `session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md`
|
||||||
|
- Roadmap entries (now marked fixed): `projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md` — search `BUG-016` and `BUG-017`
|
||||||
|
- Spec doc that drove the implementation pattern: `docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md` (separate item but documents the same disciplined-spec pattern used to file BUG-016/017)
|
||||||
|
- GURU-KALI-local backup: `/etc/systemd/system/gururmm-agent.service.pre-bug016-fix`
|
||||||
Reference in New Issue
Block a user