Files
claudetools/session-logs/2026-06-02-mike-bug016-fix-followup-and-verification.md
Mike Swanson 609e666e50 sync: auto-sync from GURU-KALI at 2026-06-02 06:08:09
Author: Mike Swanson
Machine: GURU-KALI
Timestamp: 2026-06-02 06:08:09
2026-06-02 06:08:11 -07:00

11 KiB

BUG-016/017 Fix Follow-up on GURU-KALI: Unit Refresh, Override Removal, Verification, Coord Closure

User

  • User: Mike Swanson (mike)
  • Machine: GURU-KALI
  • Role: admin

Session Summary

Short session, all of it follow-up on the BUG-016 / BUG-017 thread that the previous session opened. Three actions, in sequence.

First, removed the local override at /etc/systemd/system/gururmm-agent.service.d/override.conf now that upstream had fixed both bugs in gururmm commit 30da053. Found a subtle gap before acting: the gururmm auto-updater had refreshed the agent binary on this box (timestamp 2026-06-01 20:24:33, build-id 4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c, contains the new "Failed to persist device ID (will retry on next restart)" error string and not the old "Failed to persist device ID: " warn string) but did not touch the systemd unit file. The unit on disk was still the 2026-05-24 buggy template — ReadWritePaths=/var/log /usr/local/bin /etc/gururmm with no StateDirectory=gururmm. So removing the override without refreshing the unit would have re-broken BUG-016 on this machine even though upstream was fixed. Caught that, then replaced the unit file with the upstream-fixed template (gained StateDirectory=gururmm), removed the override drop-in, daemon-reload + restart. Backup of the pre-fix unit kept at gururmm-agent.service.pre-bug016-fix.

Second, ran a comprehensive verification pass on the now-refreshed agent: binary identity (post-fix string matches, pre-fix string absent), service state (active, MainPID 686646, NRestarts=0, ~5 min uptime), effective unit (systemctl cat shows the base file only with StateDirectory=gururmm present, no drop-ins), agent's mount namespace (mountinfo line 535 confirms /var/lib/gururmm is rw-bound via StateDirectory), write probe via nsenter into the agent's namespace (PASS), persisted state (.device-id = ec975630-d297-4df9-bcb5-a445c65b648d, unchanged), and the post-restart agent log (zero EROFS, zero persist failures, zero ERROR-level entries; the startup sequence shows clean WebSocket auth as the keeper agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0).

Third, sent a closing coord message to GURU-5070/claude-main (07e01275-f3c3-4b1c-ac66-98c805098311) wrapping up the ghost-churn thread. The message reports the upstream fix shapes (both matched my BUG-016/017 spec recommendations exactly — OnceLock<String> for the device_id cache, StateDirectory=gururmm for the unit), the auto-updater-doesn't-refresh-unit gap and what it means for the rest of the Linux fleet, the verification results, and a recommendation that any other pre-fix Linux agent should get its unit refreshed (either via a small ops script over /rmm or organically on next reinstall). The peer session no longer has any open work item for this thread.

Key Decisions

  • Refreshed the unit BEFORE removing the override rather than just removing the override. Mike's instruction was "remove the override now that upstream is fixed", but inspection showed the auto-update path only replaces the binary — the on-disk unit file was untouched. Doing the steps in this order keeps BUG-016 closed on this box; the alternate order would have regressed it.
  • Used the upstream-fixed template verbatim (rather than just appending /var/lib/gururmm to the existing ReadWritePaths line). Reason: makes GURU-KALI's unit structurally identical to what a fresh post-fix install would produce. No drift between this machine and the canonical shape; future reinstalls won't need to re-think this.
  • Kept a backup of the pre-fix unit at gururmm-agent.service.pre-bug016-fix rather than deleting it outright. Cheap insurance; takes a few KB on disk and gives an obvious rollback path if the new unit ever causes a regression.
  • Coord update was a complete close, not a status ping. Included the agent_id + device_id + build_id so GURU-5070 has the stable identifiers in its log for future cross-referencing. Also flagged the fleet-wide drift situation (other Linux hosts have new binary + old unit) as visibility, not as a request.
  • Did NOT proactively scan/fix other Linux hosts in the fleet. Flagged it in the coord message and left the decision to Mike (option (a) ops-script via /rmm vs option (b) wait for organic reinstalls). Acting unilaterally on other hosts would be out of session scope.

Configuration Changes

GURU-KALI system (not version controlled — machine-local only):

  • /etc/systemd/system/gururmm-agent.service — replaced with upstream-fixed template (added StateDirectory=gururmm directive; rest of unit unchanged). Old file backed up at /etc/systemd/system/gururmm-agent.service.pre-bug016-fix.
  • /etc/systemd/system/gururmm-agent.service.d/override.conf — deleted.
  • /etc/systemd/system/gururmm-agent.service.d/ — directory removed (rmdir, was empty after override delete).
  • systemctl daemon-reload + systemctl restart gururmm-agent ran.

Repo (committed earlier on a separate /save in the prior session, mentioned here for completeness):

  • session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md — the broader session log that includes today's follow-up's parent thread. This new file is a focused follow-up rather than an append, because the prior log was already pushed and the date rolled.

Credentials & Secrets

None created or rotated in this session.

GURU-KALI gururmm agent stable identifiers (referenced in the coord message — preserved across the unit refresh):

  • agent_id: 9bca5090-2d0e-40ad-9078-c11af8a435c0
  • device_id: ec975630-d297-4df9-bcb5-a445c65b648d

Infrastructure & Servers

  • GURU-KALI gururmm-agent post-state:

    • Binary: /usr/local/bin/gururmm-agent (4113696 bytes, mtime 2026-06-01 20:24:33)
    • Build-id: 4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c
    • PID: 686646 (active since 2026-06-01 20:31:38 MST, NRestarts=0)
    • Unit: /etc/systemd/system/gururmm-agent.service (upstream-fixed shape with StateDirectory=gururmm)
    • No drop-ins under /etc/systemd/system/gururmm-agent.service.d/ (directory removed)
    • Server URL: wss://rmm-api.azcomputerguru.com/ws
    • Mountinfo line 535: 535 442 259:3 /var/lib/gururmm /var/lib/gururmm rw,nosuid,relatime shared:389 master:1 - ext4 /dev/nvme0n1p2 rw,errors=remount-ro
    • Persisted state at /var/lib/gururmm/.device-id = ec975630-d297-4df9-bcb5-a445c65b648d
  • Authentication endpoint behavior observed on this boot: INFO Authentication successful, agent_id: Some(9bca5090-2d0e-40ad-9078-c11af8a435c0) — the server accepted the persisted device_id and returned the keeper agent_id without minting a new row.

Commands & Outputs

# 1. Backup old unit + write new
sudo cp -a /etc/systemd/system/gururmm-agent.service{,.pre-bug016-fix}
sudo tee /etc/systemd/system/gururmm-agent.service > /dev/null <<'EOF'
... (full upstream-fixed unit template, including StateDirectory=gururmm)
EOF

# 2. Remove override + its dir
sudo rm -f /etc/systemd/system/gururmm-agent.service.d/override.conf
sudo rmdir /etc/systemd/system/gururmm-agent.service.d

# 3. Reload + restart
sudo systemctl daemon-reload
sudo systemctl restart gururmm-agent

# 4. Verify - binary identity
strings /usr/local/bin/gururmm-agent | grep -c 'Failed to persist device ID (will retry on next restart)'
# -> 1   (post-fix string present)
strings /usr/local/bin/gururmm-agent | grep -c '^Failed to persist device ID: '
# -> 0   (pre-fix string absent)

# 5. Verify - mount namespace
sudo grep -E '/var/lib/gururmm' /proc/$(systemctl show -p MainPID --value gururmm-agent)/mountinfo
# -> 535 442 259:3 /var/lib/gururmm /var/lib/gururmm rw,nosuid,relatime ...

# 6. Verify - write probe via nsenter
sudo nsenter -t <pid> -m -- sh -c 'touch /var/lib/gururmm/.verify_$$ && rm /var/lib/gururmm/.verify_$$ && echo "PASS"'
# -> PASS

# 7. Verify - persisted state unchanged
sudo cat /var/lib/gururmm/.device-id
# -> ec975630-d297-4df9-bcb5-a445c65b648d

# 8. Verify - no EROFS/persist-failure/ERROR since restart
journalctl -u gururmm-agent --since "$(systemctl show -p ActiveEnterTimestamp --value gururmm-agent)" --no-pager | grep -ciE 'Read-only|errno 30|EROFS|Failed to persist'
# -> 0
journalctl -u gururmm-agent --since "$(systemctl show -p ActiveEnterTimestamp --value gururmm-agent)" --no-pager | grep -c ' ERROR '
# -> 0

Pending / Incomplete Tasks

  • Fleet-wide BUG-016 sweep on other Linux GuruRMM agents. Every Linux host installed before 30da053 is now running new-binary-over-old-unit: the OnceLock fix masks the symptom in-process, but a restart will still mint+lose a new id because the unit still has the bug. Two options flagged in the coord message:
    • (a) Push a small ops script (via /rmm) that refreshes the unit + removes any override + restarts the agent. Fast, deterministic.
    • (b) Wait for organic reinstall. Slower; lets each host drift until it gets touched. Mike's call. Not scheduled.
  • GURU-KALI gururmm-agent.service.pre-bug016-fix backup lingering at /etc/systemd/system/. Harmless (systemd ignores files that don't end in .service), but a candidate for cleanup once we trust the new unit through a few cycles.
  • Memory-dream cluster heuristic refinement (carried from the prior session — coord todo 5ad05d03-74ca-491d-9e72-3a699fcd1150). Not touched in this session.
  • Shared-drive access for Nick Pafford on Rednour Syncro #32343 (also carried from the prior session). Not touched.

Reference Information

Upstream fixes consumed:

  • gururmm commit 30da053fix(agent): resolve Linux device_id persistence issues (BUG-016, BUG-017). Touched agent/src/main.rs (systemd unit template) and agent/src/device_id.rs (OnceLock cache).
  • gururmm commits 2089e89 and 0597714 — roadmap doc updates marking BUG-016 / BUG-017 fixed.

Pre-existing IDs preserved through this session:

  • agent_id: 9bca5090-2d0e-40ad-9078-c11af8a435c0
  • device_id: ec975630-d297-4df9-bcb5-a445c65b648d
  • build_id (current binary): 4fb059e9f57cb64d075bdc3b5529e07b8cd53d5c

Coord messages sent in this session:

  • 07e01275-f3c3-4b1c-ac66-98c805098311 → GURU-5070/claude-main, project_key gururmm. Subject: "Ghost-churn thread fully closed on GURU-KALI - post-fix binary + post-fix unit + verified clean. Fleet drift note inside."

Files of interest:

  • Prior session log (parent thread): session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md
  • Roadmap entries (now marked fixed): projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md — search BUG-016 and BUG-017
  • Spec doc that drove the implementation pattern: docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md (separate item but documents the same disciplined-spec pattern used to file BUG-016/017)
  • GURU-KALI-local backup: /etc/systemd/system/gururmm-agent.service.pre-bug016-fix