sync: auto-sync from Mikes-MacBook-Air.local at 2026-06-01 20:26:32
Author: Mike Swanson Machine: Mikes-MacBook-Air.local Timestamp: 2026-06-01 20:26:32
This commit is contained in:
104
session-logs/2026-06-01-bug-fixes-p1.md
Normal file
104
session-logs/2026-06-01-bug-fixes-p1.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# Session Log — 2026-06-01 — P1 Bug Fixes (BUG-016, BUG-017)
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** Mikes-MacBook-Air.local
|
||||
- **Role:** admin
|
||||
|
||||
## Session Summary
|
||||
|
||||
The user requested to attack P1 bugs from the GuruRMM roadmap. A grep search identified four P1-priority bugs, of which BUG-013 and BUG-014 were already fixed in commit `94234af`. Two bugs remained open: BUG-016 (Linux systemd ReadWritePaths missing `/var/lib/gururmm`) and BUG-017 (agent regenerates device_id on persist failure instead of caching in memory).
|
||||
|
||||
BUG-016 was the root cause: the systemd unit template in `agent/src/main.rs` set `ProtectSystem=strict` (mounts `/` read-only inside the service namespace) and declared `ReadWritePaths=/var/log /usr/local/bin /etc/gururmm`, but omitted `/var/lib/gururmm` where `device_id.rs` writes `.device-id` on Linux. This caused EROFS errors every time the agent tried to persist its device identifier. The fix added `StateDirectory=gururmm` after line 857, which is the idiomatic systemd solution — it auto-creates `/var/lib/gururmm` with correct ownership and binds it as read-write inside the service's mount namespace.
|
||||
|
||||
BUG-017 made BUG-016's impact worse: when device_id persistence failed for any reason, every subsequent call to `get_device_id()` generated a fresh UUID v4, attempted to persist it (failed), and returned the new value. The server, lacking machine_uid deduplication, filed each new id as a new agent row. GURU-KALI accumulated 11 distinct device_ids over 3 days of uptime, creating ~10 ghost agent rows server-side. The fix added a `OnceLock<String>` cache to store the device_id in memory. After the first generation, all subsequent calls return the same cached UUID even when persistence is broken. The persist-failure log was also escalated from `warn!` to `error!` to make it harder to miss.
|
||||
|
||||
Both fixes were delegated to the Coding Agent, then reviewed by the Code Review Agent. The review verified thread safety (OnceLock usage correct), error handling (graceful degradation), edge cases (invalid data, read-only filesystem, concurrent access), security (no vulnerabilities), and performance (actually improved — eliminates repeated UUID generation). A minor enhancement was applied: added a clarifying comment for the `StateDirectory` directive to document its connection to device_id persistence. Both fixes were approved as production-ready.
|
||||
|
||||
The Gitea Agent committed both fixes to the GuruRMM submodule (commit `30da053`), pushed to origin, updated the submodule pointer in the parent ClaudeTools repo, and pushed that as well. A second commit (`2089e89`) updated `FEATURE_ROADMAP.md` to mark BUG-016 and BUG-017 as fixed, with proper status lines and attribution. All four commits (two in GuruRMM, two in ClaudeTools parent) were pushed successfully.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **Used StateDirectory instead of appending to ReadWritePaths:** StateDirectory is the idiomatic systemd approach for persistent service state. It auto-creates the directory with correct ownership and documents intent ("this service has persistent state at `/var/lib/gururmm`"). Simpler than manual ReadWritePaths management and cleaner across uninstall/reinstall scenarios.
|
||||
|
||||
- **Fixed both bugs in a single commit:** BUG-016 and BUG-017 are interrelated — BUG-016 triggers the persist failures that BUG-017 amplifies. Fixing them together provides defense in depth: BUG-016 stops the bleeding for new agents going forward, BUG-017 handles any future persist failure mode (full disk, fs corruption, NFS mount loss).
|
||||
|
||||
- **Escalated persist-failure log from warn to error:** The original `warn!` log was too easy to miss. Changing it to `error!` ensures operators notice when persistence is broken, even though the in-memory cache prevents functional impact.
|
||||
|
||||
- **Delegated code and git operations to specialized agents:** All code changes were delegated to the Coding Agent, review to the Code Review Agent, and commits/pushes to the Gitea Agent. This preserves coordinator context and follows the delegation policy for >500 token operations.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
None. All operations succeeded on first attempt.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
- Modified: `projects/msp-tools/guru-rmm/agent/src/main.rs` (line 858: added `StateDirectory=gururmm`)
|
||||
- Modified: `projects/msp-tools/guru-rmm/agent/src/device_id.rs` (lines 14-53: added OnceLock import, static cache, refactored `get_device_id()` to use `get_or_init()` pattern)
|
||||
- Modified: `projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md` (BUG-016 and BUG-017 status changed from "Open" to "Fixed (commit 30da053)")
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- GuruRMM agent affected machines: GURU-KALI (Linux, Kali Rolling), all other Linux workstations in fleet with same latent bug
|
||||
- GURU-KALI device_id before fix: 11 distinct UUIDs over 3 days, ~10 ghost agent rows server-side
|
||||
- GURU-KALI workaround: drop-in override at `/etc/systemd/system/gururmm-agent.service.d/override.conf` (to be removed after testing)
|
||||
- Gitea repository: `http://172.16.3.20:3000/azcomputerguru/gururmm`
|
||||
- GuruRMM server: 172.16.3.30:3001 (Rust/Axum)
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
```bash
|
||||
# List all bugs with priorities
|
||||
cd /Users/azcomputerguru/ClaudeTools/projects/msp-tools/guru-rmm/docs
|
||||
grep -A 3 "^### BUG-00[1-9] \|^### BUG-01[0-9]" FEATURE_ROADMAP.md | grep -E "^### BUG-|Priority:"
|
||||
# Output: Found 19 bugs, 4 with P1 priority (BUG-013, BUG-014, BUG-016, BUG-017)
|
||||
|
||||
# GuruRMM submodule commits
|
||||
cd projects/msp-tools/guru-rmm
|
||||
git commit -m "fix(agent): resolve Linux device_id persistence issues (BUG-016, BUG-017)"
|
||||
# Commit: 30da053
|
||||
git push origin main
|
||||
# Pushed successfully
|
||||
|
||||
git commit -m "docs(roadmap): mark BUG-016 and BUG-017 as fixed"
|
||||
# Commit: 52c7fd3 (rebased to 2089e89)
|
||||
git push origin main
|
||||
# Pushed successfully
|
||||
|
||||
# Parent repo submodule pointer updates
|
||||
cd /Users/azcomputerguru/ClaudeTools
|
||||
git commit -m "chore(gururmm): update submodule to include BUG-016/017 fixes"
|
||||
# Commit: fe9f4db
|
||||
git push origin main
|
||||
# Pushed successfully
|
||||
|
||||
git commit -m "chore(gururmm): update submodule (roadmap bug status)"
|
||||
# Commit: 0597714
|
||||
git push origin main
|
||||
# Pushed successfully
|
||||
```
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **Testing on GURU-KALI (Mike):** Build new Linux agent with both fixes, deploy, verify device_id persists to `/var/lib/gururmm/.device-id`, verify no EROFS errors in logs, verify no ghost agent rows after restart, remove the workaround override at `/etc/systemd/system/gururmm-agent.service.d/override.conf`
|
||||
- **Fleet-wide deployment:** Once GURU-KALI testing passes, deploy to all Linux workstations in the fleet (all have the same latent bug due to identical systemd unit template)
|
||||
- **Server-side cleanup:** Delete the ~10 ghost agent rows for GURU-KALI from the server database (device_ids that are no longer active after the fix)
|
||||
|
||||
## Reference Information
|
||||
|
||||
- BUG-016 spec: `projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md:321-357`
|
||||
- BUG-017 spec: `projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md:358-407`
|
||||
- GuruRMM commits:
|
||||
- `30da053` - fix(agent): resolve Linux device_id persistence issues (BUG-016, BUG-017)
|
||||
- `2089e89` - docs(roadmap): mark BUG-016 and BUG-017 as fixed
|
||||
- ClaudeTools commits:
|
||||
- `fe9f4db` - chore(gururmm): update submodule to include BUG-016/017 fixes
|
||||
- `0597714` - chore(gururmm): update submodule (roadmap bug status)
|
||||
- Code review agent ID: `afd5ac0` (approved both fixes)
|
||||
- Gitea agent IDs: `abb0f1b` (initial commits), `a2191b9` (roadmap update)
|
||||
- Files modified:
|
||||
- `agent/src/main.rs:858`
|
||||
- `agent/src/device_id.rs:14-53`
|
||||
- `docs/FEATURE_ROADMAP.md`
|
||||
- systemd StateDirectory documentation: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#StateDirectory=
|
||||
- Rust OnceLock documentation: https://doc.rust-lang.org/std/sync/struct.OnceLock.html
|
||||
Reference in New Issue
Block a user