20 KiB
GURU-KALI Ghost-Churn Fix, BUG-016/017 Filing, Memory Dream + Consolidation Collision
User
- User: Mike Swanson (mike)
- Machine: GURU-KALI
- Role: admin
Session Summary
Four substantive threads on GURU-KALI today, two of them tightly intertwined with parallel work happening on other workstations.
Thread 1 — GURU-KALI ghost-agent churn (full diagnosis + remediation + upstream fix lifecycle in one day). Coord message from GURU-5070 reported that GURU-KALI was minting ~10 ghost agent rows on the gururmm server, one ~daily. The initial diagnosis blamed a read-only root filesystem. Local check disproved that — findmnt -no OPTIONS / showed rw,relatime,errors=remount-ro on the host, no ext4 errors in the kernel log, no ro/rw transitions since the normal boot-time remount. The actual cause turned out to be gururmm-agent.service running with ProtectSystem=strict, which creates a private mount namespace where / is mounted ro for the service. The unit declared ReadWritePaths=/var/log /usr/local/bin /etc/gururmm but omitted /var/lib/gururmm where device_id.rs:get_persist_path() writes .device-id. Inside the agent's namespace, every persist attempt returned EROFS. Combined with a second bug (the agent regenerating a fresh UUID on every persist failure instead of caching in memory), this produced the ghost-row blizzard. Workaround applied: drop-in override at /etc/systemd/system/gururmm-agent.service.d/override.conf adding ReadWritePaths=/var/lib/gururmm. After daemon-reload + restart, the new agent persisted a stable device-id ec975630-d297-4df9-bcb5-a445c65b648d and zero EROFS warnings have logged since. Coord reply sent to GURU-5070 (d91406ce-c4ab-4914-b479-c1f4a948096f) — they purged the 11 ghost rows down to 1 keeper (agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0).
Thread 2 — Filed BUG-016 and BUG-017 in the gururmm roadmap, then both fixed upstream same-day. Wrote both bug entries into projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md with full root-cause, suggested fixes, and the GURU-KALI workaround. Notified Howard via coord (99162698-5439-4fcb-9c27-719a569a717c). Mike picked up both fixes on another workstation later in the day — 30da053 fix(agent): resolve Linux device_id persistence issues (BUG-016, BUG-017) shipped to gururmm/main, then 2089e89 docs(roadmap): mark BUG-016 and BUG-017 as fixed. Fix shape matched the spec recommendations exactly: unit template gained StateDirectory=gururmm (preferred over appending to ReadWritePaths), and device_id.rs:get_device_id() now uses OnceLock<String> to cache the first generated UUID even when persistence fails. Toward end of session, refreshed the GURU-KALI base unit to match the upstream-fixed template (replaced gururmm-agent.service with the new shape, removed the override drop-in, restarted) — backup of pre-fix unit saved as gururmm-agent.service.pre-bug016-fix. Verified device-id unchanged after restart, mountinfo line shows /var/lib/gururmm rw-bound via StateDirectory. The auto-update earlier in the day had refreshed the agent binary at 20:24 but NOT the unit file, so removing the override without refreshing the unit would have regressed BUG-016 on this box — caught that before acting.
Thread 3 — sync.sh hardening, three rounds across one day, and submodule identity reconcile. First round (dead-submodule-ref tolerance): a routine /sync failed because git fetch recursed into submodules and hit a transient dead ref in guru-connect history. Fix added --no-recurse-submodules to the parent fetch + pull and made the post-rebase git submodule update tolerant of per-submodule failures. Second round (coord_api lifted to identity.json): the hardcoded LAN IP http://172.16.3.30:8001 was identified in three scripts (sync.sh, check-messages.sh, check-ksteen-smartbadge.sh) — silently breaks off-LAN/VPN workstations. Lifted into .claude/identity.json as coord_api with the existing IP as fallback default; migrate-identity.sh updated to populate the field for any machine missing it. Broadcast 1d93052f-aa79-4ac3-a0e9-99f04a4695c9 told the team to run migrate-identity.sh. Dead Windows-path repo-root fallback loop at sync.sh:102 deleted. Third round (submodule identity reconcile): two youtube-sync-docker commits were authored as ComputerGuru <guru@GURU-KALI.lan> because sync.sh's reconcile_git_identity only ran on the parent repo. Wrote docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md, implemented the spec (10-line addition to Phase 1a — (cd "$ppath" && reconcile_git_identity ...) for each submodule). Empirically verified: caught real drift on this box's guru-connect submodule (unset identity → Mike Swanson), idempotent on re-runs, forced-drift test on youtube-sync-docker passed. Coord todo a176100c opened and closed in the same session.
Thread 4 — Memory dream skill collision with Mike's parallel consolidation. Tried the new memory-dream skill (landed via /sync earlier in the day). Default report-only run produced a clean report: 104 memory files, 17 orphan files needing index lines, 12 broken backlinks, 12 overlap clusters (biggest: 19 feedback_syncro_* files), 1 stale dated fact, 0 profile/repo conflicts. Ran --apply-safe to additively append the 17 orphan index lines to MEMORY.md. At nearly the same moment, Mike on GURU-BEAST-ROG had completed a thoughtful consolidation pass (0c00010 "chore(memory): consolidate scattered feedback/project/reference files") that took the store from 104 → 71 files: 19 syncro files into 3 rule files + 1 history file, per-cluster RULE/STATE/HISTORY split for GuruConnect/Dataforth/Cascades/GuruRMM, new reference_resource_map.md cheatsheet, MEMORY.md fully rewritten. Pull-rebase produced a merge conflict in MEMORY.md. Resolved by taking Mike's consolidated version (git checkout --ours .claude/memory/MEMORY.md) and discarding my orphan-fix index adds — every file my adds pointed at had been consolidated away on his side. Set-diff verified zero original lines lost. Re-ran dream against the consolidated state: 71 files, 0 orphans, 7 broken backlinks, 5 overlap clusters down from 12. Skill confirmed working against the new layout but with a false-positive that needs fixing — it flags the new intentional _history.md companion files as merge candidates against their rule-file siblings. Broadcast 6c559209-a0bb-4007-ad01-cbf07deead1a told the fleet about the consolidation, instructed each machine to /sync + re-dream locally, and warned about the false-positive merge proposals to ignore. Filed coord todo 5ad05d03-74ca-491d-9e72-3a699fcd1150 to refine the cluster heuristic.
Side threads (smaller scope but real work):
- Rednour Law M365 onboarding + Emma → Carla rename earlier in the day (this session crossed from yesterday's tail into today's UTC midnight). Bootstrapped the full ComputerGuru MSP app suite for
rednourlaw.comvia Tenant Admin consent +onboard-tenant.sh; renamedemma@→carla@rednourlaw.com(Carla Skinner) with mail aliases preserved; addedsmtp:nick@alias on Nick Pafford's existingnpafford@mailbox; Syncro ticket #32343 updated + 0.5h billed + marked Resolved. - youtube-sync-docker pickup: Mike asked to pull up the YouTube downloader project. Found it as a personal Gitea repo, cloned as a submodule. Read the codebase, found a real bug (Settings page wrote to
settings.jsonbut nothing downstream read it), fixed it withapply_schedule()helper + sync.sh/entrypoint.sh changes + 9 pytest cases across two commits. Code-reviewed both rounds.
Key Decisions
- Override removal: only after unit refresh. Mike said "remove the override now that upstream is fixed", but inspection showed the agent binary was auto-updated today while the unit file on disk was still the buggy 2026-05-24 version. Removing the override alone would have regressed BUG-016 on this box. Caught that before acting and proposed refreshing the unit file first; Mike's intent was preserved by doing both steps together.
- Took ours on the MEMORY.md merge conflict. During the rebase against Mike's
0c00010consolidation, my--apply-safeorphan-fix additions were now stale (every file they referenced had been consolidated away). Took his version and discarded my adds rather than trying to reconcile per-line. Verified set-diff showed zero original content lost. StateDirectory=gururmmis the right systemd directive (preferred overReadWritePaths=/var/lib/gururmm). It auto-creates the dir with correct ownership, binds it rw in the unit's namespace, documents intent ("this service has persistent state"), and handles uninstall/reinstall cleanly. Spec recommended both options; upstream pickedStateDirectorywhich matched my own preference.- Cache device_id in
OnceLock<String>, not/etc/machine-id. The existing comment atdevice_id.rs:7-10explicitly rejected hardware IDs because OEMs ship machines with identical hardware IDs (un-sysprepped factory images). The OnceLock approach is the right shape — survives persist failure, doesn't depend on hardware ID. - Memory-dream merge proposals stay advisory, never auto-applied. The skill's
_history.mdfalse positives confirm the design choice that merges always go through human approval. Filed a heuristic-refinement todo so future reports stay actionable, but the skill is functionally correct as-is. - Submodule identity reconcile uses Option A from the spec (extend the existing init while-loop with
(cd ... && reconcile_git_identity ...)) over Option B (inline duplicate logic insubmodule foreach) or Option C (factor into a sourceable library). Empirically verified the heuristic catches real drift and is idempotent. - Two youtube-sync-docker commits with wrong author (
ef903c8,fdff0a7authored asComputerGuru) left as-is — rewriting history would need force-push to shared remote. The reconcile fix prevents recurrence on this and every other machine. - Override at GURU-KALI removed cleanly at end of session, replaced by the upstream-fixed base unit. Future agent reinstall would write this same shape — no drift.
Problems Encountered
- Initial Graph PATCH for Emma rename failed with
Property 'proxyAddresses' is read-only. Graph user write doesn't includeproxyAddresseseven withDirectory.ReadWrite.All. Split the rename into two tiers: identity via Graph, mail aliases via Exchange REST. - Exchange REST returned HTTP 403 even though the SP was consented. The Exchange Operator SP lacked Exchange Administrator role in the rednourlaw tenant. Resolved by running the full onboarding flow.
- Stale read-after-write on Exchange Set-Mailbox and Graph PATCH. Both writes returned success codes immediately, but verification reads showed old data for ~45s. Polled for UPN convergence; converged within first/second attempt.
- sync.sh dead-submodule-ref failure on routine pull. Manual workaround was
git -c submodule.recurse=false pull --rebaseetc.; fix made--no-recurse-submodulesthe default behavior. - Coding Agent ran sync.sh as a verification step during the submodule reconcile implementation, which auto-committed + pushed the dirty edit pre-Code-Review. Disclosed honestly by the agent. Code Review on the committed state came back CLEAN; accepted as-is.
- MEMORY.md merge conflict during the memory dream collision with Mike's consolidation pass. Resolved by taking ours (Mike's intentional change) and discarding my now-stale orphan-fix adds.
- Auto-update refreshed agent binary but NOT systemd unit file. Discovered when planning the override removal — the binary on disk was dated 20:24 today (auto-updated with the OnceLock fix) but the unit file was still dated 2026-05-24 (pre-fix template). Without manually refreshing the unit, the override removal would have re-broken BUG-016. Refreshed the unit explicitly before removing.
Configuration Changes
ClaudeTools repo (committed across session):
.claude/scripts/sync.sh— dead-submodule-ref tolerance, deleted dead Windows-path fallbacks, submodule identity reconcile in Phase 1a, coord_api read from identity.json with fallback. Multiple commits:c89f22c,973e9db,4c49b85..claude/scripts/migrate-identity.sh— populatescoord_apifor any machine missing the field (commit973e9db)..claude/scripts/check-messages.sh,check-ksteen-smartbadge.sh— readcoord_apifrom identity.json with fallback (commit973e9db)..claude/skills/remediation-tool/references/tenants.md— rednourlaw.com row flipped NO → YES with role summary.clients/rednour/reports/2026-05-31-onboard-and-rename-emma-to-carla.md— full M365 remediation audit report.docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md— planning artifact..gitmodules— registered new submoduleprojects/youtube-sync-docker..claude/memory/_reports/— two dream reports (2026-06-01-1525-dream.md,2026-06-01-1526-dream.md).- Submodule pointers advanced: guru-rmm (BUG-016/017 fixes), guru-connect (multiple SPEC-004 tasks), youtube-sync-docker (settings fix + tests at
fdff0a7).
ClaudeTools machine-local (not committed; gitignored):
.claude/identity.json— addedcoord_api: "http://172.16.3.30:8001"field, bumpedlast_updated..claude/current-mode— set todevduring youtube-sync-docker work.- All three submodules' local
.git/configuser.name/user.email reconciled toMike Swanson / mike@azcomputerguru.com.guru-connectwas previously unset (real drift case fixed by the new Phase 1a reconcile).
gururmm repo (commits by Mike):
e3d6a46— BUG-016 + BUG-017 entries indocs/FEATURE_ROADMAP.md(filed by me).30da053— BUG-016 + BUG-017 fixes shipped (by Mike on another machine).2089e89— bug roadmap status marked fixed.
youtube-sync-docker repo (commits by Mike on this machine via Gitea Agent):
ef903c8— settings-not-applied fix + 3 tests (note: authored asComputerGurudue to pre-reconcile drift).fdff0a7— apply_schedule tests +.gitignorepython exclusions.
GURU-KALI system (not version controlled):
/etc/systemd/system/gururmm-agent.service— replaced with upstream-fixed template (gainedStateDirectory=gururmm). Old version backed up asgururmm-agent.service.pre-bug016-fix./etc/systemd/system/gururmm-agent.service.d/— directory +override.confremoved (no longer needed).
Credentials & Secrets
rednourlaw.com (4a4ca18a-f516-478b-99da-2e0722c5dc18):
- Tenant Admin SP
671a2ace-be9e-440c-a7d6-5ff982e4500c— Conditional Access Administrator - Security Investigator SP
704da463-7f4e-484c-b1da-40e447615d52— Exchange Administrator - Exchange Operator SP
59a68ba9-5e1e-4a56-92ae-507a9a669a79— Exchange Administrator - User Manager SP
dc3b79a2-638b-42fe-8ecb-51592db7d40f— User Administrator + Authentication Administrator - Defender Add-on SP
052da8aa-1ca5-4f60-b9c5-7aafcb74264b— no roles (no MDE in tenant)
Users renamed/touched:
93074d1a-6db2-4794-8f7d-c84a619e4494: emma@ → carla@rednourlaw.com (Carla Skinner). Sessions revoked, password unchanged.fe859088-bcbc-49dc-aaea-4c6e68f7d5bb: npafford@ (Nick Pafford); addedsmtp:nick@rednourlaw.comalias.
Syncro:
- Ticket #32343 (id 111409967): comments
415513323(internal) +415514647(customer-visible); line item42654682(0.5h remote, $75.00, attributed to Mike user_id 1735). Status: Resolved.
Infrastructure & Servers
- GURU-KALI gururmm agent post-fix state: PID
686646, device_idec975630-d297-4df9-bcb5-a445c65b648d, base unit/etc/systemd/system/gururmm-agent.service(refreshed today), no override drop-ins, mountinfo line 535 shows/var/lib/gururmmrw-bound viaStateDirectory=gururmm. - Coord API still at
http://172.16.3.30:8001/api/coord— now configurable per machine viaidentity.jsoncoord_apifield. - rednourlaw.com tenant: Global Admin is Carrie Rednour (also reachable via
sysadmin@rednourlaw.com). - gururmm server-side ghost-row purge complete — 11 rows → 1 keeper (
agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0).
Commands & Outputs
# Diagnostic that revealed process-scoped ro
grep ' / ' /proc/$AGENT_PID/mountinfo
# 447 404 259:3 / / ro,nosuid,relatime ... <- agent ns
# Host's /proc/mounts and findmnt showed rw the whole time.
# Workaround applied early
sudo tee /etc/systemd/system/gururmm-agent.service.d/override.conf > /dev/null <<'EOF'
[Service]
ReadWritePaths=/var/lib/gururmm
EOF
sudo systemctl daemon-reload && sudo systemctl restart gururmm-agent
# End-of-session: unit file refreshed to upstream-fixed template, override removed
sudo cp -a /etc/systemd/system/gururmm-agent.service{,.pre-bug016-fix}
# (wrote new unit with StateDirectory=gururmm)
sudo rm -f /etc/systemd/system/gururmm-agent.service.d/override.conf
sudo rmdir /etc/systemd/system/gururmm-agent.service.d
sudo systemctl daemon-reload && sudo systemctl restart gururmm-agent
# Sync.sh runs
bash .claude/scripts/sync.sh # multiple times, each pulling Mike's parallel work
Pending / Incomplete Tasks
- Memory-dream cluster heuristic refinement — coord todo
5ad05d03-74ca-491d-9e72-3a699fcd1150, open. Either skip clusters containing_history.mdfiles or honor frontmattermerge_locked: true. - Shared-drive access for Nick Pafford on Rednour ticket #32343 — deferred to a separate workflow per Mike's instruction.
- Other workstations need
migrate-identity.shto pick up the newcoord_apifield. Broadcast sent; on-LAN machines work without it. - Other workstations' submodule git identities will auto-correct on next
/sync(one-time warning per drifted submodule). - Two youtube-sync-docker commits authored as
ComputerGuru— leaving history alone. - TZ change via Settings UI still requires container restart on youtube-sync-docker — tzdata locked in at process start. Not in scope to fix.
- Sync.sh's Phase 1a now skips submodule advance by default (per Mike's later change on another machine); pass
--with-submodulesto fetch+advance. Already worked into the new sync.sh by Mike — no action.
Reference Information
Commits on the main ClaudeTools branch from this session (Mike, GURU-KALI):
c89f22c— sync: dead-submodule-ref tolerance in sync.sh973e9db— coord_api lift + identity.json + migrate-identity update + Windows-path cleanup4c49b85— submodule identity reconcile in sync.sh Phase 1a14341d1(orc37fd11post-rebase) — bundle: tenants.md flip + Rednour report + submodule reg + spec doc805b902— youtube-sync-docker submodule pointer atfdff0a7633c3fc— session log + final state805b902(post-rebase to current HEAD) — completed
Submodule HEADs at end of session:
- gururmm:
2089e89(BUG-016/017 marked fixed; latest) - guru-connect: at the SPEC-004 Task 9 TOFU provisioning spec point
- youtube-sync-docker:
fdff0a7(settings fix + apply_schedule tests)
Coord messages I sent today (GURU-KALI/claude-main):
1d93052f— broadcast: alert routing change (initiated by GURU-5070, I just re-echoed)- (deprecated) coord-message about migrate-identity.sh
99162698— to Howard-Home/claude-main: BUG-016 + BUG-017 filedd91406ce— to GURU-5070/claude-main: ghost-fix complete with stable device-id6c559209— broadcast: memory consolidation + re-dream + ignore _history.md merge proposals
Coord todos I created today:
a176100c-6de5-4e3b-8c1c-8291a2aa6ff0— submodule identity reconcile in sync.sh (DONE)5ad05d03-74ca-491d-9e72-3a699fcd1150— refine memory-dream cluster heuristic (open)
M365 stable identifiers:
- rednourlaw tenant:
4a4ca18a-f516-478b-99da-2e0722c5dc18 - Carla user object:
93074d1a-6db2-4794-8f7d-c84a619e4494 - Nick user object:
fe859088-bcbc-49dc-aaea-4c6e68f7d5bb
GuruRMM stable identifiers:
- GURU-KALI agent (post-fix keeper):
agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0,device_id ec975630-d297-4df9-bcb5-a445c65b648d
Files of interest left for future sessions:
clients/rednour/reports/2026-05-31-onboard-and-rename-emma-to-carla.md— full Rednour auditdocs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md— written spec (now implemented).claude/memory/_reports/2026-06-01-1525-dream.mdand2026-06-01-1526-dream.md— dream reports/etc/systemd/system/gururmm-agent.service.pre-bug016-fix— backup of pre-fix unit on this machine (not in repo)
Raw API artifacts (machine-local, not in repo):
/tmp/remediation-tool/4a4ca18a-f516-478b-99da-2e0722c5dc18/rednour-rename/— pre/post Set-Mailbox + Get-Mailbox JSON for both Carla rename and Nick alias add