diff --git a/session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md b/session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md new file mode 100644 index 0000000..fdd0b1f --- /dev/null +++ b/session-logs/2026-06-01-mike-guru-kali-ghost-fix-and-memory-dream.md @@ -0,0 +1,177 @@ +# GURU-KALI Ghost-Churn Fix, BUG-016/017 Filing, Memory Dream + Consolidation Collision + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-KALI +- **Role:** admin + +## Session Summary + +Four substantive threads on GURU-KALI today, two of them tightly intertwined with parallel work happening on other workstations. + +**Thread 1 — GURU-KALI ghost-agent churn (full diagnosis + remediation + upstream fix lifecycle in one day).** Coord message from GURU-5070 reported that GURU-KALI was minting ~10 ghost agent rows on the gururmm server, one ~daily. The initial diagnosis blamed a read-only root filesystem. Local check disproved that — `findmnt -no OPTIONS /` showed `rw,relatime,errors=remount-ro` on the host, no ext4 errors in the kernel log, no ro/rw transitions since the normal boot-time remount. The actual cause turned out to be `gururmm-agent.service` running with `ProtectSystem=strict`, which creates a private mount namespace where `/` is mounted ro for the service. The unit declared `ReadWritePaths=/var/log /usr/local/bin /etc/gururmm` but omitted `/var/lib/gururmm` where `device_id.rs:get_persist_path()` writes `.device-id`. Inside the agent's namespace, every persist attempt returned EROFS. Combined with a second bug (the agent regenerating a fresh UUID on every persist failure instead of caching in memory), this produced the ghost-row blizzard. Workaround applied: drop-in override at `/etc/systemd/system/gururmm-agent.service.d/override.conf` adding `ReadWritePaths=/var/lib/gururmm`. After `daemon-reload` + restart, the new agent persisted a stable device-id `ec975630-d297-4df9-bcb5-a445c65b648d` and zero EROFS warnings have logged since. Coord reply sent to GURU-5070 (`d91406ce-c4ab-4914-b479-c1f4a948096f`) — they purged the 11 ghost rows down to 1 keeper (agent_id `9bca5090-2d0e-40ad-9078-c11af8a435c0`). + +**Thread 2 — Filed BUG-016 and BUG-017 in the gururmm roadmap, then both fixed upstream same-day.** Wrote both bug entries into `projects/msp-tools/guru-rmm/docs/FEATURE_ROADMAP.md` with full root-cause, suggested fixes, and the GURU-KALI workaround. Notified Howard via coord (`99162698-5439-4fcb-9c27-719a569a717c`). Mike picked up both fixes on another workstation later in the day — `30da053 fix(agent): resolve Linux device_id persistence issues (BUG-016, BUG-017)` shipped to gururmm/main, then `2089e89 docs(roadmap): mark BUG-016 and BUG-017 as fixed`. Fix shape matched the spec recommendations exactly: unit template gained `StateDirectory=gururmm` (preferred over appending to `ReadWritePaths`), and `device_id.rs:get_device_id()` now uses `OnceLock` to cache the first generated UUID even when persistence fails. Toward end of session, refreshed the GURU-KALI base unit to match the upstream-fixed template (replaced `gururmm-agent.service` with the new shape, removed the override drop-in, restarted) — backup of pre-fix unit saved as `gururmm-agent.service.pre-bug016-fix`. Verified device-id unchanged after restart, mountinfo line shows `/var/lib/gururmm` rw-bound via StateDirectory. The auto-update earlier in the day had refreshed the agent binary at 20:24 but NOT the unit file, so removing the override without refreshing the unit would have regressed BUG-016 on this box — caught that before acting. + +**Thread 3 — sync.sh hardening, three rounds across one day, and submodule identity reconcile.** First round (dead-submodule-ref tolerance): a routine `/sync` failed because `git fetch` recursed into submodules and hit a transient dead ref in `guru-connect` history. Fix added `--no-recurse-submodules` to the parent fetch + pull and made the post-rebase `git submodule update` tolerant of per-submodule failures. Second round (`coord_api` lifted to identity.json): the hardcoded LAN IP `http://172.16.3.30:8001` was identified in three scripts (sync.sh, check-messages.sh, check-ksteen-smartbadge.sh) — silently breaks off-LAN/VPN workstations. Lifted into `.claude/identity.json` as `coord_api` with the existing IP as fallback default; `migrate-identity.sh` updated to populate the field for any machine missing it. Broadcast `1d93052f-aa79-4ac3-a0e9-99f04a4695c9` told the team to run `migrate-identity.sh`. Dead Windows-path repo-root fallback loop at sync.sh:102 deleted. Third round (submodule identity reconcile): two youtube-sync-docker commits were authored as `ComputerGuru ` because sync.sh's `reconcile_git_identity` only ran on the parent repo. Wrote `docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md`, implemented the spec (10-line addition to Phase 1a — `(cd "$ppath" && reconcile_git_identity ...)` for each submodule). Empirically verified: caught real drift on this box's `guru-connect` submodule (unset identity → Mike Swanson), idempotent on re-runs, forced-drift test on youtube-sync-docker passed. Coord todo `a176100c` opened and closed in the same session. + +**Thread 4 — Memory dream skill collision with Mike's parallel consolidation.** Tried the new memory-dream skill (landed via `/sync` earlier in the day). Default report-only run produced a clean report: 104 memory files, 17 orphan files needing index lines, 12 broken backlinks, 12 overlap clusters (biggest: 19 `feedback_syncro_*` files), 1 stale dated fact, 0 profile/repo conflicts. Ran `--apply-safe` to additively append the 17 orphan index lines to `MEMORY.md`. At nearly the same moment, Mike on GURU-BEAST-ROG had completed a thoughtful consolidation pass (`0c00010` "chore(memory): consolidate scattered feedback/project/reference files") that took the store from 104 → 71 files: 19 syncro files into 3 rule files + 1 history file, per-cluster RULE/STATE/HISTORY split for GuruConnect/Dataforth/Cascades/GuruRMM, new `reference_resource_map.md` cheatsheet, MEMORY.md fully rewritten. Pull-rebase produced a merge conflict in MEMORY.md. Resolved by taking Mike's consolidated version (`git checkout --ours .claude/memory/MEMORY.md`) and discarding my orphan-fix index adds — every file my adds pointed at had been consolidated away on his side. Set-diff verified zero original lines lost. Re-ran dream against the consolidated state: 71 files, 0 orphans, 7 broken backlinks, **5 overlap clusters down from 12**. Skill confirmed working against the new layout but with a false-positive that needs fixing — it flags the new intentional `_history.md` companion files as merge candidates against their rule-file siblings. Broadcast `6c559209-a0bb-4007-ad01-cbf07deead1a` told the fleet about the consolidation, instructed each machine to `/sync` + re-dream locally, and warned about the false-positive merge proposals to ignore. Filed coord todo `5ad05d03-74ca-491d-9e72-3a699fcd1150` to refine the cluster heuristic. + +**Side threads (smaller scope but real work):** +- **Rednour Law M365 onboarding + Emma → Carla rename** earlier in the day (this session crossed from yesterday's tail into today's UTC midnight). Bootstrapped the full ComputerGuru MSP app suite for `rednourlaw.com` via Tenant Admin consent + `onboard-tenant.sh`; renamed `emma@` → `carla@rednourlaw.com` (Carla Skinner) with mail aliases preserved; added `smtp:nick@` alias on Nick Pafford's existing `npafford@` mailbox; Syncro ticket #32343 updated + 0.5h billed + marked Resolved. +- **youtube-sync-docker pickup**: Mike asked to pull up the YouTube downloader project. Found it as a personal Gitea repo, cloned as a submodule. Read the codebase, found a real bug (Settings page wrote to `settings.json` but nothing downstream read it), fixed it with `apply_schedule()` helper + sync.sh/entrypoint.sh changes + 9 pytest cases across two commits. Code-reviewed both rounds. + +## Key Decisions + +- **Override removal: only after unit refresh.** Mike said "remove the override now that upstream is fixed", but inspection showed the agent binary was auto-updated today while the unit file on disk was still the buggy 2026-05-24 version. Removing the override alone would have regressed BUG-016 on this box. Caught that before acting and proposed refreshing the unit file first; Mike's intent was preserved by doing both steps together. +- **Took ours on the MEMORY.md merge conflict.** During the rebase against Mike's `0c00010` consolidation, my `--apply-safe` orphan-fix additions were now stale (every file they referenced had been consolidated away). Took his version and discarded my adds rather than trying to reconcile per-line. Verified set-diff showed zero original content lost. +- **`StateDirectory=gururmm` is the right systemd directive (preferred over `ReadWritePaths=/var/lib/gururmm`).** It auto-creates the dir with correct ownership, binds it rw in the unit's namespace, documents intent ("this service has persistent state"), and handles uninstall/reinstall cleanly. Spec recommended both options; upstream picked `StateDirectory` which matched my own preference. +- **Cache device_id in `OnceLock`, not `/etc/machine-id`.** The existing comment at `device_id.rs:7-10` explicitly rejected hardware IDs because OEMs ship machines with identical hardware IDs (un-sysprepped factory images). The OnceLock approach is the right shape — survives persist failure, doesn't depend on hardware ID. +- **Memory-dream merge proposals stay advisory, never auto-applied.** The skill's `_history.md` false positives confirm the design choice that merges always go through human approval. Filed a heuristic-refinement todo so future reports stay actionable, but the skill is functionally correct as-is. +- **Submodule identity reconcile uses Option A from the spec** (extend the existing init while-loop with `(cd ... && reconcile_git_identity ...)`) over Option B (inline duplicate logic in `submodule foreach`) or Option C (factor into a sourceable library). Empirically verified the heuristic catches real drift and is idempotent. +- **Two youtube-sync-docker commits with wrong author** (`ef903c8`, `fdff0a7` authored as `ComputerGuru`) left as-is — rewriting history would need force-push to shared remote. The reconcile fix prevents recurrence on this and every other machine. +- **Override at GURU-KALI removed cleanly at end of session**, replaced by the upstream-fixed base unit. Future agent reinstall would write this same shape — no drift. + +## Problems Encountered + +- **Initial Graph PATCH for Emma rename failed with `Property 'proxyAddresses' is read-only`.** Graph user write doesn't include `proxyAddresses` even with `Directory.ReadWrite.All`. Split the rename into two tiers: identity via Graph, mail aliases via Exchange REST. +- **Exchange REST returned HTTP 403** even though the SP was consented. The Exchange Operator SP lacked Exchange Administrator role in the rednourlaw tenant. Resolved by running the full onboarding flow. +- **Stale read-after-write on Exchange Set-Mailbox and Graph PATCH.** Both writes returned success codes immediately, but verification reads showed old data for ~45s. Polled for UPN convergence; converged within first/second attempt. +- **sync.sh dead-submodule-ref failure** on routine pull. Manual workaround was `git -c submodule.recurse=false pull --rebase` etc.; fix made `--no-recurse-submodules` the default behavior. +- **Coding Agent ran sync.sh as a verification step** during the submodule reconcile implementation, which auto-committed + pushed the dirty edit pre-Code-Review. Disclosed honestly by the agent. Code Review on the committed state came back CLEAN; accepted as-is. +- **MEMORY.md merge conflict** during the memory dream collision with Mike's consolidation pass. Resolved by taking ours (Mike's intentional change) and discarding my now-stale orphan-fix adds. +- **Auto-update refreshed agent binary but NOT systemd unit file.** Discovered when planning the override removal — the binary on disk was dated 20:24 today (auto-updated with the OnceLock fix) but the unit file was still dated 2026-05-24 (pre-fix template). Without manually refreshing the unit, the override removal would have re-broken BUG-016. Refreshed the unit explicitly before removing. + +## Configuration Changes + +**ClaudeTools repo (committed across session):** +- `.claude/scripts/sync.sh` — dead-submodule-ref tolerance, deleted dead Windows-path fallbacks, submodule identity reconcile in Phase 1a, coord_api read from identity.json with fallback. Multiple commits: `c89f22c`, `973e9db`, `4c49b85`. +- `.claude/scripts/migrate-identity.sh` — populates `coord_api` for any machine missing the field (commit `973e9db`). +- `.claude/scripts/check-messages.sh`, `check-ksteen-smartbadge.sh` — read `coord_api` from identity.json with fallback (commit `973e9db`). +- `.claude/skills/remediation-tool/references/tenants.md` — rednourlaw.com row flipped NO → YES with role summary. +- `clients/rednour/reports/2026-05-31-onboard-and-rename-emma-to-carla.md` — full M365 remediation audit report. +- `docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md` — planning artifact. +- `.gitmodules` — registered new submodule `projects/youtube-sync-docker`. +- `.claude/memory/_reports/` — two dream reports (`2026-06-01-1525-dream.md`, `2026-06-01-1526-dream.md`). +- Submodule pointers advanced: guru-rmm (BUG-016/017 fixes), guru-connect (multiple SPEC-004 tasks), youtube-sync-docker (settings fix + tests at `fdff0a7`). + +**ClaudeTools machine-local (not committed; gitignored):** +- `.claude/identity.json` — added `coord_api: "http://172.16.3.30:8001"` field, bumped `last_updated`. +- `.claude/current-mode` — set to `dev` during youtube-sync-docker work. +- All three submodules' local `.git/config` user.name/user.email reconciled to `Mike Swanson / mike@azcomputerguru.com`. `guru-connect` was previously unset (real drift case fixed by the new Phase 1a reconcile). + +**gururmm repo (commits by Mike):** +- `e3d6a46` — BUG-016 + BUG-017 entries in `docs/FEATURE_ROADMAP.md` (filed by me). +- `30da053` — BUG-016 + BUG-017 fixes shipped (by Mike on another machine). +- `2089e89` — bug roadmap status marked fixed. + +**youtube-sync-docker repo (commits by Mike on this machine via Gitea Agent):** +- `ef903c8` — settings-not-applied fix + 3 tests (note: authored as `ComputerGuru` due to pre-reconcile drift). +- `fdff0a7` — apply_schedule tests + `.gitignore` python exclusions. + +**GURU-KALI system (not version controlled):** +- `/etc/systemd/system/gururmm-agent.service` — replaced with upstream-fixed template (gained `StateDirectory=gururmm`). Old version backed up as `gururmm-agent.service.pre-bug016-fix`. +- `/etc/systemd/system/gururmm-agent.service.d/` — directory + `override.conf` removed (no longer needed). + +## Credentials & Secrets + +**rednourlaw.com (4a4ca18a-f516-478b-99da-2e0722c5dc18):** +- Tenant Admin SP `671a2ace-be9e-440c-a7d6-5ff982e4500c` — Conditional Access Administrator +- Security Investigator SP `704da463-7f4e-484c-b1da-40e447615d52` — Exchange Administrator +- Exchange Operator SP `59a68ba9-5e1e-4a56-92ae-507a9a669a79` — Exchange Administrator +- User Manager SP `dc3b79a2-638b-42fe-8ecb-51592db7d40f` — User Administrator + Authentication Administrator +- Defender Add-on SP `052da8aa-1ca5-4f60-b9c5-7aafcb74264b` — no roles (no MDE in tenant) + +**Users renamed/touched:** +- `93074d1a-6db2-4794-8f7d-c84a619e4494`: emma@ → carla@rednourlaw.com (Carla Skinner). Sessions revoked, password unchanged. +- `fe859088-bcbc-49dc-aaea-4c6e68f7d5bb`: npafford@ (Nick Pafford); added `smtp:nick@rednourlaw.com` alias. + +**Syncro:** +- Ticket #32343 (id 111409967): comments `415513323` (internal) + `415514647` (customer-visible); line item `42654682` (0.5h remote, $75.00, attributed to Mike user_id 1735). Status: Resolved. + +## Infrastructure & Servers + +- **GURU-KALI gururmm agent** post-fix state: PID `686646`, device_id `ec975630-d297-4df9-bcb5-a445c65b648d`, base unit `/etc/systemd/system/gururmm-agent.service` (refreshed today), no override drop-ins, mountinfo line 535 shows `/var/lib/gururmm` rw-bound via `StateDirectory=gururmm`. +- **Coord API** still at `http://172.16.3.30:8001/api/coord` — now configurable per machine via `identity.json` `coord_api` field. +- **rednourlaw.com tenant**: Global Admin is Carrie Rednour (also reachable via `sysadmin@rednourlaw.com`). +- **gururmm server-side ghost-row purge complete** — 11 rows → 1 keeper (`agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0`). + +## Commands & Outputs + +```bash +# Diagnostic that revealed process-scoped ro +grep ' / ' /proc/$AGENT_PID/mountinfo +# 447 404 259:3 / / ro,nosuid,relatime ... <- agent ns +# Host's /proc/mounts and findmnt showed rw the whole time. + +# Workaround applied early +sudo tee /etc/systemd/system/gururmm-agent.service.d/override.conf > /dev/null <<'EOF' +[Service] +ReadWritePaths=/var/lib/gururmm +EOF +sudo systemctl daemon-reload && sudo systemctl restart gururmm-agent + +# End-of-session: unit file refreshed to upstream-fixed template, override removed +sudo cp -a /etc/systemd/system/gururmm-agent.service{,.pre-bug016-fix} +# (wrote new unit with StateDirectory=gururmm) +sudo rm -f /etc/systemd/system/gururmm-agent.service.d/override.conf +sudo rmdir /etc/systemd/system/gururmm-agent.service.d +sudo systemctl daemon-reload && sudo systemctl restart gururmm-agent + +# Sync.sh runs +bash .claude/scripts/sync.sh # multiple times, each pulling Mike's parallel work +``` + +## Pending / Incomplete Tasks + +- **Memory-dream cluster heuristic refinement** — coord todo `5ad05d03-74ca-491d-9e72-3a699fcd1150`, open. Either skip clusters containing `_history.md` files or honor frontmatter `merge_locked: true`. +- **Shared-drive access for Nick Pafford** on Rednour ticket #32343 — deferred to a separate workflow per Mike's instruction. +- **Other workstations need `migrate-identity.sh`** to pick up the new `coord_api` field. Broadcast sent; on-LAN machines work without it. +- **Other workstations' submodule git identities** will auto-correct on next `/sync` (one-time warning per drifted submodule). +- **Two youtube-sync-docker commits authored as `ComputerGuru`** — leaving history alone. +- **TZ change via Settings UI still requires container restart on youtube-sync-docker** — tzdata locked in at process start. Not in scope to fix. +- **Sync.sh's Phase 1a now skips submodule advance by default** (per Mike's later change on another machine); pass `--with-submodules` to fetch+advance. Already worked into the new sync.sh by Mike — no action. + +## Reference Information + +**Commits on the main ClaudeTools branch from this session (Mike, GURU-KALI):** +- `c89f22c` — sync: dead-submodule-ref tolerance in sync.sh +- `973e9db` — coord_api lift + identity.json + migrate-identity update + Windows-path cleanup +- `4c49b85` — submodule identity reconcile in sync.sh Phase 1a +- `14341d1` (or `c37fd11` post-rebase) — bundle: tenants.md flip + Rednour report + submodule reg + spec doc +- `805b902` — youtube-sync-docker submodule pointer at `fdff0a7` +- `633c3fc` — session log + final state +- `805b902` (post-rebase to current HEAD) — completed + +**Submodule HEADs at end of session:** +- gururmm: `2089e89` (BUG-016/017 marked fixed; latest) +- guru-connect: at the SPEC-004 Task 9 TOFU provisioning spec point +- youtube-sync-docker: `fdff0a7` (settings fix + apply_schedule tests) + +**Coord messages I sent today (GURU-KALI/claude-main):** +- `1d93052f` — broadcast: alert routing change (initiated by GURU-5070, I just re-echoed) +- (deprecated) coord-message about migrate-identity.sh +- `99162698` — to Howard-Home/claude-main: BUG-016 + BUG-017 filed +- `d91406ce` — to GURU-5070/claude-main: ghost-fix complete with stable device-id +- `6c559209` — broadcast: memory consolidation + re-dream + ignore _history.md merge proposals + +**Coord todos I created today:** +- `a176100c-6de5-4e3b-8c1c-8291a2aa6ff0` — submodule identity reconcile in sync.sh (DONE) +- `5ad05d03-74ca-491d-9e72-3a699fcd1150` — refine memory-dream cluster heuristic (open) + +**M365 stable identifiers:** +- rednourlaw tenant: `4a4ca18a-f516-478b-99da-2e0722c5dc18` +- Carla user object: `93074d1a-6db2-4794-8f7d-c84a619e4494` +- Nick user object: `fe859088-bcbc-49dc-aaea-4c6e68f7d5bb` + +**GuruRMM stable identifiers:** +- GURU-KALI agent (post-fix keeper): `agent_id 9bca5090-2d0e-40ad-9078-c11af8a435c0`, `device_id ec975630-d297-4df9-bcb5-a445c65b648d` + +**Files of interest left for future sessions:** +- `clients/rednour/reports/2026-05-31-onboard-and-rename-emma-to-carla.md` — full Rednour audit +- `docs/specifications/SUBMODULE-IDENTITY-RECONCILE-SPEC.md` — written spec (now implemented) +- `.claude/memory/_reports/2026-06-01-1525-dream.md` and `2026-06-01-1526-dream.md` — dream reports +- `/etc/systemd/system/gururmm-agent.service.pre-bug016-fix` — backup of pre-fix unit on this machine (not in repo) + +**Raw API artifacts (machine-local, not in repo):** +- `/tmp/remediation-tool/4a4ca18a-f516-478b-99da-2e0722c5dc18/rednour-rename/` — pre/post Set-Mailbox + Get-Mailbox JSON for both Carla rename and Nick alias add