sync: auto-sync from GURU-5070 at 2026-06-01 06:57:20

Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-01 06:57:20
2026-06-01 06:57:27 -07:00
parent ba7aeebf9e
commit 501f3eb130
4 changed files with 212 additions and 0 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -1,6 +1,7 @@
 # Memory Index

 ## Reference
+- [RMM agent runs in systemd sandbox](reference_rmm_agent_runs_in_systemd_sandbox.md) — Commands dispatched via the GuruRMM agent run inside its ProtectSystem=strict namespace (/ is ro there); fs/mount probes show the agent's view NOT the host. SSH or read /proc/<pid>/mountinfo for host truth. (lesson 2026-06-01, GURU-KALI ghost churn)
 - [GURU-5070 Rust toolchain](reference_guru5070_rust_toolchain.md) — GURU-5070 now has cargo + MSVC + protoc; build/clippy/test guru-connect LOCALLY (set PROTOC to the winget path) instead of the build host. CI only clippy-checks the Linux server, not the Windows agent.
 - [ACG Office Network Infrastructure](infra_office_network.md) — IPs/hosts/roles for pfSense/Jupiter/VMs/Docker. Check before assuming; .21 (Uranus) is storage.
 - [Power Failure Runbook](../POWER_FAILURE_RUNBOOK.md) — Recovery order after a power event: Tailscale routes, libvirt/VMs, Seafile, NPM/DNS.
@@ -21,6 +22,7 @@
 - [GuruRMM user_session command context](reference_gururmm_user_session_context.md) — command API `context=user_session` runs as the logged-on user (WTS); does interactive-only cmds that fail as SYSTEM. Needs an active (admin) user.
 - [Pluto Build Server](reference_pluto_build_server.md) — Windows build VM: hostname PLUTO = Unraid VM "Claude-Builder" = 172.16.3.36 (all the same box). MSVC + WiX. No `pluto` vault entry. Drive via /rmm (agent enrolls as PLUTO) when SSH key isn't authorized.
 - [Coord /messages API shape](reference_coord_messages_api_shape.md) — GET /api/coord/messages returns {total,skip,limit,messages[]} NOT a bare array; parse .messages[], strip control chars, read flag may be null.
+- [GuruRMM pipeline vendored](reference_gururmm_pipeline_vendored.md) — RMM build scripts version-controlled at gururmm `deploy/build-pipeline/` (2026-06-01); build-shared.sh auto-syncs them to /opt/gururmm each build. Edit-in-repo + push = live, EXCEPT build-shared.sh + webhook-handler.py (manual cp).
 - [Gitea API credential](reference_gitea_api_credential.md) — Gitea API (PRs/merges) as howard uses services/gitea-howard.sops.yaml password on internal http://172.16.3.20:3000; NOT the gururmm-server SSH password.

 ## Users
--- a/.claude/memory/reference_gururmm_pipeline_vendored.md
+++ b/.claude/memory/reference_gururmm_pipeline_vendored.md
@@ -0,0 +1,29 @@
+---
+name: reference_gururmm_pipeline_vendored
+description: GuruRMM build-pipeline scripts are now version-controlled at deploy/build-pipeline/ in the gururmm repo (2026-06-01); build-shared.sh auto-syncs them to /opt/gururmm each build, so edit-in-repo + push = live — EXCEPT build-shared.sh + webhook-handler.py, which need a manual cp.
+metadata:
+  type: reference
+---
+
+The GuruRMM build/CI pipeline runs at **`/opt/gururmm/`** on the gururmm server (172.16.3.30,
+root-owned, hand-maintained). Those scripts had silently diverged from the repo's older `scripts/`
+generation (that drift caused the BUG-015 Windows build-gate gap). Reconciled 2026-06-01:
+
+- **Source of truth:** the live scripts are vendored into the gururmm repo at
+  **`deploy/build-pipeline/`** (build-{windows,linux,mac,agents,server,shared}.sh, sign-windows.sh,
+  webhook-handler.py + README). Commit `2bf539e`.
+- **Drift-stop (commit `24b5daf`):** `build-shared.sh` (runs first every build, after
+  `git reset --hard origin/main`) now `install -m 0755`-syncs the 6 build scripts from
+  `deploy/build-pipeline/` → `/opt/gururmm/` each build. So to change a GuruRMM build script:
+  **edit it in `deploy/build-pipeline/`, push to gururmm main — the next build runs it.** No manual
+  copy, no restart.
+- **Two exceptions — need a manual `sudo cp` on change** (they can't self-overwrite mid-run):
+  `build-shared.sh` (the running puller) and `webhook-handler.py` (the persistent HTTP server;
+  also needs `sudo systemctl restart gururmm-webhook` to reload). They change rarely. See
+  `deploy/build-pipeline/README.md`.
+
+Webhook still INVOKES the `/opt/gururmm` copies (not the repo copies directly) — the sync keeps
+them current. The repo's older `scripts/webhook-handler.py` + `scripts/build-agents.sh` are a prior
+generation, superseded. Build-windows.sh's change-gate watches `agent/ installer/` (BUG-015 fix —
+installer-only `.wxs`/`.ico` changes rebuild the MSI). Supersedes the "repo copy is stale, don't
+redeploy" caveat in [[project_rmm_webhook_docs_guard]] for the build scripts (not webhook-handler.py).
--- a/.claude/memory/reference_rmm_agent_runs_in_systemd_sandbox.md
+++ b/.claude/memory/reference_rmm_agent_runs_in_systemd_sandbox.md
@@ -0,0 +1,36 @@
+---
+name: reference_rmm_agent_runs_in_systemd_sandbox
+description: Commands dispatched via the GuruRMM agent execute INSIDE the agent's systemd sandbox (ProtectSystem=strict) — fs/mount observations reflect the agent's private namespace, NOT the host. For host truth, SSH directly or read /proc/<host-pid>/mountinfo.
+metadata:
+  type: reference
+---
+
+The GuruRMM Linux agent runs as a systemd service (`gururmm-agent.service`) hardened with
+**`ProtectSystem=strict`**, which gives the agent process a **private mount namespace where `/`
+is mounted read-only**, with only `ReadWritePaths=` entries writable. **Any command you dispatch
+through the RMM agent (`/rmm shell`, probes) runs inside that namespace** — so `findmnt /`,
+`touch`, `/proc/mounts` etc. report the **agent's sandboxed view, not the host's actual state**.
+
+**Trap (hit 2026-06-01, GURU-KALI):** I diagnosed "host root filesystem is read-only" because
+RMM-dispatched `touch /var/lib/gururmm` returned EROFS (os error 30) and `findmnt /` showed `ro`.
+The host root was **rw the entire time** (SMART PASSED, ext4 clean, no kernel remount-ro — all
+consistent with the host being fine). The real cause: the unit's
+`ReadWritePaths=/var/log /usr/local/bin /etc/gururmm` **omitted `/var/lib/gururmm`**, so the agent
+couldn't persist `/var/lib/gururmm/.device-id` → it re-minted a device_id on each daily
+identity refresh → the server (no machine_uid dedup) filed a new agent row each time (~11 ghosts).
+
+**How to get host truth instead of the sandbox view:**
+- SSH to the host directly (commands there run in the host namespace), OR
+- Read the agent PID's namespace explicitly: `cat /proc/<agent_pid>/mountinfo` — the process-scoped
+  `ro` on `/` is the tell that it's sandbox, not host. Compare against the host's `findmnt`.
+- `errors=remount-ro` in a mount line is just the stock default mount option — NOT evidence an
+  error fired. Confirm an actual remount-ro with kernel `EXT4-fs error` logs + `dumpe2fs -h` error
+  count, not the mount option alone.
+
+**The fix pattern** (durable, additive): drop-in
+`/etc/systemd/system/gururmm-agent.service.d/override.conf` with `[Service]\nReadWritePaths=/var/lib/gururmm`
+(systemd merges ReadWritePaths additively across drop-ins), then `daemon-reload` + `restart`.
+Better upstream fix: `StateDirectory=gururmm` (handles dir creation + perms + RW bind in one
+directive). **Fleet implication:** every systemd-installed GuruRMM Linux agent with this unit shape
+has the same latent bug until the installer is fixed. See filed todos (agent ReadWritePaths/
+StateDirectory + server machine_uid dedup).