sync: auto-sync from GURU-5070 at 2026-06-01 06:57:20

Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-01 06:57:20
2026-06-01 06:57:27 -07:00
parent ba7aeebf9e
commit 501f3eb130
4 changed files with 212 additions and 0 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -1,6 +1,7 @@
 # Memory Index

 ## Reference
+- [RMM agent runs in systemd sandbox](reference_rmm_agent_runs_in_systemd_sandbox.md) — Commands dispatched via the GuruRMM agent run inside its ProtectSystem=strict namespace (/ is ro there); fs/mount probes show the agent's view NOT the host. SSH or read /proc/<pid>/mountinfo for host truth. (lesson 2026-06-01, GURU-KALI ghost churn)
 - [GURU-5070 Rust toolchain](reference_guru5070_rust_toolchain.md) — GURU-5070 now has cargo + MSVC + protoc; build/clippy/test guru-connect LOCALLY (set PROTOC to the winget path) instead of the build host. CI only clippy-checks the Linux server, not the Windows agent.
 - [ACG Office Network Infrastructure](infra_office_network.md) — IPs/hosts/roles for pfSense/Jupiter/VMs/Docker. Check before assuming; .21 (Uranus) is storage.
 - [Power Failure Runbook](../POWER_FAILURE_RUNBOOK.md) — Recovery order after a power event: Tailscale routes, libvirt/VMs, Seafile, NPM/DNS.
@@ -21,6 +22,7 @@
 - [GuruRMM user_session command context](reference_gururmm_user_session_context.md) — command API `context=user_session` runs as the logged-on user (WTS); does interactive-only cmds that fail as SYSTEM. Needs an active (admin) user.
 - [Pluto Build Server](reference_pluto_build_server.md) — Windows build VM: hostname PLUTO = Unraid VM "Claude-Builder" = 172.16.3.36 (all the same box). MSVC + WiX. No `pluto` vault entry. Drive via /rmm (agent enrolls as PLUTO) when SSH key isn't authorized.
 - [Coord /messages API shape](reference_coord_messages_api_shape.md) — GET /api/coord/messages returns {total,skip,limit,messages[]} NOT a bare array; parse .messages[], strip control chars, read flag may be null.
+- [GuruRMM pipeline vendored](reference_gururmm_pipeline_vendored.md) — RMM build scripts version-controlled at gururmm `deploy/build-pipeline/` (2026-06-01); build-shared.sh auto-syncs them to /opt/gururmm each build. Edit-in-repo + push = live, EXCEPT build-shared.sh + webhook-handler.py (manual cp).
 - [Gitea API credential](reference_gitea_api_credential.md) — Gitea API (PRs/merges) as howard uses services/gitea-howard.sops.yaml password on internal http://172.16.3.20:3000; NOT the gururmm-server SSH password.

 ## Users
--- a/.claude/memory/reference_gururmm_pipeline_vendored.md
+++ b/.claude/memory/reference_gururmm_pipeline_vendored.md
@@ -0,0 +1,29 @@
+---
+name: reference_gururmm_pipeline_vendored
+description: GuruRMM build-pipeline scripts are now version-controlled at deploy/build-pipeline/ in the gururmm repo (2026-06-01); build-shared.sh auto-syncs them to /opt/gururmm each build, so edit-in-repo + push = live — EXCEPT build-shared.sh + webhook-handler.py, which need a manual cp.
+metadata:
+  type: reference
+---
+
+The GuruRMM build/CI pipeline runs at **`/opt/gururmm/`** on the gururmm server (172.16.3.30,
+root-owned, hand-maintained). Those scripts had silently diverged from the repo's older `scripts/`
+generation (that drift caused the BUG-015 Windows build-gate gap). Reconciled 2026-06-01:
+
+- **Source of truth:** the live scripts are vendored into the gururmm repo at
+  **`deploy/build-pipeline/`** (build-{windows,linux,mac,agents,server,shared}.sh, sign-windows.sh,
+  webhook-handler.py + README). Commit `2bf539e`.
+- **Drift-stop (commit `24b5daf`):** `build-shared.sh` (runs first every build, after
+  `git reset --hard origin/main`) now `install -m 0755`-syncs the 6 build scripts from
+  `deploy/build-pipeline/` → `/opt/gururmm/` each build. So to change a GuruRMM build script:
+  **edit it in `deploy/build-pipeline/`, push to gururmm main — the next build runs it.** No manual
+  copy, no restart.
+- **Two exceptions — need a manual `sudo cp` on change** (they can't self-overwrite mid-run):
+  `build-shared.sh` (the running puller) and `webhook-handler.py` (the persistent HTTP server;
+  also needs `sudo systemctl restart gururmm-webhook` to reload). They change rarely. See
+  `deploy/build-pipeline/README.md`.
+
+Webhook still INVOKES the `/opt/gururmm` copies (not the repo copies directly) — the sync keeps
+them current. The repo's older `scripts/webhook-handler.py` + `scripts/build-agents.sh` are a prior
+generation, superseded. Build-windows.sh's change-gate watches `agent/ installer/` (BUG-015 fix —
+installer-only `.wxs`/`.ico` changes rebuild the MSI). Supersedes the "repo copy is stale, don't
+redeploy" caveat in [[project_rmm_webhook_docs_guard]] for the build scripts (not webhook-handler.py).
--- a/.claude/memory/reference_rmm_agent_runs_in_systemd_sandbox.md
+++ b/.claude/memory/reference_rmm_agent_runs_in_systemd_sandbox.md
@@ -0,0 +1,36 @@
+---
+name: reference_rmm_agent_runs_in_systemd_sandbox
+description: Commands dispatched via the GuruRMM agent execute INSIDE the agent's systemd sandbox (ProtectSystem=strict) — fs/mount observations reflect the agent's private namespace, NOT the host. For host truth, SSH directly or read /proc/<host-pid>/mountinfo.
+metadata:
+  type: reference
+---
+
+The GuruRMM Linux agent runs as a systemd service (`gururmm-agent.service`) hardened with
+**`ProtectSystem=strict`**, which gives the agent process a **private mount namespace where `/`
+is mounted read-only**, with only `ReadWritePaths=` entries writable. **Any command you dispatch
+through the RMM agent (`/rmm shell`, probes) runs inside that namespace** — so `findmnt /`,
+`touch`, `/proc/mounts` etc. report the **agent's sandboxed view, not the host's actual state**.
+
+**Trap (hit 2026-06-01, GURU-KALI):** I diagnosed "host root filesystem is read-only" because
+RMM-dispatched `touch /var/lib/gururmm` returned EROFS (os error 30) and `findmnt /` showed `ro`.
+The host root was **rw the entire time** (SMART PASSED, ext4 clean, no kernel remount-ro — all
+consistent with the host being fine). The real cause: the unit's
+`ReadWritePaths=/var/log /usr/local/bin /etc/gururmm` **omitted `/var/lib/gururmm`**, so the agent
+couldn't persist `/var/lib/gururmm/.device-id` → it re-minted a device_id on each daily
+identity refresh → the server (no machine_uid dedup) filed a new agent row each time (~11 ghosts).
+
+**How to get host truth instead of the sandbox view:**
+- SSH to the host directly (commands there run in the host namespace), OR
+- Read the agent PID's namespace explicitly: `cat /proc/<agent_pid>/mountinfo` — the process-scoped
+  `ro` on `/` is the tell that it's sandbox, not host. Compare against the host's `findmnt`.
+- `errors=remount-ro` in a mount line is just the stock default mount option — NOT evidence an
+  error fired. Confirm an actual remount-ro with kernel `EXT4-fs error` logs + `dumpe2fs -h` error
+  count, not the mount option alone.
+
+**The fix pattern** (durable, additive): drop-in
+`/etc/systemd/system/gururmm-agent.service.d/override.conf` with `[Service]\nReadWritePaths=/var/lib/gururmm`
+(systemd merges ReadWritePaths additively across drop-ins), then `daemon-reload` + `restart`.
+Better upstream fix: `StateDirectory=gururmm` (handles dir creation + perms + RW bind in one
+directive). **Fleet implication:** every systemd-installed GuruRMM Linux agent with this unit shape
+has the same latent bug until the installer is fixed. See filed todos (agent ReadWritePaths/
+StateDirectory + server machine_uid dedup).
--- a/wiki/systems/jupiter-docker-templating.md
+++ b/wiki/systems/jupiter-docker-templating.md
@@ -0,0 +1,145 @@
+# Jupiter — Docker → Unraid Template/Compose Adoption Plan
+
+**System:** Jupiter (Unraid primary container host) — `172.16.3.20`, SSH `root@:22`
+(creds: `infrastructure/jupiter-unraid-primary.sops.yaml`).
+**Goal:** make every container show Unraid's UI features (WebUI button, icon, update-check,
+rich Edit form) by giving it the `net.unraid.docker.*` labels it currently lacks.
+**Status:** INSPECTION + PLANNING complete (2026-06-01). **Target #1 (gururmm-agent) DONE
+2026-06-01** — workflow validated. Remaining recreates HELD for a maintenance window.
+
+## Execution log
+- **2026-06-01 — gururmm-agent [DONE].** Wrote `templates-user/my-gururmm-agent.xml`; backed up
+  full inspect to `/root/gururmm-agent.inspect.bak.json`; stopped+rm'd, recreated via `docker run`
+  with `-l net.unraid.docker.managed=dockerman` + the captured spec. Verified: label=dockerman,
+  config faithful, agent re-authenticated and resumed metrics/inventory/check polling. One clean
+  exit-0 restart at startup = the agent's self-update finalize (cleaned rollback artifacts), then
+  stable. Now shows as a managed container in the Unraid Docker tab with the rich Edit form.
+  - **device-id persistence [FIXED 2026-06-01]:** the agent's device-id lived at
+    `/var/lib/gururmm/.device-id` *inside* the container (ephemeral). Fix: `docker cp`'d the live
+    `/var/lib/gururmm/.` out to `/mnt/user/appdata/gururmm/lib/` (preserving device-id
+    `88abeef0-cb3a-4c3f-9353-61fedcdf587d`), added a `-v /mnt/user/appdata/gururmm/lib:/var/lib/gururmm`
+    mapping + matching template Config, and recreated. Verified the agent reused the same device-id
+    (no "Persisting new device ID") and the **same agent_id `443bfabb`** — identity now durable
+    across recreates/updates. Enrollment identity also persists via `config.toml` in `/config`.
+  - **Ghost check [CLEAN]:** GuruRMM `/api/agents` shows exactly one Jupiter row (`443bfabb`,
+    last-seen current). The three recreates created no duplicate. **Incidental:** `GURU-KALI` has
+    ~11 duplicate agent rows (v0.6.46/0.6.50, stale) — same ephemeral-identity pattern on a
+    frequently-reinstalled box; cleanup candidate, out of scope for this task.
+
+---
+
+## Why the features are missing
+
+Unraid's per-container UI (WebUI/icon/update-check/Edit) is driven by **container labels**
+(`net.unraid.docker.managed`, `.webui`, `.icon`) + a template XML in
+`/boot/config/plugins/dockerMan/templates-user/`. Those labels are **immutable on a running
+container** — they're baked in at `docker create` time. Containers started by raw
+`docker run` / `docker-compose` CLI (instead of Unraid's "Add Container" form or the Compose
+Manager plugin) never get them. **The only fix is to RECREATE each container** through the
+proper mechanism. Data in mapped volumes is untouched by a recreate; the risk is downtime +
+getting the recreate config exactly right.
+
+Two correct mechanisms (both yield the full UI feature set):
+- **dockerman template** — for single CLI containers. Unraid "Add Container" → template.
+- **composeman (Compose Manager plugin)** — for multi-container compose stacks. Adopt the
+  existing `docker-compose.yml` into the plugin so the whole stack gets the labels while
+  keeping its compose orchestration + private network. Plugin IS installed (only "RustDesk"
+  registered today).
+
+---
+
+## Inventory (21 containers; 14 raw / `managed=NONE`)
+
+### Already templated (managed=dockerman) — no action
+DockerUISP, Seerr, qbittorrent, binhex-emby, binhex-sabnzbd, binhex-plexpass, rsync-server
+
+### Raw (managed=NONE) — the targets, grouped by disposition
+
+| Container | Image | Created via | Existing template | Disposition | Risk |
+|---|---|---|---|---|---|
+| **gururmm-agent** | localhost:3000/azcomputerguru/gururmm-agent:latest | CLI, net=host | none | NEW dockerman template | LOW |
+| youtube-sync-test | azcomputerguru/youtube-sync:latest | CLI | none | NEW template (or retire — "test") | LOW |
+| binhex-radarr | binhex/arch-radarr | CLI | my-binhex-radarr.xml | reconcile + recreate from template | MED |
+| binhex-sonarr | binhex/arch-sonarr | CLI | my-binhex-sonarr.xml | reconcile + recreate from template | MED |
+| MariaDB-Official | mariadb:latest | CLI | my-MariaDB-Official.xml | reconcile + recreate (snapshot appdata first) | MED (DB) |
+| **seafile** + seafile-mysql + seafile-memcached + seafile-elasticsearch | seafileltd/seafile-pro-mc:12.0 / mariadb:10.6 / memcached:1.6.18 / elasticsearch:7.17.26 | compose `dockercompose` @ /mnt/user0/SeaFile/DockerCompose/docker-compose.yml | partial (my-SeaFile*.xml, my-memcached.xml) | **adopt stack into Compose Manager** | MED-HIGH |
+| **gitea** + **gitea-db** | gitea/gitea:latest / mysql:8 | compose `gitea` @ /mnt/cache/appdata/gitea/docker-compose.yml | none | **adopt stack into Compose Manager** | **HIGH** (repos + GuruRMM build pipeline) |
+| **npm** | jc21/nginx-proxy-manager:latest | CLI | my-NginxProxyManager.xml | reconcile + recreate from template | **HIGH** (public reverse proxy) |
+| app (Discourse) | local_discourse/app | Discourse `./launcher` (no compose file) | none | **LEAVE AS-IS** — self-managed; templating breaks `./launcher rebuild` | n/a |
+| radio-archive | radio-archive:latest | compose `app` | none | tied to Discourse project — leave with app | LOW |
+
+Note: several CLI containers (npm, radarr, sonarr, MariaDB-Official) already HAVE a matching
+template XML — the running container just isn't linked to it (recreated via CLI later, which
+stripped the managed label). For these, recreate-from-existing-template is the easy path, but
+**verify the template's ports/paths/env still match the live container** before applying.
+
+---
+
+## Recreate sequencing (least → most critical)
+
+Do them one at a time, verify each comes back healthy before the next.
+
+1. **gururmm-agent** — LOW. Local image, net=host, no public dependents. Proves the workflow.
+   Spec captured below.
+2. youtube-sync-test, radio-archive — LOW. (Confirm youtube-sync-test isn't disposable first.)
+3. binhex-radarr, binhex-sonarr — MED. Media, non-critical, templates already exist.
+4. MariaDB-Official — MED. **Snapshot `/mnt/.../appdata` (or mysqldump) first.**
+5. seafile stack — MED-HIGH. Adopt into Compose Manager. **Backup first.** `down` → register → `up`.
+6. **gitea + gitea-db** — HIGH, dedicated window. **Backup gitea appdata + `mysqldump` gitea-db
+   first.** Pausing Gitea stops repo access AND the GuruRMM webhook build pipeline. Adopt the
+   existing compose into Compose Manager.
+7. **npm** — HIGH, schedule with comms. Recreating drops the public reverse proxy → all proxied
+   public services (connect., rmm., git., community., seafile.) briefly down. **Backup `/data` +
+   `/etc/letsencrypt` first.** Recreate from my-NginxProxyManager.xml (verify port maps:
+   80→1880, 81→7818, 443→18443).
+8. Discourse (app) — LEAVE.
+
+---
+
+## Captured recreate spec — gururmm-agent (target #1)
+
+```
+Image:      localhost:3000/azcomputerguru/gururmm-agent:latest
+Network:    host
+Restart:    unless-stopped
+Privileged: false   CapAdd: none   Devices: none (kvm passed as a bind mount)
+Entrypoint: /usr/local/bin/gururmm-agent
+Cmd:        run
+Env:        GURURMM_CONFIG=/config/config.toml
+Volumes:
+  /dev/kvm                          -> /dev/kvm                       (ro)
+  /proc                             -> /proc                          (ro)
+  /sys                              -> /sys                           (ro)
+  /var/run/docker.sock              -> /var/run/docker.sock           (rw)
+  /var/run/libvirt/libvirt-sock     -> /var/run/libvirt/libvirt-sock  (ro)
+  /mnt/user/appdata/gururmm         -> /config                        (rw)
+```
+
+Equivalent `docker run` (what the dockerman template encodes):
+```bash
+docker run -d --name gururmm-agent \
+  --network host --restart unless-stopped \
+  -e GURURMM_CONFIG=/config/config.toml \
+  -v /dev/kvm:/dev/kvm:ro \
+  -v /proc:/proc:ro \
+  -v /sys:/sys:ro \
+  -v /var/run/docker.sock:/var/run/docker.sock \
+  -v /var/run/libvirt/libvirt-sock:/var/run/libvirt/libvirt-sock:ro \
+  -v /mnt/user/appdata/gururmm:/config \
+  --entrypoint /usr/local/bin/gururmm-agent \
+  localhost:3000/azcomputerguru/gururmm-agent:latest run
+```
+Recreate path: build the template in Unraid "Add Container" (Repository, Network=host, the 6
+path mappings, the env var, Extra Params `--entrypoint /usr/local/bin/gururmm-agent`, Post
+Arguments `run`), `docker stop && docker rm gururmm-agent`, then apply the template. Note: it's
+a localhost-registry image, so Unraid update-check won't be meaningful — but WebUI(n/a)/icon/Edit
+form all come back.
+
+---
+
+## Open items before execution
+- Confirm `youtube-sync-test` is keep-or-retire (the "-test" name suggests disposable).
+- For each "template exists" container (npm/radarr/sonarr/MariaDB-Official): diff the template
+  XML against the live `docker inspect` (ports/paths/env) so the recreate doesn't lose config.
+- Pick the maintenance window(s). Suggest: a low-risk batch (1-4) any time; seafile its own slot;
+  gitea + npm each in a dedicated announced window, backup-first.