From 501f3eb1300054518867f0a0744a11ef08bae1b3 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Mon, 1 Jun 2026 06:57:27 -0700 Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-06-01 06:57:20 Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-01 06:57:20 --- .claude/memory/MEMORY.md | 2 + .../reference_gururmm_pipeline_vendored.md | 29 ++++ ...rence_rmm_agent_runs_in_systemd_sandbox.md | 36 +++++ wiki/systems/jupiter-docker-templating.md | 145 ++++++++++++++++++ 4 files changed, 212 insertions(+) create mode 100644 .claude/memory/reference_gururmm_pipeline_vendored.md create mode 100644 .claude/memory/reference_rmm_agent_runs_in_systemd_sandbox.md create mode 100644 wiki/systems/jupiter-docker-templating.md diff --git a/.claude/memory/MEMORY.md b/.claude/memory/MEMORY.md index c279d9c..f45f3cb 100644 --- a/.claude/memory/MEMORY.md +++ b/.claude/memory/MEMORY.md @@ -1,6 +1,7 @@ # Memory Index ## Reference +- [RMM agent runs in systemd sandbox](reference_rmm_agent_runs_in_systemd_sandbox.md) — Commands dispatched via the GuruRMM agent run inside its ProtectSystem=strict namespace (/ is ro there); fs/mount probes show the agent's view NOT the host. SSH or read /proc//mountinfo for host truth. (lesson 2026-06-01, GURU-KALI ghost churn) - [GURU-5070 Rust toolchain](reference_guru5070_rust_toolchain.md) — GURU-5070 now has cargo + MSVC + protoc; build/clippy/test guru-connect LOCALLY (set PROTOC to the winget path) instead of the build host. CI only clippy-checks the Linux server, not the Windows agent. - [ACG Office Network Infrastructure](infra_office_network.md) — IPs/hosts/roles for pfSense/Jupiter/VMs/Docker. Check before assuming; .21 (Uranus) is storage. - [Power Failure Runbook](../POWER_FAILURE_RUNBOOK.md) — Recovery order after a power event: Tailscale routes, libvirt/VMs, Seafile, NPM/DNS. @@ -21,6 +22,7 @@ - [GuruRMM user_session command context](reference_gururmm_user_session_context.md) — command API `context=user_session` runs as the logged-on user (WTS); does interactive-only cmds that fail as SYSTEM. Needs an active (admin) user. - [Pluto Build Server](reference_pluto_build_server.md) — Windows build VM: hostname PLUTO = Unraid VM "Claude-Builder" = 172.16.3.36 (all the same box). MSVC + WiX. No `pluto` vault entry. Drive via /rmm (agent enrolls as PLUTO) when SSH key isn't authorized. - [Coord /messages API shape](reference_coord_messages_api_shape.md) — GET /api/coord/messages returns {total,skip,limit,messages[]} NOT a bare array; parse .messages[], strip control chars, read flag may be null. +- [GuruRMM pipeline vendored](reference_gururmm_pipeline_vendored.md) — RMM build scripts version-controlled at gururmm `deploy/build-pipeline/` (2026-06-01); build-shared.sh auto-syncs them to /opt/gururmm each build. Edit-in-repo + push = live, EXCEPT build-shared.sh + webhook-handler.py (manual cp). - [Gitea API credential](reference_gitea_api_credential.md) — Gitea API (PRs/merges) as howard uses services/gitea-howard.sops.yaml password on internal http://172.16.3.20:3000; NOT the gururmm-server SSH password. ## Users diff --git a/.claude/memory/reference_gururmm_pipeline_vendored.md b/.claude/memory/reference_gururmm_pipeline_vendored.md new file mode 100644 index 0000000..f5d29ac --- /dev/null +++ b/.claude/memory/reference_gururmm_pipeline_vendored.md @@ -0,0 +1,29 @@ +--- +name: reference_gururmm_pipeline_vendored +description: GuruRMM build-pipeline scripts are now version-controlled at deploy/build-pipeline/ in the gururmm repo (2026-06-01); build-shared.sh auto-syncs them to /opt/gururmm each build, so edit-in-repo + push = live — EXCEPT build-shared.sh + webhook-handler.py, which need a manual cp. +metadata: + type: reference +--- + +The GuruRMM build/CI pipeline runs at **`/opt/gururmm/`** on the gururmm server (172.16.3.30, +root-owned, hand-maintained). Those scripts had silently diverged from the repo's older `scripts/` +generation (that drift caused the BUG-015 Windows build-gate gap). Reconciled 2026-06-01: + +- **Source of truth:** the live scripts are vendored into the gururmm repo at + **`deploy/build-pipeline/`** (build-{windows,linux,mac,agents,server,shared}.sh, sign-windows.sh, + webhook-handler.py + README). Commit `2bf539e`. +- **Drift-stop (commit `24b5daf`):** `build-shared.sh` (runs first every build, after + `git reset --hard origin/main`) now `install -m 0755`-syncs the 6 build scripts from + `deploy/build-pipeline/` → `/opt/gururmm/` each build. So to change a GuruRMM build script: + **edit it in `deploy/build-pipeline/`, push to gururmm main — the next build runs it.** No manual + copy, no restart. +- **Two exceptions — need a manual `sudo cp` on change** (they can't self-overwrite mid-run): + `build-shared.sh` (the running puller) and `webhook-handler.py` (the persistent HTTP server; + also needs `sudo systemctl restart gururmm-webhook` to reload). They change rarely. See + `deploy/build-pipeline/README.md`. + +Webhook still INVOKES the `/opt/gururmm` copies (not the repo copies directly) — the sync keeps +them current. The repo's older `scripts/webhook-handler.py` + `scripts/build-agents.sh` are a prior +generation, superseded. Build-windows.sh's change-gate watches `agent/ installer/` (BUG-015 fix — +installer-only `.wxs`/`.ico` changes rebuild the MSI). Supersedes the "repo copy is stale, don't +redeploy" caveat in [[project_rmm_webhook_docs_guard]] for the build scripts (not webhook-handler.py). diff --git a/.claude/memory/reference_rmm_agent_runs_in_systemd_sandbox.md b/.claude/memory/reference_rmm_agent_runs_in_systemd_sandbox.md new file mode 100644 index 0000000..6e52cb0 --- /dev/null +++ b/.claude/memory/reference_rmm_agent_runs_in_systemd_sandbox.md @@ -0,0 +1,36 @@ +--- +name: reference_rmm_agent_runs_in_systemd_sandbox +description: Commands dispatched via the GuruRMM agent execute INSIDE the agent's systemd sandbox (ProtectSystem=strict) — fs/mount observations reflect the agent's private namespace, NOT the host. For host truth, SSH directly or read /proc//mountinfo. +metadata: + type: reference +--- + +The GuruRMM Linux agent runs as a systemd service (`gururmm-agent.service`) hardened with +**`ProtectSystem=strict`**, which gives the agent process a **private mount namespace where `/` +is mounted read-only**, with only `ReadWritePaths=` entries writable. **Any command you dispatch +through the RMM agent (`/rmm shell`, probes) runs inside that namespace** — so `findmnt /`, +`touch`, `/proc/mounts` etc. report the **agent's sandboxed view, not the host's actual state**. + +**Trap (hit 2026-06-01, GURU-KALI):** I diagnosed "host root filesystem is read-only" because +RMM-dispatched `touch /var/lib/gururmm` returned EROFS (os error 30) and `findmnt /` showed `ro`. +The host root was **rw the entire time** (SMART PASSED, ext4 clean, no kernel remount-ro — all +consistent with the host being fine). The real cause: the unit's +`ReadWritePaths=/var/log /usr/local/bin /etc/gururmm` **omitted `/var/lib/gururmm`**, so the agent +couldn't persist `/var/lib/gururmm/.device-id` → it re-minted a device_id on each daily +identity refresh → the server (no machine_uid dedup) filed a new agent row each time (~11 ghosts). + +**How to get host truth instead of the sandbox view:** +- SSH to the host directly (commands there run in the host namespace), OR +- Read the agent PID's namespace explicitly: `cat /proc//mountinfo` — the process-scoped + `ro` on `/` is the tell that it's sandbox, not host. Compare against the host's `findmnt`. +- `errors=remount-ro` in a mount line is just the stock default mount option — NOT evidence an + error fired. Confirm an actual remount-ro with kernel `EXT4-fs error` logs + `dumpe2fs -h` error + count, not the mount option alone. + +**The fix pattern** (durable, additive): drop-in +`/etc/systemd/system/gururmm-agent.service.d/override.conf` with `[Service]\nReadWritePaths=/var/lib/gururmm` +(systemd merges ReadWritePaths additively across drop-ins), then `daemon-reload` + `restart`. +Better upstream fix: `StateDirectory=gururmm` (handles dir creation + perms + RW bind in one +directive). **Fleet implication:** every systemd-installed GuruRMM Linux agent with this unit shape +has the same latent bug until the installer is fixed. See filed todos (agent ReadWritePaths/ +StateDirectory + server machine_uid dedup). diff --git a/wiki/systems/jupiter-docker-templating.md b/wiki/systems/jupiter-docker-templating.md new file mode 100644 index 0000000..ad521f5 --- /dev/null +++ b/wiki/systems/jupiter-docker-templating.md @@ -0,0 +1,145 @@ +# Jupiter — Docker → Unraid Template/Compose Adoption Plan + +**System:** Jupiter (Unraid primary container host) — `172.16.3.20`, SSH `root@:22` +(creds: `infrastructure/jupiter-unraid-primary.sops.yaml`). +**Goal:** make every container show Unraid's UI features (WebUI button, icon, update-check, +rich Edit form) by giving it the `net.unraid.docker.*` labels it currently lacks. +**Status:** INSPECTION + PLANNING complete (2026-06-01). **Target #1 (gururmm-agent) DONE +2026-06-01** — workflow validated. Remaining recreates HELD for a maintenance window. + +## Execution log +- **2026-06-01 — gururmm-agent [DONE].** Wrote `templates-user/my-gururmm-agent.xml`; backed up + full inspect to `/root/gururmm-agent.inspect.bak.json`; stopped+rm'd, recreated via `docker run` + with `-l net.unraid.docker.managed=dockerman` + the captured spec. Verified: label=dockerman, + config faithful, agent re-authenticated and resumed metrics/inventory/check polling. One clean + exit-0 restart at startup = the agent's self-update finalize (cleaned rollback artifacts), then + stable. Now shows as a managed container in the Unraid Docker tab with the rich Edit form. + - **device-id persistence [FIXED 2026-06-01]:** the agent's device-id lived at + `/var/lib/gururmm/.device-id` *inside* the container (ephemeral). Fix: `docker cp`'d the live + `/var/lib/gururmm/.` out to `/mnt/user/appdata/gururmm/lib/` (preserving device-id + `88abeef0-cb3a-4c3f-9353-61fedcdf587d`), added a `-v /mnt/user/appdata/gururmm/lib:/var/lib/gururmm` + mapping + matching template Config, and recreated. Verified the agent reused the same device-id + (no "Persisting new device ID") and the **same agent_id `443bfabb`** — identity now durable + across recreates/updates. Enrollment identity also persists via `config.toml` in `/config`. + - **Ghost check [CLEAN]:** GuruRMM `/api/agents` shows exactly one Jupiter row (`443bfabb`, + last-seen current). The three recreates created no duplicate. **Incidental:** `GURU-KALI` has + ~11 duplicate agent rows (v0.6.46/0.6.50, stale) — same ephemeral-identity pattern on a + frequently-reinstalled box; cleanup candidate, out of scope for this task. + +--- + +## Why the features are missing + +Unraid's per-container UI (WebUI/icon/update-check/Edit) is driven by **container labels** +(`net.unraid.docker.managed`, `.webui`, `.icon`) + a template XML in +`/boot/config/plugins/dockerMan/templates-user/`. Those labels are **immutable on a running +container** — they're baked in at `docker create` time. Containers started by raw +`docker run` / `docker-compose` CLI (instead of Unraid's "Add Container" form or the Compose +Manager plugin) never get them. **The only fix is to RECREATE each container** through the +proper mechanism. Data in mapped volumes is untouched by a recreate; the risk is downtime + +getting the recreate config exactly right. + +Two correct mechanisms (both yield the full UI feature set): +- **dockerman template** — for single CLI containers. Unraid "Add Container" → template. +- **composeman (Compose Manager plugin)** — for multi-container compose stacks. Adopt the + existing `docker-compose.yml` into the plugin so the whole stack gets the labels while + keeping its compose orchestration + private network. Plugin IS installed (only "RustDesk" + registered today). + +--- + +## Inventory (21 containers; 14 raw / `managed=NONE`) + +### Already templated (managed=dockerman) — no action +DockerUISP, Seerr, qbittorrent, binhex-emby, binhex-sabnzbd, binhex-plexpass, rsync-server + +### Raw (managed=NONE) — the targets, grouped by disposition + +| Container | Image | Created via | Existing template | Disposition | Risk | +|---|---|---|---|---|---| +| **gururmm-agent** | localhost:3000/azcomputerguru/gururmm-agent:latest | CLI, net=host | none | NEW dockerman template | LOW | +| youtube-sync-test | azcomputerguru/youtube-sync:latest | CLI | none | NEW template (or retire — "test") | LOW | +| binhex-radarr | binhex/arch-radarr | CLI | my-binhex-radarr.xml | reconcile + recreate from template | MED | +| binhex-sonarr | binhex/arch-sonarr | CLI | my-binhex-sonarr.xml | reconcile + recreate from template | MED | +| MariaDB-Official | mariadb:latest | CLI | my-MariaDB-Official.xml | reconcile + recreate (snapshot appdata first) | MED (DB) | +| **seafile** + seafile-mysql + seafile-memcached + seafile-elasticsearch | seafileltd/seafile-pro-mc:12.0 / mariadb:10.6 / memcached:1.6.18 / elasticsearch:7.17.26 | compose `dockercompose` @ /mnt/user0/SeaFile/DockerCompose/docker-compose.yml | partial (my-SeaFile*.xml, my-memcached.xml) | **adopt stack into Compose Manager** | MED-HIGH | +| **gitea** + **gitea-db** | gitea/gitea:latest / mysql:8 | compose `gitea` @ /mnt/cache/appdata/gitea/docker-compose.yml | none | **adopt stack into Compose Manager** | **HIGH** (repos + GuruRMM build pipeline) | +| **npm** | jc21/nginx-proxy-manager:latest | CLI | my-NginxProxyManager.xml | reconcile + recreate from template | **HIGH** (public reverse proxy) | +| app (Discourse) | local_discourse/app | Discourse `./launcher` (no compose file) | none | **LEAVE AS-IS** — self-managed; templating breaks `./launcher rebuild` | n/a | +| radio-archive | radio-archive:latest | compose `app` | none | tied to Discourse project — leave with app | LOW | + +Note: several CLI containers (npm, radarr, sonarr, MariaDB-Official) already HAVE a matching +template XML — the running container just isn't linked to it (recreated via CLI later, which +stripped the managed label). For these, recreate-from-existing-template is the easy path, but +**verify the template's ports/paths/env still match the live container** before applying. + +--- + +## Recreate sequencing (least → most critical) + +Do them one at a time, verify each comes back healthy before the next. + +1. **gururmm-agent** — LOW. Local image, net=host, no public dependents. Proves the workflow. + Spec captured below. +2. youtube-sync-test, radio-archive — LOW. (Confirm youtube-sync-test isn't disposable first.) +3. binhex-radarr, binhex-sonarr — MED. Media, non-critical, templates already exist. +4. MariaDB-Official — MED. **Snapshot `/mnt/.../appdata` (or mysqldump) first.** +5. seafile stack — MED-HIGH. Adopt into Compose Manager. **Backup first.** `down` → register → `up`. +6. **gitea + gitea-db** — HIGH, dedicated window. **Backup gitea appdata + `mysqldump` gitea-db + first.** Pausing Gitea stops repo access AND the GuruRMM webhook build pipeline. Adopt the + existing compose into Compose Manager. +7. **npm** — HIGH, schedule with comms. Recreating drops the public reverse proxy → all proxied + public services (connect., rmm., git., community., seafile.) briefly down. **Backup `/data` + + `/etc/letsencrypt` first.** Recreate from my-NginxProxyManager.xml (verify port maps: + 80→1880, 81→7818, 443→18443). +8. Discourse (app) — LEAVE. + +--- + +## Captured recreate spec — gururmm-agent (target #1) + +``` +Image: localhost:3000/azcomputerguru/gururmm-agent:latest +Network: host +Restart: unless-stopped +Privileged: false CapAdd: none Devices: none (kvm passed as a bind mount) +Entrypoint: /usr/local/bin/gururmm-agent +Cmd: run +Env: GURURMM_CONFIG=/config/config.toml +Volumes: + /dev/kvm -> /dev/kvm (ro) + /proc -> /proc (ro) + /sys -> /sys (ro) + /var/run/docker.sock -> /var/run/docker.sock (rw) + /var/run/libvirt/libvirt-sock -> /var/run/libvirt/libvirt-sock (ro) + /mnt/user/appdata/gururmm -> /config (rw) +``` + +Equivalent `docker run` (what the dockerman template encodes): +```bash +docker run -d --name gururmm-agent \ + --network host --restart unless-stopped \ + -e GURURMM_CONFIG=/config/config.toml \ + -v /dev/kvm:/dev/kvm:ro \ + -v /proc:/proc:ro \ + -v /sys:/sys:ro \ + -v /var/run/docker.sock:/var/run/docker.sock \ + -v /var/run/libvirt/libvirt-sock:/var/run/libvirt/libvirt-sock:ro \ + -v /mnt/user/appdata/gururmm:/config \ + --entrypoint /usr/local/bin/gururmm-agent \ + localhost:3000/azcomputerguru/gururmm-agent:latest run +``` +Recreate path: build the template in Unraid "Add Container" (Repository, Network=host, the 6 +path mappings, the env var, Extra Params `--entrypoint /usr/local/bin/gururmm-agent`, Post +Arguments `run`), `docker stop && docker rm gururmm-agent`, then apply the template. Note: it's +a localhost-registry image, so Unraid update-check won't be meaningful — but WebUI(n/a)/icon/Edit +form all come back. + +--- + +## Open items before execution +- Confirm `youtube-sync-test` is keep-or-retire (the "-test" name suggests disposable). +- For each "template exists" container (npm/radarr/sonarr/MariaDB-Official): diff the template + XML against the live `docker inspect` (ports/paths/env) so the recreate doesn't lose config. +- Pick the maintenance window(s). Suggest: a low-risk batch (1-4) any time; seafile its own slot; + gitea + npm each in a dedicated announced window, backup-first.