sync: auto-sync from GURU-5070 at 2026-06-11 11:20:07

Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-11 11:20:07
2026-06-11 11:20:20 -07:00
parent e3459260ec
commit f90110d8e8
2 changed files with 125 additions and 0 deletions
--- a/session-logs/2026-06/2026-06-11-mike-gururmm-migration-update-outage-fix.md
+++ b/session-logs/2026-06/2026-06-11-mike-gururmm-migration-update-outage-fix.md
@@ -0,0 +1,71 @@
+# GuruRMM post-migration agent-update outage — diagnosed + fixed (root, not the WebSocket)
+
+## User
+- **User:** Mike Swanson (mike)
+- **Machine:** GURU-5070
+- **Role:** admin
+
+## Session Summary
+
+Mike noticed RMM commands were no longer firing instantly (commands to PST/Safesite agents queued/timed out where they used to execute immediately). Initial hypothesis was that today's GuruRMM host migration (VM → physical `172.16.3.30`, cutover ~07:10 UTC) broke the agent WebSocket. Evidence partly supported that (`is_connected` null for all 214 agents) but a server-side investigation overturned it.
+
+SSH'd into the new host (`guru@172.16.3.30`, key `~/.ssh/gururmm-physical` — password auth is disabled, publickey-only; the vaulted sudo password was stale). Found `gururmm-server.service` ("API and WebSocket") healthy and **179/214 agents WS-connected** with clean `/ws` 101 upgrades and live heartbeats — the WebSocket was never the problem. `is_connected` is an unpopulated API field, not a disconnect signal. The real signal in the logs: a `0.6.59 → 0.6.61` agent auto-update looping with `status=failed: Failed to download binary` every 30s.
+
+Root cause: the **0.6.61 agent binaries in `/var/www/gururmm/downloads/` were written `root:root 700`** (the rebuild on the new host at 16:50, run as root with a restrictive umask), so **nginx (www-data) returned HTTP 403** on the update download. Every agent on 0.6.59 thrashed in a failing-update loop, which is what presented as "commands don't fire instantly." Confirmed with a serve test: 0.6.61 → 403, a `644` reference file → 200.
+
+Applied the immediate fix (`chmod 644` the six 0.6.61 payloads via the now-correct sudo password Mike supplied), verified serve test flipped 403→200 and update results flipped `failed → starting` (download succeeding). Then fixed the permanent cause: added explicit `chmod 644` of published artifacts to `build-windows.sh` (per-artifact in `deploy_and_sign()` + a belt-and-suspenders `find -exec chmod 644` phase) and `build-linux.sh`, mirrored the fixed scripts to the server's `/opt/gururmm/` copies, added a runbook gap #4 + Phase-3 perms check, reconciled the stale vaulted sudo password, and committed/pushed everything. Also (earlier in the session) triaged the coord inbox and re-armed the Safesite forensic sweep.
+
+## Key Decisions
+
+- Overturned my own initial "WebSocket broke" diagnosis after server evidence (179/214 connected, 101 upgrades) — the symptom was the broken auto-update, not WS. Documented this explicitly so future sessions don't chase `is_connected`.
+- Applied the production `chmod` fix directly (GuruRMM is our product; chmod is non-destructive/reversible) rather than just reporting.
+- Fixed the permanent cause in THREE places — repo source, the live `/opt/gururmm` server copies (so the next build is correct regardless of re-vendor timing), and the migration runbook — because a root-context build silently reintroduces it.
+- Routed the chmod via SSH sudo (not the root RMM agent) because the agent command path was itself degraded by the very outage being fixed.
+
+## Problems Encountered
+
+- **Self-inflicted misdiagnosis:** the `is_connected: null` fleet-wide looked like a WS break; it's an unpopulated field. Server logs (179 WS conns, 101 upgrades) corrected it.
+- **SSH access:** password auth disabled on the new host (publickey-only). Used `~/.ssh/gururmm-physical` (created during the migration). Root SSH is locked to `guru`.
+- **Stale vault sudo password:** `infrastructure/gururmm-server.sops.yaml` → `credentials.password` was drifted (`Authentication failed`). Mike supplied the current one (`Paper123!@#-rmm`); reconciled + pushed to vault.
+- **Repo edits reverted mid-task:** a submodule sync/checkout reverted my build-script edits once; re-applied and committed immediately to lock them in. (Server `/opt/gururmm` copies already had the fix from the earlier scp.)
+- **PST/Safesite agents WS-disconnected** (alive heartbeat, no persistent socket) — needed 1800s command timeouts; their recon/sweep commands queue until reconnect.
+
+## Configuration Changes
+
+- **Server `172.16.3.30`:** `chmod 644 /var/www/gururmm/downloads/*0.6.61*` (6 payloads). Overwrote `/opt/gururmm/build-windows.sh` + `/opt/gururmm/build-linux.sh` with the fixed versions (chmod 755, root).
+- **guru-rmm repo `4b5ed30`** (pushed origin/main): `deploy/build-pipeline/build-windows.sh` (chmod 644 in `deploy_and_sign()` + `find -exec chmod 644` phase), `deploy/build-pipeline/build-linux.sh` (chmod 644), `docs/HOST_MIGRATION_RUNBOOK.md` (gap #4 + Phase-3 perms check).
+- **ClaudeTools:** submodule pointer bumped → `4b5ed30` (commit `e3459260`); vault `infrastructure/gururmm-server.sops.yaml` password reconciled (committed in vault repo).
+- coord: marked 30 personal messages read (PUT /messages/{id}/read); 26 broadcasts added to local seen-file; broadcast `0fdd4502` (Safesite status correction); dev-alert `1514691349113344091`.
+
+## Credentials & Secrets
+
+- `guru@172.16.3.30` SSH/sudo password = `Paper123!@#-rmm` → vaulted `infrastructure/gururmm-server.sops.yaml` `credentials.password` (was stale `<old>`; reconciled 2026-06-11). SSH itself is key-only (`~/.ssh/gururmm-physical`).
+
+## Infrastructure & Servers
+
+- GuruRMM host: `172.16.3.30` (hostname `gururmm`), Ubuntu, kernel 7.0.0-22. Services: `gururmm-server.service` (API+WebSocket, pid restarted 14:10 UTC = cutover), `gururmm-agent.service` (own root agent), `gururmm-webhook.service`, `guruconnect.service`. Listening: `:3001` (API+WS), `:3002`, `:3000`, `:8001` (coord), `:80` (nginx, workers=www-data), `:9090/:9100` (prometheus/node_exporter), pg `:5432`, mariadb `:3306`.
+- Build pipeline: webhook runs as **root**; `build-windows.sh` builds on Pluto (`Administrator@172.16.3.36`), scp's to `/tmp`, `deploy_and_sign()` → `/var/www/gururmm/downloads/` (FLAT layout, platform in filename). `SIGN_SCRIPT=/opt/gururmm/sign-windows.sh`.
+- Migration: old VM parked on `172.16.3.30`→ now physical; old VM on `.46` for rollback.
+- Server's own root agent: `gururmm` = `5e5a7ebc-95ea-40c8-b965-6ec15d63e157`.
+
+## Commands & Outputs
+
+```bash
+ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30   # key auth; password disabled
+# WS health: 179/214 established on :3001; logs show /ws 101 upgrades + heartbeats (WS fine)
+# Root cause: ls -l /var/www/gururmm/downloads/*0.6.61* -> -rwx------ root (700) ; curl localhost/downloads/..0.6.61.exe -> 403
+# Fix: printf '%s\n' "$PW" | ssh ... 'sudo -S -p "" chmod 644 /var/www/gururmm/downloads/*0.6.61*'  -> serve 200, update status failed->starting
+```
+
+## Pending / Incomplete Tasks
+
+- **Fleet settle:** agents finishing the 0.6.61 update over the next while; command latency normalizes as they reconnect. Re-verify a normal command fires instantly later.
+- **Safesite sweep (#32395 / todo `5766a59f`):** DESKTOP-3USU20B sweep already queued correctly (cmd `7e835f51`, pending, 1800s, live agent `f9a86061`); DESKTOP-LOPKB4G not enrolled. Both offline; runs on reconnect. Fleet broadcast `0fdd4502` corrected the stale ids.
+- **Peaceful Spirit recon** (PST-SERVER `421a4904` / PST-SERVER2 `2ffc4f54`) queued; will complete once those agents settle post-update. See the PS planning log.
+- Optional: the build-pipeline is single-thread CPU-bound on Pluto (runbook gap #3) — parallelize the 6 cargo builds (separate from this fix).
+
+## Reference Information
+
+- guru-rmm commit `4b5ed30` (origin/main); ClaudeTools `e3459260` (pointer bump).
+- Runbook: `projects/msp-tools/guru-rmm/docs/HOST_MIGRATION_RUNBOOK.md` gap #4.
+- Memory written this session re: the false-WS-symptom is folded into the runbook; see also `feedback_rmm_user_session_smb_false_negative.md` (separate RMM-impersonation gotcha).