sync: auto-sync from GURU-5070 at 2026-06-11 11:20:07

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-11 11:20:07
This commit is contained in:
2026-06-11 11:20:20 -07:00
parent e3459260ec
commit f90110d8e8
2 changed files with 125 additions and 0 deletions

View File

@@ -0,0 +1,71 @@
# GuruRMM post-migration agent-update outage — diagnosed + fixed (root, not the WebSocket)
## User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
## Session Summary
Mike noticed RMM commands were no longer firing instantly (commands to PST/Safesite agents queued/timed out where they used to execute immediately). Initial hypothesis was that today's GuruRMM host migration (VM → physical `172.16.3.30`, cutover ~07:10 UTC) broke the agent WebSocket. Evidence partly supported that (`is_connected` null for all 214 agents) but a server-side investigation overturned it.
SSH'd into the new host (`guru@172.16.3.30`, key `~/.ssh/gururmm-physical` — password auth is disabled, publickey-only; the vaulted sudo password was stale). Found `gururmm-server.service` ("API and WebSocket") healthy and **179/214 agents WS-connected** with clean `/ws` 101 upgrades and live heartbeats — the WebSocket was never the problem. `is_connected` is an unpopulated API field, not a disconnect signal. The real signal in the logs: a `0.6.59 → 0.6.61` agent auto-update looping with `status=failed: Failed to download binary` every 30s.
Root cause: the **0.6.61 agent binaries in `/var/www/gururmm/downloads/` were written `root:root 700`** (the rebuild on the new host at 16:50, run as root with a restrictive umask), so **nginx (www-data) returned HTTP 403** on the update download. Every agent on 0.6.59 thrashed in a failing-update loop, which is what presented as "commands don't fire instantly." Confirmed with a serve test: 0.6.61 → 403, a `644` reference file → 200.
Applied the immediate fix (`chmod 644` the six 0.6.61 payloads via the now-correct sudo password Mike supplied), verified serve test flipped 403→200 and update results flipped `failed → starting` (download succeeding). Then fixed the permanent cause: added explicit `chmod 644` of published artifacts to `build-windows.sh` (per-artifact in `deploy_and_sign()` + a belt-and-suspenders `find -exec chmod 644` phase) and `build-linux.sh`, mirrored the fixed scripts to the server's `/opt/gururmm/` copies, added a runbook gap #4 + Phase-3 perms check, reconciled the stale vaulted sudo password, and committed/pushed everything. Also (earlier in the session) triaged the coord inbox and re-armed the Safesite forensic sweep.
## Key Decisions
- Overturned my own initial "WebSocket broke" diagnosis after server evidence (179/214 connected, 101 upgrades) — the symptom was the broken auto-update, not WS. Documented this explicitly so future sessions don't chase `is_connected`.
- Applied the production `chmod` fix directly (GuruRMM is our product; chmod is non-destructive/reversible) rather than just reporting.
- Fixed the permanent cause in THREE places — repo source, the live `/opt/gururmm` server copies (so the next build is correct regardless of re-vendor timing), and the migration runbook — because a root-context build silently reintroduces it.
- Routed the chmod via SSH sudo (not the root RMM agent) because the agent command path was itself degraded by the very outage being fixed.
## Problems Encountered
- **Self-inflicted misdiagnosis:** the `is_connected: null` fleet-wide looked like a WS break; it's an unpopulated field. Server logs (179 WS conns, 101 upgrades) corrected it.
- **SSH access:** password auth disabled on the new host (publickey-only). Used `~/.ssh/gururmm-physical` (created during the migration). Root SSH is locked to `guru`.
- **Stale vault sudo password:** `infrastructure/gururmm-server.sops.yaml``credentials.password` was drifted (`Authentication failed`). Mike supplied the current one (`Paper123!@#-rmm`); reconciled + pushed to vault.
- **Repo edits reverted mid-task:** a submodule sync/checkout reverted my build-script edits once; re-applied and committed immediately to lock them in. (Server `/opt/gururmm` copies already had the fix from the earlier scp.)
- **PST/Safesite agents WS-disconnected** (alive heartbeat, no persistent socket) — needed 1800s command timeouts; their recon/sweep commands queue until reconnect.
## Configuration Changes
- **Server `172.16.3.30`:** `chmod 644 /var/www/gururmm/downloads/*0.6.61*` (6 payloads). Overwrote `/opt/gururmm/build-windows.sh` + `/opt/gururmm/build-linux.sh` with the fixed versions (chmod 755, root).
- **guru-rmm repo `4b5ed30`** (pushed origin/main): `deploy/build-pipeline/build-windows.sh` (chmod 644 in `deploy_and_sign()` + `find -exec chmod 644` phase), `deploy/build-pipeline/build-linux.sh` (chmod 644), `docs/HOST_MIGRATION_RUNBOOK.md` (gap #4 + Phase-3 perms check).
- **ClaudeTools:** submodule pointer bumped → `4b5ed30` (commit `e3459260`); vault `infrastructure/gururmm-server.sops.yaml` password reconciled (committed in vault repo).
- coord: marked 30 personal messages read (PUT /messages/{id}/read); 26 broadcasts added to local seen-file; broadcast `0fdd4502` (Safesite status correction); dev-alert `1514691349113344091`.
## Credentials & Secrets
- `guru@172.16.3.30` SSH/sudo password = `Paper123!@#-rmm` → vaulted `infrastructure/gururmm-server.sops.yaml` `credentials.password` (was stale `<old>`; reconciled 2026-06-11). SSH itself is key-only (`~/.ssh/gururmm-physical`).
## Infrastructure & Servers
- GuruRMM host: `172.16.3.30` (hostname `gururmm`), Ubuntu, kernel 7.0.0-22. Services: `gururmm-server.service` (API+WebSocket, pid restarted 14:10 UTC = cutover), `gururmm-agent.service` (own root agent), `gururmm-webhook.service`, `guruconnect.service`. Listening: `:3001` (API+WS), `:3002`, `:3000`, `:8001` (coord), `:80` (nginx, workers=www-data), `:9090/:9100` (prometheus/node_exporter), pg `:5432`, mariadb `:3306`.
- Build pipeline: webhook runs as **root**; `build-windows.sh` builds on Pluto (`Administrator@172.16.3.36`), scp's to `/tmp`, `deploy_and_sign()``/var/www/gururmm/downloads/` (FLAT layout, platform in filename). `SIGN_SCRIPT=/opt/gururmm/sign-windows.sh`.
- Migration: old VM parked on `172.16.3.30`→ now physical; old VM on `.46` for rollback.
- Server's own root agent: `gururmm` = `5e5a7ebc-95ea-40c8-b965-6ec15d63e157`.
## Commands & Outputs
```bash
ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30 # key auth; password disabled
# WS health: 179/214 established on :3001; logs show /ws 101 upgrades + heartbeats (WS fine)
# Root cause: ls -l /var/www/gururmm/downloads/*0.6.61* -> -rwx------ root (700) ; curl localhost/downloads/..0.6.61.exe -> 403
# Fix: printf '%s\n' "$PW" | ssh ... 'sudo -S -p "" chmod 644 /var/www/gururmm/downloads/*0.6.61*' -> serve 200, update status failed->starting
```
## Pending / Incomplete Tasks
- **Fleet settle:** agents finishing the 0.6.61 update over the next while; command latency normalizes as they reconnect. Re-verify a normal command fires instantly later.
- **Safesite sweep (#32395 / todo `5766a59f`):** DESKTOP-3USU20B sweep already queued correctly (cmd `7e835f51`, pending, 1800s, live agent `f9a86061`); DESKTOP-LOPKB4G not enrolled. Both offline; runs on reconnect. Fleet broadcast `0fdd4502` corrected the stale ids.
- **Peaceful Spirit recon** (PST-SERVER `421a4904` / PST-SERVER2 `2ffc4f54`) queued; will complete once those agents settle post-update. See the PS planning log.
- Optional: the build-pipeline is single-thread CPU-bound on Pluto (runbook gap #3) — parallelize the 6 cargo builds (separate from this fix).
## Reference Information
- guru-rmm commit `4b5ed30` (origin/main); ClaudeTools `e3459260` (pointer bump).
- Runbook: `projects/msp-tools/guru-rmm/docs/HOST_MIGRATION_RUNBOOK.md` gap #4.
- Memory written this session re: the false-WS-symptom is folded into the runbook; see also `feedback_rmm_user_session_smb_false_negative.md` (separate RMM-impersonation gotcha).