sync: auto-sync from GURU-5070 at 2026-06-11 11:20:07

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-11 11:20:07
This commit is contained in:
2026-06-11 11:20:20 -07:00
parent e3459260ec
commit f90110d8e8
2 changed files with 125 additions and 0 deletions

View File

@@ -0,0 +1,54 @@
# Peaceful Spirit — multi-site resilience plan (DFS + second DC) — PLAN ONLY
## User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
## Session Summary
Planning session (no build) for making the two Peaceful Spirit sites resilient to a site-to-site VPN outage. Goal as Mike stated it: PST-SERVER2 (North West) should be a **DFS** replication partner of PST-SERVER (Country Club) so each site holds a local copy of the data (machines pull local, not over VPN), with **active failover if the S2S VPN drops**. Mike confirmed the direction is **DFS + a second Domain Controller** (he initially typed "WDS"; clarified to DFS).
Established the environment from the wiki: domain `PEACEFULSPIRIT.local`, single DC = PST-SERVER (Country Club, 192.168.0.2, **Windows Server 2016 Essentials**) doing DC/DNS/RRAS-VPN/NPS/Enterprise-Root-CA; PST-SERVER2 (North West) = **Windows Server 2019 Standard**. Flagged the key constraint: Server *Essentials* expects to own all FSMO roles + ~25-user/50-device cap + is awkward with additional DCs, and 2016 hits end-of-support Jan 2027 (Essentials edition discontinued).
Recommended architecture mapped to the three goals: (1) promote **PST-SERVER2 as an additional DC** (AD DS + DNS + GC) → local auth/DNS survives a VPN outage; (2) **AD Sites & Services** (Country-Club 192.168.0.0/24 + North-West subnet, site link) → clients use their local DC/target; (3) **DFS Namespace (domain-based) + DFS-R** with a folder target on each server → local file copies, auto-replicated, site-aware referrals. Surfaced the dependency that matters: DFS gives local *files* but a domain share still needs a reachable DC to *authenticate*, so DFS-only would leave the NW copy unusable during an outage — hence DFS **must** pair with the local DC.
Attempted read-only recon of PST-SERVER and PST-SERVER2 via GuruRMM to scope the data + PST-SERVER2's domain-join state; both are WS-disconnected PST agents so the commands queued (re-armed at 1800s, `421a4904` PST-SERVER data, `2ffc4f54` PST-SERVER2 state — pending). The session then pivoted to fixing a fleet-wide GuruRMM update outage (separate log), which is why those recons are still outstanding.
## Key Decisions
- **DFS + second DC** (not DFS-only): DFS-only meets "local copies" but not "works when VPN down" — a domain DFS namespace/share needs a DC to authenticate, so NW needs a local DC. PST-SERVER2 = both DC and DFS target (standard combo).
- **Keep all FSMO on PST-SERVER (Essentials)** and do NOT route the replicated data through Essentials' Shared-Folders/Anywhere-Access features — use plain DFS-R — to avoid Essentials' single-DC assumptions.
- Recommend a **domain-based** DFS namespace (site-aware referrals + failover), not standalone.
- Recommend a **full writable DC** at NW over an RODC (trusted small office; RODC complicates DFS writes).
- Flagged 2016 Essentials EOL (Jan 2027) as a decision point: lean into it as-is vs. plan its replacement (2022/2025 Standard, plain AD DS).
## Problems Encountered
- PST recon commands queued (PST agents WS-disconnected; need long timeouts). Compounded by the concurrent GuruRMM update outage; recons left pending.
## Configuration Changes
- None — plan only. (RMM recon commands dispatched read-only: `421a4904`, `2ffc4f54`.)
## Infrastructure & Servers
- **PST-SERVER** (Country Club): 192.168.0.2, **Server 2016 Essentials**, DC/DNS/RRAS(L2TP)/NPS/Enterprise-Root-CA. Domain `PEACEFULSPIRIT.local`. RMM `87293069-33b6-45e8-a68f-6811216cdb96`.
- **PST-SERVER2** (North West): **Server 2019 Standard**. RMM `5d2d7ba0-3903-4aa3-9e97-6ca4424ffe65`. Domain-join state TBD (recon pending).
- LAN (Country Club): 192.168.0.0/24; WAN 98.190.129.150 (UCG Ultra). North West: separate UCG (subnet TBD; previously had OpenVPN at 64.139.88.249:1194). S2S VPN existence between the two UCGs = open question.
## Pending / Incomplete Tasks
**Open questions to firm the plan (the deciders):**
1. **What file data must be local at each site, and where is it now?** Share on PST-SERVER? Essentials redirected folders? A line-of-business app (scheduling/QuickBooks) and where it runs? (Mara uses personal OneDrive heavily — may be little on-prem file data.) → drives DFS scope + PST-SERVER2 storage sizing. (Recon `421a4904` will report shares/sizes once PST-SERVER picks it up.)
2. **Is there already a site-to-site VPN between the two UCGs (UniFi Site Magic), or build it?** The whole resilience story rides on this link.
3. **PST-SERVER2 current state** — blank Standard box vs already domain-joined (recon `2ffc4f54`).
4. **DFS-R conflict tolerance** — do both sites edit the same files (last-writer-wins conflict copies) or mostly separate data?
5. **2016 Essentials longevity** — keep as-is vs plan replacement (EOL Jan 2027).
**Next:** once 15 are answered, write a design doc + rollout runbook under `clients/peaceful-spirit/` (promote DC, DNS/GC, AD Sites & Services, DFS-N/DFS-R, per-site DHCP/DNS). Build only on explicit go.
## Reference Information
- Wiki: `wiki/clients/peaceful-spirit.md`. Syncro customer `278525`. Domain `PEACEFULSPIRIT.local`.
- RMM recon cmds (pending): PST-SERVER `421a4904`, PST-SERVER2 `2ffc4f54`.

View File

@@ -0,0 +1,71 @@
# GuruRMM post-migration agent-update outage — diagnosed + fixed (root, not the WebSocket)
## User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
## Session Summary
Mike noticed RMM commands were no longer firing instantly (commands to PST/Safesite agents queued/timed out where they used to execute immediately). Initial hypothesis was that today's GuruRMM host migration (VM → physical `172.16.3.30`, cutover ~07:10 UTC) broke the agent WebSocket. Evidence partly supported that (`is_connected` null for all 214 agents) but a server-side investigation overturned it.
SSH'd into the new host (`guru@172.16.3.30`, key `~/.ssh/gururmm-physical` — password auth is disabled, publickey-only; the vaulted sudo password was stale). Found `gururmm-server.service` ("API and WebSocket") healthy and **179/214 agents WS-connected** with clean `/ws` 101 upgrades and live heartbeats — the WebSocket was never the problem. `is_connected` is an unpopulated API field, not a disconnect signal. The real signal in the logs: a `0.6.59 → 0.6.61` agent auto-update looping with `status=failed: Failed to download binary` every 30s.
Root cause: the **0.6.61 agent binaries in `/var/www/gururmm/downloads/` were written `root:root 700`** (the rebuild on the new host at 16:50, run as root with a restrictive umask), so **nginx (www-data) returned HTTP 403** on the update download. Every agent on 0.6.59 thrashed in a failing-update loop, which is what presented as "commands don't fire instantly." Confirmed with a serve test: 0.6.61 → 403, a `644` reference file → 200.
Applied the immediate fix (`chmod 644` the six 0.6.61 payloads via the now-correct sudo password Mike supplied), verified serve test flipped 403→200 and update results flipped `failed → starting` (download succeeding). Then fixed the permanent cause: added explicit `chmod 644` of published artifacts to `build-windows.sh` (per-artifact in `deploy_and_sign()` + a belt-and-suspenders `find -exec chmod 644` phase) and `build-linux.sh`, mirrored the fixed scripts to the server's `/opt/gururmm/` copies, added a runbook gap #4 + Phase-3 perms check, reconciled the stale vaulted sudo password, and committed/pushed everything. Also (earlier in the session) triaged the coord inbox and re-armed the Safesite forensic sweep.
## Key Decisions
- Overturned my own initial "WebSocket broke" diagnosis after server evidence (179/214 connected, 101 upgrades) — the symptom was the broken auto-update, not WS. Documented this explicitly so future sessions don't chase `is_connected`.
- Applied the production `chmod` fix directly (GuruRMM is our product; chmod is non-destructive/reversible) rather than just reporting.
- Fixed the permanent cause in THREE places — repo source, the live `/opt/gururmm` server copies (so the next build is correct regardless of re-vendor timing), and the migration runbook — because a root-context build silently reintroduces it.
- Routed the chmod via SSH sudo (not the root RMM agent) because the agent command path was itself degraded by the very outage being fixed.
## Problems Encountered
- **Self-inflicted misdiagnosis:** the `is_connected: null` fleet-wide looked like a WS break; it's an unpopulated field. Server logs (179 WS conns, 101 upgrades) corrected it.
- **SSH access:** password auth disabled on the new host (publickey-only). Used `~/.ssh/gururmm-physical` (created during the migration). Root SSH is locked to `guru`.
- **Stale vault sudo password:** `infrastructure/gururmm-server.sops.yaml``credentials.password` was drifted (`Authentication failed`). Mike supplied the current one (`Paper123!@#-rmm`); reconciled + pushed to vault.
- **Repo edits reverted mid-task:** a submodule sync/checkout reverted my build-script edits once; re-applied and committed immediately to lock them in. (Server `/opt/gururmm` copies already had the fix from the earlier scp.)
- **PST/Safesite agents WS-disconnected** (alive heartbeat, no persistent socket) — needed 1800s command timeouts; their recon/sweep commands queue until reconnect.
## Configuration Changes
- **Server `172.16.3.30`:** `chmod 644 /var/www/gururmm/downloads/*0.6.61*` (6 payloads). Overwrote `/opt/gururmm/build-windows.sh` + `/opt/gururmm/build-linux.sh` with the fixed versions (chmod 755, root).
- **guru-rmm repo `4b5ed30`** (pushed origin/main): `deploy/build-pipeline/build-windows.sh` (chmod 644 in `deploy_and_sign()` + `find -exec chmod 644` phase), `deploy/build-pipeline/build-linux.sh` (chmod 644), `docs/HOST_MIGRATION_RUNBOOK.md` (gap #4 + Phase-3 perms check).
- **ClaudeTools:** submodule pointer bumped → `4b5ed30` (commit `e3459260`); vault `infrastructure/gururmm-server.sops.yaml` password reconciled (committed in vault repo).
- coord: marked 30 personal messages read (PUT /messages/{id}/read); 26 broadcasts added to local seen-file; broadcast `0fdd4502` (Safesite status correction); dev-alert `1514691349113344091`.
## Credentials & Secrets
- `guru@172.16.3.30` SSH/sudo password = `Paper123!@#-rmm` → vaulted `infrastructure/gururmm-server.sops.yaml` `credentials.password` (was stale `<old>`; reconciled 2026-06-11). SSH itself is key-only (`~/.ssh/gururmm-physical`).
## Infrastructure & Servers
- GuruRMM host: `172.16.3.30` (hostname `gururmm`), Ubuntu, kernel 7.0.0-22. Services: `gururmm-server.service` (API+WebSocket, pid restarted 14:10 UTC = cutover), `gururmm-agent.service` (own root agent), `gururmm-webhook.service`, `guruconnect.service`. Listening: `:3001` (API+WS), `:3002`, `:3000`, `:8001` (coord), `:80` (nginx, workers=www-data), `:9090/:9100` (prometheus/node_exporter), pg `:5432`, mariadb `:3306`.
- Build pipeline: webhook runs as **root**; `build-windows.sh` builds on Pluto (`Administrator@172.16.3.36`), scp's to `/tmp`, `deploy_and_sign()``/var/www/gururmm/downloads/` (FLAT layout, platform in filename). `SIGN_SCRIPT=/opt/gururmm/sign-windows.sh`.
- Migration: old VM parked on `172.16.3.30`→ now physical; old VM on `.46` for rollback.
- Server's own root agent: `gururmm` = `5e5a7ebc-95ea-40c8-b965-6ec15d63e157`.
## Commands & Outputs
```bash
ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30 # key auth; password disabled
# WS health: 179/214 established on :3001; logs show /ws 101 upgrades + heartbeats (WS fine)
# Root cause: ls -l /var/www/gururmm/downloads/*0.6.61* -> -rwx------ root (700) ; curl localhost/downloads/..0.6.61.exe -> 403
# Fix: printf '%s\n' "$PW" | ssh ... 'sudo -S -p "" chmod 644 /var/www/gururmm/downloads/*0.6.61*' -> serve 200, update status failed->starting
```
## Pending / Incomplete Tasks
- **Fleet settle:** agents finishing the 0.6.61 update over the next while; command latency normalizes as they reconnect. Re-verify a normal command fires instantly later.
- **Safesite sweep (#32395 / todo `5766a59f`):** DESKTOP-3USU20B sweep already queued correctly (cmd `7e835f51`, pending, 1800s, live agent `f9a86061`); DESKTOP-LOPKB4G not enrolled. Both offline; runs on reconnect. Fleet broadcast `0fdd4502` corrected the stale ids.
- **Peaceful Spirit recon** (PST-SERVER `421a4904` / PST-SERVER2 `2ffc4f54`) queued; will complete once those agents settle post-update. See the PS planning log.
- Optional: the build-pipeline is single-thread CPU-bound on Pluto (runbook gap #3) — parallelize the 6 cargo builds (separate from this fix).
## Reference Information
- guru-rmm commit `4b5ed30` (origin/main); ClaudeTools `e3459260` (pointer bump).
- Runbook: `projects/msp-tools/guru-rmm/docs/HOST_MIGRATION_RUNBOOK.md` gap #4.
- Memory written this session re: the false-WS-symptom is folded into the runbook; see also `feedback_rmm_user_session_smb_false_negative.md` (separate RMM-impersonation gotcha).