diff --git a/clients/peaceful-spirit/session-logs/2026-06/2026-06-11-mike-multisite-dfs-dc-plan.md b/clients/peaceful-spirit/session-logs/2026-06/2026-06-11-mike-multisite-dfs-dc-plan.md new file mode 100644 index 0000000..4cc7401 --- /dev/null +++ b/clients/peaceful-spirit/session-logs/2026-06/2026-06-11-mike-multisite-dfs-dc-plan.md @@ -0,0 +1,54 @@ +# Peaceful Spirit — multi-site resilience plan (DFS + second DC) — PLAN ONLY + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-5070 +- **Role:** admin + +## Session Summary + +Planning session (no build) for making the two Peaceful Spirit sites resilient to a site-to-site VPN outage. Goal as Mike stated it: PST-SERVER2 (North West) should be a **DFS** replication partner of PST-SERVER (Country Club) so each site holds a local copy of the data (machines pull local, not over VPN), with **active failover if the S2S VPN drops**. Mike confirmed the direction is **DFS + a second Domain Controller** (he initially typed "WDS"; clarified to DFS). + +Established the environment from the wiki: domain `PEACEFULSPIRIT.local`, single DC = PST-SERVER (Country Club, 192.168.0.2, **Windows Server 2016 Essentials**) doing DC/DNS/RRAS-VPN/NPS/Enterprise-Root-CA; PST-SERVER2 (North West) = **Windows Server 2019 Standard**. Flagged the key constraint: Server *Essentials* expects to own all FSMO roles + ~25-user/50-device cap + is awkward with additional DCs, and 2016 hits end-of-support Jan 2027 (Essentials edition discontinued). + +Recommended architecture mapped to the three goals: (1) promote **PST-SERVER2 as an additional DC** (AD DS + DNS + GC) → local auth/DNS survives a VPN outage; (2) **AD Sites & Services** (Country-Club 192.168.0.0/24 + North-West subnet, site link) → clients use their local DC/target; (3) **DFS Namespace (domain-based) + DFS-R** with a folder target on each server → local file copies, auto-replicated, site-aware referrals. Surfaced the dependency that matters: DFS gives local *files* but a domain share still needs a reachable DC to *authenticate*, so DFS-only would leave the NW copy unusable during an outage — hence DFS **must** pair with the local DC. + +Attempted read-only recon of PST-SERVER and PST-SERVER2 via GuruRMM to scope the data + PST-SERVER2's domain-join state; both are WS-disconnected PST agents so the commands queued (re-armed at 1800s, `421a4904` PST-SERVER data, `2ffc4f54` PST-SERVER2 state — pending). The session then pivoted to fixing a fleet-wide GuruRMM update outage (separate log), which is why those recons are still outstanding. + +## Key Decisions + +- **DFS + second DC** (not DFS-only): DFS-only meets "local copies" but not "works when VPN down" — a domain DFS namespace/share needs a DC to authenticate, so NW needs a local DC. PST-SERVER2 = both DC and DFS target (standard combo). +- **Keep all FSMO on PST-SERVER (Essentials)** and do NOT route the replicated data through Essentials' Shared-Folders/Anywhere-Access features — use plain DFS-R — to avoid Essentials' single-DC assumptions. +- Recommend a **domain-based** DFS namespace (site-aware referrals + failover), not standalone. +- Recommend a **full writable DC** at NW over an RODC (trusted small office; RODC complicates DFS writes). +- Flagged 2016 Essentials EOL (Jan 2027) as a decision point: lean into it as-is vs. plan its replacement (2022/2025 Standard, plain AD DS). + +## Problems Encountered + +- PST recon commands queued (PST agents WS-disconnected; need long timeouts). Compounded by the concurrent GuruRMM update outage; recons left pending. + +## Configuration Changes + +- None — plan only. (RMM recon commands dispatched read-only: `421a4904`, `2ffc4f54`.) + +## Infrastructure & Servers + +- **PST-SERVER** (Country Club): 192.168.0.2, **Server 2016 Essentials**, DC/DNS/RRAS(L2TP)/NPS/Enterprise-Root-CA. Domain `PEACEFULSPIRIT.local`. RMM `87293069-33b6-45e8-a68f-6811216cdb96`. +- **PST-SERVER2** (North West): **Server 2019 Standard**. RMM `5d2d7ba0-3903-4aa3-9e97-6ca4424ffe65`. Domain-join state TBD (recon pending). +- LAN (Country Club): 192.168.0.0/24; WAN 98.190.129.150 (UCG Ultra). North West: separate UCG (subnet TBD; previously had OpenVPN at 64.139.88.249:1194). S2S VPN existence between the two UCGs = open question. + +## Pending / Incomplete Tasks + +**Open questions to firm the plan (the deciders):** +1. **What file data must be local at each site, and where is it now?** Share on PST-SERVER? Essentials redirected folders? A line-of-business app (scheduling/QuickBooks) and where it runs? (Mara uses personal OneDrive heavily — may be little on-prem file data.) → drives DFS scope + PST-SERVER2 storage sizing. (Recon `421a4904` will report shares/sizes once PST-SERVER picks it up.) +2. **Is there already a site-to-site VPN between the two UCGs (UniFi Site Magic), or build it?** The whole resilience story rides on this link. +3. **PST-SERVER2 current state** — blank Standard box vs already domain-joined (recon `2ffc4f54`). +4. **DFS-R conflict tolerance** — do both sites edit the same files (last-writer-wins conflict copies) or mostly separate data? +5. **2016 Essentials longevity** — keep as-is vs plan replacement (EOL Jan 2027). + +**Next:** once 1–5 are answered, write a design doc + rollout runbook under `clients/peaceful-spirit/` (promote DC, DNS/GC, AD Sites & Services, DFS-N/DFS-R, per-site DHCP/DNS). Build only on explicit go. + +## Reference Information + +- Wiki: `wiki/clients/peaceful-spirit.md`. Syncro customer `278525`. Domain `PEACEFULSPIRIT.local`. +- RMM recon cmds (pending): PST-SERVER `421a4904`, PST-SERVER2 `2ffc4f54`. diff --git a/session-logs/2026-06/2026-06-11-mike-gururmm-migration-update-outage-fix.md b/session-logs/2026-06/2026-06-11-mike-gururmm-migration-update-outage-fix.md new file mode 100644 index 0000000..7046a78 --- /dev/null +++ b/session-logs/2026-06/2026-06-11-mike-gururmm-migration-update-outage-fix.md @@ -0,0 +1,71 @@ +# GuruRMM post-migration agent-update outage — diagnosed + fixed (root, not the WebSocket) + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-5070 +- **Role:** admin + +## Session Summary + +Mike noticed RMM commands were no longer firing instantly (commands to PST/Safesite agents queued/timed out where they used to execute immediately). Initial hypothesis was that today's GuruRMM host migration (VM → physical `172.16.3.30`, cutover ~07:10 UTC) broke the agent WebSocket. Evidence partly supported that (`is_connected` null for all 214 agents) but a server-side investigation overturned it. + +SSH'd into the new host (`guru@172.16.3.30`, key `~/.ssh/gururmm-physical` — password auth is disabled, publickey-only; the vaulted sudo password was stale). Found `gururmm-server.service` ("API and WebSocket") healthy and **179/214 agents WS-connected** with clean `/ws` 101 upgrades and live heartbeats — the WebSocket was never the problem. `is_connected` is an unpopulated API field, not a disconnect signal. The real signal in the logs: a `0.6.59 → 0.6.61` agent auto-update looping with `status=failed: Failed to download binary` every 30s. + +Root cause: the **0.6.61 agent binaries in `/var/www/gururmm/downloads/` were written `root:root 700`** (the rebuild on the new host at 16:50, run as root with a restrictive umask), so **nginx (www-data) returned HTTP 403** on the update download. Every agent on 0.6.59 thrashed in a failing-update loop, which is what presented as "commands don't fire instantly." Confirmed with a serve test: 0.6.61 → 403, a `644` reference file → 200. + +Applied the immediate fix (`chmod 644` the six 0.6.61 payloads via the now-correct sudo password Mike supplied), verified serve test flipped 403→200 and update results flipped `failed → starting` (download succeeding). Then fixed the permanent cause: added explicit `chmod 644` of published artifacts to `build-windows.sh` (per-artifact in `deploy_and_sign()` + a belt-and-suspenders `find -exec chmod 644` phase) and `build-linux.sh`, mirrored the fixed scripts to the server's `/opt/gururmm/` copies, added a runbook gap #4 + Phase-3 perms check, reconciled the stale vaulted sudo password, and committed/pushed everything. Also (earlier in the session) triaged the coord inbox and re-armed the Safesite forensic sweep. + +## Key Decisions + +- Overturned my own initial "WebSocket broke" diagnosis after server evidence (179/214 connected, 101 upgrades) — the symptom was the broken auto-update, not WS. Documented this explicitly so future sessions don't chase `is_connected`. +- Applied the production `chmod` fix directly (GuruRMM is our product; chmod is non-destructive/reversible) rather than just reporting. +- Fixed the permanent cause in THREE places — repo source, the live `/opt/gururmm` server copies (so the next build is correct regardless of re-vendor timing), and the migration runbook — because a root-context build silently reintroduces it. +- Routed the chmod via SSH sudo (not the root RMM agent) because the agent command path was itself degraded by the very outage being fixed. + +## Problems Encountered + +- **Self-inflicted misdiagnosis:** the `is_connected: null` fleet-wide looked like a WS break; it's an unpopulated field. Server logs (179 WS conns, 101 upgrades) corrected it. +- **SSH access:** password auth disabled on the new host (publickey-only). Used `~/.ssh/gururmm-physical` (created during the migration). Root SSH is locked to `guru`. +- **Stale vault sudo password:** `infrastructure/gururmm-server.sops.yaml` → `credentials.password` was drifted (`Authentication failed`). Mike supplied the current one (`Paper123!@#-rmm`); reconciled + pushed to vault. +- **Repo edits reverted mid-task:** a submodule sync/checkout reverted my build-script edits once; re-applied and committed immediately to lock them in. (Server `/opt/gururmm` copies already had the fix from the earlier scp.) +- **PST/Safesite agents WS-disconnected** (alive heartbeat, no persistent socket) — needed 1800s command timeouts; their recon/sweep commands queue until reconnect. + +## Configuration Changes + +- **Server `172.16.3.30`:** `chmod 644 /var/www/gururmm/downloads/*0.6.61*` (6 payloads). Overwrote `/opt/gururmm/build-windows.sh` + `/opt/gururmm/build-linux.sh` with the fixed versions (chmod 755, root). +- **guru-rmm repo `4b5ed30`** (pushed origin/main): `deploy/build-pipeline/build-windows.sh` (chmod 644 in `deploy_and_sign()` + `find -exec chmod 644` phase), `deploy/build-pipeline/build-linux.sh` (chmod 644), `docs/HOST_MIGRATION_RUNBOOK.md` (gap #4 + Phase-3 perms check). +- **ClaudeTools:** submodule pointer bumped → `4b5ed30` (commit `e3459260`); vault `infrastructure/gururmm-server.sops.yaml` password reconciled (committed in vault repo). +- coord: marked 30 personal messages read (PUT /messages/{id}/read); 26 broadcasts added to local seen-file; broadcast `0fdd4502` (Safesite status correction); dev-alert `1514691349113344091`. + +## Credentials & Secrets + +- `guru@172.16.3.30` SSH/sudo password = `Paper123!@#-rmm` → vaulted `infrastructure/gururmm-server.sops.yaml` `credentials.password` (was stale ``; reconciled 2026-06-11). SSH itself is key-only (`~/.ssh/gururmm-physical`). + +## Infrastructure & Servers + +- GuruRMM host: `172.16.3.30` (hostname `gururmm`), Ubuntu, kernel 7.0.0-22. Services: `gururmm-server.service` (API+WebSocket, pid restarted 14:10 UTC = cutover), `gururmm-agent.service` (own root agent), `gururmm-webhook.service`, `guruconnect.service`. Listening: `:3001` (API+WS), `:3002`, `:3000`, `:8001` (coord), `:80` (nginx, workers=www-data), `:9090/:9100` (prometheus/node_exporter), pg `:5432`, mariadb `:3306`. +- Build pipeline: webhook runs as **root**; `build-windows.sh` builds on Pluto (`Administrator@172.16.3.36`), scp's to `/tmp`, `deploy_and_sign()` → `/var/www/gururmm/downloads/` (FLAT layout, platform in filename). `SIGN_SCRIPT=/opt/gururmm/sign-windows.sh`. +- Migration: old VM parked on `172.16.3.30`→ now physical; old VM on `.46` for rollback. +- Server's own root agent: `gururmm` = `5e5a7ebc-95ea-40c8-b965-6ec15d63e157`. + +## Commands & Outputs + +```bash +ssh -i ~/.ssh/gururmm-physical guru@172.16.3.30 # key auth; password disabled +# WS health: 179/214 established on :3001; logs show /ws 101 upgrades + heartbeats (WS fine) +# Root cause: ls -l /var/www/gururmm/downloads/*0.6.61* -> -rwx------ root (700) ; curl localhost/downloads/..0.6.61.exe -> 403 +# Fix: printf '%s\n' "$PW" | ssh ... 'sudo -S -p "" chmod 644 /var/www/gururmm/downloads/*0.6.61*' -> serve 200, update status failed->starting +``` + +## Pending / Incomplete Tasks + +- **Fleet settle:** agents finishing the 0.6.61 update over the next while; command latency normalizes as they reconnect. Re-verify a normal command fires instantly later. +- **Safesite sweep (#32395 / todo `5766a59f`):** DESKTOP-3USU20B sweep already queued correctly (cmd `7e835f51`, pending, 1800s, live agent `f9a86061`); DESKTOP-LOPKB4G not enrolled. Both offline; runs on reconnect. Fleet broadcast `0fdd4502` corrected the stale ids. +- **Peaceful Spirit recon** (PST-SERVER `421a4904` / PST-SERVER2 `2ffc4f54`) queued; will complete once those agents settle post-update. See the PS planning log. +- Optional: the build-pipeline is single-thread CPU-bound on Pluto (runbook gap #3) — parallelize the 6 cargo builds (separate from this fix). + +## Reference Information + +- guru-rmm commit `4b5ed30` (origin/main); ClaudeTools `e3459260` (pointer bump). +- Runbook: `projects/msp-tools/guru-rmm/docs/HOST_MIGRATION_RUNBOOK.md` gap #4. +- Memory written this session re: the false-WS-symptom is folded into the runbook; see also `feedback_rmm_user_session_smb_false_negative.md` (separate RMM-impersonation gotcha).