diff --git a/.claude/memory/gururmm-physical-server-storage.md b/.claude/memory/gururmm-physical-server-storage.md index b05b0c4..c109370 100644 --- a/.claude/memory/gururmm-physical-server-storage.md +++ b/.claude/memory/gururmm-physical-server-storage.md @@ -1,10 +1,30 @@ --- name: gururmm-physical-server-storage -description: New physical GuruRMM server (172.16.1.231) storage layout + hot/cold tiering plan for the migration off 172.16.3.30 +description: Physical GuruRMM server (now IS 172.16.3.30) storage layout + hot/cold tiering; host migration COMPLETE 2026-06-11 metadata: type: project --- +**MIGRATION COMPLETE (2026-06-11 ~07:20 MST).** The physical box now IS 172.16.3.30 and runs the +full stack: gururmm-server :3001, guruconnect :3002, coord/claudetools-api :8001, webhook :9000, +nginx :80, PostgreSQL 18, MariaDB 11.8, Grafana :3000, Prometheus :9090. Cred-decrypt verified +(MSP360 sync 62/0). Agents reconnected (162/212 within 15 min). SSH: `~/.ssh/gururmm-physical` +(alias `gururmm-new` -> .231 was the temp DHCP; box is now .30). sudo password = the vault `guru` +password, piped via `echo "$P" | sudo -S -p ""` (a bare `sudo -u postgres` with no prior sudo in +the SSH session fails with "a terminal is required"). +**Cutover gotchas that bit us (see runbook):** (1) the box's nginx loaded a STALE config missing +`location /ws` -> agents got 404 on /ws -> `systemctl reload nginx` fixed it (always reload after +config placement). (2) Public ingress/TLS is **Nginx Proxy Manager on Jupiter 172.16.3.20**, NOT +local nginx (which is :80-only) -> NPM forwards to .30:80, no reconfig needed since .30 preserved. +(3) Prometheus TSDB WAL was copied mid-write -> `segments are not sequential` -> moved +`/var/lib/prometheus/metrics2/wal` aside (lost ~2h, blocks intact). (4) the `.30` IP swap used a +self-confirming detached netplan apply + a fresh `.47` mgmt IP (no stale-ARP baggage like `.30`); +the VM kept `.46` as an independent channel and released `.30`. +**Still pending post-cutover:** 7-day metrics/agent_logs backfill; Gitea runner -> Jupiter Docker +(Workstream B); drop `.47` (new box) + `.46` (VM) mgmt IPs; decommission the old VM after a +stability soak (VM is parked on .46, powered on, DATA PRISTINE for rollback -- do NOT delete yet). + + The GuruRMM server/build-pipeline is being migrated from the VM (172.16.3.30, slow rotational-backed disk — the cause of the WAL-fsync pool timeouts) to a **physical box**.