--- name: gururmm-physical-server-storage description: Physical GuruRMM server (now IS 172.16.3.30) storage layout + hot/cold tiering; host migration COMPLETE 2026-06-11 metadata: type: project --- **MIGRATION COMPLETE (2026-06-11 ~07:20 MST).** The physical box now IS 172.16.3.30 and runs the full stack: gururmm-server :3001, guruconnect :3002, coord/claudetools-api :8001, webhook :9000, nginx :80, PostgreSQL 18, MariaDB 11.8, Grafana :3000, Prometheus :9090. Cred-decrypt verified (MSP360 sync 62/0). Agents reconnected (162/212 within 15 min). SSH: `~/.ssh/gururmm-physical` (alias `gururmm-new` -> .231 was the temp DHCP; box is now .30). sudo password = the vault `guru` password, piped via `echo "$P" | sudo -S -p ""` (a bare `sudo -u postgres` with no prior sudo in the SSH session fails with "a terminal is required"). **Cutover gotchas that bit us (see runbook):** (1) the box's nginx loaded a STALE config missing `location /ws` -> agents got 404 on /ws -> `systemctl reload nginx` fixed it (always reload after config placement). (2) Public ingress/TLS is **Nginx Proxy Manager on Jupiter 172.16.3.20**, NOT local nginx (which is :80-only) -> NPM forwards to .30:80, no reconfig needed since .30 preserved. (3) Prometheus TSDB WAL was copied mid-write -> `segments are not sequential` -> moved `/var/lib/prometheus/metrics2/wal` aside (lost ~2h, blocks intact). (4) the `.30` IP swap used a self-confirming detached netplan apply + a fresh `.47` mgmt IP (no stale-ARP baggage like `.30`); the VM kept `.46` as an independent channel and released `.30`. **Post-cutover DONE:** 7-day metrics/agent_logs backfill (2026-06-11) -- streamed VM->new box direct (id-range filtered, .pgpass), 3.46M rows / ~3.4 GB in ~2.5 min, lossless (id-range counts match VM<->new box: metrics 1,189,924; agent_logs 2,262,938). **Perf proof:** SSD sustained 186-214 MB/s writes, w_await 0.7-3.2 ms, fsync ~3 ms, peak %util ~65% (headroom), and ZERO pool-timeouts under the bulk load + 212 live agents -- the rotational-VM WAL-fsync root cause is fixed. **Workstream B DONE (2026-06-11):** jupiter-runner (act_runner v0.6.1, labels ubuntu-latest/22.04) online on Jupiter .20 Docker; VM's gitea-runner DISABLED (kept registered for rollback). Build env provisioned on the new box: source repo /home/guru/gururmm @ main 7c2f20e (rsync'd from VM, target/ +node_modules excluded), last-built-commit baselines copied, Rust 1.96.0 + Node v20.20.2/npm 10.8.2, Pluto (Administrator@172.16.3.36) SSH auth OK for Windows builds. NOTE: gururmm has NO .gitea/workflows -- builds run via the **webhook-handler path** (Gitea webhook http://172.16.3.30/webhook/build -> nginx :80 /webhook/ -> :9000 -> build-*.sh on the server), NOT Gitea Actions. Pipeline wired end-to-end; not yet exercised by a real build. **Post-cutover cleanup DONE (2026-06-12):** old VM `GuruRMM` decommissioned after the soak — `virsh destroy`+`undefine`, `vdisk1.img` deleted, `.46` released; `.47` mgmt IP dropped from the physical box's netplan (eno1 now carries only `172.16.3.30`). The rollback anchor was intentionally retired; there is no longer a parked VM. **History (pre-cutover — now DONE, retained for context).** The GuruRMM server/build-pipeline ran on a **VM** at 172.16.3.30 (slow rotational-backed disk — the WAL-fsync pool-timeout cause) and was migrated to a **physical box**, which took over the 172.16.3.30 IP at cutover (2026-06-11). During provisioning (2026-06-10) the physical box was briefly at temp DHCP IP **172.16.1.231**; that IP is no longer used. hostname `gururmm`, **Ubuntu 26.04 LTS**. SSH: dedicated ed25519 key `~/.ssh/gururmm-physical` to `guru@172.16.3.30`, vault `infrastructure/gururmm-server-physical` (SSH key + initial `guru` password). sudo needs that password (`sudo -S`), not passwordless. **Drives (storage optimized 2026-06-10):** - **SSD `sda`** (Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB; extended the LV into the full VG → **root is now ~915 GB**. Holds: OS, Postgres DEFAULT tablespace (live/recent data) + WAL, cargo build targets, `/opt/gururmm`. Fast fsync here is the real fix for the pool-timeout root cause (could even revert `synchronous_commit=on`). - **HDD `sdb`** (WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed already backed up) wiped → **ext4, mounted at `/data`** (fstab by UUID, `noatime`). Dirs: `/data/gururmm/{pgcold, downloads, backups, archive}`. **Cold-storage isolation (built at migration — needs PG running):** - `CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold'` (chown the dir postgres:postgres first). - Time-partition `agent_logs` (by month). Recent partitions on SSD default tablespace (hot write path: the batched multi-row INSERT + heartbeats). Nightly job `ALTER TABLE agent_logs_YYYYMM SET TABLESPACE gururmm_cold` ages old partitions onto the HDD (still queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to `/data/gururmm/archive` (compressed) then DROP. - `downloads` (build artifacts, served by nginx + written by pipeline) and `backups` (nightly pg_dump) also live on `/data`. This is the concrete answer to the deferred "#3 log retention/archival" discussion. See [[rmm-agent-update-model]] (the downloads dir is the update artifact source) and the WAL-fix context (synchronous_commit=off + pool→30 applied to the OLD VM). **Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel).** The VM `172.16.3.30` is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner + Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box **becomes `172.16.3.30`** and runs **everything EXCEPT the Gitea runner** (which becomes a Docker container on Jupiter `.20`); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's `claudetools` DB at localhost:3306, NOT droppable.) Keeping `.30` + coord on physical means NO fleet-wide re-point (the `http://172.16.3.30:8001` refs + Cloudflare→pfSense→.30 path are unchanged). PG via `pg_dumpall --globals-only` + `pg_dump -Fc`/`pg_restore -j` (14→16, schema as-is — storage tiering is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush, credential-decrypt gate, PONR=first-agent-reconnect, rollback): `projects/msp-tools/guru-rmm/docs/ HOST_MIGRATION_RUNBOOK.md`. EXECUTED and COMPLETE 2026-06-11 (see the top of this note).