5.0 KiB
name, description, metadata
| name | description | metadata | ||
|---|---|---|---|---|
| gururmm-physical-server-storage | Physical GuruRMM server (now IS 172.16.3.30) storage layout + hot/cold tiering; host migration COMPLETE 2026-06-11 |
|
MIGRATION COMPLETE (2026-06-11 ~07:20 MST). The physical box now IS 172.16.3.30 and runs the
full stack: gururmm-server :3001, guruconnect :3002, coord/claudetools-api :8001, webhook :9000,
nginx :80, PostgreSQL 18, MariaDB 11.8, Grafana :3000, Prometheus :9090. Cred-decrypt verified
(MSP360 sync 62/0). Agents reconnected (162/212 within 15 min). SSH: ~/.ssh/gururmm-physical
(alias gururmm-new -> .231 was the temp DHCP; box is now .30). sudo password = the vault guru
password, piped via echo "$P" | sudo -S -p "" (a bare sudo -u postgres with no prior sudo in
the SSH session fails with "a terminal is required").
Cutover gotchas that bit us (see runbook): (1) the box's nginx loaded a STALE config missing
location /ws -> agents got 404 on /ws -> systemctl reload nginx fixed it (always reload after
config placement). (2) Public ingress/TLS is Nginx Proxy Manager on Jupiter 172.16.3.20, NOT
local nginx (which is :80-only) -> NPM forwards to .30:80, no reconfig needed since .30 preserved.
(3) Prometheus TSDB WAL was copied mid-write -> segments are not sequential -> moved
/var/lib/prometheus/metrics2/wal aside (lost ~2h, blocks intact). (4) the .30 IP swap used a
self-confirming detached netplan apply + a fresh .47 mgmt IP (no stale-ARP baggage like .30);
the VM kept .46 as an independent channel and released .30.
Still pending post-cutover: 7-day metrics/agent_logs backfill; Gitea runner -> Jupiter Docker
(Workstream B); drop .47 (new box) + .46 (VM) mgmt IPs; decommission the old VM after a
stability soak (VM is parked on .46, powered on, DATA PRISTINE for rollback -- do NOT delete yet).
The GuruRMM server/build-pipeline is being migrated from the VM (172.16.3.30, slow rotational-backed disk — the cause of the WAL-fsync pool timeouts) to a physical box.
New box (as of 2026-06-10): 172.16.1.231 (TEMPORARY IP — will become 172.16.3.30 at
cutover), hostname gururmm, Ubuntu 26.04 LTS. SSH: dedicated ed25519 key
~/.ssh/gururmm-physical (alias gururmm-new), vault infrastructure/gururmm-server-physical
(also holds the initial guru password). sudo needs that password (sudo -S), not passwordless.
Drives (storage optimized 2026-06-10):
- SSD
sda(Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB; extended the LV into the full VG → root is now ~915 GB. Holds: OS, Postgres DEFAULT tablespace (live/recent data) + WAL, cargo build targets,/opt/gururmm. Fast fsync here is the real fix for the pool-timeout root cause (could even revertsynchronous_commit=on). - HDD
sdb(WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed already backed up) wiped → ext4, mounted at/data(fstab by UUID,noatime). Dirs:/data/gururmm/{pgcold, downloads, backups, archive}.
Cold-storage isolation (built at migration — needs PG running):
CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold'(chown the dir postgres:postgres first).- Time-partition
agent_logs(by month). Recent partitions on SSD default tablespace (hot write path: the batched multi-row INSERT + heartbeats). Nightly jobALTER TABLE agent_logs_YYYYMM SET TABLESPACE gururmm_coldages old partitions onto the HDD (still queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to/data/gururmm/archive(compressed) then DROP. downloads(build artifacts, served by nginx + written by pipeline) andbackups(nightly pg_dump) also live on/data.
This is the concrete answer to the deferred "#3 log retention/archival" discussion. See rmm-agent-update-model (the downloads dir is the update artifact source) and the WAL-fix context (synchronous_commit=off + pool→30 applied to the OLD VM).
Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel). The VM
172.16.3.30 is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner +
Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box becomes
172.16.3.30 and runs everything EXCEPT the Gitea runner (which becomes a Docker container
on Jupiter .20); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's claudetools
DB at localhost:3306, NOT droppable.) Keeping .30 + coord on physical means NO fleet-wide
re-point (the http://172.16.3.30:8001 refs + Cloudflare→pfSense→.30 path are unchanged). PG via
pg_dumpall --globals-only + pg_dump -Fc/pg_restore -j (14→16, schema as-is — storage tiering
is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush,
credential-decrypt gate, PONR=first-agent-reconnect, rollback): projects/msp-tools/guru-rmm/docs/ HOST_MIGRATION_RUNBOOK.md. NOT yet executed — needs a window + the Gate-A unknowns closed.