6.1 KiB
name, description, metadata
| name | description | metadata | ||
|---|---|---|---|---|
| gururmm-physical-server-storage | Physical GuruRMM server (now IS 172.16.3.30) storage layout + hot/cold tiering; host migration COMPLETE 2026-06-11 |
|
MIGRATION COMPLETE (2026-06-11 ~07:20 MST). The physical box now IS 172.16.3.30 and runs the
full stack: gururmm-server :3001, guruconnect :3002, coord/claudetools-api :8001, webhook :9000,
nginx :80, PostgreSQL 18, MariaDB 11.8, Grafana :3000, Prometheus :9090. Cred-decrypt verified
(MSP360 sync 62/0). Agents reconnected (162/212 within 15 min). SSH: ~/.ssh/gururmm-physical
(alias gururmm-new -> .231 was the temp DHCP; box is now .30). sudo password = the vault guru
password, piped via echo "$P" | sudo -S -p "" (a bare sudo -u postgres with no prior sudo in
the SSH session fails with "a terminal is required").
Cutover gotchas that bit us (see runbook): (1) the box's nginx loaded a STALE config missing
location /ws -> agents got 404 on /ws -> systemctl reload nginx fixed it (always reload after
config placement). (2) Public ingress/TLS is Nginx Proxy Manager on Jupiter 172.16.3.20, NOT
local nginx (which is :80-only) -> NPM forwards to .30:80, no reconfig needed since .30 preserved.
(3) Prometheus TSDB WAL was copied mid-write -> segments are not sequential -> moved
/var/lib/prometheus/metrics2/wal aside (lost ~2h, blocks intact). (4) the .30 IP swap used a
self-confirming detached netplan apply + a fresh .47 mgmt IP (no stale-ARP baggage like .30);
the VM kept .46 as an independent channel and released .30.
Post-cutover DONE: 7-day metrics/agent_logs backfill (2026-06-11) -- streamed VM->new box
direct (id-range filtered, .pgpass), 3.46M rows / ~3.4 GB in ~2.5 min, lossless (id-range counts
match VM<->new box: metrics 1,189,924; agent_logs 2,262,938). Perf proof: SSD sustained
186-214 MB/s writes, w_await 0.7-3.2 ms, fsync ~3 ms, peak %util ~65% (headroom), and ZERO
pool-timeouts under the bulk load + 212 live agents -- the rotational-VM WAL-fsync root cause is fixed.
Workstream B DONE (2026-06-11): jupiter-runner (act_runner v0.6.1, labels ubuntu-latest/22.04)
online on Jupiter .20 Docker; VM's gitea-runner DISABLED (kept registered for rollback). Build env
provisioned on the new box: source repo /home/guru/gururmm @ main 7c2f20e (rsync'd from VM, target/
+node_modules excluded), last-built-commit baselines copied, Rust 1.96.0 + Node v20.20.2/npm 10.8.2,
Pluto (Administrator@172.16.3.36) SSH auth OK for Windows builds. NOTE: gururmm has NO .gitea/workflows
-- builds run via the webhook-handler path (Gitea webhook http://172.16.3.30/webhook/build ->
nginx :80 /webhook/ -> :9000 -> build-*.sh on the server), NOT Gitea Actions. Pipeline wired end-to-end;
not yet exercised by a real build. Still pending post-cutover: drop .47 (new box) + .46 (VM)
mgmt IPs; decommission the old VM after a stability soak (VM parked on .46, powered on, DATA PRISTINE
for rollback -- do NOT delete yet).
The GuruRMM server/build-pipeline is being migrated from the VM (172.16.3.30, slow rotational-backed disk — the cause of the WAL-fsync pool timeouts) to a physical box.
New box (as of 2026-06-10): 172.16.1.231 (TEMPORARY IP — will become 172.16.3.30 at
cutover), hostname gururmm, Ubuntu 26.04 LTS. SSH: dedicated ed25519 key
~/.ssh/gururmm-physical (alias gururmm-new), vault infrastructure/gururmm-server-physical
(also holds the initial guru password). sudo needs that password (sudo -S), not passwordless.
Drives (storage optimized 2026-06-10):
- SSD
sda(Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB; extended the LV into the full VG → root is now ~915 GB. Holds: OS, Postgres DEFAULT tablespace (live/recent data) + WAL, cargo build targets,/opt/gururmm. Fast fsync here is the real fix for the pool-timeout root cause (could even revertsynchronous_commit=on). - HDD
sdb(WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed already backed up) wiped → ext4, mounted at/data(fstab by UUID,noatime). Dirs:/data/gururmm/{pgcold, downloads, backups, archive}.
Cold-storage isolation (built at migration — needs PG running):
CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold'(chown the dir postgres:postgres first).- Time-partition
agent_logs(by month). Recent partitions on SSD default tablespace (hot write path: the batched multi-row INSERT + heartbeats). Nightly jobALTER TABLE agent_logs_YYYYMM SET TABLESPACE gururmm_coldages old partitions onto the HDD (still queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to/data/gururmm/archive(compressed) then DROP. downloads(build artifacts, served by nginx + written by pipeline) andbackups(nightly pg_dump) also live on/data.
This is the concrete answer to the deferred "#3 log retention/archival" discussion. See rmm-agent-update-model (the downloads dir is the update artifact source) and the WAL-fix context (synchronous_commit=off + pool→30 applied to the OLD VM).
Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel). The VM
172.16.3.30 is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner +
Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box becomes
172.16.3.30 and runs everything EXCEPT the Gitea runner (which becomes a Docker container
on Jupiter .20); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's claudetools
DB at localhost:3306, NOT droppable.) Keeping .30 + coord on physical means NO fleet-wide
re-point (the http://172.16.3.30:8001 refs + Cloudflare→pfSense→.30 path are unchanged). PG via
pg_dumpall --globals-only + pg_dump -Fc/pg_restore -j (14→16, schema as-is — storage tiering
is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush,
credential-decrypt gate, PONR=first-agent-reconnect, rollback): projects/msp-tools/guru-rmm/docs/ HOST_MIGRATION_RUNBOOK.md. NOT yet executed — needs a window + the Gate-A unknowns closed.