claudetools/.claude/memory/gururmm-physical-server-storage.md at 543228fdba640cd12d4bb6bdd6c959f100e2ecbc

Files

Mike Swanson 6bd3210e21 sync: auto-sync from GURU-5070 at 2026-06-11 08:01:12

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-11 08:01:12

2026-06-11 08:01:27 -07:00

6.1 KiB

Raw Blame History

name, description, metadata

name

description

metadata

gururmm-physical-server-storage

Physical GuruRMM server (now IS 172.16.3.30) storage layout + hot/cold tiering; host migration COMPLETE 2026-06-11

type
project

MIGRATION COMPLETE (2026-06-11 ~07:20 MST). The physical box now IS 172.16.3.30 and runs the full stack: gururmm-server :3001, guruconnect :3002, coord/claudetools-api :8001, webhook :9000, nginx :80, PostgreSQL 18, MariaDB 11.8, Grafana :3000, Prometheus :9090. Cred-decrypt verified (MSP360 sync 62/0). Agents reconnected (162/212 within 15 min). SSH: ~/.ssh/gururmm-physical (alias gururmm-new -> .231 was the temp DHCP; box is now .30). sudo password = the vault guru password, piped via echo "$P" | sudo -S -p "" (a bare sudo -u postgres with no prior sudo in the SSH session fails with "a terminal is required"). Cutover gotchas that bit us (see runbook): (1) the box's nginx loaded a STALE config missing location /ws -> agents got 404 on /ws -> systemctl reload nginx fixed it (always reload after config placement). (2) Public ingress/TLS is Nginx Proxy Manager on Jupiter 172.16.3.20, NOT local nginx (which is :80-only) -> NPM forwards to .30:80, no reconfig needed since .30 preserved. (3) Prometheus TSDB WAL was copied mid-write -> segments are not sequential -> moved /var/lib/prometheus/metrics2/wal aside (lost ~2h, blocks intact). (4) the .30 IP swap used a self-confirming detached netplan apply + a fresh .47 mgmt IP (no stale-ARP baggage like .30); the VM kept .46 as an independent channel and released .30. Post-cutover DONE: 7-day metrics/agent_logs backfill (2026-06-11) -- streamed VM->new box direct (id-range filtered, .pgpass), 3.46M rows / ~3.4 GB in ~2.5 min, lossless (id-range counts match VM<->new box: metrics 1,189,924; agent_logs 2,262,938). Perf proof: SSD sustained 186-214 MB/s writes, w_await 0.7-3.2 ms, fsync ~3 ms, peak %util ~65% (headroom), and ZERO pool-timeouts under the bulk load + 212 live agents -- the rotational-VM WAL-fsync root cause is fixed. Workstream B DONE (2026-06-11): jupiter-runner (act_runner v0.6.1, labels ubuntu-latest/22.04) online on Jupiter .20 Docker; VM's gitea-runner DISABLED (kept registered for rollback). Build env provisioned on the new box: source repo /home/guru/gururmm @ main 7c2f20e (rsync'd from VM, target/ +node_modules excluded), last-built-commit baselines copied, Rust 1.96.0 + Node v20.20.2/npm 10.8.2, Pluto (Administrator@172.16.3.36) SSH auth OK for Windows builds. NOTE: gururmm has NO .gitea/workflows -- builds run via the webhook-handler path (Gitea webhook http://172.16.3.30/webhook/build -> nginx :80 /webhook/ -> :9000 -> build-*.sh on the server), NOT Gitea Actions. Pipeline wired end-to-end; not yet exercised by a real build. Still pending post-cutover: drop .47 (new box) + .46 (VM) mgmt IPs; decommission the old VM after a stability soak (VM parked on .46, powered on, DATA PRISTINE for rollback -- do NOT delete yet).

The GuruRMM server/build-pipeline is being migrated from the VM (172.16.3.30, slow rotational-backed disk — the cause of the WAL-fsync pool timeouts) to a physical box.

New box (as of 2026-06-10): 172.16.1.231 (TEMPORARY IP — will become 172.16.3.30 at cutover), hostname gururmm, Ubuntu 26.04 LTS. SSH: dedicated ed25519 key ~/.ssh/gururmm-physical (alias gururmm-new), vault infrastructure/gururmm-server-physical (also holds the initial guru password). sudo needs that password (sudo -S), not passwordless.

Drives (storage optimized 2026-06-10):

SSD sda (Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB; extended the LV into the full VG → root is now ~915 GB. Holds: OS, Postgres DEFAULT tablespace (live/recent data) + WAL, cargo build targets, /opt/gururmm. Fast fsync here is the real fix for the pool-timeout root cause (could even revert synchronous_commit=on).
HDD sdb (WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed already backed up) wiped → ext4, mounted at /data (fstab by UUID, noatime). Dirs: /data/gururmm/{pgcold, downloads, backups, archive}.

Cold-storage isolation (built at migration — needs PG running):

CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold' (chown the dir postgres:postgres first).
Time-partition agent_logs (by month). Recent partitions on SSD default tablespace (hot write path: the batched multi-row INSERT + heartbeats). Nightly job ALTER TABLE agent_logs_YYYYMM SET TABLESPACE gururmm_cold ages old partitions onto the HDD (still queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to /data/gururmm/archive (compressed) then DROP.
downloads (build artifacts, served by nginx + written by pipeline) and backups (nightly pg_dump) also live on /data.

This is the concrete answer to the deferred "#3 log retention/archival" discussion. See rmm-agent-update-model (the downloads dir is the update artifact source) and the WAL-fix context (synchronous_commit=off + pool→30 applied to the OLD VM).

Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel). The VM 172.16.3.30 is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner + Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box becomes 172.16.3.30 and runs everything EXCEPT the Gitea runner (which becomes a Docker container on Jupiter .20); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's claudetools DB at localhost:3306, NOT droppable.) Keeping .30 + coord on physical means NO fleet-wide re-point (the http://172.16.3.30:8001 refs + Cloudflare→pfSense→.30 path are unchanged). PG via pg_dumpall --globals-only + pg_dump -Fc/pg_restore -j (14→16, schema as-is — storage tiering is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush, credential-decrypt gate, PONR=first-agent-reconnect, rollback): projects/msp-tools/guru-rmm/docs/ HOST_MIGRATION_RUNBOOK.md. NOT yet executed — needs a window + the Gate-A unknowns closed.

6.1 KiB Raw Blame History

6.1 KiB

Raw Blame History