claudetools/.claude/memory/gururmm-physical-server-storage.md at 470a8e7eb1253eea8b9cd48ace21ebd33c48e2f2

Files

Mike Swanson 470a8e7eb1 rmm: host-migration runbook + ratified architecture (memory + pointer)

Bump guru-rmm pointer (host-migration runbook). Record the migration architecture
decision in memory: physical box becomes .30 (all-but-Gitea-runner), VM retired,
MariaDB migrates (backs the coord claudetools DB per Gate-A).

2026-06-10 18:40:07 -07:00

3.3 KiB

Raw Blame History

name, description, metadata

name

description

metadata

gururmm-physical-server-storage

New physical GuruRMM server (172.16.1.231) storage layout + hot/cold tiering plan for the migration off 172.16.3.30

type
project

The GuruRMM server/build-pipeline is being migrated from the VM (172.16.3.30, slow rotational-backed disk — the cause of the WAL-fsync pool timeouts) to a physical box.

New box (as of 2026-06-10): 172.16.1.231 (TEMPORARY IP — will become 172.16.3.30 at cutover), hostname gururmm, Ubuntu 26.04 LTS. SSH: dedicated ed25519 key ~/.ssh/gururmm-physical (alias gururmm-new), vault infrastructure/gururmm-server-physical (also holds the initial guru password). sudo needs that password (sudo -S), not passwordless.

Drives (storage optimized 2026-06-10):

SSD sda (Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB; extended the LV into the full VG → root is now ~915 GB. Holds: OS, Postgres DEFAULT tablespace (live/recent data) + WAL, cargo build targets, /opt/gururmm. Fast fsync here is the real fix for the pool-timeout root cause (could even revert synchronous_commit=on).
HDD sdb (WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed already backed up) wiped → ext4, mounted at /data (fstab by UUID, noatime). Dirs: /data/gururmm/{pgcold, downloads, backups, archive}.

Cold-storage isolation (built at migration — needs PG running):

CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold' (chown the dir postgres:postgres first).
Time-partition agent_logs (by month). Recent partitions on SSD default tablespace (hot write path: the batched multi-row INSERT + heartbeats). Nightly job ALTER TABLE agent_logs_YYYYMM SET TABLESPACE gururmm_cold ages old partitions onto the HDD (still queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to /data/gururmm/archive (compressed) then DROP.
downloads (build artifacts, served by nginx + written by pipeline) and backups (nightly pg_dump) also live on /data.

This is the concrete answer to the deferred "#3 log retention/archival" discussion. See rmm-agent-update-model (the downloads dir is the update artifact source) and the WAL-fix context (synchronous_commit=off + pool→30 applied to the OLD VM).

Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel). The VM 172.16.3.30 is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner + Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box becomes 172.16.3.30 and runs everything EXCEPT the Gitea runner (which becomes a Docker container on Jupiter .20); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's claudetools DB at localhost:3306, NOT droppable.) Keeping .30 + coord on physical means NO fleet-wide re-point (the http://172.16.3.30:8001 refs + Cloudflare→pfSense→.30 path are unchanged). PG via pg_dumpall --globals-only + pg_dump -Fc/pg_restore -j (14→16, schema as-is — storage tiering is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush, credential-decrypt gate, PONR=first-agent-reconnect, rollback): projects/msp-tools/guru-rmm/docs/ HOST_MIGRATION_RUNBOOK.md. NOT yet executed — needs a window + the Gate-A unknowns closed.

3.3 KiB Raw Blame History

3.3 KiB

Raw Blame History