Files
claudetools/.claude/memory/gururmm-physical-server-storage.md

85 lines
6.3 KiB
Markdown

---
name: gururmm-physical-server-storage
description: Physical GuruRMM server (now IS 172.16.3.30) storage layout + hot/cold tiering; host migration COMPLETE 2026-06-11
metadata:
type: project
---
**MIGRATION COMPLETE (2026-06-11 ~07:20 MST).** The physical box now IS 172.16.3.30 and runs the
full stack: gururmm-server :3001, guruconnect :3002, coord/claudetools-api :8001, webhook :9000,
nginx :80, PostgreSQL 18, MariaDB 11.8, Grafana :3000, Prometheus :9090. Cred-decrypt verified
(MSP360 sync 62/0). Agents reconnected (162/212 within 15 min). SSH: `~/.ssh/gururmm-physical`
(alias `gururmm-new` -> .231 was the temp DHCP; box is now .30). sudo password = the vault `guru`
password, piped via `echo "$P" | sudo -S -p ""` (a bare `sudo -u postgres` with no prior sudo in
the SSH session fails with "a terminal is required").
**Cutover gotchas that bit us (see runbook):** (1) the box's nginx loaded a STALE config missing
`location /ws` -> agents got 404 on /ws -> `systemctl reload nginx` fixed it (always reload after
config placement). (2) Public ingress/TLS is **Nginx Proxy Manager on Jupiter 172.16.3.20**, NOT
local nginx (which is :80-only) -> NPM forwards to .30:80, no reconfig needed since .30 preserved.
(3) Prometheus TSDB WAL was copied mid-write -> `segments are not sequential` -> moved
`/var/lib/prometheus/metrics2/wal` aside (lost ~2h, blocks intact). (4) the `.30` IP swap used a
self-confirming detached netplan apply + a fresh `.47` mgmt IP (no stale-ARP baggage like `.30`);
the VM kept `.46` as an independent channel and released `.30`.
**Post-cutover DONE:** 7-day metrics/agent_logs backfill (2026-06-11) -- streamed VM->new box
direct (id-range filtered, .pgpass), 3.46M rows / ~3.4 GB in ~2.5 min, lossless (id-range counts
match VM<->new box: metrics 1,189,924; agent_logs 2,262,938). **Perf proof:** SSD sustained
186-214 MB/s writes, w_await 0.7-3.2 ms, fsync ~3 ms, peak %util ~65% (headroom), and ZERO
pool-timeouts under the bulk load + 212 live agents -- the rotational-VM WAL-fsync root cause is fixed.
**Workstream B DONE (2026-06-11):** jupiter-runner (act_runner v0.6.1, labels ubuntu-latest/22.04)
online on Jupiter .20 Docker; VM's gitea-runner DISABLED (kept registered for rollback). Build env
provisioned on the new box: source repo /home/guru/gururmm @ main 7c2f20e (rsync'd from VM, target/
+node_modules excluded), last-built-commit baselines copied, Rust 1.96.0 + Node v20.20.2/npm 10.8.2,
Pluto (Administrator@172.16.3.36) SSH auth OK for Windows builds. NOTE: gururmm has NO .gitea/workflows
-- builds run via the **webhook-handler path** (Gitea webhook http://172.16.3.30/webhook/build ->
nginx :80 /webhook/ -> :9000 -> build-*.sh on the server), NOT Gitea Actions. Pipeline wired end-to-end;
not yet exercised by a real build. **Post-cutover cleanup DONE (2026-06-12):** old VM `GuruRMM`
decommissioned after the soak — `virsh destroy`+`undefine`, `vdisk1.img` deleted, `.46` released;
`.47` mgmt IP dropped from the physical box's netplan (eno1 now carries only `172.16.3.30`). The
rollback anchor was intentionally retired; there is no longer a parked VM.
**History (pre-cutover — now DONE, retained for context).** The GuruRMM server/build-pipeline
ran on a **VM** at 172.16.3.30 (slow rotational-backed disk — the WAL-fsync pool-timeout cause)
and was migrated to a **physical box**, which took over the 172.16.3.30 IP at cutover
(2026-06-11). During provisioning (2026-06-10) the physical box was briefly at temp DHCP IP
**172.16.1.231**; that IP is no longer used. hostname `gururmm`, **Ubuntu 26.04 LTS**. SSH:
dedicated ed25519 key `~/.ssh/gururmm-physical` to `guru@172.16.3.30`, vault
`infrastructure/gururmm-server-physical` (SSH key + initial `guru` password). sudo needs that
password (`sudo -S`), not passwordless.
**Drives (storage optimized 2026-06-10):**
- **SSD `sda`** (Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB;
extended the LV into the full VG → **root is now ~915 GB**. Holds: OS, Postgres DEFAULT
tablespace (live/recent data) + WAL, cargo build targets, `/opt/gururmm`. Fast fsync here is
the real fix for the pool-timeout root cause (could even revert `synchronous_commit=on`).
- **HDD `sdb`** (WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed
already backed up) wiped → **ext4, mounted at `/data`** (fstab by UUID, `noatime`). Dirs:
`/data/gururmm/{pgcold, downloads, backups, archive}`.
**Cold-storage isolation (built at migration — needs PG running):**
- `CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold'` (chown the dir
postgres:postgres first).
- Time-partition `agent_logs` (by month). Recent partitions on SSD default tablespace (hot
write path: the batched multi-row INSERT + heartbeats). Nightly job `ALTER TABLE
agent_logs_YYYYMM SET TABLESPACE gururmm_cold` ages old partitions onto the HDD (still
queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to
`/data/gururmm/archive` (compressed) then DROP.
- `downloads` (build artifacts, served by nginx + written by pipeline) and `backups`
(nightly pg_dump) also live on `/data`.
This is the concrete answer to the deferred "#3 log retention/archival" discussion. See
[[rmm-agent-update-model]] (the downloads dir is the update artifact source) and the WAL-fix
context (synchronous_commit=off + pool→30 applied to the OLD VM).
**Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel).** The VM
`172.16.3.30` is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner +
Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box **becomes
`172.16.3.30`** and runs **everything EXCEPT the Gitea runner** (which becomes a Docker container
on Jupiter `.20`); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's `claudetools`
DB at localhost:3306, NOT droppable.) Keeping `.30` + coord on physical means NO fleet-wide
re-point (the `http://172.16.3.30:8001` refs + Cloudflare→pfSense→.30 path are unchanged). PG via
`pg_dumpall --globals-only` + `pg_dump -Fc`/`pg_restore -j` (14→16, schema as-is — storage tiering
is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush,
credential-decrypt gate, PONR=first-agent-reconnect, rollback): `projects/msp-tools/guru-rmm/docs/
HOST_MIGRATION_RUNBOOK.md`. EXECUTED and COMPLETE 2026-06-11 (see the top of this note).