85 lines
6.3 KiB
Markdown
85 lines
6.3 KiB
Markdown
---
|
|
name: gururmm-physical-server-storage
|
|
description: Physical GuruRMM server (now IS 172.16.3.30) storage layout + hot/cold tiering; host migration COMPLETE 2026-06-11
|
|
metadata:
|
|
type: project
|
|
---
|
|
|
|
**MIGRATION COMPLETE (2026-06-11 ~07:20 MST).** The physical box now IS 172.16.3.30 and runs the
|
|
full stack: gururmm-server :3001, guruconnect :3002, coord/claudetools-api :8001, webhook :9000,
|
|
nginx :80, PostgreSQL 18, MariaDB 11.8, Grafana :3000, Prometheus :9090. Cred-decrypt verified
|
|
(MSP360 sync 62/0). Agents reconnected (162/212 within 15 min). SSH: `~/.ssh/gururmm-physical`
|
|
(alias `gururmm-new` -> .231 was the temp DHCP; box is now .30). sudo password = the vault `guru`
|
|
password, piped via `echo "$P" | sudo -S -p ""` (a bare `sudo -u postgres` with no prior sudo in
|
|
the SSH session fails with "a terminal is required").
|
|
**Cutover gotchas that bit us (see runbook):** (1) the box's nginx loaded a STALE config missing
|
|
`location /ws` -> agents got 404 on /ws -> `systemctl reload nginx` fixed it (always reload after
|
|
config placement). (2) Public ingress/TLS is **Nginx Proxy Manager on Jupiter 172.16.3.20**, NOT
|
|
local nginx (which is :80-only) -> NPM forwards to .30:80, no reconfig needed since .30 preserved.
|
|
(3) Prometheus TSDB WAL was copied mid-write -> `segments are not sequential` -> moved
|
|
`/var/lib/prometheus/metrics2/wal` aside (lost ~2h, blocks intact). (4) the `.30` IP swap used a
|
|
self-confirming detached netplan apply + a fresh `.47` mgmt IP (no stale-ARP baggage like `.30`);
|
|
the VM kept `.46` as an independent channel and released `.30`.
|
|
**Post-cutover DONE:** 7-day metrics/agent_logs backfill (2026-06-11) -- streamed VM->new box
|
|
direct (id-range filtered, .pgpass), 3.46M rows / ~3.4 GB in ~2.5 min, lossless (id-range counts
|
|
match VM<->new box: metrics 1,189,924; agent_logs 2,262,938). **Perf proof:** SSD sustained
|
|
186-214 MB/s writes, w_await 0.7-3.2 ms, fsync ~3 ms, peak %util ~65% (headroom), and ZERO
|
|
pool-timeouts under the bulk load + 212 live agents -- the rotational-VM WAL-fsync root cause is fixed.
|
|
**Workstream B DONE (2026-06-11):** jupiter-runner (act_runner v0.6.1, labels ubuntu-latest/22.04)
|
|
online on Jupiter .20 Docker; VM's gitea-runner DISABLED (kept registered for rollback). Build env
|
|
provisioned on the new box: source repo /home/guru/gururmm @ main 7c2f20e (rsync'd from VM, target/
|
|
+node_modules excluded), last-built-commit baselines copied, Rust 1.96.0 + Node v20.20.2/npm 10.8.2,
|
|
Pluto (Administrator@172.16.3.36) SSH auth OK for Windows builds. NOTE: gururmm has NO .gitea/workflows
|
|
-- builds run via the **webhook-handler path** (Gitea webhook http://172.16.3.30/webhook/build ->
|
|
nginx :80 /webhook/ -> :9000 -> build-*.sh on the server), NOT Gitea Actions. Pipeline wired end-to-end;
|
|
not yet exercised by a real build. **Post-cutover cleanup DONE (2026-06-12):** old VM `GuruRMM`
|
|
decommissioned after the soak — `virsh destroy`+`undefine`, `vdisk1.img` deleted, `.46` released;
|
|
`.47` mgmt IP dropped from the physical box's netplan (eno1 now carries only `172.16.3.30`). The
|
|
rollback anchor was intentionally retired; there is no longer a parked VM.
|
|
|
|
|
|
**History (pre-cutover — now DONE, retained for context).** The GuruRMM server/build-pipeline
|
|
ran on a **VM** at 172.16.3.30 (slow rotational-backed disk — the WAL-fsync pool-timeout cause)
|
|
and was migrated to a **physical box**, which took over the 172.16.3.30 IP at cutover
|
|
(2026-06-11). During provisioning (2026-06-10) the physical box was briefly at temp DHCP IP
|
|
**172.16.1.231**; that IP is no longer used. hostname `gururmm`, **Ubuntu 26.04 LTS**. SSH:
|
|
dedicated ed25519 key `~/.ssh/gururmm-physical` to `guru@172.16.3.30`, vault
|
|
`infrastructure/gururmm-server-physical` (SSH key + initial `guru` password). sudo needs that
|
|
password (`sudo -S`), not passwordless.
|
|
|
|
**Drives (storage optimized 2026-06-10):**
|
|
- **SSD `sda`** (Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB;
|
|
extended the LV into the full VG → **root is now ~915 GB**. Holds: OS, Postgres DEFAULT
|
|
tablespace (live/recent data) + WAL, cargo build targets, `/opt/gururmm`. Fast fsync here is
|
|
the real fix for the pool-timeout root cause (could even revert `synchronous_commit=on`).
|
|
- **HDD `sdb`** (WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed
|
|
already backed up) wiped → **ext4, mounted at `/data`** (fstab by UUID, `noatime`). Dirs:
|
|
`/data/gururmm/{pgcold, downloads, backups, archive}`.
|
|
|
|
**Cold-storage isolation (built at migration — needs PG running):**
|
|
- `CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold'` (chown the dir
|
|
postgres:postgres first).
|
|
- Time-partition `agent_logs` (by month). Recent partitions on SSD default tablespace (hot
|
|
write path: the batched multi-row INSERT + heartbeats). Nightly job `ALTER TABLE
|
|
agent_logs_YYYYMM SET TABLESPACE gururmm_cold` ages old partitions onto the HDD (still
|
|
queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to
|
|
`/data/gururmm/archive` (compressed) then DROP.
|
|
- `downloads` (build artifacts, served by nginx + written by pipeline) and `backups`
|
|
(nightly pg_dump) also live on `/data`.
|
|
|
|
This is the concrete answer to the deferred "#3 log retention/archival" discussion. See
|
|
[[rmm-agent-update-model]] (the downloads dir is the update artifact source) and the WAL-fix
|
|
context (synchronous_commit=off + pool→30 applied to the OLD VM).
|
|
|
|
**Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel).** The VM
|
|
`172.16.3.30` is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner +
|
|
Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box **becomes
|
|
`172.16.3.30`** and runs **everything EXCEPT the Gitea runner** (which becomes a Docker container
|
|
on Jupiter `.20`); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's `claudetools`
|
|
DB at localhost:3306, NOT droppable.) Keeping `.30` + coord on physical means NO fleet-wide
|
|
re-point (the `http://172.16.3.30:8001` refs + Cloudflare→pfSense→.30 path are unchanged). PG via
|
|
`pg_dumpall --globals-only` + `pg_dump -Fc`/`pg_restore -j` (14→16, schema as-is — storage tiering
|
|
is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush,
|
|
credential-decrypt gate, PONR=first-agent-reconnect, rollback): `projects/msp-tools/guru-rmm/docs/
|
|
HOST_MIGRATION_RUNBOOK.md`. EXECUTED and COMPLETE 2026-06-11 (see the top of this note).
|