rmm: host-migration runbook + ratified architecture (memory + pointer)

Bump guru-rmm pointer (host-migration runbook). Record the migration architecture decision in memory: physical box becomes .30 (all-but-Gitea-runner), VM retired, MariaDB migrates (backs the coord claudetools DB per Gate-A).
2026-06-10 18:40:07 -07:00
parent 0455472c70
commit 470a8e7eb1
3 changed files with 52 additions and 1 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -23,6 +23,7 @@
 - [Gitea git-op latency](reference_gitea_git_op_latency.md) — SSH (.20:2222) is SLOWEST (~1.5s); internal HTTP+token ~0.55s; SOPS lookup only ~0.33s. Don't switch to SSH for speed. Gitea SSH is .20:2222 (API ssh_url .21 is wrong).
 - [GuruRMM technical reference](reference_gururmm.md) — Server (172.16.3.30) layout + downloads dir `/var/www/gururmm/downloads` + `.channel` sidecar rollout control (stable/beta) + privileged server access via the server's OWN root RMM agent (hostname `gururmm`, no SSH needed; plink fallback) + API + `context=user_session` (WTS impersonation) + build-pipeline vendoring at `deploy/build-pipeline/` + Linux agent systemd sandbox trap.
 - [RMM agent update model](rmm-agent-update-model.md) — Agent updates are server-PUSH on heartbeat (no self-poll); available versions = filesystem scan needing a `.sha256`; promote flips `.channel` sidecars beta→stable globally. Two stranders: beta-first freezes stable until an explicit promote; agents older than ~0.6.50 re-enroll with a NEW device_id/agent row when updated.
+- [GuruRMM physical server storage](gururmm-physical-server-storage.md) — New box 172.16.1.231 (temp IP→will be .30), Ubuntu 26.04, ssh key `gururmm-physical`/alias `gururmm-new`. SSD (915G root) = HOT (PG default tablespace + WAL + builds); HDD ext4 at `/data` = COLD (`gururmm_cold` PG tablespace for aged `agent_logs` partitions + downloads + backups + archive). The #3 retention answer.
 - [Trebesch DESKTOP-QNP3ON5 shell replacement](reference_trebesch_qnp3on5.md) — AT Trebesch box runs an Explorer shell replacement; explorer.exe owner check returns blank — use Win32_ComputerSystem.UserName. GuruRMM SWIFT-LION-2892.

 ## Users
--- a/.claude/memory/gururmm-physical-server-storage.md
+++ b/.claude/memory/gururmm-physical-server-storage.md
@@ -0,0 +1,50 @@
+---
+name: gururmm-physical-server-storage
+description: New physical GuruRMM server (172.16.1.231) storage layout + hot/cold tiering plan for the migration off 172.16.3.30
+metadata:
+  type: project
+---
+
+The GuruRMM server/build-pipeline is being migrated from the VM (172.16.3.30, slow
+rotational-backed disk — the cause of the WAL-fsync pool timeouts) to a **physical box**.
+
+New box (as of 2026-06-10): **172.16.1.231** (TEMPORARY IP — will become 172.16.3.30 at
+cutover), hostname `gururmm`, **Ubuntu 26.04 LTS**. SSH: dedicated ed25519 key
+`~/.ssh/gururmm-physical` (alias `gururmm-new`), vault `infrastructure/gururmm-server-physical`
+(also holds the initial `guru` password). sudo needs that password (`sudo -S`), not passwordless.
+
+**Drives (storage optimized 2026-06-10):**
+- **SSD `sda`** (Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB;
+  extended the LV into the full VG → **root is now ~915 GB**. Holds: OS, Postgres DEFAULT
+  tablespace (live/recent data) + WAL, cargo build targets, `/opt/gururmm`. Fast fsync here is
+  the real fix for the pool-timeout root cause (could even revert `synchronous_commit=on`).
+- **HDD `sdb`** (WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed
+  already backed up) wiped → **ext4, mounted at `/data`** (fstab by UUID, `noatime`). Dirs:
+  `/data/gururmm/{pgcold, downloads, backups, archive}`.
+
+**Cold-storage isolation (built at migration — needs PG running):**
+- `CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold'` (chown the dir
+  postgres:postgres first).
+- Time-partition `agent_logs` (by month). Recent partitions on SSD default tablespace (hot
+  write path: the batched multi-row INSERT + heartbeats). Nightly job `ALTER TABLE
+  agent_logs_YYYYMM SET TABLESPACE gururmm_cold` ages old partitions onto the HDD (still
+  queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to
+  `/data/gururmm/archive` (compressed) then DROP.
+- `downloads` (build artifacts, served by nginx + written by pipeline) and `backups`
+  (nightly pg_dump) also live on `/data`.
+
+This is the concrete answer to the deferred "#3 log retention/archival" discussion. See
+[[rmm-agent-update-model]] (the downloads dir is the update artifact source) and the WAL-fix
+context (synchronous_commit=off + pool→30 applied to the OLD VM).
+
+**Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel).** The VM
+`172.16.3.30` is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner +
+Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box **becomes
+`172.16.3.30`** and runs **everything EXCEPT the Gitea runner** (which becomes a Docker container
+on Jupiter `.20`); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's `claudetools`
+DB at localhost:3306, NOT droppable.) Keeping `.30` + coord on physical means NO fleet-wide
+re-point (the `http://172.16.3.30:8001` refs + Cloudflare→pfSense→.30 path are unchanged). PG via
+`pg_dumpall --globals-only` + `pg_dump -Fc`/`pg_restore -j` (14→16, schema as-is — storage tiering
+is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush,
+credential-decrypt gate, PONR=first-agent-reconnect, rollback): `projects/msp-tools/guru-rmm/docs/
+HOST_MIGRATION_RUNBOOK.md`. NOT yet executed — needs a window + the Gate-A unknowns closed.
--- a/projects/msp-tools/guru-rmm
+++ b/projects/msp-tools/guru-rmm