rmm: host-migration runbook + ratified architecture (memory + pointer)
Bump guru-rmm pointer (host-migration runbook). Record the migration architecture decision in memory: physical box becomes .30 (all-but-Gitea-runner), VM retired, MariaDB migrates (backs the coord claudetools DB per Gate-A).
This commit is contained in:
@@ -23,6 +23,7 @@
|
||||
- [Gitea git-op latency](reference_gitea_git_op_latency.md) — SSH (.20:2222) is SLOWEST (~1.5s); internal HTTP+token ~0.55s; SOPS lookup only ~0.33s. Don't switch to SSH for speed. Gitea SSH is .20:2222 (API ssh_url .21 is wrong).
|
||||
- [GuruRMM technical reference](reference_gururmm.md) — Server (172.16.3.30) layout + downloads dir `/var/www/gururmm/downloads` + `.channel` sidecar rollout control (stable/beta) + privileged server access via the server's OWN root RMM agent (hostname `gururmm`, no SSH needed; plink fallback) + API + `context=user_session` (WTS impersonation) + build-pipeline vendoring at `deploy/build-pipeline/` + Linux agent systemd sandbox trap.
|
||||
- [RMM agent update model](rmm-agent-update-model.md) — Agent updates are server-PUSH on heartbeat (no self-poll); available versions = filesystem scan needing a `.sha256`; promote flips `.channel` sidecars beta→stable globally. Two stranders: beta-first freezes stable until an explicit promote; agents older than ~0.6.50 re-enroll with a NEW device_id/agent row when updated.
|
||||
- [GuruRMM physical server storage](gururmm-physical-server-storage.md) — New box 172.16.1.231 (temp IP→will be .30), Ubuntu 26.04, ssh key `gururmm-physical`/alias `gururmm-new`. SSD (915G root) = HOT (PG default tablespace + WAL + builds); HDD ext4 at `/data` = COLD (`gururmm_cold` PG tablespace for aged `agent_logs` partitions + downloads + backups + archive). The #3 retention answer.
|
||||
- [Trebesch DESKTOP-QNP3ON5 shell replacement](reference_trebesch_qnp3on5.md) — AT Trebesch box runs an Explorer shell replacement; explorer.exe owner check returns blank — use Win32_ComputerSystem.UserName. GuruRMM SWIFT-LION-2892.
|
||||
|
||||
## Users
|
||||
|
||||
50
.claude/memory/gururmm-physical-server-storage.md
Normal file
50
.claude/memory/gururmm-physical-server-storage.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
name: gururmm-physical-server-storage
|
||||
description: New physical GuruRMM server (172.16.1.231) storage layout + hot/cold tiering plan for the migration off 172.16.3.30
|
||||
metadata:
|
||||
type: project
|
||||
---
|
||||
|
||||
The GuruRMM server/build-pipeline is being migrated from the VM (172.16.3.30, slow
|
||||
rotational-backed disk — the cause of the WAL-fsync pool timeouts) to a **physical box**.
|
||||
|
||||
New box (as of 2026-06-10): **172.16.1.231** (TEMPORARY IP — will become 172.16.3.30 at
|
||||
cutover), hostname `gururmm`, **Ubuntu 26.04 LTS**. SSH: dedicated ed25519 key
|
||||
`~/.ssh/gururmm-physical` (alias `gururmm-new`), vault `infrastructure/gururmm-server-physical`
|
||||
(also holds the initial `guru` password). sudo needs that password (`sudo -S`), not passwordless.
|
||||
|
||||
**Drives (storage optimized 2026-06-10):**
|
||||
- **SSD `sda`** (Samsung 860, 929 GB) = HOT tier. Installer had left root at only 100 GB;
|
||||
extended the LV into the full VG → **root is now ~915 GB**. Holds: OS, Postgres DEFAULT
|
||||
tablespace (live/recent data) + WAL, cargo build targets, `/opt/gururmm`. Fast fsync here is
|
||||
the real fix for the pool-timeout root cause (could even revert `synchronous_commit=on`).
|
||||
- **HDD `sdb`** (WD 1 TB, spinning) = COLD tier. Old NTFS "Data2" (504 GB, user confirmed
|
||||
already backed up) wiped → **ext4, mounted at `/data`** (fstab by UUID, `noatime`). Dirs:
|
||||
`/data/gururmm/{pgcold, downloads, backups, archive}`.
|
||||
|
||||
**Cold-storage isolation (built at migration — needs PG running):**
|
||||
- `CREATE TABLESPACE gururmm_cold LOCATION '/data/gururmm/pgcold'` (chown the dir
|
||||
postgres:postgres first).
|
||||
- Time-partition `agent_logs` (by month). Recent partitions on SSD default tablespace (hot
|
||||
write path: the batched multi-row INSERT + heartbeats). Nightly job `ALTER TABLE
|
||||
agent_logs_YYYYMM SET TABLESPACE gururmm_cold` ages old partitions onto the HDD (still
|
||||
queryable for signatures/build-correlation). Past retention horizon: pg_dump partition to
|
||||
`/data/gururmm/archive` (compressed) then DROP.
|
||||
- `downloads` (build artifacts, served by nginx + written by pipeline) and `backups`
|
||||
(nightly pg_dump) also live on `/data`.
|
||||
|
||||
This is the concrete answer to the deferred "#3 log retention/archival" discussion. See
|
||||
[[rmm-agent-update-model]] (the downloads dir is the update artifact source) and the WAL-fix
|
||||
context (synchronous_commit=off + pool→30 applied to the OLD VM).
|
||||
|
||||
**Migration architecture (ratified 2026-06-10, via a 2-round Gemini+Grok panel).** The VM
|
||||
`172.16.3.30` is a kitchen-sink host (GuruRMM + GuruConnect + coord API :8001 + Gitea runner +
|
||||
Grafana/Prometheus + MariaDB; PG 14, 5.4 GB gururmm DB). Decision: physical box **becomes
|
||||
`172.16.3.30`** and runs **everything EXCEPT the Gitea runner** (which becomes a Docker container
|
||||
on Jupiter `.20`); VM retired. (MariaDB MIGRATES — Gate-A found it backs the coord API's `claudetools`
|
||||
DB at localhost:3306, NOT droppable.) Keeping `.30` + coord on physical means NO fleet-wide
|
||||
re-point (the `http://172.16.3.30:8001` refs + Cloudflare→pfSense→.30 path are unchanged). PG via
|
||||
`pg_dumpall --globals-only` + `pg_dump -Fc`/`pg_restore -j` (14→16, schema as-is — storage tiering
|
||||
is a SEPARATE later task). Full runbook (Gate-A pre-flight, cutover from CONSOLE, ARP flush,
|
||||
credential-decrypt gate, PONR=first-agent-reconnect, rollback): `projects/msp-tools/guru-rmm/docs/
|
||||
HOST_MIGRATION_RUNBOOK.md`. NOT yet executed — needs a window + the Gate-A unknowns closed.
|
||||
Submodule projects/msp-tools/guru-rmm updated: f68bbbe8c0...12a644548f
Reference in New Issue
Block a user