sync: auto-sync from GURU-5070 at 2026-06-13 06:16:25
Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-13 06:16:25
This commit is contained in:
@@ -26,6 +26,7 @@
|
||||
- [GuruRMM physical server storage](gururmm-physical-server-storage.md) — New box 172.16.1.231 (temp IP→will be .30), Ubuntu 26.04, ssh key `gururmm-physical`/alias `gururmm-new`. SSD (915G root) = HOT (PG default tablespace + WAL + builds); HDD ext4 at `/data` = COLD (`gururmm_cold` PG tablespace for aged `agent_logs` partitions + downloads + backups + archive). The #3 retention answer.
|
||||
- [Trebesch DESKTOP-QNP3ON5 shell replacement](reference_trebesch_qnp3on5.md) — AT Trebesch box runs an Explorer shell replacement; explorer.exe owner check returns blank — use Win32_ComputerSystem.UserName. GuruRMM SWIFT-LION-2892.
|
||||
- [reference_backblaze_storage_rate](reference_backblaze_storage_rate.md) -- ACG's Backblaze B2 storage cost rate ($0.00695/GB) for the GuruRMM mspbackups storage-cost calculation
|
||||
- [Unraid VM no-IP causes](unraid-windows-vm-virtio-no-ip.md) — PRIMARY (general "new VMs stopped getting IPs lately"): Docker sets bridge-nf-call-iptables=1, so br0 VM DHCP OFFERs hit DOCKER-FORWARD (no br0 ACCEPT) and get dropped; new VMs can't complete DORA (existing renew via ESTABLISHED). Fix `=0` runtime (needs persistent post-Docker hook; not yet persisted on Jupiter). SECONDARY (Windows VM): virtio-net has no in-box driver -> use e1000 or virtio-win. Diagnose: tcpdump DHCP on pfSense; /sys vnetN rx_packets.
|
||||
- [reference_sqlx_migrations_immutable](reference_sqlx_migrations_immutable.md) -- NEVER edit an already-applied sqlx migration file — even a comment. sqlx::migrate! checksums each file at compile time and validates against _sqlx_migrations at startup; a changed checksum crash-loops the server with "migration N was previously applied but has been modified". Code review MUST flag any edit to an applied migration.
|
||||
|
||||
## Users
|
||||
@@ -137,3 +138,4 @@
|
||||
- [GuruRMM log analysis -> Claude Haiku](gururmm-log-analysis-claude-cutover.md) — cut over from Ollama-on-Beast (timed out on fleet-sized prompts; "unreachable" was a mislabeled 120s timeout) to Anthropic API Haiku 4.5 w/ structured outputs; key at vault `projects/gururmm/anthropic-api`; ZDR pending; deploy needs root on .30 (.env + restart)
|
||||
- [IX WHM API access = 'ClaudeTools' token, not password](ix-whm-dns-api-access.md) — IX cPanel/WHM (ix.azcomputerguru.com:2087) DNS + all API work uses the FULL-ACCESS-root WHM API token at vault `infrastructure/ix-server` `credentials.whm-api-token` via header `Authorization: whm root:<token>` (force curl -4). Password basic-auth on legacy json-api now 403s. Public NS ns1/ns2.acghosting.com = 52.52.94.202.
|
||||
- [Vault EVERY credential surfaced in-session](feedback-vault-every-credential.md) — any cred (pasted/created/discovered) -> store via the vault skill + document purpose & exact usage immediately; it's a standing job rule (reinforced in CORE CLAUDE.md). Lost IX creds wasted ~1h on 2026-06-12.
|
||||
- [GuruRMM install-report v1: reuse endpoint + failed-install agent](gururmm-install-report-failed-agent-v1.md) — legacy NSIS installer reuses /api/install-report (machine info + logs, success+fail); server upserts a visible "failed-install" device on failure reports (Mike: in v1); verify-connect-before-success; trend/near-fail analytics. Server side is a separate sequential SPEC after the legacy-agent branch lands.
|
||||
|
||||
36
.claude/memory/gururmm-install-report-failed-agent-v1.md
Normal file
36
.claude/memory/gururmm-install-report-failed-agent-v1.md
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
name: gururmm-install-report-failed-agent-v1
|
||||
description: GuruRMM legacy-installer v1 must reuse /api/install-report AND create a visible "failed-install agent" server-side (Mike, 2026-06-12)
|
||||
metadata:
|
||||
type: project
|
||||
---
|
||||
|
||||
For the SPEC-029 legacy-fleet build, Mike decided (2026-06-12) the observable-installer
|
||||
requirement is satisfied by the EXISTING `install-report` channel, extended:
|
||||
|
||||
- **Reuse `/api/install-report`** (do NOT invent a new beacon). The MSI already POSTs rich
|
||||
machine info + event/agent logs + service status there, success AND fail (`InstallReportCA` +
|
||||
`installer/install-report.ps1` → `server/src/api/install_report.rs`, recorded to `install_reports`).
|
||||
The **new NSIS 32-bit/legacy installer must POST the same payload** — this finally covers the
|
||||
legacy tier (today it has no installer → zero install-reports = the biggest blind spot).
|
||||
- **Failed-install agent IN v1 (Mike's call):** on a report indicating failure (service not Running
|
||||
after poll / no enrollment / connect-verification failed), the server **upserts a visible
|
||||
"failed-install" device record** — keyed by hostname + machine fingerprint (so retries update one
|
||||
record, no spam), carrying machine info + failed-step/reason + log refs + attempt count. Shows in
|
||||
the dashboard as FAILED-INSTALL (distinct from healthy agents), triage-able + alertable. **Reconcile**
|
||||
if the box later enrolls for real (don't leave a ghost). Success reports don't create a failed agent
|
||||
but still feed trend/near-fail analytics (failure-rate by OS/arch/version — build-shaping signal,
|
||||
mirrors SPEC-022 §5e patch telemetry).
|
||||
- Installer must **verify enroll/connect before declaring success** ("don't terminate until success")
|
||||
and emit a meaningful exit + a local diagnostic bundle on fail.
|
||||
|
||||
Scope split: the running legacy-agent Coding Agent does the agent + NSIS installer (+ the install-report
|
||||
POST). The **server-side failed-install-agent + trend analytics is a separate, sequential** work item
|
||||
(can't run a 2nd agent in the same submodule checkout concurrently) → its own SPEC after the first
|
||||
branch lands. See [[gururmm-log-analysis-claude-cutover]] for the server deploy shape.
|
||||
|
||||
**Note (Mike, 2026-06-12):** the legacy build must eventually be folded into the MAIN
|
||||
production builds for **agent parity** (not a separate side-build). build-windows.sh already
|
||||
emits legacy-x86/amd64 in WAVE 2, but the legacy INSTALLER + the SPEC-029 §12 fixes need to
|
||||
become first-class in the promoted pipeline. For now, scoped TEST artifacts off the
|
||||
`fix/legacy-32bit-agent` branch are fine (Mike OK'd) — productionize after the Win7 VM proof.
|
||||
39
.claude/memory/unraid-windows-vm-virtio-no-ip.md
Normal file
39
.claude/memory/unraid-windows-vm-virtio-no-ip.md
Normal file
@@ -0,0 +1,39 @@
|
||||
---
|
||||
name: unraid-windows-vm-virtio-no-ip
|
||||
description: Unraid VMs fail to get a DHCP IP - PRIMARY cause is Docker setting bridge-nf-call-iptables=1 (drops new-VM DHCP OFFERs on br0); secondary is virtio-net having no in-box Windows driver
|
||||
metadata:
|
||||
type: reference
|
||||
---
|
||||
|
||||
Two distinct causes make Unraid/KVM VMs come up with **no DHCP IP**. Confirmed 2026-06-12/13 on
|
||||
Jupiter (`172.16.3.20`, Unraid 6.12.85; host creds vault `infrastructure/jupiter-unraid-primary`).
|
||||
|
||||
## PRIMARY (the "VMs generally stopped getting IPs lately" cause): bridge-nf-call-iptables
|
||||
Docker sets `net.bridge.bridge-nf-call-iptables=1`, which routes **bridged** VM traffic on `br0`
|
||||
through the iptables FORWARD chain. Docker's `DOCKER-FORWARD` chain only ACCEPTs the docker
|
||||
bridges (`br-*`, `docker0`) and has **no ACCEPT for `br0`** (the VM bridge), so it drops new
|
||||
unmatched inbound flows. Effect:
|
||||
- The VM's DHCP DISCOVER (broadcast) egresses fine and pfSense/Kea sends an OFFER...
|
||||
- ...but the inbound **OFFER (new unicast flow to an unassigned IP) is dropped** before reaching
|
||||
the VM tap. The VM never completes DORA -> APIPA 169.254.x. Symptom in tcpdump on the DHCP
|
||||
server: VM re-DISCOVERs with 3s/8s/15s backoff, server keeps OFFERing fresh IPs, never an ACK.
|
||||
- **Existing** VMs survive because lease RENEWALS are ESTABLISHED flows (pass); only NEW/rebooted
|
||||
VMs (fresh DISCOVER) break. = "lately" (a Docker/Unraid update) + "all new VMs".
|
||||
- **Fix (runtime, reversible):** `echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables` (and
|
||||
`bridge-nf-call-ip6tables`). Bridged frames then bypass iptables entirely. **Caveat: Docker
|
||||
re-sets it to 1 on daemon restart** -> needs a PERSISTENT post-Docker hook (User Scripts "At
|
||||
Array Start", or a delayed setter in `/boot/config/go`) to truly fix it fleet-wide. NOT yet
|
||||
made persistent on Jupiter as of 2026-06-13 (pending Mike's OK for the prod boot config).
|
||||
|
||||
## SECONDARY (per-VM, Windows-specific): virtio-net has no in-box Windows driver
|
||||
A Windows VM whose NIC model is the Unraid default `virtio-net` has a **dead NIC** (Windows has
|
||||
no in-box virtio driver; the guest sends 0 packets). Linux VMs are fine (in-kernel virtio).
|
||||
The "Windows 11" VM worked because it was set to **e1000**. Fix: NIC model `e1000` (in-box Win7/
|
||||
Server2003 driver, `virsh edit`/Unraid template dropdown) OR install virtio-win NetKVM (ISOs on
|
||||
Jupiter `/mnt/user/isos/virtio-win-0.1.271-1.iso`). Diagnose without tcpdump: sample
|
||||
`/sys/class/net/<vnetN>/statistics/rx_packets` twice -> flat = dead NIC (driver), climbing = NIC
|
||||
works (then look at the bridge-nf cause above).
|
||||
|
||||
Diagnosis order: confirm NIC model first (e1000 vs virtio), then if the NIC transmits but no IP,
|
||||
suspect bridge-nf-call-iptables. Related: [[gururmm-install-report-failed-agent-v1]]
|
||||
(WIN7TEST is the SPEC-029 legacy-32bit-agent test VM, static IP 172.16.2.55, NIC now e1000).
|
||||
Reference in New Issue
Block a user