sync: auto-sync from GURU-5070 at 2026-06-13 06:16:25

Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-13 06:16:25
2026-06-13 06:16:42 -07:00
parent 53e43deea7
commit be8604b4fb
3 changed files with 77 additions and 0 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -26,6 +26,7 @@
 - [GuruRMM physical server storage](gururmm-physical-server-storage.md) — New box 172.16.1.231 (temp IP→will be .30), Ubuntu 26.04, ssh key `gururmm-physical`/alias `gururmm-new`. SSD (915G root) = HOT (PG default tablespace + WAL + builds); HDD ext4 at `/data` = COLD (`gururmm_cold` PG tablespace for aged `agent_logs` partitions + downloads + backups + archive). The #3 retention answer.
 - [Trebesch DESKTOP-QNP3ON5 shell replacement](reference_trebesch_qnp3on5.md) — AT Trebesch box runs an Explorer shell replacement; explorer.exe owner check returns blank — use Win32_ComputerSystem.UserName. GuruRMM SWIFT-LION-2892.
 - [reference_backblaze_storage_rate](reference_backblaze_storage_rate.md) -- ACG's Backblaze B2 storage cost rate ($0.00695/GB) for the GuruRMM mspbackups storage-cost calculation
+- [Unraid VM no-IP causes](unraid-windows-vm-virtio-no-ip.md) — PRIMARY (general "new VMs stopped getting IPs lately"): Docker sets bridge-nf-call-iptables=1, so br0 VM DHCP OFFERs hit DOCKER-FORWARD (no br0 ACCEPT) and get dropped; new VMs can't complete DORA (existing renew via ESTABLISHED). Fix `=0` runtime (needs persistent post-Docker hook; not yet persisted on Jupiter). SECONDARY (Windows VM): virtio-net has no in-box driver -> use e1000 or virtio-win. Diagnose: tcpdump DHCP on pfSense; /sys vnetN rx_packets.
 - [reference_sqlx_migrations_immutable](reference_sqlx_migrations_immutable.md) -- NEVER edit an already-applied sqlx migration file — even a comment. sqlx::migrate! checksums each file at compile time and validates against _sqlx_migrations at startup; a changed checksum crash-loops the server with "migration N was previously applied but has been modified". Code review MUST flag any edit to an applied migration.

 ## Users
@@ -137,3 +138,4 @@
 - [GuruRMM log analysis -> Claude Haiku](gururmm-log-analysis-claude-cutover.md) — cut over from Ollama-on-Beast (timed out on fleet-sized prompts; "unreachable" was a mislabeled 120s timeout) to Anthropic API Haiku 4.5 w/ structured outputs; key at vault `projects/gururmm/anthropic-api`; ZDR pending; deploy needs root on .30 (.env + restart)
 - [IX WHM API access = 'ClaudeTools' token, not password](ix-whm-dns-api-access.md) — IX cPanel/WHM (ix.azcomputerguru.com:2087) DNS + all API work uses the FULL-ACCESS-root WHM API token at vault `infrastructure/ix-server` `credentials.whm-api-token` via header `Authorization: whm root:<token>` (force curl -4). Password basic-auth on legacy json-api now 403s. Public NS ns1/ns2.acghosting.com = 52.52.94.202.
 - [Vault EVERY credential surfaced in-session](feedback-vault-every-credential.md) — any cred (pasted/created/discovered) -> store via the vault skill + document purpose & exact usage immediately; it's a standing job rule (reinforced in CORE CLAUDE.md). Lost IX creds wasted ~1h on 2026-06-12.
+- [GuruRMM install-report v1: reuse endpoint + failed-install agent](gururmm-install-report-failed-agent-v1.md) — legacy NSIS installer reuses /api/install-report (machine info + logs, success+fail); server upserts a visible "failed-install" device on failure reports (Mike: in v1); verify-connect-before-success; trend/near-fail analytics. Server side is a separate sequential SPEC after the legacy-agent branch lands.
--- a/.claude/memory/gururmm-install-report-failed-agent-v1.md
+++ b/.claude/memory/gururmm-install-report-failed-agent-v1.md
@@ -0,0 +1,36 @@
+---
+name: gururmm-install-report-failed-agent-v1
+description: GuruRMM legacy-installer v1 must reuse /api/install-report AND create a visible "failed-install agent" server-side (Mike, 2026-06-12)
+metadata:
+  type: project
+---
+
+For the SPEC-029 legacy-fleet build, Mike decided (2026-06-12) the observable-installer
+requirement is satisfied by the EXISTING `install-report` channel, extended:
+
+- **Reuse `/api/install-report`** (do NOT invent a new beacon). The MSI already POSTs rich
+  machine info + event/agent logs + service status there, success AND fail (`InstallReportCA` +
+  `installer/install-report.ps1` → `server/src/api/install_report.rs`, recorded to `install_reports`).
+  The **new NSIS 32-bit/legacy installer must POST the same payload** — this finally covers the
+  legacy tier (today it has no installer → zero install-reports = the biggest blind spot).
+- **Failed-install agent IN v1 (Mike's call):** on a report indicating failure (service not Running
+  after poll / no enrollment / connect-verification failed), the server **upserts a visible
+  "failed-install" device record** — keyed by hostname + machine fingerprint (so retries update one
+  record, no spam), carrying machine info + failed-step/reason + log refs + attempt count. Shows in
+  the dashboard as FAILED-INSTALL (distinct from healthy agents), triage-able + alertable. **Reconcile**
+  if the box later enrolls for real (don't leave a ghost). Success reports don't create a failed agent
+  but still feed trend/near-fail analytics (failure-rate by OS/arch/version — build-shaping signal,
+  mirrors SPEC-022 §5e patch telemetry).
+- Installer must **verify enroll/connect before declaring success** ("don't terminate until success")
+  and emit a meaningful exit + a local diagnostic bundle on fail.
+
+Scope split: the running legacy-agent Coding Agent does the agent + NSIS installer (+ the install-report
+POST). The **server-side failed-install-agent + trend analytics is a separate, sequential** work item
+(can't run a 2nd agent in the same submodule checkout concurrently) → its own SPEC after the first
+branch lands. See [[gururmm-log-analysis-claude-cutover]] for the server deploy shape.
+
+**Note (Mike, 2026-06-12):** the legacy build must eventually be folded into the MAIN
+production builds for **agent parity** (not a separate side-build). build-windows.sh already
+emits legacy-x86/amd64 in WAVE 2, but the legacy INSTALLER + the SPEC-029 §12 fixes need to
+become first-class in the promoted pipeline. For now, scoped TEST artifacts off the
+`fix/legacy-32bit-agent` branch are fine (Mike OK'd) — productionize after the Win7 VM proof.
--- a/.claude/memory/unraid-windows-vm-virtio-no-ip.md
+++ b/.claude/memory/unraid-windows-vm-virtio-no-ip.md
@@ -0,0 +1,39 @@
+---
+name: unraid-windows-vm-virtio-no-ip
+description: Unraid VMs fail to get a DHCP IP - PRIMARY cause is Docker setting bridge-nf-call-iptables=1 (drops new-VM DHCP OFFERs on br0); secondary is virtio-net having no in-box Windows driver
+metadata:
+  type: reference
+---
+
+Two distinct causes make Unraid/KVM VMs come up with **no DHCP IP**. Confirmed 2026-06-12/13 on
+Jupiter (`172.16.3.20`, Unraid 6.12.85; host creds vault `infrastructure/jupiter-unraid-primary`).
+
+## PRIMARY (the "VMs generally stopped getting IPs lately" cause): bridge-nf-call-iptables
+Docker sets `net.bridge.bridge-nf-call-iptables=1`, which routes **bridged** VM traffic on `br0`
+through the iptables FORWARD chain. Docker's `DOCKER-FORWARD` chain only ACCEPTs the docker
+bridges (`br-*`, `docker0`) and has **no ACCEPT for `br0`** (the VM bridge), so it drops new
+unmatched inbound flows. Effect:
+- The VM's DHCP DISCOVER (broadcast) egresses fine and pfSense/Kea sends an OFFER...
+- ...but the inbound **OFFER (new unicast flow to an unassigned IP) is dropped** before reaching
+  the VM tap. The VM never completes DORA -> APIPA 169.254.x. Symptom in tcpdump on the DHCP
+  server: VM re-DISCOVERs with 3s/8s/15s backoff, server keeps OFFERing fresh IPs, never an ACK.
+- **Existing** VMs survive because lease RENEWALS are ESTABLISHED flows (pass); only NEW/rebooted
+  VMs (fresh DISCOVER) break. = "lately" (a Docker/Unraid update) + "all new VMs".
+- **Fix (runtime, reversible):** `echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables` (and
+  `bridge-nf-call-ip6tables`). Bridged frames then bypass iptables entirely. **Caveat: Docker
+  re-sets it to 1 on daemon restart** -> needs a PERSISTENT post-Docker hook (User Scripts "At
+  Array Start", or a delayed setter in `/boot/config/go`) to truly fix it fleet-wide. NOT yet
+  made persistent on Jupiter as of 2026-06-13 (pending Mike's OK for the prod boot config).
+
+## SECONDARY (per-VM, Windows-specific): virtio-net has no in-box Windows driver
+A Windows VM whose NIC model is the Unraid default `virtio-net` has a **dead NIC** (Windows has
+no in-box virtio driver; the guest sends 0 packets). Linux VMs are fine (in-kernel virtio).
+The "Windows 11" VM worked because it was set to **e1000**. Fix: NIC model `e1000` (in-box Win7/
+Server2003 driver, `virsh edit`/Unraid template dropdown) OR install virtio-win NetKVM (ISOs on
+Jupiter `/mnt/user/isos/virtio-win-0.1.271-1.iso`). Diagnose without tcpdump: sample
+`/sys/class/net/<vnetN>/statistics/rx_packets` twice -> flat = dead NIC (driver), climbing = NIC
+works (then look at the bridge-nf cause above).
+
+Diagnosis order: confirm NIC model first (e1000 vs virtio), then if the NIC transmits but no IP,
+suspect bridge-nf-call-iptables. Related: [[gururmm-install-report-failed-agent-v1]]
+(WIN7TEST is the SPEC-029 legacy-32bit-agent test VM, static IP 172.16.2.55, NIC now e1000).