From be8604b4fbbb1b97caf972877ea7c4a62ba8f06b Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Sat, 13 Jun 2026 06:16:42 -0700 Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-06-13 06:16:25 Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-13 06:16:25 --- .claude/memory/MEMORY.md | 2 + .../gururmm-install-report-failed-agent-v1.md | 36 +++++++++++++++++ .../memory/unraid-windows-vm-virtio-no-ip.md | 39 +++++++++++++++++++ 3 files changed, 77 insertions(+) create mode 100644 .claude/memory/gururmm-install-report-failed-agent-v1.md create mode 100644 .claude/memory/unraid-windows-vm-virtio-no-ip.md diff --git a/.claude/memory/MEMORY.md b/.claude/memory/MEMORY.md index 9a2de67..00b34d7 100644 --- a/.claude/memory/MEMORY.md +++ b/.claude/memory/MEMORY.md @@ -26,6 +26,7 @@ - [GuruRMM physical server storage](gururmm-physical-server-storage.md) — New box 172.16.1.231 (temp IP→will be .30), Ubuntu 26.04, ssh key `gururmm-physical`/alias `gururmm-new`. SSD (915G root) = HOT (PG default tablespace + WAL + builds); HDD ext4 at `/data` = COLD (`gururmm_cold` PG tablespace for aged `agent_logs` partitions + downloads + backups + archive). The #3 retention answer. - [Trebesch DESKTOP-QNP3ON5 shell replacement](reference_trebesch_qnp3on5.md) — AT Trebesch box runs an Explorer shell replacement; explorer.exe owner check returns blank — use Win32_ComputerSystem.UserName. GuruRMM SWIFT-LION-2892. - [reference_backblaze_storage_rate](reference_backblaze_storage_rate.md) -- ACG's Backblaze B2 storage cost rate ($0.00695/GB) for the GuruRMM mspbackups storage-cost calculation +- [Unraid VM no-IP causes](unraid-windows-vm-virtio-no-ip.md) — PRIMARY (general "new VMs stopped getting IPs lately"): Docker sets bridge-nf-call-iptables=1, so br0 VM DHCP OFFERs hit DOCKER-FORWARD (no br0 ACCEPT) and get dropped; new VMs can't complete DORA (existing renew via ESTABLISHED). Fix `=0` runtime (needs persistent post-Docker hook; not yet persisted on Jupiter). SECONDARY (Windows VM): virtio-net has no in-box driver -> use e1000 or virtio-win. Diagnose: tcpdump DHCP on pfSense; /sys vnetN rx_packets. - [reference_sqlx_migrations_immutable](reference_sqlx_migrations_immutable.md) -- NEVER edit an already-applied sqlx migration file — even a comment. sqlx::migrate! checksums each file at compile time and validates against _sqlx_migrations at startup; a changed checksum crash-loops the server with "migration N was previously applied but has been modified". Code review MUST flag any edit to an applied migration. ## Users @@ -137,3 +138,4 @@ - [GuruRMM log analysis -> Claude Haiku](gururmm-log-analysis-claude-cutover.md) — cut over from Ollama-on-Beast (timed out on fleet-sized prompts; "unreachable" was a mislabeled 120s timeout) to Anthropic API Haiku 4.5 w/ structured outputs; key at vault `projects/gururmm/anthropic-api`; ZDR pending; deploy needs root on .30 (.env + restart) - [IX WHM API access = 'ClaudeTools' token, not password](ix-whm-dns-api-access.md) — IX cPanel/WHM (ix.azcomputerguru.com:2087) DNS + all API work uses the FULL-ACCESS-root WHM API token at vault `infrastructure/ix-server` `credentials.whm-api-token` via header `Authorization: whm root:` (force curl -4). Password basic-auth on legacy json-api now 403s. Public NS ns1/ns2.acghosting.com = 52.52.94.202. - [Vault EVERY credential surfaced in-session](feedback-vault-every-credential.md) — any cred (pasted/created/discovered) -> store via the vault skill + document purpose & exact usage immediately; it's a standing job rule (reinforced in CORE CLAUDE.md). Lost IX creds wasted ~1h on 2026-06-12. +- [GuruRMM install-report v1: reuse endpoint + failed-install agent](gururmm-install-report-failed-agent-v1.md) — legacy NSIS installer reuses /api/install-report (machine info + logs, success+fail); server upserts a visible "failed-install" device on failure reports (Mike: in v1); verify-connect-before-success; trend/near-fail analytics. Server side is a separate sequential SPEC after the legacy-agent branch lands. diff --git a/.claude/memory/gururmm-install-report-failed-agent-v1.md b/.claude/memory/gururmm-install-report-failed-agent-v1.md new file mode 100644 index 0000000..67ccdb1 --- /dev/null +++ b/.claude/memory/gururmm-install-report-failed-agent-v1.md @@ -0,0 +1,36 @@ +--- +name: gururmm-install-report-failed-agent-v1 +description: GuruRMM legacy-installer v1 must reuse /api/install-report AND create a visible "failed-install agent" server-side (Mike, 2026-06-12) +metadata: + type: project +--- + +For the SPEC-029 legacy-fleet build, Mike decided (2026-06-12) the observable-installer +requirement is satisfied by the EXISTING `install-report` channel, extended: + +- **Reuse `/api/install-report`** (do NOT invent a new beacon). The MSI already POSTs rich + machine info + event/agent logs + service status there, success AND fail (`InstallReportCA` + + `installer/install-report.ps1` → `server/src/api/install_report.rs`, recorded to `install_reports`). + The **new NSIS 32-bit/legacy installer must POST the same payload** — this finally covers the + legacy tier (today it has no installer → zero install-reports = the biggest blind spot). +- **Failed-install agent IN v1 (Mike's call):** on a report indicating failure (service not Running + after poll / no enrollment / connect-verification failed), the server **upserts a visible + "failed-install" device record** — keyed by hostname + machine fingerprint (so retries update one + record, no spam), carrying machine info + failed-step/reason + log refs + attempt count. Shows in + the dashboard as FAILED-INSTALL (distinct from healthy agents), triage-able + alertable. **Reconcile** + if the box later enrolls for real (don't leave a ghost). Success reports don't create a failed agent + but still feed trend/near-fail analytics (failure-rate by OS/arch/version — build-shaping signal, + mirrors SPEC-022 §5e patch telemetry). +- Installer must **verify enroll/connect before declaring success** ("don't terminate until success") + and emit a meaningful exit + a local diagnostic bundle on fail. + +Scope split: the running legacy-agent Coding Agent does the agent + NSIS installer (+ the install-report +POST). The **server-side failed-install-agent + trend analytics is a separate, sequential** work item +(can't run a 2nd agent in the same submodule checkout concurrently) → its own SPEC after the first +branch lands. See [[gururmm-log-analysis-claude-cutover]] for the server deploy shape. + +**Note (Mike, 2026-06-12):** the legacy build must eventually be folded into the MAIN +production builds for **agent parity** (not a separate side-build). build-windows.sh already +emits legacy-x86/amd64 in WAVE 2, but the legacy INSTALLER + the SPEC-029 §12 fixes need to +become first-class in the promoted pipeline. For now, scoped TEST artifacts off the +`fix/legacy-32bit-agent` branch are fine (Mike OK'd) — productionize after the Win7 VM proof. diff --git a/.claude/memory/unraid-windows-vm-virtio-no-ip.md b/.claude/memory/unraid-windows-vm-virtio-no-ip.md new file mode 100644 index 0000000..b00e7d8 --- /dev/null +++ b/.claude/memory/unraid-windows-vm-virtio-no-ip.md @@ -0,0 +1,39 @@ +--- +name: unraid-windows-vm-virtio-no-ip +description: Unraid VMs fail to get a DHCP IP - PRIMARY cause is Docker setting bridge-nf-call-iptables=1 (drops new-VM DHCP OFFERs on br0); secondary is virtio-net having no in-box Windows driver +metadata: + type: reference +--- + +Two distinct causes make Unraid/KVM VMs come up with **no DHCP IP**. Confirmed 2026-06-12/13 on +Jupiter (`172.16.3.20`, Unraid 6.12.85; host creds vault `infrastructure/jupiter-unraid-primary`). + +## PRIMARY (the "VMs generally stopped getting IPs lately" cause): bridge-nf-call-iptables +Docker sets `net.bridge.bridge-nf-call-iptables=1`, which routes **bridged** VM traffic on `br0` +through the iptables FORWARD chain. Docker's `DOCKER-FORWARD` chain only ACCEPTs the docker +bridges (`br-*`, `docker0`) and has **no ACCEPT for `br0`** (the VM bridge), so it drops new +unmatched inbound flows. Effect: +- The VM's DHCP DISCOVER (broadcast) egresses fine and pfSense/Kea sends an OFFER... +- ...but the inbound **OFFER (new unicast flow to an unassigned IP) is dropped** before reaching + the VM tap. The VM never completes DORA -> APIPA 169.254.x. Symptom in tcpdump on the DHCP + server: VM re-DISCOVERs with 3s/8s/15s backoff, server keeps OFFERing fresh IPs, never an ACK. +- **Existing** VMs survive because lease RENEWALS are ESTABLISHED flows (pass); only NEW/rebooted + VMs (fresh DISCOVER) break. = "lately" (a Docker/Unraid update) + "all new VMs". +- **Fix (runtime, reversible):** `echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables` (and + `bridge-nf-call-ip6tables`). Bridged frames then bypass iptables entirely. **Caveat: Docker + re-sets it to 1 on daemon restart** -> needs a PERSISTENT post-Docker hook (User Scripts "At + Array Start", or a delayed setter in `/boot/config/go`) to truly fix it fleet-wide. NOT yet + made persistent on Jupiter as of 2026-06-13 (pending Mike's OK for the prod boot config). + +## SECONDARY (per-VM, Windows-specific): virtio-net has no in-box Windows driver +A Windows VM whose NIC model is the Unraid default `virtio-net` has a **dead NIC** (Windows has +no in-box virtio driver; the guest sends 0 packets). Linux VMs are fine (in-kernel virtio). +The "Windows 11" VM worked because it was set to **e1000**. Fix: NIC model `e1000` (in-box Win7/ +Server2003 driver, `virsh edit`/Unraid template dropdown) OR install virtio-win NetKVM (ISOs on +Jupiter `/mnt/user/isos/virtio-win-0.1.271-1.iso`). Diagnose without tcpdump: sample +`/sys/class/net//statistics/rx_packets` twice -> flat = dead NIC (driver), climbing = NIC +works (then look at the bridge-nf cause above). + +Diagnosis order: confirm NIC model first (e1000 vs virtio), then if the NIC transmits but no IP, +suspect bridge-nf-call-iptables. Related: [[gururmm-install-report-failed-agent-v1]] +(WIN7TEST is the SPEC-029 legacy-32bit-agent test VM, static IP 172.16.2.55, NIC now e1000).