From 47894763075a2516d8751255709c977471461a5e Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Tue, 2 Jun 2026 07:26:21 -0700 Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-06-02 07:26:17 Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-02 07:26:17 --- ...-06-02-mike-bsod-detection-and-pipeline.md | 101 ++++++++++++++++++ 1 file changed, 101 insertions(+) create mode 100644 session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md diff --git a/session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md b/session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md new file mode 100644 index 0000000..3b760b7 --- /dev/null +++ b/session-logs/2026-06-02-mike-bsod-detection-and-pipeline.md @@ -0,0 +1,101 @@ +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-5070 +- **Role:** admin + +## Session Summary + +GURU-5070 bluescreened, which seeded a two-part session: forensically determine whether GuruConnect caused the crash, and build the long-planned BSOD detection feature for GuruRMM. The crash was identified as `VIDEO_TDR_FAILURE (0x116)` in `nvlddmkm.sys` (NVIDIA driver TDR), confirmed by running `cdb !analyze -v` on the minidump as SYSTEM through the local RMM agent. GuruConnect was cleared on three independent grounds: it was not running at crash time (only `agent.toml` present, untouched ~7.5h prior), a user-mode app cannot raise a kernel `0x116`, and the dump names `nvlddmkm.sys` on the `System` process. Root cause: NVIDIA driver `32.0.15.9201` on the new RTX 5070 Ti Laptop GPU; one-off (only event in 21 days). + +The BSOD detection feature was taken from spec to production. A shape-spec was written (`specs/bsod-detection/`), then implemented across agent and server: a Windows-only `agent/src/bsod.rs` that polls `C:\Windows\Minidump`, parses the kernel dump header (the `minidump` crate only reads Breakpad MDMP, so the agent reads `DUMP_HEADER64`/`DUMP_HEADER32` at fixed offsets — bugcheck code @0x38, 4 params @0x40/48/50/58, FILETIME @0xFA8), cross-references the System event log for the Report Id/faulting driver, dedups via a `bsod-seen.json` watermark, and sends a new `AgentMessage::BsodEvent`. Server-side: migration `048_bsod_events.sql`, `db/bsod_events.rs`, and a `ws/mod.rs` handler that inserts the row and raises an alert. Code Review (APPROVE-WITH-NITS) found a real missed-alert-on-boot bug (watermark persisted before send) and a non-atomic watermark write; both were fixed. The offset parser was empirically validated against the real `0x116` dump (every field matched cdb exactly). Severity was set to always-Critical per Mike (overriding the roadmap's graduated rule). + +The feature was merged to `main` (`0ec55cf`) and the agent versioned to 0.6.51. Two infrastructure gaps surfaced during rollout. First, the build pipeline tagged all new builds `stable` by default — Mike clarified this was a bug (he wanted agents to default to the stable *channel*, not builds classified stable). The Windows/Linux build scripts were fixed to default new builds to `beta` (macOS already did this); stable is now an explicit promote step. Second, the Gitea webhook builds only agents, never the server — `build-shared.sh` detected the server change and bumped the version, but `webhook-handler.py` never dispatched `build-server.sh`. The server (`0.3.37`, migration 048) had to be built and deployed by hand via `build-server.sh` through the server's own root RMM agent. + +Task 7 verification passed end-to-end on GURU-5070's real `0x116`: clearing the watermark seen-list and restarting the agent produced exactly one `bsod_events` row (code 278, all params matching cdb, correct dump path/sha) and one Critical alert, correctly deduped. The Windows fleet had largely auto-updated to 0.6.51 before the beta re-tag took effect (the soak was effectively lost for already-updated agents); since 0.6.51 was verified and rollback was impossible (the 0.6.50 binary was cleaned up), Mike chose to promote 0.6.51 to stable and converge the fleet. + +Finally, the webhook server-build gap was closed: `build-server.sh` got a `server/` change-gate (with a `last-built-commit-server` marker) plus binary backup + auto-rollback, and `webhook-handler.py` now dispatches it alongside the agent builds. Deployed live and validated — the gate correctly skips on no-change. + +## Key Decisions + +- **Minidump parsing by fixed `DUMP_HEADER64` offsets, not the `minidump` crate** — the crate only parses Breakpad MDMP, not Windows kernel PAGEDU64 dumps. Offsets validated against the real dump (sig `PAGEDU64`, code@0x38=0x116, params matching cdb). +- **Always-Critical severity** (Mike) instead of the roadmap's 1/24h=warning, >=2=critical. The 24h count is retained as informational text in the alert ("N BSOD in 24h"). +- **Dedup by dump sha256** (survives re-reads/re-enroll) rather than Report Id; server has a unique index `(agent_id, dump_sha256)` + alert `dedup_key` as backstop. +- **First-run watermark baselines existing dumps as seen and alerts on none** — suppresses historical crashes on a fresh install. +- **Build channel: new builds default to beta** — the stable-by-default tagging was a bug; promotion to stable is now explicit. Distinct from agents defaulting to the stable channel (correct, unchanged). +- **Promote 0.6.51 to stable / converge the fleet** — beta-first soak was lost for already-updated agents; 0.6.51 was verified and rollback impossible, so converging was the right call. +- **Wire server builds into the webhook with a change-gate + backup/rollback** — auto-deploying the server without those would rebuild on every push and risk the BUG-003 no-rollback outage. +- **GURU-5070 left on the beta channel** as the permanent canary (now meaningful, since builds default beta going forward). + +## Problems Encountered + +- **Server's RMM agent appeared to "not see" `/var/www/downloads`** — actually the wrong path; the real dir is `/var/www/gururmm/downloads` (`DOWNLOADS_DIR` env override). The agent runs as root and can read/write it fine; the systemd sandbox bites only on mount observations / non-ReadWritePaths writes. +- **No `sshpass` on Windows GURU-5070** — the ix-server memory's sshpass note doesn't apply here. `plink`/`pscp` are available (`C:\Program Files\PuTTY\`). Ultimately used the server's own root RMM agent instead of SSH. +- **`pgrep -f 'build-server.sh'` self-matched** the RMM command's own shell text, causing false "already running" aborts. Resolved by launching unconditionally and relying on internal locking. +- **Beta re-tag raced the auto-update** — the Windows fleet auto-updated to 0.6.51 during the ~20 min between stable publish (02:30 UTC) and the re-tag. Mitigated by promoting 0.6.51 to stable and converging. +- **Server never auto-deployed** — webhook builds agents only; `build-server.sh` is manual. Deployed by hand, then wired into the webhook to prevent recurrence. +- **Code Review caught SF-1** (watermark persisted before WS send → alert dropped if disconnected at boot, the feature's primary scenario) and SF-2 (non-atomic watermark write). Both fixed before merge. + +## Configuration Changes + +Created (GuruRMM submodule): +- `specs/bsod-detection/{plan,shape,references,standards}.md` +- `agent/src/bsod.rs` +- `server/migrations/048_bsod_events.sql` +- `server/src/db/bsod_events.rs` + +Modified (GuruRMM submodule): +- `agent/src/transport/mod.rs`, `agent/src/transport/websocket.rs` — `BsodEvent` message +- `agent/src/main.rs`, `agent/src/service.rs` — spawn detector (both entry points) +- `agent/Cargo.toml` — version 0.6.50 -> 0.6.51 (also CI auto-bump) +- `server/src/db/mod.rs`, `server/src/ws/mod.rs` — register module + handler (always-Critical) +- `deploy/build-pipeline/build-windows.sh`, `build-linux.sh` — default new builds to beta +- `deploy/build-pipeline/build-server.sh` — change-gate + marker + backup/rollback; log to `/var/log/gururmm-build-server.log` +- `deploy/build-pipeline/webhook-handler.py` — dispatch `build-server.sh` (`SERVER` entry; `PLATFORMS + [SERVER]`) +- `docs/FEATURE_ROADMAP.md` — issue #10 Agent/Server bullets checked, notes added + +Created (ClaudeTools repo): +- `.claude/memory/feedback_gururmm_build_channel_default.md` (+ MEMORY.md index) +- Updated `.claude/memory/reference_gururmm.md` (downloads dir, channel control, root-agent access, plink) + MEMORY.md index + +Live server changes (172.16.3.30): +- `/opt/gururmm/gururmm-server` -> v0.3.37 (migration 048 applied) +- `/opt/gururmm/webhook-handler.py` -> server-build dispatch version (+ `systemctl restart gururmm-webhook`) +- `/opt/gururmm/build-server.sh` -> gated/rollback version +- `/opt/gururmm/last-built-commit-server` -> seeded to `faf6b27` +- `/var/www/gururmm/downloads/gururmm-agent-windows-*-0.6.51.exe.channel` -> beta then stable (promote) + +## Credentials & Secrets + +- No new secrets created. Used existing vault entries: + - `infrastructure/gururmm-server.sops.yaml` — `credentials.gururmm-api.admin-email/admin-password` (RMM API JWT); `credentials.username`=guru / `credentials.password` (SSH, sudo = SSH pw); `host`=172.16.3.30, `port`=22. + +## Infrastructure & Servers + +- **GuruRMM server (Saturn):** 172.16.3.30 — API `:3001`, coord API `:8001`, webhook handler `127.0.0.1:9000`, Postgres (DATABASE_URL via process environ), downloads dir `/var/www/gururmm/downloads`, repo `/home/guru/gururmm`, pipeline `/opt/gururmm/`. OS Ubuntu. Runs its own GuruRMM Linux agent AS ROOT (hostname `gururmm`, agent id `5e5a7ebc-95ea-40c8-b965-6ec15d63e157` on this date — resolve live). +- **GURU-5070:** Mike's box; the crashed machine; agent id `c043d9ac-4020-4cab-a5f4-b90213d11e73`; on beta channel; NVIDIA RTX 5070 Ti Laptop GPU + Intel iGPU; driver 32.0.15.9201. +- **Internal Gitea:** http://172.16.3.20:3000 (git/API). + +## Commands & Outputs + +- BSOD analysis (as SYSTEM via RMM): `cdb -z C:\WINDOWS\Minidump\060126-16718-01.dmp -y srv*C:\symbols*https://msdl.microsoft.com/download/symbols -c "!analyze -v;q"` -> `FAILURE_BUCKET_ID: 0x116_IMAGE_nvlddmkm.sys`, `PROCESS_NAME: System`. +- Offset validation (read dump header bytes): sig=`PAGEDU64`, code@0x38=`0x00000116`, p1=`FFFFE28CA1998050`, p2=`FFFFF8016FD61050`, p3=`FFFFFFFFC000009A`, p4=`4`, FILETIME@0xFA8 -> `2026-06-01T23:57:03Z`. +- Task 7 result: `bsod_events` 1 row `code=278 text=VIDEO_TDR_FAILURE` params matching; alert `critical | active | BSOD: VIDEO_TDR_FAILURE (0x116) on GURU-5070`; dedup held (1 row / 1 alert after restart). +- Server deploy: `build-server.sh` -> `Server build complete: v0.3.37` (4m55s); `to_regclass('public.bsod_events')`=bsod_events; `_sqlx_migrations` version 48 success=t. +- Webhook gate test: `build-server.sh` -> `No server/ changes since faf6b27 -- skipping server build` (server untouched). +- Channel rollout control: `echo beta|stable > /var/www/gururmm/downloads/.channel`. + +## Pending / Incomplete Tasks + +- BSOD Phase 2/3 (deferred, in roadmap issue #10): `faulting_module` is frequently null (event-log xref doesn't carry the driver) — optional cdb `!analyze` enrichment; dashboard "Crashes" tab + BSOD in Alerts stream; `fetch_bsod_dump` on-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map). +- Fleet convergence to 0.6.51 in progress (stragglers update as they reconnect). +- Optional future hardening: pre-stop migration validation in `build-server.sh` (backup/rollback covers the binary, not partial migrations). +- `webhook-handler.py.pre-serverbuild.bak` left on server as a rollback copy. + +## Reference Information + +- Spec: `projects/msp-tools/guru-rmm/specs/bsod-detection/` (plan.md is the source of truth; Tasks 1-6 marked DONE). +- GuruRMM commits: `0ec55cf` (feature), `a8d336a` (CI bump 0.6.51/0.3.37), `c1bdc1e`/`a494f22` (build default-beta), `faf6b27` (webhook server-build wiring). +- Parent ClaudeTools bumps: `1f817f5`, `13c7ad3` (submodule pointer). +- Coord todo: `b2e6b994-bb1e-48ef-b42b-a70d1626a535` (webhook server-build gap) — CLOSED. +- Roadmap: `docs/FEATURE_ROADMAP.md` "Crash Dump Detection & Analysis (BSOD)" / Gitea issue #10. +- Crash fixture: `C:\Windows\Minidump\060126-16718-01.dmp`, sha256 `0b490e2755b40e5726bac69ed99ec95bac8c52e6bf309b3233fc286f01cc2534`, WER Report Id `1c76ccf7-ce26-4eb1-ae1c-ab6ef0adbc17`.