11 KiB
User
- User: Mike Swanson (mike)
- Machine: GURU-5070
- Role: admin
Session Summary
GURU-5070 bluescreened, which seeded a two-part session: forensically determine whether GuruConnect caused the crash, and build the long-planned BSOD detection feature for GuruRMM. The crash was identified as VIDEO_TDR_FAILURE (0x116) in nvlddmkm.sys (NVIDIA driver TDR), confirmed by running cdb !analyze -v on the minidump as SYSTEM through the local RMM agent. GuruConnect was cleared on three independent grounds: it was not running at crash time (only agent.toml present, untouched ~7.5h prior), a user-mode app cannot raise a kernel 0x116, and the dump names nvlddmkm.sys on the System process. Root cause: NVIDIA driver 32.0.15.9201 on the new RTX 5070 Ti Laptop GPU; one-off (only event in 21 days).
The BSOD detection feature was taken from spec to production. A shape-spec was written (specs/bsod-detection/), then implemented across agent and server: a Windows-only agent/src/bsod.rs that polls C:\Windows\Minidump, parses the kernel dump header (the minidump crate only reads Breakpad MDMP, so the agent reads DUMP_HEADER64/DUMP_HEADER32 at fixed offsets — bugcheck code @0x38, 4 params @0x40/48/50/58, FILETIME @0xFA8), cross-references the System event log for the Report Id/faulting driver, dedups via a bsod-seen.json watermark, and sends a new AgentMessage::BsodEvent. Server-side: migration 048_bsod_events.sql, db/bsod_events.rs, and a ws/mod.rs handler that inserts the row and raises an alert. Code Review (APPROVE-WITH-NITS) found a real missed-alert-on-boot bug (watermark persisted before send) and a non-atomic watermark write; both were fixed. The offset parser was empirically validated against the real 0x116 dump (every field matched cdb exactly). Severity was set to always-Critical per Mike (overriding the roadmap's graduated rule).
The feature was merged to main (0ec55cf) and the agent versioned to 0.6.51. Two infrastructure gaps surfaced during rollout. First, the build pipeline tagged all new builds stable by default — Mike clarified this was a bug (he wanted agents to default to the stable channel, not builds classified stable). The Windows/Linux build scripts were fixed to default new builds to beta (macOS already did this); stable is now an explicit promote step. Second, the Gitea webhook builds only agents, never the server — build-shared.sh detected the server change and bumped the version, but webhook-handler.py never dispatched build-server.sh. The server (0.3.37, migration 048) had to be built and deployed by hand via build-server.sh through the server's own root RMM agent.
Task 7 verification passed end-to-end on GURU-5070's real 0x116: clearing the watermark seen-list and restarting the agent produced exactly one bsod_events row (code 278, all params matching cdb, correct dump path/sha) and one Critical alert, correctly deduped. The Windows fleet had largely auto-updated to 0.6.51 before the beta re-tag took effect (the soak was effectively lost for already-updated agents); since 0.6.51 was verified and rollback was impossible (the 0.6.50 binary was cleaned up), Mike chose to promote 0.6.51 to stable and converge the fleet.
Finally, the webhook server-build gap was closed: build-server.sh got a server/ change-gate (with a last-built-commit-server marker) plus binary backup + auto-rollback, and webhook-handler.py now dispatches it alongside the agent builds. Deployed live and validated — the gate correctly skips on no-change.
Key Decisions
- Minidump parsing by fixed
DUMP_HEADER64offsets, not theminidumpcrate — the crate only parses Breakpad MDMP, not Windows kernel PAGEDU64 dumps. Offsets validated against the real dump (sigPAGEDU64, code@0x38=0x116, params matching cdb). - Always-Critical severity (Mike) instead of the roadmap's 1/24h=warning, >=2=critical. The 24h count is retained as informational text in the alert ("N BSOD in 24h").
- Dedup by dump sha256 (survives re-reads/re-enroll) rather than Report Id; server has a unique index
(agent_id, dump_sha256)+ alertdedup_keyas backstop. - First-run watermark baselines existing dumps as seen and alerts on none — suppresses historical crashes on a fresh install.
- Build channel: new builds default to beta — the stable-by-default tagging was a bug; promotion to stable is now explicit. Distinct from agents defaulting to the stable channel (correct, unchanged).
- Promote 0.6.51 to stable / converge the fleet — beta-first soak was lost for already-updated agents; 0.6.51 was verified and rollback impossible, so converging was the right call.
- Wire server builds into the webhook with a change-gate + backup/rollback — auto-deploying the server without those would rebuild on every push and risk the BUG-003 no-rollback outage.
- GURU-5070 left on the beta channel as the permanent canary (now meaningful, since builds default beta going forward).
Problems Encountered
- Server's RMM agent appeared to "not see"
/var/www/downloads— actually the wrong path; the real dir is/var/www/gururmm/downloads(DOWNLOADS_DIRenv override). The agent runs as root and can read/write it fine; the systemd sandbox bites only on mount observations / non-ReadWritePaths writes. - No
sshpasson Windows GURU-5070 — the ix-server memory's sshpass note doesn't apply here.plink/pscpare available (C:\Program Files\PuTTY\). Ultimately used the server's own root RMM agent instead of SSH. pgrep -f 'build-server.sh'self-matched the RMM command's own shell text, causing false "already running" aborts. Resolved by launching unconditionally and relying on internal locking.- Beta re-tag raced the auto-update — the Windows fleet auto-updated to 0.6.51 during the ~20 min between stable publish (02:30 UTC) and the re-tag. Mitigated by promoting 0.6.51 to stable and converging.
- Server never auto-deployed — webhook builds agents only;
build-server.shis manual. Deployed by hand, then wired into the webhook to prevent recurrence. - Code Review caught SF-1 (watermark persisted before WS send → alert dropped if disconnected at boot, the feature's primary scenario) and SF-2 (non-atomic watermark write). Both fixed before merge.
Configuration Changes
Created (GuruRMM submodule):
specs/bsod-detection/{plan,shape,references,standards}.mdagent/src/bsod.rsserver/migrations/048_bsod_events.sqlserver/src/db/bsod_events.rs
Modified (GuruRMM submodule):
agent/src/transport/mod.rs,agent/src/transport/websocket.rs—BsodEventmessageagent/src/main.rs,agent/src/service.rs— spawn detector (both entry points)agent/Cargo.toml— version 0.6.50 -> 0.6.51 (also CI auto-bump)server/src/db/mod.rs,server/src/ws/mod.rs— register module + handler (always-Critical)deploy/build-pipeline/build-windows.sh,build-linux.sh— default new builds to betadeploy/build-pipeline/build-server.sh— change-gate + marker + backup/rollback; log to/var/log/gururmm-build-server.logdeploy/build-pipeline/webhook-handler.py— dispatchbuild-server.sh(SERVERentry;PLATFORMS + [SERVER])docs/FEATURE_ROADMAP.md— issue #10 Agent/Server bullets checked, notes added
Created (ClaudeTools repo):
.claude/memory/feedback_gururmm_build_channel_default.md(+ MEMORY.md index)- Updated
.claude/memory/reference_gururmm.md(downloads dir, channel control, root-agent access, plink) + MEMORY.md index
Live server changes (172.16.3.30):
/opt/gururmm/gururmm-server-> v0.3.37 (migration 048 applied)/opt/gururmm/webhook-handler.py-> server-build dispatch version (+systemctl restart gururmm-webhook)/opt/gururmm/build-server.sh-> gated/rollback version/opt/gururmm/last-built-commit-server-> seeded tofaf6b27/var/www/gururmm/downloads/gururmm-agent-windows-*-0.6.51.exe.channel-> beta then stable (promote)
Credentials & Secrets
- No new secrets created. Used existing vault entries:
infrastructure/gururmm-server.sops.yaml—credentials.gururmm-api.admin-email/admin-password(RMM API JWT);credentials.username=guru /credentials.password(SSH, sudo = SSH pw);host=172.16.3.30,port=22.
Infrastructure & Servers
- GuruRMM server (Saturn): 172.16.3.30 — API
:3001, coord API:8001, webhook handler127.0.0.1:9000, Postgres (DATABASE_URL via process environ), downloads dir/var/www/gururmm/downloads, repo/home/guru/gururmm, pipeline/opt/gururmm/. OS Ubuntu. Runs its own GuruRMM Linux agent AS ROOT (hostnamegururmm, agent id5e5a7ebc-95ea-40c8-b965-6ec15d63e157on this date — resolve live). - GURU-5070: Mike's box; the crashed machine; agent id
c043d9ac-4020-4cab-a5f4-b90213d11e73; on beta channel; NVIDIA RTX 5070 Ti Laptop GPU + Intel iGPU; driver 32.0.15.9201. - Internal Gitea: http://172.16.3.20:3000 (git/API).
Commands & Outputs
- BSOD analysis (as SYSTEM via RMM):
cdb -z C:\WINDOWS\Minidump\060126-16718-01.dmp -y srv*C:\symbols*https://msdl.microsoft.com/download/symbols -c "!analyze -v;q"->FAILURE_BUCKET_ID: 0x116_IMAGE_nvlddmkm.sys,PROCESS_NAME: System. - Offset validation (read dump header bytes): sig=
PAGEDU64, code@0x38=0x00000116, p1=FFFFE28CA1998050, p2=FFFFF8016FD61050, p3=FFFFFFFFC000009A, p4=4, FILETIME@0xFA8 ->2026-06-01T23:57:03Z. - Task 7 result:
bsod_events1 rowcode=278 text=VIDEO_TDR_FAILUREparams matching; alertcritical | active | BSOD: VIDEO_TDR_FAILURE (0x116) on GURU-5070; dedup held (1 row / 1 alert after restart). - Server deploy:
build-server.sh->Server build complete: v0.3.37(4m55s);to_regclass('public.bsod_events')=bsod_events;_sqlx_migrationsversion 48 success=t. - Webhook gate test:
build-server.sh->No server/ changes since faf6b27 -- skipping server build(server untouched). - Channel rollout control:
echo beta|stable > /var/www/gururmm/downloads/<binary>.channel.
Pending / Incomplete Tasks
- BSOD Phase 2/3 (deferred, in roadmap issue #10):
faulting_moduleis frequently null (event-log xref doesn't carry the driver) — optional cdb!analyzeenrichment; dashboard "Crashes" tab + BSOD in Alerts stream;fetch_bsod_dumpon-demand upload; full ~350-entry bugcheck name table (Phase 1 ships a 10-code map). - Fleet convergence to 0.6.51 in progress (stragglers update as they reconnect).
- Optional future hardening: pre-stop migration validation in
build-server.sh(backup/rollback covers the binary, not partial migrations). webhook-handler.py.pre-serverbuild.bakleft on server as a rollback copy.
Reference Information
- Spec:
projects/msp-tools/guru-rmm/specs/bsod-detection/(plan.md is the source of truth; Tasks 1-6 marked DONE). - GuruRMM commits:
0ec55cf(feature),a8d336a(CI bump 0.6.51/0.3.37),c1bdc1e/a494f22(build default-beta),faf6b27(webhook server-build wiring). - Parent ClaudeTools bumps:
1f817f5,13c7ad3(submodule pointer). - Coord todo:
b2e6b994-bb1e-48ef-b42b-a70d1626a535(webhook server-build gap) — CLOSED. - Roadmap:
docs/FEATURE_ROADMAP.md"Crash Dump Detection & Analysis (BSOD)" / Gitea issue #10. - Crash fixture:
C:\Windows\Minidump\060126-16718-01.dmp, sha2560b490e2755b40e5726bac69ed99ec95bac8c52e6bf309b3233fc286f01cc2534, WER Report Id1c76ccf7-ce26-4eb1-ae1c-ab6ef0adbc17.