From 69d4dd3e98832ccc06f37f7fa5d5b039e1964b93 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Sun, 31 May 2026 09:41:08 -0700 Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-05-31 09:41:03 Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-05-31 09:41:03 --- ...026-05-31-mike-guruconnect-h264-decoder.md | 89 +++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 session-logs/2026-05-31-mike-guruconnect-h264-decoder.md diff --git a/session-logs/2026-05-31-mike-guruconnect-h264-decoder.md b/session-logs/2026-05-31-mike-guruconnect-h264-decoder.md new file mode 100644 index 0000000..22a837e --- /dev/null +++ b/session-logs/2026-05-31-mike-guruconnect-h264-decoder.md @@ -0,0 +1,89 @@ +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-5070 +- **Role:** admin + +## Session Summary + +Resolved the GuruConnect H.264 "blank white viewer" bug end to end. The session began continuing a pre-compaction investigation whose working theory was that the agent's hardware H.264 encoder emitted no frames (Task 8 async-MFT). That theory was disproven: it was a measurement artifact. The agent ignores `RUST_LOG` and only enables debug logging via a `--verbose` CLI flag, so every `tracing::debug!("[diag] ...")` line added earlier was silently filtered, making the capture/encode loop look dead. Compounding this, the loopback test had hardcoded a stale `session_id`; after an agent restart the relay minted a new session, so the viewer's StartStream was delivered to a dead session and never reached the live agent. + +After fixing the test harness (run agent with `--verbose`, always fetch the live `session_id` from the relay journal after the agent connects), a clean loopback run on session `fe0e6dd3` showed the encoder was correct all along: 900+ H.264 frames produced and sent, with a NAL-type dump confirming proper structure — first frame `[9,7,8,6,5]` (AUD+SPS+PPS+SEI+IDR), recurring `[9,7,8,5]` IDR keyframes (~1s GOP), and `[9,1]` P-frames. The bug was entirely on the viewer side, in the never-before-run native viewer ("COMPILE-VERIFIED ONLY" decoder). + +Three stacked viewer bugs were found and fixed in sequence, each revealed only after fixing the prior one, by progressively instrumenting and re-running the 5070 same-machine loopback: +1. Decoder mishandled `MF_E_NOTACCEPTING` (0xC00D36B5) — fed input without draining output and treated back-pressure as fatal, so after the first GOP filled the MFT it rejected every subsequent frame. +2. Decoder never negotiated its output type (`MF_E_TRANSFORM_TYPE_NOT_SET`, 0xC00D6D60) — it hand-built a bare NV12 type with no frame size instead of enumerating the MFT's available output types via `GetOutputAvailableType`. +3. The winit event loop froze after one iteration — a manual `input::pump_messages()` (PeekMessage/DispatchMessage) call inside `about_to_wait` stole winit's own thread messages and re-entered its window proc, halting the loop. This is what kept the screen blank even after bugs 1 and 2 were fixed. + +Bugs 1 and 2 were fixed via the Coding Agent (two passes); bug 3 was fixed directly (one-line removal) to verify the hypothesis before formalizing. The final instrumented run proved the full pipeline: `about_to_wait` 1526 iterations (was 1), `process_frame` 360+ paints (was 0), `render present` 1740 blits (was 0), frames at 2560x1600 BGRA (`bytes=16384000`). The decoder/render fixes plus diagnostics remain uncommitted in the GC submodule working tree pending Code Review and diagnostic cleanup. A `/sync` mid-session pulled 12 commits and confirmed the uncommitted GC work survived a submodule pointer advance (b3e8f32 -> afbf0d8). + +## Key Decisions + +- Kept the test as a 5070 same-machine loopback (agent = viewer = GURU-5070) rather than beast->5070 for root-causing, because co-locating the encoder and its logs makes instrumentation far easier; a cross-GPU beast->5070 run is deferred to fix-validation. +- Fixed the test harness blindness (no `--verbose`, stale session_id) before drawing any further conclusions — the prior "no frames" diagnosis was untrustworthy until the measurement was sound. +- Confirmed encoder correctness via a NAL-unit-type dump (added `nal_unit_types()` Annex-B scanner) rather than assuming, which redirected the entire investigation from encoder to viewer. +- Made bug-3 (event-loop) fix directly and verified it with instrumentation BEFORE routing through Code Review, to avoid spending a delegate cycle on an unverified hypothesis. Bugs 1-2 went through the Coding Agent since they were larger and the diagnosis was already firm. +- Left all GC changes uncommitted in the submodule working tree; did not commit/push because (a) diagnostics must be stripped, (b) fixes need Code Review, (c) pushing GC main triggers CI. The `SYNCMFT` software-encoder diagnostic must be reverted — the hardware async-MFT path is still unvalidated. +- Wrote this session log to the root `session-logs/` (not the GC submodule), consistent with the GuruRMM convention that submodule repos do not carry our session logs. + +## Problems Encountered + +- Agent debug logs invisible: `FmtSubscriber::with_max_level(INFO)` in `agent/src/main.rs` ignores `RUST_LOG`; debug only via `--verbose`/`-v` (global clap flag). Resolved by launching the agent with `--verbose`. +- Stale session_id in loopback: agent restart mints a new relay session; viewer hit a dead session and StartStream never reached the agent. Resolved by fetching the live `Session created: ` from the relay journal after each agent connect. +- 18-second StartStream latency: the agent took ~18s to process a queued StartStream — its `try_recv` message pump (`block_in_place`/`block_on` in `transport/websocket.rs`) is sluggish. Worked around in testing by sustaining viewer activity 45-55s so streaming overlapped; logged as a real secondary bug to fix later. +- `MF_E_NOTACCEPTING` (0xC00D36B5): decoder rejected frames; fixed by draining all outputs and retrying the unconsumed sample on back-pressure. +- `MF_E_TRANSFORM_TYPE_NOT_SET` (0xC00D6D60): output type never set; fixed by `GetOutputAvailableType` enumeration selecting the NV12 type on STREAM_CHANGE/TYPE_NOT_SET. +- winit loop froze after one iteration: manual `pump_messages()` in `about_to_wait` conflicted with winit's own pump; removed it (winit's pump already services the low-level keyboard hook). +- PowerShell protective hook misfired on a `\d+` literal in `Remove-Item` args; reran using `[0-9]+` and dropping the Remove-Item. +- Build "Access is denied" on the agent .exe when a `guruconnect.exe` was still running; resolved by `Get-Process guruconnect | Stop-Process -Force` before rebuilds. + +## Configuration Changes + +GuruConnect submodule (`projects/msp-tools/guru-connect/`) — UNCOMMITTED in working tree, pending Code Review + diagnostic strip: +- `agent/src/viewer/decoder.rs` — FIX: `decode()` returns `Vec`, handles `MF_E_NOTACCEPTING` (drain+retry), drains all outputs; `negotiate_output_type()` enumerates `GetOutputAvailableType` for NV12; `drain_one()` handles `MF_E_TRANSFORM_TYPE_NOT_SET`/`STREAM_CHANGE` with an anti-spin guard. DIAG: `[diag]` negotiation + ProcessOutput nv12_len logs. +- `agent/src/viewer/mod.rs` — FIX: decode worker iterates the `Vec` and forwards each frame. DIAG: `[diag] decode worker in/produced` counters. +- `agent/src/viewer/render.rs` — FIX: removed `input::pump_messages()` from `about_to_wait` (the event-loop freeze). DIAG: `[diag]` counters in `process_frame`, `render` present, `window_event`, `about_to_wait`. +- `agent/src/encoder/h264.rs` — DIAG ONLY (must revert): `MFT_ENUM_FLAG_SYNCMFT` software-encoder override in `activate_hw_encoder`; `nal_unit_types()` Annex-B scanner + NAL-type log in `drain_one_output`. +- `agent/src/session/mod.rs` — DIAG ONLY (must revert): `[diag]` capture/encode logging in the streaming loop. + +Main repo: no committed changes from this session (sync auto-commit `caa7c2c` was dropped as redundant during rebase). + +## Credentials & Secrets + +- GC relay JWT signing secret (used to forge viewer tokens for loopback testing): `KfPrjjC3J6YMx9q1yjPxZAYkHLM2JdFy1XRxHJ9oPnw0NU3xH074ufHk7fj++e8BJEqRQ5k4zlWD+1iDwlLP4w==` (HS256). This is the deployed server `JWT_SECRET` on 172.16.3.30. +- GC test agent shared key (deprecated `AGENT_API_KEY` fallback, server WARN-logs its use): `x7m9p2k8v4n1q5w3r6t0y2u8i5o3l7m9p2k8`. +- Deploy host SSH: `guru@172.16.3.30`, password `Gptf*77ttb123!@#-rmm` (used via plink for `journalctl`). + +## Infrastructure & Servers + +- GuruConnect relay: `172.16.3.30:3002`, systemd `guruconnect.service`, PID 1464259 this session. WS endpoints `/ws/agent` and `/ws/viewer`. Public host connect.azcomputerguru.com (NPM-proxied). +- Test agent identity: `agent_id=h264sync-5070`, hostname GURU-5070, reattaches to persistent session `fe0e6dd3-acd8-4172-a469-dd7aed2fffa5`. +- Test config: `%TEMP%\gc-h264-test\guruconnect.toml` — `server_url=ws://172.16.3.30:3002/ws/agent`, `api_key=x7m9p2k8v4n1q5w3r6t0y2u8i5o3l7m9p2k8`, `tags=["h264-test"]` (triggers `prefer_h264`). +- Agent binary: `D:\claudetools\projects\msp-tools\guru-connect\target\x86_64-pc-windows-msvc\release\guruconnect.exe` (v0.2.0). + +## Commands & Outputs + +- Build: `cd D:\claudetools\projects\msp-tools\guru-connect; cargo build -p guruconnect --release --target x86_64-pc-windows-msvc` (~36-43s; cargo writes progress to stderr — the PowerShell NativeCommandError wrapper is benign). +- Run tagged agent (debug): `guruconnect.exe agent --verbose` from the test workdir. +- Forge viewer JWT: HS256 over `{sub,session_id,tenant_id,access:"control",purpose:"viewer",exp,iat}` with the server secret; run viewer `guruconnect.exe view --server ws://172.16.3.30:3002/ws/viewer --api-key --verbose`. +- Live session lookup: `journalctl -u guruconnect.service --since '20 sec ago' | grep -oE 'Session created: [a-f0-9-]+' | tail -1`. +- Final proof counts (verbose viewer): about_to_wait=1526, window_event=3813, process_frame=360+, render present=1740, frame 2560x1600 BGRA bytes=16384000, decode errors=0. +- Encoder NAL dump: `[9,7,8,6,5]` x1 (IDR+SPS+PPS), `[9,7,8,5]` x32 (IDR), `[9,1]` x936 (P-frames). +- Decoder errors observed and cleared: 0xC00D36B5 (NOTACCEPTING) -> 0xC00D6D60 (TYPE_NOT_SET) -> 0 errors. + +## Pending / Incomplete Tasks + +- Route the three viewer fixes (decoder NOTACCEPTING, decoder output-type negotiation, render event-loop pump removal) through Code Review, then commit to the GC submodule and bump the parent pointer. Pushing GC main triggers CI. +- Strip ALL `[diag]` instrumentation: `decoder.rs`, `render.rs`, `viewer/mod.rs`, `encoder/h264.rs`, `session/mod.rs`. Clear the leftover build warning. +- Revert the `MFT_ENUM_FLAG_SYNCMFT` software-encoder diagnostic in `h264.rs`; the hardware async-MFT path still needs its own live validation (Task 8 / #17). +- Fix the ~18s StartStream latency (agent `try_recv` message pump in `transport/websocket.rs`). +- Run a real beast->5070 cross-GPU validation before flipping `DEFAULT_PREFER_H264` to true; keep it false until signed off. +- Visual confirmation from Mike still pending at session end (instrumentation proved frames reach present()). +- Tasks #4, #16, #17 remain in_progress; #3,#5,#7,#10,#13,#14,#15 pending. + +## Reference Information + +- GC submodule HEAD after sync: `afbf0d8` (SPEC-015 Notification Overlay); was `b3e8f32` (SPEC-004 Task 1 machine_uid). +- Main repo HEAD after sync: `80af6eb` fix: improve git hook JSON escaping. +- Coding Agent IDs (resumable): decoder fix #1 `aa02e8c4144895542`; decoder fix #2 `a7fd5547a8731bfd4`. +- Key files: `agent/src/main.rs:188` (verbose gate), `agent/src/viewer/decoder.rs` (decode/drain/negotiate), `agent/src/viewer/render.rs:464` (about_to_wait), `agent/src/viewer/input.rs:242` (pump_messages), `agent/src/transport/websocket.rs:104` (try_recv 1ms-timeout pump). +- Error code refs: 0xC00D36B5 = MF_E_NOTACCEPTING; 0xC00D6D60 = MF_E_TRANSFORM_TYPE_NOT_SET; 0xC00D6D61 = MF_E_TRANSFORM_STREAM_CHANGE.