Files
claudetools/session-logs/2026-05-31-mike-guruconnect-h264-decoder.md
Mike Swanson 69d4dd3e98 sync: auto-sync from GURU-5070 at 2026-05-31 09:41:03
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-05-31 09:41:03
2026-05-31 09:41:08 -07:00

11 KiB

User

  • User: Mike Swanson (mike)
  • Machine: GURU-5070
  • Role: admin

Session Summary

Resolved the GuruConnect H.264 "blank white viewer" bug end to end. The session began continuing a pre-compaction investigation whose working theory was that the agent's hardware H.264 encoder emitted no frames (Task 8 async-MFT). That theory was disproven: it was a measurement artifact. The agent ignores RUST_LOG and only enables debug logging via a --verbose CLI flag, so every tracing::debug!("[diag] ...") line added earlier was silently filtered, making the capture/encode loop look dead. Compounding this, the loopback test had hardcoded a stale session_id; after an agent restart the relay minted a new session, so the viewer's StartStream was delivered to a dead session and never reached the live agent.

After fixing the test harness (run agent with --verbose, always fetch the live session_id from the relay journal after the agent connects), a clean loopback run on session fe0e6dd3 showed the encoder was correct all along: 900+ H.264 frames produced and sent, with a NAL-type dump confirming proper structure — first frame [9,7,8,6,5] (AUD+SPS+PPS+SEI+IDR), recurring [9,7,8,5] IDR keyframes (~1s GOP), and [9,1] P-frames. The bug was entirely on the viewer side, in the never-before-run native viewer ("COMPILE-VERIFIED ONLY" decoder).

Three stacked viewer bugs were found and fixed in sequence, each revealed only after fixing the prior one, by progressively instrumenting and re-running the 5070 same-machine loopback:

  1. Decoder mishandled MF_E_NOTACCEPTING (0xC00D36B5) — fed input without draining output and treated back-pressure as fatal, so after the first GOP filled the MFT it rejected every subsequent frame.
  2. Decoder never negotiated its output type (MF_E_TRANSFORM_TYPE_NOT_SET, 0xC00D6D60) — it hand-built a bare NV12 type with no frame size instead of enumerating the MFT's available output types via GetOutputAvailableType.
  3. The winit event loop froze after one iteration — a manual input::pump_messages() (PeekMessage/DispatchMessage) call inside about_to_wait stole winit's own thread messages and re-entered its window proc, halting the loop. This is what kept the screen blank even after bugs 1 and 2 were fixed.

Bugs 1 and 2 were fixed via the Coding Agent (two passes); bug 3 was fixed directly (one-line removal) to verify the hypothesis before formalizing. The final instrumented run proved the full pipeline: about_to_wait 1526 iterations (was 1), process_frame 360+ paints (was 0), render present 1740 blits (was 0), frames at 2560x1600 BGRA (bytes=16384000). The decoder/render fixes plus diagnostics remain uncommitted in the GC submodule working tree pending Code Review and diagnostic cleanup. A /sync mid-session pulled 12 commits and confirmed the uncommitted GC work survived a submodule pointer advance (b3e8f32 -> afbf0d8).

Key Decisions

  • Kept the test as a 5070 same-machine loopback (agent = viewer = GURU-5070) rather than beast->5070 for root-causing, because co-locating the encoder and its logs makes instrumentation far easier; a cross-GPU beast->5070 run is deferred to fix-validation.
  • Fixed the test harness blindness (no --verbose, stale session_id) before drawing any further conclusions — the prior "no frames" diagnosis was untrustworthy until the measurement was sound.
  • Confirmed encoder correctness via a NAL-unit-type dump (added nal_unit_types() Annex-B scanner) rather than assuming, which redirected the entire investigation from encoder to viewer.
  • Made bug-3 (event-loop) fix directly and verified it with instrumentation BEFORE routing through Code Review, to avoid spending a delegate cycle on an unverified hypothesis. Bugs 1-2 went through the Coding Agent since they were larger and the diagnosis was already firm.
  • Left all GC changes uncommitted in the submodule working tree; did not commit/push because (a) diagnostics must be stripped, (b) fixes need Code Review, (c) pushing GC main triggers CI. The SYNCMFT software-encoder diagnostic must be reverted — the hardware async-MFT path is still unvalidated.
  • Wrote this session log to the root session-logs/ (not the GC submodule), consistent with the GuruRMM convention that submodule repos do not carry our session logs.

Problems Encountered

  • Agent debug logs invisible: FmtSubscriber::with_max_level(INFO) in agent/src/main.rs ignores RUST_LOG; debug only via --verbose/-v (global clap flag). Resolved by launching the agent with --verbose.
  • Stale session_id in loopback: agent restart mints a new relay session; viewer hit a dead session and StartStream never reached the agent. Resolved by fetching the live Session created: <id> from the relay journal after each agent connect.
  • 18-second StartStream latency: the agent took ~18s to process a queued StartStream — its try_recv message pump (block_in_place/block_on in transport/websocket.rs) is sluggish. Worked around in testing by sustaining viewer activity 45-55s so streaming overlapped; logged as a real secondary bug to fix later.
  • MF_E_NOTACCEPTING (0xC00D36B5): decoder rejected frames; fixed by draining all outputs and retrying the unconsumed sample on back-pressure.
  • MF_E_TRANSFORM_TYPE_NOT_SET (0xC00D6D60): output type never set; fixed by GetOutputAvailableType enumeration selecting the NV12 type on STREAM_CHANGE/TYPE_NOT_SET.
  • winit loop froze after one iteration: manual pump_messages() in about_to_wait conflicted with winit's own pump; removed it (winit's pump already services the low-level keyboard hook).
  • PowerShell protective hook misfired on a \d+ literal in Remove-Item args; reran using [0-9]+ and dropping the Remove-Item.
  • Build "Access is denied" on the agent .exe when a guruconnect.exe was still running; resolved by Get-Process guruconnect | Stop-Process -Force before rebuilds.

Configuration Changes

GuruConnect submodule (projects/msp-tools/guru-connect/) — UNCOMMITTED in working tree, pending Code Review + diagnostic strip:

  • agent/src/viewer/decoder.rs — FIX: decode() returns Vec<DecodedFrame>, handles MF_E_NOTACCEPTING (drain+retry), drains all outputs; negotiate_output_type() enumerates GetOutputAvailableType for NV12; drain_one() handles MF_E_TRANSFORM_TYPE_NOT_SET/STREAM_CHANGE with an anti-spin guard. DIAG: [diag] negotiation + ProcessOutput nv12_len logs.
  • agent/src/viewer/mod.rs — FIX: decode worker iterates the Vec and forwards each frame. DIAG: [diag] decode worker in/produced counters.
  • agent/src/viewer/render.rs — FIX: removed input::pump_messages() from about_to_wait (the event-loop freeze). DIAG: [diag] counters in process_frame, render present, window_event, about_to_wait.
  • agent/src/encoder/h264.rs — DIAG ONLY (must revert): MFT_ENUM_FLAG_SYNCMFT software-encoder override in activate_hw_encoder; nal_unit_types() Annex-B scanner + NAL-type log in drain_one_output.
  • agent/src/session/mod.rs — DIAG ONLY (must revert): [diag] capture/encode logging in the streaming loop.

Main repo: no committed changes from this session (sync auto-commit caa7c2c was dropped as redundant during rebase).

Credentials & Secrets

  • GC relay JWT signing secret (used to forge viewer tokens for loopback testing): KfPrjjC3J6YMx9q1yjPxZAYkHLM2JdFy1XRxHJ9oPnw0NU3xH074ufHk7fj++e8BJEqRQ5k4zlWD+1iDwlLP4w== (HS256). This is the deployed server JWT_SECRET on 172.16.3.30.
  • GC test agent shared key (deprecated AGENT_API_KEY fallback, server WARN-logs its use): x7m9p2k8v4n1q5w3r6t0y2u8i5o3l7m9p2k8.
  • Deploy host SSH: guru@172.16.3.30, password Gptf*77ttb123!@#-rmm (used via plink for journalctl).

Infrastructure & Servers

  • GuruConnect relay: 172.16.3.30:3002, systemd guruconnect.service, PID 1464259 this session. WS endpoints /ws/agent and /ws/viewer. Public host connect.azcomputerguru.com (NPM-proxied).
  • Test agent identity: agent_id=h264sync-5070, hostname GURU-5070, reattaches to persistent session fe0e6dd3-acd8-4172-a469-dd7aed2fffa5.
  • Test config: %TEMP%\gc-h264-test\guruconnect.tomlserver_url=ws://172.16.3.30:3002/ws/agent, api_key=x7m9p2k8v4n1q5w3r6t0y2u8i5o3l7m9p2k8, tags=["h264-test"] (triggers prefer_h264).
  • Agent binary: D:\claudetools\projects\msp-tools\guru-connect\target\x86_64-pc-windows-msvc\release\guruconnect.exe (v0.2.0).

Commands & Outputs

  • Build: cd D:\claudetools\projects\msp-tools\guru-connect; cargo build -p guruconnect --release --target x86_64-pc-windows-msvc (~36-43s; cargo writes progress to stderr — the PowerShell NativeCommandError wrapper is benign).
  • Run tagged agent (debug): guruconnect.exe agent --verbose from the test workdir.
  • Forge viewer JWT: HS256 over {sub,session_id,tenant_id,access:"control",purpose:"viewer",exp,iat} with the server secret; run viewer guruconnect.exe view <sid> --server ws://172.16.3.30:3002/ws/viewer --api-key <token> --verbose.
  • Live session lookup: journalctl -u guruconnect.service --since '20 sec ago' | grep -oE 'Session created: [a-f0-9-]+' | tail -1.
  • Final proof counts (verbose viewer): about_to_wait=1526, window_event=3813, process_frame=360+, render present=1740, frame 2560x1600 BGRA bytes=16384000, decode errors=0.
  • Encoder NAL dump: [9,7,8,6,5] x1 (IDR+SPS+PPS), [9,7,8,5] x32 (IDR), [9,1] x936 (P-frames).
  • Decoder errors observed and cleared: 0xC00D36B5 (NOTACCEPTING) -> 0xC00D6D60 (TYPE_NOT_SET) -> 0 errors.

Pending / Incomplete Tasks

  • Route the three viewer fixes (decoder NOTACCEPTING, decoder output-type negotiation, render event-loop pump removal) through Code Review, then commit to the GC submodule and bump the parent pointer. Pushing GC main triggers CI.
  • Strip ALL [diag] instrumentation: decoder.rs, render.rs, viewer/mod.rs, encoder/h264.rs, session/mod.rs. Clear the leftover build warning.
  • Revert the MFT_ENUM_FLAG_SYNCMFT software-encoder diagnostic in h264.rs; the hardware async-MFT path still needs its own live validation (Task 8 / #17).
  • Fix the ~18s StartStream latency (agent try_recv message pump in transport/websocket.rs).
  • Run a real beast->5070 cross-GPU validation before flipping DEFAULT_PREFER_H264 to true; keep it false until signed off.
  • Visual confirmation from Mike still pending at session end (instrumentation proved frames reach present()).
  • Tasks #4, #16, #17 remain in_progress; #3,#5,#7,#10,#13,#14,#15 pending.

Reference Information

  • GC submodule HEAD after sync: afbf0d8 (SPEC-015 Notification Overlay); was b3e8f32 (SPEC-004 Task 1 machine_uid).
  • Main repo HEAD after sync: 80af6eb fix: improve git hook JSON escaping.
  • Coding Agent IDs (resumable): decoder fix #1 aa02e8c4144895542; decoder fix #2 a7fd5547a8731bfd4.
  • Key files: agent/src/main.rs:188 (verbose gate), agent/src/viewer/decoder.rs (decode/drain/negotiate), agent/src/viewer/render.rs:464 (about_to_wait), agent/src/viewer/input.rs:242 (pump_messages), agent/src/transport/websocket.rs:104 (try_recv 1ms-timeout pump).
  • Error code refs: 0xC00D36B5 = MF_E_NOTACCEPTING; 0xC00D6D60 = MF_E_TRANSFORM_TYPE_NOT_SET; 0xC00D6D61 = MF_E_TRANSFORM_STREAM_CHANGE.