11 KiB
User
- User: Mike Swanson (mike)
- Machine: GURU-5070
- Role: admin
Session Summary
Resolved the GuruConnect H.264 "blank white viewer" bug end to end. The session began continuing a pre-compaction investigation whose working theory was that the agent's hardware H.264 encoder emitted no frames (Task 8 async-MFT). That theory was disproven: it was a measurement artifact. The agent ignores RUST_LOG and only enables debug logging via a --verbose CLI flag, so every tracing::debug!("[diag] ...") line added earlier was silently filtered, making the capture/encode loop look dead. Compounding this, the loopback test had hardcoded a stale session_id; after an agent restart the relay minted a new session, so the viewer's StartStream was delivered to a dead session and never reached the live agent.
After fixing the test harness (run agent with --verbose, always fetch the live session_id from the relay journal after the agent connects), a clean loopback run on session fe0e6dd3 showed the encoder was correct all along: 900+ H.264 frames produced and sent, with a NAL-type dump confirming proper structure — first frame [9,7,8,6,5] (AUD+SPS+PPS+SEI+IDR), recurring [9,7,8,5] IDR keyframes (~1s GOP), and [9,1] P-frames. The bug was entirely on the viewer side, in the never-before-run native viewer ("COMPILE-VERIFIED ONLY" decoder).
Three stacked viewer bugs were found and fixed in sequence, each revealed only after fixing the prior one, by progressively instrumenting and re-running the 5070 same-machine loopback:
- Decoder mishandled
MF_E_NOTACCEPTING(0xC00D36B5) — fed input without draining output and treated back-pressure as fatal, so after the first GOP filled the MFT it rejected every subsequent frame. - Decoder never negotiated its output type (
MF_E_TRANSFORM_TYPE_NOT_SET, 0xC00D6D60) — it hand-built a bare NV12 type with no frame size instead of enumerating the MFT's available output types viaGetOutputAvailableType. - The winit event loop froze after one iteration — a manual
input::pump_messages()(PeekMessage/DispatchMessage) call insideabout_to_waitstole winit's own thread messages and re-entered its window proc, halting the loop. This is what kept the screen blank even after bugs 1 and 2 were fixed.
Bugs 1 and 2 were fixed via the Coding Agent (two passes); bug 3 was fixed directly (one-line removal) to verify the hypothesis before formalizing. The final instrumented run proved the full pipeline: about_to_wait 1526 iterations (was 1), process_frame 360+ paints (was 0), render present 1740 blits (was 0), frames at 2560x1600 BGRA (bytes=16384000). The decoder/render fixes plus diagnostics remain uncommitted in the GC submodule working tree pending Code Review and diagnostic cleanup. A /sync mid-session pulled 12 commits and confirmed the uncommitted GC work survived a submodule pointer advance (b3e8f32 -> afbf0d8).
Key Decisions
- Kept the test as a 5070 same-machine loopback (agent = viewer = GURU-5070) rather than beast->5070 for root-causing, because co-locating the encoder and its logs makes instrumentation far easier; a cross-GPU beast->5070 run is deferred to fix-validation.
- Fixed the test harness blindness (no
--verbose, stale session_id) before drawing any further conclusions — the prior "no frames" diagnosis was untrustworthy until the measurement was sound. - Confirmed encoder correctness via a NAL-unit-type dump (added
nal_unit_types()Annex-B scanner) rather than assuming, which redirected the entire investigation from encoder to viewer. - Made bug-3 (event-loop) fix directly and verified it with instrumentation BEFORE routing through Code Review, to avoid spending a delegate cycle on an unverified hypothesis. Bugs 1-2 went through the Coding Agent since they were larger and the diagnosis was already firm.
- Left all GC changes uncommitted in the submodule working tree; did not commit/push because (a) diagnostics must be stripped, (b) fixes need Code Review, (c) pushing GC main triggers CI. The
SYNCMFTsoftware-encoder diagnostic must be reverted — the hardware async-MFT path is still unvalidated. - Wrote this session log to the root
session-logs/(not the GC submodule), consistent with the GuruRMM convention that submodule repos do not carry our session logs.
Problems Encountered
- Agent debug logs invisible:
FmtSubscriber::with_max_level(INFO)inagent/src/main.rsignoresRUST_LOG; debug only via--verbose/-v(global clap flag). Resolved by launching the agent with--verbose. - Stale session_id in loopback: agent restart mints a new relay session; viewer hit a dead session and StartStream never reached the agent. Resolved by fetching the live
Session created: <id>from the relay journal after each agent connect. - 18-second StartStream latency: the agent took ~18s to process a queued StartStream — its
try_recvmessage pump (block_in_place/block_onintransport/websocket.rs) is sluggish. Worked around in testing by sustaining viewer activity 45-55s so streaming overlapped; logged as a real secondary bug to fix later. MF_E_NOTACCEPTING(0xC00D36B5): decoder rejected frames; fixed by draining all outputs and retrying the unconsumed sample on back-pressure.MF_E_TRANSFORM_TYPE_NOT_SET(0xC00D6D60): output type never set; fixed byGetOutputAvailableTypeenumeration selecting the NV12 type on STREAM_CHANGE/TYPE_NOT_SET.- winit loop froze after one iteration: manual
pump_messages()inabout_to_waitconflicted with winit's own pump; removed it (winit's pump already services the low-level keyboard hook). - PowerShell protective hook misfired on a
\d+literal inRemove-Itemargs; reran using[0-9]+and dropping the Remove-Item. - Build "Access is denied" on the agent .exe when a
guruconnect.exewas still running; resolved byGet-Process guruconnect | Stop-Process -Forcebefore rebuilds.
Configuration Changes
GuruConnect submodule (projects/msp-tools/guru-connect/) — UNCOMMITTED in working tree, pending Code Review + diagnostic strip:
agent/src/viewer/decoder.rs— FIX:decode()returnsVec<DecodedFrame>, handlesMF_E_NOTACCEPTING(drain+retry), drains all outputs;negotiate_output_type()enumeratesGetOutputAvailableTypefor NV12;drain_one()handlesMF_E_TRANSFORM_TYPE_NOT_SET/STREAM_CHANGEwith an anti-spin guard. DIAG:[diag]negotiation + ProcessOutput nv12_len logs.agent/src/viewer/mod.rs— FIX: decode worker iterates theVecand forwards each frame. DIAG:[diag] decode worker in/producedcounters.agent/src/viewer/render.rs— FIX: removedinput::pump_messages()fromabout_to_wait(the event-loop freeze). DIAG:[diag]counters inprocess_frame,renderpresent,window_event,about_to_wait.agent/src/encoder/h264.rs— DIAG ONLY (must revert):MFT_ENUM_FLAG_SYNCMFTsoftware-encoder override inactivate_hw_encoder;nal_unit_types()Annex-B scanner + NAL-type log indrain_one_output.agent/src/session/mod.rs— DIAG ONLY (must revert):[diag]capture/encode logging in the streaming loop.
Main repo: no committed changes from this session (sync auto-commit caa7c2c was dropped as redundant during rebase).
Credentials & Secrets
- GC relay JWT signing secret (used to forge viewer tokens for loopback testing):
KfPrjjC3J6YMx9q1yjPxZAYkHLM2JdFy1XRxHJ9oPnw0NU3xH074ufHk7fj++e8BJEqRQ5k4zlWD+1iDwlLP4w==(HS256). This is the deployed serverJWT_SECRETon 172.16.3.30. - GC test agent shared key (deprecated
AGENT_API_KEYfallback, server WARN-logs its use):x7m9p2k8v4n1q5w3r6t0y2u8i5o3l7m9p2k8. - Deploy host SSH:
guru@172.16.3.30, passwordGptf*77ttb123!@#-rmm(used via plink forjournalctl).
Infrastructure & Servers
- GuruConnect relay:
172.16.3.30:3002, systemdguruconnect.service, PID 1464259 this session. WS endpoints/ws/agentand/ws/viewer. Public host connect.azcomputerguru.com (NPM-proxied). - Test agent identity:
agent_id=h264sync-5070, hostname GURU-5070, reattaches to persistent sessionfe0e6dd3-acd8-4172-a469-dd7aed2fffa5. - Test config:
%TEMP%\gc-h264-test\guruconnect.toml—server_url=ws://172.16.3.30:3002/ws/agent,api_key=x7m9p2k8v4n1q5w3r6t0y2u8i5o3l7m9p2k8,tags=["h264-test"](triggersprefer_h264). - Agent binary:
D:\claudetools\projects\msp-tools\guru-connect\target\x86_64-pc-windows-msvc\release\guruconnect.exe(v0.2.0).
Commands & Outputs
- Build:
cd D:\claudetools\projects\msp-tools\guru-connect; cargo build -p guruconnect --release --target x86_64-pc-windows-msvc(~36-43s; cargo writes progress to stderr — the PowerShell NativeCommandError wrapper is benign). - Run tagged agent (debug):
guruconnect.exe agent --verbosefrom the test workdir. - Forge viewer JWT: HS256 over
{sub,session_id,tenant_id,access:"control",purpose:"viewer",exp,iat}with the server secret; run viewerguruconnect.exe view <sid> --server ws://172.16.3.30:3002/ws/viewer --api-key <token> --verbose. - Live session lookup:
journalctl -u guruconnect.service --since '20 sec ago' | grep -oE 'Session created: [a-f0-9-]+' | tail -1. - Final proof counts (verbose viewer): about_to_wait=1526, window_event=3813, process_frame=360+, render present=1740, frame 2560x1600 BGRA bytes=16384000, decode errors=0.
- Encoder NAL dump:
[9,7,8,6,5]x1 (IDR+SPS+PPS),[9,7,8,5]x32 (IDR),[9,1]x936 (P-frames). - Decoder errors observed and cleared: 0xC00D36B5 (NOTACCEPTING) -> 0xC00D6D60 (TYPE_NOT_SET) -> 0 errors.
Pending / Incomplete Tasks
- Route the three viewer fixes (decoder NOTACCEPTING, decoder output-type negotiation, render event-loop pump removal) through Code Review, then commit to the GC submodule and bump the parent pointer. Pushing GC main triggers CI.
- Strip ALL
[diag]instrumentation:decoder.rs,render.rs,viewer/mod.rs,encoder/h264.rs,session/mod.rs. Clear the leftover build warning. - Revert the
MFT_ENUM_FLAG_SYNCMFTsoftware-encoder diagnostic inh264.rs; the hardware async-MFT path still needs its own live validation (Task 8 / #17). - Fix the ~18s StartStream latency (agent
try_recvmessage pump intransport/websocket.rs). - Run a real beast->5070 cross-GPU validation before flipping
DEFAULT_PREFER_H264to true; keep it false until signed off. - Visual confirmation from Mike still pending at session end (instrumentation proved frames reach present()).
- Tasks #4, #16, #17 remain in_progress; #3,#5,#7,#10,#13,#14,#15 pending.
Reference Information
- GC submodule HEAD after sync:
afbf0d8(SPEC-015 Notification Overlay); wasb3e8f32(SPEC-004 Task 1 machine_uid). - Main repo HEAD after sync:
80af6ebfix: improve git hook JSON escaping. - Coding Agent IDs (resumable): decoder fix #1
aa02e8c4144895542; decoder fix #2a7fd5547a8731bfd4. - Key files:
agent/src/main.rs:188(verbose gate),agent/src/viewer/decoder.rs(decode/drain/negotiate),agent/src/viewer/render.rs:464(about_to_wait),agent/src/viewer/input.rs:242(pump_messages),agent/src/transport/websocket.rs:104(try_recv 1ms-timeout pump). - Error code refs: 0xC00D36B5 = MF_E_NOTACCEPTING; 0xC00D6D60 = MF_E_TRANSFORM_TYPE_NOT_SET; 0xC00D6D61 = MF_E_TRANSFORM_STREAM_CHANGE.