sync: auto-sync from GURU-5070 at 2026-05-31 09:41:03
Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-05-31 09:41:03
This commit is contained in:
89
session-logs/2026-05-31-mike-guruconnect-h264-decoder.md
Normal file
89
session-logs/2026-05-31-mike-guruconnect-h264-decoder.md
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
## User
|
||||||
|
- **User:** Mike Swanson (mike)
|
||||||
|
- **Machine:** GURU-5070
|
||||||
|
- **Role:** admin
|
||||||
|
|
||||||
|
## Session Summary
|
||||||
|
|
||||||
|
Resolved the GuruConnect H.264 "blank white viewer" bug end to end. The session began continuing a pre-compaction investigation whose working theory was that the agent's hardware H.264 encoder emitted no frames (Task 8 async-MFT). That theory was disproven: it was a measurement artifact. The agent ignores `RUST_LOG` and only enables debug logging via a `--verbose` CLI flag, so every `tracing::debug!("[diag] ...")` line added earlier was silently filtered, making the capture/encode loop look dead. Compounding this, the loopback test had hardcoded a stale `session_id`; after an agent restart the relay minted a new session, so the viewer's StartStream was delivered to a dead session and never reached the live agent.
|
||||||
|
|
||||||
|
After fixing the test harness (run agent with `--verbose`, always fetch the live `session_id` from the relay journal after the agent connects), a clean loopback run on session `fe0e6dd3` showed the encoder was correct all along: 900+ H.264 frames produced and sent, with a NAL-type dump confirming proper structure — first frame `[9,7,8,6,5]` (AUD+SPS+PPS+SEI+IDR), recurring `[9,7,8,5]` IDR keyframes (~1s GOP), and `[9,1]` P-frames. The bug was entirely on the viewer side, in the never-before-run native viewer ("COMPILE-VERIFIED ONLY" decoder).
|
||||||
|
|
||||||
|
Three stacked viewer bugs were found and fixed in sequence, each revealed only after fixing the prior one, by progressively instrumenting and re-running the 5070 same-machine loopback:
|
||||||
|
1. Decoder mishandled `MF_E_NOTACCEPTING` (0xC00D36B5) — fed input without draining output and treated back-pressure as fatal, so after the first GOP filled the MFT it rejected every subsequent frame.
|
||||||
|
2. Decoder never negotiated its output type (`MF_E_TRANSFORM_TYPE_NOT_SET`, 0xC00D6D60) — it hand-built a bare NV12 type with no frame size instead of enumerating the MFT's available output types via `GetOutputAvailableType`.
|
||||||
|
3. The winit event loop froze after one iteration — a manual `input::pump_messages()` (PeekMessage/DispatchMessage) call inside `about_to_wait` stole winit's own thread messages and re-entered its window proc, halting the loop. This is what kept the screen blank even after bugs 1 and 2 were fixed.
|
||||||
|
|
||||||
|
Bugs 1 and 2 were fixed via the Coding Agent (two passes); bug 3 was fixed directly (one-line removal) to verify the hypothesis before formalizing. The final instrumented run proved the full pipeline: `about_to_wait` 1526 iterations (was 1), `process_frame` 360+ paints (was 0), `render present` 1740 blits (was 0), frames at 2560x1600 BGRA (`bytes=16384000`). The decoder/render fixes plus diagnostics remain uncommitted in the GC submodule working tree pending Code Review and diagnostic cleanup. A `/sync` mid-session pulled 12 commits and confirmed the uncommitted GC work survived a submodule pointer advance (b3e8f32 -> afbf0d8).
|
||||||
|
|
||||||
|
## Key Decisions
|
||||||
|
|
||||||
|
- Kept the test as a 5070 same-machine loopback (agent = viewer = GURU-5070) rather than beast->5070 for root-causing, because co-locating the encoder and its logs makes instrumentation far easier; a cross-GPU beast->5070 run is deferred to fix-validation.
|
||||||
|
- Fixed the test harness blindness (no `--verbose`, stale session_id) before drawing any further conclusions — the prior "no frames" diagnosis was untrustworthy until the measurement was sound.
|
||||||
|
- Confirmed encoder correctness via a NAL-unit-type dump (added `nal_unit_types()` Annex-B scanner) rather than assuming, which redirected the entire investigation from encoder to viewer.
|
||||||
|
- Made bug-3 (event-loop) fix directly and verified it with instrumentation BEFORE routing through Code Review, to avoid spending a delegate cycle on an unverified hypothesis. Bugs 1-2 went through the Coding Agent since they were larger and the diagnosis was already firm.
|
||||||
|
- Left all GC changes uncommitted in the submodule working tree; did not commit/push because (a) diagnostics must be stripped, (b) fixes need Code Review, (c) pushing GC main triggers CI. The `SYNCMFT` software-encoder diagnostic must be reverted — the hardware async-MFT path is still unvalidated.
|
||||||
|
- Wrote this session log to the root `session-logs/` (not the GC submodule), consistent with the GuruRMM convention that submodule repos do not carry our session logs.
|
||||||
|
|
||||||
|
## Problems Encountered
|
||||||
|
|
||||||
|
- Agent debug logs invisible: `FmtSubscriber::with_max_level(INFO)` in `agent/src/main.rs` ignores `RUST_LOG`; debug only via `--verbose`/`-v` (global clap flag). Resolved by launching the agent with `--verbose`.
|
||||||
|
- Stale session_id in loopback: agent restart mints a new relay session; viewer hit a dead session and StartStream never reached the agent. Resolved by fetching the live `Session created: <id>` from the relay journal after each agent connect.
|
||||||
|
- 18-second StartStream latency: the agent took ~18s to process a queued StartStream — its `try_recv` message pump (`block_in_place`/`block_on` in `transport/websocket.rs`) is sluggish. Worked around in testing by sustaining viewer activity 45-55s so streaming overlapped; logged as a real secondary bug to fix later.
|
||||||
|
- `MF_E_NOTACCEPTING` (0xC00D36B5): decoder rejected frames; fixed by draining all outputs and retrying the unconsumed sample on back-pressure.
|
||||||
|
- `MF_E_TRANSFORM_TYPE_NOT_SET` (0xC00D6D60): output type never set; fixed by `GetOutputAvailableType` enumeration selecting the NV12 type on STREAM_CHANGE/TYPE_NOT_SET.
|
||||||
|
- winit loop froze after one iteration: manual `pump_messages()` in `about_to_wait` conflicted with winit's own pump; removed it (winit's pump already services the low-level keyboard hook).
|
||||||
|
- PowerShell protective hook misfired on a `\d+` literal in `Remove-Item` args; reran using `[0-9]+` and dropping the Remove-Item.
|
||||||
|
- Build "Access is denied" on the agent .exe when a `guruconnect.exe` was still running; resolved by `Get-Process guruconnect | Stop-Process -Force` before rebuilds.
|
||||||
|
|
||||||
|
## Configuration Changes
|
||||||
|
|
||||||
|
GuruConnect submodule (`projects/msp-tools/guru-connect/`) — UNCOMMITTED in working tree, pending Code Review + diagnostic strip:
|
||||||
|
- `agent/src/viewer/decoder.rs` — FIX: `decode()` returns `Vec<DecodedFrame>`, handles `MF_E_NOTACCEPTING` (drain+retry), drains all outputs; `negotiate_output_type()` enumerates `GetOutputAvailableType` for NV12; `drain_one()` handles `MF_E_TRANSFORM_TYPE_NOT_SET`/`STREAM_CHANGE` with an anti-spin guard. DIAG: `[diag]` negotiation + ProcessOutput nv12_len logs.
|
||||||
|
- `agent/src/viewer/mod.rs` — FIX: decode worker iterates the `Vec` and forwards each frame. DIAG: `[diag] decode worker in/produced` counters.
|
||||||
|
- `agent/src/viewer/render.rs` — FIX: removed `input::pump_messages()` from `about_to_wait` (the event-loop freeze). DIAG: `[diag]` counters in `process_frame`, `render` present, `window_event`, `about_to_wait`.
|
||||||
|
- `agent/src/encoder/h264.rs` — DIAG ONLY (must revert): `MFT_ENUM_FLAG_SYNCMFT` software-encoder override in `activate_hw_encoder`; `nal_unit_types()` Annex-B scanner + NAL-type log in `drain_one_output`.
|
||||||
|
- `agent/src/session/mod.rs` — DIAG ONLY (must revert): `[diag]` capture/encode logging in the streaming loop.
|
||||||
|
|
||||||
|
Main repo: no committed changes from this session (sync auto-commit `caa7c2c` was dropped as redundant during rebase).
|
||||||
|
|
||||||
|
## Credentials & Secrets
|
||||||
|
|
||||||
|
- GC relay JWT signing secret (used to forge viewer tokens for loopback testing): `KfPrjjC3J6YMx9q1yjPxZAYkHLM2JdFy1XRxHJ9oPnw0NU3xH074ufHk7fj++e8BJEqRQ5k4zlWD+1iDwlLP4w==` (HS256). This is the deployed server `JWT_SECRET` on 172.16.3.30.
|
||||||
|
- GC test agent shared key (deprecated `AGENT_API_KEY` fallback, server WARN-logs its use): `x7m9p2k8v4n1q5w3r6t0y2u8i5o3l7m9p2k8`.
|
||||||
|
- Deploy host SSH: `guru@172.16.3.30`, password `Gptf*77ttb123!@#-rmm` (used via plink for `journalctl`).
|
||||||
|
|
||||||
|
## Infrastructure & Servers
|
||||||
|
|
||||||
|
- GuruConnect relay: `172.16.3.30:3002`, systemd `guruconnect.service`, PID 1464259 this session. WS endpoints `/ws/agent` and `/ws/viewer`. Public host connect.azcomputerguru.com (NPM-proxied).
|
||||||
|
- Test agent identity: `agent_id=h264sync-5070`, hostname GURU-5070, reattaches to persistent session `fe0e6dd3-acd8-4172-a469-dd7aed2fffa5`.
|
||||||
|
- Test config: `%TEMP%\gc-h264-test\guruconnect.toml` — `server_url=ws://172.16.3.30:3002/ws/agent`, `api_key=x7m9p2k8v4n1q5w3r6t0y2u8i5o3l7m9p2k8`, `tags=["h264-test"]` (triggers `prefer_h264`).
|
||||||
|
- Agent binary: `D:\claudetools\projects\msp-tools\guru-connect\target\x86_64-pc-windows-msvc\release\guruconnect.exe` (v0.2.0).
|
||||||
|
|
||||||
|
## Commands & Outputs
|
||||||
|
|
||||||
|
- Build: `cd D:\claudetools\projects\msp-tools\guru-connect; cargo build -p guruconnect --release --target x86_64-pc-windows-msvc` (~36-43s; cargo writes progress to stderr — the PowerShell NativeCommandError wrapper is benign).
|
||||||
|
- Run tagged agent (debug): `guruconnect.exe agent --verbose` from the test workdir.
|
||||||
|
- Forge viewer JWT: HS256 over `{sub,session_id,tenant_id,access:"control",purpose:"viewer",exp,iat}` with the server secret; run viewer `guruconnect.exe view <sid> --server ws://172.16.3.30:3002/ws/viewer --api-key <token> --verbose`.
|
||||||
|
- Live session lookup: `journalctl -u guruconnect.service --since '20 sec ago' | grep -oE 'Session created: [a-f0-9-]+' | tail -1`.
|
||||||
|
- Final proof counts (verbose viewer): about_to_wait=1526, window_event=3813, process_frame=360+, render present=1740, frame 2560x1600 BGRA bytes=16384000, decode errors=0.
|
||||||
|
- Encoder NAL dump: `[9,7,8,6,5]` x1 (IDR+SPS+PPS), `[9,7,8,5]` x32 (IDR), `[9,1]` x936 (P-frames).
|
||||||
|
- Decoder errors observed and cleared: 0xC00D36B5 (NOTACCEPTING) -> 0xC00D6D60 (TYPE_NOT_SET) -> 0 errors.
|
||||||
|
|
||||||
|
## Pending / Incomplete Tasks
|
||||||
|
|
||||||
|
- Route the three viewer fixes (decoder NOTACCEPTING, decoder output-type negotiation, render event-loop pump removal) through Code Review, then commit to the GC submodule and bump the parent pointer. Pushing GC main triggers CI.
|
||||||
|
- Strip ALL `[diag]` instrumentation: `decoder.rs`, `render.rs`, `viewer/mod.rs`, `encoder/h264.rs`, `session/mod.rs`. Clear the leftover build warning.
|
||||||
|
- Revert the `MFT_ENUM_FLAG_SYNCMFT` software-encoder diagnostic in `h264.rs`; the hardware async-MFT path still needs its own live validation (Task 8 / #17).
|
||||||
|
- Fix the ~18s StartStream latency (agent `try_recv` message pump in `transport/websocket.rs`).
|
||||||
|
- Run a real beast->5070 cross-GPU validation before flipping `DEFAULT_PREFER_H264` to true; keep it false until signed off.
|
||||||
|
- Visual confirmation from Mike still pending at session end (instrumentation proved frames reach present()).
|
||||||
|
- Tasks #4, #16, #17 remain in_progress; #3,#5,#7,#10,#13,#14,#15 pending.
|
||||||
|
|
||||||
|
## Reference Information
|
||||||
|
|
||||||
|
- GC submodule HEAD after sync: `afbf0d8` (SPEC-015 Notification Overlay); was `b3e8f32` (SPEC-004 Task 1 machine_uid).
|
||||||
|
- Main repo HEAD after sync: `80af6eb` fix: improve git hook JSON escaping.
|
||||||
|
- Coding Agent IDs (resumable): decoder fix #1 `aa02e8c4144895542`; decoder fix #2 `a7fd5547a8731bfd4`.
|
||||||
|
- Key files: `agent/src/main.rs:188` (verbose gate), `agent/src/viewer/decoder.rs` (decode/drain/negotiate), `agent/src/viewer/render.rs:464` (about_to_wait), `agent/src/viewer/input.rs:242` (pump_messages), `agent/src/transport/websocket.rs:104` (try_recv 1ms-timeout pump).
|
||||||
|
- Error code refs: 0xC00D36B5 = MF_E_NOTACCEPTING; 0xC00D6D60 = MF_E_TRANSFORM_TYPE_NOT_SET; 0xC00D6D61 = MF_E_TRANSFORM_STREAM_CHANGE.
|
||||||
Reference in New Issue
Block a user