From 89a862c993551edb02e8f465be68811b118e720e Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Sat, 21 Mar 2026 17:55:29 -0700 Subject: [PATCH] Session log: GPU diagnosis, Mac handoff, CLAUDE.md case fix - Deep diagnosis of RTX 5070 Ti GSP firmware crash (NVIDIA bug #5953411) - Power management workarounds ineffective, confirmed known Blackwell issue - Created MAC_BUILD_TASK.md handoff for M4 to do transcription - Fixed critical CLAUDE.md case sensitivity bug (lowercase never loaded on Linux) - Created Linux workstation machine spec Co-Authored-By: Claude Opus 4.6 (1M context) --- session-logs/2026-03-21-session.md | 150 +++++++++++++++++++++++++++++ 1 file changed, 150 insertions(+) diff --git a/session-logs/2026-03-21-session.md b/session-logs/2026-03-21-session.md index a9297ec..62ccd99 100644 --- a/session-logs/2026-03-21-session.md +++ b/session-logs/2026-03-21-session.md @@ -820,3 +820,153 @@ Invoke-Command -VMName 'VWP-FILES' -Credential ... -ScriptBlock { Start-Process - `projects/radio-show/audio-processor/gpu_debug_transcribe.py` — GPU diagnostic batch transcription script - `~/.claude/projects/-home-guru-ClaudeTools/memory/reference_dataforth_contact.md` — AJ/dataforthgit memory - `~/.claude/projects/-home-guru-ClaudeTools/memory/MEMORY.md` — Updated index + +--- + +## Update: 17:50 — GPU Diagnosis, Mac Handoff, CLAUDE.md Case Sensitivity Fix + +### Session Summary + +Deep diagnosis of RTX 5070 Ti GPU firmware crash during voice training transcription. Confirmed the issue is a known NVIDIA Blackwell GSP firmware bug. Created handoff document for Mac M4 to build native audio processor. Discovered and fixed critical config bug: `claude.md` (lowercase) was never being loaded on Linux due to case-sensitive filesystem. + +### Key Decisions + +1. **GPU workarounds exhausted** — Tried disabling Runtime D3, enabling persistence mode, locking GPU clocks. None prevented the GSP crash. Power management is not the cause. +2. **Offload to Mac** — Rather than cross-building, created a detailed handoff doc (`MAC_BUILD_TASK.md`) for the Mac's Claude instance to build natively for M4 hardware. +3. **CLAUDE.md fix** — Renamed `claude.md` to `CLAUDE.md` in the repo. This was the root cause of behavioral differences between Mac and Linux instances. + +### GPU Diagnosis Details + +**Problem:** RTX 5070 Ti enters full ERR! state after ~3-5 minutes of sustained GPU compute (Whisper transcription). + +**Error:** `NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x00000062 for fn 76!` + +**Timeline of test run:** +- 17:03: Started transcription of `2011-06-04-hr1.mp3` +- 17:03:59: GPU loaded model, 3908MB VRAM, idle +- 17:04:14: GPU at 52C, 72W, 97% utilization, 5394MB VRAM — processing +- 17:04:59: GPU at 56C, 72W, 86% utilization — normal operation +- 17:07: nvidia-smi timed out (5s) — GPU stopped responding to queries +- 17:07:20: ctranslate2 crashed: `IndexError: vector::_M_range_check` (244 segments completed) +- 17:21-17:24: dmesg flooded with NVRM rpcSendMessage 0x00000062 errors +- GPU in full ERR! state, nvidia-smi shows ERR! across all fields + +**What we tried (all ineffective):** +```bash +# Disable Runtime D3 power management +sudo bash -c 'echo on > /sys/bus/pci/devices/0000:02:00.0/power/control' + +# Enable persistence mode +sudo nvidia-smi -pm 1 + +# Lock GPU clocks +sudo nvidia-smi -lgc 300,2100 +``` + +**Root cause:** Known NVIDIA GSP (GPU System Processor) firmware bug on all RTX 50-series (Blackwell) GPUs under Linux. Affects drivers 580.x, 590.x, 595.x. NVIDIA internal bug #5953411 filed. Cannot disable GSP on 50-series (open kernel module required). + +**Key research findings:** +- RTX 50-series requires open kernel module — proprietary driver won't work +- Cannot set `NVreg_EnableGpuFirmware=0` (only worked on 30/40-series) +- ASUS ROG Zephyrus G14 with same RTX 5070 Mobile has identical issue (NVIDIA forums) +- BIOS firmware updates helped some users +- CachyOS forum has multiple threads with same symptoms +- Temperature was fine (55C), power normal (73W), VRAM not full (5.4/12GB) + +**Relevant links:** +- https://discuss.cachyos.org/t/rtx-5070-blackwell-errors/24323 +- https://discuss.cachyos.org/t/rtx-5070-gpu-crashing-regularly/22786 +- https://forums.developer.nvidia.com/t/rtx-5070-mobile-blackwell-gsp-timeouts-0x0000ca7d-xid-79-on-kernel-6-17-driver-580-126-ubuntu-24-04-4/360897 +- https://github.com/NVIDIA/open-gpu-kernel-modules/issues/1045 + +### Reboot Behavior Note + +Both times GPU error required reboot, warm reboot hung with spinning symbol — required hard power-off. Saved to workstation memory. + +### Venv Setup for Audio Processor + +Created `.venv` with `--system-site-packages` (to reuse system torch 2.10.0 and avoid re-downloading 2.5GB of CUDA packages): +```bash +cd projects/radio-show/audio-processor +python3 -m venv --system-site-packages .venv +.venv/bin/pip install --no-deps -e . +.venv/bin/pip install rich faster-whisper pyannote.audio pydub librosa scikit-learn ollama pyyaml +``` + +**CUDA library path issue:** `faster-whisper`'s ctranslate2 needs `libcublas.so.12` but system has CUDA 13.2. The `src/gpu.py` sets `LD_LIBRARY_PATH` at runtime, but this is too late — must be set before Python launch: +```bash +LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:$LD_LIBRARY_PATH .venv/bin/python ... +``` + +### Mac Handoff + +**File created:** `projects/radio-show/audio-processor/MAC_BUILD_TASK.md` + +Comprehensive handoff document for Mac Claude instance to build a native version of the audio processor on M4 hardware. Includes: +- Full context of the GPU bug and why Mac is needed +- Architecture reference from existing code +- Key thresholds and config values +- What already works (voice profiles, 1 transcript) +- What needs doing (8 remaining episode transcriptions) +- Mac-specific build notes (CPU/MPS backend, no CUDA) +- Session log references for full context + +**Mac specs** (from `.claude/machines/mikes-macbook-air.md`): +- MacBook Air M4, 10-core CPU, 16GB unified memory +- macOS 26.3.1, ClaudeTools repo cloned at `/Users/azcomputerguru/ClaudeTools` +- Ollama running with qwen3:14b, codestral:22b, nomic-embed-text +- Reachable on local network at 10.3.231.192 (SSH not enabled) + +### CLAUDE.md Case Sensitivity Fix (CRITICAL) + +**Root cause of behavioral differences between Mac and Linux:** + +The file `.claude/claude.md` (lowercase) was committed to git. Claude Code auto-loads `CLAUDE.md` (uppercase) at startup. On macOS (case-insensitive APFS), both resolve to the same file — directives always loaded. On Linux (case-sensitive ext4/btrfs), `CLAUDE.md` doesn't exist, so **all project directives were silently missing** on the Linux workstation. + +This explains: +- Asking for context already available in credentials.md +- Not following delegation rules +- Confusing client locations/networks +- Behavioral inconsistency vs the Mac + +**Fix:** `git mv .claude/claude.md .claude/CLAUDE.md` — committed and pushed. + +### Machine Spec Created + +**File created:** `.claude/machines/acg-guru-5070.md` + +Linux workstation spec matching the format the Mac uses: +- Lenovo Legion Pro 7 16IAX10H +- Intel Core Ultra 9 275HX (24 cores, 5.4 GHz) +- 32GB DDR5, RTX 5070 Ti (12GB VRAM) +- Dual 954GB NVMe (btrfs root + ext4 /home) +- CachyOS, Kernel 6.19.9-1, KDE Plasma 6.6.3 (Wayland) +- NVIDIA driver 595.45.04 (open kernel module) +- Documents GPU firmware bug and custom audio kernel + +### Workstation Memory Updated + +Updated `reference_workstation_setup.md` with known issue: warm reboot hangs requiring hard power-off. + +### Infrastructure + +- **Gitea:** git.azcomputerguru.com (HTTPS, user: mike@azcomputerguru.com, pass: Gptf*77ttb123!@#-git) +- **NVIDIA Driver:** 595.45.04 (open kernel module) +- **CUDA:** 13.2 (system), 12 libs at `/usr/local/lib/ollama/cuda_v12/` +- **PyTorch:** 2.10.0 (system-wide), CUDA 13.1 + +### Files Created/Modified This Update + +- `projects/radio-show/audio-processor/MAC_BUILD_TASK.md` — Mac build handoff document +- `projects/radio-show/audio-processor/.venv/` — Python venv with audio processor deps +- `.claude/machines/acg-guru-5070.md` — Linux workstation machine spec +- `.claude/CLAUDE.md` — Renamed from `claude.md` (case-sensitivity fix) +- `~/.claude/projects/-home-guru-ClaudeTools/memory/reference_workstation_setup.md` — Added reboot hang note + +### Pending/Next Steps + +1. **Mac builds audio processor** — Pull from Gitea, read MAC_BUILD_TASK.md, build natively, transcribe 8 episodes +2. **BIOS update** — Check for ASUS BIOS update for Legion Pro 7 16IAX10H (may help GPU stability) +3. **GPU after reboot** — Currently in ERR! state, needs hard power-off to recover +4. **Verify CLAUDE.md loads** — Next session on this machine should auto-load directives correctly +5. **Enable SSH on Mac** — Would make cross-machine workflow smoother (System Settings > Sharing > Remote Login)