Session log: GPU diagnosis, Mac handoff, CLAUDE.md case fix

- Deep diagnosis of RTX 5070 Ti GSP firmware crash (NVIDIA bug #5953411)
- Power management workarounds ineffective, confirmed known Blackwell issue
- Created MAC_BUILD_TASK.md handoff for M4 to do transcription
- Fixed critical CLAUDE.md case sensitivity bug (lowercase never loaded on Linux)
- Created Linux workstation machine spec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-21 17:55:29 -07:00
parent 5362dc780a
commit 89a862c993

View File

@@ -820,3 +820,153 @@ Invoke-Command -VMName 'VWP-FILES' -Credential ... -ScriptBlock { Start-Process
- `projects/radio-show/audio-processor/gpu_debug_transcribe.py` — GPU diagnostic batch transcription script - `projects/radio-show/audio-processor/gpu_debug_transcribe.py` — GPU diagnostic batch transcription script
- `~/.claude/projects/-home-guru-ClaudeTools/memory/reference_dataforth_contact.md` — AJ/dataforthgit memory - `~/.claude/projects/-home-guru-ClaudeTools/memory/reference_dataforth_contact.md` — AJ/dataforthgit memory
- `~/.claude/projects/-home-guru-ClaudeTools/memory/MEMORY.md` — Updated index - `~/.claude/projects/-home-guru-ClaudeTools/memory/MEMORY.md` — Updated index
---
## Update: 17:50 — GPU Diagnosis, Mac Handoff, CLAUDE.md Case Sensitivity Fix
### Session Summary
Deep diagnosis of RTX 5070 Ti GPU firmware crash during voice training transcription. Confirmed the issue is a known NVIDIA Blackwell GSP firmware bug. Created handoff document for Mac M4 to build native audio processor. Discovered and fixed critical config bug: `claude.md` (lowercase) was never being loaded on Linux due to case-sensitive filesystem.
### Key Decisions
1. **GPU workarounds exhausted** — Tried disabling Runtime D3, enabling persistence mode, locking GPU clocks. None prevented the GSP crash. Power management is not the cause.
2. **Offload to Mac** — Rather than cross-building, created a detailed handoff doc (`MAC_BUILD_TASK.md`) for the Mac's Claude instance to build natively for M4 hardware.
3. **CLAUDE.md fix** — Renamed `claude.md` to `CLAUDE.md` in the repo. This was the root cause of behavioral differences between Mac and Linux instances.
### GPU Diagnosis Details
**Problem:** RTX 5070 Ti enters full ERR! state after ~3-5 minutes of sustained GPU compute (Whisper transcription).
**Error:** `NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x00000062 for fn 76!`
**Timeline of test run:**
- 17:03: Started transcription of `2011-06-04-hr1.mp3`
- 17:03:59: GPU loaded model, 3908MB VRAM, idle
- 17:04:14: GPU at 52C, 72W, 97% utilization, 5394MB VRAM — processing
- 17:04:59: GPU at 56C, 72W, 86% utilization — normal operation
- 17:07: nvidia-smi timed out (5s) — GPU stopped responding to queries
- 17:07:20: ctranslate2 crashed: `IndexError: vector::_M_range_check` (244 segments completed)
- 17:21-17:24: dmesg flooded with NVRM rpcSendMessage 0x00000062 errors
- GPU in full ERR! state, nvidia-smi shows ERR! across all fields
**What we tried (all ineffective):**
```bash
# Disable Runtime D3 power management
sudo bash -c 'echo on > /sys/bus/pci/devices/0000:02:00.0/power/control'
# Enable persistence mode
sudo nvidia-smi -pm 1
# Lock GPU clocks
sudo nvidia-smi -lgc 300,2100
```
**Root cause:** Known NVIDIA GSP (GPU System Processor) firmware bug on all RTX 50-series (Blackwell) GPUs under Linux. Affects drivers 580.x, 590.x, 595.x. NVIDIA internal bug #5953411 filed. Cannot disable GSP on 50-series (open kernel module required).
**Key research findings:**
- RTX 50-series requires open kernel module — proprietary driver won't work
- Cannot set `NVreg_EnableGpuFirmware=0` (only worked on 30/40-series)
- ASUS ROG Zephyrus G14 with same RTX 5070 Mobile has identical issue (NVIDIA forums)
- BIOS firmware updates helped some users
- CachyOS forum has multiple threads with same symptoms
- Temperature was fine (55C), power normal (73W), VRAM not full (5.4/12GB)
**Relevant links:**
- https://discuss.cachyos.org/t/rtx-5070-blackwell-errors/24323
- https://discuss.cachyos.org/t/rtx-5070-gpu-crashing-regularly/22786
- https://forums.developer.nvidia.com/t/rtx-5070-mobile-blackwell-gsp-timeouts-0x0000ca7d-xid-79-on-kernel-6-17-driver-580-126-ubuntu-24-04-4/360897
- https://github.com/NVIDIA/open-gpu-kernel-modules/issues/1045
### Reboot Behavior Note
Both times GPU error required reboot, warm reboot hung with spinning symbol — required hard power-off. Saved to workstation memory.
### Venv Setup for Audio Processor
Created `.venv` with `--system-site-packages` (to reuse system torch 2.10.0 and avoid re-downloading 2.5GB of CUDA packages):
```bash
cd projects/radio-show/audio-processor
python3 -m venv --system-site-packages .venv
.venv/bin/pip install --no-deps -e .
.venv/bin/pip install rich faster-whisper pyannote.audio pydub librosa scikit-learn ollama pyyaml
```
**CUDA library path issue:** `faster-whisper`'s ctranslate2 needs `libcublas.so.12` but system has CUDA 13.2. The `src/gpu.py` sets `LD_LIBRARY_PATH` at runtime, but this is too late — must be set before Python launch:
```bash
LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v12:$LD_LIBRARY_PATH .venv/bin/python ...
```
### Mac Handoff
**File created:** `projects/radio-show/audio-processor/MAC_BUILD_TASK.md`
Comprehensive handoff document for Mac Claude instance to build a native version of the audio processor on M4 hardware. Includes:
- Full context of the GPU bug and why Mac is needed
- Architecture reference from existing code
- Key thresholds and config values
- What already works (voice profiles, 1 transcript)
- What needs doing (8 remaining episode transcriptions)
- Mac-specific build notes (CPU/MPS backend, no CUDA)
- Session log references for full context
**Mac specs** (from `.claude/machines/mikes-macbook-air.md`):
- MacBook Air M4, 10-core CPU, 16GB unified memory
- macOS 26.3.1, ClaudeTools repo cloned at `/Users/azcomputerguru/ClaudeTools`
- Ollama running with qwen3:14b, codestral:22b, nomic-embed-text
- Reachable on local network at 10.3.231.192 (SSH not enabled)
### CLAUDE.md Case Sensitivity Fix (CRITICAL)
**Root cause of behavioral differences between Mac and Linux:**
The file `.claude/claude.md` (lowercase) was committed to git. Claude Code auto-loads `CLAUDE.md` (uppercase) at startup. On macOS (case-insensitive APFS), both resolve to the same file — directives always loaded. On Linux (case-sensitive ext4/btrfs), `CLAUDE.md` doesn't exist, so **all project directives were silently missing** on the Linux workstation.
This explains:
- Asking for context already available in credentials.md
- Not following delegation rules
- Confusing client locations/networks
- Behavioral inconsistency vs the Mac
**Fix:** `git mv .claude/claude.md .claude/CLAUDE.md` — committed and pushed.
### Machine Spec Created
**File created:** `.claude/machines/acg-guru-5070.md`
Linux workstation spec matching the format the Mac uses:
- Lenovo Legion Pro 7 16IAX10H
- Intel Core Ultra 9 275HX (24 cores, 5.4 GHz)
- 32GB DDR5, RTX 5070 Ti (12GB VRAM)
- Dual 954GB NVMe (btrfs root + ext4 /home)
- CachyOS, Kernel 6.19.9-1, KDE Plasma 6.6.3 (Wayland)
- NVIDIA driver 595.45.04 (open kernel module)
- Documents GPU firmware bug and custom audio kernel
### Workstation Memory Updated
Updated `reference_workstation_setup.md` with known issue: warm reboot hangs requiring hard power-off.
### Infrastructure
- **Gitea:** git.azcomputerguru.com (HTTPS, user: mike@azcomputerguru.com, pass: Gptf*77ttb123!@#-git)
- **NVIDIA Driver:** 595.45.04 (open kernel module)
- **CUDA:** 13.2 (system), 12 libs at `/usr/local/lib/ollama/cuda_v12/`
- **PyTorch:** 2.10.0 (system-wide), CUDA 13.1
### Files Created/Modified This Update
- `projects/radio-show/audio-processor/MAC_BUILD_TASK.md` — Mac build handoff document
- `projects/radio-show/audio-processor/.venv/` — Python venv with audio processor deps
- `.claude/machines/acg-guru-5070.md` — Linux workstation machine spec
- `.claude/CLAUDE.md` — Renamed from `claude.md` (case-sensitivity fix)
- `~/.claude/projects/-home-guru-ClaudeTools/memory/reference_workstation_setup.md` — Added reboot hang note
### Pending/Next Steps
1. **Mac builds audio processor** — Pull from Gitea, read MAC_BUILD_TASK.md, build natively, transcribe 8 episodes
2. **BIOS update** — Check for ASUS BIOS update for Legion Pro 7 16IAX10H (may help GPU stability)
3. **GPU after reboot** — Currently in ERR! state, needs hard power-off to recover
4. **Verify CLAUDE.md loads** — Next session on this machine should auto-load directives correctly
5. **Enable SSH on Mac** — Would make cross-machine workflow smoother (System Settings > Sharing > Remote Login)