GPU firmware bug (NVRM 0x00000062) on RTX 5070 Ti makes GPU transcription impossible. Handoff doc for Mac M4 to build native version and complete the 8 remaining episode transcriptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6.8 KiB
Mac Build Task: Radio Show Audio Processor
Date: 2026-03-21 From: CachyOS workstation (acg-guru-5070) To: Mac Claude instance (Mikes-MacBook-Air, M4) Priority: High — this is blocked on the Linux workstation
What We Need
Build a Mac-native version of the radio show audio processor that can run the transcription and voice profiling pipeline on Apple Silicon (M4). The Linux workstation's RTX 5070 Ti has a known GPU firmware bug that crashes after ~3 minutes of sustained compute, making GPU-accelerated transcription impossible until NVIDIA fixes it.
Why the Mac
The CachyOS workstation's NVIDIA RTX 5070 Ti Laptop GPU hits a GSP (GPU System Processor) firmware crash under sustained load. This is a known, unresolved bug across all RTX 50-series (Blackwell) GPUs on Linux:
- Error:
NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x00000062 - Triggers after ~3-5 minutes of continuous GPU compute
- GPU enters full ERR! state, requires hard reboot (warm reboot hangs)
- Cannot disable GSP on 50-series (open kernel module required, no
NVreg_EnableGpuFirmware=0) - NVIDIA internal bug #5953411 filed, no fix available
- Affects drivers 580.x, 590.x, 595.x (current: 595.45.04)
- Power management tweaks, persistence mode, clock locking — none helped
- See: session logs
2026-03-21-session.mdand2026-03-20-session.mdfor full diagnosis
The M4 MacBook Air with 16GB unified memory can run this workload on CPU or MPS backend without driver issues.
The Project
Location in repo: projects/radio-show/audio-processor/
Goal: Automated pipeline for processing "The Computer Guru Show" radio recordings. The immediate task is voice training — transcribing 9 archive episodes and building speaker embeddings to identify the host (Mike Swanson) vs. callers/guests/commercials.
What Already Works (built on Linux, may need Mac adaptation)
-
Voice Profiler (
src/voice_profiler.py) — Uses WavLM (Microsoftmicrosoft/wavlm-base-sv) for speaker verification via x-vector embeddings. WORKING — 180 embeddings generated, composite built. Host voice scores 0.90-0.98 similarity, non-host 0.53-0.65. Threshold tuned to 0.83. -
Transcriber (
src/transcriber.py) — Usesfaster-whisperwithlarge-v3model. On Linux it was configured for CUDA. Only 1 of 9 episodes transcribed before GPU died. -
Config (
config.yaml) — All pipeline settings, thresholds, paths. -
Voice profiles —
voice-profiles/mike-swanson/has 180.npyembedding files +composite.npy+profiles.json. These are numpy arrays, platform-independent. -
One completed transcript —
training-data/transcripts/2010-10-02-hr1/(534 segments, transcript.json + .srt + .txt)
What Needs Doing on Mac
Primary task: Transcribe the remaining 8 episodes:
training-data/episodes/2011-06-04-hr1.mp3 (7.4MB, ~43 min)
training-data/episodes/2011-09-10-hr1.mp3 (11MB)
training-data/episodes/2014-s6e05.mp3 (9.5MB)
training-data/episodes/2015-s7e30.mp3 (9.0MB)
training-data/episodes/2016-s8e42.mp3 (19MB)
training-data/episodes/2017-s9e26.mp3 (48MB)
training-data/episodes/2018-s10e17.mp3 (21MB)
training-data/episodes/2018-s10e21.mp3 (21MB)
Output each to: training-data/transcripts/{episode-stem}/transcript.json (+ .srt, .txt)
Secondary: Verify voice profiles work on Mac — load the existing .npy embeddings and run similarity checks against the new transcripts.
Mac-Specific Build Notes
-
Do NOT try to port the Linux code directly. Build fresh for Mac hardware. The existing code has CUDA-specific paths (
src/gpu.pysetsLD_LIBRARY_PATHfor CUDA 12), nvidia-specific device selection, etc. It's cleaner to build natively. -
Reference the existing code for architecture and logic, especially:
src/voice_profiler.py— the WavLM embedding approach, similarity thresholds, profile structuresrc/transcriber.py— the Whisper pipeline stages, output format (TranscriptSegment dataclass)config.yaml— all the tuned parametersREADME.md— full architecture doc for the 6-stage pipeline
-
Hardware target: Apple M4, 16GB unified memory
faster-whisper+ ctranslate2 supports CPU on macOS (no MPS backend for ctranslate2)large-v3model needs ~3GB RAM — fits easily in 16GB- Expected speed: ~1x realtime on M4 CPU (43-min episode takes ~43 min)
- Consider
mediummodel iflarge-v3is too slow — tradeoff is accuracy - PyTorch MPS backend works for
pyannote.audioand WavLM (transformers)
-
Dependencies for Mac:
brew install ffmpeg python3 -m venv .venv source .venv/bin/activate pip install faster-whisper pyannote.audio torch torchaudio pydub librosa scikit-learn ollama rich pyyaml- No CUDA packages needed — pip will pull CPU-only torch or MPS-enabled torch for macOS
pyannote.audiorequires HuggingFace token (accept model license first): https://huggingface.co/pyannote/speaker-diarization-3.1
-
Ollama models available on Mac (per machine spec):
qwen3:14b— use for content analysis (Stage 6)nomic-embed-text— for grepai, not needed for audio processing
-
Output compatibility: Keep the same output format (JSON with segments, timestamps, speaker labels) so the Linux workstation can consume the results after git pull.
Architecture Reference
Raw MP3 → 1. Transcribe (Whisper) → 2. Diarize (pyannote) → 3. Detect Segments
→ 4. Remove Commercials → 5. Split Segments → 6. Analyze (Ollama)
For now, only Stages 1-2 matter. Stages 3-6 can wait.
Key Thresholds (from working Linux version)
- Whisper model:
large-v3 - Whisper language:
en - Voice profile host match threshold:
0.83 - Min/max speakers for diarization: 1-6
- WavLM model:
microsoft/wavlm-base-sv(speaker verification, x-vector embeddings)
Data Flow
- Training episodes are in
training-data/episodes/(already in git, 151MB total) - Voice profiles are in
voice-profiles/mike-swanson/(already in git) - Transcripts go to
training-data/transcripts/{episode-stem}/ - After Mac completes transcription, commit + push to Gitea
- Linux workstation pulls results and continues with Stages 3-6
Session Logs for Context
Read these for the full story of what was built and why:
session-logs/2026-03-21-session.md— Voice profiling results, GPU errors, transcription attempts, diagnosissession-logs/2026-03-20-session.md— Earlier session (may have additional audio processor context)
Success Criteria
- All 8 remaining episodes transcribed with timestamps and segments
- Transcripts in the same JSON format as
training-data/transcripts/2010-10-02-hr1/transcript.json - Voice profiles load and produce reasonable similarity scores on Mac
- Results committed to Gitea so Linux workstation can pull them