GPU firmware bug (NVRM 0x00000062) on RTX 5070 Ti makes GPU transcription impossible. Handoff doc for Mac M4 to build native version and complete the 8 remaining episode transcriptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
136 lines
6.8 KiB
Markdown
136 lines
6.8 KiB
Markdown
# Mac Build Task: Radio Show Audio Processor
|
|
|
|
**Date:** 2026-03-21
|
|
**From:** CachyOS workstation (acg-guru-5070)
|
|
**To:** Mac Claude instance (Mikes-MacBook-Air, M4)
|
|
**Priority:** High — this is blocked on the Linux workstation
|
|
|
|
---
|
|
|
|
## What We Need
|
|
|
|
Build a Mac-native version of the radio show audio processor that can run the **transcription and voice profiling pipeline** on Apple Silicon (M4). The Linux workstation's RTX 5070 Ti has a known GPU firmware bug that crashes after ~3 minutes of sustained compute, making GPU-accelerated transcription impossible until NVIDIA fixes it.
|
|
|
|
## Why the Mac
|
|
|
|
The CachyOS workstation's NVIDIA RTX 5070 Ti Laptop GPU hits a **GSP (GPU System Processor) firmware crash** under sustained load. This is a known, unresolved bug across all RTX 50-series (Blackwell) GPUs on Linux:
|
|
|
|
- Error: `NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x00000062`
|
|
- Triggers after ~3-5 minutes of continuous GPU compute
|
|
- GPU enters full ERR! state, requires hard reboot (warm reboot hangs)
|
|
- Cannot disable GSP on 50-series (open kernel module required, no `NVreg_EnableGpuFirmware=0`)
|
|
- NVIDIA internal bug #5953411 filed, no fix available
|
|
- Affects drivers 580.x, 590.x, 595.x (current: 595.45.04)
|
|
- Power management tweaks, persistence mode, clock locking — none helped
|
|
- See: session logs `2026-03-21-session.md` and `2026-03-20-session.md` for full diagnosis
|
|
|
|
The M4 MacBook Air with 16GB unified memory can run this workload on CPU or MPS backend without driver issues.
|
|
|
|
## The Project
|
|
|
|
**Location in repo:** `projects/radio-show/audio-processor/`
|
|
|
|
**Goal:** Automated pipeline for processing "The Computer Guru Show" radio recordings. The immediate task is **voice training** — transcribing 9 archive episodes and building speaker embeddings to identify the host (Mike Swanson) vs. callers/guests/commercials.
|
|
|
|
### What Already Works (built on Linux, may need Mac adaptation)
|
|
|
|
1. **Voice Profiler** (`src/voice_profiler.py`) — Uses WavLM (Microsoft `microsoft/wavlm-base-sv`) for speaker verification via x-vector embeddings. **WORKING** — 180 embeddings generated, composite built. Host voice scores 0.90-0.98 similarity, non-host 0.53-0.65. Threshold tuned to 0.83.
|
|
|
|
2. **Transcriber** (`src/transcriber.py`) — Uses `faster-whisper` with `large-v3` model. On Linux it was configured for CUDA. Only 1 of 9 episodes transcribed before GPU died.
|
|
|
|
3. **Config** (`config.yaml`) — All pipeline settings, thresholds, paths.
|
|
|
|
4. **Voice profiles** — `voice-profiles/mike-swanson/` has 180 `.npy` embedding files + `composite.npy` + `profiles.json`. These are numpy arrays, platform-independent.
|
|
|
|
5. **One completed transcript** — `training-data/transcripts/2010-10-02-hr1/` (534 segments, transcript.json + .srt + .txt)
|
|
|
|
### What Needs Doing on Mac
|
|
|
|
**Primary task: Transcribe the remaining 8 episodes:**
|
|
```
|
|
training-data/episodes/2011-06-04-hr1.mp3 (7.4MB, ~43 min)
|
|
training-data/episodes/2011-09-10-hr1.mp3 (11MB)
|
|
training-data/episodes/2014-s6e05.mp3 (9.5MB)
|
|
training-data/episodes/2015-s7e30.mp3 (9.0MB)
|
|
training-data/episodes/2016-s8e42.mp3 (19MB)
|
|
training-data/episodes/2017-s9e26.mp3 (48MB)
|
|
training-data/episodes/2018-s10e17.mp3 (21MB)
|
|
training-data/episodes/2018-s10e21.mp3 (21MB)
|
|
```
|
|
|
|
Output each to: `training-data/transcripts/{episode-stem}/transcript.json` (+ .srt, .txt)
|
|
|
|
**Secondary: Verify voice profiles work on Mac** — load the existing `.npy` embeddings and run similarity checks against the new transcripts.
|
|
|
|
### Mac-Specific Build Notes
|
|
|
|
1. **Do NOT try to port the Linux code directly.** Build fresh for Mac hardware. The existing code has CUDA-specific paths (`src/gpu.py` sets `LD_LIBRARY_PATH` for CUDA 12), nvidia-specific device selection, etc. It's cleaner to build natively.
|
|
|
|
2. **Reference the existing code for architecture and logic**, especially:
|
|
- `src/voice_profiler.py` — the WavLM embedding approach, similarity thresholds, profile structure
|
|
- `src/transcriber.py` — the Whisper pipeline stages, output format (TranscriptSegment dataclass)
|
|
- `config.yaml` — all the tuned parameters
|
|
- `README.md` — full architecture doc for the 6-stage pipeline
|
|
|
|
3. **Hardware target: Apple M4, 16GB unified memory**
|
|
- `faster-whisper` + ctranslate2 supports CPU on macOS (no MPS backend for ctranslate2)
|
|
- `large-v3` model needs ~3GB RAM — fits easily in 16GB
|
|
- Expected speed: ~1x realtime on M4 CPU (43-min episode takes ~43 min)
|
|
- Consider `medium` model if `large-v3` is too slow — tradeoff is accuracy
|
|
- PyTorch MPS backend works for `pyannote.audio` and WavLM (transformers)
|
|
|
|
4. **Dependencies for Mac:**
|
|
```bash
|
|
brew install ffmpeg
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install faster-whisper pyannote.audio torch torchaudio pydub librosa scikit-learn ollama rich pyyaml
|
|
```
|
|
- No CUDA packages needed — pip will pull CPU-only torch or MPS-enabled torch for macOS
|
|
- `pyannote.audio` requires HuggingFace token (accept model license first): https://huggingface.co/pyannote/speaker-diarization-3.1
|
|
|
|
5. **Ollama models available on Mac** (per machine spec):
|
|
- `qwen3:14b` — use for content analysis (Stage 6)
|
|
- `nomic-embed-text` — for grepai, not needed for audio processing
|
|
|
|
6. **Output compatibility:** Keep the same output format (JSON with segments, timestamps, speaker labels) so the Linux workstation can consume the results after git pull.
|
|
|
|
### Architecture Reference
|
|
|
|
```
|
|
Raw MP3 → 1. Transcribe (Whisper) → 2. Diarize (pyannote) → 3. Detect Segments
|
|
→ 4. Remove Commercials → 5. Split Segments → 6. Analyze (Ollama)
|
|
```
|
|
|
|
For now, only Stages 1-2 matter. Stages 3-6 can wait.
|
|
|
|
### Key Thresholds (from working Linux version)
|
|
|
|
- Whisper model: `large-v3`
|
|
- Whisper language: `en`
|
|
- Voice profile host match threshold: `0.83`
|
|
- Min/max speakers for diarization: 1-6
|
|
- WavLM model: `microsoft/wavlm-base-sv` (speaker verification, x-vector embeddings)
|
|
|
|
### Data Flow
|
|
|
|
1. Training episodes are in `training-data/episodes/` (already in git, 151MB total)
|
|
2. Voice profiles are in `voice-profiles/mike-swanson/` (already in git)
|
|
3. Transcripts go to `training-data/transcripts/{episode-stem}/`
|
|
4. After Mac completes transcription, commit + push to Gitea
|
|
5. Linux workstation pulls results and continues with Stages 3-6
|
|
|
|
## Session Logs for Context
|
|
|
|
Read these for the full story of what was built and why:
|
|
|
|
- `session-logs/2026-03-21-session.md` — Voice profiling results, GPU errors, transcription attempts, diagnosis
|
|
- `session-logs/2026-03-20-session.md` — Earlier session (may have additional audio processor context)
|
|
|
|
## Success Criteria
|
|
|
|
1. All 8 remaining episodes transcribed with timestamps and segments
|
|
2. Transcripts in the same JSON format as `training-data/transcripts/2010-10-02-hr1/transcript.json`
|
|
3. Voice profiles load and produce reasonable similarity scores on Mac
|
|
4. Results committed to Gitea so Linux workstation can pull them
|