Audio processor: add Mac build task for voice training

GPU firmware bug (NVRM 0x00000062) on RTX 5070 Ti makes GPU transcription impossible. Handoff doc for Mac M4 to build native version and complete the 8 remaining episode transcriptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 17:44:43 -07:00
parent 395333c85c
commit 122b87a1d6
1 changed files with 135 additions and 0 deletions
--- a/projects/radio-show/audio-processor/MAC_BUILD_TASK.md
+++ b/projects/radio-show/audio-processor/MAC_BUILD_TASK.md
@@ -0,0 +1,135 @@
 # Mac Build Task: Radio Show Audio Processor
 **Date:** 2026-03-21
 **From:** CachyOS workstation (acg-guru-5070)
 **To:** Mac Claude instance (Mikes-MacBook-Air, M4)
 **Priority:** High — this is blocked on the Linux workstation
 ---
 ## What We Need
 Build a Mac-native version of the radio show audio processor that can run the **transcription and voice profiling pipeline** on Apple Silicon (M4). The Linux workstation's RTX 5070 Ti has a known GPU firmware bug that crashes after ~3 minutes of sustained compute, making GPU-accelerated transcription impossible until NVIDIA fixes it.
 ## Why the Mac
 The CachyOS workstation's NVIDIA RTX 5070 Ti Laptop GPU hits a **GSP (GPU System Processor) firmware crash** under sustained load. This is a known, unresolved bug across all RTX 50-series (Blackwell) GPUs on Linux:
 - Error: `NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x00000062`
 - Triggers after ~3-5 minutes of continuous GPU compute
 - GPU enters full ERR! state, requires hard reboot (warm reboot hangs)
 - Cannot disable GSP on 50-series (open kernel module required, no `NVreg_EnableGpuFirmware=0`)
 - NVIDIA internal bug #5953411 filed, no fix available
 - Affects drivers 580.x, 590.x, 595.x (current: 595.45.04)
 - Power management tweaks, persistence mode, clock locking — none helped
 - See: session logs `2026-03-21-session.md` and `2026-03-20-session.md` for full diagnosis
 The M4 MacBook Air with 16GB unified memory can run this workload on CPU or MPS backend without driver issues.
 ## The Project
 **Location in repo:** `projects/radio-show/audio-processor/`
 **Goal:** Automated pipeline for processing "The Computer Guru Show" radio recordings. The immediate task is **voice training** — transcribing 9 archive episodes and building speaker embeddings to identify the host (Mike Swanson) vs. callers/guests/commercials.
 ### What Already Works (built on Linux, may need Mac adaptation)
 1. **Voice Profiler** (`src/voice_profiler.py`) — Uses WavLM (Microsoft `microsoft/wavlm-base-sv`) for speaker verification via x-vector embeddings. **WORKING** — 180 embeddings generated, composite built. Host voice scores 0.90-0.98 similarity, non-host 0.53-0.65. Threshold tuned to 0.83.
 2. **Transcriber** (`src/transcriber.py`) — Uses `faster-whisper` with `large-v3` model. On Linux it was configured for CUDA. Only 1 of 9 episodes transcribed before GPU died.
 3. **Config** (`config.yaml`) — All pipeline settings, thresholds, paths.
 4. **Voice profiles** — `voice-profiles/mike-swanson/` has 180 `.npy` embedding files + `composite.npy` + `profiles.json`. These are numpy arrays, platform-independent.
 5. **One completed transcript** — `training-data/transcripts/2010-10-02-hr1/` (534 segments, transcript.json + .srt + .txt)
 ### What Needs Doing on Mac
 **Primary task: Transcribe the remaining 8 episodes:**
 ```
 training-data/episodes/2011-06-04-hr1.mp3    (7.4MB, ~43 min)
 training-data/episodes/2011-09-10-hr1.mp3    (11MB)
 training-data/episodes/2014-s6e05.mp3        (9.5MB)
 training-data/episodes/2015-s7e30.mp3        (9.0MB)
 training-data/episodes/2016-s8e42.mp3        (19MB)
 training-data/episodes/2017-s9e26.mp3        (48MB)
 training-data/episodes/2018-s10e17.mp3       (21MB)
 training-data/episodes/2018-s10e21.mp3       (21MB)
 ```
 Output each to: `training-data/transcripts/{episode-stem}/transcript.json` (+ .srt, .txt)
 **Secondary: Verify voice profiles work on Mac** — load the existing `.npy` embeddings and run similarity checks against the new transcripts.
 ### Mac-Specific Build Notes
 1. **Do NOT try to port the Linux code directly.** Build fresh for Mac hardware. The existing code has CUDA-specific paths (`src/gpu.py` sets `LD_LIBRARY_PATH` for CUDA 12), nvidia-specific device selection, etc. It's cleaner to build natively.
 2. **Reference the existing code for architecture and logic**, especially:
   - `src/voice_profiler.py` — the WavLM embedding approach, similarity thresholds, profile structure
   - `src/transcriber.py` — the Whisper pipeline stages, output format (TranscriptSegment dataclass)
   - `config.yaml` — all the tuned parameters
   - `README.md` — full architecture doc for the 6-stage pipeline
 3. **Hardware target: Apple M4, 16GB unified memory**
   - `faster-whisper` + ctranslate2 supports CPU on macOS (no MPS backend for ctranslate2)
   - `large-v3` model needs ~3GB RAM — fits easily in 16GB
   - Expected speed: ~1x realtime on M4 CPU (43-min episode takes ~43 min)
   - Consider `medium` model if `large-v3` is too slow — tradeoff is accuracy
   - PyTorch MPS backend works for `pyannote.audio` and WavLM (transformers)
 4. **Dependencies for Mac:**
   ```bash
   brew install ffmpeg
   python3 -m venv .venv
   source .venv/bin/activate
   pip install faster-whisper pyannote.audio torch torchaudio pydub librosa scikit-learn ollama rich pyyaml
   ```
   - No CUDA packages needed — pip will pull CPU-only torch or MPS-enabled torch for macOS
   - `pyannote.audio` requires HuggingFace token (accept model license first): https://huggingface.co/pyannote/speaker-diarization-3.1
 5. **Ollama models available on Mac** (per machine spec):
   - `qwen3:14b` — use for content analysis (Stage 6)
   - `nomic-embed-text` — for grepai, not needed for audio processing
 6. **Output compatibility:** Keep the same output format (JSON with segments, timestamps, speaker labels) so the Linux workstation can consume the results after git pull.
 ### Architecture Reference
 ```
 Raw MP3 → 1. Transcribe (Whisper) → 2. Diarize (pyannote) → 3. Detect Segments
        → 4. Remove Commercials → 5. Split Segments → 6. Analyze (Ollama)
 ```
 For now, only Stages 1-2 matter. Stages 3-6 can wait.
 ### Key Thresholds (from working Linux version)
 - Whisper model: `large-v3`
 - Whisper language: `en`
 - Voice profile host match threshold: `0.83`
 - Min/max speakers for diarization: 1-6
 - WavLM model: `microsoft/wavlm-base-sv` (speaker verification, x-vector embeddings)
 ### Data Flow
 1. Training episodes are in `training-data/episodes/` (already in git, 151MB total)
 2. Voice profiles are in `voice-profiles/mike-swanson/` (already in git)
 3. Transcripts go to `training-data/transcripts/{episode-stem}/`
 4. After Mac completes transcription, commit + push to Gitea
 5. Linux workstation pulls results and continues with Stages 3-6
 ## Session Logs for Context
 Read these for the full story of what was built and why:
 - `session-logs/2026-03-21-session.md` — Voice profiling results, GPU errors, transcription attempts, diagnosis
 - `session-logs/2026-03-20-session.md` — Earlier session (may have additional audio processor context)
 ## Success Criteria
 1. All 8 remaining episodes transcribed with timestamps and segments
 2. Transcripts in the same JSON format as `training-data/transcripts/2010-10-02-hr1/transcript.json`
 3. Voice profiles load and produce reasonable similarity scores on Mac
 4. Results committed to Gitea so Linux workstation can pull them