Files

Mike Swanson 122b87a1d6 Audio processor: add Mac build task for voice training

GPU firmware bug (NVRM 0x00000062) on RTX 5070 Ti makes
GPU transcription impossible. Handoff doc for Mac M4 to
build native version and complete the 8 remaining episode
transcriptions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-21 17:44:52 -07:00

6.8 KiB

Raw Blame History

Mac Build Task: Radio Show Audio Processor

Date: 2026-03-21 From: CachyOS workstation (acg-guru-5070) To: Mac Claude instance (Mikes-MacBook-Air, M4) Priority: High — this is blocked on the Linux workstation

What We Need

Build a Mac-native version of the radio show audio processor that can run the transcription and voice profiling pipeline on Apple Silicon (M4). The Linux workstation's RTX 5070 Ti has a known GPU firmware bug that crashes after ~3 minutes of sustained compute, making GPU-accelerated transcription impossible until NVIDIA fixes it.

Why the Mac

The CachyOS workstation's NVIDIA RTX 5070 Ti Laptop GPU hits a GSP (GPU System Processor) firmware crash under sustained load. This is a known, unresolved bug across all RTX 50-series (Blackwell) GPUs on Linux:

Error: NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x00000062
Triggers after ~3-5 minutes of continuous GPU compute
GPU enters full ERR! state, requires hard reboot (warm reboot hangs)
Cannot disable GSP on 50-series (open kernel module required, no NVreg_EnableGpuFirmware=0)
NVIDIA internal bug #5953411 filed, no fix available
Affects drivers 580.x, 590.x, 595.x (current: 595.45.04)
Power management tweaks, persistence mode, clock locking — none helped
See: session logs 2026-03-21-session.md and 2026-03-20-session.md for full diagnosis

The M4 MacBook Air with 16GB unified memory can run this workload on CPU or MPS backend without driver issues.

The Project

Location in repo: projects/radio-show/audio-processor/

Goal: Automated pipeline for processing "The Computer Guru Show" radio recordings. The immediate task is voice training — transcribing 9 archive episodes and building speaker embeddings to identify the host (Mike Swanson) vs. callers/guests/commercials.

What Already Works (built on Linux, may need Mac adaptation)

Voice Profiler (src/voice_profiler.py) — Uses WavLM (Microsoft microsoft/wavlm-base-sv) for speaker verification via x-vector embeddings. WORKING — 180 embeddings generated, composite built. Host voice scores 0.90-0.98 similarity, non-host 0.53-0.65. Threshold tuned to 0.83.
Transcriber (src/transcriber.py) — Uses faster-whisper with large-v3 model. On Linux it was configured for CUDA. Only 1 of 9 episodes transcribed before GPU died.
Config (config.yaml) — All pipeline settings, thresholds, paths.
Voice profiles — voice-profiles/mike-swanson/ has 180 .npy embedding files + composite.npy + profiles.json. These are numpy arrays, platform-independent.
One completed transcript — training-data/transcripts/2010-10-02-hr1/ (534 segments, transcript.json + .srt + .txt)

What Needs Doing on Mac

Primary task: Transcribe the remaining 8 episodes:

training-data/episodes/2011-06-04-hr1.mp3    (7.4MB, ~43 min)
training-data/episodes/2011-09-10-hr1.mp3    (11MB)
training-data/episodes/2014-s6e05.mp3        (9.5MB)
training-data/episodes/2015-s7e30.mp3        (9.0MB)
training-data/episodes/2016-s8e42.mp3        (19MB)
training-data/episodes/2017-s9e26.mp3        (48MB)
training-data/episodes/2018-s10e17.mp3       (21MB)
training-data/episodes/2018-s10e21.mp3       (21MB)

Output each to: training-data/transcripts/{episode-stem}/transcript.json (+ .srt, .txt)

Secondary: Verify voice profiles work on Mac — load the existing .npy embeddings and run similarity checks against the new transcripts.

Mac-Specific Build Notes

Do NOT try to port the Linux code directly. Build fresh for Mac hardware. The existing code has CUDA-specific paths (src/gpu.py sets LD_LIBRARY_PATH for CUDA 12), nvidia-specific device selection, etc. It's cleaner to build natively.
Reference the existing code for architecture and logic, especially:
- src/voice_profiler.py — the WavLM embedding approach, similarity thresholds, profile structure
- src/transcriber.py — the Whisper pipeline stages, output format (TranscriptSegment dataclass)
- config.yaml — all the tuned parameters
- README.md — full architecture doc for the 6-stage pipeline
Hardware target: Apple M4, 16GB unified memory
- faster-whisper + ctranslate2 supports CPU on macOS (no MPS backend for ctranslate2)
- large-v3 model needs ~3GB RAM — fits easily in 16GB
- Expected speed: ~1x realtime on M4 CPU (43-min episode takes ~43 min)
- Consider medium model if large-v3 is too slow — tradeoff is accuracy
- PyTorch MPS backend works for pyannote.audio and WavLM (transformers)
Dependencies for Mac:
```
brew install ffmpeg
python3 -m venv .venv
source .venv/bin/activate
pip install faster-whisper pyannote.audio torch torchaudio pydub librosa scikit-learn ollama rich pyyaml
```
- No CUDA packages needed — pip will pull CPU-only torch or MPS-enabled torch for macOS
- pyannote.audio requires HuggingFace token (accept model license first): https://huggingface.co/pyannote/speaker-diarization-3.1
Ollama models available on Mac (per machine spec):
- qwen3:14b — use for content analysis (Stage 6)
- nomic-embed-text — for grepai, not needed for audio processing
Output compatibility: Keep the same output format (JSON with segments, timestamps, speaker labels) so the Linux workstation can consume the results after git pull.

Architecture Reference

Raw MP3 → 1. Transcribe (Whisper) → 2. Diarize (pyannote) → 3. Detect Segments
        → 4. Remove Commercials → 5. Split Segments → 6. Analyze (Ollama)

For now, only Stages 1-2 matter. Stages 3-6 can wait.

Key Thresholds (from working Linux version)

Whisper model: large-v3
Whisper language: en
Voice profile host match threshold: 0.83
Min/max speakers for diarization: 1-6
WavLM model: microsoft/wavlm-base-sv (speaker verification, x-vector embeddings)

Data Flow

Training episodes are in training-data/episodes/ (already in git, 151MB total)
Voice profiles are in voice-profiles/mike-swanson/ (already in git)
Transcripts go to training-data/transcripts/{episode-stem}/
After Mac completes transcription, commit + push to Gitea
Linux workstation pulls results and continues with Stages 3-6

Session Logs for Context

Read these for the full story of what was built and why:

session-logs/2026-03-21-session.md — Voice profiling results, GPU errors, transcription attempts, diagnosis
session-logs/2026-03-20-session.md — Earlier session (may have additional audio processor context)

Success Criteria

All 8 remaining episodes transcribed with timestamps and segments
Transcripts in the same JSON format as training-data/transcripts/2010-10-02-hr1/transcript.json
Voice profiles load and produce reasonable similarity scores on Mac
Results committed to Gitea so Linux workstation can pull them

6.8 KiB Raw Blame History