claudetools/projects/radio-show/audio-processor/MAC_BUILD_TASK.md

# Mac Build Task: Radio Show Audio Processor

**Date:** 2026-03-21
**From:** CachyOS workstation (acg-guru-5070)
**To:** Mac Claude instance (Mikes-MacBook-Air, M4)
**Priority:** High — this is blocked on the Linux workstation

---

## What We Need

Build a Mac-native version of the radio show audio processor that can run the **transcription and voice profiling pipeline** on Apple Silicon (M4). The Linux workstation's RTX 5070 Ti has a known GPU firmware bug that crashes after ~3 minutes of sustained compute, making GPU-accelerated transcription impossible until NVIDIA fixes it.

## Why the Mac

The CachyOS workstation's NVIDIA RTX 5070 Ti Laptop GPU hits a **GSP (GPU System Processor) firmware crash** under sustained load. This is a known, unresolved bug across all RTX 50-series (Blackwell) GPUs on Linux:

- Error: `NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x00000062`
- Triggers after ~3-5 minutes of continuous GPU compute
- GPU enters full ERR! state, requires hard reboot (warm reboot hangs)
- Cannot disable GSP on 50-series (open kernel module required, no `NVreg_EnableGpuFirmware=0`)
- NVIDIA internal bug #5953411 filed, no fix available
- Affects drivers 580.x, 590.x, 595.x (current: 595.45.04)
- Power management tweaks, persistence mode, clock locking — none helped
- See: session logs `2026-03-21-session.md` and `2026-03-20-session.md` for full diagnosis

The M4 MacBook Air with 16GB unified memory can run this workload on CPU or MPS backend without driver issues.

## The Project

**Location in repo:** `projects/radio-show/audio-processor/`

**Goal:** Automated pipeline for processing "The Computer Guru Show" radio recordings. The immediate task is **voice training** — transcribing 9 archive episodes and building speaker embeddings to identify the host (Mike Swanson) vs. callers/guests/commercials.

### What Already Works (built on Linux, may need Mac adaptation)

1. **Voice Profiler** (`src/voice_profiler.py`) — Uses WavLM (Microsoft `microsoft/wavlm-base-sv`) for speaker verification via x-vector embeddings. **WORKING** — 180 embeddings generated, composite built. Host voice scores 0.90-0.98 similarity, non-host 0.53-0.65. Threshold tuned to 0.83.

2. **Transcriber** (`src/transcriber.py`) — Uses `faster-whisper` with `large-v3` model. On Linux it was configured for CUDA. Only 1 of 9 episodes transcribed before GPU died.

3. **Config** (`config.yaml`) — All pipeline settings, thresholds, paths.

4. **Voice profiles** — `voice-profiles/mike-swanson/` has 180 `.npy` embedding files + `composite.npy` + `profiles.json`. These are numpy arrays, platform-independent.

5. **One completed transcript** — `training-data/transcripts/2010-10-02-hr1/` (534 segments, transcript.json + .srt + .txt)

### What Needs Doing on Mac

**Primary task: Transcribe the remaining 8 episodes:**
```
training-data/episodes/2011-06-04-hr1.mp3    (7.4MB, ~43 min)
training-data/episodes/2011-09-10-hr1.mp3    (11MB)
training-data/episodes/2014-s6e05.mp3        (9.5MB)
training-data/episodes/2015-s7e30.mp3        (9.0MB)
training-data/episodes/2016-s8e42.mp3        (19MB)
training-data/episodes/2017-s9e26.mp3        (48MB)
training-data/episodes/2018-s10e17.mp3       (21MB)
training-data/episodes/2018-s10e21.mp3       (21MB)
```

Output each to: `training-data/transcripts/{episode-stem}/transcript.json` (+ .srt, .txt)

**Secondary: Verify voice profiles work on Mac** — load the existing `.npy` embeddings and run similarity checks against the new transcripts.

### Mac-Specific Build Notes

1. **Do NOT try to port the Linux code directly.** Build fresh for Mac hardware. The existing code has CUDA-specific paths (`src/gpu.py` sets `LD_LIBRARY_PATH` for CUDA 12), nvidia-specific device selection, etc. It's cleaner to build natively.

2. **Reference the existing code for architecture and logic**, especially:
   - `src/voice_profiler.py` — the WavLM embedding approach, similarity thresholds, profile structure
   - `src/transcriber.py` — the Whisper pipeline stages, output format (TranscriptSegment dataclass)
   - `config.yaml` — all the tuned parameters
   - `README.md` — full architecture doc for the 6-stage pipeline

3. **Hardware target: Apple M4, 16GB unified memory**
   - `faster-whisper` + ctranslate2 supports CPU on macOS (no MPS backend for ctranslate2)
   - `large-v3` model needs ~3GB RAM — fits easily in 16GB
   - Expected speed: ~1x realtime on M4 CPU (43-min episode takes ~43 min)
   - Consider `medium` model if `large-v3` is too slow — tradeoff is accuracy
   - PyTorch MPS backend works for `pyannote.audio` and WavLM (transformers)

4. **Dependencies for Mac:**
   ```bash
   brew install ffmpeg
   python3 -m venv .venv
   source .venv/bin/activate
   pip install faster-whisper pyannote.audio torch torchaudio pydub librosa scikit-learn ollama rich pyyaml
   ```
   - No CUDA packages needed — pip will pull CPU-only torch or MPS-enabled torch for macOS
   - `pyannote.audio` requires HuggingFace token (accept model license first): https://huggingface.co/pyannote/speaker-diarization-3.1

5. **Ollama models available on Mac** (per machine spec):
   - `qwen3:14b` — use for content analysis (Stage 6)
   - `nomic-embed-text` — for grepai, not needed for audio processing

6. **Output compatibility:** Keep the same output format (JSON with segments, timestamps, speaker labels) so the Linux workstation can consume the results after git pull.

### Architecture Reference

```
Raw MP3 → 1. Transcribe (Whisper) → 2. Diarize (pyannote) → 3. Detect Segments
        → 4. Remove Commercials → 5. Split Segments → 6. Analyze (Ollama)
```

For now, only Stages 1-2 matter. Stages 3-6 can wait.

### Key Thresholds (from working Linux version)

- Whisper model: `large-v3`
- Whisper language: `en`
- Voice profile host match threshold: `0.83`
- Min/max speakers for diarization: 1-6
- WavLM model: `microsoft/wavlm-base-sv` (speaker verification, x-vector embeddings)

### Data Flow

1. Training episodes are in `training-data/episodes/` (already in git, 151MB total)
2. Voice profiles are in `voice-profiles/mike-swanson/` (already in git)
3. Transcripts go to `training-data/transcripts/{episode-stem}/`
4. After Mac completes transcription, commit + push to Gitea
5. Linux workstation pulls results and continues with Stages 3-6

## Session Logs for Context

Read these for the full story of what was built and why:

- `session-logs/2026-03-21-session.md` — Voice profiling results, GPU errors, transcription attempts, diagnosis
- `session-logs/2026-03-20-session.md` — Earlier session (may have additional audio processor context)

## Success Criteria

1. All 8 remaining episodes transcribed with timestamps and segments
2. Transcripts in the same JSON format as `training-data/transcripts/2010-10-02-hr1/transcript.json`
3. Voice profiles load and produce reasonable similarity scores on Mac
4. Results committed to Gitea so Linux workstation can pull them