diff --git a/projects/radio-show/audio-processor/MAC_BUILD_TASK.md b/projects/radio-show/audio-processor/MAC_BUILD_TASK.md new file mode 100644 index 0000000..58c6e03 --- /dev/null +++ b/projects/radio-show/audio-processor/MAC_BUILD_TASK.md @@ -0,0 +1,135 @@ +# Mac Build Task: Radio Show Audio Processor + +**Date:** 2026-03-21 +**From:** CachyOS workstation (acg-guru-5070) +**To:** Mac Claude instance (Mikes-MacBook-Air, M4) +**Priority:** High — this is blocked on the Linux workstation + +--- + +## What We Need + +Build a Mac-native version of the radio show audio processor that can run the **transcription and voice profiling pipeline** on Apple Silicon (M4). The Linux workstation's RTX 5070 Ti has a known GPU firmware bug that crashes after ~3 minutes of sustained compute, making GPU-accelerated transcription impossible until NVIDIA fixes it. + +## Why the Mac + +The CachyOS workstation's NVIDIA RTX 5070 Ti Laptop GPU hits a **GSP (GPU System Processor) firmware crash** under sustained load. This is a known, unresolved bug across all RTX 50-series (Blackwell) GPUs on Linux: + +- Error: `NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x00000062` +- Triggers after ~3-5 minutes of continuous GPU compute +- GPU enters full ERR! state, requires hard reboot (warm reboot hangs) +- Cannot disable GSP on 50-series (open kernel module required, no `NVreg_EnableGpuFirmware=0`) +- NVIDIA internal bug #5953411 filed, no fix available +- Affects drivers 580.x, 590.x, 595.x (current: 595.45.04) +- Power management tweaks, persistence mode, clock locking — none helped +- See: session logs `2026-03-21-session.md` and `2026-03-20-session.md` for full diagnosis + +The M4 MacBook Air with 16GB unified memory can run this workload on CPU or MPS backend without driver issues. + +## The Project + +**Location in repo:** `projects/radio-show/audio-processor/` + +**Goal:** Automated pipeline for processing "The Computer Guru Show" radio recordings. The immediate task is **voice training** — transcribing 9 archive episodes and building speaker embeddings to identify the host (Mike Swanson) vs. callers/guests/commercials. + +### What Already Works (built on Linux, may need Mac adaptation) + +1. **Voice Profiler** (`src/voice_profiler.py`) — Uses WavLM (Microsoft `microsoft/wavlm-base-sv`) for speaker verification via x-vector embeddings. **WORKING** — 180 embeddings generated, composite built. Host voice scores 0.90-0.98 similarity, non-host 0.53-0.65. Threshold tuned to 0.83. + +2. **Transcriber** (`src/transcriber.py`) — Uses `faster-whisper` with `large-v3` model. On Linux it was configured for CUDA. Only 1 of 9 episodes transcribed before GPU died. + +3. **Config** (`config.yaml`) — All pipeline settings, thresholds, paths. + +4. **Voice profiles** — `voice-profiles/mike-swanson/` has 180 `.npy` embedding files + `composite.npy` + `profiles.json`. These are numpy arrays, platform-independent. + +5. **One completed transcript** — `training-data/transcripts/2010-10-02-hr1/` (534 segments, transcript.json + .srt + .txt) + +### What Needs Doing on Mac + +**Primary task: Transcribe the remaining 8 episodes:** +``` +training-data/episodes/2011-06-04-hr1.mp3 (7.4MB, ~43 min) +training-data/episodes/2011-09-10-hr1.mp3 (11MB) +training-data/episodes/2014-s6e05.mp3 (9.5MB) +training-data/episodes/2015-s7e30.mp3 (9.0MB) +training-data/episodes/2016-s8e42.mp3 (19MB) +training-data/episodes/2017-s9e26.mp3 (48MB) +training-data/episodes/2018-s10e17.mp3 (21MB) +training-data/episodes/2018-s10e21.mp3 (21MB) +``` + +Output each to: `training-data/transcripts/{episode-stem}/transcript.json` (+ .srt, .txt) + +**Secondary: Verify voice profiles work on Mac** — load the existing `.npy` embeddings and run similarity checks against the new transcripts. + +### Mac-Specific Build Notes + +1. **Do NOT try to port the Linux code directly.** Build fresh for Mac hardware. The existing code has CUDA-specific paths (`src/gpu.py` sets `LD_LIBRARY_PATH` for CUDA 12), nvidia-specific device selection, etc. It's cleaner to build natively. + +2. **Reference the existing code for architecture and logic**, especially: + - `src/voice_profiler.py` — the WavLM embedding approach, similarity thresholds, profile structure + - `src/transcriber.py` — the Whisper pipeline stages, output format (TranscriptSegment dataclass) + - `config.yaml` — all the tuned parameters + - `README.md` — full architecture doc for the 6-stage pipeline + +3. **Hardware target: Apple M4, 16GB unified memory** + - `faster-whisper` + ctranslate2 supports CPU on macOS (no MPS backend for ctranslate2) + - `large-v3` model needs ~3GB RAM — fits easily in 16GB + - Expected speed: ~1x realtime on M4 CPU (43-min episode takes ~43 min) + - Consider `medium` model if `large-v3` is too slow — tradeoff is accuracy + - PyTorch MPS backend works for `pyannote.audio` and WavLM (transformers) + +4. **Dependencies for Mac:** + ```bash + brew install ffmpeg + python3 -m venv .venv + source .venv/bin/activate + pip install faster-whisper pyannote.audio torch torchaudio pydub librosa scikit-learn ollama rich pyyaml + ``` + - No CUDA packages needed — pip will pull CPU-only torch or MPS-enabled torch for macOS + - `pyannote.audio` requires HuggingFace token (accept model license first): https://huggingface.co/pyannote/speaker-diarization-3.1 + +5. **Ollama models available on Mac** (per machine spec): + - `qwen3:14b` — use for content analysis (Stage 6) + - `nomic-embed-text` — for grepai, not needed for audio processing + +6. **Output compatibility:** Keep the same output format (JSON with segments, timestamps, speaker labels) so the Linux workstation can consume the results after git pull. + +### Architecture Reference + +``` +Raw MP3 → 1. Transcribe (Whisper) → 2. Diarize (pyannote) → 3. Detect Segments + → 4. Remove Commercials → 5. Split Segments → 6. Analyze (Ollama) +``` + +For now, only Stages 1-2 matter. Stages 3-6 can wait. + +### Key Thresholds (from working Linux version) + +- Whisper model: `large-v3` +- Whisper language: `en` +- Voice profile host match threshold: `0.83` +- Min/max speakers for diarization: 1-6 +- WavLM model: `microsoft/wavlm-base-sv` (speaker verification, x-vector embeddings) + +### Data Flow + +1. Training episodes are in `training-data/episodes/` (already in git, 151MB total) +2. Voice profiles are in `voice-profiles/mike-swanson/` (already in git) +3. Transcripts go to `training-data/transcripts/{episode-stem}/` +4. After Mac completes transcription, commit + push to Gitea +5. Linux workstation pulls results and continues with Stages 3-6 + +## Session Logs for Context + +Read these for the full story of what was built and why: + +- `session-logs/2026-03-21-session.md` — Voice profiling results, GPU errors, transcription attempts, diagnosis +- `session-logs/2026-03-20-session.md` — Earlier session (may have additional audio processor context) + +## Success Criteria + +1. All 8 remaining episodes transcribed with timestamps and segments +2. Transcripts in the same JSON format as `training-data/transcripts/2010-10-02-hr1/transcript.json` +3. Voice profiles load and produce reasonable similarity scores on Mac +4. Results committed to Gitea so Linux workstation can pull them