- Audio processor CLI tool with 6-stage pipeline: transcribe (faster-whisper GPU), diarize (pyannote), detect segments (multi-signal classifier), remove commercials, split segments, analyze content (Ollama) - Post-show workflow doc for episode posts, forum threads, deep-dive blog posts - Training plan for using 579-episode archive for voice profiles and commercial detection - Successful test: 45min episode transcribed in 2:37 on RTX 5070 Ti - Sample transcript output from S7E30 (March 2015) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
366 lines
14 KiB
Markdown
366 lines
14 KiB
Markdown
# Radio Show Audio Processor
|
|
|
|
Automated pipeline for processing The Computer Guru Show recordings into podcast-ready audio, transcripts, and segmented clips.
|
|
|
|
## What It Does
|
|
|
|
```
|
|
Raw MP3 (full broadcast with commercials)
|
|
│
|
|
├── 1. Transcribe (Whisper + GPU)
|
|
│ └── Full transcript with timestamps
|
|
│
|
|
├── 2. Speaker Diarization (pyannote)
|
|
│ └── Who said what (host vs. callers vs. guests)
|
|
│
|
|
├── 3. Segment Detection
|
|
│ ├── Identify show segments vs. commercials
|
|
│ ├── Detect music/jingles (known + discovered)
|
|
│ └── Map segments to show prep structure
|
|
│
|
|
├── 4. Commercial Removal
|
|
│ └── Clean episode MP3 (show content only)
|
|
│
|
|
├── 5. Segment Splitting
|
|
│ ├── Individual segment MP3s (for social media)
|
|
│ └── Chapter markers (for podcast players)
|
|
│
|
|
└── 6. Content Analysis (Ollama)
|
|
├── Episode summary
|
|
├── Topic extraction
|
|
├── Key quotes
|
|
└── Auto-populate post-show debrief
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Pipeline Stages
|
|
|
|
#### Stage 1: Transcription — `faster-whisper` (GPU)
|
|
- **Model:** `large-v3` (best accuracy, ~3GB VRAM)
|
|
- **Why faster-whisper:** CTranslate2 backend, 4x faster than OpenAI whisper, lower VRAM
|
|
- **Output:** Word-level timestamps, language detection
|
|
- **Hardware:** RTX 5070 Ti (12GB VRAM) — plenty for large-v3
|
|
|
|
#### Stage 2: Speaker Diarization — `pyannote.audio`
|
|
- **Model:** `pyannote/speaker-diarization-3.1`
|
|
- **Purpose:** Identify speaker turns (host, caller 1, caller 2, etc.)
|
|
- **Voice enrollment:** Bootstrapped from archive (hundreds of hours of host speech)
|
|
- **Output:** Speaker segments with timestamps
|
|
|
|
#### Stage 3: Segment Detection — Multi-Signal Classifier
|
|
|
|
Commercial and segment detection uses multiple signals combined, because not all show production elements are in the archive — bumpers, stingers, and jingles vary across stations and eras.
|
|
|
|
**Signal 1: Known element fingerprints (seed library)**
|
|
- Fingerprint the production elements we DO have (intros, outros, bumpers from archive)
|
|
- Match against episodes to detect known boundaries
|
|
- Partial coverage — some elements won't match
|
|
|
|
**Signal 2: Unknown element discovery**
|
|
- Detect short non-speech audio segments (music, jingles, produced audio) that don't match any known fingerprint
|
|
- Cluster unknown elements across episodes — if the same 5-second clip appears in 30 episodes, it's a show element
|
|
- Flag new clusters for host review and naming
|
|
- Discovered elements get added to the fingerprint library automatically
|
|
|
|
**Signal 3: Speaker identity**
|
|
- Host voice present = show content
|
|
- Non-host voices with commercial audio characteristics (compressed, produced, different acoustic environment) = ads
|
|
- Host voice absent for extended periods (>30s) = likely commercial break
|
|
|
|
**Signal 4: Audio characteristics**
|
|
- Volume/loudness shifts (commercials often have different LUFS profiles)
|
|
- Spectral characteristics (produced/compressed commercial audio vs. live studio mic)
|
|
- Silence gaps (dead air between show and ads)
|
|
- Audio environment changes (room tone, background noise differences)
|
|
|
|
**Signal 5: Learned break patterns (from archive)**
|
|
- HR1/HR2 file boundaries = confirmed commercial break locations
|
|
- Train a classifier on the audio features at these known boundaries
|
|
- Generalize to detect similar patterns within single-file recordings
|
|
|
|
**Signal 6: Structural heuristics**
|
|
- Commercial breaks are typically 2-5 minutes
|
|
- Shows typically break every 12-20 minutes
|
|
- Transition phrases in transcript ("We'll be right back", "Welcome back")
|
|
|
|
**Combined scoring:** Each signal contributes a confidence score. A segment is classified as commercial when the combined score exceeds a threshold. This is resilient to missing fingerprints — even without a known bumper match, the other signals can still identify breaks.
|
|
|
|
#### Stage 4: Commercial Removal — `ffmpeg`
|
|
- Stitch show segments together with crossfades
|
|
- Normalize audio levels (EBU R128 loudness standard)
|
|
- Output clean podcast-ready MP3
|
|
|
|
#### Stage 5: Segment Splitting — `ffmpeg`
|
|
- Export individual segments as separate MP3s
|
|
- Apply fade in/out
|
|
- Add ID3 tags (show name, segment title, date)
|
|
- Generate chapter markers file (for podcast apps)
|
|
|
|
#### Stage 6: Content Analysis — `Ollama` (qwen3:14b or codestral)
|
|
- Feed transcript + speaker labels to local LLM
|
|
- Generate:
|
|
- Episode summary (2-3 paragraphs)
|
|
- Per-segment summaries
|
|
- Key quotes with speaker attribution
|
|
- Topic tags
|
|
- Suggested blog post topics
|
|
- Auto-filled post-show debrief template
|
|
|
|
### Audio Element Library
|
|
|
|
The element library is a **learning system**, not a static collection.
|
|
|
|
```
|
|
element-library/
|
|
fingerprints.db # SQLite database of audio fingerprints
|
|
known/ # Source files we have
|
|
intros/
|
|
outros/
|
|
bumpers/
|
|
promos/
|
|
discovered/ # Elements found by the discovery system
|
|
cluster-001.mp3 # Unknown element, appears in 47 episodes
|
|
cluster-002.mp3 # Unknown element, appears in 12 episodes
|
|
...
|
|
metadata.json # Names, categories, date ranges for each element
|
|
```
|
|
|
|
**Lifecycle of a discovered element:**
|
|
1. Processor detects non-speech audio that doesn't match any known fingerprint
|
|
2. Audio clip is extracted and stored as a candidate
|
|
3. Candidate is compared against all other candidates across episodes
|
|
4. Matches are clustered — same audio in multiple episodes = confirmed element
|
|
5. New element is fingerprinted and added to the library as "unnamed"
|
|
6. Host reviews unnamed elements periodically and assigns names/categories
|
|
7. Named elements improve future detection accuracy
|
|
|
|
**Element categories:**
|
|
- `show-intro` — Full show opening
|
|
- `show-outro` — Full show closing
|
|
- `segment-bumper` — Music between show segments
|
|
- `break-bumper` — Music going into/out of commercial breaks
|
|
- `station-id` — Station identification (legal requirement, consistent per station)
|
|
- `promo` — Show promo or cross-promotion
|
|
- `stinger` — Short audio effect (sound effect, catchphrase)
|
|
- `unknown` — Not yet categorized
|
|
|
|
### Voice Profile System
|
|
|
|
**Bootstrapped from 579-episode archive**, not a single enrollment sample.
|
|
|
|
```
|
|
voice-profiles/
|
|
host-mike-swanson/
|
|
embedding-composite.npy # Average embedding across all eras
|
|
embedding-2010.npy # Era-specific (voice changes over time)
|
|
embedding-2014.npy
|
|
embedding-2018.npy
|
|
embedding-2026.npy
|
|
metadata.json # Speaker name, role, episode count
|
|
guests/
|
|
[name].npy # Named guest embeddings (built over time)
|
|
callers/
|
|
regular-001.npy # Unnamed repeat caller
|
|
regular-002.npy
|
|
unknown/
|
|
cluster-[id].npy # Voices that appear multiple times, not yet named
|
|
```
|
|
|
|
**Bootstrap process:**
|
|
1. Diarize 10 diverse archive episodes (different years)
|
|
2. Dominant speaker in each = host (by far the most speaking time)
|
|
3. Extract host-only segments, generate embeddings
|
|
4. Create per-era profiles (voice may change over 8+ years)
|
|
5. Composite embedding = average across all eras
|
|
|
|
**Continuous improvement:**
|
|
- Each processed episode refines the host embedding
|
|
- Repeat non-host voices are clustered across episodes
|
|
- Host reviews clusters: "This voice appears in 47 episodes — who is this?"
|
|
- Named profiles improve future speaker labeling
|
|
|
|
## Dependencies
|
|
|
|
```bash
|
|
# System packages
|
|
sudo pacman -S python-pip ffmpeg
|
|
|
|
# Python packages (in a venv)
|
|
python3 -m venv ~/.local/share/radio-processor
|
|
source ~/.local/share/radio-processor/bin/activate
|
|
|
|
pip install faster-whisper # Transcription (CTranslate2 + CUDA)
|
|
pip install pyannote.audio # Speaker diarization
|
|
pip install torch torchaudio # PyTorch (CUDA)
|
|
pip install silero-vad # Voice activity detection
|
|
pip install pydub # Audio manipulation
|
|
pip install librosa # Audio analysis / spectral features
|
|
pip install chromaprint # Audio fingerprinting (or use dejavu)
|
|
pip install scikit-learn # Break pattern classifier
|
|
pip install ollama # Local LLM API
|
|
pip install rich # CLI progress display
|
|
```
|
|
|
|
### pyannote.audio Access
|
|
pyannote requires accepting the model license on HuggingFace:
|
|
1. Create account at huggingface.co
|
|
2. Accept license at https://huggingface.co/pyannote/speaker-diarization-3.1
|
|
3. Generate access token
|
|
4. `huggingface-cli login`
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# Full pipeline (new episode)
|
|
radio-process episode.mp3 --show-prep episodes/2026-03-21-who-controls-your-tech/show-prep.md
|
|
|
|
# Just transcribe
|
|
radio-process episode.mp3 --transcribe-only
|
|
|
|
# Process archive episode (training mode — learns elements + voices)
|
|
radio-process episode-hr1.mp3 episode-hr2.mp3 --archive-mode --date 2016-03-15
|
|
|
|
# Batch process archive for training
|
|
radio-process --batch-train archive/2016/ --output training-data/
|
|
|
|
# Enroll host voice from archive (bootstrap)
|
|
radio-process --bootstrap-voice archive/ --speaker-name "Mike Swanson" --role host
|
|
|
|
# Review discovered elements
|
|
radio-process --review-elements
|
|
|
|
# Review unknown speaker clusters
|
|
radio-process --review-speakers
|
|
```
|
|
|
|
## Output Structure
|
|
|
|
```
|
|
episodes/YYYY-MM-DD-topic/
|
|
show-prep.md # Pre-show (existing)
|
|
post-show-debrief.md # Auto-generated draft
|
|
raw/
|
|
full-broadcast.mp3 # Original recording
|
|
processed/
|
|
transcript.json # Full transcript with timestamps + speakers
|
|
transcript.txt # Plain text transcript
|
|
transcript.srt # Subtitle format
|
|
podcast-episode.mp3 # Clean episode (commercials removed)
|
|
chapters.json # Chapter markers
|
|
detection-report.json # What was detected as commercial/show, confidence scores
|
|
segments/
|
|
00-intro.mp3
|
|
01-the-week-that-was.mp3
|
|
02-the-government-wants-in.mp3
|
|
03-jensens-trillion-dollar-bet.mp3
|
|
04-apple-gives-google-the-keys.mp3
|
|
05-a-petabyte-of-your-data-gone.mp3
|
|
06-right-to-repair.mp3
|
|
07-outro.mp3
|
|
generated/
|
|
episode-post.md # For website
|
|
forum-thread.md # For community forum
|
|
blog-topic-1.md # Deep-dive article
|
|
blog-topic-2.md # Deep-dive article
|
|
analysis.json # LLM analysis output
|
|
```
|
|
|
|
## Configuration
|
|
|
|
```yaml
|
|
# config.yaml
|
|
show:
|
|
name: "The Computer Guru Show"
|
|
host: "Mike Swanson"
|
|
typical_duration_minutes: 120 # 2-hour broadcast
|
|
segment_count: 6
|
|
has_commercials: true
|
|
|
|
audio:
|
|
whisper_model: "large-v3"
|
|
whisper_language: "en"
|
|
output_format: "mp3"
|
|
output_bitrate: "192k"
|
|
normalize: true # EBU R128
|
|
crossfade_ms: 500 # Between stitched segments
|
|
|
|
segment_detection:
|
|
# Fingerprint matching
|
|
fingerprint_db: "element-library/fingerprints.db"
|
|
fingerprint_match_threshold: 0.85 # Minimum similarity for a match
|
|
|
|
# Element discovery
|
|
discover_unknown_elements: true
|
|
min_element_duration_s: 1.0 # Shortest element to detect
|
|
max_element_duration_s: 30.0 # Longest (full intro might be 20-30s)
|
|
cluster_similarity_threshold: 0.90 # How similar clips must be to cluster
|
|
min_cluster_occurrences: 3 # Must appear in 3+ episodes to be an element
|
|
|
|
# Commercial classification
|
|
min_break_duration_s: 30 # Minimum commercial break length
|
|
max_break_duration_s: 300 # Maximum (5 min)
|
|
silence_threshold_db: -40 # Silence detection threshold
|
|
confidence_threshold: 0.70 # Combined score to classify as commercial
|
|
|
|
# Signal weights (tune based on accuracy)
|
|
weights:
|
|
fingerprint_match: 0.30 # Known element detected
|
|
speaker_identity: 0.25 # Host voice absent
|
|
audio_characteristics: 0.20 # Production style differs
|
|
break_pattern: 0.15 # Matches trained break pattern
|
|
structural_heuristic: 0.10 # Duration/timing rules
|
|
|
|
diarization:
|
|
min_speakers: 1
|
|
max_speakers: 6
|
|
voice_profiles_dir: "voice-profiles/"
|
|
host_match_threshold: 0.75 # Similarity to host embedding
|
|
|
|
llm:
|
|
model: "qwen3:14b" # Ollama model for analysis
|
|
ollama_host: "http://localhost:11434"
|
|
|
|
paths:
|
|
episodes_dir: "episodes/"
|
|
voice_profiles: "voice-profiles/"
|
|
element_library: "element-library/"
|
|
output_dir: "processed/"
|
|
|
|
archive:
|
|
server: "172.16.3.10"
|
|
path: "/home/gurushow/public_html/archive/"
|
|
elements_path: "/home/gurushow/public_html/archive/Radio/Elements/"
|
|
```
|
|
|
|
## Training Data: 579-Episode Archive
|
|
|
|
The archive on IX server (172.16.3.10) contains 579 MP3 files spanning 2010-2018:
|
|
|
|
| Year | Files | Size | Notes |
|
|
|------|-------|------|-------|
|
|
| 2010 | 43 | 664MB | Season 7 start |
|
|
| 2011 | 200 | 1.9GB | Peak output |
|
|
| 2012 | 98 | 1.2GB | |
|
|
| 2014 | 81 | 783MB | Season 6 (new station) |
|
|
| 2015 | 50 | 461MB | |
|
|
| 2016 | 54 | 1.2GB | |
|
|
| 2017 | 41 | 1.5GB | |
|
|
| 2018 | 5 | 101MB | Final season 10 episodes |
|
|
| Elements | 7 MP3 + 18 WAV | 203MB | Partial production library |
|
|
|
|
Episodes are split into HR 1 / HR 2 files. The HR boundary is a confirmed commercial break point — used for training the break detection classifier.
|
|
|
|
**Important:** Not all production elements are in the archive. Bumpers, stingers, and jingles varied across stations and time periods. The element discovery system handles this by detecting and clustering unknown elements across episodes.
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Audiogram generator** — Create video clips with waveform animation + captions for social media
|
|
2. **Highlight reel** — Auto-detect the most engaging 60-90 seconds (high energy, laughter, emphasis)
|
|
3. **Show notes generator** — Generate timestamped show notes in podcast standard format
|
|
4. **RSS feed integration** — Auto-publish processed episodes to podcast RSS feed
|
|
5. **Sentiment analysis** — Track audience engagement topics over time
|
|
6. **Topic continuity** — Link topics across episodes ("Last week we talked about X, this week...")
|
|
7. **Live processing** — Real-time transcription during broadcast for immediate post-show turnaround
|
|
8. **Cross-episode search** — Full-text search across all transcripts ("When did we talk about net neutrality?")
|