- Build Tom (co-host) voice profile (44 embeddings, 0.698 similarity to Mike) - diarizer.py: add CO-HOST speaker label for cohost-role profiles - voice_profiler.py: emit "Cohost: <name>" label for cohost role - qa_extractor.py: overlap resolution at load time (midpoint boundary split), 4s CALLER-preference threshold, turn-based caller-intro lookback (2 HOST turns), _preceded_by_caller_intro() helper, _PHONE_GREETING pattern, 751-1041 + "we'll get your problem solved" promo signatures - benchmark.py: use src.transcriber.transcribe with batch_size=16 - add index_test_episodes.py and build_cohost_profile.py scripts - add .gitignore (exclude episodes, transcripts, *.db, .venv) - session log: 2026-04-27-qa-extraction-cohost-indexing.md Result: 2016-s8e43 drops from 12 false-positive Q&A pairs to 2 real caller pairs. archive.db: 6 episodes, 762 segments, 10 Q&A pairs, FTS5 search verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Radio Show Audio Processor
Automated pipeline for processing The Computer Guru Show recordings into podcast-ready audio, transcripts, and segmented clips.
What It Does
Raw MP3 (full broadcast with commercials)
│
├── 1. Transcribe (Whisper + GPU)
│ └── Full transcript with timestamps
│
├── 2. Speaker Diarization (pyannote)
│ └── Who said what (host vs. callers vs. guests)
│
├── 3. Segment Detection
│ ├── Identify show segments vs. commercials
│ ├── Detect music/jingles (known + discovered)
│ └── Map segments to show prep structure
│
├── 4. Commercial Removal
│ └── Clean episode MP3 (show content only)
│
├── 5. Segment Splitting
│ ├── Individual segment MP3s (for social media)
│ └── Chapter markers (for podcast players)
│
└── 6. Content Analysis (Ollama)
├── Episode summary
├── Topic extraction
├── Key quotes
└── Auto-populate post-show debrief
Architecture
Pipeline Stages
Stage 1: Transcription — faster-whisper (GPU)
- Model:
large-v3(best accuracy, ~3GB VRAM) - Why faster-whisper: CTranslate2 backend, 4x faster than OpenAI whisper, lower VRAM
- Output: Word-level timestamps, language detection
- Hardware: RTX 5070 Ti (12GB VRAM) — plenty for large-v3
Stage 2: Speaker Diarization — pyannote.audio
- Model:
pyannote/speaker-diarization-3.1 - Purpose: Identify speaker turns (host, caller 1, caller 2, etc.)
- Voice enrollment: Bootstrapped from archive (hundreds of hours of host speech)
- Output: Speaker segments with timestamps
Stage 3: Segment Detection — Multi-Signal Classifier
Commercial and segment detection uses multiple signals combined, because not all show production elements are in the archive — bumpers, stingers, and jingles vary across stations and eras.
Signal 1: Known element fingerprints (seed library)
- Fingerprint the production elements we DO have (intros, outros, bumpers from archive)
- Match against episodes to detect known boundaries
- Partial coverage — some elements won't match
Signal 2: Unknown element discovery
- Detect short non-speech audio segments (music, jingles, produced audio) that don't match any known fingerprint
- Cluster unknown elements across episodes — if the same 5-second clip appears in 30 episodes, it's a show element
- Flag new clusters for host review and naming
- Discovered elements get added to the fingerprint library automatically
Signal 3: Speaker identity
- Host voice present = show content
- Non-host voices with commercial audio characteristics (compressed, produced, different acoustic environment) = ads
- Host voice absent for extended periods (>30s) = likely commercial break
Signal 4: Audio characteristics
- Volume/loudness shifts (commercials often have different LUFS profiles)
- Spectral characteristics (produced/compressed commercial audio vs. live studio mic)
- Silence gaps (dead air between show and ads)
- Audio environment changes (room tone, background noise differences)
Signal 5: Learned break patterns (from archive)
- HR1/HR2 file boundaries = confirmed commercial break locations
- Train a classifier on the audio features at these known boundaries
- Generalize to detect similar patterns within single-file recordings
Signal 6: Structural heuristics
- Commercial breaks are typically 2-5 minutes
- Shows typically break every 12-20 minutes
- Transition phrases in transcript ("We'll be right back", "Welcome back")
Combined scoring: Each signal contributes a confidence score. A segment is classified as commercial when the combined score exceeds a threshold. This is resilient to missing fingerprints — even without a known bumper match, the other signals can still identify breaks.
Stage 4: Commercial Removal — ffmpeg
- Stitch show segments together with crossfades
- Normalize audio levels (EBU R128 loudness standard)
- Output clean podcast-ready MP3
Stage 5: Segment Splitting — ffmpeg
- Export individual segments as separate MP3s
- Apply fade in/out
- Add ID3 tags (show name, segment title, date)
- Generate chapter markers file (for podcast apps)
Stage 6: Content Analysis — Ollama (qwen3:14b or codestral)
- Feed transcript + speaker labels to local LLM
- Generate:
- Episode summary (2-3 paragraphs)
- Per-segment summaries
- Key quotes with speaker attribution
- Topic tags
- Suggested blog post topics
- Auto-filled post-show debrief template
Audio Element Library
The element library is a learning system, not a static collection.
element-library/
fingerprints.db # SQLite database of audio fingerprints
known/ # Source files we have
intros/
outros/
bumpers/
promos/
discovered/ # Elements found by the discovery system
cluster-001.mp3 # Unknown element, appears in 47 episodes
cluster-002.mp3 # Unknown element, appears in 12 episodes
...
metadata.json # Names, categories, date ranges for each element
Lifecycle of a discovered element:
- Processor detects non-speech audio that doesn't match any known fingerprint
- Audio clip is extracted and stored as a candidate
- Candidate is compared against all other candidates across episodes
- Matches are clustered — same audio in multiple episodes = confirmed element
- New element is fingerprinted and added to the library as "unnamed"
- Host reviews unnamed elements periodically and assigns names/categories
- Named elements improve future detection accuracy
Element categories:
show-intro— Full show openingshow-outro— Full show closingsegment-bumper— Music between show segmentsbreak-bumper— Music going into/out of commercial breaksstation-id— Station identification (legal requirement, consistent per station)promo— Show promo or cross-promotionstinger— Short audio effect (sound effect, catchphrase)unknown— Not yet categorized
Voice Profile System
Bootstrapped from 579-episode archive, not a single enrollment sample.
voice-profiles/
host-mike-swanson/
embedding-composite.npy # Average embedding across all eras
embedding-2010.npy # Era-specific (voice changes over time)
embedding-2014.npy
embedding-2018.npy
embedding-2026.npy
metadata.json # Speaker name, role, episode count
guests/
[name].npy # Named guest embeddings (built over time)
callers/
regular-001.npy # Unnamed repeat caller
regular-002.npy
unknown/
cluster-[id].npy # Voices that appear multiple times, not yet named
Bootstrap process:
- Diarize 10 diverse archive episodes (different years)
- Dominant speaker in each = host (by far the most speaking time)
- Extract host-only segments, generate embeddings
- Create per-era profiles (voice may change over 8+ years)
- Composite embedding = average across all eras
Continuous improvement:
- Each processed episode refines the host embedding
- Repeat non-host voices are clustered across episodes
- Host reviews clusters: "This voice appears in 47 episodes — who is this?"
- Named profiles improve future speaker labeling
Dependencies
# System packages
sudo pacman -S python-pip ffmpeg
# Python packages (in a venv)
python3 -m venv ~/.local/share/radio-processor
source ~/.local/share/radio-processor/bin/activate
pip install faster-whisper # Transcription (CTranslate2 + CUDA)
pip install pyannote.audio # Speaker diarization
pip install torch torchaudio # PyTorch (CUDA)
pip install silero-vad # Voice activity detection
pip install pydub # Audio manipulation
pip install librosa # Audio analysis / spectral features
pip install chromaprint # Audio fingerprinting (or use dejavu)
pip install scikit-learn # Break pattern classifier
pip install ollama # Local LLM API
pip install rich # CLI progress display
pyannote.audio Access
pyannote requires accepting the model license on HuggingFace:
- Create account at huggingface.co
- Accept license at https://huggingface.co/pyannote/speaker-diarization-3.1
- Generate access token
huggingface-cli login
Usage
# Full pipeline (new episode)
radio-process episode.mp3 --show-prep episodes/2026-03-21-who-controls-your-tech/show-prep.md
# Just transcribe
radio-process episode.mp3 --transcribe-only
# Process archive episode (training mode — learns elements + voices)
radio-process episode-hr1.mp3 episode-hr2.mp3 --archive-mode --date 2016-03-15
# Batch process archive for training
radio-process --batch-train archive/2016/ --output training-data/
# Enroll host voice from archive (bootstrap)
radio-process --bootstrap-voice archive/ --speaker-name "Mike Swanson" --role host
# Review discovered elements
radio-process --review-elements
# Review unknown speaker clusters
radio-process --review-speakers
Output Structure
episodes/YYYY-MM-DD-topic/
show-prep.md # Pre-show (existing)
post-show-debrief.md # Auto-generated draft
raw/
full-broadcast.mp3 # Original recording
processed/
transcript.json # Full transcript with timestamps + speakers
transcript.txt # Plain text transcript
transcript.srt # Subtitle format
podcast-episode.mp3 # Clean episode (commercials removed)
chapters.json # Chapter markers
detection-report.json # What was detected as commercial/show, confidence scores
segments/
00-intro.mp3
01-the-week-that-was.mp3
02-the-government-wants-in.mp3
03-jensens-trillion-dollar-bet.mp3
04-apple-gives-google-the-keys.mp3
05-a-petabyte-of-your-data-gone.mp3
06-right-to-repair.mp3
07-outro.mp3
generated/
episode-post.md # For website
forum-thread.md # For community forum
blog-topic-1.md # Deep-dive article
blog-topic-2.md # Deep-dive article
analysis.json # LLM analysis output
Configuration
# config.yaml
show:
name: "The Computer Guru Show"
host: "Mike Swanson"
typical_duration_minutes: 120 # 2-hour broadcast
segment_count: 6
has_commercials: true
audio:
whisper_model: "large-v3"
whisper_language: "en"
output_format: "mp3"
output_bitrate: "192k"
normalize: true # EBU R128
crossfade_ms: 500 # Between stitched segments
segment_detection:
# Fingerprint matching
fingerprint_db: "element-library/fingerprints.db"
fingerprint_match_threshold: 0.85 # Minimum similarity for a match
# Element discovery
discover_unknown_elements: true
min_element_duration_s: 1.0 # Shortest element to detect
max_element_duration_s: 30.0 # Longest (full intro might be 20-30s)
cluster_similarity_threshold: 0.90 # How similar clips must be to cluster
min_cluster_occurrences: 3 # Must appear in 3+ episodes to be an element
# Commercial classification
min_break_duration_s: 30 # Minimum commercial break length
max_break_duration_s: 300 # Maximum (5 min)
silence_threshold_db: -40 # Silence detection threshold
confidence_threshold: 0.70 # Combined score to classify as commercial
# Signal weights (tune based on accuracy)
weights:
fingerprint_match: 0.30 # Known element detected
speaker_identity: 0.25 # Host voice absent
audio_characteristics: 0.20 # Production style differs
break_pattern: 0.15 # Matches trained break pattern
structural_heuristic: 0.10 # Duration/timing rules
diarization:
min_speakers: 1
max_speakers: 6
voice_profiles_dir: "voice-profiles/"
host_match_threshold: 0.75 # Similarity to host embedding
llm:
model: "qwen3:14b" # Ollama model for analysis
ollama_host: "http://localhost:11434"
paths:
episodes_dir: "episodes/"
voice_profiles: "voice-profiles/"
element_library: "element-library/"
output_dir: "processed/"
archive:
server: "172.16.3.10"
path: "/home/gurushow/public_html/archive/"
elements_path: "/home/gurushow/public_html/archive/Radio/Elements/"
Training Data: 579-Episode Archive
The archive on IX server (172.16.3.10) contains 579 MP3 files spanning 2010-2018:
| Year | Files | Size | Notes |
|---|---|---|---|
| 2010 | 43 | 664MB | Season 7 start |
| 2011 | 200 | 1.9GB | Peak output |
| 2012 | 98 | 1.2GB | |
| 2014 | 81 | 783MB | Season 6 (new station) |
| 2015 | 50 | 461MB | |
| 2016 | 54 | 1.2GB | |
| 2017 | 41 | 1.5GB | |
| 2018 | 5 | 101MB | Final season 10 episodes |
| Elements | 7 MP3 + 18 WAV | 203MB | Partial production library |
Episodes are split into HR 1 / HR 2 files. The HR boundary is a confirmed commercial break point — used for training the break detection classifier.
Important: Not all production elements are in the archive. Bumpers, stingers, and jingles varied across stations and time periods. The element discovery system handles this by detecting and clustering unknown elements across episodes.
Future Enhancements
- Audiogram generator — Create video clips with waveform animation + captions for social media
- Highlight reel — Auto-detect the most engaging 60-90 seconds (high energy, laughter, emphasis)
- Show notes generator — Generate timestamped show notes in podcast standard format
- RSS feed integration — Auto-publish processed episodes to podcast RSS feed
- Sentiment analysis — Track audience engagement topics over time
- Topic continuity — Link topics across episodes ("Last week we talked about X, this week...")
- Live processing — Real-time transcription during broadcast for immediate post-show turnaround
- Cross-episode search — Full-text search across all transcripts ("When did we talk about net neutrality?")