Files
claudetools/projects/radio-show/audio-processor
Mike Swanson 6cc9043b8e Audio processor: validated voice profiling accuracy, tuned threshold
- Fine-grained speaker analysis (3s windows, 1s hop) across 42min episode
- Host voice: 0.90-0.98 similarity (clear positive match)
- Callers: 0.65-0.68 (correctly below threshold)
- Produced audio/clips: 0.53-0.65 (correctly identified as non-host)
- Co-host/other speakers: 0.56-0.62 (correctly identified)
- Tuned host_match_threshold from 0.75 to 0.83 based on empirical data
- Cross-referenced dips with transcript: correctly identifies callers,
  show intros, played audio clips, and station breaks
- Batch transcription of 7 additional training episodes in progress

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 12:48:25 -07:00
..

Radio Show Audio Processor

Automated pipeline for processing The Computer Guru Show recordings into podcast-ready audio, transcripts, and segmented clips.

What It Does

Raw MP3 (full broadcast with commercials)
  │
  ├── 1. Transcribe (Whisper + GPU)
  │     └── Full transcript with timestamps
  │
  ├── 2. Speaker Diarization (pyannote)
  │     └── Who said what (host vs. callers vs. guests)
  │
  ├── 3. Segment Detection
  │     ├── Identify show segments vs. commercials
  │     ├── Detect music/jingles (known + discovered)
  │     └── Map segments to show prep structure
  │
  ├── 4. Commercial Removal
  │     └── Clean episode MP3 (show content only)
  │
  ├── 5. Segment Splitting
  │     ├── Individual segment MP3s (for social media)
  │     └── Chapter markers (for podcast players)
  │
  └── 6. Content Analysis (Ollama)
        ├── Episode summary
        ├── Topic extraction
        ├── Key quotes
        └── Auto-populate post-show debrief

Architecture

Pipeline Stages

Stage 1: Transcription — faster-whisper (GPU)

  • Model: large-v3 (best accuracy, ~3GB VRAM)
  • Why faster-whisper: CTranslate2 backend, 4x faster than OpenAI whisper, lower VRAM
  • Output: Word-level timestamps, language detection
  • Hardware: RTX 5070 Ti (12GB VRAM) — plenty for large-v3

Stage 2: Speaker Diarization — pyannote.audio

  • Model: pyannote/speaker-diarization-3.1
  • Purpose: Identify speaker turns (host, caller 1, caller 2, etc.)
  • Voice enrollment: Bootstrapped from archive (hundreds of hours of host speech)
  • Output: Speaker segments with timestamps

Stage 3: Segment Detection — Multi-Signal Classifier

Commercial and segment detection uses multiple signals combined, because not all show production elements are in the archive — bumpers, stingers, and jingles vary across stations and eras.

Signal 1: Known element fingerprints (seed library)

  • Fingerprint the production elements we DO have (intros, outros, bumpers from archive)
  • Match against episodes to detect known boundaries
  • Partial coverage — some elements won't match

Signal 2: Unknown element discovery

  • Detect short non-speech audio segments (music, jingles, produced audio) that don't match any known fingerprint
  • Cluster unknown elements across episodes — if the same 5-second clip appears in 30 episodes, it's a show element
  • Flag new clusters for host review and naming
  • Discovered elements get added to the fingerprint library automatically

Signal 3: Speaker identity

  • Host voice present = show content
  • Non-host voices with commercial audio characteristics (compressed, produced, different acoustic environment) = ads
  • Host voice absent for extended periods (>30s) = likely commercial break

Signal 4: Audio characteristics

  • Volume/loudness shifts (commercials often have different LUFS profiles)
  • Spectral characteristics (produced/compressed commercial audio vs. live studio mic)
  • Silence gaps (dead air between show and ads)
  • Audio environment changes (room tone, background noise differences)

Signal 5: Learned break patterns (from archive)

  • HR1/HR2 file boundaries = confirmed commercial break locations
  • Train a classifier on the audio features at these known boundaries
  • Generalize to detect similar patterns within single-file recordings

Signal 6: Structural heuristics

  • Commercial breaks are typically 2-5 minutes
  • Shows typically break every 12-20 minutes
  • Transition phrases in transcript ("We'll be right back", "Welcome back")

Combined scoring: Each signal contributes a confidence score. A segment is classified as commercial when the combined score exceeds a threshold. This is resilient to missing fingerprints — even without a known bumper match, the other signals can still identify breaks.

Stage 4: Commercial Removal — ffmpeg

  • Stitch show segments together with crossfades
  • Normalize audio levels (EBU R128 loudness standard)
  • Output clean podcast-ready MP3

Stage 5: Segment Splitting — ffmpeg

  • Export individual segments as separate MP3s
  • Apply fade in/out
  • Add ID3 tags (show name, segment title, date)
  • Generate chapter markers file (for podcast apps)

Stage 6: Content Analysis — Ollama (qwen3:14b or codestral)

  • Feed transcript + speaker labels to local LLM
  • Generate:
    • Episode summary (2-3 paragraphs)
    • Per-segment summaries
    • Key quotes with speaker attribution
    • Topic tags
    • Suggested blog post topics
    • Auto-filled post-show debrief template

Audio Element Library

The element library is a learning system, not a static collection.

element-library/
  fingerprints.db               # SQLite database of audio fingerprints
  known/                        # Source files we have
    intros/
    outros/
    bumpers/
    promos/
  discovered/                   # Elements found by the discovery system
    cluster-001.mp3             # Unknown element, appears in 47 episodes
    cluster-002.mp3             # Unknown element, appears in 12 episodes
    ...
  metadata.json                 # Names, categories, date ranges for each element

Lifecycle of a discovered element:

  1. Processor detects non-speech audio that doesn't match any known fingerprint
  2. Audio clip is extracted and stored as a candidate
  3. Candidate is compared against all other candidates across episodes
  4. Matches are clustered — same audio in multiple episodes = confirmed element
  5. New element is fingerprinted and added to the library as "unnamed"
  6. Host reviews unnamed elements periodically and assigns names/categories
  7. Named elements improve future detection accuracy

Element categories:

  • show-intro — Full show opening
  • show-outro — Full show closing
  • segment-bumper — Music between show segments
  • break-bumper — Music going into/out of commercial breaks
  • station-id — Station identification (legal requirement, consistent per station)
  • promo — Show promo or cross-promotion
  • stinger — Short audio effect (sound effect, catchphrase)
  • unknown — Not yet categorized

Voice Profile System

Bootstrapped from 579-episode archive, not a single enrollment sample.

voice-profiles/
  host-mike-swanson/
    embedding-composite.npy     # Average embedding across all eras
    embedding-2010.npy          # Era-specific (voice changes over time)
    embedding-2014.npy
    embedding-2018.npy
    embedding-2026.npy
    metadata.json               # Speaker name, role, episode count
  guests/
    [name].npy                  # Named guest embeddings (built over time)
  callers/
    regular-001.npy             # Unnamed repeat caller
    regular-002.npy
  unknown/
    cluster-[id].npy            # Voices that appear multiple times, not yet named

Bootstrap process:

  1. Diarize 10 diverse archive episodes (different years)
  2. Dominant speaker in each = host (by far the most speaking time)
  3. Extract host-only segments, generate embeddings
  4. Create per-era profiles (voice may change over 8+ years)
  5. Composite embedding = average across all eras

Continuous improvement:

  • Each processed episode refines the host embedding
  • Repeat non-host voices are clustered across episodes
  • Host reviews clusters: "This voice appears in 47 episodes — who is this?"
  • Named profiles improve future speaker labeling

Dependencies

# System packages
sudo pacman -S python-pip ffmpeg

# Python packages (in a venv)
python3 -m venv ~/.local/share/radio-processor
source ~/.local/share/radio-processor/bin/activate

pip install faster-whisper          # Transcription (CTranslate2 + CUDA)
pip install pyannote.audio          # Speaker diarization
pip install torch torchaudio        # PyTorch (CUDA)
pip install silero-vad              # Voice activity detection
pip install pydub                   # Audio manipulation
pip install librosa                 # Audio analysis / spectral features
pip install chromaprint             # Audio fingerprinting (or use dejavu)
pip install scikit-learn            # Break pattern classifier
pip install ollama                  # Local LLM API
pip install rich                    # CLI progress display

pyannote.audio Access

pyannote requires accepting the model license on HuggingFace:

  1. Create account at huggingface.co
  2. Accept license at https://huggingface.co/pyannote/speaker-diarization-3.1
  3. Generate access token
  4. huggingface-cli login

Usage

# Full pipeline (new episode)
radio-process episode.mp3 --show-prep episodes/2026-03-21-who-controls-your-tech/show-prep.md

# Just transcribe
radio-process episode.mp3 --transcribe-only

# Process archive episode (training mode — learns elements + voices)
radio-process episode-hr1.mp3 episode-hr2.mp3 --archive-mode --date 2016-03-15

# Batch process archive for training
radio-process --batch-train archive/2016/ --output training-data/

# Enroll host voice from archive (bootstrap)
radio-process --bootstrap-voice archive/ --speaker-name "Mike Swanson" --role host

# Review discovered elements
radio-process --review-elements

# Review unknown speaker clusters
radio-process --review-speakers

Output Structure

episodes/YYYY-MM-DD-topic/
  show-prep.md                          # Pre-show (existing)
  post-show-debrief.md                  # Auto-generated draft
  raw/
    full-broadcast.mp3                  # Original recording
  processed/
    transcript.json                     # Full transcript with timestamps + speakers
    transcript.txt                      # Plain text transcript
    transcript.srt                      # Subtitle format
    podcast-episode.mp3                 # Clean episode (commercials removed)
    chapters.json                       # Chapter markers
    detection-report.json               # What was detected as commercial/show, confidence scores
    segments/
      00-intro.mp3
      01-the-week-that-was.mp3
      02-the-government-wants-in.mp3
      03-jensens-trillion-dollar-bet.mp3
      04-apple-gives-google-the-keys.mp3
      05-a-petabyte-of-your-data-gone.mp3
      06-right-to-repair.mp3
      07-outro.mp3
  generated/
    episode-post.md                     # For website
    forum-thread.md                     # For community forum
    blog-topic-1.md                     # Deep-dive article
    blog-topic-2.md                     # Deep-dive article
    analysis.json                       # LLM analysis output

Configuration

# config.yaml
show:
  name: "The Computer Guru Show"
  host: "Mike Swanson"
  typical_duration_minutes: 120        # 2-hour broadcast
  segment_count: 6
  has_commercials: true

audio:
  whisper_model: "large-v3"
  whisper_language: "en"
  output_format: "mp3"
  output_bitrate: "192k"
  normalize: true                      # EBU R128
  crossfade_ms: 500                    # Between stitched segments

segment_detection:
  # Fingerprint matching
  fingerprint_db: "element-library/fingerprints.db"
  fingerprint_match_threshold: 0.85    # Minimum similarity for a match

  # Element discovery
  discover_unknown_elements: true
  min_element_duration_s: 1.0          # Shortest element to detect
  max_element_duration_s: 30.0         # Longest (full intro might be 20-30s)
  cluster_similarity_threshold: 0.90   # How similar clips must be to cluster
  min_cluster_occurrences: 3           # Must appear in 3+ episodes to be an element

  # Commercial classification
  min_break_duration_s: 30             # Minimum commercial break length
  max_break_duration_s: 300            # Maximum (5 min)
  silence_threshold_db: -40            # Silence detection threshold
  confidence_threshold: 0.70           # Combined score to classify as commercial

  # Signal weights (tune based on accuracy)
  weights:
    fingerprint_match: 0.30            # Known element detected
    speaker_identity: 0.25             # Host voice absent
    audio_characteristics: 0.20        # Production style differs
    break_pattern: 0.15                # Matches trained break pattern
    structural_heuristic: 0.10         # Duration/timing rules

diarization:
  min_speakers: 1
  max_speakers: 6
  voice_profiles_dir: "voice-profiles/"
  host_match_threshold: 0.75           # Similarity to host embedding

llm:
  model: "qwen3:14b"                  # Ollama model for analysis
  ollama_host: "http://localhost:11434"

paths:
  episodes_dir: "episodes/"
  voice_profiles: "voice-profiles/"
  element_library: "element-library/"
  output_dir: "processed/"

archive:
  server: "172.16.3.10"
  path: "/home/gurushow/public_html/archive/"
  elements_path: "/home/gurushow/public_html/archive/Radio/Elements/"

Training Data: 579-Episode Archive

The archive on IX server (172.16.3.10) contains 579 MP3 files spanning 2010-2018:

Year Files Size Notes
2010 43 664MB Season 7 start
2011 200 1.9GB Peak output
2012 98 1.2GB
2014 81 783MB Season 6 (new station)
2015 50 461MB
2016 54 1.2GB
2017 41 1.5GB
2018 5 101MB Final season 10 episodes
Elements 7 MP3 + 18 WAV 203MB Partial production library

Episodes are split into HR 1 / HR 2 files. The HR boundary is a confirmed commercial break point — used for training the break detection classifier.

Important: Not all production elements are in the archive. Bumpers, stingers, and jingles varied across stations and time periods. The element discovery system handles this by detecting and clustering unknown elements across episodes.

Future Enhancements

  1. Audiogram generator — Create video clips with waveform animation + captions for social media
  2. Highlight reel — Auto-detect the most engaging 60-90 seconds (high energy, laughter, emphasis)
  3. Show notes generator — Generate timestamped show notes in podcast standard format
  4. RSS feed integration — Auto-publish processed episodes to podcast RSS feed
  5. Sentiment analysis — Track audience engagement topics over time
  6. Topic continuity — Link topics across episodes ("Last week we talked about X, this week...")
  7. Live processing — Real-time transcription during broadcast for immediate post-show turnaround
  8. Cross-episode search — Full-text search across all transcripts ("When did we talk about net neutrality?")