Files

Mike Swanson 6cc9043b8e Audio processor: validated voice profiling accuracy, tuned threshold

- Fine-grained speaker analysis (3s windows, 1s hop) across 42min episode
- Host voice: 0.90-0.98 similarity (clear positive match)
- Callers: 0.65-0.68 (correctly below threshold)
- Produced audio/clips: 0.53-0.65 (correctly identified as non-host)
- Co-host/other speakers: 0.56-0.62 (correctly identified)
- Tuned host_match_threshold from 0.75 to 0.83 based on empirical data
- Cross-referenced dips with transcript: correctly identifies callers,
  show intros, played audio clips, and station breaks
- Batch transcription of 7 additional training episodes in progress

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-21 12:48:25 -07:00

src

Audio processor: working voice profiler with WavLM speaker embeddings

2026-03-21 12:19:13 -07:00

test-data

Audio processor: validated voice profiling accuracy, tuned threshold

2026-03-21 12:48:25 -07:00

training-data

Audio processor: validated voice profiling accuracy, tuned threshold

2026-03-21 12:48:25 -07:00

voice-profiles

Audio processor: validated voice profiling accuracy, tuned threshold

2026-03-21 12:48:25 -07:00

config.yaml

Audio processor: validated voice profiling accuracy, tuned threshold

2026-03-21 12:48:25 -07:00

pyproject.toml

Add radio show audio processor and post-show workflow

2026-03-21 11:51:59 -07:00

README.md

Add radio show audio processor and post-show workflow

2026-03-21 11:51:59 -07:00

training-plan.md

Add radio show audio processor and post-show workflow

2026-03-21 11:51:59 -07:00

README.md

Radio Show Audio Processor

Automated pipeline for processing The Computer Guru Show recordings into podcast-ready audio, transcripts, and segmented clips.

What It Does

Raw MP3 (full broadcast with commercials)
  │
  ├── 1. Transcribe (Whisper + GPU)
  │     └── Full transcript with timestamps
  │
  ├── 2. Speaker Diarization (pyannote)
  │     └── Who said what (host vs. callers vs. guests)
  │
  ├── 3. Segment Detection
  │     ├── Identify show segments vs. commercials
  │     ├── Detect music/jingles (known + discovered)
  │     └── Map segments to show prep structure
  │
  ├── 4. Commercial Removal
  │     └── Clean episode MP3 (show content only)
  │
  ├── 5. Segment Splitting
  │     ├── Individual segment MP3s (for social media)
  │     └── Chapter markers (for podcast players)
  │
  └── 6. Content Analysis (Ollama)
        ├── Episode summary
        ├── Topic extraction
        ├── Key quotes
        └── Auto-populate post-show debrief

Architecture

Pipeline Stages

Stage 1: Transcription — `faster-whisper` (GPU)

Model: large-v3 (best accuracy, ~3GB VRAM)
Why faster-whisper: CTranslate2 backend, 4x faster than OpenAI whisper, lower VRAM
Output: Word-level timestamps, language detection
Hardware: RTX 5070 Ti (12GB VRAM) — plenty for large-v3

Stage 2: Speaker Diarization — `pyannote.audio`

Model: pyannote/speaker-diarization-3.1
Purpose: Identify speaker turns (host, caller 1, caller 2, etc.)
Voice enrollment: Bootstrapped from archive (hundreds of hours of host speech)
Output: Speaker segments with timestamps

Stage 3: Segment Detection — Multi-Signal Classifier

Commercial and segment detection uses multiple signals combined, because not all show production elements are in the archive — bumpers, stingers, and jingles vary across stations and eras.

Signal 1: Known element fingerprints (seed library)

Fingerprint the production elements we DO have (intros, outros, bumpers from archive)
Match against episodes to detect known boundaries
Partial coverage — some elements won't match

Signal 2: Unknown element discovery

Detect short non-speech audio segments (music, jingles, produced audio) that don't match any known fingerprint
Cluster unknown elements across episodes — if the same 5-second clip appears in 30 episodes, it's a show element
Flag new clusters for host review and naming
Discovered elements get added to the fingerprint library automatically

Signal 3: Speaker identity

Host voice present = show content
Non-host voices with commercial audio characteristics (compressed, produced, different acoustic environment) = ads
Host voice absent for extended periods (>30s) = likely commercial break

Signal 4: Audio characteristics

Volume/loudness shifts (commercials often have different LUFS profiles)
Spectral characteristics (produced/compressed commercial audio vs. live studio mic)
Silence gaps (dead air between show and ads)
Audio environment changes (room tone, background noise differences)

Signal 5: Learned break patterns (from archive)

HR1/HR2 file boundaries = confirmed commercial break locations
Train a classifier on the audio features at these known boundaries
Generalize to detect similar patterns within single-file recordings

Signal 6: Structural heuristics

Commercial breaks are typically 2-5 minutes
Shows typically break every 12-20 minutes
Transition phrases in transcript ("We'll be right back", "Welcome back")

Combined scoring: Each signal contributes a confidence score. A segment is classified as commercial when the combined score exceeds a threshold. This is resilient to missing fingerprints — even without a known bumper match, the other signals can still identify breaks.

Stage 4: Commercial Removal — `ffmpeg`

Stitch show segments together with crossfades
Normalize audio levels (EBU R128 loudness standard)
Output clean podcast-ready MP3

Stage 5: Segment Splitting — `ffmpeg`

Export individual segments as separate MP3s
Apply fade in/out
Add ID3 tags (show name, segment title, date)
Generate chapter markers file (for podcast apps)

Stage 6: Content Analysis — `Ollama` (qwen3:14b or codestral)

Feed transcript + speaker labels to local LLM
Generate:
- Episode summary (2-3 paragraphs)
- Per-segment summaries
- Key quotes with speaker attribution
- Topic tags
- Suggested blog post topics
- Auto-filled post-show debrief template

Audio Element Library

The element library is a learning system, not a static collection.

element-library/
  fingerprints.db               # SQLite database of audio fingerprints
  known/                        # Source files we have
    intros/
    outros/
    bumpers/
    promos/
  discovered/                   # Elements found by the discovery system
    cluster-001.mp3             # Unknown element, appears in 47 episodes
    cluster-002.mp3             # Unknown element, appears in 12 episodes
    ...
  metadata.json                 # Names, categories, date ranges for each element

Lifecycle of a discovered element:

Processor detects non-speech audio that doesn't match any known fingerprint
Audio clip is extracted and stored as a candidate
Candidate is compared against all other candidates across episodes
Matches are clustered — same audio in multiple episodes = confirmed element
New element is fingerprinted and added to the library as "unnamed"
Host reviews unnamed elements periodically and assigns names/categories
Named elements improve future detection accuracy

Element categories:

show-intro — Full show opening
show-outro — Full show closing
segment-bumper — Music between show segments
break-bumper — Music going into/out of commercial breaks
station-id — Station identification (legal requirement, consistent per station)
promo — Show promo or cross-promotion
stinger — Short audio effect (sound effect, catchphrase)
unknown — Not yet categorized

Voice Profile System

Bootstrapped from 579-episode archive, not a single enrollment sample.

voice-profiles/
  host-mike-swanson/
    embedding-composite.npy     # Average embedding across all eras
    embedding-2010.npy          # Era-specific (voice changes over time)
    embedding-2014.npy
    embedding-2018.npy
    embedding-2026.npy
    metadata.json               # Speaker name, role, episode count
  guests/
    [name].npy                  # Named guest embeddings (built over time)
  callers/
    regular-001.npy             # Unnamed repeat caller
    regular-002.npy
  unknown/
    cluster-[id].npy            # Voices that appear multiple times, not yet named

Bootstrap process:

Diarize 10 diverse archive episodes (different years)
Dominant speaker in each = host (by far the most speaking time)
Extract host-only segments, generate embeddings
Create per-era profiles (voice may change over 8+ years)
Composite embedding = average across all eras

Continuous improvement:

Each processed episode refines the host embedding
Repeat non-host voices are clustered across episodes
Host reviews clusters: "This voice appears in 47 episodes — who is this?"
Named profiles improve future speaker labeling

Dependencies

# System packages
sudo pacman -S python-pip ffmpeg

# Python packages (in a venv)
python3 -m venv ~/.local/share/radio-processor
source ~/.local/share/radio-processor/bin/activate

pip install faster-whisper          # Transcription (CTranslate2 + CUDA)
pip install pyannote.audio          # Speaker diarization
pip install torch torchaudio        # PyTorch (CUDA)
pip install silero-vad              # Voice activity detection
pip install pydub                   # Audio manipulation
pip install librosa                 # Audio analysis / spectral features
pip install chromaprint             # Audio fingerprinting (or use dejavu)
pip install scikit-learn            # Break pattern classifier
pip install ollama                  # Local LLM API
pip install rich                    # CLI progress display

pyannote.audio Access

pyannote requires accepting the model license on HuggingFace:

Create account at huggingface.co
Accept license at https://huggingface.co/pyannote/speaker-diarization-3.1
Generate access token
huggingface-cli login

Usage

# Full pipeline (new episode)
radio-process episode.mp3 --show-prep episodes/2026-03-21-who-controls-your-tech/show-prep.md

# Just transcribe
radio-process episode.mp3 --transcribe-only

# Process archive episode (training mode — learns elements + voices)
radio-process episode-hr1.mp3 episode-hr2.mp3 --archive-mode --date 2016-03-15

# Batch process archive for training
radio-process --batch-train archive/2016/ --output training-data/

# Enroll host voice from archive (bootstrap)
radio-process --bootstrap-voice archive/ --speaker-name "Mike Swanson" --role host

# Review discovered elements
radio-process --review-elements

# Review unknown speaker clusters
radio-process --review-speakers

Output Structure

episodes/YYYY-MM-DD-topic/
  show-prep.md                          # Pre-show (existing)
  post-show-debrief.md                  # Auto-generated draft
  raw/
    full-broadcast.mp3                  # Original recording
  processed/
    transcript.json                     # Full transcript with timestamps + speakers
    transcript.txt                      # Plain text transcript
    transcript.srt                      # Subtitle format
    podcast-episode.mp3                 # Clean episode (commercials removed)
    chapters.json                       # Chapter markers
    detection-report.json               # What was detected as commercial/show, confidence scores
    segments/
      00-intro.mp3
      01-the-week-that-was.mp3
      02-the-government-wants-in.mp3
      03-jensens-trillion-dollar-bet.mp3
      04-apple-gives-google-the-keys.mp3
      05-a-petabyte-of-your-data-gone.mp3
      06-right-to-repair.mp3
      07-outro.mp3
  generated/
    episode-post.md                     # For website
    forum-thread.md                     # For community forum
    blog-topic-1.md                     # Deep-dive article
    blog-topic-2.md                     # Deep-dive article
    analysis.json                       # LLM analysis output

Configuration

# config.yaml
show:
  name: "The Computer Guru Show"
  host: "Mike Swanson"
  typical_duration_minutes: 120        # 2-hour broadcast
  segment_count: 6
  has_commercials: true

audio:
  whisper_model: "large-v3"
  whisper_language: "en"
  output_format: "mp3"
  output_bitrate: "192k"
  normalize: true                      # EBU R128
  crossfade_ms: 500                    # Between stitched segments

segment_detection:
  # Fingerprint matching
  fingerprint_db: "element-library/fingerprints.db"
  fingerprint_match_threshold: 0.85    # Minimum similarity for a match

  # Element discovery
  discover_unknown_elements: true
  min_element_duration_s: 1.0          # Shortest element to detect
  max_element_duration_s: 30.0         # Longest (full intro might be 20-30s)
  cluster_similarity_threshold: 0.90   # How similar clips must be to cluster
  min_cluster_occurrences: 3           # Must appear in 3+ episodes to be an element

  # Commercial classification
  min_break_duration_s: 30             # Minimum commercial break length
  max_break_duration_s: 300            # Maximum (5 min)
  silence_threshold_db: -40            # Silence detection threshold
  confidence_threshold: 0.70           # Combined score to classify as commercial

  # Signal weights (tune based on accuracy)
  weights:
    fingerprint_match: 0.30            # Known element detected
    speaker_identity: 0.25             # Host voice absent
    audio_characteristics: 0.20        # Production style differs
    break_pattern: 0.15                # Matches trained break pattern
    structural_heuristic: 0.10         # Duration/timing rules

diarization:
  min_speakers: 1
  max_speakers: 6
  voice_profiles_dir: "voice-profiles/"
  host_match_threshold: 0.75           # Similarity to host embedding

llm:
  model: "qwen3:14b"                  # Ollama model for analysis
  ollama_host: "http://localhost:11434"

paths:
  episodes_dir: "episodes/"
  voice_profiles: "voice-profiles/"
  element_library: "element-library/"
  output_dir: "processed/"

archive:
  server: "172.16.3.10"
  path: "/home/gurushow/public_html/archive/"
  elements_path: "/home/gurushow/public_html/archive/Radio/Elements/"

Training Data: 579-Episode Archive

The archive on IX server (172.16.3.10) contains 579 MP3 files spanning 2010-2018:

Year	Files	Size	Notes
2010	43	664MB	Season 7 start
2011	200	1.9GB	Peak output
2012	98	1.2GB
2014	81	783MB	Season 6 (new station)
2015	50	461MB
2016	54	1.2GB
2017	41	1.5GB
2018	5	101MB	Final season 10 episodes
Elements	7 MP3 + 18 WAV	203MB	Partial production library

Episodes are split into HR 1 / HR 2 files. The HR boundary is a confirmed commercial break point — used for training the break detection classifier.

Important: Not all production elements are in the archive. Bumpers, stingers, and jingles varied across stations and time periods. The element discovery system handles this by detecting and clustering unknown elements across episodes.

Future Enhancements

Audiogram generator — Create video clips with waveform animation + captions for social media
Highlight reel — Auto-detect the most engaging 60-90 seconds (high energy, laughter, emphasis)
Show notes generator — Generate timestamped show notes in podcast standard format
RSS feed integration — Auto-publish processed episodes to podcast RSS feed
Sentiment analysis — Track audience engagement topics over time
Topic continuity — Link topics across episodes ("Last week we talked about X, this week...")
Live processing — Real-time transcription during broadcast for immediate post-show turnaround
Cross-episode search — Full-text search across all transcripts ("When did we talk about net neutrality?")

README.md

Radio Show Audio Processor

What It Does

Architecture

Pipeline Stages

Stage 1: Transcription — faster-whisper (GPU)

Stage 2: Speaker Diarization — pyannote.audio

Stage 3: Segment Detection — Multi-Signal Classifier

Stage 4: Commercial Removal — ffmpeg

Stage 5: Segment Splitting — ffmpeg

Stage 6: Content Analysis — Ollama (qwen3:14b or codestral)

Audio Element Library

Voice Profile System

Dependencies

pyannote.audio Access

Usage

Output Structure

Configuration

Training Data: 579-Episode Archive

Future Enhancements

Stage 1: Transcription — `faster-whisper` (GPU)

Stage 2: Speaker Diarization — `pyannote.audio`

Stage 4: Commercial Removal — `ffmpeg`

Stage 5: Segment Splitting — `ffmpeg`

Stage 6: Content Analysis — `Ollama` (qwen3:14b or codestral)