# Training Plan: Using the 579-Episode Archive ## Available Training Data ### Episode Archive - **Location:** `/home/gurushow/public_html/archive/` on IX server (172.16.3.10) - **Count:** 579 MP3 files, 7.8GB - **Span:** 2010-2018 (Seasons 6-10) - **Format:** Split into "HR 1" / "HR 2" per episode (2-hour shows) - **Year breakdown:** - 2010: 43 files (664MB) - 2011: 200 files (1.9GB) - 2012: 98 files (1.2GB) - 2014: 81 files (783MB) - 2015: 50 files (461MB) - 2016: 54 files (1.2GB) - 2017: 41 files (1.5GB) - 2018: 5 files (101MB) ### Show Production Elements - **Location:** `/home/gurushow/public_html/archive/Radio/Elements/` - **Intros:** 5 WAV files (show intro variations, beast intro, kick back intro, streaming intro) - **Outros:** 2 WAV files - **Bumpers:** 7 files (MP3 + WAV) — music stingers for transitions - **Promos:** 2 WAV files (promo windows, show spot) - **Corrected versions:** Separate folder with phone-number-corrected versions --- ## Phase 1: Audio Element Library (Seed + Discover) ### Purpose Build a library of all show production elements (intros, outros, bumpers, stingers, station IDs) for reliable segment boundary detection. The archive contains SOME elements but not all — different stations and eras used different production elements. ### Step 1: Seed with known elements 1. Download all files from `Radio/Elements/` on IX server (7 MP3 + 18 WAV) 2. Convert WAVs to consistent format (mono, 16kHz for fingerprinting) 3. Generate chromaprint fingerprints for each element 4. Store in `element-library/fingerprints.db` (SQLite) 5. Categorize: show-intro, show-outro, segment-bumper, break-bumper, promo ### Step 2: Discover unknown elements from archive 1. Process episodes through the pipeline 2. Detect short non-speech audio segments (music, jingles, produced audio) 3. Extract each detected clip 4. Compare against known fingerprints — if no match, store as candidate 5. Compare candidates against each other across episodes 6. Cluster: same audio appearing in 3+ episodes = confirmed show element 7. Add to fingerprint database as "unnamed" element ### Step 3: Host review - Present discovered clusters: "This 4-second audio clip appears in 38 episodes between 2015-2017 — what is it?" - Host names and categorizes each cluster - Named elements improve future detection accuracy ### What This Enables - **Known elements:** Exact boundary detection when a fingerprinted intro/bumper is detected - **Unknown elements:** Even without the source file, if the same jingle appears repeatedly, we know it marks a boundary - **Era awareness:** Elements used in 2011 may differ from 2016 — the library tracks date ranges - **New show elements:** When the show returns in 2026 with a new station, new bumpers get discovered automatically after a few episodes ### Tools - `chromaprint` / `fpcalc` for audio fingerprinting - `librosa` for spectral analysis and non-speech detection - `dejavu` (Python audio fingerprinting library) or custom matching - SQLite for fingerprint storage and lookup --- ## Phase 2: Host Voice Profile (Bootstrapped from Archive) ### Purpose Build an extremely robust speaker embedding for Mike's voice using hundreds of hours of confirmed speech. ### Method #### Step 1: Bootstrap from clean segments The show intros typically have the host speaking directly. Use a handful of episodes where the host is the only speaker for the first few minutes: 1. Transcribe 10 diverse episodes (different years, different energy levels) 2. Run pyannote diarization 3. The dominant speaker in each episode = the host (by far the most speaking time) 4. Extract host-only segments from each episode 5. Generate embeddings from all host segments 6. Average/cluster to create a robust reference embedding #### Step 2: Validate across eras The host's voice may have changed subtly over 8 years. Generate per-year embeddings: - 2010 voice profile - 2014 voice profile - 2018 voice profile - 2026 voice profile (from new episodes) Store all as the same speaker with temporal metadata. The matching algorithm checks against all variants. #### Step 3: Continuous improvement Each processed new episode refines the host embedding (confirmed host segments get folded back in). --- ## Phase 3: Commercial Break Pattern Training ### Purpose Learn the specific audio patterns that signal commercial breaks. Because not all production elements are in the archive, the detector must combine multiple signals rather than relying solely on fingerprint matching. ### Method: Multi-Signal Classifier The classifier combines all available signals with weighted scoring. No single signal is required — the system degrades gracefully when some signals are unavailable. #### Signal 1: Known + discovered element fingerprints - Match detected audio against the element library (Phase 1) - If a known break-bumper is detected, high confidence of a break boundary - If no match, other signals still contribute - **Availability:** Partial for archive episodes (incomplete element library), improves over time via discovery #### Signal 2: Speaker identity (from Phase 2) - Host voice present = show content (high confidence) - Host voice absent for >30 seconds = possible break - Multiple unfamiliar voices in quick succession with produced audio = commercial cluster - **Availability:** High — host voice profile is robust from hundreds of hours #### Signal 3: Audio characteristics - Extract per-segment features: MFCC, spectral centroid, zero-crossing rate, loudness (LUFS), dynamic range - Commercials typically: higher loudness, more compression, different spectral profile, different room tone - Show content typically: consistent room tone, natural dynamic range, live mic characteristics - **Availability:** Always available — inherent to audio #### Signal 4: HR 1/HR 2 boundary training Since archive episodes are split into Hour 1 and Hour 2, the END of HR 1 and START of HR 2 always contain a commercial break boundary. This gives us 194+ confirmed break points. 1. Take the last 5 minutes of every HR 1 file and first 5 minutes of every HR 2 file 2. Analyze the audio feature transition at the show→commercial boundary 3. Train a Random Forest classifier on these labeled transitions 4. Apply the learned transition pattern to detect similar boundaries within single-file recordings - **Availability:** Training data from archive; model applies to new episodes #### Signal 5: Structural heuristics - Commercial breaks are typically 2-5 minutes - Shows typically break every 12-20 minutes - Transition phrases in transcript ("We'll be right back", "Welcome back", "Stay tuned") - Silence gaps >1 second often bookend breaks - **Availability:** Always available #### Combined scoring Each signal produces a confidence value (0.0-1.0). Weighted sum determines classification: ``` score = (w1 * fingerprint_match) + (w2 * speaker_absence) + (w3 * audio_characteristics) + (w4 * break_pattern_match) + (w5 * structural_heuristic) if score > threshold: classify as commercial ``` Default weights (tunable after validation): - Fingerprint match: 0.30 (strongest when available, but often unavailable) - Speaker identity: 0.25 (very reliable) - Audio characteristics: 0.20 (always available) - Break pattern: 0.15 (learned from archive) - Structural: 0.10 (least reliable alone, but useful confirmation) #### Self-calibration After processing a batch of archive episodes: 1. Compare detected breaks against HR1/HR2 boundaries (known ground truth) 2. Auto-tune weights to maximize accuracy on held-out episodes 3. Report accuracy metrics ### Expected Accuracy - With all signals available (including fingerprint match): >95% - Without fingerprint matches (new station, new elements): >85% - Improves over time as element discovery adds to the fingerprint library --- ## Phase 4: Repeat Speaker Detection ### Purpose Identify co-hosts, regular callers, and guests across the archive. ### Method 1. Diarize a representative sample (20-30 episodes across all years) 2. For each episode, extract embeddings for all non-host speakers 3. Cluster all non-host embeddings across all episodes 4. Clusters that appear in multiple episodes = repeat speakers 5. Present clusters to the host for naming: "This voice appears in 47 episodes — who is this?" 6. Save named speaker profiles ### Known Speakers to Look For - Co-hosts (Harry mentioned in early episodes) - Regular callers - Recurring guests --- ## Phase 5: Batch Processing Pipeline ### Purpose Process the full archive to build the training dataset and generate transcripts. ### Approach: Incremental, not all-at-once **Batch 1: Training set (10 episodes)** - Select 10 episodes spanning different years - Full transcription + diarization - Manual review to validate accuracy - Use results to tune parameters **Batch 2: Element fingerprinting** - Download and fingerprint all show elements - Test detection against Batch 1 episodes **Batch 3: Commercial detection training** - Process 50 HR1/HR2 pairs - Train break detection classifier - Validate against held-out episodes **Batch 4: Full archive (optional, on demand)** - Process remaining episodes as background task - Each episode: ~5-10 minutes to transcribe on RTX 5070 Ti - Full archive: ~50-100 hours of compute time - Run overnight in batches ### Storage Requirements - Transcripts (JSON): ~500KB per episode × 194 = ~100MB - Speaker embeddings: negligible - Processed audio (if re-encoding): skip unless needed - Total new storage: < 500MB for all metadata --- ## Implementation Priority 1. **Set up Python environment** — venv with faster-whisper, pyannote, torch CUDA 2. **Download show elements** — Fingerprint the known intros/outros/bumpers (seed library) 3. **Process 3-5 archive episodes** — Validate transcription + diarization quality 4. **Build host voice profile** — Bootstrap from initial batch 5. **Run element discovery on initial batch** — Find unknown elements, begin clustering 6. **Train commercial detector** — Using HR1/HR2 boundaries + all available signals 7. **Process 20-30 more episodes** — Expand element library, refine classifier weights, discover repeat speakers 8. **Host review session** — Name discovered elements and speaker clusters 9. **Build the CLI tool** — Wire it all together with config file 10. **Process a new 2026 episode end-to-end** — Full pipeline test with new station's elements 11. **Batch process remaining archive** — Background task, overnight --- ## Disk Space Plan The archive is 7.8GB on IX server. Options: 1. **Stream from server** — Process one at a time via SSH/SCP, don't store locally 2. **Download subset** — Training set only (~500MB for 10 episodes + elements) 3. **Download all** — 7.8GB to local disk (easy, NVMe has plenty of space) 4. **NFS/SSHFS mount** — Mount the IX server directory, process in place Recommendation: Download the elements + 10-episode training set first. Full archive download only when ready for batch processing.