Files

Mike Swanson 79abef9dc9 radio: diarization pipeline fixes, benchmark setup, test episode set

- Fix voice_profiler threshold bug (HOST label overwrote Unknown unconditionally)
- Audio preload optimization: single ffmpeg per episode, 149.5x realtime on 5070 Ti
- WavLM threshold raised to 0.85 (Mike 0.90-0.99, callers 0.46-0.83)
- Promo/bumper filter: weighted signature scoring, 42->27 clean Q&A pairs
- Text-only Q&A fallback for episodes with no CALLER diarization labels
- TRANSFORMERS_OFFLINE=1 to skip HuggingFace freshness checks
- Add diarize_2018.py for targeted re-run + FTS5 rebuild
- Add benchmark.py + BENCH_SETUP.md for GURU-BEAST-ROG (RTX 4090) comparison
- Commit 9-episode training diarization.json outputs
- Session log: 2026-04-27-diarization-pipeline.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-27 13:20:40 -07:00

5.5 KiB

Raw Blame History

Session Log — 2026-04-27

Project: The Computer Guru Show — Archive Mining System Goal: Build searchable transcript archive of 579 episodes (2010-2018) with caller Q&A extraction for "then vs now" show prep Machine: DESKTOP-0O8A1RL User: Mike Swanson (mike)

Work Completed

Critical Bug Fix — `voice_profiler.py` `identify_speakers()`

identify_speakers() was unconditionally labeling all windows as HOST regardless of similarity score. The host-role label assignment ran after the threshold check and overwrote it. Fixed by gating the "Host:" label inside the best_score >= threshold branch.

Threshold Tuning

Raised similarity threshold from 0.70 to 0.85. Diagnostic run on 2010-10-02-hr1.mp3 confirmed clean separation:

Mike's voice: scores 0.90-0.99
Caller windows: scores 0.46-0.83

Audio Preload Optimization

identify_speakers() previously spawned approximately 500 ffmpeg subprocesses per episode (one per 10-second window). Rewrote to load full audio once via _load_full_audio() and slice in-memory numpy arrays per window.

Result: 149.5x realtime on RTX 5070 Ti Measured: 10,600 seconds of audio processed in 70.9 seconds.

Promo/Bumper Filter — `qa_extractor.py`

Added _is_promo_or_bumper() with weighted signature scoring:

Score 2 = highly distinctive phrase
Score 1 = semi-generic phrase
Threshold = 2

Filters show promos such as "Computer running slow? Has your machine somehow acquired a life of its own?" from Q&A pairs. Reduced false positives from 42 to 27 pairs across 9 training episodes.

2018 Episode Re-diarization

Episodes 2018-s10e17 and 2018-s10e21 had stale all-HOST diarization from an aborted earlier run. Re-diarized correctly:

2018-s10e17: 49 turns / 775s caller
2018-s10e21: 110 turns / 1175s caller

Text-Only Q&A Fallback

Added _extract_qa_text_only() to handle cases where diarization produces no CALLER labels. Uses question-pattern signals and caller-intro phrase detection. Automatically triggered when all segments are labeled HOST.

TRANSFORMERS_OFFLINE=1

Set in diarize_training.py and diarize_2018.py to prevent HuggingFace freshness checks on the cached WavLM model.

HuggingFace / Model Note

WavLM (microsoft/wavlm-base-sv) is ungated and sufficient for speaker verification. pyannote was evaluated but not needed.

Key Files Modified

File	Change
`src/voice_profiler.py`	Threshold bug fix, audio preload optimization, `_embed_audio_np()`, `_load_full_audio()`
`src/qa_extractor.py`	Promo filter (`_is_promo_or_bumper()`), text-only fallback (`_extract_qa_text_only()`)
`src/diarizer.py`	Default threshold raised to 0.85
`diarize_training.py`	TRANSFORMERS_OFFLINE=1, threshold=0.85
`diarize_2018.py`	New targeted script for 2018 re-diarization and DB patch
`check_scores.py`	Diagnostic script — keep for future threshold tuning

Training Set (archive/archive.db)

9 episodes, 17,555 segments, 27 Q&A pairs total.

Episode ID	File	Duration	Caller Segs	Q&A Pairs
2010-10-02	2010-10-02-hr1.mp3	44m36s	79	5
2011-06-04	2011-06-04-hr1.mp3	42m42s	31	1
2011-09-10	2011-09-10-hr1.mp3	41m46s	4	0
2014-s6e05	2014-s6e05.mp3	47m27s	153	3
2015-s7e30	2015-s7e30.mp3	45m21s	105	5
2016-s8e42	2016-s8e42.mp3	90m24s	227	5
2017-s9e26	2017-s9e26.mp3	89m25s	374	5
2018-s10e17	2018-s10e17.mp3	88m22s	816	0
2018-s10e21	2018-s10e21.mp3	88m20s	454	3

Test Set — Downloaded, Not Yet Transcribed

Files saved to test-data/episodes/.

Local Filename	Source Path on IX (172.16.3.10)	Size	Notes
2011-03-12-hr1.mp3	/home/gurushow/public_html/archive/2011/3-12-11 HR 1.mp3	8.8MB	2011 unseen date
2012-03-10-hr1.mp3	/home/gurushow/public_html/archive/2012/3 - March/3-10-12HR1.mp3	11.7MB	2012 — completely untrained year
2012-06-09-hr1.mp3	/home/gurushow/public_html/archive/2012/6 - June/6-9-12-HR1.mp3	12.2MB	2012 — completely untrained year
2014-s6e19.mp3	/home/gurushow/public_html/archive/2014/06/s6e19.mp3	10.3MB	2014 different episode
2016-s8e43.mp3	/home/gurushow/public_html/archive/2016/06/s8e43.mp3	18.0MB	2016 different episode
2017-s9e30.mp3	/home/gurushow/public_html/archive/2017/04/s9e30.mp3	48.2MB	2017 different episode

Next Steps

Transcribe test episodes: py src/cli.py batch --transcribe-only test-data/episodes/
Diarize test episodes: run diarize script targeting test-data/episodes/
Extract Q&A pairs from test set
Compare Q&A quality vs training set
Performance comparison vs RTX 4090 (separate session on that machine)

RTX 4090 Performance Comparison (Separate Machine)

The 4090 machine needs:

Full repo clone from Gitea
voice-profiles/ directory (contains mike-swanson composite + 180 embeddings)
The 6 test episode MP3s from test-data/episodes/
Run: TRANSFORMERS_OFFLINE=1 py diarize_2018.py against test episodes, record realtime factor
Compare to 5070 Ti baseline: 149.5x realtime (10,600s audio in 70.9s)

Infrastructure Notes

Archive server:

IX server: 172.16.3.10 (see vault: infrastructure/ix-server.sops.yaml)
SSH blocked from command line due to key agent interference — use Python paramiko with look_for_keys=False, allow_agent=False
Tailscale must be running for 172.16.3.x access
Full archive: 579 MP3 files across /home/gurushow/public_html/archive/{2010,2011,2012,2014,2015,2016,2017,2018}/

5.5 KiB Raw Blame History