radio: diarization pipeline fixes, benchmark setup, test episode set
- Fix voice_profiler threshold bug (HOST label overwrote Unknown unconditionally) - Audio preload optimization: single ffmpeg per episode, 149.5x realtime on 5070 Ti - WavLM threshold raised to 0.85 (Mike 0.90-0.99, callers 0.46-0.83) - Promo/bumper filter: weighted signature scoring, 42->27 clean Q&A pairs - Text-only Q&A fallback for episodes with no CALLER diarization labels - TRANSFORMERS_OFFLINE=1 to skip HuggingFace freshness checks - Add diarize_2018.py for targeted re-run + FTS5 rebuild - Add benchmark.py + BENCH_SETUP.md for GURU-BEAST-ROG (RTX 4090) comparison - Commit 9-episode training diarization.json outputs - Session log: 2026-04-27-diarization-pipeline.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,135 @@
|
||||
# Session Log — 2026-04-27
|
||||
|
||||
**Project:** The Computer Guru Show — Archive Mining System
|
||||
**Goal:** Build searchable transcript archive of 579 episodes (2010-2018) with caller Q&A extraction for "then vs now" show prep
|
||||
**Machine:** DESKTOP-0O8A1RL
|
||||
**User:** Mike Swanson (mike)
|
||||
|
||||
---
|
||||
|
||||
## Work Completed
|
||||
|
||||
### Critical Bug Fix — `voice_profiler.py` `identify_speakers()`
|
||||
|
||||
`identify_speakers()` was unconditionally labeling all windows as HOST regardless of similarity score. The host-role label assignment ran after the threshold check and overwrote it. Fixed by gating the "Host:" label inside the `best_score >= threshold` branch.
|
||||
|
||||
### Threshold Tuning
|
||||
|
||||
Raised similarity threshold from 0.70 to 0.85. Diagnostic run on `2010-10-02-hr1.mp3` confirmed clean separation:
|
||||
|
||||
- Mike's voice: scores 0.90-0.99
|
||||
- Caller windows: scores 0.46-0.83
|
||||
|
||||
### Audio Preload Optimization
|
||||
|
||||
`identify_speakers()` previously spawned approximately 500 ffmpeg subprocesses per episode (one per 10-second window). Rewrote to load full audio once via `_load_full_audio()` and slice in-memory numpy arrays per window.
|
||||
|
||||
**Result: 149.5x realtime on RTX 5070 Ti**
|
||||
Measured: 10,600 seconds of audio processed in 70.9 seconds.
|
||||
|
||||
### Promo/Bumper Filter — `qa_extractor.py`
|
||||
|
||||
Added `_is_promo_or_bumper()` with weighted signature scoring:
|
||||
|
||||
- Score 2 = highly distinctive phrase
|
||||
- Score 1 = semi-generic phrase
|
||||
- Threshold = 2
|
||||
|
||||
Filters show promos such as "Computer running slow? Has your machine somehow acquired a life of its own?" from Q&A pairs. Reduced false positives from 42 to 27 pairs across 9 training episodes.
|
||||
|
||||
### 2018 Episode Re-diarization
|
||||
|
||||
Episodes `2018-s10e17` and `2018-s10e21` had stale all-HOST diarization from an aborted earlier run. Re-diarized correctly:
|
||||
|
||||
- `2018-s10e17`: 49 turns / 775s caller
|
||||
- `2018-s10e21`: 110 turns / 1175s caller
|
||||
|
||||
### Text-Only Q&A Fallback
|
||||
|
||||
Added `_extract_qa_text_only()` to handle cases where diarization produces no CALLER labels. Uses question-pattern signals and caller-intro phrase detection. Automatically triggered when all segments are labeled HOST.
|
||||
|
||||
### TRANSFORMERS_OFFLINE=1
|
||||
|
||||
Set in `diarize_training.py` and `diarize_2018.py` to prevent HuggingFace freshness checks on the cached WavLM model.
|
||||
|
||||
### HuggingFace / Model Note
|
||||
|
||||
WavLM (`microsoft/wavlm-base-sv`) is ungated and sufficient for speaker verification. pyannote was evaluated but not needed.
|
||||
|
||||
---
|
||||
|
||||
## Key Files Modified
|
||||
|
||||
| File | Change |
|
||||
|---|---|
|
||||
| `src/voice_profiler.py` | Threshold bug fix, audio preload optimization, `_embed_audio_np()`, `_load_full_audio()` |
|
||||
| `src/qa_extractor.py` | Promo filter (`_is_promo_or_bumper()`), text-only fallback (`_extract_qa_text_only()`) |
|
||||
| `src/diarizer.py` | Default threshold raised to 0.85 |
|
||||
| `diarize_training.py` | TRANSFORMERS_OFFLINE=1, threshold=0.85 |
|
||||
| `diarize_2018.py` | New targeted script for 2018 re-diarization and DB patch |
|
||||
| `check_scores.py` | Diagnostic script — keep for future threshold tuning |
|
||||
|
||||
---
|
||||
|
||||
## Training Set (archive/archive.db)
|
||||
|
||||
9 episodes, 17,555 segments, 27 Q&A pairs total.
|
||||
|
||||
| Episode ID | File | Duration | Caller Segs | Q&A Pairs |
|
||||
|---|---|---|---|---|
|
||||
| 2010-10-02 | 2010-10-02-hr1.mp3 | 44m36s | 79 | 5 |
|
||||
| 2011-06-04 | 2011-06-04-hr1.mp3 | 42m42s | 31 | 1 |
|
||||
| 2011-09-10 | 2011-09-10-hr1.mp3 | 41m46s | 4 | 0 |
|
||||
| 2014-s6e05 | 2014-s6e05.mp3 | 47m27s | 153 | 3 |
|
||||
| 2015-s7e30 | 2015-s7e30.mp3 | 45m21s | 105 | 5 |
|
||||
| 2016-s8e42 | 2016-s8e42.mp3 | 90m24s | 227 | 5 |
|
||||
| 2017-s9e26 | 2017-s9e26.mp3 | 89m25s | 374 | 5 |
|
||||
| 2018-s10e17 | 2018-s10e17.mp3 | 88m22s | 816 | 0 |
|
||||
| 2018-s10e21 | 2018-s10e21.mp3 | 88m20s | 454 | 3 |
|
||||
|
||||
---
|
||||
|
||||
## Test Set — Downloaded, Not Yet Transcribed
|
||||
|
||||
Files saved to `test-data/episodes/`.
|
||||
|
||||
| Local Filename | Source Path on IX (172.16.3.10) | Size | Notes |
|
||||
|---|---|---|---|
|
||||
| 2011-03-12-hr1.mp3 | /home/gurushow/public_html/archive/2011/3-12-11 HR 1.mp3 | 8.8MB | 2011 unseen date |
|
||||
| 2012-03-10-hr1.mp3 | /home/gurushow/public_html/archive/2012/3 - March/3-10-12HR1.mp3 | 11.7MB | 2012 — completely untrained year |
|
||||
| 2012-06-09-hr1.mp3 | /home/gurushow/public_html/archive/2012/6 - June/6-9-12-HR1.mp3 | 12.2MB | 2012 — completely untrained year |
|
||||
| 2014-s6e19.mp3 | /home/gurushow/public_html/archive/2014/06/s6e19.mp3 | 10.3MB | 2014 different episode |
|
||||
| 2016-s8e43.mp3 | /home/gurushow/public_html/archive/2016/06/s8e43.mp3 | 18.0MB | 2016 different episode |
|
||||
| 2017-s9e30.mp3 | /home/gurushow/public_html/archive/2017/04/s9e30.mp3 | 48.2MB | 2017 different episode |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Transcribe test episodes: `py src/cli.py batch --transcribe-only test-data/episodes/`
|
||||
2. Diarize test episodes: run diarize script targeting `test-data/episodes/`
|
||||
3. Extract Q&A pairs from test set
|
||||
4. Compare Q&A quality vs training set
|
||||
5. Performance comparison vs RTX 4090 (separate session on that machine)
|
||||
|
||||
---
|
||||
|
||||
## RTX 4090 Performance Comparison (Separate Machine)
|
||||
|
||||
The 4090 machine needs:
|
||||
|
||||
- Full repo clone from Gitea
|
||||
- `voice-profiles/` directory (contains mike-swanson composite + 180 embeddings)
|
||||
- The 6 test episode MP3s from `test-data/episodes/`
|
||||
- Run: `TRANSFORMERS_OFFLINE=1 py diarize_2018.py` against test episodes, record realtime factor
|
||||
- Compare to 5070 Ti baseline: **149.5x realtime** (10,600s audio in 70.9s)
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Notes
|
||||
|
||||
**Archive server:**
|
||||
- IX server: 172.16.3.10 (see vault: `infrastructure/ix-server.sops.yaml`)
|
||||
- SSH blocked from command line due to key agent interference — use Python paramiko with `look_for_keys=False, allow_agent=False`
|
||||
- Tailscale must be running for 172.16.3.x access
|
||||
- Full archive: 579 MP3 files across `/home/gurushow/public_html/archive/{2010,2011,2012,2014,2015,2016,2017,2018}/`
|
||||
Reference in New Issue
Block a user