radio: diarization pipeline fixes, benchmark setup, test episode set

- Fix voice_profiler threshold bug (HOST label overwrote Unknown unconditionally)
- Audio preload optimization: single ffmpeg per episode, 149.5x realtime on 5070 Ti
- WavLM threshold raised to 0.85 (Mike 0.90-0.99, callers 0.46-0.83)
- Promo/bumper filter: weighted signature scoring, 42->27 clean Q&A pairs
- Text-only Q&A fallback for episodes with no CALLER diarization labels
- TRANSFORMERS_OFFLINE=1 to skip HuggingFace freshness checks
- Add diarize_2018.py for targeted re-run + FTS5 rebuild
- Add benchmark.py + BENCH_SETUP.md for GURU-BEAST-ROG (RTX 4090) comparison
- Commit 9-episode training diarization.json outputs
- Session log: 2026-04-27-diarization-pipeline.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-27 13:20:10 -07:00
parent 206cd2f929
commit 79abef9dc9
21 changed files with 4720 additions and 202 deletions

View File

@@ -0,0 +1,39 @@
{
"num_speakers": 2,
"speaker_map": {
"HOST": "HOST",
"CALLER": "CALLER"
},
"turns": [
{
"speaker": "HOST",
"start": 0.0,
"end": 20.0,
"confidence": 0.9
},
{
"speaker": "CALLER",
"start": 15.0,
"end": 25.0,
"confidence": 0.83
},
{
"speaker": "HOST",
"start": 20.0,
"end": 1855.0,
"confidence": 0.86
},
{
"speaker": "CALLER",
"start": 1850.0,
"end": 1860.0,
"confidence": 0.78
},
{
"speaker": "HOST",
"start": 1855.0,
"end": 2505.0,
"confidence": 0.93
}
]
}