- Build Tom (co-host) voice profile (44 embeddings, 0.698 similarity to Mike) - diarizer.py: add CO-HOST speaker label for cohost-role profiles - voice_profiler.py: emit "Cohost: <name>" label for cohost role - qa_extractor.py: overlap resolution at load time (midpoint boundary split), 4s CALLER-preference threshold, turn-based caller-intro lookback (2 HOST turns), _preceded_by_caller_intro() helper, _PHONE_GREETING pattern, 751-1041 + "we'll get your problem solved" promo signatures - benchmark.py: use src.transcriber.transcribe with batch_size=16 - add index_test_episodes.py and build_cohost_profile.py scripts - add .gitignore (exclude episodes, transcripts, *.db, .venv) - session log: 2026-04-27-qa-extraction-cohost-indexing.md Result: 2016-s8e43 drops from 12 false-positive Q&A pairs to 2 real caller pairs. archive.db: 6 episodes, 762 segments, 10 Q&A pairs, FTS5 search verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
34 lines
733 B
JSON
34 lines
733 B
JSON
{
|
|
"Mike Swanson": {
|
|
"role": "host",
|
|
"num_samples": 180,
|
|
"source_episodes": [
|
|
"2010-10-02-hr1.mp3",
|
|
"2011-06-04-hr1.mp3",
|
|
"2011-09-10-hr1.mp3",
|
|
"2014-s6e05.mp3",
|
|
"2015-s7e30.mp3",
|
|
"2016-s8e42.mp3",
|
|
"2017-s9e26.mp3",
|
|
"2018-s10e17.mp3",
|
|
"2018-s10e21.mp3",
|
|
"2010-10-02-hr1.mp3",
|
|
"2011-06-04-hr1.mp3",
|
|
"2011-09-10-hr1.mp3",
|
|
"2014-s6e05.mp3",
|
|
"2015-s7e30.mp3",
|
|
"2016-s8e42.mp3",
|
|
"2017-s9e26.mp3",
|
|
"2018-s10e17.mp3",
|
|
"2018-s10e21.mp3"
|
|
]
|
|
},
|
|
"Tom": {
|
|
"role": "cohost",
|
|
"num_samples": 44,
|
|
"source_episodes": [
|
|
"2014-s6e19.mp3",
|
|
"2016-s8e43.mp3"
|
|
]
|
|
}
|
|
} |