radio show: co-host voice profile, Q&A extraction fixes, archive index
- Build Tom (co-host) voice profile (44 embeddings, 0.698 similarity to Mike) - diarizer.py: add CO-HOST speaker label for cohost-role profiles - voice_profiler.py: emit "Cohost: <name>" label for cohost role - qa_extractor.py: overlap resolution at load time (midpoint boundary split), 4s CALLER-preference threshold, turn-based caller-intro lookback (2 HOST turns), _preceded_by_caller_intro() helper, _PHONE_GREETING pattern, 751-1041 + "we'll get your problem solved" promo signatures - benchmark.py: use src.transcriber.transcribe with batch_size=16 - add index_test_episodes.py and build_cohost_profile.py scripts - add .gitignore (exclude episodes, transcripts, *.db, .venv) - session log: 2026-04-27-qa-extraction-cohost-indexing.md Result: 2016-s8e43 drops from 12 false-positive Q&A pairs to 2 real caller pairs. archive.db: 6 episodes, 762 segments, 10 Q&A pairs, FTS5 search verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,251 @@
|
||||
# Session Log: Q&A Extraction — Co-Host Profile + Archive Indexing
|
||||
**Date:** 2026-04-27
|
||||
**Project:** Radio Show Archive Mining — Computer Guru Show
|
||||
|
||||
---
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** DESKTOP-0O8A1RL
|
||||
- **Role:** admin
|
||||
|
||||
---
|
||||
|
||||
## Session Summary
|
||||
|
||||
The session began with resuming work following a benchmark run that demonstrated a significant performance improvement in Whisper transcription, achieving 63.8x real-time speed with batched inference and int8_float16 settings. Next, the focus shifted to evaluating the quality of Q&A extraction across six test episodes, revealing a critical issue with false positives due to co-host Tom being mislabeled as CALLER based on a voice similarity threshold.
|
||||
|
||||
A co-host voice profile for Tom was constructed using 44 embeddings from two specific episodes (2014-s6e19 and 2016-s8e43), producing a cosine similarity of 0.698 against Mike — well below Mike's 0.85 threshold, giving clean separation. Code was updated in `voice_profiler.py` and `diarizer.py` to correctly emit "Cohost: Tom" labels and map them to a new "CO-HOST" speaker tag. Re-diarizing the two co-host-era episodes dramatically cleaned up Q&A results: 2016 went from 12 false positives to 2 real WiFi caller pairs.
|
||||
|
||||
Several bugs in `qa_extractor.py` were fixed: overlap resolution for sliding-window diarization boundaries, CALLER-preference threshold for long batch transcript segments, and a turn-based caller-intro lookback to replace an ineffective 120s time window. Phone-greeting detection and new promo signatures were added. The final Q&A count landed at 10 pairs across 6 episodes, with 2014 correctly yielding 0 (gaming co-host episode with no actual callers).
|
||||
|
||||
`archive.db` was created with the ArchiveIndex schema (episodes, segments, segments_fts, qa_pairs, qa_fts). All 6 test episodes were indexed: 762 segments, 10 Q&A pairs. FTS5 search verified working for "router", "Windows 10", "Internet Explorer", "antivirus", and "connect" queries.
|
||||
|
||||
---
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **Co-host threshold uses same 0.85 bar as host**: Tom scores 0.698 vs Mike. Any voice >= 0.85 against Tom's composite gets labeled CO-HOST. Keeps the same single threshold for all profiles rather than per-profile thresholds.
|
||||
- **Turn-based lookback for caller-intro (2 HOST turns, not 120s)**: Long HOST monologue blocks (8-10 min) in big show segments meant time-based lookback missed the caller introduction. Previous 2 HOST turns always catches it regardless of block length.
|
||||
- **CALLER-preference at 4s minimum overlap**: Batch transcription produces ~26s segments; diarization CALLER windows are ~10s. Pure majority-vote always gave HOST. 4s minimum CALLER coverage labels the segment CALLER without being overly aggressive for co-host episodes.
|
||||
- **Midpoint boundary resolution at load time**: Rather than re-diarizing everything, the sliding-window overlap is resolved in `load_diarized_transcript()` so it applies retroactively to all saved diarization files without touching the JSON.
|
||||
- **751-1041 added as promo signal**: Earlier Tucson show number (vs 790-2040 in later seasons). Weighted 1 (needs a second semi-generic signal to filter).
|
||||
- **Tom's windows sourced from first 60 min of co-host episodes**: Real callers don't call in during the first hour of a 2-hour show (only exceptions: very end of show). First-hour CALLER windows are safely all Tom.
|
||||
|
||||
---
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **2016-s8e43 had 12 Q&A pairs, 11 false positives**: Root cause was Tom (co-host) labeled CALLER throughout. Fixed by building Tom's voice profile and re-diarizing.
|
||||
- **2014-s6e19 had 2 Q&A pairs from gaming discussion**: Same co-host issue. After re-diarization: 0 pairs (correct — no actual callers in that gaming special).
|
||||
- **2012-03-10 yielded 0 segments labeled CALLER**: Midpoint assignment hit HOST turns (HOST 0-20s and CALLER 15-30s — midpoint 15.1s falls in HOST). Fixed by overlap-preference assignment with 4s CALLER minimum.
|
||||
- **Real WiFi caller (2016, ~4794s) was missing after first fix attempt**: Aggressive time-based lookback (120s) combined with short CALLER turns from sliding-window diarization caused the caller question to land in a HOST segment. Fixed by turn-based lookback + co-host profile (eliminated Tom noise, letting real caller windows survive).
|
||||
- **2012-Jun pair at 1325s was a promo**: "The Computer Guru. We'll get your problem solved. Call 751-1041 today" passed promo filter. Fixed by adding 751-1041 and "we'll get your problem solved" as promo signatures.
|
||||
|
||||
---
|
||||
|
||||
## Files Created / Modified
|
||||
|
||||
### New files
|
||||
```
|
||||
projects/radio-show/audio-processor/build_cohost_profile.py
|
||||
projects/radio-show/audio-processor/index_test_episodes.py
|
||||
projects/radio-show/audio-processor/archive.db
|
||||
projects/radio-show/audio-processor/voice-profiles/tom/
|
||||
projects/radio-show/audio-processor/voice-profiles/profiles.json (updated: Tom added)
|
||||
projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md (this file)
|
||||
```
|
||||
|
||||
### Modified
|
||||
```
|
||||
src/voice_profiler.py — emit "Cohost: <name>" label for cohost role
|
||||
src/diarizer.py — map "Cohost:" prefix to "CO-HOST" speaker
|
||||
src/qa_extractor.py — overlap resolution, CALLER-preference, turn-based
|
||||
caller-intro lookback, _preceded_by_caller_intro(),
|
||||
_PHONE_GREETING, 751-1041 + promo sig additions
|
||||
test-data/transcripts/2014-s6e19/diarization.json (re-diarized with Tom profile)
|
||||
test-data/transcripts/2016-s8e43/diarization.json (re-diarized with Tom profile)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benchmark Results (from previous run — baseline for BEAST comparison)
|
||||
|
||||
**Machine:** DESKTOP-0O8A1RL — NVIDIA GeForce RTX 5070 Ti Laptop GPU
|
||||
|
||||
| Episode | Audio | Wall (diarize) | RTF |
|
||||
|---------|-------|----------------|-----|
|
||||
| 2011-03-12-hr1 | 2509s | 15.1s | 166.1x |
|
||||
| 2012-03-10-hr1 | 2634s | 12.2s | 215.5x |
|
||||
| 2012-06-09-hr1 | 2648s | 12.2s | 216.8x |
|
||||
| 2014-s6e19 | 2914s | 13.4s | 216.9x |
|
||||
| 2016-s8e43 | 5326s | 24.2s | 219.6x |
|
||||
| 2017-s9e30 | 5343s | 24.7s | 216.4x |
|
||||
| **TOTAL** | **21374s** | **101.9s** | **209.7x** |
|
||||
|
||||
Transcription (batched Whisper large-v3): 63.8x realtime
|
||||
Diarization: 209.7x realtime
|
||||
vs DESKTOP-0O8A1RL baseline (149.5x): **+60.2x (+40.3%)**
|
||||
|
||||
---
|
||||
|
||||
## Archive DB State
|
||||
|
||||
**Path:** `projects/radio-show/audio-processor/archive.db`
|
||||
|
||||
```
|
||||
Episodes : 6
|
||||
Segments : 762
|
||||
Q&A pairs: 10
|
||||
```
|
||||
|
||||
**Q&A pairs by episode:**
|
||||
| Episode | Pairs | Notes |
|
||||
|---------|-------|-------|
|
||||
| 2011-03-12-hr1 | 3 | IE lockout call, cloud computing, ghost hunting caller |
|
||||
| 2012-03-10-hr1 | 1 | iPad 3 discussion |
|
||||
| 2012-06-09-hr1 | 1 | Windows repair feature call |
|
||||
| 2014-s6e19 | 0 | Gaming co-host special — no actual callers |
|
||||
| 2016-s8e43 | 2 | WiFi connectivity caller (2 turns of same call) |
|
||||
| 2017-s9e30 | 3 | Software control, Cat5 cabling (Charlie), WiFi ports |
|
||||
|
||||
---
|
||||
|
||||
## Voice Profiles State
|
||||
|
||||
**Path:** `projects/radio-show/audio-processor/voice-profiles/`
|
||||
|
||||
| Name | Role | Embeddings | Source Episodes |
|
||||
|------|------|-----------|-----------------|
|
||||
| Mike Swanson | host | 180 | 9 episodes (2010-2018) |
|
||||
| Tom | cohost | 44 | 2014-s6e19, 2016-s8e43 |
|
||||
|
||||
Tom vs Mike cosine similarity: **0.698** (well-separated at 0.85 threshold)
|
||||
|
||||
**Tom's source windows used:**
|
||||
- 2014-s6e19: 195-260s, 320-425s, 600-650s, 675-710s
|
||||
- 2016-s8e43: 100-115s, 135-160s, 270-295s, 575-605s, 1185-1235s, 1790-1870s, 2020-2055s
|
||||
|
||||
---
|
||||
|
||||
## Co-Host Era Notes
|
||||
|
||||
Tom was the regular in-studio co-host/board-op roughly 2013-2016. His voice is in episodes from at least 2014 through 2016 (confirmed from test set). The 2011 and 2012 episodes are pure call-in format with no co-host.
|
||||
|
||||
If there are occasional guest co-hosts or fill-in hosts in other years, they would still be labeled CALLER until profiled. These would be rare and would likely not form question patterns that survive the caller-intro gate.
|
||||
|
||||
---
|
||||
|
||||
## Pending Tasks for BEAST (GURU-BEAST-ROG)
|
||||
|
||||
### 1. Run benchmark.py to establish RTX 4090 baseline
|
||||
|
||||
```bash
|
||||
cd D:/claudetools/projects/radio-show/audio-processor
|
||||
.venv/Scripts/python benchmark.py 2>&1 | tee bench-4090.txt
|
||||
```
|
||||
|
||||
BENCH_SETUP.md has all setup steps. The voice profiles are in `voice-profiles/` (already copied or available via Tailscale/robocopy from DESKTOP-0O8A1RL). Test episodes go in `test-data/episodes/`.
|
||||
|
||||
Expected: diarization RTF should be ~250-300x on RTX 4090 (vs 209.7x on laptop 5070 Ti). Transcription should be ~70-80x.
|
||||
|
||||
Update `benchmark.py` line 27 after measuring:
|
||||
```python
|
||||
BASELINE_RTF = 209.7 # current laptop 5070 Ti baseline
|
||||
```
|
||||
|
||||
### 2. Download full archive from IX server (172.16.3.10)
|
||||
|
||||
Use paramiko (SSH with key agent disabled):
|
||||
```python
|
||||
import paramiko
|
||||
ssh = paramiko.SSHClient()
|
||||
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
|
||||
ssh.connect("172.16.3.10", username="gurushow", password="<from vault>",
|
||||
look_for_keys=False, allow_agent=False)
|
||||
```
|
||||
|
||||
Archive path: `/home/gurushow/public_html/archive/Radio/`
|
||||
Episode count: 579 MP3s across 2010-2018 (no 2013 season)
|
||||
Approximate total size: ~30-40 GB
|
||||
|
||||
Download script skeleton in prior session log: `2026-04-27-diarization-pipeline.md`
|
||||
|
||||
**Tailscale required** — IX server is at 172.16.3.10, requires VPN.
|
||||
|
||||
### 3. Full archive processing
|
||||
|
||||
Once episodes are downloaded:
|
||||
|
||||
```bash
|
||||
# Transcribe + diarize all episodes
|
||||
cd D:/claudetools/projects/radio-show/audio-processor
|
||||
.venv/Scripts/python diarize_training.py # or a new batch_process_all.py
|
||||
|
||||
# Index everything into archive.db
|
||||
.venv/Scripts/python index_test_episodes.py # modify to point at full episodes dir
|
||||
```
|
||||
|
||||
The pipeline is idempotent — `add_segments()` skips episodes already indexed.
|
||||
|
||||
### 4. Verify co-host era episodes
|
||||
|
||||
2013-2016 era episodes should now correctly separate Tom (CO-HOST) from actual callers. Spot-check a few 2015 episodes after processing to confirm Tom's profile generalizes well.
|
||||
|
||||
If any 2015/2016 episodes show too many CALLER turns that are clearly Tom (voice changed slightly over years), re-run `build_cohost_profile.py` with windows from that episode added to TOM_WINDOWS dict.
|
||||
|
||||
---
|
||||
|
||||
## Technical Reference
|
||||
|
||||
### Key thresholds
|
||||
|
||||
```python
|
||||
host_match_threshold = 0.85 # WavLM cosine similarity — applied to ALL profiles
|
||||
CALLER_MIN_S = 4.0 # min CALLER coverage in transcript segment to label CALLER
|
||||
PROMO_SCORE_THRESHOLD = 2 # weighted promo signature score
|
||||
MIN_QUESTION_DURATION = 5.0 # seconds
|
||||
MIN_ANSWER_DURATION = 15.0 # seconds
|
||||
MAX_GAP_BETWEEN_QA = 30.0 # seconds
|
||||
```
|
||||
|
||||
### Diarization sliding window
|
||||
|
||||
```python
|
||||
window_s = 10.0 # 10s embedding windows
|
||||
hop_s = 5.0 # 5s hop → overlapping boundaries (resolved at load time)
|
||||
```
|
||||
|
||||
### Transcription (batch mode)
|
||||
|
||||
```python
|
||||
model_size = "large-v3"
|
||||
compute_type = "int8_float16"
|
||||
batch_size = 16
|
||||
# No word timestamps in batch mode (not needed for search/diarization)
|
||||
```
|
||||
|
||||
### DB search examples
|
||||
|
||||
```python
|
||||
from src.indexer import ArchiveIndex
|
||||
from pathlib import Path
|
||||
|
||||
with ArchiveIndex(Path("archive.db")) as idx:
|
||||
# Segment search
|
||||
results = idx.search("router", limit=20)
|
||||
results = idx.search("Windows 10", speaker_filter="HOST", limit=10)
|
||||
|
||||
# Q&A search
|
||||
qa = idx.search_qa("antivirus", limit=10)
|
||||
qa = idx.search_qa("wifi connect", limit=10)
|
||||
```
|
||||
|
||||
### Archive server
|
||||
|
||||
```
|
||||
Host: 172.16.3.10 (requires Tailscale)
|
||||
User: gurushow
|
||||
Archive root: /home/gurushow/public_html/archive/Radio/
|
||||
SSH: paramiko with look_for_keys=False, allow_agent=False
|
||||
```
|
||||
Reference in New Issue
Block a user