Mike confirmed there is no co-host named "Tom" — the voice in 2014-s6e19 and 2016-s8e43 is Tara. The 5070 Ti session fabricated the Tom identity. The voice profile itself (44 embeddings, 0.698 cosine vs Mike) is correct; only the human label was wrong. Rename swept: - voice-profiles/tom/ -> voice-profiles/tara/ (git mv preserves all .npy) - voice-profiles/profiles.json: "Tom" key -> "Tara" - build_cohost_profile.py: TOM_WINDOWS -> TARA_WINDOWS, COHOST_NAME, comments - 2026-04-27-qa-extraction-cohost-indexing.md: correction header + body sweep - 2026-04-27-4090-benchmark-and-test-set.md: closure note - .claude/memory/radio_show_no_cohost_named_tom.md: resolution + speaker roster Diarization re-run after rename so speaker_map emits "Cohost: Tara". Q&A counts unchanged (rename is label-only): 9 pairs across 6 test episodes. Tara distribution from the post-rename diarization (per-episode % of audio): 2011-03-12-hr1 140s 5.6% likely false positive (call-in only) 2012-03-10-hr1 30s 1.1% likely false positive (call-in only) 2012-06-09-hr1 340s 12.8% suspicious — pending Mike confirm 2014-s6e19 680s 23.3% confirmed 2016-s8e43 1890s 35.5% confirmed 2017-s9e30 610s 11.4% plausible — pending Mike confirm Broader speaker-roster context Mike provided this session (saved to memory): the show has had multiple co-hosts (Tara, Randall, Rob) plus producers/board ops (Andrew, Shannon, Ken, others) who would sometimes go on-air. Only Tara has a profile so far. Every other speaker is currently labeled CALLER, which means small CO-HOST attributions in unexpected episodes (e.g. 2011/2012) may actually be a producer rather than a false positive — Mike to spot-check. Action item before full-archive run: build profiles for Randall, Rob, and the named producers to avoid systematic Q&A false positives in early-years and 2018/2019 episodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.3 KiB
Session Log — 2026-04-27 (continuation)
Project: The Computer Guru Show — Archive Mining System Goal: RTX 4090 perf comparison + run unseen test episodes through full pipeline (transcribe / diarize / Q&A) Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)
Companion to:
2026-04-27-diarization-pipeline.md(DESKTOP-0O8A1RL, RTX 5070 Ti — initial diarization fixes)2026-04-27-qa-extraction-cohost-indexing.md(DESKTOP-0O8A1RL — co-host profile, batched Whisper, Q&A overhaul)
This run uses the post-overhaul code (commit e9ac607): batched Whisper transcription, co-host-aware diarizer, revised Q&A extractor.
Headline
| Metric | 5070 Ti baseline | RTX 4090 | Delta |
|---|---|---|---|
| Diarization | 209.7x realtime | 338.1x | +128.4x (+61.2%) |
| Transcription (batched, large-v3 int8_float16) | 63.8x | 94.8x | +31.0x (+48.6%) |
| Q&A pairs (6 test episodes) | 10 | 9 | within noise |
21,374s of audio (5h 56m) end-to-end on the 4090: 225.5s transcription + 63.2s diarization + Q&A extraction.
Co-host identity correction — Tara, not Tom
The 5070 Ti session fabricated a co-host named "Tom" — Mike confirmed there is no such person on the show. After listening to the source windows, Mike identified the voice in both 2014-s6e19 and 2016-s8e43 as Tara (a real co-host; the show has had multiple over the years).
Rename swept this session:
voice-profiles/tom/→voice-profiles/tara/(git mv, all 44 embeddings + composite preserved)voice-profiles/profiles.json:"Tom"key →"Tara"build_cohost_profile.py: docstring,TOM_WINDOWS→TARA_WINDOWS,COHOST_NAME = "Tara", console output stringsprojects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md: correction header added, all body references updated.claude/memory/radio_show_no_cohost_named_tom.md: resolution recorded- Diarization re-run post-rename so
speaker_mapin eachdiarization.jsonemitsCohost: Tara
The 5070 Ti session log's claim of "Tom was the regular co-host roughly 2013-2016" carried two errors: the wrong name AND an unverified tenure window. The corrected log notes Tara appears in 2014-s6e19 and 2016-s8e43 only — generalizing to the full 2013-2016 era hasn't been confirmed.
Setup notes (for next machine)
- ffmpeg/ffprobe is required on PATH — the voice profiler shells out to ffprobe for audio duration and the pipeline crashes on the first diarize call without it. Was missing on this machine; installed via
winget install Gyan.FFmpeg. BENCH_SETUP.md updated to call this out as a Step-2 prereq. .gitignore(added ine9ac607) excludesepisodes/,transcripts/,*.db,.venv. The test MP3s + transcripts I committed earlier in2c06e72are still tracked from before the gitignore arrived; can begit rm --cached-ed in a follow-up cleanup.- All voice profiles, training data, and test MP3s were already on this machine via prior auto-sync.
Phase 1 — Whisper Transcription (large-v3, batched, int8_float16, batch_size=16)
| Episode | Audio | Wall | RTF |
|---|---|---|---|
| 2011-03-12-hr1 | 2509s | 29.7s | 84.6x |
| 2012-03-10-hr1 | 2634s | 30.3s | 87.0x |
| 2012-06-09-hr1 | 2648s | 33.6s | 78.8x |
| 2014-s6e19 | 2914s | 30.2s | 96.6x |
| 2016-s8e43 | 5326s | 49.2s | 108.2x |
| 2017-s9e30 | 5343s | 52.5s | 101.8x |
| Total | 21374s | 225.5s | 94.8x |
vs 5070 Ti's 63.8x: +48.6%.
Batching is doing real work here. The pre-batched code path on this same hardware (first benchmark run earlier today) was 14.8x — batching gave a 6.4× speedup on the 4090.
Phase 2 — Diarization (with co-host profile applied)
| Episode | Audio | Wall | RTF | Turns | HOST | CALLER |
|---|---|---|---|---|---|---|
| 2011-03-12-hr1 | 2509s | 9.1s | 275.0x | 25 | 2455s | 70s |
| 2012-03-10-hr1 | 2634s | 7.6s | 348.3x | 22 | 2615s | 90s |
| 2012-06-09-hr1 | 2648s | 7.7s | 343.1x | 13 | 2500s | 10s |
| 2014-s6e19 | 2914s | 8.3s | 352.6x | 31 | 2625s | 30s |
| 2016-s8e43 | 5326s | 15.1s | 353.6x | 134 | 4615s | 140s |
| 2017-s9e30 | 5343s | 15.5s | 345.1x | 69 | 4945s | 350s |
| Total | 21374s | 63.2s | 338.1x | 294 | 19755s | 690s |
vs 5070 Ti baseline: 209.7x → 338.1x (+61.2%).
Per-episode RTFs cluster tightly at 343-354x for warm episodes (5/6); episode 1 carries the cold-start penalty at 275.0x. Apples-to-apples vs the 5070 Ti measurement which also includes a cold start.
Aggregate CALLER time dropped from 2665s (pre-co-host pipeline, run earlier today) to 690s. That ~2000s delta is the second-voice signal correctly being routed away from the CALLER bucket. The benchmark table only sums HOST + CALLER, so CO-HOST seconds aren't shown in the totals — present in the per-episode diarization.json files.
Phase 3 — Q&A Extraction (post-overhaul: turn-based lookback, 4s CALLER preference, expanded promo signatures)
| Episode | 4090 Q&A pairs | 5070 Ti reference | Note |
|---|---|---|---|
| 2011-03-12-hr1 | 1 | 3 | -2 |
| 2012-03-10-hr1 | 2 | 1 | +1 |
| 2012-06-09-hr1 | 0 | 1 | -1 |
| 2014-s6e19 | 0 | 0 | match (gaming, no callers) |
| 2016-s8e43 | 2 | 2 | match (WiFi caller) |
| 2017-s9e30 | 4 | 3 | +1 |
| Total | 9 | 10 | -1 |
Differences are within noise. Likely sources:
- Whisper batched inference produces slightly different segment boundaries on identical audio under different GPU schedule orderings.
- Sliding-window diarization midpoint resolution can put a borderline segment in either bucket on different runs.
- Q&A extraction thresholds are sensitive to small boundary shifts.
The two structural correctness signals match: 2014 = 0 (no callers in gaming special) and 2016 = 2 (real WiFi caller, two-turn). That's the meaningful test. Aggregate ±1 across six episodes is acceptable run-to-run drift.
Files written / modified
test-data/transcripts/<stem>/transcript.json(6, regenerated with batched Whisper)test-data/transcripts/<stem>/diarization.json(6, regenerated with co-host-aware diarizer)benchmark.pyline 27 —BASELINE_RTFupdated 149.5 → 209.7BENCH_SETUP.md— added ffmpeg prereq to Step 2.claude/memory/radio_show_no_cohost_named_tom.md(new, project memory).claude/memory/MEMORY.md(index updated)
archive.db is not on this machine — index update happens on DESKTOP-0O8A1RL.
Tara distribution across the test set (post-rename diarization)
After the rename, the diarizer's per-episode speaker_map shows Tara in all 6 test episodes — well beyond the 2014+2016 the 5070 Ti session log claimed.
| Episode | Tara (seconds) | % of audio | Read |
|---|---|---|---|
| 2011-03-12-hr1 | 140s (2:20) | 5.6% | likely false positive — Mike confirms 2011 was pure call-in |
| 2012-03-10-hr1 | 30s (0:30) | 1.1% | likely false positive — 2012 was pure call-in |
| 2012-06-09-hr1 | 340s (5:40) | 12.8% | suspicious — too much for noise; awaiting Mike confirm |
| 2014-s6e19 | 680s (11:20) | 23.3% | confirmed (Mike) |
| 2016-s8e43 | 1890s (31:30) | 35.5% | confirmed (Mike) |
| 2017-s9e30 | 610s (10:10) | 11.4% | plausible — pending Mike confirm; 5070 Ti log only listed Tara in 2014+2016 |
Mike's broader correction (2026-04-27):
- Co-hosts rotated through over the years. Confirmed: Tara, Randall (early years), Rob (early years + occasional 2018/2019).
- Producers / board ops would sometimes go on-air. Named so far: Andrew, Shannon, Ken, plus "a couple more" Mike doesn't recall off-hand.
Of all these, only Tara has a voice profile. Every other co-host AND every producer-on-air moment in the archive is currently being labeled CALLER, which inflates Q&A false positives in those eras and episodes.
The small Tara percentages in 2011/2012 (1-13%) most likely reflect the 0.85 cosine threshold hitting on a similar-sounding speaker that isn't actually Tara — could be a producer (Andrew/Shannon/Ken/etc) or another early-years voice we haven't catalogued. Worth Mike sampling these short windows to identify before assuming false positive vs producer.
Implication for full-archive runs: before processing the 579-episode archive in earnest, build profiles for at least Randall, Rob, and the named producers. Otherwise the Q&A extraction across early-years and 2018/2019 episodes will inherit the same false-positive pattern that originally produced 12 bogus pairs in 2016-s8e43.
Pending work (from 5070 Ti session, still unblocked)
- Resolve "Tom" identity — Mike to confirm who the second voice is in 2014-s6e19 and 2016-s8e43. Then rename
voice-profiles/tom/, updateprofiles.json, fix labels in code. Until then, voice-profile data is correct but mislabeled. - Full archive download — 579 MP3s from IX server (~30-40GB). 4090 + Tailscale ready.
- Full pipeline run on archive — at 338x diarization + 95x transcription, total wall time for ~30h of audio extrapolates to roughly 19 minutes diarization + 19 minutes transcription. Disk I/O may dominate.
Note for Mike
- "Tom" is wrong — see callout above. Tell me who that is and I'll do the rename in one pass (directory, profiles.json, build_cohost_profile.py, the 5070 Ti session log, and a fresh diarization pass to update
speaker_map). - BENCH_SETUP.md got a one-paragraph ffmpeg prereq added at the top of Step 2.