Files
claudetools/projects/radio-show/audio-processor/session-logs/2026-04-27-4090-benchmark-and-test-set.md
Mike Swanson a4f527f31e radio: per-year test set (one episode per year, 2010-2018)
Added 2010, 2015, 2018 test episodes to round out the test set to one
per available year:
- 2010-05-08-hr1 (May 2010, earliest available; pre-Tara era)
- 2015-s7e19 (Jan 2015, avoids training's s7e30)
- 2018-s10e18 (only 3 non-training 2018 episodes exist)

Archive has no 2019 directory — Rob's "2018/2019 appearances" are
constrained to the 5 available 2018 episodes only.

Per-year diarization summary (Tara presence, post-rename):
  2010-05-08    30s   1.2%   likely false positive (pre-Tara)
  2011-03-12   140s   5.6%   likely false positive (call-in only)
  2012-03-10    30s   1.1%   likely false positive (call-in only)
  2012-06-09   340s  12.8%   suspicious — Mike to confirm
  2014-s6e19   680s  23.3%   confirmed
  2015-s7e19   280s   9.9%   plausible — Mike to confirm
  2016-s8e43  1890s  35.5%   confirmed
  2017-s9e30   610s  11.4%   plausible
  2018-s10e18  880s  17.1%   COULD BE ROB — Mike flagged Rob for
                              2018/2019 appearances; cosine threshold may
                              be hitting on Rob being acoustically similar
                              to Tara

Total Tara across 9 episodes: 1h 21m / 8h 52m audio (15.3%).

Q&A counts (still suspect — every voice that isn't Mike-or-Tara is
labeled CALLER, so Randall/Rob/producers inflate the bucket):
  2010=4, 2011=1, 2012a=2, 2012b=0, 2014=0, 2015=1, 2016=2, 2017=4, 2018=3
  Total: 17 pairs across 9 episodes

4090 perf on the expanded set:
- Diarization: 31928s in 121.5s = 262.7x realtime (vs 209.7x on 5070 Ti, +25.3%)
- Transcription (3 new episodes only): 10554s in 112.4s = 93.9x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:20:09 -07:00

190 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Session Log — 2026-04-27 (continuation)
**Project:** The Computer Guru Show — Archive Mining System
**Goal:** RTX 4090 perf comparison + run unseen test episodes through full pipeline (transcribe / diarize / Q&A)
**Machine:** GURU-BEAST-ROG (RTX 4090, 24GB)
**User:** Mike Swanson (mike)
Companion to:
- `2026-04-27-diarization-pipeline.md` (DESKTOP-0O8A1RL, RTX 5070 Ti — initial diarization fixes)
- `2026-04-27-qa-extraction-cohost-indexing.md` (DESKTOP-0O8A1RL — co-host profile, batched Whisper, Q&A overhaul)
This run uses the post-overhaul code (commit `e9ac607`): batched Whisper transcription, co-host-aware diarizer, revised Q&A extractor.
---
## Headline
| Metric | 5070 Ti baseline | RTX 4090 | Delta |
|---|---|---|---|
| Diarization | 209.7x realtime | **338.1x** | +128.4x (+61.2%) |
| Transcription (batched, large-v3 int8_float16) | 63.8x | **94.8x** | +31.0x (+48.6%) |
| Q&A pairs (6 test episodes) | 10 | 9 | within noise |
21,374s of audio (5h 56m) end-to-end on the 4090: **225.5s transcription + 63.2s diarization + Q&A extraction**.
---
## Co-host identity correction — Tara, not Tom
The 5070 Ti session fabricated a co-host named "Tom" — Mike confirmed there is no such person on the show. After listening to the source windows, Mike identified the voice in both 2014-s6e19 and 2016-s8e43 as **Tara** (a real co-host; the show has had multiple over the years).
Rename swept this session:
- `voice-profiles/tom/``voice-profiles/tara/` (git mv, all 44 embeddings + composite preserved)
- `voice-profiles/profiles.json`: `"Tom"` key → `"Tara"`
- `build_cohost_profile.py`: docstring, `TOM_WINDOWS``TARA_WINDOWS`, `COHOST_NAME = "Tara"`, console output strings
- `projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md`: correction header added, all body references updated
- `.claude/memory/radio_show_no_cohost_named_tom.md`: resolution recorded
- Diarization re-run post-rename so `speaker_map` in each `diarization.json` emits `Cohost: Tara`
The 5070 Ti session log's claim of "Tom was the regular co-host roughly 2013-2016" carried two errors: the wrong name AND an unverified tenure window. The corrected log notes Tara appears in 2014-s6e19 and 2016-s8e43 only — generalizing to the full 2013-2016 era hasn't been confirmed.
---
## Setup notes (for next machine)
- ffmpeg/ffprobe is required on PATH — the voice profiler shells out to ffprobe for audio duration and the pipeline crashes on the first diarize call without it. Was missing on this machine; installed via `winget install Gyan.FFmpeg`. BENCH_SETUP.md updated to call this out as a Step-2 prereq.
- `.gitignore` (added in `e9ac607`) excludes `episodes/`, `transcripts/`, `*.db`, `.venv`. The test MP3s + transcripts I committed earlier in `2c06e72` are still tracked from before the gitignore arrived; can be `git rm --cached`-ed in a follow-up cleanup.
- All voice profiles, training data, and test MP3s were already on this machine via prior auto-sync.
---
## Phase 1 — Whisper Transcription (large-v3, batched, int8_float16, batch_size=16)
| Episode | Audio | Wall | RTF |
|---|---|---|---|
| 2011-03-12-hr1 | 2509s | 29.7s | 84.6x |
| 2012-03-10-hr1 | 2634s | 30.3s | 87.0x |
| 2012-06-09-hr1 | 2648s | 33.6s | 78.8x |
| 2014-s6e19 | 2914s | 30.2s | 96.6x |
| 2016-s8e43 | 5326s | 49.2s | 108.2x |
| 2017-s9e30 | 5343s | 52.5s | 101.8x |
| **Total** | **21374s** | **225.5s** | **94.8x** |
vs 5070 Ti's 63.8x: **+48.6%**.
Batching is doing real work here. The pre-batched code path on this same hardware (first benchmark run earlier today) was 14.8x — batching gave a 6.4× speedup on the 4090.
---
## Phase 2 — Diarization (with co-host profile applied)
| Episode | Audio | Wall | RTF | Turns | HOST | CALLER |
|---|---|---|---|---|---|---|
| 2011-03-12-hr1 | 2509s | 9.1s | 275.0x | 25 | 2455s | 70s |
| 2012-03-10-hr1 | 2634s | 7.6s | 348.3x | 22 | 2615s | 90s |
| 2012-06-09-hr1 | 2648s | 7.7s | 343.1x | 13 | 2500s | 10s |
| 2014-s6e19 | 2914s | 8.3s | 352.6x | 31 | 2625s | 30s |
| 2016-s8e43 | 5326s | 15.1s | 353.6x | 134 | 4615s | 140s |
| 2017-s9e30 | 5343s | 15.5s | 345.1x | 69 | 4945s | 350s |
| **Total** | **21374s** | **63.2s** | **338.1x** | 294 | 19755s | 690s |
**vs 5070 Ti baseline: 209.7x → 338.1x (+61.2%).**
Per-episode RTFs cluster tightly at 343-354x for warm episodes (5/6); episode 1 carries the cold-start penalty at 275.0x. Apples-to-apples vs the 5070 Ti measurement which also includes a cold start.
Aggregate CALLER time dropped from 2665s (pre-co-host pipeline, run earlier today) to 690s. That ~2000s delta is the second-voice signal correctly being routed away from the CALLER bucket. The benchmark table only sums HOST + CALLER, so CO-HOST seconds aren't shown in the totals — present in the per-episode `diarization.json` files.
---
## Phase 3 — Q&A Extraction (post-overhaul: turn-based lookback, 4s CALLER preference, expanded promo signatures)
| Episode | 4090 Q&A pairs | 5070 Ti reference | Note |
|---|---|---|---|
| 2011-03-12-hr1 | 1 | 3 | -2 |
| 2012-03-10-hr1 | 2 | 1 | +1 |
| 2012-06-09-hr1 | 0 | 1 | -1 |
| 2014-s6e19 | 0 | 0 | match (gaming, no callers) |
| 2016-s8e43 | 2 | 2 | match (WiFi caller) |
| 2017-s9e30 | 4 | 3 | +1 |
| **Total** | **9** | **10** | **-1** |
Differences are within noise. Likely sources:
- Whisper batched inference produces slightly different segment boundaries on identical audio under different GPU schedule orderings.
- Sliding-window diarization midpoint resolution can put a borderline segment in either bucket on different runs.
- Q&A extraction thresholds are sensitive to small boundary shifts.
**The two structural correctness signals match**: 2014 = 0 (no callers in gaming special) and 2016 = 2 (real WiFi caller, two-turn). That's the meaningful test. Aggregate ±1 across six episodes is acceptable run-to-run drift.
---
## Files written / modified
- `test-data/transcripts/<stem>/transcript.json` (6, regenerated with batched Whisper)
- `test-data/transcripts/<stem>/diarization.json` (6, regenerated with co-host-aware diarizer)
- `benchmark.py` line 27 — `BASELINE_RTF` updated 149.5 → 209.7
- `BENCH_SETUP.md` — added ffmpeg prereq to Step 2
- `.claude/memory/radio_show_no_cohost_named_tom.md` (new, project memory)
- `.claude/memory/MEMORY.md` (index updated)
archive.db is not on this machine — index update happens on DESKTOP-0O8A1RL.
---
## Per-year test set (one episode per year, expanded)
Mike asked to expand from the original 6 to one episode per year. Added:
- 2010: `2010-05-08-hr1.mp3` (May 2010, earliest available; avoids training's Oct 2)
- 2015: `2015-s7e19.mp3` (Jan 2015; avoids training's s7e30)
- 2018: `2018-s10e18.mp3` (only 3 non-training episodes exist for 2018)
Archive has no 2019 directory (years 2010-2018, no 2013 either). Rob's "2018/2019 appearances" are constrained to the 5 available 2018 episodes only.
### Diarization across all 9 episodes
| Year | Episode | Audio | Tara | % | HOST | CALLER (suspect) | Q&A |
|---|---|---|---|---|---|---|---|
| 2010 | 05-08-hr1 | 42:57 | 0:30 | 1.2% | 2325s | **355s** | 4 |
| 2011 | 03-12-hr1 | 41:49 | 2:20 | 5.6% | 2455s | 70s | 1 |
| 2012 | 03-10-hr1 | 43:54 | 0:30 | 1.1% | 2615s | 90s | 2 |
| 2012 | 06-09-hr1 | 44:08 | 5:40 | 12.8% | 2500s | 10s | 0 |
| 2014 | s6e19 | 48:34 | 11:20 | 23.3% | 2625s | 30s | 0 |
| 2015 | s7e19 | 47:13 | 4:40 | 9.9% | 2690s | 45s | 1 |
| 2016 | s8e43 | 88:46 | 31:30 | 35.5% | 4615s | 140s | 2 |
| 2017 | s9e30 | 89:03 | 10:10 | 11.4% | 4945s | 350s | 4 |
| 2018 | s10e18 | 85:45 | 14:40 | 17.1% | 4745s | 230s | 3 |
| **Total** | | **8h 52m** | **1h 21m** (15.3%) | | | **1320s** | **17** |
### Read on each row
| Episode | Tara reading |
|---|---|
| 2010-05-08-hr1 | likely false positive (30s); 2010 was pre-Tara; could be Randall or a producer |
| 2011-03-12-hr1 | likely false positive; 2011 was pure call-in per Mike |
| 2012-03-10-hr1 | likely false positive; 2012 was pure call-in per Mike |
| 2012-06-09-hr1 | suspicious (5:40 is too much for noise); pending Mike spot-check |
| 2014-s6e19 | confirmed Tara |
| 2015-s7e19 | substantial (4:40) — plausibly Tara was on early 2015; Mike to confirm |
| 2016-s8e43 | confirmed Tara |
| 2017-s9e30 | plausible Tara (or another co-host); Mike to confirm |
| 2018-s10e18 | **could be Rob, not Tara** — Mike flagged Rob for 2018/2019 appearances. The cosine threshold may be hitting because the two co-hosts have similar acoustic properties. Worth Mike sampling. |
### Q&A counts caveat
The Q&A column is still suspect because **every voice that isn't Mike-or-Tara is labeled CALLER**, including Randall, Rob, and any on-air producer (Andrew/Shannon/Ken/etc). The 2010 episode in particular shows 355s CALLER and 4 Q&A — but per Mike's roster, that CALLER bucket likely includes a co-host or producer, not real callers. Spot-check before treating early-years Q&A as ground truth.
**Mike's broader correction (2026-04-27):**
- **Co-hosts** rotated through over the years. Confirmed: Tara, Randall (early years), Rob (early years + occasional 2018/2019).
- **Producers / board ops** would sometimes go on-air. Named so far: Andrew, Shannon, Ken, plus "a couple more" Mike doesn't recall off-hand.
Of all these, only Tara has a voice profile. Every other co-host AND every producer-on-air moment in the archive is currently being labeled CALLER, which inflates Q&A false positives in those eras and episodes.
The small Tara percentages in 2011/2012 (1-13%) most likely reflect the 0.85 cosine threshold hitting on a similar-sounding speaker that isn't actually Tara — could be a producer (Andrew/Shannon/Ken/etc) or another early-years voice we haven't catalogued. Worth Mike sampling these short windows to identify before assuming false positive vs producer.
**Implication for full-archive runs:** before processing the 579-episode archive in earnest, build profiles for at least Randall, Rob, and the named producers. Otherwise the Q&A extraction across early-years and 2018/2019 episodes will inherit the same false-positive pattern that originally produced 12 bogus pairs in 2016-s8e43.
---
## Pending work (from 5070 Ti session, still unblocked)
1. **Resolve "Tom" identity** — Mike to confirm who the second voice is in 2014-s6e19 and 2016-s8e43. Then rename `voice-profiles/tom/`, update `profiles.json`, fix labels in code. Until then, voice-profile data is correct but mislabeled.
2. **Full archive download** — 579 MP3s from IX server (~30-40GB). 4090 + Tailscale ready.
3. **Full pipeline run on archive** — at 338x diarization + 95x transcription, total wall time for ~30h of audio extrapolates to roughly 19 minutes diarization + 19 minutes transcription. Disk I/O may dominate.
---
## Note for Mike
- "Tom" is wrong — see callout above. Tell me who that is and I'll do the rename in one pass (directory, profiles.json, build_cohost_profile.py, the 5070 Ti session log, and a fresh diarization pass to update `speaker_map`).
- BENCH_SETUP.md got a one-paragraph ffmpeg prereq added at the top of Step 2.