Files
claudetools/projects/radio-show/audio-processor/session-logs/2026-04-27-4090-benchmark-and-test-set.md
Mike Swanson a4f527f31e radio: per-year test set (one episode per year, 2010-2018)
Added 2010, 2015, 2018 test episodes to round out the test set to one
per available year:
- 2010-05-08-hr1 (May 2010, earliest available; pre-Tara era)
- 2015-s7e19 (Jan 2015, avoids training's s7e30)
- 2018-s10e18 (only 3 non-training 2018 episodes exist)

Archive has no 2019 directory — Rob's "2018/2019 appearances" are
constrained to the 5 available 2018 episodes only.

Per-year diarization summary (Tara presence, post-rename):
  2010-05-08    30s   1.2%   likely false positive (pre-Tara)
  2011-03-12   140s   5.6%   likely false positive (call-in only)
  2012-03-10    30s   1.1%   likely false positive (call-in only)
  2012-06-09   340s  12.8%   suspicious — Mike to confirm
  2014-s6e19   680s  23.3%   confirmed
  2015-s7e19   280s   9.9%   plausible — Mike to confirm
  2016-s8e43  1890s  35.5%   confirmed
  2017-s9e30   610s  11.4%   plausible
  2018-s10e18  880s  17.1%   COULD BE ROB — Mike flagged Rob for
                              2018/2019 appearances; cosine threshold may
                              be hitting on Rob being acoustically similar
                              to Tara

Total Tara across 9 episodes: 1h 21m / 8h 52m audio (15.3%).

Q&A counts (still suspect — every voice that isn't Mike-or-Tara is
labeled CALLER, so Randall/Rob/producers inflate the bucket):
  2010=4, 2011=1, 2012a=2, 2012b=0, 2014=0, 2015=1, 2016=2, 2017=4, 2018=3
  Total: 17 pairs across 9 episodes

4090 perf on the expanded set:
- Diarization: 31928s in 121.5s = 262.7x realtime (vs 209.7x on 5070 Ti, +25.3%)
- Transcription (3 new episodes only): 10554s in 112.4s = 93.9x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:20:09 -07:00

11 KiB
Raw Blame History

Session Log — 2026-04-27 (continuation)

Project: The Computer Guru Show — Archive Mining System Goal: RTX 4090 perf comparison + run unseen test episodes through full pipeline (transcribe / diarize / Q&A) Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)

Companion to:

  • 2026-04-27-diarization-pipeline.md (DESKTOP-0O8A1RL, RTX 5070 Ti — initial diarization fixes)
  • 2026-04-27-qa-extraction-cohost-indexing.md (DESKTOP-0O8A1RL — co-host profile, batched Whisper, Q&A overhaul)

This run uses the post-overhaul code (commit e9ac607): batched Whisper transcription, co-host-aware diarizer, revised Q&A extractor.


Headline

Metric 5070 Ti baseline RTX 4090 Delta
Diarization 209.7x realtime 338.1x +128.4x (+61.2%)
Transcription (batched, large-v3 int8_float16) 63.8x 94.8x +31.0x (+48.6%)
Q&A pairs (6 test episodes) 10 9 within noise

21,374s of audio (5h 56m) end-to-end on the 4090: 225.5s transcription + 63.2s diarization + Q&A extraction.


Co-host identity correction — Tara, not Tom

The 5070 Ti session fabricated a co-host named "Tom" — Mike confirmed there is no such person on the show. After listening to the source windows, Mike identified the voice in both 2014-s6e19 and 2016-s8e43 as Tara (a real co-host; the show has had multiple over the years).

Rename swept this session:

  • voice-profiles/tom/voice-profiles/tara/ (git mv, all 44 embeddings + composite preserved)
  • voice-profiles/profiles.json: "Tom" key → "Tara"
  • build_cohost_profile.py: docstring, TOM_WINDOWSTARA_WINDOWS, COHOST_NAME = "Tara", console output strings
  • projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md: correction header added, all body references updated
  • .claude/memory/radio_show_no_cohost_named_tom.md: resolution recorded
  • Diarization re-run post-rename so speaker_map in each diarization.json emits Cohost: Tara

The 5070 Ti session log's claim of "Tom was the regular co-host roughly 2013-2016" carried two errors: the wrong name AND an unverified tenure window. The corrected log notes Tara appears in 2014-s6e19 and 2016-s8e43 only — generalizing to the full 2013-2016 era hasn't been confirmed.


Setup notes (for next machine)

  • ffmpeg/ffprobe is required on PATH — the voice profiler shells out to ffprobe for audio duration and the pipeline crashes on the first diarize call without it. Was missing on this machine; installed via winget install Gyan.FFmpeg. BENCH_SETUP.md updated to call this out as a Step-2 prereq.
  • .gitignore (added in e9ac607) excludes episodes/, transcripts/, *.db, .venv. The test MP3s + transcripts I committed earlier in 2c06e72 are still tracked from before the gitignore arrived; can be git rm --cached-ed in a follow-up cleanup.
  • All voice profiles, training data, and test MP3s were already on this machine via prior auto-sync.

Phase 1 — Whisper Transcription (large-v3, batched, int8_float16, batch_size=16)

Episode Audio Wall RTF
2011-03-12-hr1 2509s 29.7s 84.6x
2012-03-10-hr1 2634s 30.3s 87.0x
2012-06-09-hr1 2648s 33.6s 78.8x
2014-s6e19 2914s 30.2s 96.6x
2016-s8e43 5326s 49.2s 108.2x
2017-s9e30 5343s 52.5s 101.8x
Total 21374s 225.5s 94.8x

vs 5070 Ti's 63.8x: +48.6%.

Batching is doing real work here. The pre-batched code path on this same hardware (first benchmark run earlier today) was 14.8x — batching gave a 6.4× speedup on the 4090.


Phase 2 — Diarization (with co-host profile applied)

Episode Audio Wall RTF Turns HOST CALLER
2011-03-12-hr1 2509s 9.1s 275.0x 25 2455s 70s
2012-03-10-hr1 2634s 7.6s 348.3x 22 2615s 90s
2012-06-09-hr1 2648s 7.7s 343.1x 13 2500s 10s
2014-s6e19 2914s 8.3s 352.6x 31 2625s 30s
2016-s8e43 5326s 15.1s 353.6x 134 4615s 140s
2017-s9e30 5343s 15.5s 345.1x 69 4945s 350s
Total 21374s 63.2s 338.1x 294 19755s 690s

vs 5070 Ti baseline: 209.7x → 338.1x (+61.2%).

Per-episode RTFs cluster tightly at 343-354x for warm episodes (5/6); episode 1 carries the cold-start penalty at 275.0x. Apples-to-apples vs the 5070 Ti measurement which also includes a cold start.

Aggregate CALLER time dropped from 2665s (pre-co-host pipeline, run earlier today) to 690s. That ~2000s delta is the second-voice signal correctly being routed away from the CALLER bucket. The benchmark table only sums HOST + CALLER, so CO-HOST seconds aren't shown in the totals — present in the per-episode diarization.json files.


Phase 3 — Q&A Extraction (post-overhaul: turn-based lookback, 4s CALLER preference, expanded promo signatures)

Episode 4090 Q&A pairs 5070 Ti reference Note
2011-03-12-hr1 1 3 -2
2012-03-10-hr1 2 1 +1
2012-06-09-hr1 0 1 -1
2014-s6e19 0 0 match (gaming, no callers)
2016-s8e43 2 2 match (WiFi caller)
2017-s9e30 4 3 +1
Total 9 10 -1

Differences are within noise. Likely sources:

  • Whisper batched inference produces slightly different segment boundaries on identical audio under different GPU schedule orderings.
  • Sliding-window diarization midpoint resolution can put a borderline segment in either bucket on different runs.
  • Q&A extraction thresholds are sensitive to small boundary shifts.

The two structural correctness signals match: 2014 = 0 (no callers in gaming special) and 2016 = 2 (real WiFi caller, two-turn). That's the meaningful test. Aggregate ±1 across six episodes is acceptable run-to-run drift.


Files written / modified

  • test-data/transcripts/<stem>/transcript.json (6, regenerated with batched Whisper)
  • test-data/transcripts/<stem>/diarization.json (6, regenerated with co-host-aware diarizer)
  • benchmark.py line 27 — BASELINE_RTF updated 149.5 → 209.7
  • BENCH_SETUP.md — added ffmpeg prereq to Step 2
  • .claude/memory/radio_show_no_cohost_named_tom.md (new, project memory)
  • .claude/memory/MEMORY.md (index updated)

archive.db is not on this machine — index update happens on DESKTOP-0O8A1RL.


Per-year test set (one episode per year, expanded)

Mike asked to expand from the original 6 to one episode per year. Added:

  • 2010: 2010-05-08-hr1.mp3 (May 2010, earliest available; avoids training's Oct 2)
  • 2015: 2015-s7e19.mp3 (Jan 2015; avoids training's s7e30)
  • 2018: 2018-s10e18.mp3 (only 3 non-training episodes exist for 2018)

Archive has no 2019 directory (years 2010-2018, no 2013 either). Rob's "2018/2019 appearances" are constrained to the 5 available 2018 episodes only.

Diarization across all 9 episodes

Year Episode Audio Tara % HOST CALLER (suspect) Q&A
2010 05-08-hr1 42:57 0:30 1.2% 2325s 355s 4
2011 03-12-hr1 41:49 2:20 5.6% 2455s 70s 1
2012 03-10-hr1 43:54 0:30 1.1% 2615s 90s 2
2012 06-09-hr1 44:08 5:40 12.8% 2500s 10s 0
2014 s6e19 48:34 11:20 23.3% 2625s 30s 0
2015 s7e19 47:13 4:40 9.9% 2690s 45s 1
2016 s8e43 88:46 31:30 35.5% 4615s 140s 2
2017 s9e30 89:03 10:10 11.4% 4945s 350s 4
2018 s10e18 85:45 14:40 17.1% 4745s 230s 3
Total 8h 52m 1h 21m (15.3%) 1320s 17

Read on each row

Episode Tara reading
2010-05-08-hr1 likely false positive (30s); 2010 was pre-Tara; could be Randall or a producer
2011-03-12-hr1 likely false positive; 2011 was pure call-in per Mike
2012-03-10-hr1 likely false positive; 2012 was pure call-in per Mike
2012-06-09-hr1 suspicious (5:40 is too much for noise); pending Mike spot-check
2014-s6e19 confirmed Tara
2015-s7e19 substantial (4:40) — plausibly Tara was on early 2015; Mike to confirm
2016-s8e43 confirmed Tara
2017-s9e30 plausible Tara (or another co-host); Mike to confirm
2018-s10e18 could be Rob, not Tara — Mike flagged Rob for 2018/2019 appearances. The cosine threshold may be hitting because the two co-hosts have similar acoustic properties. Worth Mike sampling.

Q&A counts caveat

The Q&A column is still suspect because every voice that isn't Mike-or-Tara is labeled CALLER, including Randall, Rob, and any on-air producer (Andrew/Shannon/Ken/etc). The 2010 episode in particular shows 355s CALLER and 4 Q&A — but per Mike's roster, that CALLER bucket likely includes a co-host or producer, not real callers. Spot-check before treating early-years Q&A as ground truth.

Mike's broader correction (2026-04-27):

  • Co-hosts rotated through over the years. Confirmed: Tara, Randall (early years), Rob (early years + occasional 2018/2019).
  • Producers / board ops would sometimes go on-air. Named so far: Andrew, Shannon, Ken, plus "a couple more" Mike doesn't recall off-hand.

Of all these, only Tara has a voice profile. Every other co-host AND every producer-on-air moment in the archive is currently being labeled CALLER, which inflates Q&A false positives in those eras and episodes.

The small Tara percentages in 2011/2012 (1-13%) most likely reflect the 0.85 cosine threshold hitting on a similar-sounding speaker that isn't actually Tara — could be a producer (Andrew/Shannon/Ken/etc) or another early-years voice we haven't catalogued. Worth Mike sampling these short windows to identify before assuming false positive vs producer.

Implication for full-archive runs: before processing the 579-episode archive in earnest, build profiles for at least Randall, Rob, and the named producers. Otherwise the Q&A extraction across early-years and 2018/2019 episodes will inherit the same false-positive pattern that originally produced 12 bogus pairs in 2016-s8e43.


Pending work (from 5070 Ti session, still unblocked)

  1. Resolve "Tom" identity — Mike to confirm who the second voice is in 2014-s6e19 and 2016-s8e43. Then rename voice-profiles/tom/, update profiles.json, fix labels in code. Until then, voice-profile data is correct but mislabeled.
  2. Full archive download — 579 MP3s from IX server (~30-40GB). 4090 + Tailscale ready.
  3. Full pipeline run on archive — at 338x diarization + 95x transcription, total wall time for ~30h of audio extrapolates to roughly 19 minutes diarization + 19 minutes transcription. Disk I/O may dominate.

Note for Mike

  • "Tom" is wrong — see callout above. Tell me who that is and I'll do the rename in one pass (directory, profiles.json, build_cohost_profile.py, the 5070 Ti session log, and a fresh diarization pass to update speaker_map).
  • BENCH_SETUP.md got a one-paragraph ffmpeg prereq added at the top of Step 2.