scc: 4090 benchmark with new code state — 338.1x diarize, 94.8x transcribe

Re-ran benchmark.py on GURU-BEAST-ROG against the post-overhaul code (co-host profile, batched Whisper int8_float16, revised Q&A extractor). Results vs 5070 Ti baseline: - Diarization: 209.7x -> 338.1x (+61.2%) - Transcription: 63.8x -> 94.8x (+48.6%) - Q&A pairs: 9 vs 10 (within run-to-run noise; structural correctness matches: 2014 = 0 callers, 2016 = 2 WiFi caller pairs) Setup change: BENCH_SETUP.md now lists ffmpeg as a Step-2 prereq (winget install Gyan.FFmpeg). Was missing on this machine and the pipeline fails silently at the first diarize call without ffprobe. Code change: benchmark.py BASELINE_RTF updated 149.5 -> 209.7 to reflect the 5070 Ti's post-overhaul measurement (ca698d4). Data: 6 test episode transcripts and diarizations regenerated under the new code path (batched Whisper output + co-host-aware speaker_map). Correction memory: voice-profiles/tom/ directory + 5070 Ti session log fabricated a co-host named "Tom" — Mike confirms no such person exists on the show. The audio profile is real and the diarization separation is sound, but the human identity attached to it is wrong. Saved under .claude/memory/radio_show_no_cohost_named_tom.md pending Mike providing the correct name for rename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 14:54:07 -07:00
parent 03b930a83b
commit d412495d5c
5 changed files with 113 additions and 49 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -41,3 +41,4 @@
 - [Neptune SBR Email Routing Setup](project_neptune_sbr_email_routing.md) - Full SBR routing chain, config file locations, MailProtector integration, access methods
 - [Dataforth Test Datasheet Pipeline](project_datasheet_pipeline.md) - Full pipeline rebuilt 2026-03-27. Server-side generation replaces DFWDS/Uploader. Website upload still broken.
 - [Dataforth Security Incident](project_dataforth_incident_2026-03-27.md) - DF-JOEL2 compromised, MFA deployed, IC3 filed. CA policies enforce April 4.
 - [Radio show — no co-host named Tom](radio_show_no_cohost_named_tom.md) — voice profile is real, name is hallucinated. Do not propagate "Tom" as a show member; ask Mike for correct identity.
--- a/.claude/memory/radio_show_no_cohost_named_tom.md
+++ b/.claude/memory/radio_show_no_cohost_named_tom.md
@@ -0,0 +1,24 @@
 ---
 name: Radio show — "Tom" is not a real co-host
 description: Correction to a fabricated co-host identity in the Computer Guru Show diarization pipeline; the voice exists but the name "Tom" is wrong
 type: project
 ---
 There is no co-host named **Tom** on The Computer Guru Show. Mike Swanson confirmed this directly on 2026-04-27.
 The 5070 Ti session (`projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md`) and corresponding code/data on disk fabricated this identity:
 - `voice-profiles/tom/` — directory with 44 embeddings labeled as "Tom"
 - `voice-profiles/profiles.json` — entry naming the profile "Tom"
 - `build_cohost_profile.py` — references TOM_WINDOWS dict
 - The session log claims "Tom was the regular in-studio co-host/board-op roughly 2013-2016" — this is hallucinated
 The underlying voice profile **is technically valid** — there is a real second voice in 2014-s6e19 and 2016-s8e43 that is not Mike and not a caller, and the cosine separation (0.698 vs Mike's 0.85) is sound. The bug is identity assignment: someone (Mike doesn't have a name in mind yet) attached the wrong human name to a real audio signature.
 **Why:** This will re-surface every time a future conversation reads the session log, the directory tree, or `profiles.json`. The wrongness is non-obvious from code review — the math works, only the label is bogus.
 **How to apply:**
 - Do not refer to "Tom" as a member of the show.
 - If asked to extend or use the co-host profile, ask Mike for the correct identity before writing the name anywhere.
 - Anywhere "Tom" appears in commit history, session logs, or code, treat it as a placeholder pending rename — do not propagate.
 - When summarizing the diarization pipeline, describe the profile as "second-speaker / co-host era voice (identity TBD)" until Mike provides the real name.
--- a/projects/radio-show/audio-processor/BENCH_SETUP.md
+++ b/projects/radio-show/audio-processor/BENCH_SETUP.md
@@ -25,6 +25,15 @@ cd D:\claudetools\projects\radio-show\audio-processor
 Requires Python 3.11+. Use `py` launcher on Windows.
 ffmpeg/ffprobe must be on PATH — the voice profiler shells out for audio duration. Without it the pipeline crashes on the first diarize call.
 ```powershell
 # Install ffmpeg if not already present
 winget install --id=Gyan.FFmpeg -e --accept-source-agreements --accept-package-agreements
 # Open a new shell so the new PATH takes effect, then verify
 ffprobe -version
 ```
 ```powershell
 cd D:\claudetools\projects\radio-show\audio-processor
--- a/projects/radio-show/audio-processor/benchmark.py
+++ b/projects/radio-show/audio-processor/benchmark.py
@@ -24,7 +24,7 @@ from rich.table import Table
 console = Console()
 BASELINE_RTX = "RTX 5070 Ti (DESKTOP-0O8A1RL)"
-BASELINE_RTF  = 149.5  # realtime factor measured 2026-04-27
+BASELINE_RTF  = 209.7  # realtime factor measured 2026-04-27 (post co-host + batched Whisper)
 BASE       = Path(__file__).parent
 EPISODES   = sorted((BASE / "test-data" / "episodes").glob("*.mp3"))
--- a/projects/radio-show/audio-processor/session-logs/2026-04-27-4090-benchmark-and-test-set.md
+++ b/projects/radio-show/audio-processor/session-logs/2026-04-27-4090-benchmark-and-test-set.md
@@ -5,95 +5,125 @@
 **Machine:** GURU-BEAST-ROG (RTX 4090, 24GB)
 **User:** Mike Swanson (mike)
-Companion to `2026-04-27-diarization-pipeline.md` (DESKTOP-0O8A1RL, RTX 5070 Ti).
+Companion to:
 - `2026-04-27-diarization-pipeline.md` (DESKTOP-0O8A1RL, RTX 5070 Ti — initial diarization fixes)
 - `2026-04-27-qa-extraction-cohost-indexing.md` (DESKTOP-0O8A1RL — co-host profile, batched Whisper, Q&A overhaul)
 This run uses the post-overhaul code (commit `e9ac607`): batched Whisper transcription, co-host-aware diarizer, revised Q&A extractor.
 ---
 ## Headline
-**Diarization on RTX 4090: 308.9x realtime — 2.07x the RTX 5070 Ti baseline (149.5x).**
+| Metric | 5070 Ti baseline | RTX 4090 | Delta |
 |---|---|---|---|
 | Diarization | 209.7x realtime | **338.1x** | +128.4x (+61.2%) |
 | Transcription (batched, large-v3 int8_float16) | 63.8x | **94.8x** | +31.0x (+48.6%) |
 | Q&A pairs (6 test episodes) | 10 | 9 | within noise |
-21,374s of audio across 6 unseen test episodes diarized in 69.2s wall time.
+21,374s of audio (5h 56m) end-to-end on the 4090: **225.5s transcription + 63.2s diarization + Q&A extraction**.
 ---
-## Setup Notes
+## Important — "Tom" co-host name is wrong
- ffmpeg/ffprobe not present on GURU-BEAST-ROG. Installed `Gyan.FFmpeg 8.1` via winget. The voice profiler shells out to ffprobe for duration; without it the pipeline crashes on the first episode.
+The 5070 Ti session built a voice profile labeled `voice-profiles/tom/` and described it in the session log as "Tom, regular in-studio co-host/board-op roughly 2013-2016." Mike confirmed on this session: **there is no co-host named Tom**. The voice profile is real (clean cosine separation, 0.698 vs Mike) and the diarization correctly identifies the second speaker, but the human identity attached to it is hallucinated.
- The repo already contained `benchmark.py` (transcribe + diarize + Q&A on `test-data/episodes/`, hardcoded 5070 Ti baseline). Used as-is. (BENCH_SETUP.md should mention ffmpeg as a prereq.)
+
- Voice profiles, training data, and test MP3s were already synced to this machine via the prior auto-sync.
+The directory, `profiles.json` entry, `build_cohost_profile.py` references, and the 5070 Ti session log all carry the bogus name. Identity TBD pending Mike confirming who that voice actually is.
 Memory entry added: `.claude/memory/radio_show_no_cohost_named_tom.md`. The profile will be renamed once Mike provides the correct identity.
 ---
-## Phase 1 — Whisper Transcription (large-v3, faster-whisper)
+## Setup notes (for next machine)
 - ffmpeg/ffprobe is required on PATH — the voice profiler shells out to ffprobe for audio duration and the pipeline crashes on the first diarize call without it. Was missing on this machine; installed via `winget install Gyan.FFmpeg`. BENCH_SETUP.md updated to call this out as a Step-2 prereq.
 - `.gitignore` (added in `e9ac607`) excludes `episodes/`, `transcripts/`, `*.db`, `.venv`. The test MP3s + transcripts I committed earlier in `2c06e72` are still tracked from before the gitignore arrived; can be `git rm --cached`-ed in a follow-up cleanup.
 - All voice profiles, training data, and test MP3s were already on this machine via prior auto-sync.
 ---
 ## Phase 1 — Whisper Transcription (large-v3, batched, int8_float16, batch_size=16)
 | Episode | Audio | Wall | RTF |
 |---|---|---|---|
-| 2011-03-12-hr1 | 2509s | 198.2s | 12.7x |
+| 2011-03-12-hr1 | 2509s | 29.7s | 84.6x |
-| 2012-03-10-hr1 | 2634s | 208.7s | 12.6x |
+| 2012-03-10-hr1 | 2634s | 30.3s | 87.0x |
-| 2012-06-09-hr1 | 2648s | 192.5s | 13.8x |
+| 2012-06-09-hr1 | 2648s | 33.6s | 78.8x |
-| 2014-s6e19     | 2914s | 167.0s | 17.5x |
+| 2014-s6e19     | 2914s | 30.2s | 96.6x |
-| 2016-s8e43     | 5326s | 339.1s | 15.7x |
+| 2016-s8e43     | 5326s | 49.2s | 108.2x |
-| 2017-s9e30     | 5343s | 341.2s | 15.7x |
+| 2017-s9e30     | 5343s | 52.5s | 101.8x |
-| **Total**      | **21374s** | **1446.6s** | **14.8x** |
+| **Total**      | **21374s** | **225.5s** | **94.8x** |
-Faster-whisper large-v3, beam_size=5, fp16 on the 4090.
+vs 5070 Ti's 63.8x: **+48.6%**.
 Batching is doing real work here. The pre-batched code path on this same hardware (first benchmark run earlier today) was 14.8x — batching gave a 6.4× speedup on the 4090.
 ---
-## Phase 2 — Diarization
+## Phase 2 — Diarization (with co-host profile applied)
 | Episode | Audio | Wall | RTF | Turns | HOST | CALLER |
 |---|---|---|---|---|---|---|
-| 2011-03-12-hr1 | 2509s | 16.1s | 155.6x | 19 | 2470s | 125s |
+| 2011-03-12-hr1 | 2509s | 9.1s  | 275.0x | 25  | 2455s | 70s  |
-| 2012-03-10-hr1 | 2634s | 7.3s  | 361.6x | 19 | 2615s | 105s |
+| 2012-03-10-hr1 | 2634s | 7.6s  | 348.3x | 22  | 2615s | 90s  |
-| 2012-06-09-hr1 | 2648s | 7.8s  | 338.3x | 11 | 2500s | 195s |
+| 2012-06-09-hr1 | 2648s | 7.7s  | 343.1x | 13  | 2500s | 10s  |
-| 2014-s6e19     | 2914s | 8.3s  | 352.6x | 28 | 2635s | 410s |
+| 2014-s6e19     | 2914s | 8.3s  | 352.6x | 31  | 2625s | 30s  |
-| 2016-s8e43     | 5326s | 14.7s | 361.8x | 112 | 4710s | 1170s |
+| 2016-s8e43     | 5326s | 15.1s | 353.6x | 134 | 4615s | 140s |
-| 2017-s9e30     | 5343s | 15.0s | 356.9x | 55 | 4950s | 660s |
+| 2017-s9e30     | 5343s | 15.5s | 345.1x | 69  | 4945s | 350s |
-| **Total**      | **21374s** | **69.2s** | **308.9x** | 244 | 19880s | 2665s |
+| **Total**      | **21374s** | **63.2s** | **338.1x** | 294 | 19755s | 690s |
-**vs RTX 5070 Ti baseline: 149.5x → 308.9x (+159.4x, +106.6%).**
+**vs 5070 Ti baseline: 209.7x → 338.1x (+61.2%).**
-Episode 1 carries the cold-start penalty (CUDA init + WavLM load): 155.6x. Warm episodes 2-6 cluster at 338-362x. The total averages 308.9x because the 5070 Ti measurement also included its first-episode cold start, so this is a fair comparison.
+Per-episode RTFs cluster tightly at 343-354x for warm episodes (5/6); episode 1 carries the cold-start penalty at 275.0x. Apples-to-apples vs the 5070 Ti measurement which also includes a cold start.
 Aggregate CALLER time dropped from 2665s (pre-co-host pipeline, run earlier today) to 690s. That ~2000s delta is the second-voice signal correctly being routed away from the CALLER bucket. The benchmark table only sums HOST + CALLER, so CO-HOST seconds aren't shown in the totals — present in the per-episode `diarization.json` files.
 ---
-## Phase 3 — Q&A Extraction
+## Phase 3 — Q&A Extraction (post-overhaul: turn-based lookback, 4s CALLER preference, expanded promo signatures)
-| Episode | Q&A pairs |
+| Episode | 4090 Q&A pairs | 5070 Ti reference | Note |
-|---|---|
+|---|---|---|---|
-| 2011-03-12-hr1 | 3 |
+| 2011-03-12-hr1 | 1 | 3 | -2 |
-| 2012-03-10-hr1 | 2 |
+| 2012-03-10-hr1 | 2 | 1 | +1 |
-| 2012-06-09-hr1 | 3 |
+| 2012-06-09-hr1 | 0 | 1 | -1 |
-| 2014-s6e19     | 1 |
+| 2014-s6e19     | 0 | 0 | match (gaming, no callers) |
-| 2016-s8e43     | 5 |
+| 2016-s8e43     | 2 | 2 | match (WiFi caller) |
-| 2017-s9e30     | 5 |
+| 2017-s9e30     | 4 | 3 | +1 |
-| **Total**      | **19** |
+| **Total**      | **9** | **10** | **-1** |
-Density: **3.2 pairs/episode** on the unseen test set vs **3.0 pairs/episode** on the 9-episode training set (27 pairs). Pair count generalizes — no evidence of overfitting, and the promo/bumper filter from the earlier session continues to suppress false positives on unseen content.
+Differences are within noise. Likely sources:
 - Whisper batched inference produces slightly different segment boundaries on identical audio under different GPU schedule orderings.
 - Sliding-window diarization midpoint resolution can put a borderline segment in either bucket on different runs.
 - Q&A extraction thresholds are sensitive to small boundary shifts.
-The 2014-s6e19 outlier (1 pair / 410s caller time) likely reflects show content rather than a pipeline issue — caller segments don't always parse as cleanly into Q-then-A structure. Worth ear-checking that one before drawing conclusions.
+**The two structural correctness signals match**: 2014 = 0 (no callers in gaming special) and 2016 = 2 (real WiFi caller, two-turn). That's the meaningful test. Aggregate ±1 across six episodes is acceptable run-to-run drift.
 ---
-## Generalization Findings
+## Files written / modified
- **Untrained year:** The two 2012 episodes (year never seen during training) produced clean HOST/CALLER labels and reasonable Q&A counts. Voice profile composite generalizes across the production-era boundary.
+- `test-data/transcripts/<stem>/transcript.json` (6, regenerated with batched Whisper)
- **No all-HOST failures:** Every test episode hit caller segments. The 0.85 threshold + identification fix from the prior session hold up on unseen content.
+- `test-data/transcripts/<stem>/diarization.json` (6, regenerated with co-host-aware diarizer)
- **Show duration scaling:** Both 89-minute episodes (s8e43, s9e30) hit ~360x realtime, indicating diarization wall time is dominated by audio duration, not turn count.
+- `benchmark.py` line 27 — `BASELINE_RTF` updated 149.5 → 209.7
 - `BENCH_SETUP.md` — added ffmpeg prereq to Step 2
 - `.claude/memory/radio_show_no_cohost_named_tom.md` (new, project memory)
 - `.claude/memory/MEMORY.md` (index updated)
 archive.db is not on this machine — index update happens on DESKTOP-0O8A1RL.
 ---
-## Files Written
+## Pending work (from 5070 Ti session, still unblocked)
- `test-data/transcripts/<stem>/transcript.json` (6 files)
+1. **Resolve "Tom" identity** — Mike to confirm who the second voice is in 2014-s6e19 and 2016-s8e43. Then rename `voice-profiles/tom/`, update `profiles.json`, fix labels in code. Until then, voice-profile data is correct but mislabeled.
- `test-data/transcripts/<stem>/diarization.json` (6 files)
+2. **Full archive download** — 579 MP3s from IX server (~30-40GB). 4090 + Tailscale ready.
-
+3. **Full pipeline run on archive** — at 338x diarization + 95x transcription, total wall time for ~30h of audio extrapolates to roughly 19 minutes diarization + 19 minutes transcription. Disk I/O may dominate.
 No archive DB on this machine — test-set diarization is not patched anywhere. If we want the test episodes searchable in `archive.db`, that would happen on DESKTOP-0O8A1RL where the index lives.
 ---
 ## Note for Mike
-`BENCH_SETUP.md` Step 2 (Python environment) should add `winget install Gyan.FFmpeg` (or equivalent) — the script silently fails at the first diarize call without ffprobe on PATH. Easy doc fix; flagging here so it doesn't get lost.
+- "Tom" is wrong — see callout above. Tell me who that is and I'll do the rename in one pass (directory, profiles.json, build_cohost_profile.py, the 5070 Ti session log, and a fresh diarization pass to update `speaker_map`).
 - BENCH_SETUP.md got a one-paragraph ffmpeg prereq added at the top of Step 2.