Commit Graph

7 Commits

Author SHA1 Message Date
c760e430c0 radio: bumper detection in diarizer + full archive download script
Adds a transcript-driven bumper filter to the diarization pipeline. When
a transcript segment matches qa_extractor's promo/bumper signatures, the
overlapping audio windows are labeled BUMPER and the WavLM cosine match
is skipped. Prevents music/promo from being matched against speaker
profiles (the failure mode Mike caught in 2018-s10e18 @ 09:20-10:05).

Code changes:
- src/voice_profiler.py: identify_speakers() takes optional skip_ranges
  parameter; windows whose midpoint falls in a skip range get labeled
  "[bumper]" and skip cosine match
- src/diarizer.py: diarize() takes optional transcript_path; pre-computes
  bumper time ranges via qa_extractor._is_promo_or_bumper, passes to
  identify_speakers; adds BUMPER speaker label
- benchmark.py: passes transcript_path to diarize()

Aggregate impact across 9-episode test set:
  Tara attribution: 4880s -> 3680s  (-1200s / -25%)
  Q&A pairs: 17 -> 19 (+2)
    (bumper-flagged segments had been disrupting conversation detection
     in 2017-s9e30 and 2018-s10e18)
  CALLER total: 1320s -> 1190s  (bumpers previously labeled CALLER moved)
  Per-episode bumpers caught: 1-8, total ~165 bumper segments across set

Remaining Tara false positives are real callers acoustically similar to
Tara (Christopher in 2018, Kay in 2012, William and Charles in 2015) and
guest Clay in 2015-s7e19 — those need profile rebuild + Clay profile,
not bumper filtering.

Adds download_full_archive.py — resumable mirror-style downloader that
walks IX server's /home/gurushow/public_html/archive/{year}/ and copies
all MP3s to archive-data/episodes/. Run is in progress (~589 files,
~10-15GB). Used to source clean profile windows for the remaining
co-hosts (Tara rebuild, Clay, Tony, Rob, Randall, producers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:17:50 -07:00
a4f527f31e radio: per-year test set (one episode per year, 2010-2018)
Added 2010, 2015, 2018 test episodes to round out the test set to one
per available year:
- 2010-05-08-hr1 (May 2010, earliest available; pre-Tara era)
- 2015-s7e19 (Jan 2015, avoids training's s7e30)
- 2018-s10e18 (only 3 non-training 2018 episodes exist)

Archive has no 2019 directory — Rob's "2018/2019 appearances" are
constrained to the 5 available 2018 episodes only.

Per-year diarization summary (Tara presence, post-rename):
  2010-05-08    30s   1.2%   likely false positive (pre-Tara)
  2011-03-12   140s   5.6%   likely false positive (call-in only)
  2012-03-10    30s   1.1%   likely false positive (call-in only)
  2012-06-09   340s  12.8%   suspicious — Mike to confirm
  2014-s6e19   680s  23.3%   confirmed
  2015-s7e19   280s   9.9%   plausible — Mike to confirm
  2016-s8e43  1890s  35.5%   confirmed
  2017-s9e30   610s  11.4%   plausible
  2018-s10e18  880s  17.1%   COULD BE ROB — Mike flagged Rob for
                              2018/2019 appearances; cosine threshold may
                              be hitting on Rob being acoustically similar
                              to Tara

Total Tara across 9 episodes: 1h 21m / 8h 52m audio (15.3%).

Q&A counts (still suspect — every voice that isn't Mike-or-Tara is
labeled CALLER, so Randall/Rob/producers inflate the bucket):
  2010=4, 2011=1, 2012a=2, 2012b=0, 2014=0, 2015=1, 2016=2, 2017=4, 2018=3
  Total: 17 pairs across 9 episodes

4090 perf on the expanded set:
- Diarization: 31928s in 121.5s = 262.7x realtime (vs 209.7x on 5070 Ti, +25.3%)
- Transcription (3 new episodes only): 10554s in 112.4s = 93.9x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:20:09 -07:00
fb683d6a05 radio: rename Tom -> Tara, expand speaker roster
Mike confirmed there is no co-host named "Tom" — the voice in 2014-s6e19
and 2016-s8e43 is Tara. The 5070 Ti session fabricated the Tom identity.
The voice profile itself (44 embeddings, 0.698 cosine vs Mike) is correct;
only the human label was wrong.

Rename swept:
- voice-profiles/tom/ -> voice-profiles/tara/ (git mv preserves all .npy)
- voice-profiles/profiles.json: "Tom" key -> "Tara"
- build_cohost_profile.py: TOM_WINDOWS -> TARA_WINDOWS, COHOST_NAME, comments
- 2026-04-27-qa-extraction-cohost-indexing.md: correction header + body sweep
- 2026-04-27-4090-benchmark-and-test-set.md: closure note
- .claude/memory/radio_show_no_cohost_named_tom.md: resolution + speaker roster

Diarization re-run after rename so speaker_map emits "Cohost: Tara".
Q&A counts unchanged (rename is label-only): 9 pairs across 6 test episodes.

Tara distribution from the post-rename diarization (per-episode % of audio):
  2011-03-12-hr1   140s   5.6%   likely false positive (call-in only)
  2012-03-10-hr1    30s   1.1%   likely false positive (call-in only)
  2012-06-09-hr1   340s  12.8%   suspicious — pending Mike confirm
  2014-s6e19       680s  23.3%   confirmed
  2016-s8e43      1890s  35.5%   confirmed
  2017-s9e30       610s  11.4%   plausible — pending Mike confirm

Broader speaker-roster context Mike provided this session (saved to
memory): the show has had multiple co-hosts (Tara, Randall, Rob) plus
producers/board ops (Andrew, Shannon, Ken, others) who would sometimes
go on-air. Only Tara has a profile so far. Every other speaker is
currently labeled CALLER, which means small CO-HOST attributions in
unexpected episodes (e.g. 2011/2012) may actually be a producer rather
than a false positive — Mike to spot-check.

Action item before full-archive run: build profiles for Randall, Rob,
and the named producers to avoid systematic Q&A false positives in
early-years and 2018/2019 episodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:11:03 -07:00
b9a4bb8807 scc: 4090 benchmark with new code state — 338.1x diarize, 94.8x transcribe
Re-ran benchmark.py on GURU-BEAST-ROG against the post-overhaul code
(co-host profile, batched Whisper int8_float16, revised Q&A extractor).

Results vs 5070 Ti baseline:
- Diarization: 209.7x -> 338.1x (+61.2%)
- Transcription: 63.8x -> 94.8x (+48.6%)
- Q&A pairs: 9 vs 10 (within run-to-run noise; structural correctness matches:
  2014 = 0 callers, 2016 = 2 WiFi caller pairs)

Setup change: BENCH_SETUP.md now lists ffmpeg as a Step-2 prereq
(winget install Gyan.FFmpeg). Was missing on this machine and the pipeline
fails silently at the first diarize call without ffprobe.

Code change: benchmark.py BASELINE_RTF updated 149.5 -> 209.7 to reflect
the 5070 Ti's post-overhaul measurement (e9ac607).

Data: 6 test episode transcripts and diarizations regenerated under the
new code path (batched Whisper output + co-host-aware speaker_map).

Correction memory: voice-profiles/tom/ directory + 5070 Ti session log
fabricated a co-host named "Tom" — Mike confirms no such person exists on
the show. The audio profile is real and the diarization separation is
sound, but the human identity attached to it is wrong. Saved under
.claude/memory/radio_show_no_cohost_named_tom.md pending Mike providing
the correct name for rename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 14:54:07 -07:00
7bb683a3ed sync: auto-sync from GURU-BEAST-ROG at 2026-04-27 14:42:18
Author: Mike Swanson
Machine: GURU-BEAST-ROG
Timestamp: 2026-04-27 14:42:18
2026-04-27 14:42:25 -07:00
6cc9043b8e Audio processor: validated voice profiling accuracy, tuned threshold
- Fine-grained speaker analysis (3s windows, 1s hop) across 42min episode
- Host voice: 0.90-0.98 similarity (clear positive match)
- Callers: 0.65-0.68 (correctly below threshold)
- Produced audio/clips: 0.53-0.65 (correctly identified as non-host)
- Co-host/other speakers: 0.56-0.62 (correctly identified)
- Tuned host_match_threshold from 0.75 to 0.83 based on empirical data
- Cross-referenced dips with transcript: correctly identifies callers,
  show intros, played audio clips, and station breaks
- Batch transcription of 7 additional training episodes in progress

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 12:48:25 -07:00
a1e0442d8b Add radio show audio processor and post-show workflow
- Audio processor CLI tool with 6-stage pipeline: transcribe (faster-whisper GPU),
  diarize (pyannote), detect segments (multi-signal classifier), remove commercials,
  split segments, analyze content (Ollama)
- Post-show workflow doc for episode posts, forum threads, deep-dive blog posts
- Training plan for using 579-episode archive for voice profiles and commercial detection
- Successful test: 45min episode transcribed in 2:37 on RTX 5070 Ti
- Sample transcript output from S7E30 (March 2015)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 11:51:59 -07:00