Commit Graph

9 Commits

Author SHA1 Message Date
82940d96d7 radio: utf-8 transcript writes + sqlite archive importer + session log
- src/transcriber.py: open transcript.{json,txt,srt} with encoding="utf-8".
  Windows cp1252 default crashed on Whisper output containing U+2044.
- import_to_sqlite.py: new. Walks archive-data/transcripts, builds
  archive.db (5 tables + 2 FTS5 virtual tables, sha256-keyed idempotency).
  20.5 MB / 208 episodes at smoke-test time, 1.9s rebuild.
- batch_process.py: tracked from prior session — full-archive batch with
  resumable transcribe/diarize/intros/qa pipeline.
- .gitignore: archive-data/ and logs/.

Session log: 2026-04-27-archive-batch-and-sqlite-import.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 19:38:02 -07:00
488bf5849e radio: attach caller names to Q&A pairs from transcript intros
QAPair gets caller_name and caller_role fields populated by a new
attach_caller_names(pairs, transcript_segments) helper. For each pair,
finds the active opening intro at the question_start time (8s forward
tolerance, no backward limit — a caller's call can run for 10+ minutes
and the intro happens once at the start) and attaches the speaker name.

Validation on 9-episode test set:
  19/19 Q&A pairs (100%) now have caller names attached.

Examples of corrections from oracle attribution:
  2018-s10e18 @ 73:36  Christopher (was misattributed to "Tara")
  2015-s7e19 @ 35:45   William     (was misattributed to "Tara")
  2010-05-08-hr1       Jackie x3, Bruce
  2012-03-10-hr1       Adam x2
  2016-s8e43           John, Doug
  2017-s9e30           Tom, Denise x3, Charlie

speaker_oracle.py: adds speaker_at(time, intros) helper used both by the
existing resolve_speakers() and the new caller-name attachment. Also
adds the "let's fit/bring/put X in/on" intro pattern variant (caught
Charlie at 70:21 in 2017-s9e30 that "talk to X" missed).

download_full_archive.py: SSH keepalive every 30s + per-file retry-on-
failure (up to 3 attempts with reconnect). Earlier run hung on a dead
connection at file 109 of 589 with no recovery; restarted run is now
running at ~10 MB/s vs ~2-3 MB/s before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:55:31 -07:00
1b574caba4 radio: transcript-driven speaker name resolution (oracle)
New module src/speaker_oracle.py extracts speaker introductions from
transcripts ("let's talk to William", "we have Clay from the Nerd Junkies",
"in Tara's place, we have Clay", "thanks for the call <name>") and binds
them to non-HOST diarization turns. Pure post-pass on diarization JSONs,
no audio processing — corrects audio-only cosine errors using Mike's
deterministic on-air announcements.

Algorithm:
- Extract intros: regex patterns for caller pickups, guest intros,
  fill-in announcements, caller closes. Case-strict (rejects mid-sentence
  lowercase matches), with a blacklist of common false-positive words.
  Deduplicates same-name intros within 5s.
- Resolve speakers: for each non-HOST turn, find the LATEST opening intro
  at or before turn.start (with 8s forward tolerance for boundary slop).
  Later intros implicitly close earlier callers, so the most recent
  intro wins. No artificial lookback limit (callers can talk for 10+ min).
- Falls back to caller_close patterns within 30s after a turn ends.

Validation on 9-episode test set:
  2018-s10e18: Christopher 190s correctly named (was mislabeled "Tara")
  2012-06-09 : Kay 160s correctly named (was mislabeled "Tara")
  2015-s7e19 : Clay 45s as fillin for Tara, William 40s as caller
  2016-s8e43 : Charles 630s, Bruce 210s, John 205s — most callers named
  2017-s9e30 : Denise 295s, Tom 115s, Elaine 85s, Jeff 10s
  Many other callers across all episodes correctly named.

Remaining unnamed CO-HOST/CALLER (~5-10% of non-HOST time) are real
co-host banter or callers without explicit Mike-introductions.

benchmark.py: adds Phase 2.5 "Name Resolution" between diarization and
Q&A extraction. Prints named-speaker breakdown per episode. Doesn't
modify diarization JSONs (resolution is computed on demand).

Next step: feed named turns into qa_extractor so Q&A pairs get caller
name attached for searchability. Also: bootstrap recurring-speaker
profiles (Tara, Tony, Rob, Randall, producers) by accumulating
intro-tagged windows across the full archive once download completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:48:16 -07:00
c760e430c0 radio: bumper detection in diarizer + full archive download script
Adds a transcript-driven bumper filter to the diarization pipeline. When
a transcript segment matches qa_extractor's promo/bumper signatures, the
overlapping audio windows are labeled BUMPER and the WavLM cosine match
is skipped. Prevents music/promo from being matched against speaker
profiles (the failure mode Mike caught in 2018-s10e18 @ 09:20-10:05).

Code changes:
- src/voice_profiler.py: identify_speakers() takes optional skip_ranges
  parameter; windows whose midpoint falls in a skip range get labeled
  "[bumper]" and skip cosine match
- src/diarizer.py: diarize() takes optional transcript_path; pre-computes
  bumper time ranges via qa_extractor._is_promo_or_bumper, passes to
  identify_speakers; adds BUMPER speaker label
- benchmark.py: passes transcript_path to diarize()

Aggregate impact across 9-episode test set:
  Tara attribution: 4880s -> 3680s  (-1200s / -25%)
  Q&A pairs: 17 -> 19 (+2)
    (bumper-flagged segments had been disrupting conversation detection
     in 2017-s9e30 and 2018-s10e18)
  CALLER total: 1320s -> 1190s  (bumpers previously labeled CALLER moved)
  Per-episode bumpers caught: 1-8, total ~165 bumper segments across set

Remaining Tara false positives are real callers acoustically similar to
Tara (Christopher in 2018, Kay in 2012, William and Charles in 2015) and
guest Clay in 2015-s7e19 — those need profile rebuild + Clay profile,
not bumper filtering.

Adds download_full_archive.py — resumable mirror-style downloader that
walks IX server's /home/gurushow/public_html/archive/{year}/ and copies
all MP3s to archive-data/episodes/. Run is in progress (~589 files,
~10-15GB). Used to source clean profile windows for the remaining
co-hosts (Tara rebuild, Clay, Tony, Rob, Randall, producers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:17:50 -07:00
e9ac607500 radio show: co-host voice profile, Q&A extraction fixes, archive index
- Build Tom (co-host) voice profile (44 embeddings, 0.698 similarity to Mike)
- diarizer.py: add CO-HOST speaker label for cohost-role profiles
- voice_profiler.py: emit "Cohost: <name>" label for cohost role
- qa_extractor.py: overlap resolution at load time (midpoint boundary split),
  4s CALLER-preference threshold, turn-based caller-intro lookback (2 HOST turns),
  _preceded_by_caller_intro() helper, _PHONE_GREETING pattern,
  751-1041 + "we'll get your problem solved" promo signatures
- benchmark.py: use src.transcriber.transcribe with batch_size=16
- add index_test_episodes.py and build_cohost_profile.py scripts
- add .gitignore (exclude episodes, transcripts, *.db, .venv)
- session log: 2026-04-27-qa-extraction-cohost-indexing.md

Result: 2016-s8e43 drops from 12 false-positive Q&A pairs to 2 real caller pairs.
archive.db: 6 episodes, 762 segments, 10 Q&A pairs, FTS5 search verified.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 14:41:04 -07:00
79abef9dc9 radio: diarization pipeline fixes, benchmark setup, test episode set
- Fix voice_profiler threshold bug (HOST label overwrote Unknown unconditionally)
- Audio preload optimization: single ffmpeg per episode, 149.5x realtime on 5070 Ti
- WavLM threshold raised to 0.85 (Mike 0.90-0.99, callers 0.46-0.83)
- Promo/bumper filter: weighted signature scoring, 42->27 clean Q&A pairs
- Text-only Q&A fallback for episodes with no CALLER diarization labels
- TRANSFORMERS_OFFLINE=1 to skip HuggingFace freshness checks
- Add diarize_2018.py for targeted re-run + FTS5 rebuild
- Add benchmark.py + BENCH_SETUP.md for GURU-BEAST-ROG (RTX 4090) comparison
- Commit 9-episode training diarization.json outputs
- Session log: 2026-04-27-diarization-pipeline.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 13:20:40 -07:00
826141a319 Audio processor: working voice profiler with WavLM speaker embeddings
- Voice profiler using microsoft/wavlm-base-sv (512-dim x-vector embeddings)
- Bootstrap from archive: 180 embeddings from 9 episodes across 2010-2018
- Host identification accuracy: 0.87-0.98 similarity for live speech,
  0.60-0.64 for non-host audio (produced intros, co-host)
- Dropped speechbrain dependency (requires torchaudio, CUDA version conflicts)
- Patched torchaudio CUDA 12.8/13.1 version check (warning instead of error)
- Profile stored in voice-profiles/mike-swanson/ with per-chunk embeddings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 12:19:13 -07:00
87f5a9306a Audio processor: fix segment detection with transcript-driven breaks
- Add transcript break phrase detection (going_to_break/coming_back cues)
- Create segments from transcript breaks with silence boundary snapping
- Fix segment dedup in merge_adjacent (handle overlapping segments)
- Add CUDA 12 library path fix (gpu.py + venv activate hook)
- Auto-load existing transcript in detect command
- Tested on 2011-03-05 HR1: correctly identifies commercial break at 34:38

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 11:59:54 -07:00
a1e0442d8b Add radio show audio processor and post-show workflow
- Audio processor CLI tool with 6-stage pipeline: transcribe (faster-whisper GPU),
  diarize (pyannote), detect segments (multi-signal classifier), remove commercials,
  split segments, analyze content (Ollama)
- Post-show workflow doc for episode posts, forum threads, deep-dive blog posts
- Training plan for using 579-episode archive for voice profiles and commercial detection
- Successful test: 45min episode transcribed in 2:37 on RTX 5070 Ti
- Sample transcript output from S7E30 (March 2015)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 11:51:59 -07:00