claudetools

azcomputerguru/claudetools

Fork 0

Commit Graph

Author	SHA1	Message	Date
Mike Swanson	488bf5849e	radio: attach caller names to Q&A pairs from transcript intros QAPair gets caller_name and caller_role fields populated by a new attach_caller_names(pairs, transcript_segments) helper. For each pair, finds the active opening intro at the question_start time (8s forward tolerance, no backward limit — a caller's call can run for 10+ minutes and the intro happens once at the start) and attaches the speaker name. Validation on 9-episode test set: 19/19 Q&A pairs (100%) now have caller names attached. Examples of corrections from oracle attribution: 2018-s10e18 @ 73:36 Christopher (was misattributed to "Tara") 2015-s7e19 @ 35:45 William (was misattributed to "Tara") 2010-05-08-hr1 Jackie x3, Bruce 2012-03-10-hr1 Adam x2 2016-s8e43 John, Doug 2017-s9e30 Tom, Denise x3, Charlie speaker_oracle.py: adds speaker_at(time, intros) helper used both by the existing resolve_speakers() and the new caller-name attachment. Also adds the "let's fit/bring/put X in/on" intro pattern variant (caught Charlie at 70:21 in 2017-s9e30 that "talk to X" missed). download_full_archive.py: SSH keepalive every 30s + per-file retry-on- failure (up to 3 attempts with reconnect). Earlier run hung on a dead connection at file 109 of 589 with no recovery; restarted run is now running at ~10 MB/s vs ~2-3 MB/s before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:55:31 -07:00
Mike Swanson	1b574caba4	radio: transcript-driven speaker name resolution (oracle) New module src/speaker_oracle.py extracts speaker introductions from transcripts ("let's talk to William", "we have Clay from the Nerd Junkies", "in Tara's place, we have Clay", "thanks for the call <name>") and binds them to non-HOST diarization turns. Pure post-pass on diarization JSONs, no audio processing — corrects audio-only cosine errors using Mike's deterministic on-air announcements. Algorithm: - Extract intros: regex patterns for caller pickups, guest intros, fill-in announcements, caller closes. Case-strict (rejects mid-sentence lowercase matches), with a blacklist of common false-positive words. Deduplicates same-name intros within 5s. - Resolve speakers: for each non-HOST turn, find the LATEST opening intro at or before turn.start (with 8s forward tolerance for boundary slop). Later intros implicitly close earlier callers, so the most recent intro wins. No artificial lookback limit (callers can talk for 10+ min). - Falls back to caller_close patterns within 30s after a turn ends. Validation on 9-episode test set: 2018-s10e18: Christopher 190s correctly named (was mislabeled "Tara") 2012-06-09 : Kay 160s correctly named (was mislabeled "Tara") 2015-s7e19 : Clay 45s as fillin for Tara, William 40s as caller 2016-s8e43 : Charles 630s, Bruce 210s, John 205s — most callers named 2017-s9e30 : Denise 295s, Tom 115s, Elaine 85s, Jeff 10s Many other callers across all episodes correctly named. Remaining unnamed CO-HOST/CALLER (~5-10% of non-HOST time) are real co-host banter or callers without explicit Mike-introductions. benchmark.py: adds Phase 2.5 "Name Resolution" between diarization and Q&A extraction. Prints named-speaker breakdown per episode. Doesn't modify diarization JSONs (resolution is computed on demand). Next step: feed named turns into qa_extractor so Q&A pairs get caller name attached for searchability. Also: bootstrap recurring-speaker profiles (Tara, Tony, Rob, Randall, producers) by accumulating intro-tagged windows across the full archive once download completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:48:16 -07:00

Author

SHA1

Message

Date

Mike Swanson

488bf5849e

radio: attach caller names to Q&A pairs from transcript intros

QAPair gets caller_name and caller_role fields populated by a new
attach_caller_names(pairs, transcript_segments) helper. For each pair,
finds the active opening intro at the question_start time (8s forward
tolerance, no backward limit — a caller's call can run for 10+ minutes
and the intro happens once at the start) and attaches the speaker name.

Validation on 9-episode test set:
  19/19 Q&A pairs (100%) now have caller names attached.

Examples of corrections from oracle attribution:
  2018-s10e18 @ 73:36  Christopher (was misattributed to "Tara")
  2015-s7e19 @ 35:45   William     (was misattributed to "Tara")
  2010-05-08-hr1       Jackie x3, Bruce
  2012-03-10-hr1       Adam x2
  2016-s8e43           John, Doug
  2017-s9e30           Tom, Denise x3, Charlie

speaker_oracle.py: adds speaker_at(time, intros) helper used both by the
existing resolve_speakers() and the new caller-name attachment. Also
adds the "let's fit/bring/put X in/on" intro pattern variant (caught
Charlie at 70:21 in 2017-s9e30 that "talk to X" missed).

download_full_archive.py: SSH keepalive every 30s + per-file retry-on-
failure (up to 3 attempts with reconnect). Earlier run hung on a dead
connection at file 109 of 589 with no recovery; restarted run is now
running at ~10 MB/s vs ~2-3 MB/s before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 16:55:31 -07:00

Mike Swanson

1b574caba4

radio: transcript-driven speaker name resolution (oracle)

New module src/speaker_oracle.py extracts speaker introductions from
transcripts ("let's talk to William", "we have Clay from the Nerd Junkies",
"in Tara's place, we have Clay", "thanks for the call <name>") and binds
them to non-HOST diarization turns. Pure post-pass on diarization JSONs,
no audio processing — corrects audio-only cosine errors using Mike's
deterministic on-air announcements.

Algorithm:
- Extract intros: regex patterns for caller pickups, guest intros,
  fill-in announcements, caller closes. Case-strict (rejects mid-sentence
  lowercase matches), with a blacklist of common false-positive words.
  Deduplicates same-name intros within 5s.
- Resolve speakers: for each non-HOST turn, find the LATEST opening intro
  at or before turn.start (with 8s forward tolerance for boundary slop).
  Later intros implicitly close earlier callers, so the most recent
  intro wins. No artificial lookback limit (callers can talk for 10+ min).
- Falls back to caller_close patterns within 30s after a turn ends.

Validation on 9-episode test set:
  2018-s10e18: Christopher 190s correctly named (was mislabeled "Tara")
  2012-06-09 : Kay 160s correctly named (was mislabeled "Tara")
  2015-s7e19 : Clay 45s as fillin for Tara, William 40s as caller
  2016-s8e43 : Charles 630s, Bruce 210s, John 205s — most callers named
  2017-s9e30 : Denise 295s, Tom 115s, Elaine 85s, Jeff 10s
  Many other callers across all episodes correctly named.

Remaining unnamed CO-HOST/CALLER (~5-10% of non-HOST time) are real
co-host banter or callers without explicit Mike-introductions.

benchmark.py: adds Phase 2.5 "Name Resolution" between diarization and
Q&A extraction. Prints named-speaker breakdown per episode. Doesn't
modify diarization JSONs (resolution is computed on demand).

Next step: feed named turns into qa_extractor so Q&A pairs get caller
name attached for searchability. Also: bootstrap recurring-speaker
profiles (Tara, Tony, Rob, Randall, producers) by accumulating
intro-tagged windows across the full archive once download completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 16:48:16 -07:00

2 Commits