radio: rename Tom -> Tara, expand speaker roster

Mike confirmed there is no co-host named "Tom" — the voice in 2014-s6e19
and 2016-s8e43 is Tara. The 5070 Ti session fabricated the Tom identity.
The voice profile itself (44 embeddings, 0.698 cosine vs Mike) is correct;
only the human label was wrong.

Rename swept:
- voice-profiles/tom/ -> voice-profiles/tara/ (git mv preserves all .npy)
- voice-profiles/profiles.json: "Tom" key -> "Tara"
- build_cohost_profile.py: TOM_WINDOWS -> TARA_WINDOWS, COHOST_NAME, comments
- 2026-04-27-qa-extraction-cohost-indexing.md: correction header + body sweep
- 2026-04-27-4090-benchmark-and-test-set.md: closure note
- .claude/memory/radio_show_no_cohost_named_tom.md: resolution + speaker roster

Diarization re-run after rename so speaker_map emits "Cohost: Tara".
Q&A counts unchanged (rename is label-only): 9 pairs across 6 test episodes.

Tara distribution from the post-rename diarization (per-episode % of audio):
  2011-03-12-hr1   140s   5.6%   likely false positive (call-in only)
  2012-03-10-hr1    30s   1.1%   likely false positive (call-in only)
  2012-06-09-hr1   340s  12.8%   suspicious — pending Mike confirm
  2014-s6e19       680s  23.3%   confirmed
  2016-s8e43      1890s  35.5%   confirmed
  2017-s9e30       610s  11.4%   plausible — pending Mike confirm

Broader speaker-roster context Mike provided this session (saved to
memory): the show has had multiple co-hosts (Tara, Randall, Rob) plus
producers/board ops (Andrew, Shannon, Ken, others) who would sometimes
go on-air. Only Tara has a profile so far. Every other speaker is
currently labeled CALLER, which means small CO-HOST attributions in
unexpected episodes (e.g. 2011/2012) may actually be a producer rather
than a false positive — Mike to spot-check.

Action item before full-archive run: build profiles for Randall, Rob,
and the named producers to avoid systematic Q&A false positives in
early-years and 2018/2019 episodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 15:11:03 -07:00
parent b9a4bb8807
commit fb683d6a05
55 changed files with 122 additions and 53 deletions

View File

@@ -1,5 +1,5 @@
"""
Build voice profile for Tom (co-host) from known co-host speech windows.
Build voice profile for Tara (co-host) from known co-host speech windows.
Uses CALLER-labeled windows from the first 60 min of co-host-era episodes,
before any real callers would have called in.
@@ -32,10 +32,10 @@ console.print(f"Device: {device}")
profiler = VoiceProfiler(PROFILES_DIR, device=device)
# Tom's known speech windows per episode
# Tara's known speech windows per episode
# CALLER turns from diarization that are in the first 60 min (before real callers)
# Windows at 0-40s excluded (promo/jingle, not Tom's voice)
TOM_WINDOWS = {
# Windows at 0-40s excluded (promo/jingle, not Tara's voice)
TARA_WINDOWS = {
"2014-s6e19.mp3": [
(195, 260),
(320, 425),
@@ -53,7 +53,7 @@ TOM_WINDOWS = {
],
}
COHOST_NAME = "Tom"
COHOST_NAME = "Tara"
if COHOST_NAME not in profiler.profiles:
profiler.profiles[COHOST_NAME] = SpeakerProfile(
@@ -66,7 +66,7 @@ if COHOST_NAME not in profiler.profiles:
profile = profiler.profiles[COHOST_NAME]
console.print(f"\n[bold]Building co-host profile for: {COHOST_NAME}[/bold]")
for ep_name, windows in TOM_WINDOWS.items():
for ep_name, windows in TARA_WINDOWS.items():
ep_path = EPISODES_DIR / ep_name
if not ep_path.exists():
console.print(f"[yellow] Skipping {ep_name} — not found[/yellow]")
@@ -101,7 +101,7 @@ if not profile.embeddings:
sys.exit(1)
profile.compute_composite()
console.print(f"\n[green]Tom profile built: {profile.num_samples} embeddings "
console.print(f"\n[green]Tara profile built: {profile.num_samples} embeddings "
f"from {len(profile.source_episodes)} episodes[/green]")
# Verify: check cosine similarity vs Mike to ensure separation
@@ -109,7 +109,7 @@ mike = profiler.profiles.get("Mike Swanson")
if mike and mike.composite_embedding is not None and profile.composite_embedding is not None:
sim = float(np.dot(mike.composite_embedding, profile.composite_embedding) /
(np.linalg.norm(mike.composite_embedding) * np.linalg.norm(profile.composite_embedding) + 1e-8))
console.print(f"Tom vs Mike similarity: {sim:.3f} (lower is better separation)")
console.print(f"Tara vs Mike similarity: {sim:.3f} (lower is better separation)")
profiler.save_profiles()
console.print("[bold green]Profile saved.[/bold green]")

View File

@@ -25,13 +25,19 @@ This run uses the post-overhaul code (commit `e9ac607`): batched Whisper transcr
---
## Important — "Tom" co-host name is wrong
## Co-host identity correction — Tara, not Tom
The 5070 Ti session built a voice profile labeled `voice-profiles/tom/` and described it in the session log as "Tom, regular in-studio co-host/board-op roughly 2013-2016." Mike confirmed on this session: **there is no co-host named Tom**. The voice profile is real (clean cosine separation, 0.698 vs Mike) and the diarization correctly identifies the second speaker, but the human identity attached to it is hallucinated.
The 5070 Ti session fabricated a co-host named "Tom" — Mike confirmed there is no such person on the show. After listening to the source windows, Mike identified the voice in both 2014-s6e19 and 2016-s8e43 as **Tara** (a real co-host; the show has had multiple over the years).
The directory, `profiles.json` entry, `build_cohost_profile.py` references, and the 5070 Ti session log all carry the bogus name. Identity TBD pending Mike confirming who that voice actually is.
Rename swept this session:
- `voice-profiles/tom/``voice-profiles/tara/` (git mv, all 44 embeddings + composite preserved)
- `voice-profiles/profiles.json`: `"Tom"` key → `"Tara"`
- `build_cohost_profile.py`: docstring, `TOM_WINDOWS``TARA_WINDOWS`, `COHOST_NAME = "Tara"`, console output strings
- `projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md`: correction header added, all body references updated
- `.claude/memory/radio_show_no_cohost_named_tom.md`: resolution recorded
- Diarization re-run post-rename so `speaker_map` in each `diarization.json` emits `Cohost: Tara`
Memory entry added: `.claude/memory/radio_show_no_cohost_named_tom.md`. The profile will be renamed once Mike provides the correct identity.
The 5070 Ti session log's claim of "Tom was the regular co-host roughly 2013-2016" carried two errors: the wrong name AND an unverified tenure window. The corrected log notes Tara appears in 2014-s6e19 and 2016-s8e43 only — generalizing to the full 2013-2016 era hasn't been confirmed.
---
@@ -115,6 +121,31 @@ archive.db is not on this machine — index update happens on DESKTOP-0O8A1RL.
---
## Tara distribution across the test set (post-rename diarization)
After the rename, the diarizer's per-episode `speaker_map` shows Tara in **all 6** test episodes — well beyond the 2014+2016 the 5070 Ti session log claimed.
| Episode | Tara (seconds) | % of audio | Read |
|---|---|---|---|
| 2011-03-12-hr1 | 140s (2:20) | 5.6% | likely false positive — Mike confirms 2011 was pure call-in |
| 2012-03-10-hr1 | 30s (0:30) | 1.1% | likely false positive — 2012 was pure call-in |
| 2012-06-09-hr1 | 340s (5:40) | 12.8% | suspicious — too much for noise; awaiting Mike confirm |
| 2014-s6e19 | 680s (11:20) | 23.3% | confirmed (Mike) |
| 2016-s8e43 | 1890s (31:30) | 35.5% | confirmed (Mike) |
| 2017-s9e30 | 610s (10:10) | 11.4% | plausible — pending Mike confirm; 5070 Ti log only listed Tara in 2014+2016 |
**Mike's broader correction (2026-04-27):**
- **Co-hosts** rotated through over the years. Confirmed: Tara, Randall (early years), Rob (early years + occasional 2018/2019).
- **Producers / board ops** would sometimes go on-air. Named so far: Andrew, Shannon, Ken, plus "a couple more" Mike doesn't recall off-hand.
Of all these, only Tara has a voice profile. Every other co-host AND every producer-on-air moment in the archive is currently being labeled CALLER, which inflates Q&A false positives in those eras and episodes.
The small Tara percentages in 2011/2012 (1-13%) most likely reflect the 0.85 cosine threshold hitting on a similar-sounding speaker that isn't actually Tara — could be a producer (Andrew/Shannon/Ken/etc) or another early-years voice we haven't catalogued. Worth Mike sampling these short windows to identify before assuming false positive vs producer.
**Implication for full-archive runs:** before processing the 579-episode archive in earnest, build profiles for at least Randall, Rob, and the named producers. Otherwise the Q&A extraction across early-years and 2018/2019 episodes will inherit the same false-positive pattern that originally produced 12 bogus pairs in 2016-s8e43.
---
## Pending work (from 5070 Ti session, still unblocked)
1. **Resolve "Tom" identity** — Mike to confirm who the second voice is in 2014-s6e19 and 2016-s8e43. Then rename `voice-profiles/tom/`, update `profiles.json`, fix labels in code. Until then, voice-profile data is correct but mislabeled.

View File

@@ -2,8 +2,8 @@
"num_speakers": 3,
"speaker_map": {
"HOST": "HOST",
"CALLER": "CALLER",
"CO-HOST": "CO-HOST"
"CO-HOST": "CO-HOST",
"CALLER": "CALLER"
},
"turns": [
{

View File

@@ -2,8 +2,8 @@
"num_speakers": 3,
"speaker_map": {
"HOST": "HOST",
"CALLER": "CALLER",
"CO-HOST": "CO-HOST"
"CO-HOST": "CO-HOST",
"CALLER": "CALLER"
},
"turns": [
{

View File

@@ -2,8 +2,8 @@
"num_speakers": 3,
"speaker_map": {
"HOST": "HOST",
"CALLER": "CALLER",
"CO-HOST": "CO-HOST"
"CO-HOST": "CO-HOST",
"CALLER": "CALLER"
},
"turns": [
{

View File

@@ -2,8 +2,8 @@
"num_speakers": 3,
"speaker_map": {
"HOST": "HOST",
"CALLER": "CALLER",
"CO-HOST": "CO-HOST"
"CO-HOST": "CO-HOST",
"CALLER": "CALLER"
},
"turns": [
{

View File

@@ -23,7 +23,7 @@
"2018-s10e21.mp3"
]
},
"Tom": {
"Tara": {
"role": "cohost",
"num_samples": 44,
"source_episodes": [