radio: rename Tom -> Tara, expand speaker roster

Mike confirmed there is no co-host named "Tom" — the voice in 2014-s6e19 and 2016-s8e43 is Tara. The 5070 Ti session fabricated the Tom identity. The voice profile itself (44 embeddings, 0.698 cosine vs Mike) is correct; only the human label was wrong. Rename swept: - voice-profiles/tom/ -> voice-profiles/tara/ (git mv preserves all .npy) - voice-profiles/profiles.json: "Tom" key -> "Tara" - build_cohost_profile.py: TOM_WINDOWS -> TARA_WINDOWS, COHOST_NAME, comments - 2026-04-27-qa-extraction-cohost-indexing.md: correction header + body sweep - 2026-04-27-4090-benchmark-and-test-set.md: closure note - .claude/memory/radio_show_no_cohost_named_tom.md: resolution + speaker roster Diarization re-run after rename so speaker_map emits "Cohost: Tara". Q&A counts unchanged (rename is label-only): 9 pairs across 6 test episodes. Tara distribution from the post-rename diarization (per-episode % of audio): 2011-03-12-hr1 140s 5.6% likely false positive (call-in only) 2012-03-10-hr1 30s 1.1% likely false positive (call-in only) 2012-06-09-hr1 340s 12.8% suspicious — pending Mike confirm 2014-s6e19 680s 23.3% confirmed 2016-s8e43 1890s 35.5% confirmed 2017-s9e30 610s 11.4% plausible — pending Mike confirm Broader speaker-roster context Mike provided this session (saved to memory): the show has had multiple co-hosts (Tara, Randall, Rob) plus producers/board ops (Andrew, Shannon, Ken, others) who would sometimes go on-air. Only Tara has a profile so far. Every other speaker is currently labeled CALLER, which means small CO-HOST attributions in unexpected episodes (e.g. 2011/2012) may actually be a producer rather than a false positive — Mike to spot-check. Action item before full-archive run: build profiles for Randall, Rob, and the named producers to avoid systematic Q&A false positives in early-years and 2018/2019 episodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:11:03 -07:00
parent b9a4bb8807
commit fb683d6a05
55 changed files with 122 additions and 53 deletions
--- a/projects/radio-show/audio-processor/build_cohost_profile.py
+++ b/projects/radio-show/audio-processor/build_cohost_profile.py
@@ -1,5 +1,5 @@
 """
-Build voice profile for Tom (co-host) from known co-host speech windows.
+Build voice profile for Tara (co-host) from known co-host speech windows.

 Uses CALLER-labeled windows from the first 60 min of co-host-era episodes,
 before any real callers would have called in.
@@ -32,10 +32,10 @@ console.print(f"Device: {device}")

 profiler = VoiceProfiler(PROFILES_DIR, device=device)

-# Tom's known speech windows per episode
+# Tara's known speech windows per episode
 # CALLER turns from diarization that are in the first 60 min (before real callers)
-# Windows at 0-40s excluded (promo/jingle, not Tom's voice)
-TOM_WINDOWS = {
+# Windows at 0-40s excluded (promo/jingle, not Tara's voice)
+TARA_WINDOWS = {
    "2014-s6e19.mp3": [
        (195, 260),
        (320, 425),
@@ -53,7 +53,7 @@ TOM_WINDOWS = {
    ],
 }

-COHOST_NAME = "Tom"
+COHOST_NAME = "Tara"

 if COHOST_NAME not in profiler.profiles:
    profiler.profiles[COHOST_NAME] = SpeakerProfile(
@@ -66,7 +66,7 @@ if COHOST_NAME not in profiler.profiles:
 profile = profiler.profiles[COHOST_NAME]
 console.print(f"\n[bold]Building co-host profile for: {COHOST_NAME}[/bold]")

-for ep_name, windows in TOM_WINDOWS.items():
+for ep_name, windows in TARA_WINDOWS.items():
    ep_path = EPISODES_DIR / ep_name
    if not ep_path.exists():
        console.print(f"[yellow]  Skipping {ep_name} — not found[/yellow]")
@@ -101,7 +101,7 @@ if not profile.embeddings:
    sys.exit(1)

 profile.compute_composite()
-console.print(f"\n[green]Tom profile built: {profile.num_samples} embeddings "
+console.print(f"\n[green]Tara profile built: {profile.num_samples} embeddings "
              f"from {len(profile.source_episodes)} episodes[/green]")

 # Verify: check cosine similarity vs Mike to ensure separation
@@ -109,7 +109,7 @@ mike = profiler.profiles.get("Mike Swanson")
 if mike and mike.composite_embedding is not None and profile.composite_embedding is not None:
    sim = float(np.dot(mike.composite_embedding, profile.composite_embedding) /
                (np.linalg.norm(mike.composite_embedding) * np.linalg.norm(profile.composite_embedding) + 1e-8))
-    console.print(f"Tom vs Mike similarity: {sim:.3f} (lower is better separation)")
+    console.print(f"Tara vs Mike similarity: {sim:.3f} (lower is better separation)")

 profiler.save_profiles()
 console.print("[bold green]Profile saved.[/bold green]")
--- a/projects/radio-show/audio-processor/session-logs/2026-04-27-4090-benchmark-and-test-set.md
+++ b/projects/radio-show/audio-processor/session-logs/2026-04-27-4090-benchmark-and-test-set.md
@@ -25,13 +25,19 @@ This run uses the post-overhaul code (commit `e9ac607`): batched Whisper transcr

 ---

-## Important — "Tom" co-host name is wrong
+## Co-host identity correction — Tara, not Tom

-The 5070 Ti session built a voice profile labeled `voice-profiles/tom/` and described it in the session log as "Tom, regular in-studio co-host/board-op roughly 2013-2016." Mike confirmed on this session: **there is no co-host named Tom**. The voice profile is real (clean cosine separation, 0.698 vs Mike) and the diarization correctly identifies the second speaker, but the human identity attached to it is hallucinated.
+The 5070 Ti session fabricated a co-host named "Tom" — Mike confirmed there is no such person on the show. After listening to the source windows, Mike identified the voice in both 2014-s6e19 and 2016-s8e43 as **Tara** (a real co-host; the show has had multiple over the years).

-The directory, `profiles.json` entry, `build_cohost_profile.py` references, and the 5070 Ti session log all carry the bogus name. Identity TBD pending Mike confirming who that voice actually is.
+Rename swept this session:
+- `voice-profiles/tom/` → `voice-profiles/tara/` (git mv, all 44 embeddings + composite preserved)
+- `voice-profiles/profiles.json`: `"Tom"` key → `"Tara"`
+- `build_cohost_profile.py`: docstring, `TOM_WINDOWS` → `TARA_WINDOWS`, `COHOST_NAME = "Tara"`, console output strings
+- `projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md`: correction header added, all body references updated
+- `.claude/memory/radio_show_no_cohost_named_tom.md`: resolution recorded
+- Diarization re-run post-rename so `speaker_map` in each `diarization.json` emits `Cohost: Tara`

-Memory entry added: `.claude/memory/radio_show_no_cohost_named_tom.md`. The profile will be renamed once Mike provides the correct identity.
+The 5070 Ti session log's claim of "Tom was the regular co-host roughly 2013-2016" carried two errors: the wrong name AND an unverified tenure window. The corrected log notes Tara appears in 2014-s6e19 and 2016-s8e43 only — generalizing to the full 2013-2016 era hasn't been confirmed.

 ---

@@ -115,6 +121,31 @@ archive.db is not on this machine — index update happens on DESKTOP-0O8A1RL.

 ---

+## Tara distribution across the test set (post-rename diarization)
+
+After the rename, the diarizer's per-episode `speaker_map` shows Tara in **all 6** test episodes — well beyond the 2014+2016 the 5070 Ti session log claimed.
+
+| Episode | Tara (seconds) | % of audio | Read |
+|---|---|---|---|
+| 2011-03-12-hr1 | 140s (2:20) | 5.6% | likely false positive — Mike confirms 2011 was pure call-in |
+| 2012-03-10-hr1 | 30s (0:30) | 1.1% | likely false positive — 2012 was pure call-in |
+| 2012-06-09-hr1 | 340s (5:40) | 12.8% | suspicious — too much for noise; awaiting Mike confirm |
+| 2014-s6e19     | 680s (11:20) | 23.3% | confirmed (Mike) |
+| 2016-s8e43     | 1890s (31:30) | 35.5% | confirmed (Mike) |
+| 2017-s9e30     | 610s (10:10) | 11.4% | plausible — pending Mike confirm; 5070 Ti log only listed Tara in 2014+2016 |
+
+**Mike's broader correction (2026-04-27):**
+- **Co-hosts** rotated through over the years. Confirmed: Tara, Randall (early years), Rob (early years + occasional 2018/2019).
+- **Producers / board ops** would sometimes go on-air. Named so far: Andrew, Shannon, Ken, plus "a couple more" Mike doesn't recall off-hand.
+
+Of all these, only Tara has a voice profile. Every other co-host AND every producer-on-air moment in the archive is currently being labeled CALLER, which inflates Q&A false positives in those eras and episodes.
+
+The small Tara percentages in 2011/2012 (1-13%) most likely reflect the 0.85 cosine threshold hitting on a similar-sounding speaker that isn't actually Tara — could be a producer (Andrew/Shannon/Ken/etc) or another early-years voice we haven't catalogued. Worth Mike sampling these short windows to identify before assuming false positive vs producer.
+
+**Implication for full-archive runs:** before processing the 579-episode archive in earnest, build profiles for at least Randall, Rob, and the named producers. Otherwise the Q&A extraction across early-years and 2018/2019 episodes will inherit the same false-positive pattern that originally produced 12 bogus pairs in 2016-s8e43.
+
+---
+
 ## Pending work (from 5070 Ti session, still unblocked)

 1. **Resolve "Tom" identity** — Mike to confirm who the second voice is in 2014-s6e19 and 2016-s8e43. Then rename `voice-profiles/tom/`, update `profiles.json`, fix labels in code. Until then, voice-profile data is correct but mislabeled.
--- a/projects/radio-show/audio-processor/test-data/transcripts/2011-03-12-hr1/diarization.json
+++ b/projects/radio-show/audio-processor/test-data/transcripts/2011-03-12-hr1/diarization.json
@@ -2,8 +2,8 @@
  "num_speakers": 3,
  "speaker_map": {
    "HOST": "HOST",
-    "CALLER": "CALLER",
-    "CO-HOST": "CO-HOST"
+    "CO-HOST": "CO-HOST",
+    "CALLER": "CALLER"
  },
  "turns": [
    {
--- a/projects/radio-show/audio-processor/test-data/transcripts/2012-03-10-hr1/diarization.json
+++ b/projects/radio-show/audio-processor/test-data/transcripts/2012-03-10-hr1/diarization.json
@@ -2,8 +2,8 @@
  "num_speakers": 3,
  "speaker_map": {
    "HOST": "HOST",
-    "CALLER": "CALLER",
-    "CO-HOST": "CO-HOST"
+    "CO-HOST": "CO-HOST",
+    "CALLER": "CALLER"
  },
  "turns": [
    {
--- a/projects/radio-show/audio-processor/test-data/transcripts/2012-06-09-hr1/diarization.json
+++ b/projects/radio-show/audio-processor/test-data/transcripts/2012-06-09-hr1/diarization.json
@@ -2,8 +2,8 @@
  "num_speakers": 3,
  "speaker_map": {
    "HOST": "HOST",
-    "CALLER": "CALLER",
-    "CO-HOST": "CO-HOST"
+    "CO-HOST": "CO-HOST",
+    "CALLER": "CALLER"
  },
  "turns": [
    {
--- a/projects/radio-show/audio-processor/test-data/transcripts/2017-s9e30/diarization.json
+++ b/projects/radio-show/audio-processor/test-data/transcripts/2017-s9e30/diarization.json
@@ -2,8 +2,8 @@
  "num_speakers": 3,
  "speaker_map": {
    "HOST": "HOST",
-    "CALLER": "CALLER",
-    "CO-HOST": "CO-HOST"
+    "CO-HOST": "CO-HOST",
+    "CALLER": "CALLER"
  },
  "turns": [
    {
--- a/projects/radio-show/audio-processor/voice-profiles/profiles.json
+++ b/projects/radio-show/audio-processor/voice-profiles/profiles.json
@@ -23,7 +23,7 @@
      "2018-s10e21.mp3"
    ]
  },
-  "Tom": {
+  "Tara": {
    "role": "cohost",
    "num_samples": 44,
    "source_episodes": [
--- a/projects/radio-show/audio-processor/voice-profiles/tara/composite.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/composite.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0000.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0000.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0001.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0001.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0002.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0002.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0003.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0003.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0004.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0004.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0005.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0005.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0006.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0006.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0007.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0007.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0008.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0008.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0009.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0009.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0010.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0010.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0011.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0011.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0012.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0012.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0013.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0013.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0014.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0014.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0015.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0015.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0016.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0016.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0017.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0017.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0018.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0018.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0019.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0019.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0020.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0020.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0021.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0021.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0022.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0022.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0023.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0023.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0024.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0024.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0025.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0025.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0026.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0026.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0027.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0027.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0028.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0028.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0029.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0029.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0030.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0030.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0031.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0031.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0032.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0032.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0033.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0033.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0034.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0034.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0035.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0035.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0036.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0036.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0037.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0037.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0038.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0038.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0039.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0039.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0040.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0040.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0041.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0041.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0042.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0042.npy
--- a/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0043.npy
+++ b/projects/radio-show/audio-processor/voice-profiles/tara/embedding_0043.npy
--- a/projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md
+++ b/projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md
@@ -2,6 +2,14 @@
 **Date:** 2026-04-27
 **Project:** Radio Show Archive Mining — Computer Guru Show

+> **Correction (2026-04-27, GURU-BEAST-ROG session):** This log was originally
+> written referring to the co-host as "Tom." Mike confirmed there is no co-host
+> by that name; the voice in 2014-s6e19 and 2016-s8e43 is **Tara**. The voice
+> profile is correct (clean 0.698 cosine separation from Mike), only the human
+> identity attached to it was wrong. All references below have been updated
+> Tom → Tara. There have been multiple co-hosts on the show over the years;
+> Tara is one of them.
+
 ---

 ## User
@@ -13,9 +21,9 @@

 ## Session Summary

-The session began with resuming work following a benchmark run that demonstrated a significant performance improvement in Whisper transcription, achieving 63.8x real-time speed with batched inference and int8_float16 settings. Next, the focus shifted to evaluating the quality of Q&A extraction across six test episodes, revealing a critical issue with false positives due to co-host Tom being mislabeled as CALLER based on a voice similarity threshold.
+The session began with resuming work following a benchmark run that demonstrated a significant performance improvement in Whisper transcription, achieving 63.8x real-time speed with batched inference and int8_float16 settings. Next, the focus shifted to evaluating the quality of Q&A extraction across six test episodes, revealing a critical issue with false positives due to co-host Tara being mislabeled as CALLER based on a voice similarity threshold.

-A co-host voice profile for Tom was constructed using 44 embeddings from two specific episodes (2014-s6e19 and 2016-s8e43), producing a cosine similarity of 0.698 against Mike — well below Mike's 0.85 threshold, giving clean separation. Code was updated in `voice_profiler.py` and `diarizer.py` to correctly emit "Cohost: Tom" labels and map them to a new "CO-HOST" speaker tag. Re-diarizing the two co-host-era episodes dramatically cleaned up Q&A results: 2016 went from 12 false positives to 2 real WiFi caller pairs.
+A co-host voice profile for Tara was constructed using 44 embeddings from two specific episodes (2014-s6e19 and 2016-s8e43), producing a cosine similarity of 0.698 against Mike — well below Mike's 0.85 threshold, giving clean separation. Code was updated in `voice_profiler.py` and `diarizer.py` to correctly emit "Cohost: Tara" labels and map them to a new "CO-HOST" speaker tag. Re-diarizing the two co-host-era episodes dramatically cleaned up Q&A results: 2016 went from 12 false positives to 2 real WiFi caller pairs.

 Several bugs in `qa_extractor.py` were fixed: overlap resolution for sliding-window diarization boundaries, CALLER-preference threshold for long batch transcript segments, and a turn-based caller-intro lookback to replace an ineffective 120s time window. Phone-greeting detection and new promo signatures were added. The final Q&A count landed at 10 pairs across 6 episodes, with 2014 correctly yielding 0 (gaming co-host episode with no actual callers).

@@ -25,21 +33,21 @@ Several bugs in `qa_extractor.py` were fixed: overlap resolution for sliding-win

 ## Key Decisions

- **Co-host threshold uses same 0.85 bar as host**: Tom scores 0.698 vs Mike. Any voice >= 0.85 against Tom's composite gets labeled CO-HOST. Keeps the same single threshold for all profiles rather than per-profile thresholds.
+- **Co-host threshold uses same 0.85 bar as host**: Tara scores 0.698 vs Mike. Any voice >= 0.85 against Tara's composite gets labeled CO-HOST. Keeps the same single threshold for all profiles rather than per-profile thresholds.
 - **Turn-based lookback for caller-intro (2 HOST turns, not 120s)**: Long HOST monologue blocks (8-10 min) in big show segments meant time-based lookback missed the caller introduction. Previous 2 HOST turns always catches it regardless of block length.
 - **CALLER-preference at 4s minimum overlap**: Batch transcription produces ~26s segments; diarization CALLER windows are ~10s. Pure majority-vote always gave HOST. 4s minimum CALLER coverage labels the segment CALLER without being overly aggressive for co-host episodes.
 - **Midpoint boundary resolution at load time**: Rather than re-diarizing everything, the sliding-window overlap is resolved in `load_diarized_transcript()` so it applies retroactively to all saved diarization files without touching the JSON.
 - **751-1041 added as promo signal**: Earlier Tucson show number (vs 790-2040 in later seasons). Weighted 1 (needs a second semi-generic signal to filter).
- **Tom's windows sourced from first 60 min of co-host episodes**: Real callers don't call in during the first hour of a 2-hour show (only exceptions: very end of show). First-hour CALLER windows are safely all Tom.
+- **Tara's windows sourced from first 60 min of co-host episodes**: Real callers don't call in during the first hour of a 2-hour show (only exceptions: very end of show). First-hour CALLER windows are safely all Tara.

 ---

 ## Problems Encountered

- **2016-s8e43 had 12 Q&A pairs, 11 false positives**: Root cause was Tom (co-host) labeled CALLER throughout. Fixed by building Tom's voice profile and re-diarizing.
+- **2016-s8e43 had 12 Q&A pairs, 11 false positives**: Root cause was Tara (co-host) labeled CALLER throughout. Fixed by building Tara's voice profile and re-diarizing.
 - **2014-s6e19 had 2 Q&A pairs from gaming discussion**: Same co-host issue. After re-diarization: 0 pairs (correct — no actual callers in that gaming special).
 - **2012-03-10 yielded 0 segments labeled CALLER**: Midpoint assignment hit HOST turns (HOST 0-20s and CALLER 15-30s — midpoint 15.1s falls in HOST). Fixed by overlap-preference assignment with 4s CALLER minimum.
- **Real WiFi caller (2016, ~4794s) was missing after first fix attempt**: Aggressive time-based lookback (120s) combined with short CALLER turns from sliding-window diarization caused the caller question to land in a HOST segment. Fixed by turn-based lookback + co-host profile (eliminated Tom noise, letting real caller windows survive).
+- **Real WiFi caller (2016, ~4794s) was missing after first fix attempt**: Aggressive time-based lookback (120s) combined with short CALLER turns from sliding-window diarization caused the caller question to land in a HOST segment. Fixed by turn-based lookback + co-host profile (eliminated Tara noise, letting real caller windows survive).
 - **2012-Jun pair at 1325s was a promo**: "The Computer Guru. We'll get your problem solved. Call 751-1041 today" passed promo filter. Fixed by adding 751-1041 and "we'll get your problem solved" as promo signatures.

 ---
@@ -51,8 +59,8 @@ Several bugs in `qa_extractor.py` were fixed: overlap resolution for sliding-win
 projects/radio-show/audio-processor/build_cohost_profile.py
 projects/radio-show/audio-processor/index_test_episodes.py
 projects/radio-show/audio-processor/archive.db
-projects/radio-show/audio-processor/voice-profiles/tom/
-projects/radio-show/audio-processor/voice-profiles/profiles.json  (updated: Tom added)
+projects/radio-show/audio-processor/voice-profiles/tara/
+projects/radio-show/audio-processor/voice-profiles/profiles.json  (updated: Tara added)
 projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md  (this file)
 ```

@@ -63,8 +71,8 @@ src/diarizer.py             — map "Cohost:" prefix to "CO-HOST" speaker
 src/qa_extractor.py         — overlap resolution, CALLER-preference, turn-based
                              caller-intro lookback, _preceded_by_caller_intro(),
                              _PHONE_GREETING, 751-1041 + promo sig additions
-test-data/transcripts/2014-s6e19/diarization.json   (re-diarized with Tom profile)
-test-data/transcripts/2016-s8e43/diarization.json   (re-diarized with Tom profile)
+test-data/transcripts/2014-s6e19/diarization.json   (re-diarized with Tara profile)
+test-data/transcripts/2016-s8e43/diarization.json   (re-diarized with Tara profile)
 ```

 ---
@@ -118,11 +126,11 @@ Q&A pairs: 10
 | Name | Role | Embeddings | Source Episodes |
 |------|------|-----------|-----------------|
 | Mike Swanson | host | 180 | 9 episodes (2010-2018) |
-| Tom | cohost | 44 | 2014-s6e19, 2016-s8e43 |
+| Tara | cohost | 44 | 2014-s6e19, 2016-s8e43 |

-Tom vs Mike cosine similarity: **0.698** (well-separated at 0.85 threshold)
+Tara vs Mike cosine similarity: **0.698** (well-separated at 0.85 threshold)

-**Tom's source windows used:**
+**Tara's source windows used:**
 - 2014-s6e19: 195-260s, 320-425s, 600-650s, 675-710s
 - 2016-s8e43: 100-115s, 135-160s, 270-295s, 575-605s, 1185-1235s, 1790-1870s, 2020-2055s

@@ -130,7 +138,7 @@ Tom vs Mike cosine similarity: **0.698** (well-separated at 0.85 threshold)

 ## Co-Host Era Notes

-Tom was the regular in-studio co-host/board-op roughly 2013-2016. His voice is in episodes from at least 2014 through 2016 (confirmed from test set). The 2011 and 2012 episodes are pure call-in format with no co-host.
+Tara was an in-studio co-host whose voice appears in 2014-s6e19 and 2016-s8e43 (confirmed by Mike). The 2011 and 2012 episodes are pure call-in format with no co-host. Mike notes the show has had multiple co-hosts over the years; Tara's exact tenure isn't fixed from the original 2013-2016 assumption — that should be verified before generalizing the profile across the full archive.

 If there are occasional guest co-hosts or fill-in hosts in other years, they would still be labeled CALLER until profiled. These would be rare and would likely not form question patterns that survive the caller-intro gate.

@@ -190,9 +198,9 @@ The pipeline is idempotent — `add_segments()` skips episodes already indexed.

 ### 4. Verify co-host era episodes

-2013-2016 era episodes should now correctly separate Tom (CO-HOST) from actual callers. Spot-check a few 2015 episodes after processing to confirm Tom's profile generalizes well.
+2013-2016 era episodes should now correctly separate Tara (CO-HOST) from actual callers. Spot-check a few 2015 episodes after processing to confirm Tara's profile generalizes well.

-If any 2015/2016 episodes show too many CALLER turns that are clearly Tom (voice changed slightly over years), re-run `build_cohost_profile.py` with windows from that episode added to TOM_WINDOWS dict.
+If any 2015/2016 episodes show too many CALLER turns that are clearly Tara (voice changed slightly over years), re-run `build_cohost_profile.py` with windows from that episode added to TARA_WINDOWS dict.

 ---