radio: rename Tom -> Tara, expand speaker roster
Mike confirmed there is no co-host named "Tom" — the voice in 2014-s6e19 and 2016-s8e43 is Tara. The 5070 Ti session fabricated the Tom identity. The voice profile itself (44 embeddings, 0.698 cosine vs Mike) is correct; only the human label was wrong. Rename swept: - voice-profiles/tom/ -> voice-profiles/tara/ (git mv preserves all .npy) - voice-profiles/profiles.json: "Tom" key -> "Tara" - build_cohost_profile.py: TOM_WINDOWS -> TARA_WINDOWS, COHOST_NAME, comments - 2026-04-27-qa-extraction-cohost-indexing.md: correction header + body sweep - 2026-04-27-4090-benchmark-and-test-set.md: closure note - .claude/memory/radio_show_no_cohost_named_tom.md: resolution + speaker roster Diarization re-run after rename so speaker_map emits "Cohost: Tara". Q&A counts unchanged (rename is label-only): 9 pairs across 6 test episodes. Tara distribution from the post-rename diarization (per-episode % of audio): 2011-03-12-hr1 140s 5.6% likely false positive (call-in only) 2012-03-10-hr1 30s 1.1% likely false positive (call-in only) 2012-06-09-hr1 340s 12.8% suspicious — pending Mike confirm 2014-s6e19 680s 23.3% confirmed 2016-s8e43 1890s 35.5% confirmed 2017-s9e30 610s 11.4% plausible — pending Mike confirm Broader speaker-roster context Mike provided this session (saved to memory): the show has had multiple co-hosts (Tara, Randall, Rob) plus producers/board ops (Andrew, Shannon, Ken, others) who would sometimes go on-air. Only Tara has a profile so far. Every other speaker is currently labeled CALLER, which means small CO-HOST attributions in unexpected episodes (e.g. 2011/2012) may actually be a producer rather than a false positive — Mike to spot-check. Action item before full-archive run: build profiles for Randall, Rob, and the named producers to avoid systematic Q&A false positives in early-years and 2018/2019 episodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -41,4 +41,4 @@
|
|||||||
- [Neptune SBR Email Routing Setup](project_neptune_sbr_email_routing.md) - Full SBR routing chain, config file locations, MailProtector integration, access methods
|
- [Neptune SBR Email Routing Setup](project_neptune_sbr_email_routing.md) - Full SBR routing chain, config file locations, MailProtector integration, access methods
|
||||||
- [Dataforth Test Datasheet Pipeline](project_datasheet_pipeline.md) - Full pipeline rebuilt 2026-03-27. Server-side generation replaces DFWDS/Uploader. Website upload still broken.
|
- [Dataforth Test Datasheet Pipeline](project_datasheet_pipeline.md) - Full pipeline rebuilt 2026-03-27. Server-side generation replaces DFWDS/Uploader. Website upload still broken.
|
||||||
- [Dataforth Security Incident](project_dataforth_incident_2026-03-27.md) - DF-JOEL2 compromised, MFA deployed, IC3 filed. CA policies enforce April 4.
|
- [Dataforth Security Incident](project_dataforth_incident_2026-03-27.md) - DF-JOEL2 compromised, MFA deployed, IC3 filed. CA policies enforce April 4.
|
||||||
- [Radio show — no co-host named Tom](radio_show_no_cohost_named_tom.md) — voice profile is real, name is hallucinated. Do not propagate "Tom" as a show member; ask Mike for correct identity.
|
- [Radio show co-host — Tara, not Tom](radio_show_no_cohost_named_tom.md) — Co-host in 2014-s6e19 and 2016-s8e43 is Tara. "Tom" was hallucinated; rename complete. Multiple co-hosts have rotated through the show.
|
||||||
|
|||||||
@@ -1,24 +1,54 @@
|
|||||||
---
|
---
|
||||||
name: Radio show — "Tom" is not a real co-host
|
name: Radio show — co-host roster (Randall, Rob, Tara, others)
|
||||||
description: Correction to a fabricated co-host identity in the Computer Guru Show diarization pipeline; the voice exists but the name "Tom" is wrong
|
description: The Computer Guru Show has had multiple co-hosts over the years. The fabricated "Tom" was actually Tara. Track known co-hosts here as Mike confirms identities.
|
||||||
type: project
|
type: project
|
||||||
---
|
---
|
||||||
|
|
||||||
There is no co-host named **Tom** on The Computer Guru Show. Mike Swanson confirmed this directly on 2026-04-27.
|
The Computer Guru Show has had **multiple co-hosts** rotating through over the years. Mike Swanson is the only constant host.
|
||||||
|
|
||||||
The 5070 Ti session (`projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md`) and corresponding code/data on disk fabricated this identity:
|
## Known speaker roster (per Mike, 2026-04-27)
|
||||||
|
|
||||||
- `voice-profiles/tom/` — directory with 44 embeddings labeled as "Tom"
|
The show has had multiple **co-hosts** rotating through, plus **producers / board ops** who would sometimes go on-air. Both groups need separate voice profiles to avoid being mislabeled as callers.
|
||||||
- `voice-profiles/profiles.json` — entry naming the profile "Tom"
|
|
||||||
- `build_cohost_profile.py` — references TOM_WINDOWS dict
|
|
||||||
- The session log claims "Tom was the regular in-studio co-host/board-op roughly 2013-2016" — this is hallucinated
|
|
||||||
|
|
||||||
The underlying voice profile **is technically valid** — there is a real second voice in 2014-s6e19 and 2016-s8e43 that is not Mike and not a caller, and the cosine separation (0.698 vs Mike's 0.85) is sound. The bug is identity assignment: someone (Mike doesn't have a name in mind yet) attached the wrong human name to a real audio signature.
|
### Co-hosts
|
||||||
|
| Co-host | Era | Confirmed in audio | Profile built |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **Randall** | early years | not yet | no |
|
||||||
|
| **Rob** | early years + appearances in 2018/2019 (Mike unsure of exact dates) | not yet | no |
|
||||||
|
| **Tara** | confirmed 2014-s6e19, 2016-s8e43; diarizer also found her in 2017-s9e30 (610s/11.4%) — pending Mike spot-check | yes | yes — `voice-profiles/tara/` (44 embeddings) |
|
||||||
|
|
||||||
**Why:** This will re-surface every time a future conversation reads the session log, the directory tree, or `profiles.json`. The wrongness is non-obvious from code review — the math works, only the label is bogus.
|
### Producers / board ops (sometimes on-air)
|
||||||
|
| Person | Profile built |
|
||||||
|
|---|---|
|
||||||
|
| **Andrew** | no |
|
||||||
|
| **Shannon** | no |
|
||||||
|
| **Ken** | no |
|
||||||
|
| (Mike: "a couple more" he doesn't recall off-hand) | no |
|
||||||
|
|
||||||
**How to apply:**
|
Mike: "The 'producer' (board op) would also be on-air sometimes." Anywhere a producer's voice appears, they're currently being labeled CALLER, which inflates Q&A false positives. Same problem as unprofiled co-hosts.
|
||||||
- Do not refer to "Tom" as a member of the show.
|
|
||||||
- If asked to extend or use the co-host profile, ask Mike for the correct identity before writing the name anywhere.
|
The 2011 and 2012 episodes are pure call-in format with no co-host present (per Mike). However, a producer could still have been on-air — so even small CO-HOST attributions in 2011/2012 (1-12% of audio) may be capturing a producer rather than being false positives.
|
||||||
- Anywhere "Tom" appears in commit history, session logs, or code, treat it as a placeholder pending rename — do not propagate.
|
|
||||||
- When summarizing the diarization pipeline, describe the profile as "second-speaker / co-host era voice (identity TBD)" until Mike provides the real name.
|
## "Tom" was hallucinated
|
||||||
|
|
||||||
|
The 5070 Ti session (`2026-04-27-qa-extraction-cohost-indexing.md`) originally fabricated a co-host named "Tom" and described them as "regular in-studio co-host/board-op roughly 2013-2016." That entire identity was invented by the prior conversation. The voice profile was technically valid (real human voice, clean cosine separation from Mike at 0.698) but the human attached to it was wrong.
|
||||||
|
|
||||||
|
**Resolution applied 2026-04-27 (GURU-BEAST-ROG session):**
|
||||||
|
- `voice-profiles/tom/` renamed to `voice-profiles/tara/`
|
||||||
|
- `voice-profiles/profiles.json`: key `Tom` → `Tara`
|
||||||
|
- `build_cohost_profile.py`: `TOM_WINDOWS` → `TARA_WINDOWS`, `COHOST_NAME = "Tara"`
|
||||||
|
- Both relevant session logs updated; correction header preserves the history
|
||||||
|
- Diarization re-run; `speaker_map` now emits `Cohost: Tara`
|
||||||
|
|
||||||
|
## Implications for the archive pipeline
|
||||||
|
|
||||||
|
Co-hosts without a built profile get labeled CALLER, which inflates Q&A false positives in those eras:
|
||||||
|
- **Early-years archive (~2010-2013):** Randall and Rob are present but unprofiled — caller-labeled audio in this era is suspect.
|
||||||
|
- **2018/2019:** Rob makes appearances — same issue.
|
||||||
|
- **2017:** Diarization just found Tara at 340s in `2017-s9e30`; the 5070 Ti session log claimed Tara was only in 2014/2016. Pending Mike's confirmation that the 2017 attribution is correct.
|
||||||
|
|
||||||
|
## How to apply
|
||||||
|
- When diarizing a new episode and a CALLER cluster looks too long / too prominent / too consistent, suspect an unprofiled co-host before assuming a real caller.
|
||||||
|
- Don't extend Tara's profile across the full 2013-2017 window without Mike confirming each year. She may not have been in every episode.
|
||||||
|
- Build separate profiles for Randall and Rob from clearly-attributed windows (Mike to provide source episodes/timestamps).
|
||||||
|
- Never invent a co-host name from voice signature alone — ask Mike.
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
"""
|
"""
|
||||||
Build voice profile for Tom (co-host) from known co-host speech windows.
|
Build voice profile for Tara (co-host) from known co-host speech windows.
|
||||||
|
|
||||||
Uses CALLER-labeled windows from the first 60 min of co-host-era episodes,
|
Uses CALLER-labeled windows from the first 60 min of co-host-era episodes,
|
||||||
before any real callers would have called in.
|
before any real callers would have called in.
|
||||||
@@ -32,10 +32,10 @@ console.print(f"Device: {device}")
|
|||||||
|
|
||||||
profiler = VoiceProfiler(PROFILES_DIR, device=device)
|
profiler = VoiceProfiler(PROFILES_DIR, device=device)
|
||||||
|
|
||||||
# Tom's known speech windows per episode
|
# Tara's known speech windows per episode
|
||||||
# CALLER turns from diarization that are in the first 60 min (before real callers)
|
# CALLER turns from diarization that are in the first 60 min (before real callers)
|
||||||
# Windows at 0-40s excluded (promo/jingle, not Tom's voice)
|
# Windows at 0-40s excluded (promo/jingle, not Tara's voice)
|
||||||
TOM_WINDOWS = {
|
TARA_WINDOWS = {
|
||||||
"2014-s6e19.mp3": [
|
"2014-s6e19.mp3": [
|
||||||
(195, 260),
|
(195, 260),
|
||||||
(320, 425),
|
(320, 425),
|
||||||
@@ -53,7 +53,7 @@ TOM_WINDOWS = {
|
|||||||
],
|
],
|
||||||
}
|
}
|
||||||
|
|
||||||
COHOST_NAME = "Tom"
|
COHOST_NAME = "Tara"
|
||||||
|
|
||||||
if COHOST_NAME not in profiler.profiles:
|
if COHOST_NAME not in profiler.profiles:
|
||||||
profiler.profiles[COHOST_NAME] = SpeakerProfile(
|
profiler.profiles[COHOST_NAME] = SpeakerProfile(
|
||||||
@@ -66,7 +66,7 @@ if COHOST_NAME not in profiler.profiles:
|
|||||||
profile = profiler.profiles[COHOST_NAME]
|
profile = profiler.profiles[COHOST_NAME]
|
||||||
console.print(f"\n[bold]Building co-host profile for: {COHOST_NAME}[/bold]")
|
console.print(f"\n[bold]Building co-host profile for: {COHOST_NAME}[/bold]")
|
||||||
|
|
||||||
for ep_name, windows in TOM_WINDOWS.items():
|
for ep_name, windows in TARA_WINDOWS.items():
|
||||||
ep_path = EPISODES_DIR / ep_name
|
ep_path = EPISODES_DIR / ep_name
|
||||||
if not ep_path.exists():
|
if not ep_path.exists():
|
||||||
console.print(f"[yellow] Skipping {ep_name} — not found[/yellow]")
|
console.print(f"[yellow] Skipping {ep_name} — not found[/yellow]")
|
||||||
@@ -101,7 +101,7 @@ if not profile.embeddings:
|
|||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
profile.compute_composite()
|
profile.compute_composite()
|
||||||
console.print(f"\n[green]Tom profile built: {profile.num_samples} embeddings "
|
console.print(f"\n[green]Tara profile built: {profile.num_samples} embeddings "
|
||||||
f"from {len(profile.source_episodes)} episodes[/green]")
|
f"from {len(profile.source_episodes)} episodes[/green]")
|
||||||
|
|
||||||
# Verify: check cosine similarity vs Mike to ensure separation
|
# Verify: check cosine similarity vs Mike to ensure separation
|
||||||
@@ -109,7 +109,7 @@ mike = profiler.profiles.get("Mike Swanson")
|
|||||||
if mike and mike.composite_embedding is not None and profile.composite_embedding is not None:
|
if mike and mike.composite_embedding is not None and profile.composite_embedding is not None:
|
||||||
sim = float(np.dot(mike.composite_embedding, profile.composite_embedding) /
|
sim = float(np.dot(mike.composite_embedding, profile.composite_embedding) /
|
||||||
(np.linalg.norm(mike.composite_embedding) * np.linalg.norm(profile.composite_embedding) + 1e-8))
|
(np.linalg.norm(mike.composite_embedding) * np.linalg.norm(profile.composite_embedding) + 1e-8))
|
||||||
console.print(f"Tom vs Mike similarity: {sim:.3f} (lower is better separation)")
|
console.print(f"Tara vs Mike similarity: {sim:.3f} (lower is better separation)")
|
||||||
|
|
||||||
profiler.save_profiles()
|
profiler.save_profiles()
|
||||||
console.print("[bold green]Profile saved.[/bold green]")
|
console.print("[bold green]Profile saved.[/bold green]")
|
||||||
|
|||||||
@@ -25,13 +25,19 @@ This run uses the post-overhaul code (commit `e9ac607`): batched Whisper transcr
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Important — "Tom" co-host name is wrong
|
## Co-host identity correction — Tara, not Tom
|
||||||
|
|
||||||
The 5070 Ti session built a voice profile labeled `voice-profiles/tom/` and described it in the session log as "Tom, regular in-studio co-host/board-op roughly 2013-2016." Mike confirmed on this session: **there is no co-host named Tom**. The voice profile is real (clean cosine separation, 0.698 vs Mike) and the diarization correctly identifies the second speaker, but the human identity attached to it is hallucinated.
|
The 5070 Ti session fabricated a co-host named "Tom" — Mike confirmed there is no such person on the show. After listening to the source windows, Mike identified the voice in both 2014-s6e19 and 2016-s8e43 as **Tara** (a real co-host; the show has had multiple over the years).
|
||||||
|
|
||||||
The directory, `profiles.json` entry, `build_cohost_profile.py` references, and the 5070 Ti session log all carry the bogus name. Identity TBD pending Mike confirming who that voice actually is.
|
Rename swept this session:
|
||||||
|
- `voice-profiles/tom/` → `voice-profiles/tara/` (git mv, all 44 embeddings + composite preserved)
|
||||||
|
- `voice-profiles/profiles.json`: `"Tom"` key → `"Tara"`
|
||||||
|
- `build_cohost_profile.py`: docstring, `TOM_WINDOWS` → `TARA_WINDOWS`, `COHOST_NAME = "Tara"`, console output strings
|
||||||
|
- `projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md`: correction header added, all body references updated
|
||||||
|
- `.claude/memory/radio_show_no_cohost_named_tom.md`: resolution recorded
|
||||||
|
- Diarization re-run post-rename so `speaker_map` in each `diarization.json` emits `Cohost: Tara`
|
||||||
|
|
||||||
Memory entry added: `.claude/memory/radio_show_no_cohost_named_tom.md`. The profile will be renamed once Mike provides the correct identity.
|
The 5070 Ti session log's claim of "Tom was the regular co-host roughly 2013-2016" carried two errors: the wrong name AND an unverified tenure window. The corrected log notes Tara appears in 2014-s6e19 and 2016-s8e43 only — generalizing to the full 2013-2016 era hasn't been confirmed.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -115,6 +121,31 @@ archive.db is not on this machine — index update happens on DESKTOP-0O8A1RL.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Tara distribution across the test set (post-rename diarization)
|
||||||
|
|
||||||
|
After the rename, the diarizer's per-episode `speaker_map` shows Tara in **all 6** test episodes — well beyond the 2014+2016 the 5070 Ti session log claimed.
|
||||||
|
|
||||||
|
| Episode | Tara (seconds) | % of audio | Read |
|
||||||
|
|---|---|---|---|
|
||||||
|
| 2011-03-12-hr1 | 140s (2:20) | 5.6% | likely false positive — Mike confirms 2011 was pure call-in |
|
||||||
|
| 2012-03-10-hr1 | 30s (0:30) | 1.1% | likely false positive — 2012 was pure call-in |
|
||||||
|
| 2012-06-09-hr1 | 340s (5:40) | 12.8% | suspicious — too much for noise; awaiting Mike confirm |
|
||||||
|
| 2014-s6e19 | 680s (11:20) | 23.3% | confirmed (Mike) |
|
||||||
|
| 2016-s8e43 | 1890s (31:30) | 35.5% | confirmed (Mike) |
|
||||||
|
| 2017-s9e30 | 610s (10:10) | 11.4% | plausible — pending Mike confirm; 5070 Ti log only listed Tara in 2014+2016 |
|
||||||
|
|
||||||
|
**Mike's broader correction (2026-04-27):**
|
||||||
|
- **Co-hosts** rotated through over the years. Confirmed: Tara, Randall (early years), Rob (early years + occasional 2018/2019).
|
||||||
|
- **Producers / board ops** would sometimes go on-air. Named so far: Andrew, Shannon, Ken, plus "a couple more" Mike doesn't recall off-hand.
|
||||||
|
|
||||||
|
Of all these, only Tara has a voice profile. Every other co-host AND every producer-on-air moment in the archive is currently being labeled CALLER, which inflates Q&A false positives in those eras and episodes.
|
||||||
|
|
||||||
|
The small Tara percentages in 2011/2012 (1-13%) most likely reflect the 0.85 cosine threshold hitting on a similar-sounding speaker that isn't actually Tara — could be a producer (Andrew/Shannon/Ken/etc) or another early-years voice we haven't catalogued. Worth Mike sampling these short windows to identify before assuming false positive vs producer.
|
||||||
|
|
||||||
|
**Implication for full-archive runs:** before processing the 579-episode archive in earnest, build profiles for at least Randall, Rob, and the named producers. Otherwise the Q&A extraction across early-years and 2018/2019 episodes will inherit the same false-positive pattern that originally produced 12 bogus pairs in 2016-s8e43.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Pending work (from 5070 Ti session, still unblocked)
|
## Pending work (from 5070 Ti session, still unblocked)
|
||||||
|
|
||||||
1. **Resolve "Tom" identity** — Mike to confirm who the second voice is in 2014-s6e19 and 2016-s8e43. Then rename `voice-profiles/tom/`, update `profiles.json`, fix labels in code. Until then, voice-profile data is correct but mislabeled.
|
1. **Resolve "Tom" identity** — Mike to confirm who the second voice is in 2014-s6e19 and 2016-s8e43. Then rename `voice-profiles/tom/`, update `profiles.json`, fix labels in code. Until then, voice-profile data is correct but mislabeled.
|
||||||
|
|||||||
@@ -2,8 +2,8 @@
|
|||||||
"num_speakers": 3,
|
"num_speakers": 3,
|
||||||
"speaker_map": {
|
"speaker_map": {
|
||||||
"HOST": "HOST",
|
"HOST": "HOST",
|
||||||
"CALLER": "CALLER",
|
"CO-HOST": "CO-HOST",
|
||||||
"CO-HOST": "CO-HOST"
|
"CALLER": "CALLER"
|
||||||
},
|
},
|
||||||
"turns": [
|
"turns": [
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -2,8 +2,8 @@
|
|||||||
"num_speakers": 3,
|
"num_speakers": 3,
|
||||||
"speaker_map": {
|
"speaker_map": {
|
||||||
"HOST": "HOST",
|
"HOST": "HOST",
|
||||||
"CALLER": "CALLER",
|
"CO-HOST": "CO-HOST",
|
||||||
"CO-HOST": "CO-HOST"
|
"CALLER": "CALLER"
|
||||||
},
|
},
|
||||||
"turns": [
|
"turns": [
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -2,8 +2,8 @@
|
|||||||
"num_speakers": 3,
|
"num_speakers": 3,
|
||||||
"speaker_map": {
|
"speaker_map": {
|
||||||
"HOST": "HOST",
|
"HOST": "HOST",
|
||||||
"CALLER": "CALLER",
|
"CO-HOST": "CO-HOST",
|
||||||
"CO-HOST": "CO-HOST"
|
"CALLER": "CALLER"
|
||||||
},
|
},
|
||||||
"turns": [
|
"turns": [
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -2,8 +2,8 @@
|
|||||||
"num_speakers": 3,
|
"num_speakers": 3,
|
||||||
"speaker_map": {
|
"speaker_map": {
|
||||||
"HOST": "HOST",
|
"HOST": "HOST",
|
||||||
"CALLER": "CALLER",
|
"CO-HOST": "CO-HOST",
|
||||||
"CO-HOST": "CO-HOST"
|
"CALLER": "CALLER"
|
||||||
},
|
},
|
||||||
"turns": [
|
"turns": [
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -23,7 +23,7 @@
|
|||||||
"2018-s10e21.mp3"
|
"2018-s10e21.mp3"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"Tom": {
|
"Tara": {
|
||||||
"role": "cohost",
|
"role": "cohost",
|
||||||
"num_samples": 44,
|
"num_samples": 44,
|
||||||
"source_episodes": [
|
"source_episodes": [
|
||||||
|
|||||||
@@ -2,6 +2,14 @@
|
|||||||
**Date:** 2026-04-27
|
**Date:** 2026-04-27
|
||||||
**Project:** Radio Show Archive Mining — Computer Guru Show
|
**Project:** Radio Show Archive Mining — Computer Guru Show
|
||||||
|
|
||||||
|
> **Correction (2026-04-27, GURU-BEAST-ROG session):** This log was originally
|
||||||
|
> written referring to the co-host as "Tom." Mike confirmed there is no co-host
|
||||||
|
> by that name; the voice in 2014-s6e19 and 2016-s8e43 is **Tara**. The voice
|
||||||
|
> profile is correct (clean 0.698 cosine separation from Mike), only the human
|
||||||
|
> identity attached to it was wrong. All references below have been updated
|
||||||
|
> Tom → Tara. There have been multiple co-hosts on the show over the years;
|
||||||
|
> Tara is one of them.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## User
|
## User
|
||||||
@@ -13,9 +21,9 @@
|
|||||||
|
|
||||||
## Session Summary
|
## Session Summary
|
||||||
|
|
||||||
The session began with resuming work following a benchmark run that demonstrated a significant performance improvement in Whisper transcription, achieving 63.8x real-time speed with batched inference and int8_float16 settings. Next, the focus shifted to evaluating the quality of Q&A extraction across six test episodes, revealing a critical issue with false positives due to co-host Tom being mislabeled as CALLER based on a voice similarity threshold.
|
The session began with resuming work following a benchmark run that demonstrated a significant performance improvement in Whisper transcription, achieving 63.8x real-time speed with batched inference and int8_float16 settings. Next, the focus shifted to evaluating the quality of Q&A extraction across six test episodes, revealing a critical issue with false positives due to co-host Tara being mislabeled as CALLER based on a voice similarity threshold.
|
||||||
|
|
||||||
A co-host voice profile for Tom was constructed using 44 embeddings from two specific episodes (2014-s6e19 and 2016-s8e43), producing a cosine similarity of 0.698 against Mike — well below Mike's 0.85 threshold, giving clean separation. Code was updated in `voice_profiler.py` and `diarizer.py` to correctly emit "Cohost: Tom" labels and map them to a new "CO-HOST" speaker tag. Re-diarizing the two co-host-era episodes dramatically cleaned up Q&A results: 2016 went from 12 false positives to 2 real WiFi caller pairs.
|
A co-host voice profile for Tara was constructed using 44 embeddings from two specific episodes (2014-s6e19 and 2016-s8e43), producing a cosine similarity of 0.698 against Mike — well below Mike's 0.85 threshold, giving clean separation. Code was updated in `voice_profiler.py` and `diarizer.py` to correctly emit "Cohost: Tara" labels and map them to a new "CO-HOST" speaker tag. Re-diarizing the two co-host-era episodes dramatically cleaned up Q&A results: 2016 went from 12 false positives to 2 real WiFi caller pairs.
|
||||||
|
|
||||||
Several bugs in `qa_extractor.py` were fixed: overlap resolution for sliding-window diarization boundaries, CALLER-preference threshold for long batch transcript segments, and a turn-based caller-intro lookback to replace an ineffective 120s time window. Phone-greeting detection and new promo signatures were added. The final Q&A count landed at 10 pairs across 6 episodes, with 2014 correctly yielding 0 (gaming co-host episode with no actual callers).
|
Several bugs in `qa_extractor.py` were fixed: overlap resolution for sliding-window diarization boundaries, CALLER-preference threshold for long batch transcript segments, and a turn-based caller-intro lookback to replace an ineffective 120s time window. Phone-greeting detection and new promo signatures were added. The final Q&A count landed at 10 pairs across 6 episodes, with 2014 correctly yielding 0 (gaming co-host episode with no actual callers).
|
||||||
|
|
||||||
@@ -25,21 +33,21 @@ Several bugs in `qa_extractor.py` were fixed: overlap resolution for sliding-win
|
|||||||
|
|
||||||
## Key Decisions
|
## Key Decisions
|
||||||
|
|
||||||
- **Co-host threshold uses same 0.85 bar as host**: Tom scores 0.698 vs Mike. Any voice >= 0.85 against Tom's composite gets labeled CO-HOST. Keeps the same single threshold for all profiles rather than per-profile thresholds.
|
- **Co-host threshold uses same 0.85 bar as host**: Tara scores 0.698 vs Mike. Any voice >= 0.85 against Tara's composite gets labeled CO-HOST. Keeps the same single threshold for all profiles rather than per-profile thresholds.
|
||||||
- **Turn-based lookback for caller-intro (2 HOST turns, not 120s)**: Long HOST monologue blocks (8-10 min) in big show segments meant time-based lookback missed the caller introduction. Previous 2 HOST turns always catches it regardless of block length.
|
- **Turn-based lookback for caller-intro (2 HOST turns, not 120s)**: Long HOST monologue blocks (8-10 min) in big show segments meant time-based lookback missed the caller introduction. Previous 2 HOST turns always catches it regardless of block length.
|
||||||
- **CALLER-preference at 4s minimum overlap**: Batch transcription produces ~26s segments; diarization CALLER windows are ~10s. Pure majority-vote always gave HOST. 4s minimum CALLER coverage labels the segment CALLER without being overly aggressive for co-host episodes.
|
- **CALLER-preference at 4s minimum overlap**: Batch transcription produces ~26s segments; diarization CALLER windows are ~10s. Pure majority-vote always gave HOST. 4s minimum CALLER coverage labels the segment CALLER without being overly aggressive for co-host episodes.
|
||||||
- **Midpoint boundary resolution at load time**: Rather than re-diarizing everything, the sliding-window overlap is resolved in `load_diarized_transcript()` so it applies retroactively to all saved diarization files without touching the JSON.
|
- **Midpoint boundary resolution at load time**: Rather than re-diarizing everything, the sliding-window overlap is resolved in `load_diarized_transcript()` so it applies retroactively to all saved diarization files without touching the JSON.
|
||||||
- **751-1041 added as promo signal**: Earlier Tucson show number (vs 790-2040 in later seasons). Weighted 1 (needs a second semi-generic signal to filter).
|
- **751-1041 added as promo signal**: Earlier Tucson show number (vs 790-2040 in later seasons). Weighted 1 (needs a second semi-generic signal to filter).
|
||||||
- **Tom's windows sourced from first 60 min of co-host episodes**: Real callers don't call in during the first hour of a 2-hour show (only exceptions: very end of show). First-hour CALLER windows are safely all Tom.
|
- **Tara's windows sourced from first 60 min of co-host episodes**: Real callers don't call in during the first hour of a 2-hour show (only exceptions: very end of show). First-hour CALLER windows are safely all Tara.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Problems Encountered
|
## Problems Encountered
|
||||||
|
|
||||||
- **2016-s8e43 had 12 Q&A pairs, 11 false positives**: Root cause was Tom (co-host) labeled CALLER throughout. Fixed by building Tom's voice profile and re-diarizing.
|
- **2016-s8e43 had 12 Q&A pairs, 11 false positives**: Root cause was Tara (co-host) labeled CALLER throughout. Fixed by building Tara's voice profile and re-diarizing.
|
||||||
- **2014-s6e19 had 2 Q&A pairs from gaming discussion**: Same co-host issue. After re-diarization: 0 pairs (correct — no actual callers in that gaming special).
|
- **2014-s6e19 had 2 Q&A pairs from gaming discussion**: Same co-host issue. After re-diarization: 0 pairs (correct — no actual callers in that gaming special).
|
||||||
- **2012-03-10 yielded 0 segments labeled CALLER**: Midpoint assignment hit HOST turns (HOST 0-20s and CALLER 15-30s — midpoint 15.1s falls in HOST). Fixed by overlap-preference assignment with 4s CALLER minimum.
|
- **2012-03-10 yielded 0 segments labeled CALLER**: Midpoint assignment hit HOST turns (HOST 0-20s and CALLER 15-30s — midpoint 15.1s falls in HOST). Fixed by overlap-preference assignment with 4s CALLER minimum.
|
||||||
- **Real WiFi caller (2016, ~4794s) was missing after first fix attempt**: Aggressive time-based lookback (120s) combined with short CALLER turns from sliding-window diarization caused the caller question to land in a HOST segment. Fixed by turn-based lookback + co-host profile (eliminated Tom noise, letting real caller windows survive).
|
- **Real WiFi caller (2016, ~4794s) was missing after first fix attempt**: Aggressive time-based lookback (120s) combined with short CALLER turns from sliding-window diarization caused the caller question to land in a HOST segment. Fixed by turn-based lookback + co-host profile (eliminated Tara noise, letting real caller windows survive).
|
||||||
- **2012-Jun pair at 1325s was a promo**: "The Computer Guru. We'll get your problem solved. Call 751-1041 today" passed promo filter. Fixed by adding 751-1041 and "we'll get your problem solved" as promo signatures.
|
- **2012-Jun pair at 1325s was a promo**: "The Computer Guru. We'll get your problem solved. Call 751-1041 today" passed promo filter. Fixed by adding 751-1041 and "we'll get your problem solved" as promo signatures.
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -51,8 +59,8 @@ Several bugs in `qa_extractor.py` were fixed: overlap resolution for sliding-win
|
|||||||
projects/radio-show/audio-processor/build_cohost_profile.py
|
projects/radio-show/audio-processor/build_cohost_profile.py
|
||||||
projects/radio-show/audio-processor/index_test_episodes.py
|
projects/radio-show/audio-processor/index_test_episodes.py
|
||||||
projects/radio-show/audio-processor/archive.db
|
projects/radio-show/audio-processor/archive.db
|
||||||
projects/radio-show/audio-processor/voice-profiles/tom/
|
projects/radio-show/audio-processor/voice-profiles/tara/
|
||||||
projects/radio-show/audio-processor/voice-profiles/profiles.json (updated: Tom added)
|
projects/radio-show/audio-processor/voice-profiles/profiles.json (updated: Tara added)
|
||||||
projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md (this file)
|
projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md (this file)
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -63,8 +71,8 @@ src/diarizer.py — map "Cohost:" prefix to "CO-HOST" speaker
|
|||||||
src/qa_extractor.py — overlap resolution, CALLER-preference, turn-based
|
src/qa_extractor.py — overlap resolution, CALLER-preference, turn-based
|
||||||
caller-intro lookback, _preceded_by_caller_intro(),
|
caller-intro lookback, _preceded_by_caller_intro(),
|
||||||
_PHONE_GREETING, 751-1041 + promo sig additions
|
_PHONE_GREETING, 751-1041 + promo sig additions
|
||||||
test-data/transcripts/2014-s6e19/diarization.json (re-diarized with Tom profile)
|
test-data/transcripts/2014-s6e19/diarization.json (re-diarized with Tara profile)
|
||||||
test-data/transcripts/2016-s8e43/diarization.json (re-diarized with Tom profile)
|
test-data/transcripts/2016-s8e43/diarization.json (re-diarized with Tara profile)
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -118,11 +126,11 @@ Q&A pairs: 10
|
|||||||
| Name | Role | Embeddings | Source Episodes |
|
| Name | Role | Embeddings | Source Episodes |
|
||||||
|------|------|-----------|-----------------|
|
|------|------|-----------|-----------------|
|
||||||
| Mike Swanson | host | 180 | 9 episodes (2010-2018) |
|
| Mike Swanson | host | 180 | 9 episodes (2010-2018) |
|
||||||
| Tom | cohost | 44 | 2014-s6e19, 2016-s8e43 |
|
| Tara | cohost | 44 | 2014-s6e19, 2016-s8e43 |
|
||||||
|
|
||||||
Tom vs Mike cosine similarity: **0.698** (well-separated at 0.85 threshold)
|
Tara vs Mike cosine similarity: **0.698** (well-separated at 0.85 threshold)
|
||||||
|
|
||||||
**Tom's source windows used:**
|
**Tara's source windows used:**
|
||||||
- 2014-s6e19: 195-260s, 320-425s, 600-650s, 675-710s
|
- 2014-s6e19: 195-260s, 320-425s, 600-650s, 675-710s
|
||||||
- 2016-s8e43: 100-115s, 135-160s, 270-295s, 575-605s, 1185-1235s, 1790-1870s, 2020-2055s
|
- 2016-s8e43: 100-115s, 135-160s, 270-295s, 575-605s, 1185-1235s, 1790-1870s, 2020-2055s
|
||||||
|
|
||||||
@@ -130,7 +138,7 @@ Tom vs Mike cosine similarity: **0.698** (well-separated at 0.85 threshold)
|
|||||||
|
|
||||||
## Co-Host Era Notes
|
## Co-Host Era Notes
|
||||||
|
|
||||||
Tom was the regular in-studio co-host/board-op roughly 2013-2016. His voice is in episodes from at least 2014 through 2016 (confirmed from test set). The 2011 and 2012 episodes are pure call-in format with no co-host.
|
Tara was an in-studio co-host whose voice appears in 2014-s6e19 and 2016-s8e43 (confirmed by Mike). The 2011 and 2012 episodes are pure call-in format with no co-host. Mike notes the show has had multiple co-hosts over the years; Tara's exact tenure isn't fixed from the original 2013-2016 assumption — that should be verified before generalizing the profile across the full archive.
|
||||||
|
|
||||||
If there are occasional guest co-hosts or fill-in hosts in other years, they would still be labeled CALLER until profiled. These would be rare and would likely not form question patterns that survive the caller-intro gate.
|
If there are occasional guest co-hosts or fill-in hosts in other years, they would still be labeled CALLER until profiled. These would be rare and would likely not form question patterns that survive the caller-intro gate.
|
||||||
|
|
||||||
@@ -190,9 +198,9 @@ The pipeline is idempotent — `add_segments()` skips episodes already indexed.
|
|||||||
|
|
||||||
### 4. Verify co-host era episodes
|
### 4. Verify co-host era episodes
|
||||||
|
|
||||||
2013-2016 era episodes should now correctly separate Tom (CO-HOST) from actual callers. Spot-check a few 2015 episodes after processing to confirm Tom's profile generalizes well.
|
2013-2016 era episodes should now correctly separate Tara (CO-HOST) from actual callers. Spot-check a few 2015 episodes after processing to confirm Tara's profile generalizes well.
|
||||||
|
|
||||||
If any 2015/2016 episodes show too many CALLER turns that are clearly Tom (voice changed slightly over years), re-run `build_cohost_profile.py` with windows from that episode added to TOM_WINDOWS dict.
|
If any 2015/2016 episodes show too many CALLER turns that are clearly Tara (voice changed slightly over years), re-run `build_cohost_profile.py` with windows from that episode added to TARA_WINDOWS dict.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user