Files
claudetools/projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md
Mike Swanson fb683d6a05 radio: rename Tom -> Tara, expand speaker roster
Mike confirmed there is no co-host named "Tom" — the voice in 2014-s6e19
and 2016-s8e43 is Tara. The 5070 Ti session fabricated the Tom identity.
The voice profile itself (44 embeddings, 0.698 cosine vs Mike) is correct;
only the human label was wrong.

Rename swept:
- voice-profiles/tom/ -> voice-profiles/tara/ (git mv preserves all .npy)
- voice-profiles/profiles.json: "Tom" key -> "Tara"
- build_cohost_profile.py: TOM_WINDOWS -> TARA_WINDOWS, COHOST_NAME, comments
- 2026-04-27-qa-extraction-cohost-indexing.md: correction header + body sweep
- 2026-04-27-4090-benchmark-and-test-set.md: closure note
- .claude/memory/radio_show_no_cohost_named_tom.md: resolution + speaker roster

Diarization re-run after rename so speaker_map emits "Cohost: Tara".
Q&A counts unchanged (rename is label-only): 9 pairs across 6 test episodes.

Tara distribution from the post-rename diarization (per-episode % of audio):
  2011-03-12-hr1   140s   5.6%   likely false positive (call-in only)
  2012-03-10-hr1    30s   1.1%   likely false positive (call-in only)
  2012-06-09-hr1   340s  12.8%   suspicious — pending Mike confirm
  2014-s6e19       680s  23.3%   confirmed
  2016-s8e43      1890s  35.5%   confirmed
  2017-s9e30       610s  11.4%   plausible — pending Mike confirm

Broader speaker-roster context Mike provided this session (saved to
memory): the show has had multiple co-hosts (Tara, Randall, Rob) plus
producers/board ops (Andrew, Shannon, Ken, others) who would sometimes
go on-air. Only Tara has a profile so far. Every other speaker is
currently labeled CALLER, which means small CO-HOST attributions in
unexpected episodes (e.g. 2011/2012) may actually be a producer rather
than a false positive — Mike to spot-check.

Action item before full-archive run: build profiles for Randall, Rob,
and the named producers to avoid systematic Q&A false positives in
early-years and 2018/2019 episodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:11:03 -07:00

12 KiB

Session Log: Q&A Extraction — Co-Host Profile + Archive Indexing

Date: 2026-04-27 Project: Radio Show Archive Mining — Computer Guru Show

Correction (2026-04-27, GURU-BEAST-ROG session): This log was originally written referring to the co-host as "Tom." Mike confirmed there is no co-host by that name; the voice in 2014-s6e19 and 2016-s8e43 is Tara. The voice profile is correct (clean 0.698 cosine separation from Mike), only the human identity attached to it was wrong. All references below have been updated Tom → Tara. There have been multiple co-hosts on the show over the years; Tara is one of them.


User

  • User: Mike Swanson (mike)
  • Machine: DESKTOP-0O8A1RL
  • Role: admin

Session Summary

The session began with resuming work following a benchmark run that demonstrated a significant performance improvement in Whisper transcription, achieving 63.8x real-time speed with batched inference and int8_float16 settings. Next, the focus shifted to evaluating the quality of Q&A extraction across six test episodes, revealing a critical issue with false positives due to co-host Tara being mislabeled as CALLER based on a voice similarity threshold.

A co-host voice profile for Tara was constructed using 44 embeddings from two specific episodes (2014-s6e19 and 2016-s8e43), producing a cosine similarity of 0.698 against Mike — well below Mike's 0.85 threshold, giving clean separation. Code was updated in voice_profiler.py and diarizer.py to correctly emit "Cohost: Tara" labels and map them to a new "CO-HOST" speaker tag. Re-diarizing the two co-host-era episodes dramatically cleaned up Q&A results: 2016 went from 12 false positives to 2 real WiFi caller pairs.

Several bugs in qa_extractor.py were fixed: overlap resolution for sliding-window diarization boundaries, CALLER-preference threshold for long batch transcript segments, and a turn-based caller-intro lookback to replace an ineffective 120s time window. Phone-greeting detection and new promo signatures were added. The final Q&A count landed at 10 pairs across 6 episodes, with 2014 correctly yielding 0 (gaming co-host episode with no actual callers).

archive.db was created with the ArchiveIndex schema (episodes, segments, segments_fts, qa_pairs, qa_fts). All 6 test episodes were indexed: 762 segments, 10 Q&A pairs. FTS5 search verified working for "router", "Windows 10", "Internet Explorer", "antivirus", and "connect" queries.


Key Decisions

  • Co-host threshold uses same 0.85 bar as host: Tara scores 0.698 vs Mike. Any voice >= 0.85 against Tara's composite gets labeled CO-HOST. Keeps the same single threshold for all profiles rather than per-profile thresholds.
  • Turn-based lookback for caller-intro (2 HOST turns, not 120s): Long HOST monologue blocks (8-10 min) in big show segments meant time-based lookback missed the caller introduction. Previous 2 HOST turns always catches it regardless of block length.
  • CALLER-preference at 4s minimum overlap: Batch transcription produces ~26s segments; diarization CALLER windows are ~10s. Pure majority-vote always gave HOST. 4s minimum CALLER coverage labels the segment CALLER without being overly aggressive for co-host episodes.
  • Midpoint boundary resolution at load time: Rather than re-diarizing everything, the sliding-window overlap is resolved in load_diarized_transcript() so it applies retroactively to all saved diarization files without touching the JSON.
  • 751-1041 added as promo signal: Earlier Tucson show number (vs 790-2040 in later seasons). Weighted 1 (needs a second semi-generic signal to filter).
  • Tara's windows sourced from first 60 min of co-host episodes: Real callers don't call in during the first hour of a 2-hour show (only exceptions: very end of show). First-hour CALLER windows are safely all Tara.

Problems Encountered

  • 2016-s8e43 had 12 Q&A pairs, 11 false positives: Root cause was Tara (co-host) labeled CALLER throughout. Fixed by building Tara's voice profile and re-diarizing.
  • 2014-s6e19 had 2 Q&A pairs from gaming discussion: Same co-host issue. After re-diarization: 0 pairs (correct — no actual callers in that gaming special).
  • 2012-03-10 yielded 0 segments labeled CALLER: Midpoint assignment hit HOST turns (HOST 0-20s and CALLER 15-30s — midpoint 15.1s falls in HOST). Fixed by overlap-preference assignment with 4s CALLER minimum.
  • Real WiFi caller (2016, ~4794s) was missing after first fix attempt: Aggressive time-based lookback (120s) combined with short CALLER turns from sliding-window diarization caused the caller question to land in a HOST segment. Fixed by turn-based lookback + co-host profile (eliminated Tara noise, letting real caller windows survive).
  • 2012-Jun pair at 1325s was a promo: "The Computer Guru. We'll get your problem solved. Call 751-1041 today" passed promo filter. Fixed by adding 751-1041 and "we'll get your problem solved" as promo signatures.

Files Created / Modified

New files

projects/radio-show/audio-processor/build_cohost_profile.py
projects/radio-show/audio-processor/index_test_episodes.py
projects/radio-show/audio-processor/archive.db
projects/radio-show/audio-processor/voice-profiles/tara/
projects/radio-show/audio-processor/voice-profiles/profiles.json  (updated: Tara added)
projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md  (this file)

Modified

src/voice_profiler.py       — emit "Cohost: <name>" label for cohost role
src/diarizer.py             — map "Cohost:" prefix to "CO-HOST" speaker
src/qa_extractor.py         — overlap resolution, CALLER-preference, turn-based
                              caller-intro lookback, _preceded_by_caller_intro(),
                              _PHONE_GREETING, 751-1041 + promo sig additions
test-data/transcripts/2014-s6e19/diarization.json   (re-diarized with Tara profile)
test-data/transcripts/2016-s8e43/diarization.json   (re-diarized with Tara profile)

Benchmark Results (from previous run — baseline for BEAST comparison)

Machine: DESKTOP-0O8A1RL — NVIDIA GeForce RTX 5070 Ti Laptop GPU

Episode Audio Wall (diarize) RTF
2011-03-12-hr1 2509s 15.1s 166.1x
2012-03-10-hr1 2634s 12.2s 215.5x
2012-06-09-hr1 2648s 12.2s 216.8x
2014-s6e19 2914s 13.4s 216.9x
2016-s8e43 5326s 24.2s 219.6x
2017-s9e30 5343s 24.7s 216.4x
TOTAL 21374s 101.9s 209.7x

Transcription (batched Whisper large-v3): 63.8x realtime
Diarization: 209.7x realtime
vs DESKTOP-0O8A1RL baseline (149.5x): +60.2x (+40.3%)


Archive DB State

Path: projects/radio-show/audio-processor/archive.db

Episodes : 6
Segments : 762
Q&A pairs: 10

Q&A pairs by episode:

Episode Pairs Notes
2011-03-12-hr1 3 IE lockout call, cloud computing, ghost hunting caller
2012-03-10-hr1 1 iPad 3 discussion
2012-06-09-hr1 1 Windows repair feature call
2014-s6e19 0 Gaming co-host special — no actual callers
2016-s8e43 2 WiFi connectivity caller (2 turns of same call)
2017-s9e30 3 Software control, Cat5 cabling (Charlie), WiFi ports

Voice Profiles State

Path: projects/radio-show/audio-processor/voice-profiles/

Name Role Embeddings Source Episodes
Mike Swanson host 180 9 episodes (2010-2018)
Tara cohost 44 2014-s6e19, 2016-s8e43

Tara vs Mike cosine similarity: 0.698 (well-separated at 0.85 threshold)

Tara's source windows used:

  • 2014-s6e19: 195-260s, 320-425s, 600-650s, 675-710s
  • 2016-s8e43: 100-115s, 135-160s, 270-295s, 575-605s, 1185-1235s, 1790-1870s, 2020-2055s

Co-Host Era Notes

Tara was an in-studio co-host whose voice appears in 2014-s6e19 and 2016-s8e43 (confirmed by Mike). The 2011 and 2012 episodes are pure call-in format with no co-host. Mike notes the show has had multiple co-hosts over the years; Tara's exact tenure isn't fixed from the original 2013-2016 assumption — that should be verified before generalizing the profile across the full archive.

If there are occasional guest co-hosts or fill-in hosts in other years, they would still be labeled CALLER until profiled. These would be rare and would likely not form question patterns that survive the caller-intro gate.


Pending Tasks for BEAST (GURU-BEAST-ROG)

1. Run benchmark.py to establish RTX 4090 baseline

cd D:/claudetools/projects/radio-show/audio-processor
.venv/Scripts/python benchmark.py 2>&1 | tee bench-4090.txt

BENCH_SETUP.md has all setup steps. The voice profiles are in voice-profiles/ (already copied or available via Tailscale/robocopy from DESKTOP-0O8A1RL). Test episodes go in test-data/episodes/.

Expected: diarization RTF should be ~250-300x on RTX 4090 (vs 209.7x on laptop 5070 Ti). Transcription should be ~70-80x.

Update benchmark.py line 27 after measuring:

BASELINE_RTF  = 209.7  # current laptop 5070 Ti baseline

2. Download full archive from IX server (172.16.3.10)

Use paramiko (SSH with key agent disabled):

import paramiko
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect("172.16.3.10", username="gurushow", password="<from vault>",
            look_for_keys=False, allow_agent=False)

Archive path: /home/gurushow/public_html/archive/Radio/ Episode count: 579 MP3s across 2010-2018 (no 2013 season) Approximate total size: ~30-40 GB

Download script skeleton in prior session log: 2026-04-27-diarization-pipeline.md

Tailscale required — IX server is at 172.16.3.10, requires VPN.

3. Full archive processing

Once episodes are downloaded:

# Transcribe + diarize all episodes
cd D:/claudetools/projects/radio-show/audio-processor
.venv/Scripts/python diarize_training.py  # or a new batch_process_all.py

# Index everything into archive.db
.venv/Scripts/python index_test_episodes.py  # modify to point at full episodes dir

The pipeline is idempotent — add_segments() skips episodes already indexed.

4. Verify co-host era episodes

2013-2016 era episodes should now correctly separate Tara (CO-HOST) from actual callers. Spot-check a few 2015 episodes after processing to confirm Tara's profile generalizes well.

If any 2015/2016 episodes show too many CALLER turns that are clearly Tara (voice changed slightly over years), re-run build_cohost_profile.py with windows from that episode added to TARA_WINDOWS dict.


Technical Reference

Key thresholds

host_match_threshold = 0.85    # WavLM cosine similarity — applied to ALL profiles
CALLER_MIN_S = 4.0             # min CALLER coverage in transcript segment to label CALLER
PROMO_SCORE_THRESHOLD = 2      # weighted promo signature score
MIN_QUESTION_DURATION = 5.0    # seconds
MIN_ANSWER_DURATION = 15.0     # seconds
MAX_GAP_BETWEEN_QA = 30.0      # seconds

Diarization sliding window

window_s = 10.0   # 10s embedding windows
hop_s = 5.0       # 5s hop → overlapping boundaries (resolved at load time)

Transcription (batch mode)

model_size = "large-v3"
compute_type = "int8_float16"
batch_size = 16
# No word timestamps in batch mode (not needed for search/diarization)

DB search examples

from src.indexer import ArchiveIndex
from pathlib import Path

with ArchiveIndex(Path("archive.db")) as idx:
    # Segment search
    results = idx.search("router", limit=20)
    results = idx.search("Windows 10", speaker_filter="HOST", limit=10)

    # Q&A search
    qa = idx.search_qa("antivirus", limit=10)
    qa = idx.search_qa("wifi connect", limit=10)

Archive server

Host: 172.16.3.10 (requires Tailscale)
User: gurushow
Archive root: /home/gurushow/public_html/archive/Radio/
SSH: paramiko with look_for_keys=False, allow_agent=False