radio: attach caller names to Q&A pairs from transcript intros
QAPair gets caller_name and caller_role fields populated by a new attach_caller_names(pairs, transcript_segments) helper. For each pair, finds the active opening intro at the question_start time (8s forward tolerance, no backward limit — a caller's call can run for 10+ minutes and the intro happens once at the start) and attaches the speaker name. Validation on 9-episode test set: 19/19 Q&A pairs (100%) now have caller names attached. Examples of corrections from oracle attribution: 2018-s10e18 @ 73:36 Christopher (was misattributed to "Tara") 2015-s7e19 @ 35:45 William (was misattributed to "Tara") 2010-05-08-hr1 Jackie x3, Bruce 2012-03-10-hr1 Adam x2 2016-s8e43 John, Doug 2017-s9e30 Tom, Denise x3, Charlie speaker_oracle.py: adds speaker_at(time, intros) helper used both by the existing resolve_speakers() and the new caller-name attachment. Also adds the "let's fit/bring/put X in/on" intro pattern variant (caught Charlie at 70:21 in 2017-s9e30 that "talk to X" missed). download_full_archive.py: SSH keepalive every 30s + per-file retry-on- failure (up to 3 attempts with reconnect). Earlier run hung on a dead connection at file 109 of 589 with no recovery; restarted run is now running at ~10 MB/s vs ~2-3 MB/s before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -17,7 +17,7 @@ ensure_cuda_libs()
|
||||
import torch
|
||||
from src.config import load_config
|
||||
from src.diarizer import diarize, VoiceProfileStore
|
||||
from src.qa_extractor import load_diarized_transcript, extract_qa_pairs
|
||||
from src.qa_extractor import load_diarized_transcript, extract_qa_pairs, attach_caller_names
|
||||
from rich.console import Console
|
||||
from rich.table import Table
|
||||
|
||||
@@ -180,8 +180,16 @@ for ep, transcript_path, audio_dur, _ in trans_results:
|
||||
diarization_path = trans_ep_dir / "diarization.json"
|
||||
segments = load_diarized_transcript(transcript_path, diarization_path)
|
||||
pairs = extract_qa_pairs(segments)
|
||||
|
||||
# Attach caller names from transcript intros
|
||||
with open(transcript_path) as f:
|
||||
td = _json.load(f)
|
||||
attach_caller_names(pairs, td.get("segments", []))
|
||||
|
||||
named = sum(1 for p in pairs if p.caller_name)
|
||||
name_str = ", ".join(p.caller_name for p in pairs if p.caller_name) or "—"
|
||||
qa_rows.append((ep.stem, len(pairs)))
|
||||
console.print(f" {ep.stem}: {len(pairs)} Q&A pairs")
|
||||
console.print(f" {ep.stem}: {len(pairs)} pairs ({named} named: {name_str})")
|
||||
|
||||
# ── Summary ────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
Reference in New Issue
Block a user