radio show: co-host voice profile, Q&A extraction fixes, archive index
- Build Tom (co-host) voice profile (44 embeddings, 0.698 similarity to Mike) - diarizer.py: add CO-HOST speaker label for cohost-role profiles - voice_profiler.py: emit "Cohost: <name>" label for cohost role - qa_extractor.py: overlap resolution at load time (midpoint boundary split), 4s CALLER-preference threshold, turn-based caller-intro lookback (2 HOST turns), _preceded_by_caller_intro() helper, _PHONE_GREETING pattern, 751-1041 + "we'll get your problem solved" promo signatures - benchmark.py: use src.transcriber.transcribe with batch_size=16 - add index_test_episodes.py and build_cohost_profile.py scripts - add .gitignore (exclude episodes, transcripts, *.db, .venv) - session log: 2026-04-27-qa-extraction-cohost-indexing.md Result: 2016-s8e43 drops from 12 false-positive Q&A pairs to 2 real caller pairs. archive.db: 6 episodes, 762 segments, 10 Q&A pairs, FTS5 search verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -57,13 +57,15 @@ trans_results = []
|
||||
trans_total_audio = 0.0
|
||||
trans_total_wall = 0.0
|
||||
|
||||
import json
|
||||
from src.transcriber import transcribe as _transcribe
|
||||
|
||||
for ep in EPISODES:
|
||||
trans_ep_dir = TRANS_DIR / ep.stem
|
||||
trans_ep_dir.mkdir(parents=True, exist_ok=True)
|
||||
transcript_path = trans_ep_dir / "transcript.json"
|
||||
|
||||
if transcript_path.exists():
|
||||
import json
|
||||
with open(transcript_path) as f:
|
||||
td = json.load(f)
|
||||
dur = td.get("duration", 0)
|
||||
@@ -74,30 +76,15 @@ for ep in EPISODES:
|
||||
console.print(f" Transcribing {ep.name}...")
|
||||
t0 = time.monotonic()
|
||||
|
||||
from faster_whisper import WhisperModel
|
||||
if not hasattr(sys, "_whisper_model"):
|
||||
console.print(" [dim]Loading Whisper large-v3...[/dim]")
|
||||
sys._whisper_model = WhisperModel("large-v3", device=device, compute_type="float16")
|
||||
|
||||
model = sys._whisper_model
|
||||
segments_iter, info = model.transcribe(str(ep), language="en", beam_size=5)
|
||||
|
||||
import json
|
||||
segs = []
|
||||
for seg in segments_iter:
|
||||
segs.append({"id": seg.id, "start": seg.start, "end": seg.end, "text": seg.text})
|
||||
|
||||
duration = info.duration
|
||||
transcript = _transcribe(ep, model_size="large-v3", device=device, batch_size=16)
|
||||
wall = time.monotonic() - t0
|
||||
rtf = duration / wall
|
||||
rtf = transcript.duration / wall
|
||||
|
||||
result = {"duration": duration, "language": "en", "segments": segs}
|
||||
with open(transcript_path, "w") as f:
|
||||
json.dump(result, f)
|
||||
transcript.save(trans_ep_dir)
|
||||
|
||||
console.print(f" [green]{ep.stem}: {duration:.0f}s audio in {wall:.1f}s = {rtf:.1f}x realtime[/green]")
|
||||
trans_results.append((ep, transcript_path, duration, wall))
|
||||
trans_total_audio += duration
|
||||
console.print(f" [green]{ep.stem}: {transcript.duration:.0f}s audio in {wall:.1f}s = {rtf:.1f}x realtime[/green]")
|
||||
trans_results.append((ep, transcript_path, transcript.duration, wall))
|
||||
trans_total_audio += transcript.duration
|
||||
trans_total_wall += wall
|
||||
|
||||
if trans_total_wall > 0:
|
||||
|
||||
Reference in New Issue
Block a user