Add radio show audio processor and post-show workflow

- Audio processor CLI tool with 6-stage pipeline: transcribe (faster-whisper GPU),
  diarize (pyannote), detect segments (multi-signal classifier), remove commercials,
  split segments, analyze content (Ollama)
- Post-show workflow doc for episode posts, forum threads, deep-dive blog posts
- Training plan for using 579-episode archive for voice profiles and commercial detection
- Successful test: 45min episode transcribed in 2:37 on RTX 5070 Ti
- Sample transcript output from S7E30 (March 2015)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-21 11:51:59 -07:00
parent a8c8c6b7b6
commit a1e0442d8b
17 changed files with 58344 additions and 0 deletions

4
.gitignore vendored
View File

@@ -63,3 +63,7 @@ api/.env
.mcp.json
Pictures/
.grepai/
# Radio processor
projects/radio-show/audio-processor/test-data/*.mp3
projects/radio-show/audio-processor/*.egg-info/

View File

@@ -0,0 +1,365 @@
# Radio Show Audio Processor
Automated pipeline for processing The Computer Guru Show recordings into podcast-ready audio, transcripts, and segmented clips.
## What It Does
```
Raw MP3 (full broadcast with commercials)
├── 1. Transcribe (Whisper + GPU)
│ └── Full transcript with timestamps
├── 2. Speaker Diarization (pyannote)
│ └── Who said what (host vs. callers vs. guests)
├── 3. Segment Detection
│ ├── Identify show segments vs. commercials
│ ├── Detect music/jingles (known + discovered)
│ └── Map segments to show prep structure
├── 4. Commercial Removal
│ └── Clean episode MP3 (show content only)
├── 5. Segment Splitting
│ ├── Individual segment MP3s (for social media)
│ └── Chapter markers (for podcast players)
└── 6. Content Analysis (Ollama)
├── Episode summary
├── Topic extraction
├── Key quotes
└── Auto-populate post-show debrief
```
## Architecture
### Pipeline Stages
#### Stage 1: Transcription — `faster-whisper` (GPU)
- **Model:** `large-v3` (best accuracy, ~3GB VRAM)
- **Why faster-whisper:** CTranslate2 backend, 4x faster than OpenAI whisper, lower VRAM
- **Output:** Word-level timestamps, language detection
- **Hardware:** RTX 5070 Ti (12GB VRAM) — plenty for large-v3
#### Stage 2: Speaker Diarization — `pyannote.audio`
- **Model:** `pyannote/speaker-diarization-3.1`
- **Purpose:** Identify speaker turns (host, caller 1, caller 2, etc.)
- **Voice enrollment:** Bootstrapped from archive (hundreds of hours of host speech)
- **Output:** Speaker segments with timestamps
#### Stage 3: Segment Detection — Multi-Signal Classifier
Commercial and segment detection uses multiple signals combined, because not all show production elements are in the archive — bumpers, stingers, and jingles vary across stations and eras.
**Signal 1: Known element fingerprints (seed library)**
- Fingerprint the production elements we DO have (intros, outros, bumpers from archive)
- Match against episodes to detect known boundaries
- Partial coverage — some elements won't match
**Signal 2: Unknown element discovery**
- Detect short non-speech audio segments (music, jingles, produced audio) that don't match any known fingerprint
- Cluster unknown elements across episodes — if the same 5-second clip appears in 30 episodes, it's a show element
- Flag new clusters for host review and naming
- Discovered elements get added to the fingerprint library automatically
**Signal 3: Speaker identity**
- Host voice present = show content
- Non-host voices with commercial audio characteristics (compressed, produced, different acoustic environment) = ads
- Host voice absent for extended periods (>30s) = likely commercial break
**Signal 4: Audio characteristics**
- Volume/loudness shifts (commercials often have different LUFS profiles)
- Spectral characteristics (produced/compressed commercial audio vs. live studio mic)
- Silence gaps (dead air between show and ads)
- Audio environment changes (room tone, background noise differences)
**Signal 5: Learned break patterns (from archive)**
- HR1/HR2 file boundaries = confirmed commercial break locations
- Train a classifier on the audio features at these known boundaries
- Generalize to detect similar patterns within single-file recordings
**Signal 6: Structural heuristics**
- Commercial breaks are typically 2-5 minutes
- Shows typically break every 12-20 minutes
- Transition phrases in transcript ("We'll be right back", "Welcome back")
**Combined scoring:** Each signal contributes a confidence score. A segment is classified as commercial when the combined score exceeds a threshold. This is resilient to missing fingerprints — even without a known bumper match, the other signals can still identify breaks.
#### Stage 4: Commercial Removal — `ffmpeg`
- Stitch show segments together with crossfades
- Normalize audio levels (EBU R128 loudness standard)
- Output clean podcast-ready MP3
#### Stage 5: Segment Splitting — `ffmpeg`
- Export individual segments as separate MP3s
- Apply fade in/out
- Add ID3 tags (show name, segment title, date)
- Generate chapter markers file (for podcast apps)
#### Stage 6: Content Analysis — `Ollama` (qwen3:14b or codestral)
- Feed transcript + speaker labels to local LLM
- Generate:
- Episode summary (2-3 paragraphs)
- Per-segment summaries
- Key quotes with speaker attribution
- Topic tags
- Suggested blog post topics
- Auto-filled post-show debrief template
### Audio Element Library
The element library is a **learning system**, not a static collection.
```
element-library/
fingerprints.db # SQLite database of audio fingerprints
known/ # Source files we have
intros/
outros/
bumpers/
promos/
discovered/ # Elements found by the discovery system
cluster-001.mp3 # Unknown element, appears in 47 episodes
cluster-002.mp3 # Unknown element, appears in 12 episodes
...
metadata.json # Names, categories, date ranges for each element
```
**Lifecycle of a discovered element:**
1. Processor detects non-speech audio that doesn't match any known fingerprint
2. Audio clip is extracted and stored as a candidate
3. Candidate is compared against all other candidates across episodes
4. Matches are clustered — same audio in multiple episodes = confirmed element
5. New element is fingerprinted and added to the library as "unnamed"
6. Host reviews unnamed elements periodically and assigns names/categories
7. Named elements improve future detection accuracy
**Element categories:**
- `show-intro` — Full show opening
- `show-outro` — Full show closing
- `segment-bumper` — Music between show segments
- `break-bumper` — Music going into/out of commercial breaks
- `station-id` — Station identification (legal requirement, consistent per station)
- `promo` — Show promo or cross-promotion
- `stinger` — Short audio effect (sound effect, catchphrase)
- `unknown` — Not yet categorized
### Voice Profile System
**Bootstrapped from 579-episode archive**, not a single enrollment sample.
```
voice-profiles/
host-mike-swanson/
embedding-composite.npy # Average embedding across all eras
embedding-2010.npy # Era-specific (voice changes over time)
embedding-2014.npy
embedding-2018.npy
embedding-2026.npy
metadata.json # Speaker name, role, episode count
guests/
[name].npy # Named guest embeddings (built over time)
callers/
regular-001.npy # Unnamed repeat caller
regular-002.npy
unknown/
cluster-[id].npy # Voices that appear multiple times, not yet named
```
**Bootstrap process:**
1. Diarize 10 diverse archive episodes (different years)
2. Dominant speaker in each = host (by far the most speaking time)
3. Extract host-only segments, generate embeddings
4. Create per-era profiles (voice may change over 8+ years)
5. Composite embedding = average across all eras
**Continuous improvement:**
- Each processed episode refines the host embedding
- Repeat non-host voices are clustered across episodes
- Host reviews clusters: "This voice appears in 47 episodes — who is this?"
- Named profiles improve future speaker labeling
## Dependencies
```bash
# System packages
sudo pacman -S python-pip ffmpeg
# Python packages (in a venv)
python3 -m venv ~/.local/share/radio-processor
source ~/.local/share/radio-processor/bin/activate
pip install faster-whisper # Transcription (CTranslate2 + CUDA)
pip install pyannote.audio # Speaker diarization
pip install torch torchaudio # PyTorch (CUDA)
pip install silero-vad # Voice activity detection
pip install pydub # Audio manipulation
pip install librosa # Audio analysis / spectral features
pip install chromaprint # Audio fingerprinting (or use dejavu)
pip install scikit-learn # Break pattern classifier
pip install ollama # Local LLM API
pip install rich # CLI progress display
```
### pyannote.audio Access
pyannote requires accepting the model license on HuggingFace:
1. Create account at huggingface.co
2. Accept license at https://huggingface.co/pyannote/speaker-diarization-3.1
3. Generate access token
4. `huggingface-cli login`
## Usage
```bash
# Full pipeline (new episode)
radio-process episode.mp3 --show-prep episodes/2026-03-21-who-controls-your-tech/show-prep.md
# Just transcribe
radio-process episode.mp3 --transcribe-only
# Process archive episode (training mode — learns elements + voices)
radio-process episode-hr1.mp3 episode-hr2.mp3 --archive-mode --date 2016-03-15
# Batch process archive for training
radio-process --batch-train archive/2016/ --output training-data/
# Enroll host voice from archive (bootstrap)
radio-process --bootstrap-voice archive/ --speaker-name "Mike Swanson" --role host
# Review discovered elements
radio-process --review-elements
# Review unknown speaker clusters
radio-process --review-speakers
```
## Output Structure
```
episodes/YYYY-MM-DD-topic/
show-prep.md # Pre-show (existing)
post-show-debrief.md # Auto-generated draft
raw/
full-broadcast.mp3 # Original recording
processed/
transcript.json # Full transcript with timestamps + speakers
transcript.txt # Plain text transcript
transcript.srt # Subtitle format
podcast-episode.mp3 # Clean episode (commercials removed)
chapters.json # Chapter markers
detection-report.json # What was detected as commercial/show, confidence scores
segments/
00-intro.mp3
01-the-week-that-was.mp3
02-the-government-wants-in.mp3
03-jensens-trillion-dollar-bet.mp3
04-apple-gives-google-the-keys.mp3
05-a-petabyte-of-your-data-gone.mp3
06-right-to-repair.mp3
07-outro.mp3
generated/
episode-post.md # For website
forum-thread.md # For community forum
blog-topic-1.md # Deep-dive article
blog-topic-2.md # Deep-dive article
analysis.json # LLM analysis output
```
## Configuration
```yaml
# config.yaml
show:
name: "The Computer Guru Show"
host: "Mike Swanson"
typical_duration_minutes: 120 # 2-hour broadcast
segment_count: 6
has_commercials: true
audio:
whisper_model: "large-v3"
whisper_language: "en"
output_format: "mp3"
output_bitrate: "192k"
normalize: true # EBU R128
crossfade_ms: 500 # Between stitched segments
segment_detection:
# Fingerprint matching
fingerprint_db: "element-library/fingerprints.db"
fingerprint_match_threshold: 0.85 # Minimum similarity for a match
# Element discovery
discover_unknown_elements: true
min_element_duration_s: 1.0 # Shortest element to detect
max_element_duration_s: 30.0 # Longest (full intro might be 20-30s)
cluster_similarity_threshold: 0.90 # How similar clips must be to cluster
min_cluster_occurrences: 3 # Must appear in 3+ episodes to be an element
# Commercial classification
min_break_duration_s: 30 # Minimum commercial break length
max_break_duration_s: 300 # Maximum (5 min)
silence_threshold_db: -40 # Silence detection threshold
confidence_threshold: 0.70 # Combined score to classify as commercial
# Signal weights (tune based on accuracy)
weights:
fingerprint_match: 0.30 # Known element detected
speaker_identity: 0.25 # Host voice absent
audio_characteristics: 0.20 # Production style differs
break_pattern: 0.15 # Matches trained break pattern
structural_heuristic: 0.10 # Duration/timing rules
diarization:
min_speakers: 1
max_speakers: 6
voice_profiles_dir: "voice-profiles/"
host_match_threshold: 0.75 # Similarity to host embedding
llm:
model: "qwen3:14b" # Ollama model for analysis
ollama_host: "http://localhost:11434"
paths:
episodes_dir: "episodes/"
voice_profiles: "voice-profiles/"
element_library: "element-library/"
output_dir: "processed/"
archive:
server: "172.16.3.10"
path: "/home/gurushow/public_html/archive/"
elements_path: "/home/gurushow/public_html/archive/Radio/Elements/"
```
## Training Data: 579-Episode Archive
The archive on IX server (172.16.3.10) contains 579 MP3 files spanning 2010-2018:
| Year | Files | Size | Notes |
|------|-------|------|-------|
| 2010 | 43 | 664MB | Season 7 start |
| 2011 | 200 | 1.9GB | Peak output |
| 2012 | 98 | 1.2GB | |
| 2014 | 81 | 783MB | Season 6 (new station) |
| 2015 | 50 | 461MB | |
| 2016 | 54 | 1.2GB | |
| 2017 | 41 | 1.5GB | |
| 2018 | 5 | 101MB | Final season 10 episodes |
| Elements | 7 MP3 + 18 WAV | 203MB | Partial production library |
Episodes are split into HR 1 / HR 2 files. The HR boundary is a confirmed commercial break point — used for training the break detection classifier.
**Important:** Not all production elements are in the archive. Bumpers, stingers, and jingles varied across stations and time periods. The element discovery system handles this by detecting and clustering unknown elements across episodes.
## Future Enhancements
1. **Audiogram generator** — Create video clips with waveform animation + captions for social media
2. **Highlight reel** — Auto-detect the most engaging 60-90 seconds (high energy, laughter, emphasis)
3. **Show notes generator** — Generate timestamped show notes in podcast standard format
4. **RSS feed integration** — Auto-publish processed episodes to podcast RSS feed
5. **Sentiment analysis** — Track audience engagement topics over time
6. **Topic continuity** — Link topics across episodes ("Last week we talked about X, this week...")
7. **Live processing** — Real-time transcription during broadcast for immediate post-show turnaround
8. **Cross-episode search** — Full-text search across all transcripts ("When did we talk about net neutrality?")

View File

@@ -0,0 +1,57 @@
show:
name: "The Computer Guru Show"
host: "Mike Swanson"
typical_duration_minutes: 120
segment_count: 6
has_commercials: true
audio:
whisper_model: "large-v3"
whisper_language: "en"
output_format: "mp3"
output_bitrate: "192k"
normalize: true
crossfade_ms: 500
segment_detection:
fingerprint_db: "element-library/fingerprints.db"
fingerprint_match_threshold: 0.85
discover_unknown_elements: true
min_element_duration_s: 1.0
max_element_duration_s: 30.0
cluster_similarity_threshold: 0.90
min_cluster_occurrences: 3
min_break_duration_s: 30
max_break_duration_s: 300
silence_threshold_db: -40
confidence_threshold: 0.70
weights:
fingerprint_match: 0.30
speaker_identity: 0.25
audio_characteristics: 0.20
break_pattern: 0.15
structural_heuristic: 0.10
diarization:
min_speakers: 1
max_speakers: 6
voice_profiles_dir: "voice-profiles/"
host_match_threshold: 0.75
llm:
model: "qwen3:14b"
ollama_host: "http://localhost:11434"
paths:
episodes_dir: "episodes/"
voice_profiles: "voice-profiles/"
element_library: "element-library/"
output_dir: "processed/"
archive:
server: "172.16.3.10"
path: "/home/gurushow/public_html/archive/"
elements_path: "/home/gurushow/public_html/archive/Radio/Elements/"

View File

@@ -0,0 +1,25 @@
[build-system]
requires = ["setuptools>=68.0"]
build-backend = "setuptools.build_meta"
[project]
name = "radio-processor"
version = "0.1.0"
description = "Audio processor for The Computer Guru Show"
requires-python = ">=3.11"
dependencies = [
"faster-whisper",
"pyannote.audio",
"pydub",
"librosa",
"scikit-learn",
"ollama",
"rich",
"pyyaml",
]
[project.scripts]
radio-process = "src.cli:main"
[tool.setuptools.packages.find]
include = ["src*"]

View File

@@ -0,0 +1,187 @@
"""Stage 6: Content analysis using Ollama for summary, topics, and post-show debrief."""
import json
from dataclasses import dataclass
from pathlib import Path
from rich.console import Console
console = Console()
@dataclass
class EpisodeAnalysis:
summary: str
segment_summaries: list[dict] # [{title, summary, key_points}]
key_quotes: list[dict] # [{quote, speaker, timestamp}]
topics: list[str]
tags: list[str]
blog_post_candidates: list[dict] # [{title, angle, why}]
debrief_draft: str # Markdown debrief template
def to_dict(self) -> dict:
return {
"summary": self.summary,
"segment_summaries": self.segment_summaries,
"key_quotes": self.key_quotes,
"topics": self.topics,
"tags": self.tags,
"blog_post_candidates": self.blog_post_candidates,
}
def save(self, output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
with open(output_dir / "analysis.json", "w") as f:
json.dump(self.to_dict(), f, indent=2)
with open(output_dir / "post-show-debrief.md", "w") as f:
f.write(self.debrief_draft)
console.print(f"[green]Analysis saved to {output_dir}[/green]")
def analyze_episode(transcript_text: str, diarization_data: dict | None = None,
show_prep: str | None = None, segments: list | None = None,
model: str = "qwen3:14b",
ollama_host: str = "http://localhost:11434") -> EpisodeAnalysis:
"""Analyze a transcribed episode using a local LLM."""
import ollama as ollama_client
console.print(f"[bold]Analyzing episode with {model}[/bold]")
client = ollama_client.Client(host=ollama_host)
# Build context for the LLM
context_parts = []
if show_prep:
context_parts.append(f"## Show Prep (planned topics)\n\n{show_prep[:3000]}")
context_parts.append(f"## Transcript\n\n{transcript_text[:12000]}")
if diarization_data:
speakers = diarization_data.get("speaker_map", {})
if speakers:
speaker_info = "\n".join(f"- {v}" for v in speakers.values())
context_parts.append(f"## Speakers Identified\n\n{speaker_info}")
context = "\n\n---\n\n".join(context_parts)
# Query 1: Episode summary and segment summaries
summary_prompt = f"""You are analyzing a radio show episode transcript.
Provide a JSON response with:
1. "summary": A 2-3 paragraph episode summary suitable for a podcast episode page.
Write in third person. Be specific about topics discussed.
2. "segment_summaries": An array of objects, each with:
- "title": A compelling segment title
- "summary": 3-5 sentence summary
- "key_points": Array of key takeaway bullet points
3. "topics": Array of main topics discussed (short phrases)
4. "tags": Array of SEO-friendly tags (lowercase, hyphenated)
5. "key_quotes": Array of notable quotes, each with:
- "quote": The quote text
- "speaker": Who said it (if identifiable)
- "context": Brief context
6. "blog_post_candidates": Array of topics worth expanding into blog posts, each with:
- "title": Proposed blog post title
- "angle": The specific angle or thesis
- "why": Why this topic deserves expansion
Respond ONLY with valid JSON, no markdown fencing.
{context}"""
console.print("[dim]Generating episode analysis...[/dim]")
response = client.chat(
model=model,
messages=[{"role": "user", "content": summary_prompt}],
options={"temperature": 0.3, "num_ctx": 16384},
)
# Parse LLM response
response_text = response["message"]["content"]
# Strip markdown code fences if present
if "```json" in response_text:
response_text = response_text.split("```json", 1)[1]
response_text = response_text.split("```", 1)[0]
elif "```" in response_text:
response_text = response_text.split("```", 1)[1]
response_text = response_text.split("```", 1)[0]
try:
analysis_data = json.loads(response_text.strip())
except json.JSONDecodeError:
console.print("[yellow]LLM response was not valid JSON, using raw text[/yellow]")
analysis_data = {
"summary": response_text,
"segment_summaries": [],
"topics": [],
"tags": [],
"key_quotes": [],
"blog_post_candidates": [],
}
# Query 2: Generate debrief draft
debrief_prompt = f"""Based on this radio show transcript, generate a post-show debrief
in markdown format. Compare what was discussed against the show prep (planned topics)
to identify what made it in, what was cut, and what was added.
Format:
# Post-Show Debrief
## Episode: [derive title from content]
## Air Date: [today's date if not clear]
### What Made It In
[For each planned segment, note: Used / Modified / Cut]
### What Changed Live
[Topics expanded, cut short, or reordered vs. prep]
### Caller/Audience Interaction
[Any caller topics or audience engagement noted in transcript]
### Unplanned Additions
[Topics not in prep that came up]
### Best Moments
[Most compelling segments or quotes]
### Topics That Deserve More
[Topics that were rushed or generated high interest]
### Suggested Blog Posts
[2-3 specific blog post ideas with proposed titles and angles]
{context}"""
console.print("[dim]Generating debrief draft...[/dim]")
debrief_response = client.chat(
model=model,
messages=[{"role": "user", "content": debrief_prompt}],
options={"temperature": 0.4, "num_ctx": 16384},
)
debrief_text = debrief_response["message"]["content"]
console.print("[green]Analysis complete[/green]")
return EpisodeAnalysis(
summary=analysis_data.get("summary", ""),
segment_summaries=analysis_data.get("segment_summaries", []),
key_quotes=analysis_data.get("key_quotes", []),
topics=analysis_data.get("topics", []),
tags=analysis_data.get("tags", []),
blog_post_candidates=analysis_data.get("blog_post_candidates", []),
debrief_draft=debrief_text,
)

View File

@@ -0,0 +1,199 @@
"""Stage 4 & 5: Commercial removal and segment splitting using ffmpeg."""
import subprocess
import json
from dataclasses import dataclass
from pathlib import Path
from rich.console import Console
from rich.progress import Progress
from .segment_detector import SegmentType, DetectedSegment
console = Console()
@dataclass
class Chapter:
title: str
start: float
end: float
def remove_commercials(audio_path: Path, segments: list[DetectedSegment],
output_path: Path, crossfade_ms: int = 500,
bitrate: str = "192k", normalize: bool = True):
"""Stitch show segments together, removing commercials."""
show_segments = [s for s in segments
if s.segment_type in (SegmentType.SHOW_CONTENT,
SegmentType.SHOW_ELEMENT)]
if not show_segments:
console.print("[red]No show segments found![/red]")
return
console.print(f"[bold]Removing commercials:[/bold] {len(segments)} segments "
f"-> {len(show_segments)} show segments")
output_path.parent.mkdir(parents=True, exist_ok=True)
temp_dir = output_path.parent / ".temp_segments"
temp_dir.mkdir(exist_ok=True)
try:
# Extract each show segment
segment_files = []
with Progress(console=console) as progress:
task = progress.add_task("Extracting segments...",
total=len(show_segments))
for i, seg in enumerate(show_segments):
temp_file = temp_dir / f"seg_{i:04d}.mp3"
_extract_segment(audio_path, seg.start, seg.end,
temp_file, bitrate)
segment_files.append(temp_file)
progress.update(task, advance=1)
# Create concat file for ffmpeg
concat_file = temp_dir / "concat.txt"
with open(concat_file, "w") as f:
for sf in segment_files:
f.write(f"file '{sf}'\n")
# Concatenate with crossfade
cmd = [
"ffmpeg", "-y", "-f", "concat", "-safe", "0",
"-i", str(concat_file),
"-b:a", bitrate,
]
if normalize:
# EBU R128 loudness normalization
cmd.extend([
"-af", "loudnorm=I=-16:TP=-1.5:LRA=11",
])
cmd.append(str(output_path))
subprocess.run(cmd, capture_output=True, check=True, timeout=600)
# Get output duration
duration = _get_duration(output_path)
console.print(f"[green]Clean episode saved: {output_path.name} "
f"({duration / 60:.1f} min)[/green]")
finally:
# Cleanup temp files
import shutil
shutil.rmtree(temp_dir, ignore_errors=True)
def split_segments(audio_path: Path, segments: list[DetectedSegment],
output_dir: Path, bitrate: str = "192k"):
"""Export individual show segments as separate MP3 files."""
show_segments = [s for s in segments
if s.segment_type in (SegmentType.SHOW_CONTENT,
SegmentType.SHOW_ELEMENT)]
output_dir.mkdir(parents=True, exist_ok=True)
console.print(f"[bold]Splitting into {len(show_segments)} segments[/bold]")
exported = []
for i, seg in enumerate(show_segments):
slug = _slugify(seg.label) if seg.label else f"segment-{i:02d}"
filename = f"{i:02d}-{slug}.mp3"
output_file = output_dir / filename
_extract_segment(audio_path, seg.start, seg.end, output_file, bitrate,
fade_in_ms=200, fade_out_ms=500)
duration = seg.duration
console.print(f" [green]{filename}[/green] ({duration:.0f}s)")
exported.append({
"file": filename,
"label": seg.label,
"start": seg.start,
"end": seg.end,
"duration": duration,
})
# Save manifest
with open(output_dir / "segments.json", "w") as f:
json.dump(exported, f, indent=2)
return exported
def generate_chapters(segments: list[DetectedSegment],
output_path: Path) -> list[Chapter]:
"""Generate chapter markers from show segments."""
show_segments = [s for s in segments
if s.segment_type in (SegmentType.SHOW_CONTENT,
SegmentType.SHOW_ELEMENT)]
chapters = []
cumulative_time = 0.0
for seg in show_segments:
chapters.append(Chapter(
title=seg.label or f"Segment",
start=cumulative_time,
end=cumulative_time + seg.duration,
))
cumulative_time += seg.duration
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as f:
json.dump(
[{"title": c.title, "start": c.start, "end": c.end}
for c in chapters],
f, indent=2,
)
console.print(f"[green]Chapter markers saved: {len(chapters)} chapters[/green]")
return chapters
def _extract_segment(audio_path: Path, start: float, end: float,
output_path: Path, bitrate: str = "192k",
fade_in_ms: int = 0, fade_out_ms: int = 0):
"""Extract a segment from an audio file using ffmpeg."""
duration = end - start
cmd = [
"ffmpeg", "-y",
"-ss", str(start),
"-t", str(duration),
"-i", str(audio_path),
"-b:a", bitrate,
]
filters = []
if fade_in_ms > 0:
filters.append(f"afade=t=in:d={fade_in_ms / 1000}")
if fade_out_ms > 0:
filters.append(f"afade=t=out:st={duration - fade_out_ms / 1000}:d={fade_out_ms / 1000}")
if filters:
cmd.extend(["-af", ",".join(filters)])
cmd.append(str(output_path))
subprocess.run(cmd, capture_output=True, check=True, timeout=120)
def _get_duration(audio_path: Path) -> float:
"""Get audio file duration in seconds."""
result = subprocess.run(
["ffprobe", "-v", "quiet", "-show_entries", "format=duration",
"-of", "csv=p=0", str(audio_path)],
capture_output=True, text=True,
)
return float(result.stdout.strip())
def _slugify(text: str) -> str:
"""Convert text to a filename-safe slug."""
import re
text = text.lower().strip()
text = re.sub(r'[^\w\s-]', '', text)
text = re.sub(r'[\s_]+', '-', text)
text = re.sub(r'-+', '-', text)
return text[:50].strip('-')

View File

@@ -0,0 +1,356 @@
"""CLI entry point for the radio show audio processor."""
import argparse
import sys
from pathlib import Path
from rich.console import Console
from rich.panel import Panel
from .config import load_config
console = Console()
def main():
parser = argparse.ArgumentParser(
description="Radio Show Audio Processor — The Computer Guru Show",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s process episode.mp3
%(prog)s process episode.mp3 --show-prep show-prep.md
%(prog)s process hr1.mp3 hr2.mp3 --archive-mode --date 2016-03-15
%(prog)s transcribe episode.mp3
%(prog)s bootstrap-voice archive/
%(prog)s review-elements
%(prog)s review-speakers
""",
)
parser.add_argument("--config", type=str, default=None,
help="Path to config.yaml")
subparsers = parser.add_subparsers(dest="command", required=True)
# === process ===
p_process = subparsers.add_parser("process", help="Full pipeline")
p_process.add_argument("audio", nargs="+", type=str,
help="Audio file(s) to process")
p_process.add_argument("--show-prep", type=str, default=None,
help="Path to show prep markdown file")
p_process.add_argument("--output", type=str, default=None,
help="Output directory")
p_process.add_argument("--archive-mode", action="store_true",
help="Archive mode: learn elements and voices")
p_process.add_argument("--date", type=str, default=None,
help="Episode date (for archive mode)")
p_process.add_argument("--skip-transcribe", action="store_true",
help="Skip transcription (use existing transcript)")
p_process.add_argument("--skip-diarize", action="store_true",
help="Skip diarization")
p_process.add_argument("--skip-analysis", action="store_true",
help="Skip LLM analysis")
# === transcribe ===
p_transcribe = subparsers.add_parser("transcribe", help="Transcribe only")
p_transcribe.add_argument("audio", type=str, help="Audio file")
p_transcribe.add_argument("--output", type=str, default=None)
p_transcribe.add_argument("--model", type=str, default=None,
help="Whisper model size")
# === diarize ===
p_diarize = subparsers.add_parser("diarize", help="Diarize only")
p_diarize.add_argument("audio", type=str, help="Audio file")
p_diarize.add_argument("--output", type=str, default=None)
# === detect ===
p_detect = subparsers.add_parser("detect", help="Detect segments only")
p_detect.add_argument("audio", type=str, help="Audio file")
p_detect.add_argument("--output", type=str, default=None)
p_detect.add_argument("--show-prep", type=str, default=None)
# === split ===
p_split = subparsers.add_parser("split", help="Split into segments")
p_split.add_argument("audio", type=str, help="Audio file")
p_split.add_argument("--detection-report", type=str, required=True,
help="Path to detection-report.json")
p_split.add_argument("--output", type=str, default=None)
# === bootstrap-voice ===
p_voice = subparsers.add_parser("bootstrap-voice",
help="Bootstrap host voice profile from archive")
p_voice.add_argument("archive_dir", type=str,
help="Directory containing archive MP3s")
p_voice.add_argument("--speaker-name", type=str, default="Mike Swanson")
p_voice.add_argument("--sample-count", type=int, default=10,
help="Number of episodes to sample")
# === review-elements ===
subparsers.add_parser("review-elements",
help="Review discovered audio elements")
# === review-speakers ===
subparsers.add_parser("review-speakers",
help="Review unknown speaker clusters")
args = parser.parse_args()
config = load_config(args.config)
console.print(Panel.fit(
"[bold]Radio Show Audio Processor[/bold]\n"
f"[dim]The Computer Guru Show[/dim]",
border_style="blue",
))
if args.command == "process":
_cmd_process(args, config)
elif args.command == "transcribe":
_cmd_transcribe(args, config)
elif args.command == "diarize":
_cmd_diarize(args, config)
elif args.command == "detect":
_cmd_detect(args, config)
elif args.command == "split":
_cmd_split(args, config)
elif args.command == "bootstrap-voice":
_cmd_bootstrap_voice(args, config)
elif args.command == "review-elements":
_cmd_review_elements(args, config)
elif args.command == "review-speakers":
_cmd_review_speakers(args, config)
def _cmd_process(args, config):
"""Full processing pipeline."""
from .transcriber import transcribe
from .diarizer import diarize, VoiceProfileStore
from .segment_detector import SegmentDetector
from .audio_editor import remove_commercials, split_segments, generate_chapters
from .analyzer import analyze_episode
audio_files = [Path(f) for f in args.audio]
audio_path = audio_files[0] # Primary file
# If multiple files (HR1 + HR2), concatenate first
if len(audio_files) > 1:
audio_path = _concatenate_audio(audio_files, config)
output_dir = Path(args.output) if args.output else audio_path.parent / "processed"
output_dir.mkdir(parents=True, exist_ok=True)
# Load show prep if provided
show_prep = None
if args.show_prep:
show_prep = Path(args.show_prep).read_text()
# Stage 1: Transcribe
transcript = None
if not args.skip_transcribe:
transcript = transcribe(
audio_path,
model_size=config.audio.whisper_model,
language=config.audio.whisper_language,
)
transcript.save(output_dir)
else:
console.print("[dim]Skipping transcription[/dim]")
# Try to load existing transcript
transcript_file = output_dir / "transcript.json"
if transcript_file.exists():
from .transcriber import Transcript, TranscriptSegment, TranscriptWord
import json
with open(transcript_file) as f:
data = json.load(f)
transcript = Transcript(
segments=[
TranscriptSegment(
id=s["id"], text=s["text"],
start=s["start"], end=s["end"],
words=[TranscriptWord(**w) for w in s.get("words", [])],
)
for s in data["segments"]
],
language=data["language"],
language_probability=data["language_probability"],
duration=data["duration"],
)
# Stage 2: Diarize
diarization = None
if not args.skip_diarize:
voice_profiles = VoiceProfileStore(
config.resolve_path(config.diarization.voice_profiles_dir)
)
diarization = diarize(
audio_path,
voice_profiles=voice_profiles,
min_speakers=config.diarization.min_speakers,
max_speakers=config.diarization.max_speakers,
)
diarization.save(output_dir)
else:
console.print("[dim]Skipping diarization[/dim]")
# Stage 3: Detect segments
detector = SegmentDetector(config)
detection = detector.detect(
audio_path,
transcript=transcript,
diarization=diarization,
show_prep=show_prep,
)
detection.save(output_dir)
# Stage 4: Remove commercials
clean_path = output_dir / f"podcast-episode.{config.audio.output_format}"
remove_commercials(
audio_path, detection.segments, clean_path,
crossfade_ms=config.audio.crossfade_ms,
bitrate=config.audio.output_bitrate,
normalize=config.audio.normalize,
)
# Stage 5: Split segments
segments_dir = output_dir / "segments"
split_segments(
audio_path, detection.segments, segments_dir,
bitrate=config.audio.output_bitrate,
)
# Generate chapters
generate_chapters(detection.segments, output_dir / "chapters.json")
# Stage 6: Analyze
if not args.skip_analysis and transcript:
analysis = analyze_episode(
transcript_text=transcript.full_text,
diarization_data=diarization.to_dict() if diarization else None,
show_prep=show_prep,
segments=detection.segments,
model=config.llm.model,
ollama_host=config.llm.ollama_host,
)
generated_dir = output_dir.parent / "generated"
analysis.save(generated_dir)
console.print("\n[bold green]Processing complete![/bold green]")
console.print(f"Output: {output_dir}")
def _cmd_transcribe(args, config):
"""Transcribe only."""
from .transcriber import transcribe
audio_path = Path(args.audio)
output_dir = Path(args.output) if args.output else audio_path.parent / "processed"
model = args.model or config.audio.whisper_model
transcript = transcribe(audio_path, model_size=model)
transcript.save(output_dir)
def _cmd_diarize(args, config):
"""Diarize only."""
from .diarizer import diarize, VoiceProfileStore
audio_path = Path(args.audio)
output_dir = Path(args.output) if args.output else audio_path.parent / "processed"
voice_profiles = VoiceProfileStore(
config.resolve_path(config.diarization.voice_profiles_dir)
)
result = diarize(audio_path, voice_profiles=voice_profiles)
result.save(output_dir)
def _cmd_detect(args, config):
"""Segment detection only."""
from .segment_detector import SegmentDetector
audio_path = Path(args.audio)
output_dir = Path(args.output) if args.output else audio_path.parent / "processed"
show_prep = None
if args.show_prep:
show_prep = Path(args.show_prep).read_text()
detector = SegmentDetector(config)
result = detector.detect(audio_path, show_prep=show_prep)
result.save(output_dir)
def _cmd_split(args, config):
"""Split using existing detection report."""
from .audio_editor import split_segments, generate_chapters
from .segment_detector import DetectedSegment, SegmentType
import json
audio_path = Path(args.audio)
output_dir = Path(args.output) if args.output else audio_path.parent / "segments"
with open(args.detection_report) as f:
report = json.load(f)
segments = [
DetectedSegment(
start=s["start"], end=s["end"],
segment_type=SegmentType(s["type"]),
confidence=s["confidence"],
label=s.get("label", ""),
)
for s in report["segments"]
]
split_segments(audio_path, segments, output_dir, config.audio.output_bitrate)
generate_chapters(segments, output_dir.parent / "chapters.json")
def _cmd_bootstrap_voice(args, config):
"""Bootstrap host voice profile from archive episodes."""
console.print("[bold]Bootstrapping host voice profile[/bold]")
console.print(f"Archive: {args.archive_dir}")
console.print(f"Speaker: {args.speaker_name}")
console.print(f"Sampling {args.sample_count} episodes")
# TODO: Implement archive sampling + diarization + embedding extraction
console.print("[yellow]Not yet implemented — run individual diarizations first[/yellow]")
def _cmd_review_elements(args, config):
"""Review discovered audio elements."""
console.print("[bold]Reviewing discovered elements[/bold]")
# TODO: Implement element review UI
console.print("[yellow]Not yet implemented[/yellow]")
def _cmd_review_speakers(args, config):
"""Review unknown speaker clusters."""
console.print("[bold]Reviewing unknown speakers[/bold]")
# TODO: Implement speaker review UI
console.print("[yellow]Not yet implemented[/yellow]")
def _concatenate_audio(files: list[Path], config) -> Path:
"""Concatenate multiple audio files (e.g., HR1 + HR2)."""
import subprocess
output = files[0].parent / f"combined_{files[0].stem}.mp3"
concat_file = files[0].parent / ".concat_list.txt"
with open(concat_file, "w") as f:
for audio_file in files:
f.write(f"file '{audio_file}'\n")
subprocess.run(
["ffmpeg", "-y", "-f", "concat", "-safe", "0",
"-i", str(concat_file), "-c", "copy", str(output)],
capture_output=True, check=True,
)
concat_file.unlink()
console.print(f"[dim]Concatenated {len(files)} files -> {output.name}[/dim]")
return output
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,126 @@
"""Configuration loader for the radio show audio processor."""
from pathlib import Path
from dataclasses import dataclass, field
import yaml
@dataclass
class ShowConfig:
name: str = "The Computer Guru Show"
host: str = "Mike Swanson"
typical_duration_minutes: int = 120
segment_count: int = 6
has_commercials: bool = True
@dataclass
class AudioConfig:
whisper_model: str = "large-v3"
whisper_language: str = "en"
output_format: str = "mp3"
output_bitrate: str = "192k"
normalize: bool = True
crossfade_ms: int = 500
@dataclass
class DetectionWeights:
fingerprint_match: float = 0.30
speaker_identity: float = 0.25
audio_characteristics: float = 0.20
break_pattern: float = 0.15
structural_heuristic: float = 0.10
@dataclass
class SegmentDetectionConfig:
fingerprint_db: str = "element-library/fingerprints.db"
fingerprint_match_threshold: float = 0.85
discover_unknown_elements: bool = True
min_element_duration_s: float = 1.0
max_element_duration_s: float = 30.0
cluster_similarity_threshold: float = 0.90
min_cluster_occurrences: int = 3
min_break_duration_s: int = 30
max_break_duration_s: int = 300
silence_threshold_db: int = -40
confidence_threshold: float = 0.70
weights: DetectionWeights = field(default_factory=DetectionWeights)
@dataclass
class DiarizationConfig:
min_speakers: int = 1
max_speakers: int = 6
voice_profiles_dir: str = "voice-profiles/"
host_match_threshold: float = 0.75
@dataclass
class LLMConfig:
model: str = "qwen3:14b"
ollama_host: str = "http://localhost:11434"
@dataclass
class PathsConfig:
episodes_dir: str = "episodes/"
voice_profiles: str = "voice-profiles/"
element_library: str = "element-library/"
output_dir: str = "processed/"
@dataclass
class ArchiveConfig:
server: str = "172.16.3.10"
path: str = "/home/gurushow/public_html/archive/"
elements_path: str = "/home/gurushow/public_html/archive/Radio/Elements/"
@dataclass
class Config:
show: ShowConfig = field(default_factory=ShowConfig)
audio: AudioConfig = field(default_factory=AudioConfig)
segment_detection: SegmentDetectionConfig = field(default_factory=SegmentDetectionConfig)
diarization: DiarizationConfig = field(default_factory=DiarizationConfig)
llm: LLMConfig = field(default_factory=LLMConfig)
paths: PathsConfig = field(default_factory=PathsConfig)
archive: ArchiveConfig = field(default_factory=ArchiveConfig)
base_dir: Path = field(default_factory=lambda: Path.cwd())
def resolve_path(self, relative: str) -> Path:
return self.base_dir / relative
def load_config(config_path: str | Path | None = None) -> Config:
if config_path is None:
config_path = Path(__file__).parent.parent / "config.yaml"
config_path = Path(config_path)
if not config_path.exists():
return Config(base_dir=config_path.parent)
with open(config_path) as f:
raw = yaml.safe_load(f) or {}
config = Config(base_dir=config_path.parent)
if "show" in raw:
config.show = ShowConfig(**raw["show"])
if "audio" in raw:
config.audio = AudioConfig(**raw["audio"])
if "segment_detection" in raw:
sd = raw["segment_detection"]
weights = DetectionWeights(**sd.pop("weights", {}))
config.segment_detection = SegmentDetectionConfig(weights=weights, **sd)
if "diarization" in raw:
config.diarization = DiarizationConfig(**raw["diarization"])
if "llm" in raw:
config.llm = LLMConfig(**raw["llm"])
if "paths" in raw:
config.paths = PathsConfig(**raw["paths"])
if "archive" in raw:
config.archive = ArchiveConfig(**raw["archive"])
return config

View File

@@ -0,0 +1,274 @@
"""Stage 2: Speaker diarization using pyannote.audio with voice profile matching."""
import json
from dataclasses import dataclass
from pathlib import Path
import numpy as np
from rich.console import Console
console = Console()
@dataclass
class SpeakerTurn:
speaker: str # "SPEAKER_00", "Host: Mike Swanson", "Caller 1", etc.
start: float
end: float
confidence: float = 1.0
@property
def duration(self) -> float:
return self.end - self.start
@dataclass
class DiarizationResult:
turns: list[SpeakerTurn]
num_speakers: int
speaker_map: dict[str, str] # raw label -> friendly name
def speaker_at(self, time: float) -> str | None:
"""Get the speaker at a given timestamp."""
for turn in self.turns:
if turn.start <= time <= turn.end:
return turn.speaker
return None
def speaker_time(self, speaker: str) -> float:
"""Total speaking time for a speaker."""
return sum(t.duration for t in self.turns if t.speaker == speaker)
def speakers_ranked(self) -> list[tuple[str, float]]:
"""Speakers ranked by total speaking time."""
times = {}
for turn in self.turns:
times[turn.speaker] = times.get(turn.speaker, 0) + turn.duration
return sorted(times.items(), key=lambda x: x[1], reverse=True)
def to_dict(self) -> dict:
return {
"num_speakers": self.num_speakers,
"speaker_map": self.speaker_map,
"turns": [
{
"speaker": t.speaker,
"start": t.start,
"end": t.end,
"confidence": t.confidence,
}
for t in self.turns
],
}
def save(self, output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
with open(output_dir / "diarization.json", "w") as f:
json.dump(self.to_dict(), f, indent=2)
console.print(f"[green]Diarization saved to {output_dir}[/green]")
class VoiceProfileStore:
"""Manages speaker voice embeddings for identification."""
def __init__(self, profiles_dir: str | Path):
self.profiles_dir = Path(profiles_dir)
self.embeddings: dict[str, np.ndarray] = {}
self.metadata: dict[str, dict] = {}
self._load_profiles()
def _load_profiles(self):
if not self.profiles_dir.exists():
return
for npy_file in self.profiles_dir.rglob("*.npy"):
name = npy_file.stem
# Determine speaker name from directory structure
parent = npy_file.parent.name
if parent.startswith("host-"):
speaker_name = parent.replace("host-", "").replace("-", " ").title()
role = "host"
elif parent == "guests":
speaker_name = name.replace("-", " ").title()
role = "guest"
elif parent == "callers":
speaker_name = name
role = "caller"
else:
speaker_name = name
role = "unknown"
self.embeddings[name] = np.load(npy_file)
self.metadata[name] = {
"name": speaker_name,
"role": role,
"file": str(npy_file),
}
if self.embeddings:
console.print(f"[dim]Loaded {len(self.embeddings)} voice profiles[/dim]")
def match_embedding(self, embedding: np.ndarray, threshold: float = 0.75
) -> tuple[str | None, float]:
"""Match an embedding against stored profiles. Returns (name, similarity)."""
if not self.embeddings:
return None, 0.0
best_match = None
best_score = 0.0
for name, stored in self.embeddings.items():
# Cosine similarity
similarity = np.dot(embedding, stored) / (
np.linalg.norm(embedding) * np.linalg.norm(stored) + 1e-8
)
if similarity > best_score:
best_score = similarity
best_match = name
if best_score >= threshold:
meta = self.metadata.get(best_match, {})
friendly_name = meta.get("name", best_match)
role = meta.get("role", "unknown")
if role == "host":
return f"Host: {friendly_name}", best_score
return friendly_name, best_score
return None, best_score
def save_embedding(self, name: str, embedding: np.ndarray,
role: str = "unknown"):
"""Save a new voice profile."""
if role == "host":
subdir = self.profiles_dir / f"host-{name.lower().replace(' ', '-')}"
elif role == "guest":
subdir = self.profiles_dir / "guests"
elif role == "caller":
subdir = self.profiles_dir / "callers"
else:
subdir = self.profiles_dir / "unknown"
subdir.mkdir(parents=True, exist_ok=True)
filename = name.lower().replace(" ", "-")
np.save(subdir / f"{filename}.npy", embedding)
console.print(f"[green]Saved voice profile: {name} ({role})[/green]")
def diarize(audio_path: str | Path,
voice_profiles: VoiceProfileStore | None = None,
min_speakers: int = 1,
max_speakers: int = 6,
host_match_threshold: float = 0.75) -> DiarizationResult:
"""Run speaker diarization on an audio file."""
from pyannote.audio import Pipeline
import torch
audio_path = Path(audio_path)
console.print(f"[bold]Diarizing:[/bold] {audio_path.name}")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
console.print(f"[dim]Device: {device}[/dim]")
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1"
).to(device)
diarization = pipeline(
str(audio_path),
min_speakers=min_speakers,
max_speakers=max_speakers,
)
# Extract turns
raw_turns = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
raw_turns.append(SpeakerTurn(
speaker=speaker,
start=turn.start,
end=turn.end,
))
# Count unique speakers
raw_speakers = set(t.speaker for t in raw_turns)
console.print(f"[dim]Detected {len(raw_speakers)} speakers[/dim]")
# Match against voice profiles if available
speaker_map = {}
if voice_profiles and voice_profiles.embeddings:
console.print("[dim]Matching speakers against voice profiles...[/dim]")
embedding_model = pipeline.embedding # pyannote's embedding model
# Get embeddings for each detected speaker
from pyannote.audio import Inference
inference = Inference(pipeline.embedding, window="whole")
for raw_label in raw_speakers:
# Get segments for this speaker
speaker_segments = [t for t in raw_turns if t.speaker == raw_label]
total_time = sum(t.duration for t in speaker_segments)
# Use the longest segment for embedding
longest = max(speaker_segments, key=lambda t: t.duration)
try:
# Extract embedding from audio segment
import torchaudio
waveform, sr = torchaudio.load(
str(audio_path),
frame_offset=int(longest.start * sr if 'sr' in dir() else longest.start * 16000),
num_frames=int(longest.duration * sr if 'sr' in dir() else longest.duration * 16000),
)
# This is simplified — proper implementation would use pyannote's
# embedding extraction pipeline
match_name, score = voice_profiles.match_embedding(
np.zeros(256), # placeholder
threshold=host_match_threshold,
)
if match_name:
speaker_map[raw_label] = match_name
console.print(f" [green]{raw_label} -> {match_name} "
f"(score: {score:.2f}, {total_time:.0f}s)[/green]")
except Exception as e:
console.print(f" [yellow]Could not match {raw_label}: {e}[/yellow]")
# If no voice profiles matched, use speaking time heuristic
# The host almost always has the most speaking time
if not speaker_map:
ranked = sorted(
[(s, sum(t.duration for t in raw_turns if t.speaker == s))
for s in raw_speakers],
key=lambda x: x[1],
reverse=True,
)
if ranked:
speaker_map[ranked[0][0]] = f"Host: {voice_profiles.metadata.get('host', {}).get('name', 'Unknown')}"
console.print(f" [yellow]Assumed {ranked[0][0]} is host "
f"(most speaking time: {ranked[0][1]:.0f}s)[/yellow]")
# If no voice profiles at all, label by speaking time
if not speaker_map:
ranked = sorted(
[(s, sum(t.duration for t in raw_turns if t.speaker == s))
for s in raw_speakers],
key=lambda x: x[1],
reverse=True,
)
for i, (speaker, time) in enumerate(ranked):
if i == 0:
speaker_map[speaker] = "Host (assumed)"
else:
speaker_map[speaker] = f"Speaker {i}"
# Apply friendly names
for turn in raw_turns:
if turn.speaker in speaker_map:
turn.speaker = speaker_map[turn.speaker]
console.print(f"[green]Diarization complete: {len(raw_turns)} turns, "
f"{len(raw_speakers)} speakers[/green]")
return DiarizationResult(
turns=raw_turns,
num_speakers=len(raw_speakers),
speaker_map=speaker_map,
)

View File

@@ -0,0 +1,419 @@
"""Stage 3: Segment detection — multi-signal commercial/show content classifier."""
import json
from dataclasses import dataclass
from pathlib import Path
from enum import Enum
import numpy as np
from rich.console import Console
from rich.table import Table
console = Console()
class SegmentType(Enum):
SHOW_CONTENT = "show_content"
COMMERCIAL = "commercial"
SHOW_ELEMENT = "show_element" # intro, outro, bumper
SILENCE = "silence"
UNKNOWN = "unknown"
@dataclass
class DetectedSegment:
start: float
end: float
segment_type: SegmentType
confidence: float
label: str = "" # "Segment 1: The Week That Was", "Commercial Break 1", etc.
signals: dict = None # Individual signal scores
def __post_init__(self):
if self.signals is None:
self.signals = {}
@property
def duration(self) -> float:
return self.end - self.start
@dataclass
class SegmentDetectionResult:
segments: list[DetectedSegment]
show_segments: list[DetectedSegment]
commercial_segments: list[DetectedSegment]
element_segments: list[DetectedSegment]
total_show_time: float
total_commercial_time: float
def to_dict(self) -> dict:
return {
"total_show_time": self.total_show_time,
"total_commercial_time": self.total_commercial_time,
"segments": [
{
"start": s.start,
"end": s.end,
"type": s.segment_type.value,
"confidence": s.confidence,
"label": s.label,
"signals": s.signals,
}
for s in self.segments
],
}
def save(self, output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
with open(output_dir / "detection-report.json", "w") as f:
json.dump(self.to_dict(), f, indent=2)
def print_summary(self):
table = Table(title="Segment Detection Results")
table.add_column("Time", style="cyan")
table.add_column("Duration", style="magenta")
table.add_column("Type", style="green")
table.add_column("Confidence", style="yellow")
table.add_column("Label")
for seg in self.segments:
start = _format_time(seg.start)
dur = f"{seg.duration:.0f}s"
type_style = {
SegmentType.SHOW_CONTENT: "[green]SHOW[/green]",
SegmentType.COMMERCIAL: "[red]COMMERCIAL[/red]",
SegmentType.SHOW_ELEMENT: "[blue]ELEMENT[/blue]",
SegmentType.SILENCE: "[dim]SILENCE[/dim]",
SegmentType.UNKNOWN: "[yellow]UNKNOWN[/yellow]",
}.get(seg.segment_type, str(seg.segment_type))
table.add_row(start, dur, type_style, f"{seg.confidence:.2f}", seg.label)
console.print(table)
console.print(f"\nShow content: {self.total_show_time / 60:.1f} min")
console.print(f"Commercials: {self.total_commercial_time / 60:.1f} min")
def _format_time(seconds: float) -> str:
m = int(seconds // 60)
s = int(seconds % 60)
return f"{m:02d}:{s:02d}"
class SegmentDetector:
"""Multi-signal commercial/show content detector."""
def __init__(self, config):
self.config = config
self.weights = config.segment_detection.weights
def detect(self, audio_path: Path, transcript=None, diarization=None,
show_prep=None) -> SegmentDetectionResult:
"""Run all detection signals and combine scores."""
console.print(f"[bold]Detecting segments:[/bold] {audio_path.name}")
# Load audio for analysis
audio_data, sample_rate = self._load_audio(audio_path)
duration = len(audio_data) / sample_rate
# Step 1: Find candidate boundaries using silence detection
boundaries = self._detect_silence_boundaries(audio_data, sample_rate)
console.print(f"[dim]Found {len(boundaries)} silence boundaries[/dim]")
# Step 2: Create candidate segments between boundaries
candidates = self._create_candidate_segments(boundaries, duration)
# Step 3: Score each candidate with all available signals
for candidate in candidates:
scores = {}
# Signal 1: Fingerprint matching (if library available)
scores["fingerprint"] = self._score_fingerprint(
audio_data, sample_rate, candidate
)
# Signal 2: Speaker identity
if diarization:
scores["speaker"] = self._score_speaker_identity(
diarization, candidate
)
else:
scores["speaker"] = 0.5 # neutral
# Signal 3: Audio characteristics
scores["audio_chars"] = self._score_audio_characteristics(
audio_data, sample_rate, candidate
)
# Signal 4: Structural heuristics
if transcript:
scores["structural"] = self._score_structural(
transcript, candidate
)
else:
scores["structural"] = 0.5
# Combined weighted score (higher = more likely commercial)
commercial_score = (
self.weights.fingerprint_match * scores.get("fingerprint", 0.5) +
self.weights.speaker_identity * scores.get("speaker", 0.5) +
self.weights.audio_characteristics * scores.get("audio_chars", 0.5) +
self.weights.structural_heuristic * scores.get("structural", 0.5)
)
candidate.signals = scores
candidate.confidence = commercial_score
if commercial_score >= self.config.segment_detection.confidence_threshold:
candidate.segment_type = SegmentType.COMMERCIAL
else:
candidate.segment_type = SegmentType.SHOW_CONTENT
# Step 4: Merge adjacent segments of same type
merged = self._merge_adjacent(candidates)
# Step 5: Apply duration constraints
final = self._apply_constraints(merged)
# Step 6: Label show segments using show prep if available
if show_prep:
self._label_from_prep(final, transcript, show_prep)
# Build result
show_segs = [s for s in final if s.segment_type == SegmentType.SHOW_CONTENT]
comm_segs = [s for s in final if s.segment_type == SegmentType.COMMERCIAL]
elem_segs = [s for s in final if s.segment_type == SegmentType.SHOW_ELEMENT]
result = SegmentDetectionResult(
segments=final,
show_segments=show_segs,
commercial_segments=comm_segs,
element_segments=elem_segs,
total_show_time=sum(s.duration for s in show_segs),
total_commercial_time=sum(s.duration for s in comm_segs),
)
result.print_summary()
return result
def _load_audio(self, audio_path: Path) -> tuple[np.ndarray, int]:
"""Load audio file as mono numpy array."""
import subprocess
import io
import struct
# Use ffmpeg to decode to raw PCM
result = subprocess.run(
["ffmpeg", "-i", str(audio_path), "-f", "s16le", "-ac", "1",
"-ar", "16000", "-"],
capture_output=True, timeout=300,
)
audio = np.frombuffer(result.stdout, dtype=np.int16).astype(np.float32) / 32768.0
return audio, 16000
def _detect_silence_boundaries(self, audio: np.ndarray, sr: int,
min_silence_ms: int = 500) -> list[float]:
"""Detect silence gaps in audio that likely indicate segment boundaries."""
frame_size = int(sr * 0.025) # 25ms frames
hop_size = int(sr * 0.010) # 10ms hop
threshold_db = self.config.segment_detection.silence_threshold_db
threshold_amp = 10 ** (threshold_db / 20)
min_silence_frames = int(min_silence_ms / 10)
# Calculate frame energy
energies = []
for i in range(0, len(audio) - frame_size, hop_size):
frame = audio[i:i + frame_size]
rms = np.sqrt(np.mean(frame ** 2))
energies.append(rms)
# Find silence regions
is_silent = [e < threshold_amp for e in energies]
boundaries = []
silent_count = 0
for i, silent in enumerate(is_silent):
if silent:
silent_count += 1
else:
if silent_count >= min_silence_frames:
# Mark the midpoint of the silence as a boundary
mid_frame = i - silent_count // 2
boundary_time = mid_frame * 0.010
boundaries.append(boundary_time)
silent_count = 0
return boundaries
def _create_candidate_segments(self, boundaries: list[float],
total_duration: float) -> list[DetectedSegment]:
"""Create candidate segments from silence boundaries."""
candidates = []
prev = 0.0
for boundary in boundaries:
if boundary - prev > 1.0: # Ignore segments < 1 second
candidates.append(DetectedSegment(
start=prev,
end=boundary,
segment_type=SegmentType.UNKNOWN,
confidence=0.0,
))
prev = boundary
# Final segment
if total_duration - prev > 1.0:
candidates.append(DetectedSegment(
start=prev,
end=total_duration,
segment_type=SegmentType.UNKNOWN,
confidence=0.0,
))
return candidates
def _score_fingerprint(self, audio: np.ndarray, sr: int,
segment: DetectedSegment) -> float:
"""Score based on audio fingerprint matching against element library.
Returns 0.0 (no match / definitely show) to 1.0 (definite commercial boundary).
"""
# TODO: Implement fingerprint matching against element-library/fingerprints.db
# For now, return neutral score
return 0.5
def _score_speaker_identity(self, diarization, segment: DetectedSegment) -> float:
"""Score based on whether the host is speaking.
Returns 0.0 (host definitely speaking = show content)
to 1.0 (host definitely absent = likely commercial).
"""
host_time = 0.0
total_time = segment.duration
for turn in diarization.turns:
if turn.end < segment.start or turn.start > segment.end:
continue
# Calculate overlap
overlap_start = max(turn.start, segment.start)
overlap_end = min(turn.end, segment.end)
overlap = max(0, overlap_end - overlap_start)
if "host" in turn.speaker.lower():
host_time += overlap
if total_time == 0:
return 0.5
host_fraction = host_time / total_time
# Invert: high host presence = low commercial score
return 1.0 - host_fraction
def _score_audio_characteristics(self, audio: np.ndarray, sr: int,
segment: DetectedSegment) -> float:
"""Score based on audio production characteristics.
Commercials tend to be louder, more compressed, different spectral profile.
Returns 0.0 (matches show characteristics) to 1.0 (matches commercial characteristics).
"""
start_sample = int(segment.start * sr)
end_sample = min(int(segment.end * sr), len(audio))
seg_audio = audio[start_sample:end_sample]
if len(seg_audio) < sr: # Less than 1 second
return 0.5
# RMS energy (commercials tend to be louder)
rms = np.sqrt(np.mean(seg_audio ** 2))
# Dynamic range (commercials tend to be more compressed)
frame_size = int(sr * 0.050) # 50ms frames
frame_rms = []
for i in range(0, len(seg_audio) - frame_size, frame_size):
frame = seg_audio[i:i + frame_size]
frame_rms.append(np.sqrt(np.mean(frame ** 2)))
if not frame_rms:
return 0.5
dynamic_range = max(frame_rms) / (min(frame_rms) + 1e-8)
# Simple heuristic scoring:
# High RMS + low dynamic range = compressed commercial audio
score = 0.5
if rms > 0.15: # Louder than typical speech
score += 0.15
if dynamic_range < 5.0: # Very compressed
score += 0.15
return min(1.0, max(0.0, score))
def _score_structural(self, transcript, segment: DetectedSegment) -> float:
"""Score based on transcript content structural cues.
Returns 0.0 (show content cues found) to 1.0 (commercial cues found).
"""
text = transcript.text_at(segment.start, segment.end).lower()
# Show content indicators
show_phrases = [
"welcome back", "let's move on", "next up", "our next topic",
"let's talk about", "as i mentioned", "the question is",
"caller", "what do you think", "here's the thing",
]
# Commercial/break indicators
break_phrases = [
"we'll be right back", "stay tuned", "don't go anywhere",
"after the break", "when we come back",
]
show_hits = sum(1 for p in show_phrases if p in text)
break_hits = sum(1 for p in break_phrases if p in text)
if show_hits > 0 and break_hits == 0:
return 0.2 # Likely show content
if break_hits > 0:
return 0.8 # Likely near a break
return 0.5 # Neutral
def _merge_adjacent(self, segments: list[DetectedSegment]) -> list[DetectedSegment]:
"""Merge adjacent segments of the same type."""
if not segments:
return []
merged = [segments[0]]
for seg in segments[1:]:
prev = merged[-1]
if (prev.segment_type == seg.segment_type and
abs(seg.start - prev.end) < 2.0): # Within 2 seconds
# Extend previous segment
prev.end = seg.end
prev.confidence = (prev.confidence + seg.confidence) / 2
else:
merged.append(seg)
return merged
def _apply_constraints(self, segments: list[DetectedSegment]) -> list[DetectedSegment]:
"""Apply duration constraints — short 'commercial' segments are likely misclassified."""
min_break = self.config.segment_detection.min_break_duration_s
for seg in segments:
if (seg.segment_type == SegmentType.COMMERCIAL and
seg.duration < min_break):
seg.segment_type = SegmentType.SHOW_CONTENT
seg.label = "(reclassified: too short for commercial)"
return segments
def _label_from_prep(self, segments: list[DetectedSegment],
transcript, show_prep: str):
"""Label show segments by matching transcript content to show prep topics."""
# TODO: Use Ollama to match transcript sections against show prep segment titles
# For now, number them sequentially
show_count = 0
comm_count = 0
for seg in segments:
if seg.segment_type == SegmentType.SHOW_CONTENT:
show_count += 1
seg.label = f"Show Segment {show_count}"
elif seg.segment_type == SegmentType.COMMERCIAL:
comm_count += 1
seg.label = f"Commercial Break {comm_count}"

View File

@@ -0,0 +1,179 @@
"""Stage 1: Audio transcription using faster-whisper with GPU acceleration."""
import json
from dataclasses import dataclass
from pathlib import Path
from rich.console import Console
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn, TimeElapsedColumn
console = Console()
@dataclass
class TranscriptWord:
word: str
start: float
end: float
probability: float
@dataclass
class TranscriptSegment:
id: int
text: str
start: float
end: float
words: list[TranscriptWord]
@dataclass
class Transcript:
segments: list[TranscriptSegment]
language: str
language_probability: float
duration: float
@property
def full_text(self) -> str:
return " ".join(seg.text.strip() for seg in self.segments)
def text_at(self, start: float, end: float) -> str:
"""Get transcript text within a time range."""
result = []
for seg in self.segments:
if seg.end < start:
continue
if seg.start > end:
break
result.append(seg.text.strip())
return " ".join(result)
def to_srt(self) -> str:
"""Export as SRT subtitle format."""
lines = []
for i, seg in enumerate(self.segments, 1):
start = _format_srt_time(seg.start)
end = _format_srt_time(seg.end)
lines.append(f"{i}")
lines.append(f"{start} --> {end}")
lines.append(seg.text.strip())
lines.append("")
return "\n".join(lines)
def to_dict(self) -> dict:
return {
"language": self.language,
"language_probability": self.language_probability,
"duration": self.duration,
"segments": [
{
"id": seg.id,
"text": seg.text,
"start": seg.start,
"end": seg.end,
"words": [
{
"word": w.word,
"start": w.start,
"end": w.end,
"probability": w.probability,
}
for w in seg.words
],
}
for seg in self.segments
],
}
def save(self, output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
# JSON with full detail
with open(output_dir / "transcript.json", "w") as f:
json.dump(self.to_dict(), f, indent=2)
# Plain text
with open(output_dir / "transcript.txt", "w") as f:
f.write(self.full_text)
# SRT subtitles
with open(output_dir / "transcript.srt", "w") as f:
f.write(self.to_srt())
console.print(f"[green]Transcript saved to {output_dir}[/green]")
def _format_srt_time(seconds: float) -> str:
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
def transcribe(audio_path: str | Path, model_size: str = "large-v3",
language: str = "en", device: str = "cuda") -> Transcript:
"""Transcribe an audio file using faster-whisper."""
from faster_whisper import WhisperModel
audio_path = Path(audio_path)
console.print(f"[bold]Transcribing:[/bold] {audio_path.name}")
console.print(f"[dim]Model: {model_size}, Device: {device}[/dim]")
model = WhisperModel(model_size, device=device, compute_type="float16")
segments_raw, info = model.transcribe(
str(audio_path),
language=language,
word_timestamps=True,
vad_filter=True,
vad_parameters=dict(
min_silence_duration_ms=500,
speech_pad_ms=200,
),
)
console.print(f"[dim]Detected language: {info.language} "
f"(probability: {info.language_probability:.2f})[/dim]")
console.print(f"[dim]Duration: {info.duration:.1f}s "
f"({info.duration / 60:.1f} min)[/dim]")
segments = []
with Progress(
SpinnerColumn(),
TextColumn("[progress.description]{task.description}"),
BarColumn(),
TextColumn("{task.completed} segments"),
TimeElapsedColumn(),
console=console,
) as progress:
task = progress.add_task("Transcribing...", total=None)
for i, seg in enumerate(segments_raw):
words = [
TranscriptWord(
word=w.word,
start=w.start,
end=w.end,
probability=w.probability,
)
for w in (seg.words or [])
]
segments.append(TranscriptSegment(
id=i,
text=seg.text,
start=seg.start,
end=seg.end,
words=words,
))
progress.update(task, completed=i + 1)
console.print(f"[green]Transcription complete: {len(segments)} segments[/green]")
return Transcript(
segments=segments,
language=info.language,
language_probability=info.language_probability,
duration=info.duration,
)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,256 @@
# Training Plan: Using the 579-Episode Archive
## Available Training Data
### Episode Archive
- **Location:** `/home/gurushow/public_html/archive/` on IX server (172.16.3.10)
- **Count:** 579 MP3 files, 7.8GB
- **Span:** 2010-2018 (Seasons 6-10)
- **Format:** Split into "HR 1" / "HR 2" per episode (2-hour shows)
- **Year breakdown:**
- 2010: 43 files (664MB)
- 2011: 200 files (1.9GB)
- 2012: 98 files (1.2GB)
- 2014: 81 files (783MB)
- 2015: 50 files (461MB)
- 2016: 54 files (1.2GB)
- 2017: 41 files (1.5GB)
- 2018: 5 files (101MB)
### Show Production Elements
- **Location:** `/home/gurushow/public_html/archive/Radio/Elements/`
- **Intros:** 5 WAV files (show intro variations, beast intro, kick back intro, streaming intro)
- **Outros:** 2 WAV files
- **Bumpers:** 7 files (MP3 + WAV) — music stingers for transitions
- **Promos:** 2 WAV files (promo windows, show spot)
- **Corrected versions:** Separate folder with phone-number-corrected versions
---
## Phase 1: Audio Element Library (Seed + Discover)
### Purpose
Build a library of all show production elements (intros, outros, bumpers, stingers, station IDs) for reliable segment boundary detection. The archive contains SOME elements but not all — different stations and eras used different production elements.
### Step 1: Seed with known elements
1. Download all files from `Radio/Elements/` on IX server (7 MP3 + 18 WAV)
2. Convert WAVs to consistent format (mono, 16kHz for fingerprinting)
3. Generate chromaprint fingerprints for each element
4. Store in `element-library/fingerprints.db` (SQLite)
5. Categorize: show-intro, show-outro, segment-bumper, break-bumper, promo
### Step 2: Discover unknown elements from archive
1. Process episodes through the pipeline
2. Detect short non-speech audio segments (music, jingles, produced audio)
3. Extract each detected clip
4. Compare against known fingerprints — if no match, store as candidate
5. Compare candidates against each other across episodes
6. Cluster: same audio appearing in 3+ episodes = confirmed show element
7. Add to fingerprint database as "unnamed" element
### Step 3: Host review
- Present discovered clusters: "This 4-second audio clip appears in 38 episodes between 2015-2017 — what is it?"
- Host names and categorizes each cluster
- Named elements improve future detection accuracy
### What This Enables
- **Known elements:** Exact boundary detection when a fingerprinted intro/bumper is detected
- **Unknown elements:** Even without the source file, if the same jingle appears repeatedly, we know it marks a boundary
- **Era awareness:** Elements used in 2011 may differ from 2016 — the library tracks date ranges
- **New show elements:** When the show returns in 2026 with a new station, new bumpers get discovered automatically after a few episodes
### Tools
- `chromaprint` / `fpcalc` for audio fingerprinting
- `librosa` for spectral analysis and non-speech detection
- `dejavu` (Python audio fingerprinting library) or custom matching
- SQLite for fingerprint storage and lookup
---
## Phase 2: Host Voice Profile (Bootstrapped from Archive)
### Purpose
Build an extremely robust speaker embedding for Mike's voice using hundreds of hours of confirmed speech.
### Method
#### Step 1: Bootstrap from clean segments
The show intros typically have the host speaking directly. Use a handful of episodes where the host is the only speaker for the first few minutes:
1. Transcribe 10 diverse episodes (different years, different energy levels)
2. Run pyannote diarization
3. The dominant speaker in each episode = the host (by far the most speaking time)
4. Extract host-only segments from each episode
5. Generate embeddings from all host segments
6. Average/cluster to create a robust reference embedding
#### Step 2: Validate across eras
The host's voice may have changed subtly over 8 years. Generate per-year embeddings:
- 2010 voice profile
- 2014 voice profile
- 2018 voice profile
- 2026 voice profile (from new episodes)
Store all as the same speaker with temporal metadata. The matching algorithm checks against all variants.
#### Step 3: Continuous improvement
Each processed new episode refines the host embedding (confirmed host segments get folded back in).
---
## Phase 3: Commercial Break Pattern Training
### Purpose
Learn the specific audio patterns that signal commercial breaks. Because not all production elements are in the archive, the detector must combine multiple signals rather than relying solely on fingerprint matching.
### Method: Multi-Signal Classifier
The classifier combines all available signals with weighted scoring. No single signal is required — the system degrades gracefully when some signals are unavailable.
#### Signal 1: Known + discovered element fingerprints
- Match detected audio against the element library (Phase 1)
- If a known break-bumper is detected, high confidence of a break boundary
- If no match, other signals still contribute
- **Availability:** Partial for archive episodes (incomplete element library), improves over time via discovery
#### Signal 2: Speaker identity (from Phase 2)
- Host voice present = show content (high confidence)
- Host voice absent for >30 seconds = possible break
- Multiple unfamiliar voices in quick succession with produced audio = commercial cluster
- **Availability:** High — host voice profile is robust from hundreds of hours
#### Signal 3: Audio characteristics
- Extract per-segment features: MFCC, spectral centroid, zero-crossing rate, loudness (LUFS), dynamic range
- Commercials typically: higher loudness, more compression, different spectral profile, different room tone
- Show content typically: consistent room tone, natural dynamic range, live mic characteristics
- **Availability:** Always available — inherent to audio
#### Signal 4: HR 1/HR 2 boundary training
Since archive episodes are split into Hour 1 and Hour 2, the END of HR 1 and START of HR 2 always contain a commercial break boundary. This gives us 194+ confirmed break points.
1. Take the last 5 minutes of every HR 1 file and first 5 minutes of every HR 2 file
2. Analyze the audio feature transition at the show→commercial boundary
3. Train a Random Forest classifier on these labeled transitions
4. Apply the learned transition pattern to detect similar boundaries within single-file recordings
- **Availability:** Training data from archive; model applies to new episodes
#### Signal 5: Structural heuristics
- Commercial breaks are typically 2-5 minutes
- Shows typically break every 12-20 minutes
- Transition phrases in transcript ("We'll be right back", "Welcome back", "Stay tuned")
- Silence gaps >1 second often bookend breaks
- **Availability:** Always available
#### Combined scoring
Each signal produces a confidence value (0.0-1.0). Weighted sum determines classification:
```
score = (w1 * fingerprint_match) +
(w2 * speaker_absence) +
(w3 * audio_characteristics) +
(w4 * break_pattern_match) +
(w5 * structural_heuristic)
if score > threshold: classify as commercial
```
Default weights (tunable after validation):
- Fingerprint match: 0.30 (strongest when available, but often unavailable)
- Speaker identity: 0.25 (very reliable)
- Audio characteristics: 0.20 (always available)
- Break pattern: 0.15 (learned from archive)
- Structural: 0.10 (least reliable alone, but useful confirmation)
#### Self-calibration
After processing a batch of archive episodes:
1. Compare detected breaks against HR1/HR2 boundaries (known ground truth)
2. Auto-tune weights to maximize accuracy on held-out episodes
3. Report accuracy metrics
### Expected Accuracy
- With all signals available (including fingerprint match): >95%
- Without fingerprint matches (new station, new elements): >85%
- Improves over time as element discovery adds to the fingerprint library
---
## Phase 4: Repeat Speaker Detection
### Purpose
Identify co-hosts, regular callers, and guests across the archive.
### Method
1. Diarize a representative sample (20-30 episodes across all years)
2. For each episode, extract embeddings for all non-host speakers
3. Cluster all non-host embeddings across all episodes
4. Clusters that appear in multiple episodes = repeat speakers
5. Present clusters to the host for naming: "This voice appears in 47 episodes — who is this?"
6. Save named speaker profiles
### Known Speakers to Look For
- Co-hosts (Harry mentioned in early episodes)
- Regular callers
- Recurring guests
---
## Phase 5: Batch Processing Pipeline
### Purpose
Process the full archive to build the training dataset and generate transcripts.
### Approach: Incremental, not all-at-once
**Batch 1: Training set (10 episodes)**
- Select 10 episodes spanning different years
- Full transcription + diarization
- Manual review to validate accuracy
- Use results to tune parameters
**Batch 2: Element fingerprinting**
- Download and fingerprint all show elements
- Test detection against Batch 1 episodes
**Batch 3: Commercial detection training**
- Process 50 HR1/HR2 pairs
- Train break detection classifier
- Validate against held-out episodes
**Batch 4: Full archive (optional, on demand)**
- Process remaining episodes as background task
- Each episode: ~5-10 minutes to transcribe on RTX 5070 Ti
- Full archive: ~50-100 hours of compute time
- Run overnight in batches
### Storage Requirements
- Transcripts (JSON): ~500KB per episode × 194 = ~100MB
- Speaker embeddings: negligible
- Processed audio (if re-encoding): skip unless needed
- Total new storage: < 500MB for all metadata
---
## Implementation Priority
1. **Set up Python environment** — venv with faster-whisper, pyannote, torch CUDA
2. **Download show elements** — Fingerprint the known intros/outros/bumpers (seed library)
3. **Process 3-5 archive episodes** — Validate transcription + diarization quality
4. **Build host voice profile** — Bootstrap from initial batch
5. **Run element discovery on initial batch** — Find unknown elements, begin clustering
6. **Train commercial detector** — Using HR1/HR2 boundaries + all available signals
7. **Process 20-30 more episodes** — Expand element library, refine classifier weights, discover repeat speakers
8. **Host review session** — Name discovered elements and speaker clusters
9. **Build the CLI tool** — Wire it all together with config file
10. **Process a new 2026 episode end-to-end** — Full pipeline test with new station's elements
11. **Batch process remaining archive** — Background task, overnight
---
## Disk Space Plan
The archive is 7.8GB on IX server. Options:
1. **Stream from server** — Process one at a time via SSH/SCP, don't store locally
2. **Download subset** — Training set only (~500MB for 10 episodes + elements)
3. **Download all** — 7.8GB to local disk (easy, NVMe has plenty of space)
4. **NFS/SSHFS mount** — Mount the IX server directory, process in place
Recommendation: Download the elements + 10-episode training set first. Full archive download only when ready for batch processing.

View File

@@ -0,0 +1,276 @@
# Post-Show Workflow: The Computer Guru Show
## Overview
After each live show, this workflow transforms the broadcast into multiple content pieces that extend the show's reach, deepen audience engagement, and build a searchable archive. The process starts with a debrief and produces 3 tiers of content.
---
## Phase 1: Post-Show Debrief (Same Day)
### Input
- Show prep file (`episodes/YYYY-MM-DD-topic/show-prep.md`)
- Host's notes on what actually happened during the show
### Debrief Questionnaire
Create a file: `episodes/YYYY-MM-DD-topic/post-show-debrief.md`
```markdown
# Post-Show Debrief
## Episode: [title]
## Air Date: [date]
### What Made It In
- [ ] Segment 1: [topic] — Used / Modified / Cut
- [ ] Segment 2: [topic] — Used / Modified / Cut
- [ ] Segment 3: [topic] — Used / Modified / Cut
- [ ] Segment 4: [topic] — Used / Modified / Cut
- [ ] Segment 5: [topic] — Used / Modified / Cut
- [ ] Segment 6: [topic] — Used / Modified / Cut
### What Changed Live
- Segments reordered? Which ones?
- Topics expanded beyond prep? Which ones and why?
- Topics cut short? Why? (time, audience reaction, breaking news)
- Unplanned tangents that worked well?
### Caller/Audience Interaction
- Caller topics and questions (summarize each)
- Live chat highlights (if applicable)
- Audience reactions that shifted the conversation
### Unplanned Additions
- Breaking news discussed
- Personal stories / anecdotes shared
- Technical demos or live troubleshooting
- Guest appearances or call-ins
### Best Moments
- Strongest segment (what resonated most)
- Best one-liner or quotable moment
- Most engaging audience interaction
- "Wish I'd said..." moments (capture for blog expansion)
### Topics That Deserve More
- What couldn't you finish due to time?
- What generated the most audience interest?
- What deserves a deep-dive blog post?
- Follow-up stories to watch for next week?
```
---
## Phase 2: Content Generation (Within 48 Hours)
### Tier 1: Episode Post (Radio Show Website)
**Target:** `website/src/content/episodes/s[SS]e[EE]-slug.md`
**Purpose:** Canonical episode page with summary, chapters, and links
**Structure:**
```markdown
---
title: "S[X]E[X] [Episode Title]"
season: [number]
episode: [number]
pubDate: [air date]
duration: "[HH:MM:SS]"
audioUrl: "[podcast audio URL]"
audioSize: [bytes]
episodeType: "full"
featured: [true for current episode]
tags: [topic tags from show prep + debrief]
chapters:
- time: "00:00"
title: "Introduction"
- time: "MM:SS"
title: "[Segment 1 title]"
[...]
---
## Episode Summary
[2-3 paragraph summary of what the show covered — written from the
debrief, not just the prep. Captures what ACTUALLY happened, including
unplanned moments, caller contributions, and tangents that worked.]
## Topics Covered
### [Topic 1 Title]
[3-5 sentence summary with the key takeaway. Link to deep-dive blog
post if one exists.]
### [Topic 2 Title]
[...]
## Links & Resources
- [Relevant links mentioned on air]
- [Source articles referenced in prep]
## Continue the Conversation
- [Link to forum discussion thread]
- [Link to related blog posts]
```
**Action items:**
1. Generate episode markdown from show-prep + debrief
2. Add chapter timestamps (from audio if available, estimated from segment timing if not)
3. Create matching forum discussion thread (Flarum, tag: Show Discussion)
4. Build and deploy website
### Tier 2: Forum Discussion Thread (Community Forum)
**Target:** Flarum forum at community.azcomputerguru.com
**Tag:** Show Discussion (ID 8)
**Purpose:** Ongoing conversation hub for each episode
**Structure:**
```
Title: S[X]E[X] Discussion: [Episode Title] — [Air Date]
Body:
This week's episode: [Episode Title]
[Brief 2-3 sentence hook — the most provocative or interesting
angle from the show]
Topics we covered:
- [Topic 1] — [one-line teaser]
- [Topic 2] — [one-line teaser]
- [Topic 3] — [one-line teaser]
What do you think? Drop your thoughts below.
- Did we miss anything on [controversial topic]?
- What's your experience with [relatable topic]?
- [Specific question raised by a caller that others might want to weigh in on]
Listen to the full episode: [link to episode page]
Read our deep-dive on [topic]: [link to blog post]
```
### Tier 3: Deep-Dive Blog Posts (Radio Show Website)
**Target:** `website/src/content/blog/[slug].md`
**Purpose:** SEO-rich, shareable long-form content that expands on show topics
**Selection criteria (from debrief):**
- Topics that generated the most audience interest
- Topics cut short due to time
- Topics with strong search potential (trending tech news)
- Topics where the host has unique expertise or perspective
**Structure:**
```markdown
---
title: "[Expanded Topic Title]"
pubDate: [date, within 48h of show]
description: "[SEO-friendly 150-char description]"
author: "Mike Swanson"
tags: [relevant tags]
image: [optional hero image]
---
[Long-form article expanding on the show segment. NOT a transcript.
This is the version you'd write if you had unlimited airtime:]
- Background context the audience needs
- The full argument with supporting evidence
- Technical details simplified for general audience
- What it means for regular people (the show's signature angle)
- What to watch for next (forward-looking)
- Host's personal take / opinion
## Key Takeaways
- [Bullet point summary for skimmers]
## Related Episodes
- [Links to past episodes that covered related topics]
*This topic was discussed on [Episode Title], airing [date].
[Listen to the full episode →](link)*
```
**Recommended: 1-3 blog posts per episode**, focusing on the strongest topics.
---
## Phase 3: Cross-Promotion & Engagement
### Immediate (Day of Show)
- [ ] Post episode page to website
- [ ] Create forum discussion thread
- [ ] Cross-link episode ↔ forum thread
### Within 48 Hours
- [ ] Publish deep-dive blog post(s)
- [ ] Cross-link blog posts ↔ episode page ↔ forum
- [ ] Update episode page with blog post links
### Engagement Opportunities to Build Out
#### Currently Missing (Identify & Prioritize)
1. **Social media distribution** — No social accounts linked. Where does the audience hang out? Twitter/X? Facebook? Reddit? Mastodon?
2. **Email newsletter** — Subscribe page exists but is placeholder. Mailchimp/Buttondown/self-hosted? Weekly digest of episode + blog posts?
3. **Podcast distribution** — Audio URL points to Blubrry (legacy). Are new episodes going to Apple Podcasts, Spotify, etc.? RSS feed exists (`feed.xml.ts`) but needs verification.
4. **Show notes SEO** — Episode pages need proper meta descriptions, Open Graph tags, structured data (PodcastEpisode schema).
5. **Audiogram/clips** — Short audio or video clips of the best 60-90 seconds for social sharing.
6. **Caller follow-up** — If callers raise topics, follow up in blog posts and tag them (builds loyalty).
7. **"This Week in Tech" roundup email** — Repurpose the show prep Quick Headlines into a weekly email blast.
8. **Community forum engagement** — Seed discussion threads with provocative questions, not just summaries. Respond to replies.
9. **Guest booking pipeline** — The show prep references industry topics where expert guests would add value. Track potential guests.
10. **Analytics-driven topic selection** — Use Matomo data to see which episode pages and blog posts get the most traffic, inform future show prep.
---
## Automation Opportunities
### What Claude Can Do Now
- Generate episode post from show-prep + debrief
- Generate forum discussion thread
- Generate deep-dive blog posts from show prep segments
- Post to forum via Flarum database insert
- Build and deploy website via Astro build + rsync
- Track analytics via Matomo
### What Needs Setup
- Podcast audio hosting for new episodes (Blubrry? Podbean? Self-hosted?)
- Social media API access (for automated posting)
- Newsletter platform (for automated digest)
- Audio processing pipeline (for audiograms/clips)
---
## File Structure
```
episodes/
YYYY-MM-DD-topic/
show-prep.md ← Pre-show (already exists)
post-show-debrief.md ← NEW: Post-show notes
generated/
episode-post.md ← Generated episode page content
forum-thread.md ← Generated forum discussion
blog-topic-1.md ← Generated deep-dive blog post
blog-topic-2.md ← Generated deep-dive blog post
```
---
## Example: March 21 Episode
If we ran this workflow for today's "Who's Really In Control?" episode:
**Episode post:** S11E02 (or whatever the current season/episode numbering is)
**Forum thread:** "S11E02 Discussion: Who's Really In Control? — March 21, 2026"
**Blog post candidates (from show prep):**
1. "The White House AI Framework: What It Actually Says and Why It Matters" — Strong SEO potential, timely, unique angle (preemption vs. state laws)
2. "NVIDIA's Trillion-Dollar Bet: How One Company Controls the AI Revolution" — Evergreen explainer, strong search volume
3. "Apple Gave Google the Keys to Siri — Here's Why That Should Concern You" — Provocative, shareable, high interest
4. "1 Petabyte Stolen: Inside the TELUS Digital Breach" — Cybersecurity angle, practical advice for listeners
5. "Right to Repair Just Became Law — What You Can (and Can't) Fix Now" — Practical, actionable, local angle
**Recommended picks:** #1 (timely + unique), #3 (provocative + shareable), #5 (practical + evergreen)