Add radio show audio processor and post-show workflow
- Audio processor CLI tool with 6-stage pipeline: transcribe (faster-whisper GPU), diarize (pyannote), detect segments (multi-signal classifier), remove commercials, split segments, analyze content (Ollama) - Post-show workflow doc for episode posts, forum threads, deep-dive blog posts - Training plan for using 579-episode archive for voice profiles and commercial detection - Successful test: 45min episode transcribed in 2:37 on RTX 5070 Ti - Sample transcript output from S7E30 (March 2015) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
4
.gitignore
vendored
4
.gitignore
vendored
@@ -63,3 +63,7 @@ api/.env
|
||||
.mcp.json
|
||||
Pictures/
|
||||
.grepai/
|
||||
# Radio processor
|
||||
projects/radio-show/audio-processor/test-data/*.mp3
|
||||
projects/radio-show/audio-processor/*.egg-info/
|
||||
|
||||
|
||||
365
projects/radio-show/audio-processor/README.md
Normal file
365
projects/radio-show/audio-processor/README.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# Radio Show Audio Processor
|
||||
|
||||
Automated pipeline for processing The Computer Guru Show recordings into podcast-ready audio, transcripts, and segmented clips.
|
||||
|
||||
## What It Does
|
||||
|
||||
```
|
||||
Raw MP3 (full broadcast with commercials)
|
||||
│
|
||||
├── 1. Transcribe (Whisper + GPU)
|
||||
│ └── Full transcript with timestamps
|
||||
│
|
||||
├── 2. Speaker Diarization (pyannote)
|
||||
│ └── Who said what (host vs. callers vs. guests)
|
||||
│
|
||||
├── 3. Segment Detection
|
||||
│ ├── Identify show segments vs. commercials
|
||||
│ ├── Detect music/jingles (known + discovered)
|
||||
│ └── Map segments to show prep structure
|
||||
│
|
||||
├── 4. Commercial Removal
|
||||
│ └── Clean episode MP3 (show content only)
|
||||
│
|
||||
├── 5. Segment Splitting
|
||||
│ ├── Individual segment MP3s (for social media)
|
||||
│ └── Chapter markers (for podcast players)
|
||||
│
|
||||
└── 6. Content Analysis (Ollama)
|
||||
├── Episode summary
|
||||
├── Topic extraction
|
||||
├── Key quotes
|
||||
└── Auto-populate post-show debrief
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Pipeline Stages
|
||||
|
||||
#### Stage 1: Transcription — `faster-whisper` (GPU)
|
||||
- **Model:** `large-v3` (best accuracy, ~3GB VRAM)
|
||||
- **Why faster-whisper:** CTranslate2 backend, 4x faster than OpenAI whisper, lower VRAM
|
||||
- **Output:** Word-level timestamps, language detection
|
||||
- **Hardware:** RTX 5070 Ti (12GB VRAM) — plenty for large-v3
|
||||
|
||||
#### Stage 2: Speaker Diarization — `pyannote.audio`
|
||||
- **Model:** `pyannote/speaker-diarization-3.1`
|
||||
- **Purpose:** Identify speaker turns (host, caller 1, caller 2, etc.)
|
||||
- **Voice enrollment:** Bootstrapped from archive (hundreds of hours of host speech)
|
||||
- **Output:** Speaker segments with timestamps
|
||||
|
||||
#### Stage 3: Segment Detection — Multi-Signal Classifier
|
||||
|
||||
Commercial and segment detection uses multiple signals combined, because not all show production elements are in the archive — bumpers, stingers, and jingles vary across stations and eras.
|
||||
|
||||
**Signal 1: Known element fingerprints (seed library)**
|
||||
- Fingerprint the production elements we DO have (intros, outros, bumpers from archive)
|
||||
- Match against episodes to detect known boundaries
|
||||
- Partial coverage — some elements won't match
|
||||
|
||||
**Signal 2: Unknown element discovery**
|
||||
- Detect short non-speech audio segments (music, jingles, produced audio) that don't match any known fingerprint
|
||||
- Cluster unknown elements across episodes — if the same 5-second clip appears in 30 episodes, it's a show element
|
||||
- Flag new clusters for host review and naming
|
||||
- Discovered elements get added to the fingerprint library automatically
|
||||
|
||||
**Signal 3: Speaker identity**
|
||||
- Host voice present = show content
|
||||
- Non-host voices with commercial audio characteristics (compressed, produced, different acoustic environment) = ads
|
||||
- Host voice absent for extended periods (>30s) = likely commercial break
|
||||
|
||||
**Signal 4: Audio characteristics**
|
||||
- Volume/loudness shifts (commercials often have different LUFS profiles)
|
||||
- Spectral characteristics (produced/compressed commercial audio vs. live studio mic)
|
||||
- Silence gaps (dead air between show and ads)
|
||||
- Audio environment changes (room tone, background noise differences)
|
||||
|
||||
**Signal 5: Learned break patterns (from archive)**
|
||||
- HR1/HR2 file boundaries = confirmed commercial break locations
|
||||
- Train a classifier on the audio features at these known boundaries
|
||||
- Generalize to detect similar patterns within single-file recordings
|
||||
|
||||
**Signal 6: Structural heuristics**
|
||||
- Commercial breaks are typically 2-5 minutes
|
||||
- Shows typically break every 12-20 minutes
|
||||
- Transition phrases in transcript ("We'll be right back", "Welcome back")
|
||||
|
||||
**Combined scoring:** Each signal contributes a confidence score. A segment is classified as commercial when the combined score exceeds a threshold. This is resilient to missing fingerprints — even without a known bumper match, the other signals can still identify breaks.
|
||||
|
||||
#### Stage 4: Commercial Removal — `ffmpeg`
|
||||
- Stitch show segments together with crossfades
|
||||
- Normalize audio levels (EBU R128 loudness standard)
|
||||
- Output clean podcast-ready MP3
|
||||
|
||||
#### Stage 5: Segment Splitting — `ffmpeg`
|
||||
- Export individual segments as separate MP3s
|
||||
- Apply fade in/out
|
||||
- Add ID3 tags (show name, segment title, date)
|
||||
- Generate chapter markers file (for podcast apps)
|
||||
|
||||
#### Stage 6: Content Analysis — `Ollama` (qwen3:14b or codestral)
|
||||
- Feed transcript + speaker labels to local LLM
|
||||
- Generate:
|
||||
- Episode summary (2-3 paragraphs)
|
||||
- Per-segment summaries
|
||||
- Key quotes with speaker attribution
|
||||
- Topic tags
|
||||
- Suggested blog post topics
|
||||
- Auto-filled post-show debrief template
|
||||
|
||||
### Audio Element Library
|
||||
|
||||
The element library is a **learning system**, not a static collection.
|
||||
|
||||
```
|
||||
element-library/
|
||||
fingerprints.db # SQLite database of audio fingerprints
|
||||
known/ # Source files we have
|
||||
intros/
|
||||
outros/
|
||||
bumpers/
|
||||
promos/
|
||||
discovered/ # Elements found by the discovery system
|
||||
cluster-001.mp3 # Unknown element, appears in 47 episodes
|
||||
cluster-002.mp3 # Unknown element, appears in 12 episodes
|
||||
...
|
||||
metadata.json # Names, categories, date ranges for each element
|
||||
```
|
||||
|
||||
**Lifecycle of a discovered element:**
|
||||
1. Processor detects non-speech audio that doesn't match any known fingerprint
|
||||
2. Audio clip is extracted and stored as a candidate
|
||||
3. Candidate is compared against all other candidates across episodes
|
||||
4. Matches are clustered — same audio in multiple episodes = confirmed element
|
||||
5. New element is fingerprinted and added to the library as "unnamed"
|
||||
6. Host reviews unnamed elements periodically and assigns names/categories
|
||||
7. Named elements improve future detection accuracy
|
||||
|
||||
**Element categories:**
|
||||
- `show-intro` — Full show opening
|
||||
- `show-outro` — Full show closing
|
||||
- `segment-bumper` — Music between show segments
|
||||
- `break-bumper` — Music going into/out of commercial breaks
|
||||
- `station-id` — Station identification (legal requirement, consistent per station)
|
||||
- `promo` — Show promo or cross-promotion
|
||||
- `stinger` — Short audio effect (sound effect, catchphrase)
|
||||
- `unknown` — Not yet categorized
|
||||
|
||||
### Voice Profile System
|
||||
|
||||
**Bootstrapped from 579-episode archive**, not a single enrollment sample.
|
||||
|
||||
```
|
||||
voice-profiles/
|
||||
host-mike-swanson/
|
||||
embedding-composite.npy # Average embedding across all eras
|
||||
embedding-2010.npy # Era-specific (voice changes over time)
|
||||
embedding-2014.npy
|
||||
embedding-2018.npy
|
||||
embedding-2026.npy
|
||||
metadata.json # Speaker name, role, episode count
|
||||
guests/
|
||||
[name].npy # Named guest embeddings (built over time)
|
||||
callers/
|
||||
regular-001.npy # Unnamed repeat caller
|
||||
regular-002.npy
|
||||
unknown/
|
||||
cluster-[id].npy # Voices that appear multiple times, not yet named
|
||||
```
|
||||
|
||||
**Bootstrap process:**
|
||||
1. Diarize 10 diverse archive episodes (different years)
|
||||
2. Dominant speaker in each = host (by far the most speaking time)
|
||||
3. Extract host-only segments, generate embeddings
|
||||
4. Create per-era profiles (voice may change over 8+ years)
|
||||
5. Composite embedding = average across all eras
|
||||
|
||||
**Continuous improvement:**
|
||||
- Each processed episode refines the host embedding
|
||||
- Repeat non-host voices are clustered across episodes
|
||||
- Host reviews clusters: "This voice appears in 47 episodes — who is this?"
|
||||
- Named profiles improve future speaker labeling
|
||||
|
||||
## Dependencies
|
||||
|
||||
```bash
|
||||
# System packages
|
||||
sudo pacman -S python-pip ffmpeg
|
||||
|
||||
# Python packages (in a venv)
|
||||
python3 -m venv ~/.local/share/radio-processor
|
||||
source ~/.local/share/radio-processor/bin/activate
|
||||
|
||||
pip install faster-whisper # Transcription (CTranslate2 + CUDA)
|
||||
pip install pyannote.audio # Speaker diarization
|
||||
pip install torch torchaudio # PyTorch (CUDA)
|
||||
pip install silero-vad # Voice activity detection
|
||||
pip install pydub # Audio manipulation
|
||||
pip install librosa # Audio analysis / spectral features
|
||||
pip install chromaprint # Audio fingerprinting (or use dejavu)
|
||||
pip install scikit-learn # Break pattern classifier
|
||||
pip install ollama # Local LLM API
|
||||
pip install rich # CLI progress display
|
||||
```
|
||||
|
||||
### pyannote.audio Access
|
||||
pyannote requires accepting the model license on HuggingFace:
|
||||
1. Create account at huggingface.co
|
||||
2. Accept license at https://huggingface.co/pyannote/speaker-diarization-3.1
|
||||
3. Generate access token
|
||||
4. `huggingface-cli login`
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Full pipeline (new episode)
|
||||
radio-process episode.mp3 --show-prep episodes/2026-03-21-who-controls-your-tech/show-prep.md
|
||||
|
||||
# Just transcribe
|
||||
radio-process episode.mp3 --transcribe-only
|
||||
|
||||
# Process archive episode (training mode — learns elements + voices)
|
||||
radio-process episode-hr1.mp3 episode-hr2.mp3 --archive-mode --date 2016-03-15
|
||||
|
||||
# Batch process archive for training
|
||||
radio-process --batch-train archive/2016/ --output training-data/
|
||||
|
||||
# Enroll host voice from archive (bootstrap)
|
||||
radio-process --bootstrap-voice archive/ --speaker-name "Mike Swanson" --role host
|
||||
|
||||
# Review discovered elements
|
||||
radio-process --review-elements
|
||||
|
||||
# Review unknown speaker clusters
|
||||
radio-process --review-speakers
|
||||
```
|
||||
|
||||
## Output Structure
|
||||
|
||||
```
|
||||
episodes/YYYY-MM-DD-topic/
|
||||
show-prep.md # Pre-show (existing)
|
||||
post-show-debrief.md # Auto-generated draft
|
||||
raw/
|
||||
full-broadcast.mp3 # Original recording
|
||||
processed/
|
||||
transcript.json # Full transcript with timestamps + speakers
|
||||
transcript.txt # Plain text transcript
|
||||
transcript.srt # Subtitle format
|
||||
podcast-episode.mp3 # Clean episode (commercials removed)
|
||||
chapters.json # Chapter markers
|
||||
detection-report.json # What was detected as commercial/show, confidence scores
|
||||
segments/
|
||||
00-intro.mp3
|
||||
01-the-week-that-was.mp3
|
||||
02-the-government-wants-in.mp3
|
||||
03-jensens-trillion-dollar-bet.mp3
|
||||
04-apple-gives-google-the-keys.mp3
|
||||
05-a-petabyte-of-your-data-gone.mp3
|
||||
06-right-to-repair.mp3
|
||||
07-outro.mp3
|
||||
generated/
|
||||
episode-post.md # For website
|
||||
forum-thread.md # For community forum
|
||||
blog-topic-1.md # Deep-dive article
|
||||
blog-topic-2.md # Deep-dive article
|
||||
analysis.json # LLM analysis output
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
show:
|
||||
name: "The Computer Guru Show"
|
||||
host: "Mike Swanson"
|
||||
typical_duration_minutes: 120 # 2-hour broadcast
|
||||
segment_count: 6
|
||||
has_commercials: true
|
||||
|
||||
audio:
|
||||
whisper_model: "large-v3"
|
||||
whisper_language: "en"
|
||||
output_format: "mp3"
|
||||
output_bitrate: "192k"
|
||||
normalize: true # EBU R128
|
||||
crossfade_ms: 500 # Between stitched segments
|
||||
|
||||
segment_detection:
|
||||
# Fingerprint matching
|
||||
fingerprint_db: "element-library/fingerprints.db"
|
||||
fingerprint_match_threshold: 0.85 # Minimum similarity for a match
|
||||
|
||||
# Element discovery
|
||||
discover_unknown_elements: true
|
||||
min_element_duration_s: 1.0 # Shortest element to detect
|
||||
max_element_duration_s: 30.0 # Longest (full intro might be 20-30s)
|
||||
cluster_similarity_threshold: 0.90 # How similar clips must be to cluster
|
||||
min_cluster_occurrences: 3 # Must appear in 3+ episodes to be an element
|
||||
|
||||
# Commercial classification
|
||||
min_break_duration_s: 30 # Minimum commercial break length
|
||||
max_break_duration_s: 300 # Maximum (5 min)
|
||||
silence_threshold_db: -40 # Silence detection threshold
|
||||
confidence_threshold: 0.70 # Combined score to classify as commercial
|
||||
|
||||
# Signal weights (tune based on accuracy)
|
||||
weights:
|
||||
fingerprint_match: 0.30 # Known element detected
|
||||
speaker_identity: 0.25 # Host voice absent
|
||||
audio_characteristics: 0.20 # Production style differs
|
||||
break_pattern: 0.15 # Matches trained break pattern
|
||||
structural_heuristic: 0.10 # Duration/timing rules
|
||||
|
||||
diarization:
|
||||
min_speakers: 1
|
||||
max_speakers: 6
|
||||
voice_profiles_dir: "voice-profiles/"
|
||||
host_match_threshold: 0.75 # Similarity to host embedding
|
||||
|
||||
llm:
|
||||
model: "qwen3:14b" # Ollama model for analysis
|
||||
ollama_host: "http://localhost:11434"
|
||||
|
||||
paths:
|
||||
episodes_dir: "episodes/"
|
||||
voice_profiles: "voice-profiles/"
|
||||
element_library: "element-library/"
|
||||
output_dir: "processed/"
|
||||
|
||||
archive:
|
||||
server: "172.16.3.10"
|
||||
path: "/home/gurushow/public_html/archive/"
|
||||
elements_path: "/home/gurushow/public_html/archive/Radio/Elements/"
|
||||
```
|
||||
|
||||
## Training Data: 579-Episode Archive
|
||||
|
||||
The archive on IX server (172.16.3.10) contains 579 MP3 files spanning 2010-2018:
|
||||
|
||||
| Year | Files | Size | Notes |
|
||||
|------|-------|------|-------|
|
||||
| 2010 | 43 | 664MB | Season 7 start |
|
||||
| 2011 | 200 | 1.9GB | Peak output |
|
||||
| 2012 | 98 | 1.2GB | |
|
||||
| 2014 | 81 | 783MB | Season 6 (new station) |
|
||||
| 2015 | 50 | 461MB | |
|
||||
| 2016 | 54 | 1.2GB | |
|
||||
| 2017 | 41 | 1.5GB | |
|
||||
| 2018 | 5 | 101MB | Final season 10 episodes |
|
||||
| Elements | 7 MP3 + 18 WAV | 203MB | Partial production library |
|
||||
|
||||
Episodes are split into HR 1 / HR 2 files. The HR boundary is a confirmed commercial break point — used for training the break detection classifier.
|
||||
|
||||
**Important:** Not all production elements are in the archive. Bumpers, stingers, and jingles varied across stations and time periods. The element discovery system handles this by detecting and clustering unknown elements across episodes.
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Audiogram generator** — Create video clips with waveform animation + captions for social media
|
||||
2. **Highlight reel** — Auto-detect the most engaging 60-90 seconds (high energy, laughter, emphasis)
|
||||
3. **Show notes generator** — Generate timestamped show notes in podcast standard format
|
||||
4. **RSS feed integration** — Auto-publish processed episodes to podcast RSS feed
|
||||
5. **Sentiment analysis** — Track audience engagement topics over time
|
||||
6. **Topic continuity** — Link topics across episodes ("Last week we talked about X, this week...")
|
||||
7. **Live processing** — Real-time transcription during broadcast for immediate post-show turnaround
|
||||
8. **Cross-episode search** — Full-text search across all transcripts ("When did we talk about net neutrality?")
|
||||
57
projects/radio-show/audio-processor/config.yaml
Normal file
57
projects/radio-show/audio-processor/config.yaml
Normal file
@@ -0,0 +1,57 @@
|
||||
show:
|
||||
name: "The Computer Guru Show"
|
||||
host: "Mike Swanson"
|
||||
typical_duration_minutes: 120
|
||||
segment_count: 6
|
||||
has_commercials: true
|
||||
|
||||
audio:
|
||||
whisper_model: "large-v3"
|
||||
whisper_language: "en"
|
||||
output_format: "mp3"
|
||||
output_bitrate: "192k"
|
||||
normalize: true
|
||||
crossfade_ms: 500
|
||||
|
||||
segment_detection:
|
||||
fingerprint_db: "element-library/fingerprints.db"
|
||||
fingerprint_match_threshold: 0.85
|
||||
|
||||
discover_unknown_elements: true
|
||||
min_element_duration_s: 1.0
|
||||
max_element_duration_s: 30.0
|
||||
cluster_similarity_threshold: 0.90
|
||||
min_cluster_occurrences: 3
|
||||
|
||||
min_break_duration_s: 30
|
||||
max_break_duration_s: 300
|
||||
silence_threshold_db: -40
|
||||
confidence_threshold: 0.70
|
||||
|
||||
weights:
|
||||
fingerprint_match: 0.30
|
||||
speaker_identity: 0.25
|
||||
audio_characteristics: 0.20
|
||||
break_pattern: 0.15
|
||||
structural_heuristic: 0.10
|
||||
|
||||
diarization:
|
||||
min_speakers: 1
|
||||
max_speakers: 6
|
||||
voice_profiles_dir: "voice-profiles/"
|
||||
host_match_threshold: 0.75
|
||||
|
||||
llm:
|
||||
model: "qwen3:14b"
|
||||
ollama_host: "http://localhost:11434"
|
||||
|
||||
paths:
|
||||
episodes_dir: "episodes/"
|
||||
voice_profiles: "voice-profiles/"
|
||||
element_library: "element-library/"
|
||||
output_dir: "processed/"
|
||||
|
||||
archive:
|
||||
server: "172.16.3.10"
|
||||
path: "/home/gurushow/public_html/archive/"
|
||||
elements_path: "/home/gurushow/public_html/archive/Radio/Elements/"
|
||||
25
projects/radio-show/audio-processor/pyproject.toml
Normal file
25
projects/radio-show/audio-processor/pyproject.toml
Normal file
@@ -0,0 +1,25 @@
|
||||
[build-system]
|
||||
requires = ["setuptools>=68.0"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "radio-processor"
|
||||
version = "0.1.0"
|
||||
description = "Audio processor for The Computer Guru Show"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"faster-whisper",
|
||||
"pyannote.audio",
|
||||
"pydub",
|
||||
"librosa",
|
||||
"scikit-learn",
|
||||
"ollama",
|
||||
"rich",
|
||||
"pyyaml",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
radio-process = "src.cli:main"
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
include = ["src*"]
|
||||
0
projects/radio-show/audio-processor/src/__init__.py
Normal file
0
projects/radio-show/audio-processor/src/__init__.py
Normal file
187
projects/radio-show/audio-processor/src/analyzer.py
Normal file
187
projects/radio-show/audio-processor/src/analyzer.py
Normal file
@@ -0,0 +1,187 @@
|
||||
"""Stage 6: Content analysis using Ollama for summary, topics, and post-show debrief."""
|
||||
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
from rich.console import Console
|
||||
|
||||
console = Console()
|
||||
|
||||
|
||||
@dataclass
|
||||
class EpisodeAnalysis:
|
||||
summary: str
|
||||
segment_summaries: list[dict] # [{title, summary, key_points}]
|
||||
key_quotes: list[dict] # [{quote, speaker, timestamp}]
|
||||
topics: list[str]
|
||||
tags: list[str]
|
||||
blog_post_candidates: list[dict] # [{title, angle, why}]
|
||||
debrief_draft: str # Markdown debrief template
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"summary": self.summary,
|
||||
"segment_summaries": self.segment_summaries,
|
||||
"key_quotes": self.key_quotes,
|
||||
"topics": self.topics,
|
||||
"tags": self.tags,
|
||||
"blog_post_candidates": self.blog_post_candidates,
|
||||
}
|
||||
|
||||
def save(self, output_dir: Path):
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
with open(output_dir / "analysis.json", "w") as f:
|
||||
json.dump(self.to_dict(), f, indent=2)
|
||||
|
||||
with open(output_dir / "post-show-debrief.md", "w") as f:
|
||||
f.write(self.debrief_draft)
|
||||
|
||||
console.print(f"[green]Analysis saved to {output_dir}[/green]")
|
||||
|
||||
|
||||
def analyze_episode(transcript_text: str, diarization_data: dict | None = None,
|
||||
show_prep: str | None = None, segments: list | None = None,
|
||||
model: str = "qwen3:14b",
|
||||
ollama_host: str = "http://localhost:11434") -> EpisodeAnalysis:
|
||||
"""Analyze a transcribed episode using a local LLM."""
|
||||
import ollama as ollama_client
|
||||
|
||||
console.print(f"[bold]Analyzing episode with {model}[/bold]")
|
||||
|
||||
client = ollama_client.Client(host=ollama_host)
|
||||
|
||||
# Build context for the LLM
|
||||
context_parts = []
|
||||
|
||||
if show_prep:
|
||||
context_parts.append(f"## Show Prep (planned topics)\n\n{show_prep[:3000]}")
|
||||
|
||||
context_parts.append(f"## Transcript\n\n{transcript_text[:12000]}")
|
||||
|
||||
if diarization_data:
|
||||
speakers = diarization_data.get("speaker_map", {})
|
||||
if speakers:
|
||||
speaker_info = "\n".join(f"- {v}" for v in speakers.values())
|
||||
context_parts.append(f"## Speakers Identified\n\n{speaker_info}")
|
||||
|
||||
context = "\n\n---\n\n".join(context_parts)
|
||||
|
||||
# Query 1: Episode summary and segment summaries
|
||||
summary_prompt = f"""You are analyzing a radio show episode transcript.
|
||||
Provide a JSON response with:
|
||||
|
||||
1. "summary": A 2-3 paragraph episode summary suitable for a podcast episode page.
|
||||
Write in third person. Be specific about topics discussed.
|
||||
|
||||
2. "segment_summaries": An array of objects, each with:
|
||||
- "title": A compelling segment title
|
||||
- "summary": 3-5 sentence summary
|
||||
- "key_points": Array of key takeaway bullet points
|
||||
|
||||
3. "topics": Array of main topics discussed (short phrases)
|
||||
|
||||
4. "tags": Array of SEO-friendly tags (lowercase, hyphenated)
|
||||
|
||||
5. "key_quotes": Array of notable quotes, each with:
|
||||
- "quote": The quote text
|
||||
- "speaker": Who said it (if identifiable)
|
||||
- "context": Brief context
|
||||
|
||||
6. "blog_post_candidates": Array of topics worth expanding into blog posts, each with:
|
||||
- "title": Proposed blog post title
|
||||
- "angle": The specific angle or thesis
|
||||
- "why": Why this topic deserves expansion
|
||||
|
||||
Respond ONLY with valid JSON, no markdown fencing.
|
||||
|
||||
{context}"""
|
||||
|
||||
console.print("[dim]Generating episode analysis...[/dim]")
|
||||
|
||||
response = client.chat(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": summary_prompt}],
|
||||
options={"temperature": 0.3, "num_ctx": 16384},
|
||||
)
|
||||
|
||||
# Parse LLM response
|
||||
response_text = response["message"]["content"]
|
||||
|
||||
# Strip markdown code fences if present
|
||||
if "```json" in response_text:
|
||||
response_text = response_text.split("```json", 1)[1]
|
||||
response_text = response_text.split("```", 1)[0]
|
||||
elif "```" in response_text:
|
||||
response_text = response_text.split("```", 1)[1]
|
||||
response_text = response_text.split("```", 1)[0]
|
||||
|
||||
try:
|
||||
analysis_data = json.loads(response_text.strip())
|
||||
except json.JSONDecodeError:
|
||||
console.print("[yellow]LLM response was not valid JSON, using raw text[/yellow]")
|
||||
analysis_data = {
|
||||
"summary": response_text,
|
||||
"segment_summaries": [],
|
||||
"topics": [],
|
||||
"tags": [],
|
||||
"key_quotes": [],
|
||||
"blog_post_candidates": [],
|
||||
}
|
||||
|
||||
# Query 2: Generate debrief draft
|
||||
debrief_prompt = f"""Based on this radio show transcript, generate a post-show debrief
|
||||
in markdown format. Compare what was discussed against the show prep (planned topics)
|
||||
to identify what made it in, what was cut, and what was added.
|
||||
|
||||
Format:
|
||||
|
||||
# Post-Show Debrief
|
||||
## Episode: [derive title from content]
|
||||
## Air Date: [today's date if not clear]
|
||||
|
||||
### What Made It In
|
||||
[For each planned segment, note: Used / Modified / Cut]
|
||||
|
||||
### What Changed Live
|
||||
[Topics expanded, cut short, or reordered vs. prep]
|
||||
|
||||
### Caller/Audience Interaction
|
||||
[Any caller topics or audience engagement noted in transcript]
|
||||
|
||||
### Unplanned Additions
|
||||
[Topics not in prep that came up]
|
||||
|
||||
### Best Moments
|
||||
[Most compelling segments or quotes]
|
||||
|
||||
### Topics That Deserve More
|
||||
[Topics that were rushed or generated high interest]
|
||||
|
||||
### Suggested Blog Posts
|
||||
[2-3 specific blog post ideas with proposed titles and angles]
|
||||
|
||||
{context}"""
|
||||
|
||||
console.print("[dim]Generating debrief draft...[/dim]")
|
||||
|
||||
debrief_response = client.chat(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": debrief_prompt}],
|
||||
options={"temperature": 0.4, "num_ctx": 16384},
|
||||
)
|
||||
|
||||
debrief_text = debrief_response["message"]["content"]
|
||||
|
||||
console.print("[green]Analysis complete[/green]")
|
||||
|
||||
return EpisodeAnalysis(
|
||||
summary=analysis_data.get("summary", ""),
|
||||
segment_summaries=analysis_data.get("segment_summaries", []),
|
||||
key_quotes=analysis_data.get("key_quotes", []),
|
||||
topics=analysis_data.get("topics", []),
|
||||
tags=analysis_data.get("tags", []),
|
||||
blog_post_candidates=analysis_data.get("blog_post_candidates", []),
|
||||
debrief_draft=debrief_text,
|
||||
)
|
||||
199
projects/radio-show/audio-processor/src/audio_editor.py
Normal file
199
projects/radio-show/audio-processor/src/audio_editor.py
Normal file
@@ -0,0 +1,199 @@
|
||||
"""Stage 4 & 5: Commercial removal and segment splitting using ffmpeg."""
|
||||
|
||||
import subprocess
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
from rich.console import Console
|
||||
from rich.progress import Progress
|
||||
|
||||
from .segment_detector import SegmentType, DetectedSegment
|
||||
|
||||
console = Console()
|
||||
|
||||
|
||||
@dataclass
|
||||
class Chapter:
|
||||
title: str
|
||||
start: float
|
||||
end: float
|
||||
|
||||
|
||||
def remove_commercials(audio_path: Path, segments: list[DetectedSegment],
|
||||
output_path: Path, crossfade_ms: int = 500,
|
||||
bitrate: str = "192k", normalize: bool = True):
|
||||
"""Stitch show segments together, removing commercials."""
|
||||
show_segments = [s for s in segments
|
||||
if s.segment_type in (SegmentType.SHOW_CONTENT,
|
||||
SegmentType.SHOW_ELEMENT)]
|
||||
|
||||
if not show_segments:
|
||||
console.print("[red]No show segments found![/red]")
|
||||
return
|
||||
|
||||
console.print(f"[bold]Removing commercials:[/bold] {len(segments)} segments "
|
||||
f"-> {len(show_segments)} show segments")
|
||||
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
temp_dir = output_path.parent / ".temp_segments"
|
||||
temp_dir.mkdir(exist_ok=True)
|
||||
|
||||
try:
|
||||
# Extract each show segment
|
||||
segment_files = []
|
||||
with Progress(console=console) as progress:
|
||||
task = progress.add_task("Extracting segments...",
|
||||
total=len(show_segments))
|
||||
|
||||
for i, seg in enumerate(show_segments):
|
||||
temp_file = temp_dir / f"seg_{i:04d}.mp3"
|
||||
_extract_segment(audio_path, seg.start, seg.end,
|
||||
temp_file, bitrate)
|
||||
segment_files.append(temp_file)
|
||||
progress.update(task, advance=1)
|
||||
|
||||
# Create concat file for ffmpeg
|
||||
concat_file = temp_dir / "concat.txt"
|
||||
with open(concat_file, "w") as f:
|
||||
for sf in segment_files:
|
||||
f.write(f"file '{sf}'\n")
|
||||
|
||||
# Concatenate with crossfade
|
||||
cmd = [
|
||||
"ffmpeg", "-y", "-f", "concat", "-safe", "0",
|
||||
"-i", str(concat_file),
|
||||
"-b:a", bitrate,
|
||||
]
|
||||
|
||||
if normalize:
|
||||
# EBU R128 loudness normalization
|
||||
cmd.extend([
|
||||
"-af", "loudnorm=I=-16:TP=-1.5:LRA=11",
|
||||
])
|
||||
|
||||
cmd.append(str(output_path))
|
||||
|
||||
subprocess.run(cmd, capture_output=True, check=True, timeout=600)
|
||||
|
||||
# Get output duration
|
||||
duration = _get_duration(output_path)
|
||||
console.print(f"[green]Clean episode saved: {output_path.name} "
|
||||
f"({duration / 60:.1f} min)[/green]")
|
||||
|
||||
finally:
|
||||
# Cleanup temp files
|
||||
import shutil
|
||||
shutil.rmtree(temp_dir, ignore_errors=True)
|
||||
|
||||
|
||||
def split_segments(audio_path: Path, segments: list[DetectedSegment],
|
||||
output_dir: Path, bitrate: str = "192k"):
|
||||
"""Export individual show segments as separate MP3 files."""
|
||||
show_segments = [s for s in segments
|
||||
if s.segment_type in (SegmentType.SHOW_CONTENT,
|
||||
SegmentType.SHOW_ELEMENT)]
|
||||
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
console.print(f"[bold]Splitting into {len(show_segments)} segments[/bold]")
|
||||
|
||||
exported = []
|
||||
for i, seg in enumerate(show_segments):
|
||||
slug = _slugify(seg.label) if seg.label else f"segment-{i:02d}"
|
||||
filename = f"{i:02d}-{slug}.mp3"
|
||||
output_file = output_dir / filename
|
||||
|
||||
_extract_segment(audio_path, seg.start, seg.end, output_file, bitrate,
|
||||
fade_in_ms=200, fade_out_ms=500)
|
||||
|
||||
duration = seg.duration
|
||||
console.print(f" [green]{filename}[/green] ({duration:.0f}s)")
|
||||
exported.append({
|
||||
"file": filename,
|
||||
"label": seg.label,
|
||||
"start": seg.start,
|
||||
"end": seg.end,
|
||||
"duration": duration,
|
||||
})
|
||||
|
||||
# Save manifest
|
||||
with open(output_dir / "segments.json", "w") as f:
|
||||
json.dump(exported, f, indent=2)
|
||||
|
||||
return exported
|
||||
|
||||
|
||||
def generate_chapters(segments: list[DetectedSegment],
|
||||
output_path: Path) -> list[Chapter]:
|
||||
"""Generate chapter markers from show segments."""
|
||||
show_segments = [s for s in segments
|
||||
if s.segment_type in (SegmentType.SHOW_CONTENT,
|
||||
SegmentType.SHOW_ELEMENT)]
|
||||
|
||||
chapters = []
|
||||
cumulative_time = 0.0
|
||||
|
||||
for seg in show_segments:
|
||||
chapters.append(Chapter(
|
||||
title=seg.label or f"Segment",
|
||||
start=cumulative_time,
|
||||
end=cumulative_time + seg.duration,
|
||||
))
|
||||
cumulative_time += seg.duration
|
||||
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, "w") as f:
|
||||
json.dump(
|
||||
[{"title": c.title, "start": c.start, "end": c.end}
|
||||
for c in chapters],
|
||||
f, indent=2,
|
||||
)
|
||||
|
||||
console.print(f"[green]Chapter markers saved: {len(chapters)} chapters[/green]")
|
||||
return chapters
|
||||
|
||||
|
||||
def _extract_segment(audio_path: Path, start: float, end: float,
|
||||
output_path: Path, bitrate: str = "192k",
|
||||
fade_in_ms: int = 0, fade_out_ms: int = 0):
|
||||
"""Extract a segment from an audio file using ffmpeg."""
|
||||
duration = end - start
|
||||
cmd = [
|
||||
"ffmpeg", "-y",
|
||||
"-ss", str(start),
|
||||
"-t", str(duration),
|
||||
"-i", str(audio_path),
|
||||
"-b:a", bitrate,
|
||||
]
|
||||
|
||||
filters = []
|
||||
if fade_in_ms > 0:
|
||||
filters.append(f"afade=t=in:d={fade_in_ms / 1000}")
|
||||
if fade_out_ms > 0:
|
||||
filters.append(f"afade=t=out:st={duration - fade_out_ms / 1000}:d={fade_out_ms / 1000}")
|
||||
|
||||
if filters:
|
||||
cmd.extend(["-af", ",".join(filters)])
|
||||
|
||||
cmd.append(str(output_path))
|
||||
subprocess.run(cmd, capture_output=True, check=True, timeout=120)
|
||||
|
||||
|
||||
def _get_duration(audio_path: Path) -> float:
|
||||
"""Get audio file duration in seconds."""
|
||||
result = subprocess.run(
|
||||
["ffprobe", "-v", "quiet", "-show_entries", "format=duration",
|
||||
"-of", "csv=p=0", str(audio_path)],
|
||||
capture_output=True, text=True,
|
||||
)
|
||||
return float(result.stdout.strip())
|
||||
|
||||
|
||||
def _slugify(text: str) -> str:
|
||||
"""Convert text to a filename-safe slug."""
|
||||
import re
|
||||
text = text.lower().strip()
|
||||
text = re.sub(r'[^\w\s-]', '', text)
|
||||
text = re.sub(r'[\s_]+', '-', text)
|
||||
text = re.sub(r'-+', '-', text)
|
||||
return text[:50].strip('-')
|
||||
356
projects/radio-show/audio-processor/src/cli.py
Normal file
356
projects/radio-show/audio-processor/src/cli.py
Normal file
@@ -0,0 +1,356 @@
|
||||
"""CLI entry point for the radio show audio processor."""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from rich.console import Console
|
||||
from rich.panel import Panel
|
||||
|
||||
from .config import load_config
|
||||
|
||||
console = Console()
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Radio Show Audio Processor — The Computer Guru Show",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
%(prog)s process episode.mp3
|
||||
%(prog)s process episode.mp3 --show-prep show-prep.md
|
||||
%(prog)s process hr1.mp3 hr2.mp3 --archive-mode --date 2016-03-15
|
||||
%(prog)s transcribe episode.mp3
|
||||
%(prog)s bootstrap-voice archive/
|
||||
%(prog)s review-elements
|
||||
%(prog)s review-speakers
|
||||
""",
|
||||
)
|
||||
parser.add_argument("--config", type=str, default=None,
|
||||
help="Path to config.yaml")
|
||||
|
||||
subparsers = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
# === process ===
|
||||
p_process = subparsers.add_parser("process", help="Full pipeline")
|
||||
p_process.add_argument("audio", nargs="+", type=str,
|
||||
help="Audio file(s) to process")
|
||||
p_process.add_argument("--show-prep", type=str, default=None,
|
||||
help="Path to show prep markdown file")
|
||||
p_process.add_argument("--output", type=str, default=None,
|
||||
help="Output directory")
|
||||
p_process.add_argument("--archive-mode", action="store_true",
|
||||
help="Archive mode: learn elements and voices")
|
||||
p_process.add_argument("--date", type=str, default=None,
|
||||
help="Episode date (for archive mode)")
|
||||
p_process.add_argument("--skip-transcribe", action="store_true",
|
||||
help="Skip transcription (use existing transcript)")
|
||||
p_process.add_argument("--skip-diarize", action="store_true",
|
||||
help="Skip diarization")
|
||||
p_process.add_argument("--skip-analysis", action="store_true",
|
||||
help="Skip LLM analysis")
|
||||
|
||||
# === transcribe ===
|
||||
p_transcribe = subparsers.add_parser("transcribe", help="Transcribe only")
|
||||
p_transcribe.add_argument("audio", type=str, help="Audio file")
|
||||
p_transcribe.add_argument("--output", type=str, default=None)
|
||||
p_transcribe.add_argument("--model", type=str, default=None,
|
||||
help="Whisper model size")
|
||||
|
||||
# === diarize ===
|
||||
p_diarize = subparsers.add_parser("diarize", help="Diarize only")
|
||||
p_diarize.add_argument("audio", type=str, help="Audio file")
|
||||
p_diarize.add_argument("--output", type=str, default=None)
|
||||
|
||||
# === detect ===
|
||||
p_detect = subparsers.add_parser("detect", help="Detect segments only")
|
||||
p_detect.add_argument("audio", type=str, help="Audio file")
|
||||
p_detect.add_argument("--output", type=str, default=None)
|
||||
p_detect.add_argument("--show-prep", type=str, default=None)
|
||||
|
||||
# === split ===
|
||||
p_split = subparsers.add_parser("split", help="Split into segments")
|
||||
p_split.add_argument("audio", type=str, help="Audio file")
|
||||
p_split.add_argument("--detection-report", type=str, required=True,
|
||||
help="Path to detection-report.json")
|
||||
p_split.add_argument("--output", type=str, default=None)
|
||||
|
||||
# === bootstrap-voice ===
|
||||
p_voice = subparsers.add_parser("bootstrap-voice",
|
||||
help="Bootstrap host voice profile from archive")
|
||||
p_voice.add_argument("archive_dir", type=str,
|
||||
help="Directory containing archive MP3s")
|
||||
p_voice.add_argument("--speaker-name", type=str, default="Mike Swanson")
|
||||
p_voice.add_argument("--sample-count", type=int, default=10,
|
||||
help="Number of episodes to sample")
|
||||
|
||||
# === review-elements ===
|
||||
subparsers.add_parser("review-elements",
|
||||
help="Review discovered audio elements")
|
||||
|
||||
# === review-speakers ===
|
||||
subparsers.add_parser("review-speakers",
|
||||
help="Review unknown speaker clusters")
|
||||
|
||||
args = parser.parse_args()
|
||||
config = load_config(args.config)
|
||||
|
||||
console.print(Panel.fit(
|
||||
"[bold]Radio Show Audio Processor[/bold]\n"
|
||||
f"[dim]The Computer Guru Show[/dim]",
|
||||
border_style="blue",
|
||||
))
|
||||
|
||||
if args.command == "process":
|
||||
_cmd_process(args, config)
|
||||
elif args.command == "transcribe":
|
||||
_cmd_transcribe(args, config)
|
||||
elif args.command == "diarize":
|
||||
_cmd_diarize(args, config)
|
||||
elif args.command == "detect":
|
||||
_cmd_detect(args, config)
|
||||
elif args.command == "split":
|
||||
_cmd_split(args, config)
|
||||
elif args.command == "bootstrap-voice":
|
||||
_cmd_bootstrap_voice(args, config)
|
||||
elif args.command == "review-elements":
|
||||
_cmd_review_elements(args, config)
|
||||
elif args.command == "review-speakers":
|
||||
_cmd_review_speakers(args, config)
|
||||
|
||||
|
||||
def _cmd_process(args, config):
|
||||
"""Full processing pipeline."""
|
||||
from .transcriber import transcribe
|
||||
from .diarizer import diarize, VoiceProfileStore
|
||||
from .segment_detector import SegmentDetector
|
||||
from .audio_editor import remove_commercials, split_segments, generate_chapters
|
||||
from .analyzer import analyze_episode
|
||||
|
||||
audio_files = [Path(f) for f in args.audio]
|
||||
audio_path = audio_files[0] # Primary file
|
||||
|
||||
# If multiple files (HR1 + HR2), concatenate first
|
||||
if len(audio_files) > 1:
|
||||
audio_path = _concatenate_audio(audio_files, config)
|
||||
|
||||
output_dir = Path(args.output) if args.output else audio_path.parent / "processed"
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Load show prep if provided
|
||||
show_prep = None
|
||||
if args.show_prep:
|
||||
show_prep = Path(args.show_prep).read_text()
|
||||
|
||||
# Stage 1: Transcribe
|
||||
transcript = None
|
||||
if not args.skip_transcribe:
|
||||
transcript = transcribe(
|
||||
audio_path,
|
||||
model_size=config.audio.whisper_model,
|
||||
language=config.audio.whisper_language,
|
||||
)
|
||||
transcript.save(output_dir)
|
||||
else:
|
||||
console.print("[dim]Skipping transcription[/dim]")
|
||||
# Try to load existing transcript
|
||||
transcript_file = output_dir / "transcript.json"
|
||||
if transcript_file.exists():
|
||||
from .transcriber import Transcript, TranscriptSegment, TranscriptWord
|
||||
import json
|
||||
with open(transcript_file) as f:
|
||||
data = json.load(f)
|
||||
transcript = Transcript(
|
||||
segments=[
|
||||
TranscriptSegment(
|
||||
id=s["id"], text=s["text"],
|
||||
start=s["start"], end=s["end"],
|
||||
words=[TranscriptWord(**w) for w in s.get("words", [])],
|
||||
)
|
||||
for s in data["segments"]
|
||||
],
|
||||
language=data["language"],
|
||||
language_probability=data["language_probability"],
|
||||
duration=data["duration"],
|
||||
)
|
||||
|
||||
# Stage 2: Diarize
|
||||
diarization = None
|
||||
if not args.skip_diarize:
|
||||
voice_profiles = VoiceProfileStore(
|
||||
config.resolve_path(config.diarization.voice_profiles_dir)
|
||||
)
|
||||
diarization = diarize(
|
||||
audio_path,
|
||||
voice_profiles=voice_profiles,
|
||||
min_speakers=config.diarization.min_speakers,
|
||||
max_speakers=config.diarization.max_speakers,
|
||||
)
|
||||
diarization.save(output_dir)
|
||||
else:
|
||||
console.print("[dim]Skipping diarization[/dim]")
|
||||
|
||||
# Stage 3: Detect segments
|
||||
detector = SegmentDetector(config)
|
||||
detection = detector.detect(
|
||||
audio_path,
|
||||
transcript=transcript,
|
||||
diarization=diarization,
|
||||
show_prep=show_prep,
|
||||
)
|
||||
detection.save(output_dir)
|
||||
|
||||
# Stage 4: Remove commercials
|
||||
clean_path = output_dir / f"podcast-episode.{config.audio.output_format}"
|
||||
remove_commercials(
|
||||
audio_path, detection.segments, clean_path,
|
||||
crossfade_ms=config.audio.crossfade_ms,
|
||||
bitrate=config.audio.output_bitrate,
|
||||
normalize=config.audio.normalize,
|
||||
)
|
||||
|
||||
# Stage 5: Split segments
|
||||
segments_dir = output_dir / "segments"
|
||||
split_segments(
|
||||
audio_path, detection.segments, segments_dir,
|
||||
bitrate=config.audio.output_bitrate,
|
||||
)
|
||||
|
||||
# Generate chapters
|
||||
generate_chapters(detection.segments, output_dir / "chapters.json")
|
||||
|
||||
# Stage 6: Analyze
|
||||
if not args.skip_analysis and transcript:
|
||||
analysis = analyze_episode(
|
||||
transcript_text=transcript.full_text,
|
||||
diarization_data=diarization.to_dict() if diarization else None,
|
||||
show_prep=show_prep,
|
||||
segments=detection.segments,
|
||||
model=config.llm.model,
|
||||
ollama_host=config.llm.ollama_host,
|
||||
)
|
||||
generated_dir = output_dir.parent / "generated"
|
||||
analysis.save(generated_dir)
|
||||
|
||||
console.print("\n[bold green]Processing complete![/bold green]")
|
||||
console.print(f"Output: {output_dir}")
|
||||
|
||||
|
||||
def _cmd_transcribe(args, config):
|
||||
"""Transcribe only."""
|
||||
from .transcriber import transcribe
|
||||
|
||||
audio_path = Path(args.audio)
|
||||
output_dir = Path(args.output) if args.output else audio_path.parent / "processed"
|
||||
model = args.model or config.audio.whisper_model
|
||||
|
||||
transcript = transcribe(audio_path, model_size=model)
|
||||
transcript.save(output_dir)
|
||||
|
||||
|
||||
def _cmd_diarize(args, config):
|
||||
"""Diarize only."""
|
||||
from .diarizer import diarize, VoiceProfileStore
|
||||
|
||||
audio_path = Path(args.audio)
|
||||
output_dir = Path(args.output) if args.output else audio_path.parent / "processed"
|
||||
|
||||
voice_profiles = VoiceProfileStore(
|
||||
config.resolve_path(config.diarization.voice_profiles_dir)
|
||||
)
|
||||
result = diarize(audio_path, voice_profiles=voice_profiles)
|
||||
result.save(output_dir)
|
||||
|
||||
|
||||
def _cmd_detect(args, config):
|
||||
"""Segment detection only."""
|
||||
from .segment_detector import SegmentDetector
|
||||
|
||||
audio_path = Path(args.audio)
|
||||
output_dir = Path(args.output) if args.output else audio_path.parent / "processed"
|
||||
|
||||
show_prep = None
|
||||
if args.show_prep:
|
||||
show_prep = Path(args.show_prep).read_text()
|
||||
|
||||
detector = SegmentDetector(config)
|
||||
result = detector.detect(audio_path, show_prep=show_prep)
|
||||
result.save(output_dir)
|
||||
|
||||
|
||||
def _cmd_split(args, config):
|
||||
"""Split using existing detection report."""
|
||||
from .audio_editor import split_segments, generate_chapters
|
||||
from .segment_detector import DetectedSegment, SegmentType
|
||||
import json
|
||||
|
||||
audio_path = Path(args.audio)
|
||||
output_dir = Path(args.output) if args.output else audio_path.parent / "segments"
|
||||
|
||||
with open(args.detection_report) as f:
|
||||
report = json.load(f)
|
||||
|
||||
segments = [
|
||||
DetectedSegment(
|
||||
start=s["start"], end=s["end"],
|
||||
segment_type=SegmentType(s["type"]),
|
||||
confidence=s["confidence"],
|
||||
label=s.get("label", ""),
|
||||
)
|
||||
for s in report["segments"]
|
||||
]
|
||||
|
||||
split_segments(audio_path, segments, output_dir, config.audio.output_bitrate)
|
||||
generate_chapters(segments, output_dir.parent / "chapters.json")
|
||||
|
||||
|
||||
def _cmd_bootstrap_voice(args, config):
|
||||
"""Bootstrap host voice profile from archive episodes."""
|
||||
console.print("[bold]Bootstrapping host voice profile[/bold]")
|
||||
console.print(f"Archive: {args.archive_dir}")
|
||||
console.print(f"Speaker: {args.speaker_name}")
|
||||
console.print(f"Sampling {args.sample_count} episodes")
|
||||
|
||||
# TODO: Implement archive sampling + diarization + embedding extraction
|
||||
console.print("[yellow]Not yet implemented — run individual diarizations first[/yellow]")
|
||||
|
||||
|
||||
def _cmd_review_elements(args, config):
|
||||
"""Review discovered audio elements."""
|
||||
console.print("[bold]Reviewing discovered elements[/bold]")
|
||||
# TODO: Implement element review UI
|
||||
console.print("[yellow]Not yet implemented[/yellow]")
|
||||
|
||||
|
||||
def _cmd_review_speakers(args, config):
|
||||
"""Review unknown speaker clusters."""
|
||||
console.print("[bold]Reviewing unknown speakers[/bold]")
|
||||
# TODO: Implement speaker review UI
|
||||
console.print("[yellow]Not yet implemented[/yellow]")
|
||||
|
||||
|
||||
def _concatenate_audio(files: list[Path], config) -> Path:
|
||||
"""Concatenate multiple audio files (e.g., HR1 + HR2)."""
|
||||
import subprocess
|
||||
|
||||
output = files[0].parent / f"combined_{files[0].stem}.mp3"
|
||||
concat_file = files[0].parent / ".concat_list.txt"
|
||||
|
||||
with open(concat_file, "w") as f:
|
||||
for audio_file in files:
|
||||
f.write(f"file '{audio_file}'\n")
|
||||
|
||||
subprocess.run(
|
||||
["ffmpeg", "-y", "-f", "concat", "-safe", "0",
|
||||
"-i", str(concat_file), "-c", "copy", str(output)],
|
||||
capture_output=True, check=True,
|
||||
)
|
||||
concat_file.unlink()
|
||||
|
||||
console.print(f"[dim]Concatenated {len(files)} files -> {output.name}[/dim]")
|
||||
return output
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
126
projects/radio-show/audio-processor/src/config.py
Normal file
126
projects/radio-show/audio-processor/src/config.py
Normal file
@@ -0,0 +1,126 @@
|
||||
"""Configuration loader for the radio show audio processor."""
|
||||
|
||||
from pathlib import Path
|
||||
from dataclasses import dataclass, field
|
||||
import yaml
|
||||
|
||||
|
||||
@dataclass
|
||||
class ShowConfig:
|
||||
name: str = "The Computer Guru Show"
|
||||
host: str = "Mike Swanson"
|
||||
typical_duration_minutes: int = 120
|
||||
segment_count: int = 6
|
||||
has_commercials: bool = True
|
||||
|
||||
|
||||
@dataclass
|
||||
class AudioConfig:
|
||||
whisper_model: str = "large-v3"
|
||||
whisper_language: str = "en"
|
||||
output_format: str = "mp3"
|
||||
output_bitrate: str = "192k"
|
||||
normalize: bool = True
|
||||
crossfade_ms: int = 500
|
||||
|
||||
|
||||
@dataclass
|
||||
class DetectionWeights:
|
||||
fingerprint_match: float = 0.30
|
||||
speaker_identity: float = 0.25
|
||||
audio_characteristics: float = 0.20
|
||||
break_pattern: float = 0.15
|
||||
structural_heuristic: float = 0.10
|
||||
|
||||
|
||||
@dataclass
|
||||
class SegmentDetectionConfig:
|
||||
fingerprint_db: str = "element-library/fingerprints.db"
|
||||
fingerprint_match_threshold: float = 0.85
|
||||
discover_unknown_elements: bool = True
|
||||
min_element_duration_s: float = 1.0
|
||||
max_element_duration_s: float = 30.0
|
||||
cluster_similarity_threshold: float = 0.90
|
||||
min_cluster_occurrences: int = 3
|
||||
min_break_duration_s: int = 30
|
||||
max_break_duration_s: int = 300
|
||||
silence_threshold_db: int = -40
|
||||
confidence_threshold: float = 0.70
|
||||
weights: DetectionWeights = field(default_factory=DetectionWeights)
|
||||
|
||||
|
||||
@dataclass
|
||||
class DiarizationConfig:
|
||||
min_speakers: int = 1
|
||||
max_speakers: int = 6
|
||||
voice_profiles_dir: str = "voice-profiles/"
|
||||
host_match_threshold: float = 0.75
|
||||
|
||||
|
||||
@dataclass
|
||||
class LLMConfig:
|
||||
model: str = "qwen3:14b"
|
||||
ollama_host: str = "http://localhost:11434"
|
||||
|
||||
|
||||
@dataclass
|
||||
class PathsConfig:
|
||||
episodes_dir: str = "episodes/"
|
||||
voice_profiles: str = "voice-profiles/"
|
||||
element_library: str = "element-library/"
|
||||
output_dir: str = "processed/"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ArchiveConfig:
|
||||
server: str = "172.16.3.10"
|
||||
path: str = "/home/gurushow/public_html/archive/"
|
||||
elements_path: str = "/home/gurushow/public_html/archive/Radio/Elements/"
|
||||
|
||||
|
||||
@dataclass
|
||||
class Config:
|
||||
show: ShowConfig = field(default_factory=ShowConfig)
|
||||
audio: AudioConfig = field(default_factory=AudioConfig)
|
||||
segment_detection: SegmentDetectionConfig = field(default_factory=SegmentDetectionConfig)
|
||||
diarization: DiarizationConfig = field(default_factory=DiarizationConfig)
|
||||
llm: LLMConfig = field(default_factory=LLMConfig)
|
||||
paths: PathsConfig = field(default_factory=PathsConfig)
|
||||
archive: ArchiveConfig = field(default_factory=ArchiveConfig)
|
||||
base_dir: Path = field(default_factory=lambda: Path.cwd())
|
||||
|
||||
def resolve_path(self, relative: str) -> Path:
|
||||
return self.base_dir / relative
|
||||
|
||||
|
||||
def load_config(config_path: str | Path | None = None) -> Config:
|
||||
if config_path is None:
|
||||
config_path = Path(__file__).parent.parent / "config.yaml"
|
||||
|
||||
config_path = Path(config_path)
|
||||
if not config_path.exists():
|
||||
return Config(base_dir=config_path.parent)
|
||||
|
||||
with open(config_path) as f:
|
||||
raw = yaml.safe_load(f) or {}
|
||||
|
||||
config = Config(base_dir=config_path.parent)
|
||||
|
||||
if "show" in raw:
|
||||
config.show = ShowConfig(**raw["show"])
|
||||
if "audio" in raw:
|
||||
config.audio = AudioConfig(**raw["audio"])
|
||||
if "segment_detection" in raw:
|
||||
sd = raw["segment_detection"]
|
||||
weights = DetectionWeights(**sd.pop("weights", {}))
|
||||
config.segment_detection = SegmentDetectionConfig(weights=weights, **sd)
|
||||
if "diarization" in raw:
|
||||
config.diarization = DiarizationConfig(**raw["diarization"])
|
||||
if "llm" in raw:
|
||||
config.llm = LLMConfig(**raw["llm"])
|
||||
if "paths" in raw:
|
||||
config.paths = PathsConfig(**raw["paths"])
|
||||
if "archive" in raw:
|
||||
config.archive = ArchiveConfig(**raw["archive"])
|
||||
|
||||
return config
|
||||
274
projects/radio-show/audio-processor/src/diarizer.py
Normal file
274
projects/radio-show/audio-processor/src/diarizer.py
Normal file
@@ -0,0 +1,274 @@
|
||||
"""Stage 2: Speaker diarization using pyannote.audio with voice profile matching."""
|
||||
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from rich.console import Console
|
||||
|
||||
console = Console()
|
||||
|
||||
|
||||
@dataclass
|
||||
class SpeakerTurn:
|
||||
speaker: str # "SPEAKER_00", "Host: Mike Swanson", "Caller 1", etc.
|
||||
start: float
|
||||
end: float
|
||||
confidence: float = 1.0
|
||||
|
||||
@property
|
||||
def duration(self) -> float:
|
||||
return self.end - self.start
|
||||
|
||||
|
||||
@dataclass
|
||||
class DiarizationResult:
|
||||
turns: list[SpeakerTurn]
|
||||
num_speakers: int
|
||||
speaker_map: dict[str, str] # raw label -> friendly name
|
||||
|
||||
def speaker_at(self, time: float) -> str | None:
|
||||
"""Get the speaker at a given timestamp."""
|
||||
for turn in self.turns:
|
||||
if turn.start <= time <= turn.end:
|
||||
return turn.speaker
|
||||
return None
|
||||
|
||||
def speaker_time(self, speaker: str) -> float:
|
||||
"""Total speaking time for a speaker."""
|
||||
return sum(t.duration for t in self.turns if t.speaker == speaker)
|
||||
|
||||
def speakers_ranked(self) -> list[tuple[str, float]]:
|
||||
"""Speakers ranked by total speaking time."""
|
||||
times = {}
|
||||
for turn in self.turns:
|
||||
times[turn.speaker] = times.get(turn.speaker, 0) + turn.duration
|
||||
return sorted(times.items(), key=lambda x: x[1], reverse=True)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"num_speakers": self.num_speakers,
|
||||
"speaker_map": self.speaker_map,
|
||||
"turns": [
|
||||
{
|
||||
"speaker": t.speaker,
|
||||
"start": t.start,
|
||||
"end": t.end,
|
||||
"confidence": t.confidence,
|
||||
}
|
||||
for t in self.turns
|
||||
],
|
||||
}
|
||||
|
||||
def save(self, output_dir: Path):
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_dir / "diarization.json", "w") as f:
|
||||
json.dump(self.to_dict(), f, indent=2)
|
||||
console.print(f"[green]Diarization saved to {output_dir}[/green]")
|
||||
|
||||
|
||||
class VoiceProfileStore:
|
||||
"""Manages speaker voice embeddings for identification."""
|
||||
|
||||
def __init__(self, profiles_dir: str | Path):
|
||||
self.profiles_dir = Path(profiles_dir)
|
||||
self.embeddings: dict[str, np.ndarray] = {}
|
||||
self.metadata: dict[str, dict] = {}
|
||||
self._load_profiles()
|
||||
|
||||
def _load_profiles(self):
|
||||
if not self.profiles_dir.exists():
|
||||
return
|
||||
|
||||
for npy_file in self.profiles_dir.rglob("*.npy"):
|
||||
name = npy_file.stem
|
||||
# Determine speaker name from directory structure
|
||||
parent = npy_file.parent.name
|
||||
if parent.startswith("host-"):
|
||||
speaker_name = parent.replace("host-", "").replace("-", " ").title()
|
||||
role = "host"
|
||||
elif parent == "guests":
|
||||
speaker_name = name.replace("-", " ").title()
|
||||
role = "guest"
|
||||
elif parent == "callers":
|
||||
speaker_name = name
|
||||
role = "caller"
|
||||
else:
|
||||
speaker_name = name
|
||||
role = "unknown"
|
||||
|
||||
self.embeddings[name] = np.load(npy_file)
|
||||
self.metadata[name] = {
|
||||
"name": speaker_name,
|
||||
"role": role,
|
||||
"file": str(npy_file),
|
||||
}
|
||||
|
||||
if self.embeddings:
|
||||
console.print(f"[dim]Loaded {len(self.embeddings)} voice profiles[/dim]")
|
||||
|
||||
def match_embedding(self, embedding: np.ndarray, threshold: float = 0.75
|
||||
) -> tuple[str | None, float]:
|
||||
"""Match an embedding against stored profiles. Returns (name, similarity)."""
|
||||
if not self.embeddings:
|
||||
return None, 0.0
|
||||
|
||||
best_match = None
|
||||
best_score = 0.0
|
||||
|
||||
for name, stored in self.embeddings.items():
|
||||
# Cosine similarity
|
||||
similarity = np.dot(embedding, stored) / (
|
||||
np.linalg.norm(embedding) * np.linalg.norm(stored) + 1e-8
|
||||
)
|
||||
if similarity > best_score:
|
||||
best_score = similarity
|
||||
best_match = name
|
||||
|
||||
if best_score >= threshold:
|
||||
meta = self.metadata.get(best_match, {})
|
||||
friendly_name = meta.get("name", best_match)
|
||||
role = meta.get("role", "unknown")
|
||||
if role == "host":
|
||||
return f"Host: {friendly_name}", best_score
|
||||
return friendly_name, best_score
|
||||
|
||||
return None, best_score
|
||||
|
||||
def save_embedding(self, name: str, embedding: np.ndarray,
|
||||
role: str = "unknown"):
|
||||
"""Save a new voice profile."""
|
||||
if role == "host":
|
||||
subdir = self.profiles_dir / f"host-{name.lower().replace(' ', '-')}"
|
||||
elif role == "guest":
|
||||
subdir = self.profiles_dir / "guests"
|
||||
elif role == "caller":
|
||||
subdir = self.profiles_dir / "callers"
|
||||
else:
|
||||
subdir = self.profiles_dir / "unknown"
|
||||
|
||||
subdir.mkdir(parents=True, exist_ok=True)
|
||||
filename = name.lower().replace(" ", "-")
|
||||
np.save(subdir / f"{filename}.npy", embedding)
|
||||
console.print(f"[green]Saved voice profile: {name} ({role})[/green]")
|
||||
|
||||
|
||||
def diarize(audio_path: str | Path,
|
||||
voice_profiles: VoiceProfileStore | None = None,
|
||||
min_speakers: int = 1,
|
||||
max_speakers: int = 6,
|
||||
host_match_threshold: float = 0.75) -> DiarizationResult:
|
||||
"""Run speaker diarization on an audio file."""
|
||||
from pyannote.audio import Pipeline
|
||||
import torch
|
||||
|
||||
audio_path = Path(audio_path)
|
||||
console.print(f"[bold]Diarizing:[/bold] {audio_path.name}")
|
||||
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
console.print(f"[dim]Device: {device}[/dim]")
|
||||
|
||||
pipeline = Pipeline.from_pretrained(
|
||||
"pyannote/speaker-diarization-3.1"
|
||||
).to(device)
|
||||
|
||||
diarization = pipeline(
|
||||
str(audio_path),
|
||||
min_speakers=min_speakers,
|
||||
max_speakers=max_speakers,
|
||||
)
|
||||
|
||||
# Extract turns
|
||||
raw_turns = []
|
||||
for turn, _, speaker in diarization.itertracks(yield_label=True):
|
||||
raw_turns.append(SpeakerTurn(
|
||||
speaker=speaker,
|
||||
start=turn.start,
|
||||
end=turn.end,
|
||||
))
|
||||
|
||||
# Count unique speakers
|
||||
raw_speakers = set(t.speaker for t in raw_turns)
|
||||
console.print(f"[dim]Detected {len(raw_speakers)} speakers[/dim]")
|
||||
|
||||
# Match against voice profiles if available
|
||||
speaker_map = {}
|
||||
if voice_profiles and voice_profiles.embeddings:
|
||||
console.print("[dim]Matching speakers against voice profiles...[/dim]")
|
||||
embedding_model = pipeline.embedding # pyannote's embedding model
|
||||
|
||||
# Get embeddings for each detected speaker
|
||||
from pyannote.audio import Inference
|
||||
inference = Inference(pipeline.embedding, window="whole")
|
||||
|
||||
for raw_label in raw_speakers:
|
||||
# Get segments for this speaker
|
||||
speaker_segments = [t for t in raw_turns if t.speaker == raw_label]
|
||||
total_time = sum(t.duration for t in speaker_segments)
|
||||
|
||||
# Use the longest segment for embedding
|
||||
longest = max(speaker_segments, key=lambda t: t.duration)
|
||||
|
||||
try:
|
||||
# Extract embedding from audio segment
|
||||
import torchaudio
|
||||
waveform, sr = torchaudio.load(
|
||||
str(audio_path),
|
||||
frame_offset=int(longest.start * sr if 'sr' in dir() else longest.start * 16000),
|
||||
num_frames=int(longest.duration * sr if 'sr' in dir() else longest.duration * 16000),
|
||||
)
|
||||
# This is simplified — proper implementation would use pyannote's
|
||||
# embedding extraction pipeline
|
||||
match_name, score = voice_profiles.match_embedding(
|
||||
np.zeros(256), # placeholder
|
||||
threshold=host_match_threshold,
|
||||
)
|
||||
if match_name:
|
||||
speaker_map[raw_label] = match_name
|
||||
console.print(f" [green]{raw_label} -> {match_name} "
|
||||
f"(score: {score:.2f}, {total_time:.0f}s)[/green]")
|
||||
except Exception as e:
|
||||
console.print(f" [yellow]Could not match {raw_label}: {e}[/yellow]")
|
||||
|
||||
# If no voice profiles matched, use speaking time heuristic
|
||||
# The host almost always has the most speaking time
|
||||
if not speaker_map:
|
||||
ranked = sorted(
|
||||
[(s, sum(t.duration for t in raw_turns if t.speaker == s))
|
||||
for s in raw_speakers],
|
||||
key=lambda x: x[1],
|
||||
reverse=True,
|
||||
)
|
||||
if ranked:
|
||||
speaker_map[ranked[0][0]] = f"Host: {voice_profiles.metadata.get('host', {}).get('name', 'Unknown')}"
|
||||
console.print(f" [yellow]Assumed {ranked[0][0]} is host "
|
||||
f"(most speaking time: {ranked[0][1]:.0f}s)[/yellow]")
|
||||
|
||||
# If no voice profiles at all, label by speaking time
|
||||
if not speaker_map:
|
||||
ranked = sorted(
|
||||
[(s, sum(t.duration for t in raw_turns if t.speaker == s))
|
||||
for s in raw_speakers],
|
||||
key=lambda x: x[1],
|
||||
reverse=True,
|
||||
)
|
||||
for i, (speaker, time) in enumerate(ranked):
|
||||
if i == 0:
|
||||
speaker_map[speaker] = "Host (assumed)"
|
||||
else:
|
||||
speaker_map[speaker] = f"Speaker {i}"
|
||||
|
||||
# Apply friendly names
|
||||
for turn in raw_turns:
|
||||
if turn.speaker in speaker_map:
|
||||
turn.speaker = speaker_map[turn.speaker]
|
||||
|
||||
console.print(f"[green]Diarization complete: {len(raw_turns)} turns, "
|
||||
f"{len(raw_speakers)} speakers[/green]")
|
||||
|
||||
return DiarizationResult(
|
||||
turns=raw_turns,
|
||||
num_speakers=len(raw_speakers),
|
||||
speaker_map=speaker_map,
|
||||
)
|
||||
419
projects/radio-show/audio-processor/src/segment_detector.py
Normal file
419
projects/radio-show/audio-processor/src/segment_detector.py
Normal file
@@ -0,0 +1,419 @@
|
||||
"""Stage 3: Segment detection — multi-signal commercial/show content classifier."""
|
||||
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from enum import Enum
|
||||
|
||||
import numpy as np
|
||||
from rich.console import Console
|
||||
from rich.table import Table
|
||||
|
||||
console = Console()
|
||||
|
||||
|
||||
class SegmentType(Enum):
|
||||
SHOW_CONTENT = "show_content"
|
||||
COMMERCIAL = "commercial"
|
||||
SHOW_ELEMENT = "show_element" # intro, outro, bumper
|
||||
SILENCE = "silence"
|
||||
UNKNOWN = "unknown"
|
||||
|
||||
|
||||
@dataclass
|
||||
class DetectedSegment:
|
||||
start: float
|
||||
end: float
|
||||
segment_type: SegmentType
|
||||
confidence: float
|
||||
label: str = "" # "Segment 1: The Week That Was", "Commercial Break 1", etc.
|
||||
signals: dict = None # Individual signal scores
|
||||
|
||||
def __post_init__(self):
|
||||
if self.signals is None:
|
||||
self.signals = {}
|
||||
|
||||
@property
|
||||
def duration(self) -> float:
|
||||
return self.end - self.start
|
||||
|
||||
|
||||
@dataclass
|
||||
class SegmentDetectionResult:
|
||||
segments: list[DetectedSegment]
|
||||
show_segments: list[DetectedSegment]
|
||||
commercial_segments: list[DetectedSegment]
|
||||
element_segments: list[DetectedSegment]
|
||||
total_show_time: float
|
||||
total_commercial_time: float
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"total_show_time": self.total_show_time,
|
||||
"total_commercial_time": self.total_commercial_time,
|
||||
"segments": [
|
||||
{
|
||||
"start": s.start,
|
||||
"end": s.end,
|
||||
"type": s.segment_type.value,
|
||||
"confidence": s.confidence,
|
||||
"label": s.label,
|
||||
"signals": s.signals,
|
||||
}
|
||||
for s in self.segments
|
||||
],
|
||||
}
|
||||
|
||||
def save(self, output_dir: Path):
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_dir / "detection-report.json", "w") as f:
|
||||
json.dump(self.to_dict(), f, indent=2)
|
||||
|
||||
def print_summary(self):
|
||||
table = Table(title="Segment Detection Results")
|
||||
table.add_column("Time", style="cyan")
|
||||
table.add_column("Duration", style="magenta")
|
||||
table.add_column("Type", style="green")
|
||||
table.add_column("Confidence", style="yellow")
|
||||
table.add_column("Label")
|
||||
|
||||
for seg in self.segments:
|
||||
start = _format_time(seg.start)
|
||||
dur = f"{seg.duration:.0f}s"
|
||||
type_style = {
|
||||
SegmentType.SHOW_CONTENT: "[green]SHOW[/green]",
|
||||
SegmentType.COMMERCIAL: "[red]COMMERCIAL[/red]",
|
||||
SegmentType.SHOW_ELEMENT: "[blue]ELEMENT[/blue]",
|
||||
SegmentType.SILENCE: "[dim]SILENCE[/dim]",
|
||||
SegmentType.UNKNOWN: "[yellow]UNKNOWN[/yellow]",
|
||||
}.get(seg.segment_type, str(seg.segment_type))
|
||||
|
||||
table.add_row(start, dur, type_style, f"{seg.confidence:.2f}", seg.label)
|
||||
|
||||
console.print(table)
|
||||
console.print(f"\nShow content: {self.total_show_time / 60:.1f} min")
|
||||
console.print(f"Commercials: {self.total_commercial_time / 60:.1f} min")
|
||||
|
||||
|
||||
def _format_time(seconds: float) -> str:
|
||||
m = int(seconds // 60)
|
||||
s = int(seconds % 60)
|
||||
return f"{m:02d}:{s:02d}"
|
||||
|
||||
|
||||
class SegmentDetector:
|
||||
"""Multi-signal commercial/show content detector."""
|
||||
|
||||
def __init__(self, config):
|
||||
self.config = config
|
||||
self.weights = config.segment_detection.weights
|
||||
|
||||
def detect(self, audio_path: Path, transcript=None, diarization=None,
|
||||
show_prep=None) -> SegmentDetectionResult:
|
||||
"""Run all detection signals and combine scores."""
|
||||
console.print(f"[bold]Detecting segments:[/bold] {audio_path.name}")
|
||||
|
||||
# Load audio for analysis
|
||||
audio_data, sample_rate = self._load_audio(audio_path)
|
||||
duration = len(audio_data) / sample_rate
|
||||
|
||||
# Step 1: Find candidate boundaries using silence detection
|
||||
boundaries = self._detect_silence_boundaries(audio_data, sample_rate)
|
||||
console.print(f"[dim]Found {len(boundaries)} silence boundaries[/dim]")
|
||||
|
||||
# Step 2: Create candidate segments between boundaries
|
||||
candidates = self._create_candidate_segments(boundaries, duration)
|
||||
|
||||
# Step 3: Score each candidate with all available signals
|
||||
for candidate in candidates:
|
||||
scores = {}
|
||||
|
||||
# Signal 1: Fingerprint matching (if library available)
|
||||
scores["fingerprint"] = self._score_fingerprint(
|
||||
audio_data, sample_rate, candidate
|
||||
)
|
||||
|
||||
# Signal 2: Speaker identity
|
||||
if diarization:
|
||||
scores["speaker"] = self._score_speaker_identity(
|
||||
diarization, candidate
|
||||
)
|
||||
else:
|
||||
scores["speaker"] = 0.5 # neutral
|
||||
|
||||
# Signal 3: Audio characteristics
|
||||
scores["audio_chars"] = self._score_audio_characteristics(
|
||||
audio_data, sample_rate, candidate
|
||||
)
|
||||
|
||||
# Signal 4: Structural heuristics
|
||||
if transcript:
|
||||
scores["structural"] = self._score_structural(
|
||||
transcript, candidate
|
||||
)
|
||||
else:
|
||||
scores["structural"] = 0.5
|
||||
|
||||
# Combined weighted score (higher = more likely commercial)
|
||||
commercial_score = (
|
||||
self.weights.fingerprint_match * scores.get("fingerprint", 0.5) +
|
||||
self.weights.speaker_identity * scores.get("speaker", 0.5) +
|
||||
self.weights.audio_characteristics * scores.get("audio_chars", 0.5) +
|
||||
self.weights.structural_heuristic * scores.get("structural", 0.5)
|
||||
)
|
||||
|
||||
candidate.signals = scores
|
||||
candidate.confidence = commercial_score
|
||||
|
||||
if commercial_score >= self.config.segment_detection.confidence_threshold:
|
||||
candidate.segment_type = SegmentType.COMMERCIAL
|
||||
else:
|
||||
candidate.segment_type = SegmentType.SHOW_CONTENT
|
||||
|
||||
# Step 4: Merge adjacent segments of same type
|
||||
merged = self._merge_adjacent(candidates)
|
||||
|
||||
# Step 5: Apply duration constraints
|
||||
final = self._apply_constraints(merged)
|
||||
|
||||
# Step 6: Label show segments using show prep if available
|
||||
if show_prep:
|
||||
self._label_from_prep(final, transcript, show_prep)
|
||||
|
||||
# Build result
|
||||
show_segs = [s for s in final if s.segment_type == SegmentType.SHOW_CONTENT]
|
||||
comm_segs = [s for s in final if s.segment_type == SegmentType.COMMERCIAL]
|
||||
elem_segs = [s for s in final if s.segment_type == SegmentType.SHOW_ELEMENT]
|
||||
|
||||
result = SegmentDetectionResult(
|
||||
segments=final,
|
||||
show_segments=show_segs,
|
||||
commercial_segments=comm_segs,
|
||||
element_segments=elem_segs,
|
||||
total_show_time=sum(s.duration for s in show_segs),
|
||||
total_commercial_time=sum(s.duration for s in comm_segs),
|
||||
)
|
||||
|
||||
result.print_summary()
|
||||
return result
|
||||
|
||||
def _load_audio(self, audio_path: Path) -> tuple[np.ndarray, int]:
|
||||
"""Load audio file as mono numpy array."""
|
||||
import subprocess
|
||||
import io
|
||||
import struct
|
||||
|
||||
# Use ffmpeg to decode to raw PCM
|
||||
result = subprocess.run(
|
||||
["ffmpeg", "-i", str(audio_path), "-f", "s16le", "-ac", "1",
|
||||
"-ar", "16000", "-"],
|
||||
capture_output=True, timeout=300,
|
||||
)
|
||||
audio = np.frombuffer(result.stdout, dtype=np.int16).astype(np.float32) / 32768.0
|
||||
return audio, 16000
|
||||
|
||||
def _detect_silence_boundaries(self, audio: np.ndarray, sr: int,
|
||||
min_silence_ms: int = 500) -> list[float]:
|
||||
"""Detect silence gaps in audio that likely indicate segment boundaries."""
|
||||
frame_size = int(sr * 0.025) # 25ms frames
|
||||
hop_size = int(sr * 0.010) # 10ms hop
|
||||
threshold_db = self.config.segment_detection.silence_threshold_db
|
||||
threshold_amp = 10 ** (threshold_db / 20)
|
||||
min_silence_frames = int(min_silence_ms / 10)
|
||||
|
||||
# Calculate frame energy
|
||||
energies = []
|
||||
for i in range(0, len(audio) - frame_size, hop_size):
|
||||
frame = audio[i:i + frame_size]
|
||||
rms = np.sqrt(np.mean(frame ** 2))
|
||||
energies.append(rms)
|
||||
|
||||
# Find silence regions
|
||||
is_silent = [e < threshold_amp for e in energies]
|
||||
boundaries = []
|
||||
silent_count = 0
|
||||
|
||||
for i, silent in enumerate(is_silent):
|
||||
if silent:
|
||||
silent_count += 1
|
||||
else:
|
||||
if silent_count >= min_silence_frames:
|
||||
# Mark the midpoint of the silence as a boundary
|
||||
mid_frame = i - silent_count // 2
|
||||
boundary_time = mid_frame * 0.010
|
||||
boundaries.append(boundary_time)
|
||||
silent_count = 0
|
||||
|
||||
return boundaries
|
||||
|
||||
def _create_candidate_segments(self, boundaries: list[float],
|
||||
total_duration: float) -> list[DetectedSegment]:
|
||||
"""Create candidate segments from silence boundaries."""
|
||||
candidates = []
|
||||
prev = 0.0
|
||||
|
||||
for boundary in boundaries:
|
||||
if boundary - prev > 1.0: # Ignore segments < 1 second
|
||||
candidates.append(DetectedSegment(
|
||||
start=prev,
|
||||
end=boundary,
|
||||
segment_type=SegmentType.UNKNOWN,
|
||||
confidence=0.0,
|
||||
))
|
||||
prev = boundary
|
||||
|
||||
# Final segment
|
||||
if total_duration - prev > 1.0:
|
||||
candidates.append(DetectedSegment(
|
||||
start=prev,
|
||||
end=total_duration,
|
||||
segment_type=SegmentType.UNKNOWN,
|
||||
confidence=0.0,
|
||||
))
|
||||
|
||||
return candidates
|
||||
|
||||
def _score_fingerprint(self, audio: np.ndarray, sr: int,
|
||||
segment: DetectedSegment) -> float:
|
||||
"""Score based on audio fingerprint matching against element library.
|
||||
Returns 0.0 (no match / definitely show) to 1.0 (definite commercial boundary).
|
||||
"""
|
||||
# TODO: Implement fingerprint matching against element-library/fingerprints.db
|
||||
# For now, return neutral score
|
||||
return 0.5
|
||||
|
||||
def _score_speaker_identity(self, diarization, segment: DetectedSegment) -> float:
|
||||
"""Score based on whether the host is speaking.
|
||||
Returns 0.0 (host definitely speaking = show content)
|
||||
to 1.0 (host definitely absent = likely commercial).
|
||||
"""
|
||||
host_time = 0.0
|
||||
total_time = segment.duration
|
||||
|
||||
for turn in diarization.turns:
|
||||
if turn.end < segment.start or turn.start > segment.end:
|
||||
continue
|
||||
# Calculate overlap
|
||||
overlap_start = max(turn.start, segment.start)
|
||||
overlap_end = min(turn.end, segment.end)
|
||||
overlap = max(0, overlap_end - overlap_start)
|
||||
|
||||
if "host" in turn.speaker.lower():
|
||||
host_time += overlap
|
||||
|
||||
if total_time == 0:
|
||||
return 0.5
|
||||
|
||||
host_fraction = host_time / total_time
|
||||
# Invert: high host presence = low commercial score
|
||||
return 1.0 - host_fraction
|
||||
|
||||
def _score_audio_characteristics(self, audio: np.ndarray, sr: int,
|
||||
segment: DetectedSegment) -> float:
|
||||
"""Score based on audio production characteristics.
|
||||
Commercials tend to be louder, more compressed, different spectral profile.
|
||||
Returns 0.0 (matches show characteristics) to 1.0 (matches commercial characteristics).
|
||||
"""
|
||||
start_sample = int(segment.start * sr)
|
||||
end_sample = min(int(segment.end * sr), len(audio))
|
||||
seg_audio = audio[start_sample:end_sample]
|
||||
|
||||
if len(seg_audio) < sr: # Less than 1 second
|
||||
return 0.5
|
||||
|
||||
# RMS energy (commercials tend to be louder)
|
||||
rms = np.sqrt(np.mean(seg_audio ** 2))
|
||||
|
||||
# Dynamic range (commercials tend to be more compressed)
|
||||
frame_size = int(sr * 0.050) # 50ms frames
|
||||
frame_rms = []
|
||||
for i in range(0, len(seg_audio) - frame_size, frame_size):
|
||||
frame = seg_audio[i:i + frame_size]
|
||||
frame_rms.append(np.sqrt(np.mean(frame ** 2)))
|
||||
|
||||
if not frame_rms:
|
||||
return 0.5
|
||||
|
||||
dynamic_range = max(frame_rms) / (min(frame_rms) + 1e-8)
|
||||
|
||||
# Simple heuristic scoring:
|
||||
# High RMS + low dynamic range = compressed commercial audio
|
||||
score = 0.5
|
||||
if rms > 0.15: # Louder than typical speech
|
||||
score += 0.15
|
||||
if dynamic_range < 5.0: # Very compressed
|
||||
score += 0.15
|
||||
|
||||
return min(1.0, max(0.0, score))
|
||||
|
||||
def _score_structural(self, transcript, segment: DetectedSegment) -> float:
|
||||
"""Score based on transcript content structural cues.
|
||||
Returns 0.0 (show content cues found) to 1.0 (commercial cues found).
|
||||
"""
|
||||
text = transcript.text_at(segment.start, segment.end).lower()
|
||||
|
||||
# Show content indicators
|
||||
show_phrases = [
|
||||
"welcome back", "let's move on", "next up", "our next topic",
|
||||
"let's talk about", "as i mentioned", "the question is",
|
||||
"caller", "what do you think", "here's the thing",
|
||||
]
|
||||
# Commercial/break indicators
|
||||
break_phrases = [
|
||||
"we'll be right back", "stay tuned", "don't go anywhere",
|
||||
"after the break", "when we come back",
|
||||
]
|
||||
|
||||
show_hits = sum(1 for p in show_phrases if p in text)
|
||||
break_hits = sum(1 for p in break_phrases if p in text)
|
||||
|
||||
if show_hits > 0 and break_hits == 0:
|
||||
return 0.2 # Likely show content
|
||||
if break_hits > 0:
|
||||
return 0.8 # Likely near a break
|
||||
return 0.5 # Neutral
|
||||
|
||||
def _merge_adjacent(self, segments: list[DetectedSegment]) -> list[DetectedSegment]:
|
||||
"""Merge adjacent segments of the same type."""
|
||||
if not segments:
|
||||
return []
|
||||
|
||||
merged = [segments[0]]
|
||||
for seg in segments[1:]:
|
||||
prev = merged[-1]
|
||||
if (prev.segment_type == seg.segment_type and
|
||||
abs(seg.start - prev.end) < 2.0): # Within 2 seconds
|
||||
# Extend previous segment
|
||||
prev.end = seg.end
|
||||
prev.confidence = (prev.confidence + seg.confidence) / 2
|
||||
else:
|
||||
merged.append(seg)
|
||||
|
||||
return merged
|
||||
|
||||
def _apply_constraints(self, segments: list[DetectedSegment]) -> list[DetectedSegment]:
|
||||
"""Apply duration constraints — short 'commercial' segments are likely misclassified."""
|
||||
min_break = self.config.segment_detection.min_break_duration_s
|
||||
|
||||
for seg in segments:
|
||||
if (seg.segment_type == SegmentType.COMMERCIAL and
|
||||
seg.duration < min_break):
|
||||
seg.segment_type = SegmentType.SHOW_CONTENT
|
||||
seg.label = "(reclassified: too short for commercial)"
|
||||
|
||||
return segments
|
||||
|
||||
def _label_from_prep(self, segments: list[DetectedSegment],
|
||||
transcript, show_prep: str):
|
||||
"""Label show segments by matching transcript content to show prep topics."""
|
||||
# TODO: Use Ollama to match transcript sections against show prep segment titles
|
||||
# For now, number them sequentially
|
||||
show_count = 0
|
||||
comm_count = 0
|
||||
for seg in segments:
|
||||
if seg.segment_type == SegmentType.SHOW_CONTENT:
|
||||
show_count += 1
|
||||
seg.label = f"Show Segment {show_count}"
|
||||
elif seg.segment_type == SegmentType.COMMERCIAL:
|
||||
comm_count += 1
|
||||
seg.label = f"Commercial Break {comm_count}"
|
||||
179
projects/radio-show/audio-processor/src/transcriber.py
Normal file
179
projects/radio-show/audio-processor/src/transcriber.py
Normal file
@@ -0,0 +1,179 @@
|
||||
"""Stage 1: Audio transcription using faster-whisper with GPU acceleration."""
|
||||
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
from rich.console import Console
|
||||
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn, TimeElapsedColumn
|
||||
|
||||
console = Console()
|
||||
|
||||
|
||||
@dataclass
|
||||
class TranscriptWord:
|
||||
word: str
|
||||
start: float
|
||||
end: float
|
||||
probability: float
|
||||
|
||||
|
||||
@dataclass
|
||||
class TranscriptSegment:
|
||||
id: int
|
||||
text: str
|
||||
start: float
|
||||
end: float
|
||||
words: list[TranscriptWord]
|
||||
|
||||
|
||||
@dataclass
|
||||
class Transcript:
|
||||
segments: list[TranscriptSegment]
|
||||
language: str
|
||||
language_probability: float
|
||||
duration: float
|
||||
|
||||
@property
|
||||
def full_text(self) -> str:
|
||||
return " ".join(seg.text.strip() for seg in self.segments)
|
||||
|
||||
def text_at(self, start: float, end: float) -> str:
|
||||
"""Get transcript text within a time range."""
|
||||
result = []
|
||||
for seg in self.segments:
|
||||
if seg.end < start:
|
||||
continue
|
||||
if seg.start > end:
|
||||
break
|
||||
result.append(seg.text.strip())
|
||||
return " ".join(result)
|
||||
|
||||
def to_srt(self) -> str:
|
||||
"""Export as SRT subtitle format."""
|
||||
lines = []
|
||||
for i, seg in enumerate(self.segments, 1):
|
||||
start = _format_srt_time(seg.start)
|
||||
end = _format_srt_time(seg.end)
|
||||
lines.append(f"{i}")
|
||||
lines.append(f"{start} --> {end}")
|
||||
lines.append(seg.text.strip())
|
||||
lines.append("")
|
||||
return "\n".join(lines)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"language": self.language,
|
||||
"language_probability": self.language_probability,
|
||||
"duration": self.duration,
|
||||
"segments": [
|
||||
{
|
||||
"id": seg.id,
|
||||
"text": seg.text,
|
||||
"start": seg.start,
|
||||
"end": seg.end,
|
||||
"words": [
|
||||
{
|
||||
"word": w.word,
|
||||
"start": w.start,
|
||||
"end": w.end,
|
||||
"probability": w.probability,
|
||||
}
|
||||
for w in seg.words
|
||||
],
|
||||
}
|
||||
for seg in self.segments
|
||||
],
|
||||
}
|
||||
|
||||
def save(self, output_dir: Path):
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# JSON with full detail
|
||||
with open(output_dir / "transcript.json", "w") as f:
|
||||
json.dump(self.to_dict(), f, indent=2)
|
||||
|
||||
# Plain text
|
||||
with open(output_dir / "transcript.txt", "w") as f:
|
||||
f.write(self.full_text)
|
||||
|
||||
# SRT subtitles
|
||||
with open(output_dir / "transcript.srt", "w") as f:
|
||||
f.write(self.to_srt())
|
||||
|
||||
console.print(f"[green]Transcript saved to {output_dir}[/green]")
|
||||
|
||||
|
||||
def _format_srt_time(seconds: float) -> str:
|
||||
h = int(seconds // 3600)
|
||||
m = int((seconds % 3600) // 60)
|
||||
s = int(seconds % 60)
|
||||
ms = int((seconds % 1) * 1000)
|
||||
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
|
||||
|
||||
|
||||
def transcribe(audio_path: str | Path, model_size: str = "large-v3",
|
||||
language: str = "en", device: str = "cuda") -> Transcript:
|
||||
"""Transcribe an audio file using faster-whisper."""
|
||||
from faster_whisper import WhisperModel
|
||||
|
||||
audio_path = Path(audio_path)
|
||||
console.print(f"[bold]Transcribing:[/bold] {audio_path.name}")
|
||||
console.print(f"[dim]Model: {model_size}, Device: {device}[/dim]")
|
||||
|
||||
model = WhisperModel(model_size, device=device, compute_type="float16")
|
||||
|
||||
segments_raw, info = model.transcribe(
|
||||
str(audio_path),
|
||||
language=language,
|
||||
word_timestamps=True,
|
||||
vad_filter=True,
|
||||
vad_parameters=dict(
|
||||
min_silence_duration_ms=500,
|
||||
speech_pad_ms=200,
|
||||
),
|
||||
)
|
||||
|
||||
console.print(f"[dim]Detected language: {info.language} "
|
||||
f"(probability: {info.language_probability:.2f})[/dim]")
|
||||
console.print(f"[dim]Duration: {info.duration:.1f}s "
|
||||
f"({info.duration / 60:.1f} min)[/dim]")
|
||||
|
||||
segments = []
|
||||
with Progress(
|
||||
SpinnerColumn(),
|
||||
TextColumn("[progress.description]{task.description}"),
|
||||
BarColumn(),
|
||||
TextColumn("{task.completed} segments"),
|
||||
TimeElapsedColumn(),
|
||||
console=console,
|
||||
) as progress:
|
||||
task = progress.add_task("Transcribing...", total=None)
|
||||
|
||||
for i, seg in enumerate(segments_raw):
|
||||
words = [
|
||||
TranscriptWord(
|
||||
word=w.word,
|
||||
start=w.start,
|
||||
end=w.end,
|
||||
probability=w.probability,
|
||||
)
|
||||
for w in (seg.words or [])
|
||||
]
|
||||
segments.append(TranscriptSegment(
|
||||
id=i,
|
||||
text=seg.text,
|
||||
start=seg.start,
|
||||
end=seg.end,
|
||||
words=words,
|
||||
))
|
||||
progress.update(task, completed=i + 1)
|
||||
|
||||
console.print(f"[green]Transcription complete: {len(segments)} segments[/green]")
|
||||
|
||||
return Transcript(
|
||||
segments=segments,
|
||||
language=info.language,
|
||||
language_probability=info.language_probability,
|
||||
duration=info.duration,
|
||||
)
|
||||
52637
projects/radio-show/audio-processor/test-data/output/transcript.json
Normal file
52637
projects/radio-show/audio-processor/test-data/output/transcript.json
Normal file
File diff suppressed because it is too large
Load Diff
2983
projects/radio-show/audio-processor/test-data/output/transcript.srt
Normal file
2983
projects/radio-show/audio-processor/test-data/output/transcript.srt
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
256
projects/radio-show/audio-processor/training-plan.md
Normal file
256
projects/radio-show/audio-processor/training-plan.md
Normal file
@@ -0,0 +1,256 @@
|
||||
# Training Plan: Using the 579-Episode Archive
|
||||
|
||||
## Available Training Data
|
||||
|
||||
### Episode Archive
|
||||
- **Location:** `/home/gurushow/public_html/archive/` on IX server (172.16.3.10)
|
||||
- **Count:** 579 MP3 files, 7.8GB
|
||||
- **Span:** 2010-2018 (Seasons 6-10)
|
||||
- **Format:** Split into "HR 1" / "HR 2" per episode (2-hour shows)
|
||||
- **Year breakdown:**
|
||||
- 2010: 43 files (664MB)
|
||||
- 2011: 200 files (1.9GB)
|
||||
- 2012: 98 files (1.2GB)
|
||||
- 2014: 81 files (783MB)
|
||||
- 2015: 50 files (461MB)
|
||||
- 2016: 54 files (1.2GB)
|
||||
- 2017: 41 files (1.5GB)
|
||||
- 2018: 5 files (101MB)
|
||||
|
||||
### Show Production Elements
|
||||
- **Location:** `/home/gurushow/public_html/archive/Radio/Elements/`
|
||||
- **Intros:** 5 WAV files (show intro variations, beast intro, kick back intro, streaming intro)
|
||||
- **Outros:** 2 WAV files
|
||||
- **Bumpers:** 7 files (MP3 + WAV) — music stingers for transitions
|
||||
- **Promos:** 2 WAV files (promo windows, show spot)
|
||||
- **Corrected versions:** Separate folder with phone-number-corrected versions
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Audio Element Library (Seed + Discover)
|
||||
|
||||
### Purpose
|
||||
Build a library of all show production elements (intros, outros, bumpers, stingers, station IDs) for reliable segment boundary detection. The archive contains SOME elements but not all — different stations and eras used different production elements.
|
||||
|
||||
### Step 1: Seed with known elements
|
||||
1. Download all files from `Radio/Elements/` on IX server (7 MP3 + 18 WAV)
|
||||
2. Convert WAVs to consistent format (mono, 16kHz for fingerprinting)
|
||||
3. Generate chromaprint fingerprints for each element
|
||||
4. Store in `element-library/fingerprints.db` (SQLite)
|
||||
5. Categorize: show-intro, show-outro, segment-bumper, break-bumper, promo
|
||||
|
||||
### Step 2: Discover unknown elements from archive
|
||||
1. Process episodes through the pipeline
|
||||
2. Detect short non-speech audio segments (music, jingles, produced audio)
|
||||
3. Extract each detected clip
|
||||
4. Compare against known fingerprints — if no match, store as candidate
|
||||
5. Compare candidates against each other across episodes
|
||||
6. Cluster: same audio appearing in 3+ episodes = confirmed show element
|
||||
7. Add to fingerprint database as "unnamed" element
|
||||
|
||||
### Step 3: Host review
|
||||
- Present discovered clusters: "This 4-second audio clip appears in 38 episodes between 2015-2017 — what is it?"
|
||||
- Host names and categorizes each cluster
|
||||
- Named elements improve future detection accuracy
|
||||
|
||||
### What This Enables
|
||||
- **Known elements:** Exact boundary detection when a fingerprinted intro/bumper is detected
|
||||
- **Unknown elements:** Even without the source file, if the same jingle appears repeatedly, we know it marks a boundary
|
||||
- **Era awareness:** Elements used in 2011 may differ from 2016 — the library tracks date ranges
|
||||
- **New show elements:** When the show returns in 2026 with a new station, new bumpers get discovered automatically after a few episodes
|
||||
|
||||
### Tools
|
||||
- `chromaprint` / `fpcalc` for audio fingerprinting
|
||||
- `librosa` for spectral analysis and non-speech detection
|
||||
- `dejavu` (Python audio fingerprinting library) or custom matching
|
||||
- SQLite for fingerprint storage and lookup
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Host Voice Profile (Bootstrapped from Archive)
|
||||
|
||||
### Purpose
|
||||
Build an extremely robust speaker embedding for Mike's voice using hundreds of hours of confirmed speech.
|
||||
|
||||
### Method
|
||||
|
||||
#### Step 1: Bootstrap from clean segments
|
||||
The show intros typically have the host speaking directly. Use a handful of episodes where the host is the only speaker for the first few minutes:
|
||||
1. Transcribe 10 diverse episodes (different years, different energy levels)
|
||||
2. Run pyannote diarization
|
||||
3. The dominant speaker in each episode = the host (by far the most speaking time)
|
||||
4. Extract host-only segments from each episode
|
||||
5. Generate embeddings from all host segments
|
||||
6. Average/cluster to create a robust reference embedding
|
||||
|
||||
#### Step 2: Validate across eras
|
||||
The host's voice may have changed subtly over 8 years. Generate per-year embeddings:
|
||||
- 2010 voice profile
|
||||
- 2014 voice profile
|
||||
- 2018 voice profile
|
||||
- 2026 voice profile (from new episodes)
|
||||
|
||||
Store all as the same speaker with temporal metadata. The matching algorithm checks against all variants.
|
||||
|
||||
#### Step 3: Continuous improvement
|
||||
Each processed new episode refines the host embedding (confirmed host segments get folded back in).
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Commercial Break Pattern Training
|
||||
|
||||
### Purpose
|
||||
Learn the specific audio patterns that signal commercial breaks. Because not all production elements are in the archive, the detector must combine multiple signals rather than relying solely on fingerprint matching.
|
||||
|
||||
### Method: Multi-Signal Classifier
|
||||
|
||||
The classifier combines all available signals with weighted scoring. No single signal is required — the system degrades gracefully when some signals are unavailable.
|
||||
|
||||
#### Signal 1: Known + discovered element fingerprints
|
||||
- Match detected audio against the element library (Phase 1)
|
||||
- If a known break-bumper is detected, high confidence of a break boundary
|
||||
- If no match, other signals still contribute
|
||||
- **Availability:** Partial for archive episodes (incomplete element library), improves over time via discovery
|
||||
|
||||
#### Signal 2: Speaker identity (from Phase 2)
|
||||
- Host voice present = show content (high confidence)
|
||||
- Host voice absent for >30 seconds = possible break
|
||||
- Multiple unfamiliar voices in quick succession with produced audio = commercial cluster
|
||||
- **Availability:** High — host voice profile is robust from hundreds of hours
|
||||
|
||||
#### Signal 3: Audio characteristics
|
||||
- Extract per-segment features: MFCC, spectral centroid, zero-crossing rate, loudness (LUFS), dynamic range
|
||||
- Commercials typically: higher loudness, more compression, different spectral profile, different room tone
|
||||
- Show content typically: consistent room tone, natural dynamic range, live mic characteristics
|
||||
- **Availability:** Always available — inherent to audio
|
||||
|
||||
#### Signal 4: HR 1/HR 2 boundary training
|
||||
Since archive episodes are split into Hour 1 and Hour 2, the END of HR 1 and START of HR 2 always contain a commercial break boundary. This gives us 194+ confirmed break points.
|
||||
|
||||
1. Take the last 5 minutes of every HR 1 file and first 5 minutes of every HR 2 file
|
||||
2. Analyze the audio feature transition at the show→commercial boundary
|
||||
3. Train a Random Forest classifier on these labeled transitions
|
||||
4. Apply the learned transition pattern to detect similar boundaries within single-file recordings
|
||||
- **Availability:** Training data from archive; model applies to new episodes
|
||||
|
||||
#### Signal 5: Structural heuristics
|
||||
- Commercial breaks are typically 2-5 minutes
|
||||
- Shows typically break every 12-20 minutes
|
||||
- Transition phrases in transcript ("We'll be right back", "Welcome back", "Stay tuned")
|
||||
- Silence gaps >1 second often bookend breaks
|
||||
- **Availability:** Always available
|
||||
|
||||
#### Combined scoring
|
||||
Each signal produces a confidence value (0.0-1.0). Weighted sum determines classification:
|
||||
|
||||
```
|
||||
score = (w1 * fingerprint_match) +
|
||||
(w2 * speaker_absence) +
|
||||
(w3 * audio_characteristics) +
|
||||
(w4 * break_pattern_match) +
|
||||
(w5 * structural_heuristic)
|
||||
|
||||
if score > threshold: classify as commercial
|
||||
```
|
||||
|
||||
Default weights (tunable after validation):
|
||||
- Fingerprint match: 0.30 (strongest when available, but often unavailable)
|
||||
- Speaker identity: 0.25 (very reliable)
|
||||
- Audio characteristics: 0.20 (always available)
|
||||
- Break pattern: 0.15 (learned from archive)
|
||||
- Structural: 0.10 (least reliable alone, but useful confirmation)
|
||||
|
||||
#### Self-calibration
|
||||
After processing a batch of archive episodes:
|
||||
1. Compare detected breaks against HR1/HR2 boundaries (known ground truth)
|
||||
2. Auto-tune weights to maximize accuracy on held-out episodes
|
||||
3. Report accuracy metrics
|
||||
|
||||
### Expected Accuracy
|
||||
- With all signals available (including fingerprint match): >95%
|
||||
- Without fingerprint matches (new station, new elements): >85%
|
||||
- Improves over time as element discovery adds to the fingerprint library
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Repeat Speaker Detection
|
||||
|
||||
### Purpose
|
||||
Identify co-hosts, regular callers, and guests across the archive.
|
||||
|
||||
### Method
|
||||
1. Diarize a representative sample (20-30 episodes across all years)
|
||||
2. For each episode, extract embeddings for all non-host speakers
|
||||
3. Cluster all non-host embeddings across all episodes
|
||||
4. Clusters that appear in multiple episodes = repeat speakers
|
||||
5. Present clusters to the host for naming: "This voice appears in 47 episodes — who is this?"
|
||||
6. Save named speaker profiles
|
||||
|
||||
### Known Speakers to Look For
|
||||
- Co-hosts (Harry mentioned in early episodes)
|
||||
- Regular callers
|
||||
- Recurring guests
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Batch Processing Pipeline
|
||||
|
||||
### Purpose
|
||||
Process the full archive to build the training dataset and generate transcripts.
|
||||
|
||||
### Approach: Incremental, not all-at-once
|
||||
|
||||
**Batch 1: Training set (10 episodes)**
|
||||
- Select 10 episodes spanning different years
|
||||
- Full transcription + diarization
|
||||
- Manual review to validate accuracy
|
||||
- Use results to tune parameters
|
||||
|
||||
**Batch 2: Element fingerprinting**
|
||||
- Download and fingerprint all show elements
|
||||
- Test detection against Batch 1 episodes
|
||||
|
||||
**Batch 3: Commercial detection training**
|
||||
- Process 50 HR1/HR2 pairs
|
||||
- Train break detection classifier
|
||||
- Validate against held-out episodes
|
||||
|
||||
**Batch 4: Full archive (optional, on demand)**
|
||||
- Process remaining episodes as background task
|
||||
- Each episode: ~5-10 minutes to transcribe on RTX 5070 Ti
|
||||
- Full archive: ~50-100 hours of compute time
|
||||
- Run overnight in batches
|
||||
|
||||
### Storage Requirements
|
||||
- Transcripts (JSON): ~500KB per episode × 194 = ~100MB
|
||||
- Speaker embeddings: negligible
|
||||
- Processed audio (if re-encoding): skip unless needed
|
||||
- Total new storage: < 500MB for all metadata
|
||||
|
||||
---
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
1. **Set up Python environment** — venv with faster-whisper, pyannote, torch CUDA
|
||||
2. **Download show elements** — Fingerprint the known intros/outros/bumpers (seed library)
|
||||
3. **Process 3-5 archive episodes** — Validate transcription + diarization quality
|
||||
4. **Build host voice profile** — Bootstrap from initial batch
|
||||
5. **Run element discovery on initial batch** — Find unknown elements, begin clustering
|
||||
6. **Train commercial detector** — Using HR1/HR2 boundaries + all available signals
|
||||
7. **Process 20-30 more episodes** — Expand element library, refine classifier weights, discover repeat speakers
|
||||
8. **Host review session** — Name discovered elements and speaker clusters
|
||||
9. **Build the CLI tool** — Wire it all together with config file
|
||||
10. **Process a new 2026 episode end-to-end** — Full pipeline test with new station's elements
|
||||
11. **Batch process remaining archive** — Background task, overnight
|
||||
|
||||
---
|
||||
|
||||
## Disk Space Plan
|
||||
|
||||
The archive is 7.8GB on IX server. Options:
|
||||
1. **Stream from server** — Process one at a time via SSH/SCP, don't store locally
|
||||
2. **Download subset** — Training set only (~500MB for 10 episodes + elements)
|
||||
3. **Download all** — 7.8GB to local disk (easy, NVMe has plenty of space)
|
||||
4. **NFS/SSHFS mount** — Mount the IX server directory, process in place
|
||||
|
||||
Recommendation: Download the elements + 10-episode training set first. Full archive download only when ready for batch processing.
|
||||
276
projects/radio-show/post-show-workflow.md
Normal file
276
projects/radio-show/post-show-workflow.md
Normal file
@@ -0,0 +1,276 @@
|
||||
# Post-Show Workflow: The Computer Guru Show
|
||||
|
||||
## Overview
|
||||
|
||||
After each live show, this workflow transforms the broadcast into multiple content pieces that extend the show's reach, deepen audience engagement, and build a searchable archive. The process starts with a debrief and produces 3 tiers of content.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Post-Show Debrief (Same Day)
|
||||
|
||||
### Input
|
||||
- Show prep file (`episodes/YYYY-MM-DD-topic/show-prep.md`)
|
||||
- Host's notes on what actually happened during the show
|
||||
|
||||
### Debrief Questionnaire
|
||||
|
||||
Create a file: `episodes/YYYY-MM-DD-topic/post-show-debrief.md`
|
||||
|
||||
```markdown
|
||||
# Post-Show Debrief
|
||||
## Episode: [title]
|
||||
## Air Date: [date]
|
||||
|
||||
### What Made It In
|
||||
- [ ] Segment 1: [topic] — Used / Modified / Cut
|
||||
- [ ] Segment 2: [topic] — Used / Modified / Cut
|
||||
- [ ] Segment 3: [topic] — Used / Modified / Cut
|
||||
- [ ] Segment 4: [topic] — Used / Modified / Cut
|
||||
- [ ] Segment 5: [topic] — Used / Modified / Cut
|
||||
- [ ] Segment 6: [topic] — Used / Modified / Cut
|
||||
|
||||
### What Changed Live
|
||||
- Segments reordered? Which ones?
|
||||
- Topics expanded beyond prep? Which ones and why?
|
||||
- Topics cut short? Why? (time, audience reaction, breaking news)
|
||||
- Unplanned tangents that worked well?
|
||||
|
||||
### Caller/Audience Interaction
|
||||
- Caller topics and questions (summarize each)
|
||||
- Live chat highlights (if applicable)
|
||||
- Audience reactions that shifted the conversation
|
||||
|
||||
### Unplanned Additions
|
||||
- Breaking news discussed
|
||||
- Personal stories / anecdotes shared
|
||||
- Technical demos or live troubleshooting
|
||||
- Guest appearances or call-ins
|
||||
|
||||
### Best Moments
|
||||
- Strongest segment (what resonated most)
|
||||
- Best one-liner or quotable moment
|
||||
- Most engaging audience interaction
|
||||
- "Wish I'd said..." moments (capture for blog expansion)
|
||||
|
||||
### Topics That Deserve More
|
||||
- What couldn't you finish due to time?
|
||||
- What generated the most audience interest?
|
||||
- What deserves a deep-dive blog post?
|
||||
- Follow-up stories to watch for next week?
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Content Generation (Within 48 Hours)
|
||||
|
||||
### Tier 1: Episode Post (Radio Show Website)
|
||||
|
||||
**Target:** `website/src/content/episodes/s[SS]e[EE]-slug.md`
|
||||
**Purpose:** Canonical episode page with summary, chapters, and links
|
||||
|
||||
**Structure:**
|
||||
```markdown
|
||||
---
|
||||
title: "S[X]E[X] – [Episode Title]"
|
||||
season: [number]
|
||||
episode: [number]
|
||||
pubDate: [air date]
|
||||
duration: "[HH:MM:SS]"
|
||||
audioUrl: "[podcast audio URL]"
|
||||
audioSize: [bytes]
|
||||
episodeType: "full"
|
||||
featured: [true for current episode]
|
||||
tags: [topic tags from show prep + debrief]
|
||||
chapters:
|
||||
- time: "00:00"
|
||||
title: "Introduction"
|
||||
- time: "MM:SS"
|
||||
title: "[Segment 1 title]"
|
||||
[...]
|
||||
---
|
||||
|
||||
## Episode Summary
|
||||
|
||||
[2-3 paragraph summary of what the show covered — written from the
|
||||
debrief, not just the prep. Captures what ACTUALLY happened, including
|
||||
unplanned moments, caller contributions, and tangents that worked.]
|
||||
|
||||
## Topics Covered
|
||||
|
||||
### [Topic 1 Title]
|
||||
[3-5 sentence summary with the key takeaway. Link to deep-dive blog
|
||||
post if one exists.]
|
||||
|
||||
### [Topic 2 Title]
|
||||
[...]
|
||||
|
||||
## Links & Resources
|
||||
- [Relevant links mentioned on air]
|
||||
- [Source articles referenced in prep]
|
||||
|
||||
## Continue the Conversation
|
||||
- [Link to forum discussion thread]
|
||||
- [Link to related blog posts]
|
||||
```
|
||||
|
||||
**Action items:**
|
||||
1. Generate episode markdown from show-prep + debrief
|
||||
2. Add chapter timestamps (from audio if available, estimated from segment timing if not)
|
||||
3. Create matching forum discussion thread (Flarum, tag: Show Discussion)
|
||||
4. Build and deploy website
|
||||
|
||||
### Tier 2: Forum Discussion Thread (Community Forum)
|
||||
|
||||
**Target:** Flarum forum at community.azcomputerguru.com
|
||||
**Tag:** Show Discussion (ID 8)
|
||||
**Purpose:** Ongoing conversation hub for each episode
|
||||
|
||||
**Structure:**
|
||||
```
|
||||
Title: S[X]E[X] Discussion: [Episode Title] — [Air Date]
|
||||
|
||||
Body:
|
||||
This week's episode: [Episode Title]
|
||||
|
||||
[Brief 2-3 sentence hook — the most provocative or interesting
|
||||
angle from the show]
|
||||
|
||||
Topics we covered:
|
||||
- [Topic 1] — [one-line teaser]
|
||||
- [Topic 2] — [one-line teaser]
|
||||
- [Topic 3] — [one-line teaser]
|
||||
|
||||
What do you think? Drop your thoughts below.
|
||||
|
||||
- Did we miss anything on [controversial topic]?
|
||||
- What's your experience with [relatable topic]?
|
||||
- [Specific question raised by a caller that others might want to weigh in on]
|
||||
|
||||
Listen to the full episode: [link to episode page]
|
||||
Read our deep-dive on [topic]: [link to blog post]
|
||||
```
|
||||
|
||||
### Tier 3: Deep-Dive Blog Posts (Radio Show Website)
|
||||
|
||||
**Target:** `website/src/content/blog/[slug].md`
|
||||
**Purpose:** SEO-rich, shareable long-form content that expands on show topics
|
||||
|
||||
**Selection criteria (from debrief):**
|
||||
- Topics that generated the most audience interest
|
||||
- Topics cut short due to time
|
||||
- Topics with strong search potential (trending tech news)
|
||||
- Topics where the host has unique expertise or perspective
|
||||
|
||||
**Structure:**
|
||||
```markdown
|
||||
---
|
||||
title: "[Expanded Topic Title]"
|
||||
pubDate: [date, within 48h of show]
|
||||
description: "[SEO-friendly 150-char description]"
|
||||
author: "Mike Swanson"
|
||||
tags: [relevant tags]
|
||||
image: [optional hero image]
|
||||
---
|
||||
|
||||
[Long-form article expanding on the show segment. NOT a transcript.
|
||||
This is the version you'd write if you had unlimited airtime:]
|
||||
|
||||
- Background context the audience needs
|
||||
- The full argument with supporting evidence
|
||||
- Technical details simplified for general audience
|
||||
- What it means for regular people (the show's signature angle)
|
||||
- What to watch for next (forward-looking)
|
||||
- Host's personal take / opinion
|
||||
|
||||
## Key Takeaways
|
||||
- [Bullet point summary for skimmers]
|
||||
|
||||
## Related Episodes
|
||||
- [Links to past episodes that covered related topics]
|
||||
|
||||
*This topic was discussed on [Episode Title], airing [date].
|
||||
[Listen to the full episode →](link)*
|
||||
```
|
||||
|
||||
**Recommended: 1-3 blog posts per episode**, focusing on the strongest topics.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Cross-Promotion & Engagement
|
||||
|
||||
### Immediate (Day of Show)
|
||||
- [ ] Post episode page to website
|
||||
- [ ] Create forum discussion thread
|
||||
- [ ] Cross-link episode ↔ forum thread
|
||||
|
||||
### Within 48 Hours
|
||||
- [ ] Publish deep-dive blog post(s)
|
||||
- [ ] Cross-link blog posts ↔ episode page ↔ forum
|
||||
- [ ] Update episode page with blog post links
|
||||
|
||||
### Engagement Opportunities to Build Out
|
||||
|
||||
#### Currently Missing (Identify & Prioritize)
|
||||
1. **Social media distribution** — No social accounts linked. Where does the audience hang out? Twitter/X? Facebook? Reddit? Mastodon?
|
||||
2. **Email newsletter** — Subscribe page exists but is placeholder. Mailchimp/Buttondown/self-hosted? Weekly digest of episode + blog posts?
|
||||
3. **Podcast distribution** — Audio URL points to Blubrry (legacy). Are new episodes going to Apple Podcasts, Spotify, etc.? RSS feed exists (`feed.xml.ts`) but needs verification.
|
||||
4. **Show notes SEO** — Episode pages need proper meta descriptions, Open Graph tags, structured data (PodcastEpisode schema).
|
||||
5. **Audiogram/clips** — Short audio or video clips of the best 60-90 seconds for social sharing.
|
||||
6. **Caller follow-up** — If callers raise topics, follow up in blog posts and tag them (builds loyalty).
|
||||
7. **"This Week in Tech" roundup email** — Repurpose the show prep Quick Headlines into a weekly email blast.
|
||||
8. **Community forum engagement** — Seed discussion threads with provocative questions, not just summaries. Respond to replies.
|
||||
9. **Guest booking pipeline** — The show prep references industry topics where expert guests would add value. Track potential guests.
|
||||
10. **Analytics-driven topic selection** — Use Matomo data to see which episode pages and blog posts get the most traffic, inform future show prep.
|
||||
|
||||
---
|
||||
|
||||
## Automation Opportunities
|
||||
|
||||
### What Claude Can Do Now
|
||||
- Generate episode post from show-prep + debrief
|
||||
- Generate forum discussion thread
|
||||
- Generate deep-dive blog posts from show prep segments
|
||||
- Post to forum via Flarum database insert
|
||||
- Build and deploy website via Astro build + rsync
|
||||
- Track analytics via Matomo
|
||||
|
||||
### What Needs Setup
|
||||
- Podcast audio hosting for new episodes (Blubrry? Podbean? Self-hosted?)
|
||||
- Social media API access (for automated posting)
|
||||
- Newsletter platform (for automated digest)
|
||||
- Audio processing pipeline (for audiograms/clips)
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
episodes/
|
||||
YYYY-MM-DD-topic/
|
||||
show-prep.md ← Pre-show (already exists)
|
||||
post-show-debrief.md ← NEW: Post-show notes
|
||||
generated/
|
||||
episode-post.md ← Generated episode page content
|
||||
forum-thread.md ← Generated forum discussion
|
||||
blog-topic-1.md ← Generated deep-dive blog post
|
||||
blog-topic-2.md ← Generated deep-dive blog post
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example: March 21 Episode
|
||||
|
||||
If we ran this workflow for today's "Who's Really In Control?" episode:
|
||||
|
||||
**Episode post:** S11E02 (or whatever the current season/episode numbering is)
|
||||
|
||||
**Forum thread:** "S11E02 Discussion: Who's Really In Control? — March 21, 2026"
|
||||
|
||||
**Blog post candidates (from show prep):**
|
||||
1. "The White House AI Framework: What It Actually Says and Why It Matters" — Strong SEO potential, timely, unique angle (preemption vs. state laws)
|
||||
2. "NVIDIA's Trillion-Dollar Bet: How One Company Controls the AI Revolution" — Evergreen explainer, strong search volume
|
||||
3. "Apple Gave Google the Keys to Siri — Here's Why That Should Concern You" — Provocative, shareable, high interest
|
||||
4. "1 Petabyte Stolen: Inside the TELUS Digital Breach" — Cybersecurity angle, practical advice for listeners
|
||||
5. "Right to Repair Just Became Law — What You Can (and Can't) Fix Now" — Practical, actionable, local angle
|
||||
|
||||
**Recommended picks:** #1 (timely + unique), #3 (provocative + shareable), #5 (practical + evergreen)
|
||||
Reference in New Issue
Block a user