Files
claudetools/wiki/projects/radio-show.md
Mike Swanson f4fb131529 wiki: seed remaining clients and projects (batch 3)
Adds 11 client articles and 5 project articles:

Clients: kittle, khalsa, anaise, azcomputerguru.com, bg-builders,
evs, furrier, horseshoe-management, kittle-design, scileppi-law,
western-tire

Projects: discord-bot, radio-show, msp-pricing, wrightstown-smarthome,
wrightstown-solar

Updates wiki/index.md with all new entries, cross-references, and
removes seeded client:birthbiologic from compilation queue.

Critical findings surfaced:
- Kittle: WS2025 EVAL license, no backups, 3 plaintext creds in Syncro
- Western Tire: SSL cert *.westerntire.com expires 2026-05-30
- Kittle Design: active compromise (Ken inbox rule unresolved)
- Horseshoe Mgmt: plaintext creds for 5+ users in Syncro notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-24 19:59:40 -07:00

206 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
type: project
name: radio-show
display_name: The Computer Guru Show
last_compiled: 2026-05-24
compiled_by: DESKTOP-0O8A1RL/claude-main
sources:
- projects/radio-show/post-show-workflow.md
- projects/radio-show/audio-processor/README.md
- projects/radio-show/session-logs/2026-04-27-qa-extraction-cohost-indexing.md
- projects/radio-show/session-logs/2026-05-01-ui-redesign-recovery.md
---
# The Computer Guru Show
## Overview
"The Computer Guru Show" is Mike Swanson's radio program. The project covers two distinct workstreams:
1. **Audio Processor** — Automated pipeline that processes raw broadcast recordings (with commercials) into podcast-ready audio, transcripts, speaker-diarized segments, and a searchable SQLite archive.
2. **Post-Show Content Workflow** — Process for turning each episode into an episode page (website), forum discussion thread (Flarum), and 13 deep-dive blog posts within 48 hours of air.
**Status:** Active development. Audio processor pipeline functional with 572 episodes indexed locally on BEAST. FastAPI browse/search UI redesigned (2026-05-01). Jupiter deployment has a known audio-file gap (open). Post-show workflow documented but not yet fully automated.
Archive spans 20102018 (no 2013 season), 579 MP3s, ~3040 GB.
---
## Tech Stack
| Layer | Technology |
|---|---|
| Transcription | faster-whisper (`large-v3`, CTranslate2 + CUDA), int8_float16, batched |
| Speaker diarization | pyannote.audio 3.1 (WavLM embeddings) |
| Audio processing | ffmpeg, pydub, librosa |
| Audio fingerprinting | chromaprint |
| Voice activity detection | silero-vad |
| ML / classification | scikit-learn (break pattern classifier) |
| Content analysis | Ollama — `qwen3:14b` (narrative/summary), local LLM |
| Archive database | SQLite with FTS5 (segments, Q&A pairs) |
| Web server | FastAPI + uvicorn (embedded HTML templates) |
| Hardware (primary) | DESKTOP-0O8A1RL — RTX 5070 Ti Laptop GPU |
| Hardware (secondary) | GURU-BEAST-ROG — RTX 4090 (benchmark pending) |
---
## Architecture
### Audio Processor Pipeline
```
Raw MP3 (full broadcast with commercials)
|
+-- 1. Transcription: faster-whisper large-v3 (63.8x realtime on 5070 Ti)
| Output: word-level timestamps, language detection
|
+-- 2. Speaker Diarization: pyannote.audio 3.1 (209.7x realtime on 5070 Ti)
| 10s windows / 5s hop, midpoint boundary resolution at load time
| Speaker profiles: host (Mike, era-specific embeddings), co-hosts, callers
|
+-- 3. Segment Detection: Multi-signal classifier (6 signals, combined weighted score)
| Signals: fingerprint match (0.30), speaker identity (0.25),
| audio characteristics (0.20), break pattern (0.15), structural heuristics (0.10)
| Element library: SQLite fingerprints.db + learning/discovery system
|
+-- 4. Commercial Removal: ffmpeg — stitch segments, EBU R128 normalize
|
+-- 5. Segment Splitting: ffmpeg — individual MP3s per segment, ID3 tags, chapter markers
|
+-- 6. Content Analysis: Ollama qwen3:14b
Output: episode summary, per-segment summaries, key quotes, topic tags,
suggested blog post topics, auto-filled post-show debrief
```
### Key Thresholds
| Parameter | Value |
|---|---|
| Host/co-host match threshold | 0.85 cosine similarity (WavLM) |
| Tara (co-host) vs Mike separation | 0.698 cosine similarity |
| CALLER minimum coverage in transcript segment | 4.0 seconds |
| Promo score threshold | 2 (weighted signatures) |
| Min Q&A question duration | 5.0s |
| Min Q&A answer duration | 15.0s |
| Max gap between Q and A | 30.0s |
| Commercial break: min/max duration | 30s / 300s |
| Combined confidence threshold (commercial) | 0.70 |
### Voice Profile System
Bootstrapped from the 579-episode archive. Host (Mike) has era-specific embeddings (2010, 2014, 2018, 2026). Co-host Tara has 44 embeddings from 2 episodes. Unknown repeat voices are clustered and held for host review.
```
voice-profiles/
host-mike-swanson/ -- composite + era embeddings
guests/<name>.npy -- named guest embeddings (built over time)
callers/regular-NNN.npy -- unnamed repeat callers
unknown/cluster-NNN.npy -- unidentified voices appearing multiple times
```
### Archive Index (SQLite)
`archive.db` schema: `episodes`, `segments`, `segments_fts` (FTS5), `qa_pairs`, `qa_fts`. As of 2026-05-01 on BEAST: 572 episodes indexed.
FTS5 search supports: segment text search, Q&A pair search, speaker filter.
### FastAPI Browse/Search UI
Single-file server at `projects/radio-show/audio-processor/server/main.py`. Two embedded HTML templates:
- `INDEX_HTML` — search/browse page with CSS custom property theme (`#c39733` accent), browse-mode toggle, Q&A pill badges.
- `EPISODE_HTML` — episode detail page with sticky `<audio>` player, active-Q&A highlight that follows playhead via `timeupdate` listener, `preload="metadata"`.
Env vars: `ARCHIVE_DB`, `EPISODES_DIR`, `PORT`.
### Post-Show Content Workflow
Three content tiers produced within 48 hours of each episode:
| Tier | Target | Output |
|---|---|---|
| 1 | Radio show website | Episode page (`website/src/content/episodes/sXXeYY-slug.md`) with summary, chapters, links |
| 2 | Flarum forum | Discussion thread (tag: Show Discussion, ID 8) at community.azcomputerguru.com |
| 3 | Radio show website | 13 deep-dive blog posts (`website/src/content/blog/<slug>.md`) |
Claude handles: generating all content from show-prep + debrief, posting to Flarum via DB insert, building and deploying the Astro website.
---
## Deployment / Hosting
| Item | Value |
|---|---|
| Jupiter (primary archive host) | `172.16.3.20:8765` — uvicorn, FastAPI |
| Local dev (BEAST) | `127.0.0.1:8765` — same port as Jupiter for bookmark parity |
| Archive source (IX server) | `172.16.3.10``gurushow@`, `/home/gurushow/public_html/archive/Radio/` |
| Archive local copy (BEAST) | `projects/radio-show/audio-processor/archive-data/` |
| Forum | community.azcomputerguru.com (Flarum) |
| Radio show website | Astro site, deployed via rsync |
[WARNING] Jupiter's `/data/episodes` tree is EMPTY. `GET /api/audio/{id}` returns HTTP 404 for all episode IDs on Jupiter. Audio works locally on BEAST only (full archive in `archive-data/episodes/`). Fix decision is pending — see Open Items.
---
## Configuration / Credentials
| Secret | Location |
|---|---|
| IX server SSH (gurushow) | SOPS vault — search `gurushow` or `ix server` |
| HuggingFace token (pyannote license) | `huggingface-cli login` — required for pyannote.audio |
| Forum DB access (Flarum insert) | SOPS vault — search `flarum` or `community forum` |
IX server access: paramiko with `look_for_keys=False, allow_agent=False`. Tailscale required for `172.16.3.10`.
---
## Active Work / Open Items
- [ ] **Jupiter audio fix (open, unresolved).** Three options, no pick made:
1. rsync full archive (~3040 GB) to Jupiter at `/data/episodes/`
2. Proxy `/api/audio/{id}` from Jupiter to IX on demand (~5 lines)
3. Point `<audio src>` at IX directly via public HTTPS endpoint
- [ ] **Commit intro/QA sort tie-break fix** (`server/main.py` lines 551, 597 — `key=lambda x: x[0]`). Two-line fix, uncommitted as of end of 2026-05-01 session. Mike had not yet OK'd the commit.
- [ ] **RTX 4090 benchmark on BEAST** — establish diarization RTF baseline (expected ~250300x vs 209.7x on laptop 5070 Ti).
- [ ] **Download full archive from IX to BEAST** for batch training (paramiko script skeleton exists in prior session log `2026-04-27-diarization-pipeline.md`).
- [ ] **Verify Tara profile generalizes across 2015/2016 episodes** — re-run `build_cohost_profile.py` with additional windows if false positives appear.
- [ ] **Post-show workflow automation** — social media, email newsletter, podcast RSS still need platform setup.
---
## Key Events / History
| Date | Event |
|---|---|
| 20102018 | Show original run. 579 episodes archived. No 2013 season. |
| 2026-04-27 | Q&A extraction + co-host profile session (DESKTOP-0O8A1RL). Built Tara co-host voice profile (44 embeddings, 0.698 cosine vs Mike). Fixed false-positive Q&A extraction for co-host episodes. Created `archive.db` with FTS5. Indexed 6 test episodes: 762 segments, 10 Q&A pairs. Transcription benchmarked at 63.8x realtime; diarization at 209.7x realtime. |
| 2026-04-30 | UI redesign done on BEAST (mid-session, uncommitted before reboot). |
| 2026-05-01 | Session recovery after BEAST reboot. Found 820-line uncommitted diff to `server/main.py`. Committed as `d7ce9cb` (rebased to `296d157`). Diagnosed Jupiter audio-404 (pre-existing deployment gap, not a regression). Deployed locally on BEAST — confirmed 572 episodes, working audio. Fixed episode-500 sort bug (episode 479). |
| 2026-05-01 | Co-host name corrected: previously labeled "Tom" in session log, Mike confirmed it is "Tara." All references updated. |
---
## Anti-Patterns / Warnings
[WARNING] Do NOT attempt interactive SSH to `gurushow@172.16.3.10` from scripts. Use paramiko with `look_for_keys=False, allow_agent=False`. Key-based auth is disabled on this host.
[WARNING] Tailscale must be active to reach `172.16.3.10` (IX server) or `172.16.3.20` (Jupiter).
[WARNING] The Ollama `/save` protocol has a known stale-prompt-file bug: `save_narrative_prompt.txt` at `C:/Users/guru/AppData/Local/Temp/` is reused across sessions and can cause qwen3 to produce a narrative about the WRONG session. Recovery: write narrative directly. Fix: delete prompt file before re-writing, or use a unique per-session filename.
[WARNING] `sorted()` over `(timestamp, sqlite3.Row)` tuples without `key=` will raise `TypeError` when two rows share the same timestamp. Always use `key=lambda x: x[0]`. This bit `_episode_html` at lines 551 and 597 (2026-05-01 bug).
[INFO] Co-host voice profiles must be built from the first 60 minutes of co-host episodes. Real callers do not call in during the first hour — those CALLER-labeled windows are safely all co-host speech.
[INFO] Tara's exact tenure as co-host is unverified. Do not assume her profile applies across all 20132016 episodes without spot-checking.
---
## Backlinks
- `wiki/systems/jupiter.md` [unverified — may not exist yet] — Jupiter server spec
- `wiki/systems/ix-server.md` [unverified — may not exist yet] — IX hosting server spec
- `wiki/projects/gururmm.md` — related ACG project
- `projects/radio-show/audio-processor/README.md` — full pipeline spec and configuration reference
- `projects/radio-show/post-show-workflow.md` — full post-show content workflow spec