claudetools

Author	SHA1	Message	Date
Mike Swanson	6b63c154d2	sync: auto-sync from GURU-BEAST-ROG at 2026-04-29 13:29:17 Author: Mike Swanson Machine: GURU-BEAST-ROG Timestamp: 2026-04-29 13:29:17	2026-04-29 13:29:19 -07:00
Mike Swanson	519278a664	radio: session log update — Jupiter container live at 172.16.3.20:8765 Append to 2026-04-28-session.md covering the FastAPI/SQLite container deploy: build + ship + verify, plus credentials, paths, and re-deploy procedures for both DB updates and source updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:05:34 -07:00
Mike Swanson	71ada136a8	radio: FastAPI/SQLite query server, deployed to Jupiter Read-only HTTP layer over archive.db. Endpoints: /api/stats, /api/episodes, /api/episodes/{id}, /api/episodes/{id}/transcript, /api/search (FTS5 over segments + qa_pairs, bm25-ranked, snippets), /api/callers. Single-file HTML index with debounced search UI. Deployed: Jupiter (Unraid Docker), bound to 172.16.3.20:8765, LAN only. Container path: /mnt/user/appdata/radio-archive/{app,data}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:00:22 -07:00
Mike Swanson	90b1ffff8b	radio: session log — full archive imported (572 ep / 482.7h / 57.7 MB DB) Execution-only follow-on to 2026-04-27. Both batch passes done (519+53, 0 errors), import_to_sqlite.py run incrementally to bring archive.db to final state. Next step: Jupiter Docker container deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 05:30:08 -07:00
Mike Swanson	82940d96d7	radio: utf-8 transcript writes + sqlite archive importer + session log - src/transcriber.py: open transcript.{json,txt,srt} with encoding="utf-8". Windows cp1252 default crashed on Whisper output containing U+2044. - import_to_sqlite.py: new. Walks archive-data/transcripts, builds archive.db (5 tables + 2 FTS5 virtual tables, sha256-keyed idempotency). 20.5 MB / 208 episodes at smoke-test time, 1.9s rebuild. - batch_process.py: tracked from prior session — full-archive batch with resumable transcribe/diarize/intros/qa pipeline. - .gitignore: archive-data/ and logs/. Session log: 2026-04-27-archive-batch-and-sqlite-import.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 19:38:02 -07:00
Mike Swanson	488bf5849e	radio: attach caller names to Q&A pairs from transcript intros QAPair gets caller_name and caller_role fields populated by a new attach_caller_names(pairs, transcript_segments) helper. For each pair, finds the active opening intro at the question_start time (8s forward tolerance, no backward limit — a caller's call can run for 10+ minutes and the intro happens once at the start) and attaches the speaker name. Validation on 9-episode test set: 19/19 Q&A pairs (100%) now have caller names attached. Examples of corrections from oracle attribution: 2018-s10e18 @ 73:36 Christopher (was misattributed to "Tara") 2015-s7e19 @ 35:45 William (was misattributed to "Tara") 2010-05-08-hr1 Jackie x3, Bruce 2012-03-10-hr1 Adam x2 2016-s8e43 John, Doug 2017-s9e30 Tom, Denise x3, Charlie speaker_oracle.py: adds speaker_at(time, intros) helper used both by the existing resolve_speakers() and the new caller-name attachment. Also adds the "let's fit/bring/put X in/on" intro pattern variant (caught Charlie at 70:21 in 2017-s9e30 that "talk to X" missed). download_full_archive.py: SSH keepalive every 30s + per-file retry-on- failure (up to 3 attempts with reconnect). Earlier run hung on a dead connection at file 109 of 589 with no recovery; restarted run is now running at ~10 MB/s vs ~2-3 MB/s before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:55:31 -07:00
Mike Swanson	1b574caba4	radio: transcript-driven speaker name resolution (oracle) New module src/speaker_oracle.py extracts speaker introductions from transcripts ("let's talk to William", "we have Clay from the Nerd Junkies", "in Tara's place, we have Clay", "thanks for the call <name>") and binds them to non-HOST diarization turns. Pure post-pass on diarization JSONs, no audio processing — corrects audio-only cosine errors using Mike's deterministic on-air announcements. Algorithm: - Extract intros: regex patterns for caller pickups, guest intros, fill-in announcements, caller closes. Case-strict (rejects mid-sentence lowercase matches), with a blacklist of common false-positive words. Deduplicates same-name intros within 5s. - Resolve speakers: for each non-HOST turn, find the LATEST opening intro at or before turn.start (with 8s forward tolerance for boundary slop). Later intros implicitly close earlier callers, so the most recent intro wins. No artificial lookback limit (callers can talk for 10+ min). - Falls back to caller_close patterns within 30s after a turn ends. Validation on 9-episode test set: 2018-s10e18: Christopher 190s correctly named (was mislabeled "Tara") 2012-06-09 : Kay 160s correctly named (was mislabeled "Tara") 2015-s7e19 : Clay 45s as fillin for Tara, William 40s as caller 2016-s8e43 : Charles 630s, Bruce 210s, John 205s — most callers named 2017-s9e30 : Denise 295s, Tom 115s, Elaine 85s, Jeff 10s Many other callers across all episodes correctly named. Remaining unnamed CO-HOST/CALLER (~5-10% of non-HOST time) are real co-host banter or callers without explicit Mike-introductions. benchmark.py: adds Phase 2.5 "Name Resolution" between diarization and Q&A extraction. Prints named-speaker breakdown per episode. Doesn't modify diarization JSONs (resolution is computed on demand). Next step: feed named turns into qa_extractor so Q&A pairs get caller name attached for searchability. Also: bootstrap recurring-speaker profiles (Tara, Tony, Rob, Randall, producers) by accumulating intro-tagged windows across the full archive once download completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:48:16 -07:00
Mike Swanson	4c89402df8	radio: skip Clay profile build (failed) — accept 2015-s7e19 Q&A as noisy First attempt at Clay's voice profile from 2015-s7e19 produced Clay-vs-Mike cosine similarity of 0.994 — essentially a Mike clone. Root cause: 10s WavLM x-vector chunks averaged Mike's frequent interjections together with Clay's dialogue, and Mike's well-trained profile dominated the resulting embedding signal. Mike's call: skip Clay, accept the 2015-s7e19 Q&A as noisy. Clay rarely appears in other episodes, so the cost of not having his profile is bounded to this one episode plus any rare future appearances. Cleanup: - voice-profiles/clay/ removed - voice-profiles/profiles.json: Clay entry removed - Memory updated to record the decision and the failure mode Kept build_clay_profile.py in-repo as documentation of the attempt and the Mike-similarity-filter pattern. Useful starting point if a future attempt provides cleaner pure-Clay timestamps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:36:46 -07:00
Mike Swanson	c760e430c0	radio: bumper detection in diarizer + full archive download script Adds a transcript-driven bumper filter to the diarization pipeline. When a transcript segment matches qa_extractor's promo/bumper signatures, the overlapping audio windows are labeled BUMPER and the WavLM cosine match is skipped. Prevents music/promo from being matched against speaker profiles (the failure mode Mike caught in 2018-s10e18 @ 09:20-10:05). Code changes: - src/voice_profiler.py: identify_speakers() takes optional skip_ranges parameter; windows whose midpoint falls in a skip range get labeled "[bumper]" and skip cosine match - src/diarizer.py: diarize() takes optional transcript_path; pre-computes bumper time ranges via qa_extractor._is_promo_or_bumper, passes to identify_speakers; adds BUMPER speaker label - benchmark.py: passes transcript_path to diarize() Aggregate impact across 9-episode test set: Tara attribution: 4880s -> 3680s (-1200s / -25%) Q&A pairs: 17 -> 19 (+2) (bumper-flagged segments had been disrupting conversation detection in 2017-s9e30 and 2018-s10e18) CALLER total: 1320s -> 1190s (bumpers previously labeled CALLER moved) Per-episode bumpers caught: 1-8, total ~165 bumper segments across set Remaining Tara false positives are real callers acoustically similar to Tara (Christopher in 2018, Kay in 2012, William and Charles in 2015) and guest Clay in 2015-s7e19 — those need profile rebuild + Clay profile, not bumper filtering. Adds download_full_archive.py — resumable mirror-style downloader that walks IX server's /home/gurushow/public_html/archive/{year}/ and copies all MP3s to archive-data/episodes/. Run is in progress (~589 files, ~10-15GB). Used to source clean profile windows for the remaining co-hosts (Tara rebuild, Clay, Tony, Rob, Randall, producers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:17:50 -07:00
Mike Swanson	a4f527f31e	radio: per-year test set (one episode per year, 2010-2018) Added 2010, 2015, 2018 test episodes to round out the test set to one per available year: - 2010-05-08-hr1 (May 2010, earliest available; pre-Tara era) - 2015-s7e19 (Jan 2015, avoids training's s7e30) - 2018-s10e18 (only 3 non-training 2018 episodes exist) Archive has no 2019 directory — Rob's "2018/2019 appearances" are constrained to the 5 available 2018 episodes only. Per-year diarization summary (Tara presence, post-rename): 2010-05-08 30s 1.2% likely false positive (pre-Tara) 2011-03-12 140s 5.6% likely false positive (call-in only) 2012-03-10 30s 1.1% likely false positive (call-in only) 2012-06-09 340s 12.8% suspicious — Mike to confirm 2014-s6e19 680s 23.3% confirmed 2015-s7e19 280s 9.9% plausible — Mike to confirm 2016-s8e43 1890s 35.5% confirmed 2017-s9e30 610s 11.4% plausible 2018-s10e18 880s 17.1% COULD BE ROB — Mike flagged Rob for 2018/2019 appearances; cosine threshold may be hitting on Rob being acoustically similar to Tara Total Tara across 9 episodes: 1h 21m / 8h 52m audio (15.3%). Q&A counts (still suspect — every voice that isn't Mike-or-Tara is labeled CALLER, so Randall/Rob/producers inflate the bucket): 2010=4, 2011=1, 2012a=2, 2012b=0, 2014=0, 2015=1, 2016=2, 2017=4, 2018=3 Total: 17 pairs across 9 episodes 4090 perf on the expanded set: - Diarization: 31928s in 121.5s = 262.7x realtime (vs 209.7x on 5070 Ti, +25.3%) - Transcription (3 new episodes only): 10554s in 112.4s = 93.9x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:20:09 -07:00
Mike Swanson	fb683d6a05	radio: rename Tom -> Tara, expand speaker roster Mike confirmed there is no co-host named "Tom" — the voice in 2014-s6e19 and 2016-s8e43 is Tara. The 5070 Ti session fabricated the Tom identity. The voice profile itself (44 embeddings, 0.698 cosine vs Mike) is correct; only the human label was wrong. Rename swept: - voice-profiles/tom/ -> voice-profiles/tara/ (git mv preserves all .npy) - voice-profiles/profiles.json: "Tom" key -> "Tara" - build_cohost_profile.py: TOM_WINDOWS -> TARA_WINDOWS, COHOST_NAME, comments - 2026-04-27-qa-extraction-cohost-indexing.md: correction header + body sweep - 2026-04-27-4090-benchmark-and-test-set.md: closure note - .claude/memory/radio_show_no_cohost_named_tom.md: resolution + speaker roster Diarization re-run after rename so speaker_map emits "Cohost: Tara". Q&A counts unchanged (rename is label-only): 9 pairs across 6 test episodes. Tara distribution from the post-rename diarization (per-episode % of audio): 2011-03-12-hr1 140s 5.6% likely false positive (call-in only) 2012-03-10-hr1 30s 1.1% likely false positive (call-in only) 2012-06-09-hr1 340s 12.8% suspicious — pending Mike confirm 2014-s6e19 680s 23.3% confirmed 2016-s8e43 1890s 35.5% confirmed 2017-s9e30 610s 11.4% plausible — pending Mike confirm Broader speaker-roster context Mike provided this session (saved to memory): the show has had multiple co-hosts (Tara, Randall, Rob) plus producers/board ops (Andrew, Shannon, Ken, others) who would sometimes go on-air. Only Tara has a profile so far. Every other speaker is currently labeled CALLER, which means small CO-HOST attributions in unexpected episodes (e.g. 2011/2012) may actually be a producer rather than a false positive — Mike to spot-check. Action item before full-archive run: build profiles for Randall, Rob, and the named producers to avoid systematic Q&A false positives in early-years and 2018/2019 episodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:11:03 -07:00
Mike Swanson	b9a4bb8807	scc: 4090 benchmark with new code state — 338.1x diarize, 94.8x transcribe Re-ran benchmark.py on GURU-BEAST-ROG against the post-overhaul code (co-host profile, batched Whisper int8_float16, revised Q&A extractor). Results vs 5070 Ti baseline: - Diarization: 209.7x -> 338.1x (+61.2%) - Transcription: 63.8x -> 94.8x (+48.6%) - Q&A pairs: 9 vs 10 (within run-to-run noise; structural correctness matches: 2014 = 0 callers, 2016 = 2 WiFi caller pairs) Setup change: BENCH_SETUP.md now lists ffmpeg as a Step-2 prereq (winget install Gyan.FFmpeg). Was missing on this machine and the pipeline fails silently at the first diarize call without ffprobe. Code change: benchmark.py BASELINE_RTF updated 149.5 -> 209.7 to reflect the 5070 Ti's post-overhaul measurement (`e9ac607`). Data: 6 test episode transcripts and diarizations regenerated under the new code path (batched Whisper output + co-host-aware speaker_map). Correction memory: voice-profiles/tom/ directory + 5070 Ti session log fabricated a co-host named "Tom" — Mike confirms no such person exists on the show. The audio profile is real and the diarization separation is sound, but the human identity attached to it is wrong. Saved under .claude/memory/radio_show_no_cohost_named_tom.md pending Mike providing the correct name for rename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 14:54:07 -07:00
Mike Swanson	7bb683a3ed	sync: auto-sync from GURU-BEAST-ROG at 2026-04-27 14:42:18 Author: Mike Swanson Machine: GURU-BEAST-ROG Timestamp: 2026-04-27 14:42:18	2026-04-27 14:42:25 -07:00
Mike Swanson	e9ac607500	radio show: co-host voice profile, Q&A extraction fixes, archive index - Build Tom (co-host) voice profile (44 embeddings, 0.698 similarity to Mike) - diarizer.py: add CO-HOST speaker label for cohost-role profiles - voice_profiler.py: emit "Cohost: <name>" label for cohost role - qa_extractor.py: overlap resolution at load time (midpoint boundary split), 4s CALLER-preference threshold, turn-based caller-intro lookback (2 HOST turns), _preceded_by_caller_intro() helper, _PHONE_GREETING pattern, 751-1041 + "we'll get your problem solved" promo signatures - benchmark.py: use src.transcriber.transcribe with batch_size=16 - add index_test_episodes.py and build_cohost_profile.py scripts - add .gitignore (exclude episodes, transcripts, *.db, .venv) - session log: 2026-04-27-qa-extraction-cohost-indexing.md Result: 2016-s8e43 drops from 12 false-positive Q&A pairs to 2 real caller pairs. archive.db: 6 episodes, 762 segments, 10 Q&A pairs, FTS5 search verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 14:41:04 -07:00
Mike Swanson	79abef9dc9	radio: diarization pipeline fixes, benchmark setup, test episode set - Fix voice_profiler threshold bug (HOST label overwrote Unknown unconditionally) - Audio preload optimization: single ffmpeg per episode, 149.5x realtime on 5070 Ti - WavLM threshold raised to 0.85 (Mike 0.90-0.99, callers 0.46-0.83) - Promo/bumper filter: weighted signature scoring, 42->27 clean Q&A pairs - Text-only Q&A fallback for episodes with no CALLER diarization labels - TRANSFORMERS_OFFLINE=1 to skip HuggingFace freshness checks - Add diarize_2018.py for targeted re-run + FTS5 rebuild - Add benchmark.py + BENCH_SETUP.md for GURU-BEAST-ROG (RTX 4090) comparison - Commit 9-episode training diarization.json outputs - Session log: 2026-04-27-diarization-pipeline.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 13:20:40 -07:00
Mike Swanson	26dbe2327f	sync: Auto-sync from GURU-BEAST-ROG at 2026-04-26 15:09:57 Synced files: - Session logs updated - Latest context and credentials - Command/directive updates Machine: GURU-BEAST-ROG Timestamp: 2026-04-26 15:09:57 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-04-26 15:10:06 -07:00
Mike Swanson	9011670fce	sync: Auto-sync from GURU-BEAST-ROG at 2026-03-25 03:45:04 Synced files: - Session logs updated - Latest context and credentials - Command/directive updates Machine: GURU-BEAST-ROG Timestamp: 2026-03-25 03:45:04 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-03-25 03:46:07 -07:00
Mike Swanson	ad88fc31f0	sync: Auto-sync from acg-guru-5070 at 2026-03-22 22:31:46 Synced files: - Session logs updated - Latest context and credentials - Command/directive updates Machine: acg-guru-5070 Timestamp: 2026-03-22 22:31:46 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-03-22 22:31:46 -07:00
azcomputerguru	a3a47f2d5e	Add batch transcription scripts and 8 episode transcripts Created Mac M4 batch transcription using mlx-whisper with Apple Silicon GPU acceleration. Transcribed 8 remaining episodes (17,555 total segments). Scripts: - batch_transcribe_mac.py: Full batch processor with mlx-whisper - test_mac_transcribe.py: Quick test script for faster-whisper Transcripts (JSON, SRT, TXT formats): - 2011-06-04-hr1: 1,503 segments - 2011-09-10-hr1: 1,378 segments - 2014-s6e05: 1,340 segments - 2015-s7e30: 1,053 segments - 2016-s8e42: 2,205 segments - 2017-s9e26: 2,366 segments - 2018-s10e17: 4,683 segments - 2018-s10e21: 2,493 segments All 9 episodes now transcribed (8 on Mac + 1 from Linux). Ready for Stages 3-6 on Linux PC. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-21 23:12:06 -07:00
Mike Swanson	122b87a1d6	Audio processor: add Mac build task for voice training GPU firmware bug (NVRM 0x00000062) on RTX 5070 Ti makes GPU transcription impossible. Handoff doc for Mac M4 to build native version and complete the 8 remaining episode transcriptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 17:44:52 -07:00
Mike Swanson	a29d00c6b2	sync: Auto-sync from acg-guru-5070 at 2026-03-21 16:34:05 Synced files: - Session logs updated - Latest context and credentials - Command/directive updates Machine: acg-guru-5070 Timestamp: 2026-03-21 16:34:05 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-03-21 16:34:05 -07:00
Mike Swanson	6cc9043b8e	Audio processor: validated voice profiling accuracy, tuned threshold - Fine-grained speaker analysis (3s windows, 1s hop) across 42min episode - Host voice: 0.90-0.98 similarity (clear positive match) - Callers: 0.65-0.68 (correctly below threshold) - Produced audio/clips: 0.53-0.65 (correctly identified as non-host) - Co-host/other speakers: 0.56-0.62 (correctly identified) - Tuned host_match_threshold from 0.75 to 0.83 based on empirical data - Cross-referenced dips with transcript: correctly identifies callers, show intros, played audio clips, and station breaks - Batch transcription of 7 additional training episodes in progress Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 12:48:25 -07:00
Mike Swanson	826141a319	Audio processor: working voice profiler with WavLM speaker embeddings - Voice profiler using microsoft/wavlm-base-sv (512-dim x-vector embeddings) - Bootstrap from archive: 180 embeddings from 9 episodes across 2010-2018 - Host identification accuracy: 0.87-0.98 similarity for live speech, 0.60-0.64 for non-host audio (produced intros, co-host) - Dropped speechbrain dependency (requires torchaudio, CUDA version conflicts) - Patched torchaudio CUDA 12.8/13.1 version check (warning instead of error) - Profile stored in voice-profiles/mike-swanson/ with per-chunk embeddings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 12:19:13 -07:00
Mike Swanson	87f5a9306a	Audio processor: fix segment detection with transcript-driven breaks - Add transcript break phrase detection (going_to_break/coming_back cues) - Create segments from transcript breaks with silence boundary snapping - Fix segment dedup in merge_adjacent (handle overlapping segments) - Add CUDA 12 library path fix (gpu.py + venv activate hook) - Auto-load existing transcript in detect command - Tested on 2011-03-05 HR1: correctly identifies commercial break at 34:38 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 11:59:54 -07:00
Mike Swanson	a1e0442d8b	Add radio show audio processor and post-show workflow - Audio processor CLI tool with 6-stage pipeline: transcribe (faster-whisper GPU), diarize (pyannote), detect segments (multi-signal classifier), remove commercials, split segments, analyze content (Ollama) - Post-show workflow doc for episode posts, forum threads, deep-dive blog posts - Training plan for using 579-episode archive for voice profiles and commercial detection - Successful test: 45min episode transcribed in 2:37 on RTX 5070 Ti - Sample transcript output from S7E30 (March 2015) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 11:51:59 -07:00

25 Commits