Files
claudetools/projects/radio-show/audio-processor/session-logs/2026-04-28-session.md
Mike Swanson 90b1ffff8b radio: session log — full archive imported (572 ep / 482.7h / 57.7 MB DB)
Execution-only follow-on to 2026-04-27. Both batch passes done (519+53,
0 errors), import_to_sqlite.py run incrementally to bring archive.db
to final state. Next step: Jupiter Docker container deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 05:30:08 -07:00

7.5 KiB

Session Log — 2026-04-28

Project: The Computer Guru Show — Archive Mining System Goal: Complete the in-flight batch + sqlite import; bring DB to final state Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)

Continuation of 2026-04-27-archive-batch-and-sqlite-import.md. Short execution-only session — finished the runs that were started yesterday, no new architecture or code.


User

  • User: Mike Swanson (mike)
  • Machine: GURU-BEAST-ROG
  • Role: admin

Session Summary

The session completed the final processing steps for the Computer Guru Radio Show archive. batch_process.py finished its first 519-file snapshot — 449 processed, 68 cached, 0 errors, 4.6 hours wall — and was relaunched to pick up the 53 stragglers (late-2016 plus all of 2017/2018) that arrived after the original snapshot was taken. The second pass took 56.4 minutes with 0 errors. The encoding fix made yesterday in src/transcriber.py held across both passes — no charmap errors recurred.

import_to_sqlite.py was run once after the second batch pass, incrementally over all 572 complete episode directories. 364 newly inserted (the second-pass output plus everything done after the smoke test), 208 skipped via sha256 match, 0 errors, 3.4 seconds. The DB grew from 20.5 MB → 57.7 MB and now covers the entire downloadable archive: 572 episodes, 482.7 audio-hours, 60,917 transcript segments, 25,918 diarization turns, 3,043 intros, 1,407 Q&A pairs.

Both tasks (#1 download, #2 batch_process) are now complete. The next step is the deferred Jupiter Docker container deployment so the DB can be queried over Tailscale by Howard or me during show prep.


Key Decisions

  • No new architectural decisions this session — execution pass on the prior session's plan.

Problems Encountered

  • None. No errors during either batch pass or the import.
  • One pre-existing minor bug surfaced (not blocking, not fixed): batch_process.py's all_intros dict is module-level and reset at each run start, so intro_roster.json is overwritten per-run rather than aggregated across runs. The second pass's roster shows only 89 unique names from its 53 episodes, not the 303+ cumulative across the full archive. The per-episode intros.json files remain authoritative and the DB import reads from those, so the canonical name list in the DB (intros table — 3,043 rows) is correct. Future cleanup: load+merge existing roster.json before populating, or just drop the aggregate JSON and rely on SELECT DISTINCT name FROM intros against the DB.

Final DB State

DB:    archive-data/archive.db  (57.7 MB)
Year Episodes Hours
2010 43 32.1
2011 200 147.8
2012 98 70.4
2014 81 64.5
2015 50 38.4
2016 54 61.7
2017 41 60.5
2018 5 7.3
Total 572 482.7

(Source archive has no 2013 directory.)

Table Rows
episodes 572
segments 60,917
turns 25,918
intros 3,043
qa_pairs 1,407

Top 10 recurring caller names (by Q&A count):

Caller Pairs
Mark 77
John 52
Steve 49
Tom 46
Richard 30
Paul 29
Robert 29
Bill 27
Jeff 27
Bob 25

Note: the "Tom" here is unrelated to the prior session's co-host identity error. These caller_name values come from the host's spoken introductions in the transcript (e.g. "Let's talk to Tom..."), not from voice profiling. Real Tucson listeners named Tom calling in over 8 years.


Run Stats (this session)

batch_process.py — first pass (519-snapshot)

=== Done ===
  processed       : 449
  cached (skipped): 68
  in-progress     : 2
  errors          : 0
  wall time       : 276.9 min
  roster          : 303 unique names

batch_process.py — second pass (53 stragglers)

=== Done ===
  processed       : 53
  cached (skipped): 519
  in-progress     : 0
  errors          : 0
  wall time       : 56.4 min
  roster          : 89 unique names  (per-run overwrite, see Problems)

import_to_sqlite.py — incremental (no flags)

Found 572 complete episode directories
=== Done in 3.4s ===
  inserted : 364
  updated  : 0
  skipped  : 208
  errors   : 0
  db       : archive-data/archive.db  (57.7 MB)

Credentials

No new credentials this session. Reference from prior session log:

IX Server (archive source — used yesterday, idle now)

  • Vault path: infrastructure/ix-server.sops.yaml
  • Host: 172.16.3.10 (Tailscale)
  • External: ix.azcomputerguru.com / 72.194.62.5
  • SSH port: 22
  • Username: root
  • Password: Gptf*77ttb!@#!@#

Jupiter (Unraid — pending deploy target)

  • Vault path: infrastructure/jupiter-unraid-primary.sops.yaml
  • Container setup not yet started.

Infrastructure & Paths

Resource Value
Audio processor root c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\
Local DB archive-data/archive.db (57.7 MB)
Per-episode outputs archive-data/transcripts/<year>/.../<stem>/{transcript,diarization,intros,qa}.json + transcript.{txt,srt}
MP3s archive-data/episodes/<year>/... (572 files / ~7.5 GB)
Background logs logs/{download,batch_process}.log
Planned Jupiter mount /mnt/user/appdata/radio-archive/

Commands Run

# Second batch pass (after first 519-snapshot completed)
.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1

# Final incremental import to bring DB to 572 episodes
.venv/Scripts/python.exe import_to_sqlite.py

Pending / Next Up

  1. Stand up the Jupiter Docker container (the only remaining work item from the original plan):
    • Create /mnt/user/appdata/radio-archive/ on Jupiter
    • Container: FastAPI + sqlite (~50 lines), read-only mount of archive.db
    • Bind only on Tailscale interface, not the public IP
    • Deploy step: rsync archive.db from GURU-BEAST-ROG → Jupiter
  2. (Followup) Fix the intro_roster.json aggregation bug — load+merge existing roster on startup, or drop the aggregate JSON and use SELECT DISTINCT name FROM intros instead.
  3. (Future, deferred from prior session) Build voice profiles for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara voices stop being labeled CALLER and inflating Q&A false positives in early-years and 2018/2019 episodes. The current 1,407 Q&A pairs include some unknown number of these false positives.
  4. (Future) Speaker-name resolution view — once query patterns emerge, decide whether to materialize a SQL view that joins turns (role labels) ↔ intros (real names) by time-window.

Reference Information

  • Re-run the importer any time new episodes are processed: .venv/Scripts/python.exe import_to_sqlite.py. Idempotent. Skip-by-sha256 means only changed/new episodes do real work. ~50 ms per skip-decision, ~5 ms per insert.
  • Full rebuild: import_to_sqlite.py --rebuild (drops and recreates the DB; ~3 seconds for the current 572-episode dataset).
  • Schema and triggers are in import_to_sqlite.py itself — SCHEMA and TRIGGERS constants at top of file.
  • FTS query examples (verified working):
    • SELECT e.title, snippet(segments_fts, 0, '[', ']', '...', 12) FROM segments_fts JOIN segments s ON s.id=segments_fts.rowid JOIN episodes e ON e.id=s.episode_id WHERE segments_fts MATCH 'wireless'
    • Same shape for qa_fts against qa_pairs.