Files

Mike Swanson 90b1ffff8b radio: session log — full archive imported (572 ep / 482.7h / 57.7 MB DB)

Execution-only follow-on to 2026-04-27. Both batch passes done (519+53,
0 errors), import_to_sqlite.py run incrementally to bring archive.db
to final state. Next step: Jupiter Docker container deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 05:30:08 -07:00

7.5 KiB

Raw Blame History

Session Log — 2026-04-28

Project: The Computer Guru Show — Archive Mining System Goal: Complete the in-flight batch + sqlite import; bring DB to final state Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)

Continuation of 2026-04-27-archive-batch-and-sqlite-import.md. Short execution-only session — finished the runs that were started yesterday, no new architecture or code.

User

User: Mike Swanson (mike)
Machine: GURU-BEAST-ROG
Role: admin

Session Summary

The session completed the final processing steps for the Computer Guru Radio Show archive. batch_process.py finished its first 519-file snapshot — 449 processed, 68 cached, 0 errors, 4.6 hours wall — and was relaunched to pick up the 53 stragglers (late-2016 plus all of 2017/2018) that arrived after the original snapshot was taken. The second pass took 56.4 minutes with 0 errors. The encoding fix made yesterday in src/transcriber.py held across both passes — no charmap errors recurred.

import_to_sqlite.py was run once after the second batch pass, incrementally over all 572 complete episode directories. 364 newly inserted (the second-pass output plus everything done after the smoke test), 208 skipped via sha256 match, 0 errors, 3.4 seconds. The DB grew from 20.5 MB → 57.7 MB and now covers the entire downloadable archive: 572 episodes, 482.7 audio-hours, 60,917 transcript segments, 25,918 diarization turns, 3,043 intros, 1,407 Q&A pairs.

Both tasks (#1 download, #2 batch_process) are now complete. The next step is the deferred Jupiter Docker container deployment so the DB can be queried over Tailscale by Howard or me during show prep.

Key Decisions

No new architectural decisions this session — execution pass on the prior session's plan.

Problems Encountered

None. No errors during either batch pass or the import.
One pre-existing minor bug surfaced (not blocking, not fixed): batch_process.py's all_intros dict is module-level and reset at each run start, so intro_roster.json is overwritten per-run rather than aggregated across runs. The second pass's roster shows only 89 unique names from its 53 episodes, not the 303+ cumulative across the full archive. The per-episode intros.json files remain authoritative and the DB import reads from those, so the canonical name list in the DB (intros table — 3,043 rows) is correct. Future cleanup: load+merge existing roster.json before populating, or just drop the aggregate JSON and rely on SELECT DISTINCT name FROM intros against the DB.

Final DB State

DB:    archive-data/archive.db  (57.7 MB)

Year	Episodes	Hours
2010	43	32.1
2011	200	147.8
2012	98	70.4
2014	81	64.5
2015	50	38.4
2016	54	61.7
2017	41	60.5
2018	5	7.3
Total	572	482.7

(Source archive has no 2013 directory.)

Table	Rows
episodes	572
segments	60,917
turns	25,918
intros	3,043
qa_pairs	1,407

Top 10 recurring caller names (by Q&A count):

Caller	Pairs
Mark	77
John	52
Steve	49
Tom	46
Richard	30
Paul	29
Robert	29
Bill	27
Jeff	27
Bob	25

Note: the "Tom" here is unrelated to the prior session's co-host identity error. These caller_name values come from the host's spoken introductions in the transcript (e.g. "Let's talk to Tom..."), not from voice profiling. Real Tucson listeners named Tom calling in over 8 years.

Run Stats (this session)

batch_process.py — first pass (519-snapshot)

=== Done ===
  processed       : 449
  cached (skipped): 68
  in-progress     : 2
  errors          : 0
  wall time       : 276.9 min
  roster          : 303 unique names

batch_process.py — second pass (53 stragglers)

=== Done ===
  processed       : 53
  cached (skipped): 519
  in-progress     : 0
  errors          : 0
  wall time       : 56.4 min
  roster          : 89 unique names  (per-run overwrite, see Problems)

import_to_sqlite.py — incremental (no flags)

Found 572 complete episode directories
=== Done in 3.4s ===
  inserted : 364
  updated  : 0
  skipped  : 208
  errors   : 0
  db       : archive-data/archive.db  (57.7 MB)

Credentials

No new credentials this session. Reference from prior session log:

IX Server (archive source — used yesterday, idle now)

Vault path: infrastructure/ix-server.sops.yaml
Host: 172.16.3.10 (Tailscale)
External: ix.azcomputerguru.com / 72.194.62.5
SSH port: 22
Username: root
Password: Gptf*77ttb!@#!@#

Jupiter (Unraid — pending deploy target)

Vault path: infrastructure/jupiter-unraid-primary.sops.yaml
Container setup not yet started.

Infrastructure & Paths

Resource	Value
Audio processor root	`c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\`
Local DB	`archive-data/archive.db` (57.7 MB)
Per-episode outputs	`archive-data/transcripts/<year>/.../<stem>/{transcript,diarization,intros,qa}.json` + `transcript.{txt,srt}`
MP3s	`archive-data/episodes/<year>/...` (572 files / ~7.5 GB)
Background logs	`logs/{download,batch_process}.log`
Planned Jupiter mount	`/mnt/user/appdata/radio-archive/`

Commands Run

# Second batch pass (after first 519-snapshot completed)
.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1

# Final incremental import to bring DB to 572 episodes
.venv/Scripts/python.exe import_to_sqlite.py

Pending / Next Up

Stand up the Jupiter Docker container (the only remaining work item from the original plan):
- Create /mnt/user/appdata/radio-archive/ on Jupiter
- Container: FastAPI + sqlite (~50 lines), read-only mount of archive.db
- Bind only on Tailscale interface, not the public IP
- Deploy step: rsync archive.db from GURU-BEAST-ROG → Jupiter
(Followup) Fix the intro_roster.json aggregation bug — load+merge existing roster on startup, or drop the aggregate JSON and use SELECT DISTINCT name FROM intros instead.
(Future, deferred from prior session) Build voice profiles for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara voices stop being labeled CALLER and inflating Q&A false positives in early-years and 2018/2019 episodes. The current 1,407 Q&A pairs include some unknown number of these false positives.
(Future) Speaker-name resolution view — once query patterns emerge, decide whether to materialize a SQL view that joins turns (role labels) ↔ intros (real names) by time-window.

Reference Information

Re-run the importer any time new episodes are processed: .venv/Scripts/python.exe import_to_sqlite.py. Idempotent. Skip-by-sha256 means only changed/new episodes do real work. ~50 ms per skip-decision, ~5 ms per insert.
Full rebuild: import_to_sqlite.py --rebuild (drops and recreates the DB; ~3 seconds for the current 572-episode dataset).
Schema and triggers are in import_to_sqlite.py itself — SCHEMA and TRIGGERS constants at top of file.
FTS query examples (verified working):
- SELECT e.title, snippet(segments_fts, 0, '[', ']', '...', 12) FROM segments_fts JOIN segments s ON s.id=segments_fts.rowid JOIN episodes e ON e.id=s.episode_id WHERE segments_fts MATCH 'wireless'
- Same shape for qa_fts against qa_pairs.

7.5 KiB Raw Blame History