Execution-only follow-on to 2026-04-27. Both batch passes done (519+53, 0 errors), import_to_sqlite.py run incrementally to bring archive.db to final state. Next step: Jupiter Docker container deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.5 KiB
Session Log — 2026-04-28
Project: The Computer Guru Show — Archive Mining System Goal: Complete the in-flight batch + sqlite import; bring DB to final state Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)
Continuation of 2026-04-27-archive-batch-and-sqlite-import.md. Short execution-only session — finished the runs that were started yesterday, no new architecture or code.
User
- User: Mike Swanson (mike)
- Machine: GURU-BEAST-ROG
- Role: admin
Session Summary
The session completed the final processing steps for the Computer Guru Radio Show archive. batch_process.py finished its first 519-file snapshot — 449 processed, 68 cached, 0 errors, 4.6 hours wall — and was relaunched to pick up the 53 stragglers (late-2016 plus all of 2017/2018) that arrived after the original snapshot was taken. The second pass took 56.4 minutes with 0 errors. The encoding fix made yesterday in src/transcriber.py held across both passes — no charmap errors recurred.
import_to_sqlite.py was run once after the second batch pass, incrementally over all 572 complete episode directories. 364 newly inserted (the second-pass output plus everything done after the smoke test), 208 skipped via sha256 match, 0 errors, 3.4 seconds. The DB grew from 20.5 MB → 57.7 MB and now covers the entire downloadable archive: 572 episodes, 482.7 audio-hours, 60,917 transcript segments, 25,918 diarization turns, 3,043 intros, 1,407 Q&A pairs.
Both tasks (#1 download, #2 batch_process) are now complete. The next step is the deferred Jupiter Docker container deployment so the DB can be queried over Tailscale by Howard or me during show prep.
Key Decisions
- No new architectural decisions this session — execution pass on the prior session's plan.
Problems Encountered
- None. No errors during either batch pass or the import.
- One pre-existing minor bug surfaced (not blocking, not fixed):
batch_process.py'sall_introsdict is module-level and reset at each run start, sointro_roster.jsonis overwritten per-run rather than aggregated across runs. The second pass's roster shows only 89 unique names from its 53 episodes, not the 303+ cumulative across the full archive. The per-episodeintros.jsonfiles remain authoritative and the DB import reads from those, so the canonical name list in the DB (introstable — 3,043 rows) is correct. Future cleanup: load+merge existing roster.json before populating, or just drop the aggregate JSON and rely onSELECT DISTINCT name FROM introsagainst the DB.
Final DB State
DB: archive-data/archive.db (57.7 MB)
| Year | Episodes | Hours |
|---|---|---|
| 2010 | 43 | 32.1 |
| 2011 | 200 | 147.8 |
| 2012 | 98 | 70.4 |
| 2014 | 81 | 64.5 |
| 2015 | 50 | 38.4 |
| 2016 | 54 | 61.7 |
| 2017 | 41 | 60.5 |
| 2018 | 5 | 7.3 |
| Total | 572 | 482.7 |
(Source archive has no 2013 directory.)
| Table | Rows |
|---|---|
| episodes | 572 |
| segments | 60,917 |
| turns | 25,918 |
| intros | 3,043 |
| qa_pairs | 1,407 |
Top 10 recurring caller names (by Q&A count):
| Caller | Pairs |
|---|---|
| Mark | 77 |
| John | 52 |
| Steve | 49 |
| Tom | 46 |
| Richard | 30 |
| Paul | 29 |
| Robert | 29 |
| Bill | 27 |
| Jeff | 27 |
| Bob | 25 |
Note: the "Tom" here is unrelated to the prior session's co-host identity error. These caller_name values come from the host's spoken introductions in the transcript (e.g. "Let's talk to Tom..."), not from voice profiling. Real Tucson listeners named Tom calling in over 8 years.
Run Stats (this session)
batch_process.py — first pass (519-snapshot)
=== Done ===
processed : 449
cached (skipped): 68
in-progress : 2
errors : 0
wall time : 276.9 min
roster : 303 unique names
batch_process.py — second pass (53 stragglers)
=== Done ===
processed : 53
cached (skipped): 519
in-progress : 0
errors : 0
wall time : 56.4 min
roster : 89 unique names (per-run overwrite, see Problems)
import_to_sqlite.py — incremental (no flags)
Found 572 complete episode directories
=== Done in 3.4s ===
inserted : 364
updated : 0
skipped : 208
errors : 0
db : archive-data/archive.db (57.7 MB)
Credentials
No new credentials this session. Reference from prior session log:
IX Server (archive source — used yesterday, idle now)
- Vault path:
infrastructure/ix-server.sops.yaml - Host: 172.16.3.10 (Tailscale)
- External: ix.azcomputerguru.com / 72.194.62.5
- SSH port: 22
- Username: root
- Password:
Gptf*77ttb!@#!@#
Jupiter (Unraid — pending deploy target)
- Vault path:
infrastructure/jupiter-unraid-primary.sops.yaml - Container setup not yet started.
Infrastructure & Paths
| Resource | Value |
|---|---|
| Audio processor root | c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\ |
| Local DB | archive-data/archive.db (57.7 MB) |
| Per-episode outputs | archive-data/transcripts/<year>/.../<stem>/{transcript,diarization,intros,qa}.json + transcript.{txt,srt} |
| MP3s | archive-data/episodes/<year>/... (572 files / ~7.5 GB) |
| Background logs | logs/{download,batch_process}.log |
| Planned Jupiter mount | /mnt/user/appdata/radio-archive/ |
Commands Run
# Second batch pass (after first 519-snapshot completed)
.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1
# Final incremental import to bring DB to 572 episodes
.venv/Scripts/python.exe import_to_sqlite.py
Pending / Next Up
- Stand up the Jupiter Docker container (the only remaining work item from the original plan):
- Create
/mnt/user/appdata/radio-archive/on Jupiter - Container: FastAPI + sqlite (~50 lines), read-only mount of
archive.db - Bind only on Tailscale interface, not the public IP
- Deploy step: rsync
archive.dbfrom GURU-BEAST-ROG → Jupiter
- Create
- (Followup) Fix the intro_roster.json aggregation bug — load+merge existing roster on startup, or drop the aggregate JSON and use
SELECT DISTINCT name FROM introsinstead. - (Future, deferred from prior session) Build voice profiles for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara voices stop being labeled CALLER and inflating Q&A false positives in early-years and 2018/2019 episodes. The current 1,407 Q&A pairs include some unknown number of these false positives.
- (Future) Speaker-name resolution view — once query patterns emerge, decide whether to materialize a SQL view that joins
turns(role labels) ↔intros(real names) by time-window.
Reference Information
- Re-run the importer any time new episodes are processed:
.venv/Scripts/python.exe import_to_sqlite.py. Idempotent. Skip-by-sha256 means only changed/new episodes do real work. ~50 ms per skip-decision, ~5 ms per insert. - Full rebuild:
import_to_sqlite.py --rebuild(drops and recreates the DB; ~3 seconds for the current 572-episode dataset). - Schema and triggers are in
import_to_sqlite.pyitself —SCHEMAandTRIGGERSconstants at top of file. - FTS query examples (verified working):
SELECT e.title, snippet(segments_fts, 0, '[', ']', '...', 12) FROM segments_fts JOIN segments s ON s.id=segments_fts.rowid JOIN episodes e ON e.id=s.episode_id WHERE segments_fts MATCH 'wireless'- Same shape for
qa_ftsagainstqa_pairs.