radio: session log — full archive imported (572 ep / 482.7h / 57.7 MB DB)

Execution-only follow-on to 2026-04-27. Both batch passes done (519+53, 0 errors), import_to_sqlite.py run incrementally to bring archive.db to final state. Next step: Jupiter Docker container deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 05:29:47 -07:00
parent 72b7996be4
commit 90b1ffff8b
1 changed files with 189 additions and 0 deletions
--- a/projects/radio-show/audio-processor/session-logs/2026-04-28-session.md
+++ b/projects/radio-show/audio-processor/session-logs/2026-04-28-session.md
@@ -0,0 +1,189 @@
+# Session Log — 2026-04-28
+
+**Project:** The Computer Guru Show — Archive Mining System
+**Goal:** Complete the in-flight batch + sqlite import; bring DB to final state
+**Machine:** GURU-BEAST-ROG (RTX 4090, 24GB)
+**User:** Mike Swanson (mike)
+
+Continuation of `2026-04-27-archive-batch-and-sqlite-import.md`. Short execution-only session — finished the runs that were started yesterday, no new architecture or code.
+
+---
+
+## User
+- **User:** Mike Swanson (mike)
+- **Machine:** GURU-BEAST-ROG
+- **Role:** admin
+
+---
+
+## Session Summary
+
+The session completed the final processing steps for the Computer Guru Radio Show archive. `batch_process.py` finished its first 519-file snapshot — 449 processed, 68 cached, 0 errors, 4.6 hours wall — and was relaunched to pick up the 53 stragglers (late-2016 plus all of 2017/2018) that arrived after the original snapshot was taken. The second pass took 56.4 minutes with 0 errors. The encoding fix made yesterday in `src/transcriber.py` held across both passes — no charmap errors recurred.
+
+`import_to_sqlite.py` was run once after the second batch pass, incrementally over all 572 complete episode directories. 364 newly inserted (the second-pass output plus everything done after the smoke test), 208 skipped via sha256 match, 0 errors, 3.4 seconds. The DB grew from 20.5 MB → 57.7 MB and now covers the entire downloadable archive: 572 episodes, 482.7 audio-hours, 60,917 transcript segments, 25,918 diarization turns, 3,043 intros, 1,407 Q&A pairs.
+
+Both tasks (#1 download, #2 batch_process) are now complete. The next step is the deferred Jupiter Docker container deployment so the DB can be queried over Tailscale by Howard or me during show prep.
+
+---
+
+## Key Decisions
+
+- No new architectural decisions this session — execution pass on the prior session's plan.
+
+---
+
+## Problems Encountered
+
+- **None.** No errors during either batch pass or the import.
+- One pre-existing minor bug surfaced (not blocking, not fixed): `batch_process.py`'s `all_intros` dict is module-level and reset at each run start, so `intro_roster.json` is overwritten per-run rather than aggregated across runs. The second pass's roster shows only 89 unique names from its 53 episodes, not the 303+ cumulative across the full archive. The per-episode `intros.json` files remain authoritative and the DB import reads from those, so the canonical name list in the DB (`intros` table — 3,043 rows) is correct. Future cleanup: load+merge existing roster.json before populating, or just drop the aggregate JSON and rely on `SELECT DISTINCT name FROM intros` against the DB.
+
+---
+
+## Final DB State
+
+```
+DB:    archive-data/archive.db  (57.7 MB)
+```
+
+| Year | Episodes | Hours |
+|---|---|---|
+| 2010 | 43  | 32.1  |
+| 2011 | 200 | 147.8 |
+| 2012 | 98  | 70.4  |
+| 2014 | 81  | 64.5  |
+| 2015 | 50  | 38.4  |
+| 2016 | 54  | 61.7  |
+| 2017 | 41  | 60.5  |
+| 2018 | 5   | 7.3   |
+| **Total** | **572** | **482.7** |
+
+(Source archive has no 2013 directory.)
+
+| Table     | Rows   |
+|-----------|--------|
+| episodes  | 572    |
+| segments  | 60,917 |
+| turns     | 25,918 |
+| intros    | 3,043  |
+| qa_pairs  | 1,407  |
+
+**Top 10 recurring caller names (by Q&A count):**
+
+| Caller | Pairs |
+|---|---|
+| Mark | 77 |
+| John | 52 |
+| Steve | 49 |
+| Tom | 46 |
+| Richard | 30 |
+| Paul | 29 |
+| Robert | 29 |
+| Bill | 27 |
+| Jeff | 27 |
+| Bob | 25 |
+
+Note: the "Tom" here is unrelated to the prior session's co-host identity error. These `caller_name` values come from the host's spoken introductions in the transcript (e.g. *"Let's talk to Tom..."*), not from voice profiling. Real Tucson listeners named Tom calling in over 8 years.
+
+---
+
+## Run Stats (this session)
+
+### batch_process.py — first pass (519-snapshot)
+```
+=== Done ===
+  processed       : 449
+  cached (skipped): 68
+  in-progress     : 2
+  errors          : 0
+  wall time       : 276.9 min
+  roster          : 303 unique names
+```
+
+### batch_process.py — second pass (53 stragglers)
+```
+=== Done ===
+  processed       : 53
+  cached (skipped): 519
+  in-progress     : 0
+  errors          : 0
+  wall time       : 56.4 min
+  roster          : 89 unique names  (per-run overwrite, see Problems)
+```
+
+### import_to_sqlite.py — incremental (no flags)
+```
+Found 572 complete episode directories
+=== Done in 3.4s ===
+  inserted : 364
+  updated  : 0
+  skipped  : 208
+  errors   : 0
+  db       : archive-data/archive.db  (57.7 MB)
+```
+
+---
+
+## Credentials
+
+No new credentials this session. Reference from prior session log:
+
+### IX Server (archive source — used yesterday, idle now)
+- **Vault path:** `infrastructure/ix-server.sops.yaml`
+- **Host:** 172.16.3.10 (Tailscale)
+- **External:** ix.azcomputerguru.com / 72.194.62.5
+- **SSH port:** 22
+- **Username:** root
+- **Password:** `Gptf*77ttb!@#!@#`
+
+### Jupiter (Unraid — pending deploy target)
+- **Vault path:** `infrastructure/jupiter-unraid-primary.sops.yaml`
+- Container setup not yet started.
+
+---
+
+## Infrastructure & Paths
+
+| Resource | Value |
+|---|---|
+| Audio processor root | `c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\` |
+| Local DB | `archive-data/archive.db` (57.7 MB) |
+| Per-episode outputs | `archive-data/transcripts/<year>/.../<stem>/{transcript,diarization,intros,qa}.json` + `transcript.{txt,srt}` |
+| MP3s | `archive-data/episodes/<year>/...` (572 files / ~7.5 GB) |
+| Background logs | `logs/{download,batch_process}.log` |
+| Planned Jupiter mount | `/mnt/user/appdata/radio-archive/` |
+
+---
+
+## Commands Run
+
+```bash
+# Second batch pass (after first 519-snapshot completed)
+.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1
+
+# Final incremental import to bring DB to 572 episodes
+.venv/Scripts/python.exe import_to_sqlite.py
+```
+
+---
+
+## Pending / Next Up
+
+1. **Stand up the Jupiter Docker container** (the only remaining work item from the original plan):
+   - Create `/mnt/user/appdata/radio-archive/` on Jupiter
+   - Container: FastAPI + sqlite (~50 lines), read-only mount of `archive.db`
+   - Bind only on Tailscale interface, not the public IP
+   - Deploy step: rsync `archive.db` from GURU-BEAST-ROG → Jupiter
+2. **(Followup) Fix the intro_roster.json aggregation bug** — load+merge existing roster on startup, or drop the aggregate JSON and use `SELECT DISTINCT name FROM intros` instead.
+3. **(Future, deferred from prior session)** Build voice profiles for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara voices stop being labeled CALLER and inflating Q&A false positives in early-years and 2018/2019 episodes. The current 1,407 Q&A pairs include some unknown number of these false positives.
+4. **(Future) Speaker-name resolution view** — once query patterns emerge, decide whether to materialize a SQL view that joins `turns` (role labels) ↔ `intros` (real names) by time-window.
+
+---
+
+## Reference Information
+
+- **Re-run the importer any time** new episodes are processed: `.venv/Scripts/python.exe import_to_sqlite.py`. Idempotent. Skip-by-sha256 means only changed/new episodes do real work. ~50 ms per skip-decision, ~5 ms per insert.
+- **Full rebuild:** `import_to_sqlite.py --rebuild` (drops and recreates the DB; ~3 seconds for the current 572-episode dataset).
+- **Schema and triggers** are in `import_to_sqlite.py` itself — `SCHEMA` and `TRIGGERS` constants at top of file.
+- **FTS query examples (verified working):**
+  - `SELECT e.title, snippet(segments_fts, 0, '[', ']', '...', 12) FROM segments_fts JOIN segments s ON s.id=segments_fts.rowid JOIN episodes e ON e.id=s.episode_id WHERE segments_fts MATCH 'wireless'`
+  - Same shape for `qa_fts` against `qa_pairs`.