- src/transcriber.py: open transcript.{json,txt,srt} with encoding="utf-8".
Windows cp1252 default crashed on Whisper output containing U+2044.
- import_to_sqlite.py: new. Walks archive-data/transcripts, builds
archive.db (5 tables + 2 FTS5 virtual tables, sha256-keyed idempotency).
20.5 MB / 208 episodes at smoke-test time, 1.9s rebuild.
- batch_process.py: tracked from prior session — full-archive batch with
resumable transcribe/diarize/intros/qa pipeline.
- .gitignore: archive-data/ and logs/.
Session log: 2026-04-27-archive-batch-and-sqlite-import.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
Session Log — 2026-04-27 (continuation #2)
Project: The Computer Guru Show — Archive Mining System Goal: Resume archive download + batch transcribe/diarize after machine restart, then design + build the SQLite archive database Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)
Companion to:
2026-04-27-diarization-pipeline.md(DESKTOP-0O8A1RL — diarization fixes)2026-04-27-4090-benchmark-and-test-set.md(GURU-BEAST-ROG — 4090 perf + per-year test set)
User
- User: Mike Swanson (mike)
- Machine: GURU-BEAST-ROG
- Role: admin
Session Summary
The session focused on resuming interrupted archive processing and initiating the design of a SQLite database for the Computer Guru Radio Show. The machine had restarted during the execution of download_full_archive.py and batch_process.py, leaving the download partially complete and batch processing halted mid-year. The download had completed through 2015 with partial progress in 2016, and years 2017 and 2018 were entirely missing locally. batch_process had finished 2010 (43/43) and stopped at 21 of 200 episodes in 2011. Connectivity to the IX server was confirmed via Tailscale, the SOPS vault yielded the IX root password, and both jobs were restarted in the background.
The download successfully resumed via size-match skipping, then pulled the remaining 88 files (2.65 GB) covering late-2016 plus all of 2017 and 2018 in 30 minutes wall time, 0 errors. batch_process picked up at episode 65/519 of its file-list snapshot but immediately tripped a 'charmap' codec can't encode character '⁄' error: src/transcriber.py opened transcript.txt and transcript.srt with Windows default cp1252 encoding, which cannot represent Whisper's U+2044 fraction-slash output. The fix was a one-line addition (encoding="utf-8") on three open() calls. The partial output dir for episode 65 was deleted to force a clean redo, and batch_process was restarted.
The user then raised an architectural question about where the canonical archive database should live. Discussion converged on SQLite (over MariaDB) because the per-episode JSONs are the source of truth and the .db is rebuildable in seconds, and on Jupiter+Docker (over IX cPanel) because the use case is internal-only and Tailscale already provides access; public exposure can be added later via cloudflared. The schema (5 tables + 2 FTS5 virtual tables) was designed and import_to_sqlite.py was written and smoke-tested against 208 currently-complete episodes — 1.9 seconds for full rebuild, 20.5 MB DB, FTS queries on "wireless" and "virus" returning correct snippets.
By session end, the download was complete and batch_process was at episode 211/519 (147 episodes transcribed since the encoding-fix restart). One final batch_process re-run is needed after the current 519-snapshot finishes, to pick up the 53 newly-downloaded files that were not in the startup snapshot.
Key Decisions
- SQLite over MariaDB: per-episode JSONs are the source of truth, the
.dbis rebuildable in seconds. Can graduate to MariaDB later by re-importing from the same JSONs without losing anything. - Jupiter (Unraid Docker) over IX cPanel: use case is internal-only show-prep search. Tailscale already covers access. IX is the right place for public-facing show-site content but adds shared-hosting friction the v1 doesn't need.
- FTS5 with
porter unicode61tokenizer: porter stemming for English query expansion, unicode61 for case-folding and basic punctuation handling. External-content tables with content_rowid pointing back tosegmentsandqa_pairsso the FTS index doesn't duplicate the text. - Skip speaker-name resolution view in v1: turns table holds role labels (HOST/CO-HOST/CALLER/BUMPER), intros and qa_pairs hold real names. A SQL view that joins them by time-window is cheap to add later and no data is lost by deferring.
- Keep BUMPER turns and promo-flagged segments raw: filter at query time. Excluding them at insert loses signal that may matter for future analysis.
- sha256 of transcript.json as idempotency key: importer skips an episode whose recorded hash matches the on-disk file. Re-run the importer after each batch_process pass; it only does work for changed files.
- Restart batch_process to fix encoding bug rather than --amend partial files: the .json was correct (ensure_ascii=True default), but .txt and .srt were potentially truncated. Cleanest path was to delete the failed episode's whole output dir and let the pipeline regenerate everything with the encoding fix.
Problems Encountered
'charmap' codecencoding error in transcriber.py- Cause:
open(... "w")defaulted to Windows cp1252; Whisper output contained U+2044 (fraction slash) which cp1252 cannot encode. - Fix: added
encoding="utf-8"to the three open() calls atsrc/transcriber.py:93,97,101. - Audited the rest of the pipeline:
diarizer.py,batch_process.py, and other JSON writers usejson.dumpdefaultensure_ascii=True, which escapes unicode to ASCII before encoding — safe under cp1252 even without explicit utf-8. Onlytranscriber.pywrites raw unicode (transcript.txt, transcript.srt).
- Cause:
- Failed episode left inconsistent output
- Episode 65 (
2011/10 - October/10-15-11 HR 2) hadtranscript.jsonwritten successfully buttranscript.txttruncated mid-encode. - Fix:
rm -rfthe entire episode output dir; batch_process redoes it cleanly on next pass.
- Episode 65 (
- Monitor refired the same error every poll
- Initial monitor used
grep -E "ERROR" $LOG | tail -1each iteration, so a single historical error line emitted a notification every 60s. - Fix: track error count between polls; only emit when count grows. Same pattern applied to download FAILED counts.
- Initial monitor used
- batch_process snapshot taken before download finished
all_mp3s = ...is computed once at startup. The 53 newly-downloaded MP3s (late-2016, 2017, 2018) are not visible to the currently-running batch.- Mitigation: after current 519-snapshot run finishes, relaunch batch_process once. Resumability via existence-check makes the re-run only process the new files.
Files Modified / Created
| Path | Change |
|---|---|
projects/radio-show/audio-processor/src/transcriber.py |
Added encoding="utf-8" to all three open() calls in Transcript.save() (lines 93, 97, 101) |
projects/radio-show/audio-processor/import_to_sqlite.py |
NEW. Walks archive-data/transcripts, imports JSONs into archive.db with FTS5. sha256-keyed idempotency. |
projects/radio-show/audio-processor/batch_process.py |
(already untracked from prior session — no edits this session) |
projects/radio-show/audio-processor/archive-data/episodes/{2010..2018}/ |
Filled in by download_full_archive.py — 88 new files |
projects/radio-show/audio-processor/archive-data/transcripts/{2010,2011}/... |
Per-episode output dirs — written by batch_process |
projects/radio-show/audio-processor/archive-data/archive.db |
NEW (smoke-test rebuild, 20.5 MB at 208 episodes) |
projects/radio-show/audio-processor/logs/download.log |
Background download output |
projects/radio-show/audio-processor/logs/batch_process.log |
Background batch output |
SQLite Schema (full DDL)
CREATE TABLE episodes (
id INTEGER PRIMARY KEY,
rel_path TEXT NOT NULL UNIQUE,
year INTEGER NOT NULL,
title TEXT,
air_date TEXT,
duration_sec REAL NOT NULL,
language TEXT,
language_probability REAL,
num_speakers INTEGER,
transcript_sha256 TEXT NOT NULL,
processed_at TEXT NOT NULL
);
CREATE INDEX idx_episodes_year ON episodes(year);
CREATE INDEX idx_episodes_air_date ON episodes(air_date);
CREATE TABLE segments (
id INTEGER PRIMARY KEY,
episode_id INTEGER NOT NULL REFERENCES episodes(id) ON DELETE CASCADE,
seg_idx INTEGER NOT NULL,
start_sec REAL, end_sec REAL,
text TEXT NOT NULL,
UNIQUE(episode_id, seg_idx)
);
CREATE INDEX idx_segments_episode ON segments(episode_id, start_sec);
CREATE TABLE turns (
id INTEGER PRIMARY KEY,
episode_id INTEGER NOT NULL REFERENCES episodes(id) ON DELETE CASCADE,
speaker TEXT NOT NULL, -- HOST / CO-HOST / CALLER / BUMPER
start_sec REAL, end_sec REAL,
confidence REAL
);
CREATE INDEX idx_turns_episode ON turns(episode_id, start_sec);
CREATE INDEX idx_turns_speaker ON turns(episode_id, speaker);
CREATE TABLE intros (
id INTEGER PRIMARY KEY,
episode_id INTEGER NOT NULL REFERENCES episodes(id) ON DELETE CASCADE,
name TEXT NOT NULL,
role_hint TEXT, -- caller / cohost / fillin
intro_time_sec REAL,
affiliation TEXT, fillin_for TEXT,
source_text TEXT
);
CREATE INDEX idx_intros_episode ON intros(episode_id);
CREATE INDEX idx_intros_name ON intros(name);
CREATE TABLE qa_pairs (
id INTEGER PRIMARY KEY,
episode_id INTEGER NOT NULL REFERENCES episodes(id) ON DELETE CASCADE,
question_start_sec REAL, question_end_sec REAL,
answer_start_sec REAL, answer_end_sec REAL,
question_text TEXT NOT NULL,
answer_text TEXT NOT NULL,
caller_name TEXT, caller_role TEXT,
topic TEXT, topic_tags TEXT -- JSON array as TEXT
);
CREATE INDEX idx_qa_episode ON qa_pairs(episode_id);
CREATE INDEX idx_qa_caller ON qa_pairs(caller_name);
CREATE VIRTUAL TABLE segments_fts USING fts5(
text, content='segments', content_rowid='id',
tokenize='porter unicode61'
);
CREATE VIRTUAL TABLE qa_fts USING fts5(
question_text, answer_text,
content='qa_pairs', content_rowid='id',
tokenize='porter unicode61'
);
-- + standard ai/ad triggers to keep FTS in sync on insert/delete
Smoke-Test Results (post-import, mid-batch)
Found 208 complete episode directories under archive-data/transcripts/
inserted : 208
updated : 0
skipped : 0
errors : 0
db : archive-data/archive.db (20.5 MB)
wall : 1.9 seconds
| Year | Episodes | Hours |
|---|---|---|
| 2010 | 43 | 32.1 |
| 2011 | 165 | 122.2 |
| Total at smoke-test time | 208 | 154.3 |
| Table | Rows |
|---|---|
| episodes | 208 |
| segments | 19,745 |
| turns | 7,233 |
| intros | 1,117 |
| qa_pairs | 566 |
Air-date parsed for 204/208 episodes (4 misses are season/episode-format filenames like s7e30 with no calendar date — accepted).
FTS5 queries verified:
segments MATCH 'wireless'returned 3 hits with correct episode attribution and snippetsqa MATCH 'virus'returned 3 hits with correct episode attribution
Download Run — Final Stats
=== Summary ===
Total remote files : 589
Total remote bytes : 7.53 GB
Already present : 501 files / 4.88 GB
Newly downloaded : 88 files / 2.65 GB
Errors : 0
Wall time : 1799.3s
| Year | Local MP3 count |
|---|---|
| 2010 | 43 |
| 2011 | 200 |
| 2012 | 98 |
| 2014 | 81 |
| 2015 | 50 |
| 2016 | 54 |
| 2017 | 41 |
| 2018 | 5 |
| Total | 572 |
(572 vs 589-remote-total: 17-file delta is case-variant duplicates .MP3/.mp3 already counted under one local name, not missing files.)
Credentials
IX Server (archive source)
- Vault path:
infrastructure/ix-server.sops.yaml - Host: 172.16.3.10 (Tailscale required)
- External: ix.azcomputerguru.com / 72.194.62.5
- SSH port: 22
- OS: Rocky Linux (WHM/cPanel; WHM 2087, cPanel 2083)
- Username: root
- Password:
Gptf*77ttb!@#!@# - Notes: Use paramiko with
look_for_keys=False, allow_agent=False, timeout=30, banner_timeout=30, auth_timeout=30. Settransport.set_keepalive(30)andsftp.get_channel().settimeout(120)for long sessions. SSH from command line is blocked by key-agent interference on this machine.
Jupiter (Unraid — planned destination for archive.db)
- Vault path:
infrastructure/jupiter-unraid-primary.sops.yaml - (Container setup pending — no work done yet, just architectural decision)
Infrastructure & Paths
| Resource | Value |
|---|---|
| Audio processor root | c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\ |
| Episodes root (local) | archive-data/episodes/<year>/... |
| Transcripts root (local) | archive-data/transcripts/<year>/.../<stem>/ |
| Archive DB (local) | archive-data/archive.db |
| Per-episode outputs | transcript.json, transcript.txt, transcript.srt, diarization.json, intros.json, qa.json |
| Voice profiles | voice-profiles/ (181 profiles loaded by current run) |
| Background log dir | logs/ (download.log, batch_process.log) |
| Remote archive root | /home/gurushow/public_html/archive/{2010-2018}/ on IX |
| Planned Jupiter dir | /mnt/user/appdata/radio-archive/ |
Commands Run (key invocations)
# Resume download (from audio-processor dir, in venv)
IX_PASSWORD='Gptf*77ttb!@#!@#' .venv/Scripts/python.exe download_full_archive.py > logs/download.log 2>&1
# Resume batch transcribe + diarize (no env needed)
.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1
# Initial DB build / smoke test
.venv/Scripts/python.exe import_to_sqlite.py --rebuild
# Subsequent incremental imports (after each batch_process pass)
.venv/Scripts/python.exe import_to_sqlite.py
Pending / Next Up
- Wait for current batch_process to finish the 519-file snapshot (currently at 211/519, 147 transcribed since restart).
- Re-launch batch_process once more — picks up the 53 new MP3s downloaded after the snapshot was taken (5 late-2016 + 41 in 2017 + 5 in 2018 + 2 stragglers).
- Re-run import_to_sqlite.py (incremental, idempotent — only the new ones do real work).
- Stand up the Jupiter Docker container:
- Create
/mnt/user/appdata/radio-archive/on Jupiter - Define container (FastAPI + sqlite, ~50 lines) — read-only mount of
archive.db - Expose only on Tailscale interface, not on the public IP
- rsync
archive.dbfrom GURU-BEAST-ROG to Jupiter as the deploy step
- Create
- Decide on speaker-name resolution view once query patterns emerge.
- (Future) profile-build for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara speakers stop falling into the CALLER bucket. Per the prior session log, this is what's inflating Q&A false-positive rates in early-years and 2018/2019 episodes.
Reference Information
- Encoding rule for Windows Python: any
open(...)that may write or read non-ASCII text (transcripts, captions, raw text dumps) must specifyencoding="utf-8". JSON writes viajson.dumpwith defaultensure_ascii=Trueare safe but defensiveencoding="utf-8"doesn't hurt. - batch_process resumability: existence-check on all four output JSONs. To force a redo, delete the episode's output directory.
- Importer resumability: sha256 of
transcript.jsonrecorded per episode. Hash mismatch → cascade-delete + reinsert in one transaction. - FTS5 trigger pattern (external content):
INSERT INTO fts(rowid, ...)for ai trigger;INSERT INTO fts(fts, rowid, ...) VALUES('delete', ...)for ad trigger. Same column count for both. - Per-year MP3 totals on IX: 2010 (52), 2011 (200), 2012 (98), 2014 (81), 2015 (50), 2016 (54), 2017 (41), 2018 (5) — note 2013 directory does not exist on the source.