Files

Mike Swanson 82940d96d7 radio: utf-8 transcript writes + sqlite archive importer + session log

- src/transcriber.py: open transcript.{json,txt,srt} with encoding="utf-8".
  Windows cp1252 default crashed on Whisper output containing U+2044.
- import_to_sqlite.py: new. Walks archive-data/transcripts, builds
  archive.db (5 tables + 2 FTS5 virtual tables, sha256-keyed idempotency).
  20.5 MB / 208 episodes at smoke-test time, 1.9s rebuild.
- batch_process.py: tracked from prior session — full-archive batch with
  resumable transcribe/diarize/intros/qa pipeline.
- .gitignore: archive-data/ and logs/.

Session log: 2026-04-27-archive-batch-and-sqlite-import.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 19:38:02 -07:00

15 KiB

Raw Blame History

Session Log — 2026-04-27 (continuation #2)

Project: The Computer Guru Show — Archive Mining System Goal: Resume archive download + batch transcribe/diarize after machine restart, then design + build the SQLite archive database Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)

Companion to:

2026-04-27-diarization-pipeline.md (DESKTOP-0O8A1RL — diarization fixes)
2026-04-27-4090-benchmark-and-test-set.md (GURU-BEAST-ROG — 4090 perf + per-year test set)

User

User: Mike Swanson (mike)
Machine: GURU-BEAST-ROG
Role: admin

Session Summary

The session focused on resuming interrupted archive processing and initiating the design of a SQLite database for the Computer Guru Radio Show. The machine had restarted during the execution of download_full_archive.py and batch_process.py, leaving the download partially complete and batch processing halted mid-year. The download had completed through 2015 with partial progress in 2016, and years 2017 and 2018 were entirely missing locally. batch_process had finished 2010 (43/43) and stopped at 21 of 200 episodes in 2011. Connectivity to the IX server was confirmed via Tailscale, the SOPS vault yielded the IX root password, and both jobs were restarted in the background.

The download successfully resumed via size-match skipping, then pulled the remaining 88 files (2.65 GB) covering late-2016 plus all of 2017 and 2018 in 30 minutes wall time, 0 errors. batch_process picked up at episode 65/519 of its file-list snapshot but immediately tripped a 'charmap' codec can't encode character '⁄' error: src/transcriber.py opened transcript.txt and transcript.srt with Windows default cp1252 encoding, which cannot represent Whisper's U+2044 fraction-slash output. The fix was a one-line addition (encoding="utf-8") on three open() calls. The partial output dir for episode 65 was deleted to force a clean redo, and batch_process was restarted.

The user then raised an architectural question about where the canonical archive database should live. Discussion converged on SQLite (over MariaDB) because the per-episode JSONs are the source of truth and the .db is rebuildable in seconds, and on Jupiter+Docker (over IX cPanel) because the use case is internal-only and Tailscale already provides access; public exposure can be added later via cloudflared. The schema (5 tables + 2 FTS5 virtual tables) was designed and import_to_sqlite.py was written and smoke-tested against 208 currently-complete episodes — 1.9 seconds for full rebuild, 20.5 MB DB, FTS queries on "wireless" and "virus" returning correct snippets.

By session end, the download was complete and batch_process was at episode 211/519 (147 episodes transcribed since the encoding-fix restart). One final batch_process re-run is needed after the current 519-snapshot finishes, to pick up the 53 newly-downloaded files that were not in the startup snapshot.

Key Decisions

SQLite over MariaDB: per-episode JSONs are the source of truth, the .db is rebuildable in seconds. Can graduate to MariaDB later by re-importing from the same JSONs without losing anything.
Jupiter (Unraid Docker) over IX cPanel: use case is internal-only show-prep search. Tailscale already covers access. IX is the right place for public-facing show-site content but adds shared-hosting friction the v1 doesn't need.
FTS5 with porter unicode61 tokenizer: porter stemming for English query expansion, unicode61 for case-folding and basic punctuation handling. External-content tables with content_rowid pointing back to segments and qa_pairs so the FTS index doesn't duplicate the text.
Skip speaker-name resolution view in v1: turns table holds role labels (HOST/CO-HOST/CALLER/BUMPER), intros and qa_pairs hold real names. A SQL view that joins them by time-window is cheap to add later and no data is lost by deferring.
Keep BUMPER turns and promo-flagged segments raw: filter at query time. Excluding them at insert loses signal that may matter for future analysis.
sha256 of transcript.json as idempotency key: importer skips an episode whose recorded hash matches the on-disk file. Re-run the importer after each batch_process pass; it only does work for changed files.
Restart batch_process to fix encoding bug rather than --amend partial files: the .json was correct (ensure_ascii=True default), but .txt and .srt were potentially truncated. Cleanest path was to delete the failed episode's whole output dir and let the pipeline regenerate everything with the encoding fix.

Problems Encountered

'charmap' codec encoding error in transcriber.py
- Cause: open(... "w") defaulted to Windows cp1252; Whisper output contained U+2044 (fraction slash) which cp1252 cannot encode.
- Fix: added encoding="utf-8" to the three open() calls at src/transcriber.py:93,97,101.
- Audited the rest of the pipeline: diarizer.py, batch_process.py, and other JSON writers use json.dump default ensure_ascii=True, which escapes unicode to ASCII before encoding — safe under cp1252 even without explicit utf-8. Only transcriber.py writes raw unicode (transcript.txt, transcript.srt).
Failed episode left inconsistent output
- Episode 65 (2011/10 - October/10-15-11 HR 2) had transcript.json written successfully but transcript.txt truncated mid-encode.
- Fix: rm -rf the entire episode output dir; batch_process redoes it cleanly on next pass.
Monitor refired the same error every poll
- Initial monitor used grep -E "ERROR" $LOG | tail -1 each iteration, so a single historical error line emitted a notification every 60s.
- Fix: track error count between polls; only emit when count grows. Same pattern applied to download FAILED counts.
batch_process snapshot taken before download finished
- all_mp3s = ... is computed once at startup. The 53 newly-downloaded MP3s (late-2016, 2017, 2018) are not visible to the currently-running batch.
- Mitigation: after current 519-snapshot run finishes, relaunch batch_process once. Resumability via existence-check makes the re-run only process the new files.

Files Modified / Created

Path	Change
`projects/radio-show/audio-processor/src/transcriber.py`	Added `encoding="utf-8"` to all three `open()` calls in `Transcript.save()` (lines 93, 97, 101)
`projects/radio-show/audio-processor/import_to_sqlite.py`	NEW. Walks archive-data/transcripts, imports JSONs into archive.db with FTS5. sha256-keyed idempotency.
`projects/radio-show/audio-processor/batch_process.py`	(already untracked from prior session — no edits this session)
`projects/radio-show/audio-processor/archive-data/episodes/{2010..2018}/`	Filled in by download_full_archive.py — 88 new files
`projects/radio-show/audio-processor/archive-data/transcripts/{2010,2011}/...`	Per-episode output dirs — written by batch_process
`projects/radio-show/audio-processor/archive-data/archive.db`	NEW (smoke-test rebuild, 20.5 MB at 208 episodes)
`projects/radio-show/audio-processor/logs/download.log`	Background download output
`projects/radio-show/audio-processor/logs/batch_process.log`	Background batch output

SQLite Schema (full DDL)

CREATE TABLE episodes (
    id INTEGER PRIMARY KEY,
    rel_path TEXT NOT NULL UNIQUE,
    year INTEGER NOT NULL,
    title TEXT,
    air_date TEXT,
    duration_sec REAL NOT NULL,
    language TEXT,
    language_probability REAL,
    num_speakers INTEGER,
    transcript_sha256 TEXT NOT NULL,
    processed_at TEXT NOT NULL
);
CREATE INDEX idx_episodes_year ON episodes(year);
CREATE INDEX idx_episodes_air_date ON episodes(air_date);

CREATE TABLE segments (
    id INTEGER PRIMARY KEY,
    episode_id INTEGER NOT NULL REFERENCES episodes(id) ON DELETE CASCADE,
    seg_idx INTEGER NOT NULL,
    start_sec REAL, end_sec REAL,
    text TEXT NOT NULL,
    UNIQUE(episode_id, seg_idx)
);
CREATE INDEX idx_segments_episode ON segments(episode_id, start_sec);

CREATE TABLE turns (
    id INTEGER PRIMARY KEY,
    episode_id INTEGER NOT NULL REFERENCES episodes(id) ON DELETE CASCADE,
    speaker TEXT NOT NULL,                 -- HOST / CO-HOST / CALLER / BUMPER
    start_sec REAL, end_sec REAL,
    confidence REAL
);
CREATE INDEX idx_turns_episode ON turns(episode_id, start_sec);
CREATE INDEX idx_turns_speaker ON turns(episode_id, speaker);

CREATE TABLE intros (
    id INTEGER PRIMARY KEY,
    episode_id INTEGER NOT NULL REFERENCES episodes(id) ON DELETE CASCADE,
    name TEXT NOT NULL,
    role_hint TEXT,                        -- caller / cohost / fillin
    intro_time_sec REAL,
    affiliation TEXT, fillin_for TEXT,
    source_text TEXT
);
CREATE INDEX idx_intros_episode ON intros(episode_id);
CREATE INDEX idx_intros_name ON intros(name);

CREATE TABLE qa_pairs (
    id INTEGER PRIMARY KEY,
    episode_id INTEGER NOT NULL REFERENCES episodes(id) ON DELETE CASCADE,
    question_start_sec REAL, question_end_sec REAL,
    answer_start_sec REAL, answer_end_sec REAL,
    question_text TEXT NOT NULL,
    answer_text TEXT NOT NULL,
    caller_name TEXT, caller_role TEXT,
    topic TEXT, topic_tags TEXT            -- JSON array as TEXT
);
CREATE INDEX idx_qa_episode ON qa_pairs(episode_id);
CREATE INDEX idx_qa_caller ON qa_pairs(caller_name);

CREATE VIRTUAL TABLE segments_fts USING fts5(
    text, content='segments', content_rowid='id',
    tokenize='porter unicode61'
);
CREATE VIRTUAL TABLE qa_fts USING fts5(
    question_text, answer_text,
    content='qa_pairs', content_rowid='id',
    tokenize='porter unicode61'
);
-- + standard ai/ad triggers to keep FTS in sync on insert/delete

Smoke-Test Results (post-import, mid-batch)

Found 208 complete episode directories under archive-data/transcripts/

  inserted : 208
  updated  : 0
  skipped  : 0
  errors   : 0
  db       : archive-data/archive.db  (20.5 MB)
  wall     : 1.9 seconds

Year	Episodes	Hours
2010	43	32.1
2011	165	122.2
Total at smoke-test time	208	154.3

Table	Rows
episodes	208
segments	19,745
turns	7,233
intros	1,117
qa_pairs	566

Air-date parsed for 204/208 episodes (4 misses are season/episode-format filenames like s7e30 with no calendar date — accepted).

FTS5 queries verified:

segments MATCH 'wireless' returned 3 hits with correct episode attribution and snippets
qa MATCH 'virus' returned 3 hits with correct episode attribution

Download Run — Final Stats

=== Summary ===
  Total remote files : 589
  Total remote bytes : 7.53 GB
  Already present    : 501 files / 4.88 GB
  Newly downloaded   : 88 files / 2.65 GB
  Errors             : 0
  Wall time          : 1799.3s

Year	Local MP3 count
2010	43
2011	200
2012	98
2014	81
2015	50
2016	54
2017	41
2018	5
Total	572

(572 vs 589-remote-total: 17-file delta is case-variant duplicates .MP3/.mp3 already counted under one local name, not missing files.)

Credentials

IX Server (archive source)

Vault path: infrastructure/ix-server.sops.yaml
Host: 172.16.3.10 (Tailscale required)
External: ix.azcomputerguru.com / 72.194.62.5
SSH port: 22
OS: Rocky Linux (WHM/cPanel; WHM 2087, cPanel 2083)
Username: root
Password: Gptf*77ttb!@#!@#
Notes: Use paramiko with look_for_keys=False, allow_agent=False, timeout=30, banner_timeout=30, auth_timeout=30. Set transport.set_keepalive(30) and sftp.get_channel().settimeout(120) for long sessions. SSH from command line is blocked by key-agent interference on this machine.

Jupiter (Unraid — planned destination for archive.db)

Vault path: infrastructure/jupiter-unraid-primary.sops.yaml
(Container setup pending — no work done yet, just architectural decision)

Infrastructure & Paths

Resource	Value
Audio processor root	`c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\`
Episodes root (local)	`archive-data/episodes/<year>/...`
Transcripts root (local)	`archive-data/transcripts/<year>/.../<stem>/`
Archive DB (local)	`archive-data/archive.db`
Per-episode outputs	`transcript.json`, `transcript.txt`, `transcript.srt`, `diarization.json`, `intros.json`, `qa.json`
Voice profiles	`voice-profiles/` (181 profiles loaded by current run)
Background log dir	`logs/` (download.log, batch_process.log)
Remote archive root	`/home/gurushow/public_html/archive/{2010-2018}/` on IX
Planned Jupiter dir	`/mnt/user/appdata/radio-archive/`

Commands Run (key invocations)

# Resume download (from audio-processor dir, in venv)
IX_PASSWORD='Gptf*77ttb!@#!@#' .venv/Scripts/python.exe download_full_archive.py > logs/download.log 2>&1

# Resume batch transcribe + diarize (no env needed)
.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1

# Initial DB build / smoke test
.venv/Scripts/python.exe import_to_sqlite.py --rebuild

# Subsequent incremental imports (after each batch_process pass)
.venv/Scripts/python.exe import_to_sqlite.py

Pending / Next Up

Wait for current batch_process to finish the 519-file snapshot (currently at 211/519, 147 transcribed since restart).
Re-launch batch_process once more — picks up the 53 new MP3s downloaded after the snapshot was taken (5 late-2016 + 41 in 2017 + 5 in 2018 + 2 stragglers).
Re-run import_to_sqlite.py (incremental, idempotent — only the new ones do real work).
Stand up the Jupiter Docker container:
- Create /mnt/user/appdata/radio-archive/ on Jupiter
- Define container (FastAPI + sqlite, ~50 lines) — read-only mount of archive.db
- Expose only on Tailscale interface, not on the public IP
- rsync archive.db from GURU-BEAST-ROG to Jupiter as the deploy step
Decide on speaker-name resolution view once query patterns emerge.
(Future) profile-build for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara speakers stop falling into the CALLER bucket. Per the prior session log, this is what's inflating Q&A false-positive rates in early-years and 2018/2019 episodes.

Reference Information

Encoding rule for Windows Python: any open(...) that may write or read non-ASCII text (transcripts, captions, raw text dumps) must specify encoding="utf-8". JSON writes via json.dump with default ensure_ascii=True are safe but defensive encoding="utf-8" doesn't hurt.
batch_process resumability: existence-check on all four output JSONs. To force a redo, delete the episode's output directory.
Importer resumability: sha256 of transcript.json recorded per episode. Hash mismatch → cascade-delete + reinsert in one transaction.
FTS5 trigger pattern (external content): INSERT INTO fts(rowid, ...) for ai trigger; INSERT INTO fts(fts, rowid, ...) VALUES('delete', ...) for ad trigger. Same column count for both.
Per-year MP3 totals on IX: 2010 (52), 2011 (200), 2012 (98), 2014 (81), 2015 (50), 2016 (54), 2017 (41), 2018 (5) — note 2013 directory does not exist on the source.

15 KiB Raw Blame History Unescape Escape