Files
Mike Swanson 519278a664 radio: session log update — Jupiter container live at 172.16.3.20:8765
Append to 2026-04-28-session.md covering the FastAPI/SQLite container
deploy: build + ship + verify, plus credentials, paths, and re-deploy
procedures for both DB updates and source updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 06:05:34 -07:00

15 KiB

Session Log — 2026-04-28

Project: The Computer Guru Show — Archive Mining System Goal: Complete the in-flight batch + sqlite import; bring DB to final state Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)

Continuation of 2026-04-27-archive-batch-and-sqlite-import.md. Short execution-only session — finished the runs that were started yesterday, no new architecture or code.


User

  • User: Mike Swanson (mike)
  • Machine: GURU-BEAST-ROG
  • Role: admin

Session Summary

The session completed the final processing steps for the Computer Guru Radio Show archive. batch_process.py finished its first 519-file snapshot — 449 processed, 68 cached, 0 errors, 4.6 hours wall — and was relaunched to pick up the 53 stragglers (late-2016 plus all of 2017/2018) that arrived after the original snapshot was taken. The second pass took 56.4 minutes with 0 errors. The encoding fix made yesterday in src/transcriber.py held across both passes — no charmap errors recurred.

import_to_sqlite.py was run once after the second batch pass, incrementally over all 572 complete episode directories. 364 newly inserted (the second-pass output plus everything done after the smoke test), 208 skipped via sha256 match, 0 errors, 3.4 seconds. The DB grew from 20.5 MB → 57.7 MB and now covers the entire downloadable archive: 572 episodes, 482.7 audio-hours, 60,917 transcript segments, 25,918 diarization turns, 3,043 intros, 1,407 Q&A pairs.

Both tasks (#1 download, #2 batch_process) are now complete. The next step is the deferred Jupiter Docker container deployment so the DB can be queried over Tailscale by Howard or me during show prep.


Key Decisions

  • No new architectural decisions this session — execution pass on the prior session's plan.

Problems Encountered

  • None. No errors during either batch pass or the import.
  • One pre-existing minor bug surfaced (not blocking, not fixed): batch_process.py's all_intros dict is module-level and reset at each run start, so intro_roster.json is overwritten per-run rather than aggregated across runs. The second pass's roster shows only 89 unique names from its 53 episodes, not the 303+ cumulative across the full archive. The per-episode intros.json files remain authoritative and the DB import reads from those, so the canonical name list in the DB (intros table — 3,043 rows) is correct. Future cleanup: load+merge existing roster.json before populating, or just drop the aggregate JSON and rely on SELECT DISTINCT name FROM intros against the DB.

Final DB State

DB:    archive-data/archive.db  (57.7 MB)
Year Episodes Hours
2010 43 32.1
2011 200 147.8
2012 98 70.4
2014 81 64.5
2015 50 38.4
2016 54 61.7
2017 41 60.5
2018 5 7.3
Total 572 482.7

(Source archive has no 2013 directory.)

Table Rows
episodes 572
segments 60,917
turns 25,918
intros 3,043
qa_pairs 1,407

Top 10 recurring caller names (by Q&A count):

Caller Pairs
Mark 77
John 52
Steve 49
Tom 46
Richard 30
Paul 29
Robert 29
Bill 27
Jeff 27
Bob 25

Note: the "Tom" here is unrelated to the prior session's co-host identity error. These caller_name values come from the host's spoken introductions in the transcript (e.g. "Let's talk to Tom..."), not from voice profiling. Real Tucson listeners named Tom calling in over 8 years.


Run Stats (this session)

batch_process.py — first pass (519-snapshot)

=== Done ===
  processed       : 449
  cached (skipped): 68
  in-progress     : 2
  errors          : 0
  wall time       : 276.9 min
  roster          : 303 unique names

batch_process.py — second pass (53 stragglers)

=== Done ===
  processed       : 53
  cached (skipped): 519
  in-progress     : 0
  errors          : 0
  wall time       : 56.4 min
  roster          : 89 unique names  (per-run overwrite, see Problems)

import_to_sqlite.py — incremental (no flags)

Found 572 complete episode directories
=== Done in 3.4s ===
  inserted : 364
  updated  : 0
  skipped  : 208
  errors   : 0
  db       : archive-data/archive.db  (57.7 MB)

Credentials

No new credentials this session. Reference from prior session log:

IX Server (archive source — used yesterday, idle now)

  • Vault path: infrastructure/ix-server.sops.yaml
  • Host: 172.16.3.10 (Tailscale)
  • External: ix.azcomputerguru.com / 72.194.62.5
  • SSH port: 22
  • Username: root
  • Password: Gptf*77ttb!@#!@#

Jupiter (Unraid — pending deploy target)

  • Vault path: infrastructure/jupiter-unraid-primary.sops.yaml
  • Container setup not yet started.

Infrastructure & Paths

Resource Value
Audio processor root c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\
Local DB archive-data/archive.db (57.7 MB)
Per-episode outputs archive-data/transcripts/<year>/.../<stem>/{transcript,diarization,intros,qa}.json + transcript.{txt,srt}
MP3s archive-data/episodes/<year>/... (572 files / ~7.5 GB)
Background logs logs/{download,batch_process}.log
Planned Jupiter mount /mnt/user/appdata/radio-archive/

Commands Run

# Second batch pass (after first 519-snapshot completed)
.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1

# Final incremental import to bring DB to 572 episodes
.venv/Scripts/python.exe import_to_sqlite.py

Pending / Next Up

  1. Stand up the Jupiter Docker container (the only remaining work item from the original plan):
    • Create /mnt/user/appdata/radio-archive/ on Jupiter
    • Container: FastAPI + sqlite (~50 lines), read-only mount of archive.db
    • Bind only on Tailscale interface, not the public IP
    • Deploy step: rsync archive.db from GURU-BEAST-ROG → Jupiter
  2. (Followup) Fix the intro_roster.json aggregation bug — load+merge existing roster on startup, or drop the aggregate JSON and use SELECT DISTINCT name FROM intros instead.
  3. (Future, deferred from prior session) Build voice profiles for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara voices stop being labeled CALLER and inflating Q&A false positives in early-years and 2018/2019 episodes. The current 1,407 Q&A pairs include some unknown number of these false positives.
  4. (Future) Speaker-name resolution view — once query patterns emerge, decide whether to materialize a SQL view that joins turns (role labels) ↔ intros (real names) by time-window.

Reference Information

  • Re-run the importer any time new episodes are processed: .venv/Scripts/python.exe import_to_sqlite.py. Idempotent. Skip-by-sha256 means only changed/new episodes do real work. ~50 ms per skip-decision, ~5 ms per insert.
  • Full rebuild: import_to_sqlite.py --rebuild (drops and recreates the DB; ~3 seconds for the current 572-episode dataset).
  • Schema and triggers are in import_to_sqlite.py itself — SCHEMA and TRIGGERS constants at top of file.
  • FTS query examples (verified working):
    • SELECT e.title, snippet(segments_fts, 0, '[', ']', '...', 12) FROM segments_fts JOIN segments s ON s.id=segments_fts.rowid JOIN episodes e ON e.id=s.episode_id WHERE segments_fts MATCH 'wireless'
    • Same shape for qa_fts against qa_pairs.

Update: 06:02

Continuation later the same day — the deferred Jupiter Docker container was built, deployed, and verified end-to-end.

Update Summary

A small FastAPI + SQLite query server was built in projects/radio-show/audio-processor/server/ and deployed to Jupiter (Unraid) as a Docker container. The container is bound to 172.16.3.20:8765, runs with restart: unless-stopped, and serves a read-only mount of archive.db. End-to-end verification from GURU-BEAST-ROG over the LAN confirmed GET / (HTML index, 200 in 111 ms, 3,647 bytes), /api/stats (correct counts), /api/search?q=BIOS&kind=qa (Pete 2012-09-22 + Frank 2011-09-24, bm25-ranked, snippets with <mark> highlighting), and clean uvicorn logs.

The HTML index page uses fetch + vanilla JS with debounced search (250 ms) and a kind filter (both / Q&A only / transcript only). No build step. The SQLite connection uses mode=ro URI and check_same_thread=False so future multi-worker uvicorn deployments can share the connection safely. Source committed at 71ada13 on main after rebasing over an intervening commit (b3da922) from another machine.

Total work item is closed: the original architectural plan (sqlite-first, Jupiter Docker, internal-only) is fully realized and reachable at http://172.16.3.20:8765/.

Key Decisions (this update)

  • LAN-only re-framing: Jupiter is NOT on Tailscale (no daemon, not in tailnet status). The original "Tailscale-only" framing was tightened to "LAN-only via 172.16.3.0/22". Effective protection is the same — no public exposure — and any user on the office LAN or with subnet routing through Tailscale can reach the service.
  • Explicit host bind 172.16.3.20:8765:8765 in compose.yml instead of 0.0.0.0:8765:8765 or 8765:8765. Pinning to a specific host IP eliminates accidental exposure if Jupiter ever gets a second IP (e.g. an Tailscale interface added later).
  • Single-file HTML, no build step: index served as a 3.6 KB string from a @app.get("/") handler. Trades polish for zero deploy complexity. If the UI grows, swap in a static/ mount and a real frontend.
  • mode=ro + check_same_thread=False: read-only is enforced at the SQLite-URI layer (immune to a future bug in app-level read-only checks). check_same_thread=False is safe with mode=ro and lets uvicorn's worker model share connections without per-request reconnects.
  • Build on Jupiter, not locally: avoids docker save | ssh ... docker load round-trip and the cross-platform image-format edge cases. Only source files crossed the wire.

Problems Encountered (this update)

  • No new problems. Smoke test was clean; deploy was clean. Push hit the gitea pre-receive once because Howard/another machine had landed b3da922 in the meantime — resolved with a clean git pull --rebase (single commit, no conflicts) and a re-push.

Files Created (this update)

Path Purpose
projects/radio-show/audio-processor/server/main.py FastAPI app — endpoints + HTML index
projects/radio-show/audio-processor/server/requirements.txt fastapi==0.115.6, uvicorn[standard]==0.34.0
projects/radio-show/audio-processor/server/Dockerfile python:3.12-slim base, pip install, copy main.py, default python -u main.py
projects/radio-show/audio-processor/server/compose.yml Service radio-archive, bind 172.16.3.20:8765, mount /mnt/user/appdata/radio-archive/data:/data:ro, restart: unless-stopped

Endpoints (live on http://172.16.3.20:8765/)

Method Path Notes
GET / HTML search UI (debounced, kind filter, snippet highlighting)
GET /api/stats Counts per table + episodes/hours per year
GET /api/episodes?year=YYYY&limit=N List, ordered by air_date then title
GET /api/episodes/{id} Episode meta + intros + qa_pairs
GET /api/episodes/{id}/transcript Segments + diarization turns (chronological)
GET /api/search?q=...&kind=both|segments|qa&limit=N FTS5, bm25-ranked, snippet-formatted with <mark>
GET /api/callers?limit=N Top recurring caller_names

Credentials (used this update)

Jupiter (Unraid Primary)

  • Vault path: infrastructure/jupiter-unraid-primary.sops.yaml
  • Host: 172.16.3.20 (LAN; office network 172.16.3.0/22)
  • SSH port: 22
  • Username: root
  • Password: Th1nk3r^99##
  • iDRAC IP: 172.16.1.73
  • iDRAC user / pass: root / Window123!@#-idrac
  • OS / Docker: Linux Jupiter 6.12.54-Unraid; Docker 27.5.1
  • Notes: Primary container host (Gitea, NPM, Seafile + many others). Tailscale daemon NOT installed on this host.

Infrastructure & Paths (Jupiter)

Resource Value
App source dir on Jupiter /mnt/user/appdata/radio-archive/app/
DB dir on Jupiter /mnt/user/appdata/radio-archive/data/
DB file on Jupiter /mnt/user/appdata/radio-archive/data/archive.db (60,465,152 bytes)
Image radio-archive:latest (sha256:9a04d06ba2c1…)
Container name radio-archive
Bind 172.16.3.20:8765 -> 8765/tcp
Restart policy unless-stopped
Service URL http://172.16.3.20:8765/

Commands Run

# SSH (PuTTY plink — system OpenSSH password auth doesn't work without sshpass)
echo y | "/c/Program Files/PuTTY/plink.exe" -ssh -pw '<jupiter pw>' root@172.16.3.20 "<cmd>"

# File transfer (PuTTY pscp)
"/c/Program Files/PuTTY/pscp.exe" -batch -pw '<jupiter pw>' -scp \
    server/main.py server/requirements.txt server/Dockerfile server/compose.yml \
    root@172.16.3.20:/mnt/user/appdata/radio-archive/app/

"/c/Program Files/PuTTY/pscp.exe" -batch -pw '<jupiter pw>' -scp \
    archive-data/archive.db \
    root@172.16.3.20:/mnt/user/appdata/radio-archive/data/archive.db

# Build + run (on Jupiter via SSH)
cd /mnt/user/appdata/radio-archive/app && docker compose build  # ~70s
cd /mnt/user/appdata/radio-archive/app && docker compose up -d  # ~5s

# Verify (from GURU-BEAST-ROG)
curl -s http://172.16.3.20:8765/api/stats
curl -s "http://172.16.3.20:8765/api/search?q=BIOS&kind=qa"

Re-deploy Procedure (when DB updates)

# 1. Regenerate DB on GURU-BEAST-ROG
cd /c/Users/guru/ClaudeTools/projects/radio-show/audio-processor
.venv/Scripts/python.exe import_to_sqlite.py        # incremental
# or for full rebuild:
.venv/Scripts/python.exe import_to_sqlite.py --rebuild

# 2. Ship to Jupiter
"/c/Program Files/PuTTY/pscp.exe" -pw '<pw>' -scp archive-data/archive.db \
  root@172.16.3.20:/mnt/user/appdata/radio-archive/data/archive.db

# 3. No container restart needed — read-only connection picks up fresh data on next request

Re-deploy Procedure (when source changes)

# 1. ship updated source
"/c/Program Files/PuTTY/pscp.exe" -pw '<pw>' -scp server/*.py server/Dockerfile \
  root@172.16.3.20:/mnt/user/appdata/radio-archive/app/

# 2. rebuild + restart on Jupiter
echo y | "/c/Program Files/PuTTY/plink.exe" -ssh -pw '<pw>' root@172.16.3.20 \
  "cd /mnt/user/appdata/radio-archive/app && docker compose build && docker compose up -d"

Pending / Next Up (refreshed)

  1. Stand up the Jupiter Docker container DONE (this update).
  2. (Followup) Fix the intro_roster.json aggregation bug — load+merge or drop-and-use-DB.
  3. (Future) Voice profiles for Randall, Rob, and producers (Andrew/Shannon/Ken). Currently every non-Mike-non-Tara voice falls into the CALLER bucket, inflating Q&A false positives in early-years and 2018+ episodes.
  4. (Future) Speaker-name resolution view — defer until query patterns emerge.
  5. (Optional) Add Tailscale to Jupiter (subnet router for 172.16.3.0/22, or direct daemon for ts- hostname access). Only needed if remote, off-LAN access becomes a requirement.