Files

Mike Swanson 519278a664 radio: session log update — Jupiter container live at 172.16.3.20:8765

Append to 2026-04-28-session.md covering the FastAPI/SQLite container
deploy: build + ship + verify, plus credentials, paths, and re-deploy
procedures for both DB updates and source updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 06:05:34 -07:00

15 KiB

Raw Permalink Blame History

Session Log — 2026-04-28

Project: The Computer Guru Show — Archive Mining System Goal: Complete the in-flight batch + sqlite import; bring DB to final state Machine: GURU-BEAST-ROG (RTX 4090, 24GB) User: Mike Swanson (mike)

Continuation of 2026-04-27-archive-batch-and-sqlite-import.md. Short execution-only session — finished the runs that were started yesterday, no new architecture or code.

User

User: Mike Swanson (mike)
Machine: GURU-BEAST-ROG
Role: admin

Session Summary

The session completed the final processing steps for the Computer Guru Radio Show archive. batch_process.py finished its first 519-file snapshot — 449 processed, 68 cached, 0 errors, 4.6 hours wall — and was relaunched to pick up the 53 stragglers (late-2016 plus all of 2017/2018) that arrived after the original snapshot was taken. The second pass took 56.4 minutes with 0 errors. The encoding fix made yesterday in src/transcriber.py held across both passes — no charmap errors recurred.

import_to_sqlite.py was run once after the second batch pass, incrementally over all 572 complete episode directories. 364 newly inserted (the second-pass output plus everything done after the smoke test), 208 skipped via sha256 match, 0 errors, 3.4 seconds. The DB grew from 20.5 MB → 57.7 MB and now covers the entire downloadable archive: 572 episodes, 482.7 audio-hours, 60,917 transcript segments, 25,918 diarization turns, 3,043 intros, 1,407 Q&A pairs.

Both tasks (#1 download, #2 batch_process) are now complete. The next step is the deferred Jupiter Docker container deployment so the DB can be queried over Tailscale by Howard or me during show prep.

Key Decisions

No new architectural decisions this session — execution pass on the prior session's plan.

Problems Encountered

None. No errors during either batch pass or the import.
One pre-existing minor bug surfaced (not blocking, not fixed): batch_process.py's all_intros dict is module-level and reset at each run start, so intro_roster.json is overwritten per-run rather than aggregated across runs. The second pass's roster shows only 89 unique names from its 53 episodes, not the 303+ cumulative across the full archive. The per-episode intros.json files remain authoritative and the DB import reads from those, so the canonical name list in the DB (intros table — 3,043 rows) is correct. Future cleanup: load+merge existing roster.json before populating, or just drop the aggregate JSON and rely on SELECT DISTINCT name FROM intros against the DB.

Final DB State

DB:    archive-data/archive.db  (57.7 MB)

Year	Episodes	Hours
2010	43	32.1
2011	200	147.8
2012	98	70.4
2014	81	64.5
2015	50	38.4
2016	54	61.7
2017	41	60.5
2018	5	7.3
Total	572	482.7

(Source archive has no 2013 directory.)

Table	Rows
episodes	572
segments	60,917
turns	25,918
intros	3,043
qa_pairs	1,407

Top 10 recurring caller names (by Q&A count):

Caller	Pairs
Mark	77
John	52
Steve	49
Tom	46
Richard	30
Paul	29
Robert	29
Bill	27
Jeff	27
Bob	25

Note: the "Tom" here is unrelated to the prior session's co-host identity error. These caller_name values come from the host's spoken introductions in the transcript (e.g. "Let's talk to Tom..."), not from voice profiling. Real Tucson listeners named Tom calling in over 8 years.

Run Stats (this session)

batch_process.py — first pass (519-snapshot)

=== Done ===
  processed       : 449
  cached (skipped): 68
  in-progress     : 2
  errors          : 0
  wall time       : 276.9 min
  roster          : 303 unique names

batch_process.py — second pass (53 stragglers)

=== Done ===
  processed       : 53
  cached (skipped): 519
  in-progress     : 0
  errors          : 0
  wall time       : 56.4 min
  roster          : 89 unique names  (per-run overwrite, see Problems)

import_to_sqlite.py — incremental (no flags)

Found 572 complete episode directories
=== Done in 3.4s ===
  inserted : 364
  updated  : 0
  skipped  : 208
  errors   : 0
  db       : archive-data/archive.db  (57.7 MB)

Credentials

No new credentials this session. Reference from prior session log:

IX Server (archive source — used yesterday, idle now)

Vault path: infrastructure/ix-server.sops.yaml
Host: 172.16.3.10 (Tailscale)
External: ix.azcomputerguru.com / 72.194.62.5
SSH port: 22
Username: root
Password: Gptf*77ttb!@#!@#

Jupiter (Unraid — pending deploy target)

Vault path: infrastructure/jupiter-unraid-primary.sops.yaml
Container setup not yet started.

Infrastructure & Paths

Resource	Value
Audio processor root	`c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\`
Local DB	`archive-data/archive.db` (57.7 MB)
Per-episode outputs	`archive-data/transcripts/<year>/.../<stem>/{transcript,diarization,intros,qa}.json` + `transcript.{txt,srt}`
MP3s	`archive-data/episodes/<year>/...` (572 files / ~7.5 GB)
Background logs	`logs/{download,batch_process}.log`
Planned Jupiter mount	`/mnt/user/appdata/radio-archive/`

Commands Run

# Second batch pass (after first 519-snapshot completed)
.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1

# Final incremental import to bring DB to 572 episodes
.venv/Scripts/python.exe import_to_sqlite.py

Pending / Next Up

Stand up the Jupiter Docker container (the only remaining work item from the original plan):
- Create /mnt/user/appdata/radio-archive/ on Jupiter
- Container: FastAPI + sqlite (~50 lines), read-only mount of archive.db
- Bind only on Tailscale interface, not the public IP
- Deploy step: rsync archive.db from GURU-BEAST-ROG → Jupiter
(Followup) Fix the intro_roster.json aggregation bug — load+merge existing roster on startup, or drop the aggregate JSON and use SELECT DISTINCT name FROM intros instead.
(Future, deferred from prior session) Build voice profiles for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara voices stop being labeled CALLER and inflating Q&A false positives in early-years and 2018/2019 episodes. The current 1,407 Q&A pairs include some unknown number of these false positives.
(Future) Speaker-name resolution view — once query patterns emerge, decide whether to materialize a SQL view that joins turns (role labels) ↔ intros (real names) by time-window.

Reference Information

Re-run the importer any time new episodes are processed: .venv/Scripts/python.exe import_to_sqlite.py. Idempotent. Skip-by-sha256 means only changed/new episodes do real work. ~50 ms per skip-decision, ~5 ms per insert.
Full rebuild: import_to_sqlite.py --rebuild (drops and recreates the DB; ~3 seconds for the current 572-episode dataset).
Schema and triggers are in import_to_sqlite.py itself — SCHEMA and TRIGGERS constants at top of file.
FTS query examples (verified working):
- SELECT e.title, snippet(segments_fts, 0, '[', ']', '...', 12) FROM segments_fts JOIN segments s ON s.id=segments_fts.rowid JOIN episodes e ON e.id=s.episode_id WHERE segments_fts MATCH 'wireless'
- Same shape for qa_fts against qa_pairs.

Update: 06:02

Continuation later the same day — the deferred Jupiter Docker container was built, deployed, and verified end-to-end.

Update Summary

A small FastAPI + SQLite query server was built in projects/radio-show/audio-processor/server/ and deployed to Jupiter (Unraid) as a Docker container. The container is bound to 172.16.3.20:8765, runs with restart: unless-stopped, and serves a read-only mount of archive.db. End-to-end verification from GURU-BEAST-ROG over the LAN confirmed GET / (HTML index, 200 in 111 ms, 3,647 bytes), /api/stats (correct counts), /api/search?q=BIOS&kind=qa (Pete 2012-09-22 + Frank 2011-09-24, bm25-ranked, snippets with <mark> highlighting), and clean uvicorn logs.

The HTML index page uses fetch + vanilla JS with debounced search (250 ms) and a kind filter (both / Q&A only / transcript only). No build step. The SQLite connection uses mode=ro URI and check_same_thread=False so future multi-worker uvicorn deployments can share the connection safely. Source committed at 71ada13 on main after rebasing over an intervening commit (b3da922) from another machine.

Total work item is closed: the original architectural plan (sqlite-first, Jupiter Docker, internal-only) is fully realized and reachable at http://172.16.3.20:8765/.

Key Decisions (this update)

LAN-only re-framing: Jupiter is NOT on Tailscale (no daemon, not in tailnet status). The original "Tailscale-only" framing was tightened to "LAN-only via 172.16.3.0/22". Effective protection is the same — no public exposure — and any user on the office LAN or with subnet routing through Tailscale can reach the service.
Explicit host bind 172.16.3.20:8765:8765 in compose.yml instead of 0.0.0.0:8765:8765 or 8765:8765. Pinning to a specific host IP eliminates accidental exposure if Jupiter ever gets a second IP (e.g. an Tailscale interface added later).
Single-file HTML, no build step: index served as a 3.6 KB string from a @app.get("/") handler. Trades polish for zero deploy complexity. If the UI grows, swap in a static/ mount and a real frontend.
mode=ro + check_same_thread=False: read-only is enforced at the SQLite-URI layer (immune to a future bug in app-level read-only checks). check_same_thread=False is safe with mode=ro and lets uvicorn's worker model share connections without per-request reconnects.
Build on Jupiter, not locally: avoids docker save | ssh ... docker load round-trip and the cross-platform image-format edge cases. Only source files crossed the wire.

Problems Encountered (this update)

No new problems. Smoke test was clean; deploy was clean. Push hit the gitea pre-receive once because Howard/another machine had landed b3da922 in the meantime — resolved with a clean git pull --rebase (single commit, no conflicts) and a re-push.

Files Created (this update)

Path	Purpose
`projects/radio-show/audio-processor/server/main.py`	FastAPI app — endpoints + HTML index
`projects/radio-show/audio-processor/server/requirements.txt`	`fastapi==0.115.6`, `uvicorn[standard]==0.34.0`
`projects/radio-show/audio-processor/server/Dockerfile`	python:3.12-slim base, pip install, copy main.py, default `python -u main.py`
`projects/radio-show/audio-processor/server/compose.yml`	Service `radio-archive`, bind `172.16.3.20:8765`, mount `/mnt/user/appdata/radio-archive/data:/data:ro`, `restart: unless-stopped`

Endpoints (live on http://172.16.3.20:8765/)

Method	Path	Notes
GET	`/`	HTML search UI (debounced, kind filter, snippet highlighting)
GET	`/api/stats`	Counts per table + episodes/hours per year
GET	`/api/episodes?year=YYYY&limit=N`	List, ordered by air_date then title
GET	`/api/episodes/{id}`	Episode meta + intros + qa_pairs
GET	`/api/episodes/{id}/transcript`	Segments + diarization turns (chronological)
GET	`/api/search?q=...&kind=both\|segments\|qa&limit=N`	FTS5, bm25-ranked, snippet-formatted with `<mark>`
GET	`/api/callers?limit=N`	Top recurring caller_names

Credentials (used this update)

Jupiter (Unraid Primary)

Vault path: infrastructure/jupiter-unraid-primary.sops.yaml
Host: 172.16.3.20 (LAN; office network 172.16.3.0/22)
SSH port: 22
Username: root
Password: Th1nk3r^99##
iDRAC IP: 172.16.1.73
iDRAC user / pass: root / Window123!@#-idrac
OS / Docker: Linux Jupiter 6.12.54-Unraid; Docker 27.5.1
Notes: Primary container host (Gitea, NPM, Seafile + many others). Tailscale daemon NOT installed on this host.

Infrastructure & Paths (Jupiter)

Resource	Value
App source dir on Jupiter	`/mnt/user/appdata/radio-archive/app/`
DB dir on Jupiter	`/mnt/user/appdata/radio-archive/data/`
DB file on Jupiter	`/mnt/user/appdata/radio-archive/data/archive.db` (60,465,152 bytes)
Image	`radio-archive:latest` (sha256:9a04d06ba2c1…)
Container name	`radio-archive`
Bind	`172.16.3.20:8765 -> 8765/tcp`
Restart policy	`unless-stopped`
Service URL	http://172.16.3.20:8765/

Commands Run

# SSH (PuTTY plink — system OpenSSH password auth doesn't work without sshpass)
echo y | "/c/Program Files/PuTTY/plink.exe" -ssh -pw '<jupiter pw>' root@172.16.3.20 "<cmd>"

# File transfer (PuTTY pscp)
"/c/Program Files/PuTTY/pscp.exe" -batch -pw '<jupiter pw>' -scp \
    server/main.py server/requirements.txt server/Dockerfile server/compose.yml \
    root@172.16.3.20:/mnt/user/appdata/radio-archive/app/

"/c/Program Files/PuTTY/pscp.exe" -batch -pw '<jupiter pw>' -scp \
    archive-data/archive.db \
    root@172.16.3.20:/mnt/user/appdata/radio-archive/data/archive.db

# Build + run (on Jupiter via SSH)
cd /mnt/user/appdata/radio-archive/app && docker compose build  # ~70s
cd /mnt/user/appdata/radio-archive/app && docker compose up -d  # ~5s

# Verify (from GURU-BEAST-ROG)
curl -s http://172.16.3.20:8765/api/stats
curl -s "http://172.16.3.20:8765/api/search?q=BIOS&kind=qa"

Re-deploy Procedure (when DB updates)

# 1. Regenerate DB on GURU-BEAST-ROG
cd /c/Users/guru/ClaudeTools/projects/radio-show/audio-processor
.venv/Scripts/python.exe import_to_sqlite.py        # incremental
# or for full rebuild:
.venv/Scripts/python.exe import_to_sqlite.py --rebuild

# 2. Ship to Jupiter
"/c/Program Files/PuTTY/pscp.exe" -pw '<pw>' -scp archive-data/archive.db \
  root@172.16.3.20:/mnt/user/appdata/radio-archive/data/archive.db

# 3. No container restart needed — read-only connection picks up fresh data on next request

Re-deploy Procedure (when source changes)

# 1. ship updated source
"/c/Program Files/PuTTY/pscp.exe" -pw '<pw>' -scp server/*.py server/Dockerfile \
  root@172.16.3.20:/mnt/user/appdata/radio-archive/app/

# 2. rebuild + restart on Jupiter
echo y | "/c/Program Files/PuTTY/plink.exe" -ssh -pw '<pw>' root@172.16.3.20 \
  "cd /mnt/user/appdata/radio-archive/app && docker compose build && docker compose up -d"

Pending / Next Up (refreshed)

~~Stand up the Jupiter Docker container~~ DONE (this update).
(Followup) Fix the intro_roster.json aggregation bug — load+merge or drop-and-use-DB.
(Future) Voice profiles for Randall, Rob, and producers (Andrew/Shannon/Ken). Currently every non-Mike-non-Tara voice falls into the CALLER bucket, inflating Q&A false positives in early-years and 2018+ episodes.
(Future) Speaker-name resolution view — defer until query patterns emerge.
(Optional) Add Tailscale to Jupiter (subnet router for 172.16.3.0/22, or direct daemon for ts- hostname access). Only needed if remote, off-LAN access becomes a requirement.

15 KiB Raw Permalink Blame History

Session Log — 2026-04-28

User

Session Summary

Key Decisions

Problems Encountered

Final DB State

Run Stats (this session)

batch_process.py — first pass (519-snapshot)

batch_process.py — second pass (53 stragglers)

import_to_sqlite.py — incremental (no flags)

Credentials

IX Server (archive source — used yesterday, idle now)

Jupiter (Unraid — pending deploy target)

Infrastructure & Paths

Commands Run

Pending / Next Up

Reference Information

Update: 06:02

Update Summary

Key Decisions (this update)

Problems Encountered (this update)

Files Created (this update)

Endpoints (live on http://172.16.3.20:8765/)

Credentials (used this update)

Jupiter (Unraid Primary)

Infrastructure & Paths (Jupiter)

Commands Run

Re-deploy Procedure (when DB updates)

Re-deploy Procedure (when source changes)

Pending / Next Up (refreshed)

15 KiB

Raw Permalink Blame History