Append to 2026-04-28-session.md covering the FastAPI/SQLite container deploy: build + ship + verify, plus credentials, paths, and re-deploy procedures for both DB updates and source updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
323 lines
15 KiB
Markdown
323 lines
15 KiB
Markdown
# Session Log — 2026-04-28
|
|
|
|
**Project:** The Computer Guru Show — Archive Mining System
|
|
**Goal:** Complete the in-flight batch + sqlite import; bring DB to final state
|
|
**Machine:** GURU-BEAST-ROG (RTX 4090, 24GB)
|
|
**User:** Mike Swanson (mike)
|
|
|
|
Continuation of `2026-04-27-archive-batch-and-sqlite-import.md`. Short execution-only session — finished the runs that were started yesterday, no new architecture or code.
|
|
|
|
---
|
|
|
|
## User
|
|
- **User:** Mike Swanson (mike)
|
|
- **Machine:** GURU-BEAST-ROG
|
|
- **Role:** admin
|
|
|
|
---
|
|
|
|
## Session Summary
|
|
|
|
The session completed the final processing steps for the Computer Guru Radio Show archive. `batch_process.py` finished its first 519-file snapshot — 449 processed, 68 cached, 0 errors, 4.6 hours wall — and was relaunched to pick up the 53 stragglers (late-2016 plus all of 2017/2018) that arrived after the original snapshot was taken. The second pass took 56.4 minutes with 0 errors. The encoding fix made yesterday in `src/transcriber.py` held across both passes — no charmap errors recurred.
|
|
|
|
`import_to_sqlite.py` was run once after the second batch pass, incrementally over all 572 complete episode directories. 364 newly inserted (the second-pass output plus everything done after the smoke test), 208 skipped via sha256 match, 0 errors, 3.4 seconds. The DB grew from 20.5 MB → 57.7 MB and now covers the entire downloadable archive: 572 episodes, 482.7 audio-hours, 60,917 transcript segments, 25,918 diarization turns, 3,043 intros, 1,407 Q&A pairs.
|
|
|
|
Both tasks (#1 download, #2 batch_process) are now complete. The next step is the deferred Jupiter Docker container deployment so the DB can be queried over Tailscale by Howard or me during show prep.
|
|
|
|
---
|
|
|
|
## Key Decisions
|
|
|
|
- No new architectural decisions this session — execution pass on the prior session's plan.
|
|
|
|
---
|
|
|
|
## Problems Encountered
|
|
|
|
- **None.** No errors during either batch pass or the import.
|
|
- One pre-existing minor bug surfaced (not blocking, not fixed): `batch_process.py`'s `all_intros` dict is module-level and reset at each run start, so `intro_roster.json` is overwritten per-run rather than aggregated across runs. The second pass's roster shows only 89 unique names from its 53 episodes, not the 303+ cumulative across the full archive. The per-episode `intros.json` files remain authoritative and the DB import reads from those, so the canonical name list in the DB (`intros` table — 3,043 rows) is correct. Future cleanup: load+merge existing roster.json before populating, or just drop the aggregate JSON and rely on `SELECT DISTINCT name FROM intros` against the DB.
|
|
|
|
---
|
|
|
|
## Final DB State
|
|
|
|
```
|
|
DB: archive-data/archive.db (57.7 MB)
|
|
```
|
|
|
|
| Year | Episodes | Hours |
|
|
|---|---|---|
|
|
| 2010 | 43 | 32.1 |
|
|
| 2011 | 200 | 147.8 |
|
|
| 2012 | 98 | 70.4 |
|
|
| 2014 | 81 | 64.5 |
|
|
| 2015 | 50 | 38.4 |
|
|
| 2016 | 54 | 61.7 |
|
|
| 2017 | 41 | 60.5 |
|
|
| 2018 | 5 | 7.3 |
|
|
| **Total** | **572** | **482.7** |
|
|
|
|
(Source archive has no 2013 directory.)
|
|
|
|
| Table | Rows |
|
|
|-----------|--------|
|
|
| episodes | 572 |
|
|
| segments | 60,917 |
|
|
| turns | 25,918 |
|
|
| intros | 3,043 |
|
|
| qa_pairs | 1,407 |
|
|
|
|
**Top 10 recurring caller names (by Q&A count):**
|
|
|
|
| Caller | Pairs |
|
|
|---|---|
|
|
| Mark | 77 |
|
|
| John | 52 |
|
|
| Steve | 49 |
|
|
| Tom | 46 |
|
|
| Richard | 30 |
|
|
| Paul | 29 |
|
|
| Robert | 29 |
|
|
| Bill | 27 |
|
|
| Jeff | 27 |
|
|
| Bob | 25 |
|
|
|
|
Note: the "Tom" here is unrelated to the prior session's co-host identity error. These `caller_name` values come from the host's spoken introductions in the transcript (e.g. *"Let's talk to Tom..."*), not from voice profiling. Real Tucson listeners named Tom calling in over 8 years.
|
|
|
|
---
|
|
|
|
## Run Stats (this session)
|
|
|
|
### batch_process.py — first pass (519-snapshot)
|
|
```
|
|
=== Done ===
|
|
processed : 449
|
|
cached (skipped): 68
|
|
in-progress : 2
|
|
errors : 0
|
|
wall time : 276.9 min
|
|
roster : 303 unique names
|
|
```
|
|
|
|
### batch_process.py — second pass (53 stragglers)
|
|
```
|
|
=== Done ===
|
|
processed : 53
|
|
cached (skipped): 519
|
|
in-progress : 0
|
|
errors : 0
|
|
wall time : 56.4 min
|
|
roster : 89 unique names (per-run overwrite, see Problems)
|
|
```
|
|
|
|
### import_to_sqlite.py — incremental (no flags)
|
|
```
|
|
Found 572 complete episode directories
|
|
=== Done in 3.4s ===
|
|
inserted : 364
|
|
updated : 0
|
|
skipped : 208
|
|
errors : 0
|
|
db : archive-data/archive.db (57.7 MB)
|
|
```
|
|
|
|
---
|
|
|
|
## Credentials
|
|
|
|
No new credentials this session. Reference from prior session log:
|
|
|
|
### IX Server (archive source — used yesterday, idle now)
|
|
- **Vault path:** `infrastructure/ix-server.sops.yaml`
|
|
- **Host:** 172.16.3.10 (Tailscale)
|
|
- **External:** ix.azcomputerguru.com / 72.194.62.5
|
|
- **SSH port:** 22
|
|
- **Username:** root
|
|
- **Password:** `Gptf*77ttb!@#!@#`
|
|
|
|
### Jupiter (Unraid — pending deploy target)
|
|
- **Vault path:** `infrastructure/jupiter-unraid-primary.sops.yaml`
|
|
- Container setup not yet started.
|
|
|
|
---
|
|
|
|
## Infrastructure & Paths
|
|
|
|
| Resource | Value |
|
|
|---|---|
|
|
| Audio processor root | `c:\Users\guru\ClaudeTools\projects\radio-show\audio-processor\` |
|
|
| Local DB | `archive-data/archive.db` (57.7 MB) |
|
|
| Per-episode outputs | `archive-data/transcripts/<year>/.../<stem>/{transcript,diarization,intros,qa}.json` + `transcript.{txt,srt}` |
|
|
| MP3s | `archive-data/episodes/<year>/...` (572 files / ~7.5 GB) |
|
|
| Background logs | `logs/{download,batch_process}.log` |
|
|
| Planned Jupiter mount | `/mnt/user/appdata/radio-archive/` |
|
|
|
|
---
|
|
|
|
## Commands Run
|
|
|
|
```bash
|
|
# Second batch pass (after first 519-snapshot completed)
|
|
.venv/Scripts/python.exe batch_process.py >> logs/batch_process.log 2>&1
|
|
|
|
# Final incremental import to bring DB to 572 episodes
|
|
.venv/Scripts/python.exe import_to_sqlite.py
|
|
```
|
|
|
|
---
|
|
|
|
## Pending / Next Up
|
|
|
|
1. **Stand up the Jupiter Docker container** (the only remaining work item from the original plan):
|
|
- Create `/mnt/user/appdata/radio-archive/` on Jupiter
|
|
- Container: FastAPI + sqlite (~50 lines), read-only mount of `archive.db`
|
|
- Bind only on Tailscale interface, not the public IP
|
|
- Deploy step: rsync `archive.db` from GURU-BEAST-ROG → Jupiter
|
|
2. **(Followup) Fix the intro_roster.json aggregation bug** — load+merge existing roster on startup, or drop the aggregate JSON and use `SELECT DISTINCT name FROM intros` instead.
|
|
3. **(Future, deferred from prior session)** Build voice profiles for Randall, Rob, and named producers (Andrew/Shannon/Ken) so non-Mike-non-Tara voices stop being labeled CALLER and inflating Q&A false positives in early-years and 2018/2019 episodes. The current 1,407 Q&A pairs include some unknown number of these false positives.
|
|
4. **(Future) Speaker-name resolution view** — once query patterns emerge, decide whether to materialize a SQL view that joins `turns` (role labels) ↔ `intros` (real names) by time-window.
|
|
|
|
---
|
|
|
|
## Reference Information
|
|
|
|
- **Re-run the importer any time** new episodes are processed: `.venv/Scripts/python.exe import_to_sqlite.py`. Idempotent. Skip-by-sha256 means only changed/new episodes do real work. ~50 ms per skip-decision, ~5 ms per insert.
|
|
- **Full rebuild:** `import_to_sqlite.py --rebuild` (drops and recreates the DB; ~3 seconds for the current 572-episode dataset).
|
|
- **Schema and triggers** are in `import_to_sqlite.py` itself — `SCHEMA` and `TRIGGERS` constants at top of file.
|
|
- **FTS query examples (verified working):**
|
|
- `SELECT e.title, snippet(segments_fts, 0, '[', ']', '...', 12) FROM segments_fts JOIN segments s ON s.id=segments_fts.rowid JOIN episodes e ON e.id=s.episode_id WHERE segments_fts MATCH 'wireless'`
|
|
- Same shape for `qa_fts` against `qa_pairs`.
|
|
|
|
---
|
|
|
|
## Update: 06:02
|
|
|
|
Continuation later the same day — the deferred Jupiter Docker container was built, deployed, and verified end-to-end.
|
|
|
|
### Update Summary
|
|
|
|
A small FastAPI + SQLite query server was built in `projects/radio-show/audio-processor/server/` and deployed to Jupiter (Unraid) as a Docker container. The container is bound to `172.16.3.20:8765`, runs with `restart: unless-stopped`, and serves a read-only mount of `archive.db`. End-to-end verification from `GURU-BEAST-ROG` over the LAN confirmed `GET /` (HTML index, 200 in 111 ms, 3,647 bytes), `/api/stats` (correct counts), `/api/search?q=BIOS&kind=qa` (Pete 2012-09-22 + Frank 2011-09-24, bm25-ranked, snippets with `<mark>` highlighting), and clean uvicorn logs.
|
|
|
|
The HTML index page uses fetch + vanilla JS with debounced search (250 ms) and a kind filter (both / Q&A only / transcript only). No build step. The SQLite connection uses `mode=ro` URI and `check_same_thread=False` so future multi-worker uvicorn deployments can share the connection safely. Source committed at `71ada13` on `main` after rebasing over an intervening commit (`b3da922`) from another machine.
|
|
|
|
Total work item is closed: the original architectural plan (sqlite-first, Jupiter Docker, internal-only) is fully realized and reachable at **http://172.16.3.20:8765/**.
|
|
|
|
### Key Decisions (this update)
|
|
|
|
- **LAN-only re-framing**: Jupiter is NOT on Tailscale (no daemon, not in tailnet status). The original "Tailscale-only" framing was tightened to "LAN-only via 172.16.3.0/22". Effective protection is the same — no public exposure — and any user on the office LAN or with subnet routing through Tailscale can reach the service.
|
|
- **Explicit host bind `172.16.3.20:8765:8765`** in compose.yml instead of `0.0.0.0:8765:8765` or `8765:8765`. Pinning to a specific host IP eliminates accidental exposure if Jupiter ever gets a second IP (e.g. an Tailscale interface added later).
|
|
- **Single-file HTML, no build step**: index served as a 3.6 KB string from a `@app.get("/")` handler. Trades polish for zero deploy complexity. If the UI grows, swap in a `static/` mount and a real frontend.
|
|
- **`mode=ro` + `check_same_thread=False`**: read-only is enforced at the SQLite-URI layer (immune to a future bug in app-level read-only checks). `check_same_thread=False` is safe with `mode=ro` and lets uvicorn's worker model share connections without per-request reconnects.
|
|
- **Build on Jupiter, not locally**: avoids `docker save | ssh ... docker load` round-trip and the cross-platform image-format edge cases. Only source files crossed the wire.
|
|
|
|
### Problems Encountered (this update)
|
|
|
|
- **No new problems**. Smoke test was clean; deploy was clean. Push hit the gitea pre-receive once because Howard/another machine had landed `b3da922` in the meantime — resolved with a clean `git pull --rebase` (single commit, no conflicts) and a re-push.
|
|
|
|
### Files Created (this update)
|
|
|
|
| Path | Purpose |
|
|
|---|---|
|
|
| `projects/radio-show/audio-processor/server/main.py` | FastAPI app — endpoints + HTML index |
|
|
| `projects/radio-show/audio-processor/server/requirements.txt` | `fastapi==0.115.6`, `uvicorn[standard]==0.34.0` |
|
|
| `projects/radio-show/audio-processor/server/Dockerfile` | python:3.12-slim base, pip install, copy main.py, default `python -u main.py` |
|
|
| `projects/radio-show/audio-processor/server/compose.yml` | Service `radio-archive`, bind `172.16.3.20:8765`, mount `/mnt/user/appdata/radio-archive/data:/data:ro`, `restart: unless-stopped` |
|
|
|
|
### Endpoints (live on http://172.16.3.20:8765/)
|
|
|
|
| Method | Path | Notes |
|
|
|---|---|---|
|
|
| GET | `/` | HTML search UI (debounced, kind filter, snippet highlighting) |
|
|
| GET | `/api/stats` | Counts per table + episodes/hours per year |
|
|
| GET | `/api/episodes?year=YYYY&limit=N` | List, ordered by air_date then title |
|
|
| GET | `/api/episodes/{id}` | Episode meta + intros + qa_pairs |
|
|
| GET | `/api/episodes/{id}/transcript` | Segments + diarization turns (chronological) |
|
|
| GET | `/api/search?q=...&kind=both\|segments\|qa&limit=N` | FTS5, bm25-ranked, snippet-formatted with `<mark>` |
|
|
| GET | `/api/callers?limit=N` | Top recurring caller_names |
|
|
|
|
### Credentials (used this update)
|
|
|
|
#### Jupiter (Unraid Primary)
|
|
- **Vault path:** `infrastructure/jupiter-unraid-primary.sops.yaml`
|
|
- **Host:** 172.16.3.20 (LAN; office network 172.16.3.0/22)
|
|
- **SSH port:** 22
|
|
- **Username:** root
|
|
- **Password:** `Th1nk3r^99##`
|
|
- **iDRAC IP:** 172.16.1.73
|
|
- **iDRAC user / pass:** root / `Window123!@#-idrac`
|
|
- **OS / Docker:** Linux Jupiter 6.12.54-Unraid; Docker 27.5.1
|
|
- **Notes:** Primary container host (Gitea, NPM, Seafile + many others). Tailscale daemon NOT installed on this host.
|
|
|
|
### Infrastructure & Paths (Jupiter)
|
|
|
|
| Resource | Value |
|
|
|---|---|
|
|
| App source dir on Jupiter | `/mnt/user/appdata/radio-archive/app/` |
|
|
| DB dir on Jupiter | `/mnt/user/appdata/radio-archive/data/` |
|
|
| DB file on Jupiter | `/mnt/user/appdata/radio-archive/data/archive.db` (60,465,152 bytes) |
|
|
| Image | `radio-archive:latest` (sha256:9a04d06ba2c1…) |
|
|
| Container name | `radio-archive` |
|
|
| Bind | `172.16.3.20:8765 -> 8765/tcp` |
|
|
| Restart policy | `unless-stopped` |
|
|
| Service URL | http://172.16.3.20:8765/ |
|
|
|
|
### Commands Run
|
|
|
|
```bash
|
|
# SSH (PuTTY plink — system OpenSSH password auth doesn't work without sshpass)
|
|
echo y | "/c/Program Files/PuTTY/plink.exe" -ssh -pw '<jupiter pw>' root@172.16.3.20 "<cmd>"
|
|
|
|
# File transfer (PuTTY pscp)
|
|
"/c/Program Files/PuTTY/pscp.exe" -batch -pw '<jupiter pw>' -scp \
|
|
server/main.py server/requirements.txt server/Dockerfile server/compose.yml \
|
|
root@172.16.3.20:/mnt/user/appdata/radio-archive/app/
|
|
|
|
"/c/Program Files/PuTTY/pscp.exe" -batch -pw '<jupiter pw>' -scp \
|
|
archive-data/archive.db \
|
|
root@172.16.3.20:/mnt/user/appdata/radio-archive/data/archive.db
|
|
|
|
# Build + run (on Jupiter via SSH)
|
|
cd /mnt/user/appdata/radio-archive/app && docker compose build # ~70s
|
|
cd /mnt/user/appdata/radio-archive/app && docker compose up -d # ~5s
|
|
|
|
# Verify (from GURU-BEAST-ROG)
|
|
curl -s http://172.16.3.20:8765/api/stats
|
|
curl -s "http://172.16.3.20:8765/api/search?q=BIOS&kind=qa"
|
|
```
|
|
|
|
### Re-deploy Procedure (when DB updates)
|
|
|
|
```bash
|
|
# 1. Regenerate DB on GURU-BEAST-ROG
|
|
cd /c/Users/guru/ClaudeTools/projects/radio-show/audio-processor
|
|
.venv/Scripts/python.exe import_to_sqlite.py # incremental
|
|
# or for full rebuild:
|
|
.venv/Scripts/python.exe import_to_sqlite.py --rebuild
|
|
|
|
# 2. Ship to Jupiter
|
|
"/c/Program Files/PuTTY/pscp.exe" -pw '<pw>' -scp archive-data/archive.db \
|
|
root@172.16.3.20:/mnt/user/appdata/radio-archive/data/archive.db
|
|
|
|
# 3. No container restart needed — read-only connection picks up fresh data on next request
|
|
```
|
|
|
|
### Re-deploy Procedure (when source changes)
|
|
|
|
```bash
|
|
# 1. ship updated source
|
|
"/c/Program Files/PuTTY/pscp.exe" -pw '<pw>' -scp server/*.py server/Dockerfile \
|
|
root@172.16.3.20:/mnt/user/appdata/radio-archive/app/
|
|
|
|
# 2. rebuild + restart on Jupiter
|
|
echo y | "/c/Program Files/PuTTY/plink.exe" -ssh -pw '<pw>' root@172.16.3.20 \
|
|
"cd /mnt/user/appdata/radio-archive/app && docker compose build && docker compose up -d"
|
|
```
|
|
|
|
### Pending / Next Up (refreshed)
|
|
|
|
1. ~~**Stand up the Jupiter Docker container**~~ DONE (this update).
|
|
2. **(Followup)** Fix the intro_roster.json aggregation bug — load+merge or drop-and-use-DB.
|
|
3. **(Future)** Voice profiles for Randall, Rob, and producers (Andrew/Shannon/Ken). Currently every non-Mike-non-Tara voice falls into the CALLER bucket, inflating Q&A false positives in early-years and 2018+ episodes.
|
|
4. **(Future)** Speaker-name resolution view — defer until query patterns emerge.
|
|
5. **(Optional)** Add Tailscale to Jupiter (subnet router for 172.16.3.0/22, or direct daemon for `ts-` hostname access). Only needed if remote, off-LAN access becomes a requirement.
|