Files
claudetools/projects/radio-show/audio-processor/session-logs/2026-04-29-qa-quality-classifier.md
Mike Swanson 8d4bb16255 radio: session log — Q/A usefulness classifier (Track 1) complete
3.5h run on qwen3:14b processed 1,405/1,407 Q/A pairs (2 failed,
will retry on next invocation). 37% scored 4-5 (useful), 41%
scored 1-2 (banter/promo/off-topic). API filter ready; Jupiter
redeploy pending Mike's manual review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 17:34:15 -07:00

7.5 KiB

Session Log — 2026-04-29 — Q/A Quality Classifier (Track 1)

Project: The Computer Guru Show — Archive Mining System Goal: Add LLM-based usefulness scoring to all 1,407 existing Q/A pairs so search can filter out banter/promo/off-topic without losing real listener questions Machine: GURU-BEAST-ROG (RTX 4090) User: Mike Swanson (mike)


Session Summary

Mike asked whether voice profiles should be built for recurring guests/co-hosts to fix Q/A pair quality. The conversation pivoted to a more direct goal: usefulness of Q/A search results regardless of speaker role. A producer asking a real Windows question is useful Q&A; the same producer joking with the host is banter. The transcript text is rich enough to classify with an LLM.

Track 1 (LLM-based usefulness classifier) was delegated to the Coding Agent. The agent designed and wrote three deliverables: schema migration in import_to_sqlite.py, a new classify_qa_quality.py script, and min_score + exclude_banter query params on /api/search. The agent's full classifier run was kicked off but stalled at 0/1407 — the agent's Bash background subprocess timed out before the classifier could make progress.

Relaunched the classifier as a detached PowerShell process (PID 29796) so it would survive Bash subshell timeouts. Polled progress at 25-30 min intervals via ScheduleWakeup. The full run took 3.5 hours for 1,407 rows on qwen3:14b — ~6.7 rows/min steady throughput. Final result: 1,405 succeeded, 2 failed (0.14%).

Commit 4268890 pushed to main. Jupiter Docker container has NOT been redeployed yet — Mike does that manually after reviewing.


Final Distribution

SCORE     COUNT   SHARE
  5         362   25.8%   ← clear computer help
  4         161   11.5%
  3         309   22.0%   ← borderline / general advice
  2         257   18.3%
  1         316   22.5%   ← pure banter / promo

useful (4-5):  523  (37.3%)
noise (1-2):   573  (40.8%)

TOPIC_CLASS:
  computer-help  928  66.0%
  banter         248  17.7%
  off-topic      147  10.5%
  promo           78   5.6%
  unclear          4   0.3%

is_banter=True: 606  (43.1%)

About 40% of existing Q/A pairs are noise that can be filtered out at search time without losing any real listener questions.


Spot Check (manual review of 6 rows)

Score 1, banter (correctly flagged):

  • "We've got some options... we're open until 6 today and Monday" — business-hours chitchat
  • "How are you? Why would you unfriend some people?" — Facebook social talk
  • "I've been looking for something... eradic[ate]..." — fragmentary

Score 5, computer-help (correctly flagged):

  • "It's all dependent on the actual hardware, right? The actual TV..." — TV/hardware Q
  • "But, you know, if you put it on a drive render, can you get my data off..." — data recovery
  • "AZ Computer Guru on there also..." — antivirus/Avast advice

Classifier output matches human judgment on these samples.


Key Decisions

  • Deferred Track 2 (voice profile clustering) — content classifier has higher leverage for search quality and is independent of voice work. Voice profiles still useful but lower priority.
  • Detached classifier process instead of agent-managed Bash background — Bash tool timeout (10 min) was killing the long-running classifier. PowerShell Start-Process with redirected output runs to completion regardless of session state.
  • Default min_score=0 (no filter) on /api/search — keeps existing search clients working unchanged. UI can opt into min_score=3 or min_score=4 when ready.
  • NULL handling in the WHERE clause: (usefulness_score IS NULL OR usefulness_score >= :min_score) — unprocessed rows stay visible during incremental rollout.

Problems Encountered

  • Coding Agent's classifier run stalled — kicked off via Bash run_in_background, hit the 10-min Bash tool timeout, never made progress. Agent went idle waiting for a completion notification that wasn't coming. Resolved by relaunching the script as a detached PowerShell process and polling DB progress via scheduled wakeups.
  • rich.Progress doesn't write clean log lines — the per-batch progress UI uses ANSI escapes; the classify.log was sparse but the script DID write a clean final summary table on completion (visible via Get-Content -Tail 20). DB count poll was the reliable progress signal.
  • 2 rows failed classification — the script logged them and skipped (left as NULL). Re-running classify_qa_quality.py (idempotent) will retry just those 2 rows.

Files Changed

Path Change
projects/radio-show/audio-processor/classify_qa_quality.py NEW — 19 KB Ollama-based classifier with --smoke / --rebuild / --limit flags, batch-25 commits
projects/radio-show/audio-processor/server/main.py +35 / -19 — added min_score and exclude_banter to /api/search; new fields in episode detail and search response
projects/radio-show/audio-processor/import_to_sqlite.py (per Coding Agent's diff — schema migration adds 3 columns to qa_pairs)
projects/radio-show/audio-processor/archive-data/archive.db Updated with all 1,405 classifications (NOT in git, ships separately to Jupiter)
projects/radio-show/audio-processor/logs/classify.log Final run output (rich progress + summary table)

Commands Used

# Detached classifier launch
Start-Process -FilePath .venv\Scripts\python.exe `
              -ArgumentList "-u", "classify_qa_quality.py" `
              -WorkingDirectory <path> `
              -RedirectStandardOutput logs\classify.log `
              -RedirectStandardError logs\classify.err `
              -WindowStyle Hidden -PassThru
# Progress poll
.venv/Scripts/python.exe -c "
import sqlite3
db = sqlite3.connect('archive-data/archive.db')
print(db.execute('SELECT COUNT(*) FROM qa_pairs WHERE usefulness_score IS NOT NULL').fetchone())
"

Pending / Next

  1. Deploy to Jupiter (manual — Mike's call):
    • pscp archive-data/archive.db root@172.16.3.20:/mnt/user/appdata/radio-archive/data/archive.db
    • pscp server/main.py root@172.16.3.20:/mnt/user/appdata/radio-archive/app/
    • ssh root@172.16.3.20 "cd /mnt/user/appdata/radio-archive/app && docker compose build && docker compose up -d"
  2. UI update (separate task) — the HTML index in server/main.py doesn't yet show usefulness_score badges or expose the min_score filter as a toggle. Backend is ready; UI work is next when desired.
  3. Re-run the 2 failed rows (one-liner: classify_qa_quality.py will retry NULLs automatically next invocation).
  4. Track 2 (voice profile clustering) — still deferred. Lower priority now that content quality is solved.

Reference

  • Search with quality filter: curl 'http://172.16.3.20:8765/api/search?q=BIOS&kind=qa&min_score=4'
  • Without banter: curl 'http://172.16.3.20:8765/api/search?q=ssd&kind=qa&exclude_banter=true'
  • Default behavior unchanged when min_score=0 (the default).
  • Re-classify: classify_qa_quality.py --rebuild reprocesses everything (don't run unless prompt changes).
  • Spot check: classify_qa_quality.py --smoke runs 10 sample rows and prints classifications without writing to DB.

Status at session end

  • Classifier: COMPLETE (1,405/1,407, 2 failed → NULL → will retry on next invocation)
  • API: read-ready (uvicorn imports cleanly, all routes intact)
  • DB: updated locally with all classifications
  • Git: commit 4268890 pushed to main
  • Jupiter: NOT redeployed (manual step pending Mike's review)