Added 2010, 2015, 2018 test episodes to round out the test set to one
per available year:
- 2010-05-08-hr1 (May 2010, earliest available; pre-Tara era)
- 2015-s7e19 (Jan 2015, avoids training's s7e30)
- 2018-s10e18 (only 3 non-training 2018 episodes exist)
Archive has no 2019 directory — Rob's "2018/2019 appearances" are
constrained to the 5 available 2018 episodes only.
Per-year diarization summary (Tara presence, post-rename):
2010-05-08 30s 1.2% likely false positive (pre-Tara)
2011-03-12 140s 5.6% likely false positive (call-in only)
2012-03-10 30s 1.1% likely false positive (call-in only)
2012-06-09 340s 12.8% suspicious — Mike to confirm
2014-s6e19 680s 23.3% confirmed
2015-s7e19 280s 9.9% plausible — Mike to confirm
2016-s8e43 1890s 35.5% confirmed
2017-s9e30 610s 11.4% plausible
2018-s10e18 880s 17.1% COULD BE ROB — Mike flagged Rob for
2018/2019 appearances; cosine threshold may
be hitting on Rob being acoustically similar
to Tara
Total Tara across 9 episodes: 1h 21m / 8h 52m audio (15.3%).
Q&A counts (still suspect — every voice that isn't Mike-or-Tara is
labeled CALLER, so Randall/Rob/producers inflate the bucket):
2010=4, 2011=1, 2012a=2, 2012b=0, 2014=0, 2015=1, 2016=2, 2017=4, 2018=3
Total: 17 pairs across 9 episodes
4090 perf on the expanded set:
- Diarization: 31928s in 121.5s = 262.7x realtime (vs 209.7x on 5070 Ti, +25.3%)
- Transcription (3 new episodes only): 10554s in 112.4s = 93.9x
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
42 lines
1.9 KiB
Python
42 lines
1.9 KiB
Python
import os
|
|
import sys
|
|
import paramiko
|
|
|
|
password = os.environ.get('IX_PASSWORD')
|
|
if not password:
|
|
print('IX_PASSWORD env var not set', file=sys.stderr)
|
|
sys.exit(1)
|
|
|
|
client = paramiko.SSHClient()
|
|
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
|
|
client.connect('172.16.3.10', username='root', password=password,
|
|
look_for_keys=False, allow_agent=False, timeout=30)
|
|
sftp = client.open_sftp()
|
|
|
|
os.makedirs('test-data/episodes', exist_ok=True)
|
|
|
|
downloads = [
|
|
('/home/gurushow/public_html/archive/2010/COMPUTER GURU 5-8-10 hour 1.mp3', 'test-data/episodes/2010-05-08-hr1.mp3'),
|
|
('/home/gurushow/public_html/archive/2011/3-12-11 HR 1.mp3', 'test-data/episodes/2011-03-12-hr1.mp3'),
|
|
('/home/gurushow/public_html/archive/2012/3 - March/3-10-12HR1.mp3', 'test-data/episodes/2012-03-10-hr1.mp3'),
|
|
('/home/gurushow/public_html/archive/2012/6 - June/6-9-12-HR1.mp3', 'test-data/episodes/2012-06-09-hr1.mp3'),
|
|
('/home/gurushow/public_html/archive/2014/06/s6e19.mp3', 'test-data/episodes/2014-s6e19.mp3'),
|
|
('/home/gurushow/public_html/archive/2015/01/s7e19.mp3', 'test-data/episodes/2015-s7e19.mp3'),
|
|
('/home/gurushow/public_html/archive/2016/06/s8e43.mp3', 'test-data/episodes/2016-s8e43.mp3'),
|
|
('/home/gurushow/public_html/archive/2017/04/s9e30.mp3', 'test-data/episodes/2017-s9e30.mp3'),
|
|
('/home/gurushow/public_html/archive/2018/01/s10e18.mp3', 'test-data/episodes/2018-s10e18.mp3'),
|
|
]
|
|
|
|
for remote, local in downloads:
|
|
if os.path.exists(local):
|
|
print(f'[skip] {local} already exists')
|
|
continue
|
|
size_mb = sftp.stat(remote).st_size / 1024 / 1024
|
|
print(f'Downloading {local} ({size_mb:.1f} MB)...', flush=True)
|
|
sftp.get(remote, local)
|
|
print(' done', flush=True)
|
|
|
|
sftp.close()
|
|
client.close()
|
|
print('All downloads complete.')
|