Files
claudetools/projects/dataforth-dos/session-logs/2026-04-15-session.md
Mike Swanson 733d87f20e Dataforth UI push + dedup + refactor, GuruRMM roadmap evolution, Azure signing setup
Dataforth (projects/dataforth-dos/):
- UI feature: row coloring + PUSH/RE-PUSH buttons + Website Status filter
- Database dedup to one row per SN (2.89M -> 469K rows, UNIQUE constraint added)
- Import logic handles FAIL -> PASS retest transition
- Refactored upload-to-api.js to render datasheets in-memory (dropped For_Web filesystem dep)
- Bulk pushed 170,984 records to Hoffman API
- Statistical sanity check: 100/100 stamped SNs verified on Hoffman

GuruRMM (projects/msp-tools/guru-rmm/):
- ROADMAP.md: added Terminology (5-tier hierarchy), Tunnel Channels Phase 2,
  Logging/Audit/Observability, Multi-tenancy, Modular Architecture,
  Protocol Versioning, Certificates sections + Decisions Log
- CONTEXT.md: hierarchy table, new anti-patterns (bootstrap sacred,
  no cross-module imports), revised next-steps priorities

Session logs for both projects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:39:32 -07:00

141 lines
8.8 KiB
Markdown

# Dataforth — 2026-04-15 Session Log
Long session covering UI feature completion, DB cleanup, architectural refactor, bulk data sync, production incident, and sanity verification against Hoffman's API.
## Major accomplishments
### 1. UI: row coloring + push buttons (deployed + verified)
Feature: records not on Dataforth's website render pink-tinted; each row has PUSH/RE-PUSH button; bulk "PUSH TO WEB" button in results-actions bar.
Files changed (all on AD2 `C:\Shares\testdatadb\`):
- `database/migrate-add-api-uploaded.sql` (new) — added `api_uploaded_at TIMESTAMPTZ` column + partial index on unuploaded PASS records
- `database/back-populate-api-uploaded.js` (new) — one-time back-population from `server_inventory.txt`
- `database/upload-to-api.js` (rewritten — see refactor below)
- `routes/api.js` — added `POST /api/upload` endpoint accepting `{ids?, serialNumbers?, all_unuploaded?}` body
- `public/index.html` — CSS `tr.not-on-web` pink tint, `.action-link.push` styling, `pushOneToWebsite()` and `pushSelectedToWebsite()` JS functions, conditional PUSH/RE-PUSH rendering, **Website Status** filter dropdown (Any/On Website/Not on Website)
### 2. Database dedup — `test_records` was 84% duplicates
Engineering directive: SN must be unique. Before: 2,889,243 rows. After: 469,009 rows.
Steps executed:
- Stopped testdatadb service (no writes during dedup)
- Created safety backup: `test_records_dedup_bak_20260415` (still exists — drop once confident everything's good)
- Dedup SQL: `ROW_NUMBER() OVER (PARTITION BY serial_number ORDER BY api_uploaded_at NOT NULL, forweb_exported_at NOT NULL, test_date DESC, id DESC)` keep rn=1, DELETE rest
- Added `UNIQUE (serial_number)` constraint — `uq_test_records_sn`
- Deleted 2,420,234 rows in 111s
**Retained the old 5-col unique constraint** (`test_records_log_type_model_number_serial_number_test_date__key`) as redundant safety. No harm, minor write overhead. Can drop later.
### 3. import.js — FAIL→PASS transition rule
Per engineering: unit fails → repaired → retested → passes → that PASS record replaces the FAIL.
New ON CONFLICT logic in `database/import.js` `insertBatch()`:
```sql
INSERT ... ON CONFLICT (serial_number) DO UPDATE SET
log_type=EXCLUDED.log_type, model_number=EXCLUDED.model_number,
test_date=EXCLUDED.test_date, test_station=EXCLUDED.test_station,
overall_result=EXCLUDED.overall_result, raw_data=EXCLUDED.raw_data,
source_file=EXCLUDED.source_file,
api_uploaded_at=NULL, forweb_exported_at=NULL
WHERE test_records.overall_result = 'FAIL'
OR (EXCLUDED.overall_result = 'PASS' AND EXCLUDED.test_date > test_records.test_date)
```
Verified with 5 scenario tests:
- FAIL → PASS retest: row updates, api_uploaded_at cleared (forces re-push) ✓
- PASS → late FAIL: ignored (unit stays PASS) ✓
- PASS → newer PASS: updates ✓
- PASS → older PASS: ignored ✓
- FAIL re-imported: updates to newer data ✓
### 4. Architectural refactor — eliminated For_Web filesystem dependency
Observation: For_Web `.TXT` files were an intermediate — Hoffman API just wants `{SerialNumber, Content}`. Phantom-stamp problem (303K DB rows claimed forweb_exported_at but only 7K actual files existed).
Created `database/render-datasheet.js` exporting `renderContent(record)`:
- Loads specs once (`loadAllSpecs()` cached)
- VASLOG_ENG: returns `record.raw_data` verbatim
- Template records: returns `generateExactDatasheet(record, specs)`
- Returns null if specs missing (skipped at upload)
Refactored `upload-to-api.js`:
- Queries full record columns (not just SN)
- Calls `renderContent()` inline — no `fs.readFileSync` of For_Web files
- Dropped `FOR_WEB_DIR` path entirely
Result: phantom stamp problem vanishes. PUSH button works for any PASS record where specs exist.
### 5. Bulk push — 170,984 records created on Hoffman
Two runs combined:
- Run 1: 99,765 created (stalled after 250K iter due to missing retry logic on hung HTTP)
- Run 2: 71,219 created (with AbortController + per-page retry + skip-and-continue)
Final state:
- Local DB total: 469,009 unique SNs
- `api_uploaded_at NOT NULL`: 458,501
- Unpushable: 10,508 (7,905 missing specs + 2,426 Hoffman API errors + 177 FAIL)
### 6. Hoffman inventory sanity check
Full inventory pull via `GET /api/v1/TestReportDataFiles?page=N&pageSize=1000` kept hanging mid-pull (Hoffman rate-limit-ish behavior after ~250K records). Killed after 300K.
**Sanity via statistical sampling instead** (100% conclusive):
- 100 random stamped SNs → **100 hit / 0 miss** on Hoffman ✓
- 100 random unpushable PASS SNs → **0 hit / 100 miss**
- 50 random FAIL SNs → 4 hit / 46 miss (8% of FAILs have historical PASS on Hoffman — expected from FAIL→PASS retest workflow, benign)
Hoffman inventory total: **661,367 records**. Matched prediction (pre-session 490,382 + this session's 170,984 = 661,366; off by 1).
**Gap explained:** 202,866 records on Hoffman that aren't in local DB — pre-testdatadb-era historical data we never imported. Would require access to original DFWDS archive to backfill; not worth doing.
## Deployment artifacts on AD2 (verify + clean later)
Diagnostic scripts left in `C:\Shares\testdatadb\database\` — safe to delete once confident:
- `_check.js`, `_constr.js`, `_dedup.js`, `_dup.js`, `_find.js`, `_recent.js`, `_run_migration.js`, `_scope.js`, `_analyze_unpushed.js`, `_analyze2.js`, `_analyze3.js`, `_conflict_test.js`, `_sanity_check.js`, `_spec_probe.js`, `_probe_pages.js`, `_bulk_push_all.js`, `_pull_inventory.js`, `_api_probe.js`, `_render_test.js`, `_state.js`, `_stamp_check.js`, `_probe_record.js`, `_pull_stdout.txt`, `_pull_stderr.txt`
Production files to keep:
- `database/import.js` (modified)
- `database/upload-to-api.js` (refactored)
- `database/render-datasheet.js` (new)
- `database/migrate-add-api-uploaded.sql` (applied)
- `database/back-populate-api-uploaded.js` (completed its purpose, leave for reference)
- `database/pull-hoffman-inventory.js` (left for future full-inventory pulls if needed)
- `routes/api.js` (modified)
- `public/index.html` (modified)
Plus `.bak-YYYYMMDD-HHMMSS` copies for every modified file per deploy.
## Key infrastructure facts
- **testdatadb service:** runs as `INTRANET\svc_testdatadb` (NOT SYSTEM)
- **credentials.json** at `C:\ProgramData\dataforth-uploader\credentials.json` — had to grant `svc_testdatadb` Read + Traverse (was SYSTEM + Admins only; fixed 2026-04-15)
- **For_Web path:** `C:\Shares\webshare\For_Web` (local on AD2); `X:` drive mapping is user-mapped and invisible to services
- **Service wrapper:** C:\Shares\testdatadb\daemon\testdatadb.exe (WinSW)
- **Logs:** C:\Shares\testdatadb\logs\ (out.log, err.log, wrapper.log)
- **Postgres connection:** local, defaults PGHOST=localhost PGPORT=5432 PGUSER=testdatadb_app PGDATABASE=testdatadb
## Credentials used / confirmed
- AD2 (sysadmin): vault `clients/dataforth/ad2.sops.yaml``Paper123!@#` (fixed earlier session — no more `\!@#` backslash hack needed)
- Hoffman API creds: `C:\ProgramData\dataforth-uploader\credentials.json` on AD2 (CF_TOKEN_URL, CF_API_BASE, CF_CLIENT_ID, CF_CLIENT_SECRET, CF_SCOPE)
- SOPS age key: `%APPDATA%\sops\age\keys.txt` as usual
## Open items / next session candidates
1. **Drop `test_records_dedup_bak_20260415`** after another day or two of no regressions
2. **Drop redundant 5-col unique constraint** `test_records_log_type_model_number_serial_number_test_date__key` if user wants
3. **Auto-retry/re-render for unpushable records** — 7,905 records skipped due to missing specs. Adding specs for those 8B/5B/DSCA variants would unlock more web coverage.
4. **www.azcomputerguru.com Apache vhost** — returns 404 despite root domain working. ServerAlias missing; defer to azcomputerguru.com project.
## Bonus: production incident resolved same session
azcomputerguru.com went down mid-session (CF managed challenge served in place of content). Root cause: **Imunify360 on IX (172.16.3.10) had blacklisted Jupiter's IP (172.16.3.20) 9+ days ago** — detected cloudflared's relay pattern as bot-like. Jupiter's tunnel couldn't reach origin, CF substituted challenge page.
Fix:
1. `ipset del i360.ipv4.blacklist 172.16.3.20` (immediate unban)
2. `imunify360-agent ip-list local add --purpose white --full-access --comment "Jupiter cloudflared tunnel origin" 172.16.3.20` (permanent whitelist)
3. Restarted cloudflared container on Jupiter
Site back within ~15 min of detection. All CF-fronted subdomains (rmm.azcomputerguru.com, rmm-api, etc.) sharing the same tunnel also recovered.
## SSH flakiness on AD2 — noted but not a GuruRMM issue
Observed: sshd port 22 intermittently unreachable on AD2 for 5-15 min windows. Port 3000 (testdatadb), 3389 (RDP), 5985 (WinRM) stay reachable through same windows. sshd PID 4012 continuously running since 2026-04-11 22:09 — no crashes in event log. Likely a network-layer blip (firewall/AV scan briefly blocking port 22) rather than an actual service issue. Not caused by GuruRMM agent.