# Parsing Fidelity Verdict — testdatadb ingestion vs original staged datasheets **Date:** 2026-06-17 · **Host:** AD2 · **Scope:** all 11,922 staged original `.TXT` datasheets vs the PostgreSQL `test_records` **Raw report:** `PARSING-FIDELITY-REPORT-2026-06-17.txt` · **Tool:** `datasheet-pipeline/implementation/tools/validate-parsing.js` ## Verdict **Ingestion/parsing is faithful — 0 genuine parse faults across 11,239 comparable records.** Every staged datasheet that has a corresponding DB record and a comparable test run matches on serial, model, date, and the 5 accuracy-test results. The earlier "mismatches" were all explained by retests, reused serials, format variants, or legacy out-of-scope units — not by the parser misreading or mis-segmenting data. ## Method Compared each staged original `.TXT` (the DOS-station ground truth, written *before* ingestion) against the DB record's `raw_data` (parsed from the `.DAT`). The cross-check keyed on **scale-invariant data**: - **Error (%)** — dimensionless, identical in `.DAT` and `.TXT` for every family (immune to mV scaling and current-output V→mA conversion). The primary fidelity signal. - **Stim setpoints** (scale-aware) — used to confirm the *same unit/test* when error values differed (retest vs wrong-record). - Serial (with hex-prefix decode), model (SCM-prefix normalized), date. ## Results (11,921 staged files with an SN) | Outcome | Count | Meaning | |---|---:|---| | **Consistent** (SN+model+date+5×error%) | **11,226** | Faithful parse, confirmed | | Retest — DB date newer than `.TXT` | 35 | ON-CONFLICT updated DB to a later test (expected) | | Retest — same date, stim matches, run differs | 42 | Unit tested twice same day; DB keeps first run (strictly-greater-date rule) | | VAS/single-point format | 5 | No 5-row accuracy block (SCMVAS) — not comparable by this method | | Serial collision (generic SN, diff family) | 2 | `1-1`/`1-2` reused across products; unique-on-serial keeps one | | **Genuine parse fault** | **0** | — | | Model variant mismatch (same family) | 2 | `A819-1`/`A821-2` — reused serial across 8B35/8B36 (collision) | | DB older than `.TXT` | 1 | `A821-1` — same collision pair | | Accuracy-row-count diff | 0 | — | ### Why the "genuine" bucket collapsed to 0 The last 16 suspects were all `SCM5B37K-1530` (K-thermocouple). Their stim values matched the same 5 nominal setpoints (-50/112.5/275/437.5/600 °C) but differed by ~0.06 °C run-to-run — because the thermocouple input is a *measured analog value*, not an exact setpoint. A scale+relative stim tolerance correctly classifies them as same-day retests. A real segmentation fault would show a different setpoint *structure*; none did. ## Two follow-up items (NOT parsing-correctness bugs) 1. **608 staged originals have no DB record** (mostly A-prefix `10xxx` serials, e.g. `A243-1` = `10243-1`, model `5B45-25D`). These exist as staged `.TXT` but are absent from the DB under both decoded and encoded serial. This is an **ingestion-completeness** question (the source `.DAT` for these units appears to be out of the import scan scope, or these are custom `-NND` variants), separate from parsing fidelity. Worth a completeness pass: confirm which `.DAT` paths the importer scans and whether these models' `.DAT` files are present. 2. **Same-day retests keep the first run.** The `ON CONFLICT` rule updates only when `EXCLUDED.test_date > test_records.test_date` (strictly greater). Two runs on the *same date* leave the DB on the first-imported run, which may differ from the latest staged datasheet. If "latest run wins" is desired, the rule needs a tiebreaker (e.g. `>=` with an import-time or sequence guard). 42 records currently sit on a same-day earlier run. ## How to re-run ```bash cd C:\Shares\testdatadb node /validate-parsing.js [optional-report-path] # Reads C:\Shares\test\STAGE\**\*.TXT and compares to test_records. Read-only. ```