Validated all 11,922 staged original .TXT datasheets against test_records. 0 genuine parse faults across 11,239 comparable records; mismatches all explained (retests, reused serials, VAS format, legacy out-of-scope units). Adds the validate-parsing.js tool, raw report, and verdict. Two follow-ups (NOT parse bugs): 608 staged units absent from DB (ingestion completeness), and same-day retests keep the first run (ON CONFLICT strictly-greater-date). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3.9 KiB
Parsing Fidelity Verdict — testdatadb ingestion vs original staged datasheets
Date: 2026-06-17 · Host: AD2 · Scope: all 11,922 staged original .TXT datasheets vs the PostgreSQL test_records
Raw report: PARSING-FIDELITY-REPORT-2026-06-17.txt · Tool: datasheet-pipeline/implementation/tools/validate-parsing.js
Verdict
Ingestion/parsing is faithful — 0 genuine parse faults across 11,239 comparable records. Every staged datasheet that has a corresponding DB record and a comparable test run matches on serial, model, date, and the 5 accuracy-test results. The earlier "mismatches" were all explained by retests, reused serials, format variants, or legacy out-of-scope units — not by the parser misreading or mis-segmenting data.
Method
Compared each staged original .TXT (the DOS-station ground truth, written before ingestion) against the DB record's raw_data (parsed from the .DAT). The cross-check keyed on scale-invariant data:
- Error (%) — dimensionless, identical in
.DATand.TXTfor every family (immune to mV scaling and current-output V→mA conversion). The primary fidelity signal. - Stim setpoints (scale-aware) — used to confirm the same unit/test when error values differed (retest vs wrong-record).
- Serial (with hex-prefix decode), model (SCM-prefix normalized), date.
Results (11,921 staged files with an SN)
| Outcome | Count | Meaning |
|---|---|---|
| Consistent (SN+model+date+5×error%) | 11,226 | Faithful parse, confirmed |
Retest — DB date newer than .TXT |
35 | ON-CONFLICT updated DB to a later test (expected) |
| Retest — same date, stim matches, run differs | 42 | Unit tested twice same day; DB keeps first run (strictly-greater-date rule) |
| VAS/single-point format | 5 | No 5-row accuracy block (SCMVAS) — not comparable by this method |
| Serial collision (generic SN, diff family) | 2 | 1-1/1-2 reused across products; unique-on-serial keeps one |
| Genuine parse fault | 0 | — |
| Model variant mismatch (same family) | 2 | A819-1/A821-2 — reused serial across 8B35/8B36 (collision) |
DB older than .TXT |
1 | A821-1 — same collision pair |
| Accuracy-row-count diff | 0 | — |
Why the "genuine" bucket collapsed to 0
The last 16 suspects were all SCM5B37K-1530 (K-thermocouple). Their stim values matched the same 5 nominal setpoints (-50/112.5/275/437.5/600 °C) but differed by ~0.06 °C run-to-run — because the thermocouple input is a measured analog value, not an exact setpoint. A scale+relative stim tolerance correctly classifies them as same-day retests. A real segmentation fault would show a different setpoint structure; none did.
Two follow-up items (NOT parsing-correctness bugs)
- 608 staged originals have no DB record (mostly A-prefix
10xxxserials, e.g.A243-1=10243-1, model5B45-25D). These exist as staged.TXTbut are absent from the DB under both decoded and encoded serial. This is an ingestion-completeness question (the source.DATfor these units appears to be out of the import scan scope, or these are custom-NNDvariants), separate from parsing fidelity. Worth a completeness pass: confirm which.DATpaths the importer scans and whether these models'.DATfiles are present. - Same-day retests keep the first run. The
ON CONFLICTrule updates only whenEXCLUDED.test_date > test_records.test_date(strictly greater). Two runs on the same date leave the DB on the first-imported run, which may differ from the latest staged datasheet. If "latest run wins" is desired, the rule needs a tiebreaker (e.g.>=with an import-time or sequence guard). 42 records currently sit on a same-day earlier run.
How to re-run
cd C:\Shares\testdatadb
node <path-to>/validate-parsing.js [optional-report-path]
# Reads C:\Shares\test\STAGE\**\*.TXT and compares to test_records. Read-only.