Files

Mike Swanson bbcde2be8e dataforth(datasheet): parsing-fidelity validation — all staged originals vs DB

Validated all 11,922 staged original .TXT datasheets against test_records.
0 genuine parse faults across 11,239 comparable records; mismatches all explained
(retests, reused serials, VAS format, legacy out-of-scope units). Adds the
validate-parsing.js tool, raw report, and verdict. Two follow-ups (NOT parse bugs):
608 staged units absent from DB (ingestion completeness), and same-day retests keep
the first run (ON CONFLICT strictly-greater-date).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-18 13:02:32 -07:00

3.9 KiB

Raw Blame History

Parsing Fidelity Verdict — testdatadb ingestion vs original staged datasheets

Date: 2026-06-17 · Host: AD2 · Scope: all 11,922 staged original .TXT datasheets vs the PostgreSQL test_records Raw report: PARSING-FIDELITY-REPORT-2026-06-17.txt · Tool: datasheet-pipeline/implementation/tools/validate-parsing.js

Verdict

Ingestion/parsing is faithful — 0 genuine parse faults across 11,239 comparable records. Every staged datasheet that has a corresponding DB record and a comparable test run matches on serial, model, date, and the 5 accuracy-test results. The earlier "mismatches" were all explained by retests, reused serials, format variants, or legacy out-of-scope units — not by the parser misreading or mis-segmenting data.

Method

Compared each staged original .TXT (the DOS-station ground truth, written before ingestion) against the DB record's raw_data (parsed from the .DAT). The cross-check keyed on scale-invariant data:

Error (%) — dimensionless, identical in .DAT and .TXT for every family (immune to mV scaling and current-output V→mA conversion). The primary fidelity signal.
Stim setpoints (scale-aware) — used to confirm the same unit/test when error values differed (retest vs wrong-record).
Serial (with hex-prefix decode), model (SCM-prefix normalized), date.

Results (11,921 staged files with an SN)

Outcome	Count	Meaning
Consistent (SN+model+date+5×error%)	11,226	Faithful parse, confirmed
Retest — DB date newer than `.TXT`	35	ON-CONFLICT updated DB to a later test (expected)
Retest — same date, stim matches, run differs	42	Unit tested twice same day; DB keeps first run (strictly-greater-date rule)
VAS/single-point format	5	No 5-row accuracy block (SCMVAS) — not comparable by this method
Serial collision (generic SN, diff family)	2	`1-1`/`1-2` reused across products; unique-on-serial keeps one
Genuine parse fault	0	—
Model variant mismatch (same family)	2	`A819-1`/`A821-2` — reused serial across 8B35/8B36 (collision)
DB older than `.TXT`	1	`A821-1` — same collision pair
Accuracy-row-count diff	0	—

Why the "genuine" bucket collapsed to 0

The last 16 suspects were all SCM5B37K-1530 (K-thermocouple). Their stim values matched the same 5 nominal setpoints (-50/112.5/275/437.5/600 °C) but differed by ~0.06 °C run-to-run — because the thermocouple input is a measured analog value, not an exact setpoint. A scale+relative stim tolerance correctly classifies them as same-day retests. A real segmentation fault would show a different setpoint structure; none did.

Two follow-up items (NOT parsing-correctness bugs)

608 staged originals have no DB record (mostly A-prefix 10xxx serials, e.g. A243-1 = 10243-1, model 5B45-25D). These exist as staged .TXT but are absent from the DB under both decoded and encoded serial. This is an ingestion-completeness question (the source .DAT for these units appears to be out of the import scan scope, or these are custom -NND variants), separate from parsing fidelity. Worth a completeness pass: confirm which .DAT paths the importer scans and whether these models' .DAT files are present.
Same-day retests keep the first run. The ON CONFLICT rule updates only when EXCLUDED.test_date > test_records.test_date (strictly greater). Two runs on the same date leave the DB on the first-imported run, which may differ from the latest staged datasheet. If "latest run wins" is desired, the rule needs a tiebreaker (e.g. >= with an import-time or sequence guard). 42 records currently sit on a same-day earlier run.

How to re-run

cd C:\Shares\testdatadb
node <path-to>/validate-parsing.js [optional-report-path]
# Reads C:\Shares\test\STAGE\**\*.TXT and compares to test_records. Read-only.

3.9 KiB Raw Blame History Unescape Escape