Files
claudetools/projects/dataforth-dos/PARSING-FIDELITY-VERDICT-2026-06-17.md
Mike Swanson bbcde2be8e dataforth(datasheet): parsing-fidelity validation — all staged originals vs DB
Validated all 11,922 staged original .TXT datasheets against test_records.
0 genuine parse faults across 11,239 comparable records; mismatches all explained
(retests, reused serials, VAS format, legacy out-of-scope units). Adds the
validate-parsing.js tool, raw report, and verdict. Two follow-ups (NOT parse bugs):
608 staged units absent from DB (ingestion completeness), and same-day retests keep
the first run (ON CONFLICT strictly-greater-date).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 13:02:32 -07:00

46 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Parsing Fidelity Verdict — testdatadb ingestion vs original staged datasheets
**Date:** 2026-06-17 · **Host:** AD2 · **Scope:** all 11,922 staged original `.TXT` datasheets vs the PostgreSQL `test_records`
**Raw report:** `PARSING-FIDELITY-REPORT-2026-06-17.txt` · **Tool:** `datasheet-pipeline/implementation/tools/validate-parsing.js`
## Verdict
**Ingestion/parsing is faithful — 0 genuine parse faults across 11,239 comparable records.** Every staged datasheet that has a corresponding DB record and a comparable test run matches on serial, model, date, and the 5 accuracy-test results. The earlier "mismatches" were all explained by retests, reused serials, format variants, or legacy out-of-scope units — not by the parser misreading or mis-segmenting data.
## Method
Compared each staged original `.TXT` (the DOS-station ground truth, written *before* ingestion) against the DB record's `raw_data` (parsed from the `.DAT`). The cross-check keyed on **scale-invariant data**:
- **Error (%)** — dimensionless, identical in `.DAT` and `.TXT` for every family (immune to mV scaling and current-output V→mA conversion). The primary fidelity signal.
- **Stim setpoints** (scale-aware) — used to confirm the *same unit/test* when error values differed (retest vs wrong-record).
- Serial (with hex-prefix decode), model (SCM-prefix normalized), date.
## Results (11,921 staged files with an SN)
| Outcome | Count | Meaning |
|---|---:|---|
| **Consistent** (SN+model+date+5×error%) | **11,226** | Faithful parse, confirmed |
| Retest — DB date newer than `.TXT` | 35 | ON-CONFLICT updated DB to a later test (expected) |
| Retest — same date, stim matches, run differs | 42 | Unit tested twice same day; DB keeps first run (strictly-greater-date rule) |
| VAS/single-point format | 5 | No 5-row accuracy block (SCMVAS) — not comparable by this method |
| Serial collision (generic SN, diff family) | 2 | `1-1`/`1-2` reused across products; unique-on-serial keeps one |
| **Genuine parse fault** | **0** | — |
| Model variant mismatch (same family) | 2 | `A819-1`/`A821-2` — reused serial across 8B35/8B36 (collision) |
| DB older than `.TXT` | 1 | `A821-1` — same collision pair |
| Accuracy-row-count diff | 0 | — |
### Why the "genuine" bucket collapsed to 0
The last 16 suspects were all `SCM5B37K-1530` (K-thermocouple). Their stim values matched the same 5 nominal setpoints (-50/112.5/275/437.5/600 °C) but differed by ~0.06 °C run-to-run — because the thermocouple input is a *measured analog value*, not an exact setpoint. A scale+relative stim tolerance correctly classifies them as same-day retests. A real segmentation fault would show a different setpoint *structure*; none did.
## Two follow-up items (NOT parsing-correctness bugs)
1. **608 staged originals have no DB record** (mostly A-prefix `10xxx` serials, e.g. `A243-1` = `10243-1`, model `5B45-25D`). These exist as staged `.TXT` but are absent from the DB under both decoded and encoded serial. This is an **ingestion-completeness** question (the source `.DAT` for these units appears to be out of the import scan scope, or these are custom `-NND` variants), separate from parsing fidelity. Worth a completeness pass: confirm which `.DAT` paths the importer scans and whether these models' `.DAT` files are present.
2. **Same-day retests keep the first run.** The `ON CONFLICT` rule updates only when `EXCLUDED.test_date > test_records.test_date` (strictly greater). Two runs on the *same date* leave the DB on the first-imported run, which may differ from the latest staged datasheet. If "latest run wins" is desired, the rule needs a tiebreaker (e.g. `>=` with an import-time or sequence guard). 42 records currently sit on a same-day earlier run.
## How to re-run
```bash
cd C:\Shares\testdatadb
node <path-to>/validate-parsing.js [optional-report-path]
# Reads C:\Shares\test\STAGE\**\*.TXT and compares to test_records. Read-only.
```