claudetools/projects/dataforth-dos/PARSING-FIDELITY-VERDICT-2026-06-17.md

# Parsing Fidelity Verdict — testdatadb ingestion vs original staged datasheets

**Date:** 2026-06-17 · **Host:** AD2 · **Scope:** all 11,922 staged original `.TXT` datasheets vs the PostgreSQL `test_records`
**Raw report:** `PARSING-FIDELITY-REPORT-2026-06-17.txt` · **Tool:** `datasheet-pipeline/implementation/tools/validate-parsing.js`

## Verdict

**Ingestion/parsing is faithful — 0 genuine parse faults across 11,239 comparable records.** Every staged datasheet that has a corresponding DB record and a comparable test run matches on serial, model, date, and the 5 accuracy-test results. The earlier "mismatches" were all explained by retests, reused serials, format variants, or legacy out-of-scope units — not by the parser misreading or mis-segmenting data.

## Method

Compared each staged original `.TXT` (the DOS-station ground truth, written *before* ingestion) against the DB record's `raw_data` (parsed from the `.DAT`). The cross-check keyed on **scale-invariant data**:
- **Error (%)** — dimensionless, identical in `.DAT` and `.TXT` for every family (immune to mV scaling and current-output V→mA conversion). The primary fidelity signal.
- **Stim setpoints** (scale-aware) — used to confirm the *same unit/test* when error values differed (retest vs wrong-record).
- Serial (with hex-prefix decode), model (SCM-prefix normalized), date.

## Results (11,921 staged files with an SN)

| Outcome | Count | Meaning |
|---|---:|---|
| **Consistent** (SN+model+date+5×error%) | **11,226** | Faithful parse, confirmed |
| Retest — DB date newer than `.TXT` | 35 | ON-CONFLICT updated DB to a later test (expected) |
| Retest — same date, stim matches, run differs | 42 | Unit tested twice same day; DB keeps first run (strictly-greater-date rule) |
| VAS/single-point format | 5 | No 5-row accuracy block (SCMVAS) — not comparable by this method |
| Serial collision (generic SN, diff family) | 2 | `1-1`/`1-2` reused across products; unique-on-serial keeps one |
| **Genuine parse fault** | **0** | — |
| Model variant mismatch (same family) | 2 | `A819-1`/`A821-2` — reused serial across 8B35/8B36 (collision) |
| DB older than `.TXT` | 1 | `A821-1` — same collision pair |
| Accuracy-row-count diff | 0 | — |

### Why the "genuine" bucket collapsed to 0
The last 16 suspects were all `SCM5B37K-1530` (K-thermocouple). Their stim values matched the same 5 nominal setpoints (-50/112.5/275/437.5/600 °C) but differed by ~0.06 °C run-to-run — because the thermocouple input is a *measured analog value*, not an exact setpoint. A scale+relative stim tolerance correctly classifies them as same-day retests. A real segmentation fault would show a different setpoint *structure*; none did.

## Two follow-up items (NOT parsing-correctness bugs)

1. **608 staged originals have no DB record** (mostly A-prefix `10xxx` serials, e.g. `A243-1` = `10243-1`, model `5B45-25D`). These exist as staged `.TXT` but are absent from the DB under both decoded and encoded serial. This is an **ingestion-completeness** question (the source `.DAT` for these units appears to be out of the import scan scope, or these are custom `-NND` variants), separate from parsing fidelity. Worth a completeness pass: confirm which `.DAT` paths the importer scans and whether these models' `.DAT` files are present.
2. **Same-day retests keep the first run.** The `ON CONFLICT` rule updates only when `EXCLUDED.test_date > test_records.test_date` (strictly greater). Two runs on the *same date* leave the DB on the first-imported run, which may differ from the latest staged datasheet. If "latest run wins" is desired, the rule needs a tiebreaker (e.g. `>=` with an import-time or sequence guard). 42 records currently sit on a same-day earlier run.

## How to re-run

```bash
cd C:\Shares\testdatadb
node <path-to>/validate-parsing.js [optional-report-path]
# Reads C:\Shares\test\STAGE\**\*.TXT and compares to test_records. Read-only.
```