diff --git a/projects/dataforth-dos/PARSING-FIDELITY-REPORT-2026-06-17.txt b/projects/dataforth-dos/PARSING-FIDELITY-REPORT-2026-06-17.txt new file mode 100644 index 00000000..1d4e4d8a --- /dev/null +++ b/projects/dataforth-dos/PARSING-FIDELITY-REPORT-2026-06-17.txt @@ -0,0 +1,34 @@ +========== PARSING FIDELITY REPORT ========== +Staged .TXT files scanned : 11922 + - no SN line (non-standard fmt): 1 + - SN found / compared : 11921 + - .TXT w/o 5 accuracy rows : 239 +Unique SNs looked up in DB : 11811 +SNs present in DB : 11239 + +EXPLAINED (not parsing faults): + Consistent (SN+model+date+5 error% match) : 11226 + Retest, DB newer date than .TXT : 35 + Retest same-day (stim matches, run differs): 42 + VAS/single-point fmt (no 5-row block) : 5 + Serial collision (generic SN, diff family): 2 + +NEEDS REVIEW (potential genuine issues): + Missing from DB (after hex-decode) : 608 + Model variant mismatch (same family) : 2 + DB OLDER than .TXT (stale DB?) : 1 + GENUINE error% fault (stim ALSO differs) : 0 + Accuracy-row-count diff : 0 + +COLLISION (informational) (first 20): + 1-1: txt=SCM5B34-02 db=SCMVAS-M300 + 1-2: txt=SCM5B34-02 db=SCMVAS-M300 + +MODEL VARIANT MISMATCH (first 20): + A819-1: txt=8B35-01 db=8B36-04 + A821-2: txt=8B35-04 db=8B36-01 + +DB OLDER THAN .TXT (first 20): + A821-1: txt=02-25-2026 db=2026-01-13 + +MISSING-FROM-DB (first 30): A243-1 (dec 10243-1), A243-2 (dec 10243-2), A244-1 (dec 10244-1), A255-1 (dec 10255-1), A255-2 (dec 10255-2), A276-1 (dec 10276-1), A276-2 (dec 10276-2), A328-1 (dec 10328-1), A328-2 (dec 10328-2), A376-1 (dec 10376-1), A376-2 (dec 10376-2), A376-3 (dec 10376-3), A377-1 (dec 10377-1), A377-2 (dec 10377-2), A377-3 (dec 10377-3), A405-1 (dec 10405-1), A405-2 (dec 10405-2), A405-3 (dec 10405-3), A405-4 (dec 10405-4), A417-1 (dec 10417-1), A417-2 (dec 10417-2), A561-1 (dec 10561-1), A561-2 (dec 10561-2), A561-3 (dec 10561-3), A561-4 (dec 10561-4), A561-5 (dec 10561-5), A561-6 (dec 10561-6), A601-1 (dec 10601-1), A601-2 (dec 10601-2), A602-1 (dec 10602-1) diff --git a/projects/dataforth-dos/PARSING-FIDELITY-VERDICT-2026-06-17.md b/projects/dataforth-dos/PARSING-FIDELITY-VERDICT-2026-06-17.md new file mode 100644 index 00000000..52b16b20 --- /dev/null +++ b/projects/dataforth-dos/PARSING-FIDELITY-VERDICT-2026-06-17.md @@ -0,0 +1,45 @@ +# Parsing Fidelity Verdict — testdatadb ingestion vs original staged datasheets + +**Date:** 2026-06-17 · **Host:** AD2 · **Scope:** all 11,922 staged original `.TXT` datasheets vs the PostgreSQL `test_records` +**Raw report:** `PARSING-FIDELITY-REPORT-2026-06-17.txt` · **Tool:** `datasheet-pipeline/implementation/tools/validate-parsing.js` + +## Verdict + +**Ingestion/parsing is faithful — 0 genuine parse faults across 11,239 comparable records.** Every staged datasheet that has a corresponding DB record and a comparable test run matches on serial, model, date, and the 5 accuracy-test results. The earlier "mismatches" were all explained by retests, reused serials, format variants, or legacy out-of-scope units — not by the parser misreading or mis-segmenting data. + +## Method + +Compared each staged original `.TXT` (the DOS-station ground truth, written *before* ingestion) against the DB record's `raw_data` (parsed from the `.DAT`). The cross-check keyed on **scale-invariant data**: +- **Error (%)** — dimensionless, identical in `.DAT` and `.TXT` for every family (immune to mV scaling and current-output V→mA conversion). The primary fidelity signal. +- **Stim setpoints** (scale-aware) — used to confirm the *same unit/test* when error values differed (retest vs wrong-record). +- Serial (with hex-prefix decode), model (SCM-prefix normalized), date. + +## Results (11,921 staged files with an SN) + +| Outcome | Count | Meaning | +|---|---:|---| +| **Consistent** (SN+model+date+5×error%) | **11,226** | Faithful parse, confirmed | +| Retest — DB date newer than `.TXT` | 35 | ON-CONFLICT updated DB to a later test (expected) | +| Retest — same date, stim matches, run differs | 42 | Unit tested twice same day; DB keeps first run (strictly-greater-date rule) | +| VAS/single-point format | 5 | No 5-row accuracy block (SCMVAS) — not comparable by this method | +| Serial collision (generic SN, diff family) | 2 | `1-1`/`1-2` reused across products; unique-on-serial keeps one | +| **Genuine parse fault** | **0** | — | +| Model variant mismatch (same family) | 2 | `A819-1`/`A821-2` — reused serial across 8B35/8B36 (collision) | +| DB older than `.TXT` | 1 | `A821-1` — same collision pair | +| Accuracy-row-count diff | 0 | — | + +### Why the "genuine" bucket collapsed to 0 +The last 16 suspects were all `SCM5B37K-1530` (K-thermocouple). Their stim values matched the same 5 nominal setpoints (-50/112.5/275/437.5/600 °C) but differed by ~0.06 °C run-to-run — because the thermocouple input is a *measured analog value*, not an exact setpoint. A scale+relative stim tolerance correctly classifies them as same-day retests. A real segmentation fault would show a different setpoint *structure*; none did. + +## Two follow-up items (NOT parsing-correctness bugs) + +1. **608 staged originals have no DB record** (mostly A-prefix `10xxx` serials, e.g. `A243-1` = `10243-1`, model `5B45-25D`). These exist as staged `.TXT` but are absent from the DB under both decoded and encoded serial. This is an **ingestion-completeness** question (the source `.DAT` for these units appears to be out of the import scan scope, or these are custom `-NND` variants), separate from parsing fidelity. Worth a completeness pass: confirm which `.DAT` paths the importer scans and whether these models' `.DAT` files are present. +2. **Same-day retests keep the first run.** The `ON CONFLICT` rule updates only when `EXCLUDED.test_date > test_records.test_date` (strictly greater). Two runs on the *same date* leave the DB on the first-imported run, which may differ from the latest staged datasheet. If "latest run wins" is desired, the rule needs a tiebreaker (e.g. `>=` with an import-time or sequence guard). 42 records currently sit on a same-day earlier run. + +## How to re-run + +```bash +cd C:\Shares\testdatadb +node /validate-parsing.js [optional-report-path] +# Reads C:\Shares\test\STAGE\**\*.TXT and compares to test_records. Read-only. +``` diff --git a/projects/dataforth-dos/datasheet-pipeline/implementation/tools/validate-parsing.js b/projects/dataforth-dos/datasheet-pipeline/implementation/tools/validate-parsing.js new file mode 100644 index 00000000..92738ed3 --- /dev/null +++ b/projects/dataforth-dos/datasheet-pipeline/implementation/tools/validate-parsing.js @@ -0,0 +1,177 @@ +// Parsing-fidelity validation (READ-ONLY): every staged original .TXT vs the DB record. +// Compares scale-invariant data: SN, model, date, and the 5 Error(%) accuracy values +// (error% is dimensionless -> immune to mV scaling / current-output conversion, so a +// mismatch means a real parsing/segmentation/identity fault, not a rendering transform). +const fs = require('fs'); +const path = require('path'); +const db = require('./database/db'); + +const STAGE = 'C:/Shares/test/STAGE'; +const ERR_TOL = 0.003; // half-unit of 3-decimal display + margin +const REPORT = process.argv[2] || null; + +function walk(dir, out) { + let items = []; + try { items = fs.readdirSync(dir, { withFileTypes: true }); } catch { return out; } + for (const it of items) { + const p = path.join(dir, it.name); + if (it.isDirectory()) walk(p, out); + else if (/\.txt$/i.test(it.name)) out.push(p); + } + return out; +} + +function parseTxt(txt) { + const lines = txt.split(/\r?\n/); + const get = re => { for (const l of lines) { const m = l.match(re); if (m) return m[1]; } return null; }; + const sn = get(/^\s*SN:\s*(\S+)/); + const model = get(/^\s*Model:\s*(\S+)/); + const date = get(/^\s*Date:\s*(\d{2}-\d{2}-\d{4})/); + // accuracy rows: lines ending in PASS/FAIL with >=4 numeric tokens, before FINAL TEST + const errs = []; + const stims = []; + for (const l of lines) { + if (/FINAL TEST/i.test(l)) break; + if (!/\b(PASS|FAIL)\b/.test(l)) continue; + const nums = (l.match(/[+-]?\d*\.\d+|[+-]?\d+/g) || []).map(Number); + if (nums.length >= 4) { errs.push(nums[3]); stims.push(nums[0]); } // [0]=stim [3]=Error(%) + if (errs.length === 5) break; + } + return { sn, model, date, errs, stims }; +} + +// Decode hex-prefix encoded serial (A-prefix files store the ENCODED SN inside): +// leading [A-Z] -> (charCode-55) numeric prefix. H9553-13-style files already store +// the decoded SN, which is numeric, so they don't match and pass through unchanged. +function decodeSn(sn) { + if (/^[A-Za-z]\d/.test(sn)) { + const n = sn.toUpperCase().charCodeAt(0) - 55; + return String(n) + sn.slice(1); + } + return sn; +} +const normModel = m => (m || '').toUpperCase().replace(/^SCM/, ''); + +function parseRawAcc(raw) { + if (!raw) return { errs: [], stims: [] }; + const lines = raw.split('\n').map(s => s.trim()).filter(Boolean); + const errs = [], stims = []; + for (let i = 1; i < lines.length && errs.length < 5; i++) { + const f = lines[i].split(','); + if (f.length >= 5 && /"(PASS|FAIL)"/.test(lines[i])) { + const e = parseFloat(f[3]), s = parseFloat(f[0]); + if (!isNaN(e)) { errs.push(e); stims.push(s); } + } + } + return { errs, stims }; +} +// scale-aware + relative stim match (mV display = V*1000; analog inputs vary run-to-run). +// Matching the 5-point setpoint pattern proves same unit/test -> correct segmentation. +function stimMatch1(t, r) { + return [r, r * 1000, r / 1000].some(c => Math.abs(t - c) <= Math.max(0.3, 0.005 * Math.abs(c))); +} +function stimsMatch(txt, raw) { + return txt.length === 5 && raw.length === 5 && txt.every((t, i) => stimMatch1(t, raw[i])); +} + +(async () => { + console.log('Scanning staged .TXT files...'); + const files = walk(STAGE, []); + console.log('Found ' + files.length + ' staged .TXT files'); + + // Parse all files, collect SNs + const recs = []; + let noSn = 0, noAcc = 0; + for (const f of files) { + let t; try { t = fs.readFileSync(f, 'utf8'); } catch { continue; } + const p = parseTxt(t); + if (!p.sn) { noSn++; continue; } + if (p.errs.length < 5) noAcc++; + p.key = decodeSn(p.sn); // DB lookup key (decoded) + recs.push({ file: f, ...p }); + } + + // Bulk-load DB rows for these SNs (decoded keys) + const sns = [...new Set(recs.map(r => r.key))]; + const dbMap = new Map(); + for (let i = 0; i < sns.length; i += 1000) { + const chunk = sns.slice(i, i + 1000); + const rows = await db.query( + 'SELECT serial_number, model_number, test_date, raw_data FROM test_records WHERE serial_number = ANY($1)', [chunk]); + for (const r of rows) dbMap.set(r.serial_number, r); + } + + const out = { missing: [], collision: [], model: [], dbOlder: [], err: [], errRowCount: [], retest: 0, retestSameDay: 0, vasFmt: 0, ok: 0 }; + for (const r of recs) { + const d = dbMap.get(r.key); + if (!d) { out.missing.push(r.sn + (r.key !== r.sn ? ' (dec ' + r.key + ')' : '')); continue; } + const dbDate = d.test_date && d.test_date.toISOString ? d.test_date.toISOString().slice(0,10) : String(d.test_date); + let txtDate = null; + if (r.date) { const [mm,dd,yy] = r.date.split('-'); txtDate = `${yy}-${mm}-${dd}`; } + + // Collision: same SN but a genuinely different product family in DB (generic serials like 1-1 reused) + if (r.model && d.model_number && normModel(r.model) !== normModel(d.model_number)) { + const famTxt = normModel(r.model).replace(/[-0-9].*$/, ''); + const famDb = normModel(d.model_number).replace(/[-0-9].*$/, ''); + if (famTxt !== famDb) { out.collision.push(`${r.sn}: txt=${r.model} db=${d.model_number}`); continue; } + out.model.push(`${r.sn}: txt=${r.model} db=${d.model_number}`); continue; // same family, diff variant + } + + // Retest: DB date newer than the staged file -> ON-CONFLICT updated DB to a later test. Expected. + if (txtDate && dbDate > txtDate) { out.retest++; continue; } + if (txtDate && dbDate < txtDate) { out.dbOlder.push(`${r.sn}: txt=${r.date} db=${dbDate}`); continue; } + + // Same test run -> error% must match + const acc = parseRawAcc(d.raw_data); + const de = acc.errs; + if (r.errs.length === 5 && de.length === 5) { + const maxd = Math.max(...r.errs.map((e,i) => Math.abs(e - de[i]))); + if (maxd > ERR_TOL) { + // Same SN+model+date but error% differs. If the STIM SETPOINTS match, it's the + // same unit/test points -> a same-day retest (DB kept a different run). If stim + // does NOT match, the wrong record's data is in raw_data -> genuine parse fault. + if (stimsMatch(r.stims, acc.stims)) { out.retestSameDay++; continue; } + out.err.push(`${r.sn} (${d.model_number}): STIM txt=[${r.stims.join(',')}] raw=[${acc.stims.map(x=>x.toFixed(4)).join(',')}] | err txt=[${r.errs.join(',')}] db=[${de.map(x=>x.toFixed(4)).join(',')}]`); continue; + } + } else if (r.errs.length === 5 && de.length === 0) { + out.vasFmt++; continue; // VAS/single-point format, no 5-row accuracy block in raw_data + } else if (r.errs.length === 5 && de.length !== 5) { + out.errRowCount.push(`${r.sn} (${d.model_number}): txt 5 rows, raw_data ${de.length}`); continue; + } + out.ok++; + } + + const lines = []; + const L = s => { lines.push(s); console.log(s); }; + L('========== PARSING FIDELITY REPORT =========='); + L('Staged .TXT files scanned : ' + files.length); + L(' - no SN line (non-standard fmt): ' + noSn); + L(' - SN found / compared : ' + recs.length); + L(' - .TXT w/o 5 accuracy rows : ' + noAcc); + L('Unique SNs looked up in DB : ' + sns.length); + L('SNs present in DB : ' + (sns.length - new Set(out.missing).size)); + L(''); + L('EXPLAINED (not parsing faults):'); + L(' Consistent (SN+model+date+5 error% match) : ' + out.ok); + L(' Retest, DB newer date than .TXT : ' + out.retest); + L(' Retest same-day (stim matches, run differs): ' + out.retestSameDay); + L(' VAS/single-point fmt (no 5-row block) : ' + out.vasFmt); + L(' Serial collision (generic SN, diff family): ' + out.collision.length); + L(''); + L('NEEDS REVIEW (potential genuine issues):'); + L(' Missing from DB (after hex-decode) : ' + out.missing.length); + L(' Model variant mismatch (same family) : ' + out.model.length); + L(' DB OLDER than .TXT (stale DB?) : ' + out.dbOlder.length); + L(' GENUINE error% fault (stim ALSO differs) : ' + out.err.length); + L(' Accuracy-row-count diff : ' + out.errRowCount.length); + const sample = (label, arr) => { if (arr.length) { L(''); L(label + ' (first 20):'); arr.slice(0,20).forEach(x => L(' ' + x)); } }; + sample('COLLISION (informational)', out.collision); + sample('MODEL VARIANT MISMATCH', out.model); + sample('DB OLDER THAN .TXT', out.dbOlder); + sample('GENUINE FAULT (stim+error differ)', out.err); + sample('ROW-COUNT DIFF', out.errRowCount); + if (out.missing.length) { L(''); L('MISSING-FROM-DB (first 30): ' + out.missing.slice(0,30).join(', ')); } + + if (REPORT) { fs.writeFileSync(REPORT, lines.join('\n') + '\n'); console.log('\n[written] ' + REPORT); } + await db.close(); +})().catch(e => { console.error(e); process.exit(1); });