dataforth(datasheet): root-cause the 608 missing units (report for John)

608 staged datasheets absent from DB. Two causes: (1) 229 units with encoded/
non-standard serials the importer's leading-digit regex silently skips - data is
in the .DAT, recoverable; full blind spot is 840 serials / 9,510 records / 141
models dropped fleet-wide. (2) 379 units whose per-model .DAT was overwritten by a
later work order - recoverable only from the staged .TXT or a log backup. Adds
John-facing report, raw data, and the chase-missing-units.js tool.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-17 14:55:44 -07:00
parent d58d1dd76c
commit 8f06426ba0
3 changed files with 241 additions and 0 deletions

View File

@@ -0,0 +1,92 @@
# Test Datasheets Missing From the Database/Website — Findings
**To:** John Lehman (Engineering)
**From:** Mike Swanson, AZ Computer Guru
**Date:** 2026-06-17
**Scope:** Why some tested units have a staged datasheet but no record in testdatadb / on the website.
---
## Summary
I cross-checked **all 11,921 staged datasheet files** (the `.TXT` the test stations produce) against the database. **608 had no matching database record.** They fall into two distinct causes:
| Cause | Units | Recoverable? |
|---|---:|---|
| **1. Encoded / non-standard serial numbers the importer skips** | **229** | **Yes** — the data exists, the importer just doesn't read it |
| **2. Source log (`.DAT`) was overwritten before import** | **379** | Only from a backup, or from the staged `.TXT` itself |
The first cause is the important one: it is a **software limitation we can fix**, and its true reach is far larger than these 608 — see "Full scope" below.
---
## Cause 1 — Encoded serial numbers are silently skipped (229 units, fixable)
**What happens:** When a serial number is too long for the DOS 8.3 filename, the test program encodes the first two digits as a letter (e.g. `10243-1` is written as `A243-1`; `10` -> `A`). For these units, the serial is stored **with a leading letter** inside the log file:
```
"A243-1","01-21-2025" <- real serial 10243-1, model 5B45-25D
```
The database importer recognizes a record only when the serial **starts with a digit**. A serial that starts with a letter never matches, so the whole record is **silently dropped** — it is never imported, never rendered, never sent to the website.
**Confirmed:** the encoded serials are present in the `.DAT` logs (e.g. `5BLOG\45-25D.DAT` contains `"A243-1","01-21-2025"`), and the decoded form (`10243-1`) appears in no log and in no database row. So the data exists; the importer simply can't read it.
**Of the 608 missing, 229 are this case:**
- 212 are hex-encoded serials (all `A`-prefix, i.e. `10xxx` serials)
- 17 are other non-standard serial formats the same rule rejects (e.g. `TEST-1`, `178540-A1`, `A-1`)
- Stations: TS-11R (142), TS-11L (59), TS-8R (11), TS-8L (17)
- Dates: late 2025 through 2026
- Example models: SCM5B47K-05, SCM5B38-04, SCM5B34-01, 8B45-02, SCM5B36-02, 8B32-01, 5B45-25D, DSCA45-01
**Examples (encoded -> real serial):**
```
A243-1 -> 10243-1 5B45-25D 02-03-2026 TS-11L
A244-1 -> 10244-1 SCM5B30-02 03-26-2026 TS-11L
A276-1 -> 10276-1 DSCA39-05 05-07-2026 TS-11L
A328-1 -> 10328-1 DSCA45-08 02-17-2026 TS-11L
```
### Full scope of this bug (beyond the 608)
The 229 above are only the units that *also* still have a staged `.TXT` on disk. Scanning **every** `.DAT` log across all stations and the central history logs, the importer is dropping:
- **840 distinct encoded serial numbers**
- **9,510 individual test records**
- across **141 models**
- of which **831 of 840 serials are absent from the database**
So this single serial-format limitation is keeping on the order of **~9,500 test results out of the database and off the website.**
**Fix:** teach the importer to (a) accept a serial that starts with a letter and (b) decode it back to its real number (`A243-1` -> `10243-1`) before storing — matching how the longer serials (the `H`-prefix range) are already handled. This is a one-function change to the import parser. It would recover the 229 units here plus the ~9,500-record backlog. (I will write this up as a separate proposed change for review; no code has been changed.)
---
## Cause 2 — Source log overwritten before import (379 units)
These have ordinary numeric serials (no encoding issue), but their test data is **no longer in any log file** we import from. The per-model `.DAT` log is reused for later production runs, and the older records get overwritten.
**Confirmed example:** units `177097-1 ... 177097-16` (model DSCA33-05, tested 10-17-2025, TS-1R) appear in **no** log file anywhere under the test share. Their model's log (`DSCLOG\33-05.DAT`) now contains a **different** work order (`178644-*`, tested 02-26-2026). The 10-17-2025 results were overwritten; only the staged `.TXT` datasheet survived.
**Recovery options for these 379:**
1. **From the staged `.TXT` itself** — the rendered datasheet still exists on disk for these units; they can be imported directly from the `.TXT` (store the existing sheet as-is) rather than re-derived from the `.DAT`. This is the most practical recovery and would cover most of the 379.
2. **From a pre-overwrite backup** of the `.DAT` logs, if one exists. (The `Recovery-TEST` backup area referenced by the importer does not exist on this server; if Engineering keeps log backups elsewhere, those could be imported.)
If neither is pursued, these units' results remain only as the staged `.TXT` files and will not appear in the database or on the website.
---
## Recommended actions
1. **Fix the importer's serial handling** (Cause 1) — recovers 229 staged units and ~9,500 total dropped records across 141 models. Highest value, single code change. Proposal to follow for review.
2. **Backfill the 379 overwritten units (Cause 2) from their staged `.TXT` files** — recovers the datasheets that still exist on disk. Lower-risk than chasing log backups.
3. **Prevent recurrence:** the per-model log overwrite is the underlying reason Cause 2 data is lost. If retaining every run matters, the logs should be archived before reuse (or the import scheduled frequently enough that records are captured before a log is overwritten).
---
## How this was determined (for the record)
- Compared every `C:\Shares\test\STAGE\**\*.TXT` against `test_records` (serial, model, date, measured results).
- For each missing unit, searched all `.DAT` sources (central HISTLOGS + every station's LOGS, 26,815 files) for the encoded and decoded serial.
- Confirmed encoded serials are present in the logs but skipped by the import regex; confirmed overwritten units are absent from all logs and their model log now holds a newer work order.
- Tools (read-only) committed under `projects/dataforth-dos/datasheet-pipeline/implementation/tools/`; raw data in `MISSING-UNITS-ROOTCAUSE-2026-06-17.txt`.

View File

@@ -0,0 +1,57 @@
========== MISSING-UNITS ROOT CAUSE & SCOPE ==========
Staged .TXT with SN : 11921
Staged units MISSING from DB : 608
ROOT-CAUSE CATEGORIES (of the missing):
Parser-drop (encoded serial w/ leading letter present in .DAT, regex rejects): 212
Source .DAT has no record for this unit (data absent) : 379
Other (decoded present, still missing - investigate) : 17
by leading letter: A=212
by station : TS-11R=142, TS-11L=59, TS-8R=11
by model (top 12): SCM5B47K-05=18, SCM5B38-04=14, SCM5B34-01=13, 8B45-02=12, SCM5B36-02=11, 8B32-01=10, 5B45-25D=9, SCM5B49-05=8, 5B45-01=6, SCM5B39-05=6, DSCA45-01=5, SCM5B36-04=5
date range : 01-13-2026 .. 12-11-2025
samples :
A243-1 -> 10243-1 5B45-25D 02-03-2026 TS-11L
A243-2 -> 10243-2 5B45-25D 02-03-2026 TS-11L
A244-1 -> 10244-1 SCM5B30-02 03-26-2026 TS-11L
A255-1 -> 10255-1 5B45-25D 02-03-2026 TS-11L
A255-2 -> 10255-2 5B45-25D 02-03-2026 TS-11L
A276-1 -> 10276-1 DSCA39-05 05-07-2026 TS-11L
A276-2 -> 10276-2 DSCA39-05 05-07-2026 TS-11L
A328-1 -> 10328-1 DSCA45-08 02-17-2026 TS-11L
A328-2 -> 10328-2 DSCA45-01C 02-17-2026 TS-11L
A376-1 -> 10376-1 DSCA45-08 02-17-2026 TS-11L
A376-2 -> 10376-2 DSCA45-08 02-17-2026 TS-11L
A376-3 -> 10376-3 DSCA45-02 02-17-2026 TS-11L
DATA-ABSENT samples:
177097-1 -> 177097-1 DSCA33-05 10-17-2025 TS-1R
177097-10 -> 177097-10 DSCA33-05 10-17-2025 TS-1R
177097-11 -> 177097-11 DSCA33-05 10-17-2025 TS-1R
177097-12 -> 177097-12 DSCA33-05 10-17-2025 TS-1R
177097-13 -> 177097-13 DSCA33-05 10-17-2025 TS-1R
177097-14 -> 177097-14 DSCA33-05 10-17-2025 TS-1R
177097-16 -> 177097-16 DSCA33-05 10-17-2025 TS-1R
177097-2 -> 177097-2 DSCA33-05 10-17-2025 TS-1R
177097-3 -> 177097-3 DSCA33-05 10-17-2025 TS-1R
177097-4 -> 177097-4 DSCA33-05 10-17-2025 TS-1R
OTHER samples:
A-1 -> A-1 SCM5B38-37 11-20-2025 TS-11R
TEST-1 -> TEST-1 SCM5B392-04 11-12-2025 TS-11R
TEST-2 -> TEST-2 SCM5B392-04 11-12-2025 TS-11R
178540-A1 -> 178540-A1 SCM5B40-03 02-26-2026 TS-8L
178540-A2 -> 178540-A2 SCM5B40-03 02-26-2026 TS-8L
178540-A3 -> 178540-A3 SCM5B40-03 02-26-2026 TS-8L
178540-A4 -> 178540-A4 SCM5B40-03 02-26-2026 TS-8L
178540-B1 -> 178540-B1 SCM5B40-03 02-26-2026 TS-8L
178540-B2 -> 178540-B2 SCM5B40-03 02-26-2026 TS-8L
178540-B3 -> 178540-B3 SCM5B40-03 02-26-2026 TS-8L
FULL .DAT BLIND SPOT (all letter-prefixed serials the importer skips, not just staged):
distinct letter-prefixed serials in .DAT : 840
total letter-prefixed records (dropped) : 9510
distinct models affected : 141
of those serials, DECODED form absent from DB: 831 / 840

View File

@@ -0,0 +1,92 @@
// Root-cause + scope of the 608 missing staged units (READ-ONLY) for the report to John.
// Hypothesis: importer serial/date regex requires a leading digit, so hex-encoded
// (leading-letter) serials in the .DAT are never matched -> records dropped.
const fs = require('fs');
const path = require('path');
const db = require('./database/db');
const STAGE = 'C:/Shares/test/STAGE';
const STRICT = /^"(\d+-\d+[A-Za-z]?)","(\d{2}-\d{2}-\d{4})"$/; // current importer regex
const LOOSE = /^"([^"]+)","(\d{2}-\d{2}-\d{4})"$/; // any serial before a date
const decode = sn => /^[A-Za-z]\d/.test(sn) ? String(sn.toUpperCase().charCodeAt(0) - 55) + sn.slice(1) : sn;
function walk(dir, re, out) { let it=[]; try{it=fs.readdirSync(dir,{withFileTypes:true})}catch{return out;}
for(const e of it){const p=path.join(dir,e.name); if(e.isDirectory()) walk(p,re,out); else if(re.test(e.name)) out.push(p);} return out; }
(async () => {
// ---- staged .TXT inventory ----
const txts = walk(STAGE, /\.txt$/i, []);
const staged = [];
for (const f of txts) { let t; try{t=fs.readFileSync(f,'utf8')}catch{continue;}
const sn=(t.match(/^\s*SN:\s*(\S+)/m)||[])[1]; if(!sn) continue;
const model=(t.match(/^\s*Model:\s*(\S+)/m)||[])[1]||'';
const date=(t.match(/^\s*Date:\s*(\d{2}-\d{2}-\d{4})/m)||[])[1]||'';
const station=(f.match(/STAGE[\\\/]([^\\\/]+)/)||[])[1]||'';
staged.push({ sn, dec: decode(sn), model, date, station, file: f });
}
// ---- which staged decoded serials are in DB ----
const decs=[...new Set(staged.map(s=>s.dec))]; const inDb=new Set();
for(let i=0;i<decs.length;i+=1000){ const rows=await db.query('SELECT serial_number FROM test_records WHERE serial_number = ANY($1)',[decs.slice(i,i+1000)]); for(const r of rows) inDb.add(r.serial_number); }
const missing = staged.filter(s=>!inDb.has(s.dec));
// ---- scan ALL .DAT sources: which serial tokens appear, strict vs letter-prefixed ----
let dats=[]; walk('C:/Shares/test/Ate/HISTLOGS', /\.dat$/i, dats);
let stations=[]; try{stations=fs.readdirSync('C:/Shares/test',{withFileTypes:true}).filter(d=>d.isDirectory()&&/^TS-\d+[LR]?$/i.test(d.name)).map(d=>d.name);}catch{}
for(const s of stations) walk(path.join('C:/Shares/test',s,'LOGS'), /\.dat$/i, dats);
const looseSet=new Set(); const letterSet=new Set(); let letterRecs=0; const letterModels=new Set();
let fi=0;
for(const f of dats){ fi++; if(fi%5000===0) console.log(' scan '+fi+'/'+dats.length);
let lines; try{lines=fs.readFileSync(f,'utf8').split('\n')}catch{continue;}
let lastModel='';
for(const l of lines){ const t=l.trim();
const mm=t.match(/^"([A-Z0-9][A-Z0-9 \-]*)"$/i); if(mm && !/PASS|FAIL/.test(t) && !t.includes(',')) { lastModel=mm[1].trim(); continue; }
const m=t.match(LOOSE); if(m){ const sn=m[1]; looseSet.add(sn);
if(/^[A-Za-z]\d/.test(sn) && !STRICT.test(t)){ letterSet.add(sn); letterRecs++; if(lastModel) letterModels.add(lastModel); } }
}
}
// ---- categorize the missing ----
const cat = { parserDrop: [], absent: [], decInDbButMiss: [] };
for(const s of missing){
if(letterSet.has(s.sn) || (/^[A-Za-z]\d/.test(s.sn) && looseSet.has(s.sn))) cat.parserDrop.push(s);
else if(!looseSet.has(s.sn) && !looseSet.has(s.dec)) cat.absent.push(s);
else cat.decInDbButMiss.push(s);
}
const by=(arr,k)=>{const m={};for(const x of arr){const v=(x[k]||'?');m[v]=(m[v]||0)+1;}return Object.entries(m).sort((a,b)=>b[1]-a[1]);};
// ---- full letter-prefixed population in .DAT and how much is absent from DB ----
const letterDecs=[...letterSet].map(decode); const letterInDb=new Set();
for(let i=0;i<letterDecs.length;i+=1000){ const rows=await db.query('SELECT serial_number FROM test_records WHERE serial_number = ANY($1)',[letterDecs.slice(i,i+1000)]); for(const r of rows) letterInDb.add(r.serial_number); }
const letterMissingDistinct=letterDecs.filter(d=>!letterInDb.has(d)).length;
const out=[]; const L=s=>{out.push(s);console.log(s);};
L('========== MISSING-UNITS ROOT CAUSE & SCOPE ==========');
L('Staged .TXT with SN : '+staged.length);
L('Staged units MISSING from DB : '+missing.length);
L('');
L('ROOT-CAUSE CATEGORIES (of the missing):');
L(' Parser-drop (encoded serial w/ leading letter present in .DAT, regex rejects): '+cat.parserDrop.length);
L(' Source .DAT has no record for this unit (data absent) : '+cat.absent.length);
L(' Other (decoded present, still missing - investigate) : '+cat.decInDbButMiss.length);
L('');
L('PARSER-DROP breakdown by leading char: '+by(cat.parserDrop, 'sn').slice(0,1).length? '' : '');
const lead=k=>{const m={};for(const x of cat.parserDrop){const c=x.sn[0].toUpperCase();m[c]=(m[c]||0)+1;}return Object.entries(m).sort((a,b)=>b[1]-a[1]);};
L(' by leading letter: '+lead().map(([c,n])=>c+'='+n).join(', '));
L(' by station : '+by(cat.parserDrop,'station').map(([c,n])=>c+'='+n).join(', '));
L(' by model (top 12): '+by(cat.parserDrop,'model').slice(0,12).map(([c,n])=>c+'='+n).join(', '));
L(' date range : '+(()=>{const ds=cat.parserDrop.map(s=>s.date).filter(Boolean).sort();return ds[0]+' .. '+ds[ds.length-1];})());
L(' samples :'); cat.parserDrop.slice(0,12).forEach(s=>L(' '+s.sn+' -> '+s.dec+' '+s.model+' '+s.date+' '+s.station));
if(cat.absent.length){ L(''); L('DATA-ABSENT samples:'); cat.absent.slice(0,10).forEach(s=>L(' '+s.sn+' -> '+s.dec+' '+s.model+' '+s.date+' '+s.station)); }
if(cat.decInDbButMiss.length){ L(''); L('OTHER samples:'); cat.decInDbButMiss.slice(0,10).forEach(s=>L(' '+s.sn+' -> '+s.dec+' '+s.model+' '+s.date+' '+s.station)); }
L('');
L('FULL .DAT BLIND SPOT (all letter-prefixed serials the importer skips, not just staged):');
L(' distinct letter-prefixed serials in .DAT : '+letterSet.size);
L(' total letter-prefixed records (dropped) : '+letterRecs);
L(' distinct models affected : '+letterModels.size);
L(' of those serials, DECODED form absent from DB: '+letterMissingDistinct+' / '+letterSet.size);
if(process.argv[2]) fs.writeFileSync(process.argv[2], out.join('\n')+'\n');
await db.close();
})().catch(e=>{console.error(e);process.exit(1);});