medway-import-core
v0.2.0
Published
Lightweight, browser/RN-safe core for parsing MedWay stock import files (XLSX and CSV) and producing a canonical product payload with detailed row errors and metadata.
Readme
medway-import-core
Lightweight, browser/RN-safe core for parsing MedWay stock import files (XLSX and CSV) and producing a canonical product payload with detailed row errors and metadata.
Installation
Install from npm:
npm install medway-import-coreBuild locally:
pnpm install # or npm/yarn
pnpm run buildPublic API
parseProductsFileFromBuffer(fileBytes, filename, options?): Promise<ParsedImportResult>fileBytes:ArrayBufferof the selected filefilename: original filename to detect extensionoptions:{ mode?: "fast"|"deep", validationMode?: "full"|"errorsOnly"|"none" }- Returns
{ rows: CanonicalProduct[], errors: ParsedRowError[], meta: {...} } - Meta includes:
sourceSchema,headerMode,requiredFields,analysisMode,sampleSize,concatMode,validationMode,engineVersion,concatenatedColumns,dirtyColumns,decomposedColumns, andcolumnGuesses(headerless only).
Types are exported from ./types.
Usage: Web
import { parseProductsFileFromBuffer } from "medway-import-core";
async function handleFile(file: File) {
const bytes = await file.arrayBuffer();
const result = await parseProductsFileFromBuffer(bytes, file.name);
console.log(result.rows, result.errors, result.meta);
}Usage: React Native / Expo
import { parseProductsFileFromBuffer } from "medway-import-core";
import * as DocumentPicker from "expo-document-picker";
async function handleImport() {
const res = await DocumentPicker.getDocumentAsync({
type: [
"text/csv",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
],
});
if (res.canceled || !res.assets?.[0]) return;
const asset = res.assets[0];
const bytes = await fetch(asset.uri).then((r) => r.arrayBuffer());
const result = await parseProductsFileFromBuffer(bytes, asset.name);
console.log(result);
}Notes
- ESM-only build (
type: module) for modern bundlers and Metro; no Node-only APIs. - XLSX parsing uses
xlsxin array mode (type: "array"), nofsusage. - CSV parsing is a small, dependency-free parser.
- Header semantics: generic CSVs are mapped via a synonyms + type-aware matcher. You can inspect suggested mappings by importing
suggestHeaderMappings.
Changes (documentation)
- Added
.gitignorefornode_modules, lockfiles, debug logs,.DS_Store. - Updated
package.jsonwithfilesanddevDependencies.typescript. - Made
tsconfig.jsonstandalone and strict. - Relaxed identity validation: only validate identity codes when provided, and treat identity errors as non-fatal;
cooissues reported underidentity.coo. - Added CLI tester:
npm run parse-file <path>prints schema, counts, sample rows, and errors. - Added minimal tests:
npm testruns CSV and XLSX template fixtures to prevent regressions. - Added header semantics module to improve generic CSV mapping using synonyms and data type checks.
- Updated row contract: any non-empty input row appears in preview; missing product name no longer drops the row. Instead an error
E_PRODUCT_NAME_REQUIREDis attached. Only fully blank rows are dropped. - Promoted additional fields to canonical and mapping:
brand_name,requires_prescription,is_controlled,storage_conditions,description, and packaging (purchase_unit,pieces_per_unit,unit). - CLI now shows header-to-canonical mapping with confidence and debug lists of kept/dropped indices.
- Required vs optional fields:
- Required (blocking if missing/invalid):
product.generic_name,product.strength,product.form,product.category,batch.expiry_date,pkg.pieces_per_unit,identity.coo,batch.on_hand. - Optional:
brand_name,batch.batch_no,batch.unit_price,identity.sku,requires_prescription,is_controlled,storage_conditions,description,purchase_unit,unit,reserved.
- Required (blocking if missing/invalid):
- Error codes for required fields added:
E_REQUIRED_GENERIC_NAME,E_REQUIRED_STRENGTH,E_REQUIRED_FORM,E_REQUIRED_CATEGORY,E_REQUIRED_EXPIRY,E_REQUIRED_PACK_CONTENTS,E_REQUIRED_COO,E_REQUIRED_QUANTITY. - Header detection: the first row is scored as header-like vs data-like. If data signals dominate (dates, numeric values, strength patterns, country names) the file is treated as headerless (
meta.headerMode = "none") and column guesses are produced. If header tokens and short labels dominate, it is treated as headers. - Fallback policy: headers are preferred when detection is ambiguous; headerless mapping is applied when the first row is data-like or header-based parsing yields zero rows.
meta.headerMode,meta.columnGuesses, andmeta.requiredFieldsare exposed.
Heuristic Improvements (Headerless)
- GTIN detection: columns that are ≥60% and typically ≥90% 13-digit numeric are classified as
identity.skuwith high confidence;on_handandgeneric_nameguesses are suppressed for those columns. SeeinferHeaderlessAssignmentsandinferHeaderlessGuessesinsrc/schema.ts. - Purchase unit mapping: columns with values from
{box,bottle,strip,vial,ampoule,device}are mapped toidentity.purchase_unit. - Prescription mapping: columns containing
{RX,OTC}(case-insensitive) are mapped toproduct.requires_prescription; sanitizer accepts these tokens. - CLI polish: when
meta.headerMode = "none", the CLI suppressesheaderMappingsand prints onlycolumnGuesses.
Category Classification
- A simple scoring library assigns an umbrella therapeutic category from
generic_name,brand_name,category,description. - Scoring uses weighted positives and negatives: +3 if
categoryhits cluster keywords; +2 ifgeneric/brandhits known molecules; +1 for description hits; device keywords boost MISC (+4) and others (+2); negative keywords subtract with guardrails. Minimum score and separation required. - Result is placed at
product.umbrella_categorywhen score ≥ 2.
Codes Mapping
- Added 3-letter therapeutic code → umbrella mapping used to override classification when
identity.catis present. - Supported codes:
ANT,CVS,RES,CNS,ANE,MSK,OPH,HEM,END,VAC,IMM,DER,VIT,OBG,BPH,FER,ONC,ENT,GIT,SIG,TOX,RCM,MSC. - Example:
ONC→ANTINEOPLASTICS_SUPPORT,HEM→BLOOD,GIT→GASTROINTESTINAL. - Classification now prefers the code when provided; otherwise falls back to text scoring with guardrails.
Consolidation
- Consolidated category logic into
src/category.tsand removed duplicatedist/catefory.tswithout losing functionality. The source file now contains:- Weighted scoring with guardrails
- Optional negative keyword support
- Code → umbrella mapping (
mapCategoryCodeToUmbrella) - O(1) index (
UMBRELLA_CATEGORY_INDEX)
Template v3 (Products sheet) Mapping
- Exact headers recognized with confidence 1.0 via semantics:
Generic (International Name),Strength,Dosage Form,Product Category,Expiry Date,Pack Contents,Batch / Lot Number,Item Quantity,Unit Price,Country of Manufacture,Serial Number,Brand Name,Manufacturer,Notes. - Mapped to canonical:
product.generic_name(required),product.strength(required),product.form(required),product.category(required),batch.expiry_date(required),pkg.pieces_per_unit(required),batch.batch_no,batch.on_hand(required),batch.unit_price,identity.coo(required),identity.sku,product.brand_name,product.manufacturer_name,product.description. - Country names are fuzzily normalized to ISO‑2 before validation (e.g., “United States”→
US, “UK”→GB).
2025-11-19 – Country Normalizer + Expiry Parser (EyosiyasJ)
- Expanded country alias map to 30+ common variants (e.g.,
U.S.A→US,Bharat→IN,UAE→AE,KSA→SA,Ivory Coast/Cote d Ivoire→CI,Türkiye→TR,Holland→NL). - Kept bundler/RN safety by registering
i18n-iso-countriesEnglish locale via dynamic JSON import with a safe fallback when import assertions are unavailable. - Implemented flexible expiry parsing for
MMM-YY,MMM YYYY,MM-YY,MM/YYYYformats, converting to a deterministic ISO date (last day of month). - Added unit tests for
normalizeCountryToIso2and flexible expiry parsing intests/run-tests.mjs. - No UI or styling changes; follows established patterns and includes function‑level comments.
2025-11-19 – Concatenation Splitter Mode (EyosiyasJ)
- Renamed
legacy_itemstoconcat_items(keptlegacy_itemsas alias). - Introduced a generic concatenation decomposer (
src/concatDecompose.ts) to peel out strength, form, pack, country, GTIN, batch, and manufacturer from concatenated cells. - Universal integration detects likely concatenated columns and runs a pre-sanitize decomposition pass; exposes
meta.concatenatedColumnswith indices and reasons. - Removed 3-letter category code detection from the decomposer to avoid overfitting.
CLI Cheat Sheet (Concat Mode)
- Use
npm run parse-file <path>to inspect results quickly. - Look for:
schema:→ should showconcat_itemswhen Items.xlsx-like headers are used; aliaslegacy_itemsstill recognized.concatenatedColumns:in meta → indices of columns treated as “mushed” with a short reason.sampleRows:→ verify decomposed fields likeproduct.strength,product.form,pkg.pieces_per_unit,identity.coo,batch.batch_no,product.manufacturer_nameare populated.
- If required fields are still missing after decomposition, rows will carry
E_REQUIRED_*errors. Review these to decide whether to accept or fix the source file.
Signed: EyosiyasJ
2025-11-19 – Concatenated Column Decision Tree (EyosiyasJ)
- Header trust: headers are scored and only treated as “known” when confidence ≥
0.8. Weak labels (e.g.,Name) are treated as headerless columns for content-driven parsing. - Concat-prone detection: columns are flagged as concatenated when, across a sample of rows, ≥70% contain at least two signals among strength, form, pack, country, GTIN, batch; formula-like lists without numeric+unit are excluded.
- Atomic fields are skipped: GTIN, price, quantity, expiry, COO, and SKU-like codes are never treated as concatenated.
- Pipeline overlay: when a column is flagged, decomposition is applied before sanitization, and extractions fill empty canonical fields only. Leftover text is used as
product.generic_nameif still empty.
Row-level Opportunistic Decomposer
- Applies per cell on
product.generic_name,product.brand_name,product.descriptioneven when a column was not flagged. - Acceptance gate requires: strength present AND at least two of {form, pack, COO, GTIN, batch} (≥3 total signals), and leftover not formula-like.
- Fills only empty fields; leftover appended to
product.descriptionwhengeneric_namealready populated. - References:
src/concatDecompose.ts:decomposeConcatenatedCell(mode='opportunistic'),src/parseProductsCore.tsrow-level pass.
2025-11-19 – Name Column minSignals Tuning (EyosiyasJ)
- To address mixed
Namecolumns inItems.xlsx, opportunistic decomposition now usesminSignals: 2onproduct.generic_nameonly. Other targets remain atminSignals: 3. - This lowers the acceptance bar just for the
Namefield to split embedded strength/form/pack reliably while preserving strictness forbrand_nameanddescription. - Implementation:
src/parseProductsCore.ts:64-81passes{ mode: 'opportunistic', minSignals: 2 }forproduct.generic_nameand{ minSignals: 3 }otherwise. - If a dataset still fails to split valid entries, consider temporarily raising form keywords or pack patterns; avoid adding category-code detectors.
2025-11-19 – Test Output Prints 14 Mapped Fields (EyosiyasJ)
- After tests complete, the runner prints the 14 canonical fields mapped by Template v3 and concat mode for quick verification:
product.generic_name,product.strength,product.form,product.category,batch.expiry_date,pkg.pieces_per_unit,batch.batch_no,batch.on_hand,batch.unit_price,identity.coo,identity.sku,product.brand_name,product.manufacturer_name,product.description.
- Reference:
tests/run-tests.mjsfinal output section.
2025-11-19 – Test Output Prints Parsed Items Preview (EyosiyasJ)
- The test runner now prints a parsed items preview after completion. For
Items.xlsx, the first 20 canonical rows are output as JSON with the 14 key fields for quick eyeballing. - This helps confirm that concatenated fields were split into canonical values (e.g., strength/form/pack/COO) and that sanitized rows are present.
- Reference:
tests/run-tests.mjs– look forParsed Items Preview:in the final output.
2025-11-19 – Pattern-Driven Name Splitter (EyosiyasJ)
- Added a strict, right-sided splitter
splitNameGenericStrengthFormforNamecells to peel outgeneric_name, full strength tokens (e.g.,125mg/5ml,1%), and normalizedform. - Form detection uses a trailing form dictionary (hyphen or space-suffix) and maps to existing sanitize forms (
tablet,capsule,syrup,cream,ointment,gel,drops,spray,lotion,patch,suspension,solution,inhaler,powder,other). - Strength detection captures the last numeric+unit block including ratios and
% w/w; normalization removes extra hyphens and%w/w→%for sanitizer compatibility. - Integration points:
src/parseProductsCore.ts: applies the splitter on rawNamecolumn and onproduct.generic_namebefore opportunistic decomposition.src/concatDecompose.ts: providessplitNameGenericStrengthFormand keeps universal detectors for other fields.
- Result: reduces
E_REQUIRED_STRENGTHandE_FORM_MISSINGonItems.xlsxwhile preserving formula-like generics.
2025-11-20 – Name Pre-Split Cleanup + Opportunistic Routing Fix (EyosiyasJ)
- Cleaned
generic_nameartifacts like trailing-0,-1-,-0.64-by trimming strength-prefix fragments during split and by forcing a safe override when an initialgeneric_nameequals the rawNamecell or matches hyphen-digit patterns. - Applied a dedicated pre-split on the
Namecolumn before concat decomposition; strength and form are merged only if empty to follow established patterns. - Prevented opportunistic leftover from overwriting
product.generic_nameby avoiding replacement when the source equals the current generic; leftovers now append toproduct.descriptioninstead. - References:
src/concatDecompose.ts:191-197,src/parseProductsCore.ts:81-93,src/parseProductsCore.ts:143-152. - Outcome: preview rows from
Items.xlsxnow show cleangeneric_namevalues (e.g.,MOMETASONE FUROATE,GENTAMICIN) while preserving extractedstrengthand normalizedform.
Signed: EyosiyasJ
2025-11-20 – Formal Form Identifier + "No Digits" Rule (EyosiyasJ)
- Added a formal form identifier layer (dictionary + matcher) as an anchor for concatenated text across any column.
- Dictionary covers core forms and multi-word variants:
tablet(incl.effervescent tablets,film-coated tablet,chewable tablet),capsule,syrup,suspension(powder for suspension),cream,gel,ointment,drops(eye/ear drop(s)),injection(inj./injection),inhaler(aerosol,suspension for inhalation,puffer), plus conservativeotherbucket (shampoo,plaster,sachet,suppository,pregnancy test,test). - Used as a tail-phrase anchor in
splitNameGenericStrengthFormand decomposition; prefers longest phrase matches to avoid partial anchors (e.g.,effervescent tabletsovertablets). - Guardrails: variants under
otheronly anchor when dose signals are present to avoid misclassifying devices (e.g.,adhesive plaster,pregnancy test).
- Dictionary covers core forms and multi-word variants:
- Enforced a "no digits allowed" rule for text-only fields during validation:
- Flags suspicious digits/units in
manufacturerandcategorywithE_TEXT_DIGITS_SUSPECT(warn) - Also flags
formcontaining digits/units (warn) and rawcountry (COO)values containing digits (warn)
- Flags suspicious digits/units in
- References:
src/concatDecompose.ts:204-265(dictionary + matcher),src/concatDecompose.ts:104-116(anchor in splitter),src/sanitize.ts:302-309,314-318(warnings). - Outcome: more deterministic form extraction in messy text (including non-Name columns) and cleaner separation between dose tokens and label-like fields.
2025-11-20 – Schema‑Aware Validation (concat_items best‑effort) (EyosiyasJ)
- Added schema‑specific validation profile for
schema: concat_items(POS‑style dumps):- When no dose signal is present (no
strength), onlyproduct.generic_nameis required. - When dose signal is present, full strictness applies (strength, form, expiry, COO, pack contents, quantity).
- Suppressed
E_TEXT_DIGITS_SUSPECTwarnings forproduct.categoryinconcat_itemsbecause this column often carries numeric IDs.
- When no dose signal is present (no
- Exposed schema‑aware required fields in
meta.requiredFields:concat_items:["product.generic_name"]- others: full list
[generic_name, strength, form, category, batch.expiry_date, pkg.pieces_per_unit, identity.coo, batch.on_hand].
- References:
src/sanitize.ts:509–655(schema‑aware requiredness),src/parseProductsCore.ts:171(passsourceSchema),src/index.ts:45–56(meta.requiredFields). - Outcome: Items.xlsx imports are still parsed and decomposed correctly, but low‑signal device/inventory rows no longer produce a wall of blocking errors.
2025-11-20 – Analysis Mode (fast vs deep) (EyosiyasJ)
- Added global
AnalysisModewith options{ fast, deep }influencing sampling size for upfront analysis only. - Fast mode samples up to 32 rows; deep mode samples 64–256 rows or 25% of file.
- Detection affected: headerless mapping, concatenation column detection, and concat mode selection.
- Per-row logic unchanged: splitting, decomposition, and validation are identical regardless of mode.
- Exposed in meta:
analysisMode,sampleSize, andconcatMode(none | name_only | full). - CLI:
node scripts/parse-file.mjs <file> --mode fast|deepor--deepshorthand; prints analysis details. - References:
src/types.ts(AnalysisMode, ParseOptions, meta fields),src/index.ts(options passthrough),src/parseProductsCore.ts(sampling + concatMode),scripts/parse-file.mjs(CLI flags).
2025-11-20 – Column Hygiene Scan (EyosiyasJ)
- Added a lightweight sampling-based column hygiene classifier to avoid heavy decomposition on clean columns.
- Classifies columns into numeric/ID vs dirty free-text using unit tokens, form words and basic text heuristics.
- In
concatMode=full, heavy decomposition runs only on dirty columns; numeric/ID columns are skipped. - References:
src/parseProductsCore.ts:55–117(setup),src/parseProductsCore.ts:145–174(dirty gating).
2025-11-20 – Validation Mode (optional ingest speed) (EyosiyasJ)
- Added
ValidationModewith options{ full, errorsOnly, none }. full: current behaviour, includes hygiene warnings and all validations.errorsOnly: filters out warning-level hygiene issues; only hard errors are returned.none: mapping + decomposition only; validation errors suppressed (rows still normalized). Default mode isfull.- Exposed in meta as
validationMode; passed viaParseOptions.validationMode. - References:
src/types.ts(ValidationMode),src/sanitize.ts(mode-aware filtering),src/parseProductsCore.ts(propagate to meta).
References
src/schema.ts:isHeaderTrustedandinferConcatenatedColumnsimplement header trust and column-level concat detection.src/parseProductsCore.ts:42-56applies the pre-sanitize concat overlay for flagged columns.src/concatDecompose.tsprovidesdecomposeConcatenatedCellused by detection and overlay.
Thresholds
- Header trust threshold:
0.8. - Concatenation coverage:
≥70%of sampled non-empty cells have ≥2 signals. - Formula-like exclusion: formula-rate must be
≤30%.
API Reference
parseProductsFileFromBuffer(bytes, name, options?)- Detects schema (
template_v3 | concat_items | csv_generic | unknown), headers vs headerless, and concatenation mode.
- Detects schema (
2025-11-21 – npm Publishing (EyosiyasJ)
- Packaging readiness:
- Added
publishConfig.access = "public"andlicense: "MIT"topackage.json. - Published as unscoped
medway-import-corewhile the@medwayorg scope is not configured. filesrestricts the tarball todist/**/*only.
- Added
- How to publish:
npm login(account with publish rights).npm run buildto emitdistwith.d.ts.npm packto inspect the tarball locally.npm publish(no flags needed;publishConfig.accessis set).
- Notes:
- Tests:
npm testshould pass before publishing. - Vulnerabilities: run
npm auditand address if needed.
- Tests:
2025-11-21 – Codepage Support for CSV/XLS (EyosiyasJ)
- Added a side‑effect import in the package entry to enable non‑UTF encodings:
src/index.tsnow includesimport 'xlsx/dist/cpexcel.js'so Windows‑1252/ISO‑8859‑1 CSVs and legacy.xlsparse correctly.
- Prevent bundlers from dropping the import by marking side effects:
package.jsonsets"sideEffects": ["xlsx/dist/cpexcel.js"].
- Dependency alignment:
- Kept
xlsxunderdependenciesand verified with build/tests.
- Kept
- Publish:
- Bumped package version to
0.1.2and verified the tarball vianpm pack.
- Bumped package version to
2025-11-21 – Universal Tabular Acceptance (EyosiyasJ)
- Parser now accepts a broad set of tabular formats:
- Excel family:
.xls,.xlsx,.xlsb, SpreadsheetML.xml,.ods - Delimited text:
.csv(comma/semicolon),.tsv(tab), pipe|DSV - HTML tables (single table parsed as the first sheet)
- Excel family:
- Strategy:
- First try SheetJS to read any workbook/tabular payload from bytes.
- If a sheet is produced, convert to AoA and run header detection, preserving headerless behavior.
- Fallback: sniff delimiter among
, ; \t |and parse with quote handling.
- Notes:
- Non‑UTF encodings and legacy
.xlsare supported viacpexcelside‑effect import. - Headerless files still expose
meta.columnGuessesfor UI alignment.
- Non‑UTF encodings and legacy
Signed: EyosiyasJ
2025-11-21 – Headerless Workbook Classification (EyosiyasJ)
- Adjusted schema detection to treat headerless workbook inputs (synthetic
col_#keys) asunknownwhen the origin is a workbook, notcsv_generic. - Rationale: Headerless Excel POS dumps should not be forced into the generic CSV profile; this aligns tests and real-world behavior.
- Implementation:
src/schema.ts:336–370now accepts anoriginhint ("workbook" | "text") and returnsunknownfor headerless workbook.src/parseProductsCore.ts:69–70passes through the origin hint.src/index.ts:66andsrc/index.ts:91–92set origin appropriately for workbook vs text paths.
- Tests:
npm testnow passestestHeaderlessPosDetectionby returning a schema in the accepted set. - Version: bumped to
0.1.3.
Signed: EyosiyasJ
2025-11-21 – Manufacturer/Brand Conservative Allowlists & Guards (EyosiyasJ)
Added env-based allowlists to accommodate legitimate numeric edge cases without weakening invariants:
ALLOWED_NUMERIC_BRANDS: comma/semicolon-separated brand names allowed to contain digits (e.g.,3M)ALLOWED_NUMERIC_MANUF: comma/semicolon-separated manufacturer names allowed to contain digits (e.g.,XYZ Pharma 2000)- Usage: set via environment before parsing (
ALLOWED_NUMERIC_BRANDS="3M;GSK 3" ALLOWED_NUMERIC_MANUF="XYZ Pharma 2000").
Brand hardening: length cap (≤40 chars) and punctuation-heavy guard (>3 of -, , /, &, .).
Manufacturer hardening: punctuation-heavy guard and existing unit/strength rejection preserved; tail-scanning with hint words remains conservative.
Concat items optionality: ensured manufacturer_name and brand_name are optional under schema: concat_items (no E_REQUIRED_* for missing values).
Decomposer integration: brand head detection removes generic/strength/form tokens and applies allowlists; manufacturer detectors accept allowlisted numeric names.
Tests: added targeted regression cases in tests/run-tests.mjs covering generic-only rows, brand+generic+strength+form, tail manufacturer in description, brand==generic demotion, org-like brand promotion to manufacturer, and contaminated manufacturer demotion.
References:
src/sanitize.ts:803–858(env allowlists, brand/manufacturer guards)src/concatDecompose.ts:471–558(allowlist-aware detectors)tests/run-tests.mjs:515–585(manufacturer/brand tests)
Signed: EyosiyasJ
2025-11-21 – Field Invariants + Sanity Pass (EyosiyasJ)
- Added per‑field invariants and a post‑parse sanity pass to reduce cross‑contamination and error noise while keeping rows importable.
- Invariants:
- Text‑only fields (
manufacturer_name,category,form) reject digits and unit tokens;batch_nomust be strictly alphanumeric and unit‑free.
- Text‑only fields (
- Central helpers and patterns:
UNIT_RE,HAS_DIGIT_RE,ALNUM_BATCH_RE, and helpershasDigit,hasUnitToken,isAlphaNumericBatch.
- Central helpers and patterns:
- Splitter enforcement:
- Batch detection now gates on
isAlphaNumericBatchand!hasUnitTokenso unit‑like tokens don’t leak intobatch_no.
- Batch detection now gates on
- Manufacturer detection rejects candidates containing digits/units.
- Category is not decomposed by the splitter to avoid misclassification.
- Leftover routing:
- When a candidate fails invariants (e.g., manufacturer with digits), the value is routed to
product.descriptioninstead of being forced into the field.
- When a candidate fails invariants (e.g., manufacturer with digits), the value is routed to
- Post‑parse sanity checks:
- After row sanitize, invariants are re‑checked; failing values are moved to
product.descriptionand a compact warningE_FIELD_SUSPECT_VALUEis recorded.
- After row sanitize, invariants are re‑checked; failing values are moved to
- Cross‑field consistency sweeps:
- Strip unit tokens and form words from
product.generic_name; append stripped tokens toproduct.description.
- Strip unit tokens and form words from
- If
product.formequalsproduct.category(or plural), clear category and append original toproduct.description.
- If
- If
manufacturer_nameis a country name, setidentity.cooto ISO‑2 and demote the manufacturer text toproduct.description; ifcoocarries manufacturer hints (pharma,labs,ltd), demote toproduct.description.
- If
- Schema‑aware noise reduction (
concat_items): - Allow pure integer
categoryIDs.
- Allow pure integer
- Downgrade missing
strength,expiry, andCOOto warnings to keep POS‑style imports lightweight.
- Downgrade missing
- Validation mode safeguard:
validationMode: "none"suppresses all errors, including sanity‑pass suspects; rows are still normalized.
- References:
src/concatDecompose.ts:374–379(helpers),src/concatDecompose.ts:437–467(batch/manufacturer gates)
src/parseProductsCore.ts:410–492(leftover routing for description/manufacturer/category)
src/sanitize.ts:314–421(helpers + batch sanitize),src/sanitize.ts:578–837(schema‑aware sanitize and sanity pass)
src/index.ts:103–118(schema‑aware required fields in meta)
2025-11-20 – Category & Field Guardrails (EyosiyasJ)
- Umbrella category: when a category signal exists (
product.categoryoridentity.cat) but cannot be mapped to one of the 23 umbrella categories, the parser now setsproduct.category = "NA"and does not attach an error. This replaces the previousE_UMBRELLA_NOT_FOUNDemission. - Numeric exclusion:
product.form: rejects purely numeric values withE_FORM_NUMERIC.product.category: rejects purely numeric values withE_CATEGORY_NUMERIC.identity.coo/batch.coo: flags purely numeric values withE_COO_NUMERIC; ISO‑2 validation still applies.
- Batch rule:
batch.batch_nomust contain both letters and digits; otherwiseE_BATCH_ALPHA_NUM_MIXis returned. Cleaning and truncation still apply. - Parser routing: leftover text from columns mapped as
product.categoryno longer populatesproduct.generic_name. Category and form remain isolated and are not linked to other name fields. - Follows existing patterns: validations live in
src/sanitize.ts, parser routing insrc/parseProductsCore.ts, umbrella logic insrc/category.ts.
2025-11-20 – 3‑Letter Code → Full Umbrella Label (EyosiyasJ)
- When
identity.catcontains a valid 3‑letter therapeutic code, the engine now maps it to an umbrella ID and setsproduct.categoryto the umbrella’s human‑readable label (e.g.,CVS→Cardiovascular (CVS)).product.umbrella_categorycontinues to hold the umbrella ID. - Reference:
src/sanitize.ts(label assignment viaUMBRELLA_CATEGORY_INDEX).
2025-11-20 – Universal NA Fallback for Empty Text Fields (EyosiyasJ)
- For empty text fields, the canonical output now uses
"NA"instead of returning empty strings or implicit “other”. Applied to:product.brand_name,product.manufacturer_name,product.form,product.category,product.storage_conditions,product.description, andbatch.batch_no.
- Required‑field validations remain unchanged and still error when missing;
NAis applied after validation so UI can display placeholders without masking errors.
2025-11-20 – Test Fixtures, Bench, and Packaging Smoke (EyosiyasJ)
- Added fixture generator
tests/generate-fixtures.mjsproducing:template_clean.xlsx,devices_only.xlsx,headerless_pos.xlsx,garbage.xlsx,BigItems.xlsx
- Extended
tests/run-tests.mjswith golden expectations and invariants:- Fast vs Deep rows identical; meta differs by
sampleSize - ValidationMode ordering (
full ≥ errorsOnly ≥ none), mapping unchanged - Devices-only relaxed requiredness (no
strength/form/expiry/COOblockers) - Headerless POS detection and concatenation meta present
- Fast vs Deep rows identical; meta differs by
- Performance bench
bench.mjslogs rows/errors/ms and ms/row onBigItems.xlsx. - Packaging smoke
tests/smoke-pack.mjspacks the library, installs into a temp project, and runs a parse. - How to run:
node tests/generate-fixtures.mjspnpm run testnode bench.mjsnode tests/smoke-pack.mjsAnalysis
modeaffects sampling for detection only; per-row logic is identical.validationModecontrols error verbosity and performance.
parseProductsCore(input)- Internal pipeline that applies headerless assignments, pre-sanitize concatenation overlay, opportunistic decomposition, and schema-aware validation.
readXlsxToRows(bytes)- Parses the workbook and extracts
__meta(template_version,header_checksum).
- Parses the workbook and extracts
parseCsvRaw(text),detectHeaderMode(rows),buildRawRows(rows, mode)- Helpers for CSV dual-path parsing and header detection.
inferHeaderlessGuesses(rows)- Produces candidates and confidence for UI debugging in headerless mode.
sanitizeCanonicalRow(row, idx, schema?, validationMode?)- Normalizes and validates a row with schema-aware requiredness and mode filtering.
Signed: EyosiyasJ
2025-11-20 – Function & Module Documentation Pass (EyosiyasJ)
- Added function‑level JSDoc to entry API, core pipeline, CSV/XLSX helpers, schema detectors, and sanitizers.
- Introduced module‑level headers explaining purpose, behavior, and design decisions for key files (
src/index.ts,src/parseProductsCore.ts,src/csv.ts,src/xlsx.ts,src/schema.ts,src/sanitize.ts,src/concatDecompose.ts). - Expanded Public API section with options (
AnalysisMode,ValidationMode) and comprehensive meta fields reference. - Aligns with established patterns and keeps bundler/RN safety. No UI/styling changes.
References
- Entry:
src/index.ts:13-40 - Core:
src/parseProductsCore.ts:30-43 - CSV:
src/csv.ts:3-7,src/csv.ts:68-74,src/csv.ts:137-142,src/csv.ts:191-214 - XLSX:
src/xlsx.ts:4-13,src/xlsx.ts:33-38,src/xlsx.ts:40-58 - Schema:
src/schema.ts:327-361,src/schema.ts:363-385,src/schema.ts:569-577,src/schema.ts:679-687,src/schema.ts:862-908 - Sanitizers:
src/sanitize.ts:12-21,src/sanitize.ts:518-526
Signed: EyosiyasJ
2025-11-23 – Product Type + Non‑Medicine Classification (EyosiyasJ)
- Added
Product Typecolumn to Template v3 as column 2 with valuesmedicineornon-medicine(parser is case-insensitive; UI shows canonical). - For
Product Type = medicine: system behavior unchanged; required fields remain as before. - For
Product Type = non-medicine: required fields aregeneric_name,product.category(must be eitherAccessoriesorChemicals & Reagents),pack contents,item quantity,country of manufacture. Medicine-only fieldsStrength,Dosage Form,Expiry Dateshould be blank and are ignored if present. - Short-circuit classification when
Product Typeis missing/invalid (no NLP):- Category check: if category ∈ the 23 medicine umbrellas → medicine; if category ∈ {Accessories, Chemicals & Reagents} → non-medicine.
- Dosage form check: if form ∈ known forms (tablet, capsule, syrup, suspension, injection, cream, ointment, gel, drops, spray, lotion, patch, powder) → medicine.
- Strength check: if strength looks like
number + unit(mg, g, mcg, IU, ml, %, ratios like mg/ml, mg/5ml) → medicine. - Default: non-medicine; refined to
AccessoriesvsChemicals & Reagentsusing a keyword library.
- Shipped keyword library
NON_MEDICINE_KEYWORDSgrouped under:accessories: syringes/needles/IV sets; gloves/masks/PPE; dressings/bandages/plasters; diagnostics/test strips/kits; general accessories; small devices sold in pharmacies; personal care/hygiene items.chemicalsAndReagents: common chemicals/solutions/reagents (hydrogen peroxide, alcohols, chlorhexidine solution, povidone iodine, saline, distilled water, glycerin, liquid paraffin, petroleum jelly, sodium hypochlorite, bleach, potassium permanganate, formalin, buffer solution, lab reagent).
- Updated template header checksum to
f9802bc8to reflect the new column. - All unit and integration tests pass; header and headerless flows keep established behavior while Template v3 adopts Product Type.
