@soustack/ingest
v0.1.1
Published
## CLI usage
Readme
soustack-ingest
CLI usage
soustack-ingest ingest <inputPath> --out <outDir>The CLI reads the input file, runs it through the ingest pipeline, and writes JSON outputs under <outDir> (see src/cli.ts and src/pipeline/emit.ts).
Prerequisites
- Node.js 18+ (or compatible)
- Optional:
pandocfor improved RTF/RTFD conversion (the adapter will fall back to a built-in parser when it is unavailable).
Adapter behavior
Adapters are selected by file extension (src/cli.ts, src/adapters).
.rtfd.zip: handled byreadRtfdZip(src/adapters/rtfdZip.ts). The adapter extracts the archive, locates the primary.rtfpayload (preferringTXT.rtfor the largest.rtffile), and converts it to text. It tries a Node-based parser first, then falls back topandocandtextutilwhen available..txt: handled byreadTxt(src/adapters/txt.ts). Reads the file as UTF-8 text and passes it to the pipeline..docx: handled byreadDocx(src/adapters/docx.ts). Extracts plain text from Microsoft Word documents usingmammoth..pdf: handled byreadPdf(src/adapters/pdf.ts). Extracts plain text from PDF files usingpdf-parse.
Unsupported extensions throw an error.
Pipeline stages & contracts
The ingest pipeline runs stages in order (src/cli.ts, src/pipeline).
normalize (
src/pipeline/normalize.ts)- Input: raw adapter text (
string). - Output:
NormalizedTextwithfullTextand line metadata (Line[]). - Contract: normalize newlines to
\nand assign 1-based line numbers.
- Input: raw adapter text (
segment (
src/pipeline/segment.ts)- Input:
Line[]. - Output:
SegmentedTextwithChunk[]. - Contract: scores potential recipe boundaries and returns one chunk per inferred recipe with a best-effort title guess and confidence score.
- Input:
extract (
src/pipeline/extract.ts)- Input: a
Chunkplus the fullLine[]. - Output:
IntermediateRecipecontaining title, ingredients, instructions, and source-line evidence. - Contract: splits lines into
ingredientsandinstructionssections by headers; lines before any header fall into instructions.
- Input: a
toSoustack (
src/pipeline/toSoustack.ts)- Input:
IntermediateRecipe. - Output:
SoustackRecipe(Soustack JSON shape) with$schema(canonical URL),profile: "lite",stacksas an object map, normalizedingredients/instructionsstring arrays, and ingest metadata. - Contract: embeds source path and line range into
metadata.ingest.
- Input:
validate (
src/pipeline/validate.ts)- Input:
SoustackRecipe. - Output:
ValidationResult(ok,errors). - Contract: see validator notes below.
- Input:
emit (
src/pipeline/emit.ts)- Input: list of validated
SoustackRecipevalues and an output directory. - Output:
<outDir>/index.jsonwith name/slug/path entries.<outDir>/recipes/<slug>.soustack.jsonfiles for each recipe.
- Contract: recipe filenames are slugified from
recipe.nameand truncated to 80 characters.
- Input: list of validated
Validator behavior & wiring soustack
Validation is intentionally lightweight today. The pipeline starts with a stub validator built from a fallback schema (src/pipeline/validate.ts). It attempts to load soustack at runtime:
- If
soustackexportsvalidator, that object is used. - If it exports
validateRecipe, it is wrapped into avalidator. - If neither exists or the import fails, the stub validator stays active.
To wire soustack validation:
- Ensure
soustackis installed (already inpackage.json). - Export either a
validatorobject with avalidate(recipe)function, or avalidateRecipe(recipe)function, from thesoustackpackage entry point. - Call
initValidator()once at startup (the CLI does this before anyvalidate()calls) so the active validator is set deterministically.
Build, test, and run
npm run build
npm test
npm run ingest -- <inputPath> --out <outDir>Example usage
npm run ingest -- "/mnt/data/bowman cookbook.rtfd.zip" --out ./output