@etohq/workflows-input-dataset-pdf-runtime
v0.0.1-next-20260318155517
Published
PDF text extraction + record structuring runtime for workflows input dataset imports (Node-only)
Readme
@etohq/workflows-input-dataset-pdf-runtime
Node-only PDF text extraction + user-defined structuring into tabular records.
This package is intentionally separate from @etohq/workflows-input-schema-runtime because PDF parsing
is not cross-platform.
Typical Usage
- Extract PDF text:
import { pdfToText } from "@etohq/workflows-input-dataset-pdf-runtime"
const res = await pdfToText(pdfBuffer)
if (!res.ok) throw new Error(res.error)- Structure into records using a user-provided
PdfExtractSpec:
import { structureRecordsFromText } from "@etohq/workflows-input-dataset-pdf-runtime"
const structured = structureRecordsFromText(res.text, extractSpec)
// structured.records: Record<string, string>[]- Feed those records into dataset import mapping:
import { importCsvRecords } from "@etohq/workflows-input-dataset-runtime"
const imported = importCsvRecords({
schema,
spec: datasetImportSpec,
records: structured.records,
})User-Defined Structure (Regex)
The core idea is: users define a regex with named capture groups that represent columns:
Example line regex:
^(?<S\\/N>\\d+)\\s+(?<Appnum>\\w+)\\s+(?<Surname>[A-Z-]+)\\s+(?<Firstname>[A-Z-]+)\\s+(?<Othername>[A-Z-]+)\\s+(?<Program>.+?)\\s+(?<List>Merit|Supplementary)$
Then map the group names to output column names via column_map.
