@etohq/workflows-input-dataset-pdf-runtime

v0.0.1-next-20260318155517

Published

4 months ago

PDF text extraction + record structuring runtime for workflows input dataset imports (Node-only)

0High
0Medium
0Low

solarsoft0

@etohq/workflows-input-dataset-pdf-runtime

Node-only PDF text extraction + user-defined structuring into tabular records.

This package is intentionally separate from @etohq/workflows-input-schema-runtime because PDF parsing is not cross-platform.

Typical Usage

Extract PDF text:

import { pdfToText } from "@etohq/workflows-input-dataset-pdf-runtime"
const res = await pdfToText(pdfBuffer)
if (!res.ok) throw new Error(res.error)

Structure into records using a user-provided PdfExtractSpec:

import { structureRecordsFromText } from "@etohq/workflows-input-dataset-pdf-runtime"

const structured = structureRecordsFromText(res.text, extractSpec)
// structured.records: Record<string, string>[]

Feed those records into dataset import mapping:

import { importCsvRecords } from "@etohq/workflows-input-dataset-runtime"

const imported = importCsvRecords({
  schema,
  spec: datasetImportSpec,
  records: structured.records,
})

User-Defined Structure (Regex)

The core idea is: users define a regex with named capture groups that represent columns:

Example line regex:

^(?<S\\/N>\\d+)\\s+(?<Appnum>\\w+)\\s+(?<Surname>[A-Z-]+)\\s+(?<Firstname>[A-Z-]+)\\s+(?<Othername>[A-Z-]+)\\s+(?<Program>.+?)\\s+(?<List>Merit|Supplementary)$

Then map the group names to output column names via column_map.

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

@etohq/workflows-input-dataset-pdf-runtime

v0.0.1-next-20260318155517

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@etohq/workflows-input-dataset-pdf-runtime

Typical Usage

User-Defined Structure (Regex)