breadchunks

v0.2.2

Published

5 days ago

Heading-aware, token-budgeted semantic chunker for Markdown — for RAG and embedding pipelines.

0High
0Medium
0Low

jongleberry

markdown chunking rag embedding llm

breadchunks

Heading-aware, token-budgeted semantic chunker for Markdown — for RAG and embedding pipelines.

Native Node.js addon (N-API). Prebuilt binaries for macOS arm64/x64, Linux glibc arm64/x64, and Windows x64. Alpine/musl and Windows arm64 must build from source (requires cloning the repository) via napi build --release.

Install

npm install breadchunks

Usage

import { chunk } from 'breadchunks'

// Chunk a single document
const [chunks] = await chunk([Buffer.from(markdown)], { minLength: 400, maxLength: 2000 })
for (const c of chunks) {
  console.log(`[${c.breadcrumb}]`, c.text.slice(0, 80))
}

// Chunk multiple documents in one call
const results = await chunk([docA, docB, docC])
// results[0] → Chunk[] for docA, results[1] → Chunk[] for docB, …

TypeScript types are included (index.d.ts).

chunk accepts a batch of Buffer | string inputs and returns a Promise<Chunk[][]> — one Chunk[] per input. Buffer is preferred; it avoids a round-trip UTF-8 re-encode from the JS heap.

How it works

Three-phase pipeline:

Phase 1 — Split: Split at header boundaries. Each section becomes its own chunk tagged with a full heading breadcrumb (H1 > H2 > H3). Backtick-fenced code blocks are protected — # comment inside a backtick fence is never treated as a heading.
Phase 2 — Merge same-breadcrumb: Merge adjacent chunks that share the same breadcrumb and are below minLength.
Phase 3 — Parent absorption (bottom-up, h6→h1): Absorb small child sections into their parent when the combined size stays under maxLength.

Supported Markdown: ATX headers only (# H1 through ###### H6). Backtick-fenced code blocks are protected. Setext headers, tilde fences (~~~), and 4-space-indented code are not recognized as structural boundaries — # inside them is treated as a header.

API

`chunk(inputs, options?)`

function chunk(
  inputs: Array<Buffer | string>,
  options?: ChunkOptions,
): Promise<Array<Array<Chunk>>>

`ChunkOptions`

| Field | Type | Default | Description | |---|---|---|---| | minLength | number? | 512 | Target minimum chunk size (chars) | | maxLength | number? | 3072 | Hard maximum chunk size (chars) | | phase | number? | 3 | Stop after this phase (1, 2, or 3) | | title | string? | undefined | Document title — prepended to every breadcrumb |

`Chunk`

| Field | Type | Description | |---|---|---| | level | number | Heading depth (0 = preface, 1–6 = h1–h6) | | header | string? | Text of the nearest heading | | headers | (string \| undefined \| null)[] | Full 6-slot heading stack (h1–h6) | | breadcrumb | string | Human-readable path: "H1 > H2 > H3" | | text | string | Chunk body (without the heading line or breadcrumb). To get the full string an embedding model sees, prepend breadcrumb + "\n\n" when breadcrumb is non-empty. | | length | number | Character count of the full string (including breadcrumb if present) after whitespace collapse. |

Note on length: it counts breadcrumb + "\n\n" + text (or just text if breadcrumb is empty) after collapsing all whitespace runs to a single space — not text.length.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

breadchunks

Install

Usage

How it works

API

chunk(inputs, options?)

ChunkOptions

Chunk

License

`chunk(inputs, options?)`

`ChunkOptions`

`Chunk`