npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

chunk-smart

v0.1.2

Published

Structure-aware text chunker for RAG pipelines

Readme

chunk-smart

Structure-aware text chunker for RAG pipelines.

npm version npm downloads license node TypeScript

chunk-smart detects the content type of input text -- markdown, code, JSON, HTML, YAML, or plain text -- and splits it at natural structural boundaries rather than blindly by character count. Markdown is split at heading boundaries, code at function and class boundaries, JSON at top-level keys or array element groups, and plain text at paragraph and sentence boundaries. Every chunk carries rich metadata including positional offsets, token counts, heading context, and overlap indicators.

The package has zero runtime dependencies. All detection and splitting logic uses hand-written scanners and regex patterns. Token counting uses an approximate heuristic (1 token per 4 characters) suitable for most LLM tokenizers. The same input with the same options always produces the same output -- no LLM calls, no network access, fully deterministic.

Installation

npm install chunk-smart

Quick Start

import { chunk } from 'chunk-smart';

const text = `# Introduction

This is the first section with some content.

## Details

Here are the details of the implementation.

## Conclusion

Final thoughts on the topic.`;

const chunks = chunk(text);

for (const c of chunks) {
  console.log(
    c.metadata.index,
    c.metadata.contentType,
    c.metadata.tokenCount,
    c.content.slice(0, 60)
  );
}

Output:

0 markdown 15 # Introduction\n\nThis is the first section with some c
1 markdown 14 ## Details\n\nHere are the details of the implementatio
2 markdown 10 ## Conclusion\n\nFinal thoughts on the topic.

Features

  • Auto-detection -- Identifies content type (markdown, code, JSON, HTML, YAML, plain text) from structural markers with a confidence score.
  • Structure-aware splitting -- Splits at headings, function/class boundaries, JSON keys, paragraph breaks, and sentence endings instead of arbitrary character positions.
  • Rich metadata -- Every chunk includes its sequential index, character offsets in the original text, token count, character count, content type, heading context (markdown), detected code language, and overlap indicators.
  • Configurable chunk sizing -- Control maximum and minimum token counts per chunk, with an approximate tokenizer (1 token = 4 characters).
  • Overlap support -- Configurable token-based overlap between adjacent chunks for continuity in retrieval pipelines.
  • Factory pattern -- createChunker produces a reusable chunker instance with preset defaults, avoiding repeated option parsing.
  • Type-specific methods -- Dedicated chunkMarkdown, chunkCode, and chunkJSON methods bypass auto-detection when you know the input format.
  • Zero dependencies -- No runtime dependencies. Ships as compiled CommonJS with TypeScript declarations.
  • Deterministic -- Same input and options always produce the same output. No LLM calls, no network access.

API Reference

chunk(text, options?)

Splits a text string into an array of Chunk objects. Auto-detects content type unless contentType is specified in options.

Signature:

function chunk(text: string, options?: ChunkOptions): Chunk[]

Parameters:

| Parameter | Type | Description | |-----------|------|-------------| | text | string | The text to split into chunks. | | options | ChunkOptions | Optional configuration for chunking behavior. |

Returns: Chunk[] -- An array of chunk objects, each containing content and metadata.

Example:

import { chunk } from 'chunk-smart';

// Auto-detect content type
const chunks = chunk('# Hello\n\nWorld.\n\n## Section\n\nDetails.');

// Force content type and limit chunk size
const smallChunks = chunk(longText, {
  maxTokens: 256,
  contentType: 'markdown',
});

// Enable overlap between chunks
const overlapping = chunk(document, {
  maxTokens: 512,
  overlap: 50,
});

createChunker(defaultOptions?)

Creates a reusable chunker instance with preset default options. All methods on the returned Chunker accept optional overrides that are merged with the defaults.

Signature:

function createChunker(defaultOptions?: ChunkOptions): Chunker

Parameters:

| Parameter | Type | Description | |-----------|------|-------------| | defaultOptions | ChunkOptions | Default options applied to every call on the returned chunker. |

Returns: Chunker -- An object with chunk, chunkMarkdown, chunkCode, chunkJSON, and detectContentType methods.

Example:

import { createChunker } from 'chunk-smart';

const chunker = createChunker({ maxTokens: 256, overlap: 20 });

const mdChunks = chunker.chunkMarkdown(markdownText);
const codeChunks = chunker.chunkCode(sourceCode);
const jsonChunks = chunker.chunkJSON(jsonString);
const autoChunks = chunker.chunk(unknownText);

Chunker Interface

The object returned by createChunker.

interface Chunker {
  chunk(text: string, overrides?: Partial<ChunkOptions>): Chunk[];
  chunkMarkdown(text: string, options?: Partial<ChunkOptions>): Chunk[];
  chunkCode(text: string, options?: Partial<ChunkOptions>): Chunk[];
  chunkJSON(text: string, options?: Partial<ChunkOptions>): Chunk[];
  detectContentType(text: string): DetectResult;
}

| Method | Description | |--------|-------------| | chunk | Auto-detects content type and splits accordingly. | | chunkMarkdown | Forces contentType: 'markdown' and splits at heading boundaries, then paragraphs, then sentences. | | chunkCode | Forces contentType: 'code' and splits at function, class, def, const, let, var boundaries, then blank lines. | | chunkJSON | Forces contentType: 'json' and splits at top-level object keys or array element groups. Falls back to token-based splitting for invalid JSON. | | detectContentType | Returns the detected content type and confidence score without chunking. |


detectContentType(text)

Analyzes text and returns the detected content type with a confidence score.

Signature:

function detectContentType(text: string): DetectResult

Parameters:

| Parameter | Type | Description | |-----------|------|-------------| | text | string | The text to analyze. |

Returns: DetectResult -- An object with type and confidence.

Detection heuristics:

| Content Type | Signals | Confidence | |--------------|---------|------------| | json | Starts with { or [; valid JSON.parse | 0.95 | | json | Starts with { or [; invalid parse | 0.70 | | html | <!DOCTYPE html>, <html>, or common block tags (div, p, span, body, etc.) | 0.90 | | yaml | --- marker or 2+ key: value lines; no < characters | 0.80 | | markdown | # headings, triple backtick fences, **bold**, list items, [link](url) (score >= 2) | 0.80 | | code | function, class, def, import, const/let/var assignments, indented {}; lines (score >= 2) | 0.70 | | text | No structural markers detected | 0.50 |

Example:

import { detectContentType } from 'chunk-smart';

detectContentType('{"key": "value"}');
// { type: 'json', confidence: 0.95 }

detectContentType('# Title\n\nParagraph text.');
// { type: 'markdown', confidence: 0.8 }

detectContentType('function greet() { return "hi"; }');
// { type: 'code', confidence: 0.7 }

detectContentType('Just plain text with no markers.');
// { type: 'text', confidence: 0.5 }

Configuration

ChunkOptions

| Option | Type | Default | Description | |--------|------|---------|-------------| | maxTokens | number | 512 | Maximum tokens per chunk. 1 token is approximately 4 characters. | | minTokens | number | 50 | Minimum tokens per chunk (informational). | | overlap | number | 0 | Number of tokens of overlap between adjacent chunks. | | contentType | ContentType | auto-detected | Force a specific content type instead of auto-detecting. One of 'markdown', 'code', 'html', 'json', 'yaml', 'text'. | | preserveStructure | boolean | true | When true, uses boundary-aware splitting (paragraphs, sentences). When false, falls back to fixed-size token splitting for html, yaml, and text types. |

ContentType

type ContentType = 'markdown' | 'code' | 'html' | 'json' | 'yaml' | 'text'

Content Type Splitting Strategies

| Type | Primary Boundary | Secondary Boundary | Tertiary Boundary | |------|-----------------|-------------------|-------------------| | markdown | # headings | Double newlines (paragraphs) | Sentence endings | | code | function/class/def/const/let/var declarations | Blank lines | Character-level hard split | | json | Top-level object keys or array element groups | Token-based fallback (invalid JSON) | -- | | html | Paragraphs | Sentences | Word boundaries | | yaml | Paragraphs | Sentences | Word boundaries | | text | Double newlines (paragraphs) | Sentence endings (. ! ?) | Word boundaries |

Types

Chunk

interface Chunk {
  content: string;
  metadata: ChunkMetadata;
}

ChunkMetadata

interface ChunkMetadata {
  index: number;          // Sequential position in the result array (0-based)
  startOffset: number;    // Character offset of chunk start in the original text
  endOffset: number;      // Character offset of chunk end in the original text
  tokenCount: number;     // Approximate token count: Math.ceil(content.length / 4)
  charCount: number;      // Character count: content.length
  contentType: ContentType; // Detected or specified content type
  headings: string[];     // Headings found within the chunk (markdown only)
  codeLanguage?: string;  // Language detected from ``` fence or #! shebang (code only)
  overlapBefore: number;  // Characters of overlap with the previous chunk
  overlapAfter: number;   // Characters of overlap with the next chunk
}

DetectResult

interface DetectResult {
  type: ContentType;
  confidence: number;     // 0.0 to 1.0
}

Error Handling

chunk-smart handles edge cases gracefully without throwing exceptions:

  • Empty string -- Returns an empty array [].
  • Whitespace-only input -- Returns an empty array [].
  • Invalid JSON with contentType: 'json' -- Falls back to token-based splitting.
  • Single word exceeding maxTokens -- Hard-splits at the character limit to guarantee every chunk respects the size constraint.
  • No structural boundaries found -- Falls back to the next finer boundary level (paragraphs to sentences to words to characters).

Advanced Usage

Chunking a Codebase for Retrieval

import { createChunker } from 'chunk-smart';
import { readFileSync } from 'node:fs';

const chunker = createChunker({ maxTokens: 512, overlap: 30 });

const source = readFileSync('src/parser.ts', 'utf-8');
const chunks = chunker.chunkCode(source);

for (const c of chunks) {
  console.log(`Chunk ${c.metadata.index}: ${c.metadata.tokenCount} tokens`);
  if (c.metadata.codeLanguage) {
    console.log(`  Language: ${c.metadata.codeLanguage}`);
  }
}

Processing Large JSON API Responses

import { chunk } from 'chunk-smart';

const apiResponse = JSON.stringify(largeDataset, null, 2);
const chunks = chunk(apiResponse, {
  maxTokens: 1024,
  contentType: 'json',
});

// Each chunk is valid JSON (subset of top-level keys or array elements)
for (const c of chunks) {
  const parsed = JSON.parse(c.content);
  console.log(`Chunk ${c.metadata.index}: ${Object.keys(parsed).length} keys`);
}

Markdown Documentation with Heading Context

import { chunk } from 'chunk-smart';

const docs = readFileSync('API.md', 'utf-8');
const chunks = chunk(docs, { maxTokens: 256, contentType: 'markdown' });

for (const c of chunks) {
  if (c.metadata.headings.length > 0) {
    console.log(`Section: ${c.metadata.headings.join(' > ')}`);
  }
  console.log(`  Offset: ${c.metadata.startOffset}-${c.metadata.endOffset}`);
  console.log(`  Tokens: ${c.metadata.tokenCount}`);
}

Disabling Structure-Aware Splitting

import { chunk } from 'chunk-smart';

// Fixed-size token splitting without boundary awareness
const chunks = chunk(text, {
  maxTokens: 128,
  overlap: 20,
  preserveStructure: false,
  contentType: 'text',
});

Using Per-Call Overrides with a Chunker Instance

import { createChunker } from 'chunk-smart';

const chunker = createChunker({ maxTokens: 512 });

// Override maxTokens for a single call
const small = chunker.chunk(text, { maxTokens: 128 });

// Detect content type without chunking
const detected = chunker.detectContentType(unknownInput);
console.log(detected.type, detected.confidence);

TypeScript

chunk-smart is written in TypeScript with strict mode enabled. Type declarations are shipped alongside the compiled JavaScript in the dist/ directory.

All public types are exported from the package entry point:

import {
  chunk,
  createChunker,
  detectContentType,
} from 'chunk-smart';

import type {
  ContentType,
  DetectResult,
  ChunkMetadata,
  Chunk,
  ChunkOptions,
  Chunker,
} from 'chunk-smart';

License

MIT