npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

code-chunk

v0.1.12

Published

AST-aware code chunking for semantic search and RAG

Readme

code-chunk

AST-aware code chunking for semantic search and RAG pipelines.

Uses tree-sitter to split source code at semantic boundaries (functions, classes, methods) rather than arbitrary character limits. Each chunk includes rich context: scope chain, imports, siblings, and entity signatures.

Table of Contents

Features

  • AST-aware: Splits at semantic boundaries, never mid-function
  • Rich context: Scope chain, imports, siblings, entity signatures
  • Contextualized text: Pre-formatted for embedding models
  • Multi-language: TypeScript, JavaScript, Python, Rust, Go, Java
  • Batch processing: Process entire codebases with controlled concurrency
  • Streaming: Process large files incrementally
  • Effect support: First-class Effect integration

How It Works

Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. code-chunk takes a different approach:

1. Parse

Source code is parsed into an Abstract Syntax Tree (AST) using tree-sitter. This gives us a structured representation of the code that understands language grammar.

2. Extract

We traverse the AST to extract semantic entities: functions, methods, classes, interfaces, types, and imports. For each entity, we capture:

  • Name and type
  • Full signature (e.g., async getUser(id: string): Promise<User>)
  • Docstring/comments if present
  • Byte and line ranges

3. Build Scope Tree

Entities are organized into a hierarchical scope tree that captures nesting relationships. A method inside a class knows its parent; a nested function knows its containing function. This enables us to provide scope context like UserService > getUser.

4. Chunk

Code is split at semantic boundaries while respecting the maxChunkSize limit. The chunker:

  • Prefers to keep complete entities together
  • Splits oversized entities at logical points (statement boundaries)
  • Never cuts mid-expression or mid-statement
  • Merges small adjacent chunks to reduce fragmentation

5. Enrich with Context

Each chunk is enriched with contextual metadata:

  • Scope chain: Where this code lives (e.g., inside which class/function)
  • Entities: What's defined in this chunk
  • Siblings: What comes before/after (for continuity)
  • Imports: What dependencies are used

This context is formatted into contextualizedText, optimized for embedding models to understand semantic relationships.

Installation

bun add code-chunk
# or
npm install code-chunk

Quickstart

Basic Usage

import { chunk } from 'code-chunk'

const chunks = await chunk('src/user.ts', sourceCode)

for (const c of chunks) {
  console.log(c.text)
  console.log(c.context.scope)    // [{ name: 'UserService', type: 'class' }]
  console.log(c.context.entities) // [{ name: 'getUser', type: 'method', ... }]
}

Using Contextualized Text for Embeddings

Use contextualizedText for better embedding quality in RAG systems:

for (const c of chunks) {
  const embedding = await embed(c.contextualizedText)
  await vectorDB.upsert({
    id: `${filepath}:${c.index}`,
    embedding,
    metadata: { filepath, lines: c.lineRange }
  })
}

The contextualizedText prepends semantic context to the raw code:

# src/services/user.ts
# Scope: UserService
# Defines: async getUser(id: string): Promise<User>
# Uses: Database
# After: constructor

  async getUser(id: string): Promise<User> {
    return this.db.query('SELECT * FROM users WHERE id = ?', [id])
  }

Streaming Large Files

Process chunks incrementally without loading everything into memory:

import { chunkStream } from 'code-chunk'

for await (const c of chunkStream('src/large.ts', code)) {
  await process(c)
}

Reusable Chunker

Create a chunker instance when processing multiple files with the same config:

import { createChunker } from 'code-chunk'

const chunker = createChunker({
  maxChunkSize: 2048,
  contextMode: 'full',
  siblingDetail: 'signatures',
})

for (const file of files) {
  const chunks = await chunker.chunk(file.path, file.content)
}

Batch Processing

Process multiple files concurrently with error handling per file:

import { chunkBatch } from 'code-chunk'

const files = [
  { filepath: 'src/user.ts', code: userCode },
  { filepath: 'src/auth.ts', code: authCode },
  { filepath: 'lib/utils.py', code: utilsCode },
]

const results = await chunkBatch(files, {
  maxChunkSize: 1500,
  concurrency: 10,
  onProgress: (done, total, path, success) => {
    console.log(`[${done}/${total}] ${path}: ${success ? 'ok' : 'failed'}`)
  }
})

for (const result of results) {
  if (result.error) {
    console.error(`Failed: ${result.filepath}`, result.error)
  } else {
    await indexChunks(result.filepath, result.chunks)
  }
}

Stream results as they complete:

import { chunkBatchStream } from 'code-chunk'

for await (const result of chunkBatchStream(files, { concurrency: 5 })) {
  if (result.chunks) {
    await indexChunks(result.filepath, result.chunks)
  }
}

Effect Integration

For Effect-based pipelines:

import { chunkStreamEffect } from 'code-chunk'
import { Effect, Stream } from 'effect'

const program = Stream.runForEach(
  chunkStreamEffect('src/utils.ts', code),
  (chunk) => Effect.log(chunk.text)
)

await Effect.runPromise(program)

API Reference

chunk(filepath, code, options?)

Chunk source code into semantic pieces with context.

Parameters:

  • filepath: File path (used for language detection)
  • code: Source code string
  • options: Optional configuration

Returns: Promise<Chunk[]>

Throws: ChunkingError, UnsupportedLanguageError


chunkStream(filepath, code, options?)

Stream chunks as they're generated. Useful for large files.

Returns: AsyncGenerator<Chunk>

Note: chunk.totalChunks is -1 in streaming mode (unknown upfront).


chunkStreamEffect(filepath, code, options?)

Effect-native streaming API for composable pipelines.

Returns: Stream.Stream<Chunk, ChunkingError | UnsupportedLanguageError>


createChunker(options?)

Create a reusable chunker instance with default options.

Returns: Chunker with chunk(), stream(), chunkBatch(), and chunkBatchStream() methods


chunkBatch(files, options?)

Process multiple files concurrently with per-file error handling.

Parameters:

  • files: Array of { filepath, code, options? }
  • options: Batch options (extends ChunkOptions with concurrency and onProgress)

Returns: Promise<BatchResult[]> where each result has { filepath, chunks, error }


chunkBatchStream(files, options?)

Stream batch results as files complete processing.

Returns: AsyncGenerator<BatchResult>


chunkBatchEffect(files, options?)

Effect-native batch processing.

Returns: Effect.Effect<BatchResult[], never>


chunkBatchStreamEffect(files, options?)

Effect-native streaming batch processing.

Returns: Stream.Stream<BatchResult, never>


formatChunkWithContext(text, context, overlapText?)

Format chunk text with semantic context prepended. Useful for custom embedding pipelines.

Returns: string


detectLanguage(filepath)

Detect programming language from file extension.

Returns: Language | null


ChunkOptions

| Option | Type | Default | Description | |--------|------|---------|-------------| | maxChunkSize | number | 1500 | Maximum chunk size in bytes | | contextMode | 'none' \| 'minimal' \| 'full' | 'full' | How much context to include | | siblingDetail | 'none' \| 'names' \| 'signatures' | 'signatures' | Level of sibling detail | | filterImports | boolean | false | Filter out import statements | | language | Language | auto | Override language detection | | overlapLines | number | 10 | Lines from previous chunk to include in contextualizedText |

BatchOptions

Extends ChunkOptions with:

| Option | Type | Default | Description | |--------|------|---------|-------------| | concurrency | number | 10 | Maximum files to process concurrently | | onProgress | function | - | Callback (completed, total, filepath, success) => void |


Supported Languages

| Language | Extensions | |----------|------------| | TypeScript | .ts, .tsx, .mts, .cts | | JavaScript | .js, .jsx, .mjs, .cjs | | Python | .py, .pyi | | Rust | .rs | | Go | .go | | Java | .java |


Errors

ChunkingError: Thrown when chunking fails (parsing error, extraction error, etc.)

UnsupportedLanguageError: Thrown when the file extension is not supported

Both errors have a _tag property for Effect-style error handling.

License

MIT