@sharc-code/splitter
v0.2.3
Published
SHARC Splitter - AST and LangChain code chunking for semantic search
Maintainers
Readme
@sharc-code/splitter
Code chunking library for semantic search, featuring AST-based splitting with context injection and LangChain fallback.
Overview
@sharc-code/splitter provides intelligent code chunking for RAG (Retrieval-Augmented Generation) and semantic code search applications. It extracts meaningful code units while preserving semantic context.
Key Features
- AST-Based Splitting: Tree-sitter powered parsing for 9 languages
- Context Injection: Automatically adds class/module context to extracted methods
- Decorator/Annotation Support: Preserves decorators (
@Get(),#[derive], etc.) in context - LangChain Fallback: Character-based splitting for unsupported languages
- Syntax Error Detection: Validate code before indexing
- Memory Safe: Proper cleanup of native tree-sitter resources
Installation
# npm
npm install @sharc-code/splitter
# bun
bun add @sharc-code/splitter
# pnpm
pnpm add @sharc-code/splitterNote: This package includes native tree-sitter bindings. Ensure you have a C++ compiler available:
- Windows: Visual Studio Build Tools
- macOS: Xcode Command Line Tools (
xcode-select --install) - Linux:
build-essentialpackage
Quick Start
import { AstCodeSplitter, LangChainCodeSplitter } from '@sharc-code/splitter';
// AST-based splitting (recommended for code)
const astSplitter = new AstCodeSplitter(3500, 0);
const chunks = await astSplitter.split(
`class UserService {
async authenticate(user: string): Promise<boolean> {
return this.validateCredentials(user);
}
}`,
'typescript',
'src/services/user.ts'
);
console.log(chunks[0].content);
// Output:
// // Context: class UserService (services/user.ts)
// async authenticate(user: string): Promise<boolean> {
// return this.validateCredentials(user);
// }
// LangChain splitting (for docs/config)
const langchainSplitter = new LangChainCodeSplitter(1500, 150);
const docChunks = await langchainSplitter.split(markdownContent, 'markdown', 'README.md');Splitters
AstCodeSplitter
Tree-sitter based splitter that extracts complete semantic units (functions, classes, methods) with automatic context injection.
Supported Languages:
- TypeScript / JavaScript
- Python
- Java
- C++ / C
- Go
- Rust
- C#
- Scala
Features:
- Extracts functions, classes, methods, interfaces as complete units
- Injects context comments (e.g.,
// Context: class UserService > module auth) - Automatic fallback to LangChain for unsupported languages
- Syntax error detection via
checkSyntaxErrors() - Memory-safe with
dispose()for cleanup
import { AstCodeSplitter } from '@sharc-code/splitter';
const splitter = new AstCodeSplitter(
3500, // chunkSize (max characters per chunk)
0 // chunkOverlap (0 for AST - semantic units don't need overlap)
);
// Split code
const chunks = await splitter.split(code, 'typescript', 'src/auth.ts');
// Check for syntax errors before indexing
const { hasErrors, errorCount } = splitter.checkSyntaxErrors(code, 'typescript');
// Check if language is supported
if (AstCodeSplitter.isLanguageSupported('rust')) {
// Use AST splitter
}
// Clean up when done (important for long-running processes)
splitter.dispose();LangChainCodeSplitter
Character-based splitter using LangChain's RecursiveCharacterTextSplitter. Best for documentation, configuration files, and languages without AST support.
Supported Languages:
- JavaScript/TypeScript (as
js) - Python, Java, C++, Go, Rust, PHP, Ruby, Swift, Scala
- Markdown, HTML, LaTeX
- Solidity
import { LangChainCodeSplitter } from '@sharc-code/splitter';
const splitter = new LangChainCodeSplitter(
1500, // chunkSize
150 // chunkOverlap (overlap preserves context across chunks)
);
const chunks = await splitter.split(content, 'markdown', 'docs/README.md');API Reference
CodeChunk
interface CodeChunk {
content: string;
metadata: {
startLine: number;
endLine: number;
language?: string;
filePath?: string;
};
}Splitter Interface
interface Splitter {
split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;
setChunkSize(chunkSize: number): void;
setChunkOverlap(chunkOverlap: number): void;
}AstCodeSplitter
class AstCodeSplitter implements Splitter {
constructor(chunkSize?: number, chunkOverlap?: number);
// Split code into chunks
split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;
// Check for syntax errors
checkSyntaxErrors(code: string, language: string): { hasErrors: boolean; errorCount: number };
// Configuration
setChunkSize(chunkSize: number): void;
setChunkOverlap(chunkOverlap: number): void;
// Cleanup native resources
dispose(): void;
// Static utilities
static getSupportedLanguages(): string[];
static isLanguageSupported(language: string): boolean;
}LangChainCodeSplitter
class LangChainCodeSplitter implements Splitter {
constructor(chunkSize?: number, chunkOverlap?: number);
split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;
setChunkSize(chunkSize: number): void;
setChunkOverlap(chunkOverlap: number): void;
}Context Injection
The AST splitter automatically adds context comments to extracted code chunks, improving search relevance:
TypeScript/JavaScript/Java/C++/Go/Rust/C#/Scala
// Context: class UserService > module auth (services/user.ts)
async authenticate(user: string): Promise<boolean> {
return this.validateCredentials(user);
}Python
# Context: class UserService (services/user.py)
def authenticate(self, user: str) -> bool:
return self.validate_credentials(user)Context Hierarchy
The splitter tracks nested containers and builds a context path:
// Input: deeply nested method
namespace App {
module Auth {
class UserService {
authenticate() { ... }
}
}
}
// Output chunk:
// Context: namespace App > module Auth > class UserService (auth/user.ts)
authenticate() { ... }Recommended Chunk Sizes
| Content Type | Chunk Size | Overlap | Splitter | Rationale | |--------------|------------|---------|----------|-----------| | Code (AST) | 3500 | 0 | AST | Complete semantic units, no overlap needed | | Documentation | 1500 | 150 | LangChain | Prose flows between sections | | Config/Data | 1500 | 100 | LangChain | Related keys grouped together | | Fallback Code | 1500 | 100 | LangChain | Conservative chunking |
Memory Management
Tree-sitter uses native C++ bindings that allocate memory outside Node.js's garbage collector. For long-running processes:
const splitter = new AstCodeSplitter();
try {
// Process many files...
for (const file of files) {
const chunks = await splitter.split(file.content, file.language, file.path);
// ... use chunks
}
} finally {
// Free native memory when done
splitter.dispose();
}The splitter also automatically cleans up parse trees after each split() call.
Error Handling
Syntax Error Detection
const splitter = new AstCodeSplitter();
// Check for errors before indexing
const { hasErrors, errorCount } = splitter.checkSyntaxErrors(code, 'typescript');
if (hasErrors) {
console.warn(`File has ${errorCount} syntax errors, skipping...`);
} else {
const chunks = await splitter.split(code, 'typescript');
}Automatic Fallback
If AST parsing fails or the language isn't supported, the AST splitter automatically falls back to LangChain:
const splitter = new AstCodeSplitter();
// Vue files aren't AST-supported, will use LangChain
const chunks = await splitter.split(vueCode, 'vue', 'App.vue');
// Console: "Language vue not supported by AST, using LangChain splitter for: App.vue"Development
# Install dependencies
bun install
# Build
bun run build
# Type check
bun run typecheck
# Watch mode
bun run dev
# Clean build artifacts
bun run cleanDependencies
| Package | Purpose |
|---------|---------|
| tree-sitter | AST parsing engine |
| tree-sitter-* | Language grammars (9 languages) |
| langchain | Text splitting utilities |
Use in SHARC
This package is used by:
- @sharc-code/mcp - MCP server for AI assistants
- @sharc/core - Core indexing engine (internal)
For end-to-end semantic code search, see the main SHARC documentation.
License
MIT - See LICENSE for details.
Contributing
All code modifications must be done via Pull Request. See CLAUDE.md for guidelines.
