@sharc-code/splitter
v0.2.8
Published
SHARC Splitter - AST and recursive text chunking for semantic search
Maintainers
Readme
@sharc-code/splitter
Code chunking library for semantic search, featuring AST-based splitting with context injection and recursive character fallback.
Overview
@sharc-code/splitter provides intelligent code chunking for RAG (Retrieval-Augmented Generation) and semantic code search applications. It extracts meaningful code units while preserving semantic context.
Key Features
- AST-Based Splitting: Tree-sitter powered parsing for 9 languages
- Context Injection: Automatically adds class/module context to extracted methods
- Decorator/Annotation Support: Preserves decorators (
@Get(),#[derive], etc.) in context - Recursive Fallback: Character-based splitting for unsupported languages
- Syntax Error Detection: Validate code before indexing
- Memory Safe: Proper cleanup of native tree-sitter resources
Installation
# npm
npm install @sharc-code/splitter
# bun
bun add @sharc-code/splitter --trust
# pnpm
pnpm add @sharc-code/splitterNote: This package includes native tree-sitter bindings. Ensure you have a C++ compiler available:
- Windows: Visual Studio Build Tools
- macOS: Xcode Command Line Tools (
xcode-select --install) - Linux:
build-essentialpackage
If you install with Bun, verify the native bindings were trusted:
bun pm untrustedIf tree-sitter or tree-sitter-* packages still appear there, AST chunking may degrade to the recursive fallback splitter until those dependencies are trusted.
Quick Start
import { AstCodeSplitter, LangChainCodeSplitter } from '@sharc-code/splitter';
// AST-based splitting (recommended for code)
const astSplitter = new AstCodeSplitter(3500, 0);
const chunks = await astSplitter.split(
`class UserService {
async authenticate(user: string): Promise<boolean> {
return this.validateCredentials(user);
}
}`,
'typescript',
'src/services/user.ts'
);
console.log(chunks[0].content);
// Output:
// // Context: class UserService (services/user.ts)
// async authenticate(user: string): Promise<boolean> {
// return this.validateCredentials(user);
// }
// Recursive splitting (for docs/config)
const langchainSplitter = new LangChainCodeSplitter(1500, 150);
const docChunks = await langchainSplitter.split(markdownContent, 'markdown', 'README.md');Splitters
AstCodeSplitter
Tree-sitter based splitter that extracts complete semantic units (functions, classes, methods) with automatic context injection.
Supported Languages:
- TypeScript / JavaScript
- Python
- Java
- C++ / C
- Go
- Rust
- C#
- Scala
Features:
- Extracts functions, classes, methods, interfaces as complete units
- Injects context comments (e.g.,
// Context: class UserService > module auth) - Automatic fallback to recursive splitting for unsupported languages
- Syntax error detection via
checkSyntaxErrors() - Memory-safe with
dispose()for cleanup
import { AstCodeSplitter } from '@sharc-code/splitter';
const splitter = new AstCodeSplitter(
3500, // chunkSize (max characters per chunk)
0 // chunkOverlap (0 for AST - semantic units don't need overlap)
);
// Split code
const chunks = await splitter.split(code, 'typescript', 'src/auth.ts');
// Check for syntax errors before indexing
const { hasErrors, errorCount } = splitter.checkSyntaxErrors(code, 'typescript');
// Check if language is supported
if (AstCodeSplitter.isLanguageSupported('rust')) {
// Use AST splitter
}
// Clean up when done (important for long-running processes)
splitter.dispose();LangChainCodeSplitter
Character-based splitter using SHARC's vendored recursive character splitter. Best for documentation, configuration files, and languages without AST support.
This class name is kept for backward compatibility even though SHARC no longer depends on LangChain internally.
Supported Languages:
- JavaScript/TypeScript (as
js) - Python, Java, C++, Go, Rust, PHP, Ruby, Swift, Scala
- Markdown, HTML, LaTeX
- Solidity
import { LangChainCodeSplitter } from '@sharc-code/splitter';
const splitter = new LangChainCodeSplitter(
1500, // chunkSize
150 // chunkOverlap (overlap preserves context across chunks)
);
const chunks = await splitter.split(content, 'markdown', 'docs/README.md');API Reference
CodeChunk
interface CodeChunk {
content: string;
metadata: {
startLine: number;
endLine: number;
language?: string;
filePath?: string;
chunkKind?: 'wrapper' | 'member' | 'container' | 'data_container' | 'standalone';
documentClass?: 'implementation' | 'test' | 'docs_examples';
};
}Splitter Interface
interface Splitter {
split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;
setChunkSize(chunkSize: number): void;
setChunkOverlap(chunkOverlap: number): void;
}AstCodeSplitter
class AstCodeSplitter implements Splitter {
constructor(chunkSize?: number, chunkOverlap?: number);
// Split code into chunks
split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;
// Check for syntax errors
checkSyntaxErrors(code: string, language: string): { hasErrors: boolean; errorCount: number };
// Configuration
setChunkSize(chunkSize: number): void;
setChunkOverlap(chunkOverlap: number): void;
// Cleanup native resources
dispose(): void;
// Static utilities
static getSupportedLanguages(): string[];
static isLanguageSupported(language: string): boolean;
}LangChainCodeSplitter
class LangChainCodeSplitter implements Splitter {
constructor(chunkSize?: number, chunkOverlap?: number);
split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;
setChunkSize(chunkSize: number): void;
setChunkOverlap(chunkOverlap: number): void;
}Context Injection
The AST splitter automatically adds context comments to extracted code chunks, improving search relevance:
TypeScript/JavaScript/Java/C++/Go/Rust/C#/Scala
// Context: class UserService > module auth (services/user.ts)
async authenticate(user: string): Promise<boolean> {
return this.validateCredentials(user);
}Python
# Context: class UserService (services/user.py)
def authenticate(self, user: str) -> bool:
return self.validate_credentials(user)Context Hierarchy
The splitter tracks nested containers and builds a context path:
// Input: deeply nested method
namespace App {
module Auth {
class UserService {
authenticate() { ... }
}
}
}
// Output chunk:
// Context: namespace App > module Auth > class UserService (auth/user.ts)
authenticate() { ... }Recommended Chunk Sizes
| Content Type | Chunk Size | Overlap | Splitter | Rationale | |--------------|------------|---------|----------|-----------| | Code (AST) | 3500 | 0 | AST | Complete semantic units, no overlap needed | | Documentation | 1500 | 150 | Recursive | Prose flows between sections | | Config/Data | 1500 | 100 | Recursive | Related keys grouped together | | Fallback Code | 1500 | 100 | Recursive | Conservative chunking |
Memory Management
Tree-sitter uses native C++ bindings that allocate memory outside Node.js's garbage collector. For long-running processes:
const splitter = new AstCodeSplitter();
try {
// Process many files...
for (const file of files) {
const chunks = await splitter.split(file.content, file.language, file.path);
// ... use chunks
}
} finally {
// Free native memory when done
splitter.dispose();
}The splitter also automatically cleans up parse trees after each split() call.
Error Handling
Syntax Error Detection
const splitter = new AstCodeSplitter();
// Check for errors before indexing
const { hasErrors, errorCount } = splitter.checkSyntaxErrors(code, 'typescript');
if (hasErrors) {
console.warn(`File has ${errorCount} syntax errors, skipping...`);
} else {
const chunks = await splitter.split(code, 'typescript');
}Automatic Fallback
If AST parsing fails or the language isn't supported, the AST splitter automatically falls back to recursive splitting:
const splitter = new AstCodeSplitter();
// Vue files aren't AST-supported, will use the recursive fallback
const chunks = await splitter.split(vueCode, 'vue', 'App.vue');
// Console: "Language vue not supported by AST, using recursive fallback splitter for: App.vue"Development
# Install dependencies
bun install
# Build
bun run build
# Type check
bun run typecheck
# Watch mode
bun run dev
# Clean build artifacts
bun run cleanDependencies
| Package | Purpose |
|---------|---------|
| tree-sitter | AST parsing engine |
| tree-sitter-* | Language grammars (9 languages) |
| vendored recursive splitter | Text splitting utilities |
Use in SHARC
This package is used by:
- @sharc-code/mcp - MCP server for AI assistants
- @sharc/core - Core indexing engine (internal)
For end-to-end semantic code search, see the main SHARC documentation.
License
MIT - See LICENSE for details.
Contributing
All code modifications must be done via Pull Request. See CLAUDE.md for guidelines.
