@sharc-code/splitter

v0.2.8

Published

2 months ago

SHARC Splitter - AST and recursive text chunking for semantic search

0High
0Medium
0Low

yxanul

code-search semantic-search mcp model-context-protocol ai embeddings tree-sitter ast

@sharc-code/splitter

Code chunking library for semantic search, featuring AST-based splitting with context injection and recursive character fallback.

Overview

@sharc-code/splitter provides intelligent code chunking for RAG (Retrieval-Augmented Generation) and semantic code search applications. It extracts meaningful code units while preserving semantic context.

Key Features

AST-Based Splitting: Tree-sitter powered parsing for 9 languages
Context Injection: Automatically adds class/module context to extracted methods
Decorator/Annotation Support: Preserves decorators (@Get(), #[derive], etc.) in context
Recursive Fallback: Character-based splitting for unsupported languages
Syntax Error Detection: Validate code before indexing
Memory Safe: Proper cleanup of native tree-sitter resources

Installation

# npm
npm install @sharc-code/splitter

# bun
bun add @sharc-code/splitter --trust

# pnpm
pnpm add @sharc-code/splitter

Note: This package includes native tree-sitter bindings. Ensure you have a C++ compiler available:

Windows: Visual Studio Build Tools
macOS: Xcode Command Line Tools (xcode-select --install)
Linux: build-essential package

If you install with Bun, verify the native bindings were trusted:

bun pm untrusted

If tree-sitter or tree-sitter-* packages still appear there, AST chunking may degrade to the recursive fallback splitter until those dependencies are trusted.

Quick Start

import { AstCodeSplitter, LangChainCodeSplitter } from '@sharc-code/splitter';

// AST-based splitting (recommended for code)
const astSplitter = new AstCodeSplitter(3500, 0);

const chunks = await astSplitter.split(
  `class UserService {
    async authenticate(user: string): Promise<boolean> {
      return this.validateCredentials(user);
    }
  }`,
  'typescript',
  'src/services/user.ts'
);

console.log(chunks[0].content);
// Output:
// // Context: class UserService (services/user.ts)
// async authenticate(user: string): Promise<boolean> {
//   return this.validateCredentials(user);
// }

// Recursive splitting (for docs/config)
const langchainSplitter = new LangChainCodeSplitter(1500, 150);
const docChunks = await langchainSplitter.split(markdownContent, 'markdown', 'README.md');

Splitters

AstCodeSplitter

Tree-sitter based splitter that extracts complete semantic units (functions, classes, methods) with automatic context injection.

Supported Languages:

TypeScript / JavaScript
Python
Java
C++ / C
Go
Rust
C#
Scala

Features:

Extracts functions, classes, methods, interfaces as complete units
Injects context comments (e.g., // Context: class UserService > module auth)
Automatic fallback to recursive splitting for unsupported languages
Syntax error detection via checkSyntaxErrors()
Memory-safe with dispose() for cleanup

import { AstCodeSplitter } from '@sharc-code/splitter';

const splitter = new AstCodeSplitter(
  3500,  // chunkSize (max characters per chunk)
  0      // chunkOverlap (0 for AST - semantic units don't need overlap)
);

// Split code
const chunks = await splitter.split(code, 'typescript', 'src/auth.ts');

// Check for syntax errors before indexing
const { hasErrors, errorCount } = splitter.checkSyntaxErrors(code, 'typescript');

// Check if language is supported
if (AstCodeSplitter.isLanguageSupported('rust')) {
  // Use AST splitter
}

// Clean up when done (important for long-running processes)
splitter.dispose();

LangChainCodeSplitter

Character-based splitter using SHARC's vendored recursive character splitter. Best for documentation, configuration files, and languages without AST support.

This class name is kept for backward compatibility even though SHARC no longer depends on LangChain internally.

Supported Languages:

JavaScript/TypeScript (as js)
Python, Java, C++, Go, Rust, PHP, Ruby, Swift, Scala
Markdown, HTML, LaTeX
Solidity

import { LangChainCodeSplitter } from '@sharc-code/splitter';

const splitter = new LangChainCodeSplitter(
  1500,  // chunkSize
  150    // chunkOverlap (overlap preserves context across chunks)
);

const chunks = await splitter.split(content, 'markdown', 'docs/README.md');

API Reference

CodeChunk

interface CodeChunk {
  content: string;
  metadata: {
    startLine: number;
    endLine: number;
    language?: string;
    filePath?: string;
    chunkKind?: 'wrapper' | 'member' | 'container' | 'data_container' | 'standalone';
    documentClass?: 'implementation' | 'test' | 'docs_examples';
  };
}

Splitter Interface

interface Splitter {
  split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;
  setChunkSize(chunkSize: number): void;
  setChunkOverlap(chunkOverlap: number): void;
}

AstCodeSplitter

class AstCodeSplitter implements Splitter {
  constructor(chunkSize?: number, chunkOverlap?: number);

  // Split code into chunks
  split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;

  // Check for syntax errors
  checkSyntaxErrors(code: string, language: string): { hasErrors: boolean; errorCount: number };

  // Configuration
  setChunkSize(chunkSize: number): void;
  setChunkOverlap(chunkOverlap: number): void;

  // Cleanup native resources
  dispose(): void;

  // Static utilities
  static getSupportedLanguages(): string[];
  static isLanguageSupported(language: string): boolean;
}

LangChainCodeSplitter

class LangChainCodeSplitter implements Splitter {
  constructor(chunkSize?: number, chunkOverlap?: number);

  split(code: string, language: string, filePath?: string): Promise<CodeChunk[]>;
  setChunkSize(chunkSize: number): void;
  setChunkOverlap(chunkOverlap: number): void;
}

Context Injection

The AST splitter automatically adds context comments to extracted code chunks, improving search relevance:

TypeScript/JavaScript/Java/C++/Go/Rust/C#/Scala

// Context: class UserService > module auth (services/user.ts)
async authenticate(user: string): Promise<boolean> {
  return this.validateCredentials(user);
}

Python

# Context: class UserService (services/user.py)
def authenticate(self, user: str) -> bool:
    return self.validate_credentials(user)

Context Hierarchy

The splitter tracks nested containers and builds a context path:

// Input: deeply nested method
namespace App {
  module Auth {
    class UserService {
      authenticate() { ... }
    }
  }
}

// Output chunk:
// Context: namespace App > module Auth > class UserService (auth/user.ts)
authenticate() { ... }

Recommended Chunk Sizes

| Content Type | Chunk Size | Overlap | Splitter | Rationale | |--------------|------------|---------|----------|-----------| | Code (AST) | 3500 | 0 | AST | Complete semantic units, no overlap needed | | Documentation | 1500 | 150 | Recursive | Prose flows between sections | | Config/Data | 1500 | 100 | Recursive | Related keys grouped together | | Fallback Code | 1500 | 100 | Recursive | Conservative chunking |

Memory Management

Tree-sitter uses native C++ bindings that allocate memory outside Node.js's garbage collector. For long-running processes:

const splitter = new AstCodeSplitter();

try {
  // Process many files...
  for (const file of files) {
    const chunks = await splitter.split(file.content, file.language, file.path);
    // ... use chunks
  }
} finally {
  // Free native memory when done
  splitter.dispose();
}

The splitter also automatically cleans up parse trees after each split() call.

Error Handling

Syntax Error Detection

const splitter = new AstCodeSplitter();

// Check for errors before indexing
const { hasErrors, errorCount } = splitter.checkSyntaxErrors(code, 'typescript');

if (hasErrors) {
  console.warn(`File has ${errorCount} syntax errors, skipping...`);
} else {
  const chunks = await splitter.split(code, 'typescript');
}

Automatic Fallback

If AST parsing fails or the language isn't supported, the AST splitter automatically falls back to recursive splitting:

const splitter = new AstCodeSplitter();

// Vue files aren't AST-supported, will use the recursive fallback
const chunks = await splitter.split(vueCode, 'vue', 'App.vue');
// Console: "Language vue not supported by AST, using recursive fallback splitter for: App.vue"

Development

# Install dependencies
bun install

# Build
bun run build

# Type check
bun run typecheck

# Watch mode
bun run dev

# Clean build artifacts
bun run clean

Dependencies

| Package | Purpose | |---------|---------| | tree-sitter | AST parsing engine | | tree-sitter-* | Language grammars (9 languages) | | vendored recursive splitter | Text splitting utilities |

Use in SHARC

This package is used by:

@sharc-code/mcp - MCP server for AI assistants
@sharc/core - Core indexing engine (internal)

For end-to-end semantic code search, see the main SHARC documentation.

License

MIT - See LICENSE for details.

Contributing

All code modifications must be done via Pull Request. See CLAUDE.md for guidelines.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@sharc-code/splitter

Overview

Key Features

Installation

Quick Start

Splitters

AstCodeSplitter

LangChainCodeSplitter

API Reference

CodeChunk

Splitter Interface

AstCodeSplitter

LangChainCodeSplitter

Context Injection

TypeScript/JavaScript/Java/C++/Go/Rust/C#/Scala

Python

Context Hierarchy

Recommended Chunk Sizes

Memory Management

Error Handling

Syntax Error Detection

Automatic Fallback

Development

Dependencies

Use in SHARC

License

Contributing