blingfire
v0.1.1
Published
Node.js bindings for BlingFire tokenization library
Downloads
697
Maintainers
Readme
blingfire
Node.js bindings for BlingFire, a tokenization library from Microsoft.
BlingFire provides text tokenization and sentence splitting with high performance and no external dependencies.
Features
- Native C++ implementation with minimal overhead
- Production-tested tokenization and sentence splitting
- Full TypeScript support with type definitions
- No runtime dependencies beyond Node.js
- Includes precompiled binaries for Linux (arm64, amd64) and macOS (arm64, amd64)
Installation
npm install blingfirePlatform Support
This package includes precompiled binaries for:
- Linux: arm64, amd64
- macOS (darwin): arm64, amd64
For these platforms, no build tools are required. For other platforms, you will need to build from source.
Building from Source (Other Platforms)
If precompiled binaries are not available for your platform, you will need:
- Node.js 16.x or higher
- Python (for node-gyp)
- C++ compiler:
- macOS: Xcode Command Line Tools (
xcode-select --install) - Linux: GCC or Clang (
apt-get install build-essential) - Windows: Visual Studio Build Tools
- macOS: Xcode Command Line Tools (
During source builds, the package will:
- Copy BlingFire source files to the vendor directory
- Compile the native C++ addon
- Build TypeScript files
Usage
Basic Example
import { textToSentences, textToWords } from 'blingfire';
// Sentence splitting
const text = 'Hello world. This is a test. How are you?';
const sentences = textToSentences(text);
console.log(sentences);
// Output: "Hello world.\nThis is a test.\nHow are you?"
// Get sentences as array
const sentenceArray = sentences.split('\n');
console.log(sentenceArray);
// Output: ["Hello world.", "This is a test.", "How are you?"]
// Word tokenization
const words = textToWords(text);
console.log(words);
// Output: "Hello world . This is a test . How are you ?"
// Get words as array
const wordArray = words.split(' ');
console.log(wordArray);
// Output: ["Hello", "world", ".", "This", "is", "a", "test", ".", "How", "are", "you", "?"]CommonJS
const { textToSentences, textToWords } = require('blingfire');
const text = 'Dr. Smith went to the U.S. He was happy.';
const sentences = textToSentences(text);
console.log(sentences.split('\n'));Advanced Examples
Processing Documents
import { textToSentences, textToWords } from 'blingfire';
// Process a document
const document = `
Natural language processing (NLP) is a subfield of AI.
It focuses on the interaction between computers and humans.
Dr. Jones believes it's revolutionary!
`;
// Split into sentences
const sentences = textToSentences(document).split('\n');
// Tokenize each sentence
sentences.forEach((sentence, i) => {
const words = textToWords(sentence).split(' ');
console.log(`Sentence ${i + 1} has ${words.length} tokens`);
});Unicode and Multilingual Text
import { textToSentences, textToWords } from 'blingfire';
// Works with Unicode
const multiLang = 'Hello world. 你好世界。Привет мир.';
const sentences = textToSentences(multiLang);
console.log(sentences.split('\n'));
// Handles emojis and special characters
const withEmoji = 'I love coding! 🚀 It makes me happy. 😊';
const words = textToWords(withEmoji);
console.log(words);API Reference
textToSentences(text: string): string
Splits text into sentences using BlingFire's sentence boundary detection.
Parameters:
text(string): The input text to split into sentences
Returns:
- (string): Sentences separated by newline characters (
\n)
Throws:
TypeError: If input is not a stringError: If BlingFire processing fails
Example:
const result = textToSentences('First. Second. Third.');
// Returns: "First.\nSecond.\nThird."textToWords(text: string): string
Tokenizes text into words using BlingFire's word boundary detection.
Parameters:
text(string): The input text to tokenize
Returns:
- (string): Words separated by space characters
Throws:
TypeError: If input is not a stringError: If BlingFire processing fails
Example:
const result = textToWords('Hello, world!');
// Returns: "Hello , world !"How It Works
This package wraps the BlingFire C++ library using Node.js N-API. During installation:
- Vendor Script: Copies BlingFire source files from a git submodule to a
vendor/directory - Native Compilation: Compiles both BlingFire and the N-API binding using node-gyp
- TypeScript Build: Compiles TypeScript wrapper to JavaScript
The package includes all BlingFire source code, so no external dependencies are required at runtime.
Development
Setup
# Clone with submodules
git clone --recursive <repository-url>
cd blingfire
# Install dependencies
npm install
# Run tests
npm test
# Build (TypeScript + Native)
npm run build:typescript
npm run build:nativeTesting
# Run tests once
npm test
# Watch mode
npm run test:watchBuilding from Source
# Clean build artifacts
npm run clean
# Rebuild everything
npm install
npm run build:typescript
npm run build:nativeLicense
The contents of this repository are licensed under the MIT License.
See the LICENSE file for more information.
Credits
- BlingFire by Microsoft
- This package wraps the C++ library with Node.js N-API bindings
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
