rawtype
v1.0.3
Published
Fast heuristic detection of text vs binary data using byte-level analysis.
Maintainers
Readme
rawtype
Rawtype is a lightweight, zero-dependency library for Node.js and browsers that performs a fundamental job: distinguishing binary data from text by analyzing raw bytes. It uses pragmatic heuristics—like scanning for null bytes and checking for common binary headers—to make a fast, informed guess. This is a heuristic, not a guaranteed fact, so results include a confidence score to help you assess reliability.
Quick Start
npm install rawtypeimport { detect, isText } from 'rawtype';
// Detect with detailed results
const result = detect(fs.readFileSync('file.dat'));
console.log(result);
// { kind: 'binary', confidence: 0.99, sampleSize: 4096 }
// Simple boolean check
if (isText(userInput)) {
console.log('Safe to process as a string.');
}Why Use Rawtype?
It solves specific, common problems where simpler checks fail:
| Problem | Typical Solution | Why It Fails | Rawtype's Approach |
| :--- | :--- | :--- | :--- |
| Null bytes in text | Buffer.isUtf8() | Fails on UTF-16 or corrupted text. | Heavy weighting for null bytes, but allows for edge cases. |
| Unknown file type | MIME type from extension | Easily spoofed; unreliable for raw data. | Scans initial bytes for known binary signatures (magic numbers). |
| Large files | Read entire file | Memory intensive and slow. | Samples only the first 4KB by default (configurable). |
| Need for certainty | True/False guess | Lacks nuance for ambiguous data. | Provides a confidence score (0.0-1.0) with each result. |
API Reference
Core Function: detect(input, options?)
The primary function analyzes input and returns a result object.
Signature:
function detect(
input: string | Buffer | Uint8Array | ArrayBuffer | number[],
options?: {
maxSample?: number; // Bytes to sample (default: 4096)
nullByteWeight?: number; // Penalty for null bytes (default: 50)
textThreshold?: number; // Confidence needed for "text" (default: 0.85)
}
): DetectionResultResult Object (DetectionResult):
{
kind: 'text' | 'binary'; // The best-guess classification
confidence: number; // Certainty of the guess (0.0 to 1.0)
sampleSize: number; // How many bytes were actually analyzed
}Understanding Confidence: The confidence score represents the algorithm's certainty in its kind classification. A high score (e.g., 0.98) indicates strong evidence, while a score near your textThreshold (default 0.85) suggests the data was ambiguous.
Helper Functions
isText(input, options?): boolean– Returnstrueifdetect()returnskind: 'text'with confidence >= threshold.isBinary(input, options?): boolean– Inverse ofisText.
File Scanner (scanFile, scanDirectory)
Note: These are Node.js-only utilities.
import { scanFile, scanDirectory } from 'rawtype/file-scanner';
// Scan a single file
const fileResult = await scanFile('./data.bin');
// Recursively scan a directory
const dirResult = await scanDirectory('./user-uploads', {
recursive: true,
extensions: ['.dat', '.txt'],
exclude: ['**/node_modules/**']
});The scanDirectory function returns a DirectoryScanResult containing a summary and an array of FileScanResult objects for each file.
Integration Examples
1. HTTP API (Edge/Browser)
Use the built-in handler for HTTP detection endpoints.
// Example for Vercel/Cloudflare
import { rawtypeHandler } from 'rawtype/api-handler';
export default rawtypeHandler.fetch;Endpoint: POST /detect
Accepts: text/plain, application/json, application/octet-stream
Returns: JSON with the DetectionResult.
2. Stream Processing
Process data from streams without buffering everything.
import { sampleStream } from 'rawtype/stream-helper';
import { detect } from 'rawtype';
// Sample from a fetch response stream
const response = await fetch('https://example.com/data');
const sample = await sampleStream(response.body, 8192);
const result = detect(sample);3. Binary Data Parsing Pipeline
You can integrate rawtype as a first step before detailed parsing.
import { createReadStream } from 'fs';
import { sampleStream } from 'rawtype/stream-helper';
import { Parser } from 'my-binary-parser'; // Your parser
async function processFile(path: string) {
const stream = createReadStream(path);
const sample = await sampleStream(stream, 4096);
const { kind, confidence } = detect(sample);
if (kind === 'binary' && confidence > 0.9) {
const parser = new Parser();
// ... safe to proceed with binary parsing
} else {
// Process as text or log ambiguity
}
}License and Compliance
Important: Rawtype is licensed under the GNU General Public License v3.0 (GPLv3).
What This Means for You:
- You can use, modify, and distribute rawtype freely.
- If you distribute a modified version of rawtype, you must make your modifications publicly available under the GPLv3.
- If you distribute a larger application that includes rawtype as part of it, the entire combined work may need to be licensed under GPLv3. This is a requirement of the license's "copyleft" provision.
- The software is provided without any warranty.
Considerations:
- For internal use where software is not distributed, GPLv3 requirements generally do not apply.
- If you intend to use rawtype in a proprietary, closed-source product, the GPLv3 license may not be compatible with your goals. You may need to seek a different license from the copyright holder or choose an alternative library with a more permissive license (like MIT or Apache 2.0).
For full legal details, please read the GPLv3 license text and consult with a legal professional if you have specific questions.
Design Philosophy
Rawtype adheres to a strict, minimalistic design:
- Zero Dependencies: To keep it lightweight and secure.
- Single Responsibility: It detects binary vs. text—it does not decode encodings, validate MIME types, or parse formats.
- Practical Heuristics: Uses byte inspection, null-byte detection, and magic number checks for a balance of speed and accuracy.
- Honest Results: Provides a confidence score to communicate uncertainty, not just a binary guess.
