anchor-atomizer
v0.1.0
Published
Text decomposition: Compound → Molecule → Atom with sanitization
Maintainers
Readme
anchor-atomizer
Text decomposition: Compound → Molecule → Atom with sanitization
Features
- Atomic Knowledge Model: Decompose documents into Atom/Molecule structures
- Text Sanitization: Remove YAML frontmatter, log lines, HTML tags, JSON artifacts
- Unicode-Aware: Proper segmentation for international text
- Character Offsets: Track byte positions for lazy content loading
- Serde Support: Serialize/deserialize atoms and molecules
Performance
| Operation | Target | |-----------|--------| | Ingestion throughput | >100 atoms/sec | | Sanitization | <1ms per document | | Unicode segmentation | Native speed |
Quick Start
use anchor_atomizer::{atomize, decompose_to_molecules, sanitize, tokenize};
// Sanitize input (remove YAML frontmatter, log lines, etc.)
let raw = r#"---
title: My Document
tags: [rust, test]
---
# Introduction
This is the first paragraph.
This is the second paragraph."#;
let clean = sanitize(raw);
// Decompose into molecules (sections)
let molecules = decompose_to_molecules(&clean);
// Atomize into individual atoms (paragraphs)
let atoms = atomize(&clean);
println!("Found {} atoms", atoms.len());
for atom in &atoms {
println!(" Atom at {}-{}: {}", atom.char_start, atom.char_end, &atom.content[..20.min(atom.content.len())]);
}
// Tokenize for SimHash/TF-IDF
let tokens = tokenize(&clean);
println!("Found {} tokens", tokens.len());API
Sanitization
/// Remove metadata wrappers from text
pub fn sanitize(text: &str) -> String;
/// With custom options
pub fn sanitize_with_options(text: &str, options: &SanitizeOptions) -> String;Example:
use anchor_atomizer::{sanitize, SanitizeOptions};
let raw = r#"[INFO] 2024-01-01 - Starting
Actual content here."#;
let clean = sanitize(raw);
assert!(!clean.contains("[INFO]"));Atomization
/// Split text into paragraph-level atoms
pub fn atomize(text: &str) -> Vec<Atom>;
/// Split text into section-level molecules
pub fn decompose_to_molecules(text: &str) -> Vec<Molecule>;Atom struct:
pub struct Atom {
pub content: String, // Text content
pub char_start: usize, // Start offset in original
pub char_end: usize, // End offset in original
}Molecule struct:
pub struct Molecule {
pub atoms: Vec<Atom>, // Contained atoms
pub metadata: Option<Value>, // Section header, etc.
pub char_start: usize,
pub char_end: usize,
}Tokenization
/// Tokenize into lowercase words
pub fn tokenize(text: &str) -> Vec<String>;
/// Count tokens without allocating
pub fn count_tokens(text: &str) -> usize;Installation
[dependencies]
anchor-atomizer = "0.1.0"Or:
cargo add anchor-atomizerUsage Examples
Full Ingestion Pipeline
use anchor_atomizer::{sanitize, atomize, tokenize};
use anchor_fingerprint::simhash;
fn ingest_document(text: &str) -> Vec<(String, u64)> {
// 1. Sanitize
let clean = sanitize(text);
// 2. Atomize
let atoms = atomize(&clean);
// 3. Fingerprint each atom
atoms
.iter()
.map(|atom| {
let hash = simhash(&atom.content);
(atom.content.clone(), hash)
})
.collect()
}Custom Sanitization
use anchor_atomizer::{sanitize_with_options, SanitizeOptions};
let options = SanitizeOptions {
remove_yaml_frontmatter: true,
remove_log_lines: true,
remove_html_tags: true,
remove_code_fences: false, // Keep code blocks
remove_json_artifacts: true,
trim_result: true,
};
let clean = sanitize_with_options(text, &options);Lazy Content Loading
use std::fs::File;
use std::io::{Read, Seek};
use anchor_atomizer::Atom;
fn load_atom_content(atom: &Atom, file_path: &str) -> std::io::Result<String> {
let mut file = File::open(file_path)?;
let mut buffer = vec![0u8; atom.char_end - atom.char_start];
file.seek(std::io::SeekFrom::Start(atom.char_start as u64))?;
file.read_exact(&mut buffer)?;
Ok(String::from_utf8_lossy(&buffer).to_string())
}Text Processing Flow
Raw Document
│
▼
┌─────────────────┐
│ Sanitize │ Remove YAML, logs, HTML
└────────┬────────┘
│
▼
┌─────────────────┐
│ Decompose │ Split on headers
│ to Molecules │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Atomize │ Split on paragraphs
└────────┬────────┘
│
▼
Atoms with
char offsetsTesting
cargo test --all-featuresCoverage:
cargo tarpaulin --out HtmlBenchmarks
cargo benchSample output:
tokenize_short_100_chars time: [500 ns 520 ns 540 ns]
tokenize_medium_1000_chars time: [2.5 µs 2.6 µs 2.7 µs]
atomize_short time: [1.0 µs 1.1 µs 1.2 µs]
sanitize_yaml_frontmatter time: [800 ns 850 ns 900 ns]License
AGPL-3.0 - See LICENSE for details.
Contributing
- Read the specification
- Follow code style
- Write tests per testing standards
- Submit a PR
Acknowledgments
- Unicode segmentation:
unicode-segmentationcrate - Regex engine:
regexcrate
