anchor-atomizer

v0.1.0

Published

2 months ago

Text decomposition: Compound → Molecule → Atom with sanitization

0High
0Medium
0Low

rbalchii

text segmentation tokenization nlp

anchor-atomizer

Text decomposition: Compound → Molecule → Atom with sanitization

Features

Atomic Knowledge Model: Decompose documents into Atom/Molecule structures
Text Sanitization: Remove YAML frontmatter, log lines, HTML tags, JSON artifacts
Unicode-Aware: Proper segmentation for international text
Character Offsets: Track byte positions for lazy content loading
Serde Support: Serialize/deserialize atoms and molecules

Performance

| Operation | Target | |-----------|--------| | Ingestion throughput | >100 atoms/sec | | Sanitization | <1ms per document | | Unicode segmentation | Native speed |

Quick Start

use anchor_atomizer::{atomize, decompose_to_molecules, sanitize, tokenize};

// Sanitize input (remove YAML frontmatter, log lines, etc.)
let raw = r#"---
title: My Document
tags: [rust, test]
---

# Introduction

This is the first paragraph.

This is the second paragraph."#;

let clean = sanitize(raw);

// Decompose into molecules (sections)
let molecules = decompose_to_molecules(&clean);

// Atomize into individual atoms (paragraphs)
let atoms = atomize(&clean);

println!("Found {} atoms", atoms.len());
for atom in &atoms {
    println!("  Atom at {}-{}: {}", atom.char_start, atom.char_end, &atom.content[..20.min(atom.content.len())]);
}

// Tokenize for SimHash/TF-IDF
let tokens = tokenize(&clean);
println!("Found {} tokens", tokens.len());

API

Sanitization

/// Remove metadata wrappers from text
pub fn sanitize(text: &str) -> String;

/// With custom options
pub fn sanitize_with_options(text: &str, options: &SanitizeOptions) -> String;

Example:

use anchor_atomizer::{sanitize, SanitizeOptions};

let raw = r#"[INFO] 2024-01-01 - Starting
Actual content here."#;

let clean = sanitize(raw);
assert!(!clean.contains("[INFO]"));

Atomization

/// Split text into paragraph-level atoms
pub fn atomize(text: &str) -> Vec<Atom>;

/// Split text into section-level molecules
pub fn decompose_to_molecules(text: &str) -> Vec<Molecule>;

Atom struct:

pub struct Atom {
    pub content: String,      // Text content
    pub char_start: usize,    // Start offset in original
    pub char_end: usize,      // End offset in original
}

Molecule struct:

pub struct Molecule {
    pub atoms: Vec<Atom>,           // Contained atoms
    pub metadata: Option<Value>,    // Section header, etc.
    pub char_start: usize,
    pub char_end: usize,
}

Tokenization

/// Tokenize into lowercase words
pub fn tokenize(text: &str) -> Vec<String>;

/// Count tokens without allocating
pub fn count_tokens(text: &str) -> usize;

Installation

[dependencies]
anchor-atomizer = "0.1.0"

Or:

cargo add anchor-atomizer

Usage Examples

Full Ingestion Pipeline

use anchor_atomizer::{sanitize, atomize, tokenize};
use anchor_fingerprint::simhash;

fn ingest_document(text: &str) -> Vec<(String, u64)> {
    // 1. Sanitize
    let clean = sanitize(text);
    
    // 2. Atomize
    let atoms = atomize(&clean);
    
    // 3. Fingerprint each atom
    atoms
        .iter()
        .map(|atom| {
            let hash = simhash(&atom.content);
            (atom.content.clone(), hash)
        })
        .collect()
}

Custom Sanitization

use anchor_atomizer::{sanitize_with_options, SanitizeOptions};

let options = SanitizeOptions {
    remove_yaml_frontmatter: true,
    remove_log_lines: true,
    remove_html_tags: true,
    remove_code_fences: false,  // Keep code blocks
    remove_json_artifacts: true,
    trim_result: true,
};

let clean = sanitize_with_options(text, &options);

Lazy Content Loading

use std::fs::File;
use std::io::{Read, Seek};
use anchor_atomizer::Atom;

fn load_atom_content(atom: &Atom, file_path: &str) -> std::io::Result<String> {
    let mut file = File::open(file_path)?;
    let mut buffer = vec![0u8; atom.char_end - atom.char_start];
    file.seek(std::io::SeekFrom::Start(atom.char_start as u64))?;
    file.read_exact(&mut buffer)?;
    Ok(String::from_utf8_lossy(&buffer).to_string())
}

Text Processing Flow

Raw Document
    │
    ▼
┌─────────────────┐
│   Sanitize      │  Remove YAML, logs, HTML
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Decompose     │  Split on headers
│   to Molecules  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Atomize       │  Split on paragraphs
└────────┬────────┘
         │
         ▼
    Atoms with
    char offsets

Testing

cargo test --all-features

Coverage:

cargo tarpaulin --out Html

Benchmarks

cargo bench

Sample output:

tokenize_short_100_chars      time:   [500 ns 520 ns 540 ns]
tokenize_medium_1000_chars    time:   [2.5 µs 2.6 µs 2.7 µs]
atomize_short                 time:   [1.0 µs 1.1 µs 1.2 µs]
sanitize_yaml_frontmatter     time:   [800 ns 850 ns 900 ns]

License

AGPL-3.0 - See LICENSE for details.

Contributing

Read the specification
Follow code style
Write tests per testing standards
Submit a PR

Acknowledgments

Unicode segmentation: unicode-segmentation crate
Regex engine: regex crate

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

anchor-atomizer

Features

Performance

Quick Start

API

Sanitization

Atomization

Tokenization

Installation

Usage Examples

Full Ingestion Pipeline

Custom Sanitization

Lazy Content Loading

Text Processing Flow

Testing

Benchmarks

License

Contributing

Acknowledgments