anchor-keyextract
v0.1.0
Published
Keyword extraction using TF-IDF and RAKE with synonym ring support
Maintainers
Readme
anchor-keyextract
Keyword extraction using TF-IDF and RAKE with synonym ring support
Features
- TF-IDF: Term Frequency - Inverse Document Frequency for keyword scoring
- RAKE: Rapid Automatic Keyword Extraction for multi-word phrases
- Synonym Rings: Tag expansion for search queries (like
#rust→#programming,#systems) - Unicode Support: Handles international text correctly
- Serde Ready: Serialize/deserialize for storage
Quick Start
use anchor_keyextract::{extract_keywords, extract_keywords_rake, SynonymRing};
// Extract keywords using TF-IDF
let text = "Rust is a systems programming language with zero-cost abstractions";
let keywords = extract_keywords(text, 5);
for kw in keywords {
println!("{}: {:.3}", kw.term, kw.score);
}
// Or use RAKE for multi-word phrases
let rake_keywords = extract_keywords_rake(text, 5);
// Synonym ring for tag expansion
let mut ring = SynonymRing::new();
ring.add("#rust", vec!["#programming", "#systems", "#language"]);
let expanded = ring.expand("#rust");
println!("{:?}", expanded);
// ["#rust", "#programming", "#systems", "#language"]API
Keyword Extraction
/// Extract keywords using TF-IDF (single document)
pub fn extract_keywords(text: &str, max_keywords: usize) -> Vec<Keyword>;
/// Extract keywords using RAKE algorithm
pub fn extract_keywords_rake(text: &str, max_keywords: usize) -> Vec<Keyword>;Keyword struct:
pub struct Keyword {
pub term: String, // The keyword
pub score: f32, // Relevance score (higher = more important)
}TF-IDF (Multi-document)
use anchor_keyextract::{TfIdf, TfIdfBuilder};
// Build from multiple documents
let tfidf = TfIdfBuilder::new()
.add_document("Rust is fast and safe")
.add_document("Python is popular")
.add_document("Rust has zero-cost abstractions")
.build();
// Get keywords for document 0
let keywords = tfidf.get_keywords(0, 5);RAKE
use anchor_keyextract::Rake;
let rake = Rake::new();
let keywords = rake.extract("Machine learning algorithms process data", 5);Synonym Ring
use anchor_keyextract::SynonymRing;
// Load from file
let ring = SynonymRing::load_or_empty(std::path::Path::new("internal_tags.json"));
// Or build programmatically
let mut ring = SynonymRing::new();
ring.add("#rust", vec!["#programming", "#systems"]);
ring.add("#ai", vec!["#ml", "#machine-learning"]);
// Expand a tag
let expanded = ring.expand("#rust");
// Returns: ["#rust", "#programming", "#systems"]
// Reverse lookup works too
// If "#programming" is a synonym of "#rust", expanding "#programming"
// will also return "#rust" and other synonymsInstallation
[dependencies]
anchor-keyextract = "0.1.0"Or:
cargo add anchor-keyextractUsage Examples
Auto-tagging Documents
use anchor_keyextract::{extract_keywords, SynonymRing};
fn auto_tag_document(text: &str, ring: &SynonymRing) -> Vec<String> {
let keywords = extract_keywords(text, 10);
// Convert to tags and expand with synonyms
let mut tags = Vec::new();
for kw in keywords {
if kw.score > 0.5 {
let expanded = ring.expand(&format!("#{}", kw.term));
tags.extend(expanded);
}
}
tags
}Building a Synonym Ring from JSON
{
"#rust": ["#programming", "#systems", "#language", "#memory-safety"],
"#python": ["#scripting", "#data-science", "#ml"],
"#web": ["#frontend", "#backend", "#fullstack"],
"#database": ["#sql", "#nosql", "#storage"]
}use anchor_keyextract::SynonymRing;
let ring = SynonymRing::load_or_empty("internal_tags.json");Search Query Expansion
use anchor_keyextract::SynonymRing;
fn expand_search_query(query: &str, ring: &SynonymRing) -> String {
let terms: Vec<String> = query
.split_whitespace()
.flat_map(|term| ring.expand(term))
.collect();
terms.join(" OR ")
}
// Usage
let ring = SynonymRing::load_or_empty("internal_tags.json");
let expanded = expand_search_query("#rust performance", &ring);
// "#rust OR #programming OR #systems performance"Algorithms
TF-IDF
Term Frequency - Inverse Document Frequency measures how important a word is to a document:
TF-IDF(t, d) = TF(t, d) × IDF(t)
IDF(t) = log((N + 1) / (DF(t) + 1)) + 1Where:
TF(t, d): Frequency of termtin documentd(normalized by doc length)N: Total number of documentsDF(t): Number of documents containing termt
RAKE
Rapid Automatic Keyword Extraction identifies multi-word keywords by:
- Splitting text on stop words and punctuation
- Building word co-occurrence graph
- Scoring phrases by sum of (degree/frequency) for each word
Synonym Ring Expansion
Bidirectional expansion:
- Forward:
#rust→[#programming, #systems] - Reverse:
#programming→[#rust, #systems](if#programmingis a synonym of#rust)
Testing
cargo test --all-featuresBenchmarks
cargo benchSample output:
tfidf_build_medium time: [5.0 µs 5.2 µs 5.4 µs]
tfidf_extract_keywords time: [2.0 µs 2.1 µs 2.2 µs]
extract_keywords_short time: [3.0 µs 3.1 µs 3.2 µs]
extract_keywords_medium time: [15.0 µs 15.5 µs 16.0 µs]
rake_extract_medium time: [20.0 µs 21.0 µs 22.0 µs]
synonym_ring_expand time: [50 ns 52 ns 54 ns]License
AGPL-3.0 - See LICENSE for details.
Contributing
- Read the specification
- Follow code style
- Write tests per testing standards
- Submit a PR
Acknowledgments
- TF-IDF: Standard information retrieval algorithm
- RAKE: Rose, Engel, Eigner, Jones (2010)
- Unicode segmentation:
unicode-segmentationcrate
