deduplino
v0.0.9
Published
CLI tool for deduplicating lino format
Readme
Deduplino
A CLI tool for deduplicating lino format files by identifying patterns in repeated link references and replacing them with numbered references for improved readability and reduced file size.
Installation
Using Bun (Recommended)
# Install globally with bun
bun install -g deduplino
# Or from source
git clone <repository-url>
cd deduplino
bun install
bun run buildUsing NPM (Fallback)
npm install -g deduplinoQuick Start
# Basic usage
deduplino -i input.lino -o output.lino
# From stdin to stdout
echo "(test link)\n(test link)" | deduplino --piped-input
# Process with different threshold
deduplino --deduplication-threshold 0.5 -i input.linoHow It Works
Deduplino analyzes lino files to find patterns in link references and creates optimized representations using three pattern types.
Auto-Escape Feature
The --auto-escape option automatically converts non-lino text (like logs) into valid lino format:
- First attempt: Escape only references containing colons (timestamps, URLs, field names)
- Second attempt: Escape references with special characters (
!@#$%^&*+=|\\:;?/<>.,) - Final fallback: Escape all references except simple punctuation and quoted strings
Example log processing:
Input: 2025-07-25T21:32:46Z updateReferences id: a43fad436d79
Output: '2025-07-25T21:32:46Z' updateReferences 'id:' a43fad436d79Pattern Types
1. Exact Duplicates
Links that appear identically multiple times.
Input:
(first second)
(first second)
(first second)Output:
1: first second
1
1
12. Prefix Patterns
Links that share common beginnings.
Input:
(this is a link of cat)
(this is a link of tree)Output:
1: this is a link of
1 cat
1 tree3. Suffix Patterns
Links that share common endings.
Input:
(foo ends here)
(bar ends here)Output:
1: ends here
foo 1
bar 1Advanced Pattern Detection
The tool handles complex nested structures and can identify patterns in structured links:
Input:
(this is) a link
(this is) a linkOutput:
1: this is
1 a link
1 a linkAlgorithm
- Parse input using the Protocols.Lino parser
- Filter links with 2+ references (deduplicatable content)
- Identify Patterns:
- Exact duplicates
- Common prefixes between link pairs
- Common suffixes between link pairs
- Special handling for structured links
- Score & Select patterns by (frequency × pattern_length)
- Apply top patterns based on threshold
- Format output using library's formatLinks function
CLI Options
| Option | Short | Description | Default |
|--------|--------|-------------|---------|
| [input-file] | | Input file as positional argument | - |
| --input | -i | Input file path (alternative to positional argument) | - |
| --output | -o | Output file path (smart naming if not provided) | - |
| --deduplication-threshold | -p | Percentage of patterns to apply (0-1) | 0.2 |
| --auto-escape | | Automatically escape input to make it valid lino format | false |
| --piped-input | | Read from stdin (use when piping data) | false |
| --fail-on-parse-error | | Exit with code 1 if input cannot be parsed as lino format | false |
| --detect-auto-escape-edge-cases | | Analyze log file line-by-line to find cases that auto-escape cannot fix | false |
| --help | -h | Show help information | - |
Examples
Basic File Processing
# Deduplicate a file (smart output naming)
deduplino document.lino
# Creates document.deduped.lino
# Deduplicate with custom output
deduplino document.lino -o compressed.lino
# Traditional flag syntax
deduplino -i document.lino -o compressed.lino
# Process from pipeline
cat document.lino | deduplino --piped-input > compressed.lino
# Quick stdin processing
echo "(test)\n(test)" | deduplino --piped-inputSmart Output Naming
When you don't specify an output file, deduplino automatically generates one:
# File with .lino extension
deduplino input.lino # → input.deduped.lino
# File without .lino extension
deduplino server.log # → server.log.deduped.lino
deduplino data.txt # → data.txt.deduped.linoThreshold Control
# Conservative (default) - top 20% of patterns
deduplino document.lino
# More aggressive - top 50% of patterns
deduplino --deduplication-threshold 0.5 -i document.lino
# Maximum deduplication - all patterns
deduplino --deduplication-threshold 1.0 -i document.linoAuto-Escape for Logs
# Process log files that aren't valid lino format
deduplino --auto-escape -i server.log -o processed.lino
# Handle timestamps and special characters
echo "2025-07-25T21:32:46Z error: connection failed" | deduplino --auto-escape --piped-input
# Output: '2025-07-25T21:32:46Z' 'error:' connection failedPipeline Usage
# Chain with other tools
some-tool | deduplino --piped-input | other-tool
# Multiple processing steps
cat input.lino | deduplino --piped-input -p 0.3 | tee intermediate.lino | final-processorError Handling and Validation
# Validate lino format - exit with code 1 if invalid
deduplino --fail-on-parse-error -i document.lino
# Auto-escape with validation - useful for CI/CD pipelines
deduplino --auto-escape --fail-on-parse-error -i log.txt
# This will attempt auto-escape, but fail if it still can't parse the result
# Check if auto-escape worked properly
echo "problematic: input" | deduplino --piped-input --auto-escape --fail-on-parse-errorEdge Case Detection and Analysis
# Analyze a log file to find problematic lines
deduplino --detect-auto-escape-edge-cases -i server.log
# Find edge cases in piped input
cat application.log | deduplino --piped-input --detect-auto-escape-edge-cases
# Example output:
# 🔍 Found 3 edge case(s) that auto-escape cannot fix:
#
# 📂 Unbalanced Parentheses (2 cases):
# Line 42: "))((("
# Line 156: "))((()))(("
#
# 📂 Only Punctuation (1 cases):
# Line 89: "( ( ( ) )"
#
# 📊 Statistics:
# Total lines processed: 1000
# Failed lines: 3
# Success rate: 99.7%Pattern Selection Strategy
The --deduplication-threshold parameter controls which patterns are applied:
- 0.2 (default): Apply top 20% of patterns for optimal readability/compression balance
- 0.5: More aggressive deduplication, may impact readability
- 1.0: Maximum deduplication, applies all found patterns
Patterns are ranked by: frequency × pattern_length
Development
Setup
bun installTesting
# Run all tests
bun test
# Watch mode
bun test --watchBuilding
# Build for production
bun run build
# Development mode with file watching
bun run devProject Structure
src/
├── index.ts # CLI interface and argument parsing
├── deduplicator.ts # Core deduplication algorithm
tests/
└── deduplicator.test.ts # Comprehensive test suite (27 tests)Algorithm Details
Pattern Finding
- Exact: Map-based counting of identical content
- Prefix/Suffix: Pairwise comparison with reference-level matching
- Structured: Special handling for nested link structures like
(this is) a link
Pattern Scoring
Patterns are scored by count × pattern.split(' ').length to favor:
- High-frequency patterns (appear many times)
- Longer patterns (more compression benefit)
Overlap Prevention
Selected patterns are filtered to prevent overlap - each link content can only be part of one pattern.
Dependencies
- @linksplatform/protocols-lino: Lino format parsing and formatting
- yargs: Command-line argument parsing
License
This is free and unencumbered software released into the public domain.
See LICENSE for full details or visit https://unlicense.org
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
bun test - Submit a pull request
