utf8-flattener
v1.0.1
Published
A powerful CLI tool to flatten codebases into a single XML file with UTF-8 support and comprehensive statistics
Maintainers
Readme
UTF8-Flattener
A powerful CLI tool to flatten codebases into a single XML file with comprehensive statistics and UTF-8 support.
Features
- 🚀 Fast file discovery - Uses Git for efficient file listing when available
- 🌍 UTF-8 support - Properly handles file paths with non-ASCII characters (Vietnamese, Chinese, etc.)
- 📊 Comprehensive statistics - File sizes, extensions, directories, and more
- 🔍 Smart filtering - Respects .gitignore and provides sensible defaults
- 📄 Detailed reporting - Generates both console output and markdown reports
- ⚡ Concurrent processing - Automatically optimizes based on CPU count
- 🛡️ Binary detection - Safely handles binary files without corruption
Installation
npm install -g utf8-flattenerUsage
Basic Usage
# Flatten current directory
utf8-flattener
# Or use the shorter alias
flattener
# Flatten specific directory
utf8-flattener -i /path/to/project
# Custom output file
utf8-flattener -i /path/to/project -o my-codebase.xmlCommand Line Options
utf8-flattener [options]
Options:
-V, --version Output the version number
-i, --input <path> Input directory to flatten (default: current directory)
-o, --output <path> Output file path (default: "flattened-codebase.xml")
-h, --help Display help for commandInteractive Mode
When run without arguments, utf8-flattener will:
- Auto-detect project root (looks for .git, package.json, etc.)
- Suggest intelligent defaults
- Ask for confirmation before proceeding
$ utf8-flattener
Detected project root at "/home/user/my-project". Use it as input and write output to "/home/user/my-project/flattened-codebase.xml"? [Y/n]Output Format
The tool generates:
- XML file - Contains all text file contents with metadata
- Statistics report - Comprehensive analysis of your codebase
- Markdown report (optional) - Detailed breakdown with tables and charts
XML Structure
<?xml version="1.0" encoding="UTF-8"?>
<codebase>
<file path="src/index.js" size="1234" lines="45">
<content><![CDATA[
// Your file content here
]]></content>
</file>
<!-- More files... -->
</codebase>Examples
Example 1: Basic Project Flattening
cd my-project
utf8-flattenerOutput:
🔍 Discovering files...
✅ Found 142 files to include
📄 Processing files...
✅ Processed 142/142 files
🔧 Generating XML output...
✅ XML generation completed
📊 Completion Summary:
✅ Successfully processed 142 files into flattened-codebase.xml
📁 Output file: /home/user/my-project/flattened-codebase.xml
📏 Total source size: 2.3 MB
📄 Generated XML size: 2.5 MB
📝 Total lines of code: 8,543
🔢 Estimated tokens: 125,670Example 2: Large Codebase with Custom Output
utf8-flattener -i /large/project -o /output/analysis.xmlExample 3: Working with UTF-8 Paths
# Works perfectly with international file names
utf8-flattener -i "项目/中文目录" -o chinese-project.xml
utf8-flattener -i "dự án/tiếng việt" -o vietnamese-project.xmlFile Filtering
The tool automatically respects:
- Git repositories - Uses
git ls-filesfor accurate file listing - .gitignore rules - Excludes ignored files and directories
- Binary files - Detects and skips binary files (includes size in output)
- Common ignore patterns - node_modules, .git, build artifacts, etc.
Default Ignore Patterns
node_modules/.git/dist/,build/*.log*.tmp,*.temp- IDE files (.vscode, .idea)
Statistics Overview
The tool provides detailed analysis including:
File Analysis
- Total files processed
- File size distribution
- File type breakdown
- Largest files
- Duplicate candidates
Directory Analysis
- Directory size breakdown
- Depth distribution
- File count per directory
Code Quality Metrics
- Lines of code
- Token estimation
- Zero-byte files
- Empty text files
- Suspiciously large files
Git Integration
- Tracked vs untracked files
- Git LFS candidates
- Repository status
Advanced Usage
Programmatic Usage
const flattener = require('utf8-flattener');
// Use as a module (main.js exports the commander program)
const program = flattener;Custom Integration
The flattener consists of modular components:
discovery.js- File discovery with Git integrationaggregate.js- Content aggregation with concurrencystats.js- Statistical analysisxml.js- XML generationbinary.js- Binary file detection
Performance
- Concurrent processing - Automatically scales based on CPU cores
- Memory efficient - Streams large files during XML generation
- Git optimization - Uses
git ls-filesfor fast discovery - Smart caching - Reuses binary detection results
Typical Performance
| Project Size | Files | Time | Memory | |-------------|-------|------|--------| | Small (< 100 files) | ~50 | < 1s | ~50MB | | Medium (1k files) | ~1,000 | ~5s | ~200MB | | Large (10k files) | ~10,000 | ~30s | ~500MB |
UTF-8 Support
This tool has enhanced UTF-8 support for international file paths:
- ✅ Vietnamese:
Phát hành thẻ/verifyIssueCreditCard.feature - ✅ Chinese:
项目/测试/测试文件.js - ✅ Japanese:
プロジェクト/テスト.js - ✅ Arabic:
مشروع/اختبار.js - ✅ Emoji:
🚀project/📁folder/📄file.js
The tool properly handles Git's quoted path format and C-style escape sequences.
Troubleshooting
Common Issues
Issue: "Cannot find module 'commander'"
# Solution: Install dependencies
npm installIssue: UTF-8 paths showing as escape sequences
# This is now fixed! The tool automatically decodes Git's quoted pathsIssue: Permission denied on binary executable
# Solution: Make executable
chmod +x ./bin/flattener.jsIssue: Out of memory on very large projects
# Solution: The tool uses streaming for large files, but you may need to increase Node.js memory
node --max-old-space-size=4096 ./bin/flattener.js -i large-projectDebug Mode
For troubleshooting, you can examine the intermediate steps:
// Enable verbose logging in the source files
const { discoverFiles } = require('./discovery');
const files = await discoverFiles('/path/to/project');
console.log('Discovered files:', files);Contributing
This tool was extracted from the BMAD-METHOD™ toolkit. Contributions welcome!
Development Setup
git clone https://github.com/yourusername/utf8-flattener.git
cd utf8-flattener
npm install
npm testLicense
MIT License - see LICENSE file for details.
Changelog
v1.0.0
- Initial release
- UTF-8 path support
- Comprehensive statistics
- Git integration
- Binary file detection
- Concurrent processing
- Markdown reporting
