@paulmeller/docflow
v0.0.28
Published
A developer-friendly transformation engine for programmatic document manipulation
Maintainers
Readme
DocFlow
A developer-friendly transformation engine built on SuperDoc
Transform documents programmatically with a clean, chainable API. Built on top of SuperDoc Editor, DocFlow makes it easy to load, query, transform, and export documents in multiple formats.
Features
- 🔄 Format Conversion: DOCX ↔ Markdown ↔ HTML ↔ JSON ↔ Plain Text
- 📊 Table Support: DOCX tables export cleanly to Markdown with formatting preserved
- 🔍 Powerful Queries: CSS-like selectors and predicate functions
- ⚡ Batch Processing: Process multiple documents concurrently
- 🎯 Type-Safe: Full TypeScript support
- 🔗 Chainable API: Fluent, readable transformations
- 🚫 Error Handling: Multiple error handling modes
- 📦 Headless: Runs in Node.js without a browser
Installation
npm install @paulmeller/docflowQuick Start
import DocFlow from '@paulmeller/docflow';
// Simple transformation
await new DocFlow()
.load('document.docx')
.transform('heading[level=2]', node => ({
...node,
text: node.text.toUpperCase()
}))
.save('output.docx');⚠️ Important: load() vs loadContent()
DO NOT confuse these two methods:
| Method | Purpose | First Parameter | When to Use |
|--------|---------|----------------|-------------|
| load() | Load from file path or buffer | File path string or Buffer | Reading files from disk |
| loadContent() | Load from content string | Content string (markdown/HTML/JSON) | When you have content in memory |
Common Mistake ❌
// ❌ WRONG - This treats JSON string as a file path!
const jsonString = JSON.stringify(docJson);
await doc.load(jsonString, { format: 'json' });
// Error: ENAMETOOLONG or "file not found"
// ❌ WRONG - This treats markdown as a file path!
const markdown = '# Hello\n\nWorld';
await doc.load(markdown, { format: 'markdown' });Correct Usage ✅
// ✅ CORRECT - Use load() for file paths
await doc.load('document.docx');
await doc.load('data.json', { format: 'json' });
// ✅ CORRECT - Use loadContent() for content strings
const jsonString = JSON.stringify(docJson);
await doc.loadContent(jsonString);
const markdown = '# Hello\n\nWorld';
await doc.loadContent(markdown);Core Concepts
1. Load Documents
Load from file, buffer, or content:
const doc = new DocFlow();
// From file path
await doc.load('document.docx');
// From buffer (binary data)
const buffer = fs.readFileSync('document.docx');
await doc.load(buffer);
// From JSON buffer with explicit format
const jsonBuffer = Buffer.from(jsonString, 'utf-8');
await doc.load(jsonBuffer, { format: 'json' });
// From content string (auto-detects markdown, HTML, JSON, or text)
await doc.loadContent('<h1>Title</h1><p>Content</p>');
await doc.loadContent('# Title\n\nContent');
await doc.loadContent(JSON.stringify(docJson));2. Query Structure
Find nodes using selectors:
// CSS-like selectors
const headings = doc.query('heading[level=2]');
const paragraphs = doc.query('paragraph');
// Predicate functions
const longParagraphs = doc.query(node =>
node.type === 'paragraph' &&
node.content?.length > 100
);
// Work with results
console.log(headings.count); // Number of matches
console.log(headings.text()); // Concatenated text
console.log(headings.first()); // First match
console.log(headings.last()); // Last match3. Transform Content
Modify document structure:
import { Transforms } from '@paulmeller/docflow';
// Transform matching nodes
await doc.transform('heading[level=2]', node => ({
...node,
text: Transforms.toTitleCase(node.text)
}));
// Transform with built-in helpers
await doc.transform('text', Transforms.replace(/foo/g, 'bar'));
await doc.transform('text', Transforms.addPrefix('> '));
// Transform entire document
await doc.transformDocument(json => {
// Modify entire document structure
return modifiedJSON;
});4. Export & Save
Export to various formats:
// Export to buffer/string
const docx = await doc.export('docx');
const html = await doc.export('html');
const markdown = await doc.export('markdown');
const json = await doc.export('json');
const text = await doc.export('text');
// Save directly to file
await doc.save('output.docx');
await doc.save('output.md', 'markdown');
await doc.save('output.html', 'html');
await doc.save('output.json', 'json');
await doc.save('output.txt', 'text');Examples
Example 1: Format Conversion
import DocFlow from '@paulmeller/docflow';
// DOCX to Markdown
await new DocFlow()
.load('document.docx')
.save('document.md', 'markdown');
// Markdown to DOCX
await new DocFlow()
.load('document.md')
.save('document.docx');
// DOCX with tables to Markdown (tables preserved with formatting)
await new DocFlow()
.load('report-with-tables.docx')
.save('report.md', 'markdown');
// Extract plain text for analysis
const plainText = await new DocFlow()
.load('document.docx')
.export('text');
console.log(plainText); // All text without formattingExample 2: JSON Import/Export
import DocFlow from '@paulmeller/docflow';
// Export DOCX to JSON (for storage or transmission)
const doc1 = new DocFlow();
await doc1.load('report.docx');
const jsonData = await doc1.export('json');
await fs.writeFile('report.json', JSON.stringify(jsonData, null, 2));
// Import JSON and convert to Markdown
const doc2 = new DocFlow();
await doc2.load('report.json'); // Auto-detects .json extension
const markdown = await doc2.export('markdown');
// Load JSON from buffer (e.g., from API or virtual filesystem)
const jsonString = JSON.stringify(jsonData);
const buffer = Buffer.from(jsonString, 'utf-8');
const doc3 = new DocFlow();
await doc3.load(buffer, { format: 'json' }); // Explicit format for buffers
const html = await doc3.export('html');
// Load JSON from content string (when you have it in memory)
const doc4 = new DocFlow();
await doc4.loadContent(jsonString); // Auto-detects JSON
const docx = await doc4.export('docx');Example 3: Heading Standardization
import { DocFlow, Transforms } from '@paulmeller/docflow';
const doc = new DocFlow();
await doc
.load('document.docx')
.transform('heading[level=1]', node => ({
...node,
text: Transforms.toTitleCase(node.text)
}))
.transform('heading[level=2]', node => ({
...node,
text: Transforms.toSentenceCase(node.text)
}))
.save('standardized.docx');Example 3: Template Filling
const doc = new DocFlow();
await doc
.load('template.docx')
.transform('text', Transforms.replace(/\{name\}/g, 'John Doe'))
.transform('text', Transforms.replace(/\{date\}/g, new Date().toLocaleDateString()))
.transform('text', Transforms.replace(/\{company\}/g, 'Acme Corp'))
.save('filled-document.docx');Example 4: Content Analysis
const doc = new DocFlow();
await doc.load('document.docx');
// Find all headings
const headings = doc.query('heading');
console.log(`Document has ${headings.count} headings`);
// Find specific content
const importantSections = doc.query(node =>
node.type === 'heading' &&
node.text?.toLowerCase().includes('important')
);
console.log('Important sections:');
importantSections.map(h => console.log(`- ${h.text}`));Example 5: Batch Processing
import { BatchProcessor } from '@paulmeller/docflow';
const batch = new BatchProcessor({
concurrency: 5
});
const results = await batch.process(
['doc1.docx', 'doc2.docx', 'doc3.docx'],
async (doc, file) => {
await doc
.transform('heading[level=1]', node => ({
...node,
attrs: { ...node.attrs, level: 2 }
}))
.save(file.replace('.docx', '-updated.docx'));
return { processed: true };
}
);
console.log(`✓ Processed ${results.successful} files`);
console.log(`✗ Failed ${results.failed} files`);Example 6: Complex Pipeline
const doc = new DocFlow();
await doc
.load('input.docx')
// Step 1: Standardize heading levels
.transform('heading[level=1]', node => ({
...node,
text: node.text.toUpperCase()
}))
// Step 2: Remove empty paragraphs
.transformDocument(json => {
json.content = json.content.filter(node =>
node.type !== 'paragraph' || node.content?.length > 0
);
return json;
})
// Step 3: Add metadata
.transformDocument(json => ({
...json,
attrs: {
...json.attrs,
processed: true,
timestamp: Date.now()
}
}))
.save('processed.docx');Example 7: Document Structure Report
const doc = new DocFlow();
await doc.load('document.docx');
const json = doc.toJSON();
// Analyze structure
const stats = {
headings: doc.query('heading').count,
paragraphs: doc.query('paragraph').count,
h1: doc.query('heading[level=1]').count,
h2: doc.query('heading[level=2]').count,
h3: doc.query('heading[level=3]').count
};
console.log('Document Structure:');
console.log(JSON.stringify(stats, null, 2));
// Table of contents
const toc = doc.query('heading')
.map(h => `${' '.repeat(h.attrs.level - 1)}- ${h.text}`)
.join('\n');
console.log('\nTable of Contents:');
console.log(toc);Example 8: Error Handling
// Mode 1: Throw on error (default)
try {
await new DocFlow()
.load('missing.docx')
.save('output.docx');
} catch (error) {
console.error('Failed:', error.message);
}
// Mode 2: Collect errors
const doc = new DocFlow({ errorMode: 'collect' });
await doc
.load('input.docx')
.transform('invalid-selector', node => node)
.save('output.docx');
const errors = doc.getErrors();
if (errors.length > 0) {
console.error('Errors occurred:', errors);
}
// Mode 3: Silent (ignore errors)
const doc2 = new DocFlow({ errorMode: 'silent' });
await doc2.load('maybe-exists.docx').save('output.docx');Example 9: Custom Transformations
// Define custom transformation
function addTimestamp(node) {
if (node.type === 'paragraph') {
return {
...node,
attrs: {
...node.attrs,
timestamp: new Date().toISOString()
}
};
}
return node;
}
// Apply it
await doc
.load('document.docx')
.transform('paragraph', addTimestamp)
.save('timestamped.docx');Example 10: Conditional Processing
const doc = new DocFlow();
await doc.load('document.docx');
// Only process if conditions are met
const validation = doc.validate();
if (validation.valid) {
const headings = doc.query('heading');
if (headings.count > 0) {
await doc
.transform('heading', node => ({
...node,
text: `📌 ${node.text}`
}))
.save('processed.docx');
}
}Format Auto-Detection
DocFlow automatically detects formats to make your code simpler and more intuitive.
load() behavior:
- Auto-detects format from file extension:
.docx,.md,.html,.json,.txt - Use
formatoption to override:await doc.load('file.txt', { format: 'markdown' }) - For Buffers, format defaults to
'docx'unless you specify otherwise
// Auto-detected from extension
await doc.load('document.docx'); // → format: 'docx'
await doc.load('data.json'); // → format: 'json'
await doc.load('readme.md'); // → format: 'markdown'
// Explicit format for buffers
const buffer = Buffer.from(jsonString, 'utf-8');
await doc.load(buffer, { format: 'json' });loadContent() behavior:
- Auto-detects JSON: Looks for
{"type": "doc"...}pattern - Auto-detects HTML: Looks for
<html>,<h1>,<p>, etc. - Auto-detects Markdown: Everything else treated as markdown
- No format parameter needed - detection is automatic
// All auto-detected
await doc.loadContent('# Heading\n\nText'); // → markdown
await doc.loadContent('<h1>Title</h1><p>Text</p>'); // → HTML
await doc.loadContent('{"type": "doc", ...}'); // → JSONAPI Reference
DocFlow
Constructor
new DocFlow(options?: {
headless?: boolean; // Default: true
validateSchema?: boolean; // Default: true
errorMode?: 'throw' | 'collect' | 'silent'; // Default: 'throw'
})Methods
load(source, options?)
Load a document from file or buffer.
load(source: string | Buffer, options?: {
format?: 'docx' | 'html' | 'markdown' | 'json' | 'text'
}): Promise<DocFlow>loadContent(content)
Load content directly. Automatically detects if content is markdown, HTML, or plain text. Auto-initializes a blank document if none exists.
loadContent(content: string): Promise<DocFlow>toJSON()
Get document as ProseMirror JSON.
toJSON(): ProseMirrorJSONquery(selector)
Query document structure.
query(selector: string | Function): QueryResultSelectors:
"heading"- All heading nodes"heading[level=2]"- H2 headings"paragraph"- All paragraphsnode => condition- Custom predicate
transform(selector, transformer)
Transform matching nodes.
transform(
selector: string | Function,
transformer: (node) => node | Promise<node> | null
): Promise<DocFlow>transformDocument(transformer)
Transform entire document.
transformDocument(
transformer: (json) => json | Promise<json>
): Promise<DocFlow>export(format?, options?)
Export to format.
export(
format?: 'docx' | 'html' | 'markdown' | 'json' | 'text',
options?: ExportOptions
): Promise<Buffer | string | Object>save(filepath, format?)
Save to file.
save(filepath: string, format?: string): Promise<DocFlow>validate()
Validate document.
validate(): {
valid: boolean;
errors: string[];
warnings: string[];
document: ProseMirrorJSON;
}getHistory()
Get operation history.
getHistory(): Operation[]getErrors()
Get collected errors (when errorMode: 'collect').
getErrors(): ErrorRecord[]QueryResult
Properties
count: Number of matching nodesnodes: Array of found nodesselector: Selector used
Methods
first()
Get first match.
first(): Node | nulllast()
Get last match.
last(): Node | nullmap(fn)
Map over results.
map<T>(fn: (node, index) => T): T[]filter(fn)
Filter results.
filter(fn: (node, index) => boolean): QueryResulttext()
Get concatenated text content.
text(): stringtransform(transformer)
Transform all found nodes.
transform(transformer: Function): Promise<DocFlow>BatchProcessor
Process multiple documents concurrently.
const batch = new BatchProcessor({
concurrency: 5,
errorMode: 'collect'
});
const results = await batch.process(
['file1.docx', 'file2.docx'],
async (doc, file) => {
// Process each document
await doc.transform(...).save(...);
return { success: true };
}
);Transforms
Built-in transformation helpers.
import { Transforms } from '@paulmeller/docflow';
Transforms.toTitleCase(text) // "hello world" → "Hello World"
Transforms.toSentenceCase(text) // "HELLO WORLD" → "Hello world"
Transforms.replace(pattern, repl) // Replace matching text
Transforms.addPrefix(prefix) // Add prefix to text
Transforms.remove(condition) // Remove matching nodesProseMirror JSON Structure
Documents are represented as ProseMirror JSON:
{
"type": "doc",
"content": [
{
"type": "heading",
"attrs": { "level": 1 },
"content": [
{ "type": "text", "text": "Title" }
]
},
{
"type": "paragraph",
"content": [
{ "type": "text", "text": "Content" }
]
}
]
}Common Node Types
doc- Root documentheading- Heading (attrs:level)paragraph- Paragraphtext- Text contentblockquote- Block quotecodeBlock- Code blockbulletList- Bullet listorderedList- Numbered listlistItem- List item
Best Practices
1. Chain Operations
// ✓ Good - Chainable, readable
await doc
.load('input.docx')
.transform('heading', ...)
.transform('paragraph', ...)
.save('output.docx');
// ✗ Avoid - Verbose
await doc.load('input.docx');
await doc.transform('heading', ...);
await doc.transform('paragraph', ...);
await doc.save('output.docx');2. Use Specific Selectors
// ✓ Good - Specific
doc.query('heading[level=2]')
// ✗ Avoid - Too broad
doc.query('heading').filter(h => h.attrs.level === 2)3. Handle Errors
// ✓ Good - Handle errors
try {
await doc.load('file.docx');
} catch (error) {
console.error('Failed to load:', error);
}
// Or use collect mode
const doc = new DocFlow({ errorMode: 'collect' });
await doc.load('file.docx');
if (doc.getErrors().length > 0) {
// Handle errors
}4. Validate Complex Transformations
// ✓ Good - Validate after transformation
await doc.transform(...);
const validation = doc.validate();
if (!validation.valid) {
console.error('Validation failed:', validation.errors);
}5. Use Batch Processing for Multiple Files
// ✓ Good - Concurrent processing
const batch = new BatchProcessor({ concurrency: 5 });
await batch.process(files, pipeline);
// ✗ Avoid - Sequential processing
for (const file of files) {
await new DocFlow().load(file).transform(...).save(...);
}TypeScript Usage
Full TypeScript support included:
import DocFlow, { QueryResult, Transforms } from '@paulmeller/docflow';
const doc = new DocFlow({
errorMode: 'collect'
});
await doc.load('document.docx');
const headings: QueryResult = doc.query('heading');
const count: number = headings.count;Known Limitations
Due to limitations in the underlying SuperDoc library's DOCX conversion engine:
DOCX Round-Trip Limitations
- Lists: DOCX export/import loses list items beyond the first item (confirmed SuperDoc limitation even with proper command API)
- Tables: ✅ Work perfectly for DOCX round-trips
- Recommendation: For list transformations, use Markdown↔HTML conversions instead of DOCX round-trips
Working Perfectly
- ✅ Markdown → HTML → Markdown: Full fidelity for lists, tables, links, formatting
- ✅ HTML → Markdown: Complete preservation of all content
- ✅ DOCX → Markdown/HTML: One-way conversion works well for extracting content
- ✅ JSON export: Perfect for analyzing document structure
Best Practices
// ✅ GOOD: Use HTML/Markdown for list transformations
await doc.createBlank();
await doc.loadContent('- Item 1\n- Item 2\n- Item 3');
const html = await doc.export('html'); // Preserves all items
const md = await doc.export('markdown'); // Preserves all items
// ⚠️ LIMITED: DOCX round-trips may lose list structure
await doc.load('document.docx');
const docx = await doc.export('docx');
await doc.load(docx); // May lose list items 2+For applications requiring full DOCX round-trip fidelity with complex lists and tables, consider using Microsoft Word's native APIs or alternative DOCX libraries.
Upstream Issue Tracker: These limitations originate from the SuperDoc library. You can track progress at SuperDoc GitHub Issues.
Performance Tips
- Use batch processing for multiple files
- Reuse DocFlow instances when possible
- Use specific selectors to minimize traversal
- Validate only when necessary (disable with
validateSchema: false) - Handle large documents with streaming (future feature)
Troubleshooting
Error: ENAMETOOLONG: name too long
Error message:
Error: ENAMETOOLONG: name too long, open '{"type": "doc", "content": [...]}'Cause: You're passing content to load() instead of a file path.
Fix: Use loadContent() for content strings:
// ❌ WRONG - Treats JSON string as file path
const json = await doc.export('json');
await doc.load(JSON.stringify(json), { format: 'json' });
// ✅ CORRECT - Use loadContent() for content
const json = await doc.export('json');
await doc.loadContent(JSON.stringify(json));
// OR use load() with a buffer
const buffer = Buffer.from(JSON.stringify(json), 'utf-8');
await doc.load(buffer, { format: 'json' });Document fails to load
// Check file exists
if (fs.existsSync('document.docx')) {
await doc.load('document.docx');
}
// Check format
const format = path.extname('document.docx');
await doc.load('document.docx', { format: 'docx' });Transformation not applying
// Verify selector matches
const matches = doc.query('heading[level=2]');
console.log(`Found ${matches.count} matches`);
// Check transformation logic
await doc.transform('heading', node => {
console.log('Transforming:', node);
return { ...node, text: node.text.toUpperCase() };
});Export format issues
// Explicitly specify format
await doc.save('output.docx', 'docx');
// Check supported formats
const formats = ['docx', 'html', 'markdown', 'json', 'text'];License
MIT
Contributing
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
Credits
Built on SuperDoc by Harbour Enterprises.
