@paulmeller/docflow

v0.0.28

Published

5 months ago

A developer-friendly transformation engine for programmatic document manipulation

0High
0Medium
0Low

document transformation docx markdown html conversion superdoc prosemirror document-processing batch-processing format-conversion document-query headless automation

DocFlow

A developer-friendly transformation engine built on SuperDoc

Transform documents programmatically with a clean, chainable API. Built on top of SuperDoc Editor, DocFlow makes it easy to load, query, transform, and export documents in multiple formats.

Features

🔄 Format Conversion: DOCX ↔ Markdown ↔ HTML ↔ JSON ↔ Plain Text
📊 Table Support: DOCX tables export cleanly to Markdown with formatting preserved
🔍 Powerful Queries: CSS-like selectors and predicate functions
⚡ Batch Processing: Process multiple documents concurrently
🎯 Type-Safe: Full TypeScript support
🔗 Chainable API: Fluent, readable transformations
🚫 Error Handling: Multiple error handling modes
📦 Headless: Runs in Node.js without a browser

Installation

npm install @paulmeller/docflow

Quick Start

import DocFlow from '@paulmeller/docflow';

// Simple transformation
await new DocFlow()
  .load('document.docx')
  .transform('heading[level=2]', node => ({
    ...node,
    text: node.text.toUpperCase()
  }))
  .save('output.docx');

⚠️ Important: `load()` vs `loadContent()`

DO NOT confuse these two methods:

| Method | Purpose | First Parameter | When to Use | |--------|---------|----------------|-------------| | load() | Load from file path or buffer | File path string or Buffer | Reading files from disk | | loadContent() | Load from content string | Content string (markdown/HTML/JSON) | When you have content in memory |

Common Mistake ❌

// ❌ WRONG - This treats JSON string as a file path!
const jsonString = JSON.stringify(docJson);
await doc.load(jsonString, { format: 'json' });  
// Error: ENAMETOOLONG or "file not found"

// ❌ WRONG - This treats markdown as a file path!
const markdown = '# Hello\n\nWorld';
await doc.load(markdown, { format: 'markdown' });

Correct Usage ✅

// ✅ CORRECT - Use load() for file paths
await doc.load('document.docx');
await doc.load('data.json', { format: 'json' });

// ✅ CORRECT - Use loadContent() for content strings
const jsonString = JSON.stringify(docJson);
await doc.loadContent(jsonString);

const markdown = '# Hello\n\nWorld';
await doc.loadContent(markdown);

Core Concepts

1. Load Documents

Load from file, buffer, or content:

const doc = new DocFlow();

// From file path
await doc.load('document.docx');

// From buffer (binary data)
const buffer = fs.readFileSync('document.docx');
await doc.load(buffer);

// From JSON buffer with explicit format
const jsonBuffer = Buffer.from(jsonString, 'utf-8');
await doc.load(jsonBuffer, { format: 'json' });

// From content string (auto-detects markdown, HTML, JSON, or text)
await doc.loadContent('<h1>Title</h1><p>Content</p>');
await doc.loadContent('# Title\n\nContent');
await doc.loadContent(JSON.stringify(docJson));

2. Query Structure

Find nodes using selectors:

// CSS-like selectors
const headings = doc.query('heading[level=2]');
const paragraphs = doc.query('paragraph');

// Predicate functions
const longParagraphs = doc.query(node => 
  node.type === 'paragraph' && 
  node.content?.length > 100
);

// Work with results
console.log(headings.count);        // Number of matches
console.log(headings.text());       // Concatenated text
console.log(headings.first());      // First match
console.log(headings.last());       // Last match

3. Transform Content

Modify document structure:

import { Transforms } from '@paulmeller/docflow';

// Transform matching nodes
await doc.transform('heading[level=2]', node => ({
  ...node,
  text: Transforms.toTitleCase(node.text)
}));

// Transform with built-in helpers
await doc.transform('text', Transforms.replace(/foo/g, 'bar'));
await doc.transform('text', Transforms.addPrefix('> '));

// Transform entire document
await doc.transformDocument(json => {
  // Modify entire document structure
  return modifiedJSON;
});

4. Export & Save

Export to various formats:

// Export to buffer/string
const docx = await doc.export('docx');
const html = await doc.export('html');
const markdown = await doc.export('markdown');
const json = await doc.export('json');
const text = await doc.export('text');

// Save directly to file
await doc.save('output.docx');
await doc.save('output.md', 'markdown');
await doc.save('output.html', 'html');
await doc.save('output.json', 'json');
await doc.save('output.txt', 'text');

Examples

Example 1: Format Conversion

import DocFlow from '@paulmeller/docflow';

// DOCX to Markdown
await new DocFlow()
  .load('document.docx')
  .save('document.md', 'markdown');

// Markdown to DOCX
await new DocFlow()
  .load('document.md')
  .save('document.docx');

// DOCX with tables to Markdown (tables preserved with formatting)
await new DocFlow()
  .load('report-with-tables.docx')
  .save('report.md', 'markdown');

// Extract plain text for analysis
const plainText = await new DocFlow()
  .load('document.docx')
  .export('text');
console.log(plainText); // All text without formatting

Example 2: JSON Import/Export

import DocFlow from '@paulmeller/docflow';

// Export DOCX to JSON (for storage or transmission)
const doc1 = new DocFlow();
await doc1.load('report.docx');
const jsonData = await doc1.export('json');
await fs.writeFile('report.json', JSON.stringify(jsonData, null, 2));

// Import JSON and convert to Markdown
const doc2 = new DocFlow();
await doc2.load('report.json');  // Auto-detects .json extension
const markdown = await doc2.export('markdown');

// Load JSON from buffer (e.g., from API or virtual filesystem)
const jsonString = JSON.stringify(jsonData);
const buffer = Buffer.from(jsonString, 'utf-8');
const doc3 = new DocFlow();
await doc3.load(buffer, { format: 'json' });  // Explicit format for buffers
const html = await doc3.export('html');

// Load JSON from content string (when you have it in memory)
const doc4 = new DocFlow();
await doc4.loadContent(jsonString);  // Auto-detects JSON
const docx = await doc4.export('docx');

Example 3: Heading Standardization

import { DocFlow, Transforms } from '@paulmeller/docflow';

const doc = new DocFlow();
await doc
  .load('document.docx')
  .transform('heading[level=1]', node => ({
    ...node,
    text: Transforms.toTitleCase(node.text)
  }))
  .transform('heading[level=2]', node => ({
    ...node,
    text: Transforms.toSentenceCase(node.text)
  }))
  .save('standardized.docx');

Example 3: Template Filling

const doc = new DocFlow();
await doc
  .load('template.docx')
  .transform('text', Transforms.replace(/\{name\}/g, 'John Doe'))
  .transform('text', Transforms.replace(/\{date\}/g, new Date().toLocaleDateString()))
  .transform('text', Transforms.replace(/\{company\}/g, 'Acme Corp'))
  .save('filled-document.docx');

Example 4: Content Analysis

const doc = new DocFlow();
await doc.load('document.docx');

// Find all headings
const headings = doc.query('heading');
console.log(`Document has ${headings.count} headings`);

// Find specific content
const importantSections = doc.query(node =>
  node.type === 'heading' &&
  node.text?.toLowerCase().includes('important')
);

console.log('Important sections:');
importantSections.map(h => console.log(`- ${h.text}`));

Example 5: Batch Processing

import { BatchProcessor } from '@paulmeller/docflow';

const batch = new BatchProcessor({
  concurrency: 5
});

const results = await batch.process(
  ['doc1.docx', 'doc2.docx', 'doc3.docx'],
  async (doc, file) => {
    await doc
      .transform('heading[level=1]', node => ({
        ...node,
        attrs: { ...node.attrs, level: 2 }
      }))
      .save(file.replace('.docx', '-updated.docx'));

    return { processed: true };
  }
);

console.log(`✓ Processed ${results.successful} files`);
console.log(`✗ Failed ${results.failed} files`);

Example 6: Complex Pipeline

const doc = new DocFlow();

await doc
  .load('input.docx')
  // Step 1: Standardize heading levels
  .transform('heading[level=1]', node => ({
    ...node,
    text: node.text.toUpperCase()
  }))
  // Step 2: Remove empty paragraphs
  .transformDocument(json => {
    json.content = json.content.filter(node =>
      node.type !== 'paragraph' || node.content?.length > 0
    );
    return json;
  })
  // Step 3: Add metadata
  .transformDocument(json => ({
    ...json,
    attrs: {
      ...json.attrs,
      processed: true,
      timestamp: Date.now()
    }
  }))
  .save('processed.docx');

Example 7: Document Structure Report

const doc = new DocFlow();
await doc.load('document.docx');

const json = doc.toJSON();

// Analyze structure
const stats = {
  headings: doc.query('heading').count,
  paragraphs: doc.query('paragraph').count,
  h1: doc.query('heading[level=1]').count,
  h2: doc.query('heading[level=2]').count,
  h3: doc.query('heading[level=3]').count
};

console.log('Document Structure:');
console.log(JSON.stringify(stats, null, 2));

// Table of contents
const toc = doc.query('heading')
  .map(h => `${'  '.repeat(h.attrs.level - 1)}- ${h.text}`)
  .join('\n');

console.log('\nTable of Contents:');
console.log(toc);

Example 8: Error Handling

// Mode 1: Throw on error (default)
try {
  await new DocFlow()
    .load('missing.docx')
    .save('output.docx');
} catch (error) {
  console.error('Failed:', error.message);
}

// Mode 2: Collect errors
const doc = new DocFlow({ errorMode: 'collect' });
await doc
  .load('input.docx')
  .transform('invalid-selector', node => node)
  .save('output.docx');

const errors = doc.getErrors();
if (errors.length > 0) {
  console.error('Errors occurred:', errors);
}

// Mode 3: Silent (ignore errors)
const doc2 = new DocFlow({ errorMode: 'silent' });
await doc2.load('maybe-exists.docx').save('output.docx');

Example 9: Custom Transformations

// Define custom transformation
function addTimestamp(node) {
  if (node.type === 'paragraph') {
    return {
      ...node,
      attrs: {
        ...node.attrs,
        timestamp: new Date().toISOString()
      }
    };
  }
  return node;
}

// Apply it
await doc
  .load('document.docx')
  .transform('paragraph', addTimestamp)
  .save('timestamped.docx');

Example 10: Conditional Processing

const doc = new DocFlow();
await doc.load('document.docx');

// Only process if conditions are met
const validation = doc.validate();

if (validation.valid) {
  const headings = doc.query('heading');

  if (headings.count > 0) {
    await doc
      .transform('heading', node => ({
        ...node,
        text: `📌 ${node.text}`
      }))
      .save('processed.docx');
  }
}

Format Auto-Detection

DocFlow automatically detects formats to make your code simpler and more intuitive.

`load()` behavior:

Auto-detects format from file extension: .docx, .md, .html, .json, .txt
Use format option to override: await doc.load('file.txt', { format: 'markdown' })
For Buffers, format defaults to 'docx' unless you specify otherwise

// Auto-detected from extension
await doc.load('document.docx');    // → format: 'docx'
await doc.load('data.json');        // → format: 'json'
await doc.load('readme.md');        // → format: 'markdown'

// Explicit format for buffers
const buffer = Buffer.from(jsonString, 'utf-8');
await doc.load(buffer, { format: 'json' });

`loadContent()` behavior:

Auto-detects JSON: Looks for {"type": "doc"...} pattern
Auto-detects HTML: Looks for <html>, <h1>, <p>, etc.
Auto-detects Markdown: Everything else treated as markdown
No format parameter needed - detection is automatic

// All auto-detected
await doc.loadContent('# Heading\n\nText');           // → markdown
await doc.loadContent('<h1>Title</h1><p>Text</p>');   // → HTML
await doc.loadContent('{"type": "doc", ...}');        // → JSON

API Reference

DocFlow

Constructor

new DocFlow(options?: {
  headless?: boolean;        // Default: true
  validateSchema?: boolean;  // Default: true
  errorMode?: 'throw' | 'collect' | 'silent';  // Default: 'throw'
})

Methods

`load(source, options?)`

Load a document from file or buffer.

load(source: string | Buffer, options?: {
  format?: 'docx' | 'html' | 'markdown' | 'json' | 'text'
}): Promise<DocFlow>

`loadContent(content)`

Load content directly. Automatically detects if content is markdown, HTML, or plain text. Auto-initializes a blank document if none exists.

loadContent(content: string): Promise<DocFlow>

`toJSON()`

Get document as ProseMirror JSON.

toJSON(): ProseMirrorJSON

`query(selector)`

Query document structure.

query(selector: string | Function): QueryResult

Selectors:

"heading" - All heading nodes
"heading[level=2]" - H2 headings
"paragraph" - All paragraphs
node => condition - Custom predicate

`transform(selector, transformer)`

Transform matching nodes.

transform(
  selector: string | Function,
  transformer: (node) => node | Promise<node> | null
): Promise<DocFlow>

`transformDocument(transformer)`

Transform entire document.

transformDocument(
  transformer: (json) => json | Promise<json>
): Promise<DocFlow>

`export(format?, options?)`

Export to format.

export(
  format?: 'docx' | 'html' | 'markdown' | 'json' | 'text',
  options?: ExportOptions
): Promise<Buffer | string | Object>

`save(filepath, format?)`

Save to file.

save(filepath: string, format?: string): Promise<DocFlow>

`validate()`

Validate document.

validate(): {
  valid: boolean;
  errors: string[];
  warnings: string[];
  document: ProseMirrorJSON;
}

`getHistory()`

Get operation history.

getHistory(): Operation[]

`getErrors()`

Get collected errors (when errorMode: 'collect').

getErrors(): ErrorRecord[]

QueryResult

Properties

count: Number of matching nodes
nodes: Array of found nodes
selector: Selector used

Methods

`first()`

Get first match.

first(): Node | null

`last()`

Get last match.

last(): Node | null

`map(fn)`

Map over results.

map<T>(fn: (node, index) => T): T[]

`filter(fn)`

Filter results.

filter(fn: (node, index) => boolean): QueryResult

`text()`

Get concatenated text content.

text(): string

`transform(transformer)`

Transform all found nodes.

transform(transformer: Function): Promise<DocFlow>

BatchProcessor

Process multiple documents concurrently.

const batch = new BatchProcessor({
  concurrency: 5,
  errorMode: 'collect'
});

const results = await batch.process(
  ['file1.docx', 'file2.docx'],
  async (doc, file) => {
    // Process each document
    await doc.transform(...).save(...);
    return { success: true };
  }
);

Transforms

Built-in transformation helpers.

import { Transforms } from '@paulmeller/docflow';

Transforms.toTitleCase(text)        // "hello world" → "Hello World"
Transforms.toSentenceCase(text)     // "HELLO WORLD" → "Hello world"
Transforms.replace(pattern, repl)   // Replace matching text
Transforms.addPrefix(prefix)        // Add prefix to text
Transforms.remove(condition)        // Remove matching nodes

ProseMirror JSON Structure

Documents are represented as ProseMirror JSON:

{
  "type": "doc",
  "content": [
    {
      "type": "heading",
      "attrs": { "level": 1 },
      "content": [
        { "type": "text", "text": "Title" }
      ]
    },
    {
      "type": "paragraph",
      "content": [
        { "type": "text", "text": "Content" }
      ]
    }
  ]
}

Common Node Types

doc - Root document
heading - Heading (attrs: level)
paragraph - Paragraph
text - Text content
blockquote - Block quote
codeBlock - Code block
bulletList - Bullet list
orderedList - Numbered list
listItem - List item

Best Practices

1. Chain Operations

// ✓ Good - Chainable, readable
await doc
  .load('input.docx')
  .transform('heading', ...)
  .transform('paragraph', ...)
  .save('output.docx');

// ✗ Avoid - Verbose
await doc.load('input.docx');
await doc.transform('heading', ...);
await doc.transform('paragraph', ...);
await doc.save('output.docx');

2. Use Specific Selectors

// ✓ Good - Specific
doc.query('heading[level=2]')

// ✗ Avoid - Too broad
doc.query('heading').filter(h => h.attrs.level === 2)

3. Handle Errors

// ✓ Good - Handle errors
try {
  await doc.load('file.docx');
} catch (error) {
  console.error('Failed to load:', error);
}

// Or use collect mode
const doc = new DocFlow({ errorMode: 'collect' });
await doc.load('file.docx');
if (doc.getErrors().length > 0) {
  // Handle errors
}

4. Validate Complex Transformations

// ✓ Good - Validate after transformation
await doc.transform(...);
const validation = doc.validate();
if (!validation.valid) {
  console.error('Validation failed:', validation.errors);
}

5. Use Batch Processing for Multiple Files

// ✓ Good - Concurrent processing
const batch = new BatchProcessor({ concurrency: 5 });
await batch.process(files, pipeline);

// ✗ Avoid - Sequential processing
for (const file of files) {
  await new DocFlow().load(file).transform(...).save(...);
}

TypeScript Usage

Full TypeScript support included:

import DocFlow, { QueryResult, Transforms } from '@paulmeller/docflow';

const doc = new DocFlow({
  errorMode: 'collect'
});

await doc.load('document.docx');

const headings: QueryResult = doc.query('heading');
const count: number = headings.count;

Known Limitations

Due to limitations in the underlying SuperDoc library's DOCX conversion engine:

DOCX Round-Trip Limitations

Lists: DOCX export/import loses list items beyond the first item (confirmed SuperDoc limitation even with proper command API)
Tables: ✅ Work perfectly for DOCX round-trips
Recommendation: For list transformations, use Markdown↔HTML conversions instead of DOCX round-trips

Working Perfectly

✅ Markdown → HTML → Markdown: Full fidelity for lists, tables, links, formatting
✅ HTML → Markdown: Complete preservation of all content
✅ DOCX → Markdown/HTML: One-way conversion works well for extracting content
✅ JSON export: Perfect for analyzing document structure

Best Practices

// ✅ GOOD: Use HTML/Markdown for list transformations
await doc.createBlank();
await doc.loadContent('- Item 1\n- Item 2\n- Item 3');
const html = await doc.export('html');  // Preserves all items
const md = await doc.export('markdown');  // Preserves all items

// ⚠️ LIMITED: DOCX round-trips may lose list structure
await doc.load('document.docx');
const docx = await doc.export('docx');
await doc.load(docx);  // May lose list items 2+

For applications requiring full DOCX round-trip fidelity with complex lists and tables, consider using Microsoft Word's native APIs or alternative DOCX libraries.

Upstream Issue Tracker: These limitations originate from the SuperDoc library. You can track progress at SuperDoc GitHub Issues.

Performance Tips

Use batch processing for multiple files
Reuse DocFlow instances when possible
Use specific selectors to minimize traversal
Validate only when necessary (disable with validateSchema: false)
Handle large documents with streaming (future feature)

Troubleshooting

Error: `ENAMETOOLONG: name too long`

Error message:

Error: ENAMETOOLONG: name too long, open '{"type": "doc", "content": [...]}'

Cause: You're passing content to load() instead of a file path.

Fix: Use loadContent() for content strings:

// ❌ WRONG - Treats JSON string as file path
const json = await doc.export('json');
await doc.load(JSON.stringify(json), { format: 'json' });

// ✅ CORRECT - Use loadContent() for content
const json = await doc.export('json');
await doc.loadContent(JSON.stringify(json));

// OR use load() with a buffer
const buffer = Buffer.from(JSON.stringify(json), 'utf-8');
await doc.load(buffer, { format: 'json' });

Document fails to load

// Check file exists
if (fs.existsSync('document.docx')) {
  await doc.load('document.docx');
}

// Check format
const format = path.extname('document.docx');
await doc.load('document.docx', { format: 'docx' });

Transformation not applying

// Verify selector matches
const matches = doc.query('heading[level=2]');
console.log(`Found ${matches.count} matches`);

// Check transformation logic
await doc.transform('heading', node => {
  console.log('Transforming:', node);
  return { ...node, text: node.text.toUpperCase() };
});

Export format issues

// Explicitly specify format
await doc.save('output.docx', 'docx');

// Check supported formats
const formats = ['docx', 'html', 'markdown', 'json', 'text'];

License

MIT

Contributing

Contributions welcome! Please read CONTRIBUTING.md for guidelines.

Credits

Built on SuperDoc by Harbour Enterprises.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

DocFlow

Features

Installation

Quick Start

⚠️ Important: load() vs loadContent()

Common Mistake ❌

Correct Usage ✅

Core Concepts

1. Load Documents

2. Query Structure

3. Transform Content

4. Export & Save

Examples

Example 1: Format Conversion

Example 2: JSON Import/Export

Example 3: Heading Standardization

Example 3: Template Filling

Example 4: Content Analysis

Example 5: Batch Processing

Example 6: Complex Pipeline

Example 7: Document Structure Report

Example 8: Error Handling

Example 9: Custom Transformations

Example 10: Conditional Processing

Format Auto-Detection

load() behavior:

loadContent() behavior:

API Reference

DocFlow

Constructor

Methods

load(source, options?)

loadContent(content)

toJSON()

query(selector)

transform(selector, transformer)

transformDocument(transformer)

export(format?, options?)

save(filepath, format?)

validate()

getHistory()

getErrors()

QueryResult

Properties

Methods

first()

last()

map(fn)

filter(fn)

text()

transform(transformer)

BatchProcessor

Transforms

ProseMirror JSON Structure

Common Node Types

Best Practices

1. Chain Operations

2. Use Specific Selectors

3. Handle Errors

4. Validate Complex Transformations

5. Use Batch Processing for Multiple Files

TypeScript Usage

Known Limitations

DOCX Round-Trip Limitations

Working Perfectly

Best Practices

Performance Tips

Troubleshooting

Error: ENAMETOOLONG: name too long

Document fails to load

Transformation not applying

Export format issues

License

Contributing

⚠️ Important: `load()` vs `loadContent()`

`load()` behavior:

`loadContent()` behavior:

`load(source, options?)`

`loadContent(content)`

`toJSON()`

`query(selector)`

`transform(selector, transformer)`

`transformDocument(transformer)`

`export(format?, options?)`

`save(filepath, format?)`

`validate()`

`getHistory()`

`getErrors()`

`first()`

`last()`

`map(fn)`

`filter(fn)`

`text()`

`transform(transformer)`

Error: `ENAMETOOLONG: name too long`