file-to-html-converter
v1.0.0
Published
Convert DOCX and PDF files to clean semantic HTML
Maintainers
Readme
file2html
Convert DOCX and PDF files to clean semantic HTML.
Features
- DOCX Support: Convert Microsoft Word documents to semantic HTML
- PDF Support: Convert PDF files to semantic HTML
- Clean Output: Generates semantic HTML with proper tags like
<h1>,<h2>,<p>,<ul>,<li>,<strong>,<em>,<img> - No Inline Styles: Output is clean HTML without inline styles or absolute positioning
- TypeScript Support: Full TypeScript definitions included
Installation
npm install file2htmlUsage
DOCX to HTML
import { docxToHtml } from 'file2html';
const html = await docxToHtml('document.docx');
console.log(html);PDF to HTML
import { pdfToHtml } from 'file2html';
const html = await pdfToHtml('document.pdf');
console.log(html);Complete Example
import { docxToHtml, pdfToHtml } from 'file2html';
async function convertFiles() {
try {
// Convert DOCX file
const docxHtml = await docxToHtml('sample.docx');
console.log('DOCX HTML:', docxHtml);
// Convert PDF file
const pdfHtml = await pdfToHtml('sample.pdf');
console.log('PDF HTML:', pdfHtml);
} catch (error) {
console.error('Conversion failed:', error.message);
}
}
convertFiles();API Reference
docxToHtml(filePath: string): Promise<string>
Converts a DOCX file to semantic HTML.
Parameters:
filePath(string): Path to the DOCX file
Returns:
Promise<string>: Clean semantic HTML
Features:
- Converts paragraphs to
<p>tags - Maps bold text to
<strong>tags - Maps italic text to
<em>tags - Converts tables to
<table>,<tr>,<td>structure - Detects headings based on paragraph styles and converts to
<h1>,<h2>, etc. - Extracts images and converts to
<img>tags with base64 data URLs
pdfToHtml(filePath: string): Promise<string>
Converts a PDF file to semantic HTML.
Parameters:
filePath(string): Path to the PDF file
Returns:
Promise<string>: Clean semantic HTML
Features:
- Groups text into paragraphs
- Detects headings based on text patterns and formatting
- Converts to semantic HTML structure
Output Format
The library generates clean, semantic HTML without inline styles:
<h1>Document Title</h1>
<p>This is a paragraph with <strong>bold text</strong> and <em>italic text</em>.</p>
<h2>Section Heading</h2>
<ul>
<li>List item 1</li>
<li>List item 2</li>
</ul>
<table>
<tr>
<td>Cell 1</td>
<td>Cell 2</td>
</tr>
</table>
<img src="data:image/png;base64,..." alt="">Error Handling
Both functions throw errors for:
- Non-existent files
- Invalid file formats
- Corrupted files
- Permission issues
try {
const html = await docxToHtml('invalid-file.docx');
} catch (error) {
console.error('Conversion failed:', error.message);
}Development
Building
npm run buildTesting
npm testRunning Tests in Watch Mode
npm run test:watchDependencies
adm-zip: For extracting DOCX filesfast-xml-parser: For parsing WordprocessingML XMLpdf-parse: For extracting text from PDF files
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
