@parallelsoftware/doc-extract
v1.0.2
Published
Intelligent document extration package
Readme
DocExtract
Intelligent document extraction package for TypeScript applications.
Overview
DocExtract is a TypeScript library designed to extract text content from various document formats including PDFs, Word documents, and images. It provides a simple, type-safe API for processing documents from URLs or Buffer contents.
Features
- 📄 Multiple Document Types: Support for PDF, DOC, DOCX files
- 🖼️ Image Processing: Extract text from JPEG, PNG, JPG, and WebP images
- 🔧 Configurable: Customize allowed file types per instance
- 📝 TypeScript Support: Full type safety with comprehensive type definitions
- 🛡️ Input Validation: Built-in validation for document types and sources
- 🧪 Well Tested: Comprehensive test suite with high coverage
Installation
npm install @parallelsoftware/doc-extractQuick Start
import { DocExtract, Document } from '@parallelsoftware/doc-extract'
// Create an instance with default settings
const extractor = new DocExtract()
// Extract from a document URL
const document: Document = {
filename: 'example.pdf',
type: 'application/pdf',
url: 'https://example.com/document.pdf'
}
const extractedText = await extractor.extractText(document)
console.log(extractedText)API Reference
DocExtract Class
Constructor
new DocExtract(options?: DocExtractClientOptions)Options:
allowedImages?: ImageTypes[]- Array of allowed image MIME typesallowedDocuments?: DocumentTypes[]- Array of allowed document MIME types
Methods
extractText(document: Document): Promise<string>
Extracts text content from the provided document.
Parameters:
document: Document- The document to extract text from
Returns:
Promise<string>- The extracted text content
Throws:
Error- If document has neither URL nor contentsError- If document type is not allowed
Types
Document
type Document = {
filename: string
type: DocumentTypes | ImageTypes
url?: string
contents?: Buffer
}DocumentTypes
type DocumentTypes =
| 'application/pdf'
| 'application/msword'
| 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
| 'application/vnd.oasis.opendocument.text'ImageTypes
type ImageTypes = 'image/jpeg' | 'image/png' | 'image/jpg' | 'image/webp'Usage Examples
Basic Usage
import { DocExtract, Document } from '@parallelsoftware/doc-extract'
const extractor = new DocExtract()
// Extract from PDF
const pdfDoc: Document = {
filename: 'report.pdf',
type: 'application/pdf',
url: 'https://example.com/report.pdf'
}
const text = await extractor.extractText(pdfDoc)Custom Configuration
// Only allow PDFs and JPEG images
const extractor = new DocExtract({
allowedDocuments: ['application/pdf'],
allowedImages: ['image/jpeg']
})Using Buffer Contents
import fs from 'fs'
const fileBuffer = fs.readFileSync('./document.pdf')
const document: Document = {
filename: 'document.pdf',
type: 'application/pdf',
contents: fileBuffer
}
const text = await extractor.extractText(document)Error Handling
try {
const text = await extractor.extractText(document)
console.log('Extracted:', text)
} catch (error) {
if (error.message.includes('not allowed')) {
console.error('File type not supported:', document.type)
} else if (error.message.includes('url or contents')) {
console.error('Document source missing')
} else {
console.error('Extraction failed:', error.message)
}
}Supported File Types
Documents (Default)
- PDF:
application/pdf - Word 97-2003:
application/msword - Word 2007+:
application/vnd.openxmlformats-officedocument.wordprocessingml.document - OpenDocument Text:
application/vnd.oasis.opendocument.text
Images (Default)
- JPEG:
image/jpeg - PNG:
image/png - JPG:
image/jpg - WebP:
image/webp
Development
Prerequisites
- Node.js 16+
- npm or yarn
Setup
# Clone the repository
git clone <repository-url>
cd doc-extract
# Install dependencies
npm install
# Build the project
npm run buildTesting
# Run tests
npm test
# Run tests in watch mode
npm run test:watch
# Generate coverage report
npm run test:coverageLinting and Formatting
# Lint code
npm run lint
# Fix linting issues
npm run lint:fix
# Format code
npm run format
# Check formatting
npm run format:checkContributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
npm test) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
MIT © Anton R. Menkveld
Changelog
1.0.0
- Initial release
- Support for PDF, DOC, DOCX, and image extraction
- Configurable file type restrictions
- TypeScript support with full type definitions
