doc-extract

v1.0.4

Published

8 months ago

A Node.js library for reading and extracting text from various document formats (PDF, DOCX, DOC, PPT, PPTX, TXT)

Downloads

0High
0Medium
0Low

haider_nakara

document reader pdf docx powerpoint text-extraction file-parser nodejs typescript

Doc Extract

A powerful Node.js library for reading and extracting text from various document formats including PDF, DOCX, DOC, PPT, PPTX, and TXT files.

Features

📄 Multiple Format Support: PDF, DOCX, DOC, PPT, PPTX, TXT
🔍 Text Extraction: Extract clean text content from documents
📊 Rich Metadata: Get document statistics (word count, character count, pages, etc.)
💾 Buffer Support: Read documents from memory buffers
🔧 TypeScript: Full TypeScript support with type definitions
🚀 Promise-based: Modern async/await API
🛡️ Error Handling: Comprehensive error handling with custom error types

Installation

npm install doc-extract

System Dependencies

This library depends on some system packages for full functionality:

For PDF support:

No additional dependencies required

For PowerPoint and DOC support:

# Ubuntu/Debian
sudo apt-get install antiword unrtf poppler-utils tesseract-ocr

# macOS
brew install antiword unrtf poppler tesseract

# Windows
# Install poppler and tesseract manually or use chocolatey:
choco install poppler tesseract

Quick Start

import DocumentReader, { readDocument } from "doc-extract";

// Simple usage
const content = await readDocument("./path/to/document.pdf");
console.log(content.text);
console.log(content.metadata);

// Using the class for more control
const reader = new DocumentReader({ debug: true });
const content = await reader.readDocument("./path/to/document.docx");

API Reference

Class: DocumentReader

Constructor

new DocumentReader(options?: { debug?: boolean })

options.debug: Enable debug logging (default: false)

Methods

readDocument(filePath: string): Promise

Read a document from file path.

const reader = new DocumentReader();
const content = await reader.readDocument("./document.pdf");

readDocumentFromBuffer(buffer: Buffer, fileName: string, mimeType?: string): Promise

Read a document from a Buffer.

const fs = require("fs");
const buffer = fs.readFileSync("./document.pdf");
const content = await reader.readDocumentFromBuffer(buffer, "document.pdf");

readMultipleDocuments(filePaths: string[]): Promise<DocumentContent[]>

Read multiple documents at once.

const contents = await reader.readMultipleDocuments([
  "./doc1.pdf",
  "./doc2.docx",
  "./doc3.pptx",
]);

readMultipleFromBuffers(buffers: Array<{buffer: Buffer, fileName: string, mimeType?: string}>): Promise<DocumentContent[]>

Read multiple documents from buffers.

const contents = await reader.readMultipleFromBuffers([
  { buffer: buffer1, fileName: "doc1.pdf" },
  { buffer: buffer2, fileName: "doc2.docx" },
]);

Specific Format Methods

// PDF specific
const pdfContent = await reader.readPdf("./document.pdf");

// DOCX specific (includes HTML conversion)
const docxContent = await reader.readDocx("./document.docx");
console.log(docxContent.html); // HTML version of the document

// PowerPoint specific
const pptContent = await reader.readPowerPoint("./presentation.pptx");

Utility Methods

// Check if format is supported
const isSupported = reader.isFormatSupported("./document.pdf"); // true

// Get all supported formats
const formats = reader.getSupportedFormats(); // ['pdf', 'docx', 'doc', 'pptx', 'ppt', 'txt']

// Validate file
await reader.validateFile("./document.pdf"); // throws error if invalid

Convenience Functions

import { readDocument, readDocumentFromBuffer } from "doc-extract";

// Quick read from file
const content = await readDocument("./document.pdf");

// Quick read from buffer
const content = await readDocumentFromBuffer(buffer, "document.pdf");

Types

DocumentContent

interface DocumentContent {
  text: string;
  metadata?: {
    pages?: number;
    words?: number;
    characters?: number;
    fileSize?: number;
    fileName?: string;
  };
}

PdfContent

interface PdfContent extends DocumentContent {
  metadata: DocumentContent["metadata"] & {
    pages: number;
    info?: any; // PDF metadata from pdf-parse
  };
}

DocxContent

interface DocxContent extends DocumentContent {
  html?: string; // HTML version of the document
  messages?: any[]; // Conversion messages from mammoth
}

SupportedFormats

enum SupportedFormats {
  PDF = "pdf",
  DOCX = "docx",
  DOC = "doc",
  PPTX = "pptx",
  PPT = "ppt",
  TXT = "txt",
}

Error Handling

The library uses custom error types for better error handling:

import { DocumentReaderError } from "doc-extract";

try {
  const content = await readDocument("./nonexistent.pdf");
} catch (error) {
  if (error instanceof DocumentReaderError) {
    console.log("Error code:", error.code);
    console.log("Error message:", error.message);
  }
}

Error Codes

UNSUPPORTED_FORMAT: File format not supported
READ_ERROR: General read error
PDF_READ_ERROR: PDF-specific read error
DOCX_READ_ERROR: DOCX-specific read error
TEXTRACT_READ_ERROR: Textract-related error
BUFFER_READ_ERROR: Buffer reading error
VALIDATION_ERROR: File validation error
INVALID_FILE_PATH: Invalid file path

Examples

Express.js Integration

import express from "express";
import multer from "multer";
import { DocumentReader } from "doc-extract";

const app = express();
const upload = multer();
const reader = new DocumentReader();

app.post("/upload", upload.single("document"), async (req, res) => {
  try {
    if (!req.file) {
      return res.status(400).json({ error: "No file uploaded" });
    }

    const content = await reader.readDocumentFromBuffer(
      req.file.buffer,
      req.file.originalname,
      req.file.mimetype
    );

    res.json({
      text: content.text,
      metadata: content.metadata,
    });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

Batch Processing

import { DocumentReader } from "doc-extract";
import { promises as fs } from "fs";
import path from "path";

async function processDocumentsInDirectory(dirPath: string) {
  const reader = new DocumentReader({ debug: true });

  const files = await fs.readdir(dirPath);
  const documentPaths = files
    .filter((file) => reader.isFormatSupportedByName(file))
    .map((file) => path.join(dirPath, file));

  const results = await reader.readMultipleDocuments(documentPaths);

  results.forEach((content, index) => {
    console.log(`Document ${documentPaths[index]}:`);
    console.log(`Words: ${content.metadata?.words}`);
    console.log(`Characters: ${content.metadata?.characters}`);
    console.log("---");
  });
}

Search in Documents

import { DocumentReader } from "doc-extract";

async function searchInDocument(filePath: string, searchTerm: string) {
  const reader = new DocumentReader();
  const content = await reader.readDocument(filePath);

  const lines = content.text.split("\n");
  const matchingLines = lines
    .map((line, index) => ({ line, lineNumber: index + 1 }))
    .filter(({ line }) =>
      line.toLowerCase().includes(searchTerm.toLowerCase())
    );

  return {
    totalMatches: matchingLines.length,
    matches: matchingLines,
    metadata: content.metadata,
  };
}

Performance Tips

Reuse DocumentReader instances - The class can be reused for multiple operations
Use batch methods - readMultipleDocuments() is more efficient than individual calls
Enable debug mode only during development
Clean up temporary files - The library handles this automatically, but ensure your temp directory has sufficient space

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

git clone https://github.com/HaiderNakara/doc-extract.git
cd doc-extract
npm install
npm run build
npm test

Running Tests

npm test          # Run tests once
npm run test:watch # Run tests in watch mode
npm run test:coverage # Run tests with coverage

License

This project is licensed under the MIT License - see the LICENSE file for details.