npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@missionsquad/docsuite

v1.2.2

Published

MissionSquad document parsing and text extraction suite

Readme

DocSuite: Universal Document Text Extractor

DocSuite is a powerful and simple Node.js module for extracting text content from various document formats, including Microsoft Word (.docx), Excel (.xlsx), and PowerPoint (.pptx). It provides a unified, asynchronous API that returns structured, easy-to-process data for each document.

Features

  • Multi-Format Support: Extract text from .docx, .xlsx, .pptx, and .pdf files.
  • PDF Capabilities: Extract text content, embedded images, and render full-page images from PDF documents.
  • Structured Output: Instead of a single block of text, DocSuite returns an array of objects, where each object represents a page, sheet, or slide.
  • Robust Error Handling: The module captures parsing errors gracefully and returns them as part of the structured response, preventing crashes.
  • Simple API: A single extract method serves as the entry point for all supported file types.
  • TypeScript Native: Built with TypeScript, providing strong typing for all methods and data structures.

Installation

You can add DocSuite to your project using npm or yarn.

npm install @missionsquad/docsuite
# or
yarn add @missionsquad/docsuite

System Dependencies

PDF support requires Poppler utilities to be installed on your system. Without Poppler, DocSuite can still process .docx, .xlsx, and .pptx files, but PDF extraction will be unavailable.

Installing Poppler

macOS:

brew install poppler

Ubuntu/Debian:

sudo apt-get install poppler-utils

Windows: Download and install from Poppler for Windows or use Chocolatey:

choco install poppler

Verifying Installation

You can verify Poppler is installed correctly by running:

pdftotext -v

Or programmatically check within your application:

import DocSuite from '@missionsquad/docsuite';

await DocSuite.checkDependencies();
// Logs: "Poppler dependency verified successfully." or a warning if not found

Quick Start

Here's a basic example of how to use DocSuite to extract text from a file:

import DocSuite, { ExtractionResult } from '@missionsquad/docsuite';

const filePath = './docs/Research Doc.docx';

(async () => {
  const results: ExtractionResult[] = await DocSuite.extract(filePath);

  for (const result of results) {
    if (result.error) {
      console.error(`[!] Error processing ${result.fileName} on page ${result.page}:`);
      console.error(result.error);
    } else {
      console.log(`--- Page/Sheet/Slide ${result.page} of ${result.fileName} ---`);
      console.log(result.contents);
    }
  }
})();

API Reference

DocSuite.extract(filePath: string): Promise<ExtractionResult[]>

This is the main entry point for the module. It automatically detects the file type based on its extension and routes it to the appropriate parser.

  • filePath: The absolute or relative path to the document file.
  • Returns: A Promise that resolves to an array of ExtractionResult objects.

The ExtractionResult Object

This is the core data structure returned by the extract method.

type ExtractionResult = {
  // The type of content extracted: 'text', 'csv', 'image', or null (on error)
  type: 'text' | 'csv' | 'image' | null;

  // The name of the file being processed.
  fileName: string;

  // The 1-based index of the page, sheet, or slide.
  page: number;

  // The extracted content:
  // - For text/csv: the text string
  // - For images: base64 data URI (e.g., "data:image/jpeg;base64,...")
  // This property is omitted if an error occurs.
  contents?: string;

  // A string containing the error message if parsing fails. This property is omitted on success.
  error?: string;

  // For PDF images: indicates if this is a full-page render (true) or an embedded image (false)
  isFullPage?: boolean;
};

File-Specific Behavior

  • .docx (Word Documents): The entire document is treated as a single page (page: 1). The mammoth library does not support page-by-page extraction as page breaks are dynamic.
  • .xlsx (Excel Workbooks): Each sheet in the workbook is returned as a separate ExtractionResult object. The contents will be the sheet's data formatted as a CSV string.
  • .pptx (PowerPoint Presentations): Each slide is returned as a separate ExtractionResult object.
  • .pdf (PDF Documents): Each page is processed individually. By default, text is extracted from each page. With options, you can also extract embedded images or render full pages as images.

PDF Extraction Options

DocSuite provides advanced options for PDF processing:

Basic PDF Text Extraction

import DocSuite from '@missionsquad/docsuite';

const results = await DocSuite.extract('./document.pdf');

for (const result of results) {
  if (result.type === 'text') {
    console.log(`Page ${result.page}:`, result.contents);
  }
}

Extracting Embedded Images

const results = await DocSuite.extract('./document.pdf', {
  pdf: {
    imageFormat: 'jpeg' // Options: 'native', 'jpeg', 'png'
  }
});

for (const result of results) {
  if (result.type === 'image') {
    // result.contents is a base64 data URI
    console.log(`Image on page ${result.page}:`, result.contents);
  } else if (result.type === 'text') {
    console.log(`Text on page ${result.page}:`, result.contents);
  }
}

Full-Page Image Rendering

For cases where you need a visual representation of each page (e.g., for AI vision models):

const results = await DocSuite.extract('./document.pdf', {
  pdf: {
    fullPageImage: true,
    imageFormat: 'jpeg' // Format for full-page renders
  }
});

for (const result of results) {
  if (result.type === 'image' && result.isFullPage) {
    // High-quality full-page render at 150 DPI, scaled to 1080px width
    console.log(`Full page ${result.page} as image`);
  } else if (result.type === 'text') {
    console.log(`Text from page ${result.page}`);
  }
}

Progress Tracking

Monitor extraction progress for long PDFs:

const results = await DocSuite.extract('./large-document.pdf', {
  pdf: { fullPageImage: true },
  progressCallback: (event) => {
    if (event.type === 'embedding_page') {
      const { currentPage, totalPages } = event.data;
      console.log(`Processing page ${currentPage} of ${totalPages}`);
    }
  }
});

PDF Extraction Types

The PdfExtractionOptions interface:

type PdfExtractionOptions = {
  // Format for extracted images: 'native' preserves original format,
  // 'jpeg' and 'png' convert all images to that format
  imageFormat?: 'native' | 'jpeg' | 'png';
  
  // If true, renders each page as a high-quality image (150 DPI, 1080px width)
  fullPageImage?: boolean;
};

The ExtractionOptions interface:

type ExtractionOptions = {
  // Override automatic extension detection
  extension?: string;
  
  // PDF-specific options
  pdf?: PdfExtractionOptions;
  
  // Callback for progress events during extraction
  progressCallback?: (event: { type: string; data: any }) => void;
};

Advanced Features

Post-Processors

DocSuite supports post-processing extracted content. See README-PostProcessors.md for detailed usage.

// Add custom processing to results
DocSuite.setPdfPostProcessor((results) => {
  return results.map(result => ({
    ...result,
    contents: result.contents ? result.contents.toUpperCase() : result.contents
  }));
});

API Reference Summary

DocSuite.extract(filePath: string, options?: ExtractionOptions): Promise<ExtractionResult[]>

Main entry point for extracting content from any supported document format.

DocSuite.checkDependencies(): Promise<void>

Verify that required system dependencies (Poppler) are installed.

Format-Specific Methods

  • DocSuite.extractDocx(filePath: string): Promise<ExtractionResult[]>
  • DocSuite.extractXlsx(filePath: string): Promise<ExtractionResult[]>
  • DocSuite.extractPptx(filePath: string): Promise<ExtractionResult[]>
  • DocSuite.extractPdf(filePath: string, options?: PdfExtractionOptions, progressCallback?: Function): Promise<ExtractionResult[]>

Post-Processor Configuration

  • DocSuite.setDocxPostProcessor(processor: PostProcessorInput): void
  • DocSuite.setXlsxPostProcessor(processor: PostProcessorInput): void
  • DocSuite.setPptxPostProcessor(processor: PostProcessorInput): void
  • DocSuite.setPdfPostProcessor(processor: PostProcessorInput): void
  • DocSuite.setPostProcessor(extension: string, processor: PostProcessorInput): void
  • DocSuite.clearPostProcessor(extension: string): void
  • DocSuite.clearAllPostProcessors(): void

Development

To test the module, a test script is included in the scripts directory. This script will process all documents in the docs directory and print the structured output or any errors to the console.

Running Tests

  1. Install development dependencies: You will need ts-node to run the script directly.

    npm install -D ts-node
    # or
    yarn add -D ts-node
  2. Execute the test script:

    npx ts-node src/scripts/test.ts