npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

ubc-genai-toolkit-document-parsing

v0.1.0

Published

A module for the UBC GenAI Toolkit that provides a standardized interface for parsing text content from various document formats like PDF, DOCX, and HTML.

Readme

UBC GenAI Toolkit - Document Parsing Module

Overview

This module provides a standardized interface for parsing text content from various document formats. It follows the Facade pattern, simplifying interactions with underlying parsing libraries for formats like PDF, DOCX, and HTML, while shielding your application from their complexities.

Applications can use this module to extract text from files into either plain text or Markdown format through a consistent API.

Installation

npm install ubc-genai-toolkit-document-parsing ubc-genai-toolkit-core

Core Concepts

  • DocumentParsingModule: The main class and entry point for parsing documents.
  • parse(input, outputFormat): The primary method that takes a file path and a desired output format ('text' or 'markdown') and returns the extracted content.
  • Supported Formats: The module currently supports:
    • PDF (.pdf)
    • Microsoft Word (.docx)
    • HTML (.html, .htm)
    • Markdown (.md)

Configuration

The DocumentParsingModule is configured during instantiation with a DocumentParsingConfig object, which extends the ModuleConfig from ubc-genai-toolkit-core.

import { DocumentParsingModule } from 'ubc-genai-toolkit-document-parsing';
import { ConsoleLogger } from 'ubc-genai-toolkit-core';

const config = {
	logger: new ConsoleLogger(),
	debug: true,
};

const docParser = new DocumentParsingModule(config);

Usage Example

The following example demonstrates how to use the module to parse a document from a file path.

import { DocumentParsingModule } from 'ubc-genai-toolkit-document-parsing';
import path from 'path';

async function parseDocument(filePath: string) {
	const docParser = new DocumentParsingModule();

	console.log(`--- Parsing: ${path.basename(filePath)} ---`);

	try {
		// Parse to Markdown
		const markdownResult = await docParser.parse({ filePath }, 'markdown');
		console.log('Markdown Output (first 200 chars):');
		console.log(markdownResult.content.substring(0, 200) + '...');

		// Parse to Plain Text
		const textResult = await docParser.parse({ filePath }, 'text');
		console.log('\\nText Output (first 200 chars):');
		console.log(textResult.content.substring(0, 200) + '...');
	} catch (error) {
		console.error(`Failed to parse ${filePath}:`, error);
	}
}

// Example usage:
// const pathToDoc = path.resolve(__dirname, 'data/sample.docx');
// parseDocument(pathToDoc);

This example initializes the module and uses it to parse a file into both Markdown and plain text, printing the first 200 characters of each result.

Error Handling

The module uses the common error types from ubc-genai-toolkit-core and defines its own specific errors:

  • UnsupportedFileTypeError: Thrown if the file type of the input document is not supported.
  • ParsingError: A generic error for issues during the parsing process, such as file access problems or failures in the underlying parsing libraries.

Always wrap calls to the parse method in try...catch blocks to handle these potential errors.