npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

iris-extract

v0.0.2

Published

TypeScript DOM cleaning and structuring library

Readme

unstructured-ts

A TypeScript library for cleaning and structuring DOM content, inspired by Unstructured. Built with Cheerio for fast, server-side HTML processing.

Features

  • 🧹 DOM Cleaning: Remove scripts, styles, navigation, and other unwanted elements
  • 🏗️ Semantic Structure: Classify elements as titles, paragraphs, lists, tables, etc.
  • 📊 Table Extraction: Extract tables with headers and structured data
  • 🖼️ Image Handling: Extract images with metadata and alt text
  • Fast Processing: Built on Cheerio for efficient server-side HTML parsing
  • 🎯 Configurable: Flexible options for different use cases
  • 📝 TypeScript: Full type safety and excellent IDE support

Installation

npm install unstructured-ts

Quick Start

import { partitionHtml } from 'unstructured-ts';

const html = `
<html>
  <body>
    <nav>Skip this navigation</nav>
    <h1>Main Title</h1>
    <p>This is a paragraph with some content.</p>
    <ul>
      <li>First item</li>
      <li>Second item</li>
    </ul>
    <table>
      <tr><th>Name</th><th>Age</th></tr>
      <tr><td>John</td><td>30</td></tr>
    </table>
  </body>
</html>
`;

const result = partitionHtml(html);

console.log(result.elements);
// [
//   { type: 'Title', text: 'Main Title', ... },
//   { type: 'NarrativeText', text: 'This is a paragraph with some content.', ... },
//   { type: 'ListItem', text: 'First item', ... },
//   { type: 'ListItem', text: 'Second item', ... },
//   { type: 'Table', text: 'Name | Age\\n--- | ---\\nJohn | 30', rows: [['John', '30']], headers: ['Name', 'Age'], ... }
// ]

Advanced Usage

Custom Options

import { DOMPartitioner } from 'unstructured-ts';

const partitioner = new DOMPartitioner({
  skipNavigation: true,      // Remove navigation elements
  skipHeaders: false,        // Keep header elements
  skipFooters: true,         // Remove footer elements
  skipForms: true,           // Remove form elements
  minTextLength: 15,         // Minimum text length to include
  extractTables: true,       // Extract table structure
  extractImages: true,       // Extract image elements
  includeImageAlt: true,     // Include alt text in image elements
  includeOriginalHtml: false // Include original HTML in metadata
});

const result = partitioner.partition(html);

Working with Elements

import { ElementType } from 'unstructured-ts';

const result = partitionHtml(html);

// Filter by element type
const titles = result.elements.filter(el => el.type === ElementType.TITLE);
const paragraphs = result.elements.filter(el => el.type === ElementType.NARRATIVE_TEXT);
const tables = result.elements.filter(el => el.type === ElementType.TABLE);

// Access table data
tables.forEach(table => {
  if (table.type === ElementType.TABLE) {
    console.log('Headers:', table.headers);
    console.log('Rows:', table.rows);
  }
});

// Access metadata
result.elements.forEach(element => {
  console.log(`${element.type}: ${element.text}`);
  console.log('Metadata:', element.metadata);
});

Element Types

The library classifies DOM elements into semantic types:

  • Title: Headings (h1-h6) and title-like content
  • NarrativeText: Paragraphs and article content
  • ListItem: List items and bullet points
  • Text: Generic text content
  • Table: Structured tabular data
  • Image: Images with metadata
  • Header/Footer: Page headers and footers
  • Navigation: Navigation menus and links
  • Form: Form elements and inputs

API Reference

partitionHtml(html: string, options?: PartitionOptions): PartitionResult

Convenience function to partition HTML content.

DOMPartitioner

Main class for partitioning DOM content.

Constructor

new DOMPartitioner(options?: PartitionOptions)

Methods

  • partition(html: string): PartitionResult - Partition HTML content

PartitionOptions

Configuration options for partitioning:

interface PartitionOptions {
  skipNavigation?: boolean;     // Default: true
  skipHeaders?: boolean;        // Default: false
  skipFooters?: boolean;        // Default: false
  skipForms?: boolean;          // Default: true
  minTextLength?: number;       // Default: 10
  preserveWhitespace?: boolean; // Default: false
  extractTables?: boolean;      // Default: true
  extractImages?: boolean;      // Default: true
  includeImageAlt?: boolean;    // Default: true
  includeOriginalHtml?: boolean;// Default: false
}

Element

Base element interface:

interface Element {
  id: string;
  type: ElementType;
  text: string;
  metadata: ElementMetadata;
}

TableElement

Extended element for tables:

interface TableElement extends Element {
  type: ElementType.TABLE;
  rows: string[][];
  headers?: string[];
}

ImageElement

Extended element for images:

interface ImageElement extends Element {
  type: ElementType.IMAGE;
  src?: string;
  alt?: string;
  width?: number;
  height?: number;
}

Comparison with Unstructured Python Library

This library is inspired by the Python Unstructured library but is designed specifically for TypeScript/JavaScript environments:

| Feature | unstructured-ts | Unstructured Python | |---------|----------------|-------------------| | DOM Processing | ✅ Cheerio-based | ✅ BeautifulSoup-based | | Element Classification | ✅ Simplified | ✅ Comprehensive | | Table Extraction | ✅ Basic structure | ✅ Advanced analysis | | Multiple File Formats | ❌ HTML only | ✅ PDF, DOCX, etc. | | OCR Support | ❌ | ✅ | | Language | TypeScript | Python | | Performance | ⚡ Fast | 🐌 Slower | | Dependencies | Minimal | Heavy |

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT