npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@alloc/dom-to-semantic-markdown

v2.0.1

Published

DOM to Semantic-Markdown for use in LLMs

Readme

@alloc/dom-to-semantic-markdown

This library converts HTML DOM to a semantic Markdown format optimized for use with Large Language Models (LLMs). It preserves the semantic structure of web content, extracts essential metadata, and reduces token usage compared to raw HTML, making it easier for LLMs to understand and process information.

Note: This is a personal fork of romansky/dom-to-semantic-markdown. Support will not be provided.

Key Features

  • Semantic Structure Preservation: Retains the meaning of HTML elements like <header>, <footer>, <nav>, and more.
  • Metadata Extraction: Captures important metadata such as title, description, keywords, Open Graph tags, Twitter Card tags, and JSON-LD data.
  • Token Efficiency: Optimizes for token usage through URL refification and concise representation of content.
  • Main Content Detection: Automatically identifies and extracts the primary content section of a webpage.
  • Table Column Tracking: Adds unique identifiers to table columns, improving LLM's ability to correlate data across rows.

Installation

pnpm add @alloc/dom-to-semantic-markdown

Usage

import { convertHtmlToMarkdown } from "@alloc/dom-to-semantic-markdown";

const markdown = convertHtmlToMarkdown(document.body);
console.log(markdown);

Functions

convertHtmlToMarkdown(html: string, options?: ConversionOptions): string

Converts an HTML string to semantic Markdown.

  • html: string: The HTML string to be converted.
  • options?: ConversionOptions: Optional configuration object to customize the conversion process. See ConversionOptions for available settings.

Returns: string - The Markdown string representation of the HTML content.

convertElementToMarkdown(element: Element, options?: ConversionOptions): string

Converts an HTML Element to semantic Markdown.

  • element: Element: The HTML DOM Element to be converted. This allows you to convert specific parts of a document, not just the entire HTML string.
  • options?: ConversionOptions: Optional configuration object to customize the conversion process. See ConversionOptions for available settings.

Returns: string - The Markdown string representation of the provided HTML Element and its descendants.

extractMetaData(element: Element, mode?: 'basic' | 'extended'): SemanticMarkdownAST.MetaDataNode['content']

Extracts metadata from an HTML Element.

  • element: Element: The HTML DOM Element to extract metadata from.
  • mode?: 'basic' | 'extended': Optional mode to control the level of metadata extraction.
    • 'basic': Includes standard meta tags like title, description, and keywords.
    • 'extended': Includes basic meta tags, Open Graph tags, Twitter Card tags, and JSON-LD data.

Returns: SemanticMarkdownAST.MetaDataNode['content'] - An object containing the extracted metadata.

htmlToMarkdownAST(element: Element, options?: ExtractOptions, indentLevel?: number): SemanticMarkdownAST.Node[]

Converts an HTML Element into a Semantic Markdown Abstract Syntax Tree (AST). This function recursively parses the HTML structure and generates a structured Markdown representation. It uses extractMetaData to extract metadata from the <head> element.

  • element: Element: The HTML DOM element to be converted.
  • options?: ExtractOptions: Optional configuration to customize the extraction process. See ExtractOptions for details.
  • indentLevel?: number: The current indentation level, used for nested elements like lists. Defaults to 0.

Returns: SemanticMarkdownAST.Node[] - An array of AST nodes representing the semantic Markdown structure of the input HTML element. This AST can then be rendered into a Markdown string using a separate rendering function.

This function is not intended for direct use in most cases. Use convertHtmlToMarkdown or convertElementToMarkdown for simpler HTML to Markdown conversion. However, understanding htmlToMarkdownAST is crucial for customizing or extending the library's functionality.

markdownASTToString(nodes: Node[], options?: RenderOptions, indentLevel?: number): string

Converts a Semantic Markdown Abstract Syntax Tree (AST) back into a Markdown string. This function takes the AST generated by htmlToMarkdownAST and renders it into a human-readable Markdown format.

  • nodes: Node[]: An array of SemanticMarkdownAST nodes representing the Markdown content. This is typically the output of the htmlToMarkdownAST function.
  • options?: RenderOptions: Optional configuration object to customize the rendering process. See RenderOptions for available settings.
  • indentLevel?: number: The initial indentation level for the Markdown output. Used for nested structures like lists and blockquotes. Defaults to 0.

Returns: string - The Markdown string representation of the AST.

This function is essential for completing the HTML to Markdown conversion process. It takes the structured AST and transforms it into a flat, string-based Markdown output.

Types

ExtractOptions

  • debug?: boolean: Enable debug logging.
  • websiteDomain?: string: The domain of the website being converted.
  • extractMainContent?: boolean: Whether to extract only the main content of the page.
  • includeMetaData?: 'basic' | 'extended' | false: Controls whether to include metadata extracted from the HTML head.
    • 'basic': Includes standard meta tags like title, description, and keywords.
    • 'extended': Includes basic meta tags, Open Graph tags, Twitter Card tags, and JSON-LD data.
    • false: Disables metadata extraction.
  • excludeTagNames?: string[]: Avoid extracting content from these tags.
  • excludeInvisibleElements?: boolean: Whether to exclude elements that are not visible.
  • enableTableColumnTracking?: boolean: Adds unique identifiers to table columns.
  • overrideElementProcessing?: (element: Element, options: ConversionOptions, indentLevel: number) => SemanticMarkdownAST[] | undefined: Custom processing for HTML elements.
  • processUnhandledElement?: (element: Element, options: ConversionOptions, indentLevel: number) => SemanticMarkdownAST[] | undefined: Handler for unknown HTML elements.

RenderOptions

  • emitFrontMatter?: boolean: Include the metadata as “front matter” in the output.
  • overrideNodeRenderer?: (node: SemanticMarkdownAST, options: ConversionOptions, indentLevel: number) => string | undefined: Custom renderer for AST nodes.
  • renderCustomNode?: (node: CustomNode, options: ConversionOptions, indentLevel: number) => string | undefined: Renderer for custom AST nodes.

ConversionOptions

  • refifyUrls?: boolean: Whether to convert URLs to reference-style links.
  • overrideDOMParser?: DOMParser: Custom DOMParser for Node.js environments.
  • Everything in ExtractOptions and RenderOptions

SemanticMarkdownAST

SemanticMarkdownAST is a type-only namespace that defines the structure of the Markdown Abstract Syntax Tree (AST) used by this library. It encompasses various node types that represent different semantic elements in Markdown, allowing for a structured and programmatically accessible representation of Markdown content.

The namespace includes the following type definitions for different Markdown elements:

  • BlockquoteNode: Represents blockquotes.
  • BoldNode: Represents bold text.
  • CodeNode: Represents code blocks and inline code.
  • CustomNode: Represents custom, user-defined nodes.
  • HeadingNode: Represents headings with levels from 1 to 6.
  • ImageNode: Represents images.
  • ItalicNode: Represents italic text.
  • LinkNode: Represents hyperlinks.
  • ListItemNode: Represents items in a list.
  • ListNode: Represents ordered and unordered lists.
  • MetaDataNode: Represents metadata extracted from HTML <head>, including standard meta tags, Open Graph, Twitter Card, and JSON-LD.
  • SemanticHtmlNode: Represents semantic HTML elements like <article>, <header>, etc.
  • StrikethroughNode: Represents strikethrough text.
  • TableCellNode: Represents cells within a table.
  • TableNode: Represents tables.
  • TableRowNode: Represents rows within a table.
  • TextNode: Represents plain text content.
  • VideoNode: Represents video embeds.

Each of these node types defines a specific structure with properties relevant to the represented Markdown element, such as content, level (for headings), href (for links), etc. These types are used throughout the library to represent and manipulate Markdown content programmatically.

Using the Output with LLMs

The semantic Markdown produced by this library is optimized for use with Large Language Models (LLMs). To use it effectively:

  1. Extract the Markdown content using the library.
  2. Start with a brief instruction or context for the LLM.
  3. Wrap the extracted Markdown in triple backticks (`` `).
  4. Follow the Markdown with your question or prompt.

Example:

The following is a semantic Markdown representation of a webpage. Please analyze its content:

```markdown
{paste your extracted markdown here}
```

{your question, e.g., "What are the main points discussed in this article?"}

This format helps the LLM understand its task and the context of the content, enabling more accurate and relevant responses to your questions.