@aiquants/html-to-markdown
v1.2.0
Published
HTML to Markdown converter
Readme
@aiquants/html-to-markdown
A tool to dynamically fetch a web page from a given URL and convert it to Markdown. Now with Model Context Protocol (MCP) server support!
This tool fetches the fully rendered HTML after JavaScript execution and converts it to Markdown. It supports complex pages, including footnotes and table structures.
Key Features
- Dynamic Content Fetching: Accurately fetches content from dynamic sites like SPAs (Single Page Applications) using Playwright.
- High-Fidelity Conversion: Performs powerful AST (Abstract Syntax Tree)-based conversion using
rehypeandremark. - Preserves Table Content: Retains HTML tags within table cells (
<td>,<th>) as much as possible to maintain rich formatting. - Link Normalization: Automatically converts relative links on the page to absolute links to prevent broken links.
- Multiple Interfaces: Usable as a Node.js library, command-line tool, and MCP server.
- MCP Server Support: Provides Model Context Protocol server functionality for AI assistants.
- Streamable MCP Support: Supports streamable MCP protocols for real-time progress updates.
Installation
npm install @aiquants/html-to-markdownUsage
As a Command-Line Tool
You can run the tool directly using npx without installation. By default, the converted Markdown will be saved in the .outputs/raw directory, but you can specify a custom output path using the --output option.
npx @aiquants/html-to-markdown <URL> [--locale <locale>] [--output <path>]Or convert HTML content directly:
npx @aiquants/html-to-markdown --html-content <HTML_TEXT> [--locale <locale>] [--output <path>]Options:
--locale <locale>: Set the locale for the browser context and console messages (en-USorja-JP). Defaults toen-US.--output <path>,-o <path>: Specify the output file path. If not specified, the file will be saved in the.outputs/rawdirectory with an auto-generated filename.--html-content <HTML_TEXT>,-h <HTML_TEXT>: Convert HTML content directly instead of fetching from a URL.
Examples:
# Convert a Wikipedia page with the Japanese locale
npx @aiquants/html-to-markdown https://ja.wikipedia.org/wiki/Node.js --locale ja-JP
# Convert an English page (locale defaults to en-US)
npx @aiquants/html-to-markdown https://en.wikipedia.org/wiki/Node.js
# Save to a specific file
npx @aiquants/html-to-markdown https://en.wikipedia.org/wiki/Node.js --output ./my-output.md
# Use short option for output
npx @aiquants/html-to-markdown https://ja.wikipedia.org/wiki/Node.js --locale ja-JP -o ./nodejs-ja.md
# Convert HTML content directly
npx @aiquants/html-to-markdown --html-content '<html><body><h1>Sample Title</h1><p>This is a sample paragraph.</p></body></html>' --output ./sample.md
# Convert HTML content with Japanese locale
npx @aiquants/html-to-markdown --html-content '<html><body><h1>サンプルタイトル</h1><p>これはサンプルの段落です。</p></body></html>' --locale ja-JP -o ./sample-ja.mdAs a Library
import { htmlToMarkdown } from '@aiquants/html-to-markdown';
import fs from 'fs';
async function main() {
// Example 1: Convert from URL
const url = 'https://en.wikipedia.org/wiki/Node.js';
const options = {
locale: 'en-US', // 'en-US' (default) or 'ja-JP'
};
try {
const { markdown } = await htmlToMarkdown(url, options);
fs.writeFileSync('output.md', markdown);
console.log('Markdown file has been saved as output.md');
} catch (error) {
console.error('Error converting HTML to Markdown:', error);
}
// Example 2: Convert HTML content directly
const htmlContent = `
<html>
<body>
<h1>Sample Title</h1>
<p>This is a sample paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</body>
</html>
`;
try {
const { markdown } = await htmlToMarkdown('', {
locale: 'en-US',
htmlContent: htmlContent
});
fs.writeFileSync('output-from-html.md', markdown);
console.log('Markdown file has been saved as output-from-html.md');
} catch (error) {
console.error('Error converting HTML to Markdown:', error);
}
}
main();As an MCP Server
This package can be used as a Model Context Protocol (MCP) server, allowing AI assistants and MCP-compatible applications to convert web pages to Markdown.
MCP Server Configuration
For VS Code with MCP extension, add to your mcp.json:
{
"html-to-md": {
"command": "npx",
"args": [
"--package=@aiquants/html-to-markdown",
"aiq-html2md-mcp"
],
"type": "stdio"
},
"html-to-md-streamable": {
"url": "http://127.0.0.1:4001/mcp",
"type": "http",
"_comment": "Note: You need to start the streamable MCP server separately using 'npx --package=@aiquants/html-to-markdown aiq-html2md-mcp-stream'"
}
}For Claude Desktop or other MCP clients, add to your configuration file:
{
"mcpServers": {
"html-to-markdown": {
"command": "npx",
"args": ["--package=@aiquants/html-to-markdown", "aiq-html2md-mcp"],
"description": "Convert HTML content from URLs to Markdown format"
}
}
}Note: The streamable MCP server runs as an HTTP server on port 4001 and needs to be started separately:
# Start the streamable MCP server (requires global installation)
npm install -g @aiquants/html-to-markdown
aiq-html2md-mcp-streamAlternatively, you can use npx to run without global installation:
# Using npx with the streamable server binary
npx --package=@aiquants/html-to-markdown aiq-html2md-mcp-streamMCP Tools Available
html_to_markdown: Convert HTML content from a URL to Markdown format
url(required*): The URL of the web page to converthtml_content(required*): HTML content as a string to convert (alternative to URL)locale(optional): Browser locale (en-USorja-JP, defaults toen-US)
*Either
urlorhtml_contentis requiredsave_content_to_file: Save text content to a file with specified path or directory
content(required): Text content to save to file (will be saved as Markdown format)save_path(optional): Complete file path including filename to save contentsave_directory(optional): Directory path to save content with auto-generated filenamefilename(optional): Base filename to use when save_directory is specified (extension .md will be added automatically)
url_to_markdown_file: Convert web pages or HTML strings directly to Markdown files (combines conversion and file saving)
url(required*): The URL of the web page to convert and savehtml_content(required*): HTML content as a string to convert and save (alternative to URL)locale(optional): Browser locale (en-USorja-JP, defaults toen-US)save_path(optional): Complete file path including filename to save the converted contentsave_directory(optional): Directory path to save the converted content with auto-generated filenamefilename(optional): Base filename to use when save_directory is specified (extension .md will be added automatically)
*Either
urlorhtml_contentis required, and eithersave_pathorsave_directoryis required
Streamable MCP Tools (for real-time progress updates):
- html_to_markdown_streamable: Same as
html_to_markdownbut with real-time progress updates - save_content_to_file: Same file saving functionality as in standard MCP
- url_to_markdown_file_streamable: Same as
url_to_markdown_filebut with real-time progress updates and streaming support
File Saving Options:
save_path: Specify the complete file path including filename and extension where you want to save the content.- Example:
/path/to/output/my-page.md - The directory will be created automatically if it doesn't exist.
- You cannot use both
save_pathandsave_directoryat the same time.
- Example:
save_directory: Specify only the directory where you want to save the file. The filename will be auto-generated based on the URL.- Example:
/path/to/output/(filename will be auto-generated likepage.md) - For URLs like
https://example.com/articles/my-article, the filename becomesmy-article.md - The directory will be created automatically if it doesn't exist.
- You cannot use both
save_pathandsave_directoryat the same time.
- Example:
File Saving Behavior:
- All content is saved as Markdown format (
.mdfiles) - The directory will be created automatically if it doesn't exist
- With
save_path: File is saved to the exact specified path - With
save_directory: File is saved with auto-generated filename based on the URL or custom filename if provided
MCP Usage Examples
Example 1: Basic HTML to Markdown conversion
{
"method": "tools/call",
"params": {
"name": "html_to_markdown",
"arguments": {
"url": "https://example.com",
"locale": "en-US"
}
}
}Example 2: Save content to a specific file
{
"method": "tools/call",
"params": {
"name": "save_content_to_file",
"arguments": {
"content": "# My Content\n\nThis is my markdown content.",
"save_path": "/path/to/my-file.md"
}
}
}Example 3: Convert URL directly to Markdown file (One-step operation)
{
"method": "tools/call",
"params": {
"name": "url_to_markdown_file",
"arguments": {
"url": "https://example.com/article",
"save_directory": "/path/to/output/",
"filename": "my-article",
"locale": "en-US"
}
}
}Example 4: Streamable conversion with real-time progress
{
"method": "tools/call",
"params": {
"name": "url_to_markdown_file_streamable",
"arguments": {
"url": "https://example.com/large-page",
"save_path": "/path/to/large-page.md",
"locale": "ja-JP"
}
}
}Using MCP Server Programmatically
import { createMcpServer, createStreamableMcpServer } from '@aiquants/html-to-markdown';
// Create standard MCP server
const mcpServer = createMcpServer();
await mcpServer.start(8000); // Port 8000
// Create streamable MCP server
const streamableMcpServer = createStreamableMcpServer();
await streamableMcpServer.start(8000); // Port 8000API
htmlToMarkdown(urlOrHtml, options?)
Converts the HTML content of a given URL or HTML string to Markdown.
urlOrHtml(string, required): The URL of the web page to convert, or when usinghtmlContentoption, this can be any string (commonly used as identifier).options(object, optional): Options for the conversion process.
options object
locale(string, optional): Specifies the locale to use for the browser context and console messages.'en-US'(default)'ja-JP'
htmlContent(string, optional): HTML content as a string instead of fetching from URL. When provided, the function will convert this HTML content directly instead of fetching content from theurlOrHtmlparameter.
How It Works
This tool follows these steps for conversion:
Fetch HTML (Playwright):
- Launches a browser with Playwright using the specified locale.
- Navigates to the page and waits until the
networkidleevent, ensuring dynamic content is fully loaded before fetching the HTML.
Pre-process HTML (Rehype):
rehype-parse: Parses the HTML into a HAST (HTML Abstract Syntax Tree).rehype-raw: Preserves elements likescriptandstyle.rehypeSanitizeHtml(custom plugin): Removes empty comment nodes and unnecessary whitespace.rehypeAbsoluteLinks(custom plugin): Converts relative paths inhrefandsrcattributes to absolute paths.rehypeWikipediaFootnotes(custom plugin): Transforms Wikipedia footnotes into standard Markdown format.rehypeSlug&rehypeAutolinkHeadings: Adds IDs to headings and automatically generates anchor links.
Convert to Markdown (Remark):
rehype-remark: Converts the HAST to an MDAST (Markdown Abstract Syntax Tree), with custom handling for links (<a>) and table cells (<td>,<th>).remark-gfm: Adds support for GitHub Flavored Markdown (GFM), including tables and strikethrough.remark-stringify: Serializes the MDAST into a Markdown string.
Dependencies and Licenses
This project is built upon the following open-source software. We are grateful to the developers of these libraries.
| Package | License | | -------------------------- | ---------- | | @modelcontextprotocol/sdk | MIT | | cors | MIT | | express | MIT | | github-slugger | ISC | | happy-dom | MIT | | hast | MIT | | hast-util-to-html | MIT | | playwright | Apache-2.0 | | rehype-parse | MIT | | rehype-raw | MIT | | rehype-remark | MIT | | rehype-slug | MIT | | remark-gfm | MIT | | remark-stringify | MIT | | unified | MIT | | unist-util-visit | MIT | | yargs-parser | ISC |
This list is generated based on the dependencies in package.json. For the most accurate and up-to-date license information, please refer to the individual packages.
Author
License
MIT
