@thds/markdown-block-extractor
v1.2.1
Published
A TypeScript library for extracting structured blocks and media items from markdown content. Optimized for React, Vite, and Deno.
Readme
Markdown Block Extractor
A TypeScript library for extracting structured blocks and media items from markdown content. This library processes markdown with custom block markers and extracts both regular content blocks and media items with detailed metadata. Optimized for React, Vite, Deno, and modern JavaScript applications.
Features
- Block Extraction: Extract content blocks marked with HTML comments
- Flexible Block IDs: Support for any string as block ID including GUIDs, alphanumeric strings, and special characters
- Media Detection: Automatically detect images and videos in both markdown and HTML syntax
- Rich Metadata: Generate detailed metadata for each block including word count, line count, and content features
- TypeScript Support: Full TypeScript definitions included
- React/Vite Optimized: Built with Vite for optimal bundling in modern React applications
- Deno Compatible: Works seamlessly in Deno environments and edge functions
- Browser Compatible: Uses native crypto.randomUUID() for UUID generation
- Tree Shakeable: ES modules with proper exports for efficient bundling
- Multiple Build Formats: CommonJS, ES modules, UMD builds, and Deno imports available
- Proper AST Structure: Each block contains a complete Abstract Syntax Tree with individual nodes
- Title Extraction: Automatic title extraction from headings or first text line
How the Parsing Pipeline Works
The markdown block extractor uses a sophisticated multi-stage parsing pipeline to ensure accurate extraction and proper AST structure. Here's how it works:
1. Initial Markdown Parsing
const ast = unified()
.use(remarkParse)
.parse(markdown) as Node;The original markdown is parsed into an Abstract Syntax Tree (AST) using remark-parse. This creates the initial structure, but blocks with HTML content get treated as single HTML nodes.
2. Transform Pipeline (Applied in Order)
The AST goes through several transformation stages:
Stage 1: Orphan Content Wrapper
- Wraps content outside of blocks in custom blocks
- Ensures all content is contained within blocks
Stage 2: Custom Blocks Processing
- Identifies and processes
<!-- block:id=X -->and<!-- custom-block:id=X -->markers - Creates
BlockNodeandCustomBlockNodestructures - Collects content between markers as children
Stage 3: Media Extraction
- Extracts images and videos from both markdown and HTML content
- Associates media items with their containing blocks
- Runs on the original parsed content (before re-parsing)
Stage 4: Block Extraction with Proper AST ⭐ Key Innovation
- Re-parses block content to create proper AST structure
- Uses position information to extract original markdown content
- Re-parses that content to get individual nodes (headings, paragraphs, etc.)
3. The Re-parsing Process
When processing each block, the system:
Extracts Original Content:
const blockContent = originalMarkdown.substring(startOffset, endOffset); const markdownContent = blockContent.substring(contentStart, contentEnd).trim();Re-parses the Original Markdown:
const parsed = unified() .use(remarkParse) .parse(markdownContent) as Node;Creates Proper AST Structure:
- Instead of a single HTML node, you get individual nodes:
headingnodes for# Titleparagraphnodes for text contentimagenodes fortextnodes for plain text- etc.
- Instead of a single HTML node, you get individual nodes:
4. Example Transformation
Before (Single HTML node):
{
"type": "html",
"value": "<img src=\"test.jpg\">\n# Test Block\nSome content here."
}After (Proper AST structure):
{
"type": "root",
"children": [
{
"type": "html",
"value": "<img src=\"test.jpg\">"
},
{
"type": "heading",
"depth": 1,
"children": [
{
"type": "text",
"value": "Test Block"
}
]
},
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "Some content here."
}
]
}
]
}5. Benefits of This Approach
- Proper AST Structure: Each block has individual nodes instead of monolithic HTML
- Preserved Functionality: Media extraction, metadata, and title extraction all work correctly
- Position-Based Extraction: Uses original markdown positions for accurate content extraction
- Fallback Handling: If position extraction fails, falls back to node-based extraction
- Backward Compatibility: All existing functionality is preserved
6. Final Result Structure
Each BlockExtract contains:
ast: Proper AST with individual nodes (headings, paragraphs, images, etc.)title: Extracted from the proper AST structuremarkdown: Stringified version of the proper ASTmediaItems: Correctly associated media itemsmetadata: Accurate metadata based on original content- All other existing fields
Installation
NPM (React/Vite/Node.js)
npm install @thds/markdown-block-extractorimport { parse } from "@thds/markdown-block-extractor";Deno
import { parse } from "https://deno.land/x/[email protected]/src/index.ts";Or using an import map in deno.json:
{
"imports": {
"@thds/markdown-block-extractor": "https://deno.land/x/[email protected]/src/index.ts"
}
}import { parse } from "@thds/markdown-block-extractor";React Usage
import React, { useEffect, useState } from 'react';
import { parse, type ParseResult } from '@thds/markdown-block-extractor';
function MarkdownProcessor() {
const [result, setResult] = useState<ParseResult | null>(null);
useEffect(() => {
const markdown = `<!-- block:id=my-block-123 -->
# My Block

Some content here.
<!-- end-block:id=my-block-123 -->`;
const parsed = parse(markdown);
setResult(parsed);
}, []);
return (
<div>
{result?.blockExtracts.map(block => (
<div key={block.id}>
<h3>{block.title || `Block ${block.id}`}</h3>
<p>Word count: {block.metadata.wordCount}</p>
<p>Has images: {block.metadata.hasImages ? 'Yes' : 'No'}</p>
<div dangerouslySetInnerHTML={{ __html: block.markdown }} />
</div>
))}
</div>
);
}Deno Edge Function Usage
import { parse } from "https://deno.land/x/[email protected]/src/index.ts";
Deno.serve(async (req: Request) => {
if (req.method !== "POST") {
return new Response("Method not allowed", { status: 405 });
}
try {
const { markdown } = await req.json();
const result = parse(markdown);
return new Response(JSON.stringify(result), {
headers: { "Content-Type": "application/json" }
});
} catch (error) {
return new Response(
JSON.stringify({ error: error.message }),
{ status: 500, headers: { "Content-Type": "application/json" } }
);
}
});Deno Local Usage
import { parse } from "https://deno.land/x/[email protected]/src/index.ts";
const markdown = `<!-- block:id=my-block-123 -->
# My Block

Some content here.
<!-- end-block:id=my-block-123 -->`;
const result = parse(markdown);
console.log(result.blockExtracts);Usage
import { parse } from '@thds/markdown-block-extractor';
const markdown = `<!-- block:id=my-block-123 -->
# My Block

Some content here.
<!-- end-block:id=my-block-123 -->`;
const result = parse(markdown);
console.log(result.blockExtracts);
// [
// {
// id: "my-block-123",
// type: "block",
// title: "My Block",
// markdown: "# My Block\n\nSome content here.",
// ast: {
// type: "root",
// children: [
// {
// type: "heading",
// depth: 1,
// children: [{ type: "text", value: "My Block" }]
// },
// {
// type: "image",
// url: "https://example.com/image.jpg",
// alt: "Image"
// },
// {
// type: "paragraph",
// children: [{ type: "text", value: "Some content here." }]
// }
// ]
// },
// mediaItems: [
// {
// type: "image",
// url: "https://example.com/image.jpg",
// syntax: "markdown"
// }
// ],
// metadata: {
// lineCount: 3,
// hasImages: true,
// hasVideos: false,
// hasCodeBlocks: false,
// hasTables: false,
// hasLists: false,
// hasLinks: false,
// wordCount: 4,
// characterCount: 50
// }
// }
// ]Block Syntax
The library recognizes two types of blocks with flexible ID support:
Regular Blocks
<!-- block:id=1 -->
Your content here
<!-- end-block:id=1 -->
<!-- block:id=my-block-123 -->
Your content here
<!-- end-block:id=my-block-123 -->
<!-- block:id=550e8400-e29b-41d4-a716-446655440000 -->
Your content here
<!-- end-block:id=550e8400-e29b-41d4-a716-446655440000 -->Custom Blocks
<!-- custom-block:id=2 -->
Your content here
<!-- end-custom-block:id=2 -->
<!-- custom-block:id=special-block_123 -->
Your content here
<!-- end-custom-block:id=special-block_123 -->Block ID Rules
- Block IDs can be any non-empty string
- Supports GUIDs, alphanumeric strings, and special characters
- Must match exactly between start and end markers
- Empty or whitespace-only IDs are treated as invalid
API Reference
parse(markdown: string): ParseResult
Parses markdown content and returns extracted blocks and media items.
Parameters:
markdown(string): The markdown content to parse
Returns:
ParseResult: Object containing:blockExtracts: Array of extracted blocksmediaItems: Array of all media items foundast: The parsed AST tree
Types
BlockExtract
interface BlockExtract {
id: string;
type: 'block' | 'customBlock';
title: string;
markdown: string;
ast: Node;
mediaItems: MediaItem[];
metadata: BlockMetadata;
position?: Position;
}MediaItem
interface MediaItem {
type: 'image' | 'video';
url: string;
alt?: string;
title?: string;
blockId?: string;
blockType?: string;
position?: Position;
syntax: 'markdown' | 'html';
}BlockMetadata
interface BlockMetadata {
lineCount: number;
hasImages: boolean;
hasVideos: boolean;
hasCodeBlocks: boolean;
hasTables: boolean;
hasLists: boolean;
hasLinks: boolean;
wordCount: number;
characterCount: number;
}Development
Running the Example
# Node.js/NPM
npm run example
# or
npx tsx examples/example.ts
# Deno
npm run example:deno
# or
deno run --allow-read examples/deno-example.tsRunning Tests
# Node.js/NPM
npm test
# or
npm run test:run
# Deno
npm run test:deno
# or
deno test --allow-read --allow-netBuilding
# Node.js/NPM builds
npm run build
# Deno build
npm run build:deno
# or
deno run --allow-read --allow-write npm:typescript@^5.9.2 --project tsconfig.deno.jsonReleasing (npm + JSR)
# Interactive release: bumps package.json and deno.json, builds, validates, and publishes
npm run release
# Under the hood, you can also run each step manually:
npm run build
deno publish --dry-run
npm run publish:npm
npm run publish:jsrDevelopment Mode
# Node.js/NPM
npm run dev
# Runs the example in watch mode
# Deno
npm run dev:deno
# or
deno run --allow-read --watch examples/deno-example.tsProject Structure
markdown-block-extractor/
├── src/ # Source code
│ ├── index.ts # Main library entry point
│ ├── types/ # TypeScript type definitions
│ │ └── index.ts
│ ├── utils/ # Utility functions
│ │ ├── index.ts
│ │ ├── block-utils.ts
│ │ ├── html-parser.ts
│ │ ├── media-utils.ts
│ │ ├── metadata-utils.ts
│ │ ├── node-utils.ts
│ │ └── uuid-utils.ts
│ └── plugins/ # Remark plugins
│ ├── remark-block-extractor.ts
│ ├── remark-custom-blocks.ts
│ ├── remark-media-extractor.ts
│ └── remark-orphan-content-wrapper.ts
├── dist/ # Built output
│ ├── index.cjs.js # CommonJS build
│ ├── index.es.js # ES modules build
│ ├── index.umd.js # UMD build
│ └── index.d.ts # TypeScript definitions
├── dist-deno/ # Deno build output
├── tests/ # Test files
│ └── test.spec.ts
├── examples/ # Example usage
│ ├── example.ts # Node.js example
│ ├── deno-example.ts # Deno example
│ └── deno-edge-function.ts # Deno edge function example
├── package.json # NPM package configuration
├── deno.json # Deno configuration
├── vite.config.ts # Vite build configuration
├── tsconfig.json # TypeScript configuration
├── tsconfig.deno.json # Deno TypeScript configuration
└── README.mdPackage Information
- NPM Package:
@thds/markdown-block-extractor - Version: 1.1.0
- Author: THDS GmbH
- License: CC-BY-NC-4.0
- Node.js: >=16.0.0
License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC-4.0).
Important: This license prohibits commercial use. You may use this library for personal, educational, or non-commercial projects, but commercial use requires explicit permission from the copyright holder.
For commercial licensing inquiries, please contact THDS GmbH.
