@thds/markdown-block-extractor

v1.2.1

Published

3 months ago

A TypeScript library for extracting structured blocks and media items from markdown content. Optimized for React, Vite, and Deno.

0High
0Medium
0Low

jankrueger_datenlotse

tom.nissing

torbenthds

markdown parser extractor blocks media typescript remark unified react vite deno edge-function frontend library

Markdown Block Extractor

A TypeScript library for extracting structured blocks and media items from markdown content. This library processes markdown with custom block markers and extracts both regular content blocks and media items with detailed metadata. Optimized for React, Vite, Deno, and modern JavaScript applications.

Features

Block Extraction: Extract content blocks marked with HTML comments
Flexible Block IDs: Support for any string as block ID including GUIDs, alphanumeric strings, and special characters
Media Detection: Automatically detect images and videos in both markdown and HTML syntax
Rich Metadata: Generate detailed metadata for each block including word count, line count, and content features
TypeScript Support: Full TypeScript definitions included
React/Vite Optimized: Built with Vite for optimal bundling in modern React applications
Deno Compatible: Works seamlessly in Deno environments and edge functions
Browser Compatible: Uses native crypto.randomUUID() for UUID generation
Tree Shakeable: ES modules with proper exports for efficient bundling
Multiple Build Formats: CommonJS, ES modules, UMD builds, and Deno imports available
Proper AST Structure: Each block contains a complete Abstract Syntax Tree with individual nodes
Title Extraction: Automatic title extraction from headings or first text line

How the Parsing Pipeline Works

The markdown block extractor uses a sophisticated multi-stage parsing pipeline to ensure accurate extraction and proper AST structure. Here's how it works:

1. Initial Markdown Parsing

const ast = unified()
  .use(remarkParse)
  .parse(markdown) as Node;

The original markdown is parsed into an Abstract Syntax Tree (AST) using remark-parse. This creates the initial structure, but blocks with HTML content get treated as single HTML nodes.

2. Transform Pipeline (Applied in Order)

The AST goes through several transformation stages:

Stage 1: Orphan Content Wrapper

Wraps content outside of blocks in custom blocks
Ensures all content is contained within blocks

Stage 2: Custom Blocks Processing

Identifies and processes  and  markers
Creates BlockNode and CustomBlockNode structures
Collects content between markers as children

Stage 3: Media Extraction

Extracts images and videos from both markdown and HTML content
Associates media items with their containing blocks
Runs on the original parsed content (before re-parsing)

Stage 4: Block Extraction with Proper AST ⭐ Key Innovation

Re-parses block content to create proper AST structure
Uses position information to extract original markdown content
Re-parses that content to get individual nodes (headings, paragraphs, etc.)

3. The Re-parsing Process

When processing each block, the system:

Extracts Original Content:

const blockContent = originalMarkdown.substring(startOffset, endOffset);
const markdownContent = blockContent.substring(contentStart, contentEnd).trim();

Re-parses the Original Markdown:

const parsed = unified()
  .use(remarkParse)
  .parse(markdownContent) as Node;

Creates Proper AST Structure:
- Instead of a single HTML node, you get individual nodes:
  - heading nodes for # Title
  - paragraph nodes for text content
  - image nodes for ![alt](url)
  - text nodes for plain text
  - etc.

4. Example Transformation

Before (Single HTML node):

{
  "type": "html",
  "value": "<img src=\"test.jpg\">\n# Test Block\nSome content here."
}

After (Proper AST structure):

{
  "type": "root",
  "children": [
    {
      "type": "html",
      "value": "<img src=\"test.jpg\">"
    },
    {
      "type": "heading",
      "depth": 1,
      "children": [
        {
          "type": "text",
          "value": "Test Block"
        }
      ]
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Some content here."
        }
      ]
    }
  ]
}

5. Benefits of This Approach

Proper AST Structure: Each block has individual nodes instead of monolithic HTML
Preserved Functionality: Media extraction, metadata, and title extraction all work correctly
Position-Based Extraction: Uses original markdown positions for accurate content extraction
Fallback Handling: If position extraction fails, falls back to node-based extraction
Backward Compatibility: All existing functionality is preserved

6. Final Result Structure

Each BlockExtract contains:

ast: Proper AST with individual nodes (headings, paragraphs, images, etc.)
title: Extracted from the proper AST structure
markdown: Stringified version of the proper AST
mediaItems: Correctly associated media items
metadata: Accurate metadata based on original content
All other existing fields

Installation

NPM (React/Vite/Node.js)

npm install @thds/markdown-block-extractor

import { parse } from "@thds/markdown-block-extractor";

Deno

import { parse } from "https://deno.land/x/[email protected]/src/index.ts";

Or using an import map in deno.json:

{
  "imports": {
    "@thds/markdown-block-extractor": "https://deno.land/x/[email protected]/src/index.ts"
  }
}

import { parse } from "@thds/markdown-block-extractor";

React Usage

import React, { useEffect, useState } from 'react';
import { parse, type ParseResult } from '@thds/markdown-block-extractor';

function MarkdownProcessor() {
  const [result, setResult] = useState<ParseResult | null>(null);
  
  useEffect(() => {
    const markdown = `<!-- block:id=my-block-123 -->
# My Block
![Image](https://example.com/image.jpg)
Some content here.
<!-- end-block:id=my-block-123 -->`;
    
    const parsed = parse(markdown);
    setResult(parsed);
  }, []);
  
  return (
    <div>
      {result?.blockExtracts.map(block => (
        <div key={block.id}>
          <h3>{block.title || `Block ${block.id}`}</h3>
          <p>Word count: {block.metadata.wordCount}</p>
          <p>Has images: {block.metadata.hasImages ? 'Yes' : 'No'}</p>
          <div dangerouslySetInnerHTML={{ __html: block.markdown }} />
        </div>
      ))}
    </div>
  );
}

Deno Edge Function Usage

import { parse } from "https://deno.land/x/[email protected]/src/index.ts";

Deno.serve(async (req: Request) => {
  if (req.method !== "POST") {
    return new Response("Method not allowed", { status: 405 });
  }

  try {
    const { markdown } = await req.json();
    const result = parse(markdown);
    
    return new Response(JSON.stringify(result), {
      headers: { "Content-Type": "application/json" }
    });
  } catch (error) {
    return new Response(
      JSON.stringify({ error: error.message }),
      { status: 500, headers: { "Content-Type": "application/json" } }
    );
  }
});

Deno Local Usage

import { parse } from "https://deno.land/x/[email protected]/src/index.ts";

const markdown = `<!-- block:id=my-block-123 -->
# My Block
![Image](https://example.com/image.jpg)
Some content here.
<!-- end-block:id=my-block-123 -->`;

const result = parse(markdown);
console.log(result.blockExtracts);

Usage

import { parse } from '@thds/markdown-block-extractor';

const markdown = `<!-- block:id=my-block-123 -->
# My Block
![Image](https://example.com/image.jpg)
Some content here.
<!-- end-block:id=my-block-123 -->`;

const result = parse(markdown);

console.log(result.blockExtracts);
// [
//   {
//     id: "my-block-123",
//     type: "block",
//     title: "My Block",
//     markdown: "# My Block\n![Image](https://example.com/image.jpg)\nSome content here.",
//     ast: {
//       type: "root",
//       children: [
//         {
//           type: "heading",
//           depth: 1,
//           children: [{ type: "text", value: "My Block" }]
//         },
//         {
//           type: "image",
//           url: "https://example.com/image.jpg",
//           alt: "Image"
//         },
//         {
//           type: "paragraph",
//           children: [{ type: "text", value: "Some content here." }]
//         }
//       ]
//     },
//     mediaItems: [
//       {
//         type: "image",
//         url: "https://example.com/image.jpg",
//         syntax: "markdown"
//       }
//     ],
//     metadata: {
//       lineCount: 3,
//       hasImages: true,
//       hasVideos: false,
//       hasCodeBlocks: false,
//       hasTables: false,
//       hasLists: false,
//       hasLinks: false,
//       wordCount: 4,
//       characterCount: 50
//     }
//   }
// ]

Block Syntax

The library recognizes two types of blocks with flexible ID support:

Regular Blocks

<!-- block:id=1 -->
Your content here
<!-- end-block:id=1 -->

<!-- block:id=my-block-123 -->
Your content here
<!-- end-block:id=my-block-123 -->

<!-- block:id=550e8400-e29b-41d4-a716-446655440000 -->
Your content here
<!-- end-block:id=550e8400-e29b-41d4-a716-446655440000 -->

Custom Blocks

<!-- custom-block:id=2 -->
Your content here
<!-- end-custom-block:id=2 -->

<!-- custom-block:id=special-block_123 -->
Your content here
<!-- end-custom-block:id=special-block_123 -->

Block ID Rules

Block IDs can be any non-empty string
Supports GUIDs, alphanumeric strings, and special characters
Must match exactly between start and end markers
Empty or whitespace-only IDs are treated as invalid

API Reference

`parse(markdown: string): ParseResult`

Parses markdown content and returns extracted blocks and media items.

Parameters:

markdown (string): The markdown content to parse

Returns:

ParseResult: Object containing:
- blockExtracts: Array of extracted blocks
- mediaItems: Array of all media items found
- ast: The parsed AST tree

Types

`BlockExtract`

interface BlockExtract {
  id: string;
  type: 'block' | 'customBlock';
  title: string;
  markdown: string;
  ast: Node;
  mediaItems: MediaItem[];
  metadata: BlockMetadata;
  position?: Position;
}

`MediaItem`

interface MediaItem {
  type: 'image' | 'video';
  url: string;
  alt?: string;
  title?: string;
  blockId?: string;
  blockType?: string;
  position?: Position;
  syntax: 'markdown' | 'html';
}

`BlockMetadata`

interface BlockMetadata {
  lineCount: number;
  hasImages: boolean;
  hasVideos: boolean;
  hasCodeBlocks: boolean;
  hasTables: boolean;
  hasLists: boolean;
  hasLinks: boolean;
  wordCount: number;
  characterCount: number;
}

Development

Running the Example

# Node.js/NPM
npm run example
# or
npx tsx examples/example.ts

# Deno
npm run example:deno
# or
deno run --allow-read examples/deno-example.ts

Running Tests

# Node.js/NPM
npm test
# or
npm run test:run

# Deno
npm run test:deno
# or
deno test --allow-read --allow-net

Building

# Node.js/NPM builds
npm run build

# Deno build
npm run build:deno
# or
deno run --allow-read --allow-write npm:typescript@^5.9.2 --project tsconfig.deno.json

Releasing (npm + JSR)

# Interactive release: bumps package.json and deno.json, builds, validates, and publishes
npm run release

# Under the hood, you can also run each step manually:
npm run build
deno publish --dry-run
npm run publish:npm
npm run publish:jsr

Development Mode

# Node.js/NPM
npm run dev
# Runs the example in watch mode

# Deno
npm run dev:deno
# or
deno run --allow-read --watch examples/deno-example.ts

Project Structure

markdown-block-extractor/
├── src/                    # Source code
│   ├── index.ts           # Main library entry point
│   ├── types/             # TypeScript type definitions
│   │   └── index.ts
│   ├── utils/             # Utility functions
│   │   ├── index.ts
│   │   ├── block-utils.ts
│   │   ├── html-parser.ts
│   │   ├── media-utils.ts
│   │   ├── metadata-utils.ts
│   │   ├── node-utils.ts
│   │   └── uuid-utils.ts
│   └── plugins/           # Remark plugins
│       ├── remark-block-extractor.ts
│       ├── remark-custom-blocks.ts
│       ├── remark-media-extractor.ts
│       └── remark-orphan-content-wrapper.ts
├── dist/                  # Built output
│   ├── index.cjs.js       # CommonJS build
│   ├── index.es.js        # ES modules build
│   ├── index.umd.js       # UMD build
│   └── index.d.ts         # TypeScript definitions
├── dist-deno/             # Deno build output
├── tests/                 # Test files
│   └── test.spec.ts
├── examples/              # Example usage
│   ├── example.ts         # Node.js example
│   ├── deno-example.ts    # Deno example
│   └── deno-edge-function.ts # Deno edge function example
├── package.json           # NPM package configuration
├── deno.json              # Deno configuration
├── vite.config.ts         # Vite build configuration
├── tsconfig.json          # TypeScript configuration
├── tsconfig.deno.json     # Deno TypeScript configuration
└── README.md

Package Information

NPM Package: @thds/markdown-block-extractor
Version: 1.1.0
Author: THDS GmbH
License: CC-BY-NC-4.0
Node.js: >=16.0.0

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC-4.0).

Important: This license prohibits commercial use. You may use this library for personal, educational, or non-commercial projects, but commercial use requires explicit permission from the copyright holder.

For commercial licensing inquiries, please contact THDS GmbH.