markudown-engine

v0.1.5

Published

2 months ago

High-performance web scraping engine with BullMQ workers — Cheerio, Playwright, and Abrasio layers

0High
0Medium
0Low

jvosobhie

scraper markdown bullmq playwright cheerio

MarkUDown Engine

High-performance web scraping engine powered by BullMQ, Playwright, and Cheerio. Converts any web page into clean markdown with a 3-layer extraction strategy.

Architecture

                                    MarkUDown Engine
                                          |
                    +---------------------+---------------------+
                    |                     |                     |
              Worker (TS)          Go HTML->MD          Python LLM
              BullMQ + Node.js     Port 3001            Port 3002
                    |                     |                     |
              +-----+-----+         HTML -> MD          Gemini/OpenAI
              |     |     |                             Extract/Research
           Cheerio PW  Abrasio
           (L1)  (L2)  (L3)

Worker (TypeScript) — Main processing engine. Receives jobs via BullMQ (Redis), scrapes pages using a 3-layer fallback orchestrator, cleans HTML, and converts to markdown.

Go HTML-to-Markdown — Lightweight microservice for high-performance HTML-to-markdown conversion (~10-50x faster than JS alternatives). Falls back to in-process Turndown if unavailable.

Python LLM Service — Handles AI-powered structured data extraction, schema generation, and deep research synthesis using Gemini.

3-Layer Extraction Orchestrator

Each scrape request passes through layers until content is successfully extracted:

| Layer | Engine | Speed | Use Case | |-------|--------|-------|----------| | 1 | Cheerio | ~100ms | Static HTML sites (no JS rendering needed) | | 2 | Playwright | ~2-5s | JavaScript-rendered SPAs, dynamic content | | 3 | Abrasio | ~5-15s | Anti-bot protected sites (CAPTCHA, fingerprint detection) |

Layer 1 (Cheerio): HTTP fetch + DOM parsing. No browser overhead. Validates that content is > 50 chars and has no CAPTCHA markers.
Layer 2 (Playwright): Headless Chromium with semaphore-controlled concurrency. Blocks images/media/fonts for speed. Detects soft blocks (403/429/503).
Layer 3 (Abrasio): Proprietary stealth engine with browser fingerprinting, CAPTCHA solving, IP rotation, and profile management. Only available when ABRASIO_API_URL is configured.

Without Abrasio configured, the engine operates in open-source mode using Layers 1 and 2.

Job Types

| Job | Type | Description | |-----|------|-------------| | /scrape | Sync | Scrape a single URL, return markdown + metadata | | /map | Sync | Discover all URLs on a website (sitemap + link crawl) | | /crawl | Async | Recursively crawl a site with depth/limit controls | | /batch-scrape | Async | Scrape multiple URLs in parallel | | /extract | Async | Scrape + LLM-based structured data extraction | | /search | Async | Google search + scrape results | | /screenshot | Sync | Full-page screenshot via Playwright | | /rss | Async | Generate RSS feed from any web page | | /change-detection | Async | Detect content changes via hash comparison | | /deep-research | Async | Multi-page scrape + LLM synthesis report | | /agent | Async | AI-driven autonomous web navigation — answers a question by iteratively scraping and navigating pages |

Sync jobs return results immediately. Async jobs return a job_id — poll GET /{job_type}/{job_id} for status and results.

Quick Start

Docker Compose (Recommended)

# 1. Clone and configure
cp .env.example .env
# Edit .env with your settings (GENAI_API_KEY for LLM features)

# 2. Start all services
docker-compose up -d

# 3. Verify
curl http://localhost:3001/health   # Go service
curl http://localhost:3002/health   # Python LLM service

Development

# 1. Install dependencies
npm install
npx playwright install chromium

# 2. Start Redis
docker run -d --name redis -p 6379:6379 redis:7-alpine

# 3. Start Go service (optional, falls back to Turndown)
cd services/go-html-to-md && go run . &

# 4. Start Python LLM service (optional, needed for /extract and /deep-research)
cd services/python-llm && pip install -r requirements.txt && python main.py &

# 5. Start worker (with hot-reload)
npm run dev

Build

npm run build          # Compile TypeScript to dist/
npm run typecheck      # Type check without emitting
npm run lint           # ESLint

Project Structure

MarkUDown-Engine/
├── src/                           # TypeScript Worker (main engine)
│   ├── index.ts                   # Entry: starts BullMQ workers + Playwright
│   ├── config.ts                  # Zod-validated environment config
│   ├── engine/
│   │   ├── orchestrator.ts        # 3-layer fallback (Cheerio -> Playwright -> Abrasio)
│   │   ├── cheerio-engine.ts      # Layer 1: HTTP fetch + Cheerio parse
│   │   ├── playwright-engine.ts   # Layer 2: headless browser + semaphore
│   │   └── abrasio-engine.ts      # Layer 3: proprietary stealth API
│   ├── jobs/
│   │   ├── scrape.ts              # Single URL scrape
│   │   ├── crawl.ts               # Recursive BFS crawl
│   │   ├── map.ts                 # URL discovery (sitemap + links)
│   │   ├── batch-scrape.ts        # Parallel multi-URL scrape
│   │   ├── extract.ts             # Scrape + LLM extraction
│   │   ├── search.ts              # Google search + scrape
│   │   ├── screenshot.ts          # Full-page screenshot
│   │   ├── rss.ts                 # RSS feed generation
│   │   ├── change-detection.ts    # Content diff via SHA-256 hash
│   │   ├── deep-research.ts       # Multi-page research synthesis
│   │   └── agent.ts               # AI autonomous navigation agent
│   ├── processors/
│   │   ├── html-cleaner.ts        # Cheerio-based HTML sanitization
│   │   └── markdown-client.ts     # Go service client + Turndown fallback
│   ├── queues/
│   │   ├── connection.ts          # Redis connection
│   │   ├── queues.ts              # BullMQ queue definitions (10 queues)
│   │   └── workers.ts             # Worker registration and lifecycle
│   └── utils/
│       ├── logger.ts              # Winston structured logging
│       ├── errors.ts              # TransportableError hierarchy
│       ├── url-utils.ts           # URL normalize, filter, extract
│       └── redis.ts               # Redis client for auxiliary storage
├── services/
│   ├── go-html-to-md/             # Go microservice (~15MB Docker image)
│   │   ├── main.go                # HTTP server on port 3001
│   │   ├── handler.go             # POST /convert endpoint
│   │   ├── converter.go           # html-to-markdown v2
│   │   ├── go.mod
│   │   └── Dockerfile
│   └── python-llm/                # Python LLM microservice
│       ├── main.py                # FastAPI on port 3002
│       ├── routers/
│       │   ├── extract.py         # POST /extract (Gemini structured extraction)
│       │   ├── schema.py          # POST /schema/create (NL -> JSON schema)
│       │   └── deep_research.py   # POST /deep-research (multi-source synthesis)
│       ├── requirements.txt
│       └── Dockerfile
├── package.json
├── tsconfig.json
├── Dockerfile                     # Worker: node:20-slim + Playwright Chromium
├── docker-compose.yml             # Production: Redis + Worker + Go + Python
├── docker-compose.dev.yml         # Development with hot-reload
└── .env.example

Configuration

All configuration is via environment variables. See .env.example for the full list.

| Variable | Default | Description | |----------|---------|-------------| | REDIS_URL | redis://localhost:6379 | Redis connection for BullMQ | | GO_MD_SERVICE_URL | http://localhost:3001 | Go HTML-to-Markdown service | | PYTHON_LLM_URL | http://localhost:3002 | Python LLM service | | ABRASIO_API_URL | (empty) | Abrasio stealth engine URL (empty = disabled) | | ABRASIO_API_KEY | (empty) | Abrasio API key | | GENAI_API_KEY | (empty) | Google Gemini API key (for /extract, /deep-research) | | PROXY_URL | (empty) | Proxy server address, e.g. http://host:port | | PROXY_USERNAME | (empty) | Proxy username prefix — target country code is appended per request (e.g. user-country-) | | PROXY_PASSWORD | (empty) | Proxy password | | HEADLESS | true | Run Playwright browser in headless mode. Set to false to open a visible window (local dev only — requires a display server such as Xvfb in Docker) | | MAX_CONCURRENT_PAGES | 10 | Max simultaneous Playwright pages | | MAX_CRAWL_DEPTH | 5 | Default max crawl depth | | MAX_CRAWL_URLS | 1000 | Default max URLs per crawl | | DEFAULT_TIMEOUT | 60 | Default timeout in seconds |

API Integration

MarkUDown Engine is designed to be used with a separate API gateway that handles authentication, billing, and rate limiting. The API pushes jobs to BullMQ queues and polls Redis for results.

Push a Job (Python example with python-bullmq)

from bullmq import Queue

queue = Queue("scrape", {"connection": "redis://localhost:6379"})

job = await queue.add("scrape", {
    "url": "https://example.com",
    "options": {
        "main_content": True,
        "include_link": True,
        "timeout": 60
    }
})

print(f"Job ID: {job.id}")

Poll for Results

import redis
import json

r = redis.from_url("redis://localhost:6379", decode_responses=True)

# Check job status
key = f"bull:scrape:{job_id}"
finished = r.hget(key, "finishedOn")

if finished:
    result = json.loads(r.hget(key, "returnvalue"))
    print(result["data"]["markdown"])

Job Data Formats

Scrape Job:

{
  "url": "https://example.com",
  "options": {
    "timeout": 60,
    "exclude_tags": ["nav", "footer"],
    "main_content": true,
    "include_link": true,
    "include_html": false,
    "force_playwright": false,
    "force_abrasio": false
  }
}

Crawl Job:

{
  "url": "https://example.com",
  "options": {
    "max_depth": 3,
    "limit": 50,
    "concurrency": 5,
    "blocked_words": ["login", "admin"],
    "allowed_patterns": ["/blog/", "/docs/"],
    "main_content": true
  }
}

Extract Job:

{
  "url": "https://example.com/products",
  "schema": {
    "name": "string",
    "price": "float",
    "description": "string",
    "url": "url"
  },
  "extract_query": "Extract all product listings"
}

Map Job:

{
  "url": "https://example.com",
  "options": {
    "max_urls": 500,
    "allowed_words": ["blog", "docs"],
    "blocked_words": ["login", "cart"]
  }
}

Batch Scrape Job:

{
  "urls": ["https://example.com/page1", "https://example.com/page2"],
  "options": {
    "main_content": true,
    "timeout": 30
  }
}

Search Job:

{
  "query": "best web scraping tools 2026",
  "options": {
    "limit": 5,
    "scrape_results": true,
    "lang": "en",
    "country": "us"
  }
}

RSS Job:

{
  "url": "https://example.com/blog",
  "options": {
    "max_items": 20,
    "title": "Example Blog Feed"
  }
}

Screenshot Job:

{
  "url": "https://example.com",
  "options": {
    "full_page": true,
    "type": "png",
    "timeout": 30
  }
}

Change Detection Job:

{
  "url": "https://example.com/pricing",
  "options": {
    "main_content": true,
    "include_diff": true
  }
}

Deep Research Job:

{
  "query": "Compare pricing strategies of SaaS companies",
  "urls": [
    "https://example1.com/pricing",
    "https://example2.com/pricing"
  ],
  "options": {
    "max_tokens": 4096
  }
}

Agent Job:

{
  "url": "https://example.com",
  "prompt": "What is the return policy and how many days do I have to return a product?",
  "options": {
    "timeout": 60,
    "max_steps": 10,
    "max_pages": 5,
    "allow_navigation": true,
    "main_content": true
  }
}

| Option | Default | Description | |--------|---------|-------------| | max_steps | 10 | Max LLM decision steps (capped at 25) | | max_pages | 5 | Max pages the agent can navigate to (capped at 15) | | allow_navigation | true | Allow the agent to follow links to other pages | | main_content | true | Strip nav/footer/ads from pages before sending to LLM |

Agent Response:

{
  "success": true,
  "data": {
    "url": "https://example.com",
    "answer": "You have 30 days to return any product in original condition...",
    "steps": [
      {
        "step": 1,
        "url": "https://example.com",
        "action": "navigate",
        "reasoning": "Need to find the returns policy page",
        "result": "Navigating to: https://example.com/returns"
      },
      {
        "step": 2,
        "url": "https://example.com/returns",
        "action": "answer",
        "reasoning": "Found the complete returns policy on this page",
        "result": "You have 30 days to return..."
      }
    ],
    "pages_visited": ["https://example.com", "https://example.com/returns"],
    "total_steps": 2
  },
  "processing_time_ms": 4821
}

Note: Requires GENAI_API_KEY. The agent uses the Python LLM service (/agent/step/) to decide the next action at each step. Without Abrasio configured, scraping uses Cheerio → Playwright fallback.

Response Formats

Scrape Response

{
  "success": true,
  "data": {
    "url": "https://example.com",
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "links": ["https://www.iana.org/domains/example"],
    "metadata": {
      "title": "Example Domain",
      "description": "",
      "source": "cheerio",
      "statusCode": 200
    }
  },
  "processing_time_ms": 145
}

Crawl Response

{
  "success": true,
  "status": "completed",
  "total": 15,
  "data": [
    {
      "url": "https://example.com",
      "markdown": "# Page content...",
      "metadata": { "title": "...", "source": "cheerio", "statusCode": 200 }
    }
  ],
  "processing_time_ms": 12500
}

Extract Response

{
  "success": true,
  "data": [
    { "name": "Product A", "price": 29.99, "description": "...", "url": "..." },
    { "name": "Product B", "price": 49.99, "description": "...", "url": "..." }
  ],
  "total": 2,
  "url": "https://example.com/products",
  "processing_time_ms": 8500
}

Self-Hosting Guide

Requirements

Docker and Docker Compose (recommended)
OR: Node.js 20+, Redis 7+, Go 1.22+ (optional), Python 3.10+ (optional)

Production Deployment

# 1. Configure
cp .env.example .env
# Set GENAI_API_KEY for LLM features
# Set ABRASIO_API_URL/KEY for stealth mode (optional, paid)

# 2. Build and run
docker-compose up -d --build

# 3. Monitor
docker-compose logs -f worker

Kubernetes

The worker, Go service, and Python LLM service each have their own Dockerfile and can be deployed as separate Kubernetes Deployments with a shared Redis (or Redis Cluster) as the message broker.

Scaling

Horizontal: Run multiple worker replicas. BullMQ handles job distribution across workers automatically.
Vertical: Increase MAX_CONCURRENT_PAGES for more simultaneous Playwright pages per worker (requires more RAM).
Recommended: 1 worker per 2 CPU cores, 2GB RAM per worker.

Python LLM Service API

Base URL: http://localhost:3002

All endpoints accept and return application/json.

`POST /extract/`

Extract structured data from Markdown content using Gemini.

Request body:

{
  "url": "https://example.com/products",
  "markdown": "# Products\n...",
  "schema_fields": {
    "name": "string",
    "price": "float",
    "url": "url",
    "in_stock": "boolean"
  },
  "extraction_scope": "list_page",
  "extraction_target": "products",
  "extract_query": "scrape all products with their prices"
}

| Field | Type | Required | Description | |-------|------|----------|-------------| | markdown | string | Yes | Markdown content of the page | | url | string | No | Source URL | | schema_fields | object | No* | Field names → types (string, float, integer, date, url, boolean) | | prompt | string | No* | Free-form extraction instruction (alternative to schema_fields) | | extract_query | string | No | Natural-language description of what to extract | | extraction_scope | string | No | One of: whole_site, category, single_page, list_page, search_query | | extraction_target | string | No | Target category or search term |

*At least one of schema_fields, prompt, or extract_query is required.

Response:

{
  "success": true,
  "data": [
    { "name": "Widget A", "price": 29.99, "url": "https://example.com/widget-a", "in_stock": true },
    { "name": "Widget B", "price": 49.99, "url": "https://example.com/widget-b", "in_stock": false }
  ],
  "total": 2
}

`POST /schema/create`

Generate a scraping schema from a natural language query.

Request body:

{ "query": "Scrape all laptop listings from https://store.example.com including name, price, and specs" }

Response:

{
  "success": true,
  "schema": {
    "url": "https://store.example.com",
    "extraction_scope": "list_page",
    "extraction_target": null,
    "name": "string",
    "price": "float",
    "specs": "string",
    "allowed_words": ["laptop", "notebook", "specs", "buy"],
    "blocked_words": ["cart", "checkout", "login"],
    "allowed_patterns": ["/laptops/", "/notebooks/"],
    "blocked_patterns": ["/cart", "/account"]
  }
}

The returned schema can be passed directly as schema_fields in a /extract/ call or to the TypeScript worker's extract job.

`POST /summarize/`

Summarize a web page into title, prose summary, and key points.

Request body:

{
  "url": "https://example.com/article",
  "markdown": "# Title\n...",
  "max_length": 300,
  "language": "English"
}

| Field | Type | Default | Description | |-------|------|---------|-------------| | markdown | string | — | Page content in Markdown | | url | string | null | Source URL | | max_length | integer | 500 | Target summary length in words | | language | string | null | Output language (defaults to source language) |

Response:

{
  "success": true,
  "title": "How to Build a Web Scraper in 2024",
  "summary": "This article covers the fundamentals of web scraping...",
  "key_points": [
    "Choose between static (Cheerio) and dynamic (Playwright) scrapers",
    "Respect robots.txt and rate limits",
    "Use proxies for large-scale scraping"
  ]
}

`POST /deep-research/`

Synthesize a comprehensive research report from multiple scraped pages.

Request body:

{
  "query": "What are the best practices for web scraping in 2024?",
  "pages": [
    { "url": "https://source1.com", "title": "Web Scraping Guide", "markdown": "..." },
    { "url": "https://source2.com", "markdown": "..." }
  ],
  "max_tokens": 4096
}

Response:

{
  "success": true,
  "research": "## Web Scraping Best Practices\n\nBased on [Source 1] and [Source 2]...",
  "sources": ["https://source1.com", "https://source2.com"],
  "pages_analyzed": 2
}

`POST /agent/step/`

Execute one step of an autonomous web navigation agent. The caller is responsible for driving the loop: navigate to target_url, feed the new page back in the next request, and stop when action is "answer" or "done".

Request body:

{
  "prompt": "Find the price of the MacBook Pro 16-inch",
  "current_url": "https://www.apple.com",
  "page_content": "# Apple\nShop iPhone, Mac, iPad...",
  "available_links": ["https://www.apple.com/mac/", "https://www.apple.com/macbook-pro/"],
  "steps_so_far": [],
  "pages_visited": [],
  "step_number": 1,
  "max_steps": 10,
  "allow_navigation": true
}

Response:

{
  "action": "navigate",
  "reasoning": "The homepage doesn't show prices. I should go to the MacBook Pro page.",
  "answer": null,
  "target_url": "https://www.apple.com/macbook-pro/",
  "extracted_data": null
}

| Action | Description | |--------|-------------| | navigate | Go to target_url and call again with the new page content | | extract | Data extracted from the current page is in extracted_data | | answer | Final answer is in answer — stop the loop | | done | All data compiled — stop the loop |

`GET /health`

{ "status": "healthy", "service": "python-llm" }

Technology Stack

| Component | Technology | Purpose | |-----------|-----------|---------| | Job Queue | BullMQ (Redis) | Reliable job processing with retries | | Worker Runtime | Node.js 20 + TypeScript | High-performance async I/O | | HTTP Scraping | Cheerio + undici | Fast DOM parsing without browser | | Browser Scraping | Playwright (Chromium) | JS-rendered content extraction | | HTML to Markdown | Go (html-to-markdown v2) | High-performance conversion | | Markdown Fallback | Turndown | In-process JS fallback | | LLM Extraction | Python + Gemini | Structured data extraction | | Logging | Winston | Structured JSON logging | | Config Validation | Zod | Runtime env var validation |

Support & Community

| Channel | Link | |---------|------| | 💬 Discord | discord.gg/GBSKsC8DvS | | 📧 Email | [email protected] | | 🌐 API Docs | scrapetechnology.com/markudown/docs |

For bug reports and feature requests, open a thread in the #markudown-feedback channel on Discord.

License

AGPL-3.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

MarkUDown Engine

Architecture

3-Layer Extraction Orchestrator

Job Types

Quick Start

Docker Compose (Recommended)

Development

Build

Project Structure

Configuration

API Integration

Push a Job (Python example with python-bullmq)

Poll for Results

Job Data Formats

Response Formats

Scrape Response

Crawl Response

Extract Response

Self-Hosting Guide

Requirements

Production Deployment

Kubernetes

Scaling

Python LLM Service API

POST /extract/

POST /schema/create

POST /summarize/

POST /deep-research/

POST /agent/step/

GET /health

Technology Stack

Support & Community

License

`POST /extract/`

`POST /schema/create`

`POST /summarize/`

`POST /deep-research/`

`POST /agent/step/`

`GET /health`