markudown-sdk

v0.1.0

Published

2 months ago

Node.js / TypeScript SDK for MarkUDown — AI data extraction infrastructure

0High
0Medium
0Low

jvosobhie

scraping web-scraping data-extraction ai markudown

MarkUDown Node.js / TypeScript SDK

TypeScript SDK for MarkUDown — AI data extraction infrastructure that turns any website into structured, ready-to-use data.

Requires Node 18+ (uses native fetch). Zero dependencies.

Installation

npm install markudown
# or
pnpm add markudown
# or
yarn add markudown

Quick Start

import { MarkUDown } from "markudown";

const md = new MarkUDown({ apiKey: "your-api-key" });

// Scrape a page
const result = await md.scrape("https://example.com");

// Search the web
const results = await md.search("best web scraping tools 2025", {
  engine: "google",
  limit: 5,
});

// Extract structured data
const data = await md.extract(
  "https://shop.example.com/products",
  [
    { name: "title",  type: "string", active: true },
    { name: "price",  type: "float",  active: true },
    { name: "rating", type: "float",  active: true },
  ],
  { extractQuery: "Extract all product listings" }
);

Get your API key at scrapetechnology.com/markudown. New accounts include 500 free credits.

Configuration

const md = new MarkUDown({
  apiKey: "your-api-key",
  baseUrl: "https://api.scrapetechnology.com", // default
  timeout: 120_000,      // HTTP timeout in ms (default: 120 000)
  pollInterval: 2_000,   // ms between status polls (default: 2 000)
  pollTimeout: 300_000,  // max ms to wait for a job (default: 300 000)
});

All async endpoints accept wait: false to return the job ID immediately:

const job = await md.crawl("https://example.com", { wait: false }) as { id: string };

// Check later
const status = await md.getJobStatus("crawl", job.id);

API Reference

`scrape(url, opts?)` → Promise<unknown>

Scrape a single URL. Returns result immediately.

const result = await md.scrape("https://example.com", {
  mainContent: true,
  includeLinks: true,
  excludeTags: ["header", "nav", "footer"],
});

| Option | Type | Default | Description | |--------|------|---------|-------------| | timeout | number | 60 | Timeout in seconds | | excludeTags | string[] | ["header","nav","footer"] | HTML tags to strip | | mainContent | boolean | true | Extract only main content | | includeLinks | boolean | false | Include hyperlinks | | includeHtml | boolean | false | Include raw HTML |

`screenshot(url, opts?)` → Promise<unknown>

Take a screenshot of a URL.

const result = await md.screenshot("https://example.com");
// result.image → base64 PNG

`map(url, opts?)` → Promise<unknown>

Discover all URLs on a website.

const result = await md.map("https://example.com", {
  maxUrls: 500,
  allowedWords: ["product", "shop"],
  blockedWords: ["blog", "careers"],
});

`crawl(url, opts?)` → Promise<unknown>

Crawl a website recursively and extract content from every page.

const result = await md.crawl("https://docs.example.com", {
  maxDepth: 3,
  limit: 50,
  includeOnly: ["*guide*", "*tutorial*"],
});

| Option | Type | Default | |--------|------|---------| | maxDepth | number | 2 | | limit | number | 10 | | timeout | number | 60 | | includeLinks | boolean | false | | blockedWords | string[] | [] | | includeOnly | string[] | [] | | mainContent | boolean | true | | callbackUrl | string | — | | wait | boolean | true |

`extract(url, schema, opts?)` → Promise<unknown>

Schema-based structured data extraction with AI.

const result = await md.extract(
  "https://shop.example.com",
  [
    { name: "title",    type: "string", active: true },
    { name: "price",    type: "float",  active: true },
    { name: "in_stock", type: "string", active: true },
  ],
  {
    extractQuery: "Product listings with title, price and availability",
    extractionScope: "whole_site",
  }
);

`createSchema(query)` → Promise<unknown>

Generate an extraction schema automatically from a natural language description.

const schema = await md.createSchema(
  "Extract all products from https://shop.example.com with title, price and rating"
);

`promptExtract(url, prompt, opts?)` → Promise<unknown>

Extract data using a natural language prompt — no schema required.

const result = await md.promptExtract(
  "https://example.com/team",
  "List all team members with their name, title and LinkedIn URL"
);

`batchScrape(urls, opts?)` → Promise<unknown>

Scrape multiple URLs in parallel.

const result = await md.batchScrape([
  "https://example.com/page-1",
  "https://example.com/page-2",
  "https://example.com/page-3",
]);

`search(query, opts?)` → Promise<unknown>

Web search with optional scraping of result pages.

const result = await md.search("open source web scraping", {
  engine: "all",       // "google" | "bing" | "duckduckgo" | "all"
  limit: 10,
  scrapeResults: true,
  lang: "en",
  country: "us",
});

`rss(url, opts?)` → Promise<unknown>

Generate an RSS feed from any web page.

const result = await md.rss("https://blog.example.com", {
  maxItems: 20,
  title: "My Blog Feed",
});

`changeDetection(url, opts?)` → Promise<unknown>

Detect and diff content changes on a page.

const result = await md.changeDetection("https://example.com/pricing", {
  includeDiff: true,
});
// result.data.changed → true/false
// result.data.diff    → diff string

`deepResearch(query, urls, opts?)` → Promise<unknown>

Scrape multiple pages and synthesize findings with an LLM.

const result = await md.deepResearch(
  "What are the pricing differences between these competitors?",
  [
    "https://competitor-a.com/pricing",
    "https://competitor-b.com/pricing",
  ],
  { maxTokens: 4096 }
);

`agent(url, prompt, opts?)` → Promise<unknown>

AI autonomous navigation agent that clicks, scrolls, and fills forms.

const result = await md.agent(
  "https://example.com",
  "Find the contact email on this website",
  { maxSteps: 15, includeScreenshots: false }
);

`smartExtract(url, goal, opts?)` → Promise<unknown>

Guided extraction: analyzes the site, plans interactions, and returns complete data.

const result = await md.smartExtract(
  "https://example.com/products",
  "Extract all products with name, price, and description",
  {
    outputFormat: "json",
    hints: ["Click 'Load more' button to get all items"],
  }
);

`rank(keyword, domain, opts?)` → Promise<unknown>

Check where a domain ranks for a keyword in search results.

const result = await md.rank("web scraping api", "scrapetechnology.com", {
  engine: "google",
  country: "us",
  depth: 100,
});
// result.data.position → e.g. 4

`dataset(url, goal, opts?)` → Promise<unknown>

Auto-paginate a listing page and extract all items as a structured dataset.

const result = await md.dataset(
  "https://books.toscrape.com",
  "Extract all books with title, price, rating and availability",
  { maxPages: 10, outputFormat: "json" }
);

Monitor

// Create
const sub = await md.monitorCreate(
  "https://example.com/pricing",
  "https://yourapp.com/webhook",
  { intervalMinutes: 60 }
);
const subId = (sub as { subscription_id: string }).subscription_id;

// List
const monitors = await md.monitorList();

// Delete
await md.monitorDelete(subId);

`instagram(resource, target, opts?)` → Promise<unknown>

Extract public Instagram data.

| resource | target | Description | |------------|----------|-------------| | "profile" | username | Profile info + recent posts | | "post" | shortcode | Single post | | "hashtag" | hashtag | Recent posts for a hashtag | | "search" | keyword | Search results |

// Profile
const profile = await md.instagram("profile", "instagram");

// Hashtag
const posts = await md.instagram("hashtag", "webdev", { limit: 30 });

// Search
const results = await md.instagram("search", "coffee shops NYC");

`x(resource, target, opts?)` → Promise<unknown>

Extract public X (Twitter) data.

| resource | target | Description | |------------|----------|-------------| | "profile" | username | Profile info + recent tweets | | "post" | tweet ID | Single tweet | | "search" | keyword | Search results |

const profile = await md.x("profile", "elonmusk");
const results = await md.x("search", "web scraping tools", { limit: 25 });

Job Management

// Get status
const status = await md.getJobStatus("crawl", "job-id");
// status.status → "pending" | "processing" | "completed" | "failed"

// Cancel
await md.cancelJob("agent", "job-id");

Error Handling

import { MarkUDown, MarkUDownError, RateLimitError, InsufficientCreditsError } from "markudown";

const md = new MarkUDown({ apiKey: "your-key" });

try {
  const result = await md.scrape("https://example.com");
} catch (e) {
  if (e instanceof RateLimitError) {
    console.log("Rate limit hit — wait a minute and retry");
  } else if (e instanceof InsufficientCreditsError) {
    console.log("Out of credits — upgrade at scrapetechnology.com");
  } else if (e instanceof MarkUDownError) {
    console.log(`API error ${e.statusCode}: ${e.message}`);
  }
}

Build

npm install
npm run build       # compiles to dist/
npm run typecheck   # type-check without emitting

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

MarkUDown Node.js / TypeScript SDK

Installation

Quick Start

Configuration

API Reference

scrape(url, opts?) → Promise<unknown>

screenshot(url, opts?) → Promise<unknown>

map(url, opts?) → Promise<unknown>

crawl(url, opts?) → Promise<unknown>

extract(url, schema, opts?) → Promise<unknown>

createSchema(query) → Promise<unknown>

promptExtract(url, prompt, opts?) → Promise<unknown>

batchScrape(urls, opts?) → Promise<unknown>

search(query, opts?) → Promise<unknown>

rss(url, opts?) → Promise<unknown>

changeDetection(url, opts?) → Promise<unknown>

deepResearch(query, urls, opts?) → Promise<unknown>

agent(url, prompt, opts?) → Promise<unknown>

smartExtract(url, goal, opts?) → Promise<unknown>

rank(keyword, domain, opts?) → Promise<unknown>

dataset(url, goal, opts?) → Promise<unknown>