@innotekseo/blogs-migrate
v0.1.2
Published
HTML-to-MDX migration CLI — crawl legacy HTML sites and convert to MDX content files
Readme
@innotekseo/blogs-migrate
HTML-to-MDX migration CLI — crawl legacy HTML sites and convert pages to MDX content files with frontmatter.
Part of the Innotek Platform Toolkits — open-source tools for AI-era content discoverability.
GitHub: innotekseoai/innotekseo-blogs · packages/cli
Install
npm install -g @innotekseo/blogs-migrateOr use without installing:
npx @innotekseo/blogs-migrate --url https://old-blog.example.comThe binary name is innotekseo-migrate.
CLI Usage
innotekseo-migrate --url <start-url> [options]
Options:
--url <url> Start URL to crawl (required)
--output <dir> Output directory (default: ./content)
--depth <n> Max crawl depth (default: 1)
--delay <ms> Delay between requests in ms (default: 500)Example:
innotekseo-migrate \
--url https://old-blog.example.com/posts \
--output ./src/content \
--depth 3 \
--delay 1000Output — for each crawled page, generates an .mdx file:
---
title: "Original Page Title"
date: "2024-01-15T00:00:00.000Z"
source: "https://old-blog.example.com/posts/my-article"
---
Converted markdown content...
Features
- BFS crawling with configurable depth limit
- Smart content extraction — tries
<article>,<main>, common CSS classes before falling back to<body> - Noise removal — strips
<script>,<style>,<nav>,<footer>,<header>,<aside>,<iframe> - Image downloading — saves remote images locally, rewrites paths in markdown
- Rate limiting between requests (default 500ms)
- Slug deduplication — appends
-1,-2for duplicate URL slugs - Same-domain only — only follows links within the source domain
- SSRF protection — blocks localhost, private IPs, IPv6 loopback,
file://,.localhostnames - Path traversal protection — output directory validated to stay within CWD
Programmatic Usage
import { migrate } from "@innotekseo/blogs-migrate";
const files = await migrate({
url: "https://old-blog.example.com",
output: "./content",
depth: 2,
delay: 500,
});
// files: string[] — paths to created .mdx filesIndividual functions are also exported: crawlPage, crawlSite, convertPage, slugify, toMdxString, downloadImages.
After Migration
The generated MDX files work directly with @innotekseo/blogs-core via LocalAdapter, and the @innotekseo/blogs-components layout components.
Related Packages
| Package | Description |
|---|---|
| @innotekseo/cli | Main GEO/SEO CLI — llms.txt, article scaffolding |
| @innotekseo/blogs-core | Content adapter library + REST API |
| @innotekseo/blogs-components | Astro UI components for MDX content |
License
ISC — Innotek Solutions Ltd
