@udx/mq

v1.1.5

Published

23 days ago

Markdown Query - jq for Markdown documents

@udx/mq - Markdown Query

A powerful tool for querying and transforming markdown documents, designed as a companion to @udx/mcurl. Think of it as "jq for markdown" - a tool that lets you treat markdown as structured data.

Key Capabilities

Clean Content Extraction: Pull narrative content without code blocks for cleaner analysis
Structured Querying: Filter and transform markdown content like jq does for JSON
Document Analysis: Generate actionable insights and understand document structure
Format Conversion: Transform between JSON, markdown, and other formats
Composability: Combine with other tools in Unix-style pipelines

Why Clean Content Extraction Matters

Code blocks in technical documents serve a crucial purpose for developers but act as "noise" when analyzing the narrative flow. By separating content from code, mq helps:

Improve focus on conceptual information
Extract cleaner summaries without code snippets
Better identify key points and arguments
Create more approachable versions of technical content

Installation

npm install -g @udx/mq

This tool is great to combine with mcurl. For instance, the following command fetches a web page and extracts images:

mcurl https://udx.io/about | mq --images

You can also get the raw JSON output so you can use jq to filter and transform it further into a list of image URLs:

mcurl https://udx.io/about | mq --images --format json | jq '.[].src'

Usage Examples

Extract Clean Content (No Code Blocks)

# Extract clean content without code blocks
mq --clean-content --input test/fixtures/test-code-blocks.md

# Filter content to include only h1 and h2 headings and their content
mq --clean-content=2 --input test/fixtures/complex-test.md

# Get clean content in JSON format
mq --clean-content --format json --input test/fixtures/test-code-blocks.md

Basic Query Operations

# Extract headings from a document (returns JSON structure by default)
mq --input test/fixtures/basic-test.md '.headings[]'

# Analyze document structure (returns formatted Markdown report)
mq --analyze --input test/fixtures/complex-test.md

# Generate a table of contents (returns Markdown TOC)
mq --input test/fixtures/test-document.md '.toc'

# Extract code blocks by language (returns JSON structure)
mq --language javascript --input test/fixtures/test-code-blocks.md

# Extract code content only in raw format
mq --language javascript --input test/fixtures/test-code-blocks.md | jq -r '.[0].content'

# Extract all images (returns JSON structure)
mq --input test/fixtures/test-images.md '.images[]'

# Extract first sentences from sections (returns text content)
mq --first-sentences 2 --input test/fixtures/test-sentences.md

Pipe with mcurl

# Fetch web content and analyze it
mcurl https://udx.io | mq --analyze

# Fetch web content and extract key information
mcurl https://udx.io/work | mq --clean-content

# First analyze the overall structure of web content
mcurl https://udx.io/about | mq --analyze

Complex Queries

# Extract level 2 headings
mq --input test/fixtures/complex-test.md '.headings[] | select(.level == 2)'

# Extract links to specific domain
mq --input test/fixtures/test-document.md '.links[] | select(.href | contains("example"))'

# Extract code blocks and make them collapsible
mq --input test/fixtures/test-code-blocks.md --transform-code-blocks

Integration with curl and jq

One of the most powerful aspects of mq is its ability to integrate with curl, mcurl, and jq in Unix-style pipelines:

# Fetch a GitHub markdown file and extract headings
curl -s https://raw.githubusercontent.com/WordPress/wordpress-develop/HEAD/README.md | mq '.headings[]'

# Get content from a website and extract clean narrative content
mcurl https://udx.io/about | mq --clean-content

# Process markdown content and pipe to jq for further filtering
curl -s https://raw.githubusercontent.com/WordPress/wordpress-develop/HEAD/README.md | mq --clean-content --format json | \
  jq '[.[] | select(.type=="heading" and .level == 1)]'

# Extract expertise data from UDX API using proper jq patterns
curl -s 'https://udx.io/wp-json/udx/v2/works/search?query=&page=1' | \
  jq '.facets.expertise[] | select(.count > 10) | {name: .name, count: .count}'

Advanced Features

Clean Content Extraction

The clean content extractor is one of mq's most powerful features for document analysis. It removes code blocks while preserving the document's narrative structure:

# Extract clean content without code blocks
mq --clean-content --input test/fixtures/test-code-blocks.md

# Limit extraction to specific heading levels (h1 and h2 only)
mq --clean-content=2 --input test/fixtures/complex-test.md

# Get JSON output for programmatic processing
mq --clean-content --format json --input test/fixtures/test-code-blocks.md | jq length

Benefits of Clean Content Extraction

Improved Analysis: Focus on the narrative without code noise
Better Summarization: Generate more coherent summaries from technical content
Hierarchical Understanding: Preserve document structure while filtering code
Content Repurposing: Transform code-heavy tutorials into conceptual guides
Incremental Content Processing: Extract varying amounts of content for different purposes

Advanced UDX API Examples

# Extract links from HTML content using mq
mcurl https://udx.io/about | mq '.links[0:5]'

# Extract clean content from a WordPress page for easier reading
mcurl https://udx.io/guidance | mq --clean-content

# First analyze page structure, then extract specific elements
mcurl https://udx.io/work | mq --analyze

Approach

Best Practices for Working with Markdown and APIs

Native Node.js Functions: Prefer using native Node.js functions for fetching API data rather than dedicated modules. For example:

// Using native Node.js rather than dedicated modules
const https = require('https');
  
function fetchContent(url) {
  // Function fetches content from URL using native Node.js modules
  // Input: url - String URL to fetch
  // Output: Promise that resolves to response body
  return new Promise((resolve, reject) => {
    https.get(url, (res) => {
      let data = '';
      res.on('data', (chunk) => { data += chunk; });
      res.on('end', () => { resolve(data); });
    }).on('error', reject);
  });
}

Logging and Debugging: Always log API request metadata and response data for troubleshooting:

// Proper logging for API requests
function logApiRequest(url, options, response) {
  // Log API request details when verbose mode is enabled
  // Input: url - request URL, options - request options, response - API response
  // Output: None, logs to console
  if (process.env.DEBUG || process.env.VERBOSE) {
    console.log(`[API Request] ${options.method || 'GET'} ${url}`);
    console.log(`[API Response] Status: ${response.statusCode}`);
    if (process.env.VERBOSE) {
      console.log(`[API Response Body] ${JSON.stringify(response.body).substring(0, 200)}...`);
    }
  }
}

Use Lodash for Complex Operations: Leverage Lodash for data transformations to improve readability and fault tolerance in your pipeline.
Progressive Enhancement Workflow:
1. Start by analyzing content structure with mq --analyze
2. Extract relevant sections with targeted selectors
3. Process and transform with clean content extraction
4. Format output appropriately for your use case
Testing Strategy: Test your pipelines using REST API tools, Mocha for unit tests, or simple curl commands for verification.
Documentation: Add comprehensive function headers that explain purpose, inputs, and outputs for all custom operations.

Common Pipelines

# Extract content → Clean → Filter → Format as JSON
mcurl https://udx.io/about | mq --clean-content | mq --format json | jq 'length'

# Analyze content structure then target specific elements
mcurl https://udx.io/work | mq --analyze && mcurl https://udx.io/work | mq '.headings[0:5]'

# Process multiple sources with consistent transformations
for url in "udx.io/about" "udx.io/work" "udx.io/guidance"; do
  echo "Processing $url"
  mcurl https://$url | mq --clean-content=2 | wc -l
done

UDX API Integration Patterns

Mq can be used as part of a larger data processing pipeline, working alongside other tools like curl and jq:

# Use mq for HTML content processing
mcurl https://udx.io/work | mq --clean-content | grep "Cloud"

# Use curl+jq for JSON API processing (not mcurl!)
curl -s 'https://udx.io/wp-json/udx/v2/works/search?query=&page=1' | \
  jq '.facets.expertise[] | select(.count > 10) | {name: .name, count: .count}'

# Get industry distribution with better formatting
curl -s 'https://udx.io/wp-json/udx/v2/works/search?query=&page=1' | \
  jq '.facets.industries[] | select(.count > 5) | {name: .name, count: .count}'

# Pipeline: Extract content from UDX pages, clean it, then analyze structure
for page in "about" "work" "guidance"; do
  mcurl "https://udx.io/$page" | mq --clean-content | mq --analyze | grep -i "headings"
done