@uniweb/semantic-parser
v1.1.4
Published
Semantic parser for ProseMirror/TipTap content structures
Readme
@uniweb/semantic-parser
A semantic parser for ProseMirror/TipTap content structures that helps bridge the gap between natural content writing and component-based web development.
What it Does
The parser transforms rich text editor content (ProseMirror/TipTap) into structured, semantic groups that web components can easily consume. It provides two complementary views of your content:
- Sequence: An ordered list of all content elements (for rendering in document order)
- Groups: Content organized into semantic sections (main content + items)
Installation
npm install @uniweb/semantic-parserQuick Start
import { parseContent } from "@uniweb/semantic-parser";
// Your ProseMirror/TipTap document
const doc = {
type: "doc",
content: [
{
type: "heading",
attrs: { level: 1 },
content: [{ type: "text", text: "Welcome" }],
},
{
type: "paragraph",
content: [{ type: "text", text: "Get started today." }],
},
],
};
// Parse the content
const result = parseContent(doc);
// Access different views
console.log(result.sequence); // Ordered array of elements
console.log(result.title); // Main content fields at top level
console.log(result.items); // Additional content groupsOutput Structure
Sequence View
An ordered array of semantic elements preserving document order:
result.sequence = [
{ type: "heading", level: 1, content: "Welcome" },
{ type: "paragraph", content: "Get started today." }
]Content Structure
Main content fields are at the top level. The items array contains additional content groups (created when headings appear after content), each with the same field structure:
result = {
// Header fields (from headings)
pretitle: "", // Heading before main title
title: "Welcome", // Main heading
subtitle: "", // Heading after main title
subtitle2: "", // Third heading level
// Body fields
paragraphs: ["Get started today."],
links: [], // All links (including buttons, documents)
imgs: [],
videos: [],
icons: [],
lists: [],
quotes: [],
data: {}, // Structured data (tagged code blocks, forms, cards)
headings: [], // Overflow headings after title/subtitle/subtitle2
// Additional content groups (from headings after content)
items: [
{ title: "Feature 1", paragraphs: [...], links: [...] },
{ title: "Feature 2", paragraphs: [...], links: [...] }
],
// Ordered sequence for document-order rendering
sequence: [...],
// Original document
raw: { type: "doc", content: [...] }
}Common Use Cases
Extracting Main Content
const content = parseContent(doc);
const title = content.title;
const description = content.paragraphs.join(" ");
const image = content.banner?.url;Processing Content Sections
const content = parseContent(doc);
// Main content
console.log("Title:", content.title);
console.log("Description:", content.paragraphs);
// Additional content groups
content.items.forEach(item => {
console.log("Section:", item.title);
console.log("Content:", item.paragraphs);
});Sequential Processing
const { sequence } = parseContent(doc);
sequence.forEach(element => {
switch(element.type) {
case 'heading':
renderHeading(element);
break;
case 'paragraph':
renderParagraph(element);
break;
case 'image':
renderImage(element);
break;
}
});Content Mapping Utilities
The parser includes optional mapping utilities to transform parsed content into component-specific formats. Perfect for visual editors and component-based systems.
Type System (Recommended)
Automatically transform content based on field types with context-aware behavior:
const schema = {
title: {
path: "title",
type: "plaintext", // Auto-strips <strong>, <em>, etc.
maxLength: 60 // Auto-truncates intelligently
},
excerpt: {
path: "paragraphs",
type: "excerpt", // Auto-creates excerpt from paragraphs
maxLength: 150
},
image: {
path: "imgs[0].url",
type: "image",
defaultValue: "/placeholder.jpg"
}
};
// Visual editor mode (default) - silent, graceful cleanup
const data = mappers.extractBySchema(parsed, schema);
// Build mode - validates and warns
const data = mappers.extractBySchema(parsed, schema, { mode: 'build' });Field Types: plaintext, richtext, excerpt, number, image, link
Using Pre-Built Extractors
import { parseContent, mappers } from "@uniweb/semantic-parser";
const parsed = parseContent(doc);
// Extract hero component data
const heroData = mappers.extractors.hero(parsed);
// { title, subtitle, kicker, description, image, cta, ... }
// Extract card data
const cards = mappers.extractors.card(parsed, { useItems: true });
// Extract statistics
const stats = mappers.extractors.stats(parsed);
// [{ value: "12", label: "Partner Labs" }, ...]
// Extract navigation menu
const nav = mappers.extractors.navigation(parsed);
// Extract features list
const features = mappers.extractors.features(parsed);Schema-Based Mapping
Define custom mappings using schemas:
const schema = {
brand: "pretitle",
title: "title",
subtitle: "subtitle",
image: {
path: "imgs[0].url",
defaultValue: "/placeholder.jpg"
},
actions: {
path: "links",
transform: links => links.map(l => ({ label: l.label, type: "primary" }))
}
};
const componentData = mappers.accessor.extractBySchema(parsed, schema);Available Extractors
hero- Hero/banner sectionscard- Card componentsarticle- Article/blog contentstats- Statistics/metricsnavigation- Navigation menusfeatures- Feature liststestimonial- Testimonialsfaq- FAQ sectionspricing- Pricing tiersteam- Team membersgallery- Image galleries
See Mapping Patterns Guide for complete documentation.
Rendering Content
After extracting content, render it using a Text component that handles paragraph arrays, rich HTML, and formatting marks.
Text Component Pattern
import { parseContent, mappers } from '@uniweb/semantic-parser';
import { H1, P } from './components/Text';
const parsed = parseContent(doc);
const hero = mappers.extractors.hero(parsed);
// Render extracted content
<>
<H1 text={hero.title} />
<P text={hero.description} /> {/* Handles arrays automatically */}
</>The Text component:
- Handles arrays - Renders
["Para 1", "Para 2"]as separate paragraphs - Supports rich HTML - Preserves formatting marks
- Multi-line headings - Wraps multiple lines in semantic heading tags
- Color marks - Supports
<mark>and<span>for visual emphasis
See Text Component Reference for implementation guide.
Sanitization
Sanitize content at the engine level (during data preparation), not in components:
import { parseContent, mappers } from '@uniweb/semantic-parser';
function prepareData(parsed) {
const hero = mappers.extractors.hero(parsed);
return {
...hero,
title: mappers.types.sanitizeHtml(hero.title, {
allowedTags: ['strong', 'em', 'mark', 'span'],
allowedAttr: ['class', 'data-variant']
})
};
}The parser provides sanitization utilities but doesn't enforce their use. Your engine decides when to sanitize based on security requirements.
Content Grouping
The parser supports two grouping modes:
Heading-Based Grouping (Default)
Groups are created based on heading patterns. A new group starts when:
- A heading follows content
- Multiple H1s appear (no main content created)
- The heading level indicates a new section
Pretitle Detection: Any heading followed by a more important heading is automatically detected as a pretitle:
- H3 before H1 → pretitle ✅
- H2 before H1 → pretitle ✅
- H6 before H5 → pretitle ✅
- H4 before H2 → pretitle ✅
No configuration needed - it just works naturally!
Divider-Based Grouping
When any horizontal rule (---) is present, the entire document uses divider-based grouping. Groups are split explicitly by dividers.
Text Formatting
Inline formatting is preserved as HTML tags:
// Input: Text with bold mark
// Output: "Text with <strong>bold</strong>"
// Input: Text with italic mark
// Output: "Text with <em>emphasis</em>"
// Input: Link mark
// Output: "Click <a href=\"/docs\">here</a>"
// Input: Span mark (bracketed spans)
// Output: "This is <span class=\"highlight\">highlighted</span> text"Span Marks
Bracketed spans ([text]{.class}) are converted to <span> elements with their attributes:
// Input mark
{ type: "span", attrs: { class: "highlight", id: "note-1" } }
// Output HTML
'<span class="highlight" id="note-1">text</span>'Spans can have classes, IDs, and custom attributes. They combine with other marks—a span with bold becomes <strong><span class="...">text</span></strong>.
Documentation
- Content Writing Guide: Learn how to structure content for optimal parsing
- API Reference: Complete API documentation with all element types
- Mapping Patterns Guide: Transform content to component-specific formats
- Text Component Reference: Reference implementation for rendering parsed content
- File Structure: Codebase organization
Use Cases
- Component-based websites: Extract structured data for React/Vue components
- Content management: Parse editor content into database-friendly structures
- Static site generation: Transform rich content into template-ready data
- Content analysis: Analyze document structure and content types
License
GPL-3.0-or-later
