@uniwebcms/semantic-parser

v2.5.0

Published

a month ago

Semantic parser for ProseMirror/TipTap content structures

0High
0Medium
0Low

prosemirror tiptap parser semantic content

@uniwebcms/semantic-parser

A semantic parser for ProseMirror/TipTap content structures that helps bridge the gap between natural content writing and component-based web development.

What it Does

The parser transforms rich text editor content (ProseMirror/TipTap) into structured, semantic groups that web components can easily consume. It provides three complementary views of your content:

Sequence: A flat, ordered list of all content elements
Groups: Content organized into semantic sections with identified main content
ByType: Elements categorized by type for easy filtering and queries

Installation

npm install @uniwebcms/semantic-parser

Quick Start

import { parseContent } from "@uniwebcms/semantic-parser";

// Your ProseMirror/TipTap document
const doc = {
  type: "doc",
  content: [
    {
      type: "heading",
      attrs: { level: 1 },
      content: [{ type: "text", text: "Welcome" }],
    },
    {
      type: "paragraph",
      content: [{ type: "text", text: "Get started today." }],
    },
  ],
};

// Parse the content
const result = parseContent(doc);

// Access different views
console.log(result.sequence);  // Flat array of elements
console.log(result.groups);    // Semantic groups with main/items
console.log(result.byType);    // Elements organized by type

Output Structure

Sequence View

A flat array of semantic elements preserving document order:

result.sequence = [
  { type: "heading", level: 1, content: "Welcome" },
  { type: "paragraph", content: "Get started today." }
]

Groups View

Content organized into semantic groups:

result.groups = {
  main: {
    header: {
      pretitle: "",           // H3 before main title
      title: "Welcome",       // Main heading
      subtitle: ""            // Heading after main title
    },
    body: {
      paragraphs: ["Get started today."],
      imgs: [],
      videos: [],
      links: [],
      lists: [],
      // ... more content types
    },
    banner: null,             // Optional banner image
    metadata: { level: 1 }
  },
  items: [],                  // Additional content groups
  metadata: {
    dividerMode: false,       // Using dividers vs headings
    groups: 0
  }
}

ByType View

Elements organized by type with context:

result.byType = {
  headings: [
    {
      type: "heading",
      level: 1,
      content: "Welcome",
      context: {
        position: 0,
        previousElement: null,
        nextElement: { type: "paragraph", ... },
        nearestHeading: null
      }
    }
  ],
  paragraphs: [ /* ... */ ],
  images: {
    background: [],
    content: [],
    gallery: [],
    icon: []
  },
  lists: [],
  metadata: {
    totalElements: 2,
    dominantType: "paragraph",
    hasMedia: false
  },
  // Helper methods
  getHeadingsByLevel(level),
  getElementsByHeadingContext(filter)
}

Common Use Cases

Extracting Main Content

const { groups } = parseContent(doc);

const title = groups.main.header.title;
const description = groups.main.body.paragraphs.join(" ");
const image = groups.main.banner?.url;

Processing Content Sections

const { groups } = parseContent(doc);

// Main content
console.log("Main:", groups.main.header.title);

// Additional sections
groups.items.forEach(item => {
  console.log("Section:", item.header.title);
  console.log("Content:", item.body.paragraphs);
});

Finding Specific Elements

const { byType } = parseContent(doc);

// Get all H2 headings
const subheadings = byType.getHeadingsByLevel(2);

// Get all background images
const backgrounds = byType.images.background;

// Get content under specific headings
const features = byType.getElementsByHeadingContext(
  h => h.content.includes("Features")
);

Sequential Processing

const { sequence } = parseContent(doc);

sequence.forEach(element => {
  switch(element.type) {
    case 'heading':
      renderHeading(element);
      break;
    case 'paragraph':
      renderParagraph(element);
      break;
    case 'image':
      renderImage(element);
      break;
  }
});

Content Mapping Utilities

The parser includes optional mapping utilities to transform parsed content into component-specific formats. Perfect for visual editors and component-based systems.

Type System (Recommended)

Automatically transform content based on field types with context-aware behavior:

const schema = {
  title: {
    path: "groups.main.header.title",
    type: "plaintext",  // Auto-strips <strong>, <em>, etc.
    maxLength: 60       // Auto-truncates intelligently
  },
  excerpt: {
    path: "groups.main.body.paragraphs",
    type: "excerpt",    // Auto-creates excerpt from paragraphs
    maxLength: 150
  },
  image: {
    path: "groups.main.body.imgs[0].url",
    type: "image",
    defaultValue: "/placeholder.jpg"
  }
};

// Visual editor mode (default) - silent, graceful cleanup
const data = mappers.extractBySchema(parsed, schema);

// Build mode - validates and warns
const data = mappers.extractBySchema(parsed, schema, { mode: 'build' });

Field Types: plaintext, richtext, excerpt, number, image, link

Using Pre-Built Extractors

import { parseContent, mappers } from "@uniwebcms/semantic-parser";

const parsed = parseContent(doc);

// Extract hero component data
const heroData = mappers.extractors.hero(parsed);
// { title, subtitle, kicker, description, image, cta, ... }

// Extract card data
const cards = mappers.extractors.card(parsed, { useItems: true });

// Extract statistics
const stats = mappers.extractors.stats(parsed);
// [{ value: "12", label: "Partner Labs" }, ...]

// Extract navigation menu
const nav = mappers.extractors.navigation(parsed);

// Extract features list
const features = mappers.extractors.features(parsed);

Schema-Based Mapping

Define custom mappings using schemas:

const schema = {
  brand: "groups.main.header.pretitle",
  title: "groups.main.header.title",
  subtitle: "groups.main.header.subtitle",
  image: {
    path: "groups.main.body.imgs[0].url",
    defaultValue: "/placeholder.jpg"
  },
  actions: {
    path: "groups.main.body.links",
    transform: links => links.map(l => ({ label: l.label, type: "primary" }))
  }
};

const componentData = mappers.accessor.extractBySchema(parsed, schema);

Available Extractors

hero - Hero/banner sections
card - Card components
article - Article/blog content
stats - Statistics/metrics
navigation - Navigation menus
features - Feature lists
testimonial - Testimonials
faq - FAQ sections
pricing - Pricing tiers
team - Team members
gallery - Image galleries

See Mapping Patterns Guide for complete documentation.

Rendering Content

After extracting content, render it using a Text component that handles paragraph arrays, rich HTML, and formatting marks.

Text Component Pattern

import { parseContent, mappers } from '@uniwebcms/semantic-parser';
import { H1, P } from './components/Text';

const parsed = parseContent(doc);
const hero = mappers.extractors.hero(parsed);

// Render extracted content
<>
  <H1 text={hero.title} />
  <P text={hero.description} />  {/* Handles arrays automatically */}
</>

The Text component:

Handles arrays - Renders ["Para 1", "Para 2"] as separate paragraphs
Supports rich HTML - Preserves formatting marks
Multi-line headings - Wraps multiple lines in semantic heading tags
Color marks - Supports <mark> and <span> for visual emphasis

See Text Component Reference for implementation guide.

Sanitization

Sanitize content at the engine level (during data preparation), not in components:

import { parseContent, mappers } from '@uniwebcms/semantic-parser';

function prepareData(parsed) {
  const hero = mappers.extractors.hero(parsed);
  return {
    ...hero,
    title: mappers.types.sanitizeHtml(hero.title, {
      allowedTags: ['strong', 'em', 'mark', 'span'],
      allowedAttr: ['class', 'data-variant']
    })
  };
}

The parser provides sanitization utilities but doesn't enforce their use. Your engine decides when to sanitize based on security requirements.

Content Grouping

The parser supports two grouping modes:

Heading-Based Grouping (Default)

Groups are created based on heading patterns. A new group starts when:

A heading follows content
Multiple H1s appear (no main content created)
The heading level indicates a new section

Pretitle Detection: Any heading followed by a more important heading is automatically detected as a pretitle:

H3 before H1 → pretitle ✅
H2 before H1 → pretitle ✅
H6 before H5 → pretitle ✅
H4 before H2 → pretitle ✅

No configuration needed - it just works naturally!

Divider-Based Grouping

When any horizontal rule (---) is present, the entire document uses divider-based grouping. Groups are split explicitly by dividers.

Text Formatting

Inline formatting is preserved as HTML tags:

// Input: Text with bold mark
// Output: "Text with <strong>bold</strong>"

// Input: Link mark
// Output: "Click <a href=\"/docs\">here</a>"

Documentation

Content Writing Guide: Learn how to structure content for optimal parsing
API Reference: Complete API documentation with all element types
Mapping Patterns Guide: Transform content to component-specific formats
Text Component Reference: Reference implementation for rendering parsed content
File Structure: Codebase organization

Use Cases

Component-based websites: Extract structured data for React/Vue components
Content management: Parse editor content into database-friendly structures
Static site generation: Transform rich content into template-ready data
Content analysis: Analyze document structure and content types

License

GPL-3.0-or-later

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@uniwebcms/semantic-parser

What it Does

Installation

Quick Start

Output Structure

Sequence View

Groups View

ByType View

Common Use Cases

Extracting Main Content

Processing Content Sections

Finding Specific Elements

Sequential Processing

Content Mapping Utilities

Type System (Recommended)

Using Pre-Built Extractors

Schema-Based Mapping

Available Extractors

Rendering Content

Text Component Pattern

Sanitization

Content Grouping

Heading-Based Grouping (Default)

Divider-Based Grouping

Text Formatting

Documentation

Use Cases

License