html-string-splitter
v2.1.0
Published
Split HTML strings by character, word, sentence, line, or HTML tag while preserving valid HTML structure.
Maintainers
Readme
html-string-splitter
Split HTML strings by character, word, sentence, line, or HTML tag — while preserving valid HTML structure.
Why?
Truncating HTML is hard. String.slice() breaks tags and produces invalid HTML. This library handles all of that:
// Broken HTML
'<p>Hello <strong>world</strong></p>'.slice(0, 18)
// '<p>Hello <strong>w' — broken tag!
// Valid HTML
clip('<p>Hello <strong>world</strong></p>', { keep: 7, by: 'c' })
// '<p>Hello <strong>w...</strong></p>' — properly closedZero dependencies. TypeScript. ESM + CJS. Emoji-safe. Entity-aware.
Installation
npm install html-string-splitterCommon Use Cases
Blog post preview
import { clip } from 'html-string-splitter';
clip(articleHtml, { keep: 200, by: 'c' });
// First 200 characters with "..." and valid HTML
clip(articleHtml, { keep: 200, by: 'c', suffix: '<a href="/post">Read more</a>' });
// With a "Read More" linkWord-based truncation
clip('<p>Hello beautiful world</p>', { keep: 2, by: 'w' });
// '<p>Hello beautiful...</p>'Conditional "Read More"
import { split } from 'html-string-splitter';
const result = split(html, { keep: 200, by: 'c' });
if (result.truncated) {
showReadMoreButton();
}
// result = { html, truncated, total, kept }Count words or characters
import { count } from 'html-string-splitter';
count('<p>Hello world</p>', { by: 'w' }); // 2
count('<p>A & B</p>'); // 5 (entity = 1 char)Extract plain text
import { text } from 'html-string-splitter';
text('<p>Hello <strong>world</strong></p>'); // 'Hello world'Paginate an article
import { chunk } from 'html-string-splitter';
const pages = chunk(articleHtml, { size: 100, by: 'w' });
// pages[0] = first 100 words (valid HTML)
// pages[1] = next 100 words (valid HTML)Split by HTML tag
// Keep first 3 paragraphs
clip(html, { keep: 3, by: 'p' });
// Keep first 5 list items
clip(html, { keep: 5, by: 'li' });
// Count all images
count(html, { by: 'img' });Core API
| Function | Returns | Description |
|----------|---------|-------------|
| clip(html, options) | string | Truncate HTML, return string |
| split(html, options) | SplitResult | Truncate with metadata |
| count(html, options?) | number | Count units |
| text(html, options?) | string | Extract plain text |
| splitAt(html, options) | [string, string] | Split into two parts |
| slice(html, options?) | string | Extract a range (like String.slice) |
| chunk(html, options) | string[] | Split into equal parts |
Advanced API
| Function | Returns | Description |
|----------|---------|-------------|
| summary(html) | SummaryResult | Full statistics in one pass |
| pick(html, options) | PickResult[] | Extract pieces by text or tag |
| highlight(html, query) | string | Wrap text matches in a tag |
| wrap(html, options) | string | Wrap content at intervals (by chars, words, or tags) |
| tokenize(html) | Token[] | Low-level HTML tokenizer |
Split Units
The by parameter accepts:
| Unit | Alias | Example |
|------|-------|---------|
| 'character' | 'c' | clip(html, { keep: 100, by: 'c' }) |
| 'word' | 'w' | clip(html, { keep: 20, by: 'w' }) |
| 'sentence' | 's' | clip(html, { keep: 3, by: 's' }) |
| 'line' | 'l' | clip(html, { keep: 5, by: 'l' }) |
| Any tag name | — | clip(html, { keep: 3, by: 'p' }) |
Options
Basic
clip(html, {
keep: 10, // units to keep (required)
by: 'c', // split unit (default: 'c')
ellipsis: '...', // appended at truncation (default: '...')
suffix: '<a>More</a>', // HTML after ellipsis
from: 'end', // 'start' (default) or 'end'
stripTags: true, // return plain text
});Advanced
split(html, {
keep: 100,
by: 'c',
preserveWords: true, // don't cut mid-word (true | number | 'trim')
smartEllipsis: true, // skip "..." at block boundaries
stripComments: true, // remove HTML comments
exclude: ['figcaption'], // remove elements entirely
selectiveTags: ['span'], // only strip these tags (with stripTags)
imageWeight: 5, // character cost for <img>, <video>, etc.
wordPattern: /[\p{Han}]|\w+/gu, // custom word boundaries (CJK)
output: 'both', // return html + text in one pass
});Chunk
chunk(html, {
size: 100, // units per chunk (required)
by: 'w', // split unit
overlap: 20, // shared units between chunks (for RAG/LLM)
breakAt: 'word', // don't cut mid-word
});See Options Reference for detailed explanations with examples.
CommonJS
const { clip, split, count } = require('html-string-splitter');TypeScript
import type { SplitOptions, SplitResult, ChunkOptions, PickOptions, HighlightOptions } from 'html-string-splitter';Documentation
Guides:
- Split & Clip — Truncation, splitAt, slice
- Chunk — Pagination with overlap and breakAt
- Count & Summary — Counting and statistics
- Text Extraction — Plain text and output modes
Advanced:
- Pick & Highlight — Extract pieces and highlight matches
- Wrap — Wrap by chars, words, or tags
- Tokenize — Low-level tokenizer API
Reference:
- Options Reference — All options in detail
- Migration from v1 — Upgrade guide
