html-string-splitter

v2.1.0

Published

a month ago

Split HTML strings by character, word, sentence, line, or HTML tag while preserving valid HTML structure.

0High
0Medium
0Low

hrdelwar

html html-split html-truncate html-chunk html-tokenizer truncate split word-count text-extraction typescript

html-string-splitter

Split HTML strings by character, word, sentence, line, or HTML tag — while preserving valid HTML structure.

Why?

Truncating HTML is hard. String.slice() breaks tags and produces invalid HTML. This library handles all of that:

// Broken HTML
'<p>Hello <strong>world</strong></p>'.slice(0, 18)
// '<p>Hello <strong>w'  — broken tag!

// Valid HTML
clip('<p>Hello <strong>world</strong></p>', { keep: 7, by: 'c' })
// '<p>Hello <strong>w...</strong></p>'  — properly closed

Zero dependencies. TypeScript. ESM + CJS. Emoji-safe. Entity-aware.

Installation

npm install html-string-splitter

Common Use Cases

Blog post preview

import { clip } from 'html-string-splitter';

clip(articleHtml, { keep: 200, by: 'c' });
// First 200 characters with "..." and valid HTML

clip(articleHtml, { keep: 200, by: 'c', suffix: '<a href="/post">Read more</a>' });
// With a "Read More" link

Word-based truncation

clip('<p>Hello beautiful world</p>', { keep: 2, by: 'w' });
// '<p>Hello beautiful...</p>'

Conditional "Read More"

import { split } from 'html-string-splitter';

const result = split(html, { keep: 200, by: 'c' });
if (result.truncated) {
  showReadMoreButton();
}
// result = { html, truncated, total, kept }

Count words or characters

import { count } from 'html-string-splitter';

count('<p>Hello world</p>', { by: 'w' });  // 2
count('<p>A &amp; B</p>');                 // 5 (entity = 1 char)

Extract plain text

import { text } from 'html-string-splitter';

text('<p>Hello <strong>world</strong></p>');  // 'Hello world'

Paginate an article

import { chunk } from 'html-string-splitter';

const pages = chunk(articleHtml, { size: 100, by: 'w' });
// pages[0] = first 100 words (valid HTML)
// pages[1] = next 100 words (valid HTML)

Split by HTML tag

// Keep first 3 paragraphs
clip(html, { keep: 3, by: 'p' });

// Keep first 5 list items
clip(html, { keep: 5, by: 'li' });

// Count all images
count(html, { by: 'img' });

Core API

| Function | Returns | Description | |----------|---------|-------------| | clip(html, options) | string | Truncate HTML, return string | | split(html, options) | SplitResult | Truncate with metadata | | count(html, options?) | number | Count units | | text(html, options?) | string | Extract plain text | | splitAt(html, options) | [string, string] | Split into two parts | | slice(html, options?) | string | Extract a range (like String.slice) | | chunk(html, options) | string[] | Split into equal parts |

Advanced API

| Function | Returns | Description | |----------|---------|-------------| | summary(html) | SummaryResult | Full statistics in one pass | | pick(html, options) | PickResult[] | Extract pieces by text or tag | | highlight(html, query) | string | Wrap text matches in a tag | | wrap(html, options) | string | Wrap content at intervals (by chars, words, or tags) | | tokenize(html) | Token[] | Low-level HTML tokenizer |

Split Units

The by parameter accepts:

| Unit | Alias | Example | |------|-------|---------| | 'character' | 'c' | clip(html, { keep: 100, by: 'c' }) | | 'word' | 'w' | clip(html, { keep: 20, by: 'w' }) | | 'sentence' | 's' | clip(html, { keep: 3, by: 's' }) | | 'line' | 'l' | clip(html, { keep: 5, by: 'l' }) | | Any tag name | — | clip(html, { keep: 3, by: 'p' }) |

Options

Basic

clip(html, {
  keep: 10,            // units to keep (required)
  by: 'c',             // split unit (default: 'c')
  ellipsis: '...',     // appended at truncation (default: '...')
  suffix: '<a>More</a>', // HTML after ellipsis
  from: 'end',         // 'start' (default) or 'end'
  stripTags: true,     // return plain text
});

Advanced

split(html, {
  keep: 100,
  by: 'c',
  preserveWords: true,         // don't cut mid-word (true | number | 'trim')
  smartEllipsis: true,         // skip "..." at block boundaries
  stripComments: true,         // remove HTML comments
  exclude: ['figcaption'],     // remove elements entirely
  selectiveTags: ['span'],     // only strip these tags (with stripTags)
  imageWeight: 5,              // character cost for <img>, <video>, etc.
  wordPattern: /[\p{Han}]|\w+/gu, // custom word boundaries (CJK)
  output: 'both',              // return html + text in one pass
});

Chunk

chunk(html, {
  size: 100,           // units per chunk (required)
  by: 'w',             // split unit
  overlap: 20,         // shared units between chunks (for RAG/LLM)
  breakAt: 'word',     // don't cut mid-word
});

See Options Reference for detailed explanations with examples.

CommonJS

const { clip, split, count } = require('html-string-splitter');

TypeScript

import type { SplitOptions, SplitResult, ChunkOptions, PickOptions, HighlightOptions } from 'html-string-splitter';

Documentation

Guides:

Split & Clip — Truncation, splitAt, slice
Chunk — Pagination with overlap and breakAt
Count & Summary — Counting and statistics
Text Extraction — Plain text and output modes

Advanced:

Pick & Highlight — Extract pieces and highlight matches
Wrap — Wrap by chars, words, or tags
Tokenize — Low-level tokenizer API

Reference:

Options Reference — All options in detail
Migration from v1 — Upgrade guide

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

html-string-splitter

Why?

Installation

Common Use Cases

Blog post preview

Word-based truncation

Conditional "Read More"

Count words or characters

Extract plain text

Paginate an article

Split by HTML tag

Core API

Advanced API

Split Units

Options

Basic

Advanced

Chunk

CommonJS

TypeScript

Documentation

License