flappa-doormal
v2.21.0
Published
Arabic text marker pattern library for generating regex from declarative configurations
Maintainers
Readme
flappa-doormal
Why This Library?
The Problem
Working with Arabic hadith and Islamic text collections requires splitting continuous text into segments (individual hadiths, chapters, verses). This traditionally means:
- Writing complex Unicode regex patterns:
^[\u0660-\u0669]+\s*[-–—ـ]\s* - Handling diacritic variations:
حَدَّثَنَاvsحدثنا - Managing multi-page spans and page boundary tracking
- Manually extracting hadith numbers, volume/page references
What Exists
- General regex libraries: Don't understand Arabic text nuances
- NLP tokenizers: Overkill for pattern-based segmentation
- Manual regex: Error-prone, hard to maintain, no metadata extraction
The Solution
flappa-doormal provides:
✅ Readable templates: {{raqms}} {{dash}} instead of cryptic regex
✅ Named captures: {{raqms:hadithNum}} auto-extracts to meta.hadithNum
✅ Fuzzy matching: Auto-enabled for {{bab}}, {{kitab}}, {{basmalah}}, {{fasl}}, {{naql}} (override with fuzzy: false)
✅ Content limits: maxPages and maxContentLength (safety-hardened) control segment size
✅ Page tracking: Know which page each segment came from
✅ Declarative rules: Describe what to match, not how
Installation
npm install flappa-doormal
# or
bun add flappa-doormal
# or
yarn add flappa-doormalQuick Start
import { segmentPages } from 'flappa-doormal';
// Your pages from a hadith book
const pages = [
{ id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ...' },
{ id: 1, content: '٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ...' },
{ id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ...' },
];
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:num}} {{dash}} '],
split: 'at',
}]
});
// Result:
// [
// { content: 'حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ...', from: 1, meta: { num: '٦٦٩٦' } },
// { content: 'أَخْبَرَنَا عُمَرُ قَالَ...', from: 1, meta: { num: '٦٦٩٧' } },
// { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { num: '٦٦٩٨' } }
// ]Segment Validation
Use validateSegments() to sanity-check segmentation output against the input pages and options. This is useful for detecting page attribution issues or maxPages violations before sending segments to downstream systems.
import { segmentPages, validateSegments } from 'flappa-doormal';
const segments = segmentPages(pages, { rules, maxPages: 0 });
const report = validateSegments(pages, { rules, maxPages: 0 }, segments);
if (!report.ok) {
console.log(report.summary);
console.log(report.issues[0]);
}Example issue entry (truncated):
{
"type": "page_attribution_mismatch",
"severity": "error",
"segmentIndex": 2,
"expected": { "from": 5 },
"actual": { "from": 4 },
"evidence": "Content found in page 5, but segment.from=4."
}Features
1. Template Tokens
Replace regex with readable tokens:
| Token | Matches | Regex Equivalent |
|-------|---------|------------------|
| {{raqms}} | Arabic-Indic digits | [\\u0660-\\u0669]+ |
| {{raqm}} | Single Arabic digit | [\\u0660-\\u0669] |
| {{nums}} | ASCII digits | \\d+ |
| {{num}} | Single ASCII digit | \\d |
| {{dash}} | Dash variants | [-–—ـ] |
| {{harf}} | Arabic letter | [أ-ي] |
| {{harfs}} | Single-letter codes separated by spaces, with optional marks/tatweel on each isolated letter | e.g. د ت س ي ق, هـ ث |
| {{rumuz}} | Source abbreviations (rijāl/takhrīj rumuz), incl. multi-code blocks | e.g. خت ٤, خ سي, خ فق, د ت سي ق, دت عس ق |
| {{numbered}} | Hadith numbering ٢٢ - | {{raqms}} {{dash}} |
| {{fasl}} | Section markers | فصل\|مسألة |
| {{tarqim}} | Punctuation marks | [.!?؟؛] |
| {{bullet}} | Bullet points | [•*°] |
| {{newline}} | Newline character | \n |
| {{naql}} | Narrator phrases | حدثنا\|أخبرنا\|... |
| {{kitab}} | "كتاب" (book) | كتاب |
| {{bab}} | "باب" (chapter) | باب |
| {{basmalah}} | "بسم الله" | بسم الله |
| {{hr}} | Horizontal rule (5+ chars) | [-–—ـ_=]{5,} |
Token Details
{{kitab}}– Matches "كتاب" (Book). Used in hadith collections to mark major book divisions. Example:كتاب الإيمان(Book of Faith).{{bab}}– Matches "باب" (Chapter). Example:باب ما جاء في الصلاة(Chapter on what came regarding prayer).{{fasl}}– Matches "فصل" or "مسألة" (Section/Issue). Common in fiqh books.{{basmalah}}– Matches "بسم الله" or "﷽". Commonly appears at the start of chapters, books, or documents.
{{naql}} matches common hadith transmission phrases:
- حدثنا (he narrated to us)
- أخبرنا (he informed us)
- حدثني (he narrated to me)
- وحدثنا (and he narrated to us)
- أنبأنا (he reported to us)
- سمعت (I heard)
{{rumuz}} matches rijāl/takhrīj source abbreviations used in narrator biography books:
- All six books: ع
- The four Sunan: ٤
- Bukhari: خ / خت / خغ / بخ / عخ / ز / ي
- Muslim: م / مق / مت
- Nasa'i: س / ن / ص / عس / سي / كن
- Abu Dawud: د / مد / قد / خد / ف / فد / ل / دل / كد / غد / صد
- Tirmidhi: ت / تم
- Ibn Majah: ق / فق
Matches blocks of codes separated by whitespace (e.g., خ سي, خ فق, خت ٤, د ت سي ق).
Note: Single-letter rumuz like
عare only matched when they appear as standalone codes, not as the first letter of words likeعَن.
| Token | Matches | Example |
|-------|---------|---------|
| {{raqms}} | One or more Arabic-Indic digits (٠-٩) | ٦٦٩٦ in ٦٦٩٦ - حدثنا |
| {{raqm}} | Single Arabic-Indic digit | ٥ |
| {{nums}} | One or more ASCII digits (0-9) | 123 |
| {{num}} | Single ASCII digit | 5 |
| {{numbered}} | Common hadith format: {{raqms}} {{dash}} | ٢٢ - حدثنا |
{{dash}} matches:
-(hyphen-minus U+002D)–(en-dash U+2013)—(em-dash U+2014)ـ(tatweel U+0640, Arabic elongation character)
Example: ٦٦٩٦ - حدثنا or ٦٦٩٦ ـ حدثنا
Token Constants (TypeScript)
For better IDE support, use the Token constants instead of raw strings:
import { Token, withCapture } from 'flappa-doormal';
// Instead of:
{ lineStartsWith: ['{{kitab}}', '{{bab}}'] }
// Use:
{ lineStartsWith: [Token.KITAB, Token.BAB] }
// With named captures:
const pattern = withCapture(Token.RAQMS, 'hadithNum') + ' ' + Token.DASH + ' ';
// Result: '{{raqms:hadithNum}} {{dash}} '
{ lineStartsAfter: [pattern], split: 'at' }
// segment.meta.hadithNum will contain the matched numberAvailable constants: Token.BAB, Token.BASMALAH, Token.BULLET, Token.DASH, Token.FASL, Token.HARF, Token.HARFS, Token.HR, Token.KITAB, Token.NAQL, Token.NUM, Token.NUMS, Token.NUMBERED, Token.RAQM, Token.RAQMS, Token.RUMUZ, Token.TARQIM
2. Named Capture Groups
Extract metadata automatically with the {{token:name}} syntax:
// Capture hadith number
{ template: '^{{raqms:hadithNum}} {{dash}} ' }
// Result: meta.hadithNum = '٦٦٩٦'
// Capture volume and page
{ template: '^{{raqms:vol}}/{{raqms:page}} {{dash}} ' }
// Result: meta.vol = '٣', meta.page = '٤٥٦'
// Capture rest of content
{ template: '^{{raqms:num}} {{dash}} {{:text}}' }
// Result: meta.num = '٦٦٩٦', meta.text = 'حَدَّثَنَا أَبُو بَكْرٍ'3. Fuzzy Matching (Diacritic-Insensitive)
Match Arabic text regardless of harakat:
const rules = [{
fuzzy: true,
lineStartsAfter: ['{{kitab:book}} '],
split: 'at',
}];
// Matches both:
// - 'كِتَابُ الصلاة' (with diacritics)
// - 'كتاب الصيام' (without diacritics)4. Pattern Types
| Type | Marker in content? | Use case |
|------|-------------------|----------|
| lineStartsWith | ✅ Included | Keep marker, segment at boundary |
| lineStartsAfter | ❌ Excluded | Strip marker, capture only content |
| lineEndsWith | ✅ Included | Match patterns at end of line |
| template | Depends | Custom pattern with full control |
| regex | Depends | Raw regex for complex cases |
| dictionaryEntry | ✅ Included | Serializable Arabic dictionary headword rule |
Building UIs with Pattern Type Keys
The library exports PATTERN_TYPE_KEYS (a const array) and PatternTypeKey (a type) for building UIs that let users select pattern types:
import { PATTERN_TYPE_KEYS, type PatternTypeKey } from 'flappa-doormal';
// PATTERN_TYPE_KEYS = ['lineStartsWith', 'lineStartsAfter', 'lineEndsWith', 'template', 'regex', 'dictionaryEntry']
// Build a dropdown/select
PATTERN_TYPE_KEYS.map(key => <option value={key}>{key}</option>)
// Type-safe validation
const isPatternKey = (k: string): k is PatternTypeKey =>
(PATTERN_TYPE_KEYS as readonly string[]).includes(k);4.1 Page-start Guard (avoid page-wrap false positives)
When matching at line starts (e.g., {{naql}}), a new page can begin with a marker that is actually a continuation of the previous page (page wrap), not a true new segment.
Use pageStartGuard to allow a rule to match at the start of a page only if the previous page’s last non-whitespace character matches a pattern (tokens supported):
const segments = segmentPages(pages, {
rules: [{
fuzzy: true,
lineStartsWith: ['{{naql}}'],
split: 'at',
// Only allow a split at the start of a new page if the previous page ended with sentence punctuation:
pageStartGuard: '{{tarqim}}'
}]
});This guard applies only at page starts. Mid-page line starts are unaffected.
Previous-Word Page-Start Stoplist
For dictionary-like content, page wraps can split a phrase across pages and create false positives at the top of the next page. Example:
- Page N ends with
قال - Page N+1 starts with
العجاج:
Use pageStartPrevWordStoplist to suppress page-start matches when the previous
page's last Arabic word is in a stoplist. Matching is Arabic-normalized and
diacritic-insensitive.
const segments = segmentPages(pages, {
rules: [{
regex: '^(?<lemma>[ء-غف-ي]+):',
split: 'at',
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال']
}]
});If the previous page ends with strong sentence punctuation (., !, ?, ؟, ؛),
the stoplist guard is skipped and the page-start match is allowed.
Preferred Dictionary Profile
For new Shamela-style dictionary work, prefer the top-level dictionary
profile over hand-built raw regexes or the older one-rule helper:
import { segmentPages } from 'flappa-doormal';
const segments = segmentPages(pages, {
breakpoints: ['{{tarqim}}'],
dictionary: {
version: 2,
zones: [{
name: 'main',
blockers: [
{ appliesTo: ['lineEntry', 'inlineSubentry'], use: 'pageContinuation' },
{ appliesTo: ['lineEntry', 'inlineSubentry'], use: 'intro' },
{
appliesTo: ['lineEntry', 'inlineSubentry'],
use: 'stopLemma',
words: ['ومعناه', 'ويقال', 'وقيل']
},
],
families: [
{ classes: ['chapter'], emit: 'chapter', use: 'heading' },
{ emit: 'entry', use: 'lineEntry', wrappers: 'none' },
{ emit: 'entry', prefixes: ['و'], stripPrefixesFromLemma: false, use: 'inlineSubentry' },
],
}],
},
maxPages: 1,
});Why this is preferred:
- serializable JSON authoring shape
- profile-scoped blockers instead of giant regex blobs
- zone support for books that change layout later
- compatible with diagnostics tooling via
diagnoseDictionaryProfile() - first-class validation via
validateDictionaryProfile()
Blocker authoring notes:
previousWord.scopedefaults to'samePage'- set
scope: 'pageStart'to compare only against the previous page's last Arabic word for page-start candidates - set
scope: 'any'to combine the page-start cross-page check with the normal same-page check pageContinuation.authorityPrecisiondefaults to'high'; use'aggressive'when page-start continuation filtering should treat authority-like prefixes more conservativelyqualifierTailandstructuralLeakare always-on global safety checks and show up in diagnostics even though they are not zone-declared blockers
The production dictionary implementation now lives under src/dictionary/
inside the repo, separate from the generic segmentation internals.
Dictionary runtime semantics:
segmentPages()is still the only entry point; dictionary profiles do not use a separate API- dictionary split points are merged with ordinary
rules - when a rule split and a dictionary split land at the same offset, metadata is
merged; if
debugis enabled,_flappa.ruleand_flappa.dictionarycan both appear on the same segment - for dictionary-only configs, content before the first detected entry/chapter is preserved as a leading segment with no dictionary metadata
Advanced: Single-Rule Arabic Dictionary Matching
createArabicDictionaryEntryRule() and the native dictionaryEntry rule shape
are still supported as the lower-level, advanced path for clients who want one
Arabic dictionary-style matcher inside a broader rules pipeline.
Use this path when:
- you need exactly one conservative dictionary headword rule
- you want to compose it with ordinary
SplitRule[] - you do not need profile zones, per-family blockers, or full-book tuning
Prefer the top-level dictionary profile when:
- segmenting an entire dictionary book
- persisting JSON config for a corpus
- the book changes layout in different sections
- you need diagnostics, rejection-reason rates, or book-specific profile tuning
Decision guide:
| Use case | Preferred API |
|----------|---------------|
| One conservative lemma matcher inside a normal segmentation pipeline | createArabicDictionaryEntryRule() / dictionaryEntry |
| Full-book dictionary segmentation with blockers, families, and zones | top-level dictionary |
| Persisted JSON config for real books | top-level dictionary |
| Advanced composition with other SplitRule[] rules | createArabicDictionaryEntryRule() / dictionaryEntry |
The helper returns a serializable native dictionaryEntry rule rather than an
eagerly-compiled regex blob:
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
const rule = createArabicDictionaryEntryRule({
stopWords: ['وقيل', 'ويقال', 'قال', 'العجاج', 'أخاك'],
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
samePagePrevWordStoplist: ['جل'],
// Optional dictionary-specific shapes:
allowParenthesized: true, // e.g. (عنبر) :
allowWhitespaceBeforeColon: true, // e.g. عنبر :
allowCommaSeparated: true, // e.g. سبد، دبس:
midLineSubentries: false, // line/page starts only
});
const segments = segmentPages(pages, { rules: [rule] });Equivalent direct JSON-authored rule:
const rule = {
dictionaryEntry: {
stopWords: ['وقيل', 'ويقال', 'قال', 'العجاج', 'أخاك'],
allowParenthesized: true,
allowWhitespaceBeforeColon: true,
allowCommaSeparated: true,
midLineSubentries: false,
},
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
samePagePrevWordStoplist: ['جل'],
meta: { type: 'entry' },
};Behavior:
- Keeps the lemma marker in
segment.content - Stores the matched lemma in
segment.meta.lemma - Matches root entries at true line/page starts like
عز:andلع: - Matches mid-line subentries conservatively when they begin with
و - Supports disabling mid-line subentries entirely with
midLineSubentries: false - Can match parenthesized headwords like
(عنبر) :when enabled - Can match comma-separated headword lists like
سبد، دبس:when enabled - Can suppress same-page false positives like
جلّ وعزّ:withsamePagePrevWordStoplist
Option notes:
stopWords- exact lemma-level blockers for non-lexical heads like
وقيلorويقال - use this for rejecting candidate headwords themselves
- exact lemma-level blockers for non-lexical heads like
pageStartPrevWordStoplist- blocks a page-start candidate when the previous page ends with one of these words
- useful for page-wrap false positives after citation/introduction prose
samePagePrevWordStoplist- blocks a same-page candidate when the previous local word matches
- useful for phrases like
جلّ وعزّ
allowParenthesized- enables heads like
(عنبر):
- enables heads like
allowWhitespaceBeforeColon- enables spacing variants like
عنبر :
- enables spacing variants like
allowCommaSeparated- enables grouped heads like
سبد، دبس:
- enables grouped heads like
midLineSubentries- when
true, allows conservative same-line subentries such asوالعزاء: - when
false, only line-start/page-start heads are emitted
- when
Serialization tradeoff:
dictionaryEntryis serializable and safe to keep in JSON- but it is still a single-rule primitive
- if you need corpus-wide blocker tuning, families, or zones, move up to the
top-level
dictionaryprofile
Example: compose with chapter rules
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
const segments = segmentPages(pages, {
rules: [
{ lineStartsAfter: ['## '], meta: { type: 'chapter' } },
{
fuzzy: true,
lineStartsAfter: ['{{bab}} '],
meta: { type: 'chapter' },
},
createArabicDictionaryEntryRule({
stopWords: ['وقيل', 'ويقال', 'قال'],
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
samePagePrevWordStoplist: ['جل'],
allowCommaSeparated: true,
}),
],
breakpoints: ['{{tarqim}}'],
maxPages: 1,
});Example: one-off advanced rule inside a non-dictionary pipeline
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
const segments = segmentPages(pages, {
rules: [
{ lineStartsWith: ['{{kitab}}'], meta: { type: 'book' } },
{ lineStartsWith: ['{{bab}}'], meta: { type: 'chapter' } },
createArabicDictionaryEntryRule({
stopWords: ['وقيل', 'ويقال'],
midLineSubentries: false,
allowParenthesized: true,
}),
],
});Use createArabicDictionaryEntryRule() or dictionaryEntry when you only need
one conservative dictionary matcher and want it to behave like a normal
SplitRule.
For full-book dictionary profiling, diagnostics, and book-specific tuning,
prefer the top-level dictionary contract above.
Repo Fixture Book Options
The repo keeps book-specific golden options for the four reference Shamela dictionaries as local test/support fixtures, not as part of the public package API.
If you want standalone JSON copies of those fixture options for your own local workflow, export them on demand:
bun run dictionary:export-options
bun run dictionary:export-options -- --out-dir /path/to/dictionary-optionsBy default this writes to out/dictionary-options/, which is not intended to
be checked into the repo.
Dictionary Diagnostics
Use diagnoseDictionaryProfile() when tuning blockers and families for a
dictionary profile:
import { diagnoseDictionaryProfile } from 'flappa-doormal';
const diagnostics = diagnoseDictionaryProfile(pages, profile, {
sampleLimit: 25,
});
console.log(diagnostics.rejectionReasons);
console.log(diagnostics.rejectedLemmas.slice(0, 10));Returned diagnostics include:
- accepted vs rejected candidate counts
- accepted counts by
kind - accepted/rejected counts by family and zone
- rejection-reason counts (
intro,stopLemma,pageContinuation,qualifierTail,structuralLeak, etc.) - top rejected lemmas
- sampled accepted/rejected candidates for quick inspection
diagnoseDictionaryProfile() is primarily a tuning API for profile authoring,
so consumers should treat its output shape as less stable than the segmentation
API itself.
Validate profiles before persisting them or shipping them to an editor/CI step:
import { validateDictionaryProfile } from 'flappa-doormal';
const issues = validateDictionaryProfile(profile);
if (issues.length > 0) {
console.error(issues);
}Validation catches:
- empty or duplicate zones
- invalid gate shapes
- empty blocker lists
- inert heading families (for example, a heading family that emits
entrybut never matchesentryheadings)
The runtime throws DictionaryProfileValidationError if invalid profiles reach
segmentPages() or diagnoseDictionaryProfile().
Dictionary Surface Analysis
For corpus exploration and profile authoring, the library also exposes the heading/surface scanner used during the proposal phase:
import {
analyzeDictionaryMarkdownPages,
classifyDictionaryHeading,
scanDictionaryMarkdownPage,
} from 'flappa-doormal';
const kind = classifyDictionaryHeading('## (خَ غ)');
const pageMatches = scanDictionaryMarkdownPage(page);
const report = analyzeDictionaryMarkdownPages(pages);Use these for:
- inspecting
convertContentToMarkdown()output before profile authoring - spotting structural marker/code lines
- building your own authoring tools around the same heading classifier
These are analysis helpers, not a replacement for the full runtime.
For full-book scans, use the bundled script:
bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
bun run dictionary:scan -- --book 7031 --books-dir /path/to/books --json
bun run dictionary:scan -- --book 1687 --input /path/to/1687.json --out diagnostics/1687.txtThe scan script:
- reads an explicit
--inputfile or resolves<books-dir>/<book>.json - converts each page with
convertContentToMarkdown() - applies
removeZeroWidth - runs
diagnoseDictionaryProfile()with the repo-local golden profile fixture for that book
The test suite does not require the full Shamela corpora. It uses extracted
markdown fixtures under testing/fixtures/dictionary-books/, so moving your
local books/ directory will not break CI or the built-in tests.
Dictionary Letter-Code Lines
For dictionary-specific letter-code lines like ك ش ن or (هـ ث), use
{{harfs}} and decide the metadata shape in client code:
import { getTokenPattern, segmentPages } from 'flappa-doormal';
const harfCodes = getTokenPattern('harfs').replaceAll('\\s+', '[ \\t]+');
const segments = segmentPages(pages, {
rules: [{
regex: `^(?:\\((?<huruf>${harfCodes})\\)|(?<huruf>${harfCodes}))$`,
split: 'at',
meta: { type: 'C' },
}],
});Here huruf is just a named capture group chosen by the client, not a built-in
regex primitive.
This client-side rule can be used for:
- chapter-adjacent code lines like
(هـ ث) - consecutive bare code lines like
س ط بthenس د ر
The replaceAll('\\s+', '[ \\t]+') step is intentional:
{{harfs}}itself uses\s+- but when embedding it in a raw full-line regex, horizontal whitespace is usually
safer than unrestricted
\s+, because it prevents accidental matching across newlines
5. Auto-Escaping Brackets
In lineStartsWith, lineStartsAfter, lineEndsWith, and template patterns, parentheses () and square brackets [] are automatically escaped. This means you can write intuitive patterns without manual escaping:
// Write this (clean and readable):
{ lineStartsAfter: ['({{harf}}): '], split: 'at' }
// Instead of this (verbose escaping):
{ lineStartsAfter: ['\\({{harf}}\\): '], split: 'at' }Important: Brackets inside {{tokens}} are NOT escaped - token patterns like {{harf}} which expand to [أ-ي] work correctly.
For full regex control (character classes, capturing groups), use the regex pattern type which does NOT auto-escape:
// Character class [أب] matches أ or ب
{ regex: '^[أب] ', split: 'at' }
// Capturing group (test|text) matches either
{ regex: '^(test|text) ', split: 'at' }
// Named capture groups extract metadata from raw regex too!
{ regex: '^(?<num>[٠-٩]+)\\s+[أ-ي\\s]+:\\s*(.+)' }
// meta.num = matched number, content = captured (.+) group6. Page Constraints
Limit rules to specific page ranges:
{
lineStartsWith: ['## '],
split: 'at',
min: 10, // Only pages 10+
max: 100, // Only pages up to 100
}7. Max Content Length (Safety Hardened)
Split oversized segments based on character count:
{
maxContentLength: 500, // Split after 500 characters
prefer: 'longer', // Try to fill the character bucket
breakpoints: ['\\.'], // Recommended: split on punctuation within window
}The library implements safety hardening for character-based splits:
- Safe Fallback: If no breakpoint matches, it searches backward up to 100 characters for a delimiter (whitespace or punctuation) to avoid chopping words.
- Unicode Safety: Automatically prevents splitting inside Unicode surrogate pairs (e.g., emojis), preventing text corruption.
- Validation:
maxContentLengthmust be at least 50.
7.1 Preprocessing
Apply text normalization transforms before segmentation rules are evaluated:
segmentPages(pages, {
preprocess: [
'removeZeroWidth', // Strip invisible Unicode control characters
'condenseEllipsis', // "..." → "…" (prevents {{tarqim}} false matches)
'fixTrailingWaw', // " و " → " و" (joins waw to next word)
],
rules: [...],
});Available transforms:
| Transform | Effect | Use Case |
|-----------|--------|----------|
| removeZeroWidth | Strips U+200B–U+200F, U+202A–U+202E, U+2060–U+2064, U+FEFF | Invisible chars interfering with patterns |
| condenseEllipsis | ... → … | Prevent {{tarqim}} matching inside ellipsis |
| fixTrailingWaw | و → و | Fix OCR artifacts with detached waw |
Page constraints:
preprocess: [
'removeZeroWidth', // All pages
{ type: 'condenseEllipsis', min: 100 }, // Pages 100+
{ type: 'fixTrailingWaw', min: 50, max: 500 }, // Pages 50-500
]removeZeroWidth modes:
// Default: strip entirely
{ type: 'removeZeroWidth', mode: 'strip' }
// Alternative: replace with space (preserves word boundaries)
// Note: Won't insert space after existing whitespace (space, newline, tab)
{ type: 'removeZeroWidth', mode: 'space' }8. Advanced Structural Filters
Refine rule matching with page-specific constraints:
{
lineStartsWith: ['### '],
split: 'at',
// Range constraints
min: 10, // Only match on pages 10 and above
max: 500, // Only match on pages 500 and below
exclude: [50, [100, 110]], // Skip page 50 and range 100-110
// Negative lookahead: skip rule if content matches this pattern
// (e.g. skip chapter marker if it appears inside a table/list)
skipWhen: '^\s*- ',
}9. Debugging & Logging
Pass an optional logger to trace segmentation decisions or enable debug to attach match metadata to segments:
const segments = segmentPages(pages, {
rules: [...],
debug: true, // Enables detailed match metadata
logger: {
debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
info: (msg, data) => console.info(`[INFO] ${msg}`, data),
warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
}
});
// Helper to format debug reason
// import { getSegmentDebugReason } from 'flappa-doormal';
// console.log(getSegmentDebugReason(segments[0])); // "Rule #0 (lineStartsWith) [idx:2] (Matched: '{{naql}}')"Debug Metadata (_flappa)
When debug: true is enabled, the library attaches a _flappa object to each segment's meta property. This is extremely useful for understanding exactly why a segment was created and which pattern matched.
The metadata includes different fields based on the split reason:
1. Rule-based Splits
If a segment was created by one of your rules:
{
"meta": {
"_flappa": {
"rule": {
"index": 0, // Index of the rule in your rules array
"patternType": "lineStartsWith", // The type of pattern that matched
"wordIndex": 2, // Index of the specific pattern in the array
"word": "{{naql}}" // The specific pattern string that matched
}
}
}
}2. Breakpoint-based Splits
If a segment was created by a breakpoint pattern (e.g. because it exceeded maxPages or maxContentLength):
{
"meta": {
"_flappa": {
"breakpoint": {
"index": 0, // Index of the breakpoint in your array
"pattern": "\\.", // The pattern (or `regex`) that matched
"kind": "pattern", // "pattern", "regex", or "pageBoundary"
"wordIndex": 1, // Index in `words` array (if using `words` field)
"word": "ثم " // The specific word that matched
}
}
}
}3. Dictionary-based Splits If a segment was created by a dictionary profile:
{
"meta": {
"_flappa": {
"dictionary": {
"family": "lineEntry"
}
}
}
}Heading-driven dictionary splits can also record the heading class:
{
"meta": {
"_flappa": {
"dictionary": {
"family": "heading",
"headingClass": "chapter"
}
}
}
}4. Safety Fallback Splits (maxContentLength)
If no rule or breakpoint matched and the library was forced to perform a safety fallback split:
{
"meta": {
"_flappa": {
"contentLengthSplit": {
"maxContentLength": 5000,
"splitReason": "whitespace" // "whitespace", "unicode_boundary", or "grapheme_cluster"
}
}
}
}whitespace: Found a safe space/newline to split at.unicode_boundary: No whitespace found, split at a safe character boundary (avoiding surrogate pairs).grapheme_cluster: Split at a grapheme boundary (avoiding diacritic/ZWJ corruption).
10. Page Joiners
Control how text from different pages is stitched together:
// Default: space ' ' joiner
// Result: "...end of page 1. Start of page 2..."
segmentPages(pages, { pageJoiner: 'space' });
// Result: "...end of page 1.\nStart of page 2..."
segmentPages(pages, { pageJoiner: 'newline' });11. Breakpoint Preferences
When a segment exceeds maxPages or maxContentLength, breakpoints split it at the "best" available match:
{
maxPages: 1, // Minimum segment size (page span)
breakpoints: ['{{tarqim}}'],
// 'longer' (default): Greedy. Finds the match furthest in the window.
// Result: Segments stay close to the max limit.
prefer: 'longer',
// 'shorter': Conservative. Finds the first available match.
// Result: Segments split as early as possible.
prefer: 'shorter',
}Breakpoint Pattern Behavior
When a breakpoint pattern matches, the split position is controlled by the split option:
⚠️ Split Defaults Differ: Rules default to
split: 'at', while Breakpoints default tosplit: 'after'.
{
breakpoints: [
// Default: split AFTER the match (match included in previous segment)
{ pattern: '{{tarqim}}' }, // or { pattern: '{{tarqim}}', split: 'after' }
// Alternative: split AT the match (match starts next segment)
{ pattern: 'ولهذا', split: 'at' },
],
}split: 'after' (default)
- Previous segment ENDS WITH the matched text
- New segment STARTS AFTER the matched text
// Pattern "ولهذا" with split: 'after' on "النص الأول ولهذا النص الثاني"
// - Segment 1: "النص الأول ولهذا" (ends WITH match)
// - Segment 2: "النص الثاني" (starts AFTER match)split: 'at'
- Previous segment ENDS BEFORE the matched text
- New segment STARTS WITH the matched text
// Pattern "ولهذا" with split: 'at' on "النص الأول ولهذا النص الثاني"
// - Segment 1: "النص الأول" (ends BEFORE match)
// - Segment 2: "ولهذا النص الثاني" (starts WITH match)Note: For empty pattern
''(page boundary fallback),splitis ignored since there is no matched text to include/exclude.
Pattern order matters - the first matching pattern wins:
{
// Patterns are tried in order
breakpoints: [
'\\.', // Try punctuation first (no need for \\s* - segments are trimmed)
'ولهذا', // Then try specific word
'', // Finally, fall back to page boundary
],
}
// If punctuation is found, "ولهذا" is never triedNote on lookahead patterns: Zero-length patterns like
(?=X)are not supported for breakpoints because they can cause non-progress scenarios. Use{ pattern: 'X', split: 'at' }instead to achieve "split before X" behavior.
Note on whitespace: Segments are trimmed by default. With
split:'at', if the match consists only of whitespace, it will be trimmed from the start of the next segment. This is usually desirable for delimiter patterns.
Tip:
\s*after punctuation is redundant: Because segments are trimmed,{{tarqim}}\s*produces identical output to{{tarqim}}. The trailing whitespace captured by\s*gets trimmed anyway. Save yourself the extra characters!
pattern vs regex Field
Breakpoints support two pattern fields:
| Field | Bracket escaping | Use case |
|-------|-----------------|----------|
| pattern | ()[] auto-escaped | Simple patterns, token-friendly |
| regex | None (raw regex) | Complex regex with groups, lookahead |
// Use `pattern` for simple patterns (brackets are auto-escaped)
{ pattern: '(a)', split: 'after' } // Matches literal "(a)"
{ pattern: '{{tarqim}}', split: 'after' } // Token expansion works
// Use `regex` for complex patterns with regex groups
{ regex: '\\s+(?:ولهذا|وكذلك|فلذلك)', split: 'at' } // Non-capturing group
{ regex: '{{tarqim}}', split: 'after' } // Tokens work in regex too!If both pattern and regex are specified, regex takes precedence.
⚠️ Mid-Word Matching Caveat
Breakpoint patterns match substrings, not whole words. A pattern like ولهذا will match inside مَولهذا, causing a mid-word split:
// Content: "النص الأول مَولهذا النص"
// Pattern: { pattern: 'ولهذا', split: 'at' }
// Result:
// - Segment 1: "النص الأول مَ" ← orphaned letter!
// - Segment 2: "ولهذا النص"Solution: Require whitespace before the pattern to ensure whole-word matching:
// Single word - require preceding whitespace
{ pattern: '\\s+ولهذا', split: 'at' }
// Multiple words using alternation - each needs whitespace prefix
{ pattern: '\\s+(?:ولهذا|وكذلك|فلذلك)', split: 'at' }Why not
\b? JavaScript's\bword boundary does not work with Arabic text. Since Arabic letters aren't considered "word characters" (\w=[a-zA-Z0-9_]), using\bwill match nothing - not even standalone words. Always use\s+prefix instead.
The words Field (Simplified Word Breakpoints)
For breaking on multiple words, the words field provides a simpler syntax with automatic whitespace boundaries:
{
breakpoints: [
// Instead of manually writing:
// { regex: '\\s+(?:فهذا|ثم|أقول)', split: 'at' }
// Use the `words` field:
{ words: ['فهذا', 'ثم', 'أقول'], min: 100 }
],
}Features:
- Automatic
\s+prefix for whole-word matching - Defaults to
split: 'at'(can be overridden) - Metacharacters auto-escaped (literals match literally)
- Tokens supported (
{{naql}}expands as usual) - Longest match first (words sorted by length descending)
// Override split behavior
{ words: ['والله أعلم'], split: 'after' } // Include phrase in previous segment
// Use tokens in words
{ words: ['{{naql}}', 'وكذلك'] } // Token expansion works
// Note: `words` cannot be combined with `pattern` or `regex`
// Note: Empty `words: []` is filtered out (no-op), NOT treated as page-boundary fallback⚠️ Partial Word Matching: The words field matches text that starts with the word, not complete words only. For example, words: ['ثم'] will also match ثمامة (a name starting with ثم).
To match only complete words, add a trailing space:
// ❌ Matches 'ثم' anywhere, including inside 'ثمامة'
{ words: ['فهذا', 'ثم', 'أقول'] }
// ✅ Matches only standalone words followed by space
{ words: ['فهذا ', 'ثم ', 'أقول '] }Security note (ReDoS): Breakpoints (and raw regex rules) compile user-provided regular expressions. Do not accept untrusted patterns (e.g. from end users) without validation/sandboxing; some regexes can trigger catastrophic backtracking and hang the process.
12. Occurrence Filtering
Control which matches to use:
{
lineEndsWith: ['\\.'],
split: 'after',
occurrence: 'last', // Only split at LAST period on page
}Use Cases
Simple Hadith Segmentation
Use {{numbered}} for the common "number - content" format:
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{numbered}}'],
split: 'at',
meta: { type: 'hadith' }
}]
});
// Matches: ٢٢ - حدثنا, ٦٦٩٦ – أخبرنا, etc.
// Content starts AFTER the number and dashHadith Segmentation with Number Extraction
For capturing the hadith number, use explicit capture syntax:
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:hadithNum}} {{dash}} '],
split: 'at',
meta: { type: 'hadith' }
}]
});
// Each segment has:
// - content: The hadith text (without number prefix)
// - from/to: Page range
// - meta: { type: 'hadith', hadithNum: '٦٦٩٦' }Volume/Page Reference Extraction
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:vol}}/{{raqms:page}} {{dash}} '],
split: 'at'
}]
});
// meta: { vol: '٣', page: '٤٥٦' }Chapter Detection with Fuzzy Matching
const segments = segmentPages(pages, {
rules: [{
fuzzy: true,
lineStartsAfter: ['{{kitab:book}} '],
split: 'at',
meta: { type: 'chapter' }
}]
});
// Matches "كِتَابُ" or "كتاب" regardless of diacriticsNaql (Transmission) Phrase Detection
const segments = segmentPages(pages, {
rules: [{
fuzzy: true,
lineStartsWith: ['{{naql:phrase}}'],
split: 'at'
}]
});
// meta.phrase captures which narrator phrase was matched:
// 'حدثنا', 'أخبرنا', 'حدثني', etc.Mixed Captured and Non-Captured Tokens
// Only capture the number, not the letter
const segments = segmentPages(pages, {
rules: [{
lineStartsWith: ['{{raqms:num}} {{harf}} {{dash}} '],
split: 'at'
}]
});
// Input: '٥ أ - البند الأول'
// meta: { num: '٥' } // harf not captured (no :name suffix)Narrator Abbreviation Codes
Use {{rumuz}} for matching rijāl/takhrīj source abbreviations (common in narrator biography books and takhrīj notes):
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:num}} {{rumuz}}:'],
split: 'at'
}]
});
// Matches: ١١١٨ ع: ... / ١١١٨ خ سي: ... / ١١١٨ خ فق: ...
// meta: { num: '١١١٨' }
// content: '...' (rumuz stripped)Supported codes: Single-letter (ع, خ, م, د, etc.), two-letter (خت, عس, سي, etc.), digit ٤, and the word تمييز (used in jarḥ wa taʿdīl books).
Note: Single-letter rumuz like
عare only matched when they appear as standalone codes, not as the first letter of words likeعَن. The pattern is diacritic-safe.
If your data uses only single-letter codes separated by spaces (e.g., د ت س ي ق), you can also use {{harfs}}.
Analysis Helpers (no LLM required)
Use analyzeCommonLineStarts(pages) to discover common line-start signatures across a book, useful for rule authoring:
import { analyzeCommonLineStarts } from 'flappa-doormal';
const patterns = analyzeCommonLineStarts(pages);
// [{ pattern: "{{numbered}}", count: 1234, examples: [...] }, ...]You can control what gets analyzed and how results are ranked:
import { analyzeCommonLineStarts } from 'flappa-doormal';
// Top 20 most common line-start signatures (by frequency)
const topByCount = analyzeCommonLineStarts(pages, {
sortBy: 'count',
topK: 20,
});
// Only analyze markdown H2 headings (lines beginning with "##")
// This shows what comes AFTER the heading marker (e.g. "## {{bab}}", "## {{numbered}}\\[", etc.)
const headingVariants = analyzeCommonLineStarts(pages, {
lineFilter: (line) => line.startsWith('##'),
sortBy: 'count',
topK: 40,
});
// Support additional prefix styles without changing library code
// (e.g. markdown blockquotes ">> ..." + headings)
const quotedHeadings = analyzeCommonLineStarts(pages, {
lineFilter: (line) => line.startsWith('>') || line.startsWith('#'),
prefixMatchers: [/^>+/u, /^#+/u],
sortBy: 'count',
topK: 40,
});Key options:
sortBy:'specificity'(default) or'count'(highest frequency first)lineFilter: restrict which lines are counted (e.g. only headings)prefixMatchers: consume syntactic prefixes (default includes headings via/^#+/u) so you can see variations after the prefixnormalizeArabicDiacritics:trueby default (helps token matching likeوأَخْبَرَنَا→{{naql}})whitespace: how whitespace is represented in returned patterns:'regex'(default): uses\\s*placeholders between tokens'space': uses literal single spaces (' ') between tokens (useful if you don't want\\sto later match newlines when reusing these patterns)
Note on brackets in returned patterns:
analyzeCommonLineStarts()returns template-like signatures, not “ready-to-run regex”.- It intentionally does not escape literal
()/[]in the returnedpattern(e.g.(ح)stays(ح)). - If you paste these signatures into
lineStartsWith/lineStartsAfter/template, that’s fine: those template pattern types auto-escape()[]outside{{tokens}}. - If you paste them into a raw
regexrule, you may need to escape literal brackets yourself.
Repeating Sequence Analysis (continuous text)
For texts without line breaks (continuous prose), use analyzeRepeatingSequences():
import { analyzeRepeatingSequences } from 'flappa-doormal';
const patterns = analyzeRepeatingSequences(pages, {
minElements: 2,
maxElements: 4,
minCount: 3,
topK: 20,
});
// [{ pattern: "{{naql}}\\s*{{harf}}", count: 42, examples: [...] }, ...]Key options:
minElements/maxElements: N-gram size range (default 1-3)minCount: Minimum occurrences to include (default 3)topK: Maximum patterns to return (default 20)requireToken: Only patterns containing{{tokens}}(default true)normalizeArabicDiacritics: Ignore diacritics when matching (default true)
Analysis → Segmentation Workflow
Use analysis functions to discover patterns, then pass to segmentPages().
Example A: Continuous Text (No Punctuation)
For prose-like text without structural line breaks:
import { analyzeRepeatingSequences, segmentPages, type Page } from 'flappa-doormal';
// Continuous Arabic text with narrator phrases
const pages: Page[] = [
{ id: 1, content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي حدثنا خالد بن زيد عن علي' },
{ id: 2, content: 'حدثنا سعيد بن جبير عن ابن عباس أخبرنا يوسف عن أنس' },
];
// Step 1: Discover repeating patterns
const patterns = analyzeRepeatingSequences(pages, { minCount: 2, topK: 10 });
// [{ pattern: '{{naql}}', count: 5, examples: [...] }, ...]
// Step 2: Build rules from discovered patterns
const rules = patterns.filter(p => p.count >= 3).map(p => ({
lineStartsWith: [p.pattern],
split: 'at' as const,
fuzzy: true,
}));
// Step 3: Segment
const segments = segmentPages(pages, { rules });
// [{ content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي', from: 1 }, ...]Example B: Structured Text (With Numbering)
For hadith-style numbered entries:
import { analyzeCommonLineStarts, segmentPages, type Page } from 'flappa-doormal';
// Numbered hadith text
const pages: Page[] = [
{ id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ\n٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ' },
{ id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ عَنْ عَائِشَةَ' },
];
// Step 1: Discover common line-start patterns
const patterns = analyzeCommonLineStarts(pages, { topK: 10, minCount: 2 });
// [{ pattern: '{{raqms}}\\s*{{dash}}', count: 3, examples: [...] }, ...]
// Step 2: Build rules (add named capture for hadith number)
const topPattern = patterns[0]?.pattern ?? '{{raqms}} {{dash}} ';
const rules = [{
lineStartsAfter: [topPattern.replace('{{raqms}}', '{{raqms:num}}')],
split: 'at' as const,
meta: { type: 'hadith' }
}];
// Step 3: Segment
const segments = segmentPages(pages, { rules });
// [
// { content: 'حَدَّثَنَا أَبُو بَكْرٍ...', from: 1, meta: { type: 'hadith', num: '٦٦٩٦' } },
// { content: 'أَخْبَرَنَا عُمَرُ قَالَ', from: 1, meta: { type: 'hadith', num: '٦٦٩٧' } },
// { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { type: 'hadith', num: '٦٦٩٨' } },
// ]Agent Advisor Workflow
If you want an AI agent to start from raw pages and get to a draft configuration with less hand-written glue, use suggestSegmentationOptions():
import { suggestSegmentationOptions } from 'flappa-doormal';
const report = suggestSegmentationOptions(pages, {
maxRules: 4,
topLineStarts: 12,
topRepeatingSequences: 8,
});
console.log(report.assessment);
console.log(report.recommendedOptions);
console.log(report.ruleSuggestions.slice(0, 5));The report includes:
- preprocess cleanup hints (
removeZeroWidth,condenseEllipsis,fixTrailingWaw) - an assessment of whether the book looks
structured,continuous, ormixed - draft
SplitRule[]suggestions with examples and confidence - a ready-to-run
recommendedOptionsobject - rule validation output
- self-evaluation of the generated segmentation draft
- optional breakpoint suggestions when the draft still produces very large segments
For local JSON files, you can run the bundled script:
bun run segment:advise -- --input ./pages.json
bun run segment:advise -- --input ./book.json --format markdown --out ./segmentation-report.mdInput can be either:
Page[]{ pages: Page[] }
MCP Server
The repo now includes a stdio MCP server wrapper for agent workflows:
bun run mcp:serveWhen packaged, the server binary is:
flappa-doormal-mcpExposed MCP tools:
inspect_bookInput:{ pages, advisorOptions? }Returns preprocess detections, line-start analysis, repeating sequences, and draft rule suggestions.suggest_segmentation_optionsInput:{ pages, advisorOptions? }Returns the full advisor report, includingrecommendedOptions.preview_segmentationInput:{ pages, options, sampleSegments? }Runs segmentation and returns segments, samples, and validation.validate_segmentationInput:{ pages, options, segments }Validates caller-provided segments against the source book.score_candidate_optionsInput:{ pages, candidates, sampleSegments? }Ranks multipleSegmentationOptionscandidates using validation and segment-shape heuristics.
All tool results are returned as JSON-friendly objects so agents can iterate without scraping prose output.
Advanced: Metadata Extraction & Data Migration
If you already have pre-segmented data (e.g., records from a database or JSON file) and want to use flappa-doormal's token system to extract metadata and clean the content without further splitting, you can use the Metadata Extraction pattern.
By setting maxPages: 0, you guarantee a 1:1 mapping: each input page produces exactly one output segment, regardless of how much text is on the page.
Example: Extracting multiple fields from pre-split records
import { segmentPages, type Page } from 'flappa-doormal';
const excerpts = [
{ nass: '٧٠١٦ - ١ - ١ - فَقَصَّتْهَا حَفْصَةُ', id: 1 },
{ nass: '٧٠١٧ (أ) - بَابُ الْقَيْدِ', id: 2 },
{ nass: 'باب الصلاة - الفصل الأول', id: 3 },
];
// Convert your data to the Page format
const pages: Page[] = excerpts.map(e => ({ content: e.nass, id: e.id }));
const result = segmentPages(pages, {
maxPages: 0, // IMPORTANT: Guarantees each page stays isolated (no merging/splitting)
rules: [
// 1. Extract triple numbers: ٧٠١٦ - ١ - ١
{
lineStartsAfter: ['{{raqms:num}} {{dash}} {{raqms:num2}} {{dash}} {{raqms:num3}} '],
},
// 2. Extract number + indicator: ٧٠١٧ (أ)
{
lineStartsAfter: ['{{raqms:num}} ({{harf:indicator}}) {{dash}} '],
},
// 3. Mark chapters using fuzzy tokens
{
fuzzy: true,
lineStartsWith: ['{{bab}} '],
meta: { type: 'Chapter' },
},
],
});
// Segment 0: { content: 'فَقَصَّتْهَا حَفْصَةُ', meta: { num: '٧٠١٦', num2: '١', num3: '١' }, ... }
// Segment 1: { content: 'بَابُ الْقَيْدِ', meta: { num: '٧٠١٧', indicator: 'أ' }, ... }
// Segment 2: { content: 'باب الصلاة - الفصل الأول', meta: { type: 'Chapter' }, ... }Why use this?
- Pattern Robustness: Use
{{raqms}},{{dash}}, and{{harf}}instead of writing raw regex for every edge case. - Prefix Cleaning:
lineStartsAfterautomatically removes the matched pattern, leaving only the clean text. - Deduplication: Named captures like
{{raqms:num}}automatically populate themetaobject. - Fuzzy Headers: Use
fuzzy: trueto match headers like "Book" or "Chapter" regardless of Arabic diacritics.
Rule Optimization
Use optimizeRules() to automatically merge compatible rules, remove duplicate patterns, and sort rules by specificity (longest patterns first):
import { optimizeRules } from 'flappa-doormal';
const rules = [
// These will be merged because meta/fuzzy options match
{ lineStartsWith: ['{{kitab}}'], fuzzy: true, meta: { type: 'header' } },
{ lineStartsWith: ['{{bab}}'], fuzzy: true, meta: { type: 'header' } },
// This will be kept separate
{ lineStartsAfter: ['{{numbered}}'], meta: { type: 'entry' } },
];
const { rules: optimized, mergedCount } = optimizeRules(rules);
// Result:
// optimized[0] = {
// lineStartsWith: ['{{kitab}}', '{{bab}}'],
// fuzzy: true,
// meta: { type: 'header' }
// }
// optimized[1] = { lineStartsAfter: ['{{numbered}}'], ... }Rule Validation
Use validateRules() to detect common mistakes in rule patterns before running segmentation:
import { validateRules } from 'flappa-doormal';
const issues = validateRules([
{ lineStartsAfter: ['raqms:num'] }, // Missing {{}}
{ lineStartsWith: ['{{unknown}}'] }, // Unknown token
{ lineStartsAfter: ['## (rumuz:rumuz)'] } // Typo - should be {{rumuz:rumuz}}
]);
// issues[0]?.lineStartsAfter?.[0]?.type === 'missing_braces'
// issues[1]?.lineStartsWith?.[0]?.type === 'unknown_token'
// issues[2]?.lineStartsAfter?.[0]?.type === 'missing_braces'
// To get a simple list of error strings for UI display:
import { formatValidationReport } from 'flappa-doormal';
const errors = formatValidationReport(issues);
// [
// 'Rule 1, lineStartsAfter: Missing {{}} around token "raqms:num"',
// 'Rule 2, lineStartsWith: Unknown token "{{unknown}}"',
// ...
// ]Checks performed:
- Missing braces: Detects token names like
raqms:numwithout{{}} - Unknown tokens: Flags tokens inside
{{}}that don't exist (e.g.,{{nonexistent}}) - Duplicates: Finds duplicate patterns within the same rule
Token Mapping Utilities
When building UIs for rule editing, it's often useful to separate the token pattern (e.g., {{raqms}}) from the capture name (e.g., {{raqms:hadithNum}}).
import { applyTokenMappings, stripTokenMappings } from 'flappa-doormal';
// 1. Apply user-defined mappings to a raw template
const template = '{{raqms}} {{dash}}';
const mappings = [{ token: 'raqms', name: 'num' }];
const result = applyTokenMappings(template, mappings);
// result = '{{raqms:num}} {{dash}}'
// 2. Strip captures to get back to the canonical pattern
const raw = stripTokenMappings(result);
// raw = '{{raqms}} {{dash}}'Prompting LLMs / Agents to Generate Rules (Shamela books)
Pre-analysis (no LLM required): generate “hints” from the book
Before prompting an LLM, you can quickly extract high-signal pattern hints from the book using:
analyzeCommonLineStarts(pages, options)(fromsrc/line-start-analysis.ts): common line-start signatures (tokenized)analyzeTextForRule(text)/detectTokenPatterns(text)(fromsrc/pattern-detection.ts): turn a single representative line into a token template suggestion
These help the LLM avoid guessing and focus on the patterns actually present.
Step 1: top line-start signatures (frequency-first)
import { analyzeCommonLineStarts } from 'flappa-doormal';
const top = analyzeCommonLineStarts(pages, {
sortBy: 'count',
topK: 40,
minCount: 10,
});
console.log(top.map((p) => ({ pattern: p.pattern, count: p.count, example: p.examples[0] })));Typical output (example):
[
{ pattern: "{{numbered}}", count: 1200, example: { pageId: 50, line: "١ - حَدَّثَنَا ..." } },
{ pattern: "{{bab}}", count: 180, example: { pageId: 66, line: "باب ..." } },
{ pattern: "##\\s*{{bab}}",count: 140, example: { pageId: 69, line: "## باب ..." } }
]If you only want to analyze headings (to see what comes after ##):
const headingVariants = analyzeCommonLineStarts(pages, {
lineFilter: (line) => line.startsWith('##'),
sortBy: 'count',
topK: 40,
});Step 2: convert a few representative lines into token templates
Pick 3–10 representative line prefixes from the book (often from the examples returned above) and run:
import { analyzeTextForRule } from 'flappa-doormal';
console.log(analyzeTextForRule("٢٩- خ سي: أحمد بن حميد ..."));
// -> { template: "{{raqms}}- {{rumuz}}: أحمد...", patternType: "lineStartsAfter", fuzzy: false, ... }Step 3: paste the “hints” into your LLM prompt
When you prompt the LLM, include a short “Hints” section:
- Top 20–50
analyzeCommonLineStartspatterns (with counts + 1–2 examples) - 3–10
analyzeTextForRule(...)results - A small sample of pages (not the full book)
Then instruct the LLM to prioritize rules that align with those hints.
You can use an LLM to generate SegmentationOptions by pasting it a random subset of pages and asking it to infer robust segmentation rules. Here’s a ready-to-copy plain-text prompt:
You are helping me generate JSON configuration for a text-segmentation function called segmentPages(pages, options).
It segments Arabic book pages (e.g., Shamela) into logical segments (books/chapters/sections/entries/hadiths).
I will give you a random subset of pages so you can infer patterns. You must respond with ONLY JSON (no prose).
I will paste a random subset of pages. Each page has:
- id: page number (not necessarily consecutive)
- content: plain text; line breaks are \n
Output ONLY a JSON object compatible with SegmentationOptions (no prose, no code fences).
SegmentationOptions shape:
- rules: SplitRule[]
- optional: maxPages, breakpoints, prefer
SplitRule constraints:
- Each rule must use exactly ONE of: lineStartsWith, lineStartsAfter, lineEndsWith, template, regex
- Optional fields: split ("at" | "after"), meta, min, max, exclude, occurrence ("first" | "last"), fuzzy
Important behaviors:
- lineStartsAfter matches at line start but strips the marker from segment.content.
- Template patterns (lineStartsWith/After/EndsWith/template) auto-escape ()[] outside tokens.
- Raw regex patterns do NOT auto-escape and can include groups, named captures, etc.
Available tokens you may use in templates:
- {{basmalah}} (بسم الله / ﷽)
- {{kitab}} (كتاب)
- {{bab}} (باب)
- {{fasl}} (فصل | مسألة)
- {{naql}} (حدثنا/أخبرنا/... narration phrases)
- {{raqm}} (single Arabic-Indic digit)
- {{raqms}} (Arabic-Indic digits)
- {{num}} (single ASCII digit)
- {{nums}} (ASCII digits)
- {{dash}} (dash variants)
- {{tarqim}} (punctuation [. ! ? ؟ ؛])
- {{harf}} (Arabic letter)
- {{harfs}} (single-letter codes separated by spaces; e.g. "د ت س ي ق")
- {{rumuz}} (rijāl/takhrīj source abbreviations; matches blocks like "خت ٤", "خ سي", "خ فق")
Named captures:
- {{raqms:num}} captures to meta.num
- {{:name}} captures arbitrary text to meta.name
Your tasks:
1) Identify document structure from the sample:
- book headers (كتاب), chapter headers (باب), sections (فصل/مسألة), hadith numbering, biography entries, etc.
2) Propose a minimal but robust ordered ruleset:
- Put most-specific rules first.
- Use fuzzy:true for Arabic headings where diacritics vary.
- Use lineStartsAfter when you want to remove the marker (e.g., hadith numbers, rumuz prefixes).
3) Use constraints:
- Use min/max/exclude when front matter differs or specific pages are noisy.
4) If segments can span many pages:
- Set maxPages and breakpoints.
- Suggested breakpoints (in order): "{{tarqim}}", "\\n", "" (page boundary)
- Prefer "longer" unless there’s a reason to prefer shorter segments.
5) Capture useful metadata:
- For numbering patterns, capture the number into meta.num (e.g., {{raqms:num}}).
Examples (what good answers look like):
Example A: hadith-style numbered segments
Input pages:
PAGE 10:
٣٤ - حَدَّثَنَا ...\n... (rest of hadith)
PAGE 11:
٣٥ - حَدَّثَنَا ...\n... (rest of hadith)
Good JSON answer:
{
"rules": [
{
"lineStartsAfter": ["{{raqms:num}} {{dash}}\\s*"],
"split": "at",
"meta": { "type": "hadith" }
}
]
}
Example B: chapter markers + hadith numbers
Input pages:
PAGE 50:
كتاب الصلاة\nباب فضل الصلاة\n١ - حَدَّثَنَا ...\n...
PAGE 51:
٢ - حَدَّثَنَا ...\n...
Good JSON answer:
{
"rules": [
{ "fuzzy": true, "lineStartsWith": ["{{kitab}}"], "split": "at", "meta": { "type": "book" } },
{ "fuzzy": true, "lineStartsWith": ["{{bab}}"], "split": "at", "meta": { "type": "chapter" } },
{ "lineStartsAfter": ["{{raqms:num}}\\s*{{dash}}\\s*"], "split": "at", "meta": { "type": "hadith" } }
]
}
Example C: narrator/rijāl entries with rumuz (codes) + colon
Input pages:
PAGE 257:
٢٩- خ سي: أحمد بن حميد...\nوكان من حفاظ الكوفة.
PAGE 258:
١٠٢- ق: تمييز ولهم شيخ آخر...\n...
Good JSON answer:
{
"rules": [
{
"lineStartsAfter": ["{{raqms:num}}\\s*{{dash}}\\s*{{rumuz}}:\\s*"],
"split": "at",
"meta": { "type": "entry" }
}
]
}
Now wait for the pages.Sentence-Based Splitting (Last Period Per Page)
const segments = segmentPages(pages, {
rules: [{
lineEndsWith: ['\\.'],
split: 'after',
occurrence: 'last',
}]
});Multiple Rules with Priority
const segments = segmentPages(pages, {
rules: [
// First: Chapter headers (highest priority)
{ fuzzy: true, lineStartsAfter: ['{{kitab:book}} '], split: 'at', meta: { type: 'chapter' } },
// Second: Sub-chapters
{ fuzzy: true, lineStartsAfter: ['{{bab:section}} '], split: 'at', meta: { type: 'section' } },
// Third: Individual hadiths
{ lineStartsAfter: ['{{raqms:num}} {{dash}} '], split: 'at', meta: { type: 'hadith' } },
]
});API Reference
segmentPages(pages, options)
Main segmentation function.
import { segmentPages, type Page, type SegmentationOptions, type Segment } from 'flappa-doormal';
const pages: Page[] = [
{ id: 1, content: 'First page content...' },
{ id: 2, content: 'Second page content...' },
];
const options: SegmentationOptions = {
// Optional preprocessing transforms (run before pattern matching)
// See "7.1 Preprocessing" section for details
preprocess: ['removeZeroWidth', 'condenseEllipsis'],
rules: [
{ lineStartsWith: ['## '], split: 'at' }
],
// How to join content across page boundaries in OUTPUT segments:
// - 'space' (default): page boundaries become spaces
// - 'newline': preserve page boundaries as newlines
pageJoiner: 'newline',
// Breakpoint preferences for resizing oversized segments:
// - 'longer' (default): maximizes segment size within limits
// - 'shorter': minimizes segment size (splits at first match)
prefer: 'longer',
// Post-structural limit: split if segment spans more than 2 pages
maxPages: 2,
// Post-structural limit: split if segment exceeds 5000 characters
maxContentLength: 5000,
// Enable match metadata in segments (meta.debug)
debug: true,
// Custom logger for tracing
logger: {
info: (m) => console.log(m),
warn: (m) => console.warn(m),
}
};
const segments: Segment[] = segmentPages(pages, options);validateSegments(pages, options, segments, validationOptions?)
Validates that segments correctly map back to the source pages and adhere to constraints.
import { validateSegments } from 'flappa-doormal';
const report = validateSegments(pages, options, segments, {
// Optional: Max content length to search before falling back (default: 500)
// Segments longer than this are checked via fast path unless issues are found.
fullSearchThreshold: 1000,
});Returns a SegmentValidationReport containing:
ok: booleansummary: counts of errors/warningsissues: detailed list of problems (page attribution mismatch, maxPages violation, etc.)
stripHtmlTags(html)
Remove all HTML tags from content, keeping only text.
import { stripHtmlTags } from 'flappa-doormal';
const text = stripHtmlTags('<p>Hello <b>World</b></p>');
// Returns: 'Hello World'For more sophisticated HTML to Markdown conversion (like converting <span data-type="title"> to ## headers), you can implement your own function. Here's an example:
const htmlToMarkdown = (html: string): string => {
return html
// Convert title spans to markdown headers
.replace(/<span[^>]*data-type=["']title["'][^>]*>(.*?)<\/span>/gi, '## $1')
// Strip narrator links but keep text
.replace(/<a[^>]*href=["']inr:\/\/[^"']*["'][^>]*>(.*?)<\/a>/gi, '$1')
// Strip all remaining HTML tags
.replace(/<[^>]*>/g, '');
};expandTokens(template)
Expand template tokens to regex pattern.
import { expandTokens } from 'flappa-doormal';
const pattern = expandTokens('{{raqms}} {{dash}}');
// Returns: '[\u0660-\u0669]+ [-–—ـ]'makeDiacriticInsensitive(text)
Make Arabic text diacritic-insensitive for fuzzy matching.
import { makeDiacriticInsensitive } from 'flappa-doormal';
const pattern = makeDiacriticInsensitive('حدثنا');
// Returns regex pattern matching 'حَدَّثَنَا', 'حدثنا', etc.TOKEN_PATTERNS
Access available token definitions.
import { TOKEN_PATTERNS } from 'flappa-doormal';
console.log(TOKEN_PATTERNS.narrated);
// 'حدثنا|أخبرنا|حدثني|وحدثنا|أنبأنا|سمعت'Pattern Detection Utilities
These functions help auto-detect tokens in text, useful for building UI tools that suggest rule configurations from user-highlighted text.
detectTokenPatterns(text)
Analyzes text and returns all detected token patterns with their positions.
