@markupai/format-offset-mapper
v1.1.6
Published
Map character offsets between different text formats (XML, HTML, Markdown, plain text) using a two-stage surface-alignment pipeline.
Readme
@markupai/format-offset-mapper
A small, dependency-light TypeScript library that maps character offsets between different text format representations — XML, XHTML, HTML, Markdown, and plain text — using a deterministic two-stage surface-alignment pipeline.
import { mapOffsets } from "@markupai/format-offset-mapper";
// DITA XML → editor plain text (most common use case)
const map = mapOffsets("xml", ditaXml, editorPlainText);
const editorStart = map[xmlIssueStart];
const editorEnd = map[xmlIssueEnd];Table of Contents
- Overview and Motivation
- Installation
- Three-Stage Pipeline
- Public API
- Usage Examples
- Format Handler Details
- SurfaceExtractionResult Type
- Architecture Decisions
- Testing and Coverage
- Package Info and Repository Structure
- License
Overview and Motivation
The Problem
Content is often authored in a rich original format — DITA XML, XHTML, custom HTML, or Markdown — but rendered to the user as a different on-screen representation (typically plain text in native desktop editors, or rendered HTML/DOM in web editors).
Writing analysis / linting / grammar APIs receive content in the original rich format and return issue offsets measured in that same original format's coordinate system. To highlight an issue visually in the editor, those original-format offsets must be translated into the on-screen representation's coordinate system.
API receives: <topic><title>Hello World</title></topic> (DITA XML)
API returns: issue at offsets [15, 20] (XML offsets for "World")
Editor shows: Hello World (plain text on screen)
Need to find: offsets [6, 11] (plain text offsets for "World")Without a correct offset map, highlight decorations land on the wrong characters, fall inside markup tags, or crash when offsets are out of range.
The mapping direction is always:
original-format offset → on-screen offset
(API returns) (editor needs)For example: mapOffsets("xml", ditaXml, editorPlainText) maps DITA XML offsets (what the API returns) to plain text offsets (what the editor displays).
Design Goals
- Deterministic. Pure character-level diff; no heuristic / fuzzy fallbacks.
- Linear time. Single-pass O(n) format handlers; no AST parsers.
- Zero browser globals. Runs in Node.js, browsers, Web Workers, and edge runtimes.
- Tiny. A single runtime dependency (
diff). - Extensible. Custom formats plug in via a registry.
- Battle-tested. 100% statement / branch / function / line coverage.
Installation
npm install @markupai/format-offset-mapperThe package is ESM-only and ships TypeScript types out of the box.
import { mapOffsets, alignRanges } from "@markupai/format-offset-mapper";Three-Stage Pipeline
Source content (format A) Target content (format B)
│ │
▼ Stage 1: format handler ▼ Stage 1: format handler
surface_src surface_tgt
map_src[srcOff → surfSrcOff] map_tgt[tgtOff → surfTgtOff]
│ │
└────────────────────────────────┘
▼ Stage 2: diff alignment (diffChars)
align[surfSrcOff → surfTgtOff]
│
▼ Stage 3: compose + optional inverse
result[srcOff] = tgtOffStage 1 — Format Handler
Each format handler takes the raw content string and produces:
- A surface string: the plain text with all markup stripped and entities decoded.
- A source-to-surface map: an
Int32Arraywheremap[sourceOffset]= the corresponding offset in the surface string.
Stage 2 — Diff Alignment
The diff library's diffChars function computes a character-level LCS diff between the source surface and the target surface. From the diff output, an alignment map is built: alignMap[sourceSurfaceOffset] = the corresponding target surface offset.
This stage is what makes the pipeline robust to whitespace normalisation, entity encoding differences, and minor Markdown parsing drift — even when the source and target surfaces are not identical character-for-character.
Stage 3 — Compose
Fast path (targetFormat = "text", the default): The target IS the surface — no inversion needed. Stage 3 simply composes srcMap → alignMap:
result[srcOff] = alignMap[srcMap[srcOff]]This is the most common case (XML/HTML/Markdown → editor plain text) and is cheaper than the full path.
Cross-format path (targetFormat = "html", "xml", etc.): Stage 3 composes srcMap → alignMap → inverse(tgtMap). buildInverseMap converts the target's surface-to-source map into a surface→target inverse using last-write-wins semantics (see Architecture Decisions for why).
Public API
All exports are available from the top-level package import:
import {
mapOffsets,
buildCombinedMap,
buildAlignmentMap,
remapRange,
buildInverseMap,
alignRanges,
normalizeForComparison,
buildXmlToSurfaceMap,
buildHtmlToSurfaceMap,
buildMarkdownToSurfaceMap,
buildPlaintextSurfaceMap,
FormatHandlerRegistry,
defaultRegistry,
} from "@markupai/format-offset-mapper";Primary Functions
mapOffsets
function mapOffsets(
sourceFormat: FormatName, // "xml" | "xhtml" | "html" | "markdown" | "text" | string
sourceContent: string,
targetContent: string,
targetFormat?: FormatName, // default: "text" (fast path)
options?: MapOffsetsOptions,
): Int32Array // result[sourceOffset] = targetOffsetThe main entry point. Returns an Int32Array of length sourceContent.length + 1 where result[sourceOffset] = the corresponding offset in targetContent.
The sentinel entry result[sourceContent.length] = targetContent.length, so end-of-range lookups (map[end] where end === sourceContent.length) are always safe.
buildCombinedMap
function buildCombinedMap(
sourceFormat: FormatName,
sourceContent: string,
targetContent: string,
targetFormat?: FormatName,
options?: MapOffsetsOptions,
): CombinedMapResultLike mapOffsets but returns all intermediate surfaces and maps for each pipeline stage. Useful for debugging and parity tests.
interface CombinedMapResult {
readonly sourceSurface: string; // plain text from source
readonly targetSurface: string; // plain text from target
readonly sourceToSurface: Int32Array; // Stage-1 source→surface map
readonly targetToSurface: Int32Array; // Stage-1 target→surface map
readonly combined: Int32Array; // same as mapOffsets() result
}Stage-Level Functions
buildAlignmentMap
function buildAlignmentMap(oldText: string, newText: string): Int32ArrayStage 2 only. Builds a character-level diff-based alignment map from oldText to newText. map[i] is the offset in newText corresponding to offset i in oldText. Offsets inside deleted runs collapse to the single newText position where the deletion occurred.
remapRange
function remapRange(
map: Int32Array,
start: number,
end: number,
): { start: number; end: number } | nullRemaps a [start, end) half-open range through any alignment map. Returns null when the range collapses entirely (the entire range was deleted). Use this to translate [issueStart, issueEnd) pairs.
buildInverseMap
function buildInverseMap(map: Int32Array, surfaceLength: number): Int32ArrayStage 3 inverse only. Given a source→surface map, returns a surface→source inverse. Used internally for cross-format mapping when the target has non-trivial markup.
alignRanges
function alignRanges(
sourceFormat: FormatName,
sourceContent: string,
targetContent: string,
inputs: readonly AlignInput[],
options?: AlignRangesOptions,
): AlignResult[] | nullHigh-level batch primitive. Takes an array of source-format ranges and returns target-format ranges, optionally validating that each range's target content still matches the source (drift detection).
It wraps mapOffsets + per-range remapRange + optional normalizeForComparison comparison in a single call. It is shape-neutral by design — inputs and results are plain { start, end } pairs, not typed Match objects. Wrap the results in your own issue-payload shape as needed.
interface AlignInput {
readonly start: number;
readonly end: number;
}
interface AlignResult {
readonly range: { start: number; end: number } | null; // null → range collapsed
readonly drifted: boolean; // true → target text changed (validate mode only)
}
interface AlignRangesOptions {
readonly targetFormat?: FormatName; // default "text"
readonly registry?: FormatHandlerRegistry;
readonly validate?: boolean; // compare target substring vs source surface (normalised)
readonly abortOnDrift?: boolean; // return null from the whole call on any drift
}Drift semantics: when validate: true, the target substring at each aligned range is compared to the source surface substring at the input range, after both are passed through normalizeForComparison. Strings that differ only in whitespace runs or zero-width characters are treated as equal.
With abortOnDrift: true, the entire call returns null the moment any range drifts. Use this when your policy is "any drift = abandon the batch". Without it, callers can inspect per-range drifted flags and decide what to do individually (partial rendering, etc.).
normalizeForComparison
function normalizeForComparison(text: string): stringNormalises a string for drift-detection comparisons by removing zero-width characters (U+200B–U+200D, U+FEFF), collapsing whitespace runs to a single space, and trimming. Used internally by alignRanges when validate: true; exported because every consumer doing its own drift check wants exactly this behaviour.
normalizeForComparison("hello world\n") === "hello world"; // trueFormat Handlers (Standalone)
All format handlers can be used independently of the pipeline:
function buildXmlToSurfaceMap(xml: string): SurfaceExtractionResult
function buildHtmlToSurfaceMap(html: string): SurfaceExtractionResult
function buildMarkdownToSurfaceMap(md: string): SurfaceExtractionResult
function buildPlaintextSurfaceMap(content: string): SurfaceExtractionResultRegistry
class FormatHandlerRegistry {
register(format: string, handler: FormatHandler): this // chainable
get(format: string): FormatHandler // throws for unknown format
has(format: string): boolean
clone(): FormatHandlerRegistry // for test isolation
}
const defaultRegistry: FormatHandlerRegistry // pre-populated with all built-insThe default registry has "xml", "xhtml" (identical handler to "xml"), "html", "markdown", and "text" pre-registered.
Types
interface SurfaceExtractionResult {
readonly surface: string; // plain text with markup stripped
readonly map: Int32Array; // map[sourceOffset] = surfaceOffset
}
type FormatHandler = (content: string) => SurfaceExtractionResult;
type BuiltInFormat = "xml" | "xhtml" | "html" | "markdown" | "text";
// Accepts built-in names or any custom string key registered in a registry
type FormatName = BuiltInFormat | (string & Record<never, never>);
interface MapOffsetsOptions {
registry?: FormatHandlerRegistry; // override the global defaultRegistry
}
// See `alignRanges` above for AlignInput / AlignResult / AlignRangesOptions.Usage Examples
XML / DITA → Plain Text
import { mapOffsets } from "@markupai/format-offset-mapper";
const ditaXml = `
<topic>
<title>Hello</title>
<body>
<p>World & more</p>
</body>
</topic>
`;
const editorPlainText = "Hello\nWorld & more";
const map = mapOffsets("xml", ditaXml, editorPlainText);
// Translate an API issue range [xmlStart, xmlEnd) to editor offsets
const editorStart = map[xmlIssueStart];
const editorEnd = map[xmlIssueEnd];The "xhtml" format name uses the identical XML handler, so both work:
const map = mapOffsets("xhtml", xhtmlContent, editorPlainText);HTML → Plain Text
import { mapOffsets } from "@markupai/format-offset-mapper";
const htmlContent = "<h1>Title</h1><p>Hello <strong>World</strong></p>";
const editorText = "TitleHello World";
const map = mapOffsets("html", htmlContent, editorText);
const editorPos = map[htmlApiOffset];Markdown → Plain Text
import { mapOffsets } from "@markupai/format-offset-mapper";
const md = "# Heading\n\nHello **bold** world";
const plain = "Heading\n\nHello bold world";
const map = mapOffsets("markdown", md, plain);XML → HTML (Cross-Format)
When both source and target have markup, pass the target format explicitly:
import { mapOffsets } from "@markupai/format-offset-mapper";
const xml = "<topic><title>Hello</title><body><p>World</p></body></topic>";
const html = "<article><h1>Hello</h1><section><p>World</p></section></article>";
// Runs the full three-stage pipeline including inverse-map inversion
const map = mapOffsets("xml", xml, html, "html");
const htmlOffset = map[xmlOffset];Using remapRange for Issue Ranges
API issues typically have [start, end) ranges. Use remapRange to translate them in one call:
import { mapOffsets, remapRange } from "@markupai/format-offset-mapper";
const offsetMap = mapOffsets("xml", ditaXml, editorText);
for (const issue of apiIssues) {
const range = remapRange(offsetMap, issue.start, issue.end);
if (range === null) continue; // issue range was deleted entirely
highlightRange(range.start, range.end);
}Using alignRanges for Batch Issue Alignment + Drift Detection
When you have a set of API issues to align in one go — and want to skip any whose underlying text has been edited since the check — reach for alignRanges:
import { alignRanges } from "@markupai/format-offset-mapper";
// Each suggestion has a start_index and an original-text length.
const inputs = suggestions.map((s) => ({
start: s.start_index,
end: s.start_index + s.original.length,
}));
const results = alignRanges("xml", baselineXml, editorPlainText, inputs, {
validate: true, // compare target content vs source surface after normalisation
});
suggestions.forEach((suggestion, i) => {
const r = results![i];
if (!r.range || r.drifted) return;
highlightRange(suggestion, r.range.start, r.range.end);
});For a strict abort-all-on-drift policy:
const results = alignRanges("xml", baselineXml, currentXml, inputs, {
validate: true,
abortOnDrift: true,
});
if (results === null) return []; // any drift abandoned the whole batchDebugging with buildCombinedMap
When offset mapping produces unexpected results, inspect each stage:
import { buildCombinedMap } from "@markupai/format-offset-mapper";
const result = buildCombinedMap("xml", ditaXml, editorText);
console.log("Source surface:", result.sourceSurface);
// → plain text with XML tags stripped and entities decoded
console.log("Source-to-surface at offset 42:", result.sourceToSurface[42]);
// → position in source surface for that XML offset
console.log("Final result at offset 42:", result.combined[42]);
// → position in editor text (same as mapOffsets result)Custom Format Handler
Register a custom format in a clone of the default registry to avoid polluting the global instance:
import { defaultRegistry, mapOffsets } from "@markupai/format-offset-mapper";
import type { SurfaceExtractionResult } from "@markupai/format-offset-mapper";
function buildDocBookHandler(content: string): SurfaceExtractionResult {
// Your custom stripping logic here...
const surface = stripDocBookMarkup(content);
const map = buildOffsetMap(content, surface);
return { surface, map };
}
// Use a clone so tests don't bleed into production code
const registry = defaultRegistry.clone().register("docbook", buildDocBookHandler);
const offsetMap = mapOffsets("docbook", docbookContent, editorText, "text", { registry });Test Isolation with Registry Clone
import { describe, it, expect } from "vitest";
import { FormatHandlerRegistry, buildXmlToSurfaceMap, buildPlaintextSurfaceMap, mapOffsets } from "@markupai/format-offset-mapper";
// Build a minimal registry — no side-effects on defaultRegistry
const registry = new FormatHandlerRegistry()
.register("xml", buildXmlToSurfaceMap)
.register("text", buildPlaintextSurfaceMap);
describe("my adapter tests", () => {
it("maps correctly", () => {
const map = mapOffsets("xml", "<p>Hello</p>", "Hello", "text", { registry });
expect(map[3]).toBe(0); // 'H' at xml offset 3 → editor offset 0
});
});Format Handler Details
XML / XHTML Handler (buildXmlToSurfaceMap)
Single-pass, O(n). Registered for both "xml" and "xhtml" format names.
Tag skipping: All characters inside <...> (including the < and >) are mapped to the current surface position. No surface characters are emitted for tags.
Entity decoding: Recognises and decodes:
- Five predefined XML entities:
&<>"' - Decimal numeric references:
©→© - Hex numeric references:
😀→😀
Entity length limit: the semicolon must appear within 12 characters of the &. Malformed entities (no semicolon within 12 chars) are treated as literal &.
All source characters within an entity reference (e.g. all five characters &, a, m, p, ; in &) are mapped to the surface position before the decoded character. The decoded character increments the surface position after it is emitted.
Known limitation: CDATA sections are only partially handled. The < of <![CDATA[ opens a tag-skip scan that finds the next >, so CDATA content is emitted as surface text while <![CDATA[ and ]]> are suppressed. If your XML uses CDATA sections, register a custom handler.
HTML Handler (buildHtmlToSurfaceMap)
Extends the XML handler with HTML-specific behaviour.
Script and style suppression: Entire <script> and <style> elements — opening tag, content, and closing tag — are suppressed. All source characters in the suppressed range are mapped to the surface position at the start of the element. Tag names matched case-insensitively.
HTML5 named entities: 50+ entities including , —, “, €, Greek letters, and others. For the full ~2000-entry HTML5 spec, register a custom handler using the entities npm package — the diff alignment stage compensates for rare misses in this table.
Entity length limit: 20 characters (HTML named entities are longer than XML ones).
Void elements (<br>, <img>, <hr>, etc.): No special handling needed — the tag-skip branch already covers them correctly since void elements have no closing tag and > terminates the tag.
Tags with non-alpha-starting names (like <123>): treated as regular tag skips.
Markdown Handler (buildMarkdownToSurfaceMap)
CommonMark subset, single-pass. Syntax markers are stripped; visible text is kept.
| Construct | Surface output | Notes |
|---|---|---|
| ATX heading # Heading | heading text | Strips # prefix and optional trailing ### |
| Bold **text** / __text__ | inner text | Strips outer marker pairs |
| Italic *text* / _text_ | inner text | Strips outer marker pairs |
| Inline code `code` | code text | Strips backtick markers |
| Fenced code block | code content | Fence lines (``` or ~~~) suppressed |
| Link [text](url) | text | URL chars mapped to end of text surface pos |
| Image  | alt | URL chars suppressed |
| Reference link [text][ref] | text | Ref chars suppressed |
| Block quote > | inner text | Strips leading > per line |
| Horizontal rule --- / *** / ___ | (suppressed) | Must be on its own line |
| Setext underlines === / --- | (suppressed) | Lines immediately after heading text |
| Escaped char \* | literal char | Strips backslash, emits the escaped char |
| Inline HTML <...> | (tag skipped) | Same as XML tag-skip branch |
This is a best-effort single-pass scanner, not a full CommonMark parser. For deeply nested or ambiguous constructs the Stage-2 diff alignment compensates for minor drift. If 100% fidelity on complex Markdown is required, register a custom handler built on micromark or a similar AST parser.
Plaintext Handler (buildPlaintextSurfaceMap)
Identity map. The surface IS the source: map[i] = i for all i. Used as the implicit target-side handler when targetFormat is "text" (the fast path).
SurfaceExtractionResult Type
interface SurfaceExtractionResult {
readonly surface: string; // plain text with all markup stripped
readonly map: Int32Array; // map[sourceOffset] = surfaceOffset
// length = source.length + 1
}The map is always non-decreasing: map[i] <= map[i+1] for all valid i. This is a fundamental invariant — tag characters collapse forward to the next content character's surface position, but never move backwards.
The extra sentinel entry map[source.length] = surface.length covers end-of-range lookups. You can safely use map[end] without a bounds check even when end === source.length.
Architecture Decisions
This section captures the library's design constraints and trade-offs. Read it before modifying any of the core pipeline stages.
Why remapRange excludes trailing target insertions
remapRange(map, start, end) does not use map[end] directly — it uses map[end - 1] + 1, falling back to map[end] only when the end - 1 derivation would exceed the sentinel.
The reason: map[end] for end === source.length returns target.length (the documented sentinel), which includes any target-side insertions that happened immediately after the source's last equal character. For range highlighting or selection this is the wrong answer — you get a range that spills over into user-inserted content.
Concretely, for source "Iris" aligned to target "Iris foo":
map[4](sentinel) =8(points past the inserted" foo").remapRange(map, 0, 4)returns{ start: 0, end: 4 }(just"Iris"), not{ start: 0, end: 8 }(the whole string).
The same issue appears in XML/HTML mapping: when source markup places tag characters after the last content character (e.g. <b>Iris</b> where </b> comes after Iris), those tag characters map to the source surface's sentinel position, and looking them up in the alignment map returns the trailing-insert-inclusive sentinel. remapRange corrects this so callers get well-behaved half-open semantics for free.
Insertions that happen between equal runs (mid-range) are still included — only insertions past the last equal character are trimmed. If you genuinely need the whole target (including trailing inserts), read map[source.length] directly rather than going through remapRange.
Why diff (not diff-match-patch) for Stage 2?
The diff package performs a pure character-level LCS (Longest Common Subsequence) diff. It is deterministic and produces predictable alignments.
diff-match-patch has a "fuzzy matching" phase designed for applying patches to slightly-changed text, which can produce surprising non-deterministic alignments when the source and target differ by more than whitespace. For an offset-mapping primitive that needs to behave the same way every time, that fuzziness is a liability.
Why Last-Write-Wins in buildInverseMap?
This is the most subtle design decision in the library and affects the cross-format path.
The problem: The XML/HTML handlers map multiple source characters to the same surface position. Specifically, all characters within a tag (<, d, i, v, > in <div>) map to the current surface position, and the content character immediately following the tag maps to that same surface position. The surface position advances only after the content character is recorded.
When building the inverse (surface position → source position), there is a tie: which source offset should surface position N map back to — the opening < of the preceding tag, or the content character that follows it?
Last-write-wins resolves the tie by overwriting on each forward pass through the source map. Since the content character is processed last (after all the tag characters), it wins. The inverse for surface position N returns the content character's source offset.
This is what callers need: when translating "surface position 5 → target source offset", you want to land on the content character in the target, not on the < of some preceding tag.
The entity trade-off: Entity characters (all of &, a, m, p, ; in &) also all map to the same surface position. With last-write-wins, the inverse returns the ; (last entity char) instead of the & (first). This means cross-format mapping for source positions inside an entity reference may land on the ; rather than the & in the target.
Content characters surrounding entities (spaces, regular letters) always align correctly. For the primary use case — highlighting issue ranges that span real content — this entity-internal misalignment is acceptable.
The alternative (first-write-wins) would correctly return & for entities but would incorrectly return < for tag+content groups, which is the more common and more visually destructive failure.
Why Single-Pass Handlers (Not AST Parsers)?
Three reasons:
Zero extra dependencies: the library has a single runtime dependency (
diff). An AST parser likeunified/rehypefor HTML ormicromarkfor Markdown would add hundreds of kilobytes to the bundle.O(n) time: single-pass handlers are linear in the source length. AST parsers typically also achieve O(n) but with a larger constant.
Stage 2 compensates for drift: the diff alignment stage handles edge cases where the handler's surface text differs slightly from the target text (e.g. minor Markdown ambiguity). A 100% faithful AST-based surface is only necessary for cross-format mapping where Stage 2 maps surface-to-surface and drift in Stage 1 propagates through. For the primary use case (format → plain text), Stage 2 absorbs the drift.
For genuinely complex Markdown or exotic HTML structures, register a custom handler:
import { buildCombinedMap, defaultRegistry } from "@markupai/format-offset-mapper";
import { fromMarkdown } from "mdast-util-from-markdown";
function buildAstMarkdownHandler(md: string) {
// AST-based surface extraction...
}
defaultRegistry.register("markdown", buildAstMarkdownHandler);Map Length Convention: source.length + 1
Every map in the pipeline has length source.length + 1. The extra entry at index source.length is the sentinel, recording the surface length. This allows callers to use map[end] for any end in [0, source.length] without a bounds check.
This convention applies to SurfaceExtractionResult.map, the return value of buildAlignmentMap, and the return value of mapOffsets.
Int32Array for All Maps
Int32Array was chosen over number[] for three reasons:
- Fixed 4 bytes per entry (predictable memory for large documents)
- Typed arrays are significantly faster to fill in tight loops in V8
- The values are always non-negative integers —
Int32Arraycommunicates that constraint in the type
All offset values are non-negative and well under 2^31 - 1 (the Int32Array max), so there is no overflow risk for any realistic document size.
DOM-Free by Design
The library does not use document, DOMParser, window, or any other browser global. tsconfig.json uses lib: ["ES2022"] which excludes DOM types. This keeps the package usable from Node.js, browsers, Web Workers, and edge runtimes alike — and keeps the format handlers fast and predictable, since they don't depend on a host parser.
Testing and Coverage
100% statement / branch / function / line coverage is enforced via Vitest + v8.
Coverage thresholds are set in vitest.config.ts and will cause npm run test:coverage to fail if any threshold drops below 100%.
Test Structure
tests/
core/
alignment-map.test.ts # buildAlignmentMap + remapRange
align-ranges.test.ts # alignRanges
compose.test.ts # mapOffsets + buildCombinedMap
inverse-map.test.ts # buildInverseMap
registry.test.ts # FormatHandlerRegistry
handlers/
xml.test.ts # buildXmlToSurfaceMap unit tests
html.test.ts # buildHtmlToSurfaceMap unit tests
markdown.test.ts # buildMarkdownToSurfaceMap unit tests
plaintext.test.ts # buildPlaintextSurfaceMap unit tests
integration/
xml-to-text.test.ts # Full pipeline, DITA fixtures
html-to-text.test.ts # Full pipeline, HTML fixtures
xml-to-html.test.ts # Cross-format pipeline with inverse map
markdown-to-html.test.ts
fixtures/
dita-samples.ts # Real-world DITA XML fixtures with expected offsets
html-samples.ts
markdown-samples.tsCommands
npm test # run tests in watch mode
npm run test:run # run tests once (CI mode)
npm run test:coverage # run with coverage report (fails if < 100%)
npm run build # compile TypeScript to dist/
npm run typecheck # type-check only (no emit)
npm run lint # ESLint + tsc --noEmit
npm run lint:fix # ESLint with --fixWriting Tests for a New Format Handler
Use a minimal registry to keep tests hermetic:
import { FormatHandlerRegistry, buildPlaintextSurfaceMap, mapOffsets } from "@markupai/format-offset-mapper";
import { myHandler } from "./my-handler.js";
const registry = new FormatHandlerRegistry()
.register("myformat", myHandler)
.register("text", buildPlaintextSurfaceMap);
it("maps content correctly", () => {
const map = mapOffsets("myformat", source, plain, "text", { registry });
expect(map[contentOffset]).toBe(expectedPlainOffset);
});
it("map is non-decreasing", () => {
const { map } = myHandler(source);
for (let i = 0; i < map.length - 1; i++) {
expect(map[i]).toBeLessThanOrEqual(map[i + 1]);
}
});
it("sentinel: map[source.length] === surface.length", () => {
const { surface, map } = myHandler(source);
expect(map[source.length]).toBe(surface.length);
});The non-decreasing and sentinel invariants are the two most important correctness properties for any format handler.
Package Info and Repository Structure
@markupai/format-offset-mapper
Node.js >= 24
ESM only (type: "module")
Single runtime dependency: diff ^9.0.0Repository Layout
format-offset-mapper/
src/
index.ts # Public API exports + registry population
types.ts # All shared TypeScript types and interfaces
core/
compose.ts # mapOffsets + buildCombinedMap (pipeline orchestration)
alignment-map.ts # buildAlignmentMap + remapRange (Stage 2)
inverse-map.ts # buildInverseMap (Stage 3, cross-format path)
align-ranges.ts # alignRanges (high-level batch primitive)
handlers/
xml.ts # buildXmlToSurfaceMap (also used for xhtml)
html.ts # buildHtmlToSurfaceMap
markdown.ts # buildMarkdownToSurfaceMap
plaintext.ts # buildPlaintextSurfaceMap (identity)
utils/
entities.ts # decodeXmlEntity + decodeHtmlEntity
normalize.ts # normalizeForComparison
registry.ts # FormatHandlerRegistry class + defaultRegistry singleton
tests/
core/ # Unit tests for pipeline stages
handlers/ # Unit tests for each format handler
integration/ # Full-pipeline tests using real-world fixtures
fixtures/ # DITA, HTML, Markdown sample data
dist/ # Compiled output (generated by npm run build)
package.json
tsconfig.json
tsconfig.build.json # Build-only tsconfig (excludes tests/)
vitest.config.ts
eslint.config.jsTypeScript Configuration
- Target: ES2022
- Module system: NodeNext (strict ESM with
.jsextensions in imports) - Strict mode: enabled with
noUncheckedIndexedAccess,exactOptionalPropertyTypes,noImplicitReturns noUncheckedIndexedAccessmeans all array and typed-array reads produceT | undefined. The!non-null assertion is used only where the invariants of the pipeline guarantee the index is in bounds (e.g. sentinel entries,Math.min-clamped indices).
Publishing
prepublishOnly runs build and test:run automatically. The files field in package.json publishes only dist/, README.md, and LICENSE.
Releases are published to npm via the Publish to npm GitHub Actions workflow, which is triggered by publishing a GitHub release. The workflow uses npm provenance so consumers can verify each published artifact was built from this repository.
License
Apache License 2.0 — see LICENSE for the full text.
Copyright © Markup AI.
