@markupai/format-offset-mapper

v1.1.6

Published

10 days ago

Map character offsets between different text formats (XML, HTML, Markdown, plain text) using a two-stage surface-alignment pipeline.

@markupai/format-offset-mapper

A small, dependency-light TypeScript library that maps character offsets between different text format representations — XML, XHTML, HTML, Markdown, and plain text — using a deterministic two-stage surface-alignment pipeline.

import { mapOffsets } from "@markupai/format-offset-mapper";

// DITA XML → editor plain text (most common use case)
const map = mapOffsets("xml", ditaXml, editorPlainText);
const editorStart = map[xmlIssueStart];
const editorEnd   = map[xmlIssueEnd];

Overview and Motivation

The Problem

Content is often authored in a rich original format — DITA XML, XHTML, custom HTML, or Markdown — but rendered to the user as a different on-screen representation (typically plain text in native desktop editors, or rendered HTML/DOM in web editors).

Writing analysis / linting / grammar APIs receive content in the original rich format and return issue offsets measured in that same original format's coordinate system. To highlight an issue visually in the editor, those original-format offsets must be translated into the on-screen representation's coordinate system.

API receives:  <topic><title>Hello World</title></topic>   (DITA XML)
API returns:   issue at offsets [15, 20]                   (XML offsets for "World")

Editor shows:  Hello World                                  (plain text on screen)
Need to find:  offsets [6, 11]                             (plain text offsets for "World")

Without a correct offset map, highlight decorations land on the wrong characters, fall inside markup tags, or crash when offsets are out of range.

The mapping direction is always:

original-format offset  →  on-screen offset
     (API returns)              (editor needs)

For example: mapOffsets("xml", ditaXml, editorPlainText) maps DITA XML offsets (what the API returns) to plain text offsets (what the editor displays).

Design Goals

Deterministic. Pure character-level diff; no heuristic / fuzzy fallbacks.
Linear time. Single-pass O(n) format handlers; no AST parsers.
Zero browser globals. Runs in Node.js, browsers, Web Workers, and edge runtimes.
Tiny. A single runtime dependency (diff).
Extensible. Custom formats plug in via a registry.
Battle-tested. 100% statement / branch / function / line coverage.

Installation

npm install @markupai/format-offset-mapper

The package is ESM-only and ships TypeScript types out of the box.

import { mapOffsets, alignRanges } from "@markupai/format-offset-mapper";

Three-Stage Pipeline

Source content (format A)        Target content (format B)
        │                                │
        ▼  Stage 1: format handler       ▼  Stage 1: format handler
   surface_src                      surface_tgt
   map_src[srcOff → surfSrcOff]     map_tgt[tgtOff → surfTgtOff]
        │                                │
        └────────────────────────────────┘
                    ▼  Stage 2: diff alignment (diffChars)
             align[surfSrcOff → surfTgtOff]
                    │
                    ▼  Stage 3: compose + optional inverse
         result[srcOff] = tgtOff

Stage 1 — Format Handler

Each format handler takes the raw content string and produces:

A surface string: the plain text with all markup stripped and entities decoded.
A source-to-surface map: an Int32Array where map[sourceOffset] = the corresponding offset in the surface string.

Stage 2 — Diff Alignment

The diff library's diffChars function computes a character-level LCS diff between the source surface and the target surface. From the diff output, an alignment map is built: alignMap[sourceSurfaceOffset] = the corresponding target surface offset.

This stage is what makes the pipeline robust to whitespace normalisation, entity encoding differences, and minor Markdown parsing drift — even when the source and target surfaces are not identical character-for-character.

Stage 3 — Compose

Fast path (targetFormat = "text", the default): The target IS the surface — no inversion needed. Stage 3 simply composes srcMap → alignMap:

result[srcOff] = alignMap[srcMap[srcOff]]

This is the most common case (XML/HTML/Markdown → editor plain text) and is cheaper than the full path.

Cross-format path (targetFormat = "html", "xml", etc.): Stage 3 composes srcMap → alignMap → inverse(tgtMap). buildInverseMap converts the target's surface-to-source map into a surface→target inverse using last-write-wins semantics (see Architecture Decisions for why).

Public API

All exports are available from the top-level package import:

import {
  mapOffsets,
  buildCombinedMap,
  buildAlignmentMap,
  remapRange,
  buildInverseMap,
  alignRanges,
  normalizeForComparison,
  buildXmlToSurfaceMap,
  buildHtmlToSurfaceMap,
  buildMarkdownToSurfaceMap,
  buildPlaintextSurfaceMap,
  FormatHandlerRegistry,
  defaultRegistry,
} from "@markupai/format-offset-mapper";

Primary Functions

`mapOffsets`

function mapOffsets(
  sourceFormat: FormatName,   // "xml" | "xhtml" | "html" | "markdown" | "text" | string
  sourceContent: string,
  targetContent: string,
  targetFormat?: FormatName,  // default: "text" (fast path)
  options?: MapOffsetsOptions,
): Int32Array                 // result[sourceOffset] = targetOffset

The main entry point. Returns an Int32Array of length sourceContent.length + 1 where result[sourceOffset] = the corresponding offset in targetContent.

The sentinel entry result[sourceContent.length] = targetContent.length, so end-of-range lookups (map[end] where end === sourceContent.length) are always safe.

`buildCombinedMap`

function buildCombinedMap(
  sourceFormat: FormatName,
  sourceContent: string,
  targetContent: string,
  targetFormat?: FormatName,
  options?: MapOffsetsOptions,
): CombinedMapResult

Like mapOffsets but returns all intermediate surfaces and maps for each pipeline stage. Useful for debugging and parity tests.

interface CombinedMapResult {
  readonly sourceSurface: string;       // plain text from source
  readonly targetSurface: string;       // plain text from target
  readonly sourceToSurface: Int32Array; // Stage-1 source→surface map
  readonly targetToSurface: Int32Array; // Stage-1 target→surface map
  readonly combined: Int32Array;        // same as mapOffsets() result
}

Stage-Level Functions

`buildAlignmentMap`

function buildAlignmentMap(oldText: string, newText: string): Int32Array

Stage 2 only. Builds a character-level diff-based alignment map from oldText to newText. map[i] is the offset in newText corresponding to offset i in oldText. Offsets inside deleted runs collapse to the single newText position where the deletion occurred.

`remapRange`

function remapRange(
  map: Int32Array,
  start: number,
  end: number,
): { start: number; end: number } | null

Remaps a [start, end) half-open range through any alignment map. Returns null when the range collapses entirely (the entire range was deleted). Use this to translate [issueStart, issueEnd) pairs.

`buildInverseMap`

function buildInverseMap(map: Int32Array, surfaceLength: number): Int32Array

Stage 3 inverse only. Given a source→surface map, returns a surface→source inverse. Used internally for cross-format mapping when the target has non-trivial markup.

`alignRanges`

function alignRanges(
  sourceFormat: FormatName,
  sourceContent: string,
  targetContent: string,
  inputs: readonly AlignInput[],
  options?: AlignRangesOptions,
): AlignResult[] | null

High-level batch primitive. Takes an array of source-format ranges and returns target-format ranges, optionally validating that each range's target content still matches the source (drift detection).

It wraps mapOffsets + per-range remapRange + optional normalizeForComparison comparison in a single call. It is shape-neutral by design — inputs and results are plain { start, end } pairs, not typed Match objects. Wrap the results in your own issue-payload shape as needed.

interface AlignInput {
  readonly start: number;
  readonly end: number;
}

interface AlignResult {
  readonly range: { start: number; end: number } | null; // null → range collapsed
  readonly drifted: boolean;                              // true → target text changed (validate mode only)
}

interface AlignRangesOptions {
  readonly targetFormat?: FormatName;          // default "text"
  readonly registry?: FormatHandlerRegistry;
  readonly validate?: boolean;                  // compare target substring vs source surface (normalised)
  readonly abortOnDrift?: boolean;              // return null from the whole call on any drift
}

Drift semantics: when validate: true, the target substring at each aligned range is compared to the source surface substring at the input range, after both are passed through normalizeForComparison. Strings that differ only in whitespace runs or zero-width characters are treated as equal.

With abortOnDrift: true, the entire call returns null the moment any range drifts. Use this when your policy is "any drift = abandon the batch". Without it, callers can inspect per-range drifted flags and decide what to do individually (partial rendering, etc.).

`normalizeForComparison`

function normalizeForComparison(text: string): string

Normalises a string for drift-detection comparisons by removing zero-width characters (U+200B–U+200D, U+FEFF), collapsing whitespace runs to a single space, and trimming. Used internally by alignRanges when validate: true; exported because every consumer doing its own drift check wants exactly this behaviour.

normalizeForComparison("hello   world\n") === "hello world"; // true

Format Handlers (Standalone)

All format handlers can be used independently of the pipeline:

function buildXmlToSurfaceMap(xml: string): SurfaceExtractionResult
function buildHtmlToSurfaceMap(html: string): SurfaceExtractionResult
function buildMarkdownToSurfaceMap(md: string): SurfaceExtractionResult
function buildPlaintextSurfaceMap(content: string): SurfaceExtractionResult

Registry

class FormatHandlerRegistry {
  register(format: string, handler: FormatHandler): this  // chainable
  get(format: string): FormatHandler  // throws for unknown format
  has(format: string): boolean
  clone(): FormatHandlerRegistry  // for test isolation
}

const defaultRegistry: FormatHandlerRegistry  // pre-populated with all built-ins

The default registry has "xml", "xhtml" (identical handler to "xml"), "html", "markdown", and "text" pre-registered.

Types

interface SurfaceExtractionResult {
  readonly surface: string;   // plain text with markup stripped
  readonly map: Int32Array;   // map[sourceOffset] = surfaceOffset
}

type FormatHandler = (content: string) => SurfaceExtractionResult;

type BuiltInFormat = "xml" | "xhtml" | "html" | "markdown" | "text";

// Accepts built-in names or any custom string key registered in a registry
type FormatName = BuiltInFormat | (string & Record<never, never>);

interface MapOffsetsOptions {
  registry?: FormatHandlerRegistry;  // override the global defaultRegistry
}

// See `alignRanges` above for AlignInput / AlignResult / AlignRangesOptions.

Usage Examples

XML / DITA → Plain Text

import { mapOffsets } from "@markupai/format-offset-mapper";

const ditaXml = `
  <topic>
    <title>Hello</title>
    <body>
      <p>World &amp; more</p>
    </body>
  </topic>
`;
const editorPlainText = "Hello\nWorld & more";

const map = mapOffsets("xml", ditaXml, editorPlainText);

// Translate an API issue range [xmlStart, xmlEnd) to editor offsets
const editorStart = map[xmlIssueStart];
const editorEnd   = map[xmlIssueEnd];

The "xhtml" format name uses the identical XML handler, so both work:

const map = mapOffsets("xhtml", xhtmlContent, editorPlainText);

HTML → Plain Text

import { mapOffsets } from "@markupai/format-offset-mapper";

const htmlContent = "<h1>Title</h1><p>Hello <strong>World</strong></p>";
const editorText  = "TitleHello World";

const map = mapOffsets("html", htmlContent, editorText);

const editorPos = map[htmlApiOffset];

Markdown → Plain Text

import { mapOffsets } from "@markupai/format-offset-mapper";

const md = "# Heading\n\nHello **bold** world";
const plain = "Heading\n\nHello bold world";

const map = mapOffsets("markdown", md, plain);

XML → HTML (Cross-Format)

When both source and target have markup, pass the target format explicitly:

import { mapOffsets } from "@markupai/format-offset-mapper";

const xml  = "<topic><title>Hello</title><body><p>World</p></body></topic>";
const html = "<article><h1>Hello</h1><section><p>World</p></section></article>";

// Runs the full three-stage pipeline including inverse-map inversion
const map = mapOffsets("xml", xml, html, "html");

const htmlOffset = map[xmlOffset];

Using `remapRange` for Issue Ranges

API issues typically have [start, end) ranges. Use remapRange to translate them in one call:

import { mapOffsets, remapRange } from "@markupai/format-offset-mapper";

const offsetMap = mapOffsets("xml", ditaXml, editorText);

for (const issue of apiIssues) {
  const range = remapRange(offsetMap, issue.start, issue.end);
  if (range === null) continue; // issue range was deleted entirely
  highlightRange(range.start, range.end);
}

Using `alignRanges` for Batch Issue Alignment + Drift Detection

When you have a set of API issues to align in one go — and want to skip any whose underlying text has been edited since the check — reach for alignRanges:

import { alignRanges } from "@markupai/format-offset-mapper";

// Each suggestion has a start_index and an original-text length.
const inputs = suggestions.map((s) => ({
  start: s.start_index,
  end: s.start_index + s.original.length,
}));

const results = alignRanges("xml", baselineXml, editorPlainText, inputs, {
  validate: true, // compare target content vs source surface after normalisation
});

suggestions.forEach((suggestion, i) => {
  const r = results![i];
  if (!r.range || r.drifted) return;
  highlightRange(suggestion, r.range.start, r.range.end);
});

For a strict abort-all-on-drift policy:

const results = alignRanges("xml", baselineXml, currentXml, inputs, {
  validate: true,
  abortOnDrift: true,
});
if (results === null) return []; // any drift abandoned the whole batch

Debugging with `buildCombinedMap`

When offset mapping produces unexpected results, inspect each stage:

import { buildCombinedMap } from "@markupai/format-offset-mapper";

const result = buildCombinedMap("xml", ditaXml, editorText);

console.log("Source surface:", result.sourceSurface);
// → plain text with XML tags stripped and entities decoded

console.log("Source-to-surface at offset 42:", result.sourceToSurface[42]);
// → position in source surface for that XML offset

console.log("Final result at offset 42:", result.combined[42]);
// → position in editor text (same as mapOffsets result)

Custom Format Handler

import { defaultRegistry, mapOffsets } from "@markupai/format-offset-mapper";
import type { SurfaceExtractionResult } from "@markupai/format-offset-mapper";

function buildDocBookHandler(content: string): SurfaceExtractionResult {
  // Your custom stripping logic here...
  const surface = stripDocBookMarkup(content);
  const map = buildOffsetMap(content, surface);
  return { surface, map };
}

// Use a clone so tests don't bleed into production code
const registry = defaultRegistry.clone().register("docbook", buildDocBookHandler);

const offsetMap = mapOffsets("docbook", docbookContent, editorText, "text", { registry });

Test Isolation with Registry Clone

import { describe, it, expect } from "vitest";
import { FormatHandlerRegistry, buildXmlToSurfaceMap, buildPlaintextSurfaceMap, mapOffsets } from "@markupai/format-offset-mapper";

// Build a minimal registry — no side-effects on defaultRegistry
const registry = new FormatHandlerRegistry()
  .register("xml", buildXmlToSurfaceMap)
  .register("text", buildPlaintextSurfaceMap);

describe("my adapter tests", () => {
  it("maps correctly", () => {
    const map = mapOffsets("xml", "<p>Hello</p>", "Hello", "text", { registry });
    expect(map[3]).toBe(0); // 'H' at xml offset 3 → editor offset 0
  });
});

Format Handler Details

XML / XHTML Handler (`buildXmlToSurfaceMap`)

Single-pass, O(n). Registered for both "xml" and "xhtml" format names.

Tag skipping: All characters inside <...> (including the < and >) are mapped to the current surface position. No surface characters are emitted for tags.

Entity decoding: Recognises and decodes:

Five predefined XML entities: & < > " '
Decimal numeric references: © → ©
Hex numeric references: 😀 → 😀

Entity length limit: the semicolon must appear within 12 characters of the &. Malformed entities (no semicolon within 12 chars) are treated as literal &.

All source characters within an entity reference (e.g. all five characters &, a, m, p, ; in &) are mapped to the surface position before the decoded character. The decoded character increments the surface position after it is emitted.

Known limitation: CDATA sections are only partially handled. The < of <![CDATA[ opens a tag-skip scan that finds the next >, so CDATA content is emitted as surface text while <![CDATA[ and ]]> are suppressed. If your XML uses CDATA sections, register a custom handler.

HTML Handler (`buildHtmlToSurfaceMap`)

Extends the XML handler with HTML-specific behaviour.

Script and style suppression: Entire <script> and <style> elements — opening tag, content, and closing tag — are suppressed. All source characters in the suppressed range are mapped to the surface position at the start of the element. Tag names matched case-insensitively.

HTML5 named entities: 50+ entities including  , —, “, €, Greek letters, and others. For the full ~2000-entry HTML5 spec, register a custom handler using the entities npm package — the diff alignment stage compensates for rare misses in this table.

Entity length limit: 20 characters (HTML named entities are longer than XML ones).

Void elements (<br>, <img>, <hr>, etc.): No special handling needed — the tag-skip branch already covers them correctly since void elements have no closing tag and > terminates the tag.

Tags with non-alpha-starting names (like <123>): treated as regular tag skips.

Markdown Handler (`buildMarkdownToSurfaceMap`)

CommonMark subset, single-pass. Syntax markers are stripped; visible text is kept.

| Construct | Surface output | Notes | |---|---|---| | ATX heading # Heading | heading text | Strips # prefix and optional trailing ### | | Bold **text** / __text__ | inner text | Strips outer marker pairs | | Italic *text* / _text_ | inner text | Strips outer marker pairs | | Inline code `code` | code text | Strips backtick markers | | Fenced code block | code content | Fence lines (``` or ~~~) suppressed | | Link [text](url) | text | URL chars mapped to end of text surface pos | | Image ![alt](url) | alt | URL chars suppressed | | Reference link [text][ref] | text | Ref chars suppressed | | Block quote > | inner text | Strips leading > per line | | Horizontal rule --- / *** / ___ | (suppressed) | Must be on its own line | | Setext underlines === / --- | (suppressed) | Lines immediately after heading text | | Escaped char \* | literal char | Strips backslash, emits the escaped char | | Inline HTML <...> | (tag skipped) | Same as XML tag-skip branch |

This is a best-effort single-pass scanner, not a full CommonMark parser. For deeply nested or ambiguous constructs the Stage-2 diff alignment compensates for minor drift. If 100% fidelity on complex Markdown is required, register a custom handler built on micromark or a similar AST parser.

Plaintext Handler (`buildPlaintextSurfaceMap`)

Identity map. The surface IS the source: map[i] = i for all i. Used as the implicit target-side handler when targetFormat is "text" (the fast path).

SurfaceExtractionResult Type

interface SurfaceExtractionResult {
  readonly surface: string;  // plain text with all markup stripped
  readonly map: Int32Array;  // map[sourceOffset] = surfaceOffset
                             // length = source.length + 1
}

The map is always non-decreasing: map[i] <= map[i+1] for all valid i. This is a fundamental invariant — tag characters collapse forward to the next content character's surface position, but never move backwards.

The extra sentinel entry map[source.length] = surface.length covers end-of-range lookups. You can safely use map[end] without a bounds check even when end === source.length.

Architecture Decisions

This section captures the library's design constraints and trade-offs. Read it before modifying any of the core pipeline stages.

Why `remapRange` excludes trailing target insertions

remapRange(map, start, end) does not use map[end] directly — it uses map[end - 1] + 1, falling back to map[end] only when the end - 1 derivation would exceed the sentinel.

The reason: map[end] for end === source.length returns target.length (the documented sentinel), which includes any target-side insertions that happened immediately after the source's last equal character. For range highlighting or selection this is the wrong answer — you get a range that spills over into user-inserted content.

Concretely, for source "Iris" aligned to target "Iris foo":

map[4] (sentinel) = 8 (points past the inserted " foo").
remapRange(map, 0, 4) returns { start: 0, end: 4 } (just "Iris"), not { start: 0, end: 8 } (the whole string).

The same issue appears in XML/HTML mapping: when source markup places tag characters after the last content character (e.g. <b>Iris</b> where </b> comes after Iris), those tag characters map to the source surface's sentinel position, and looking them up in the alignment map returns the trailing-insert-inclusive sentinel. remapRange corrects this so callers get well-behaved half-open semantics for free.

Insertions that happen between equal runs (mid-range) are still included — only insertions past the last equal character are trimmed. If you genuinely need the whole target (including trailing inserts), read map[source.length] directly rather than going through remapRange.

Why `diff` (not `diff-match-patch`) for Stage 2?

The diff package performs a pure character-level LCS (Longest Common Subsequence) diff. It is deterministic and produces predictable alignments.

diff-match-patch has a "fuzzy matching" phase designed for applying patches to slightly-changed text, which can produce surprising non-deterministic alignments when the source and target differ by more than whitespace. For an offset-mapping primitive that needs to behave the same way every time, that fuzziness is a liability.

Why Last-Write-Wins in `buildInverseMap`?

This is the most subtle design decision in the library and affects the cross-format path.

The problem: The XML/HTML handlers map multiple source characters to the same surface position. Specifically, all characters within a tag (<, d, i, v, > in <div>) map to the current surface position, and the content character immediately following the tag maps to that same surface position. The surface position advances only after the content character is recorded.

When building the inverse (surface position → source position), there is a tie: which source offset should surface position N map back to — the opening < of the preceding tag, or the content character that follows it?

Last-write-wins resolves the tie by overwriting on each forward pass through the source map. Since the content character is processed last (after all the tag characters), it wins. The inverse for surface position N returns the content character's source offset.

This is what callers need: when translating "surface position 5 → target source offset", you want to land on the content character in the target, not on the < of some preceding tag.

The entity trade-off: Entity characters (all of &, a, m, p, ; in &) also all map to the same surface position. With last-write-wins, the inverse returns the ; (last entity char) instead of the & (first). This means cross-format mapping for source positions inside an entity reference may land on the ; rather than the & in the target.

Content characters surrounding entities (spaces, regular letters) always align correctly. For the primary use case — highlighting issue ranges that span real content — this entity-internal misalignment is acceptable.

The alternative (first-write-wins) would correctly return & for entities but would incorrectly return < for tag+content groups, which is the more common and more visually destructive failure.

Why Single-Pass Handlers (Not AST Parsers)?

Three reasons:

Zero extra dependencies: the library has a single runtime dependency (diff). An AST parser like unified/rehype for HTML or micromark for Markdown would add hundreds of kilobytes to the bundle.
O(n) time: single-pass handlers are linear in the source length. AST parsers typically also achieve O(n) but with a larger constant.
Stage 2 compensates for drift: the diff alignment stage handles edge cases where the handler's surface text differs slightly from the target text (e.g. minor Markdown ambiguity). A 100% faithful AST-based surface is only necessary for cross-format mapping where Stage 2 maps surface-to-surface and drift in Stage 1 propagates through. For the primary use case (format → plain text), Stage 2 absorbs the drift.

For genuinely complex Markdown or exotic HTML structures, register a custom handler:

import { buildCombinedMap, defaultRegistry } from "@markupai/format-offset-mapper";
import { fromMarkdown } from "mdast-util-from-markdown";

function buildAstMarkdownHandler(md: string) {
  // AST-based surface extraction...
}

defaultRegistry.register("markdown", buildAstMarkdownHandler);

Map Length Convention: `source.length + 1`

Every map in the pipeline has length source.length + 1. The extra entry at index source.length is the sentinel, recording the surface length. This allows callers to use map[end] for any end in [0, source.length] without a bounds check.

This convention applies to SurfaceExtractionResult.map, the return value of buildAlignmentMap, and the return value of mapOffsets.

`Int32Array` for All Maps

Int32Array was chosen over number[] for three reasons:

Fixed 4 bytes per entry (predictable memory for large documents)
Typed arrays are significantly faster to fill in tight loops in V8
The values are always non-negative integers — Int32Array communicates that constraint in the type

All offset values are non-negative and well under 2^31 - 1 (the Int32Array max), so there is no overflow risk for any realistic document size.

DOM-Free by Design

The library does not use document, DOMParser, window, or any other browser global. tsconfig.json uses lib: ["ES2022"] which excludes DOM types. This keeps the package usable from Node.js, browsers, Web Workers, and edge runtimes alike — and keeps the format handlers fast and predictable, since they don't depend on a host parser.

Testing and Coverage

100% statement / branch / function / line coverage is enforced via Vitest + v8.

Coverage thresholds are set in vitest.config.ts and will cause npm run test:coverage to fail if any threshold drops below 100%.

Test Structure

tests/
  core/
    alignment-map.test.ts   # buildAlignmentMap + remapRange
    align-ranges.test.ts    # alignRanges
    compose.test.ts         # mapOffsets + buildCombinedMap
    inverse-map.test.ts     # buildInverseMap
    registry.test.ts        # FormatHandlerRegistry
  handlers/
    xml.test.ts             # buildXmlToSurfaceMap unit tests
    html.test.ts            # buildHtmlToSurfaceMap unit tests
    markdown.test.ts        # buildMarkdownToSurfaceMap unit tests
    plaintext.test.ts       # buildPlaintextSurfaceMap unit tests
  integration/
    xml-to-text.test.ts     # Full pipeline, DITA fixtures
    html-to-text.test.ts    # Full pipeline, HTML fixtures
    xml-to-html.test.ts     # Cross-format pipeline with inverse map
    markdown-to-html.test.ts
  fixtures/
    dita-samples.ts         # Real-world DITA XML fixtures with expected offsets
    html-samples.ts
    markdown-samples.ts

Commands

npm test              # run tests in watch mode
npm run test:run      # run tests once (CI mode)
npm run test:coverage # run with coverage report (fails if < 100%)
npm run build         # compile TypeScript to dist/
npm run typecheck     # type-check only (no emit)
npm run lint          # ESLint + tsc --noEmit
npm run lint:fix      # ESLint with --fix

Writing Tests for a New Format Handler

Use a minimal registry to keep tests hermetic:

import { FormatHandlerRegistry, buildPlaintextSurfaceMap, mapOffsets } from "@markupai/format-offset-mapper";
import { myHandler } from "./my-handler.js";

const registry = new FormatHandlerRegistry()
  .register("myformat", myHandler)
  .register("text", buildPlaintextSurfaceMap);

it("maps content correctly", () => {
  const map = mapOffsets("myformat", source, plain, "text", { registry });
  expect(map[contentOffset]).toBe(expectedPlainOffset);
});

it("map is non-decreasing", () => {
  const { map } = myHandler(source);
  for (let i = 0; i < map.length - 1; i++) {
    expect(map[i]).toBeLessThanOrEqual(map[i + 1]);
  }
});

it("sentinel: map[source.length] === surface.length", () => {
  const { surface, map } = myHandler(source);
  expect(map[source.length]).toBe(surface.length);
});

The non-decreasing and sentinel invariants are the two most important correctness properties for any format handler.

Package Info and Repository Structure

@markupai/format-offset-mapper
  Node.js >= 24
  ESM only (type: "module")
  Single runtime dependency: diff ^9.0.0

Repository Layout

format-offset-mapper/
  src/
    index.ts               # Public API exports + registry population
    types.ts               # All shared TypeScript types and interfaces
    core/
      compose.ts           # mapOffsets + buildCombinedMap (pipeline orchestration)
      alignment-map.ts     # buildAlignmentMap + remapRange (Stage 2)
      inverse-map.ts       # buildInverseMap (Stage 3, cross-format path)
      align-ranges.ts      # alignRanges (high-level batch primitive)
    handlers/
      xml.ts               # buildXmlToSurfaceMap (also used for xhtml)
      html.ts              # buildHtmlToSurfaceMap
      markdown.ts          # buildMarkdownToSurfaceMap
      plaintext.ts         # buildPlaintextSurfaceMap (identity)
    utils/
      entities.ts          # decodeXmlEntity + decodeHtmlEntity
      normalize.ts         # normalizeForComparison
      registry.ts          # FormatHandlerRegistry class + defaultRegistry singleton
  tests/
    core/                  # Unit tests for pipeline stages
    handlers/              # Unit tests for each format handler
    integration/           # Full-pipeline tests using real-world fixtures
    fixtures/              # DITA, HTML, Markdown sample data
  dist/                    # Compiled output (generated by npm run build)
  package.json
  tsconfig.json
  tsconfig.build.json      # Build-only tsconfig (excludes tests/)
  vitest.config.ts
  eslint.config.js

TypeScript Configuration

Target: ES2022
Module system: NodeNext (strict ESM with .js extensions in imports)
Strict mode: enabled with noUncheckedIndexedAccess, exactOptionalPropertyTypes, noImplicitReturns
noUncheckedIndexedAccess means all array and typed-array reads produce T | undefined. The ! non-null assertion is used only where the invariants of the pipeline guarantee the index is in bounds (e.g. sentinel entries, Math.min-clamped indices).

Publishing

prepublishOnly runs build and test:run automatically. The files field in package.json publishes only dist/, README.md, and LICENSE.

Releases are published to npm via the Publish to npm GitHub Actions workflow, which is triggered by publishing a GitHub release. The workflow uses npm provenance so consumers can verify each published artifact was built from this repository.

License

Apache License 2.0 — see LICENSE for the full text.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@markupai/format-offset-mapper

Table of Contents

Overview and Motivation

The Problem

Design Goals

Installation

Three-Stage Pipeline

Stage 1 — Format Handler

Stage 2 — Diff Alignment

Stage 3 — Compose

Public API

Primary Functions

mapOffsets

buildCombinedMap

Stage-Level Functions

buildAlignmentMap

remapRange

buildInverseMap

alignRanges

normalizeForComparison

Format Handlers (Standalone)

Registry

Types

Usage Examples

XML / DITA → Plain Text

HTML → Plain Text

Markdown → Plain Text

XML → HTML (Cross-Format)

Using remapRange for Issue Ranges

Using alignRanges for Batch Issue Alignment + Drift Detection

Debugging with buildCombinedMap

Custom Format Handler

Test Isolation with Registry Clone

Format Handler Details

XML / XHTML Handler (buildXmlToSurfaceMap)

HTML Handler (buildHtmlToSurfaceMap)

Markdown Handler (buildMarkdownToSurfaceMap)

Plaintext Handler (buildPlaintextSurfaceMap)

SurfaceExtractionResult Type

Architecture Decisions

Why remapRange excludes trailing target insertions

Why diff (not diff-match-patch) for Stage 2?

Why Last-Write-Wins in buildInverseMap?

Why Single-Pass Handlers (Not AST Parsers)?

Map Length Convention: source.length + 1

Int32Array for All Maps

DOM-Free by Design

Testing and Coverage

Test Structure

Commands

Writing Tests for a New Format Handler

Package Info and Repository Structure

Repository Layout

TypeScript Configuration

Publishing

License

`mapOffsets`

`buildCombinedMap`

`buildAlignmentMap`

`remapRange`

`buildInverseMap`

`alignRanges`

`normalizeForComparison`

Using `remapRange` for Issue Ranges

Using `alignRanges` for Batch Issue Alignment + Drift Detection

Debugging with `buildCombinedMap`

XML / XHTML Handler (`buildXmlToSurfaceMap`)

HTML Handler (`buildHtmlToSurfaceMap`)

Markdown Handler (`buildMarkdownToSurfaceMap`)

Plaintext Handler (`buildPlaintextSurfaceMap`)

Why `remapRange` excludes trailing target insertions

Why `diff` (not `diff-match-patch`) for Stage 2?

Why Last-Write-Wins in `buildInverseMap`?

Map Length Convention: `source.length + 1`

`Int32Array` for All Maps