@tfw.in/structura-lib
v0.2.12
Published
Structura Library Components
Readme
@tfw.in/structura-lib
A React component library for PDF document viewing with structured data extraction and rendering.
Features
- PDF & JSON Side-by-Side Viewing - View original PDF alongside extracted structured content
- Edit Mode - Inline editing of extracted content
- Math Rendering - LaTeX math expressions rendered via KaTeX (
$...$inline,$$...$$display) - Semantic Tags - Visual highlighting for corrections, additions, and deletions
- Header/Footer Detection - Automatic badges for header and footer content
- Table Support - Rich table rendering with cell-level editing
Installation
npm install @tfw.in/structura-libUsage
import { Structura } from '@tfw.in/structura-lib';
import '@tfw.in/structura-lib/styles.css';
function App() {
return (
<Structura
apiKey="your-api-key"
baseUrl="https://api.example.com"
/>
);
}Props
Core Props
| Prop | Type | Default | Description |
|------|------|---------|-------------|
| apiKey | string | required | API key for authentication |
| baseUrl | string | undefined | Optional API base URL |
| initialPdfPath | string \| null | null | Initial PDF file path to load |
| initialJsonData | any | null | Initial JSON data to display |
Feature Flags
| Prop | Type | Default | Description |
|------|------|---------|-------------|
| editMode | boolean | true | Enable/disable edit mode toggle |
| jsonMode | boolean | true | Enable/disable JSON view mode toggle |
| mathRendering | boolean | true | Enable LaTeX math rendering |
| semanticTags | boolean | true | Enable/disable semantic tags toggle |
| headerFooterBadges | boolean | true | Show header/footer badges |
| postProcessors | PostProcessor[] | defaultPostProcessors | Ordered array of HTML → HTML plugins. Runs before math rendering. Pass [] to disable all cleanup. See Post-Processing Plugins. |
| dedupePostTableText | boolean | true | Deprecated. Flip the default preset on/off. Prefer postProcessors. |
| htmlPostProcessor | (html: string) => string | undefined | Deprecated. Appended to the plugin chain. Prefer postProcessors. |
| defaultViewMode | 'read' \| 'edit' \| 'json' | 'read' | Initial view mode |
Callbacks
| Prop | Type | Default | Description |
|------|------|---------|-------------|
| onContentChange | function | undefined | Callback when content is edited: (blockId, oldContent, newContent) => void |
| onExport | function | undefined | Callback when data is exported: (data) => void |
Styling Props
| Prop | Type | Default | Description |
|------|------|---------|-------------|
| className | string | undefined | Custom class for the container |
| style | React.CSSProperties | undefined | Inline styles for the container |
| pdfPanelClassName | string | undefined | Custom class for the PDF panel |
| htmlPanelClassName | string | undefined | Custom class for the HTML panel |
| theme | object | undefined | Theme customization object (see below) |
Theme Object
theme={{
primaryColor: '#3b82f6', // Primary accent color
backgroundColor: '#ffffff', // Background color
textColor: '#1f2937', // Text color
borderColor: '#e5e7eb', // Border color
fontFamily: 'Inter, sans-serif' // Font family
}}Full Example
import { Structura } from '@tfw.in/structura-lib';
import '@tfw.in/structura-lib/dist/esm/styles.css';
function App() {
const handleContentChange = (blockId, oldContent, newContent) => {
console.log(`Block ${blockId} changed`);
console.log('Old:', oldContent);
console.log('New:', newContent);
};
const handleExport = (data) => {
console.log('Exported data:', data);
// Save to your backend, etc.
};
return (
<Structura
apiKey="your-api-key"
baseUrl="https://api.example.com"
editMode={true}
jsonMode={true}
mathRendering={true}
semanticTags={true}
headerFooterBadges={true}
defaultViewMode="read"
onContentChange={handleContentChange}
onExport={handleExport}
/>
);
}Math Rendering
The Structura component automatically renders LaTeX math expressions ($m^2$, $L/m^2$, etc.) when mathRendering={true} (the default).
If you extract text from the response structure (e.g. Gemini corrected output) and render it in your own components, the raw LaTeX delimiters will appear as plain text. Use renderMathInHtml to convert them to rendered math:
import { renderMathInHtml } from '@tfw.in/structura-lib';
// Convert LaTeX delimiters to rendered HTML
const rawText = "Total membrane area required 0.02$m^2$";
const rendered = renderMathInHtml(rawText);
// Use with dangerouslySetInnerHTML
<div dangerouslySetInnerHTML={{ __html: rendered }} />Utilities
| Export | Type | Description |
|--------|------|-------------|
| renderMathInHtml(html) | function | Converts $...$ (inline) and $$...$$ (display) math to KaTeX HTML. Use this when rendering extracted text in your own components. |
| containsMath(html) | function | Returns true if the string contains math delimiters. |
| MathContent | React component | Renders HTML with math expressions. Props: html, className, as (element type). |
| useMathHtml(html) | React hook | Returns rendered math HTML string via useMemo. |
import { MathContent, renderMathInHtml, containsMath, useMathHtml } from '@tfw.in/structura-lib';
// As a React component
<MathContent html="The formula is $E = mc^2$" />
// As a hook
function MyComponent({ text }) {
const rendered = useMathHtml(text);
return <span dangerouslySetInnerHTML={{ __html: rendered }} />;
}
// Check before processing
if (containsMath(text)) {
const html = renderMathInHtml(text);
}Note: Ensure you import the library styles (
@tfw.in/structura-lib/styles.css) — this loads the KaTeX CSS required for proper math rendering.
Post-Processing Plugins
HTML cleanup is plugin-based. A PostProcessor is any pure function (html: string) => string. Plugins run in order, before math rendering. You can use the defaults, subset them, extend, or replace entirely.
import type { PostProcessor } from '@tfw.in/structura-lib';
type PostProcessor = (html: string) => string;Built-in plugins
| Plugin | What it does |
|--------|-------------|
| dedupeTableText | Strips floating <td>/<th>/<tr> elements or prose blocks that repeat the preceding <table>'s content (a frequent Gemini artifact). |
| fixTocTable | For TOC-like tables where Gemini leaves section numbers stuck at the start of the title cell ("10 In Process Data" in one <td>), moves the digits into the leading cell. |
The default preset is exported as defaultPostProcessors:
import { defaultPostProcessors } from '@tfw.in/structura-lib';
// === [dedupeTableText, fixTocTable]Using the plugin system
import {
Structura,
defaultPostProcessors,
dedupeTableText,
fixTocTable,
} from '@tfw.in/structura-lib';
// 1. Default — all built-ins applied:
<Structura initialJsonData={data} />
// 2. Subset — only dedupe, skip TOC fix:
<Structura initialJsonData={data} postProcessors={[dedupeTableText]} />
// 3. Extend with your own plugin:
const stripLatexNoise = (html) =>
html.replace(/\\bigcirc/g, '○').replace(/\\overline\{([^}]*)\}/g, '$1');
<Structura
initialJsonData={data}
postProcessors={[...defaultPostProcessors, stripLatexNoise]}
/>
// 4. Turn off all cleanup:
<Structura initialJsonData={data} postProcessors={[]} />Writing your own plugin
A plugin is any function matching (html: string) => string. Keep it pure:
import type { PostProcessor } from '@tfw.in/structura-lib';
const removeEmptyParagraphs: PostProcessor = (html) =>
html.replace(/<p>\s*<\/p>/g, '');Plugins that throw are isolated — the rest of the chain continues on the last good output.
Composing outside of Structura
If you process HTML outside the component (custom sidebar, PDF export, etc.), use composePostProcessors:
import {
composePostProcessors,
defaultPostProcessors,
renderMathInHtml,
} from '@tfw.in/structura-lib';
const pipeline = composePostProcessors(defaultPostProcessors);
const cleaned = renderMathInHtml(pipeline(rawHtml));Math rendering is separate
renderMathInHtml (controlled by the mathRendering prop) runs after all post-processors and is not a plugin. It transforms $...$ → KaTeX HTML, which changes output semantics beyond cleanup.
Cleaning the whole JSON tree
If you extract content from the parsed JSON for downstream processing (not just rendering), use cleanJsonData. It walks the tree and runs your plugin chain on every /GeminiCorrected block's html.
import { cleanJsonData, dedupeTableText, defaultPostProcessors } from '@tfw.in/structura-lib';
// Default — uses defaultPostProcessors:
const cleaned = cleanJsonData(responseJson);
// Only dedupe, skip TOC fix:
const cleaned2 = cleanJsonData(responseJson, { postProcessors: [dedupeTableText] });
// Extend defaults with your own plugin:
const cleaned3 = cleanJsonData(responseJson, {
postProcessors: [...defaultPostProcessors, myPlugin],
});
// Raw — no cleanup:
const raw = cleanJsonData(responseJson, { postProcessors: [] });Inside Structura this is applied automatically with your chosen postProcessors, so onContentChange / onExport already hand you clean HTML.
Processor contract (immutability)
Every processor exported by this library follows the same contract:
- Pure — the input value is never mutated
- Immutable output — strings are immutable by language; object outputs are deep-frozen so downstream code cannot accidentally mutate them
| Processor | Input | Output |
|-----------|-------|--------|
| dedupeTableText(html) | string | string (new) |
| fixTocTable(html) | string | string (new) |
| renderMathInHtml(html) | string | string (new) |
| cleanJsonData(data, opts?) | object tree | new tree, deep-frozen |
If you need a mutable tree (e.g. to hand to another library that edits in place), pass { disableFreeze: true }:
const mutable = cleanJsonData(data, { disableFreeze: true });Recipes for common Gemini quirks
htmlPostProcessor runs after built-in dedupe and before math rendering. Use it to patch issues you see in your own data without waiting on an SDK release. Below are tested recipes for issues seen in customer documents — copy what you need.
Tip: Always wrap your processor in
useCallbackso React doesn't re-run rendering on every parent update. Compose multiple recipes by chaining.replace(...)or piping helper functions.
1. Section header order swap ("Equipment 3" → "3 Equipment")
Gemini sometimes emits headings as name + number instead of number + name. Detect the pattern at the end of a heading's text and swap.
const fixSectionOrder = (html) =>
html.replace(
/<(h[1-6])([^>]*)>([^<]+?)\s+(\d+(?:\.\d+)*)\s*<\/\1>/g,
'<$1$2>$4 $3</$1>'
);2. Strip stray LaTeX tokens that aren't real math
Gemini occasionally inserts \bullet, \bigcirc, \overline{...}, \mathbf{...} around plain text. Replace them with their visual equivalents.
const stripLatexNoise = (html) =>
html
.replace(/\\bullet/g, '•')
.replace(/\\bigcirc/g, '○')
.replace(/\\overline\{([^}]*)\}/g, '$1')
.replace(/\\mathbf\{([^}]*)\}/g, '$1')
.replace(/\\text\{([^}]*)\}/g, '$1');3. Drop block-type pseudo-attributes from prose
Gemini sometimes leaves debug-style attributes like <p block-type="Text">. Harmless but noisy.
const stripBlockType = (html) => html.replace(/\s+block-type="[^"]*"/g, '');4. Compose multiple recipes
import { useCallback } from 'react';
import { Structura } from '@tfw.in/structura-lib';
function MyViewer({ jsonData }) {
const postProcess = useCallback((html) => {
return [fixSectionOrder, stripLatexNoise, stripBlockType]
.reduce((acc, fn) => fn(acc), html);
}, []);
return (
<Structura
initialJsonData={jsonData}
htmlPostProcessor={postProcess}
/>
);
}5. Apply the same transforms to the JSON before extracting data
If your downstream code reads from node.html directly (export, search, etc.), run the same transform on the JSON tree so consumers see the cleaned content:
import { cleanJsonData } from '@tfw.in/structura-lib';
const cleaned = cleanJsonData(jsonData, {
transformHtml: (html) => stripLatexNoise(fixSectionOrder(html)),
});
// `cleaned` now has both built-in table dedupe AND your custom rules
// applied to every /GeminiCorrected block's `html` field.Heads up: Some quirks (missing columns, deteriorated OCR values) are upstream data issues that no UI-side post-processor can fix — those need a pipeline change. The recipes above only address things visible in the HTML string.
Semantic Tags
Parse and render semantic tags for document corrections:
import { SemanticTagRenderer, parseSemanticTags } from '@tfw.in/structura-lib';
// Render with visual highlighting
<SemanticTagRenderer content="Text with <add>additions</add> and <del>deletions</del>" />
// Parse tags programmatically
const parsed = parseSemanticTags(content);License
MIT
