pixagram-content
v0.1.1
Published
Content rendering, sanitization and NLP for Pixagram — compiled to WebAssembly
Readme
pixagram-content
WebAssembly content engine for the Pixagram platform.
Handles all user-generated content safely and efficiently:
| Input | Output | |---|---| | Post body (Markdown or HTML) | Sanitised HTML + image list + plain text | | Comment (Markdown or HTML) | Sanitised HTML (restricted tags) + plain text | | Biography | Plain text only | | Username/display name | Plain text, max 50 chars |
Additional utilities:
extract_plain_text(html)— strip tags, decode entitiessummarize(text, n)— TF-IDF extractive summary of n sentences
Architecture
src/
lib.rs → WASM-exported public API
renderer.rs → Markdown → HTML (pulldown-cmark)
sanitizer.rs → HTML sanitisation (ammonia) + @mention/#hashtag + external-link tagging
text_utils.rs → Image extraction, HTML → plain text, entity decoding
summarizer.rs → TF-IDF sentence scoring + extractive summarisationDependencies
| Crate | Why | Transitive weight |
|---|---|---|
| wasm-bindgen | JS↔Rust bridge | ~50 KB |
| pulldown-cmark | CommonMark parser | ~80 KB |
| js-sys | Build JS objects without serde | ~15 KB |
Deliberately excluded:
- ~~
ammonia~~ → pulledhtml5ever+markup5ever+cssparser+ ICU Unicode tables ≈ 1.4 MB. Replaced by the hand-rolled allowlist sanitiser insanitizer.rs. - ~~
serde+serde-wasm-bindgen~~ → replaced by directjs-sys::Objectconstruction. - ~~
regex~~ → replaced by O(n) scanner functions. - ~~
once_cell~~ → no longer needed without regex.
Expected WASM binary size after wasm-opt -Oz: ~200–350 KB.
Building
# Install wasm-pack (once)
cargo install wasm-pack
# Build for browser (ES module output)
wasm-pack build --target web --out-dir pkg
# Build for bundler (webpack/vite/rollup)
wasm-pack build --target bundler --out-dir pkg
# Run tests
wasm-pack test --headless --chromeThe pkg/ directory will contain:
pixagram_content_bg.wasm— the compiled WebAssembly binarypixagram_content.js— JS glue code (auto-generated)pixagram_content.d.ts— TypeScript types (auto-generated)
Usage (React / TypeScript)
Initialise once
// content-engine.ts
import init, {
render_post,
render_comment,
sanitize_biography,
sanitize_username,
extract_plain_text,
summarize,
} from "./pkg/pixagram_content.js";
let ready = false;
export async function initContentEngine() {
if (!ready) {
await init();
ready = true;
}
}
export { render_post, render_comment, sanitize_biography, sanitize_username,
extract_plain_text, summarize };Render a post
import { initContentEngine, render_post } from "@/lib/content-engine";
await initContentEngine();
const ORIGIN = "https://pixagram.art";
const post = render_post(rawContent, isMarkdown, ORIGIN);
// post.html → sanitised HTML (with images)
// post.html_no_images → HTML with <img> stripped (for card preview)
// post.images → [{ src, alt, is_base64 }, …]
// post.plain_text → plain text for search / previewRender a comment
const comment = render_comment(rawComment, isMarkdown, ORIGIN);
// comment.html
// comment.plain_textDisplay with React (external-link dialog)
External links are tagged with data-external="true". Use
html-react-parser to intercept them:
// npm install html-react-parser
import parse, { domToReact, Element } from "html-react-parser";
import ExternalLinkDialog from "@/components/ExternalLinkDialog";
interface Props { html: string }
export function RichContent({ html }: Props) {
const options = {
replace(node: unknown) {
if (!(node instanceof Element)) return;
if (
node.name === "a" &&
node.attribs?.["data-external"] === "true"
) {
return (
<ExternalLinkDialog href={node.attribs.href}>
{domToReact(node.children, options)}
</ExternalLinkDialog>
);
}
},
};
return (
<article className="rich-content">
{parse(html, options)}
</article>
);
}Summarise
const ORIGIN = "https://pixagram.art";
const post = render_post(body, true, ORIGIN);
const summary = summarize(post.plain_text, 3); // top 3 sentencesBiography & username
const bio = sanitize_biography(rawBio); // plain text, any length
const name = sanitize_username(rawName); // plain text, max 50 charsSecurity model
Posts
Allowed elements: h1–h6, p, br, hr, ul, ol, li, blockquote,
pre, code, em, strong, del, s, sup, sub, a, img,
table, thead, tbody, tfoot, tr, th, td, div, span,
figure, figcaption, details, summary.
Allowed URL schemes: http, https, mailto, data (Base64 images).
Comments
Allowed elements: p, br, ul, ol, li, blockquote, code,
em, strong, del, s, sup, sub, a, span.
Allowed URL schemes: http, https, mailto (no data: — no embedded images in comments).
Biography / username
No HTML whatsoever — pure plain text output.
External links
Any <a href> pointing outside your origin receives:
data-external="true"— React hook for the dialog componenttarget="_blank"rel="noopener noreferrer"
What ammonia blocks
- All
<script>,<style>,<iframe>,<object>,<embed>,<form>, … tags javascript:andvbscript:URL schemes- Event handler attributes (
onclick,onload, …) - Arbitrary
data-*attributes (only explicitly allowed ones pass through)
@mentions and #hashtags
Applied to text nodes only (never inside existing HTML attributes):
@username → <a href="/@username" class="mention">@username</a>
#hashtag → <a href="/trending/#hashtag" class="hashtag">#hashtag</a>Username character set: [A-Za-z0-9_.\-] up to 64 chars.
Hashtag character set: [A-Za-z0-9_] up to 100 chars.
Images
post.images.forEach(img => {
console.log(img.src); // full URL or data:image/… base64 string
console.log(img.alt); // alt text
console.log(img.is_base64); // true for data: URIs
});
// Render card without inline images, show first image as cover:
const coverSrc = post.images[0]?.src ?? null;
// Use post.html_no_images in the article bodyTF-IDF summariser details
- Sentence splitting — rule-based; avoids splitting on common abbreviations (≤2-char tokens before
.). - Tokenisation — lowercase, strip punctuation, remove stop words.
- Porter-lite stemming — suffix stripping for conflation (e.g. running → run).
- TF —
term_count_in_sentence / sentence_length. - IDF —
ln(N / df) + 1(smoothed to avoid zero IDF when a term appears in every sentence). - Sentence score —
Σ TF·IDFover all terms in the sentence. - Selection — top k sentences returned in original document order (not ranked order) for readability.
To upgrade to production-quality stemming without writing Rust, replace the stem() function in summarizer.rs with calls to the rust-stemmers crate (Snowball/Porter2 algorithm).
