pixagram-content

v0.1.1

Published

2 months ago

Content rendering, sanitization and NLP for Pixagram — compiled to WebAssembly

Downloads

0High
0Medium
0Low

pixagram

pixagram-content

WebAssembly content engine for the Pixagram platform.

Handles all user-generated content safely and efficiently:

| Input | Output | |---|---| | Post body (Markdown or HTML) | Sanitised HTML + image list + plain text | | Comment (Markdown or HTML) | Sanitised HTML (restricted tags) + plain text | | Biography | Plain text only | | Username/display name | Plain text, max 50 chars |

Additional utilities:

extract_plain_text(html) — strip tags, decode entities
summarize(text, n) — TF-IDF extractive summary of n sentences

Architecture

src/
  lib.rs         → WASM-exported public API
  renderer.rs    → Markdown → HTML  (pulldown-cmark)
  sanitizer.rs   → HTML sanitisation (ammonia) + @mention/#hashtag + external-link tagging
  text_utils.rs  → Image extraction, HTML → plain text, entity decoding
  summarizer.rs  → TF-IDF sentence scoring + extractive summarisation

Dependencies

| Crate | Why | Transitive weight | |---|---|---| | wasm-bindgen | JS↔Rust bridge | ~50 KB | | pulldown-cmark | CommonMark parser | ~80 KB | | js-sys | Build JS objects without serde | ~15 KB |

Deliberately excluded:

~~ammonia~~ → pulled html5ever + markup5ever + cssparser + ICU Unicode tables ≈ 1.4 MB. Replaced by the hand-rolled allowlist sanitiser in sanitizer.rs.
~~serde + serde-wasm-bindgen~~ → replaced by direct js-sys::Object construction.
~~regex~~ → replaced by O(n) scanner functions.
~~once_cell~~ → no longer needed without regex.

Expected WASM binary size after wasm-opt -Oz: ~200–350 KB.

Building

# Install wasm-pack (once)
cargo install wasm-pack

# Build for browser (ES module output)
wasm-pack build --target web --out-dir pkg

# Build for bundler (webpack/vite/rollup)
wasm-pack build --target bundler --out-dir pkg

# Run tests
wasm-pack test --headless --chrome

The pkg/ directory will contain:

pixagram_content_bg.wasm — the compiled WebAssembly binary
pixagram_content.js — JS glue code (auto-generated)
pixagram_content.d.ts — TypeScript types (auto-generated)

Usage (React / TypeScript)

Initialise once

// content-engine.ts
import init, {
  render_post,
  render_comment,
  sanitize_biography,
  sanitize_username,
  extract_plain_text,
  summarize,
} from "./pkg/pixagram_content.js";

let ready = false;

export async function initContentEngine() {
  if (!ready) {
    await init();
    ready = true;
  }
}

export { render_post, render_comment, sanitize_biography, sanitize_username,
         extract_plain_text, summarize };

Render a post

import { initContentEngine, render_post } from "@/lib/content-engine";

await initContentEngine();

const ORIGIN = "https://pixagram.art";

const post = render_post(rawContent, isMarkdown, ORIGIN);
// post.html           → sanitised HTML (with images)
// post.html_no_images → HTML with <img> stripped (for card preview)
// post.images         → [{ src, alt, is_base64 }, …]
// post.plain_text     → plain text for search / preview

Render a comment

const comment = render_comment(rawComment, isMarkdown, ORIGIN);
// comment.html
// comment.plain_text

Display with React (external-link dialog)

External links are tagged with data-external="true". Use html-react-parser to intercept them:

// npm install html-react-parser
import parse, { domToReact, Element } from "html-react-parser";
import ExternalLinkDialog from "@/components/ExternalLinkDialog";

interface Props { html: string }

export function RichContent({ html }: Props) {
  const options = {
    replace(node: unknown) {
      if (!(node instanceof Element)) return;
      if (
        node.name === "a" &&
        node.attribs?.["data-external"] === "true"
      ) {
        return (
          <ExternalLinkDialog href={node.attribs.href}>
            {domToReact(node.children, options)}
          </ExternalLinkDialog>
        );
      }
    },
  };

  return (
    <article className="rich-content">
      {parse(html, options)}
    </article>
  );
}

Summarise

const ORIGIN = "https://pixagram.art";
const post = render_post(body, true, ORIGIN);
const summary = summarize(post.plain_text, 3); // top 3 sentences

Biography & username

const bio = sanitize_biography(rawBio);       // plain text, any length
const name = sanitize_username(rawName);      // plain text, max 50 chars

Security model

Posts

Allowed elements: h1–h6, p, br, hr, ul, ol, li, blockquote, pre, code, em, strong, del, s, sup, sub, a, img, table, thead, tbody, tfoot, tr, th, td, div, span, figure, figcaption, details, summary.

Allowed URL schemes: http, https, mailto, data (Base64 images).

Comments

Allowed elements: p, br, ul, ol, li, blockquote, code, em, strong, del, s, sup, sub, a, span.

Allowed URL schemes: http, https, mailto (no data: — no embedded images in comments).

Biography / username

No HTML whatsoever — pure plain text output.

External links

Any <a href> pointing outside your origin receives:

data-external="true" — React hook for the dialog component
target="_blank"
rel="noopener noreferrer"

What ammonia blocks

All <script>, <style>, <iframe>, <object>, <embed>, <form>, … tags
javascript: and vbscript: URL schemes
Event handler attributes (onclick, onload, …)
Arbitrary data-* attributes (only explicitly allowed ones pass through)

@mentions and #hashtags

Applied to text nodes only (never inside existing HTML attributes):

@username   →  <a href="/@username"  class="mention">@username</a>
#hashtag    →  <a href="/trending/#hashtag" class="hashtag">#hashtag</a>

Username character set: [A-Za-z0-9_.\-] up to 64 chars. Hashtag character set: [A-Za-z0-9_] up to 100 chars.

Images

post.images.forEach(img => {
  console.log(img.src);       // full URL or data:image/… base64 string
  console.log(img.alt);       // alt text
  console.log(img.is_base64); // true for data: URIs
});

// Render card without inline images, show first image as cover:
const coverSrc = post.images[0]?.src ?? null;
// Use post.html_no_images in the article body

TF-IDF summariser details

Sentence splitting — rule-based; avoids splitting on common abbreviations (≤2-char tokens before .).
Tokenisation — lowercase, strip punctuation, remove stop words.
Porter-lite stemming — suffix stripping for conflation (e.g. running → run).
TF — term_count_in_sentence / sentence_length.
IDF — ln(N / df) + 1 (smoothed to avoid zero IDF when a term appears in every sentence).
Sentence score — Σ TF·IDF over all terms in the sentence.
Selection — top k sentences returned in original document order (not ranked order) for readability.

To upgrade to production-quality stemming without writing Rust, replace the stem() function in summarizer.rs with calls to the rust-stemmers crate (Snowball/Porter2 algorithm).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pixagram-content

Architecture

Dependencies

Building

Usage (React / TypeScript)

Initialise once

Render a post

Render a comment

Display with React (external-link dialog)

Summarise

Biography & username

Security model

Posts

Comments

Biography / username

External links

What ammonia blocks

@mentions and #hashtags

Images

TF-IDF summariser details