omnimail

v1.0.0-alpha.0

Published

11 hours ago

Universal, schema-first email parsing library for .eml, .msg, .emlx, .mbox, and TNEF.

0High
0Medium
0Low

email parser eml msg emlx mbox tnef winmail rfc5322 rfc2822 mime outlook apple-mail esm browser deno bun cloudflare-workers

omnimail

A universal, schema-first email parsing library for .eml, .msg, .emlx, .mbox, .mht / .mhtml, Maildir, and TNEF (winmail.dat). Returns a single normalized JSON shape so consumers don't have to branch on the source format.

Status. Early development. The public API and output schema are stable; format coverage is being built out milestone-by-milestone.

Why

Existing JavaScript email parsers are typically scoped to a single format and produce different output shapes per library. Consumers shouldn't have to know whether they're holding an Outlook .msg or an RFC 5322 .eml; they should call one function and render the result.

Design principles

Pure, deterministic core. Parsing byte-like inputs is referentially transparent — no I/O, no globals, no time-dependent output.
Universal runtime. src/ uses only Web Standard APIs (Uint8Array, DataView, TextDecoder, TextEncoder, crypto.subtle, URL). No Node built-ins. Lint-enforced.
Synchronous parsing for bytes. parse stays synchronous for string, ArrayBuffer, Uint8Array, typed arrays, and DataView; Blob-like inputs return a Promise because reading them is async.
Tree-shakeable. ESM-only, "sideEffects": false. A consumer importing only parseEml ships zero .msg code.
Sanitization is the consumer's job. body.html is returned as a branded UnsafeHtml to force the consumer to acknowledge the cast at render time. Pair with DOMPurify before rendering.
Typed errors. FormatError, MalformedError, UnsupportedError — never silent failures. Field-level parse errors yield undefined; document -level failures throw.

Installation

npm install omnimail

Requires a runtime with the standard Web platform APIs listed above. Tested on Node 20+, Bun, Deno, Cloudflare Workers, and modern browsers.

Usage

import { parse, defaultInlineImageResolver, defaultTnefUnwrapper } from 'omnimail';

const email = parse(buffer, {
  resolveInlineImages: defaultInlineImageResolver,
  unwrapTnef:          defaultTnefUnwrapper,
});

email.format       // 'eml' | 'msg' | 'emlx' | 'mbox' | 'tnef' | 'mhtml' | 'maildir'
email.from         // { name?: string; address: string }
email.subject      // string | undefined
email.body.text    // string | undefined
email.body.html    // UnsafeHtml | undefined  ← sanitize before rendering!
email.attachments  // Attachment[]

parse accepts common browser, Node, and Web API data shapes. Byte-like inputs return ParsedEmail synchronously; async body-like inputs return Promise<ParsedEmail>.

parse('From: [email protected]\r\n\r\nhello');
parse(arrayBuffer);
parse(uint8Array);       // also Node Buffer
parse(dataView);

const fromBlob = await parse(fileOrBlob);
const fromFetch = await parse(await fetch('/message.eml'));

For multi-message archives (.mbox), use parseMbox directly — it returns a lazy iterable so messages are parsed on demand.

import { parseMbox } from 'omnimail';

for (const message of parseMbox(bytes)) {
  console.log(message.subject);
}

MHTML (`.mht` / `.mhtml`)

MHTML archives (RFC 2557 — "Save Page As Web Archive") are auto-detected and returned with format: 'mhtml'. The bytes are RFC 5322 + multipart/related, so byte input flows through the same parse()/parseEml() machinery; the MHTML wrapper just relabels the format and exposes Attachment.contentLocation so consumers can match resources by URL.

import { parseMhtml, defaultMhtmlResourceResolver } from 'omnimail';

const archive = parseMhtml(bytes, {
  resolveInlineImages: defaultMhtmlResourceResolver,
});

archive.body.html       // root HTML, with src= rewritten to data: URIs
archive.attachments[0]?.contentLocation  // 'https://example.com/logo.png'

defaultMhtmlResourceResolver rewrites <img src="https://..."> and similar attributes to inline data: URIs when an attachment with a matching Content-Location exists. It pairs with defaultInlineImageResolver (for cid: references) — compose them yourself if a single archive uses both.

Maildir

Maildir is a directory layout, not a single file: each message lives as its own file in cur/, new/, or tmp/, and flags are encoded in the filename suffix :2,<flags> (or ;2,<flags> on FAT). Because the parser core has no filesystem access, the caller is responsible for walking the directory and providing entries:

import { parseMaildir, type MaildirEntry } from 'omnimail';

const entries: MaildirEntry[] = [
  { filename: '1700000000.M.host:2,FRS', bytes, subdir: 'cur' },
  { filename: '1700000001.M.host',       bytes, subdir: 'new', folder: 'Sent' },
  // … one entry per file under cur/ and new/, optionally tmp/
];

for (const message of parseMaildir(entries)) {
  message.format                    // 'maildir'
  message.extras?.flags?.read       // boolean — from S flag
  message.extras?.flags?.flagged    // boolean — from F flag
  message.extras?.folder            // 'Sent' (Maildir++ folder)
  message.extras?.maildirFilename   // original basename (round-trip key)
}

Flags surfaced from :2,<flags>: D (draft), F (flagged), P (passed), R (replied), S (read), T (trashed). Unknown letters (Dovecot keyword indices) are ignored. Files in tmp/ are skipped by default — pass { includeTmp: true } to opt in.

A Node-only directory walker is intentionally out of scope for the core package (universal-runtime constraint). A small companion utility may ship separately.

Hardening

Pass limits in ParseOptions to cap input size, attachment count, and multipart nesting depth. Violations throw MalformedError.

parse(bytes, {
  limits: {
    maxBytes:       50 * 1024 * 1024,  // default: 100 MiB
    maxAttachments: 500,                // default: 1000
    maxMimeDepth:   50,                 // default: 100
  },
});

Public API

| Export | Purpose | | ------------------------------ | ------------------------------------------------------ | | parse(input, options?) | Auto-detecting entry for strings, byte buffers, Blob/File, Response, Request. Throws on mbox → use parseMbox. | | parseEml, parseMsg, parseEmlx, parseMhtml, parseTnef | Per-format parsers (better tree-shaking). | | parseMbox, parseMboxStream | Lazy iterable / async iterable for .mbox archives. | | parseMaildir | Lazy iterable for Maildir directory layouts. | | detectFormat(bytes) | Format detection without parsing. | | defaultInlineImageResolver | Rewrites cid: references to data: URIs. | | defaultMhtmlResourceResolver | Rewrites Content-Location URLs to data: URIs. | | defaultTnefUnwrapper | Unwraps TNEF (winmail.dat) attachments. | | FormatError, MalformedError, UnsupportedError | Typed errors. |

See docs/api.md for the full output-shape contract, field-level guarantees, and cross-platform notes.

Rendering safely

body.html is intentionally typed as UnsafeHtml — a branded string. The brand exists to force you to acknowledge the cast at render time. Always pass the value through a sanitizer such as DOMPurify before injecting into the DOM.

import DOMPurify from 'dompurify';

const safeHtml = DOMPurify.sanitize(email.body.html ?? '');
container.innerHTML = safeHtml;

Testing

The test suite is ~241k tests across 60+ files, parsing ~60,000 vendored .eml fixtures plus ~80 MHTML samples, ~30 Maildir messages, and ~90 .mbox, .msg, and TNEF samples drawn from 25+ distinct sources:

Parser-test corpora — postal-mime, mailparser, MimeKit (C#), Apache james-mime4j, Apache JAMES, Ruby mail gem, CPython email, Mozilla Thunderbird, Stalwart mail-parser and mail-auth, Mailgun flanker, rspamd, msgreader (.msg), tnefparse (TNEF), nodemailer, php-mime-mail-parser
MHTML corpora — Jacob Palme RFC 2557 reference set, Chromium / Blink web tests, WebKit LayoutTests, fast-mhtml real-world saved pages (Wikipedia, MDN, GitHub, Hacker News, …), mhtml2html adversarial fixtures, zsxsoft Microsoft Word .mht exports
Maildir corpora — tedious/DovecotTesting captured Dovecot Maildir tree (Maildir++ folders, Dovecot S=,W= filename extensions, full flag-letter coverage)
Real-world inbound — Apache SpamAssassin public corpus (2002 + 2003, ~9,200 messages spanning hundreds of senders/MTAs)
Modern operational mail — Apache mailing-list archives (~22 lists × many months across 2024–2025) and lore.kernel.org public-inbox archives (git, lkml, bpf, netdev, and 14 other Linux subsystems)
Corporate mail — curated subset of the Enron corpus (1999–2002)

Differential parity

Every vendored .eml fixture is also parsed by postal-mime and mailparser; the normalized output is committed to tests/fixtures/eml/expected/. CI asserts that omnimail extracts at least the same information — same subject, recipients, date, attachments by filename + mime + size, body presence. 97.9% of the ~120k fixture × parser pairs match byte-for-byte; the remaining 2.1% are documented as named tolerance rules in tests/integration/differential.test.ts. Most tolerances exist because our output is more correct than the references (cross-platform charset table, RFC-strict date validation, partial base64 recovery, no encoded-word decoding inside addr-specs).

postal-mime and mailparser are not dev or runtime dependencies of this package. The capture is a one-shot operation in scripts/capture-expected/, re-run only when fixtures change. See that directory's README for details.

Spec compliance

Every spec we implement has a machine-checked manifest under tests/spec/manifests/ listing every numbered section of the spec and this parser's stance on it: covered (test exists), uncovered (TODO), out-of-scope (with a reason — e.g., "S/MIME cryptographic operations delegated to crypto libraries"), or n/a (front matter, IANA registries, non-normative prose).

CI runs npm run spec:coverage, which cross-references each manifest with the §X.Y markers in describe() titles under tests/spec/. Builds fail if a section marked covered has no matching test, if a test cites a section not in the manifest, or if any spec's uncovered count rises above its baseline (tests/spec/manifests/uncovered-baseline.json). This makes the compliance claim a contract that drift cannot silently break.

The table below is generated by npm run spec:coverage -- --write-readme; CI verifies it against the live manifests via npm run spec:coverage:readme.

| Spec | Covered | Uncovered | Out-of-scope | N/A | Total | | --- | ---: | ---: | ---: | ---: | ---: | | [MS-CFB] — Compound File Binary File Format | 10 | 0 | 2 | 19 | 31 | | [MS-DTYP] — Windows Data Types | 1 | 0 | 0 | 5 | 6 | | [MS-OXMSG] — Outlook Item (.msg) File Format | 9 | 0 | 31 | 31 | 71 | | [MS-OXOMSG] — Email Object Protocol | 2 | 0 | 0 | 7 | 9 | | Maildir — Maildir (Bernstein) + Maildir++ (Varshavchik / Courier) | 10 | 0 | 5 | 0 | 15 | | MS-OXRTFCP — Rich Text Format (RTF) Compression Algorithm | 13 | 0 | 16 | 53 | 82 | | MS-OXRTFEX — Rich Text Format (RTF) Extensions Algorithm | 2 | 0 | 17 | 25 | 44 | | MS-OXTNEF — Transport Neutral Encapsulation Format (TNEF) Data Algorithm | 14 | 0 | 51 | 30 | 95 | | RFC 2045 — Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies | 13 | 0 | 3 | 17 | 33 | | RFC 2046 — MIME Part Two: Media Types | 12 | 0 | 17 | 15 | 44 | | RFC 2047 — MIME Part Three: Message Header Extensions for Non-ASCII Text | 9 | 0 | 1 | 7 | 17 | | RFC 2049 — MIME Part Five: Conformance Criteria and Examples | 1 | 0 | 2 | 8 | 11 | | RFC 2183 — Communicating Presentation Information in Internet Messages: The Content-Disposition Header Field | 11 | 0 | 1 | 8 | 20 | | RFC 2231 — MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations | 5 | 0 | 1 | 8 | 14 | | RFC 2369 — The Use of URLs as Meta-Syntax for Core Mail List Commands and their Transport through Message Header Fields | 8 | 0 | 9 | 12 | 29 | | RFC 2387 — The MIME Multipart/Related Content-type | 5 | 0 | 5 | 12 | 22 | | RFC 2392 — Content-ID and Message-ID Uniform Resource Locators | 1 | 0 | 1 | 5 | 7 | | RFC 2557 — MIME Encapsulation of Aggregate Documents, such as HTML (MHTML) | 9 | 0 | 11 | 13 | 33 | | RFC 3156 — MIME Security with OpenPGP | 4 | 0 | 4 | 9 | 17 | | RFC 3462 — The Multipart/Report Content Type for the Reporting of Mail System Administrative Messages | 1 | 0 | 1 | 3 | 5 | | RFC 3464 — An Extensible Message Format for Delivery Status Notifications | 1 | 0 | 23 | 15 | 39 | | RFC 3676 — The Text/Plain Format and DelSp Parameters | 4 | 0 | 5 | 16 | 25 | | RFC 3834 — Recommendations for Automatic Responses to Electronic Mail | 3 | 0 | 16 | 9 | 28 | | RFC 4155 — The application/mbox Media Type | 1 | 0 | 1 | 6 | 8 | | RFC 5322 — Internet Message Format | 26 | 0 | 11 | 32 | 69 | | RFC 5545 — Internet Calendaring and Scheduling Core Object Specification (iCalendar) | 2 | 0 | 109 | 36 | 147 | | RFC 6376 — DomainKeys Identified Mail (DKIM) Signatures | 3 | 0 | 55 | 45 | 103 | | RFC 6531 — SMTP Extension for Internationalized Email | 1 | 0 | 11 | 15 | 27 | | RFC 6532 — Internationalized Email Headers | 4 | 0 | 3 | 9 | 16 | | RFC 6854 — Update to Internet Message Format to Allow Group Syntax in the From: and Sender: Header Fields | 2 | 0 | 1 | 11 | 14 | | RFC 7208 — Sender Policy Framework (SPF) for Authorizing Use of Domains in Email, Version 1 | 2 | 0 | 64 | 46 | 112 | | RFC 8058 — Signaling One-Click Functionality for List Email Headers | 3 | 0 | 3 | 8 | 14 | | RFC 8098 — Message Disposition Notification | 3 | 0 | 23 | 22 | 48 | | RFC 8551 — Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 4.0 Message Specification | 9 | 0 | 41 | 25 | 75 | | RFC 8601 — Message Header Field for Indicating Message Authentication Status | 2 | 0 | 29 | 38 | 69 |

See tests/spec/manifests/README.md for the manifest schema and conventions.

What this parser does NOT do

The parser is intentionally read-only and receiver-side. The following are out-of-scope by design — they're the domain of dedicated libraries or upstream MTAs:

Cryptographic verification. DKIM signatures (RFC 6376), S/MIME signatures (RFC 8551), PGP/MIME signatures (RFC 3156) are surfaced as bytes + headers; signatures are not validated.
Decryption. S/MIME and PGP/MIME encrypted parts are surfaced as protected blobs; decryption is delegated to crypto libraries.
SPF / DMARC evaluation. Received-SPF and Authentication-Results round-trip verbatim; DNS lookups and policy evaluation are the MTA's job.
iCalendar interior parsing. text/calendar parts surface as attachments with the method parameter preserved; VEVENT/VTODO/iTIP semantics are delegated to calendar libraries.
Message generation. This is a parser. It does not compose, sign, encrypt, or send. RFC sections governing senders' MUST/SHOULD obligations are explicitly marked out-of-scope in the manifests.
DSN field-level interpretation. multipart/report and message/delivery-status shapes are detected; per-recipient field parsing of RFC 3464 is left to consumers.
Auto-response loop prevention. RFC 3834's Auto-Submitted header is surfaced; the loop-prevention rules govern responders, not parsers.

Versioning

This project follows Semantic Versioning. The public API surface — every named export, its type signature, and the shape of ParsedEmail — is the stability contract, and is mechanically enforced by tests/unit/api-surface.test.ts. That snapshot is the source of truth: any change to it is a breaking change and requires a major version bump. Additive changes (new exports, new optional fields, new optional options) go in minor releases; bug fixes and internal refactors go in patches. Symbols re-exported from src/ but not in the snapshot are internal and may change at any time.

Acknowledgements

The test corpus is seeded from many upstream projects. Provenance, sha256 hashes, and licenses for every vendored fixture are recorded in tests/fixtures/SOURCES.md. Our gratitude goes to the maintainers of:

postal-mime (MIT), mailparser (MIT-0), @kenjiuno/msgreader (Apache-2.0)
Mozilla Thunderbird (MPL-2.0), Apache james-mime4j / Apache JAMES (Apache-2.0)
MimeKit (MIT), Ruby mail gem (MIT), CPython (PSF-2.0)
Stalwart mail-parser / mail-auth (Apache-2.0 OR MIT)
Mailgun flanker (Apache-2.0), rspamd (Apache-2.0)
Apache SpamAssassin public corpus (Apache-2.0)
lore.kernel.org public-inbox archives (operational mail)
Apache list archives (operational mail)
Enron Email Corpus (public domain, FERC release)
tnefparse (BSD), nodemailer (MIT), php-mime-mail-parser (MIT)

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

omnimail

Why

Design principles

Installation

Usage

MHTML (.mht / .mhtml)

Maildir

Hardening

Public API

Rendering safely

Testing

Differential parity

Spec compliance

What this parser does NOT do

Versioning

Acknowledgements

License

MHTML (`.mht` / `.mhtml`)