omnimail
v1.0.0-alpha.0
Published
Universal, schema-first email parsing library for .eml, .msg, .emlx, .mbox, and TNEF.
Maintainers
Readme
omnimail
A universal, schema-first email parsing library for .eml, .msg, .emlx,
.mbox, .mht / .mhtml, Maildir, and TNEF (winmail.dat). Returns a
single normalized JSON shape so consumers don't have to branch on the source
format.
Status. Early development. The public API and output schema are stable; format coverage is being built out milestone-by-milestone.
Why
Existing JavaScript email parsers are typically scoped to a single format and
produce different output shapes per library. Consumers shouldn't have to know
whether they're holding an Outlook .msg or an RFC 5322 .eml; they should
call one function and render the result.
Design principles
- Pure, deterministic core. Parsing byte-like inputs is referentially transparent — no I/O, no globals, no time-dependent output.
- Universal runtime.
src/uses only Web Standard APIs (Uint8Array,DataView,TextDecoder,TextEncoder,crypto.subtle,URL). No Node built-ins. Lint-enforced. - Synchronous parsing for bytes.
parsestays synchronous forstring,ArrayBuffer,Uint8Array, typed arrays, andDataView; Blob-like inputs return a Promise because reading them is async. - Tree-shakeable. ESM-only,
"sideEffects": false. A consumer importing onlyparseEmlships zero.msgcode. - Sanitization is the consumer's job.
body.htmlis returned as a brandedUnsafeHtmlto force the consumer to acknowledge the cast at render time. Pair with DOMPurify before rendering. - Typed errors.
FormatError,MalformedError,UnsupportedError— never silent failures. Field-level parse errors yieldundefined; document -level failures throw.
Installation
npm install omnimailRequires a runtime with the standard Web platform APIs listed above. Tested on Node 20+, Bun, Deno, Cloudflare Workers, and modern browsers.
Usage
import { parse, defaultInlineImageResolver, defaultTnefUnwrapper } from 'omnimail';
const email = parse(buffer, {
resolveInlineImages: defaultInlineImageResolver,
unwrapTnef: defaultTnefUnwrapper,
});
email.format // 'eml' | 'msg' | 'emlx' | 'mbox' | 'tnef' | 'mhtml' | 'maildir'
email.from // { name?: string; address: string }
email.subject // string | undefined
email.body.text // string | undefined
email.body.html // UnsafeHtml | undefined ← sanitize before rendering!
email.attachments // Attachment[]parse accepts common browser, Node, and Web API data shapes. Byte-like inputs
return ParsedEmail synchronously; async body-like inputs return
Promise<ParsedEmail>.
parse('From: [email protected]\r\n\r\nhello');
parse(arrayBuffer);
parse(uint8Array); // also Node Buffer
parse(dataView);
const fromBlob = await parse(fileOrBlob);
const fromFetch = await parse(await fetch('/message.eml'));For multi-message archives (.mbox), use parseMbox directly — it returns a
lazy iterable so messages are parsed on demand.
import { parseMbox } from 'omnimail';
for (const message of parseMbox(bytes)) {
console.log(message.subject);
}MHTML (.mht / .mhtml)
MHTML archives (RFC 2557 — "Save Page As Web Archive") are auto-detected and
returned with format: 'mhtml'. The bytes are RFC 5322 + multipart/related,
so byte input flows through the same parse()/parseEml() machinery; the
MHTML wrapper just relabels the format and exposes
Attachment.contentLocation so consumers can match resources by URL.
import { parseMhtml, defaultMhtmlResourceResolver } from 'omnimail';
const archive = parseMhtml(bytes, {
resolveInlineImages: defaultMhtmlResourceResolver,
});
archive.body.html // root HTML, with src= rewritten to data: URIs
archive.attachments[0]?.contentLocation // 'https://example.com/logo.png'defaultMhtmlResourceResolver rewrites <img src="https://..."> and
similar attributes to inline data: URIs when an attachment with a matching
Content-Location exists. It pairs with defaultInlineImageResolver (for
cid: references) — compose them yourself if a single archive uses both.
Maildir
Maildir is a directory layout, not a single file: each message lives as its
own file in cur/, new/, or tmp/, and flags are encoded in the
filename suffix :2,<flags> (or ;2,<flags> on FAT). Because the parser
core has no filesystem access, the caller is responsible for walking the
directory and providing entries:
import { parseMaildir, type MaildirEntry } from 'omnimail';
const entries: MaildirEntry[] = [
{ filename: '1700000000.M.host:2,FRS', bytes, subdir: 'cur' },
{ filename: '1700000001.M.host', bytes, subdir: 'new', folder: 'Sent' },
// … one entry per file under cur/ and new/, optionally tmp/
];
for (const message of parseMaildir(entries)) {
message.format // 'maildir'
message.extras?.flags?.read // boolean — from S flag
message.extras?.flags?.flagged // boolean — from F flag
message.extras?.folder // 'Sent' (Maildir++ folder)
message.extras?.maildirFilename // original basename (round-trip key)
}Flags surfaced from :2,<flags>: D (draft), F (flagged), P (passed),
R (replied), S (read), T (trashed). Unknown letters (Dovecot keyword
indices) are ignored. Files in tmp/ are skipped by default — pass
{ includeTmp: true } to opt in.
A Node-only directory walker is intentionally out of scope for the core package (universal-runtime constraint). A small companion utility may ship separately.
Hardening
Pass limits in ParseOptions to cap input size, attachment count, and
multipart nesting depth. Violations throw MalformedError.
parse(bytes, {
limits: {
maxBytes: 50 * 1024 * 1024, // default: 100 MiB
maxAttachments: 500, // default: 1000
maxMimeDepth: 50, // default: 100
},
});Public API
| Export | Purpose |
| ------------------------------ | ------------------------------------------------------ |
| parse(input, options?) | Auto-detecting entry for strings, byte buffers, Blob/File, Response, Request. Throws on mbox → use parseMbox. |
| parseEml, parseMsg, parseEmlx, parseMhtml, parseTnef | Per-format parsers (better tree-shaking). |
| parseMbox, parseMboxStream | Lazy iterable / async iterable for .mbox archives. |
| parseMaildir | Lazy iterable for Maildir directory layouts. |
| detectFormat(bytes) | Format detection without parsing. |
| defaultInlineImageResolver | Rewrites cid: references to data: URIs. |
| defaultMhtmlResourceResolver | Rewrites Content-Location URLs to data: URIs. |
| defaultTnefUnwrapper | Unwraps TNEF (winmail.dat) attachments. |
| FormatError, MalformedError, UnsupportedError | Typed errors. |
See docs/api.md for the full output-shape contract,
field-level guarantees, and cross-platform notes.
Rendering safely
body.html is intentionally typed as UnsafeHtml — a branded string. The
brand exists to force you to acknowledge the cast at render time. Always pass
the value through a sanitizer such as DOMPurify before injecting into the
DOM.
import DOMPurify from 'dompurify';
const safeHtml = DOMPurify.sanitize(email.body.html ?? '');
container.innerHTML = safeHtml;Testing
The test suite is ~241k tests across 60+ files, parsing ~60,000 vendored
.eml fixtures plus ~80 MHTML samples, ~30 Maildir messages, and ~90
.mbox, .msg, and TNEF samples drawn from 25+ distinct sources:
- Parser-test corpora — postal-mime, mailparser, MimeKit (C#), Apache
james-mime4j, Apache JAMES, Ruby
mailgem, CPythonemail, Mozilla Thunderbird, Stalwartmail-parserandmail-auth, Mailgun flanker, rspamd, msgreader (.msg), tnefparse (TNEF), nodemailer, php-mime-mail-parser - MHTML corpora — Jacob Palme RFC 2557 reference set, Chromium / Blink
web tests, WebKit LayoutTests, fast-mhtml real-world saved pages
(Wikipedia, MDN, GitHub, Hacker News, …), mhtml2html adversarial
fixtures, zsxsoft Microsoft Word
.mhtexports - Maildir corpora — tedious/DovecotTesting captured Dovecot Maildir
tree (Maildir++ folders, Dovecot
S=,W=filename extensions, full flag-letter coverage) - Real-world inbound — Apache SpamAssassin public corpus (2002 + 2003, ~9,200 messages spanning hundreds of senders/MTAs)
- Modern operational mail — Apache mailing-list archives (~22 lists ×
many months across 2024–2025) and lore.kernel.org public-inbox archives
(
git,lkml,bpf,netdev, and 14 other Linux subsystems) - Corporate mail — curated subset of the Enron corpus (1999–2002)
Differential parity
Every vendored .eml fixture is also parsed by postal-mime and
mailparser; the normalized output is committed to
tests/fixtures/eml/expected/. CI asserts that omnimail extracts
at least the same information — same subject, recipients, date,
attachments by filename + mime + size, body presence. 97.9% of the
~120k fixture × parser pairs match byte-for-byte; the remaining 2.1%
are documented as named tolerance rules in
tests/integration/differential.test.ts. Most tolerances exist because
our output is more correct than the references (cross-platform charset
table, RFC-strict date validation, partial base64 recovery, no
encoded-word decoding inside addr-specs).
postal-mime and mailparser are not dev or runtime dependencies of this
package. The capture is a one-shot operation in scripts/capture-expected/,
re-run only when fixtures change. See that directory's README for details.
Spec compliance
Every spec we implement has a machine-checked manifest under
tests/spec/manifests/ listing every numbered section of the spec and
this parser's stance on it: covered (test exists), uncovered (TODO),
out-of-scope (with a reason — e.g., "S/MIME cryptographic operations
delegated to crypto libraries"), or n/a (front matter, IANA registries,
non-normative prose).
CI runs npm run spec:coverage, which cross-references each manifest with
the §X.Y markers in describe() titles under tests/spec/. Builds fail
if a section marked covered has no matching test, if a test cites a
section not in the manifest, or if any spec's uncovered count rises
above its baseline (tests/spec/manifests/uncovered-baseline.json). This
makes the compliance claim a contract that drift cannot silently break.
The table below is generated by npm run spec:coverage -- --write-readme;
CI verifies it against the live manifests via npm run spec:coverage:readme.
| Spec | Covered | Uncovered | Out-of-scope | N/A | Total | | --- | ---: | ---: | ---: | ---: | ---: | | [MS-CFB] — Compound File Binary File Format | 10 | 0 | 2 | 19 | 31 | | [MS-DTYP] — Windows Data Types | 1 | 0 | 0 | 5 | 6 | | [MS-OXMSG] — Outlook Item (.msg) File Format | 9 | 0 | 31 | 31 | 71 | | [MS-OXOMSG] — Email Object Protocol | 2 | 0 | 0 | 7 | 9 | | Maildir — Maildir (Bernstein) + Maildir++ (Varshavchik / Courier) | 10 | 0 | 5 | 0 | 15 | | MS-OXRTFCP — Rich Text Format (RTF) Compression Algorithm | 13 | 0 | 16 | 53 | 82 | | MS-OXRTFEX — Rich Text Format (RTF) Extensions Algorithm | 2 | 0 | 17 | 25 | 44 | | MS-OXTNEF — Transport Neutral Encapsulation Format (TNEF) Data Algorithm | 14 | 0 | 51 | 30 | 95 | | RFC 2045 — Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies | 13 | 0 | 3 | 17 | 33 | | RFC 2046 — MIME Part Two: Media Types | 12 | 0 | 17 | 15 | 44 | | RFC 2047 — MIME Part Three: Message Header Extensions for Non-ASCII Text | 9 | 0 | 1 | 7 | 17 | | RFC 2049 — MIME Part Five: Conformance Criteria and Examples | 1 | 0 | 2 | 8 | 11 | | RFC 2183 — Communicating Presentation Information in Internet Messages: The Content-Disposition Header Field | 11 | 0 | 1 | 8 | 20 | | RFC 2231 — MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations | 5 | 0 | 1 | 8 | 14 | | RFC 2369 — The Use of URLs as Meta-Syntax for Core Mail List Commands and their Transport through Message Header Fields | 8 | 0 | 9 | 12 | 29 | | RFC 2387 — The MIME Multipart/Related Content-type | 5 | 0 | 5 | 12 | 22 | | RFC 2392 — Content-ID and Message-ID Uniform Resource Locators | 1 | 0 | 1 | 5 | 7 | | RFC 2557 — MIME Encapsulation of Aggregate Documents, such as HTML (MHTML) | 9 | 0 | 11 | 13 | 33 | | RFC 3156 — MIME Security with OpenPGP | 4 | 0 | 4 | 9 | 17 | | RFC 3462 — The Multipart/Report Content Type for the Reporting of Mail System Administrative Messages | 1 | 0 | 1 | 3 | 5 | | RFC 3464 — An Extensible Message Format for Delivery Status Notifications | 1 | 0 | 23 | 15 | 39 | | RFC 3676 — The Text/Plain Format and DelSp Parameters | 4 | 0 | 5 | 16 | 25 | | RFC 3834 — Recommendations for Automatic Responses to Electronic Mail | 3 | 0 | 16 | 9 | 28 | | RFC 4155 — The application/mbox Media Type | 1 | 0 | 1 | 6 | 8 | | RFC 5322 — Internet Message Format | 26 | 0 | 11 | 32 | 69 | | RFC 5545 — Internet Calendaring and Scheduling Core Object Specification (iCalendar) | 2 | 0 | 109 | 36 | 147 | | RFC 6376 — DomainKeys Identified Mail (DKIM) Signatures | 3 | 0 | 55 | 45 | 103 | | RFC 6531 — SMTP Extension for Internationalized Email | 1 | 0 | 11 | 15 | 27 | | RFC 6532 — Internationalized Email Headers | 4 | 0 | 3 | 9 | 16 | | RFC 6854 — Update to Internet Message Format to Allow Group Syntax in the From: and Sender: Header Fields | 2 | 0 | 1 | 11 | 14 | | RFC 7208 — Sender Policy Framework (SPF) for Authorizing Use of Domains in Email, Version 1 | 2 | 0 | 64 | 46 | 112 | | RFC 8058 — Signaling One-Click Functionality for List Email Headers | 3 | 0 | 3 | 8 | 14 | | RFC 8098 — Message Disposition Notification | 3 | 0 | 23 | 22 | 48 | | RFC 8551 — Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 4.0 Message Specification | 9 | 0 | 41 | 25 | 75 | | RFC 8601 — Message Header Field for Indicating Message Authentication Status | 2 | 0 | 29 | 38 | 69 |
See
tests/spec/manifests/README.md for the
manifest schema and conventions.
What this parser does NOT do
The parser is intentionally read-only and receiver-side. The following are out-of-scope by design — they're the domain of dedicated libraries or upstream MTAs:
- Cryptographic verification. DKIM signatures (RFC 6376), S/MIME signatures (RFC 8551), PGP/MIME signatures (RFC 3156) are surfaced as bytes + headers; signatures are not validated.
- Decryption. S/MIME and PGP/MIME encrypted parts are surfaced as protected blobs; decryption is delegated to crypto libraries.
- SPF / DMARC evaluation.
Received-SPFandAuthentication-Resultsround-trip verbatim; DNS lookups and policy evaluation are the MTA's job. - iCalendar interior parsing.
text/calendarparts surface as attachments with themethodparameter preserved; VEVENT/VTODO/iTIP semantics are delegated to calendar libraries. - Message generation. This is a parser. It does not compose, sign,
encrypt, or send. RFC sections governing senders' MUST/SHOULD obligations
are explicitly marked
out-of-scopein the manifests. - DSN field-level interpretation.
multipart/reportandmessage/delivery-statusshapes are detected; per-recipient field parsing of RFC 3464 is left to consumers. - Auto-response loop prevention. RFC 3834's
Auto-Submittedheader is surfaced; the loop-prevention rules govern responders, not parsers.
Versioning
This project follows Semantic Versioning. The public
API surface — every named export, its type signature, and the shape of
ParsedEmail — is the stability contract, and is mechanically enforced by
tests/unit/api-surface.test.ts. That snapshot is the source of truth: any
change to it is a breaking change and requires a major version bump.
Additive changes (new exports, new optional fields, new optional options)
go in minor releases; bug fixes and internal refactors go in patches.
Symbols re-exported from src/ but not in the snapshot are internal and
may change at any time.
Acknowledgements
The test corpus is seeded from many upstream projects. Provenance, sha256
hashes, and licenses for every vendored fixture are recorded in
tests/fixtures/SOURCES.md. Our gratitude
goes to the maintainers of:
- postal-mime (MIT), mailparser (MIT-0), @kenjiuno/msgreader (Apache-2.0)
- Mozilla Thunderbird (MPL-2.0), Apache james-mime4j / Apache JAMES (Apache-2.0)
- MimeKit (MIT), Ruby
mailgem (MIT), CPython (PSF-2.0) - Stalwart mail-parser / mail-auth (Apache-2.0 OR MIT)
- Mailgun flanker (Apache-2.0), rspamd (Apache-2.0)
- Apache SpamAssassin public corpus (Apache-2.0)
- lore.kernel.org public-inbox archives (operational mail)
- Apache list archives (operational mail)
- Enron Email Corpus (public domain, FERC release)
- tnefparse (BSD), nodemailer (MIT), php-mime-mail-parser (MIT)
