@friedrich.fschr/epub-parser

v0.2.6

Published

a year ago

An epub parser which can extract chapter contents from an epub file

0High
0Medium
0Low

An EPUB file is essentially a .zip archive. Its content structure is built using HTML and CSS, and can theoretically include JavaScript as well. By changing the file extension to .zip and extracting the contents, you can directly view chapter content by opening the corresponding HTML/XHTML files. However, the chapters will appear in random order. If certain chapters or resources are encrypted, this zip extraction method will fail.

When parsing EPUB files: (1) The first step involves parsing files like container.xml, .opf, and .ncx, which contain metadata (title, author, publication date, etc.), resource information (paths to images and other assets within the EPUB), and sequential chapter display information (Spine). (2) The second step handles resource paths within chapters. References to resources in chapter files are only valid internally, so they must be converted to paths usable in the display environment—either as blob URLs in browsers or absolute filesystem paths in Node.js. (3) Additionally, EPUB encryption, signatures, and permissions (managed via encryption.xml, signatures.xml, and rights.xml respectively) need processing. Like container.xml, these files reside in the /META-INF/ directory with fixed filenames. Currently, @lingo-reader/epub-parser supports the first two functionalities, with the third (encryption handling) coming soon—meaning it currently works with unencrypted EPUBs.

The parser follows the EPUB 3.3 and Open Packaging Format (OPF) 2.0.1 v1.0 specifications. Its API aims to expose all available file information comprehensively.

Install

pnpm install @lingo-reader/epub-parser

initEpubFile

import { initEpubFile } from "@lingo-reader/epub-parser";
import type { EpubFile } from "@lingo-reader/epub-parser";
/*
  type initEpubFile = (epubPath: string | File, resourceSaveDir?: string) => Promise<EpubFile>
*/

const epub: EpubFile = await initEpubFile(file);

The primary API exposed by @lingo-reader/epub-parser is initEpubFile. When provided with a file path or File object, it returns an initialized EpubFile class containing methods to read metadata, Spine information, and other EPUB data.

Parameters:

epubPath: string | File: File path or File object.
resourceSaveDir?: string: Optional (Node.js only). Specifies where to save resources like images.
- default: './images/'

Returns:

Promise: Initialized EpubFile object (Promise).

EpubFile

The EpubFile class exposes these methods:

import { EpubFile } from "@lingo-reader/epub-parser";
import { EBookParser } from "@lingo-reader/shared";

declare class EpubFile implements EBookParser {
  getFileInfo(): EpubFileInfo;
  getMetadata(): EpubMetadata;
  getManifest(): Record<string, ManifestItem>;
  getSpine(): EpubSpine;
  getGuide(): GuideReference[];
  getCollection(): CollectionItem[];
  getToc(): EpubToc;
  getPageList(): PageList;
  getNavList(): NavList;
  loadChapter(id: string): Promise<EpubProcessedChapter>;
  resolveHref(href: string): EpubResolvedHref | undefined;
  destroy(): void;
}

getManifest(): Record<string, ManifestItem>

Retrieves all resources contained in the EPUB (HTML files, images, etc.).

import { getManifest } from "@lingo-reader/epub-parser";
import type { ManifestItem } from "@lingo-reader/epub-parser";
/*
  type getManifest = () => Record<string, ManifestItem>
*/

// Keys represent resource `id`
const manifest: Record<string, ManifestItem> = epub.getManifest();

Parameters:

None

Returns:

Record - A dictionary mapping resource id to their descriptors:

interface ManifestItem {
  // Unique resource identifier
  id: string;
  // Path within the EPUB (ZIP) archive
  href: string;
  // MIME type (e.g., "application/xhtml+xml")
  mediaType: string;
  // Special role (e.g., "cover-image")
  properties?: string;
  // Associated media overlay for audio/video
  mediaOverlay?: string;
  // Fallback resources when this item cannot be loaded
  fallback?: string[];
}

getSpine(): EpubSpine

Returns the reading order of all content documents in the EPUB.

The linear property in SpineItem indicates whether the item is part of the primary reading flow (values: "yes" or "no").

import { getSpine } from "@lingo-reader/epub-parser";
import type { EpubSpine } from "@lingo-reader/epub-parser";
/*
  type getSpine = () => EpubSpine
*/

const spine: EpubSpine = epub.getSpine();

Parameters:

None

Returns:

EpubSpine - An ordered array of spine items:

type SpineItem = ManifestItem & {
  /**
   * Reading progression flag
   * - "yes": Primary reading content (default)
   * - "no": Supplementary material
   */
  linear?: string;
};
type EpubSpine = SpineItem[];

loadChapter(id: string): Promise<EpubProcessedChapter>

The loadChapter function takes a chapter id as parameter and returns a processed chapter object. Returns undefined if the chapter doesn't exist.

const spine = epub.getSpine();
const fileInfo = epub.getFileInfo();

// Load the first chapter. 'html' is the processed HTML chapter string,
// 'css' is the chapter's CSS file, provided as an absolute path in Node.js,
// which can be directly read.
const { html, css } = epub.loadChapter(spine[0].id);

Parameters:

id: string - The chapter id from spine

Returns:

Promise<EpubProcessedChapter | undefined> - Processed chapter content

interface EpubCssPart {
  id: string;
  href: string;
}

interface EpubProcessedChapter {
  css: EpubCssPart[];
  html: string;
}

In an EPUB ebook file, each chapter is typically an XHTML (or HTML) file. Thus, the processed chapter object consists of two parts: one is the HTML content string under the <body> tag, and the other is the CSS. The CSS is parsed from the <link> tags in the chapter file and provided here in the form of a blob URL (or as an absolute filesystem path in a Node.js environment), represented by the href field in EpubCssPart, along with a corresponding id for the URL. The CSS blob URL can be directly referenced in a <link> tag or fetched via the Fetch API (using the absolute path in Node.js) to obtain the CSS text for further processing.

Internal chapter navigation in EPUBs is handled through <a> tags' href attributes. To distinguish internal links from external links and facilitate internal navigation logic, internal links are prefixed with epub:. These links can be resolved using the resolveHref function. The handling of such links is managed at the UI layer, while epub-parser only provides the corresponding chapter HTML and selector functionality.

resolveHref(href: string): EpubResolvedHref | undefined

resolveHref parses internal links into a chapter ID and a CSS selector within the book's HTML.

If an external link (e.g., https://www.example.com) or an invalid internal link is provided, it returns undefined.

const toc: EpubToc = epub.getToc();
// 'id' is the chapter ID, 'selector' is a DOM selector (e.g., `[id="ididid"]`)
const { id, selector } = epub.resolveHref(toc[0].href);

Parameters：

href: string：The internal resource path.

Returns:

EpubResolvedHref | undefined：The resolved internal link. Returns undefined if the path is invalid.

interface EpubResolvedHref {
  id: string;
  selector: string;
}

getToc(): EpubToc

The toc structure corresponds to the navMap section of the EPUB's .ncx file, which contains the book's navigation hierarchy.

import { getToc } from "@lingo-reader/epub-parser";
import type { EpubToc } from "@lingo-reader/epub-parser";
/*
  type getToc = () => EpubToc
*/

const toc: EpubToc = epub.getToc();

Parameters：

none

Returns:

EpubToc：

interface NavPoint {
  // Display text of the table of contents entry
  label: string;

  // Resource path within the EPUB file (preprocessed format).
  // Can be resolved using resolveHref()
  href: string;

  // Chapter identifier
  id: string;

  // Reading order sequence
  playOrder: string;

  // Nested sub-entries (optional)
  children?: NavPoint[];
}

/** EPUB table of contents structure (NCX navMap representation) */
type EpubToc = NavPoint[];

destroy(): void

Cleans up generated resources (like blob URLs) created during file parsing to prevent memory leaks. In Node.js environments, it also deletes corresponding temporary files.

getFileInfo(): EpubFileInfo

import type { EpubFileInfo } from "@lingo-reader/epub-parser";
/*
  type getFileInfo = () => EpubFileInfo
*/

const fileInfo: EpubFileInfo = epub.getFileInfo();

EpubFileInfo currently includes two attributes: fileName represents the file name, and mimetype indicates the file type of the EPUB file, which is read from the /mimetype file but is always fixed as application/epub+zip.

Parameters：

none

Returns:

EpubFileInfo：

interface EpubFileInfo {
  fileName: string;
  mimetype: string;
}

getMetadata(): EpubMetadata

The metadata recorded in the book.

import type { EpubMetadata } from "@lingo-reader/epub-parser";
/*
  type getMetadata = () => EpubFileInfo
*/

const metadata: EpubMetadata = epub.getMetadata();

Parameters：

none

Returns:

EpubMetadata：

interface EpubMetadata {
  // Title of the book
  title: string;
  // Language of the book
  language: string;
  // Description of the book
  description?: string;
  // Publisher of the EPUB file
  publisher?: string;
  // General type/genre of the book, such as novel, biography, etc.
  type?: string;
  // MIME type of the EPUB file
  format?: string;
  // Original source of the book content
  source?: string;
  // Related external resources
  relation?: string;
  // Coverage of the publication content
  coverage?: string;
  // Copyright statement
  rights?: string;
  // Includes creation time, publication date, update time, etc. of the book
  // Specific fields depend on opf:event, such as modification
  date?: Record<string, string>;

  identifier: Identifier;
  packageIdentifier: Identifier;
  creator?: Contributor[];
  contributor?: Contributor[];
  subject?: Subject[];

  metas?: Record<string, string>;
  links?: Link[];
}

identifier: Identifier

id represents the unique identifier of the resource. The scheme specifies the system or authority used to generate or assign the identifier, such as ISBN or DOI. identifierType indicates the type of identifier used by id, which is similar to scheme.

interface Identifier {
  id: string;
  scheme?: string;
  identifierType?: string;
}

packageIdentifier: Identifier

It is essentially also an Identifier. Typically, within the <package> tag, it is referenced using the unique-identifier attribute, whose value corresponds to the id of the relevant <identifier> element.

<package unique-identifier="id">

<dc:identifier id="id" opf:scheme="URI">uuid:19c0c5cb-002b-476f-baa7-fcf510414f95</dc:identifier>

</package>

creator?: Contributor[]

Describes the various contributors.

interface Contributor {
  // Name of the contributor
  contributor: string;
  // Sort-friendly version of the name
  fileAs?: string;
  // Role of the contributor
  role?: string;

  // The encoding scheme used for role or alternateScript,
  // can also represent a language, such as English or Chinese
  scheme?: string;
  // Alternative script or writing system for the contributor's name
  alternateScript?: string;
}

subject?: Subject[]

The subject or theme of the book.

interface Subject {
  // Subject, such as fiction, essay, etc.
  subject: string;
  // The authority or organization providing the code or identifier
  authority?: string;
  // Associated subject code or term
  term?: string;
}

links?: Link[]

Provides additional related resources or external links.

interface Link {
  // URL or path to the resource
  href: string;
  // Language of the resource
  hreflang?: string;
  // id
  id?: string;
  // MIME type of the resource (e.g., image/jpeg, application/xml)
  mediaType?: string;
  // Additional properties
  properties?: string;
  // Purpose or function of the link
  rel: string;
}

getGuide(): EpubGuide

The preview chapters of the book, which can also be replaced by the first few chapters from the spine.

import { getGuide } from "@lingo-reader/epub-parser";
import type { EpubGuide } from "@lingo-reader/epub-parser";
/*
  type getGuide = () => EpubGuide
*/

const guide: EpubGuide = epub.getGuide();

Parameters：

none

Returns:

EpubGuide：

interface GuideReference {
  title: string;
  // The role of the resource, such as toc, loi, cover-image, etc.
  type: string;
  // The path to the resource within the EPUB file
  href: string;
}

type EpubGuide = GuideReference[];

getCollection(): EpubCollection

The content under the <collection> tag in the .opf file, used to specify whether an EPUB file belongs to a specific collection, such as a series, category, or a particular group of publications.

import { getCollection } from "@lingo-reader/epub-parser";
import type { EpubCollection } from "@lingo-reader/epub-parser";
/*
  type getCollection = () => EpubCollection
*/

const collection: EpubCollection = epub.getCollection();

Parameters：

none

Returns:

EpubCollection：

interface CollectionItem {
  // The role played within the Collection
  role: string;
  // Links to related resources
  links: string[];
}

type EpubCollection = CollectionItem[];

getPageList(): PageList

Refer to https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1.2, where the correspondId refers to the resource's ID, and the rest correspond to the specifications.

import { getPageList } from "@lingo-reader/epub-parser";
import type { PageList } from "@lingo-reader/epub-parser";
/*
  type getPageList = () => PageList
*/

const pageList: PageList = epub.getPageList();

Parameters：

none

Returns:

PageList:

interface PageTarget {
  label: string;
  // Page number
  value: string;
  href: string;
  playOrder: string;
  type: string;
  correspondId: string;
}
interface PageList {
  label: string;
  pageTargets: PageTarget[];
}

getNavList(): NavList

Refer to https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1.2, where the correspondId refers to the resource's ID, label corresponds to the content of navLabel.text, and href is the path to the resource within the EPUB file.

import { getNavList } from "@lingo-reader/epub-parser";
import type { NavList } from "@lingo-reader/epub-parser";
/*
  type getNavList = () => NavList
*/

const navList: NavList = epub.getNavList();

Parameters：

none

Returns:

NavList:

interface NavTarget {
  label: string;
  href: string;
  correspondId: string;
}
interface NavList {
  label: string;
  navTargets: NavTarget[];
}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Install

initEpubFile

EpubFile

getManifest(): Record<string, ManifestItem>

getSpine(): EpubSpine

loadChapter(id: string): Promise<EpubProcessedChapter>

resolveHref(href: string): EpubResolvedHref | undefined

getToc(): EpubToc

destroy(): void

getFileInfo(): EpubFileInfo

getMetadata(): EpubMetadata

identifier: Identifier

packageIdentifier: Identifier

creator?: Contributor[]

subject?: Subject[]

links?: Link[]

getGuide(): EpubGuide

getCollection(): EpubCollection

getPageList(): PageList

getNavList(): NavList