@xcvzmoon/document-metadata-extractor
v1.1.0
Published
Metadata extractor for document files
Readme
Document Metadata Extractor
A TypeScript library for extracting metadata from various document types. This library provides a unified interface for extracting metadata from PDFs, images, Excel files, Word documents, and PowerPoint presentations.
Overview
This library is built on top of various specialized libraries to extract metadata from different document formats. Each document type uses its underlying library to parse and extract relevant metadata:
- PDF: Built on top of
unpdffor extracting PDF metadata and page counts - Images: Built on top of
exiftool-vendoredfor extracting EXIF and image metadata - Excel: Built on top of
xlsxfor extracting spreadsheet metadata, sheet information, and document properties - DOCX/PPTX: Built on top of
jszipand@xmldom/xmldomfor parsing Office Open XML documents and extracting metadata from core and application properties
Installation
npm install @xcvzmoon/document-metadata-extractor
# or
pnpm add @xcvzmoon/document-metadata-extractor
# or
yarn add @xcvzmoon/document-metadata-extractor
# or
bun add @xcvzmoon/document-metadata-extractorUsage
import { getMetadata } from '@xcvzmoon/document-metadata-extractor';
import { readFile } from 'fs/promises';
// Read a file as Buffer
const fileBuffer = await readFile('document.pdf');
// Extract metadata
const metadata = await getMetadata(fileBuffer, { target: 'pdf' });
console.log(metadata);Supported Document Types
Extracts PDF metadata including title, author, subject, creator, producer, creation date, modification date, and page count.
const metadata = await getMetadata(pdfBuffer, { target: 'pdf' });
// Returns: PdfMetadata with pages, title, author, subject, creator, producer, creationDate, modificationDateImages
Extracts EXIF and image metadata using ExifTool. Returns all available tags from the image file.
const metadata = await getMetadata(imageBuffer, { target: 'image' });
// Returns: All ExifTool tags for the imageExcel
Extracts spreadsheet metadata including sheet names, sheet count, row/column counts, author, last modified by, creation/modification dates, company, and file size.
const metadata = await getMetadata(excelBuffer, { target: 'excel' });
// Returns: ExcelMetadata with sheets, sheetCount, rows, columns, author, lastModifiedBy, created, modified, company, fileSizeDOCX
Extracts Word document metadata including title, subject, creator, keywords, description, last modified by, revision, creation/modification dates, category, company, page count, word count, character count, and file size.
const metadata = await getMetadata(docxBuffer, { target: 'docx' });
// Returns: DocxMetadata with title, subject, creator, keywords, description, lastModifiedBy, revision, created, modified, category, company, pageCount, wordCount, characterCount, fileSizePPTX
Extracts PowerPoint presentation metadata using the same extraction method as DOCX files.
const metadata = await getMetadata(pptxBuffer, { target: 'pptx' });
// Returns: DocxMetadata (same structure as DOCX)API
getMetadata(data: Buffer, options: { target: 'image' | 'pdf' | 'docx' | 'excel' | 'pptx' })
Extracts metadata from a document buffer based on the specified target type.
Parameters:
data: A Buffer containing the document file dataoptions.target: The document type to extract metadata from
Returns:
- Promise resolving to the appropriate metadata type based on the target:
PdfMetadatafor PDF files- ExifTool tags object for images
ExcelMetadatafor Excel filesDocxMetadatafor DOCX and PPTX files
Type Definitions
The library exports TypeScript type definitions for all metadata types:
PdfMetadataExcelMetadataDocxMetadata
License
ISC
Author
Mon Albert Gamil - GitHub
