@egintegrations/document-services
v0.1.0
Published
Document processing library with Google Cloud Vision OCR, text extraction from PDFs and images, and document parsing utilities. Includes text normalization and base parser framework.
Downloads
20
Maintainers
Readme
@egintegrations/document-services
Document processing library with Google Cloud Vision OCR for text extraction from PDFs and images, plus text processing utilities.
Installation
npm install @egintegrations/document-servicesPeer Dependencies
# For Google Cloud Vision OCR
npm install @google-cloud/visionFeatures
- Google Cloud Vision OCR: Extract text from images and PDFs
- Text Processing: Normalize whitespace, clean OCR artifacts, extract dates/amounts
- TypeScript: Full type safety
Quick Start
OCR from Images/PDFs
import { GoogleVisionOCR } from '@egintegrations/document-services';
const ocr = new GoogleVisionOCR({
credentials: {
client_email: process.env.GCP_CLIENT_EMAIL,
private_key: process.env.GCP_PRIVATE_KEY,
},
projectId: process.env.GCP_PROJECT_ID,
});
// Extract text from image
const result = await ocr.extractText({
data: imageBuffer,
mimeType: 'image/jpeg',
filename: 'receipt.jpg',
});
if (result.success) {
console.log(result.extractedText);
}
// Extract text from PDF
const pdfResult = await ocr.extractText({
data: pdfBuffer,
mimeType: 'application/pdf',
filename: 'invoice.pdf',
});Text Processing
import {
cleanOCRText,
extractAmounts,
extractDates,
extractLines,
} from '@egintegrations/document-services';
const rawText = 'Total: $ 100 . 00 Date: 01 / 15 / 2023';
// Clean OCR artifacts
const cleaned = cleanOCRText(rawText);
// "Total: $100.00 Date: 01/15/2023"
// Extract amounts
const amounts = extractAmounts(cleaned);
// [100.00]
// Extract dates
const dates = extractDates(cleaned);
// ['01/15/2023']
// Extract lines
const lines = extractLines(text);API Reference
GoogleVisionOCR
interface OCRConfig {
credentials?: {
client_email: string;
private_key: string;
};
projectId?: string;
}
class GoogleVisionOCR {
constructor(config: OCRConfig);
extractText(document: DocumentInput): Promise<OCRResult>;
}Text Processing Functions
normalizeWhitespace(text: string): string- Normalize whitespacecleanOCRText(text: string): string- Clean OCR artifactsextractLines(text: string): string[]- Extract non-empty linesextractAmounts(text: string): number[]- Extract currency amountsextractDates(text: string): string[]- Extract dates (MM/DD/YYYY)
Environment Variables
For Google Cloud Vision:
GCP_CLIENT_EMAIL- Service account emailGCP_PRIVATE_KEY- Service account private keyGCP_PROJECT_ID- Google Cloud project ID
License
MIT
Credits
Extracted from BRS-Inbox-Scanner with Google Cloud Vision OCR integration.
