@egintegrations/document-services
v0.1.0
Published
Document processing library with Google Cloud Vision OCR, text extraction from PDFs and images, and document parsing utilities. Includes text normalization and base parser framework.
Maintainers
Readme
@egintegrations/document-services
Document processing library with Google Cloud Vision OCR for text extraction from PDFs and images, plus text processing utilities.
Installation
npm install @egintegrations/document-servicesPeer Dependencies
# For Google Cloud Vision OCR
npm install @google-cloud/visionFeatures
- Google Cloud Vision OCR: Extract text from images and PDFs
- Text Processing: Normalize whitespace, clean OCR artifacts, extract dates/amounts
- TypeScript: Full type safety
Quick Start
OCR from Images/PDFs
import { GoogleVisionOCR } from '@egintegrations/document-services';
const ocr = new GoogleVisionOCR({
credentials: {
client_email: process.env.GCP_CLIENT_EMAIL,
private_key: process.env.GCP_PRIVATE_KEY,
},
projectId: process.env.GCP_PROJECT_ID,
});
// Extract text from image
const result = await ocr.extractText({
data: imageBuffer,
mimeType: 'image/jpeg',
filename: 'receipt.jpg',
});
if (result.success) {
console.log(result.extractedText);
}
// Extract text from PDF
const pdfResult = await ocr.extractText({
data: pdfBuffer,
mimeType: 'application/pdf',
filename: 'invoice.pdf',
});Text Processing
import {
cleanOCRText,
extractAmounts,
extractDates,
extractLines,
} from '@egintegrations/document-services';
const rawText = 'Total: $ 100 . 00 Date: 01 / 15 / 2023';
// Clean OCR artifacts
const cleaned = cleanOCRText(rawText);
// "Total: $100.00 Date: 01/15/2023"
// Extract amounts
const amounts = extractAmounts(cleaned);
// [100.00]
// Extract dates
const dates = extractDates(cleaned);
// ['01/15/2023']
// Extract lines
const lines = extractLines(text);API Reference
GoogleVisionOCR
interface OCRConfig {
credentials?: {
client_email: string;
private_key: string;
};
projectId?: string;
}
class GoogleVisionOCR {
constructor(config: OCRConfig);
extractText(document: DocumentInput): Promise<OCRResult>;
}Text Processing Functions
normalizeWhitespace(text: string): string- Normalize whitespacecleanOCRText(text: string): string- Clean OCR artifactsextractLines(text: string): string[]- Extract non-empty linesextractAmounts(text: string): number[]- Extract currency amountsextractDates(text: string): string[]- Extract dates (MM/DD/YYYY)
Environment Variables
For Google Cloud Vision:
GCP_CLIENT_EMAIL- Service account emailGCP_PRIVATE_KEY- Service account private keyGCP_PROJECT_ID- Google Cloud project ID
License
MIT
Credits
Extracted from BRS-Inbox-Scanner with Google Cloud Vision OCR integration.
