pdf-tax-reader-cl
v1.0.0
Published
PDF scraping library for Chilean tax documents. Extract emitter name, economic activities, and address from structured PDF documents like 'CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS'
Maintainers
Readme
pdf-tax-reader-cl
A Node.js library for extracting specific data from Chilean tax PDF documents. This library is designed to scrape structured PDF documents like "CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS" and extract key information.
🚀 Quick Start
npm install pdf-tax-reader-clconst { extractTaxData } = require('pdf-tax-reader-cl');
// Extract data from a PDF file
extractTaxData('./tax-document.pdf')
.then(data => {
console.log('Extracted Data:', data);
// {
// emitterName: "GUITAL Y PARTNERS LIMITADA",
// economicActivities: [
// "ASES.COMER.PUBLICIDAD,REPONEDORES,COMERC.FRUT,VERD,BEBIDAS DE FANTASIA",
// "463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS"
// ],
// address: "VITACURA 4380 , Dpto. 31 , VITACURA"
// }
})
.catch(error => {
console.error('Error:', error.message);
});Features
- Extract emitter name from PDF documents
- Extract economic activities list
- Extract address information
- Process single PDF files or entire directories
- Save extracted data to JSON format
- Comprehensive error handling and logging
Installation
npm install pdf-tax-reader-clRequirements
- Node.js >= 14.0.0
- PDF files must be text-based (not scanned images)
- PDFs must follow the Chilean tax document structure
Usage
Single PDF Processing
const { extractTaxData } = require('pdf-tax-reader-cl');
// Process a single PDF file
extractTaxData('./documents/tax-document.pdf')
.then(data => {
console.log('Extracted Data:', data);
})
.catch(error => {
console.error('Error:', error.message);
});Multiple PDF Processing
const { processMultiplePDFs } = require('pdf-tax-reader-cl');
// Process all PDF files in a directory
processMultiplePDFs('./documents')
.then(results => {
console.log('Processing completed:', results);
// [
// {
// filename: "document1.pdf",
// data: { emitterName: "...", economicActivities: [...], address: "..." }
// },
// {
// filename: "document2.pdf",
// error: "Invalid PDF format"
// }
// ]
})
.catch(error => {
console.error('Error:', error);
});TypeScript Support
import { extractTaxData, ExtractedTaxData } from 'pdf-tax-reader-cl';
// Extract data with TypeScript types
extractTaxData('./path/to/document.pdf')
.then((data: ExtractedTaxData) => {
console.log('Extracted Data:', data);
// data.emitterName is string | null
// data.economicActivities is string[]
// data.address is string | null
})
.catch(error => {
console.error('Error:', error);
});Testing
If you're developing or contributing to this library:
# Clone the repository
git clone https://github.com/Jmzp/pdf-tax-reader-cl.git
cd pdf-tax-reader-cl
# Install dependencies
npm install
# Run tests
npm testThe test suite includes:
- Mock data validation
- Single PDF processing test
- Multiple PDF processing test
Error Handling Examples
The application provides detailed error messages for different types of invalid files:
// Example error handling
try {
const data = await extractTaxData('./invalid-file.txt');
} catch (error) {
console.log('Error:', error.message);
// Output: "Invalid file extension. Expected .pdf, got: txt"
}
try {
const data = await extractTaxData('./corrupted-file.pdf');
} catch (error) {
console.log('Error:', error.message);
// Output: "Invalid PDF format. File does not appear to be a valid PDF document."
}
try {
const data = await extractTaxData('./non-tax-document.pdf');
} catch (error) {
console.log('Error:', error.message);
// Output: "Document does not appear to be a Chilean tax document. Missing expected tax document structure."
}Data Extraction
The application extracts the following information from PDF documents:
1. Emitter Name (Nombre del emisor)
- Extracts the company or entity name that generated the document
- Pattern:
Nombre del emisor: [COMPANY_NAME]
2. Economic Activities (Actividades Económicas)
- Extracts all economic activities listed in the document
- Includes both general descriptions and specific activity codes
- Pattern: Looks for lines containing activity codes (6 digits) or specific keywords
3. Address (Domicilio)
- Extracts the registered address of the taxpayer
- Pattern:
Domicilio: [ADDRESS]
Output Format
The extracted data is returned in the following JSON format:
{
"emitterName": "GUITAL Y PARTNERS LIMITADA",
"economicActivities": [
"ASES.COMER.PUBLICIDAD, REPONEDORES, COMERC.FRUT, VERD, BEBIDAS DE FANTASIA",
"463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS",
"463020 VENTA AL POR MAYOR DE BEBIDAS ALCOHOLICAS Y NO ALCOHOLICAS",
"692000 ACTIVIDADES DE CONTABILIDAD, TENEDURIA DE LIBROS Y AUDITORIA; CONSULTO",
"731001 SERVICIOS DE PUBLICIDAD PRESTADOS POR EMPRESAS",
"783000 OTRAS ACTIVIDADES DE DOTACION DE RECURSOS HUMANOS",
"854909 OTROS TIPOS DE ENSEÑANZA N.C.P.",
"855000 ACTIVIDADES DE APOYO A LA ENSEÑANZA"
],
"address": "VITACURA 4380, Dpto. 31, VITACURA"
}API Reference
Functions
extractTaxData(pdfPath: string): Promise<ExtractedTaxData>
Extract tax data from a PDF file.
processMultiplePDFs(directoryPath: string): Promise<ProcessingResult[]>
Process multiple PDF files in a directory.
saveToJSON(data: any, outputPath: string): void
Save extracted data to JSON file.
isValidPDF(dataBuffer: Buffer): boolean
Validate if a file is a valid PDF.
hasValidExtension(filePath: string): boolean
Validate file extension.
isTaxDocument(text: string): boolean
Check if the document appears to be a Chilean tax document.
validateExtractedData(data: ExtractedTaxData): ValidationResult
Validate extracted data completeness.
Types
ExtractedTaxData
interface ExtractedTaxData {
emitterName: string | null;
economicActivities: string[];
address: string | null;
}ProcessingResult
interface ProcessingResult {
filename: string;
data?: ExtractedTaxData;
error?: string;
}Dependencies
pdf-parse: For extracting text content from PDF files- Built-in Node.js modules:
fs,path
Error Handling & Validation
The application includes comprehensive error handling and validation for:
File Validation
- File existence: Checks if the file exists before processing
- File extension: Validates that the file has a
.pdfextension - File size: Ensures the file is not empty
- PDF format: Validates PDF structure and signatures
Content Validation
- PDF structure: Verifies the file is a valid PDF document
- Text content: Ensures the PDF contains extractable text (not just scanned images)
- Tax document structure: Validates that the document appears to be a Chilean tax document
- Data completeness: Ensures all required fields are successfully extracted
Error Types Handled
- File not found errors
- Invalid file extensions (.txt, .doc, etc.)
- Corrupted or invalid PDF files
- Empty files
- PDFs without extractable text
- Non-tax documents
- Incomplete data extraction
- Directory access errors
Limitations
- The application is designed specifically for Chilean tax documents with the structure shown in the example
- PDF must be text-based (not scanned images)
- Extraction accuracy depends on the consistency of the PDF format
- The application will reject non-PDF files, corrupted PDFs, and documents that don't match the expected tax document structure
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Jorge Zapata - GitHub
Support
If you find this library useful, please consider giving it a ⭐️ on GitHub!
