npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pdf-tax-reader-cl

v1.0.0

Published

PDF scraping library for Chilean tax documents. Extract emitter name, economic activities, and address from structured PDF documents like 'CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS'

Readme

pdf-tax-reader-cl

A Node.js library for extracting specific data from Chilean tax PDF documents. This library is designed to scrape structured PDF documents like "CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS" and extract key information.

npm version License: MIT Node.js

🚀 Quick Start

npm install pdf-tax-reader-cl
const { extractTaxData } = require('pdf-tax-reader-cl');

// Extract data from a PDF file
extractTaxData('./tax-document.pdf')
  .then(data => {
    console.log('Extracted Data:', data);
    // {
    //   emitterName: "GUITAL Y PARTNERS LIMITADA",
    //   economicActivities: [
    //     "ASES.COMER.PUBLICIDAD,REPONEDORES,COMERC.FRUT,VERD,BEBIDAS DE FANTASIA",
    //     "463011     VENTA AL POR MAYOR DE FRUTAS Y VERDURAS"
    //   ],
    //   address: "VITACURA 4380 , Dpto. 31 , VITACURA"
    // }
  })
  .catch(error => {
    console.error('Error:', error.message);
  });

Features

  • Extract emitter name from PDF documents
  • Extract economic activities list
  • Extract address information
  • Process single PDF files or entire directories
  • Save extracted data to JSON format
  • Comprehensive error handling and logging

Installation

npm install pdf-tax-reader-cl

Requirements

  • Node.js >= 14.0.0
  • PDF files must be text-based (not scanned images)
  • PDFs must follow the Chilean tax document structure

Usage

Single PDF Processing

const { extractTaxData } = require('pdf-tax-reader-cl');

// Process a single PDF file
extractTaxData('./documents/tax-document.pdf')
  .then(data => {
    console.log('Extracted Data:', data);
  })
  .catch(error => {
    console.error('Error:', error.message);
  });

Multiple PDF Processing

const { processMultiplePDFs } = require('pdf-tax-reader-cl');

// Process all PDF files in a directory
processMultiplePDFs('./documents')
  .then(results => {
    console.log('Processing completed:', results);
    // [
    //   {
    //     filename: "document1.pdf",
    //     data: { emitterName: "...", economicActivities: [...], address: "..." }
    //   },
    //   {
    //     filename: "document2.pdf",
    //     error: "Invalid PDF format"
    //   }
    // ]
  })
  .catch(error => {
    console.error('Error:', error);
  });

TypeScript Support

import { extractTaxData, ExtractedTaxData } from 'pdf-tax-reader-cl';

// Extract data with TypeScript types
extractTaxData('./path/to/document.pdf')
  .then((data: ExtractedTaxData) => {
    console.log('Extracted Data:', data);
    // data.emitterName is string | null
    // data.economicActivities is string[]
    // data.address is string | null
  })
  .catch(error => {
    console.error('Error:', error);
  });

Testing

If you're developing or contributing to this library:

# Clone the repository
git clone https://github.com/Jmzp/pdf-tax-reader-cl.git
cd pdf-tax-reader-cl

# Install dependencies
npm install

# Run tests
npm test

The test suite includes:

  • Mock data validation
  • Single PDF processing test
  • Multiple PDF processing test

Error Handling Examples

The application provides detailed error messages for different types of invalid files:

// Example error handling
try {
    const data = await extractTaxData('./invalid-file.txt');
} catch (error) {
    console.log('Error:', error.message);
    // Output: "Invalid file extension. Expected .pdf, got: txt"
}

try {
    const data = await extractTaxData('./corrupted-file.pdf');
} catch (error) {
    console.log('Error:', error.message);
    // Output: "Invalid PDF format. File does not appear to be a valid PDF document."
}

try {
    const data = await extractTaxData('./non-tax-document.pdf');
} catch (error) {
    console.log('Error:', error.message);
    // Output: "Document does not appear to be a Chilean tax document. Missing expected tax document structure."
}

Data Extraction

The application extracts the following information from PDF documents:

1. Emitter Name (Nombre del emisor)

  • Extracts the company or entity name that generated the document
  • Pattern: Nombre del emisor: [COMPANY_NAME]

2. Economic Activities (Actividades Económicas)

  • Extracts all economic activities listed in the document
  • Includes both general descriptions and specific activity codes
  • Pattern: Looks for lines containing activity codes (6 digits) or specific keywords

3. Address (Domicilio)

  • Extracts the registered address of the taxpayer
  • Pattern: Domicilio: [ADDRESS]

Output Format

The extracted data is returned in the following JSON format:

{
  "emitterName": "GUITAL Y PARTNERS LIMITADA",
  "economicActivities": [
    "ASES.COMER.PUBLICIDAD, REPONEDORES, COMERC.FRUT, VERD, BEBIDAS DE FANTASIA",
    "463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS",
    "463020 VENTA AL POR MAYOR DE BEBIDAS ALCOHOLICAS Y NO ALCOHOLICAS",
    "692000 ACTIVIDADES DE CONTABILIDAD, TENEDURIA DE LIBROS Y AUDITORIA; CONSULTO",
    "731001 SERVICIOS DE PUBLICIDAD PRESTADOS POR EMPRESAS",
    "783000 OTRAS ACTIVIDADES DE DOTACION DE RECURSOS HUMANOS",
    "854909 OTROS TIPOS DE ENSEÑANZA N.C.P.",
    "855000 ACTIVIDADES DE APOYO A LA ENSEÑANZA"
  ],
  "address": "VITACURA 4380, Dpto. 31, VITACURA"
}

API Reference

Functions

extractTaxData(pdfPath: string): Promise<ExtractedTaxData>

Extract tax data from a PDF file.

processMultiplePDFs(directoryPath: string): Promise<ProcessingResult[]>

Process multiple PDF files in a directory.

saveToJSON(data: any, outputPath: string): void

Save extracted data to JSON file.

isValidPDF(dataBuffer: Buffer): boolean

Validate if a file is a valid PDF.

hasValidExtension(filePath: string): boolean

Validate file extension.

isTaxDocument(text: string): boolean

Check if the document appears to be a Chilean tax document.

validateExtractedData(data: ExtractedTaxData): ValidationResult

Validate extracted data completeness.

Types

ExtractedTaxData

interface ExtractedTaxData {
  emitterName: string | null;
  economicActivities: string[];
  address: string | null;
}

ProcessingResult

interface ProcessingResult {
  filename: string;
  data?: ExtractedTaxData;
  error?: string;
}

Dependencies

  • pdf-parse: For extracting text content from PDF files
  • Built-in Node.js modules: fs, path

Error Handling & Validation

The application includes comprehensive error handling and validation for:

File Validation

  • File existence: Checks if the file exists before processing
  • File extension: Validates that the file has a .pdf extension
  • File size: Ensures the file is not empty
  • PDF format: Validates PDF structure and signatures

Content Validation

  • PDF structure: Verifies the file is a valid PDF document
  • Text content: Ensures the PDF contains extractable text (not just scanned images)
  • Tax document structure: Validates that the document appears to be a Chilean tax document
  • Data completeness: Ensures all required fields are successfully extracted

Error Types Handled

  • File not found errors
  • Invalid file extensions (.txt, .doc, etc.)
  • Corrupted or invalid PDF files
  • Empty files
  • PDFs without extractable text
  • Non-tax documents
  • Incomplete data extraction
  • Directory access errors

Limitations

  • The application is designed specifically for Chilean tax documents with the structure shown in the example
  • PDF must be text-based (not scanned images)
  • Extraction accuracy depends on the consistency of the PDF format
  • The application will reject non-PDF files, corrupted PDFs, and documents that don't match the expected tax document structure

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Jorge Zapata - GitHub

Support

If you find this library useful, please consider giving it a ⭐️ on GitHub!