@thecodingwhale/cv-processor

v1.0.12

Published

9 months ago

CV Processor to extract structured data from PDF resumes using TypeScript

0High
0Medium
0Low

thecodingwhale

CV resume PDF parser extraction

CV Processor (TypeScript)

A TypeScript/Node.js tool to extract structured data from CV/resume PDFs.

Overview

This tool processes PDF resumes/CVs and extracts structured information into JSON format, making it easier to analyze, search, and integrate CV data into applications. It's specifically designed for actor/actress resumes to extract credits and categorize them properly.

Features

PDF text extraction and image processing for visual resume analysis
AI-powered extraction using multiple providers:
- Google's Gemini AI
- OpenAI (GPT-4, etc.)
- Azure OpenAI
- Grok (X.AI)
- AWS Bedrock (Claude, Nova, etc.)
Organized output with categorized credits
CLI interface for easy use
Parallel processing of multiple AI providers
Performance metrics and processing time tracking
Reports analysis and provider comparison

Installation

# Clone the repository
git clone <repository-url>
cd cv-processor-ts

# Install dependencies
npm install

# Build the project
npm run build

Configuration

To use the AI-powered features, you need to configure your API keys:

Create a .env file in the project root:

# Google Gemini API Key
GEMINI_API_KEY=your_gemini_api_key_here

# OpenAI API Key
OPENAI_API_KEY=your_openai_api_key_here

# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=your_azure_openai_api_key_here
AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-04-01-preview
AZURE_OPENAI_DEPLOYMENT_NAME=your-deployment-name

# Grok (X.AI) API Key
GROK_API_KEY=your_grok_api_key_here

# AWS Bedrock Configuration
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_REGION=us-east-1
AWS_BEDROCK_INFERENCE_PROFILE_ARN=arn:aws:bedrock:us-east-1:123456789012:inference-profile/my-profile

Azure OpenAI and AWS Bedrock Setup

For detailed setup instructions for Azure OpenAI and AWS Bedrock, please refer to the respective documentation.

Customizing Instructions

The application uses a text file for AI extraction instructions. You can customize these instructions by:

Editing the instructions.txt file in the project root directory
Or specifying a custom instructions file path when creating an AICVProcessor:

const processor = new AICVProcessor(aiProvider, {
  instructionsPath: '/path/to/your/custom-instructions.txt',
  verbose: true,
})

The instructions file contains:

The schema definition for extracted data
Categorization rules for actor credits
Extraction rules and guidelines
Examples of expected input/output

Usage

Command Line

# Process a PDF resume with default AI (Gemini)
npm start -- process path/to/resume.pdf

# With verbose output
npm start -- process path/to/resume.pdf -v

# Specify output file
npm start -- process path/to/resume.pdf -o output.json

# Use OpenAI instead of Gemini
npm start -- process path/to/resume.pdf --use-ai openai

# Use Azure OpenAI
npm start -- process path/to/resume.pdf --use-ai azure

# Use Grok (X.AI)
npm start -- process path/to/resume.pdf --use-ai grok

# Use AWS Bedrock
npm start -- process path/to/resume.pdf --use-ai aws
npm start -- process path/to/resume.pdf --use-ai aws --ai-model anthropic.claude-3-sonnet-20240229-v1:0

# Specify a different AI model
npm start -- process path/to/resume.pdf --ai-model gpt-4o
npm start -- process path/to/resume.pdf --use-ai gemini --ai-model gemini-1.5-flash

# Specify conversion type (PDF to Images or PDF to Text)
npm start -- process path/to/resume.pdf --conversion-type pdftoimages
npm start -- process path/to/resume.pdf --conversion-type pdftotexts

# Specify custom instructions file path
npm start -- process path/to/resume.pdf --instructions-path ./custom-instructions.txt

# Specify expected total fields for emptiness percentage calculation
npm start -- process path/to/resume.pdf --expected-total-fields 50

Parallel Processing

You can process a CV with multiple AI providers in parallel:

# Process with all configured providers simultaneously
npm run parallel path/to/resume.pdf

# Process with all providers while specifying expected total fields
npm run parallel path/to/resume.pdf --expected-total-fields 50

# Example with a real file path
npm run parallel ./CVs/KRISTEEN-LY-castingnetworks.pdf --expected-total-fields 108

When using the --expected-total-fields parameter, the system will calculate two emptiness percentages:

The default percentage based on AI-determined total fields
A percentage based on your specified expected total field count

This will:

Run extractions using all configured AI providers/models in parallel
Save all results to an organized output directory
Generate a markdown report comparing performance and results
Track processing time for benchmarking purposes
Include both AI-determined and user-expected emptiness percentages in the report when --expected-total-fields is used

The output will be saved to: output/CVName_YYYY-MM-DD_HH-MM-SS/

Analyzing Results

After running multiple CV processes, you can generate a merged report to compare AI provider performance:

# Generate a merged report from all output directories
npm start -- merge-reports

# Specify a custom output directory
npm start -- merge-reports -d ./my-output-folder

# Specify a custom output file for the report
npm start -- merge-reports -o performance-analysis.md

The merged report provides:

Rankings of AI providers by accuracy, speed, and combined performance
Detailed metrics for each provider and model
Recommendations for the best overall performer
Summary of all processing runs

This helps identify which AI provider and model combination delivers the best results for your specific CV processing needs.

API Usage

import { AIProviderFactory } from './dist/ai/AIProviderFactory'
import { AICVProcessor } from './dist/AICVProcessor'

const main = async () => {
  // Configure AI provider
  const aiConfig = {
    apiKey: process.env.GEMINI_API_KEY!,
    model: 'gemini-1.5-pro',
  }

  // Create AI provider and processor
  const aiProvider = AIProviderFactory.createProvider('gemini', aiConfig)
  const processor = new AICVProcessor(aiProvider, {
    verbose: true,
    // Optional: custom instructions path
    instructionsPath: './my-custom-instructions.txt',
  })

  try {
    // Process the CV
    const cvData = await processor.processCv('path/to/resume.pdf')

    // Save to file
    processor.saveToJson(cvData, 'output.json')
  } catch (error) {
    console.error('Error processing CV:', error)
  }
}

main()

Output Format

The processed CV is output as a JSON file with the following structure:

{
  "resume": [
    {
      "category": "Film",
      "category_id": "a1b2c3d4-e5f6-g7h8-i9j0-k1l2m3n4o5p6",
      "credits": [
        {
          "id": "b1c2d3e4-f5g6-h7i8-j9k0-l1m2n3o4p5q6",
          "year": "2023",
          "title": "Major Motion Picture",
          "role": "Supporting Character",
          "director": "Famous Director",
          "attached_media": []
        }
      ]
    },
    {
      "category": "Television",
      "category_id": "c1d2e3f4-g5h6-i7j8-k9l0-m1n2o3p4q5r6",
      "credits": [
        {
          "id": "d1e2f3g4-h5i6-j7k8-l9m0-n1o2p3q4r5s6",
          "year": "2022",
          "title": "Popular TV Show",
          "role": "Guest Star",
          "director": "TV Director",
          "attached_media": []
        }
      ]
    }
  ],
  "resume_show_years": true,
  "metadata": {
    "processedDate": "2023-07-01T12:34:56.789Z",
    "sourceFile": "actor_resume.pdf",
    "processingTime": 5.23,
    "provider": "gemini",
    "model": "gemini-1.5-pro"
  }
}

AI Provider System

The application is designed with a flexible AI provider system that allows you to easily swap between different AI models:

Built-in Providers:
- Google Gemini AI (default)
- OpenAI (GPT-4o, etc.)
- Azure OpenAI (GPT-4o, etc.)
- Grok (X.AI) API
- AWS Bedrock (Amazon Nova, etc.)
Performance Metrics:
- Each output includes processing time in seconds
- Filenames include the processing time for easy comparison
- Parallel processing generates reports comparing all providers
- Merged reports identify the best providers based on accuracy and speed

Dependencies

@google/generative-ai: Google Gemini AI integration
openai: OpenAI API integration
pdf-parse: PDF text extraction
tesseract.js: OCR capability
@aws-sdk/client-bedrock-runtime: AWS Bedrock integration
commander: CLI framework
dotenv: Environment variable management
jsonrepair: Fix malformed JSON from AI responses
glob: File path matching
poppler-utils: Required for PDF to image conversion (external dependency)

License

MIT