@solvers-hub/llm-json

v0.1.11

Published

2 months ago

A TypeScript SDK to extract and correct JSON from LLM outputs

0High
0Medium
0Low

solvers-hub

nitish236

llm json extractor parser

LLM-JSON Extractor

A TypeScript SDK for extracting and correcting JSON data from LLM outputs.

Overview

LLM-JSON is a lightweight library designed to parse and extract JSON objects from large language model (LLM) outputs. It can handle multiple JSON objects within text, extract text separately from JSON, and even attempt to fix malformed JSON.

Key Features

Text/JSON Separation: Cleanly separates text content from JSON data in LLM outputs
Multiple JSON Support: Extracts multiple JSON objects or arrays from a single input
JSON Validation & Correction: Automatically fixes common JSON formatting errors from LLMs
Code Block Support: Extracts JSON from markdown code blocks (```json)
Schema Validation: Validates extracted JSON against provided schemas
TypeScript Support: Written in TypeScript with full type definitions

Quick Start

Installation

npm install @solvers-hub/llm-json

Basic Usage

import { LlmJson } from '@solvers-hub/llm-json';

const llmOutput = `Here's some text followed by JSON:

{
  "name": "John",
  "age": 30,
  "skills": ["JavaScript", "TypeScript", "React"]
}`;

const llmJson = new LlmJson({ attemptCorrection: true });
const { text, json } = llmJson.extract(llmOutput);

console.log(text); // ['Here\'s some text followed by JSON:']
console.log(json); // [{ name: 'John', age: 30, skills: ['JavaScript', 'TypeScript', 'React'] }]

Schema Validation

You can validate extracted JSON against schemas:

import { LlmJson } from '@solvers-hub/llm-json';

const schemas = [
  {
    name: 'person',
    schema: {
      type: 'object',
      properties: {
        name: { type: 'string' },
        age: { type: 'integer' }
      },
      required: ['name', 'age']
    }
  }
];

const llmJson = new LlmJson({ 
  attemptCorrection: true,
  schemas
});

const llmOutput = `Here's a person: {"name": "John", "age": 30}
And some other data: {"title": "Meeting notes"}`;
const result = llmJson.extract(llmOutput);

// Note: All extracted JSON objects are included in the json array
console.log(result.json);
// [
//   { name: 'John', age: 30 },
//   { title: 'Meeting notes' }
// ]

// The validatedJson array includes validation results for each JSON object
console.log(result.validatedJson);
// [
//   {
//     json: { name: 'John', age: 30 },
//     matchedSchema: 'person',
//     isValid: true
//   },
//   {
//     json: { title: 'Meeting notes' },
//     matchedSchema: null,
//     isValid: false,
//     validationErrors: [...]  // Validation errors
//   }
// ]

Advanced Usage

Using with Zod Schemas

While llm-json uses JSON Schema internally, you can easily integrate with Zod using the zod-to-json-schema library:

npm install zod zod-to-json-schema

import { LlmJson } from '@solvers-hub/llm-json';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';

// Define your Zod schema
const productSchema = z.object({
  productId: z.string(),
  quantity: z.number().min(0)
});

// Convert to JSON Schema
const llmJson = new LlmJson({
  attemptCorrection: true,
  schemas: [{
    name: 'product',
    schema: zodToJsonSchema(productSchema)
  }]
});

// Process LLM output
const result = llmJson.extract('Found product: {"productId": "abc-123", "quantity": 10}');

// Get validated result
if (result.validatedJson[0]?.isValid) {
  const validatedProduct = productSchema.parse(result.validatedJson[0].json);
  console.log(validatedProduct); // Type-safe!
}

Handling Nested Validation Errors

When validating complex nested schemas, you can identify specific validation failures:

const playlistSchema = {
  name: 'playlist',
  schema: {
    type: 'object',
    properties: {
      playlistId: { type: 'string' },
      tracks: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            id: { type: 'string' },
            title: { type: 'string' },
            duration: { type: 'number' }
          },
          required: ['id', 'title', 'duration']
        }
      }
    },
    required: ['playlistId', 'tracks']
  }
};

const result = llmJson.extract(llmOutput);

if (!result.validatedJson[0]?.isValid) {
  // Identify which array item failed
  result.validatedJson[0].validationErrors?.forEach(error => {
    console.log('Path:', error.instancePath); // e.g., "/tracks/1/duration"
    console.log('Message:', error.message);
    
    // Extract the failed item index
    const match = error.instancePath?.match(/\/tracks\/(\d+)/);
    if (match) {
      const failedIndex = parseInt(match[1]);
      console.log('Failed track:', result.json[0].tracks[failedIndex]);
    }
  });
}

Two-Stage Parsing for Performance

For high-throughput pipelines, use a two-stage strategy:

class OptimizedParser {
  private fastParser = new LlmJson({ attemptCorrection: false });  // Stage 1
  private fallbackParser = new LlmJson({ attemptCorrection: true }); // Stage 2

  parse(input: string) {
    // Stage 1: Fast path (no correction)
    const result = this.fastParser.extract(input);
    
    // Only use Stage 2 if Stage 1 failed AND input contains JSON-like patterns
    if (result.json.length === 0 && this.hasJsonPattern(input)) {
      return this.fallbackParser.extract(input); // Stage 2: Fallback with correction
    }
    
    return result;
  }

  private hasJsonPattern(input: string): boolean {
    return /\{[^}]*:/.test(input) || /\[[^\]]*\{/.test(input);
  }
}

Trigger conditions for Stage 2:

Stage 1 returns empty json array
Input contains JSON-like patterns ({, [, :, etc.)
Distinguishes between "no JSON" vs "malformed JSON"

Streaming/Chunked Input Handling

For real-time LLM streaming, buffer chunks until complete JSON is detected:

class StreamingHandler {
  private buffer = '';
  private llmJson = new LlmJson({ attemptCorrection: true });

  processChunk(chunk: string) {
    this.buffer += chunk;
    
    // Detect complete JSON by counting braces
    const complete = this.findCompleteJson(this.buffer);
    
    if (complete) {
      const result = this.llmJson.extract(complete.json);
      this.buffer = complete.remaining;
      return result.json;
    }
    
    return []; // Still buffering
  }

  private findCompleteJson(text: string): { json: string; remaining: string } | null {
    let depth = 0;
    let start = text.indexOf('{');
    if (start === -1) return null;
    
    for (let i = start; i < text.length; i++) {
      if (text[i] === '{') depth++;
      if (text[i] === '}') depth--;
      if (depth === 0) {
        return {
          json: text.substring(start, i + 1),
          remaining: text.substring(i + 1)
        };
      }
    }
    return null; // Incomplete
  }
}

Primary challenges:

Detecting JSON boundaries in incomplete streams
Handling braces inside strings (must track string state)
Supporting nested objects (depth counting)
Managing buffer size for long streams

Processing Multiple Schema Types

Build a typed processor for different data types:

class LogProcessor {
  private llmJson: LlmJson;

  constructor(schemas: { errorLog: any; infoLog: any }) {
    this.llmJson = new LlmJson({
      attemptCorrection: true,
      schemas: [
        { name: 'errorLog', schema: schemas.errorLog },
        { name: 'infoLog', schema: schemas.infoLog }
      ]
    });
  }

  process(logEntry: string): { errors: any[]; infos: any[] } {
    const result = this.llmJson.extract(logEntry);
    
    const errors: any[] = [];
    const infos: any[] = [];

    result.validatedJson?.forEach(validated => {
      if (validated.isValid) {
        if (validated.matchedSchema === 'errorLog') {
          errors.push(validated.json);
        } else if (validated.matchedSchema === 'infoLog') {
          infos.push(validated.json);
        }
      }
    });

    return { errors, infos };
  }
}

Examples

See the examples directory for comprehensive examples:

example.ts - Basic extraction and correction
array-example.ts - Extracting arrays from LLM output
schema-validation-example.ts - JSON Schema validation
schema-validation-examples.ts - Complex nested schemas
zod-integration-example.ts - Using Zod schemas with llm-json
advanced-patterns-example.ts - Streaming, two-stage parsing, error detection

Run examples:

npm run example:run -- examples/example.ts
npm run example:run -- examples/zod-integration-example.ts
npm run example:run -- examples/advanced-patterns-example.ts

API Reference

`ExtractResult`

interface ExtractResult {
  text: string[];              // Non-JSON text segments
  json: any[];                 // Extracted JSON objects/arrays
  validatedJson?: Array<{      // Validation results (if schemas provided)
    json: any;
    matchedSchema: string | null;
    isValid: boolean;
    validationErrors?: Array<{
      instancePath?: string;   // Path to failed property (e.g., "/tracks/1/duration")
      schemaPath?: string;     // Path in schema
      message?: string;        // Error description
    }>;
  }>;
}

Distinguishing Parse Failures from Empty Input

function diagnoseResult(input: string, result: ExtractResult): string {
  // Case 1: Success
  if (result.json.length > 0) return 'success';
  
  // Case 2: No JSON-like patterns = genuinely empty
  const hasJsonPattern = /\{[^}]*:/.test(input) || /\[[^\]]*\{/.test(input);
  if (!hasJsonPattern) return 'no_json';
  
  // Case 3: Has JSON patterns but extraction failed = parse failure
  return 'parse_failure';
}