@pluckr/core

v0.1.1

Published

a month ago

Schema-first, self-healing HTML data extraction with LLM-generated selectors

0High
0Medium
0Low

@pluckr/core

Schema-first, self-healing HTML data extraction powered by LLMs. Define what you want with a Zod schema, and Pluckr figures out how to extract it.

Installation

npm install @pluckr/core

You also need an AI SDK provider:

npm install @ai-sdk/anthropic   # or @ai-sdk/openai, @ai-sdk/google

Quick Start

import { Pluckr } from '@pluckr/core'
import { anthropic } from '@ai-sdk/anthropic'
import { z } from 'zod'

const pluckr = new Pluckr({
  model: anthropic('claude-haiku-4-5-20251001'),
})

const result = await pluckr.extract({
  html: '<html>...</html>',
  schema: z.object({
    title: z.string(),
    price: z.coerce.number().positive(),
    inStock: z.coerce.boolean(),
  }),
  cacheKey: 'my-product-page',
})

if (result.success) {
  console.log(result.data.title)  // fully typed!
}

await pluckr.close()

How it works

You provide HTML and a Zod schema describing the data you want
An LLM generates CSS selectors for each schema field
Selectors are tested and verified via an agentic tool loop
Zod validates and coerces the extracted data
Working selectors are cached — subsequent extractions are instant (no LLM calls)
If the page changes and selectors break, Pluckr asks the LLM to fix them

Bring your own model

Pluckr accepts any Vercel AI SDK compatible model:

import { anthropic } from '@ai-sdk/anthropic'
import { openai } from '@ai-sdk/openai'
import { google } from '@ai-sdk/google'

new Pluckr({ model: anthropic('claude-haiku-4-5-20251001') })
new Pluckr({ model: openai('gpt-4o-mini') })
new Pluckr({ model: google('gemini-2.0-flash') })

Use `z.coerce` for non-string fields

CSS selectors extract text from HTML, so all raw values are strings. Use z.coerce.number(), z.coerce.boolean(), etc. so Zod converts "29.99" to 29.99 and "true" to true.

Caching

By default, Pluckr uses in-memory storage. For persistent caching, use @pluckr/sqlite:

import { SqliteStorage } from '@pluckr/sqlite'

const pluckr = new Pluckr({
  model: anthropic('claude-haiku-4-5-20251001'),
  storage: new SqliteStorage(),  // defaults to .pluckr/cache.db
})

You can also implement the Storage interface for any backend (Redis, Postgres, etc.).

Error Handling

extract() returns a discriminated union — no exceptions:

const result = await pluckr.extract({ html, schema, cacheKey: 'my-page' })

if (result.success) {
  console.log(result.data)
} else {
  // result.error.code: 'NO_DATA' | 'EXTRACTION_FAILED' | 'PERMANENT_FAILURE'
  console.error(result.error.message)
}

API

`new Pluckr(config)`

| Option | Type | Default | Description | |--------|------|---------|-------------| | model | LanguageModel | required | Any Vercel AI SDK model | | storage | Storage | MemoryStorage | Cache backend | | debug | boolean | false | Log extraction steps | | maxToolCallsPerField | number | 3 | Max LLM tool calls per schema field |

`pluckr.extract(options)`

| Option | Type | Description | |--------|------|-------------| | html | string | Raw HTML to extract from | | schema | ZodObject | Zod schema describing desired data | | cacheKey | string? | Optional key for selector caching |

Returns ExtractResult<T>: { success: true, data: T } or { success: false, error: ExtractError }.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@pluckr/core

Installation

Quick Start

How it works

Bring your own model

Use z.coerce for non-string fields

Caching

Error Handling

API

new Pluckr(config)

pluckr.extract(options)

License

Use `z.coerce` for non-string fields

`new Pluckr(config)`

`pluckr.extract(options)`