@nicolaygerold/pluck

v0.1.7

Published

a day ago

LLM-native HTML selector library. Extract data with XPath and CSS.

0High
0Medium
0Low

nicolaygerold

html xpath css selector scraping parsing extraction

pluck

LLM-native HTML selector library. Extract data from HTML with XPath and CSS selectors.

bun add pluck

Quick Start

import { pluck } from "pluck";

const html = `
  <div class="product">
    <h1>Wireless Headphones</h1>
    <span class="price">$99.99</span>
    <a href="/buy/123">Buy now</a>
  </div>
`;

const doc = pluck(html);

// Extract text
const title = doc.css("h1::text").get();
// → "Wireless Headphones"

// Extract attribute
const link = doc.xpath("//a/@href").get();
// → "/buy/123"

// Check if element exists
if (doc.css(".discount").ok) {
  console.log("On sale!");
}

API Reference

Entry Point

import { pluck } from "pluck";

const doc = pluck(html);              // Parse HTML string
const doc = pluck(html, { debug: true }); // Enable debug logging

Query Methods

doc.css("selector")      // CSS selector
doc.xpath("//expression") // XPath expression

Both return a Selector that can be chained.

Extraction

.get()              // First match or null
.get("default")     // First match or default value
.getall()           // All matches as string[]

.text()             // Combined text content
.attr("name")       // Attribute value or null
.html()             // Inner HTML or null

Feedback

.ok                 // true if selector matched anything
.count              // Number of matches
.selector           // The selector string used

.result()           // { ok: true, value, count } or { ok: false, selector }

Chaining

.or(fallback)       // Use fallback selector if no match
.map(fn)            // Transform values: .map(s => s.toUpperCase())
.first()            // First match only
.last()             // Last match only
.eq(n)              // Nth match (0-indexed)

Iteration

.each((sel, i) => {})  // Iterate with callback
.toArray()             // Convert to Selector[]

for (const item of doc.css("li")) {
  console.log(item.text());
}

Pseudo-Elements

Extract text or attributes directly in selectors:

// CSS
doc.css("h1::text").get()           // Text content
doc.css("a::attr(href)").get()      // Attribute value

// XPath
doc.xpath("//h1::text").get()       // Text content  
doc.xpath("//a::attr(href)").get()  // Attribute value
doc.xpath("//a/@href").get()        // Also works (native XPath)

XPath Cheatsheet

Axes

//div                 // All div elements anywhere
/div                  // Direct child div
.//div                // Descendant div from current context
..                    // Parent element
ancestor::div         // Ancestor div elements
following-sibling::p  // Following p siblings
preceding-sibling::p  // Preceding p siblings
following::p          // All following p elements
preceding::p          // All preceding p elements

Predicates

//div[1]                      // First div
//div[last()]                 // Last div
//div[position() > 1]         // All except first
//a[@href]                    // Has href attribute
//a[@class='active']          // Exact attribute match
//a[@class!='hidden']         // Attribute not equal
//a[contains(@class, 'btn')]  // Attribute contains
//a[starts-with(@href, '/')]  // Attribute starts with
//a[ends-with(@href, '.pdf')] // Attribute ends with
//p[text()='Hello']           // Exact text match
//p[contains(text(), 'Hello')] // Text contains
//div[span]                   // Has child span
//div[count(p) > 2]           // Has more than 2 p children

String Functions

//p[normalize-space()='Hello']           // Ignore whitespace
//p[string-length() > 0]                 // Non-empty text
//p[substring(., 1, 5)='Hello']          // First 5 chars
//p[substring-before(., ':')='Price']    // Before delimiter
//p[substring-after(., ': ')='$99']      // After delimiter
//p[translate(., 'ABC', 'abc')='hello']  // Case conversion

Logical Operators

//a[@class='x' and @id='y']   // Both conditions
//a[@class='x' or @class='y'] // Either condition
//a[not(contains(@class, 'hidden'))] // Negation
//h1 | //h2                   // Union (combine results)

CSS Cheatsheet

div                    // Element
.class                 // Class
#id                    // ID
div.class              // Element with class
div > p                // Direct child
div p                  // Descendant
div + p                // Adjacent sibling
div ~ p                // General sibling
[href]                 // Has attribute
[href="/page"]         // Attribute equals
[href^="/"]            // Starts with
[href$=".pdf"]         // Ends with
[href*="example"]      // Contains
[class~="btn"]         // Contains word

Common Patterns

Tables

const doc = pluck(html);

// Get all rows
const rows = doc.xpath("//table//tr").toArray();

// Get specific cell (row 2, column 3)
const cell = doc.xpath("//table//tr[2]/td[3]::text").get();

// Get column values
const prices = doc.xpath("//table//tr/td[2]::text").getall();

// Get row by content
const row = doc.xpath("//tr[td[text()='Product A']]");

Lists

// All list items
const items = doc.css("ul li::text").getall();

// Nested lists
const nested = doc.xpath("//ul/li/ul/li::text").getall();

Forms

// Input value
const value = doc.css("input[name='email']::attr(value)").get();

// All form fields
doc.css("form input").each((input) => {
  const name = input.attr("name");
  const value = input.attr("value");
});

// Select options
const options = doc.xpath("//select[@name='country']/option/@value").getall();

Links

// All links
const hrefs = doc.css("a::attr(href)").getall();

// External links
const external = doc.xpath("//a[starts-with(@href, 'http')]/@href").getall();

// Links with specific text
const login = doc.xpath("//a[text()='Login']/@href").get();

Definition Lists

// Get value after specific term
const price = doc.xpath("//dt[text()='Price']/following-sibling::dd[1]::text").get();

Structured Data

// Extract product cards
const products = doc.css(".product").map((p) => ({
  name: p.css(".name::text").get(),
  price: p.css(".price::text").get(),
  url: p.css("a::attr(href)").get(),
})).getall();

CSS vs XPath

| Use CSS When | Use XPath When | |--------------|----------------| | Simple element selection | Text content matching | | Class/ID selection | Attribute contains/starts-with | | Direct children | Sibling navigation | | Attribute presence | Parent/ancestor traversal | | | Position-based selection | | | Complex predicates |

Rule of thumb: Start with CSS, switch to XPath when you need text matching or axis navigation.

Error Handling

// Check before access
const price = doc.css(".price::text");
if (price.ok) {
  console.log(price.get());
}

// Default values
const stock = doc.css(".stock::text").get("In stock");

// Fallback selectors
const title = doc
  .css("h1::text")
  .or(doc.css(".title::text"))
  .or(doc.css("title::text"))
  .get();

// Structured result
const result = doc.css(".price::text").result();
if (result.ok) {
  console.log(result.value, result.count);
} else {
  console.log("Selector failed:", result.selector);
}

// Invalid selectors return ok: false (no exceptions)
const invalid = doc.xpath("//[broken");
invalid.ok;    // false
invalid.count; // 0
invalid.get(); // null

LLM Tips

Patterns that work well for code generation:

// ✅ Good: Explicit extraction
doc.css("h1::text").get()
doc.xpath("//a/@href").get()

// ✅ Good: Check existence
if (doc.css(".error").ok) { ... }

// ✅ Good: Safe defaults  
doc.css(".price::text").get("N/A")

// ✅ Good: Structured extraction
doc.css(".item").map(item => ({
  title: item.css(".title::text").get(),
  link: item.css("a::attr(href)").get(),
})).getall()

// ❌ Avoid: Chaining without checks
doc.css(".maybe-missing").css(".child::text").get()

// ✅ Better: Check at each step
const parent = doc.css(".maybe-missing");
const text = parent.ok ? parent.css(".child::text").get() : null;

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pluck

Quick Start

API Reference

Entry Point

Query Methods

Extraction

Feedback

Chaining

Iteration

Pseudo-Elements

XPath Cheatsheet

Axes

Predicates

String Functions

Logical Operators

CSS Cheatsheet

Common Patterns

Tables

Lists

Forms

Links

Definition Lists

Structured Data

CSS vs XPath

Error Handling

LLM Tips

License