llm-search

v1.0.5

Published

7 months ago

[DEPRECATED] A Node.js module for searching and scraping web content, designed for LLMs but useful for any project where webscraping is needed!

Downloads

[DEPRECATED] llm-search 🔍

⚠️ DEPRECATION NOTICE: This package has been deprecated in favor of llm-kit. Please install llm-kit instead:
npm install llm-kit
The new package provides all the same functionality with improved features and ongoing maintenance. Please update your dependencies accordingly.

A Node.js module for searching and scraping web content, designed for LLMs but useful for everyone!

Features

Search multiple engines (Google, DuckDuckGo)
Wikipedia search and content extraction
HackerNews scraping
Webpage content extraction
Document parsing (PDF, DOCX, CSV)
Image OCR/text extraction support
No API keys required at all
Automatic fallbacks
TypeScript & Node support

Installation

npm install llm-search

# Optional: Install OCR language data for non-English languages
npm install tesseract.js-data

Quick Start

import { search, parse } from "llm-search";

// Web Search
const results = await search("typescript tutorial");
console.log(results);

// Parse Documents
const pdfResult = await parse("document.pdf");
console.log(pdfResult.text);

const csvResult = await parse("data.csv", {
  csv: { columns: true },
});
console.log(csvResult.data);

// OCR Images
const imageResult = await parse("image.png", {
  language: "eng",
});
console.log(imageResult.text);

Supported File Types

Documents

PDF files (.pdf)
Word documents (.docx)
CSV files (.csv)

Images (OCR)

PNG (.png)
JPEG (.jpg, .jpeg)
BMP (.bmp)
GIF (.gif)

Documentation

See the docs directory for detailed documentation:

Search - Web search capabilities
Wikipedia - Wikipedia integration
HackerNews - HackerNews API
Webpage - Web content extraction
Parser - Document and image parsing

Example Usage

Web Search

import { search } from "llm-search";

const results = await search("typescript tutorial");
console.log(results);

Document Parsing

import { parse } from "llm-search";

// Parse PDF
const pdfResult = await parse("document.pdf");
console.log(pdfResult.text);

// Parse CSV with options
const csvResult = await parse("data.csv", {
  csv: {
    delimiter: ";",
    columns: true,
  },
});
console.log(csvResult.data);

// OCR Image
const imageResult = await parse("image.png", {
  language: "eng", // supports multiple languages
});
console.log(imageResult.text);

Error Handling

try {
  const result = await parse("document.pdf");
  console.log(result.text);
} catch (error) {
  if (error.code === "PDF_PARSE_ERROR") {
    console.error("PDF parsing failed:", error.message);
  }
  // Handle other errors
}

Dependencies

This package uses these great libraries:

@mozilla/readability - Web content extraction
csv-parse - CSV parsing
duck-duck-scrape - DuckDuckGo search
fast-xml-parser - XML parsing
google-sr - Google search
jsdom - DOM emulation for web scraping
mammoth - DOCX parsing
pdf-parse - PDF parsing
puppeteer - Headless browser automation
tesseract.js - OCR
wikipedia - Wikipedia API

License

MIT

Contributing

Contributions VERY welcome!! Please read the contributing guidelines first.