@ujeebu-org/langchain

v0.2.1

Published

4 months ago

LangChain integration for Ujeebu Extract API

0High
0Medium
0Low

yalhyane

ujeebu

langchain ujeebu extract article scraping nlp llm ai web-scraping content-extraction

LangChain Ujeebu Integration (Node.js/TypeScript)

Official LangChain integration for Ujeebu Extract API - Extract clean, structured content from news articles and blog posts for use with Large Language Models (LLMs) and AI applications.

Features

Easy Integration: Seamlessly integrate Ujeebu Extract API with LangChain agents and chains
Document Loaders: Load articles as LangChain Documents for use with vector stores and retrievers
Agent Tools: Use Ujeebu Extract as a tool in LangChain agents
Rich Metadata: Extract article text, HTML, author, publication date, images, and more
Quick Mode: Optional fast extraction mode (30-60% faster)
TypeScript Support: Full TypeScript types and interfaces

What is Ujeebu Extract?

Ujeebu Extract converts news and blog articles into clean, structured JSON data. It extracts:

Clean article text and HTML
Author and publication date
Title and summary
Images and media
RSS feeds
Site metadata

Perfect for RAG (Retrieval-Augmented Generation) applications, content analysis, and LLM training data.

Installation

npm install @ujeebu-org/langchain
# or
yarn add @ujeebu-org/langchain
# or
pnpm add @ujeebu-org/langchain

Requirements

Node.js 16.0 or higher
@langchain/core 0.3.0 or higher
An Ujeebu API key (Get one here)

Quick Start

Set up your API key

export UJEEBU_API_KEY="your-api-key"

Or set it programmatically:

process.env.UJEEBU_API_KEY = "your-api-key";

Using as an Agent Tool

npm install @ujeebu-org/langchain @langchain/core @langchain/openai @langchain/langgraph langchain

import { UjeebuExtractTool } from '@ujeebu-org/langchain';
import { createAgent } from 'langchain';
import { ChatOpenAI } from '@langchain/openai';

// Initialize the tool
const ujeebuTool = new UjeebuExtractTool();

// Create an agent
const model = new ChatOpenAI({ temperature: 0 });
const agent = createAgent({ model, tools: [ujeebuTool] });

// Use the agent
const response = await agent.invoke({
  messages: [{ role: 'user', content: 'Extract the article from https://example.com/article and summarize it' }],
});
console.log(response.messages[response.messages.length - 1].content);

Using the Document Loader

npm install @ujeebu-org/langchain @langchain/core @langchain/openai @langchain/community

import { UjeebuLoader } from '@ujeebu-org/langchain';
import { FaissStore } from '@langchain/community/vectorstores/faiss';
import { OpenAIEmbeddings } from '@langchain/openai';

// Load articles
const loader = new UjeebuLoader({
  urls: [
    'https://example.com/article1',
    'https://example.com/article2',
    'https://example.com/article3'
  ]
});
const documents = await loader.load();

// Create a vector store
const embeddings = new OpenAIEmbeddings();
const vectorStore = await FaissStore.fromDocuments(documents, embeddings);

// Query the documents
const results = await vectorStore.similaritySearch('What are the main topics?');

Usage Examples

Basic Article Extraction

import { UjeebuExtractTool } from '@ujeebu-org/langchain';

const tool = new UjeebuExtractTool();
const result = await tool.invoke({
  url: 'https://example.com/article',
  text: true,
  author: true,
  pub_date: true
});
console.log(result);

Extract with Images

import { UjeebuExtractTool } from '@ujeebu-org/langchain';

const tool = new UjeebuExtractTool();
const result = await tool.invoke({
  url: 'https://example.com/article',
  images: true  // Extract article images
});

Quick Mode for Faster Extraction

import { UjeebuLoader } from '@ujeebu-org/langchain';

const loader = new UjeebuLoader({
  urls: ['https://example.com/article'],
  quickMode: true  // 30-60% faster, slightly less accurate
});
const documents = await loader.load();

Load with HTML Content

import { UjeebuLoader } from '@ujeebu-org/langchain';

const loader = new UjeebuLoader({
  urls: ['https://example.com/article'],
  extractHtml: true,   // Include HTML content
  extractImages: true  // Include images
});
const documents = await loader.load();

// Access metadata
const doc = documents[0];
console.log(`Title: ${doc.metadata.title}`);
console.log(`Author: ${doc.metadata.author}`);
console.log(`Images: ${doc.metadata.images}`);

Build a QA System

import { UjeebuLoader } from '@ujeebu-org/langchain';
import { FaissStore } from '@langchain/community/vectorstores/faiss';
import { OpenAIEmbeddings, ChatOpenAI } from '@langchain/openai';
import { ChatPromptTemplate } from '@langchain/core/prompts';
import { StringOutputParser } from '@langchain/core/output_parsers';

// Load articles
const loader = new UjeebuLoader({
  urls: [
    'https://example.com/article1',
    'https://example.com/article2'
  ]
});
const documents = await loader.load();

// Create vector store
const embeddings = new OpenAIEmbeddings();
const vectorStore = await FaissStore.fromDocuments(documents, embeddings);
const retriever = vectorStore.asRetriever();

// Retrieve and answer
const query = 'What are the main points?';
const relevantDocs = await retriever.invoke(query);
const context = relevantDocs.map((doc) => doc.pageContent).join('\n\n');

const prompt = ChatPromptTemplate.fromMessages([
  ['system', 'Answer based on the following context:\n\n{context}'],
  ['human', '{question}'],
]);

const chain = prompt.pipe(new ChatOpenAI({ temperature: 0 })).pipe(new StringOutputParser());
const answer = await chain.invoke({ context, question: query });
console.log(answer);

API Reference

UjeebuExtractTool

A LangChain tool for extracting article content.

Constructor Parameters:

apiKey (string, optional): Ujeebu API key. Defaults to UJEEBU_API_KEY environment variable.
baseUrl (string, optional): Custom API endpoint URL.

Tool Parameters (UjeebuExtractInput):

url (string, required): URL of the article to extract
text (boolean, optional): Extract article text (default: true)
html (boolean, optional): Extract article HTML (default: false)
author (boolean, optional): Extract article author (default: true)
pub_date (boolean, optional): Extract publication date (default: true)
images (boolean, optional): Extract images (default: false)
quick_mode (boolean, optional): Use quick mode for faster extraction (default: false)

UjeebuLoader

A LangChain document loader for articles.

Constructor Parameters (UjeebuLoaderParams):

urls (string[], required): List of article URLs to load
apiKey (string, optional): Ujeebu API key
extractText (boolean, optional): Extract article text (default: true)
extractHtml (boolean, optional): Extract article HTML (default: false)
extractAuthor (boolean, optional): Extract author (default: true)
extractPubDate (boolean, optional): Extract publication date (default: true)
extractImages (boolean, optional): Extract images (default: false)
quickMode (boolean, optional): Use quick mode (default: false)
baseUrl (string, optional): Custom API endpoint URL

Methods:

load(): Promise<Document[]> - Load all documents

Document Metadata:

source: Original URL
url: Resolved URL
canonical_url: Canonical URL
title: Article title
author: Article author
pub_date: Publication date
language: Article language
site_name: Site name
summary: Article summary
image: Main image URL
images: List of all image URLs (if extractImages=true)

Advanced Usage

Custom API Endpoint

import { UjeebuLoader } from '@ujeebu-org/langchain';

const loader = new UjeebuLoader({
  urls: ['https://example.com/article'],
  baseUrl: 'https://custom-api.ujeebu.com/extract'
});

Error Handling

import { UjeebuLoader } from '@ujeebu-org/langchain';

const loader = new UjeebuLoader({
  urls: ['https://example.com/article']
});

try {
  const documents = await loader.load();
  console.log(`Loaded ${documents.length} documents`);
} catch (error) {
  if (error instanceof Error) {
    console.error(`Error: ${error.message}`);
  }
}

Testing

Run the test suite:

# Install dependencies
npm install

# Build the package
npm run build

# Run tests
npm test

# Run tests with coverage
npm run test:coverage

# Run linting
npm run lint

# Format code
npm run format

Examples

Check out the examples directory for more usage examples:

agent-example.ts - Using Ujeebu with LangChain agents
document-loader-example.ts - Using the document loader with vector stores

Pricing

Ujeebu Extract API pricing is based on usage. Check the pricing page for details.

Support

Documentation: https://ujeebu.com/docs/extract
API Reference: https://ujeebu.com/docs
Support: [email protected]
GitHub Issues: Report a bug

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Related Projects

LangChain - Build applications with LLMs through composability
Ujeebu API - Web scraping and content extraction API

Changelog

0.2.0

Upgrade to LangChain 1.x compatibility
Import BaseDocumentLoader from @langchain/core instead of langchain
Only @langchain/core required as peer dependency (not langchain)
Update examples to use @langchain/langgraph for agents

0.1.0

Initial release
UjeebuExtractTool for LangChain agents
UjeebuLoader document loader
Full TypeScript support
Comprehensive test coverage
Complete documentation