ai-search-indexer
v1.0.2
Published
Website content indexer using Mozilla Readability and Playwright
Maintainers
Readme
ai-search-indexer
Index documentation websites into JSON artifacts that power the web components in this monorepo.
This package supports two index formats:
context.jsonforwg-doc-search(classic search + Prompt API answer)vector-context.jsonforwg-rag-search(Orama + Transformers.js retrieval + Prompt API answer)
Install
pnpm add ai-search-indexer
npx playwright install chromiumQuick Start (Monorepo)
From repo root:
pnpm install1. Create a Classic Index (context.json)
Example (Angular Material overview pages):
node packages/indexer/examples/index-angular-material.jsOutput:
packages/web-components/demo/context.json
2. Create a Vector Index (vector-context.json)
Example (~50 Angular Material pages: overview + api):
node packages/indexer/examples/index-vector.jsOutput:
packages/web-components/demo/vector-context.json
Programmatic API
WebsiteIndexer (classic)
import { WebsiteIndexer } from 'ai-search-indexer';
const indexer = new WebsiteIndexer('./context.json');
await indexer.indexUrls([
'https://material.angular.dev/components/table/overview',
'https://material.angular.dev/components/sort/overview'
]);
indexer.displayStats();VectorDocIndexer (RAG)
import { VectorDocIndexer } from 'ai-search-indexer';
const indexer = new VectorDocIndexer('./vector-context.json', {
embeddingModel: 'Xenova/all-MiniLM-L6-v2'
});
await indexer.indexUrls([
'https://material.angular.dev/components/table/overview',
'https://material.angular.dev/components/table/api'
]);
indexer.displayStats();Vector CLI
Run from packages/indexer:
pnpm index:vector --output ./vector-context.json --urls-file ./urls.txtYou can also pass URLs directly:
pnpm index:vector \
https://material.angular.dev/components/table/overview \
https://material.angular.dev/components/table/apiCLI options
--urls-file <path>:.txtor.jsonURL list--output <path>: output path (default./vector-context.json)--model <name>: embedding model (defaultXenova/all-MiniLM-L6-v2)
Output Formats
context.json
Contains indexed_content pages with fields like:
urltitledescriptioncontenthtml_content
vector-context.json
Contains:
metadatapages[]chunks[]with per-chunkembedding
Scripts
In packages/indexer/package.json:
pnpm examplepnpm example:vectorpnpm index:vectorpnpm test
Tutorial
See full step-by-step guide:
packages/indexer/TUTORIAL.md
License
MIT
