@upstash/docs2vector
v1.0.2
Published
A tool to process markdown files from GitHub repositories and store them in Upstash Vector
Readme
GitHub Docs Vectorizer
A Node.js tool to process Markdown files from GitHub repositories, generate embeddings, and store them in Upstash Vector database. Perfect for building document search systems, AI-driven documentation assistants, or knowledge bases.
Features
- Recursively find all Markdown (
.md) and MDX (.mdx) files in any GitHub repository - Chunk documents using LangChain's RecursiveCharacterTextSplitter for better text segmentation
- Supports both OpenAI and Upstash embeddings
- Stores document chunks and metadata in Upstash Vector for enhanced retrieval
Prerequisites
- Node.js (v16 or higher)
- NPM or Yarn for package management
- GitHub personal access token (required for repository access)
- Upstash Vector database account (to store vectors)
- OpenAI API key (optional, for generating embeddings)
How to Find Your GitHub Token
- Go to GitHub.com and sign in to your account
- Click on your profile picture in the top-right corner
- Go to
Settings>Developer settings>Personal access tokens>Tokens (classic) - Click
Generate new token>Generate new token (classic) - Give your token a descriptive name in the "Note" field
- Select the following scopes:
repo(Full control of private repositories)read:org(Read organization data)
- Click
Generate token
Installation Guide
- Clone the repository or create a new directory:
mkdir github-docs-vectorizer
cd github-docs-vectorizerEnsure the following files are included in your directory:
script.js: The main script for processingpackage.json: Manages project dependencies.env: Contains your environment variables (explained below)
Install dependencies:
npm install @upstash/docs2vector- Set up a
.envfile in the root directory of your project with your credentials:
# Required for accessing GitHub repositories
GITHUB_TOKEN=your_github_token
# Required for storing vectors in Upstash
UPSTASH_VECTOR_REST_URL=your_upstash_vector_url
UPSTASH_VECTOR_REST_TOKEN=your_upstash_vector_token
# Optional: Provide if using OpenAI embeddings
OPENAI_API_KEY=your_openai_api_keyUsage
Run the script by providing the GitHub repository URL as an argument:
node script.js https://github.com/username/repositoryExample:
node script.js https://github.com/facebook/reactThe script will:
- Clone the specified repository
- Find all Markdown files
- Split content into chunks
- Generate embeddings (using either OpenAI or Upstash)
- Store the chunks in your Upstash Vector database
- Clean up temporary files
Configuration
Embedding Options
Supported Embedding Providers
OpenAI Embeddings (default if API key is provided)
- Requires
OPENAI_API_KEYin.env - Uses OpenAI's text-embedding-ada-002 model
- Requires
Upstash Embeddings (used when OpenAI API key is not provided)
- No additional configuration needed
- Uses Upstash's built-in embedding service
Customizing Document Chunking
To adjust how documents are split into chunks, you can update the configuration in script.js:
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // Adjust chunk size as needed
chunkOverlap: 200 // Adjust overlap as needed
});SDK
npm install @upstash/docs2vector dotenvimport Docs2Vector from '@upstash/docs2vector';
import dotenv from 'dotenv';
// Load environment variables
dotenv.config();
async function main() {
try {
// Step 1: Define the GitHub repository URL
const githubRepoUrl = 'YOUR_GITHUB_URL';
// Print start message
console.log(`Starting processing for the repository: ${githubRepoUrl}`);
// Step 2: Initialize the Docs2Vector SDK
const converter = new Docs2Vector();
// Step 3: Run the processing flow with Docs2Vector's `run` method
await converter.run(githubRepoUrl);
// Print success message
console.log(`Successfully processed repository: ${githubRepoUrl}`);
console.log('Vectors stored in Upstash Vector database.');
} catch (error) {
console.error('An error occurred while processing the repository:', error.message);
}
}
main();Metadata
Metadata accompanies each stored chunk for improved context:
- Original file name
- File type (Markdown or MDX)
- Relative file path in the repository
- Document source for the specific chunk of text
Error Handling
The script is designed to handle errors gracefully in the following cases:
- Invalid repository URLs provided
- Missing or incorrect credentials
- Unable to access or read the required files
- Connectivity or network-related problems
- Network problems
In case of errors, the script will:
- Log the error message
- Clean up any temporary files
- Exit with a non-zero status code
Contributing
Feel free to submit issues and enhancement requests!
License
MIT License - feel free to use this tool for any purpose.
Credits
This tool uses the following open-source packages:
- LangChain: Handles document processing and vector store integration
- Octokit: Facilitates interactions with the GitHub API
- simple-git: Manages operations on Git repositories
- Upstash Vector: Enables seamless storage and retrieval of document vectors
