ai-server
v2.0.1
Published
An OpenAI and Claude API compatible server using node-llama-cpp for local LLM models
Maintainers
Readme
AI Server
A TypeScript microservice that provides an API compatible with OpenAI and Claude for working with local LLM models through node-llama-cpp.
Features
- 🔄 Full compatibility with OpenAI Chat API (
/v1/chat/completions) - 🤖 Compatibility with Anthropic Claude API (
/v1/messages) - 🐟 Full compatibility with DeepSeek API (
/v1/chat/completions) - 🌊 Support for streaming generation (Streaming API)
- 🔑 Your own API key authentication
- 🧠 Run local LLM models in GGUF format
- ⚙️ Configuration through environment variables
- 🔍 Monitoring via
/healthendpoint - 📋 Standard API for retrieving model list (
/v1/models)
Requirements
- Node.js 18+
- TypeScript 5.3+
- GGUF model (Llama 2, Mistral, LLaMA 3, or other compatible models)
- Recommended minimum 16 GB RAM for 7B models
Installation
Clone the repository:
git clone https://github.com/ivanoff/ai-server.git cd ai-serverInstall dependencies:
npm installCreate a directory for models:
mkdir -p modelsDownload a GGUF model into the
models/directory (for example, from Hugging Face)Copy the example
.envfile and configure it to your needs:cp .env.example .envCompile TypeScript:
npm run buildStart the server:
npm start
Project Structure
ai-server/
├── src/
│ └── server.ts # Main server code
├── models/ # Directory for GGUF models
├── dist/ # Compiled files
├── .env # Configuration
├── package.json
└── tsconfig.jsonConfiguration
Configure the .env file to change server parameters:
# Path to the model (absolute or relative to project root)
MODEL_PATH=./models/llama-2-7b-chat.gguf
# Server port
PORT=3000
# Default maximum number of tokens
DEFAULT_MAX_TOKENS=2048
# Number of model layers to offload to GPU (0 for CPU-only)
GPU_LAYERS=120
# API key for authentication
API_KEY=your_api_keyUsage Examples
OpenAI API compatible request
const response = await fetch('http://localhost:3000/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer your_api_key'
},
body: JSON.stringify({
model: 'llama-local',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Tell me about TypeScript' }
],
max_tokens: 500,
temperature: 0.7
})
});
const data = await response.json();
console.log(data);Anthropic Claude API compatible request
const response = await fetch('http://localhost:3000/v1/messages', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': 'your_api_key'
},
body: JSON.stringify({
model: 'llama-local',
messages: [
{ role: 'human', content: 'Tell me about TypeScript' }
],
max_tokens: 500,
temperature: 0.7
})
});
const data = await response.json();
console.log(data);Streaming mode
To use streaming mode, add the stream: true parameter to the request and process the event stream:
const response = await fetch('http://localhost:3000/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama-local',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Tell me about TypeScript' }
],
max_tokens: 500,
temperature: 0.7,
stream: true
})
});
// Process the event stream
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(line => line.trim() !== '');
for (const line of lines) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
const jsonData = JSON.parse(line.replace('data: ', ''));
console.log(jsonData);
}
}
}API Endpoints
/v1/chat/completions
OpenAI Chat API compatible endpoint.
Request Parameters:
messages: Array of message objects withroleandcontentmodel: Model identifier (optional)max_tokens: Maximum tokens to generate (optional)temperature: Randomness of generation (optional)stream: Enable streaming mode (optional)
/v1/messages
Claude API compatible endpoint.
Request Parameters:
messages: Array of message objects withroleandcontentmodel: Model identifier (optional)max_tokens: Maximum tokens to generate (optional)temperature: Randomness of generation (optional)stream: Enable streaming mode (optional)
/health
Health check endpoint that returns server status and model path.
/v1/models
Returns a list of available models (currently returns a single model, llama-local).
Development
Run in development mode with hot reloading:
npm run devWatch for TypeScript changes:
npm run watch
License
Created by
Dimitry Ivanov [email protected] # curl -A cv ivanoff.org.ua
