@fuzzyos/fuzzy
v0.1.4
Published
CLI tool for managing vLLM deployments on GPU pods
Readme
fuzzy
Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
Installation
npm install -g @fuzzyos/fuzzyWhat is fuzzy?
fuzzy simplifies running large language models on remote GPU pods. It automatically:
- Sets up vLLM on fresh Ubuntu pods
- Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
- Manages multiple models on the same pod with "smart" GPU allocation
- Provides OpenAI-compatible API endpoints for each model
- Includes an interactive agent with file system tools for testing
Quick Start
# Set required environment variables
export HF_TOKEN=your_huggingface_token # Get from https://huggingface.co/settings/tokens
export FUZZY_API_KEY=your_api_key # Any string you want for API authentication
# Setup a DataCrunch pod with NFS storage (models path auto-extracted)
fuzzy pods setup dc1 "ssh [email protected]" \
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
# Start a model (automatic configuration for known models)
fuzzy start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
# Send a single message to the model
fuzzy agent qwen "What is the Fibonacci sequence?"
# Interactive chat mode with file system tools
fuzzy agent qwen -i
# Use with any OpenAI-compatible client
export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
export OPENAI_API_KEY=$FUZZY_API_KEYPrerequisites
- Node.js 18+
- HuggingFace token (for model downloads)
- GPU pod with:
- Ubuntu 22.04 or 24.04
- SSH root access
- NVIDIA drivers installed
- Persistent storage for models
Supported Providers
Primary Support
DataCrunch - Best for shared model storage
- NFS volumes sharable across multiple pods in same region
- Models download once, use everywhere
- Ideal for teams or multiple experiments
RunPod - Good persistent storage
- Network volumes persist independently
- Cannot share between running pods simultaneously
- Good for single-pod workflows
Also Works With
- Vast.ai (volumes locked to specific machine)
- Prime Intellect (no persistent storage)
- AWS EC2 (with EFS setup)
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
Commands
Pod Management
fuzzy pods setup <name> "<ssh>" [options] # Setup new pod
--mount "<mount_command>" # Run mount command during setup
--models-path <path> # Override extracted path (optional)
--vllm release|nightly|gpt-oss # vLLM version (default: release)
fuzzy pods # List all configured pods
fuzzy pods active <name> # Switch active pod
fuzzy pods remove <name> # Remove pod from local config
fuzzy shell [<name>] # SSH into pod
fuzzy ssh [<name>] "<command>" # Run command on podNote: When using --mount, the models path is automatically extracted from the mount command's target directory. You only need --models-path if not using --mount or to override the extracted path.
vLLM Version Options
release(default): Stable vLLM release, recommended for most usersnightly: Latest vLLM features, needed for newest models like GLM-4.5gpt-oss: Special build for OpenAI's GPT-OSS models only
Model Management
fuzzy start <model> --name <name> [options] # Start a model
--memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
--context <size> # Context window: 4k, 8k, 16k, 32k, 64k, 128k
--gpus <count> # Number of GPUs to use (predefined models only)
--pod <name> # Target specific pod (overrides active)
--vllm <args...> # Pass custom args directly to vLLM
fuzzy stop [<name>] # Stop model (or all if no name given)
fuzzy list # List running models with status
fuzzy logs <name> # Stream model logs (tail -f)Agent & Chat Interface
fuzzy agent <name> "<message>" # Single message to model
fuzzy agent <name> "<msg1>" "<msg2>" # Multiple messages in sequence
fuzzy agent <name> -i # Interactive chat mode
fuzzy agent <name> -i -c # Continue previous session
# Standalone OpenAI-compatible agent (works with any API)
fuzzy-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
fuzzy-agent --api-key sk-... "What is 2+2?" # Uses OpenAI by default
fuzzy-agent --json "What is 2+2?" # Output event stream as JSONL
fuzzy-agent -i # Interactive modeThe agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
Predefined Model Configurations
fuzzy includes predefined configurations for popular agentic models, so you do not have to specify --vllm arguments manually. fuzzy will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run fuzzy start without additional arguments to see a list of predefined models that can run on the active pod.
Qwen Models
# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
fuzzy start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
# Qwen3-Coder-30B - Advanced reasoning with tool use
fuzzy start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
fuzzy start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480bGPT-OSS Models
# Requires special vLLM build during setup
fuzzy pods setup gpt-pod "ssh [email protected]" --models-path /workspace --vllm gpt-oss
# GPT-OSS-20B - Fits on 16GB+ VRAM
fuzzy start openai/gpt-oss-20b --name gpt20
# GPT-OSS-120B - Needs 60GB+ VRAM
fuzzy start openai/gpt-oss-120b --name gpt120GLM Models
# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
fuzzy start zai-org/GLM-4.5 --name glm
# GLM-4.5-Air - Smaller version, 1-2 GPUs
fuzzy start zai-org/GLM-4.5-Air --name glm-airCustom Models with --vllm
For models not in the predefined list, use --vllm to pass arguments directly to vLLM:
# DeepSeek with custom settings
fuzzy start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
--tensor-parallel-size 4 --trust-remote-code
# Mistral with pipeline parallelism
fuzzy start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
--tensor-parallel-size 8 --pipeline-parallel-size 2
# Any model with specific tool parser
fuzzy start some/model --name mymodel --vllm \
--tool-call-parser hermes --enable-auto-tool-choiceDataCrunch Setup
DataCrunch offers the best experience with shared NFS storage across pods:
1. Create Shared Filesystem (SFS)
- Go to DataCrunch dashboard → Storage → Create SFS
- Choose size and datacenter
- Note the mount command (e.g.,
sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02)
2. Create GPU Instance
- Create instance in same datacenter as SFS
- Share the SFS with the instance
- Get SSH command from dashboard
3. Setup with fuzzy
# Get mount command from DataCrunch dashboard
fuzzy pods setup dc1 "ssh [email protected]" \
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
# Models automatically stored in /mnt/hf-models (extracted from mount command)4. Benefits
- Models persist across instance restarts
- Share models between multiple instances in same datacenter
- Download once, use everywhere
- Pay only for storage, not compute time during downloads
RunPod Setup
RunPod offers good persistent storage with network volumes:
1. Create Network Volume (optional)
- Go to RunPod dashboard → Storage → Create Network Volume
- Choose size and region
2. Create GPU Pod
- Select "Network Volume" during pod creation (if using)
- Attach your volume to
/runpod-volume - Get SSH command from pod details
3. Setup with fuzzy
# With network volume
fuzzy pods setup runpod "ssh [email protected]" --models-path /runpod-volume
# Or use workspace (persists with pod but not shareable)
fuzzy pods setup runpod "ssh [email protected]" --models-path /workspaceMulti-GPU Support
Automatic GPU Assignment
When running multiple models, fuzzy automatically assigns them to different GPUs:
fuzzy start model1 --name m1 # Auto-assigns to GPU 0
fuzzy start model2 --name m2 # Auto-assigns to GPU 1
fuzzy start model3 --name m3 # Auto-assigns to GPU 2Specify GPU Count for Predefined Models
For predefined models with multiple configurations, use --gpus to control GPU usage:
# Run Qwen on 1 GPU instead of all available
fuzzy start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
fuzzy start zai-org/GLM-4.5 --name glm --gpus 8If the model doesn't have a configuration for the requested GPU count, you'll see available options.
Tensor Parallelism for Large Models
For models that don't fit on a single GPU:
# Use all available GPUs
fuzzy start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
--tensor-parallel-size 4
# Specific GPU count
fuzzy start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
--data-parallel-size 8 --enable-expert-parallelAPI Integration
All models expose OpenAI-compatible endpoints:
from openai import OpenAI
client = OpenAI(
base_url="http://your-pod-ip:8001/v1",
api_key="your-fuzzy-api-key"
)
# Chat completion with tool calling
response = client.chat.completions.create(
model="Qwen/Qwen2.5-Coder-32B-Instruct",
messages=[
{"role": "user", "content": "Write a Python function to calculate fibonacci"}
],
tools=[{
"type": "function",
"function": {
"name": "execute_code",
"description": "Execute Python code",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string"}
},
"required": ["code"]
}
}
}],
tool_choice="auto"
)Standalone Agent CLI
fuzzy includes a standalone OpenAI-compatible agent that can work with any API:
# Install globally to get fuzzy-agent command
npm install -g @fuzzyos/fuzzy
# Use with OpenAI
fuzzy-agent --api-key sk-... "What is machine learning?"
# Use with local vLLM
fuzzy-agent --base-url http://localhost:8000/v1 \
--model meta-llama/Llama-3.1-8B-Instruct \
--api-key dummy \
"Explain quantum computing"
# Interactive mode
fuzzy-agent -i
# Continue previous session
fuzzy-agent --continue "Follow up question"
# Custom system prompt
fuzzy-agent --system-prompt "You are a Python expert" "Write a web scraper"
# Use responses API (for GPT-OSS models)
fuzzy-agent --api responses --model openai/gpt-oss-20b "Hello"The agent supports:
- Session persistence across conversations
- Interactive TUI mode with syntax highlighting
- File system tools (read, list, bash, glob, rg) for code navigation
- Both Chat Completions and Responses API formats
- Custom system prompts
Tool Calling Support
fuzzy automatically configures appropriate tool calling parsers for known models:
- Qwen models:
hermesparser (Qwen3-Coder usesqwen3_coder) - GLM models:
glm4_moeparser with reasoning support - GPT-OSS models: Uses
/v1/responsesendpoint, as tool calling (function calling in OpenAI parlance) is currently a WIP with thev1/chat/completionsendpoint. - Custom models: Specify with
--vllm --tool-call-parser <parser> --enable-auto-tool-choice
To disable tool calling:
fuzzy start model --name mymodel --vllm --disable-tool-call-parserMemory and Context Management
GPU Memory Allocation
Controls how much GPU memory vLLM pre-allocates:
--memory 30%: High concurrency, limited context--memory 50%: Balanced (default)--memory 90%: Maximum context, low concurrency
Context Window
Sets maximum input + output tokens:
--context 4k: 4,096 tokens total--context 32k: 32,768 tokens total--context 128k: 131,072 tokens total
Example for coding workload:
# Large context for code analysis, moderate concurrency
fuzzy start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
--context 64k --memory 70%Note: When using --vllm, the --memory, --context, and --gpus parameters are ignored. You'll see a warning if you try to use them together.
Session Persistence
The interactive agent mode (-i) saves sessions for each project directory:
# Start new session
fuzzy agent qwen -i
# Continue previous session (maintains chat history)
fuzzy agent qwen -i -cSessions are stored in ~/.fuzzy/sessions/ organized by project path and include:
- Complete conversation history
- Tool call results
- Token usage statistics
Architecture & Event System
The agent uses a unified event-based architecture where all interactions flow through AgentEvent types. This enables:
- Consistent UI rendering across console and TUI modes
- Session recording and replay
- Clean separation between API calls and UI updates
- JSON output mode for programmatic integration
Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
JSON Output Mode
Use --json flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
fuzzy-agent --api-key sk-... --json "What is 2+2?"Each line is a complete JSON object representing an event:
{"type":"user_message","text":"What is 2+2?"}
{"type":"assistant_start"}
{"type":"assistant_message","text":"2 + 2 = 4"}
{"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}Troubleshooting
OOM (Out of Memory) Errors
- Reduce
--memorypercentage - Use smaller model or quantized version (FP8)
- Reduce
--contextsize
Model Won't Start
# Check GPU usage
fuzzy ssh "nvidia-smi"
# Check if port is in use
fuzzy list
# Force stop all models
fuzzy stopTool Calling Issues
- Not all models support tool calling reliably
- Try different parser:
--vllm --tool-call-parser mistral - Or disable:
--vllm --disable-tool-call-parser
Access Denied for Models
Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
vLLM Build Issues
If using --vllm nightly fails, try:
- Use
--vllm releasefor stable version - Check CUDA compatibility with
fuzzy ssh "nvidia-smi"
Agent Not Finding Messages
If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
# Good
fuzzy agent qwen "What is this file about?"
# Bad (shell might interpret special chars)
fuzzy agent qwen What is this file about?Advanced Usage
Working with Multiple Pods
# Override active pod for any command
fuzzy start model --name test --pod dev-pod
fuzzy list --pod prod-pod
fuzzy stop test --pod dev-podCustom vLLM Arguments
# Pass any vLLM argument after --vllm
fuzzy start model --name custom --vllm \
--quantization awq \
--enable-prefix-caching \
--max-num-seqs 256 \
--gpu-memory-utilization 0.95Monitoring
# Watch GPU utilization
fuzzy ssh "watch -n 1 nvidia-smi"
# Check model downloads
fuzzy ssh "du -sh ~/.cache/huggingface/hub/*"
# View all logs
fuzzy ssh "ls -la ~/.vllm_logs/"
# Check agent session history
ls -la ~/.fuzzy/sessions/Environment Variables
HF_TOKEN- HuggingFace token for model downloadsFUZZY_API_KEY- API key for vLLM endpointsFUZZY_CONFIG_DIR- Config directory (default:~/.fuzzy)OPENAI_API_KEY- Used byfuzzy-agentwhen no--api-keyprovided
License
MIT
