@tobiascantcode/spfgraph
v0.1.4
Published
Run LLMs locally in your terminal. Supports custom .pt GPT-2 decoder checkpoints and GGUF models, with a live dashboard of throughput, VRAM, and chat.
Maintainers
Readme
@tobiascantcode/spfgraph — zeroshot-run / sph
Run LLMs locally in your terminal. Like Ollama, but with first-class support for
custom .pt GPT-2, LLaMA, and Phi checkpoints. GGUF models work too via
node-llama-cpp (optional). Ships a live terminal dashboard (tokens/sec + VRAM
in real time) and an OpenAI-compatible HTTP server so any client that speaks
the OpenAI API works out of the box.
Status: early (
0.1.x). The.ptpath is the most exercised; GGUF support depends onnode-llama-cppbuilding on your platform. Bug reports and PRs welcome — see Issues.Tested on: Windows 11 (PyTorch 2.10 + CUDA 12.6, Python 3.10). macOS and Linux are expected to work but currently unverified — please open an issue if something's off.
┌─────────────────────────────────────────────────────────┐
│ zeroshot-run | ckpt_mid_final | 539M params | fp16 │
│ VRAM: 1.0GB / 4.0GB | Speed: 79.0 tok/s | Ctx: 21/2048 │
├──────────────────────────┬──────────────────────────────┤
│ tok/s over time │ VRAM (MB) over time │
├──────────────────────────┴──────────────────────────────┤
│ chat ↓ │
│ you The quick brown fox │
│ model jumps over the lazy dog" is more concise... │
├─────────────────────────────────────────────────────────┤
│ you _ │
└─────────────────────────────────────────────────────────┘Install
npm i -g @tobiascantcode/spfgraphThat gives you two equivalent commands: zeroshot-run and sph.
Prerequisites
For .pt checkpoints (the main path) you need Python with torch and
tiktoken:
pip install tiktoken
# For torch, use the install command from https://pytorch.org/get-started/locally/
# that matches your CUDA version (or CPU-only). A bare `pip install torch` may
# pull a build that doesn't match your driver.For .gguf models the optional node-llama-cpp dependency must build
successfully. On Windows that needs Visual Studio Build Tools + cmake +
Python; on macOS/Linux it's usually automatic. If the build fails the npm
install still succeeds — only the GGUF backend is unavailable, the .pt
path keeps working.
Quickstart
# Drop a checkpoint into ~/.zeroshot/models/, or register one explicitly:
sph register zeroshot-500m ./checkpoints/zeroshot-500m.pt
# See what's known
sph list
# Open the dashboard
sph load zeroshot-500m
# Or pass a path directly
sph load ./model.pt
sph load ./llama3-8b.Q4_K_M.gguf
# Throughput benchmark (no TUI)
sph bench zeroshot-500m --tokens 256 --prompt "Hello"
# Start an OpenAI-compatible REST API
sph serve ./model.pt --port 11434Commands
sph load <model> Load a model and start the interactive REPL
sph serve <model> Start an OpenAI-compatible HTTP server
sph list List local models (registered + scanned)
sph bench <model> Run a quick throughput benchmark
sph register <name> <path> Add a model to the local registryload flags:
| Flag | Default | Effect |
|---|---|---|
| --temperature <n> | 0.8 | sampling temperature |
| --top-k <n> | 40 | top-k sampling |
| --top-p <n> | 1.0 | nucleus sampling (1.0 disables) |
| --min-p <n> | 0.0 | min-p sampling (0.05 recommended, 0.0 disables) |
| --repetition-penalty <n> | 1.0 | penalize repeated tokens (1.1 typical, 1.0 disables) |
| --max-tokens <n> | 512 | max tokens per response |
| --ctx <n> | model's block_size | override context window |
| --tokenizer <path> | tiktoken | path to custom Hugging Face tokenizer.json |
| --arch <name> | auto | override architecture detection (gpt2 or llama) |
| --cpu | off | force CPU inference (otherwise CUDA fp16 is used when available) |
| --python <path> | python | Python executable (also via ZEROSHOT_PYTHON env var) |
serve flags: --port (default 11434), --host (default 127.0.0.1),
--ctx, --cpu, --python, --tokenizer, --arch. Sampling
(temperature, top_k, top_p, min_p, repetition_penalty,
max_tokens) is set per-request via the JSON body, not at server start.
bench flags: --tokens --prompt --python <path>
Dashboard controls
Once load is running:
| Key | Action |
|---|---|
| Enter | send the prompt |
| Ctrl+C | quit |
| Esc | cancel current generation |
| Ctrl+T | scroll up one line (slow / fine control) |
| Ctrl+U | scroll down one line |
| PgUp / PgDn | half-page scroll |
| Home / End | jump to top / bottom |
| Ctrl+B / Ctrl+F | full-page back / forward |
| Ctrl+G | jump to bottom and re-enable auto-follow |
| Mouse wheel | scroll 3 lines per tick |
| /clear | clear chat history |
| /quit or /exit | quit |
The chat label shows your scroll status: chat ↓ when pinned to the bottom
(auto-following new tokens), chat 12/30 (scrolled — End to follow) when
you've scrolled up to read history. New tokens won't yank you back to the
bottom while you're scrolled up.
OpenAI-compatible server
sph serve exposes any loaded model as an HTTP API that speaks the OpenAI
chat-completions dialect, so existing clients (LangChain, llama-index, the
openai SDK pointed at a custom base URL, Open WebUI, etc.) work without
changes.
sph serve ./model.pt --port 11434 --host 127.0.0.1Endpoints:
| Method | Path | Description |
|---|---|---|
| GET | /v1/models | Lists the currently-loaded model |
| POST | /v1/chat/completions | Chat completion, streaming or not |
| OPTIONS | * | CORS preflight (all origins allowed) |
Request body fields honored: messages, stream, max_tokens,
temperature, top_p, top_k, min_p, repetition_penalty. Streaming
responses are server-sent events terminated with data: [DONE].
The server holds a single in-flight generation lock — concurrent requests get HTTP 429 until the current one finishes. (One model, one GPU, one generation at a time.)
Quick test:
curl http://127.0.0.1:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"local","messages":[{"role":"user","content":"hi"}],"stream":true}'Checkpoint format (.pt)
The bundled python/inference.py expects checkpoints saved as:
torch.save({'model': model.state_dict(), 'config': asdict(cfg)}, path)
# either key works; `config` and `model_config` are both accepted:
torch.save({'model': model.state_dict(), 'model_config': asdict(cfg)}, path)Required config keys: n_layer, n_head, n_embd. Optional keys:
block_size (default 1024), vocab_size (default 50304), bias
(default True). For LLaMA architectures: rope_theta, num_key_value_heads, intermediate_size.
Architecture support dynamically detects GPT-2 (decoder-only with weight tying) and LLaMA (RMSNorm, RoPE, SwiGLU, GQA) via config keys.
Tokenization defaults to tiktoken's GPT-2 encoding, but you can pass
--tokenizer ./tokenizer.json to load any Hugging Face tokenizers file natively.
How loading works
The loader is designed to fit large models on small GPUs (e.g., a 600M model on a 4 GB card):
torch.load(..., mmap=True)— the checkpoint is memory-mapped, not read entirely into RAM (PyTorch ≥ 2.1).- The model is built on the meta device with
set_default_dtype(fp16)— every parameter has a shape but zero storage, so there's no CPU RAM spike from default initialization. model.to_empty(device='cuda')allocates fresh fp16 storage directly on the GPU.load_state_dict(...)copies tensors one at a time, casting fp32 → fp16 in flight. Peak temporary memory ≈ one tensor.
For a 600M-parameter model the peak footprint works out to about 2.4 GB CPU
RAM (mmap'd state dict) plus 1.2 GB VRAM (fp16 model). YMMV depending on
model size and n_embd.
CUDA OOM auto-fallback
If to_empty(device='cuda') raises OutOfMemoryError (your GPU is
contended — browser, other ML processes, etc.), the loader automatically
retries on CPU instead of bailing:
materializing on cuda...
CUDA OOM during materialization: ...
falling back to CPU (re-run with --cpu to skip this attempt next time)CPU is much slower than GPU but always works.
Bridge protocol
zeroshot-run spawns python/inference.py --server ... and exchanges
line-delimited JSON over stdin/stdout. Events from Python:
{ "type": "ready", "config": {...}, "device": "cuda", "vram_total_mb": 4096 }
{ "type": "token", "text": "...", "id": 123 }
{ "type": "stats", "tokens_per_sec": 34.2, "vram_used_mb": 2100, "ctx_used": 128, "ctx_max": 2048 }
{ "type": "done", "total_tokens": 84, "elapsed_ms": 2450 }
{ "type": "error", "message": "..." }Commands accepted on stdin:
{ "cmd": "generate", "prompt": "...", "max_tokens": 256, "temperature": 0.8, "top_k": 40 }
{ "cmd": "stop" }
{ "cmd": "shutdown" }The GGUF backend (src/inference/gguf_loader.js) emits the same event
surface so the dashboard is backend-agnostic.
Config
~/.zeroshot/config.json — model registry and defaults. Edit it directly to
change defaults or scan paths:
{
"defaults": { "temperature": 0.8, "top_k": 40, "max_tokens": 512, "fp16": true },
"models": {
"zeroshot-500m": { "path": "C:\\path\\to\\model.pt", "added": "..." }
},
"scan_paths": ["C:\\Users\\you\\.zeroshot\\models"]
}Drop any .pt / .gguf file into ~/.zeroshot/models/ and sph list
will pick it up automatically.
Environment variables
| Variable | Effect |
|---|---|
| ZEROSHOT_PYTHON | Python executable to spawn (default python) |
| ZEROSHOT_DEBUG=1 | Log every keypress and mouse event the dashboard receives to ~/.zeroshot/debug.log. Use this if a key or wheel feels unresponsive — the log tells you whether the terminal is sending the event at all. |
| PYTORCH_CUDA_ALLOC_CONF | Inherited by the Python process. Defaults to expandable_segments:True (no-op on Windows, helps on Linux when free VRAM is fragmented). |
Layout
src/
ui/dashboard.js blessed-contrib screen, scroll handling, render loop
inference/
python_bridge.js spawn + line-delimited JSON protocol for .pt models
gguf_loader.js optional node-llama-cpp backend, same event surface
bench/benchmark.js throughput benchmark
models/registry.js ~/.zeroshot/config.json
commands/ load.js, list.js, bench.js
python/
inference.py GPT-2 decoder + meta-device loader + JSON server
requirements.txt
bin/
zeroshot-run.jsTroubleshooting
node-llama-cpp install warnings on Windows — benign. The package is in
optionalDependencies, so a build failure is just a warning. Only the GGUF
backend is affected; .pt loading goes through Python and works regardless.
python exited before ready (code=3221225477) — that's 0xC0000005,
a Windows access violation, historically caused by torch running out of CPU
RAM during default model initialization for large models on systems with
≤8 GB RAM. The current loader uses meta-device construction to avoid that
spike. If you still hit it, run with --cpu and report the stderr in an
issue.
CUDA out of memory — the loader auto-retries on CPU. To skip the CUDA
attempt entirely, pass --cpu. To free GPU memory, close other apps that
use the GPU (browsers with hardware acceleration, Discord, OBS, other
training jobs).
Scrolling feels dead — try Ctrl+T / Ctrl+U (one-line scroll) before
PgUp/PgDn; some Windows console configurations don't forward PgUp/PgDn to
applications. If even Ctrl+T does nothing, run with ZEROSHOT_DEBUG=1 and
check ~/.zeroshot/debug.log to see whether the terminal is sending the
event at all.
sph: command not found after install — make sure you installed with
-g and that npm's global bin directory is on your PATH. npm config get prefix
shows where global bins go.
E404 on npm publish (developers only) — npm's confusing way of
saying "you don't have permission to publish to this scope." Run
npm whoami; if it shows the wrong user, run npm logout && npm login
and retry. Add --otp=123456 if 2FA is on.
Contributing
Source: github.com/TobiasLogic/SPFgraph. Issues and PRs welcome — especially:
- Reports from macOS or Linux (currently unverified)
- GGUF /
node-llama-cppinteractions on different platforms - Other GPT-style checkpoint formats worth supporting
