skillforge
v0.1.0
Published
Minimal runner and eval framework for Claude Skills
Readme
skillforge
Minimal runner and eval framework for Claude Skills.
I'm using this to play with multiple versions of the same skill, and to refine them based on evaluation.
⚠️ WARNING ⚠️
This is alpha software written for personal use.
- It runs Claude in yolo mode which can and will wipe your data one day.
- It can also burn through a ton of tokens depending on what your skills do.
- Also, this is pretty much entirely vibecoded, and I have not read the code.
If you burn all your money to brick your computer, it's your own fault.
Usage
Create a folder for experimentation inside your existing project:
npx skillforge create myfolderThis will generate a stub with a skill to test and an example evaluation.
Run benchmarks:
cd myfolder
npx skillforgeYou can add more versions of the same skill, or add more skills.
By default, the entire matrix runs 10 times. Repeated runs show cached results unless you pass --reset or delete the myfolder/results from the disk.
I recommend setting up a Claude session and asking it to check out the /results. Then you can use insights from that convo to refine your eval.md and SKILL.md.
Options
--runs <n> Number of runs per variant (default: 10)
--reset Archive old results and start fresh
--reeval Re-run evaluation on existing workspaces
--cache-only Show cached results only