@reasoningco/cse
v0.1.0
Published
Testing and analytics toolkit for Claude Code skills — trigger simulation, regression testing, conflict detection, and live session logging
Readme
claude-skills-evalkit
Testing and analytics toolkit for Claude Code skills. Answers "will my skill actually trigger?" before you deploy.
Install
npm install -g claude-skills-evalkitThis installs the cse CLI globally and automatically sets up the /cse slash command in Claude Code.
Or use without installing:
npx claude-skills-evalkit simulate ./my-skill/SKILL.md -p "your prompt"The CLI is available as both claude-skills-evalkit and cse (shorthand).
/cse in Claude Code
After installing, a /cse slash command is available directly inside Claude Code. Restart Claude Code after install, then:
/cse simulate ./skills/my-skill/SKILL.md "does this trigger?"
/cse test ./skill-tests/my-suite.yaml
/cse conflicts
/cse optimize ./skills/my-skill/SKILL.md
/cse helpYou can also use natural language:
/cse does my code reviewer skill trigger for "review this PR"?
/cse are there any conflicts between my skills?To install the slash command manually (if you already had cse installed):
cse install-commandCommands
cse simulate — Test skill triggering
# Single prompt
cse simulate ./skills/my-skill/SKILL.md -p "build me a dashboard"
# Batch from file (one prompt per line)
cse simulate ./skills/my-skill/SKILL.md -f prompts.txt
# Verbose — show per-signal score breakdowns
cse simulate ./skills/my-skill/SKILL.md -p "build a form" -v
# JSON output for scripting
cse simulate ./skills/my-skill/SKILL.md -f prompts.txt --json
# Show top N results (default: 5)
cse simulate ./skills/my-skill/SKILL.md -p "create a component" --top 10Output:
Prompt: "build me a dashboard"
Rank Skill Plugin Confidence
──────────────────────────────────────────────────────────
1 ► frontend-design ui-toolkit 0.84 ████████
2 data-viz-builder analytics 0.61 ██████
3 chart-generator charting 0.42 ████
► = your target skill
Scored in 4.24mscse test — Regression test suites
Create a YAML test suite:
# skill-tests/frontend.yaml
name: frontend-triggers
skillPath: ../skills/frontend-design/SKILL.md
cases:
- id: dashboard
prompt: "build me a dashboard"
expectedTrigger: true
minConfidence: 0.5
- id: sql-query
prompt: "optimize SQL queries"
expectedTrigger: falseRun it:
# Run all suites in ./skill-tests/
cse test
# Specific suite
cse test --suite ./skill-tests/frontend.yaml
# CI mode with JUnit XML
cse test --ci --reporter junit --output results.xml
# Filter by test ID
cse test --filter "dashboard"cse conflicts — Detect overlapping skills
# Scan all installed skills for overlaps
cse conflicts
# Set overlap threshold (0-1)
cse conflicts --threshold 0.6
# JSON output
cse conflicts --format jsoncse optimize — Improve skill descriptions
# Offline analysis (precision/recall scores + suggestions)
cse optimize ./skills/my-skill/SKILL.md \
--positive should-trigger.txt \
--negative should-not-trigger.txt
# AI-powered rewriting via claude -p (uses your existing Claude Code session)
cse optimize ./skills/my-skill/SKILL.md \
--positive pos.txt --negative neg.txt --apicse dry-run — Test instruction adherence
cse dry-run ./skills/my-skill/SKILL.md "build me a todo app"
cse dry-run ./skills/my-skill/SKILL.md "build me a todo app" --jsonUses claude -p (pipe mode) — no API key needed. If ANTHROPIC_API_KEY is set, falls back to direct API.
cse log — Session analytics
cse log status # Show logging status
cse log query --since 7d # Query last 7 days
cse log query --group-by skill --since 30d
cse log export --format csv --output analytics.csvcse dashboard — Analytics dashboard
cse dashboard
cse dashboard --since 30dcse init — Initialize config
cse initCreates .evalkitrc.json and skill-tests/example.yaml.
Programmatic API
import {
simulateTrigger,
batchSimulate,
parseSkill,
scanPlugins,
detectConflicts,
} from 'claude-skills-evalkit';
// Parse a skill
const skill = parseSkill('./my-skill/SKILL.md');
// Scan installed plugins
const plugins = await scanPlugins();
const allSkills = plugins.flatMap(p => p.skills);
// Simulate triggering
const result = simulateTrigger('build me a dashboard', {
targetSkill: skill,
competingSkills: allSkills,
});
console.log(result.targetSkill?.confidence); // 0.84
console.log(result.targetSkill?.rank); // 1
// Detect conflicts
const report = detectConflicts(allSkills, 0.5);
console.log(report.overlaps);Configuration
Create .evalkitrc.json in your project root (or run cse init):
{
"scoring": {
"weights": {
"semantic": 0.35,
"keywordOverlap": 0.20,
"triggerPhrase": 0.25,
"specificity": 0.10,
"competition": 0.10
}
},
"logging": {
"redactPrompts": true,
"storageBackend": "sqlite"
},
"regression": {
"suiteDir": "./skill-tests",
"reporters": ["console"]
}
}Claude Code Plugin
Install as a Claude Code plugin for live session logging and MCP tools (in addition to the CLI):
.claude-plugin/plugin.json — Plugin manifest
hooks/hooks.json — Automatic session logging
skills/claude-skills-evalkit/SKILL.md — Invoke via "test my skill"
.mcp.json — MCP server with simulate_trigger, detect_conflicts, analyze_descriptionScoring Algorithm
Five signals combined with calibrated weights:
| Signal | Weight | What It Measures | |--------|--------|------------------| | Semantic (TF-IDF) | 0.35 | Topic relevance via cosine similarity | | Trigger Phrase | 0.25 | Match against quoted phrases in description | | Keyword Overlap | 0.20 | IDF-weighted Jaccard coefficient | | Specificity | 0.10 | Description quality and precision | | Competition | 0.10 | Score gap between top candidates |
Security
- API keys: environment variables only (
ANTHROPIC_API_KEY), never in config - Prompts: SHA-256 hashed by default, never stored in plaintext
- Storage: database files created with
0600permissions - No telemetry, no phone-home, no
eval()
License
MIT
