@reasoningco/cse

v0.1.0

Published

a month ago

Testing and analytics toolkit for Claude Code skills — trigger simulation, regression testing, conflict detection, and live session logging

0High
0Medium
0Low

sanvithreddy

anmolgxrg

claude claude-code skills testing analytics trigger-simulation regression-testing

claude-skills-evalkit

Testing and analytics toolkit for Claude Code skills. Answers "will my skill actually trigger?" before you deploy.

Install

npm install -g claude-skills-evalkit

This installs the cse CLI globally and automatically sets up the /cse slash command in Claude Code.

Or use without installing:

npx claude-skills-evalkit simulate ./my-skill/SKILL.md -p "your prompt"

The CLI is available as both claude-skills-evalkit and cse (shorthand).

/cse in Claude Code

After installing, a /cse slash command is available directly inside Claude Code. Restart Claude Code after install, then:

/cse simulate ./skills/my-skill/SKILL.md "does this trigger?"
/cse test ./skill-tests/my-suite.yaml
/cse conflicts
/cse optimize ./skills/my-skill/SKILL.md
/cse help

You can also use natural language:

/cse does my code reviewer skill trigger for "review this PR"?
/cse are there any conflicts between my skills?

To install the slash command manually (if you already had cse installed):

cse install-command

Commands

`cse simulate` — Test skill triggering

# Single prompt
cse simulate ./skills/my-skill/SKILL.md -p "build me a dashboard"

# Batch from file (one prompt per line)
cse simulate ./skills/my-skill/SKILL.md -f prompts.txt

# Verbose — show per-signal score breakdowns
cse simulate ./skills/my-skill/SKILL.md -p "build a form" -v

# JSON output for scripting
cse simulate ./skills/my-skill/SKILL.md -f prompts.txt --json

# Show top N results (default: 5)
cse simulate ./skills/my-skill/SKILL.md -p "create a component" --top 10

Output:

Prompt: "build me a dashboard"

  Rank  Skill                  Plugin              Confidence
  ──────────────────────────────────────────────────────────
  1  ►  frontend-design        ui-toolkit          0.84  ████████
  2     data-viz-builder       analytics           0.61  ██████
  3     chart-generator        charting            0.42  ████

  ► = your target skill
  Scored in 4.24ms

`cse test` — Regression test suites

Create a YAML test suite:

# skill-tests/frontend.yaml
name: frontend-triggers
skillPath: ../skills/frontend-design/SKILL.md
cases:
  - id: dashboard
    prompt: "build me a dashboard"
    expectedTrigger: true
    minConfidence: 0.5

  - id: sql-query
    prompt: "optimize SQL queries"
    expectedTrigger: false

Run it:

# Run all suites in ./skill-tests/
cse test

# Specific suite
cse test --suite ./skill-tests/frontend.yaml

# CI mode with JUnit XML
cse test --ci --reporter junit --output results.xml

# Filter by test ID
cse test --filter "dashboard"

`cse conflicts` — Detect overlapping skills

# Scan all installed skills for overlaps
cse conflicts

# Set overlap threshold (0-1)
cse conflicts --threshold 0.6

# JSON output
cse conflicts --format json

`cse optimize` — Improve skill descriptions

# Offline analysis (precision/recall scores + suggestions)
cse optimize ./skills/my-skill/SKILL.md \
  --positive should-trigger.txt \
  --negative should-not-trigger.txt

# AI-powered rewriting via claude -p (uses your existing Claude Code session)
cse optimize ./skills/my-skill/SKILL.md \
  --positive pos.txt --negative neg.txt --api

`cse dry-run` — Test instruction adherence

cse dry-run ./skills/my-skill/SKILL.md "build me a todo app"
cse dry-run ./skills/my-skill/SKILL.md "build me a todo app" --json

Uses claude -p (pipe mode) — no API key needed. If ANTHROPIC_API_KEY is set, falls back to direct API.

`cse log` — Session analytics

cse log status              # Show logging status
cse log query --since 7d    # Query last 7 days
cse log query --group-by skill --since 30d
cse log export --format csv --output analytics.csv

`cse dashboard` — Analytics dashboard

cse dashboard
cse dashboard --since 30d

`cse init` — Initialize config

cse init

Creates .evalkitrc.json and skill-tests/example.yaml.

Programmatic API

import {
  simulateTrigger,
  batchSimulate,
  parseSkill,
  scanPlugins,
  detectConflicts,
} from 'claude-skills-evalkit';

// Parse a skill
const skill = parseSkill('./my-skill/SKILL.md');

// Scan installed plugins
const plugins = await scanPlugins();
const allSkills = plugins.flatMap(p => p.skills);

// Simulate triggering
const result = simulateTrigger('build me a dashboard', {
  targetSkill: skill,
  competingSkills: allSkills,
});
console.log(result.targetSkill?.confidence); // 0.84
console.log(result.targetSkill?.rank);       // 1

// Detect conflicts
const report = detectConflicts(allSkills, 0.5);
console.log(report.overlaps);

Configuration

Create .evalkitrc.json in your project root (or run cse init):

{
  "scoring": {
    "weights": {
      "semantic": 0.35,
      "keywordOverlap": 0.20,
      "triggerPhrase": 0.25,
      "specificity": 0.10,
      "competition": 0.10
    }
  },
  "logging": {
    "redactPrompts": true,
    "storageBackend": "sqlite"
  },
  "regression": {
    "suiteDir": "./skill-tests",
    "reporters": ["console"]
  }
}

Claude Code Plugin

Install as a Claude Code plugin for live session logging and MCP tools (in addition to the CLI):

.claude-plugin/plugin.json  — Plugin manifest
hooks/hooks.json            — Automatic session logging
skills/claude-skills-evalkit/SKILL.md — Invoke via "test my skill"
.mcp.json                   — MCP server with simulate_trigger, detect_conflicts, analyze_description

Scoring Algorithm

Five signals combined with calibrated weights:

| Signal | Weight | What It Measures | |--------|--------|------------------| | Semantic (TF-IDF) | 0.35 | Topic relevance via cosine similarity | | Trigger Phrase | 0.25 | Match against quoted phrases in description | | Keyword Overlap | 0.20 | IDF-weighted Jaccard coefficient | | Specificity | 0.10 | Description quality and precision | | Competition | 0.10 | Score gap between top candidates |

Security

API keys: environment variables only (ANTHROPIC_API_KEY), never in config
Prompts: SHA-256 hashed by default, never stored in plaintext
Storage: database files created with 0600 permissions
No telemetry, no phone-home, no eval()

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

claude-skills-evalkit

Install

/cse in Claude Code

Commands

cse simulate — Test skill triggering

cse test — Regression test suites

cse conflicts — Detect overlapping skills

cse optimize — Improve skill descriptions

cse dry-run — Test instruction adherence

cse log — Session analytics

cse dashboard — Analytics dashboard

cse init — Initialize config