pup-appsec
v0.2.0
Published
Pup — AI-native, evidence-based AppSec: contextual SAST, reachability SCA, threat modelling, and secrets for JavaScript/TypeScript, Python, Java/Kotlin, Go, Ruby, PHP, and C#.
Maintainers
Readme
🐾 Pup
An AI-native, evidence-based Application Security Platform. Pup replaces signature- and regex-based scanning with agentic security analysis that reasons like an experienced application security engineer.
The goal is not more findings. The goal is fewer, higher-confidence findings — each one answering: Is this actually exploitable? Why? Show the evidence. Show the execution path. How do we fix it?
More demos — SCA + SBOM, cross-endpoint chains, and the living threat-model diagram: see Pup in action →
Status: open-core, usable today. Modules shipped: AI SAST · AI SCA · Threat Model · Secrets, with incremental CI scanning, an MCP server, and a local dashboard. Languages: JavaScript/TypeScript, Python, Java, Kotlin, Go, Ruby, PHP, and C# (JS/TS via a full AST; others via entry-point detectors + the language-agnostic agent — adding a language is one small detector). DAST, IaC, and container modules are planned.
How it differs from a scanner
Traditional SAST asks "does this line match a rule?" Pup asks "what is happening, and can an attacker exploit it?"
Repo (JS/TS) ─▶ ts-morph project ─▶ entry-point detection (where untrusted input enters)
│
▼
Context bundle (handler + trust boundaries + route)
│
▼
AI SAST agent (Claude Opus 4.8) ◀──▶ code-navigation tools
reasons about exploitability, (read_file, search_code, get_function,
proves it, suppresses false +ves get_callers, list_files)
│
▼
Evidence-based findings (CLI report + JSON)Two design choices keep this from becoming a glorified grep:
- Scope by reachability of untrusted data, not by vuln patterns. Pup
structurally locates HTTP entry points (where
req.query/req.body/req.paramsenter) — it does not scan for "dangerous" function names. - The model does the vulnerability judgment. The agent is handed an entry point and a set of navigation tools, then investigates across files and reasons about exploitability. This is what catches contextual bugs that no signature finds — IDOR, broken authorization, business-logic flaws — and what lets Pup suppress a scary-looking-but-safe parameterized query.
See it in action
More animated runs — real findings, severities, and costs (full gallery: docs/demos.md):
Reachability triage. Two known CVEs: one is reachable (real risk), the other the CVE database rates HIGH but Pup proves not reachable and filters out — plus a CycloneDX SBOM.
In your pull requests. The summary comment Pup posts on every PR, with the gate that blocks the merge:
→ Also in the gallery: AI SAST, AI-validated secrets, cross-endpoint chains, and the living threat-model data-flow diagram.
Results on known-vulnerable apps
Pup run against public, deliberately-vulnerable apps (where the bugs are documented, so you can check its work). Context-aware (threat model + SAST):
| App | Open findings | False positives suppressed | Endpoints | Cost | |---|---|---|---|---| | DVNA | 11 (6 crit / 1 high / 3 med / 1 low) | 5 | 10 | ~$2 | | NodeGoat | 20 (1 crit / 10 high / 5 med / 4 low) | 11 | 19 | ~$7.50 | | Juice Shop | 18 (5 high / 10 med / 3 low) | 117 | 20 of 238¹ | ~$7 |
Findings map to each app's documented OWASP issues (SQLi, command injection, RCE via eval/deserialization, XXE, IDOR, stored/reflected XSS, broken access control, ReDoS, …) — with evidence and execution paths, and the non-issues filtered out.
¹ Juice Shop has 238 detected endpoints; we ran a representative sample of 20 (
--limit 20) to keep the showcase cheap. Several findings map to named Juice Shop challenges (Zip Slip, XXE, SSRF, persisted-XSS feedback, YAML bomb, CAPTCHA bypass). A full baseline scan is the one-time larger-cost run.
Compared to traditional scanners
Public analyses of SAST tools on these same apps show how much results swing by tool and ruleset. One Semgrep run on Juice Shop surfaced only a handful of findings (Yıldırım, A Brief Semgrep Analysis of Juice Shop), while multi-tool deep-dives (Semgrep + CodeQL + OSV-Scanner) find different subsets and conclude "no single tool catches everything" (Deep Dive into SAST Tools for OWASP Juice Shop). Notably, none publish a clean true-positive / false-positive breakdown — and the trade traditional SAST makes (recall vs a high false-positive rate) is the well-documented source of alert fatigue.
Pup's difference is the "false positives suppressed" column above: every listed finding is proven exploitable with evidence and an execution path, and the non-issues are filtered out rather than handed to you as noise. A rigorous head-to-head — Pup vs Semgrep/Snyk on the same app version, comparing real-exploitable findings against false positives — is the credible next step.
Install
npm install -g pup-appsec # provides the `pup` command
pup scan ./my-app
# ...or run it without installing:
npx --package pup-appsec pup scan ./my-app
# Docker (no Node needed):
docker run --rm -e ANTHROPIC_API_KEY -v "$PWD:/repo" ghcr.io/shadowexia1/pup scan /repo --diff mainSet ANTHROPIC_API_KEY in your environment (or a .env). To build from source
instead, see Quick start below.
One command to review a whole repo — threat model + SAST + SCA into a single shareable HTML report:
pup audit ./my-app --out report.html # open report.html in a browserWhat you need to run Pup
- Node.js 20+ (or Docker).
- An LLM API key — Pup sends focused code context to a model to reason about
exploitability. You bring your own key; it stays in your environment and is
never bundled or sent anywhere but your chosen provider.
- Anthropic / Claude (recommended): set
ANTHROPIC_API_KEY. - OpenAI or Azure OpenAI (alternative): set
PUP_PROVIDER=openaiandOPENAI_API_KEY(andPUP_OPENAI_BASE_URLfor Azure). - You need a little API credit on that account — a full
pup auditof a small app is typically well under a dollar (Pup uses prompt caching + incremental scanning to keep cost down).
- Anthropic / Claude (recommended): set
That's it — install, set the key, run. No account, no signup, nothing phones home.
What it costs to run
You pay your LLM provider directly (BYO key). Pup is built to be cheap — prompt caching + incremental scanning keep it low. Rough estimates (Claude Opus, real numbers from the sample apps):
| Action | Typical cost (USD) | How often |
|---|---|---|
| Threat model | ~0.15 – 0.40 | once, then cheap incremental updates |
| Full repo baseline scan / pup audit | ~0.05 – 0.40 per endpoint — a small app under 1 dollar, a large app a few dollars | once (the baseline) |
| Per merge-request scan (incremental) | ~0.10 – 1.00 — only the changed endpoints/dependencies | every MR |
The expensive part is the one-time baseline; after that, MRs only re-scan the
diff, so steady-state cost is a few cents to ~1 dollar per MR. Run
pup … --budget 5 to cap and report spend.
Large monorepos. Baseline cost scales with the number of entry points —
roughly endpoints × $0.10–0.40 on Claude Opus. A big service with ~1,000
endpoints is a ~$100–400 one-time baseline, and you control it: scan
incrementally (--diff), in slices (--entrypoint / --limit), or on a
cheaper/local model (gpt-4o, or Ollama for ~free). For a team that's typically
well under a year of per-seat enterprise SAST/SCA licensing — and steady-state
is cents per PR. Scope matters too: scan the directory (or repo root) that
contains all the code you care about (see Limitations → scan-path scoped).
Why Pup vs Snyk and traditional scanners
| | Traditional scanners (Snyk, etc.) | Pup | |---|---|---| | How it finds issues | signatures, regexes, CVE/taint rules | reasons about the code like an engineer | | Output | many findings, high false-positive rate | fewer, higher-confidence, proven-exploitable findings | | "Is it actually exploitable?" | rarely answered | evidence, execution path, attack scenario, fix for every finding | | Context | none — same verdict everywhere | threat-model context loop — knows if an endpoint is internet-facing, behind auth, handling sensitive data | | Cost | per-seat SaaS subscription | free & open-source, bring your own key | | Your code | uploaded to their cloud | stays in your environment, sent only to your chosen LLM | | Run it your way | their platform only | CLI · CI (GitHub/GitLab) · your editor (MCP) · self-host |
The headline: fewer false positives, every finding proven, and it's free.
Why Pup (vs Claude/Cursor/Copilot security review)
AI security review is ramping up, and we're now seeing tools that actually reason about a bug and explain it with evidence. Pup isn't trying to be a better review bot — it's a dedicated, open AppSec platform with a different shape:
Whole-app attack surface, not just the diff. Pup enumerates every HTTP entry point across the codebase (238 in Juice Shop, including factory-pattern routes and cross-file framework splits) and reasons about each. PR-review tools focus on the lines that changed; Pup finds the exploitable path in code this PR didn't touch.
More than just SAST — one tool that includes:
- exploitability-reasoning SAST (including secrets scanning)
- reachability-based SCA
- CycloneDX SBOM
- a living, committed threat model
- cross-endpoint chains
- a latent / defense-in-depth pass (
--latent)
A review bot reviews code; Pup is your security engineer.
Persistent context. The threat model (auth, WAF, trust boundaries) and sticky suppressions persist across runs, so "exploitable" means exploitable in your architecture, and yesterday's false positive stays suppressed.
Open and vendor-neutral. FSL open-core you can read, fork, and self-host. Runs on Claude, OpenAI, Gemini, or a local model (air-gapped / cost-controlled) — bring your own key, no lock-in.
Built for pipelines. Exit-code gating to block merges, plus SARIF / GitLab / SBOM / Prometheus / webhook / MCP outputs.
Honest take: if you're all-in on one vendor and just want quick PR review, that vendor's built-in reviewer may be all you need. Pup is for teams already using AI who want to move away from traditional scanners — which add little value by comparison — and want SAST + SCA + threat-modelling in one place without building it all out themselves. Pup gives you the freedom to run it on any model, including a local one.
Quick start
Installed the CLI (above)? Point pup at any project:
export ANTHROPIC_API_KEY=sk-ant-... # or put it in a .env file
pup scan ./my-app # AI SAST
pup sca ./my-app --sbom sbom.cdx.json # dependency analysis + SBOM
pup audit ./my-app --out report.html # threat model + SAST + SCA, one reportAdd --dry-run to any scan to list the entry points Pup found with no API
calls (free) — a good first look. Useful flags: --json findings.json,
--fail-on high (CI gate), --entrypoint "/api/orders" --limit 1 (focus one).
From source (contributors, or to try the bundled vulnerable sample):
git clone https://github.com/ShadowExia1/Pup && cd Pup && npm install
cp .env.example .env # add your key
npm run scan:sample:dry # dry run on samples/vulnerable-express (no key)
npm run scan:sample # live analysis (needs key)
npm run pup -- scan /path/to/projectThe sample target
samples/vulnerable-express/ is a deliberately vulnerable Express app used to
exercise the engine. It contains, by design:
- A SQL injection that is only visible by tracing input across three files
(
app.js→services.js→db.js). - An IDOR / broken authorization bug on
GET /api/orders/:id— the route is authenticated but never checks that the order belongs to the caller. No regex finds this; it requires reasoning about a missing check. - A command injection in the report export endpoint.
- A safe, parameterized product search that looks like the SQLi but is not — included to verify Pup suppresses false positives.
Configuration
Environment variables (via shell or .env):
| Variable | Default | Purpose |
| ------------------- | ------------------ | ---------------------------------------- |
| ANTHROPIC_API_KEY | — | Required for live analysis. |
| PUP_MODEL | claude-opus-4-8 | Reasoning model for the SAST agent. |
| PUP_EFFORT | high | Agent effort: low/medium/high/max. |
Reports, dashboard & CI
Every scan can emit machine- and human-readable output:
npm run pup -- scan <path> \
--json report.json \ # structured report (summary + cost + findings)
--html report.html \ # self-contained dashboard — just open it
--sarif report.sarif \ # GitHub/GitLab code scanning (inline PR annotations)
--fail-on high # exit non-zero if any exploitable finding >= high (CI gate)Every scan also prints an MR-friendly one-liner:
SAST results — 2 critical · 1 high · 0 medium · 0 low · 0 info · 1 false positivesThe HTML dashboard is a single static file (no server): severity chips, scan cost, and every finding with its evidence, execution path, fix, generated patch, and security test. Suppressed/safe findings are shown separately so you can see what Pup reviewed and cleared.
CI / Merge Requests — set up once, runs on every PR
It's a two-step, one-time setup — then Pup runs on every PR/MR automatically (nothing to re-add per PR):
- Copy a template into your repo:
ci/github-actions.yml→.github/workflows/pup-security.yml, orci/gitlab-ci.yml→.gitlab-ci.yml. - Add an
ANTHROPIC_API_KEYsecret (GitHub: Settings → Secrets → Actions; GitLab: Settings → CI/CD → Variables).
Each PR then runs SAST + SCA, updates a committed threat model, writes an SBOM,
uploads SARIF (inline annotations), posts a summary comment, and fails the
check when a finding meets the gate severity. The templates use the published
pup-appsec package — no build step.
The SAST template scans all changed code by default (--deep-reachability
re-scans endpoints that import changed shared files; --latent flags risky sinks
in changed files) — toggle via PUP_SAST_EXTRA.
Choose what blocks a merge with PUP_FAIL_ON: critical / high /
medium / low to set the bar, or leave it empty for report-only (Pup still
posts findings but never fails). Different teams/repos can pick different bars.
To enforce it across many repos (org secrets, required status checks, branch
protection, reusable workflows) see Configure the platform.
Full guide + Azure/Bitbucket/Jenkins: ci/README.md.
The intended flow: run one manual full-repo baseline first (
pup scan .), then let CI scan incrementally per PR (changed files / changed dependencies only) to keep per-PR cost minimal.
SBOM. pup sca --sbom sbom.cdx.json writes a CycloneDX dependency
inventory (npm, PyPI, Maven, Go) for Dependency-Track, Grype, or procurement —
pure parsing, no AI cost.
Push findings to your own tooling. --webhook <url> POSTs the JSON report
to any endpoint (a collector, DefectDojo, a Slack relay); --prometheus <file>
writes findings-by-severity + cost as Grafana-ready metrics. Step-by-step
setup for each: docs/integrations.md.
📚 All guides are indexed in
docs/.
Use inside your editor (MCP)
Pup runs as an MCP server so a coding agent
(Claude Code, Cursor, Windsurf, …) can call its scans inline while you write
code — security review where the code is written. It exposes three tools:
pup_scan (SAST), pup_sca (SCA), and pup_threat_model.
Start it: npm run mcp (it speaks JSON-RPC over stdio).
Claude Code — add to .mcp.json (or claude mcp add):
{
"mcpServers": {
"pup": {
"command": "npx",
"args": ["tsx", "C:\\Users\\cjpar\\Pup\\src\\cli.ts", "mcp"],
"env": { "ANTHROPIC_API_KEY": "sk-ant-..." }
}
}
}Cursor — the same shape in .cursor/mcp.json.
Then ask your agent things like "use pup_scan on src/api/orders.ts" or
"run pup_threat_model then pup_scan and tell me what's actually exploitable."
Tool calls are capped to a few entry points/advisories by default — pass
entryPoint, diff, or limit to scope a larger run.
Layout
src/
cli.ts # `pup scan <path>`
config.ts
core/types.ts # Finding, EntryPoint, ContextBundle
context/
project.ts # load target into a ts-morph project
entrypoints.ts # structural HTTP entry-point detection
contextBuilder.ts # assemble the agent's focused context + seed prompt
agent/
tools.ts # code-navigation tools the agent drives
sastAgent.ts # the reasoning loop + evidence-based finding schema
report/render.ts # terminal report
samples/
vulnerable-express/ # deliberately vulnerable targetLicense
Pup is source-available under the Functional Source License (FSL-1.1-Apache-2.0). You may read, run, self-host, modify, and contribute to it freely. The only restriction is a Competing Use — you may not offer Pup (or a substantially similar service) as a commercial product that competes with the maintainers. Each release automatically converts to the permissive Apache License 2.0 two years after it is published.
This is an open-core model: the engine is open; a hosted commercial layer (team dashboard, history, SSO, policy, compliance, managed inference) is the business. Not legal advice — consult a lawyer for your use.
Limitations (what Pup does not do — yet)
About the engine
- Non-deterministic. It's LLM-based; the same code can produce slightly different findings or wording across runs. CI gating works on exit codes, but treat findings as an expert review, not a formal proof — verify the criticals and highs.
- Not sound static analysis. No formal taint/dataflow; Pup reasons over code it navigates. Heavy reflection, dynamic dispatch, or metaprogramming can cause both false negatives and false positives.
- Your code is sent to an LLM provider (Anthropic/OpenAI/…) — unless you run a local model (Ollama/vLLM) for air-gapped/private use.
Coverage
- Coverage = what's detected. Undetected entry points (exotic frameworks, dynamic route registration) aren't analyzed. Non-JS/TS detection is heuristic, not full AST (JS/TS gets ts-morph depth).
- Reachable-first by default. The high-confidence findings come from code
reachable from detected entry points. Risky code that no entry point reaches
isn't in that proven-exploitable set — but you can opt in with
--latent, which surfaces inherently dangerous sinks (eval, exec, deserialization, dynamic SQL, weak crypto) anywhere in the repo at reduced severity (it flags for review; it doesn't prove exploitability). Non-HTTP reachability still isn't traced. - Scan-path scoped. Only files under the path you scan are indexed. In a
monorepo, code imported from outside that directory (e.g.
./backendimporting../shared) isn't seen — scan the directory, or the repo root, that contains everything relevant. - HTTP entry points only — no message queues, gRPC, WebSockets, CLI, or cron/worker surfaces.
- No mobile client (iOS/Android/RN), no DAST, no client-side/DOM-XSS, no IaC/container/cloud-config, no license compliance.
Dependencies
- SCA/SBOM cover npm/PyPI/Maven/Go, but depth needs a lockfile (without one,
versions may be ranges and transitive deps incomplete).
--diffscoping is npm-only today.
Cost & maturity
- A full baseline of a large app is a real spend — but nothing compared to what products charge today (~$7 for 20 complex Juice Shop endpoints on Claude Opus). Use a one-time baseline + cheap incremental MR scans, or a cheaper/local model. See What it costs to run.
- Early-stage (v0.2). Expect rough edges and breaking changes.
