pup-appsec

v0.2.0

Published

5 days ago

Pup — AI-native, evidence-based AppSec: contextual SAST, reachability SCA, threat modelling, and secrets for JavaScript/TypeScript, Python, Java/Kotlin, Go, Ruby, PHP, and C#.

0High
0Medium
0Low

shadowexia1

security appsec sast sca threat-modeling ai static-analysis vulnerability mcp

🐾 Pup

An AI-native, evidence-based Application Security Platform. Pup replaces signature- and regex-based scanning with agentic security analysis that reasons like an experienced application security engineer.

The goal is not more findings. The goal is fewer, higher-confidence findings — each one answering: Is this actually exploitable? Why? Show the evidence. Show the execution path. How do we fix it?

Pup scanning a vulnerable app — SQL injection, IDOR, and command injection found, one safe route filtered out

More demos — SCA + SBOM, cross-endpoint chains, and the living threat-model diagram: see Pup in action →

Status: open-core, usable today. Modules shipped: AI SAST · AI SCA · Threat Model · Secrets, with incremental CI scanning, an MCP server, and a local dashboard. Languages: JavaScript/TypeScript, Python, Java, Kotlin, Go, Ruby, PHP, and C# (JS/TS via a full AST; others via entry-point detectors + the language-agnostic agent — adding a language is one small detector). DAST, IaC, and container modules are planned.

How it differs from a scanner

Traditional SAST asks "does this line match a rule?" Pup asks "what is happening, and can an attacker exploit it?"

Repo (JS/TS) ─▶ ts-morph project ─▶ entry-point detection (where untrusted input enters)
                                          │
                                          ▼
                          Context bundle (handler + trust boundaries + route)
                                          │
                                          ▼
   AI SAST agent (Claude Opus 4.8)  ◀──▶  code-navigation tools
   reasons about exploitability,         (read_file, search_code, get_function,
   proves it, suppresses false +ves       get_callers, list_files)
                                          │
                                          ▼
                          Evidence-based findings (CLI report + JSON)

Two design choices keep this from becoming a glorified grep:

Scope by reachability of untrusted data, not by vuln patterns. Pup structurally locates HTTP entry points (where req.query / req.body / req.params enter) — it does not scan for "dangerous" function names.
The model does the vulnerability judgment. The agent is handed an entry point and a set of navigation tools, then investigates across files and reasons about exploitability. This is what catches contextual bugs that no signature finds — IDOR, broken authorization, business-logic flaws — and what lets Pup suppress a scary-looking-but-safe parameterized query.

See it in action

More animated runs — real findings, severities, and costs (full gallery: docs/demos.md):

Reachability triage. Two known CVEs: one is reachable (real risk), the other the CVE database rates HIGH but Pup proves not reachable and filters out — plus a CycloneDX SBOM.

Pup AI SCA demo

In your pull requests. The summary comment Pup posts on every PR, with the gate that blocks the merge:

Pup PR comment

→ Also in the gallery: AI SAST, AI-validated secrets, cross-endpoint chains, and the living threat-model data-flow diagram.

Results on known-vulnerable apps

Pup run against public, deliberately-vulnerable apps (where the bugs are documented, so you can check its work). Context-aware (threat model + SAST):

| App | Open findings | False positives suppressed | Endpoints | Cost | |---|---|---|---|---| | DVNA | 11 (6 crit / 1 high / 3 med / 1 low) | 5 | 10 | ~$2 | | NodeGoat | 20 (1 crit / 10 high / 5 med / 4 low) | 11 | 19 | ~$7.50 | | Juice Shop | 18 (5 high / 10 med / 3 low) | 117 | 20 of 238¹ | ~$7 |

Findings map to each app's documented OWASP issues (SQLi, command injection, RCE via eval/deserialization, XXE, IDOR, stored/reflected XSS, broken access control, ReDoS, …) — with evidence and execution paths, and the non-issues filtered out.

¹ Juice Shop has 238 detected endpoints; we ran a representative sample of 20 (--limit 20) to keep the showcase cheap. Several findings map to named Juice Shop challenges (Zip Slip, XXE, SSRF, persisted-XSS feedback, YAML bomb, CAPTCHA bypass). A full baseline scan is the one-time larger-cost run.

Compared to traditional scanners

Public analyses of SAST tools on these same apps show how much results swing by tool and ruleset. One Semgrep run on Juice Shop surfaced only a handful of findings (Yıldırım, A Brief Semgrep Analysis of Juice Shop), while multi-tool deep-dives (Semgrep + CodeQL + OSV-Scanner) find different subsets and conclude "no single tool catches everything" (Deep Dive into SAST Tools for OWASP Juice Shop). Notably, none publish a clean true-positive / false-positive breakdown — and the trade traditional SAST makes (recall vs a high false-positive rate) is the well-documented source of alert fatigue.

Pup's difference is the "false positives suppressed" column above: every listed finding is proven exploitable with evidence and an execution path, and the non-issues are filtered out rather than handed to you as noise. A rigorous head-to-head — Pup vs Semgrep/Snyk on the same app version, comparing real-exploitable findings against false positives — is the credible next step.

Install

npm install -g pup-appsec          # provides the `pup` command
pup scan ./my-app

# ...or run it without installing:
npx --package pup-appsec pup scan ./my-app

# Docker (no Node needed):
docker run --rm -e ANTHROPIC_API_KEY -v "$PWD:/repo" ghcr.io/shadowexia1/pup scan /repo --diff main

Set ANTHROPIC_API_KEY in your environment (or a .env). To build from source instead, see Quick start below.

One command to review a whole repo — threat model + SAST + SCA into a single shareable HTML report:

pup audit ./my-app --out report.html        # open report.html in a browser

What you need to run Pup

Node.js 20+ (or Docker).
An LLM API key — Pup sends focused code context to a model to reason about exploitability. You bring your own key; it stays in your environment and is never bundled or sent anywhere but your chosen provider.
- Anthropic / Claude (recommended): set ANTHROPIC_API_KEY.
- OpenAI or Azure OpenAI (alternative): set PUP_PROVIDER=openai and OPENAI_API_KEY (and PUP_OPENAI_BASE_URL for Azure).
- You need a little API credit on that account — a full pup audit of a small app is typically well under a dollar (Pup uses prompt caching + incremental scanning to keep cost down).

That's it — install, set the key, run. No account, no signup, nothing phones home.

What it costs to run

You pay your LLM provider directly (BYO key). Pup is built to be cheap — prompt caching + incremental scanning keep it low. Rough estimates (Claude Opus, real numbers from the sample apps):

| Action | Typical cost (USD) | How often | |---|---|---| | Threat model | ~0.15 – 0.40 | once, then cheap incremental updates | | Full repo baseline scan / pup audit | ~0.05 – 0.40 per endpoint — a small app under 1 dollar, a large app a few dollars | once (the baseline) | | Per merge-request scan (incremental) | ~0.10 – 1.00 — only the changed endpoints/dependencies | every MR |

The expensive part is the one-time baseline; after that, MRs only re-scan the diff, so steady-state cost is a few cents to ~1 dollar per MR. Run pup … --budget 5 to cap and report spend.

Large monorepos. Baseline cost scales with the number of entry points — roughly endpoints × $0.10–0.40 on Claude Opus. A big service with ~1,000 endpoints is a ~$100–400 one-time baseline, and you control it: scan incrementally (--diff), in slices (--entrypoint / --limit), or on a cheaper/local model (gpt-4o, or Ollama for ~free). For a team that's typically well under a year of per-seat enterprise SAST/SCA licensing — and steady-state is cents per PR. Scope matters too: scan the directory (or repo root) that contains all the code you care about (see Limitations → scan-path scoped).

Why Pup vs Snyk and traditional scanners

| | Traditional scanners (Snyk, etc.) | Pup | |---|---|---| | How it finds issues | signatures, regexes, CVE/taint rules | reasons about the code like an engineer | | Output | many findings, high false-positive rate | fewer, higher-confidence, proven-exploitable findings | | "Is it actually exploitable?" | rarely answered | evidence, execution path, attack scenario, fix for every finding | | Context | none — same verdict everywhere | threat-model context loop — knows if an endpoint is internet-facing, behind auth, handling sensitive data | | Cost | per-seat SaaS subscription | free & open-source, bring your own key | | Your code | uploaded to their cloud | stays in your environment, sent only to your chosen LLM | | Run it your way | their platform only | CLI · CI (GitHub/GitLab) · your editor (MCP) · self-host |

The headline: fewer false positives, every finding proven, and it's free.

Why Pup (vs Claude/Cursor/Copilot security review)

AI security review is ramping up, and we're now seeing tools that actually reason about a bug and explain it with evidence. Pup isn't trying to be a better review bot — it's a dedicated, open AppSec platform with a different shape:

Whole-app attack surface, not just the diff. Pup enumerates every HTTP entry point across the codebase (238 in Juice Shop, including factory-pattern routes and cross-file framework splits) and reasons about each. PR-review tools focus on the lines that changed; Pup finds the exploitable path in code this PR didn't touch.
More than just SAST — one tool that includes:
- exploitability-reasoning SAST (including secrets scanning)
- reachability-based SCA
- CycloneDX SBOM
- a living, committed threat model
- cross-endpoint chains
- a latent / defense-in-depth pass (--latent)
A review bot reviews code; Pup is your security engineer.
Persistent context. The threat model (auth, WAF, trust boundaries) and sticky suppressions persist across runs, so "exploitable" means exploitable in your architecture, and yesterday's false positive stays suppressed.
Open and vendor-neutral. FSL open-core you can read, fork, and self-host. Runs on Claude, OpenAI, Gemini, or a local model (air-gapped / cost-controlled) — bring your own key, no lock-in.
Built for pipelines. Exit-code gating to block merges, plus SARIF / GitLab / SBOM / Prometheus / webhook / MCP outputs.

Honest take: if you're all-in on one vendor and just want quick PR review, that vendor's built-in reviewer may be all you need. Pup is for teams already using AI who want to move away from traditional scanners — which add little value by comparison — and want SAST + SCA + threat-modelling in one place without building it all out themselves. Pup gives you the freedom to run it on any model, including a local one.

Quick start

Installed the CLI (above)? Point pup at any project:

export ANTHROPIC_API_KEY=sk-ant-...           # or put it in a .env file
pup scan  ./my-app                             # AI SAST
pup sca   ./my-app --sbom sbom.cdx.json        # dependency analysis + SBOM
pup audit ./my-app --out report.html           # threat model + SAST + SCA, one report

Add --dry-run to any scan to list the entry points Pup found with no API calls (free) — a good first look. Useful flags: --json findings.json, --fail-on high (CI gate), --entrypoint "/api/orders" --limit 1 (focus one).

From source (contributors, or to try the bundled vulnerable sample):

git clone https://github.com/ShadowExia1/Pup && cd Pup && npm install
cp .env.example .env            # add your key
npm run scan:sample:dry         # dry run on samples/vulnerable-express (no key)
npm run scan:sample             # live analysis (needs key)
npm run pup -- scan /path/to/project

The sample target

samples/vulnerable-express/ is a deliberately vulnerable Express app used to exercise the engine. It contains, by design:

A SQL injection that is only visible by tracing input across three files (app.js → services.js → db.js).
An IDOR / broken authorization bug on GET /api/orders/:id — the route is authenticated but never checks that the order belongs to the caller. No regex finds this; it requires reasoning about a missing check.
A command injection in the report export endpoint.
A safe, parameterized product search that looks like the SQLi but is not — included to verify Pup suppresses false positives.

Configuration

Environment variables (via shell or .env):

| Variable | Default | Purpose | | ------------------- | ------------------ | ---------------------------------------- | | ANTHROPIC_API_KEY | — | Required for live analysis. | | PUP_MODEL | claude-opus-4-8 | Reasoning model for the SAST agent. | | PUP_EFFORT | high | Agent effort: low/medium/high/max. |

Reports, dashboard & CI

Every scan can emit machine- and human-readable output:

npm run pup -- scan <path> \
  --json report.json \      # structured report (summary + cost + findings)
  --html report.html \      # self-contained dashboard — just open it
  --sarif report.sarif \    # GitHub/GitLab code scanning (inline PR annotations)
  --fail-on high            # exit non-zero if any exploitable finding >= high (CI gate)

Every scan also prints an MR-friendly one-liner:

SAST results — 2 critical · 1 high · 0 medium · 0 low · 0 info · 1 false positives

The HTML dashboard is a single static file (no server): severity chips, scan cost, and every finding with its evidence, execution path, fix, generated patch, and security test. Suppressed/safe findings are shown separately so you can see what Pup reviewed and cleared.

CI / Merge Requests — set up once, runs on every PR

It's a two-step, one-time setup — then Pup runs on every PR/MR automatically (nothing to re-add per PR):

Copy a template into your repo: ci/github-actions.yml → .github/workflows/pup-security.yml, or ci/gitlab-ci.yml → .gitlab-ci.yml.
Add an ANTHROPIC_API_KEY secret (GitHub: Settings → Secrets → Actions; GitLab: Settings → CI/CD → Variables).

Each PR then runs SAST + SCA, updates a committed threat model, writes an SBOM, uploads SARIF (inline annotations), posts a summary comment, and fails the check when a finding meets the gate severity. The templates use the published pup-appsec package — no build step.

The SAST template scans all changed code by default (--deep-reachability re-scans endpoints that import changed shared files; --latent flags risky sinks in changed files) — toggle via PUP_SAST_EXTRA.

Choose what blocks a merge with PUP_FAIL_ON: critical / high / medium / low to set the bar, or leave it empty for report-only (Pup still posts findings but never fails). Different teams/repos can pick different bars. To enforce it across many repos (org secrets, required status checks, branch protection, reusable workflows) see Configure the platform. Full guide + Azure/Bitbucket/Jenkins: ci/README.md.

The intended flow: run one manual full-repo baseline first (pup scan .), then let CI scan incrementally per PR (changed files / changed dependencies only) to keep per-PR cost minimal.

SBOM. pup sca --sbom sbom.cdx.json writes a CycloneDX dependency inventory (npm, PyPI, Maven, Go) for Dependency-Track, Grype, or procurement — pure parsing, no AI cost.

Push findings to your own tooling. --webhook <url> POSTs the JSON report to any endpoint (a collector, DefectDojo, a Slack relay); --prometheus <file> writes findings-by-severity + cost as Grafana-ready metrics. Step-by-step setup for each: docs/integrations.md.

📚 All guides are indexed in docs/.

Use inside your editor (MCP)

Pup runs as an MCP server so a coding agent (Claude Code, Cursor, Windsurf, …) can call its scans inline while you write code — security review where the code is written. It exposes three tools: pup_scan (SAST), pup_sca (SCA), and pup_threat_model.

Start it: npm run mcp (it speaks JSON-RPC over stdio).

Claude Code — add to .mcp.json (or claude mcp add):

{
  "mcpServers": {
    "pup": {
      "command": "npx",
      "args": ["tsx", "C:\\Users\\cjpar\\Pup\\src\\cli.ts", "mcp"],
      "env": { "ANTHROPIC_API_KEY": "sk-ant-..." }
    }
  }
}

Cursor — the same shape in .cursor/mcp.json.

Then ask your agent things like "use pup_scan on src/api/orders.ts" or "run pup_threat_model then pup_scan and tell me what's actually exploitable." Tool calls are capped to a few entry points/advisories by default — pass entryPoint, diff, or limit to scope a larger run.

Layout

src/
  cli.ts                  # `pup scan <path>`
  config.ts
  core/types.ts           # Finding, EntryPoint, ContextBundle
  context/
    project.ts            # load target into a ts-morph project
    entrypoints.ts        # structural HTTP entry-point detection
    contextBuilder.ts     # assemble the agent's focused context + seed prompt
  agent/
    tools.ts              # code-navigation tools the agent drives
    sastAgent.ts          # the reasoning loop + evidence-based finding schema
  report/render.ts        # terminal report
samples/
  vulnerable-express/     # deliberately vulnerable target

License

Pup is source-available under the Functional Source License (FSL-1.1-Apache-2.0). You may read, run, self-host, modify, and contribute to it freely. The only restriction is a Competing Use — you may not offer Pup (or a substantially similar service) as a commercial product that competes with the maintainers. Each release automatically converts to the permissive Apache License 2.0 two years after it is published.

This is an open-core model: the engine is open; a hosted commercial layer (team dashboard, history, SSO, policy, compliance, managed inference) is the business. Not legal advice — consult a lawyer for your use.

Limitations (what Pup does not do — yet)

About the engine

Non-deterministic. It's LLM-based; the same code can produce slightly different findings or wording across runs. CI gating works on exit codes, but treat findings as an expert review, not a formal proof — verify the criticals and highs.
Not sound static analysis. No formal taint/dataflow; Pup reasons over code it navigates. Heavy reflection, dynamic dispatch, or metaprogramming can cause both false negatives and false positives.
Your code is sent to an LLM provider (Anthropic/OpenAI/…) — unless you run a local model (Ollama/vLLM) for air-gapped/private use.

Coverage

Coverage = what's detected. Undetected entry points (exotic frameworks, dynamic route registration) aren't analyzed. Non-JS/TS detection is heuristic, not full AST (JS/TS gets ts-morph depth).
Reachable-first by default. The high-confidence findings come from code reachable from detected entry points. Risky code that no entry point reaches isn't in that proven-exploitable set — but you can opt in with --latent, which surfaces inherently dangerous sinks (eval, exec, deserialization, dynamic SQL, weak crypto) anywhere in the repo at reduced severity (it flags for review; it doesn't prove exploitability). Non-HTTP reachability still isn't traced.
Scan-path scoped. Only files under the path you scan are indexed. In a monorepo, code imported from outside that directory (e.g. ./backend importing ../shared) isn't seen — scan the directory, or the repo root, that contains everything relevant.
HTTP entry points only — no message queues, gRPC, WebSockets, CLI, or cron/worker surfaces.
No mobile client (iOS/Android/RN), no DAST, no client-side/DOM-XSS, no IaC/container/cloud-config, no license compliance.

Dependencies

SCA/SBOM cover npm/PyPI/Maven/Go, but depth needs a lockfile (without one, versions may be ranges and transitive deps incomplete). --diff scoping is npm-only today.

Cost & maturity

A full baseline of a large app is a real spend — but nothing compared to what products charge today (~$7 for 20 complex Juice Shop endpoints on Claude Opus). Use a one-time baseline + cheap incremental MR scans, or a cheaper/local model. See What it costs to run.
Early-stage (v0.2). Expect rough edges and breaking changes.