falsegreen-skill

v0.2.0

Published

2 days ago

LLM skill for false-positive test detection (J1-J6 protocol)

Downloads

262

0High
0Medium
0Low

vinicq

claude-code-plugin testing test-smells code-quality false-positive robotframework

falsegreen-skill

LLM-based semantic analysis for false-positive test detection. Companion to falsegreen, the Python static scanner.

For Python, this skill applies the complete falsegreen catalog directly — all structural and semantic patterns — via LLM analysis, without requiring the static scanner to run first. For TypeScript, JavaScript, and Robot Framework it is the primary detection tool. It is a superset of the three static scanners (falsegreen, falsegreen-js, robotframework-falsegreen) plus semantic patterns only an LLM can detect.

Why this exists

A test suite with 100% green tests is not a proof of correctness. It is a proof that no test failed — which is a different thing. Tests can pass permanently not because the code is right, but because the test never checks anything meaningful.

Static analysis tools catch some of these cases. Linters like ruff or flake8-pytest-style catch syntax-level patterns: a bare assert True, a missing assert call, an unreachable block. Mutation testing tools like mutmut probe whether tests actually fail when the code changes. Both approaches have limits: linters cannot reason about test intent, and mutation testing requires the code to run.

This skill fills the gap between linters and mutation testing. It reads the test as text, reconstructs the intent, and asks six structural questions about whether the test can actually fail. The questions are derived from the taxonomy of false-positive test patterns collected in CREDITS.md.

The core insight: a test is useful if and only if there exists some incorrect implementation that would cause it to fail. If no such implementation exists — because the assertion is unreachable, tautological, or verifies the mock instead of the code — the test is structurally green regardless of whether the production code is correct.

The methodology

One rule underlies every judgment: a test is useful only if it can fail when the code breaks.

The six-judgment framework (J1-J6) makes this rule concrete:

| # | Question | Catches | |---|---|---| | J1 | Does the assertion run? | Dead assertions, vacuous loops, swallowed failures | | J2 | Is the expected value from an independent oracle? | Echo mocks, formula re-implementation, spec contradictions | | J3 | Is the real unit under test, not a mock of it? | Mock-the-SUT, self-confirming literals | | J4 | Does the assertion verify enough? | Truthiness-only, len > 0, repr coupling, broad raises | | J5 | Is the test coupled to implementation internals? | Positional mock args, private method testing | | J6 | Does the test pass in isolation, without ordering? | Shared mutable state, test-order dependency |

A test is flagged HIGH only when the first failed judgment has no plausible legitimate interpretation. A test is flagged LOW when the smell is likely but has plausible intent. Everything else is PASS.

Precision over recall. One wrong flag on a legitimate test costs more goodwill than a missed smell. Exemptions are explicit:

Semantic case 18 requires a cited independent oracle (spec, docstring, API contract). Without a citation, do not report case 18.
Characterization tests — intentionally freezing current behavior — are not false positives.
Boolean predicates (isinstance, .exists(), .is_dir()) are not weak assertions.
In HTTP/UI layer tests, a truthiness check on a response object means "the request succeeded" and is meaningful.

Full protocol: SKILL.md.

What it detects

Python — structural patterns (complete falsegreen catalog)

Family A — The test never checks anything

| Code | Pattern | Example | |---|---|---| | C1 | Assert inside if/for that may not run | if items: assert items[0].valid when items can be [] | | C2 | No assertion at all | test body contains only setup calls | | C2b | Calls SUT but discards result | result = process(x) — result never asserted | | C3 | Assert inside try whose except swallows it | except Exception: pass catches AssertionError | | C4 | Test function nested inside another function | pytest does not collect inner defs | | C4b | Test class with __init__ | pytest skips classes that have __init__ | | C20 | Assertion after unconditional return/raise | dead code, never runs | | C21 | Every assert is conditional, none runs unconditionally | all asserts inside if/else branches | | CC | Commented-out assertion | # assert result == 42 |

Family B — The check is weak or always true

| Code | Pattern | Example | |---|---|---| | C5 | Always-true check | assert True, assert (a, b) (non-empty tuple) | | C6 | Truthiness / len > 0 / substring in str() | assert result, assert len(x) > 0 | | C6b | Positional mock arg via computed index | call_args.args[expected_args.index("target")] | | C7 | Self-comparison | assert name == name | | C8 | Exact float equality | assert ratio == 3.14159 | | C9 | pytest.raises too broad or no match= | with pytest.raises(Exception) | | C11a | Self-confirming literal | product.price = 100; assert product.price == 100 | | C13 | Mock assertion uncalled or misspelled | mock.assert_called_once (no parens) | | C13b | @patch without autospec=True | typos in kwargs pass silently | | C14 | Golden file written from actual output | first run records any output as truth | | C16 | Depends on wall clock, random, or sleep | datetime.now() unfrozen, time.sleep() | | C18 | str()/repr() comparison | assert str(user) == "User(Alice, 30)" | | C25 | @pytest.mark.xfail without strict=True | XPASS silently accepted | | C34 | Suboptimal assertion form | == True, == None, not x in y, len == 0 |

Family C — The test checks its own setup

| Code | Pattern | Example | |---|---|---| | C19 | pytest.raises wraps multiple calls | setup call inside raises block may be the one that raises | | C28 | pytest.raises binding variable never read | as exc: but exc never asserted | | C29 | os.environ mutated directly | os.environ["KEY"] = "x" without monkeypatch |

Family D — Green depends on outside factors

| Code | Pattern | Example | |---|---|---| | C17 | pytest.skip() inside broad except | assertion failure silently becomes a skip | | C23 | Hard-coded absolute or home-relative path | /home/user/data.csv | | C24 | Module-level mutable state shared between tests | _cache = {} at module scope | | C27 | try/except/pass instead of pytest.raises | both raise and no-raise leave test green | | C30 | responses.add() without activating interceptor | real HTTP goes through | | C31 | capsys.readouterr() result discarded | captured output never asserted | | C32 | @pytest.mark.skip without reason= | forgotten skip | | C35 | @pytest.mark.flaky / retry decorator | masks non-determinism |

Family E — The test checks the wrong thing

| Code | Pattern | Example | |---|---|---| | C33 | sklearn/ML metric computed but not asserted | accuracy_score(y, y_hat) result discarded | | C36 | pytest.fail() without reason | CI shows only FAILED, no context | | C37 | Duplicate case in @pytest.mark.parametrize | same (a, b, expected) tuple appears twice |

Semantic patterns (all three languages)

Semantic patterns require LLM judgment — no static rule can detect them.

| Case | Pattern | |---|---| | 10 | Patches the unit under test (not a dependency) | | 11 | Asserts the value fed to the mock (echo) | | 12 | Re-implements the production formula as the expected value | | 15 | Passes only when another test has already run | | 18 | Expected value contradicts the spec (freezes a bug as correct) |

Diagnostic and coupling codes (opt-in)

These codes do not create false positives, but they reduce observability and make failures harder to diagnose. They are OFF by default and can be enabled per code in .falsegreen.toml:

[tool.falsegreen]
severity = { D1 = "info", D3 = "info", D4 = "info", D5 = "info", D6 = "info", M2 = "info" }

| Code | Pattern | Why it matters | |---|---|---| | D1 | Assertion Roulette: 2+ asserts without messages | CI output says only the line number — hard to triage | | D3 | Duplicate Assert: exact same assertion written twice | second assertion adds nothing | | D4 | Unnamed Parametrize: 3+ cases, no ids= | CI shows test[0], test[1] — unreadable failure reports | | D5 | Inline Setup Excess: 5+ setup statements before first assert | test should be split or setup moved to a fixture | | D6 | Debug Print: print() or pprint() in test body | suppressed by default, often a forgotten debug statement | | M2 | Long Test Method: test body over 50 lines | trying to verify too many concerns at once |

How it compares

vs. ruff / flake8-pytest-style

Ruff and flake8-pytest-style catch syntax-level patterns: assert True, pytest.raises with no type, magic values in assertions. They are fast and precise for the patterns they cover — about 8-10 of the 37+ cases in the falsegreen catalog.

This skill covers all 37+ structural codes and the 5 semantic cases that require reading the test as a whole — echo mocks, formula re-implementation, spec contradictions. The two tools are complementary: run the linter for instant feedback on simple cases, run the skill for semantic judgment on the rest.

vs. PyNose / pytest-smell

PyNose and pytest-smell are the closest research counterparts. Both apply the classic Palomba 2018 test-smell taxonomy (Assertion Roulette, Duplicate Assert, General Fixture, etc.). The falsegreen taxonomy is narrower: it focuses only on patterns that create false-positive green tests, not on maintainability smells in general.

Where there is overlap (Assertion Roulette = D1, Duplicate Assert = D3), falsegreen flags them as diagnostic codes — informational, not blocking. The structural codes unique to falsegreen (C1-C45) cover patterns that Palomba's taxonomy does not address because they were derived specifically from studying how green tests hide broken code in CI.

vs. mutmut / cosmic-ray

Mutation testing answers the question definitively: change the code, does the test fail? That is the ground truth. Mutmut and cosmic-ray are accurate for the programs they can run, but they require an executable environment, a full test suite, and minutes to hours per run.

This skill is a static pre-flight check. It cannot prove that a test fails when the code changes — that is mutation testing's job. It can identify, in seconds, tests that are structurally unable to fail: assertions that never execute, checks that are always true by construction, mocks that intercept the function being tested. Think of the skill as a fast filter before the mutation testing pass.

How to use

Installation

| Platform | How | |---|---| | Claude Code | /plugin marketplace add vinicq/falsegreen-skill then /plugin install falsegreen-skill@falsegreen | | Claude.ai / Anthropic API Skills | npm run build:targets, then package dist/claude-agent-skill/ as the standalone skill | | OpenAI Codex CLI | codex plugin marketplace add vinicq/falsegreen-skill — or clone the repo: AGENTS.md is auto-loaded | | Gemini CLI | gemini extensions install https://github.com/vinicq/falsegreen-skill | | Gemini Agent Skill | workspace skill at .gemini/skills/falsegreen-skill/SKILL.md, or npm run build:targets for dist/gemini-skill/ | | Cursor | Copy contents of contexts/cursor.md to .cursor/rules/falsegreen-skill.mdc | | CLI | npx falsegreen-skill analyze tests/test_example.py — see docs/cli.md | | API | Use the defined provider guides in contexts/claude.md, contexts/codex.md, and contexts/gemini.md |

Quick example

Given this test that echoes the mock back to itself:

# tests/test_tax.py
def test_calculate_tax(mock_calc):
    mock_calc.return_value = 0.15
    result = calculate_tax(100, mock_calc)
    assert result == mock_calc.return_value  # J2: asserting the mock, not behavior

export ANTHROPIC_API_KEY=sk-ant-...
npx falsegreen-skill analyze tests/test_tax.py

Output:

CASE 11 (J2) - HIGH - Python - spec

Test: test_calculate_tax (line 3-6)
Finding: The assertion checks mock_calc.return_value - the same value the
mock was configured to return. This passes for any return value, including
wrong ones.
Evidence:
  mock_calc.return_value = 0.15
  assert result == mock_calc.return_value
Fix hint: Assert against an independently computed expected value, e.g.
assert result == 15.0 for a 15% tax on 100.

SUMMARY
Tests reviewed: 1
Findings: 1 (1 high, 0 low)
Clean: 0

Try it on your test suite

Point the CLI at any test file or directory:

# single file
npx falsegreen-skill analyze tests/test_orders.py

# multiple files
npx falsegreen-skill analyze tests/test_orders.py tests/test_payments.py

# JSON report for CI — exits 2 if any HIGH finding is present
npx falsegreen-skill analyze tests/test_orders.py --json --fail-on-high

# deep analysis with a stronger model
npx falsegreen-skill analyze tests/test_orders.py --model claude-opus-4-8

# lower temperature for more deterministic output (default is already 0.2)
npx falsegreen-skill analyze tests/test_orders.py --temperature 0.0

The skill identifies the language from the file extension. TypeScript and JavaScript work the same way — no extra flags needed.

Full flag reference: docs/cli.md.

Claude Code (primary path)

Add the marketplace and install the plugin:

/plugin marketplace add vinicq/falsegreen-skill
/plugin install falsegreen-skill@falsegreen

Then invoke the skill with /falsegreen-skill:falsegreen-llm, or just attach a test file and ask for false-positive analysis — the skill triggers on intent. The skill identifies the language and framework, classifies the test intent, applies the six-judgment protocol, and reports findings with case numbers, confidence levels, and fix hints.

For Python, the skill applies the full pattern catalog directly. Optionally, run the static scanner first to speed up batch analysis:

pip install falsegreen
falsegreen tests/

If you provide the scanner output, the skill uses it as the structural pass and applies semantic judgment on top. Without it, the skill runs everything.

Defined API providers

This skill is not tied to Claude. The maintained provider paths are Anthropic, OpenAI/Codex, Google Gemini, and the configured CLI providers listed in providers.md.

See providers.md for per-provider invocation code and Cursor setup.

Cursor

Add .cursor/rules/falsegreen-skill.mdc to your project (template in providers.md). Open a test file, ask Cursor to analyze it for false-positive smells, and the J1-J6 protocol runs automatically.

Supported languages and frameworks

| Language | Frameworks | |---|---| | Python | pytest, unittest | | TypeScript | Jest, Vitest, Mocha + Chai, React Testing Library, Vue Test Utils, Angular TestBed | | JavaScript | Jest, Vitest, Mocha + Chai, Jasmine, React Testing Library |

Frontend component tests — React, Vue, Angular, Svelte — use the same J1-J6 framework as backend tests. The structural failures are identical: a J4 weak assertion on a rendered component is the same smell as a J4 on a service method. See the family-based examples under examples/typescript/ (for instance family_a_never_checks.ts, which carries the Testing Library patterns) for annotated cases.

Test levels (the pyramid)

The skill detects the test level and reads the oracle in light of it, the step the static scanners cannot fully do. The level changes what counts as a valid check:

Unit: a function or component with its boundaries doubled. A real assertion on the return value is the oracle.
Integration (API and database): API tests (supertest, httpx, a framework TestClient, Tavern) and database tests against a real datastore. The response or the row IS the verification at this level, so the skill does not flag it as a weak check.
E2E: Cypress, Playwright, Selenium, Robot Browser. The presence of a rendered element or a page state is a real check here.

The level itself is part of the judgment: a real API or database call inside a test that claims to be a unit test is a smell (over-mocking inverted, mystery guest), and the skill says so rather than accepting the level at face value.

Project layout

falsegreen-skill/
  SKILL.md              the skill protocol (language and LLM agnostic)
  AGENTS.md             Codex CLI context (auto-loaded from project root)
  GEMINI.md             Gemini CLI context (auto-loaded, extension contextFileName)
  llm.md                self-contained prompt context used by CLI/API examples
  reference.md          per-language case catalog and framework cues
  providers.md          multi-LLM invocation guide (API snippets)
  CREDITS.md            the research this skill builds on
  gemini-extension.json Gemini CLI extension manifest
  .gemini/              Gemini Agent Skill entry point
  .claude-plugin/       Claude Code plugin manifest + marketplace catalog
  .codex-plugin/        Codex CLI plugin manifest
  .agents/plugins/      Codex CLI marketplace catalog
  skills/
    falsegreen-llm/     shared skill entry point (Claude Code + Codex plugins)
  bin/
    falsegreen-llm.js   zero-dependency CLI (npx falsegreen-skill)
  scripts/
    validate-package.mjs validate manifests, frontmatter, and schema naming
    build-targets.mjs    generate standalone Claude/Gemini skill packages
  docs/
    cli.md              CLI usage guide
    packaging.md        target packaging and release checklist
  schema/
    finding.json        JSON Schema for a single finding
    report.json         JSON Schema for a full report
  contexts/             ready-to-use context files per platform
    claude.md           Claude Code CLI, Claude.ai, Anthropic API
    codex.md            ChatGPT, OpenAI API, structured output, batch
    gemini.md           Google AI Studio, Gemini API, long context
    cursor.md           Cursor IDE — full .cursor/rules/ MDC template
  examples/
    python/
      family_a_never_checks.py       C1, C2, C2b, C3, C4, C4b, C20, C21, CC
      family_b_weak_always_true.py   C5, C6, C6b, C7, C8, C9, C11a, C13, C13b, C14, C16, C18, C25, C34
      family_c_checks_own_setup.py   C19, C28, C29
      family_d_external_state.py     C17, C23, C24, C27, C30, C31, C32, C35
      family_e_wrong_thing.py        C33, C36, C37
      semantic_cases.py              cases 10, 11, 12, 15, 18 (LLM-only)
      diagnostic_codes.py            D1, D3, D4, D5, D6, M2 (opt-in)
    typescript/
    javascript/

Contributing

See CONTRIBUTING.md. The main contribution paths are language-specific patterns and look-alike examples in reference.md.

License: MIT, see LICENSE.

Contributors ✨

Thanks to the people who keep false-green tests out of real suites (emoji key):

New contributors are added automatically; the table also recognizes non-code work (docs, ideas, infrastructure, tests, research) via the all-contributors spec.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

falsegreen-skill

Why this exists

The methodology

What it detects

Python — structural patterns (complete falsegreen catalog)

Semantic patterns (all three languages)

Diagnostic and coupling codes (opt-in)

How it compares

How to use

Installation

Quick example

Try it on your test suite

Claude Code (primary path)

Defined API providers

Cursor

Supported languages and frameworks

Test levels (the pyramid)

Project layout

Contributing

Contributors ✨