@phoenixaihub/flake-finder

v0.1.0

Published

17 days ago

Detect, score, and quarantine flaky tests in your CI pipeline

0High
0Medium
0Low

phoenixaihub

flaky-tests testing ci bayesian vitest jest cli

flake-finder

Detect, score, and quarantine flaky tests in your CI pipeline.

flake-finder tracks test results over time and uses Bayesian statistics + change-point detection to identify flaky tests — tests that fail non-deterministically without any code change.

Features

📥 Ingest JUnit XML, Jest JSON, pytest JSON, or generic JSON
📊 Score each test with a Bayesian flakiness score (0–100)
🔍 Distinguish regression from flakiness via CUSUM change-point detection
⚖️ Weight recent results more heavily via exponential decay (14-day half-life)
🚫 Quarantine flaky tests with ready-to-use config for Jest, pytest, and JUnit
🤖 CI-native commands: ingest, check (exit 1), and GitHub PR comment generation

Install

npm install -g @phoenixaihub/flake-finder
# or as a dev dependency:
npm install -D @phoenixaihub/flake-finder

Requires Node.js 18+.

Quick Start

1. Track test results

# JUnit XML (Java, Go, Python, Rust...)
flake-finder track results.xml

# Jest JSON (jest --json)
jest --json --outputFile results.json
flake-finder track results.json --format jest

# pytest (pytest-json-report)
pytest --json-report --json-report-file=results.json
flake-finder track results.json --format pytest

Run this after every CI build. Results accumulate in .flake-finder/results.db.

2. View the flakiness report

flake-finder report

# Only show tests with score > 20
flake-finder report --threshold 20

# Output as Markdown
flake-finder report --format markdown

Sample output:

🔍 Flaky Test Report

┌─────────────────────────────────────────────────┬───────┬───────────┬──────┬───────┬──────────────┐
│ Test                                            │ Score │ Fail Rate │ Runs │ Fails │ Change Point │
├─────────────────────────────────────────────────┼───────┼───────────┼──────┼───────┼──────────────┤
│ LoginPage > handles session timeout             │  72.4 │    68.0%  │  25  │   17  │ flaky        │
│ PaymentService#processRefund                    │  48.1 │    42.0%  │  12  │    5  │ ⚠ regression │
│ UserAuth > validates expired token              │  31.2 │    28.0%  │  18  │    5  │ flaky        │
│ …SearchController#testPaginationEdgeCase        │  12.7 │    11.0%  │  27  │    3  │ flaky        │
└─────────────────────────────────────────────────┴───────┴───────────┴──────┴───────┴──────────────┘

  4 flaky test(s) found
  Score: 0-100 (higher = flakier) | ⚠ = regression detected, not pure flakiness

3. Generate quarantine config

# Show all formats
flake-finder quarantine --threshold 20

# Dry run to preview
flake-finder quarantine --threshold 20 --dry-run

Output includes:

📦 Jest (--testPathIgnorePatterns):
--testPathIgnorePatterns \
  "LoginPage",
  "UserAuth"

🐍 pytest (-k exclusion):
pytest -k 'not "LoginPage > handles session timeout" and not "UserAuth > validates expired token"'

☕ JUnit (@Ignore annotations):
  @Ignore("flaky: score=72.4")
  // LoginPage > handles session timeout

4. Stats dashboard

flake-finder stats

📊 Flake-Finder Stats Dashboard

  Total tests tracked:  142
  Total test runs:      38
  Total results:        5,396
  Flaky tests:          7 / 142 (threshold: 10)
  Date range:           12/15/2023 → 1/15/2024

🔥 Top 10 Flakiest Tests:
  ███████░░░  72.4 LoginPage > handles session timeout
  ████░░░░░░  48.1 PaymentService#processRefund
  ███░░░░░░░  31.2 UserAuth > validates expired token
  ...

CI Integration

GitHub Actions

# .github/workflows/test.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm ci && npm test -- --json --outputFile results.json

      - name: Ingest flake results
        run: npx flake-finder ci ingest results.json
        env:
          GITHUB_SHA: ${{ github.sha }}

      - name: Check for flaky tests
        run: npx flake-finder ci check --threshold 25

      - name: Post PR comment
        if: github.event_name == 'pull_request'
        run: |
          npx flake-finder ci comment > /tmp/flake-comment.md
          gh pr comment ${{ github.event.pull_request.number }} --body-file /tmp/flake-comment.md
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

CircleCI

jobs:
  test:
    steps:
      - checkout
      - run: npm test -- --json --outputFile results.json
      - run: |
          npx flake-finder ci ingest results.json
          npx flake-finder ci check --threshold 25

Persisting the database across runs

For the flakiness scores to improve over time, persist the .flake-finder/ directory as a CI cache:

GitHub Actions:

- uses: actions/cache@v4
  with:
    path: .flake-finder
    key: flake-finder-${{ runner.os }}

How It Works

Bayesian Flakiness Score

Each test gets a Beta distribution as its failure rate posterior:

Prior: Beta(1, 1) — uninformed, assumes nothing
For each test result: add weight to α (failure) or β (pass)
Results are exponentially decayed by age (14-day half-life by default)
Score = E[failure_rate] × confidence_weight × 100

This means:

A test that fails once in 100 runs scores ~1–2
A test that fails 5 times in 10 runs scores ~40–60
A test that always fails scores close to 100 (high confidence)

Change-Point Detection (CUSUM)

Uses Page's CUSUM algorithm to detect if a test has transitioned from passing to failing (a regression), vs. randomly flipping (flakiness):

Encodes pass=0, fail=1
Accumulates deviations from baseline failure rate
If cumulative sum exceeds threshold → change point detected
Tests with change points are flagged ⚠ regression in reports

Exponential Decay

Results decay with half-life of 14 days:

weight = 2^(-(age_days / 14))

A test fixed 2 weeks ago contributes half as much signal as a recent result.

Configuration

All commands accept --db <path> to override the default .flake-finder/results.db location.

Environment variables respected during ci ingest:

GITHUB_SHA or GIT_COMMIT — auto-attached as commit SHA
GITHUB_RUN_ID or CI_RUN_ID — auto-attached as run ID

Contributing

See CONTRIBUTING.md.