git-forensics

v2.0.0

Published

3 months ago

Uncover architectural secrets hidden in your git history

0High
0Medium
0Low

itaymendel

git forensics code-analysis hotspots temporal-coupling code-ownership architecture

git-forensics

A TypeScript library for providing insights from git commit history.

Features

Actionable insights
Fast - ~700ms for 100,000 commits (getting the git-log will be slow)
Follows file rename and removal
Optimized for CI
Percentile-based classification — self-calibrating thresholds that work across any codebase size
Composite risk scoring — weighted multi-metric risk scores per file
Integrated (a VERY basic) code complexity engine
Bring your own code complexity score
Add custom metrics using full temporal history

Motivation

Existing git analysis tools (code-maat, git-of-theseus, Hercules, etc.) are great for reports but feel heavy as a backend for dev-tools. This library is designed to be lightweight, fast, and embeddable.

Tip: Focus on recent history (6-9 months). While the library handles renames and long histories correctly, older data tends to add noise.

Installation

npm install git-forensics

Quick Start

import { simpleGit } from 'simple-git';
import { computeForensics } from 'git-forensics';

const git = simpleGit('/path/to/repo');
const forensics = await computeForensics(git);

forensics.hotspots; // Files changed most often
forensics.churn; // Code volatility (lines added/deleted)
forensics.coupledPairs; // Hidden dependencies
forensics.couplingRankings; // Architectural hubs
forensics.codeAge; // Stale code detection
forensics.ownership; // Knowledge silos
forensics.communication; // Developer coordination needs
forensics.topContributors; // Per-file contributor breakdown

Example Output

Running computeForensics on a repository returns structured data across all metrics:

{
  "analyzedCommits": 842,
  "dateRange": { "from": "2024-03-10", "to": "2025-01-15" },
  "metadata": { "totalFilesAnalyzed": 134, "totalAuthors": 12 },

  "hotspots": [
    { "file": "src/api/routes.ts", "revisions": 87, "exists": true },
    { "file": "src/core/engine.ts", "revisions": 64, "exists": true },
  ],

  "coupledPairs": [
    {
      "file1": "src/api/routes.ts",
      "file2": "src/api/middleware.ts",
      "couplingPercent": 82,
      "coChanges": 34,
    },
  ],

  "ownership": [
    {
      "file": "src/core/engine.ts",
      "mainDev": "alice",
      "ownershipPercent": 34,
      "fractalValue": 0.18,
      "authorCount": 7,
    },
  ],

  // ... plus churn, codeAge, couplingRankings, communication, topContributors
}

Passing the result to generateInsights produces actionable alerts:

[
  {
    "file": "src/core/engine.ts",
    "type": "hotspot",
    "severity": "critical",
    "data": {
      "type": "hotspot",
      "revisions": 64,
      "rank": 2,
      "percentile": 95,
    },
    "fragments": {
      "title": "Hotspot",
      "finding": "64 revisions (P95), ranked #2 in repository",
      "risk": "Top-ranked churn file — prioritize for refactoring or test hardening",
      "suggestion": "Consider breaking into smaller modules or adding test coverage",
    },
  },
  {
    "file": "src/core/engine.ts",
    "type": "ownership-risk",
    "severity": "critical",
    "data": {
      "type": "ownership-risk",
      "fractalValue": 0.18,
      "authorCount": 7,
      "mainDev": "alice",
      "percentile": 92,
    },
    "fragments": {
      "title": "Fragmented Ownership",
      "finding": "7 contributors, fragmentation score 0.18 (P92)",
      "risk": "Diffuse ownership slows review cycles and increases merge conflicts",
      "suggestion": "Request review from alice (primary contributor)",
    },
  },
  // ... insights generated for each metric that exceeds thresholds
]

Actionable Insights

generateInsights transforms metrics into alerts with severity (warning, critical) and human-readable fragments (title, finding, risk, suggestion).

Insights use percentile-based thresholds — a file is flagged based on where it ranks relative to other files in the same repository. This makes thresholds self-calibrating across codebases of any size.

Insight thresholds

| Question | Metric | Insight triggers when | | ----------------------------------- | ------------------ | ---------------------------------------------- | | Where's the riskiest code? | hotspots | Revisions in P75+ (warning) or P90+ (critical) | | What keeps getting rewritten? | churn | Churn in P75+ or P90+ | | What hidden dependencies exist? | coupledPairs | ≥70% co-change rate (absolute, not percentile) | | What has ripple effects? | couplingRankings | Coupling score in P75+ or P90+ | | What's been forgotten? | codeAge | Age in P75+ or P90+ | | Who owns what? Any knowledge silos? | ownership | ≥3 authors, fragmentation in P75+ or P90+ |

All thresholds are overridable — pass a partial thresholds object and only the values you specify will change:

const insights = generateInsights(forensics, {
  thresholds: {
    hotspot: { warning: 80, critical: 95 }, // percentile cutoffs
    churn: { warning: 80 },
    staleCode: { warning: 60, critical: 85 },
    coupling: { minPercent: 80 }, // stays absolute — not percentile-based
    ownershipRisk: { warning: 70, critical: 90, minAuthors: 4 },
    couplingScore: { warning: 80, critical: 95 },
  },
});

Analysis options

The analysis pipeline has its own configurable thresholds that control what data is collected:

const forensics = await computeForensics(git, {
  maxFilesPerCommit: 50, // skip large commits from coupling analysis (default: 50)
  minCoChanges: 3, // minimum co-changes to report a coupled pair (default: 3)
  minCouplingPercent: 30, // minimum coupling % to report a pair (default: 30)
  minSharedEntities: 2, // minimum shared files for communication pairs (default: 2)
});

These options are also available on computeForensicsFromData().

Build your own insights

forensics.stats contains the complete temporal history—every commit, by every author, for every file. Access stats.fileStats[file].byAuthor, authorContributions, nameHistory, etc. to build custom metrics like temporal histograms, expertise scores, or handoff detection.

Composite Risk Score

computeRiskScores produces a single 0-100 risk score per file by combining percentile ranks across all metrics with configurable weights:

import { computeRiskScores } from 'git-forensics';

const scores = computeRiskScores(forensics);
// [
//   { file: 'src/core/engine.ts', riskScore: 87.5, breakdown: { revisions: 22.5, churn: 25, ownershipRisk: 18, age: 12, couplingScore: 10 } },
//   { file: 'src/api/routes.ts', riskScore: 72.0, breakdown: { ... } },
//   ...
// ]

Default weights:

| Metric | Weight | | -------------- | ------ | | Revisions | 0.25 | | Churn | 0.25 | | Ownership Risk | 0.20 | | Age | 0.15 | | Coupling Score | 0.15 |

Override weights to match your priorities:

const scores = computeRiskScores(forensics, {
  revisions: 0.4,
  churn: 0.3,
  ownershipRisk: 0.1,
  age: 0.1,
  couplingScore: 0.1,
});

File Metrics with Percentiles

extractFileMetrics flattens forensics into per-file rows for storage. Pass includePercentiles: true to enrich each row with percentile ranks and a composite risk score:

import { extractFileMetrics } from 'git-forensics';

const metrics = extractFileMetrics(forensics, { includePercentiles: true });
// Each entry includes:
// {
//   file, revisions, ageMonths, churn, fractalValue, ...
//   percentiles: { revisions: 90, churn: 75, ownershipRisk: 85, ageMonths: 60, couplingScore: 40 },
//   riskScore: 72.5,
// }

Percentile Utilities

The underlying percentile functions are exported for building custom scoring:

import {
  percentileRank,
  createPercentileRanker,
  createInvertedPercentileRanker,
} from 'git-forensics';

// One-off calculation
percentileRank(50, [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]); // 45

// Reusable ranker for repeated lookups
const rank = createPercentileRanker([10, 20, 30, 40, 50]);
rank(30); // 50
rank(50); // 90

// Inverted ranker (lower values = higher percentile)
const riskRank = createInvertedPercentileRanker([0.1, 0.3, 0.5, 0.7, 0.9]);
riskRank(0.1); // 90 (lowest value = highest risk)

Complexity Analysis

git-forensics separates commit analysis from static code analysis. It provides optional complexity helpers for convenience (using indent-complexity). It is recommended you use a language-aware complexity scoring and pass the results to computeForensics.

CI Usage

Building a report

Loop over insights and build a PR comment or CI annotation:

const insights = generateInsights(forensics, { minSeverity: 'warning' });

for (const insight of insights) {
  const prefix = insight.severity === 'critical' ? '[CRITICAL]' : '[WARNING]';
  console.log(`${prefix} ${insight.file} - ${insight.fragments.title}`);
  console.log(`  ${insight.fragments.finding}`);
  console.log(`  ${insight.fragments.suggestion}\n`);
}

Optimization: Store & Reuse (large codebases)

For very large repos, store the computeForensics result between runs and rehydrate with generateInsights — no git scan needed:

import { generateInsights, getChangedFiles } from 'git-forensics';

// Fetch pre-computed forensics from your server/cache
const forensics = await fetch('https://your-server/api/forensics?repo=org/repo').then((r) =>
  r.json()
);

// Generate insights only for PR changed files
const changedFiles = await getChangedFiles(git, 'origin/main');
const insights = generateInsights(forensics, { files: changedFiles, minSeverity: 'warning' });

Data-Driven API

For environments without direct git access use computeForensicsFromData() with pre-fetched git data:

import { computeForensicsFromData, gitLogDataSchema, validateGitLogData } from 'git-forensics';

// Data must match the following format
const data = {
  log: {
    all: [
      {
        hash: 'abc123',
        date: '2025-01-15T10:00:00Z',
        author_name: 'Alice',
        message: 'Add feature',
        diff: {
          files: [
            { file: 'src/app.ts', insertions: 50, deletions: 10 },
            { file: 'src/utils.ts', insertions: 20, deletions: 5 },
          ],
        },
      },
      // ... more commits
    ],
  },
  trackedFiles: 'src/app.ts\nsrc/utils.ts\nsrc/index.ts', // from git ls-files
};

// Print JSON-schema if needed
console.log(gitLogDataSchema); // JSON Schema object

// Validate before processing
validateGitLogData(data); // throws if invalid

const forensics = computeForensicsFromData(data);

Migration from v1.x

v2.0.0 replaces absolute thresholds with percentile-based classification. Key changes:

InsightThresholds values are now percentile cutoffs (0-100), not raw metric values
InsightData variants (except coupling) include a percentile field
Stale-code severity changed from info/warning to warning/critical
Finding strings now include (Pxx) percentile annotations
Generator function signatures added a percentileRank parameter (affects direct generator importers)
New exports: computeRiskScores, DEFAULT_RISK_WEIGHTS, percentileRank, createPercentileRanker, createInvertedPercentileRanker
New types: PercentileThresholds, RiskWeights, FileRiskScore, ExtractFileMetricsOptions

Attribution

Based on concepts from Adam Tornhill's Your Code as a Crime Scene and Software Design X-Rays.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

git-forensics

Features

Motivation

Installation

Quick Start

Example Output

Actionable Insights

Insight thresholds

Analysis options

Build your own insights

Composite Risk Score

File Metrics with Percentiles

Percentile Utilities

Complexity Analysis

CI Usage

Building a report

Optimization: Store & Reuse (large codebases)

Data-Driven API

Migration from v1.x

Attribution

License