git-forensics
v2.0.0
Published
Uncover architectural secrets hidden in your git history
Maintainers
Readme
git-forensics
A TypeScript library for providing insights from git commit history.
Features
- Actionable insights
- Fast - ~700ms for 100,000 commits (getting the git-log will be slow)
- Follows file rename and removal
- Optimized for CI
- Percentile-based classification — self-calibrating thresholds that work across any codebase size
- Composite risk scoring — weighted multi-metric risk scores per file
- Integrated (a VERY basic) code complexity engine
- Bring your own code complexity score
- Add custom metrics using full temporal history
Motivation
Existing git analysis tools (code-maat, git-of-theseus, Hercules, etc.) are great for reports but feel heavy as a backend for dev-tools. This library is designed to be lightweight, fast, and embeddable.
Tip: Focus on recent history (6-9 months). While the library handles renames and long histories correctly, older data tends to add noise.
Installation
npm install git-forensicsQuick Start
import { simpleGit } from 'simple-git';
import { computeForensics } from 'git-forensics';
const git = simpleGit('/path/to/repo');
const forensics = await computeForensics(git);
forensics.hotspots; // Files changed most often
forensics.churn; // Code volatility (lines added/deleted)
forensics.coupledPairs; // Hidden dependencies
forensics.couplingRankings; // Architectural hubs
forensics.codeAge; // Stale code detection
forensics.ownership; // Knowledge silos
forensics.communication; // Developer coordination needs
forensics.topContributors; // Per-file contributor breakdownExample Output
Running computeForensics on a repository returns structured data across all metrics:
{
"analyzedCommits": 842,
"dateRange": { "from": "2024-03-10", "to": "2025-01-15" },
"metadata": { "totalFilesAnalyzed": 134, "totalAuthors": 12 },
"hotspots": [
{ "file": "src/api/routes.ts", "revisions": 87, "exists": true },
{ "file": "src/core/engine.ts", "revisions": 64, "exists": true },
],
"coupledPairs": [
{
"file1": "src/api/routes.ts",
"file2": "src/api/middleware.ts",
"couplingPercent": 82,
"coChanges": 34,
},
],
"ownership": [
{
"file": "src/core/engine.ts",
"mainDev": "alice",
"ownershipPercent": 34,
"fractalValue": 0.18,
"authorCount": 7,
},
],
// ... plus churn, codeAge, couplingRankings, communication, topContributors
}Passing the result to generateInsights produces actionable alerts:
[
{
"file": "src/core/engine.ts",
"type": "hotspot",
"severity": "critical",
"data": {
"type": "hotspot",
"revisions": 64,
"rank": 2,
"percentile": 95,
},
"fragments": {
"title": "Hotspot",
"finding": "64 revisions (P95), ranked #2 in repository",
"risk": "Top-ranked churn file — prioritize for refactoring or test hardening",
"suggestion": "Consider breaking into smaller modules or adding test coverage",
},
},
{
"file": "src/core/engine.ts",
"type": "ownership-risk",
"severity": "critical",
"data": {
"type": "ownership-risk",
"fractalValue": 0.18,
"authorCount": 7,
"mainDev": "alice",
"percentile": 92,
},
"fragments": {
"title": "Fragmented Ownership",
"finding": "7 contributors, fragmentation score 0.18 (P92)",
"risk": "Diffuse ownership slows review cycles and increases merge conflicts",
"suggestion": "Request review from alice (primary contributor)",
},
},
// ... insights generated for each metric that exceeds thresholds
]Actionable Insights
generateInsights transforms metrics into alerts with severity (warning, critical) and human-readable fragments (title, finding, risk, suggestion).
Insights use percentile-based thresholds — a file is flagged based on where it ranks relative to other files in the same repository. This makes thresholds self-calibrating across codebases of any size.
Insight thresholds
| Question | Metric | Insight triggers when |
| ----------------------------------- | ------------------ | ---------------------------------------------- |
| Where's the riskiest code? | hotspots | Revisions in P75+ (warning) or P90+ (critical) |
| What keeps getting rewritten? | churn | Churn in P75+ or P90+ |
| What hidden dependencies exist? | coupledPairs | ≥70% co-change rate (absolute, not percentile) |
| What has ripple effects? | couplingRankings | Coupling score in P75+ or P90+ |
| What's been forgotten? | codeAge | Age in P75+ or P90+ |
| Who owns what? Any knowledge silos? | ownership | ≥3 authors, fragmentation in P75+ or P90+ |
All thresholds are overridable — pass a partial thresholds object and only the values you specify will change:
const insights = generateInsights(forensics, {
thresholds: {
hotspot: { warning: 80, critical: 95 }, // percentile cutoffs
churn: { warning: 80 },
staleCode: { warning: 60, critical: 85 },
coupling: { minPercent: 80 }, // stays absolute — not percentile-based
ownershipRisk: { warning: 70, critical: 90, minAuthors: 4 },
couplingScore: { warning: 80, critical: 95 },
},
});Analysis options
The analysis pipeline has its own configurable thresholds that control what data is collected:
const forensics = await computeForensics(git, {
maxFilesPerCommit: 50, // skip large commits from coupling analysis (default: 50)
minCoChanges: 3, // minimum co-changes to report a coupled pair (default: 3)
minCouplingPercent: 30, // minimum coupling % to report a pair (default: 30)
minSharedEntities: 2, // minimum shared files for communication pairs (default: 2)
});These options are also available on computeForensicsFromData().
Build your own insights
forensics.stats contains the complete temporal history—every commit, by every author, for every file. Access stats.fileStats[file].byAuthor, authorContributions, nameHistory, etc. to build custom metrics like temporal histograms, expertise scores, or handoff detection.
Composite Risk Score
computeRiskScores produces a single 0-100 risk score per file by combining percentile ranks across all metrics with configurable weights:
import { computeRiskScores } from 'git-forensics';
const scores = computeRiskScores(forensics);
// [
// { file: 'src/core/engine.ts', riskScore: 87.5, breakdown: { revisions: 22.5, churn: 25, ownershipRisk: 18, age: 12, couplingScore: 10 } },
// { file: 'src/api/routes.ts', riskScore: 72.0, breakdown: { ... } },
// ...
// ]Default weights:
| Metric | Weight | | -------------- | ------ | | Revisions | 0.25 | | Churn | 0.25 | | Ownership Risk | 0.20 | | Age | 0.15 | | Coupling Score | 0.15 |
Override weights to match your priorities:
const scores = computeRiskScores(forensics, {
revisions: 0.4,
churn: 0.3,
ownershipRisk: 0.1,
age: 0.1,
couplingScore: 0.1,
});File Metrics with Percentiles
extractFileMetrics flattens forensics into per-file rows for storage. Pass includePercentiles: true to enrich each row with percentile ranks and a composite risk score:
import { extractFileMetrics } from 'git-forensics';
const metrics = extractFileMetrics(forensics, { includePercentiles: true });
// Each entry includes:
// {
// file, revisions, ageMonths, churn, fractalValue, ...
// percentiles: { revisions: 90, churn: 75, ownershipRisk: 85, ageMonths: 60, couplingScore: 40 },
// riskScore: 72.5,
// }Percentile Utilities
The underlying percentile functions are exported for building custom scoring:
import {
percentileRank,
createPercentileRanker,
createInvertedPercentileRanker,
} from 'git-forensics';
// One-off calculation
percentileRank(50, [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]); // 45
// Reusable ranker for repeated lookups
const rank = createPercentileRanker([10, 20, 30, 40, 50]);
rank(30); // 50
rank(50); // 90
// Inverted ranker (lower values = higher percentile)
const riskRank = createInvertedPercentileRanker([0.1, 0.3, 0.5, 0.7, 0.9]);
riskRank(0.1); // 90 (lowest value = highest risk)Complexity Analysis
git-forensics separates commit analysis from static code analysis. It provides optional complexity helpers for convenience (using indent-complexity).
It is recommended you use a language-aware complexity scoring and pass the results to computeForensics.
CI Usage
Building a report
Loop over insights and build a PR comment or CI annotation:
const insights = generateInsights(forensics, { minSeverity: 'warning' });
for (const insight of insights) {
const prefix = insight.severity === 'critical' ? '[CRITICAL]' : '[WARNING]';
console.log(`${prefix} ${insight.file} - ${insight.fragments.title}`);
console.log(` ${insight.fragments.finding}`);
console.log(` ${insight.fragments.suggestion}\n`);
}Optimization: Store & Reuse (large codebases)
For very large repos, store the computeForensics result between runs and rehydrate with generateInsights — no git scan needed:
import { generateInsights, getChangedFiles } from 'git-forensics';
// Fetch pre-computed forensics from your server/cache
const forensics = await fetch('https://your-server/api/forensics?repo=org/repo').then((r) =>
r.json()
);
// Generate insights only for PR changed files
const changedFiles = await getChangedFiles(git, 'origin/main');
const insights = generateInsights(forensics, { files: changedFiles, minSeverity: 'warning' });Data-Driven API
For environments without direct git access use computeForensicsFromData() with pre-fetched git data:
import { computeForensicsFromData, gitLogDataSchema, validateGitLogData } from 'git-forensics';
// Data must match the following format
const data = {
log: {
all: [
{
hash: 'abc123',
date: '2025-01-15T10:00:00Z',
author_name: 'Alice',
message: 'Add feature',
diff: {
files: [
{ file: 'src/app.ts', insertions: 50, deletions: 10 },
{ file: 'src/utils.ts', insertions: 20, deletions: 5 },
],
},
},
// ... more commits
],
},
trackedFiles: 'src/app.ts\nsrc/utils.ts\nsrc/index.ts', // from git ls-files
};
// Print JSON-schema if needed
console.log(gitLogDataSchema); // JSON Schema object
// Validate before processing
validateGitLogData(data); // throws if invalid
const forensics = computeForensicsFromData(data);Migration from v1.x
v2.0.0 replaces absolute thresholds with percentile-based classification. Key changes:
InsightThresholdsvalues are now percentile cutoffs (0-100), not raw metric valuesInsightDatavariants (exceptcoupling) include apercentilefield- Stale-code severity changed from
info/warningtowarning/critical - Finding strings now include
(Pxx)percentile annotations - Generator function signatures added a
percentileRankparameter (affects direct generator importers) - New exports:
computeRiskScores,DEFAULT_RISK_WEIGHTS,percentileRank,createPercentileRanker,createInvertedPercentileRanker - New types:
PercentileThresholds,RiskWeights,FileRiskScore,ExtractFileMetricsOptions
Attribution
Based on concepts from Adam Tornhill's Your Code as a Crime Scene and Software Design X-Rays.
License
MIT
