dry-ts
v0.13.0
Published
Find candidate duplicate TypeScript code by comparing normalized AST structure.
Maintainers
Readme
dry-ts
dry-ts finds candidate duplicate TypeScript code across files and directories. It reports fuzzy structural matches as clusters of related filename and line ranges so another mechanism — a CI gate, an AI agent, a human reviewer — can evaluate and reduce duplication.
It catches Type-2/Type-3 clones — same shape, renamed identifiers, reordered or slightly varied statements — that token and line matchers miss. That is exactly the class an LLM produces when it reimplements existing structure, so dry-ts is built with agents and PR gates as the primary consumers. See How it works for the engine.
Quickstart
# 1. Scan a source tree for candidate duplicates
bunx dry-ts src
# 2. PR gate — fail only if THIS change adds duplication (the recommended CI workflow)
bunx dry-ts --profile pr --changed-from origin/main src
# 3. Agent loop — after edits, emit only NEW duplication as JSON, each finding
# routed to its nearest existing match
bunx dry-ts --profile agent --changed-from HEAD src
# 4. SARIF for GitHub code scanning (inline PR annotations)
bunx dry-ts --sarif --changed-from origin/main src > dry-ts.sarif--profile pr and --profile agent are curated presets (see Profiles). Both require a changed-scope flag and fail loud without one, so they never gate against the wrong base.
- Config: persisting flags in a committed config file (
.dry-ts.json) so a repo sets policy once is planned, not shipped. Until then, encode policy in a profile + a short flag list. - Suppress an intentional repetition: annotate it with
// dry-ignore, or exclude a whole path with--exclude. - Full flag reference: Usage. Cutting noise: Curating results.
What it is: a TypeScript-first structural duplicate-candidate detector, built for PR gates and AI/agent consumers.
What it is not: a general-purpose, multi-language copy/paste detector. For broad Type-1 token cloning across many languages with mature CI reporters, reach for jscpd or PMD CPD. dry-ts reports structural similarity candidates, not confirmed semantic duplication — keep that framing when you triage. The point is to catch accidental reimplementation before it lands, not to "dedupe everything."
Maturity: young package — pin the version ([email protected]). It has no install hooks and two runtime dependencies (ignore, typescript); git is spawned only for --changed-from.
Output stability
dry-ts is stateless — no baseline file — so findings are a pure function of the input and the normalization rules. Those rules (src/TypeScriptNormalizer.ts, src/NormalizedNode.ts) and the scoring can evolve, which means an upgrade can shift findings. The policy:
- Any change that can move findings is at least a MINOR version bump, called out in CHANGELOG.md under that release.
- Pin
[email protected]in CI so a gate stays reproducible across runs, and read the changelog before bumping the pin.
CI tools live or die on trust; treating output stability as part of semver is how a pinned gate stays honest.
Usage
Run without installing after the package is published:
bunx dry-ts [options] [file-or-directory ...]
npx dry-ts [options] [file-or-directory ...]Run from this repository:
bun install
bun run build
bun ./dist/bin/dry-ts.js [options] [file-or-directory ...]Options:
--profile NAME Start from a curated preset (pr, src, audit, tests, agent),
then apply explicit flags on top. See "Profiles".
--threshold N Minimum structural similarity score, default 0.82
--min-lines N Minimum source lines in a candidate declaration, default 4
--min-nodes N Minimum normalized syntax nodes, default 20; candidates
below this threshold are excluded before pair matching,
so raising this speeds scans
--min-locations N
Minimum locations in a reported cluster, default 2
--format F text, json, edn, or sarif, default text
--edn Same as --format edn
--json Same as --format json
--text Same as --format text
--sarif Same as --format sarif (SARIF 2.1.0 for GitHub code scanning)
--changed-from REF
Incremental gating: mark clusters that intersect code changed
since merge-base(REF, HEAD) as status "new". The scope is your
working tree (committed + staged + unstaged). Untracked scanned
files count as fully changed. Requires a git repository.
--changed FILE Incremental gating: mark clusters intersecting FILE (every
line) as status "new". Repeatable; for agents/non-git callers.
Cannot be combined with --changed-from.
--explain-changed
Dump the resolved changed-region map to stderr for debugging.
--only-new Restrict reported clusters to status "new". Output filter only:
the exit code is unchanged (still governed by
--fail-on-duplicates). Requires --changed-from/--changed.
Totals print to stderr, e.g. "showing 6 new (73 known hidden)".
--fail-on-duplicates
Exit 1 on findings. With --changed-from/--changed, only
clusters with status "new" gate; otherwise any cluster does.
--no-gitignore Include files and directories ignored by .gitignore
--exclude GLOB Skip files/directories matching a .gitignore-style glob, e.g.
--exclude '**/*.spec.*'. Repeatable. Applies during directory
scans regardless of --no-gitignore; explicit file arguments are
always scanned.
--exclude-tests Skip test files during directory scans: a curated preset of
**/*.test.*, **/*.spec.*, **/*.e2e-spec.*, **/__tests__/**, and
**/__mocks__/**, merged into the --exclude glob list (composes
with any --exclude globs; explicit file arguments still scanned).
Opt-in, default off; output byte-for-byte unchanged when off.
--exclude-kinds KIND[,KIND...]
Drop candidate declarations of the given SyntaxKinds before
matching. Comma-separated and repeatable. Opt-in only: with no
flag, output is unchanged. Useful for suppressing boilerplate
false positives such as dep-only DI constructors
(--exclude-kinds Constructor) or port/interface member
signatures (--exclude-kinds PropertySignature,MethodSignature).
An unknown or non-candidate kind name is a hard error.
--exclude-tagged-templates
Drop candidate declarations whose value is a tagged template
literal (const X = styled(Button)`…`, css`…`, gql`…`). Opt-in,
default off. Suppresses CSS-in-JS / styled-components clusters,
a dominant false-positive class on frontend codebases.
--counterparts Add, per cluster location, its nearest matching counterpart
({index, file, startLine, endLine, shared, total, score}) and —
under an active change scope (--changed-from/--changed) — a
per-location "changed" boolean. The nearest is the absolute
strongest AST-similar partner in the same cluster (intra-cluster
reference, so --only-new never orphans it). Opt-in, default off;
off-path output is byte-for-byte unchanged. Note: --only-new
filters whole clusters, never individual locations or counterparts.Valid --exclude-kinds names are the candidate root kinds — the TypeScript AST
node types dry-ts treats as comparable units. The names are TypeScript
SyntaxKinds; what each one is in plain terms:
| Name | What it is |
| --- | --- |
| ClassDeclaration | a class Foo {} declaration |
| InterfaceDeclaration | an interface Foo {} declaration |
| TypeAliasDeclaration | a type Foo = ... alias |
| EnumDeclaration | an enum Foo {} declaration |
| ModuleDeclaration | a namespace Foo {} / module Foo {} block |
| FunctionDeclaration | a function foo() {} declaration |
| MethodDeclaration | a method body in a class or object literal: foo() {} |
| Constructor | a class constructor() {} |
| GetAccessor | a getter: get foo() {} |
| SetAccessor | a setter: set foo(v) {} |
| PropertyDeclaration | a class field: foo = ... / foo: T |
| PropertySignature | a property in an interface/type: foo: T |
| MethodSignature | a method signature in an interface/type: foo(): T |
| CallSignature | a callable signature in a type: (arg: T): U |
| ConstructSignature | a constructable signature in a type: new (): T |
| IndexSignature | an index signature: [key: string]: T |
| VariableStatement | a const / let / var statement (the whole declaration line) |
| EnumMember | a single member inside an enum |
| ArrowFunction | an arrow function used as a value: () => {} |
| FunctionExpression | a function () {} used as a value |
Excluding a kind never hides a longer child candidate — children are always visited regardless.
Curating results
Out of the box on a large frontend monorepo a scan can report thousands of clusters, much of it expected duplication — test scaffolding, CSS-in-JS, generated code. The raw output is a candidate list, not a ranked verdict. Two things cut it down to the findings that matter:
Ranking. Clusters where the same declaration name recurs across two or
more distinct files float to the top, tagged same-name=<name> in the text
header. A validateUser copied into another module is the strongest "real,
copy-pasted duplicate" signal there is — near-zero false positive — so it leads
the report regardless of score. Everything below is ordered strongest score
first, as before. (A name repeated only within one file — overloads, shadowed
locals — does not count; cross-file is the signal.)
The curation footer. On a large run (≥10 clusters) the text format prints a short footer to stderr that names the levers which would cut the noise and estimates the reduction:
2574 clusters. Curation levers (see README "Curating results"):
- 1250 disappear with --exclude-tests (clusters that fall below --min-locations
once test files are dropped) → ≈1324 left
- --exclude-tagged-templates drops CSS-in-JS / styled-components clusters
- --min-nodes N raises the size floor (currently 20); --exclude '<glob>' drops pathsIt is a teaching aid, not a finding: it goes to stderr (not stdout), so it never
pollutes the findings stream — pipe or redirect stdout and the footer stays out
of it. The ≈ is
honest — the --exclude-tests count is estimated from the reported clusters,
not a re-scan, so a removed location that bridged two halves of a cluster can
split it rather than delete it. JSON/EDN output never prints the footer.
The levers themselves, roughly in order of leverage on a frontend codebase:
--exclude-tests (test scaffolding is typically about half the noise),
--exclude-tagged-templates (CSS-in-JS), --exclude '<glob>' (whole paths:
generated code, fixtures, stories), --min-nodes N (raise the size floor), and
the kind filters below.
Profiles: --profile NAME
Rather than rediscover the right flag combination per run, start from a curated
preset. --profile NAME seeds a bundle of defaults; any explicit flag you pass
overrides it. Precedence is explicit flag > profile > built-in default, and
list flags (--exclude-kinds) union the profile's entries with yours rather
than replacing them — so --profile pr --exclude-kinds Constructor excludes
ArrowFunction, VariableStatement, and Constructor.
| Profile | Expands to | For |
| --- | --- | --- |
| pr | --exclude-tests --min-nodes 50 --exclude-kinds ArrowFunction,VariableStatement --only-new --fail-on-duplicates | PR gate, highest signal. Requires --changed-from/--changed (it sets --only-new, which errors without a scope — so it fails loud rather than gating against the wrong base). |
| agent | pr + --counterparts --format json | After-edit agent loop. The PR gate plus per-location counterpart routing and JSON output, so an agent reads each new finding's nearest existing match. Inherits pr's scope requirement. See AI Agents. |
| src | --exclude-tests | Source-only scan with test scaffolding dropped. |
| audit | --min-nodes 12 | Broad exploratory scan — lower the floor to surface near-misses the default filters out. |
| tests | --exclude-kinds ArrowFunction --min-nodes 40 | Test-infrastructure duplication (shared setup/fixtures/builders), explicitly not the anonymous arrow bodies that dominate a raw test scan. Point it at your test directories. |
A profile only ever turns things on (there is no negation flag to switch one back off), so each stays at the high-signal defaults its name implies.
Dropping near-uniform candidates: --min-distinct-kinds
--min-nodes filters by raw node count, but a large candidate can still be
near-uniform boilerplate — a property-only interface, a flat config object —
that clears the node bar yet reaches the similarity threshold against any
similarly-shaped block. --min-distinct-kinds N drops a candidate whose subtree
spans fewer than N distinct node kinds, while keeping candidates with varied
control flow.
It complements --min-nodes (size) with a structure-variety floor. Default
off (0); markers do not count toward kind diversity, only node kinds do.
The off path tracks nothing, so it costs nothing.
Suppressing CSS-in-JS declarations: --exclude-tagged-templates
Styled-components and other tagged-template idioms — const X = styled(Button)\…`,
styled('span')`…`, css`…`, gql`…`— normalize to a near-identical AST: aVariableStatementwhose initializer is aTaggedTemplateExpression, with
${p => p.theme.x}arrow interpolations that add just enough distinct kinds to
clear the diversity floor. So they cluster across dozens of files despite sharing
no logic, and the existing reducers do not catch them (excludingVariableStatement`
wholesale would also kill real const-bound function duplicates). On a large frontend
codebase like the Sentry corpus this is one of the single largest false-positive
classes.
--exclude-tagged-templates drops any candidate declaration whose value is a
tagged template literal. It matches by structure rather than by tag name, so it
suppresses styled, css, gql, and any styled alias uniformly, with no tag
allowlist to maintain. Default off; when off, output is byte-for-byte
unchanged and the check costs nothing. Like the other reducers it never stops
recursion into children — a genuine duplicate nested inside a tagged template is
still reported.
Skipping test files: --exclude-tests
Test files are the single largest false-positive class in real scans — ~82% of
clusters on a Node/TS backend (n8n cli/src), ~49% on a frontend (Sentry). Most
of that is table-driven test cases: identical arrange/act/assert blocks differing
only in data. That repetition is correct — test bodies should be DAMP
(descriptive and meaningful) over DRY, so readability and failure-localization
win, and table-driven cases are the sanctioned form. --exclude-tests exists to
focus a run on src duplication, not because test duplication never matters:
real test infrastructure dup (builders, factories, custom matchers, shared
setup) is worth its own dedicated scan — just point the tool at the test tree
without this flag.
The flag merges a curated preset (**/*.test.*, **/*.spec.*,
**/*.e2e-spec.*, **/__tests__/**, **/__mocks__/**) into the --exclude
glob list, so it composes with any explicit --exclude globs and follows the
same rules (directory scans only; explicitly-named file arguments are always
scanned). Bare test/ / tests/ / e2e/ directories are deliberately left out
of the preset — too many projects use those names for non-test code; add them
with --exclude '**/test/**' if your layout needs it. Default off; when off,
output is byte-for-byte unchanged.
Suppressing a single occurrence: // dry-ignore
For an intentional, idiomatic repetition that you do not want to exclude wholesale by kind or file, annotate the specific declaration at the source:
// dry-ignore
export function knownDuplicate(): void {
// ...
}A // dry-ignore (or // dry-ignore-next-line) comment in a declaration's
leading trivia drops that declaration as a candidate. The block-comment form
/* dry-ignore */ works too. No flag required; reads the existing source, no
second parse.
Placement rule: suppression is scoped to the node whose leading comment carries
the directive — put it on the exact declaration you mean. A directive on a
wrapping const statement suppresses that VariableStatement candidate but does
not reach a nested arrow function, which keeps its own (separate) leading
trivia and remains a candidate. Suppressing a parent never hides unrelated child
candidates inside it.
Limitation — // dry-ignore is all-or-nothing and global to that
declaration. It drops the node as a candidate entirely, including against any
future accidental clone of it. There is deliberately no "this particular pair
is intentional, but keep watching each member for other clones"
acknowledgment: that would be a stored, fingerprint-keyed baseline, and dry-ts
is stateless by design (no snapshot to drift or rot). The consequence is honest:
for an intentional N-member family (say 18 sibling templates) you either
annotate all N declarations, or — the intended blunt instrument — exclude the
whole path with --exclude '<glob>' (or .gitignore), accepting that it also
hides any real duplicate that later lands there.
For file- or glob-level ignores, exclude the path via --exclude/.gitignore
(or --no-gitignore to override). Persisting flags in config (so you do not
retype --exclude-tests --min-nodes 40 … every run) is planned — config state, not a findings baseline — and does not exist yet.
Incremental gating
--fail-on-duplicates on its own is zero-tolerance: any cluster anywhere fails
the build, which no real codebase survives. Pair it with a changed-scope flag to
gate only on duplication a change introduces — "no change makes the codebase
wetter" — while still reporting known debt. No baseline file, no state.
Every cluster carries a status:
new— at least one location intersects the changed scope. This is the finding, even when the counterpart location is old code (you copied something). Onlynewclusters gate under--fail-on-duplicates.known— pre-existing duplication, entirely in unchanged code. Reported, never gates.unscoped— emitted for every cluster when no changed-scope flag is active (the tool cannot know what is "known" without a scope).
In a CI gate the actionable new clusters are easily buried under pre-existing
known ones. --only-new filters the report down to new clusters across
all formats (text/json/edn/sarif); the exit code still reflects the full set, so
the gate behaves identically while the log stays readable. It requires a
changed-scope flag (there is no new status without one).
What counts as "changed"
--changed-from REF resolves merge-base(REF, HEAD) and diffs from there, so a
branch behind its base does not see base-side changes pollute the result. Write
--changed-from origin/main and get correct PR semantics directly.
The diff is taken against your working tree, so the scope includes committed, staged, and unstaged edits — an uncommitted change already gates. That is what makes it right for an agent loop that runs before committing. On top of that, any scanned file that git does not track counts as fully changed (a brand-new file is all-new, so a freshly-added duplicate cannot slip past the gate).
A file renamed into scope with no edits gates nothing — moving code is not
duplicating it. --changed FILE scopes the whole file (file granularity),
including any pre-existing duplication inside it, and is the path for non-git
callers; use --changed-from for line-level precision.
When no paths are provided, dry-ts scans src. Directory arguments recursively include .js, .jsx, .ts, .tsx, .mts, and .cts files, excluding TypeScript declaration files. Directory scans respect .gitignore from the working directory by default; pass --no-gitignore to include ignored paths. Explicit file arguments are always scanned even when they match a .gitignore pattern.
--exclude GLOB drops files and directories matching a .gitignore-style glob, e.g. --exclude '**/*.spec.*' --exclude '**/*.stories.*'. It is repeatable and applies during directory scans regardless of --no-gitignore (it is an explicit instruction, not repo config); explicit file arguments are still always scanned. This is the highest-leverage way to cut whole categories of expected duplication — on a large frontend codebase, test and story files alone are typically about half of all reported clusters.
Output formats
Default text output:
CLUSTER 1 score=0.89 locations=2 status=unscoped
src/invoice.ts:12-25 nodes=88 kind=FunctionDeclaration name=renderInvoice
src/receipt.ts:30-44 nodes=91 kind=FunctionDeclaration name=renderReceiptUnder a changed-scope, findings are marked: status=new (intersects your change).
A cluster whose declaration name recurs across files carries a trailing
same-name=<name> tag (up to three names, then (+N)) and is ranked to the top
— see Curating results.
Each location carries two diagnostic facts so a reader can classify a finding
without opening the file: kind (the candidate root SyntaxKind — Constructor,
InterfaceDeclaration, ArrowFunction, …) and name (the declaration
identifier). A Constructor is named constructor; an anonymous candidate (an
arrow function, a callable signature) has no name — the text format drops the
name= token, JSON/EDN report null/nil.
EDN output:
{:clusters
[{:score-min 0.8909090909090909
:score-max 0.8909090909090909
:status :unscoped
:location-count 2
:locations [{:file "src/invoice.ts", :start-line 12, :end-line 25, :nodes 88, :kind "FunctionDeclaration", :name "renderInvoice"}
{:file "src/receipt.ts", :start-line 30, :end-line 44, :nodes 91, :kind "FunctionDeclaration", :name "renderReceipt"}]}]}JSON output:
{
"clusters": [
{
"score": { "min": 0.8909090909090909, "max": 0.8909090909090909 },
"status": "unscoped",
"locationCount": 2,
"locations": [
{ "file": "src/invoice.ts", "startLine": 12, "endLine": 25, "nodes": 88, "kind": "FunctionDeclaration", "name": "renderInvoice" },
{ "file": "src/receipt.ts", "startLine": 30, "endLine": 44, "nodes": 91, "kind": "FunctionDeclaration", "name": "renderReceipt" }
]
}
]
}--format sarif emits SARIF 2.1.0 for GitHub code scanning — see SARIF.
CI
Gate a PR only when it introduces new duplication, tolerating known debt, with
--changed-from against the PR's base branch. This is the one copy-paste action:
name: Duplicate Code
on: [push, pull_request]
jobs:
dry-ts:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
# merge-base needs history; the default shallow checkout breaks it.
fetch-depth: 0
- uses: oven-sh/setup-bun@v2
with:
bun-version: 1.3.6
- run: bunx [email protected] --profile pr --changed-from origin/${{ github.base_ref || 'main' }} srcPin [email protected] to a real version (see Output stability). To gate on all
duplication (zero-tolerance) instead, drop --changed-from and the pr profile:
bunx dry-ts --fail-on-duplicates src.
For this repository, bun run ci builds, tests, and runs dry-ts against src test.
SARIF / GitHub code scanning
--format sarif (or --sarif) emits SARIF 2.1.0, the lingua franca for GitHub
code scanning and most CI quality dashboards. One result per cluster under the
rule dry-ts/structural-duplicate; each location maps to a physicalLocation,
and --counterparts nearest data maps to relatedLocations joined back by a
relevant relationship. A cluster's level follows its status: new (intersects
the change) → warning, known/unscoped → note. Findings are framed as
structural candidates, not confirmed duplicates.
Upload the report so findings surface inline on the PR:
- run: bunx dry-ts --sarif --changed-from origin/${{ github.base_ref || 'main' }} src > dry-ts.sarif
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: dry-ts.sarifPair with --counterparts to enrich each finding with its nearest counterpart.
Run without --fail-on-duplicates if you want the annotations without failing
the build.
AI Agents
If you use an AI agent, run npx @tanstack/intent@latest install.
dry-ts is built for agents as a first-class consumer: run it after an edit,
parse the JSON, and route each finding. The agent profile bundles exactly that
configuration.
# After an agent edits code: gate on duplication the edit introduced and hand
# the agent routed JSON — each new cluster with its nearest existing match.
bunx dry-ts --profile agent --changed-from HEAD src testThe loop:
- Run after edits with
--profile agent(=--only-new --counterparts --fail-on-duplicates --format jsonover theprfloors). - Read the exit code.
0= clean.1= new duplication found, with the JSON below on stdout.2= infra/config failure — do not read it as findings. - Per cluster, read
locations[].nearest— the nearest existing match — and itschangedflag to route the fix:- counterpart
changed: false⇒ the new code duplicates existing code → reuse / extract toward the existing definition. - counterpart
changed: true⇒ the agent reimplemented itself within its own diff → refactor the new code (lowest-risk; nothing stable depends on it yet).
- counterpart
- Reuse existing code, or justify the duplication (a
// dry-ignoreif it is intentional).
Exit codes are stable for automation:
0 success: no findings, or no --fail-on-duplicates
1 findings with --fail-on-duplicates (status "new" under a changed-scope;
any cluster otherwise)
2 usage/configuration error, or any git/scanner failure (fail-closed)The gate fails closed: a missing git binary, a bad ref, unparseable diff output,
an unreadable source file, or zero files scanned under --fail-on-duplicates all
exit 2 with a message — never a silent green or a 1 that reads as "findings".
The JSON shape is intentionally small and stable: { "clusters": ClusterReport[] }. Each cluster includes a score range, a status ("new" | "known" | "unscoped"), locationCount, and grouped locations. Each location includes nodes (the normalized syntax node count for that duplicated block), kind (the candidate root SyntaxKind name), and name (the declaration identifier, or null when anonymous). kind and name let an agent triage a finding — e.g. skip a Constructor in a *.spec.ts as dependency-injection boilerplate — without a second read of the source.
Nearest-counterpart provenance: --counterparts
A cluster's score range and member list do not say, for a given location,
which member it actually matches and how strongly — in a transitive cluster
(>2 members) the range hides the edge structure. --counterparts adds that
missing payload: per location, its nearest matching counterpart (the absolute
strongest AST-similar partner in the same cluster) as { index, file, startLine,
endLine, shared, total, score }, plus — under an active change scope — a
per-location changed boolean. index is the counterpart's position in the same
cluster's locations array (an O(1) deref); file/startLine/endLine are the
self-contained reference; shared/total are the exact pairwise
fingerprint-intersection and union counts; score is shared / total, the same
similarity value the cluster reports, so you never recompute a float.
{
"clusters": [
{
"score": { "min": 0.8205128205128205, "max": 1 },
"status": "new",
"locationCount": 3,
"locations": [
{
"file": "new_a.ts", "startLine": 1, "endLine": 4, "nodes": 41,
"kind": "FunctionDeclaration", "name": "summarizeRows",
"nearest": {
"index": 2, "file": "old.ts", "startLine": 1, "endLine": 4,
"shared": 34, "total": 34, "score": 1
},
"changed": true
},
{
"file": "new_b.ts", "startLine": 1, "endLine": 5, "nodes": 46,
"kind": "FunctionDeclaration", "name": "reduceEntries",
"nearest": {
"index": 0, "file": "new_a.ts", "startLine": 1, "endLine": 4,
"shared": 32, "total": 39, "score": 0.8205128205128205
},
"changed": true
},
{
"file": "old.ts", "startLine": 1, "endLine": 4, "nodes": 41,
"kind": "FunctionDeclaration", "name": "normalizeRecord",
"nearest": {
"index": 0, "file": "new_a.ts", "startLine": 1, "endLine": 4,
"shared": 34, "total": 34, "score": 1
},
"changed": false
}
]
}
]
}Reading this: new_a re-implements old exactly (score 1, a tight pair),
while new_b chains in more loosely (shared 32/39, score 0.82). The
changed flag on both sides tells an agent how to route the fix:
- counterpart
changed: true⇒ new/new — the agent reimplemented itself within its own diff; refactor the new code (highest-confidence, lowest-risk fix, nothing stable depends on it yet). - counterpart
changed: false⇒ new/old — the new code duplicates existing code; extract toward the existing definition.
The nearest is always a member of the same cluster, so the index always
dereferences within that cluster's locations, and --only-new (which filters
whole clusters, never individual locations or counterparts) never orphans it.
The reported score/shared/total describe the edge between the two rendered
locations. (One rare exception: when a location's only structural match is to a
substructure inside a larger member — e.g. it matches another declaration's inner
body but not the whole declaration — the counterpart resolves to that enclosing
rendered member, and the score reflects the substructure edge.)
The full payload (index + score included) lives in json and edn. The
text format keeps one scannable line per location and appends an abbreviated
counterpart — → nearest <file>:<start>-<end> (<shared>/<total>) — without index
or score; read --format json/edn when an agent needs those fields.
Copy-paste recipes:
# Self-catch: did the block I just wrote re-implement existing structure?
dry-ts --counterparts --json --changed src/foo.ts src test
# Line-precise self-catch against the last commit, gated (or just --profile agent).
dry-ts --counterparts --only-new --fail-on-duplicates --changed-from HEAD src test
# CI fixer: gate a PR and hand a reviewer/fixer agent the routed JSON.
dry-ts --profile agent --changed-from origin/main src testOn a finding, exit 1 still emits the parseable JSON above on stdout — read it.
Exit 2 is an infra/config failure (see the exit-code table) and must not be
read as duplicate findings.
Library API
import { TypeScriptDuplicateFinder } from "dry-ts";
const clusters = new TypeScriptDuplicateFinder().findClusters({
paths: ["src"],
threshold: 0.82,
minLines: 4,
minNodes: 20,
minLocations: 2,
respectGitignore: true, // default; set false to include .gitignore-d paths
});findClusters() returns raw clusters with status unset. The changed-scope
flags (--changed-from, --changed) and the status field
("new" | "known" | "unscoped") are assigned by the CLI, not the library
finder.
How it works
dry-ts parses TypeScript source with the TypeScript compiler API, selects TypeScript declarations and function-like nodes as comparison candidates, normalizes each candidate's AST, and compares sets of structural fingerprints with Jaccard similarity:
score = shared fingerprints / all fingerprints seen in either candidateNames and literal values normalize away, while TypeScript syntax shape remains. Classes, interfaces, type aliases, enums, functions, methods, constructors, properties, variable statements, accessors, enum members, arrow functions, and function expressions can all become candidates.
How dry-ts differs from token and line matchers
Most duplicate-code tools match tokens or lines. dry-ts matches normalized AST structure. The difference is which kind of clone each can see, in the standard Type 1–4 clone taxonomy:
- Type 1 — identical code, modulo whitespace and comments.
- Type 2 — Type 1 with renamed identifiers and changed literals; same structure.
- Type 3 — near-miss: statements added, removed, or reordered.
- Type 4 — semantically equivalent but structurally different.
| Tool | Method | Catches | | --- | --- | --- | | Simian | line hashing (ignores whitespace, braces, comments) | mostly Type 1 | | jscpd | contiguous token-sequence matching (Rabin–Karp over Prism tokens) | Type 1 | | PMD CPD | contiguous token-sequence matching (Rabin–Karp / suffix tree); can normalize identifiers and literals | Type 1, Type 2 | | dry-ts | set similarity over normalized-AST fingerprints (Jaccard) | Type 2 and Type 3 |
Two properties follow from comparing sets of structural fingerprints instead of contiguous token runs:
- Rename- and reorder-tolerant. Names and literals normalize away before
fingerprinting, and Jaccard scores partial overlap, so two blocks with the
same shape but added, removed, or reordered statements still score high. Token-
and line-sequence matchers need a contiguous run, so a single insertion splits
the match. (PMD CPD's
ignore-identifiers/ignore-literalsreach Type 2, but still match contiguous token sequences — not fuzzy structural overlap.) - Graded, not binary. The output is a similarity score (default ≥ 0.82), not "≥ N identical tokens" — you tune by structural similarity, not run length.
dry-ts does not target Type 4 (semantic) clones; it compares structure, not behavior. What it adds over token/line matchers is the Type-2/Type-3 middle: same shape, different names, slight variations. That is exactly the class an LLM produces when it reimplements existing structure — and dry-ts is built for agents as the primary consumer of its output (catching their own reimplementations, or gating others' in CI), not for humans reading copy-paste reports.
Publishing
Before publishing:
bun install --frozen-lockfile
bun run ci
bun run pack:dry-run
npm publishDevelopment
bun run test
bun run check
bun run ci
bun run bench -- /path/to/project/src /path/to/project/testsBenchmarking
Three corpus tiers, all scanned with bun run bench -- <paths>:
Real mid-size project — any ~30k LOC repository you have locally. Use it as a regression check: cluster output should stay identical across performance changes, and timing should not regress.
Pinned large repositories —
bun run bench:setupfetches three pinned real-world corpora into.bench/(gitignored). Pass a name (bun run bench:setup sentry) to fetch just one.microsoft/TypeScriptat a statically pinned tag (v5.9.3, chosen by maintainers to match the currenttypescriptdependency — bumped by hand, not resolved automatically). Scan.bench/TypeScript/src/compilerfor a worst-case stress: very large files, deeply nested ASTs, and high structural self-similarity. Already pushed quite low (~1.5s), so it has little regression headroom.getsentry/sentry(sparse blobless clone ofstatic/apponly). Scan.bench/sentry/static/app— a large, messy real-world TS/TSX frontend (~6.8k files, ~2.6k clusters). Wider and more varied than the compiler subtree, so it surfaces hot-path regressions the compiler scan would miss.n8n-io/n8n(sparse blobless clone of two subtrees). A large real-world Node/TS backend that keeps the corpus from hyper-indexing on frontend code. Two complementary scan targets, both checked out bybench:setup n8n:.bench/n8n/packages/nodes-base/nodes(~3.7k files, ~20 MB) — declarative integration nodes with real near-duplicate boilerplate; volume + recall and pair-comparison stress. This is the documented default baseline..bench/n8n/packages/cli/src(~1.9k files, ~14 MB) — the server itself (controllers, services, entities, queue, auth), representative imperative backend app logic. Scan it explicitly withbun run bench -- <path>.
Synthetic regimes —
bun run bench:corpus <regime>generates a deterministic corpus into.bench/corpus/<regime>:identical(default 800 functions): dense identical structures, stresses the pairwise comparison phaseoneliners(default 10000): trivial declarations below--min-lines, stresses entry filteringnested(default depth 300): deeply nested expressions, stresses fingerprint construction
Both
bench:corpusandbenchpass--no-gitignoreso corpus paths under.bench/(which is gitignored) are scanned correctly.bench:corpusalso accepts--count Nto override the default size and--out DIRto write the corpus to a custom directory.
Example:
bun run bench:corpus identical -- --count 1200
bun run bench -- --runs 5 .bench/corpus/identicalBaselines (use as regression checks — cluster counts must stay fixed and timing must not regress across changes):
- TypeScript v5.9.3
src/compiler: ~1.5s, 246 clusters (since v0.3.0). - Sentry 25.10.0
static/app: ~5.3s, 2574 clusters (since v0.5.0,--exclude-kindshot-path change measured against it). - n8n 2.25.7
packages/nodes-base/nodes: ~3.3s, 1720 clusters (since v0.8.0). - n8n 2.25.7
packages/cli/src: ~2.9s, 1650 clusters (since v0.8.0).
