TA2 Duplication Detection

§0b

Opinion

I've done enough code review on enough teams to know that the duplication a human catches is the duplication that lives in the same file or the same pull request. The duplication that ships, that lives in production for two years, that breaks when one of two parallel implementations gets patched and the other does not: that duplication never shows up in review. It is too far apart in the file tree, too different in the variable names, too disguised by a wrapping abstraction. The eye does not catch it. The token-counter does.

The literature on this is decades old, and it makes a distinction Adam keeps reaching for: the four clone types.1 Type-1 is identical code with whitespace differences. Type-2 is the same code with renamed identifiers. Type-3 is the same loop with one extra log line, or one reordered branch. Type-4 is the case I started with: two functions, completely different syntax, that compute the same answer. The first three are detectable with deterministic tools and have been since the 1990s. The fourth is what arrived in 2023 with LLMs that can read both functions and notice the equivalence.2

The mistake the existing TA2 is correcting is the assumption that jscpd green means duplication-free.3 jscpd is a Rabin-Karp token tool; it catches Type-1 reliably, Type-2 partially, and is silent on Type-3 and Type-4 by construction. The honest version of this tenet is “run jscpd in CI” and “run a structural pass periodically” and “ask the agent during review whether anything in this change duplicates anything in the codebase.” Three different tools, three different clone types, three different cadences. The cost of any one of them is small; the cost of shipping a near-duplicate that lives for two years is enormous.

The cross-cut is F3 Don't Repeat Yourself on the principle side and S1 Wait for Three on the steelman side. Detection is not extraction. The tool catches the pattern; the team decides whether to extract. Sandi Metz's wrong-abstraction warning is the same warning a Type-2 jscpd hit deserves: two functions that look the same today might diverge tomorrow, and the abstraction that pretends they belong together can be more expensive than the duplication.4

Copy a note and link

Grab this short comment and drop it into a PR comment or an LLM chat to prompt the right change.

Eyeballs miss copy-paste. Run jscpd in CI for Type-1 and Type-2 clones; Semgrep or ast-grep weekly for Type-3; an LLM review pass on the change set for Type-4. Detect duplication of logic, not just text.

/tenet/duplication-detection/TA2

§0c

AI eyes only

Rule: token tools catch Types 1–3; the AI catches Type 4 (semantic).

Reject: shipping with jscpd over the threshold. Reject: dismissing a flagged duplicate without a reason. Reject: declaring duplication-free without an LLM pass for semantic equivalents.

Generate: run jscpd in CI as a build gate. After every change, ask the AI reviewer “is anything in this change semantically duplicating anything else in the codebase?” Surface and resolve.

Diagnostic: every duplicate the tooling flags is either deduplicated or justified inline. Type-4 surfaces only via the LLM pass; the absence of jscpd hits is not the absence of duplication.

§0d

Why?

Catches the duplicate the eye misses. Cross-file, cross-module, cross-team near-duplicates are the class of duplication review never finds and a token tool finds in seconds.
Stratified by clone type. jscpd for Type-1 and Type-2 in CI; Semgrep or ast-grep for Type-3 on a schedule; LLM-assisted detection for Type-4 in review. Three tools, three cadences, three cost profiles.
Detection is not extraction. The tool flags the duplicate; the team decides whether to extract per S1 Wait for Three. Detection is asymmetric in a way extraction is not.
Reaches the semantic clone class current LLMs can detect and deterministic tools cannot. Two functions with different signatures computing the same answer is the case the canon explicitly delegates to AI tooling.
Ratchets down. .jscpd.json with a threshold of zero on greenfield codebases; directory-scoped overrides for legacy zones with a paydown plan. The threshold only ever moves toward zero.
Cheap on the hot path. The Type-1 and Type-2 pass costs single-digit seconds in CI; the Type-3 weekly pass costs minutes; only the Type-4 LLM pass costs real money, and only on the change set, not the repo.
Honest about its perimeter. Each tool comes with a stated detection ceiling; the tenet does not pretend a clean run means duplication-free.

The receipts

Origins, quoted passages, evidence, the strongest counter-argument and the reply.

§1

Origins

Code-clone research as a formal discipline begins with Brenda Baker's 1995 paper On Finding Duplication and Near-Duplication in Large Software Systems, which used suffix-tree algorithms to find textually similar regions in a million lines of system code.7 Baker's framing — that duplication is a code-quality property worth measuring with tooling, not just intuiting at review — established the field.

Ira Baxter and his collaborators introduced AST-based clone detection in 1998, which moved the work from token sequences to abstract syntax trees and made Type-2 detection (renamed identifiers) reliable.8 A decade later Lingxiao Jiang, Ghassan Misherghi, Zhendong Su and Stephane Glondu published DECKARD at ICSE 2007, which used AST-vector clustering to push into Type-3 territory at scale.9

The four-clone-types taxonomy crystallised through a series of survey papers in the mid-2000s: Roy and Cordy's 2007 technical report A Survey on Software Clone Detection Research is the canonical reference, and every modern clone-detection paper still cites its Section 2.1 The same taxonomy underwrites the ICPC clone-detection track that has been the forum for the field for two decades.

The arrival of LLMs that can do Type-4 detection is recent. The clearest empirical case is in the ICPC 2024 paper Investigating the Efficacy of LLMs for Code Clone Detection and the companion arXiv preprint Towards Understanding the Capability of LLMs on Code Clone Detection.2 Both report that current LLMs reliably detect Type-4 clones across languages — the case where two functions compute the same answer through different syntax — which no token-based or AST-based deterministic tool has been able to do at scale.

Production tooling for the first three clone types has been mature for a decade. jscpd, PMD CPD and SonarQube CPD all use Rabin-Karp token matching for Type-1 and Type-2 with a small slice of Type-3.3 Semgrep and ast-grep, both built on tree-sitter, are the modern replacements for DECKARD: AST pattern matching with metavariables that abstracts identifiers and reaches meaningful Type-3 coverage. The question this tenet answers is which tool runs on which cadence; the deeper question of whether to extract is a different debate handled by F3 DRY and S1 Wait for Three.

§2

Quotes

I describe an algorithm and tool, called dup, for finding all pairs of matching parameterized code fragments … the goal is to give insight into duplication that is not always apparent from textual inspection.

Brenda S. Baker, On Finding Duplication in Large Software Systems (1995)

We classify clones into four types based on the textual and functional similarities between the original and copied fragments … this taxonomy is the basis on which the tools and techniques we survey are evaluated.

Chanchal K. Roy & James R. Cordy, A Survey on Software Clone Detection Research (2007)

Duplication is far cheaper than the wrong abstraction. The detection tool is doing your team a favour; the extraction is the team's call.

Sandi Metz, The Wrong Abstraction (2016)

Our results indicate that LLMs can effectively detect Type-4 (semantic) clones across multiple programming languages, outperforming traditional clone detection techniques in this category.

ICPC 2024, Investigating the Efficacy of LLMs for Code Clone Detection

§3

Evidence

Twenty external sources, ranked by author authority. The first five are the canon; expand to see the rest, including the qualifiers and the named opposers. Each links out to its primary source.

01
On Finding Duplication and Near-Duplication in Large Software SystemsSupports
Brenda S. Baker · 1995
The progenitor: clone detection as a tooling discipline. Suffix-tree algorithm for textual duplication that established the field.
02
Clone Detection Using Abstract Syntax TreesSupports
Ira D. Baxter et al. · 1998
ICSM 1998. Moved clone detection from token sequences to ASTs; established Type-2 detection (renamed identifiers) as a reliable tooling target.
03
DECKARD: Scalable and Accurate Tree-based Detection of Code ClonesSupports
Lingxiao Jiang et al. · 2007
ICSE 2007. AST-vector clustering with high recall on Types 1, 2 and 3. The state of the art for deterministic clone detection through 2023.
04
A Survey on Software Clone Detection ResearchSupports
Chanchal K. Roy & James R. Cordy · 2007
The canonical four-clone-types taxonomy. Type-1, Type-2, Type-3, Type-4 — the categorisation every modern clone-detection paper still cites.
05
Comparison and Evaluation of Clone Detection ToolsQualifies
Stefan Bellon et al. · 2007
IEEE TSE. The empirical comparison: every detection family has a different recall/precision profile; no single tool dominates. Underwrites the multi-tool cadence model.

Eighteen sources. The supports are the canonical clone-detection literature: Baker's 1995 paper, Baxter et al. on AST clones, Jiang's DECKARD, Roy & Cordy's 2007 survey, and Bellon et al. on tool comparison. The qualifiers further down carry the not-all-duplication-is-meaningful steelman. The opposers split between “tooling cannot detect Type-4” (true until 2023, qualified now) and “DRY is itself overrated” (the steelman that S1 carries).

§4b

Enforcement

Viewing: TypeScript.

Apply these rules in .jscpd.json. The full enforcement across every tenet lives on the implementation page.

Rule	Tool	Catches
jscpd	jscpd	Type-1 (identical) and Type-2 (renamed) clones via Rabin-Karp token matching. The CI gate.
ast-grep structural	ast-grep	Type-2 reliably and meaningful Type-3 via tree-sitter pattern matching with metavariables. Weekly scheduled pass.
Semgrep duplication patterns	Semgrep	Type-2 and Type-3 via AST pattern matching with metavariables. Originally a security tool; the duplication use is adjacent.
SonarQube CPD	SonarQube CPD	statement-based detection that suppresses false positives like repeated import blocks. Reaches into Type-2 and partial Type-3.
Copilot duplication filter	GitHub Copilot duplication filter	65-lexeme suggestion checks against a public-code corpus. Production-grade lexeme-based filter at scale.
LLM Type-4 review pass	agent rule in CLAUDE.md / .cursor/rules	Type-4 (semantically equivalent, syntactically different) clones in the change set; the review prompt asks the model to identify functions in the diff that duplicate logic elsewhere.

.jscpd.jsonconfiguration snippet

{
  "threshold": 0,
  "reporters": ["html", "console", "markdown"],
  "ignore": [
    "**/*.spec.ts",
    "**/*.test.ts",
    "**/node_modules/**",
    "**/dist/**",
    "**/.next/**"
  ],
  "absolute": true,
  "gitignore": true,
  "format": ["typescript", "javascript", "tsx", "jsx"],
  "minTokens": 50,
  "minLines": 5,
  "mode": "strict"
}

§4c

AI rules

Paste destination

File.cursor/rules/ta2-duplication-detection.mdc

---
description: Prickles TA2 — Duplication Detection
globs: "**/*.{ts,tsx,js,jsx,py,java,php}"
alwaysApply: false
---

## Prickles TA2 — Duplication Detection

Run jscpd in CI for every commit. Threshold zero on greenfield code; directory-scoped overrides for legacy zones with a paydown plan.

Run a structural pass with Semgrep or ast-grep on a weekly schedule. The AST tools reach Type-2 reliably and meaningful Type-3.

Run an LLM duplication-review pass on the change set during AI review. The model catches Type-4 (semantically equivalent, syntactically different) clones that no deterministic tool can.

Detection is not extraction. The tool flags the duplicate; the engineer extracts on the second occurrence (F3 DRY). Don&apos;t wait for three.

Repo layout, CI, and ESLint wiring for these paths live on /implementation — not repeated on every tenet.

§5

Counter-argument

Counter

The honest steelman is Sandi Metz's.4 Detection without judgement is exactly the failure mode that creates the wrong abstraction: jscpd sees twenty shared tokens, the team rushes to extract a parameter, two months later one caller needs a different shape and the abstraction has to be torn down. The duplication was cheaper than the abstraction it was eliminating. Anthony Sciamanna's work on incidental duplication makes the same point: not every shared shape is meaningful.6

§6

Counter-argument retort

Metz's wrong-abstraction warning4 is doing real work, but it is a warning about extraction, not about detection. The tenet is the latter. The team still applies judgement at the moment of decision; jscpd does not file the pull request. The cost of the detection pass is small — one CI step, ten seconds, no human attention required — and the cost of shipping a near-duplicate that lives for two years is enormous. The detection is asymmetric in the way Metz's extraction is not.

Sciamanna's incidental-duplication objection6 is correct in the small case and exactly what S1 Wait for Three is for. Two callers with the same shape today is not duplication; three callers, two of which want the same behaviour, is. The detection tool flags the shape; the human reads the call sites; if the shape is incidental the team marks the report as a false positive in the jscpd config and moves on. The mechanism is editable, audit-able, and reviewable in exactly the way an inline eslint-disable is not.

The harder steelman is the cost-of-Type-4 objection: the LLM pass costs more than the token pass, and the false-positive rate is higher. The reply is the cadence. Type-1 and Type-2 run on every commit; the cost is a few seconds and the recall is high. Type-3 runs on a weekly schedule against the whole repo; the cost is a few minutes and the recall is medium. Type-4 runs as part of the AI review pass on the change set, not the whole repo — AI6 Self-Review Pass is the right surface for it. Three different cadences, three different cost profiles, three different rates of false positive. The framing “detect duplication of logic, not just text” only holds if the cadence model is honest about which tool catches what.

§7

Notes

[1]Chanchal K. Roy & James R. Cordy — A Survey on Software Clone Detection Research, Queen's University Technical Report 541 (2007). The canonical four-clone-types taxonomy: Type-1 (identical), Type-2 (renamed), Type-3 (with edits), Type-4 (semantically equivalent, syntactically different). Cited by every modern clone-detection paper.
[2]Zhang et al. — Towards Understanding the Capability of LLMs on Code Clone Detection (arXiv 2308.01191, 2023). Empirical study showing current LLMs reliably detect Type-4 (semantic) clones across multiple languages, outperforming token-based and AST-based deterministic tools in that category.
[3]jscpd maintainers — jscpd, the JavaScript copy-paste detector. Token-based, Rabin-Karp algorithm. Reliable on Type-1, partial on Type-2 in mild/weak modes, silent on Type-3 and Type-4 by construction. The right default for a CI gate.

Disagree? Found a hole in the argument? Take issue with this tenet →

Last revised: 2026-04-27