SourcererCC: Scaling Code Clone Detection to Big Code

Chanchal K. Roy; Cristina V. Lopes; Hitesh Sajnani; Jeffrey Svajlenko; Vaibhav Saini

arxiv: 1512.06448 · v1 · pith:PEVLEF6Vnew · submitted 2015-12-20 · 💻 cs.SE

SourcererCC: Scaling Code Clone Detection to Big Code

Hitesh Sajnani , Vaibhav Saini , Jeffrey Svajlenko , Chanchal K. Roy , Cristina V. Lopes This is my paper

classification 💻 cs.SE

keywords cloneclonescodesourcerercclargerecallindexinter-project

0 comments

read the original abstract

Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills
cs.CR 2026-06 unverdicted novelty 6.0

A per-component SimHash fingerprint supplies structural identity for AI agent skills, recovering family membership under paraphrase and refactoring with AUC 0.974 while localizing changes.