Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale
Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3
The pith
Public source code files and commits have grown exponentially for over 40 years, and provenance tracking at this scale fits on ordinary computers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Over more than 40 years the number of unique, never-before-seen source code files and commits in the archive has followed an exponential curve. At the same time the same files appear in a rapidly growing number of distinct commits, producing a combinatorial multiplication factor. A data model that records each file and commit together with the places where it has been observed can be built to accommodate both the exponential arrival rate and the multiplication factor while remaining runnable on commodity hardware.
What carries the argument
The benchmarked data model for provenance that records observations of files and commits across contexts while accounting for the measured multiplication factor of identical artifacts.
Load-bearing premise
The archive contains a sufficiently complete sample of all publicly available source code for its measured growth and duplication patterns to apply to the full body of public code.
What would settle it
A count of new files and commits in a substantially larger public corpus that shows clearly sub-exponential growth over the same period, or a direct resource measurement showing that the chosen data model exceeds commodity hardware limits when loaded with the observed volume.
Figures
read the original abstract
We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes the Software Heritage archive (4B unique files, 1B commits from 50M projects) to quantify exponential growth rates of never-seen-before source code files and commits over >40 years, measures the multiplication factor due to duplication across contexts, and benchmarks data models for provenance tracking at this scale, identifying a viable model deployable on commodity hardware for the entire body of public source code.
Significance. If the growth and duplication measurements are representative, the work supplies concrete empirical grounding for provenance system design in software engineering and archival research. The scale of the corpus and the identification of a maintainable data model constitute practical strengths that could inform future large-scale tracking infrastructure.
major comments (2)
- [Abstract] Abstract: the claim that measured growth rates and the benchmarked model apply 'for the entire body of publicly available source code' is load-bearing yet rests on the untested assumption that Software Heritage ingestion provides uniform coverage across decades and project types; no cross-validation against independent corpora (e.g., GitHub mirrors or Debian snapshots) is described to bound potential sampling bias in older material.
- [Abstract] Abstract: the exponential growth claim is presented without any description of data selection criteria, statistical fitting procedure, confidence intervals, or error estimation, preventing assessment of whether the reported rates are robust to variations in archive completeness.
Simulated Author's Rebuttal
We thank the referee for these comments on the abstract. We respond to each point below and will make revisions to address the concerns about qualification of claims and methodological transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that measured growth rates and the benchmarked model apply 'for the entire body of publicly available source code' is load-bearing yet rests on the untested assumption that Software Heritage ingestion provides uniform coverage across decades and project types; no cross-validation against independent corpora (e.g., GitHub mirrors or Debian snapshots) is described to bound potential sampling bias in older material.
Authors: We agree that the phrasing in the abstract overstates generalizability without explicit qualification of coverage. Software Heritage is the largest known public corpus, but ingestion is not guaranteed to be uniform. We will revise the abstract to qualify the scope and add a short discussion of known ingestion characteristics and potential biases in the manuscript body. revision: yes
-
Referee: [Abstract] Abstract: the exponential growth claim is presented without any description of data selection criteria, statistical fitting procedure, confidence intervals, or error estimation, preventing assessment of whether the reported rates are robust to variations in archive completeness.
Authors: The body of the paper details the corpus construction and growth quantification, but the abstract indeed omits the fitting procedure and uncertainty measures. We will revise the abstract to include a concise description of the data selection and exponential fitting approach, and ensure confidence intervals are reported alongside the growth rates in the results. revision: yes
Circularity Check
No circularity: empirical counts and benchmarks from archive data, no derivations or self-referential fits.
full rationale
The paper reports direct measurements of growth rates, duplication factors, and data model benchmarks on the Software Heritage corpus. No equations, fitted parameters relabeled as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on observed counts rather than any reduction to prior self-defined quantities. Representativeness concerns are external validity issues, not circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Building the universal archive of source code
Jean-Fran¸ cois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli. Building the universal archive of source code. Communications of the ACM , 61(10):29–31, October 2018
work page 2018
-
[2]
Statistical mechanics of complex networks.Reviews of modern physics , 74(1):47, 2002
R´ eka Albert and Albert-L´ aszl´ o Barab´ asi. Statistical mechanics of complex networks.Reviews of modern physics , 74(1):47, 2002
work page 2002
-
[3]
Miltiadis Allamanis and Charles A. Sutton. Mining source code repositories at massive scale using language modeling. In Thomas Zimmermann, Massimiliano Di Penta, and Sunghun Kim, editors, Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, San Francisco, CA, USA, May 18-19, 2013 , pages 207–216. IEEE Computer Soci- ety, 2013
work page 2013
- [4]
-
[5]
Frederick P. Brooks, Jr. The Mythical Man-Month: Essays on Software Engineering. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1978
work page 1978
-
[6]
Germ´ an, and Stefano Zacchiroli
Matthieu Caneill, Daniel M. Germ´ an, and Stefano Zacchiroli. The Debsources dataset: two decades of free and open source software. Empirical Software Engineering, 22(3):1405–1437, 2017
work page 2017
-
[7]
Free/libre open- source software development: What we know and what we do not know
Kevin Crowston, Kangning Wei, James Howison, and Andrea Wiggins. Free/libre open- source software development: What we know and what we do not know. ACM Comput. Surv., 44(2):7:1–7:35, March 2008
work page 2008
-
[8]
Julius Davies, Daniel M. Germ´ an, Michael W. Godfrey, and Abram Hindle. Software bertillon- age - determining the provenance of software development artifacts. Empirical Software En- gineering, 18(6):1195–1237, 2013
work page 2013
-
[9]
Identifiers for digital ob- jects: the case of software source code preservation
Roberto Di Cosmo, Morane Gruenpeter, and Stefano Zacchiroli. Identifiers for digital ob- jects: the case of software source code preservation. In Proceedings of the 15th International Conference on Digital Preservation, iPRES 2018, Boston, USA , September 2018. Available from https://hal.archives-ouvertes.fr/hal-01865790
work page 2018
-
[10]
Software heritage: Why and how to pre- serve software source code
Roberto Di Cosmo and Stefano Zacchiroli. Software heritage: Why and how to pre- serve software source code. In Proceedings of the 14th International Conference on Dig- ital Preservation, iPRES 2017, Kyoto, Japan , September 2017. Available from https: //hal.archives-ouvertes.fr/hal-01590958
work page 2017
-
[11]
Sergey N Dorogovtsev and Jose FF Mendes. Evolution of networks. Advances in physics , 51(4):1079–1187, 2002
work page 2002
-
[12]
Boa: A language and infrastructure for analyzing ultra-large-scale software repositories
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 2013 International Conference on Software Engineering , pages 422–431. IEEE Press, 2013
work page 2013
-
[13]
Germ´ an, Massimiliano Di Penta, Yann-Ga¨ el Gu´ eh´ eneuc, and Giuliano Antoniol
Daniel M. Germ´ an, Massimiliano Di Penta, Yann-Ga¨ el Gu´ eh´ eneuc, and Giuliano Antoniol. Code siblings: Technical and legal implications of copying code between applications. In Godfrey and Whitehead [16], pages 81–90. 19
-
[14]
Michael W. Godfrey. Understanding software artifact provenance. Sci. Comput. Program. , 97:86–90, 2015
work page 2015
-
[15]
Michael W. Godfrey, Daniel M. German, Julius Davies, and Abram Hindle. Determining the provenance of software artifacts. In Proceedings of the 5th International Workshop on Software Clones, IWSC ’11, pages 65–66, New York, NY, USA, 2011. ACM
work page 2011
-
[16]
Godfrey and Jim Whitehead, editors
Michael W. Godfrey and Jim Whitehead, editors. Proceedings of the 6th International Work- ing Conference on Mining Software Repositories, MSR 2009 (Co-located with ICSE), Van- couver, BC, Canada, May 16-17, 2009, Proceedings . IEEE Computer Society, 2009
work page 2009
-
[17]
An exploratory study of the pull- based software development model
Georgios Gousios, Martin Pinzger, and Arie van Deursen. An exploratory study of the pull- based software development model. In Proceedings of the 36th International Conference on Software Engineering, pages 345–355. ACM, 2014
work page 2014
-
[18]
Toward large-scale vulnerability discovery using machine learning
Gustavo Grieco, Guillermo Luis Grinblat, Lucas Uzal, Sanjay Rawat, Josselin Feist, and Laurent Mounier. Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy , CODASPY ’16, pages 85–96, New York, NY, USA, 2016. ACM
work page 2016
-
[19]
The road ahead for mining software repositories
Ahmed E Hassan. The road ahead for mining software repositories. In Frontiers of Software Maintenance, 2008. FoSM 2008. , pages 48–57. IEEE, 2008
work page 2008
-
[20]
The long-term growth rate of evolving software: Empirical results and implications
Les Hatton, Diomidis Spinellis, and Michiel van Genuchten. The long-term growth rate of evolving software: Empirical results and implications. Journal of Software: Evolution and Process, 29(5), 2017
work page 2017
-
[21]
Israel Herraiz, Daniel Rodr´ ıguez, Gregorio Robles, and Jes´ us M. Gonz´ alez-Barahona. The evolution of the laws of software evolution: A discussion based on a systematic literature review. ACM Comput. Surv. , 46(2):28:1–28:28, 2013
work page 2013
- [22]
-
[23]
Meir M. Lehman. On understanding laws, evolution, and conservation in the large-program life cycle. Journal of Systems and Software , 1:213–221, 1980
work page 1980
-
[24]
A large-scale empirical study of security patches
Frank Li and Vern Paxson. A large-scale empirical study of security patches. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , CCS ’17, pages 2201–2215, New York, NY, USA, 2017. ACM
work page 2017
-
[25]
Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek
Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. D´ ej` avu: a map of code duplicates on github. PACMPL, 1(OOPSLA):84:1–84:28, 2017
work page 2017
-
[26]
Public git archive: a big code dataset for all
Vadim Markovtsev and Waren Long. Public git archive: a big code dataset for all. In Andy Zaidman, Yasutaka Kamei, and Emily Hill, editors, Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 34–37. ACM, 2018
work page 2018
-
[27]
Mining software repair models for reasoning on the search space of automated program fixing
Matias Martinez and Martin Monperrus. Mining software repair models for reasoning on the search space of automated program fixing. Empirical Software Engineering , 20(1):176–205, 2015
work page 2015
-
[28]
Ralph C. Merkle. A digital signature based on a conventional encryption function. In Carl Pomerance, editor, Advances in Cryptology - CRYPTO ’87, A Conference on the Theory and Applications of Cryptographic Techniques, volume 293 of Lecture Notes in Computer Science, pages 369–378. Springer, 1987. 20
work page 1987
-
[29]
Audris Mockus. Amassing and indexing a large sample of version control systems: Towards the census of public source code history. In Godfrey and Whitehead [16], pages 11–20
-
[30]
The software heritage graph dataset: Public software development under one roof
Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. The software heritage graph dataset: Public software development under one roof. In MSR 2019: The 16th International Conference on Mining Software Repositories , pages 138–142. IEEE, 2019
work page 2019
-
[31]
Computer Tool for Managing Digital Documents, February 2010
Guillaume Rousseau and Maxime Biais. Computer Tool for Managing Digital Documents, February 2010. CIB: G06F17/30; G06F21/10; G06F21/64
work page 2010
-
[32]
Ccfindersw: Clone detection tool with flexible multilingual tokenization
Yuichi Semura, Norihiro Yoshida, Eunjong Choi, and Katsuro Inoue. Ccfindersw: Clone detection tool with flexible multilingual tokenization. In Jian Lv, He Jason Zhang, Mike Hinchey, and Xiao Liu, editors, 24th Asia-Pacific Software Engineering Conference, APSEC 2017, Nanjing, China, December 4-8, 2017 , pages 654–659. IEEE Computer Society, 2017
work page 2017
-
[33]
A repository of Unix history and evolution
Diomidis Spinellis. A repository of Unix history and evolution. Empirical Software Engineer- ing, 22(3):1372–1404, 2017
work page 2017
-
[34]
The lives and deaths of open source code forges
Megan Squire. The lives and deaths of open source code forges. In Lorraine Morgan, editor, Proceedings of the 13th International Symposium on Open Collaboration, OpenSym 2017, Galway, Ireland, August 23-25, 2017 , pages 15:1–15:8. ACM, 2017
work page 2017
-
[35]
Fast and flexible large-scale clone detection with cloneworks
Jeffrey Svajlenko and Chanchal Kumar Roy. Fast and flexible large-scale clone detection with cloneworks. In Sebasti´ an Uchitel, Alessandro Orso, and Martin P. Robillard, editors, Pro- ceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 - Companion Volume , pages 27–30. IEEE Computer So-...
work page 2017
-
[36]
An empirical study on the maintenance of source code clones
Suresh Thummalapenta, Luigi Cerulo, Lerina Aversano, and Massimiliano Di Penta. An empirical study on the maintenance of source code clones. Empirical Software Engineering, 15(1):1–34, 2010
work page 2010
-
[37]
Tiwari, Ganesha Upadhyaya, and Hridesh Rajan
Nitin M. Tiwari, Ganesha Upadhyaya, and Hridesh Rajan. Candoia: a platform and ecosys- tem for mining software repositories tools. In Laura K. Dillon, Willem Visser, and Laurie Williams, editors, Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, pages 759–764. ACM, 2016
work page 2016
-
[38]
C. Vendome. A large scale study of license usage on github. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume 2, pages 772–774, May 2015
work page 2015
-
[39]
Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M. Germ´ an, and Katsuro Inoue. Analy- sis of license inconsistency in large collections of open source projects. Empirical Software Engineering, 22(3):1194–1222, 2017
work page 2017
-
[40]
T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007. International Workshop on, pages 9–9, May 2007
work page 2007
-
[41]
Mining version histories to guide software changes
Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, and Andreas Zeller. Mining version histories to guide software changes. In Anthony Finkelstein, Jacky Estublier, and David S. Rosenblum, editors, 26th International Conference on Software Engineering (ICSE 2004), 23-28 May 2004, Edinburgh, United Kingdom , pages 563–572. IEEE Computer Society, 2004. 21
work page 2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.