pith. sign in

arxiv: 1906.08076 · v1 · pith:7VOGYQEWnew · submitted 2019-06-19 · 💻 cs.SE

Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3

classification 💻 cs.SE
keywords source code growthprovenance trackingexponential growthcode duplicationsoftware archiveversion control historypublic code corpus
0
0 comments X

The pith

Public source code files and commits have grown exponentially for over 40 years, and provenance tracking at this scale fits on ordinary computers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures the rate at which entirely new source code files and commits appear in a very large public archive and reports that both have increased exponentially for more than four decades. It also counts how often the same file shows up in many different commits and projects, finding a rapid combinatorial increase in duplicates. These two facts together determine how much storage and indexing power would be needed to record the full history of every public code artifact. The authors then test several ways of storing that history information and identify one design that handles the measured growth and duplication without needing specialized machines.

Core claim

Over more than 40 years the number of unique, never-before-seen source code files and commits in the archive has followed an exponential curve. At the same time the same files appear in a rapidly growing number of distinct commits, producing a combinatorial multiplication factor. A data model that records each file and commit together with the places where it has been observed can be built to accommodate both the exponential arrival rate and the multiplication factor while remaining runnable on commodity hardware.

What carries the argument

The benchmarked data model for provenance that records observations of files and commits across contexts while accounting for the measured multiplication factor of identical artifacts.

Load-bearing premise

The archive contains a sufficiently complete sample of all publicly available source code for its measured growth and duplication patterns to apply to the full body of public code.

What would settle it

A count of new files and commits in a substantially larger public corpus that shows clearly sub-exponential growth over the same period, or a direct resource measurement showing that the chosen data model exceeds commodity hardware limits when loaded with the observed volume.

Figures

Figures reproduced from arXiv: 1906.08076 by Guillaume Rousseau (UPD7), Roberto Di Cosmo (IRIF), Stefano Zacchiroli (IRIF).

Figure 1
Figure 1. Figure 1: Software Heritage Merkle DAG with crawling information. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Global production of original software artifacts over time, in terms of never-seen-before [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The three layers of multiplication in public source code: SLOCs occurring in source code [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top: cumulative (upper curve) and simple (lower curve) multiplication factor of unique [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of normalized SLOC lengths in a sample of 2.5 M contents that appear at [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multiplication factor of normalized SLOCs as the number of unique contents they appear [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Duplication of revisions across origins. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of origin size as the number of revisions they host. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Provenance tracking models, entity-relationship (E-R) views [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evolution over time of the sizes of different provenance data models, in terms of entities [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper analyzes the Software Heritage archive (4B unique files, 1B commits from 50M projects) to quantify exponential growth rates of never-seen-before source code files and commits over >40 years, measures the multiplication factor due to duplication across contexts, and benchmarks data models for provenance tracking at this scale, identifying a viable model deployable on commodity hardware for the entire body of public source code.

Significance. If the growth and duplication measurements are representative, the work supplies concrete empirical grounding for provenance system design in software engineering and archival research. The scale of the corpus and the identification of a maintainable data model constitute practical strengths that could inform future large-scale tracking infrastructure.

major comments (2)
  1. [Abstract] Abstract: the claim that measured growth rates and the benchmarked model apply 'for the entire body of publicly available source code' is load-bearing yet rests on the untested assumption that Software Heritage ingestion provides uniform coverage across decades and project types; no cross-validation against independent corpora (e.g., GitHub mirrors or Debian snapshots) is described to bound potential sampling bias in older material.
  2. [Abstract] Abstract: the exponential growth claim is presented without any description of data selection criteria, statistical fitting procedure, confidence intervals, or error estimation, preventing assessment of whether the reported rates are robust to variations in archive completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these comments on the abstract. We respond to each point below and will make revisions to address the concerns about qualification of claims and methodological transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that measured growth rates and the benchmarked model apply 'for the entire body of publicly available source code' is load-bearing yet rests on the untested assumption that Software Heritage ingestion provides uniform coverage across decades and project types; no cross-validation against independent corpora (e.g., GitHub mirrors or Debian snapshots) is described to bound potential sampling bias in older material.

    Authors: We agree that the phrasing in the abstract overstates generalizability without explicit qualification of coverage. Software Heritage is the largest known public corpus, but ingestion is not guaranteed to be uniform. We will revise the abstract to qualify the scope and add a short discussion of known ingestion characteristics and potential biases in the manuscript body. revision: yes

  2. Referee: [Abstract] Abstract: the exponential growth claim is presented without any description of data selection criteria, statistical fitting procedure, confidence intervals, or error estimation, preventing assessment of whether the reported rates are robust to variations in archive completeness.

    Authors: The body of the paper details the corpus construction and growth quantification, but the abstract indeed omits the fitting procedure and uncertainty measures. We will revise the abstract to include a concise description of the data selection and exponential fitting approach, and ensure confidence intervals are reported alongside the growth rates in the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical counts and benchmarks from archive data, no derivations or self-referential fits.

full rationale

The paper reports direct measurements of growth rates, duplication factors, and data model benchmarks on the Software Heritage corpus. No equations, fitted parameters relabeled as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on observed counts rather than any reduction to prior self-defined quantities. Representativeness concerns are external validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical measurement study on an existing archive; the abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5746 in / 1136 out tokens · 44463 ms · 2026-05-25T20:04:39.535213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Building the universal archive of source code

    Jean-Fran¸ cois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli. Building the universal archive of source code. Communications of the ACM , 61(10):29–31, October 2018

  2. [2]

    Statistical mechanics of complex networks.Reviews of modern physics , 74(1):47, 2002

    R´ eka Albert and Albert-L´ aszl´ o Barab´ asi. Statistical mechanics of complex networks.Reviews of modern physics , 74(1):47, 2002

  3. [3]

    Miltiadis Allamanis and Charles A. Sutton. Mining source code repositories at massive scale using language modeling. In Thomas Zimmermann, Massimiliano Di Penta, and Sunghun Kim, editors, Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, San Francisco, CA, USA, May 18-19, 2013 , pages 207–216. IEEE Computer Soci- ety, 2013

  4. [4]

    Borges, A

    H. Borges, A. Hora, and M. T. Valente. Understanding the factors that impact the popularity of github repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 334–344, October 2016

  5. [5]

    Brooks, Jr

    Frederick P. Brooks, Jr. The Mythical Man-Month: Essays on Software Engineering. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1978

  6. [6]

    Germ´ an, and Stefano Zacchiroli

    Matthieu Caneill, Daniel M. Germ´ an, and Stefano Zacchiroli. The Debsources dataset: two decades of free and open source software. Empirical Software Engineering, 22(3):1405–1437, 2017

  7. [7]

    Free/libre open- source software development: What we know and what we do not know

    Kevin Crowston, Kangning Wei, James Howison, and Andrea Wiggins. Free/libre open- source software development: What we know and what we do not know. ACM Comput. Surv., 44(2):7:1–7:35, March 2008

  8. [8]

    Germ´ an, Michael W

    Julius Davies, Daniel M. Germ´ an, Michael W. Godfrey, and Abram Hindle. Software bertillon- age - determining the provenance of software development artifacts. Empirical Software En- gineering, 18(6):1195–1237, 2013

  9. [9]

    Identifiers for digital ob- jects: the case of software source code preservation

    Roberto Di Cosmo, Morane Gruenpeter, and Stefano Zacchiroli. Identifiers for digital ob- jects: the case of software source code preservation. In Proceedings of the 15th International Conference on Digital Preservation, iPRES 2018, Boston, USA , September 2018. Available from https://hal.archives-ouvertes.fr/hal-01865790

  10. [10]

    Software heritage: Why and how to pre- serve software source code

    Roberto Di Cosmo and Stefano Zacchiroli. Software heritage: Why and how to pre- serve software source code. In Proceedings of the 14th International Conference on Dig- ital Preservation, iPRES 2017, Kyoto, Japan , September 2017. Available from https: //hal.archives-ouvertes.fr/hal-01590958

  11. [11]

    Evolution of networks

    Sergey N Dorogovtsev and Jose FF Mendes. Evolution of networks. Advances in physics , 51(4):1079–1187, 2002

  12. [12]

    Boa: A language and infrastructure for analyzing ultra-large-scale software repositories

    Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 2013 International Conference on Software Engineering , pages 422–431. IEEE Press, 2013

  13. [13]

    Germ´ an, Massimiliano Di Penta, Yann-Ga¨ el Gu´ eh´ eneuc, and Giuliano Antoniol

    Daniel M. Germ´ an, Massimiliano Di Penta, Yann-Ga¨ el Gu´ eh´ eneuc, and Giuliano Antoniol. Code siblings: Technical and legal implications of copying code between applications. In Godfrey and Whitehead [16], pages 81–90. 19

  14. [14]

    Michael W. Godfrey. Understanding software artifact provenance. Sci. Comput. Program. , 97:86–90, 2015

  15. [15]

    Godfrey, Daniel M

    Michael W. Godfrey, Daniel M. German, Julius Davies, and Abram Hindle. Determining the provenance of software artifacts. In Proceedings of the 5th International Workshop on Software Clones, IWSC ’11, pages 65–66, New York, NY, USA, 2011. ACM

  16. [16]

    Godfrey and Jim Whitehead, editors

    Michael W. Godfrey and Jim Whitehead, editors. Proceedings of the 6th International Work- ing Conference on Mining Software Repositories, MSR 2009 (Co-located with ICSE), Van- couver, BC, Canada, May 16-17, 2009, Proceedings . IEEE Computer Society, 2009

  17. [17]

    An exploratory study of the pull- based software development model

    Georgios Gousios, Martin Pinzger, and Arie van Deursen. An exploratory study of the pull- based software development model. In Proceedings of the 36th International Conference on Software Engineering, pages 345–355. ACM, 2014

  18. [18]

    Toward large-scale vulnerability discovery using machine learning

    Gustavo Grieco, Guillermo Luis Grinblat, Lucas Uzal, Sanjay Rawat, Josselin Feist, and Laurent Mounier. Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy , CODASPY ’16, pages 85–96, New York, NY, USA, 2016. ACM

  19. [19]

    The road ahead for mining software repositories

    Ahmed E Hassan. The road ahead for mining software repositories. In Frontiers of Software Maintenance, 2008. FoSM 2008. , pages 48–57. IEEE, 2008

  20. [20]

    The long-term growth rate of evolving software: Empirical results and implications

    Les Hatton, Diomidis Spinellis, and Michiel van Genuchten. The long-term growth rate of evolving software: Empirical results and implications. Journal of Software: Evolution and Process, 29(5), 2017

  21. [21]

    Gonz´ alez-Barahona

    Israel Herraiz, Daniel Rodr´ ıguez, Gregorio Robles, and Jes´ us M. Gonz´ alez-Barahona. The evolution of the laws of software evolution: A discussion based on a systematic literature review. ACM Comput. Surv. , 46(2):28:1–28:28, 2013

  22. [22]

    Ishio, R

    T. Ishio, R. G. Kula, T. Kanda, D. M. German, and K. Inoue. Software Ingredients: Detection of Third-Party Component Reuse in Java Software Release. In2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR) , pages 339–350, May 2016

  23. [23]

    Meir M. Lehman. On understanding laws, evolution, and conservation in the large-program life cycle. Journal of Systems and Software , 1:213–221, 1980

  24. [24]

    A large-scale empirical study of security patches

    Frank Li and Vern Paxson. A large-scale empirical study of security patches. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , CCS ’17, pages 2201–2215, New York, NY, USA, 2017. ACM

  25. [25]

    Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek

    Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. D´ ej` avu: a map of code duplicates on github. PACMPL, 1(OOPSLA):84:1–84:28, 2017

  26. [26]

    Public git archive: a big code dataset for all

    Vadim Markovtsev and Waren Long. Public git archive: a big code dataset for all. In Andy Zaidman, Yasutaka Kamei, and Emily Hill, editors, Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 34–37. ACM, 2018

  27. [27]

    Mining software repair models for reasoning on the search space of automated program fixing

    Matias Martinez and Martin Monperrus. Mining software repair models for reasoning on the search space of automated program fixing. Empirical Software Engineering , 20(1):176–205, 2015

  28. [28]

    Ralph C. Merkle. A digital signature based on a conventional encryption function. In Carl Pomerance, editor, Advances in Cryptology - CRYPTO ’87, A Conference on the Theory and Applications of Cryptographic Techniques, volume 293 of Lecture Notes in Computer Science, pages 369–378. Springer, 1987. 20

  29. [29]

    Amassing and indexing a large sample of version control systems: Towards the census of public source code history

    Audris Mockus. Amassing and indexing a large sample of version control systems: Towards the census of public source code history. In Godfrey and Whitehead [16], pages 11–20

  30. [30]

    The software heritage graph dataset: Public software development under one roof

    Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. The software heritage graph dataset: Public software development under one roof. In MSR 2019: The 16th International Conference on Mining Software Repositories , pages 138–142. IEEE, 2019

  31. [31]

    Computer Tool for Managing Digital Documents, February 2010

    Guillaume Rousseau and Maxime Biais. Computer Tool for Managing Digital Documents, February 2010. CIB: G06F17/30; G06F21/10; G06F21/64

  32. [32]

    Ccfindersw: Clone detection tool with flexible multilingual tokenization

    Yuichi Semura, Norihiro Yoshida, Eunjong Choi, and Katsuro Inoue. Ccfindersw: Clone detection tool with flexible multilingual tokenization. In Jian Lv, He Jason Zhang, Mike Hinchey, and Xiao Liu, editors, 24th Asia-Pacific Software Engineering Conference, APSEC 2017, Nanjing, China, December 4-8, 2017 , pages 654–659. IEEE Computer Society, 2017

  33. [33]

    A repository of Unix history and evolution

    Diomidis Spinellis. A repository of Unix history and evolution. Empirical Software Engineer- ing, 22(3):1372–1404, 2017

  34. [34]

    The lives and deaths of open source code forges

    Megan Squire. The lives and deaths of open source code forges. In Lorraine Morgan, editor, Proceedings of the 13th International Symposium on Open Collaboration, OpenSym 2017, Galway, Ireland, August 23-25, 2017 , pages 15:1–15:8. ACM, 2017

  35. [35]

    Fast and flexible large-scale clone detection with cloneworks

    Jeffrey Svajlenko and Chanchal Kumar Roy. Fast and flexible large-scale clone detection with cloneworks. In Sebasti´ an Uchitel, Alessandro Orso, and Martin P. Robillard, editors, Pro- ceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 - Companion Volume , pages 27–30. IEEE Computer So-...

  36. [36]

    An empirical study on the maintenance of source code clones

    Suresh Thummalapenta, Luigi Cerulo, Lerina Aversano, and Massimiliano Di Penta. An empirical study on the maintenance of source code clones. Empirical Software Engineering, 15(1):1–34, 2010

  37. [37]

    Tiwari, Ganesha Upadhyaya, and Hridesh Rajan

    Nitin M. Tiwari, Ganesha Upadhyaya, and Hridesh Rajan. Candoia: a platform and ecosys- tem for mining software repositories tools. In Laura K. Dillon, Willem Visser, and Laurie Williams, editors, Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, pages 759–764. ACM, 2016

  38. [38]

    C. Vendome. A large scale study of license usage on github. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume 2, pages 772–774, May 2015

  39. [39]

    Germ´ an, and Katsuro Inoue

    Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M. Germ´ an, and Katsuro Inoue. Analy- sis of license inconsistency in large collections of open source projects. Empirical Software Engineering, 22(3):1194–1222, 2017

  40. [40]

    Zimmermann, R

    T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007. International Workshop on, pages 9–9, May 2007

  41. [41]

    Mining version histories to guide software changes

    Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, and Andreas Zeller. Mining version histories to guide software changes. In Anthony Finkelstein, Jacky Estublier, and David S. Rosenblum, editors, 26th International Conference on Software Engineering (ICSE 2004), 23-28 May 2004, Edinburgh, United Kingdom , pages 563–572. IEEE Computer Society, 2004. 21