Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

Guillaume Rousseau (UPD7); Roberto Di Cosmo (IRIF); Stefano Zacchiroli (IRIF)

arxiv: 1906.08076 · v1 · pith:7VOGYQEWnew · submitted 2019-06-19 · 💻 cs.SE

Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

Guillaume Rousseau (UPD7) , Roberto Di Cosmo (IRIF) , Stefano Zacchiroli (IRIF) This is my paper

Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3

classification 💻 cs.SE

keywords source code growthprovenance trackingexponential growthcode duplicationsoftware archiveversion control historypublic code corpus

0 comments

The pith

Public source code files and commits have grown exponentially for over 40 years, and provenance tracking at this scale fits on ordinary computers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures the rate at which entirely new source code files and commits appear in a very large public archive and reports that both have increased exponentially for more than four decades. It also counts how often the same file shows up in many different commits and projects, finding a rapid combinatorial increase in duplicates. These two facts together determine how much storage and indexing power would be needed to record the full history of every public code artifact. The authors then test several ways of storing that history information and identify one design that handles the measured growth and duplication without needing specialized machines.

Core claim

Over more than 40 years the number of unique, never-before-seen source code files and commits in the archive has followed an exponential curve. At the same time the same files appear in a rapidly growing number of distinct commits, producing a combinatorial multiplication factor. A data model that records each file and commit together with the places where it has been observed can be built to accommodate both the exponential arrival rate and the multiplication factor while remaining runnable on commodity hardware.

What carries the argument

The benchmarked data model for provenance that records observations of files and commits across contexts while accounting for the measured multiplication factor of identical artifacts.

Load-bearing premise

The archive contains a sufficiently complete sample of all publicly available source code for its measured growth and duplication patterns to apply to the full body of public code.

What would settle it

A count of new files and commits in a substantially larger public corpus that shows clearly sub-exponential growth over the same period, or a direct resource measurement showing that the chosen data model exceeds commodity hardware limits when loaded with the observed volume.

Figures

Figures reproduced from arXiv: 1906.08076 by Guillaume Rousseau (UPD7), Roberto Di Cosmo (IRIF), Stefano Zacchiroli (IRIF).

**Figure 2.** Figure 2: Global production of original software artifacts over time, in terms of never-seen-before [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The three layers of multiplication in public source code: SLOCs occurring in source code [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Top: cumulative (upper curve) and simple (lower curve) multiplication factor of unique [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of normalized SLOC lengths in a sample of 2.5 M contents that appear at [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Multiplication factor of normalized SLOCs as the number of unique contents they appear [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Duplication of revisions across origins. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of origin size as the number of revisions they host. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Provenance tracking models, entity-relationship (E-R) views [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Evolution over time of the sizes of different provenance data models, in terms of entities [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers concrete exponential growth rates and duplication factors from the Software Heritage corpus plus a working provenance model, but the leap to all public code depends on untested archive representativeness.

read the letter

This paper measures exponential growth in never-seen-before files and commits over more than 40 years inside the Software Heritage archive, records a combinatorial multiplication of identical artifacts across contexts, and benchmarks provenance data models to find one that runs on commodity hardware. The concrete counts and the scaling benchmark are the new pieces; prior work had not quantified these rates at this corpus size. Handling 4B files and 1B commits is solid engineering, and identifying a maintainable model is useful for anyone building provenance systems. The central soft spot is the representativeness assumption. The archive's crawling history is not shown to be uniform across decades or project types, so older code could be under-sampled and the growth curve partly an artifact of improving coverage. No cross-checks against GitHub mirrors or Debian snapshots appear in the abstract, which leaves the claim about the entire body of public code resting on an extrapolation that is not yet bounded. Method details on selection criteria, error estimation, and exact benchmark numbers are also absent, so soundness cannot be fully checked from the abstract alone. The work is for people who need large-scale empirical data on code duplication or provenance infrastructure. A reader focused on archive statistics or supply-chain tooling would extract value from the numbers even with the coverage caveat. It deserves peer review because the dataset scale and the practical model are worth referee scrutiny, though the authors will likely need to add coverage validation and method transparency.

Referee Report

2 major / 0 minor

Summary. The paper analyzes the Software Heritage archive (4B unique files, 1B commits from 50M projects) to quantify exponential growth rates of never-seen-before source code files and commits over >40 years, measures the multiplication factor due to duplication across contexts, and benchmarks data models for provenance tracking at this scale, identifying a viable model deployable on commodity hardware for the entire body of public source code.

Significance. If the growth and duplication measurements are representative, the work supplies concrete empirical grounding for provenance system design in software engineering and archival research. The scale of the corpus and the identification of a maintainable data model constitute practical strengths that could inform future large-scale tracking infrastructure.

major comments (2)

[Abstract] Abstract: the claim that measured growth rates and the benchmarked model apply 'for the entire body of publicly available source code' is load-bearing yet rests on the untested assumption that Software Heritage ingestion provides uniform coverage across decades and project types; no cross-validation against independent corpora (e.g., GitHub mirrors or Debian snapshots) is described to bound potential sampling bias in older material.
[Abstract] Abstract: the exponential growth claim is presented without any description of data selection criteria, statistical fitting procedure, confidence intervals, or error estimation, preventing assessment of whether the reported rates are robust to variations in archive completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these comments on the abstract. We respond to each point below and will make revisions to address the concerns about qualification of claims and methodological transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that measured growth rates and the benchmarked model apply 'for the entire body of publicly available source code' is load-bearing yet rests on the untested assumption that Software Heritage ingestion provides uniform coverage across decades and project types; no cross-validation against independent corpora (e.g., GitHub mirrors or Debian snapshots) is described to bound potential sampling bias in older material.

Authors: We agree that the phrasing in the abstract overstates generalizability without explicit qualification of coverage. Software Heritage is the largest known public corpus, but ingestion is not guaranteed to be uniform. We will revise the abstract to qualify the scope and add a short discussion of known ingestion characteristics and potential biases in the manuscript body. revision: yes
Referee: [Abstract] Abstract: the exponential growth claim is presented without any description of data selection criteria, statistical fitting procedure, confidence intervals, or error estimation, preventing assessment of whether the reported rates are robust to variations in archive completeness.

Authors: The body of the paper details the corpus construction and growth quantification, but the abstract indeed omits the fitting procedure and uncertainty measures. We will revise the abstract to include a concise description of the data selection and exponential fitting approach, and ensure confidence intervals are reported alongside the growth rates in the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical counts and benchmarks from archive data, no derivations or self-referential fits.

full rationale

The paper reports direct measurements of growth rates, duplication factors, and data model benchmarks on the Software Heritage corpus. No equations, fitted parameters relabeled as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on observed counts rather than any reduction to prior self-defined quantities. Representativeness concerns are external validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical measurement study on an existing archive; the abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5746 in / 1136 out tokens · 44463 ms · 2026-05-25T20:04:39.535213+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Building the universal archive of source code

Jean-Fran¸ cois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli. Building the universal archive of source code. Communications of the ACM , 61(10):29–31, October 2018

work page 2018
[2]

Statistical mechanics of complex networks.Reviews of modern physics , 74(1):47, 2002

R´ eka Albert and Albert-L´ aszl´ o Barab´ asi. Statistical mechanics of complex networks.Reviews of modern physics , 74(1):47, 2002

work page 2002
[3]

Miltiadis Allamanis and Charles A. Sutton. Mining source code repositories at massive scale using language modeling. In Thomas Zimmermann, Massimiliano Di Penta, and Sunghun Kim, editors, Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, San Francisco, CA, USA, May 18-19, 2013 , pages 207–216. IEEE Computer Soci- ety, 2013

work page 2013
[4]

Borges, A

H. Borges, A. Hora, and M. T. Valente. Understanding the factors that impact the popularity of github repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 334–344, October 2016

work page 2016
[5]

Brooks, Jr

Frederick P. Brooks, Jr. The Mythical Man-Month: Essays on Software Engineering. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1978

work page 1978
[6]

Germ´ an, and Stefano Zacchiroli

Matthieu Caneill, Daniel M. Germ´ an, and Stefano Zacchiroli. The Debsources dataset: two decades of free and open source software. Empirical Software Engineering, 22(3):1405–1437, 2017

work page 2017
[7]

Free/libre open- source software development: What we know and what we do not know

Kevin Crowston, Kangning Wei, James Howison, and Andrea Wiggins. Free/libre open- source software development: What we know and what we do not know. ACM Comput. Surv., 44(2):7:1–7:35, March 2008

work page 2008
[8]

Germ´ an, Michael W

Julius Davies, Daniel M. Germ´ an, Michael W. Godfrey, and Abram Hindle. Software bertillon- age - determining the provenance of software development artifacts. Empirical Software En- gineering, 18(6):1195–1237, 2013

work page 2013
[9]

Identiﬁers for digital ob- jects: the case of software source code preservation

Roberto Di Cosmo, Morane Gruenpeter, and Stefano Zacchiroli. Identiﬁers for digital ob- jects: the case of software source code preservation. In Proceedings of the 15th International Conference on Digital Preservation, iPRES 2018, Boston, USA , September 2018. Available from https://hal.archives-ouvertes.fr/hal-01865790

work page 2018
[10]

Software heritage: Why and how to pre- serve software source code

Roberto Di Cosmo and Stefano Zacchiroli. Software heritage: Why and how to pre- serve software source code. In Proceedings of the 14th International Conference on Dig- ital Preservation, iPRES 2017, Kyoto, Japan , September 2017. Available from https: //hal.archives-ouvertes.fr/hal-01590958

work page 2017
[11]

Evolution of networks

Sergey N Dorogovtsev and Jose FF Mendes. Evolution of networks. Advances in physics , 51(4):1079–1187, 2002

work page 2002
[12]

Boa: A language and infrastructure for analyzing ultra-large-scale software repositories

Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 2013 International Conference on Software Engineering , pages 422–431. IEEE Press, 2013

work page 2013
[13]

Germ´ an, Massimiliano Di Penta, Yann-Ga¨ el Gu´ eh´ eneuc, and Giuliano Antoniol

Daniel M. Germ´ an, Massimiliano Di Penta, Yann-Ga¨ el Gu´ eh´ eneuc, and Giuliano Antoniol. Code siblings: Technical and legal implications of copying code between applications. In Godfrey and Whitehead [16], pages 81–90. 19

work page
[14]

Michael W. Godfrey. Understanding software artifact provenance. Sci. Comput. Program. , 97:86–90, 2015

work page 2015
[15]

Godfrey, Daniel M

Michael W. Godfrey, Daniel M. German, Julius Davies, and Abram Hindle. Determining the provenance of software artifacts. In Proceedings of the 5th International Workshop on Software Clones, IWSC ’11, pages 65–66, New York, NY, USA, 2011. ACM

work page 2011
[16]

Godfrey and Jim Whitehead, editors

Michael W. Godfrey and Jim Whitehead, editors. Proceedings of the 6th International Work- ing Conference on Mining Software Repositories, MSR 2009 (Co-located with ICSE), Van- couver, BC, Canada, May 16-17, 2009, Proceedings . IEEE Computer Society, 2009

work page 2009
[17]

An exploratory study of the pull- based software development model

Georgios Gousios, Martin Pinzger, and Arie van Deursen. An exploratory study of the pull- based software development model. In Proceedings of the 36th International Conference on Software Engineering, pages 345–355. ACM, 2014

work page 2014
[18]

Toward large-scale vulnerability discovery using machine learning

Gustavo Grieco, Guillermo Luis Grinblat, Lucas Uzal, Sanjay Rawat, Josselin Feist, and Laurent Mounier. Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy , CODASPY ’16, pages 85–96, New York, NY, USA, 2016. ACM

work page 2016
[19]

The road ahead for mining software repositories

Ahmed E Hassan. The road ahead for mining software repositories. In Frontiers of Software Maintenance, 2008. FoSM 2008. , pages 48–57. IEEE, 2008

work page 2008
[20]

The long-term growth rate of evolving software: Empirical results and implications

Les Hatton, Diomidis Spinellis, and Michiel van Genuchten. The long-term growth rate of evolving software: Empirical results and implications. Journal of Software: Evolution and Process, 29(5), 2017

work page 2017
[21]

Gonz´ alez-Barahona

Israel Herraiz, Daniel Rodr´ ıguez, Gregorio Robles, and Jes´ us M. Gonz´ alez-Barahona. The evolution of the laws of software evolution: A discussion based on a systematic literature review. ACM Comput. Surv. , 46(2):28:1–28:28, 2013

work page 2013
[22]

Ishio, R

T. Ishio, R. G. Kula, T. Kanda, D. M. German, and K. Inoue. Software Ingredients: Detection of Third-Party Component Reuse in Java Software Release. In2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR) , pages 339–350, May 2016

work page 2016
[23]

Meir M. Lehman. On understanding laws, evolution, and conservation in the large-program life cycle. Journal of Systems and Software , 1:213–221, 1980

work page 1980
[24]

A large-scale empirical study of security patches

Frank Li and Vern Paxson. A large-scale empirical study of security patches. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , CCS ’17, pages 2201–2215, New York, NY, USA, 2017. ACM

work page 2017
[25]

Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek

Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. D´ ej` avu: a map of code duplicates on github. PACMPL, 1(OOPSLA):84:1–84:28, 2017

work page 2017
[26]

Public git archive: a big code dataset for all

Vadim Markovtsev and Waren Long. Public git archive: a big code dataset for all. In Andy Zaidman, Yasutaka Kamei, and Emily Hill, editors, Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 34–37. ACM, 2018

work page 2018
[27]

Mining software repair models for reasoning on the search space of automated program ﬁxing

Matias Martinez and Martin Monperrus. Mining software repair models for reasoning on the search space of automated program ﬁxing. Empirical Software Engineering , 20(1):176–205, 2015

work page 2015
[28]

Ralph C. Merkle. A digital signature based on a conventional encryption function. In Carl Pomerance, editor, Advances in Cryptology - CRYPTO ’87, A Conference on the Theory and Applications of Cryptographic Techniques, volume 293 of Lecture Notes in Computer Science, pages 369–378. Springer, 1987. 20

work page 1987
[29]

Amassing and indexing a large sample of version control systems: Towards the census of public source code history

Audris Mockus. Amassing and indexing a large sample of version control systems: Towards the census of public source code history. In Godfrey and Whitehead [16], pages 11–20

work page
[30]

The software heritage graph dataset: Public software development under one roof

Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. The software heritage graph dataset: Public software development under one roof. In MSR 2019: The 16th International Conference on Mining Software Repositories , pages 138–142. IEEE, 2019

work page 2019
[31]

Computer Tool for Managing Digital Documents, February 2010

Guillaume Rousseau and Maxime Biais. Computer Tool for Managing Digital Documents, February 2010. CIB: G06F17/30; G06F21/10; G06F21/64

work page 2010
[32]

Ccﬁndersw: Clone detection tool with ﬂexible multilingual tokenization

Yuichi Semura, Norihiro Yoshida, Eunjong Choi, and Katsuro Inoue. Ccﬁndersw: Clone detection tool with ﬂexible multilingual tokenization. In Jian Lv, He Jason Zhang, Mike Hinchey, and Xiao Liu, editors, 24th Asia-Paciﬁc Software Engineering Conference, APSEC 2017, Nanjing, China, December 4-8, 2017 , pages 654–659. IEEE Computer Society, 2017

work page 2017
[33]

A repository of Unix history and evolution

Diomidis Spinellis. A repository of Unix history and evolution. Empirical Software Engineer- ing, 22(3):1372–1404, 2017

work page 2017
[34]

The lives and deaths of open source code forges

Megan Squire. The lives and deaths of open source code forges. In Lorraine Morgan, editor, Proceedings of the 13th International Symposium on Open Collaboration, OpenSym 2017, Galway, Ireland, August 23-25, 2017 , pages 15:1–15:8. ACM, 2017

work page 2017
[35]

Fast and ﬂexible large-scale clone detection with cloneworks

Jeﬀrey Svajlenko and Chanchal Kumar Roy. Fast and ﬂexible large-scale clone detection with cloneworks. In Sebasti´ an Uchitel, Alessandro Orso, and Martin P. Robillard, editors, Pro- ceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 - Companion Volume , pages 27–30. IEEE Computer So-...

work page 2017
[36]

An empirical study on the maintenance of source code clones

Suresh Thummalapenta, Luigi Cerulo, Lerina Aversano, and Massimiliano Di Penta. An empirical study on the maintenance of source code clones. Empirical Software Engineering, 15(1):1–34, 2010

work page 2010
[37]

Tiwari, Ganesha Upadhyaya, and Hridesh Rajan

Nitin M. Tiwari, Ganesha Upadhyaya, and Hridesh Rajan. Candoia: a platform and ecosys- tem for mining software repositories tools. In Laura K. Dillon, Willem Visser, and Laurie Williams, editors, Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, pages 759–764. ACM, 2016

work page 2016
[38]

C. Vendome. A large scale study of license usage on github. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume 2, pages 772–774, May 2015

work page 2015
[39]

Germ´ an, and Katsuro Inoue

Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M. Germ´ an, and Katsuro Inoue. Analy- sis of license inconsistency in large collections of open source projects. Empirical Software Engineering, 22(3):1194–1222, 2017

work page 2017
[40]

Zimmermann, R

T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007. International Workshop on, pages 9–9, May 2007

work page 2007
[41]

Mining version histories to guide software changes

Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, and Andreas Zeller. Mining version histories to guide software changes. In Anthony Finkelstein, Jacky Estublier, and David S. Rosenblum, editors, 26th International Conference on Software Engineering (ICSE 2004), 23-28 May 2004, Edinburgh, United Kingdom , pages 563–572. IEEE Computer Society, 2004. 21

work page 2004

[1] [1]

Building the universal archive of source code

Jean-Fran¸ cois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli. Building the universal archive of source code. Communications of the ACM , 61(10):29–31, October 2018

work page 2018

[2] [2]

Statistical mechanics of complex networks.Reviews of modern physics , 74(1):47, 2002

R´ eka Albert and Albert-L´ aszl´ o Barab´ asi. Statistical mechanics of complex networks.Reviews of modern physics , 74(1):47, 2002

work page 2002

[3] [3]

Miltiadis Allamanis and Charles A. Sutton. Mining source code repositories at massive scale using language modeling. In Thomas Zimmermann, Massimiliano Di Penta, and Sunghun Kim, editors, Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, San Francisco, CA, USA, May 18-19, 2013 , pages 207–216. IEEE Computer Soci- ety, 2013

work page 2013

[4] [4]

Borges, A

H. Borges, A. Hora, and M. T. Valente. Understanding the factors that impact the popularity of github repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 334–344, October 2016

work page 2016

[5] [5]

Brooks, Jr

Frederick P. Brooks, Jr. The Mythical Man-Month: Essays on Software Engineering. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1978

work page 1978

[6] [6]

Germ´ an, and Stefano Zacchiroli

Matthieu Caneill, Daniel M. Germ´ an, and Stefano Zacchiroli. The Debsources dataset: two decades of free and open source software. Empirical Software Engineering, 22(3):1405–1437, 2017

work page 2017

[7] [7]

Free/libre open- source software development: What we know and what we do not know

Kevin Crowston, Kangning Wei, James Howison, and Andrea Wiggins. Free/libre open- source software development: What we know and what we do not know. ACM Comput. Surv., 44(2):7:1–7:35, March 2008

work page 2008

[8] [8]

Germ´ an, Michael W

Julius Davies, Daniel M. Germ´ an, Michael W. Godfrey, and Abram Hindle. Software bertillon- age - determining the provenance of software development artifacts. Empirical Software En- gineering, 18(6):1195–1237, 2013

work page 2013

[9] [9]

Identiﬁers for digital ob- jects: the case of software source code preservation

Roberto Di Cosmo, Morane Gruenpeter, and Stefano Zacchiroli. Identiﬁers for digital ob- jects: the case of software source code preservation. In Proceedings of the 15th International Conference on Digital Preservation, iPRES 2018, Boston, USA , September 2018. Available from https://hal.archives-ouvertes.fr/hal-01865790

work page 2018

[10] [10]

Software heritage: Why and how to pre- serve software source code

Roberto Di Cosmo and Stefano Zacchiroli. Software heritage: Why and how to pre- serve software source code. In Proceedings of the 14th International Conference on Dig- ital Preservation, iPRES 2017, Kyoto, Japan , September 2017. Available from https: //hal.archives-ouvertes.fr/hal-01590958

work page 2017

[11] [11]

Evolution of networks

Sergey N Dorogovtsev and Jose FF Mendes. Evolution of networks. Advances in physics , 51(4):1079–1187, 2002

work page 2002

[12] [12]

Boa: A language and infrastructure for analyzing ultra-large-scale software repositories

Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 2013 International Conference on Software Engineering , pages 422–431. IEEE Press, 2013

work page 2013

[13] [13]

Germ´ an, Massimiliano Di Penta, Yann-Ga¨ el Gu´ eh´ eneuc, and Giuliano Antoniol

Daniel M. Germ´ an, Massimiliano Di Penta, Yann-Ga¨ el Gu´ eh´ eneuc, and Giuliano Antoniol. Code siblings: Technical and legal implications of copying code between applications. In Godfrey and Whitehead [16], pages 81–90. 19

work page

[14] [14]

Michael W. Godfrey. Understanding software artifact provenance. Sci. Comput. Program. , 97:86–90, 2015

work page 2015

[15] [15]

Godfrey, Daniel M

Michael W. Godfrey, Daniel M. German, Julius Davies, and Abram Hindle. Determining the provenance of software artifacts. In Proceedings of the 5th International Workshop on Software Clones, IWSC ’11, pages 65–66, New York, NY, USA, 2011. ACM

work page 2011

[16] [16]

Godfrey and Jim Whitehead, editors

Michael W. Godfrey and Jim Whitehead, editors. Proceedings of the 6th International Work- ing Conference on Mining Software Repositories, MSR 2009 (Co-located with ICSE), Van- couver, BC, Canada, May 16-17, 2009, Proceedings . IEEE Computer Society, 2009

work page 2009

[17] [17]

An exploratory study of the pull- based software development model

Georgios Gousios, Martin Pinzger, and Arie van Deursen. An exploratory study of the pull- based software development model. In Proceedings of the 36th International Conference on Software Engineering, pages 345–355. ACM, 2014

work page 2014

[18] [18]

Toward large-scale vulnerability discovery using machine learning

Gustavo Grieco, Guillermo Luis Grinblat, Lucas Uzal, Sanjay Rawat, Josselin Feist, and Laurent Mounier. Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy , CODASPY ’16, pages 85–96, New York, NY, USA, 2016. ACM

work page 2016

[19] [19]

The road ahead for mining software repositories

Ahmed E Hassan. The road ahead for mining software repositories. In Frontiers of Software Maintenance, 2008. FoSM 2008. , pages 48–57. IEEE, 2008

work page 2008

[20] [20]

The long-term growth rate of evolving software: Empirical results and implications

Les Hatton, Diomidis Spinellis, and Michiel van Genuchten. The long-term growth rate of evolving software: Empirical results and implications. Journal of Software: Evolution and Process, 29(5), 2017

work page 2017

[21] [21]

Gonz´ alez-Barahona

Israel Herraiz, Daniel Rodr´ ıguez, Gregorio Robles, and Jes´ us M. Gonz´ alez-Barahona. The evolution of the laws of software evolution: A discussion based on a systematic literature review. ACM Comput. Surv. , 46(2):28:1–28:28, 2013

work page 2013

[22] [22]

Ishio, R

T. Ishio, R. G. Kula, T. Kanda, D. M. German, and K. Inoue. Software Ingredients: Detection of Third-Party Component Reuse in Java Software Release. In2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR) , pages 339–350, May 2016

work page 2016

[23] [23]

Meir M. Lehman. On understanding laws, evolution, and conservation in the large-program life cycle. Journal of Systems and Software , 1:213–221, 1980

work page 1980

[24] [24]

A large-scale empirical study of security patches

Frank Li and Vern Paxson. A large-scale empirical study of security patches. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , CCS ’17, pages 2201–2215, New York, NY, USA, 2017. ACM

work page 2017

[25] [25]

Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek

Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. D´ ej` avu: a map of code duplicates on github. PACMPL, 1(OOPSLA):84:1–84:28, 2017

work page 2017

[26] [26]

Public git archive: a big code dataset for all

Vadim Markovtsev and Waren Long. Public git archive: a big code dataset for all. In Andy Zaidman, Yasutaka Kamei, and Emily Hill, editors, Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 34–37. ACM, 2018

work page 2018

[27] [27]

Mining software repair models for reasoning on the search space of automated program ﬁxing

Matias Martinez and Martin Monperrus. Mining software repair models for reasoning on the search space of automated program ﬁxing. Empirical Software Engineering , 20(1):176–205, 2015

work page 2015

[28] [28]

Ralph C. Merkle. A digital signature based on a conventional encryption function. In Carl Pomerance, editor, Advances in Cryptology - CRYPTO ’87, A Conference on the Theory and Applications of Cryptographic Techniques, volume 293 of Lecture Notes in Computer Science, pages 369–378. Springer, 1987. 20

work page 1987

[29] [29]

Amassing and indexing a large sample of version control systems: Towards the census of public source code history

Audris Mockus. Amassing and indexing a large sample of version control systems: Towards the census of public source code history. In Godfrey and Whitehead [16], pages 11–20

work page

[30] [30]

The software heritage graph dataset: Public software development under one roof

Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. The software heritage graph dataset: Public software development under one roof. In MSR 2019: The 16th International Conference on Mining Software Repositories , pages 138–142. IEEE, 2019

work page 2019

[31] [31]

Computer Tool for Managing Digital Documents, February 2010

Guillaume Rousseau and Maxime Biais. Computer Tool for Managing Digital Documents, February 2010. CIB: G06F17/30; G06F21/10; G06F21/64

work page 2010

[32] [32]

Ccﬁndersw: Clone detection tool with ﬂexible multilingual tokenization

Yuichi Semura, Norihiro Yoshida, Eunjong Choi, and Katsuro Inoue. Ccﬁndersw: Clone detection tool with ﬂexible multilingual tokenization. In Jian Lv, He Jason Zhang, Mike Hinchey, and Xiao Liu, editors, 24th Asia-Paciﬁc Software Engineering Conference, APSEC 2017, Nanjing, China, December 4-8, 2017 , pages 654–659. IEEE Computer Society, 2017

work page 2017

[33] [33]

A repository of Unix history and evolution

Diomidis Spinellis. A repository of Unix history and evolution. Empirical Software Engineer- ing, 22(3):1372–1404, 2017

work page 2017

[34] [34]

The lives and deaths of open source code forges

Megan Squire. The lives and deaths of open source code forges. In Lorraine Morgan, editor, Proceedings of the 13th International Symposium on Open Collaboration, OpenSym 2017, Galway, Ireland, August 23-25, 2017 , pages 15:1–15:8. ACM, 2017

work page 2017

[35] [35]

Fast and ﬂexible large-scale clone detection with cloneworks

Jeﬀrey Svajlenko and Chanchal Kumar Roy. Fast and ﬂexible large-scale clone detection with cloneworks. In Sebasti´ an Uchitel, Alessandro Orso, and Martin P. Robillard, editors, Pro- ceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 - Companion Volume , pages 27–30. IEEE Computer So-...

work page 2017

[36] [36]

An empirical study on the maintenance of source code clones

Suresh Thummalapenta, Luigi Cerulo, Lerina Aversano, and Massimiliano Di Penta. An empirical study on the maintenance of source code clones. Empirical Software Engineering, 15(1):1–34, 2010

work page 2010

[37] [37]

Tiwari, Ganesha Upadhyaya, and Hridesh Rajan

Nitin M. Tiwari, Ganesha Upadhyaya, and Hridesh Rajan. Candoia: a platform and ecosys- tem for mining software repositories tools. In Laura K. Dillon, Willem Visser, and Laurie Williams, editors, Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, pages 759–764. ACM, 2016

work page 2016

[38] [38]

C. Vendome. A large scale study of license usage on github. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume 2, pages 772–774, May 2015

work page 2015

[39] [39]

Germ´ an, and Katsuro Inoue

Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M. Germ´ an, and Katsuro Inoue. Analy- sis of license inconsistency in large collections of open source projects. Empirical Software Engineering, 22(3):1194–1222, 2017

work page 2017

[40] [40]

Zimmermann, R

T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007. International Workshop on, pages 9–9, May 2007

work page 2007

[41] [41]

Mining version histories to guide software changes

Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, and Andreas Zeller. Mining version histories to guide software changes. In Anthony Finkelstein, Jacky Estublier, and David S. Rosenblum, editors, 26th International Conference on Software Engineering (ICSE 2004), 23-28 May 2004, Edinburgh, United Kingdom , pages 563–572. IEEE Computer Society, 2004. 21

work page 2004