Ensuring Open Source Integrity: The Intersection of Copy-Based Reuse and License Compliance
Pith reviewed 2026-06-26 07:18 UTC · model grok-4.3
The pith
Nearly two in five instances of copy-based code reuse across open source projects carry a potential license noncompliance risk.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a large approximation of open source code, the authors map instances of direct copying between projects and quantify that 39.4% of such project pairs risk license noncompliance. They further model that code from permissive licenses is more likely to be copied, while public domain licenses see less reuse, and that dependency analysis reveals only a small fraction of the copying.
What carries the argument
the copy-based code reuse network that maps direct copying across projects
Load-bearing premise
The method used to detect copied code accurately distinguishes actual copying from coincidental similarities, and the license information extracted from projects is accurate enough to assess noncompliance.
What would settle it
A manual audit of a sample of flagged project pairs that finds either no evidence of copying or correct license compliance in most cases would undermine the risk estimate.
Figures
read the original abstract
As other creative work, source code is protected by copyright. The owner can license the work, e.g., to permit copy and other kinds of use, and even start legal proceeding against license violators. However, source code can be reused in subtle ways, e.g., via copying without explicit package manager dependencies, making it hard to reason about potential license noncompliance. Using the World of Code infrastructure approximating the entirety of open source software, in this paper we create a copy-based code reuse network mapping direct copying across projects, and use it to quantify the extent of potential license noncompliance across the entire open source ecosystem. In addition, we estimate regression models to understand whether code copying is affected by the origin project's license, and, if so, how it varies with other project characteristics. We find that code in repositories with permissive licenses, such as MIT and Apache, shows higher likelihood of reuse across programming languages. In contrast, copyleft licenses, like the GPL, exhibit mixed effects. Public domain licenses, despite their aim of allowing unrestricted use, are associated with lower likelihood of copy-based reuse. A widespread potential license noncompliance appears to accompany copy-based reuse, with 39.4% of project combinations at potential noncompliance risk, particularly when licenses are unclear or absent. Our findings reveal that only 2.43% of reuse detected through the copy-based network was discoverable via dependency analysis, highlighting the limitations of existing dependency-tracking tools in capturing copy-based reuse.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper uses the World of Code infrastructure to build a copy-based code reuse network across OSS projects and quantifies potential license noncompliance, reporting that 39.4% of project combinations are at risk (especially with unclear/absent licenses). Regression models examine how origin-project license type affects copy likelihood (higher for MIT/Apache, mixed for GPL, lower for public domain). Only 2.43% of detected reuse is captured by dependency analysis, underscoring limitations of package-manager tools.
Significance. If the copy detector and license metadata prove reliable, the 39.4% figure and the dependency-gap result would provide a large-scale empirical basis for the prevalence of hidden license risks in direct code copying, with direct implications for compliance tooling and OSS governance. The license-type regressions add nuance on reuse incentives. The scale of the World of Code data is a methodological strength for ecosystem claims.
major comments (2)
- [Abstract / Methods (copy detection)] Abstract and methods description: the central 39.4% noncompliance-risk statistic rests on the copy detector correctly identifying direct copying events rather than coincidental similarity. No precision/recall figures, manual validation sample size, or error analysis on the detector or license extractor are reported; without these the percentage cannot be interpreted as a robust ecosystem-wide estimate.
- [Abstract / Regression analysis] Abstract: the regression claims (permissive licenses increase reuse likelihood; copyleft shows mixed effects) are load-bearing for the secondary contribution, yet no model specification, controls for project characteristics, sample sizes, or robustness checks are described, preventing assessment of whether the reported patterns are driven by the license variable or by confounding factors.
minor comments (1)
- [Abstract] The exact definition of 'project combinations' used to compute the 39.4% figure should be stated explicitly (e.g., how pairs are sampled and filtered) to support replication.
Simulated Author's Rebuttal
We appreciate the referee's feedback on the methodological rigor of our study. We address the major comments below and plan revisions accordingly.
read point-by-point responses
-
Referee: [Abstract / Methods (copy detection)] Abstract and methods description: the central 39.4% noncompliance-risk statistic rests on the copy detector correctly identifying direct copying events rather than coincidental similarity. No precision/recall figures, manual validation sample size, or error analysis on the detector or license extractor are reported; without these the percentage cannot be interpreted as a robust ecosystem-wide estimate.
Authors: The copy detection method is based on established techniques in the World of Code infrastructure, which has been used and validated in multiple prior studies for identifying code reuse at scale. While we did not report specific precision and recall for this particular application in the current manuscript, we can provide additional details on the detector's parameters and any internal validation. We will revise the methods section to include a discussion of potential false positives in copy detection and how they might affect the 39.4% estimate, along with any available error analysis. revision: yes
-
Referee: [Abstract / Regression analysis] Abstract: the regression claims (permissive licenses increase reuse likelihood; copyleft shows mixed effects) are load-bearing for the secondary contribution, yet no model specification, controls for project characteristics, sample sizes, or robustness checks are described, preventing assessment of whether the reported patterns are driven by the license variable or by confounding factors.
Authors: The full manuscript describes the regression models in the Methods section, including the use of logistic regression with controls for project size, age, and primary programming language. Sample sizes are provided in the results tables. However, to better address potential confounding, we will add explicit robustness checks (e.g., alternative model specifications and subsample analyses) in the revision. This will clarify that the license effects hold after accounting for other project characteristics. revision: yes
Circularity Check
No circularity; statistics derived from external infrastructure data
full rationale
The paper's central quantitative results, including the 39.4% noncompliance risk and regression findings on license effects, are computed directly from the World of Code copy-detection network and license metadata as external inputs. No derivation step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the reported figures are empirical aggregates and model outputs on observed project combinations rather than tautological renamings or predictions forced by the analysis itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption World of Code accurately detects direct code copying events across projects and provides reliable license metadata for compliance assessment.
Forward citations
Cited by 1 Pith paper
-
File-Level Copying Is an Implicit Dependency in Open Source
File-level copying acts as an implicit dependency in open source, removing provenance signals and concentrating security risks in vendored copies and license risks in direct source reuse.
Reference graph
Works this paper leans on
-
[1]
Do software developers understand open source licenses?
D. A. Almeida, G. C. Murphy, G. Wilson, and M. Hoye, “Do software developers understand open source licenses?” In2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), IEEE, 2017, pp. 1–11
2017
-
[2]
Codeipprompt: Intellectual property infringement assessment of code language models,
Z. Yu, Y . Wu, N. Zhang, C. Wang, Y . V orobeychik, and C. Xiao, “Codeipprompt: Intellectual property infringement assessment of code language models,” inInternational conference on machine learning, PMLR, 2023, pp. 40 373–40 389
2023
-
[3]
A first look at license compliance capability of llms in code generation,
W. Xu, K. Gao, H. He, and M. Zhou, “A first look at license compliance capability of llms in code generation,”arXiv preprint arXiv:2408.02487, 2024
arXiv 2024
-
[4]
Cracks in the stack: Hidden vul- nerabilities and licensing risks in llm pre-training datasets,
M. Jahanshahi and A. Mockus, “Cracks in the stack: Hidden vul- nerabilities and licensing risks in llm pre-training datasets,” in2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), IEEE, 2025, pp. 104–111
2025
-
[5]
To distribute or not to distribute? why licensing bugs matter,
C. Vendome, D. M. German, M. Di Penta, G. Bavota, M. Linares- V´asquez, and D. Poshyvanyk, “To distribute or not to distribute? why licensing bugs matter,” inProceedings of the 40th International Conference on Software Engineering, 2018, pp. 268–279
2018
-
[6]
Tool support for open source software license compli- ance: The first two decades of the millennium,
T. Tuunanen, “Tool support for open source software license compli- ance: The first two decades of the millennium,”JYU dissertations, 2021
2021
-
[7]
Beyond dependencies: The role of copy-based reuse in open source software development,
M. Jahanshahi, D. Reid, and A. Mockus, “Beyond dependencies: The role of copy-based reuse in open source software development,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 8, pp. 1–49, 2025
2025
-
[8]
Open source for open source license compliance,
O. Fendt and M. C. Jaeger, “Open source for open source license compliance,” inOpen Source Systems: 15th IFIP WG 2.13 Interna- tional Conference, OSS 2019, Montreal, QC, Canada, May 26–27, 2019, Proceedings 15, Springer, 2019, pp. 133–138
2019
-
[9]
Continuous open source license compli- ance,
S. Phipps and S. Zacchiroli, “Continuous open source license compli- ance,”arXiv preprint arXiv:2011.08489, 2020
arXiv 2011
-
[10]
Understanding and auditing the licensing of open source software distributions,
D. M. German, M. Di Penta, and J. Davies, “Understanding and auditing the licensing of open source software distributions,” in2010 IEEE 18th International Conference on Program Comprehension, IEEE, 2010, pp. 84–93
2010
-
[11]
Jacobsen v. katzer: Federal circuit affirms economic interest of open source copyright holder,
Y . Shagall and E. Breithaupt, “Jacobsen v. katzer: Federal circuit affirms economic interest of open source copyright holder,”Harvard Journal of Law & Technology, 2008, Accessed: 2024-09-27. [Online]. Available: https://jolt.law.harvard.edu/digest/jacobsen-v-katzer
2008
-
[12]
gpl violation lawsuit, Accessed: 2024-09-27, 2007
Software Freedom Law Center,On behalf of busybox developers, sflc files first ever u.s. gpl violation lawsuit, Accessed: 2024-09-27, 2007. [Online]. Available: https://softwarefreedom.org/news/2007/sep/20/ busybox/
2024
-
[13]
Insights from open source software supply chains (keynote),
A. Mockus, “Insights from open source software supply chains (keynote),” inProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 3–3
2019
-
[14]
Tutorial: Open source software supply chains,
A. Mockus, “Tutorial: Open source software supply chains,” inIndia Software Engineering Conference, 2022. [Online]. Available: papers/ SSCISEC22.pdf
2022
-
[15]
Mockus,Securing large language model software supply chains, ASE’23 LLMs in Software Engineering, Luxenburgh, Sep
A. Mockus,Securing large language model software supply chains, ASE’23 LLMs in Software Engineering, Luxenburgh, Sep. 2023. [Online]. Available: papers/wocllm.pdf
2023
-
[16]
Estimating the attack surface from residual vulnerabilities in open source software supply chain,
D. Yan, Y . Niu, K. Liu, Z. Liu, Z. Liu, and T. F. Bissyand ´e, “Estimating the attack surface from residual vulnerabilities in open source software supply chain,” in2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2021, pp. 493–502
2021
-
[17]
Sok: Taxonomy of attacks on open-source software supply chains,
P. Ladisa, H. Plate, M. Martinez, and O. Barais, “Sok: Taxonomy of attacks on open-source software supply chains,” in2023 IEEE Symposium on Security and Privacy (SP), IEEE, 2023, pp. 1509–1526
2023
-
[18]
Effort, co-operation and co-ordination in an open source software project: Gnome,
S. Koch and G. Schneider, “Effort, co-operation and co-ordination in an open source software project: Gnome,”Information Systems Journal, vol. 12, no. 1, pp. 27–42, 2002
2002
-
[19]
Large-scale code reuse in open source software,
A. Mockus, “Large-scale code reuse in open source software,” inFirst International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS’07: ICSE Workshops 2007), IEEE, 2007, pp. 7– 7
2007
-
[20]
Crowston and J
K. Crowston and J. Howison,The social structure of free and open source software development, 2005
2005
-
[21]
Influence of social and technical factors for evaluating contribution in github,
J. Tsay, L. Dabbish, and J. Herbsleb, “Influence of social and technical factors for evaluating contribution in github,” inProceedings of the 36th international conference on Software engineering, 2014, pp. 356– 366
2014
-
[22]
Predicting the popularity of github repositories,
H. Borges, A. Hora, and M. T. Valente, “Predicting the popularity of github repositories,” inProceedings of the The 12th international conference on predictive models and data analytics in software engi- neering, 2016, pp. 1–10
2016
-
[23]
Sustainability of open source soft- ware communities beyond a fork: How and why has the libreoffice project evolved?
J. Gamalielsson and B. Lundell, “Sustainability of open source soft- ware communities beyond a fork: How and why has the libreoffice project evolved?”Journal of systems and Software, vol. 89, pp. 128– 145, 2014
2014
-
[24]
Popularity, interoperability, and impact of programming languages in 100,000 open source projects,
T. F. Bissyand ´e, F. Thung, D. Lo, L. Jiang, and L. R ´eveillere, “Popularity, interoperability, and impact of programming languages in 100,000 open source projects,” in2013 IEEE 37th annual computer software and applications conference, IEEE, 2013, pp. 303–312
2013
-
[25]
An investigation into the impact of software licenses on copy-and-paste reuse among oss projects,
Y . Kashima, Y . Hayase, N. Yoshida, Y . Manabe, and K. Inoue, “An investigation into the impact of software licenses on copy-and-paste reuse among oss projects,” in2011 18th Working Conference on Reverse Engineering, IEEE, 2011, pp. 28–32
2011
-
[26]
The effects of open source license choice on software reuse,
J. V . Brewer, “The effects of open source license choice on software reuse,” Ph.D. dissertation, Virginia Tech, 2012
2012
-
[27]
A. M. S. Laurent,Understanding open source and free software licensing: guide to navigating licensing issues in existing & new software. ” O’Reilly Media, Inc.”, 2004
2004
-
[28]
Stallman,Free software, free society: Selected essays of Richard M
R. Stallman,Free software, free society: Selected essays of Richard M. Stallman. Lulu. com, 2002
2002
-
[29]
How big media uses technology and the law to lock down culture and control creativity,
L. Lessig, “How big media uses technology and the law to lock down culture and control creativity,”Retrieved December, vol. 5, p. 2004, 2004
2004
-
[30]
Open source licensing,
L. Rosen, “Open source licensing,”Software Freedom and Intellectual Property Law, 2005
2005
-
[31]
V ¨alim¨aki,The rise of open source licensing: a challenge to the use of intellectual property in the software industry
M. V ¨alim¨aki,The rise of open source licensing: a challenge to the use of intellectual property in the software industry. Turre publishing, 2005
2005
-
[32]
An exploratory study of the evolution of software licensing,
M. Di Penta, D. M. German, Y .-G. Gu ´eh´eneuc, and G. Antoniol, “An exploratory study of the evolution of software licensing,” in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, 2010, pp. 145–154
2010
-
[33]
A large-scale empirical study of open source license usage: Practices and challenges,
J. Wu, L. Bao, X. Yang, X. Xia, and X. Hu, “A large-scale empirical study of open source license usage: Practices and challenges,” in 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR), IEEE, 2024, pp. 595–606
2024
-
[34]
An empirical study of license conflict in free and open source software,
X. Cui, J. Wu, Y . Wu, X. Wang, T. Luo, S. Qu, X. Ling, and M. Yang, “An empirical study of license conflict in free and open source software,” in2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE- SEIP), IEEE, 2023, pp. 495–505
2023
-
[35]
An empirical study of license violations in open source projects,
A. Mathur, H. Choudhary, P. Vashist, W. Thies, and S. Thilagam, “An empirical study of license violations in open source projects,” in 2012 35th annual IEEE software engineering workshop, IEEE, 2012, pp. 168–176
2012
-
[36]
Investigating whether and how software developers understand open source software licensing,
D. A. Almeida, G. C. Murphy, G. Wilson, and M. Hoye, “Investigating whether and how software developers understand open source software licensing,”Empirical Software Engineering, vol. 24, pp. 211–239, 2019
2019
-
[37]
From one to hundreds: Multi-licensing in the javascript ecosystem,
J. P. Moraes, I. Polato, I. Wiese, F. Saraiva, and G. Pinto, “From one to hundreds: Multi-licensing in the javascript ecosystem,”Empirical Software Engineering, vol. 26, no. 3, p. 39, 2021
2021
-
[38]
Empirical study on dependency- related license violation in the javascript package ecosystem,
S. Qiu, D. M. German, and K. Inoue, “Empirical study on dependency- related license violation in the javascript package ecosystem,”Journal of Information Processing, vol. 29, pp. 296–304, 2021
2021
-
[39]
Open-source license violations of binary software at large scale,
M. Feng et al., “Open-source license violations of binary software at large scale,” in2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2019, pp. 564– 568
2019
-
[40]
An analysis of open source software licensing questions in stack exchange sites,
M. Papoutsoglou, G. M. Kapitsaki, D. German, and L. Angelis, “An analysis of open source software licensing questions in stack exchange sites,”Journal of Systems and Software, vol. 183, p. 111 113, 2022
2022
-
[41]
Applying the universal version history concept to help de-risk copy-based code reuse,
D. Reid and A. Mockus, “Applying the universal version history concept to help de-risk copy-based code reuse,” in2023 IEEE 23rd International Working Conference on Source Code Analysis and Ma- nipulation (SCAM), IEEE, 2023, pp. 1–12
2023
-
[42]
Open source license inconsistencies on github,
T. Wolter, A. Barcomb, D. Riehle, and N. Harutyunyan, “Open source license inconsistencies on github,”ACM Transactions on Software Engineering and Methodology, vol. 32, no. 5, pp. 1–23, 2023. 13
2023
-
[43]
A method to detect license inconsistencies in large-scale open source projects,
Y . Wu, Y . Manabe, T. Kanda, D. M. German, and K. Inoue, “A method to detect license inconsistencies in large-scale open source projects,” in2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, IEEE, 2015, pp. 324–333
2015
-
[44]
Lidetector: License incompatibility detection for open source software,
S. Xu, Y . Gao, L. Fan, Z. Liu, Y . Liu, and H. Ji, “Lidetector: License incompatibility detection for open source software,”ACM Transactions on Software Engineering and Methodology, vol. 32, no. 1, pp. 1–28, 2023
2023
-
[45]
Oss license identification at scale: A comprehensive dataset using world of code,
M. Jahanshahi, D. Reid, A. McDaniel, and A. Mockus, “Oss license identification at scale: A comprehensive dataset using world of code,” in2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), IEEE, 2025, pp. 144–148
2025
-
[46]
World of code: An infrastructure for mining the universe of open source vcs data,
Y . Ma, C. Bogart, S. Amreen, R. Zaretzki, and A. Mockus, “World of code: An infrastructure for mining the universe of open source vcs data,” in2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), IEEE, 2019, pp. 143–154
2019
-
[47]
World of code: Enabling a research workflow for mining and analyzing the universe of open source vcs data,
Y . Ma, T. Dey, C. Bogart, S. Amreen, M. Valiev, A. Tutko, D. Kennard, R. Zaretzki, and A. Mockus, “World of code: Enabling a research workflow for mining and analyzing the universe of open source vcs data,”Empirical Software Engineering, vol. 26, pp. 1–42, 2021
2021
-
[48]
A dataset and an approach for identity resolution of 38 million author ids extracted from 2b git commits,
T. Fry, T. Dey, A. Karnauch, and A. Mockus, “A dataset and an approach for identity resolution of 38 million author ids extracted from 2b git commits,” inProceedings of the 17th international conference on mining software repositories, 2020, pp. 518–522
2020
-
[49]
A complete set of related git repositories identified via community detection approaches based on shared commits,
A. Mockus, D. Spinellis, Z. Kotti, and G. J. Dusing, “A complete set of related git repositories identified via community detection approaches based on shared commits,” inProceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 513–517
2020
-
[50]
Dataset: Copy-based reuse in open source software,
M. Jahanshahi and A. Mockus, “Dataset: Copy-based reuse in open source software,” inProceedings of the 21st International Conference on Mining Software Repositories, 2024, pp. 42–47
2024
-
[51]
The transformation of open source software,
B. Fitzgerald, “The transformation of open source software,”MIS quarterly, pp. 587–598, 2006
2006
-
[52]
Free software matters: Enforcing the gpl, ii,
E. Moglen, “Free software matters: Enforcing the gpl, ii,”Column in LinuxUser Magazine (August 2001), 2001
2001
-
[53]
Agresti,Categorical data analysis
A. Agresti,Categorical data analysis. John Wiley & Sons, 2012, vol. 792
2012
-
[54]
S. K. Thompson,Sampling. John Wiley & Sons, 2012, vol. 755
2012
-
[55]
Collinearity: A review of methods to deal with it and a simulation study evaluating their performance,
C. F. Dormann et al., “Collinearity: A review of methods to deal with it and a simulation study evaluating their performance,”Ecography, vol. 36, no. 1, pp. 27–46, 2013
2013
-
[56]
Multicollinearity in regression analyses conducted in epidemiologic studies,
K. P. Vatcheva, M. Lee, J. B. McCormick, and M. H. Rahbar, “Multicollinearity in regression analyses conducted in epidemiologic studies,”Epidemiology (Sunnyvale, Calif.), vol. 6, no. 2, 2016. 14 APPENDIX List of SPDX license identifiers aggregated by their respec- tive license types: Permissive: 0BSD, AFL-3.0, Apache-2.0, BSD-2, BSD-2- Clause, BSD-3-Claus...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.