pith. sign in

arxiv: 2607.02059 · v1 · pith:WGW2QC4Hnew · submitted 2026-07-02 · 💻 cs.SE

File-Level Copying Is an Implicit Dependency in Open Source

Pith reviewed 2026-07-03 08:45 UTC · model grok-4.3

classification 💻 cs.SE
keywords file-level copyingsoftware reusesupply chain securityopen source ecosystemdependency visibilityvulnerability managementlicense compliancevendoring
0
0 comments X

The pith

File-level copying removes the four observable signals package managers provide for declared dependencies, hiding security and license risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that copying individual files between open source repositories is a widespread but ungoverned practice that eliminates the visibility normally provided by declared dependencies. This removal affects four specific signals: provenance, ongoing maintenance, security status, and licensing compliance, with no built-in way to recover them. Analysis of copy events across the ecosystem shows that different forms of copying carry distinct risks, with security problems clustering in vendored dependencies and license issues in direct source-code reuse. The copied code is typically stale and undocumented, allowing these risks to propagate into packaged software undetected by tools that examine only declared metadata.

Core claim

File-level copying is a widespread but ungoverned form of software reuse. Copying files across repositories reduces supply-chain visibility by removing the four observable signals a package manager provides for a declared dependency (provenance, maintenance, security, and compliance) with no mechanism to restore them. These visibility gaps are form-specific with security risk concentrating in vendored dependencies and license risk in direct source-code reuse.

What carries the argument

The thirteen axial copy forms that classify file-level reuse events and act as unreliable proxies for developer intent while determining where security and license risks concentrate.

Load-bearing premise

That the 0.1% commit sample from World of Code combined with manual labeling of 3,912 rationale-bearing commits produces unbiased estimates of copy prevalence, intent distributions, and risk concentrations across the entire ecosystem.

What would settle it

Observing that most copied files record a recoverable origin and version number, or that high-severity CVE instances do not concentrate disproportionately in the vendoring form.

Figures

Figures reproduced from arXiv: 2607.02059 by Audris Mockus, Minghui Zhou, Runzhi He, Wenhao Yang.

Figure 1
Figure 1. Figure 1: RQ1 copy form vs. RQ2 rationale (3,494 in-taxonomy commits) [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

File-level copying is a widespread but ungoverned form of software reuse. Copying files across repositories reduces supply-chain visibility: it removes the four observable signals a package manager provides for a declared dependency (provenance, maintenance, security, and compliance) with no mechanism to restore them. To characterize the scale and consequences of this unmanaged reuse, we present a mixed-method study of copying across the entire open-source ecosystem using World of Code (WoC). From a 0.1% commit sample, we extract 690,500 copy events and retain 3,912 rationale-bearing copy commits for intent labeling. We show that the 13 axial copy forms, spanning vendored dependencies, hardware/driver synchronization, scaffolding, UI assets, and direct source-code reuse, are unreliable proxies for developer intent: among rationale-bearing commits, hardware/driver copies are predominantly fork-maintenance work (78%), while dependency-vendoring copies more often signal upstream bypass (70%) than offline availability. These visibility gaps are form-specific: security and license risk concentrate in complementary copy forms. Copied sources are frequently stale (median 155 days; 38.5% over one year old) and seldom record a recoverable origin (4.3% documented), let alone a checkable version (2.0% versioned); even vendored copies record where they came from only 10% of the time. Security risk concentrates in vendored dependencies: 17,314 CVE-risk copy commits in the full-WoC graph, 88% in the dependency-vendoring form; 80% score CVSS >= 7.0 and upstream-fix adoption is only 47%-84%. License risk concentrates in direct source-code reuse: 41,777 pre-validation candidates, 66% in the source-code form, with 39 verified high-star violations (kappa = 0.752). Both risks reach packaged software and are invisible to dependency scanners operating on declared metadata alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that file-level copying across open-source repositories is a widespread but ungoverned form of reuse that removes the provenance, maintenance, security, and compliance signals provided by declared package-manager dependencies. Using a 0.1% commit sample from World of Code, the authors extract 690,500 copy events, retain 3,912 rationale-bearing commits for manual intent labeling, and identify 13 axial copy forms. They report that intent is form-specific (e.g., 78% fork-maintenance for hardware/driver copies, 70% upstream bypass for vendoring), that copied sources are frequently stale (median 155 days) and rarely document origin (4.3%) or version (2.0%), and that risks concentrate differently: 88% of 17,314 CVE-risk copies occur in vendoring (with 80% CVSS >=7.0 and low upstream-fix adoption) while 66% of 41,777 license-risk candidates occur in direct source-code reuse (39 verified high-star violations).

Significance. If the sampling, detection, and labeling procedures prove robust, the work supplies the first large-scale, form-stratified quantification of an implicit dependency mechanism that evades existing supply-chain scanners. The mixed-method design, the concrete counts of CVE and license exposure, and the demonstration that risk types map to complementary copy forms constitute a substantive empirical contribution to software supply-chain security and reuse studies. The scale of the underlying WoC graph and the explicit mapping from copy form to observable risk are particular strengths.

major comments (3)
  1. [Abstract / Methods (sampling)] Abstract and Methods (sampling description): The central quantitative claims—88% of CVE-risk copies in vendoring, 66% of license candidates in source-code reuse, and the intent percentages (78%, 70%)—are derived from a 0.1% commit sample that produced 690,500 events. No stratification by repository size, language, commit density, or activity level is described, nor is any weighting or bias-correction procedure. If copy events are non-uniform, the reported form-specific concentrations are not guaranteed to be ecosystem-representative, directly undermining the claim that risks concentrate in complementary forms.
  2. [Methods (copy-event extraction)] Methods (copy-event extraction): The paper provides no description of the copy-detection algorithm, similarity threshold, or false-positive handling used to identify the 690,500 events. Because every downstream count, percentage, and risk concentration rests on the correctness of these events, the absence of validation metrics or inter-rater checks for the detection step is load-bearing for the soundness of the entire study.
  3. [Methods (intent labeling)] Methods (intent labeling): The 3,912 rationale-bearing commits were manually labeled, yet no inter-rater reliability statistic (e.g., Cohen’s kappa) is reported for the intent categories. This directly affects the reliability of the form-specific intent distributions that support the claim that copy forms are unreliable proxies for developer intent.
minor comments (2)
  1. [Abstract] The abstract states that 17,314 CVE-risk copy commits exist in the full-WoC graph but does not clarify whether this figure is extrapolated from the 0.1% sample or obtained by a separate full-graph scan; this ambiguity should be resolved for reproducibility.
  2. [Discussion / Limitations] The paper would benefit from an explicit limitations subsection that enumerates the sampling assumptions and detection-validation gaps already noted in the major comments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the empirical contribution of our mixed-methods study on file-level copying as an implicit dependency. We address each major comment below and will revise the manuscript to improve methodological transparency and address potential limitations.

read point-by-point responses
  1. Referee: [Abstract / Methods (sampling)] Abstract and Methods (sampling description): The central quantitative claims—88% of CVE-risk copies in vendoring, 66% of license candidates in source-code reuse, and the intent percentages (78%, 70%)—are derived from a 0.1% commit sample that produced 690,500 events. No stratification by repository size, language, commit density, or activity level is described, nor is any weighting or bias-correction procedure. If copy events are non-uniform, the reported form-specific concentrations are not guaranteed to be ecosystem-representative, directly undermining the claim that risks concentrate in complementary forms.

    Authors: We agree that the sampling procedure merits explicit discussion to support generalizability. The 0.1% sample was drawn uniformly at random from the World of Code commit database to obtain broad ecosystem coverage. However, we acknowledge that without stratification or bias correction, non-uniformity in copy events could affect representativeness. In revision we will expand the Methods section with a detailed sampling description, add a Limitations subsection addressing possible biases and the rationale for random sampling, and note the statistical robustness afforded by the large event count (690,500). We will also indicate whether post-hoc stratification is feasible with existing data. revision: yes

  2. Referee: [Methods (copy-event extraction)] Methods (copy-event extraction): The paper provides no description of the copy-detection algorithm, similarity threshold, or false-positive handling used to identify the 690,500 events. Because every downstream count, percentage, and risk concentration rests on the correctness of these events, the absence of validation metrics or inter-rater checks for the detection step is load-bearing for the soundness of the entire study.

    Authors: We agree that a clear description of the copy-detection procedure is required for reproducibility and soundness assessment. In the revised manuscript we will add a dedicated subsection in Methods that fully specifies the detection algorithm, the similarity threshold used, false-positive handling (including sampling-based manual inspection), and any validation metrics such as estimated precision. This addition will ensure all quantitative claims rest on transparent foundations. revision: yes

  3. Referee: [Methods (intent labeling)] Methods (intent labeling): The 3,912 rationale-bearing commits were manually labeled, yet no inter-rater reliability statistic (e.g., Cohen’s kappa) is reported for the intent categories. This directly affects the reliability of the form-specific intent distributions that support the claim that copy forms are unreliable proxies for developer intent.

    Authors: We note that Cohen’s kappa (0.752) is already reported for the separate license-violation verification step. We agree that reporting inter-rater reliability for the intent labeling of the 3,912 commits is necessary. In revision we will add the appropriate IRR statistic (Cohen’s kappa or equivalent) for the intent categories, computed on the double-labeled subset, to the Methods section. revision: yes

Circularity Check

0 steps flagged

Purely empirical measurement study with no derivations or fitted models

full rationale

This paper is a mixed-method empirical study that extracts copy events from a commit sample, performs manual labeling on a subset, and reports observed distributions, prevalences, and risk concentrations. No equations, models, or predictions are derived; all reported quantities (e.g., percentages of CVE-risk copies, license candidates, intent distributions) are direct tabulations from the labeled data. The sampling and labeling steps are independent of the conclusions and do not reduce to self-definition or fitted-input predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The study is therefore self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that commit-based copy detection and manual intent labeling on a small subsample faithfully represent ecosystem-wide behavior; no free parameters are explicitly fitted but sampling fraction and labeling criteria function as implicit choices.

free parameters (2)
  • 0.1% commit sample fraction
    Chosen sampling rate used to extract 690,500 copy events; affects scale estimates.
  • Rationale-bearing commit filter
    Threshold or rule that retains only 3,912 commits for labeling; directly shapes intent distributions.
axioms (2)
  • domain assumption File copies can be reliably identified from commit metadata and content hashes in World of Code
    Invoked to produce the initial 690,500 copy events before any labeling or risk analysis.
  • domain assumption Manual labeling of commit messages and context produces accurate developer intent categories
    Required to claim that hardware copies are 78% fork-maintenance and vendoring copies are 70% upstream bypass.

pith-pipeline@v0.9.1-grok · 5898 in / 1492 out tokens · 25425 ms · 2026-07-03T08:45:27.768660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Code reuse in open source software,

    S. Haefliger, G. V on Krogh, and S. Spaeth, “Code reuse in open source software,”Management Science, vol. 54, no. 1, pp. 180–193, 2008

  2. [2]

    Small world with high risks: A study of security threats in the npm ecosystem,

    M. Zimmermann, C.-A. Staicu, C. Tenny, and M. Pradel, “Small world with high risks: A study of security threats in the npm ecosystem,” in28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 995–1010. [Online]. Available: https://www.usenix.org/conference/usen ixsecurity19/presentation/zimmerman

  3. [3]

    Flexible and optimal dependency management via Max-SMT,

    D. Pinckney, F. Cassano, A. Guha, J. Bell, M. Culpo, and T. Gamblin, “Flexible and optimal dependency management via Max-SMT,” inPro- ceedings of the 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1418–1429

  4. [4]

    An empirical analysis of technical lag in npm package dependencies,

    A. Zerouali, E. Constantinou, T. Mens, G. Robles, and J. M. González-Barahona, “An empirical analysis of technical lag in npm package dependencies,” inNew Opportunities for Software Reuse - 17th International Conference, ICSR 2018, Madrid, Spain, May 21-23, 2018, Proceedings, ser. Lecture Notes in Computer Science, R. Capilla, B. Gallina, and C. Cetina, E...

  5. [5]

    Insights from open source software supply chains (keynote),

    A. Mockus, “Insights from open source software supply chains (keynote),” inProceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19). ACM, 2019, p. 3

  6. [6]

    Some from here, some from there: Cross-project code reuse in github,

    M. Gharehyazie, B. Ray, and V . Filkov, “Some from here, some from there: Cross-project code reuse in github,” in2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 2017, pp. 291–301

  7. [7]

    Same file, different changes: the potential of meta-maintenance on github,

    H. Hata, R. G. Kula, T. Ishio, and C. Treude, “Same file, different changes: the potential of meta-maintenance on github,” inIEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 773–784

  8. [8]

    Beyond dependencies: The role of copy-based reuse in open source software development,

    M. Jahanshahi, D. Reid, and A. Mockus, “Beyond dependencies: The role of copy-based reuse in open source software development,”ACM Transactions on Software Engineering and Methodology, 2025

  9. [9]

    World of code: enabling a research workflow for mining and analyzing the universe of open source vcs data,

    Y . Ma, T. Dey, C. Bogart, S. Amreen, M. Valiev, A. Tutko, D. Kennard, R. Zaretzki, and A. Mockus, “World of code: enabling a research workflow for mining and analyzing the universe of open source vcs data,”Empirical Software Engineering, vol. 26, pp. 1–42, 2021

  10. [10]

    Fix: Potential Vulnerability in Cloned zlib-Function,

    tabudz, “Fix: Potential Vulnerability in Cloned zlib-Function,” GitHub Pull Request #6245, https://github.com/PointCloudLibrary/pcl/pull/62 45, March 2025

  11. [11]

    The extent of orphan vulnerabilities from code reuse in open source software,

    D. Reid, M. Jahanshahi, and A. Mockus, “The extent of orphan vulnerabilities from code reuse in open source software,” inProceedings of the 44th International Conference on Software Engineering. ACM, 2022, pp. 2104–2115

  12. [12]

    OSGeo/gdal (gdal/ogr translator library),

    “OSGeo/gdal (gdal/ogr translator library),” https://github.com/OSGeo/g dal, gitHub repository, accessed 2026-06-30

  13. [13]

    discourse/discourse (a platform for community discussion),

    “discourse/discourse (a platform for community discussion),” https://gi thub.com/discourse/discourse, gitHub repository, accessed 2026-06-30

  14. [14]

    Automating dependency updates in practice: An exploratory study on github dependabot,

    R. He, H. He, Y . Zhang, and M. Zhou, “Automating dependency updates in practice: An exploratory study on GitHub dependabot,” IEEE Trans. Software Eng., vol. 49, no. 8, pp. 4004–4022, 2023. [Online]. Available: https://doi.org/10.1109/TSE.2023.3278129

  15. [15]

    Backstabber’s knife collection: A review of open source software supply chain attacks,

    M. Ohm, H. Plate, A. Sykosch, and M. Meier, “Backstabber’s knife collection: A review of open source software supply chain attacks,” inDetection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2020). Springer, 2020, pp. 23–43

  16. [16]

    An empirical study of license conflict in free and open source software,

    X. Cui, J. Wu, Y . Wu, X. Wang, T. Luo, S. Qu, X. Ling, and M. Yang, “An empirical study of license conflict in free and open source software,” in2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2023, pp. 495–505

  17. [17]

    Understanding and remediating open-source license incompatibilities in the pypi ecosystem,

    W. Xu, H. He, K. Gao, and M. Zhou, “Understanding and remediating open-source license incompatibilities in the pypi ecosystem,” in2023 38th IEEE/ACM International Conference on Automated Software En- gineering (ASE). IEEE, 2023, pp. 178–190

  18. [18]

    An exploratory study of api changes and usages based on apache and eclipse ecosystems,

    W. Wu, F. Khomh, B. Adams, Y .-G. Guéhéneuc, and G. Antoniol, “An exploratory study of api changes and usages based on apache and eclipse ecosystems,”Empirical Software Engineering, vol. 21, pp. 2366–2412, 2016

  19. [19]

    Do developers update their library dependencies? - an empirical study on the impact of security advisories on library migration,

    R. G. Kula, D. M. Germán, A. Ouni, T. Ishio, and K. Inoue, “Do developers update their library dependencies? - an empirical study on the impact of security advisories on library migration,”Empir. Softw. Eng., vol. 23, no. 1, pp. 384–417, 2018. [Online]. Available: https://doi.org/10.1007/s10664-017-9521-5

  20. [20]

    A formal framework for measuring technical lag in component repositories - and its application to npm,

    A. Zerouali, T. Mens, J. M. González-Barahona, A. Decan, E. Constantinou, and G. Robles, “A formal framework for measuring technical lag in component repositories - and its application to npm,” J. Softw. Evol. Process., vol. 31, no. 8, 2019. [Online]. Available: https://doi.org/10.1002/smr.2157

  21. [21]

    Lags in the release, adoption, and propagation of npm vulnerability fixes,

    B. Chinthanet, R. G. Kula, S. McIntosh, T. Ishio, A. Ihara, and K. Matsumoto, “Lags in the release, adoption, and propagation of npm vulnerability fixes,”Empirical Software Engineering, vol. 26, no. 3, p. 47, 2021

  22. [22]

    Demystifying the vulnerability propagation and its evolution via dependency trees in the NPM ecosystem,

    C. Liu, S. Chen, L. Fan, B. Chen, Y . Liu, and X. Peng, “Demystifying the vulnerability propagation and its evolution via dependency trees in the NPM ecosystem,” in44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 2022, pp. 672–684. [Online]. Available: https://doi.org/10.1145/351000...

  23. [23]

    Surviving software dependencies,

    R. Cox, “Surviving software dependencies,”Commun. ACM, vol. 62, no. 9, pp. 36–43, 2019. [Online]. Available: https://doi.org/10.1145/33 47446

  24. [24]

    Javascript frameworks security report 2019,

    Snyk, “Javascript frameworks security report 2019,” https://snyk.io/blog /javascript-frameworks-security-report-2019, October 2019

  25. [25]

    Hero: On the chaos when path meets modules,

    Y . Wang, L. Qiao, C. Xu, Y . Liu, S.-C. Cheung, N. Meng, H. Yu, and Z. Zhu, “Hero: On the chaos when path meets modules,” in2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 99–111

  26. [26]

    Deploying android security updates: an extensive study involving manufacturers, carriers, and end users,

    K. R. Jones, T.-F. Yen, S. C. Sundaramurthy, and A. G. Bardas, “Deploying android security updates: an extensive study involving manufacturers, carriers, and end users,” inProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 551–567

  27. [27]

    Reuse and mainte- nance practices among divergent forks in three software ecosystems,

    J. Businge, M. Openja, S. Nadi, and T. Berger, “Reuse and mainte- nance practices among divergent forks in three software ecosystems,” Empirical Software Engineering, vol. 27, no. 2, p. 54, Mar. 2022

  28. [28]

    Towards an empirical analysis of code cloning and code reuse in CI/CD ecosystems,

    G. Cardoen, “Towards an empirical analysis of code cloning and code reuse in CI/CD ecosystems,” inProceedings of the 23rd Belgium-Netherlands Software Evolution Workshop, Namur, Belgium, November 21-22, 2024, ser. CEUR Workshop Proceedings, G. Perrouin, B. Vanderose, and X. Devroey, Eds., vol. 3941. CEUR-WS.org, 2024, pp. 63–75. [Online]. Available: https...

  29. [29]

    Developers’ perception of github actions: A survey analysis,

    S. G. Saroar and M. Nayebi, “Developers’ perception of github actions: A survey analysis,” inProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, 2023, pp. 121– 130

  30. [30]

    Handling Duplicates in Dockerfiles Families: Learning from Experts,

    M. A. Oumaziz, J.-R. Falleri, X. Blanc, T. F. Bissyande, and J. Klein, “Handling Duplicates in Dockerfiles Families: Learning from Experts,” in2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). Cleveland, OH, USA: IEEE, Sep. 2019, pp. 524– 535

  31. [31]

    Collecting and leveraging a benchmark of build system clones to aid in quality assessments,

    S. McIntosh, M. Poehlmann, E. Juergens, A. Mockus, B. Adams, A. E. Hassan, B. Haupt, and C. Wagner, “Collecting and leveraging a benchmark of build system clones to aid in quality assessments,” inCompanion Proceedings of the 36th International Conference on Software Engineering. Hyderabad India: ACM, May 2014, pp. 145– 154

  32. [32]

    An Empirical Study of Dotfiles Reposito- ries Containing User-Specific Configuration Files,

    W. Zhu and M. W. Godfrey, “An Empirical Study of Dotfiles Reposito- ries Containing User-Specific Configuration Files,” Jan. 2025

  33. [33]

    What the fork?: Finding hidden code clones in npm,

    E. Wyss, L. De Carli, and D. Davidson, “What the fork?: Finding hidden code clones in npm,” inProceedings of the 44th International Conference on Software Engineering. Pittsburgh Pennsylvania: ACM, May 2022, pp. 2415–2426

  34. [34]

    Applying the universal version history concept to help de-risk copy-based code reuse,

    D. Reid and A. Mockus, “Applying the universal version history concept to help de-risk copy-based code reuse,” in23rd IEEE Interna- tional Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 2023

  35. [35]

    PaReco: Patched clones and missed patches among the divergent variants of a software family,

    P. K. Ramkisoen, J. Businge, B. Van Bladel, A. Decan, S. Demeyer, C. De Roover, and F. Khomh, “PaReco: Patched clones and missed patches among the divergent variants of a software family,” inProceed- ings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Singapore Singapore: ACM, Nov. ...

  36. [36]

    Cracks in the stack: Hidden vulnera- bilities and licensing risks in llm pre-training datasets,

    M. Jahanshahi and A. Mockus, “Cracks in the stack: Hidden vulnera- bilities and licensing risks in llm pre-training datasets,” inProceedings of the 2nd International Workshop on Large Language Models for Code (LLM4Code 2025), 2025

  37. [37]

    A study of potential code borrowing and license violations in java projects on github,

    Y . Golubev, M. Eliseeva, N. Povarov, and T. Bryksin, “A study of potential code borrowing and license violations in java projects on github,” inProceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 54–64

  38. [38]

    Ensuring open source integrity: The intersection of copy-based reuse and license compliance,

    M. Jahanshahi, B. Vasilescu, and A. Mockus, “Ensuring open source integrity: The intersection of copy-based reuse and license compliance,”

  39. [39]
  40. [40]

    Detecting and removing bloated dependencies in CommonJS packages,

    Y . Liu, D. Tiwari, C. Bogdan, and B. Baudry, “Detecting and removing bloated dependencies in CommonJS packages,”Journal of Systems and Software, 2025

  41. [41]

    Dataset: Copy-based reuse in open source software,

    M. Jahanshahi and A. Mockus, “Dataset: Copy-based reuse in open source software,” in2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR). IEEE, 2024, pp. 42–47

  42. [42]

    The Prevalence and Impact of Licenses in Open Software Projects

    M. Jahanshahi, B. Vasilescu, and A. Mockus, “The prevalence and impact of licenses in open software projects,” 2026. [Online]. Available: https://arxiv.org/abs/2606.23445

  43. [43]

    Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,

    C. K. Roy, J. R. Cordy, and R. Koschke, “Comparison and evaluation of code clone detection techniques and tools: A qualitative approach,” Science of Computer Programming, vol. 74, no. 7, pp. 470–495, 2009

  44. [44]

    Fast and flexible large-scale clone detection with cloneworks,

    J. Svajlenko and C. K. Roy, “Fast and flexible large-scale clone detection with cloneworks,” inProceedings of the 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 2017, pp. 27–30

  45. [45]

    CCFinderSW: Clone detection tool with flexible multilingual tokenization,

    Y . Semura, N. Yoshida, E. Choi, and K. Inoue, “CCFinderSW: Clone detection tool with flexible multilingual tokenization,” in2017 24th Asia- Pacific Software Engineering Conference (APSEC). IEEE, 2017, pp. 654–659

  46. [46]

    Pitfalls and guidelines for using time-based Git data,

    S. W. Flint, J. Chauhan, and R. Dyer, “Pitfalls and guidelines for using time-based Git data,”Empirical Software Engineering, vol. 27, no. 7, 2022

  47. [47]

    GH Archive,

    I. Grigorik, “GH Archive,” https://www.gharchive.org/, 2012, a public archive of GitHub’s public event timeline. Accessed: 2026-06-30

  48. [48]

    The measurement of observer agreement for categorical data,

    J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

  49. [49]

    A coefficient of agreement for nominal scales,

    J. Cohen, “A coefficient of agreement for nominal scales,”Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960

  50. [50]

    Saldaña,The Coding Manual for Qualitative Researchers, 2nd ed

    J. Saldaña,The Coding Manual for Qualitative Researchers, 2nd ed. Thousand Oaks, CA: SAGE Publications, 2013

  51. [51]

    Repology: the packaging hub,

    R. Project, “Repology: the packaging hub,” https://repology.org, 2024, accessed: 2025-12-01

  52. [52]

    Mapping NVD Records to Their Vulnerability-fixing Commits: How Hard is It?

    H. H. Nguyen, T. Zhang, D. M. Tran, Y . Cheng, T. Le-Cong, H. J. Kang, R. Widyasari, S. L. Khin, O. E. Lieh, and D. Lo, “Mapping NVD records to their vulnerability-fixing commits: How hard is it?” 2025, accepted at ACM Transactions on Software Engineering and Methodology. [Online]. Available: https://arxiv.org/abs/2506.09702

  53. [53]

    Common vulnerability scoring system version 3.1: Specification document,

    FIRST.org, Inc., “Common vulnerability scoring system version 3.1: Specification document,” https://www.first.org/cvss/v3.1/specificatio n-document, Forum of Incident Response and Security Teams (FIRST), Tech. Rep., 2019, revision 1. Accessed: 2026-06-30

  54. [54]

    Sampling projects in GitHub for MSR studies,

    O. Dabic, E. Aghajani, and G. Bavota, “Sampling projects in GitHub for MSR studies,” in2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 2021, pp. 560–564

  55. [55]

    Open source license selection: Challenges and influencing factors,

    X. Wu, J. Wu, M. Zhou, Z. Wang, and L. Yang, “Open source license selection: Challenges and influencing factors,”Journal of Software, vol. 33, no. 1, pp. 1–25, 2021, in Chinese

  56. [56]

    eranif/codelite (codelite ide),

    “eranif/codelite (codelite ide),” https://github.com/eranif/codelite, gitHub repository, accessed 2026-06-30