arxiv: 2604.17940 · v1 · submitted 2026-04-20 · 💻 cs.SE

Recognition: unknown

When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems

Peerachai Banyongrakkul , Mansooreh Zahedi , Christoph Treude , Haoyu Gao , Patanamon Thongtanunam

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:23 UTC · model grok-4.3

classification 💻 cs.SE

keywords pre-trained modelsPTM dependenciessoftware evolutionempirical studydependency managementGitHub repositoriesAI-integrated systems

0 comments

The pith

Pre-trained models are added late in software projects and accumulate over time rather than being replaced.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies how pre-trained models function as dependencies in open-source software systems by tracking their addition, removal, and updates across project releases. It uses traditional libraries as a baseline for comparison in a dataset of nearly five thousand releases. The analysis shows PTMs appear later in a project's life and tend to stay and multiply instead of cycling out. This matters for maintenance because PTMs have opaque internals and faster-changing capabilities that differ from standard code libraries. The results point to a need for updated practices in how these models are managed as permanent parts of software architectures.

Core claim

The study of 4,988 releases in 323 GitHub repositories finds that PTMs are typically added late in the project life-cycle and tend to accumulate rather than be replaced as a project matures. PTM changes occur in only 406 of 2,814 release transitions, three times less frequently than library changes. PTM changes are less routinely documented yet more likely to include explicit rationale, and unlike reactive library evolution, PTM changes are proactively driven by capability expansion with a distinctive rationale of testing uncertainty.

What carries the argument

Empirical comparison of PTM versus library change frequency, timing, and documented rationales extracted from release notes and repository metadata across 323 projects.

If this is right

PTMs require dedicated tracking methods separate from standard library dependency tools.
Maintenance teams should expect PTM updates to be driven by new capabilities rather than routine fixes.
Documentation standards for PTM changes need to capture their explicit rationales more consistently.
Systems may accumulate multiple PTM instances for different tasks, increasing long-term complexity.
Software engineering processes should treat PTMs as multi-role dependencies rather than single libraries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If accumulation continues unchecked, long-lived projects could face growing integration and version conflicts across PTM instances.
Automated tools that scan release notes for testing-uncertainty mentions could help flag risky PTM adoptions early.
The proactive capability-driven pattern may not hold in closed-source or enterprise environments where update decisions follow different incentives.
Extending the analysis to measure actual runtime impact of infrequent PTM changes could test whether lower frequency correlates with higher stability or hidden risks.

Load-bearing premise

The 323 GitHub repositories and 4,988 releases accurately represent typical downstream systems that reuse open-source pre-trained models, and PTM changes can be reliably identified from metadata and notes.

What would settle it

A study of a larger or more diverse set of repositories showing PTM change frequency equal to or higher than library changes, or lacking the late-addition and accumulation pattern, would undermine the claim of a qualitatively distinct evolution.

Figures

Figures reproduced from arXiv: 2604.17940 by Christoph Treude, Haoyu Gao, Mansooreh Zahedi, Patanamon Thongtanunam, Peerachai Banyongrakkul.

**Figure 2.** Figure 2: Overview of our dataset creation and analysis pipeline. The process spans from initial PTM signature identification [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of repository characteristics before and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Example of PTM change detection using multiset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Example of the qualitative process used to analyze [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of change types for PTMs and Libraries. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Comparative distribution of releases per change event [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Top five most frequently added PTMs. To further contextualize this trend over calendar time, [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Quarterly number of PTMs reused, decomposed into [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 11.** Figure 11: Documentation and rationale coverage profile com [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 10.** Figure 10: Comparative distribution of PTM and library addition [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 13.** Figure 13: Distribution of documentation artifacts and rationale [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 15.** Figure 15: Example of performance optimization (Example [ [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 14.** Figure 14: Example of PTM lightweight testing (Example [ [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

read the original abstract

Modern software systems have transitioned from purely code-based architectures to AI-integrated systems where pre-trained models (PTMs) serve as permanent dependencies. However, while the evolution of traditional software libraries is well-documented, we lack a clear understanding of how these "PTM dependencies" change over time. Unlike libraries, PTMs are characterized by opaque internals and less standardized, rapidly evolving release cycles. Furthermore, their multi-role nature enables developers to treat individual instances of a single PTM as separate functional dependencies based on their specific downstream tasks. This raises a critical question for software maintenance: do PTMs change like standard software libraries or do they follow a divergent pattern? To answer this, we present the first empirical study of downstream PTM changes, analyzing a comprehensive dataset of 4,988 releases across 323 GitHub OSS repositories that reuse open-source PTMs. Using traditional software libraries as a baseline, we find that PTMs follow a qualitatively distinct pattern. PTMs are typically added late in the project life-cycle and tend to accumulate rather than be replaced as a project matures. Our findings show that PTM changes are three times less frequent (406 of 2,814 release transitions) than library changes. PTM changes are also less routinely documented, but more likely to carry explicit rationale. Unlike libraries, which evolve reactively, PTM evolution is proactively driven by capability expansion, with a unique documented rationale of PTM testing uncertainty. Our work calls for a rethinking of how PTMs are tracked and managed as dependencies in modern software engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first empirical study of pre-trained model (PTM) evolution as dependencies in downstream software systems. Analyzing 4,988 releases from 323 GitHub OSS repositories that reuse open-source PTMs, and using traditional libraries as a baseline, it claims PTMs follow a qualitatively distinct pattern: added late in the project lifecycle, tending to accumulate rather than be replaced; PTM changes occur three times less frequently than library changes (406 of 2,814 release transitions); PTM changes are less routinely documented but more likely to carry explicit rationale; and PTM evolution is proactively driven by capability expansion with a unique rationale of PTM testing uncertainty, unlike the reactive evolution of libraries.

Significance. If the results hold after addressing detection validity, the work is significant as the first large-scale quantitative and qualitative characterization of PTM dependency evolution in software engineering. It supplies concrete counts from a substantial dataset (323 repositories, 4,988 releases) and identifies actionable differences from library management, supporting calls to rethink tracking and maintenance practices for AI components. The combination of frequency statistics and rationale categorization provides falsifiable observations that can guide future tooling and empirical work.

major comments (2)

[Abstract and §3] Abstract and §3 (Data Collection/Analysis): The central claim that PTM changes are three times less frequent than library changes (406 of 2,814 release transitions) rests on the PTM change detection pipeline. No description is provided of the extraction method from release notes and metadata, handling of dynamic loading or multi-role PTM instances, inter-rater reliability for categorization, or any validation against ground truth. This directly affects the quantitative distinctness result and the comparison baseline, as differential miss rates would inflate the reported frequency gap.
[§4] §4 (Results) and selection criteria: The representativeness claim for the 323 repositories and 4,988 releases as typical downstream systems is load-bearing for generalizing the late-addition and accumulation patterns. The manuscript provides no details on selection bias controls, inclusion/exclusion criteria beyond GitHub OSS, or comparison to broader PTM reuse populations, leaving the qualitative pattern vulnerable to sampling artifacts.

minor comments (2)

[Abstract] The abstract states PTM changes are 'less routinely documented' without providing the exact documentation rates or statistical test used for the comparison to libraries.
[Results] Figure or table presenting the 406/2,814 counts and rationale categories would benefit from explicit confidence intervals or effect sizes to support the 'three times less frequent' statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We have addressed each of the major comments in detail below and revised the manuscript to enhance methodological transparency and address concerns about generalizability.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Data Collection/Analysis): The central claim that PTM changes are three times less frequent than library changes (406 of 2,814 release transitions) rests on the PTM change detection pipeline. No description is provided of the extraction method from release notes and metadata, handling of dynamic loading or multi-role PTM instances, inter-rater reliability for categorization, or any validation against ground truth. This directly affects the quantitative distinctness result and the comparison baseline, as differential miss rates would inflate the reported frequency gap.

Authors: We agree with the referee that the description of the PTM change detection pipeline in the original manuscript was insufficiently detailed, which is important for validating the central quantitative claim. We have revised §3 to include a comprehensive description of the extraction method from release notes and metadata, our handling of dynamic loading and multi-role PTM instances (by identifying distinct usage contexts in the code), the inter-rater reliability assessment for categorization, and the validation against ground truth on a sample of releases. These revisions ensure the reported frequency difference is robustly supported. revision: yes
Referee: [§4] §4 (Results) and selection criteria: The representativeness claim for the 323 repositories and 4,988 releases as typical downstream systems is load-bearing for generalizing the late-addition and accumulation patterns. The manuscript provides no details on selection bias controls, inclusion/exclusion criteria beyond GitHub OSS, or comparison to broader PTM reuse populations, leaving the qualitative pattern vulnerable to sampling artifacts.

Authors: We acknowledge the importance of discussing selection criteria and potential biases for the generalizability of our findings on late-addition and accumulation patterns. The original manuscript described the dataset as 323 GitHub OSS repositories with 4,988 releases that reuse open-source PTMs, but provided limited details on the exact selection process and bias controls. In the revised version, we have added explicit inclusion and exclusion criteria in §3 and a new subsection in §4 addressing threats to validity, including selection bias and limitations in representing the broader population of PTM-reusing systems. While a full comparative analysis to all PTM reuse instances is beyond the scope of this study, we discuss how our sample aligns with known characteristics of AI-integrated OSS projects. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely observational empirical study

full rationale

This paper conducts an empirical analysis of 4,988 releases from 323 GitHub repositories, reporting observed frequencies (e.g., 406 PTM changes out of 2,814 transitions) and patterns directly from external repository metadata and release notes. No derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. All claims rest on data extraction and categorization rather than self-definitions, self-citations as load-bearing premises, or ansatzes smuggled from prior author work. The study is self-contained against external benchmarks, with findings grounded in observable repository events instead of internal logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the chosen GitHub repositories and the accuracy of identifying PTM versus library changes from release data; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 323 GitHub OSS repositories reusing open-source PTMs are representative of broader downstream software systems
Generalization from the sampled projects to industry practice depends on this assumption.

pith-pipeline@v0.9.0 · 5607 in / 1299 out tokens · 58800 ms · 2026-05-10T04:23:07.231963+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Software reuse research: status and future,

W. B. Frakes and K. Kang, “Software reuse research: status and future,” IEEE Trans. Softw. Eng., vol. 31, no. 7, pp. 529–536, 2005

2005
[2]

Predicting software reuse using machine learning techniques—A case study on open-source Java software systems,

M. Y . H. Yeow, C. Y . Chong, M. K. Lim, and Y . Yee Yen, “Predicting software reuse using machine learning techniques—A case study on open-source Java software systems,”PLoS ONE, vol. 20, no. 2, p. e0314512, feb 2025

2025
[3]

Do developers update their library dependencies?: An empirical study on the impact of security advisories on library migration,

R. G. Kula, D. M. German, A. Ouni, T. Ishio, and K. Inoue, “Do developers update their library dependencies?: An empirical study on the impact of security advisories on library migration,”Empirical Software Engineering, vol. 23, no. 1, pp. 384–417, 2018

2018
[4]

An empirical comparison of dependency network evolution in seven software packaging ecosystems,

A. Decan, T. Mens, and P. Grosjean, “An empirical comparison of dependency network evolution in seven software packaging ecosystems,” Empirical Software Engineering, vol. 24, no. 1, pp. 381–416, 2019

2019
[5]

An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry,

W. Jianget al., “An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry,” inProceedings of the 45th International Conference on Software Engineering (ICSE 2023). Piscataway, NJ, USA: IEEE, 2023, pp. 2463–2475

2023
[6]

Pre-trained models: Past, present and future,

X. Hanet al., “Pre-trained models: Past, present and future,”AI Open, vol. 2, pp. 225–250, 2021

2021
[7]

Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects,

J. Yasmin, W. Jiang, and C. D. Y . Tian, “Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects,” 2026

2026
[8]

Deep Learning Model Reuse in the HuggingFace Community: Chal- lenges, Benefit and Trends,

M. Taraghi, G. Dorcelus, A. Foundjem, F. Tambon, and F. Khomh, “Deep Learning Model Reuse in the HuggingFace Community: Chal- lenges, Benefit and Trends,” inProceedings of the 31st IEEE Interna- tional Conference on Software Analysis, Evolution and Reengineering (SANER 2024). Piscataway, NJ, USA: IEEE, mar 2024, pp. 512–523

2024
[9]

Challenges of Using Pre-trained Models: the Practitioners’ Perspective,

X. Tanet al., “Challenges of Using Pre-trained Models: the Practitioners’ Perspective,” inProceedings of the 31st IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2024). Los Alamitos, CA, USA: IEEE, mar 2024, pp. 67–78. JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 18

2024
[10]

From release to adoption: Challenges in reusing pre-trained ai models for downstream developers,

P. Banyongrakkul, M. Zahedi, P. Thongtanunam, C. Treude, and H. Gao, “From release to adoption: Challenges in reusing pre-trained ai models for downstream developers,” inProceedings of the 41st IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME 2025). Piscataway, NJ, USA: IEEE, 2025, pp. 1–13

2025
[11]

What do we know about hugging face? a systematic literature review and quantitative validation of qualitative claims,

J. Joneset al., “What do we know about hugging face? a systematic literature review and quantitative validation of qualitative claims,” in Proceedings of the 18th ACM/IEEE International Symposium on Em- pirical Software Engineering and Measurement (ESEM 2024), vol. 1, no. 1. New York, NY , USA: ACM, 2024, pp. 13–24

2024
[12]

Docu- menting ethical considerations in open source ai models,

H. Gao, M. Zahedi, C. Treude, S. Rosenstock, and M. Cheong, “Docu- menting ethical considerations in open source ai models,” inProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2024). New York, NY , USA: ACM, 2024, p. 177–188

2024
[13]

Towards semantic versioning of open pre-trained language model releases on hugging face,

A. Ajibode, A. A. Bangash, F. R. Cogo, B. Adams, and A. E. Hassan, “Towards semantic versioning of open pre-trained language model releases on hugging face,”Empirical Software Engineering, vol. 30, no. 3, pp. 1–63, 2025

2025
[14]

PeaTMOSS: A Dataset and Initial Analysis of Pre- Trained Models in Open-Source Software,

W. Jianget al., “PeaTMOSS: A Dataset and Initial Analysis of Pre- Trained Models in Open-Source Software,” inProceedings of the 21st IEEE/ACM International Conference on Mining Software Repositories (MSR 2024), vol. 1. New York, NY , USA: ACM, 2024, p. 431–443

2024
[15]

Reusing Deep Learning Models: Challenges and Directions in Software Engineering,

J. C. Daviset al., “Reusing Deep Learning Models: Challenges and Directions in Software Engineering,” inProceedings of the 2023 IEEE John Vincent Atanasoff Symposium on Modern Computing (JVA 2023). Piscataway, NJ, USA: IEEE, 2023, pp. 17–30

2023
[16]

From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI

M.-A. Storey, “From technical debt to cognitive and intent debt: Rethinking software health in the age of ai,” 2026. [Online]. Available: https://arxiv.org/abs/2603.22106

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

On the adoption and effects of source code reuse on defect proneness and maintenance effort,

G. Giordanoet al., “On the adoption and effects of source code reuse on defect proneness and maintenance effort,”Empirical Software Engineering, vol. 29, no. 1, p. 20, 2023

2023
[18]

Design patterns: Abstraction and reuse of object-oriented design,

E. Gamma, R. Helm, R. Johnson, and J. Vlissides, “Design patterns: Abstraction and reuse of object-oriented design,” inProceedings of the European Conference on Object-Oriented Programming (ECOOP 1993), ser. Lecture Notes in Computer Science. Cham, Switzerland: Springer Nature, 1993, vol. 707, pp. 406–431

1993
[19]

Surviving Software Dependencies,

R. Cox, “Surviving Software Dependencies,”Queue, vol. 17, no. 2, pp. 24–47, 2019

2019
[20]

An Empirical Analysis of Technical Lag in npm Package Dependencies,

A. Zerouali, E. Constantinou, T. Menset al., “An Empirical Analysis of Technical Lag in npm Package Dependencies,” inProceedings of the 17th International Conference on Software Reuse (ICSR 2018), ser. Lecture Notes in Computer Science, vol. 10826. Cham, Switzerland: Springer, 2018, pp. 95–110

2018
[21]

Characterizing usages, updates and risks of third- party libraries in java projects,

K. Huanget al., “Characterizing usages, updates and risks of third- party libraries in java projects,”Empirical Software Engineering, vol. 27, no. 4, p. 78, 2022

2022
[22]

How the apache community upgrades dependencies: an evolutionary study,

G. Bavota, G. Canfora, M. Di Penta, R. Oliveto, and S. Panichella, “How the apache community upgrades dependencies: an evolutionary study,” Empirical Software Engineering, vol. 20, no. 5, pp. 1275–1317, 2015

2015
[23]

A large scale analysis of semantic versioning in npm,

D. Pinckney, F. Cassano, A. Guha, and J. Bell, “A large scale analysis of semantic versioning in npm,” inProceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories (MSR 2023). Piscataway, NJ, USA: IEEE, 2023, pp. 485–497

2023
[24]

A study of library migrations in java,

C. Teyton, J. R. Falleri, M. Palyart, and X. Blanc, “A study of library migrations in java,”Journal of Software: Evolution and Process, vol. 26, no. 11, pp. 1030–1052, 2014

2014
[25]

An Empirical Study on the Reuse of Third-Party Libraries in Open-Source Software Development,

A. Zaimiet al., “An Empirical Study on the Reuse of Third-Party Libraries in Open-Source Software Development,” inProceedings of the 7th Balkan Conference on Informatics Conference (BCI 2015). New York, NY , USA: ACM, 2015

2015
[26]

Logging library migrations: a case study for the apache software foundation projects,

S. Kabinna, C.-P. Bezemer, W. Shang, and A. E. Hassan, “Logging library migrations: a case study for the apache software foundation projects,” inProceedings of the 13th International Conference on Mining Software Repositories (MSR 2016). ACM, 2016, pp. 154–164

2016
[27]

A large-scale empirical study on Java library migrations: prevalence, trends, and rationales,

H. He, R. He, H. Gu, and M. Zhou, “A large-scale empirical study on Java library migrations: prevalence, trends, and rationales,” inProceed- ings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). New York, NY , USA: ACM, 2021, pp. 478–490

2021
[28]

How and why developers migrate python tests,

L. Barbosa and A. Hora, “How and why developers migrate python tests,” inProceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2022). Piscataway, NJ, USA: IEEE, 2022, pp. 538–548

2022
[29]

Software Reuse and Evolution in JavaScript Applications,

A. Terzi, “Software Reuse and Evolution in JavaScript Applications,” in Proceedings of the 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022). IEEE, 2022, pp. 263–269

2022
[30]

Jaisri, B

P. Jaisri, B. Reid, and R. G. Kula,A Preliminary Study on Self-contained Libraries in the NPM Ecosystem. Cham, Switzerland: Springer Nature, 2025, pp. 53–65

2025
[31]

Pymigbench: A bench- mark for python library migration,

M. Islam, A. K. Jha, S. Nadi, and I. Akhmetov, “Pymigbench: A bench- mark for python library migration,” inProceedings of the IEEE/ACM 20th International Conference on Mining Software Repositories (MSR 2023). Piscataway, NJ, USA: IEEE, 2023, pp. 511–515

2023
[32]

Self-Admitted Library Migrations in Java, JavaScript, and Python Packaging Ecosystems: A Comparative Study,

H. Gu, H. He, and M. Zhou, “Self-Admitted Library Migrations in Java, JavaScript, and Python Packaging Ecosystems: A Comparative Study,” inProceedings of the 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2023). Piscataway, NJ, USA: IEEE, 2023, pp. 627–638

2023
[33]

A Qualitative Study of De- pendency Management and Its Security Implications,

I. Pashchenko, D. L. Vu, and F. Massacci, “A Qualitative Study of De- pendency Management and Its Security Implications,” inProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS 2020). ACM, 2020, pp. 1513–1531

2020
[34]

Characterizing python library migrations,

M. Islam, A. K. Jha, I. Akhmetov, and S. Nadi, “Characterizing python library migrations,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 92–114, 2024

2024
[35]

Cramming: training a language model on a single GPU in one day,

J. Geipinget al., “Cramming: training a language model on a single GPU in one day,” inIn Proceedings of the 40th International Conference on Machine Learning (ICML 2023), vol. 202. Cambridge, MA, USA: PMLR Press, Jul 2023, pp. 11 117–11 143

2023
[36]

How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study,

F. Pepeet al., “How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study,” inProccedings of the 32nd IEEE International Conference on Program Comprehension (ICPC 2024), no. iii. New York, NY , USA: ACM, 2024, pp. 370–381

2024
[37]

Siavash Ameli, Siyuan Zhuang, Ion Stoica, and Michael W

M. Abdin, S. Agarwal, A. Awadallahet al., “Phi-4-reasoning technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21318

work page arXiv 2025
[38]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

D. Guo, D. Yang, H. Zhanget al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,”Springer Nature, vol. 645, no. 8081, pp. 633–638, 2025

2025
[39]

High-Resolution Image Synthesis with Latent Diffusion Models ,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “ High-Resolution Image Synthesis with Latent Diffusion Models ,” in Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022). Los Alamitos, CA, USA: IEEE, jun 2022, pp. 10 674–10 685

2022
[40]

Analyzing the Evolution and Maintenance of ML Models on Hugging Face,

J. Castano, S. Martinez-Fernandez, X. Franch, and J. Bogner, “Analyzing the Evolution and Maintenance of ML Models on Hugging Face,” in Proceedings of the IEEE/ACM 21st International Conference on Mining Software Repositories (MSR 2024), vol. 1, no. 1. New York, NY , USA: ACM, 2024, pp. 607–618

2024
[41]

Navigating dataset documentations in ai: A large-scale analysis of dataset cards on hugging face,

X. Yang, W. Liang, and J. Zou, “Navigating dataset documentations in ai: A large-scale analysis of dataset cards on hugging face,” inProceedings of the 12th International Conference on Learning Representations (ICLR 2024). OpenReview.net, 2024

2024
[42]

“i see models being a whole other thing

W. Jianget al., ““i see models being a whole other thing”: an empirical study of pre-trained model naming conventions and a tool for enhancing naming consistency,”Empirical Software Engineering, vol. 30, no. 6, p. 155, 2025

2025
[43]

What Is the Intended Usage Context of This Model? An Exploratory Study of Pre- Trained Models on Various Model Repositories,

L. Gong, J. Zhang, M. Wei, H. Zhang, and Z. Huang, “What Is the Intended Usage Context of This Model? An Exploratory Study of Pre- Trained Models on Various Model Repositories,”ACM Trans. Softw. Eng. Methodol., vol. 32, no. 3, pp. 1–57, may 2023

2023
[44]

An Empirical Study of Artifacts and Security Risks in the Pre-trained Model Supply Chain,

W. Jiang, N. Synovic, R. Sethiet al., “An Empirical Study of Artifacts and Security Risks in the Pre-trained Model Supply Chain,” inProceed- ings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (SCORED 2022). New York, NY , USA: ACM, 2022, pp. 105–114

2022
[45]

Discrepancies among pre-trained deep neural networks: A new threat to model zoo reliability,

D. Montes, P. Peerapatanapokin, J. Schultz, C. Guo, W. Jiang, and J. C. Davis, “Discrepancies among pre-trained deep neural networks: A new threat to model zoo reliability,” inProceedings of the 30th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). New York, NY , USA: A...

2022
[46]

Exploring the Carbon Footprint of Hugging Face’s ML Models: A Repository Min- ing Study,

J. Castano, S. Martinez-Fernandez, X. Franch, and J. Bogner, “Exploring the Carbon Footprint of Hugging Face’s ML Models: A Repository Min- ing Study,” inProceedings of the 17th IEEE/ACM International Sym- posium on Empirical Software Engineering and Measurement (ESEM 2023). Piscataway, NJ, USA: IEEE, 2023, pp. 1–12

2023
[47]

A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT,

C. Zhouet al., “A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT,”International Journal of Machine Learning and Cybernetics, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 19

2023
[48]

Does reusing pre-trained NLP model propagate bugs?

M. Chakraborty, “Does reusing pre-trained NLP model propagate bugs?” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). ACM, 2021, pp. 1686–1688

2021
[49]

AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges,

H. Gao, M. Zahedi, W. Jiang, H. Y . Lin, J. Davis, and C. Treude, “AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges,” vol. 1, no. 1, pp. 1–29, 2025

2025
[50]

An exploratory study of dataset and model management in open source machine learning applications,

T. R. Toma and C. P. Bezemer, “An exploratory study of dataset and model management in open source machine learning applications,” in Proceedings of the 3rd IEEE/ACM International Conference on AI Engineering – Software Engineering for AI (CAIN 2024). New York, NY , USA: ACM, 2024, pp. 64–74

2024
[51]

The State of the ML- universe: 10 Years of Artificial Intelligence & Machine Learning Soft- ware Development on GitHub,

D. Gonzalez, T. Zimmermann, and N. Nagappan, “The State of the ML- universe: 10 Years of Artificial Intelligence & Machine Learning Soft- ware Development on GitHub,” inProceedings of the 17th IEEE/ACM International Conference on Mining Software Repositories (MSR 2020). New York, NY , USA: ACM, 2020, pp. 431–442

2020
[52]

Comparison of release engineering practices in a large mature company and a startup,

E. Laukkanen, M. Paasivaara, J. Itkonenet al., “Comparison of release engineering practices in a large mature company and a startup,”Empir- ical Software Engineering, vol. 23, no. 6, pp. 3535–3577, 2018

2018
[53]

Identifying unmaintained projects in github,

J. Coelho, M. T. Valente, L. L. Silva, and E. Shihab, “Identifying unmaintained projects in github,” inProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM 2018). New York, NY , USA: ACM, 2018

2018
[54]

RapidRelease - A dataset of projects and issues on github with rapid releases,

S. D. Joshi and S. Chimalakonda, “RapidRelease - A dataset of projects and issues on github with rapid releases,” inProceedings of the IEEE/ACM 16th International Conference on Mining Software Repositories (MSR 2019), vol. 2019-May. Piscataway, NJ, USA: IEEE, 2019, pp. 587–591

2019
[55]

Keep the Ball Rolling: Analyzing Release Cadence in GitHub Projects,

O. Kilic, N. Bowness, and O. Baysal, “Keep the Ball Rolling: Analyzing Release Cadence in GitHub Projects,” inProceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories (MSR 2023). Piscataway, NJ, USA: IEEE, 2023, pp. 372–376

2023
[56]

Release conventions of open-source software: An exploratory study,

D. Chakroborti, S. S. Nath, K. A. Schneider, and C. K. Roy, “Release conventions of open-source software: An exploratory study,”Journal of Software: Evolution and Process, vol. 35, no. 1, p. e2499, 2023

2023
[57]

Semantic versioning 2.0.0,

T. Preston-Werner and Contributors, “Semantic versioning 2.0.0,” https: //semver.org, 2023, accessed: Oct. 13, 2025

2023
[58]

On a test of whether one of two random variables is stochastically larger than the other,

H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,”The Annals of Mathematical Statistics, vol. 18, no. 1, pp. 50–60, 1947

1947
[59]

Cohen’s d,

M. J. Diener, “Cohen’s d,” inThe Corsini Encyclopedia of Psychology. John Wiley & Sons, Ltd, 2010, p. 1

2010
[60]

The measurement of observer agreement for categorical data,

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, mar 1977

1977
[61]

Interrater reliability: the kappa statistic,

M. L. McHugh, “Interrater reliability: the kappa statistic,”Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012

2012
[62]

(2025) Commit 041f39a

foundation-model-stack/aiu-fms-testing utils. (2025) Commit 041f39a. (Accessed: 2025-12-01). [Online]. Available: https://github.com/ foundation-model-stack/aiu-fms-testing-utils/commit/041f39a9

2025
[63]

(2023) Commit 7ee8e07

ZFTurbo/Music-Source-Separation-Training. (2023) Commit 7ee8e07. (Accessed: 2025-12-01). [Online]. Avail- able: https://github.com/ZFTurbo/Music-Source-Separation-Training/ commit/7ee8e074e6a9f6cd217f66a360a82c84cc2b174a

2023
[64]

A General Inductive Approach for Analyzing Qualitative Evaluation Data,

D. R. Thomas, “A General Inductive Approach for Analyzing Qualitative Evaluation Data,”American Journal of Evaluation, vol. 27, no. 2, pp. 237–246, 2006

2006
[65]

Wilcoxon signed-rank test,

R. F. Woolson, “Wilcoxon signed-rank test,” inWiley Encyclopedia of Clinical Trials. John Wiley & Sons, Ltd, 2008, pp. 1–3

2008
[66]

(2025) Example 1

luxonis/datadreamer. (2025) Example 1. (Accessed: 2025-12-01). [Online]. Available: https://github.com/luxonis/datadreamer/pull/77

2025
[67]

(2023) Exam- ple 2

centre-for-humanities computing/conspiracies. (2023) Exam- ple 2. (Accessed: 2025-12-01). [Online]. Avail- able: https://github.com/centre-for-humanities-computing/conspiracies/ commit/2c3d5e32318dd0713770d32b485b63ff986e67ac

2023
[68]

(2024) Example 3

hpcaitech/ColossalAI. (2024) Example 3. (Accessed: 2025-12-01). [Online]. Available: https://github.com/hpcaitech/ColossalAI/releases/ tag/v0.4.3

2024
[69]

(2024) Example 4

koito19960406/ZenSVI. (2024) Example 4. (Accessed: 2025-12-01). [Online]. Available: https://github.com/koito19960406/ZenSVI/pull/91

2024
[70]

(2024) Example 5

luxonis/datadreamer. (2024) Example 5. (Accessed: 2025-12-01). [Online]. Available: https://github.com/vllm-project/vllm/issues/4141

2024
[71]

(2021) Example 6

castorini/pyserini. (2021) Example 6. (Accessed: 2025-12-01). [Online]. Available: https://github.com/castorini/pyserini/pull/620

2021
[72]

(2025) Example 7

huggingface/trl. (2025) Example 7. (Accessed: 2025-12-01). [Online]. Available: https://github.com/huggingface/trl/pull/3415

2025
[73]

(2025) Example 8

biopragmatics/bioregistry. (2025) Example 8. (Accessed: 2025-12-01). [Online]. Available: https://github.com/biopragmatics/bioregistry/pull/ 1439/commits/e29600af8d57c7dacf28d7bddddb3b629f2e0b1a

2025
[74]

(2024) Example 9

luxonis/datadreamer. (2024) Example 9. (Accessed: 2025-12- 01). [Online]. Available: https://github.com/TransformerLensOrg/ TransformerLens/pull/777

2024
[75]

(2025) Example 10

vllm project/vllm. (2025) Example 10. (Accessed: 2025-12-01). [Online]. Available: https://github.com/vllm-project/vllm/pull/14422

2025
[76]

(2025) Example 11

PrunaAI/pruna. (2025) Example 11. (Accessed: 2025-12- 01). [Online]. Available: https://github.com/PrunaAI/pruna/commit/ bc1ece9b77f4fd426fbaf43e03b2f5eb66f2dc96

2025
[77]

(2025) Example 12

arthur-ai/arthur engine. (2025) Example 12. (Accessed: 2025-12-01). [Online]. Available: https://github.com/arthur-ai/arthur-engine/pull/310

2025
[78]

(2023) Example 13

mlflow/mlflow. (2023) Example 13. (Accessed: 2025-12-01). [Online]. Available: https://github.com/mlflow/mlflow/pull/8623

2023
[79]

(2024) Example 14

——. (2024) Example 14. (Accessed: 2025-12-01). [Online]. Available: https://github.com/mlflow/mlflow/issues/10887

2024
[80]

(2025) Example 14

vllm project/vllm. (2025) Example 14. (Accessed: 2025-12-01). [Online]. Available: https://github.com/vllm-project/vllm/pull/21169

2025

Showing first 80 references.