Recognition: unknown
The Grand Software Supply Chain of AI Systems
Pith reviewed 2026-05-07 06:48 UTC · model grok-4.3
The pith
AI systems suffer from four unaddressed supply chain gaps that leave them without verifiability, versioning, observability, or traceability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By elevating the AI supply chain to first-class status, the paper decomposes it into data acquisition, model training, model inference, and a cross-cutting substrate, then isolates four gaps—verifiability, versioning, observability, and traceability—that traditional mechanisms leave open. It shows that present-day AI systems carry undeclared behavioral couplings with no resolver enforcement, cannot be rolled back to known working assemblies, degrade silently instead of surfacing breaking changes, and allow only approximate lineage reconstruction. The scale of the problem is illustrated by a reference stack of 48 open-source projects whose declared dependencies total 4,664 direct and 11,508,
What carries the argument
The four-layer decomposition of the AI supply chain (data acquisition, model training, model inference, cross-cutting substrate) together with the four structural gaps (verifiability, versioning, observability, traceability) that current mechanisms fail to close.
If this is right
- AI systems embed undeclared behavioral couplings that no resolver can enforce.
- AI systems cannot be reverted to known working assemblies when problems arise.
- AI systems degrade silently instead of surfacing breaking changes to operators.
- AI system lineage can only be approximated rather than reconstructed exactly.
Where Pith is reading between the lines
- Security and reliability work on AI may need to shift focus from model internals to the full dependency stack.
- Standard dependency tools from traditional software may require AI-specific extensions to close the four gaps.
- Organizations deploying AI could reduce risk by treating the entire stack as a single versioned and verifiable artifact.
- The measured dependency counts suggest that AI maintenance costs are dominated by transitive package management rather than model training itself.
Load-bearing premise
That the four identified gaps are the main unaddressed integrity problems in AI supply chains and that the measured 48-project stack is representative of typical production systems.
What would settle it
An AI system that demonstrably enforces component verification, supports full reversion to prior working assemblies, actively surfaces breaking changes through observability, and provides accurate lineage tracing would falsify the claim that current systems fall short on all four gaps.
Figures
read the original abstract
AI systems rest on software with low integrity mechanisms, leaving AI systems exposed across every stage from data acquisition to final inference. This paper makes the AI supply chain a first-class object of analysis, decomposing it across four architectural layers: data acquisition, model training, model inference, and a cross-cutting substrate. Within these layers, we identify four structural gaps that traditional supply chain mechanisms do not address: verifiability, versioning, observability, and traceability.Current AI systems fall short on all of them: they carry undeclared behavioral couplings that no resolver enforces; they cannot be reverted back to known working assemblies; they degrade silently rather than surfacing breaking changes; and their lineage can hardly be approximated. To illustrate the scale of the software supply chain of AI, we measure a reference stack of 48 production-grade open-source projects, which declares 4,664 direct dependencies, resolves to 11,508 transitive packages, and totals roughly 392M lines of code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that AI systems rest on software with low integrity mechanisms and positions the AI software supply chain as a first-class object of analysis. It decomposes the supply chain into four architectural layers (data acquisition, model training, model inference, and cross-cutting substrate) and identifies four structural gaps (verifiability, versioning, observability, and traceability) that traditional mechanisms do not address. The authors assert that current AI systems fall short on all four: they carry undeclared behavioral couplings with no resolver, cannot revert to known working assemblies, degrade silently, and have lineage that is hard to approximate. To illustrate scale, the paper measures a reference stack of 48 production-grade open-source projects reporting 4,664 direct dependencies, 11,508 transitive packages, and roughly 392M lines of code.
Significance. If the four gaps are shown to be both real and unaddressed by existing tools, the work would be significant for software engineering research on AI systems, as it would provide a structured framework for analyzing integrity risks across the full pipeline from data to inference. The concrete dependency counts from the 48-project stack are a clear strength, offering quantifiable evidence of the size and transitivity of AI software supply chains that could motivate further empirical studies.
major comments (3)
- [Abstract and gap identification section] Abstract and the section defining the four structural gaps: the claim that 'current AI systems fall short on all of them' (undeclared behavioral couplings with no resolver, inability to revert, silent degradation, hard-to-approximate lineage) is asserted without supporting evidence from the measured stack. The empirical data only reports aggregate counts and does not include any per-project analysis showing absence of resolvers, lockfiles, change-detection hooks, or lineage metadata.
- [Measurement of the reference stack] The measurement section on the reference stack of 48 projects: the reported figures (4,664 direct dependencies, 11,508 transitive packages, ~392M LOC) establish scale and transitivity but contain no breakdown or audit demonstrating that the claimed gaps are present. Scale alone does not establish that traditional mechanisms (e.g., SBOMs or dependency resolvers) are inapplicable or absent.
- [Architectural layers and structural gaps] The section introducing the four architectural layers and gaps: these are presented as structural gaps that 'traditional supply chain mechanisms do not address,' yet the paper provides no systematic comparison or survey of existing tools (such as model registries, lockfiles, or provenance systems) to show why they fail to cover verifiability, versioning, observability, or traceability in the AI context.
minor comments (2)
- [Terminology] The paper would benefit from explicit definitions or examples for terms such as 'undeclared behavioral couplings' and 'silent degradation' with references to specific components in the 48-project stack.
- [References] Add citations to prior work on software supply chain security, SBOM standards, and ML model provenance to better situate the novelty of the four gaps.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. The feedback correctly identifies that the manuscript's empirical measurement focuses on aggregate scale while the claims about the four gaps rest on structural analysis of the architectural layers. We address each major comment below, clarifying the intended role of the measurement and outlining targeted revisions to improve the connection between evidence and claims without altering the core contribution.
read point-by-point responses
-
Referee: [Abstract and gap identification section] Abstract and the section defining the four structural gaps: the claim that 'current AI systems fall short on all of them' (undeclared behavioral couplings with no resolver, inability to revert, silent degradation, hard-to-approximate lineage) is asserted without supporting evidence from the measured stack. The empirical data only reports aggregate counts and does not include any per-project analysis showing absence of resolvers, lockfiles, change-detection hooks, or lineage metadata.
Authors: We agree that the measurement reports only aggregate counts (4,664 direct dependencies, 11,508 transitive packages, ~392M LOC) and contains no per-project audit of mechanisms such as resolvers or lockfiles. The assertion that current AI systems fall short on the four gaps is derived from the architectural decomposition across the four layers, where we argue that AI-specific elements (data pipelines, model artifacts, inference services) create behavioral couplings and lineage needs outside the scope of conventional code-centric tools. The reference stack serves to quantify the underlying software complexity that makes these gaps consequential. We will revise the abstract and gap-identification section to explicitly separate the structural argument from the empirical illustration and add a clarifying sentence on the measurement's purpose and limitations. revision: yes
-
Referee: [Measurement of the reference stack] The measurement section on the reference stack of 48 projects: the reported figures (4,664 direct dependencies, 11,508 transitive packages, ~392M LOC) establish scale and transitivity but contain no breakdown or audit demonstrating that the claimed gaps are present. Scale alone does not establish that traditional mechanisms (e.g., SBOMs or dependency resolvers) are inapplicable or absent.
Authors: The referee is correct that the reported figures demonstrate scale and transitivity but provide no breakdown or audit showing the presence or absence of specific mechanisms such as SBOMs or resolvers. The measurement's purpose is to illustrate the size and interconnectedness of the software foundation supporting AI systems, thereby underscoring why the identified gaps matter at scale. We will revise the measurement section to state its purpose and limitations more explicitly and, where feasible, add high-level observations from the dependency data (for example, dominant package managers) without claiming a comprehensive audit of gap-related features. revision: yes
-
Referee: [Architectural layers and structural gaps] The section introducing the four architectural layers and gaps: these are presented as structural gaps that 'traditional supply chain mechanisms do not address,' yet the paper provides no systematic comparison or survey of existing tools (such as model registries, lockfiles, or provenance systems) to show why they fail to cover verifiability, versioning, observability, or traceability in the AI context.
Authors: We acknowledge that the manuscript does not contain a systematic survey or comparison of existing tools such as model registries, lockfiles, or provenance systems. The gaps are defined structurally by contrasting the requirements of each AI layer (for example, verifiability of data-to-model couplings) against the typical coverage of conventional mechanisms, which focus primarily on code artifacts. To make this reasoning more explicit, we will add a concise paragraph or short subsection that discusses representative existing approaches (such as MLflow for model tracking or SPDX-based SBOMs) and explains their current limitations with respect to the four AI-specific gaps. This addition will remain brief and within the scope of a position paper. revision: yes
Circularity Check
No significant circularity: purely observational and descriptive analysis
full rationale
The paper presents an observational decomposition of AI software supply chains into four layers and identifies four gaps (verifiability, versioning, observability, traceability) that traditional mechanisms allegedly do not address. It supports the scale claim solely via direct measurement of a 48-project reference stack (4,664 direct dependencies, 11,508 transitive packages, ~392M LOC). No equations, derivations, predictions, or first-principles results exist. No self-definitional statements, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes appear. The gap identification and scale measurement are independent; the former is conceptual framing and the latter is empirical counting. This matches the default expectation of a non-circular descriptive paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AI systems rest on software with low integrity mechanisms
invented entities (2)
-
Four architectural layers
no independent evidence
-
Four structural gaps
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Poisoning web-scale training datasets is practical
Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024
2024
-
[2]
Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021
Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021
-
[3]
How is chatgpt’s behavior changing over time? Harvard Data Science Review, 6(2), 2024
Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behavior changing over time? Harvard Data Science Review, 6(2), 2024
2024
-
[4]
Model-driven prompt engineering
Robert Clarisó and Jordi Cabot. Model-driven prompt engineering. In2023 ACM/IEEE 26th International Conference on Model Driven Engineering Languages and Systems (MODELS), pages 47–54. IEEE, 2023
2023
-
[5]
Data scientists targeted by malicious hugging face ml models with silent backdoor
David Cohen. Data scientists targeted by malicious hugging face ml models with silent backdoor. JFrog, 2024. Accessed: April 2026
2024
-
[6]
Gpt3.int8(): 8-bit matrix multiplication for transformers at scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, 15 K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30318–30332. Curran Associates, Inc., 2022
2022
-
[7]
Creating the first confidential gpus.Communications of the ACM, 67(1):60–67, 2023
Gobikrishna Dhanuskodi, Sudeshna Guha, Vidhya Krishnan, Aruna Manjunatha, Rob Nertney, Michael O’Connor, and Phil Rogers. Creating the first confidential gpus.Communications of the ACM, 67(1):60–67, 2023
2023
-
[8]
Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021
2021
-
[9]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arxiv 2017.arXiv preprint arXiv:1708.06733, 2017
work page internal anchor Pith review arXiv 2017
-
[10]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022
2022
-
[11]
An empirical study of pre-trained model reuse in the hugging face deep learning model registry
Wenxin Jiang, Nicholas Synovic, Matt Hyatt, Taylor R Schorlemmer, Rohan Sethi, Yung-Hsiang Lu, George K Thiruvathukal, and James C Davis. An empirical study of pre-trained model reuse in the hugging face deep learning model registry. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2463–2475. IEEE, 2023
2023
-
[12]
Billion-scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019
Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019
2019
-
[13]
Sok: Taxonomy of attacks on open-source software supply chains
Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. Sok: Taxonomy of attacks on open-source software supply chains. In2023 IEEE Symposium on Security and Privacy (SP), pages 1509–1526. IEEE, 2023
2023
-
[14]
Reproducible builds: Increasing the integrity of software supply chains.IEEE Software, 39(2):62–70, 2021
Chris Lamb and Stefano Zacchiroli. Reproducible builds: Increasing the integrity of software supply chains.IEEE Software, 39(2):62–70, 2021
2021
-
[15]
Model cards for model reporting
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchin- son, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019
2019
-
[16]
What we know about aiboms: Results from a multivocal literature review on artificial intelligence bill of materials.ACM Transactions on Software Engineering and Methodology, 2025
Sabato Nocera, Massimiliano Di Penta, Fatima Ahmed, Simone Romano, and Giuseppe Scan- niello. What we know about aiboms: Results from a multivocal literature review on artificial intelligence bill of materials.ACM Transactions on Software Engineering and Methodology, 2025
2025
-
[17]
Problems and opportunities in training deep learning software systems: An analysis of variance
Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. Problems and opportunities in training deep learning software systems: An analysis of variance. InProceedings of the 35th IEEE/ACM international conference on automated software engineering, pages 771–783, 2020
2020
-
[18]
Three’s a crowd: TeamPCP trojanizes LiteLLM in continuation of campaign
Benjamin Read, Merav Bar, Rami McCarthy, and James Haughom. Three’s a crowd: TeamPCP trojanizes LiteLLM in continuation of campaign. Wiz Blog, March 2026. Accessed: April 2026
2026
-
[19]
Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015
2015
-
[20]
Reproducibility in machine-learning-based research: Overview, barriers, and drivers.AI Magazine, 46(2):e70002, 2025
Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. Reproducibility in machine-learning-based research: Overview, barriers, and drivers.AI Magazine, 46(2):e70002, 2025
2025
-
[21]
Mohammed Latif Siddiq, Tanzim Hossain Romel, Natalie Sekerak, Beatrice Casey, and Joanna Santos. An empirical study on remote code execution in machine learning model hosting ecosystems.arXiv preprint arXiv:2601.14163, 2026. 16
-
[22]
SLSA supply-chain levels for software artifacts
The Linux Foundation. SLSA supply-chain levels for software artifacts. https://slsa.dev,
-
[23]
Accessed: 2026-04-20
2026
-
[24]
Identifying and eliminating csam in generative ml training data and models
David Thiel. Identifying and eliminating csam in generative ml training data and models. Stanford Internet Observatory, Cyber Policy Center, December, 23(3):131, 2023
2023
-
[25]
in-toto: Providing farm-to-table guarantees for bits and bytes
Santiago Torres-Arias, Hammad Afzali, Trishank Karthik Kuppusamy, Reza Curtmola, and Justin Cappos. in-toto: Providing farm-to-table guarantees for bits and bytes. In28th USENIX Security Symposium (USENIX Security 19), pages 1393–1410, 2019
2019
-
[26]
Confidential computing within an {AI} accelerator
Kapil Vaswani, Stavros V olos, Cédric Fournet, Antonio Nino Diaz, Ken Gordon, Balaji Vembu, Sam Webster, David Chisnall, Saurabh Kulkarni, Graham Cunningham, et al. Confidential computing within an {AI} accelerator. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 501–518, 2023
2023
-
[27]
Re- search directions in software supply chain security.ACM Transactions on Software Engineering and Methodology, 34(5):1–38, 2025
Laurie Williams, Giacomo Benedetti, Sivana Hamer, Ranindya Paramitha, Imranur Rahman, Mahzabin Tamanna, Greg Tystahl, Nusrat Zahan, Patrick Morrison, Yasemin Acar, et al. Re- search directions in software supply chain security.ACM Transactions on Software Engineering and Methodology, 34(5):1–38, 2025
2025
-
[28]
The shift from models to compound ai systems
Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/ compound-ai-systems/, 2024. 17
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.