pith. machine review for the scientific record. sign in

arxiv: 2604.27781 · v1 · submitted 2026-04-30 · 💻 cs.SE

Recognition: unknown

The Grand Software Supply Chain of AI Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:48 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI supply chainsoftware integrityverifiabilityversioningobservabilitytraceabilitydependency managementAI systems
0
0 comments X

The pith

AI systems suffer from four unaddressed supply chain gaps that leave them without verifiability, versioning, observability, or traceability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the AI software supply chain as a first-class object of study and decomposes it into four layers: data acquisition, model training, model inference, and a cross-cutting substrate. Within this structure it isolates four structural gaps that conventional supply-chain tools do not cover. Current AI systems fail all four: they embed undeclared behavioral couplings, cannot revert to known-good states, degrade without warning, and make lineage reconstruction nearly impossible. A measurement of one reference stack of 48 production-grade projects shows the practical scale: 4,664 declared direct dependencies that resolve to 11,508 packages and roughly 392 million lines of code. The argument is that these gaps expose AI systems to integrity failures across every stage from data ingestion to inference.

Core claim

By elevating the AI supply chain to first-class status, the paper decomposes it into data acquisition, model training, model inference, and a cross-cutting substrate, then isolates four gaps—verifiability, versioning, observability, and traceability—that traditional mechanisms leave open. It shows that present-day AI systems carry undeclared behavioral couplings with no resolver enforcement, cannot be rolled back to known working assemblies, degrade silently instead of surfacing breaking changes, and allow only approximate lineage reconstruction. The scale of the problem is illustrated by a reference stack of 48 open-source projects whose declared dependencies total 4,664 direct and 11,508,

What carries the argument

The four-layer decomposition of the AI supply chain (data acquisition, model training, model inference, cross-cutting substrate) together with the four structural gaps (verifiability, versioning, observability, traceability) that current mechanisms fail to close.

If this is right

  • AI systems embed undeclared behavioral couplings that no resolver can enforce.
  • AI systems cannot be reverted to known working assemblies when problems arise.
  • AI systems degrade silently instead of surfacing breaking changes to operators.
  • AI system lineage can only be approximated rather than reconstructed exactly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Security and reliability work on AI may need to shift focus from model internals to the full dependency stack.
  • Standard dependency tools from traditional software may require AI-specific extensions to close the four gaps.
  • Organizations deploying AI could reduce risk by treating the entire stack as a single versioned and verifiable artifact.
  • The measured dependency counts suggest that AI maintenance costs are dominated by transitive package management rather than model training itself.

Load-bearing premise

That the four identified gaps are the main unaddressed integrity problems in AI supply chains and that the measured 48-project stack is representative of typical production systems.

What would settle it

An AI system that demonstrably enforces component verification, supports full reversion to prior working assemblies, actively surfaces breaking changes through observability, and provides accurate lineage tracing would falsify the claim that current systems fall short on all four gaps.

Figures

Figures reproduced from arXiv: 2604.27781 by Carmine Cesarano, Martin Monperrus.

Figure 1
Figure 1. Figure 1: The software supply chain of a reference stack of view at source ↗
read the original abstract

AI systems rest on software with low integrity mechanisms, leaving AI systems exposed across every stage from data acquisition to final inference. This paper makes the AI supply chain a first-class object of analysis, decomposing it across four architectural layers: data acquisition, model training, model inference, and a cross-cutting substrate. Within these layers, we identify four structural gaps that traditional supply chain mechanisms do not address: verifiability, versioning, observability, and traceability.Current AI systems fall short on all of them: they carry undeclared behavioral couplings that no resolver enforces; they cannot be reverted back to known working assemblies; they degrade silently rather than surfacing breaking changes; and their lineage can hardly be approximated. To illustrate the scale of the software supply chain of AI, we measure a reference stack of 48 production-grade open-source projects, which declares 4,664 direct dependencies, resolves to 11,508 transitive packages, and totals roughly 392M lines of code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that AI systems rest on software with low integrity mechanisms and positions the AI software supply chain as a first-class object of analysis. It decomposes the supply chain into four architectural layers (data acquisition, model training, model inference, and cross-cutting substrate) and identifies four structural gaps (verifiability, versioning, observability, and traceability) that traditional mechanisms do not address. The authors assert that current AI systems fall short on all four: they carry undeclared behavioral couplings with no resolver, cannot revert to known working assemblies, degrade silently, and have lineage that is hard to approximate. To illustrate scale, the paper measures a reference stack of 48 production-grade open-source projects reporting 4,664 direct dependencies, 11,508 transitive packages, and roughly 392M lines of code.

Significance. If the four gaps are shown to be both real and unaddressed by existing tools, the work would be significant for software engineering research on AI systems, as it would provide a structured framework for analyzing integrity risks across the full pipeline from data to inference. The concrete dependency counts from the 48-project stack are a clear strength, offering quantifiable evidence of the size and transitivity of AI software supply chains that could motivate further empirical studies.

major comments (3)
  1. [Abstract and gap identification section] Abstract and the section defining the four structural gaps: the claim that 'current AI systems fall short on all of them' (undeclared behavioral couplings with no resolver, inability to revert, silent degradation, hard-to-approximate lineage) is asserted without supporting evidence from the measured stack. The empirical data only reports aggregate counts and does not include any per-project analysis showing absence of resolvers, lockfiles, change-detection hooks, or lineage metadata.
  2. [Measurement of the reference stack] The measurement section on the reference stack of 48 projects: the reported figures (4,664 direct dependencies, 11,508 transitive packages, ~392M LOC) establish scale and transitivity but contain no breakdown or audit demonstrating that the claimed gaps are present. Scale alone does not establish that traditional mechanisms (e.g., SBOMs or dependency resolvers) are inapplicable or absent.
  3. [Architectural layers and structural gaps] The section introducing the four architectural layers and gaps: these are presented as structural gaps that 'traditional supply chain mechanisms do not address,' yet the paper provides no systematic comparison or survey of existing tools (such as model registries, lockfiles, or provenance systems) to show why they fail to cover verifiability, versioning, observability, or traceability in the AI context.
minor comments (2)
  1. [Terminology] The paper would benefit from explicit definitions or examples for terms such as 'undeclared behavioral couplings' and 'silent degradation' with references to specific components in the 48-project stack.
  2. [References] Add citations to prior work on software supply chain security, SBOM standards, and ML model provenance to better situate the novelty of the four gaps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The feedback correctly identifies that the manuscript's empirical measurement focuses on aggregate scale while the claims about the four gaps rest on structural analysis of the architectural layers. We address each major comment below, clarifying the intended role of the measurement and outlining targeted revisions to improve the connection between evidence and claims without altering the core contribution.

read point-by-point responses
  1. Referee: [Abstract and gap identification section] Abstract and the section defining the four structural gaps: the claim that 'current AI systems fall short on all of them' (undeclared behavioral couplings with no resolver, inability to revert, silent degradation, hard-to-approximate lineage) is asserted without supporting evidence from the measured stack. The empirical data only reports aggregate counts and does not include any per-project analysis showing absence of resolvers, lockfiles, change-detection hooks, or lineage metadata.

    Authors: We agree that the measurement reports only aggregate counts (4,664 direct dependencies, 11,508 transitive packages, ~392M LOC) and contains no per-project audit of mechanisms such as resolvers or lockfiles. The assertion that current AI systems fall short on the four gaps is derived from the architectural decomposition across the four layers, where we argue that AI-specific elements (data pipelines, model artifacts, inference services) create behavioral couplings and lineage needs outside the scope of conventional code-centric tools. The reference stack serves to quantify the underlying software complexity that makes these gaps consequential. We will revise the abstract and gap-identification section to explicitly separate the structural argument from the empirical illustration and add a clarifying sentence on the measurement's purpose and limitations. revision: yes

  2. Referee: [Measurement of the reference stack] The measurement section on the reference stack of 48 projects: the reported figures (4,664 direct dependencies, 11,508 transitive packages, ~392M LOC) establish scale and transitivity but contain no breakdown or audit demonstrating that the claimed gaps are present. Scale alone does not establish that traditional mechanisms (e.g., SBOMs or dependency resolvers) are inapplicable or absent.

    Authors: The referee is correct that the reported figures demonstrate scale and transitivity but provide no breakdown or audit showing the presence or absence of specific mechanisms such as SBOMs or resolvers. The measurement's purpose is to illustrate the size and interconnectedness of the software foundation supporting AI systems, thereby underscoring why the identified gaps matter at scale. We will revise the measurement section to state its purpose and limitations more explicitly and, where feasible, add high-level observations from the dependency data (for example, dominant package managers) without claiming a comprehensive audit of gap-related features. revision: yes

  3. Referee: [Architectural layers and structural gaps] The section introducing the four architectural layers and gaps: these are presented as structural gaps that 'traditional supply chain mechanisms do not address,' yet the paper provides no systematic comparison or survey of existing tools (such as model registries, lockfiles, or provenance systems) to show why they fail to cover verifiability, versioning, observability, or traceability in the AI context.

    Authors: We acknowledge that the manuscript does not contain a systematic survey or comparison of existing tools such as model registries, lockfiles, or provenance systems. The gaps are defined structurally by contrasting the requirements of each AI layer (for example, verifiability of data-to-model couplings) against the typical coverage of conventional mechanisms, which focus primarily on code artifacts. To make this reasoning more explicit, we will add a concise paragraph or short subsection that discusses representative existing approaches (such as MLflow for model tracking or SPDX-based SBOMs) and explains their current limitations with respect to the four AI-specific gaps. This addition will remain brief and within the scope of a position paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely observational and descriptive analysis

full rationale

The paper presents an observational decomposition of AI software supply chains into four layers and identifies four gaps (verifiability, versioning, observability, traceability) that traditional mechanisms allegedly do not address. It supports the scale claim solely via direct measurement of a 48-project reference stack (4,664 direct dependencies, 11,508 transitive packages, ~392M LOC). No equations, derivations, predictions, or first-principles results exist. No self-definitional statements, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes appear. The gap identification and scale measurement are independent; the former is conceptual framing and the latter is empirical counting. This matches the default expectation of a non-circular descriptive paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Claims rely on domain assumptions about AI software and conceptual entities for layers and gaps; no fitted parameters.

axioms (1)
  • domain assumption AI systems rest on software with low integrity mechanisms
    Foundational premise stated in the abstract.
invented entities (2)
  • Four architectural layers no independent evidence
    purpose: Decompose AI supply chain
    New conceptual structure for analysis.
  • Four structural gaps no independent evidence
    purpose: Highlight AI-specific deficiencies
    Identified as unaddressed by traditional mechanisms.

pith-pipeline@v0.9.0 · 10742 in / 1065 out tokens · 128973 ms · 2026-05-07T06:48:09.669975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Poisoning web-scale training datasets is practical

    Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024

  2. [2]

    Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021

    Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning.arXiv preprint arXiv:2106.09667, 2021

  3. [3]

    How is chatgpt’s behavior changing over time? Harvard Data Science Review, 6(2), 2024

    Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behavior changing over time? Harvard Data Science Review, 6(2), 2024

  4. [4]

    Model-driven prompt engineering

    Robert Clarisó and Jordi Cabot. Model-driven prompt engineering. In2023 ACM/IEEE 26th International Conference on Model Driven Engineering Languages and Systems (MODELS), pages 47–54. IEEE, 2023

  5. [5]

    Data scientists targeted by malicious hugging face ml models with silent backdoor

    David Cohen. Data scientists targeted by malicious hugging face ml models with silent backdoor. JFrog, 2024. Accessed: April 2026

  6. [6]

    Gpt3.int8(): 8-bit matrix multiplication for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, 15 K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30318–30332. Curran Associates, Inc., 2022

  7. [7]

    Creating the first confidential gpus.Communications of the ACM, 67(1):60–67, 2023

    Gobikrishna Dhanuskodi, Sudeshna Guha, Vidhya Krishnan, Aruna Manjunatha, Rob Nertney, Michael O’Connor, and Phil Rogers. Creating the first confidential gpus.Communications of the ACM, 67(1):60–67, 2023

  8. [8]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

  9. [9]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arxiv 2017.arXiv preprint arXiv:1708.06733, 2017

  10. [10]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  11. [11]

    An empirical study of pre-trained model reuse in the hugging face deep learning model registry

    Wenxin Jiang, Nicholas Synovic, Matt Hyatt, Taylor R Schorlemmer, Rohan Sethi, Yung-Hsiang Lu, George K Thiruvathukal, and James C Davis. An empirical study of pre-trained model reuse in the hugging face deep learning model registry. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2463–2475. IEEE, 2023

  12. [12]

    Billion-scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019

  13. [13]

    Sok: Taxonomy of attacks on open-source software supply chains

    Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. Sok: Taxonomy of attacks on open-source software supply chains. In2023 IEEE Symposium on Security and Privacy (SP), pages 1509–1526. IEEE, 2023

  14. [14]

    Reproducible builds: Increasing the integrity of software supply chains.IEEE Software, 39(2):62–70, 2021

    Chris Lamb and Stefano Zacchiroli. Reproducible builds: Increasing the integrity of software supply chains.IEEE Software, 39(2):62–70, 2021

  15. [15]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchin- son, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019

  16. [16]

    What we know about aiboms: Results from a multivocal literature review on artificial intelligence bill of materials.ACM Transactions on Software Engineering and Methodology, 2025

    Sabato Nocera, Massimiliano Di Penta, Fatima Ahmed, Simone Romano, and Giuseppe Scan- niello. What we know about aiboms: Results from a multivocal literature review on artificial intelligence bill of materials.ACM Transactions on Software Engineering and Methodology, 2025

  17. [17]

    Problems and opportunities in training deep learning software systems: An analysis of variance

    Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. Problems and opportunities in training deep learning software systems: An analysis of variance. InProceedings of the 35th IEEE/ACM international conference on automated software engineering, pages 771–783, 2020

  18. [18]

    Three’s a crowd: TeamPCP trojanizes LiteLLM in continuation of campaign

    Benjamin Read, Merav Bar, Rami McCarthy, and James Haughom. Three’s a crowd: TeamPCP trojanizes LiteLLM in continuation of campaign. Wiz Blog, March 2026. Accessed: April 2026

  19. [19]

    Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015

    David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems.Advances in neural information processing systems, 28, 2015

  20. [20]

    Reproducibility in machine-learning-based research: Overview, barriers, and drivers.AI Magazine, 46(2):e70002, 2025

    Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. Reproducibility in machine-learning-based research: Overview, barriers, and drivers.AI Magazine, 46(2):e70002, 2025

  21. [21]

    An empirical study on remote code execution in machine learning model hosting ecosystems.arXiv preprint arXiv:2601.14163, 2026

    Mohammed Latif Siddiq, Tanzim Hossain Romel, Natalie Sekerak, Beatrice Casey, and Joanna Santos. An empirical study on remote code execution in machine learning model hosting ecosystems.arXiv preprint arXiv:2601.14163, 2026. 16

  22. [22]

    SLSA supply-chain levels for software artifacts

    The Linux Foundation. SLSA supply-chain levels for software artifacts. https://slsa.dev,

  23. [23]

    Accessed: 2026-04-20

  24. [24]

    Identifying and eliminating csam in generative ml training data and models

    David Thiel. Identifying and eliminating csam in generative ml training data and models. Stanford Internet Observatory, Cyber Policy Center, December, 23(3):131, 2023

  25. [25]

    in-toto: Providing farm-to-table guarantees for bits and bytes

    Santiago Torres-Arias, Hammad Afzali, Trishank Karthik Kuppusamy, Reza Curtmola, and Justin Cappos. in-toto: Providing farm-to-table guarantees for bits and bytes. In28th USENIX Security Symposium (USENIX Security 19), pages 1393–1410, 2019

  26. [26]

    Confidential computing within an {AI} accelerator

    Kapil Vaswani, Stavros V olos, Cédric Fournet, Antonio Nino Diaz, Ken Gordon, Balaji Vembu, Sam Webster, David Chisnall, Saurabh Kulkarni, Graham Cunningham, et al. Confidential computing within an {AI} accelerator. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 501–518, 2023

  27. [27]

    Re- search directions in software supply chain security.ACM Transactions on Software Engineering and Methodology, 34(5):1–38, 2025

    Laurie Williams, Giacomo Benedetti, Sivana Hamer, Ranindya Paramitha, Imranur Rahman, Mahzabin Tamanna, Greg Tystahl, Nusrat Zahan, Patrick Morrison, Yasemin Acar, et al. Re- search directions in software supply chain security.ACM Transactions on Software Engineering and Methodology, 34(5):1–38, 2025

  28. [28]

    The shift from models to compound ai systems

    Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/ compound-ai-systems/, 2024. 17