Data-aware Static Analysis: Improving Detection of Semantic Faults in Machine Learning Code Using Data Characteristics
Pith reviewed 2026-06-27 15:42 UTC · model grok-4.3
The pith
A data-aware static analysis detects semantic faults in machine learning code before any model is trained.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that semantic faults specific to machine learning can be detected at development time by a static analysis that tracks data characteristics through data and control flow while consulting API contracts, rather than requiring execution or model training.
What carries the argument
The data-aware static analysis that performs combined data-flow and control-flow reasoning over ML code and API contracts to infer relevant data characteristics.
If this is right
- Developers receive warnings about data-related ML bugs while editing code, before any training run.
- Faults such as unscaled inputs to sensitive models or incorrect feature dimensions become detectable without manual post-training inspection.
- The analysis operates at the level of API contracts rather than low-level tensor shapes, keeping it language- and framework-agnostic in principle.
- Existing static-analysis tools for ML can be extended by adding the data-characteristic tracking layer described.
Where Pith is reading between the lines
- If the inference of data characteristics proves reliable, it could be integrated into IDEs as a lightweight check that runs on every save.
- The same machinery might apply to detecting other data-dependent issues such as privacy leaks or fairness violations that depend on data distributions.
- A natural next measurement would be the false-positive rate on a broad corpus of notebooks that do not contain the targeted faults.
Load-bearing premise
Data characteristics that determine whether a semantic fault exists can be inferred from source code and library contracts with enough precision to catch real faults without false positives or missed cases.
What would settle it
Running the analysis on a larger set of notebooks containing known semantic faults and checking whether every fault is reported exactly when the data characteristic mismatch is present in the code.
Figures
read the original abstract
Semantic faults specific to the use of machine learning models are a common problem for machine learning developers, causing suboptimal predictions, high computational cost, or incorrect outputs. For example, one may erroneously use unscaled data to train a scale-sensitive model. Machine learning developers detect these faults after training their models and manually analyzing the results, making it an inefficient process. We propose a novel data-aware static analysis approach to detect semantic faults in machine learning code, allowing developers to reveal these bugs while writing code instead of after training the model. Our approach uses combined data and control flow analysis, and API contracts, enabling data-aware reasoning about machine learning code at a high level of abstraction. We highlight the potential of our solution by analyzing a sample of real-world machine learning notebooks, finding that we can detect faults that require a data-aware approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a novel data-aware static analysis approach to detect semantic faults in machine learning code (e.g., using unscaled data for scale-sensitive models). The method combines data and control flow analysis with API contracts to enable high-level reasoning about data characteristics without executing or training models. The authors evaluate the approach on a sample of real-world ML notebooks and report that it can detect faults requiring data awareness.
Significance. If the approach achieves adequate precision and recall on real code, it would represent a useful advance in software engineering for ML by shifting detection of data-dependent semantic faults to coding time. The integration of API contracts for abstraction is a promising direction, though its effectiveness depends on details not visible in the provided text.
major comments (1)
- [Abstract] Abstract: the central claim that the approach 'enables data-aware reasoning about machine learning code at a high level of abstraction' and can 'detect faults that require a data-aware approach' rests on static inference of data properties (scaling, shapes, distributions) from code, control flow, and contracts. No algorithm, contract modeling, over-approximation strategy, or handling of Python dynamism is described, making it impossible to assess whether the method avoids excessive false positives or misses on runtime-dependent ML library properties.
minor comments (1)
- [Abstract] Abstract: the evaluation is described only as 'analyzing a sample of real-world machine learning notebooks' with no mention of sample size, specific faults found, or any metrics.
Simulated Author's Rebuttal
We thank the referee for their review and the opportunity to clarify our work. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the approach 'enables data-aware reasoning about machine learning code at a high level of abstraction' and can 'detect faults that require a data-aware approach' rests on static inference of data properties (scaling, shapes, distributions) from code, control flow, and contracts. No algorithm, contract modeling, over-approximation strategy, or handling of Python dynamism is described, making it impossible to assess whether the method avoids excessive false positives or misses on runtime-dependent ML library properties.
Authors: We agree that the abstract is high-level and does not describe the algorithm, contract modeling, over-approximation strategy, or handling of Python dynamism. The full manuscript expands on the combined data/control-flow analysis and API contracts, but we acknowledge these details are insufficient for assessing precision, recall, or robustness to runtime-dependent library behavior. We will revise the abstract to include a brief overview of the key technical components and add/expand sections describing the analysis algorithm, contract representation, over-approximation choices, and Python dynamism handling. The notebook evaluation is presented as an initial demonstration of data-aware fault detection rather than a full precision/recall study; we will clarify this scope. revision: yes
Circularity Check
No circularity: methodological proposal with no derivations or fitted quantities
full rationale
The paper proposes a static analysis technique combining data/control-flow analysis with API contracts to detect ML-specific semantic faults at development time. No equations, parameters, predictions, or uniqueness theorems appear in the provided text. The central claim rests on the existence of a combined analysis plus a notebook sample showing detectable faults; this does not reduce to any self-definition, fitted input renamed as prediction, or self-citation chain. The approach is presented as novel without invoking prior author results as load-bearing justification. This is a standard non-circular engineering proposal.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
OECD/BCG/INSEAD,The Adoption of Artificial Intelligence in Firms: New Evidence for Policymaking. OECD Publishing, May 2025. [Online]. Available: http://dx.doi.org/10.1787/f9ef33c3-en
-
[2]
M. F. Arroyabe, C. F. Arranz, I. Fernandez De Arroyabe, and J. C. Fernandez de Arroyabe, “Analyzing ai adoption in european smes: A study of digital capabilities, innovation, and external environment,” Technology in Society, vol. 79, p. 102733, Dec. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.techsoc. 2024.102733
-
[3]
Maintainability challenges in ml: A systematic literature review,
K. Shivashankar and A. Martini, “Maintainability challenges in ml: A systematic literature review,” in2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, Aug. 2022, p. 60–67. [Online]. Available: http://dx.doi.org/10.1109/ SEAA56994.2022.00018
arXiv 2022
-
[4]
Santhanam,Quality Management of Machine Learning Systems
P. Santhanam,Quality Management of Machine Learning Systems. Springer International Publishing, 2020, p. 1–13. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-030-62144-5 1
2020
-
[5]
Characterizing technical debt and antipatterns in ai-based systems: A systematic mapping study,
J. Bogner, R. Verdecchia, and I. Gerostathopoulos, “Characterizing technical debt and antipatterns in ai-based systems: A systematic mapping study,” in2021 IEEE/ACM International Conference on Technical Debt (TechDebt). IEEE, May 2021, p. 64–73. [Online]. Available: http://dx.doi.org/10.1109/ TechDebt52882.2021.00016
arXiv 2021
-
[6]
Quality issues in machine learning software systems,
P.-O. C ˆot´e, A. Nikanjam, R. Bouchoucha, I. Basta, M. Abidi, and F. Khomh, “Quality issues in machine learning software systems,”Empirical Software Engi- neering, vol. 29, no. 6, Sep. 2024. [Online]. Available: http://dx.doi.org/10.1007/s10664-024-10536-7
-
[7]
Architecting ML-enabled systems: Challenges, best practices, and design decisions,
R. Nazir, A. Bucaioni, and P. Pelliccione, “Architecting ml-enabled systems: Challenges, best practices, and design decisions,”Journal of Systems and Software, vol. 207, p. 111860, Jan. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023.111860
-
[8]
A checklist of quality concerns for architecting ML- intensive systems,
A. Bucaioni, R. Kazman, and P. Pelliccione, “A checklist of quality concerns for architecting ml- intensive systems,”Journal of Systems and Software, vol. 231, p. 112612, Jan. 2026. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2025.112612
-
[9]
Software engineering practices for machine learning — adoption, effects, and team assessment,
A. Serban, K. van der Blom, H. Hoos, and J. Visser, “Software engineering practices for machine learning — adoption, effects, and team assessment,”Journal of Systems and Software, vol. 209, p. 111907, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023. 111907
-
[10]
Data collection and quality challenges in deep learning: a data- centric AI perspective,
S. E. Whang, Y . Roh, H. Song, and J.-G. Lee, “Data collection and quality challenges in deep learning: a data-centric ai perspective,”The VLDB Journal, vol. 32, no. 4, p. 791–813, Jan. 2023. [Online]. Available: http://dx.doi.org/10.1007/s00778-022-00775-9
-
[11]
Imbalanced data problem in machine learning: A review,
S. Kumar, S. Datta, V . Singh, S. K. Singh, and R. Sharma, “Opportunities and challenges in data- centric ai,”IEEE Access, vol. 12, p. 33173–33189, 2024. [Online]. Available: http://dx.doi.org/10.1109/ACCESS. 2024.3369417
-
[12]
Testing machine learning and deep learning systems: Achievements and challenges,
S. Albelali and M. Ahmed, “Testing machine learning and deep learning systems: Achievements and challenges,”Arabian Journal for Science and Engineering, vol. 50, no. 15, p. 11433–11484, Jun
-
[13]
Available: http://dx.doi.org/10.1007/ s13369-025-10276-w
[Online]. Available: http://dx.doi.org/10.1007/ s13369-025-10276-w
-
[14]
Why do machine learning notebooks crash? an empirical study on public Python Jupyter notebooks,
Y . Wang, W. Meijer, J. A. H. L ´opez, U. Nilsson, and D. Varr ´o, “Why do machine learning notebooks crash? an empirical study on public python jupyter notebooks,” IEEE Transactions on Software Engineering, vol. 51, no. 7, p. 2181–2196, Jul. 2025. [Online]. Available: http://dx.doi.org/10.1109/TSE.2025.3574500
-
[15]
Taxonomy of real faults in deep learning systems,
N. Humbatova, G. Jahangirova, G. Bavota, V . Riccio, A. Stocco, and P. Tonella, “Taxonomy of real faults in deep learning systems,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. ACM, Jun. 2020, p. 1110–1121. [Online]. Available: http://dx.doi.org/10.1145/3377811.3380395
-
[16]
Bug analysis in Jupyter notebook projects: An empirical study,
T. L. De Santana, P. A. D. M. S. Neto, E. S. De Almeida, and I. Ahmed, “Bug analysis in jupyter notebook projects: An empirical study,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, p. 1–34, Apr. 2024. [Online]. Available: http://dx.doi.org/10.1145/3641539
-
[17]
What kinds of contracts do ML APIs need?
S. S. Khairunnesa, S. Ahmed, S. M. Imtiaz, H. Rajan, and G. T. Leavens, “What kinds of contracts do ml apis need?”Empirical Software Engineering, vol. 28, no. 6, Oct. 2023. [Online]. Available: http://dx.doi.org/ 10.1007/s10664-023-10320-z
-
[18]
Comparative analysis of real issues in open- source machine learning projects,
T. D. Lai, A. Simmons, S. Barnett, J.-G. Schneider, and R. Vasa, “Comparative analysis of real issues in open- source machine learning projects,”Empirical Software Engineering, vol. 29, no. 3, May 2024. [Online]. Avail- able: http://dx.doi.org/10.1007/s10664-024-10467-3
-
[19]
Bug characterization in machine learning-based systems,
M. M. Morovati, A. Nikanjam, F. Tambon, F. Khomh, and Z. M. Jiang, “Bug characterization in machine learning-based systems,”Empirical Software Engineer- ing, vol. 29, no. 1, Dec. 2023. [Online]. Available: http://dx.doi.org/10.1007/s10664-023-10400-0
-
[20]
Refty: refinement types for valid deep learning models,
Y . Gao, Z. Li, H. Lin, H. Zhang, M. Wu, and M. Yang, “Refty: refinement types for valid deep learning models,” inProceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. ACM, May 2022, p. 1843–1855. [Online]. Available: http://dx.doi.org/10.1145/3510003.3510077
-
[21]
Safe- DS: A domain specific language to make data science safe,
L. Reimann and G. Kniesel-W ¨unsche, “Safe- ds: A domain specific language to make data science safe,” in2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, May 2023, p. 72–77. [Online]. Available: http://dx.doi.org/10.1109/ICSE-NIER58687.2023.00019
-
[22]
Mlscent: A tool for anti-pattern detection in ml projects,
K. Shivashankar and A. Martini, “Mlscent: A tool for anti-pattern detection in ml projects,” in2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE, Apr. 2025, p. 150–160. [Online]. Available: http://dx.doi.org/10. 1109/CAIN66642.2025.00026
arXiv 2025
-
[23]
Design by contract for deep learning APIs,
S. Ahmed, S. M. Imtiaz, S. S. Khairunnesa, B. D. Cruz, and H. Rajan, “Design by contract for deep learning apis,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’23. ACM, Nov. 2023, p. 94–106. [Online]. Available: http://dx.doi.org/10.1145/3611643.3616247
-
[24]
The fault in our stats,
A. Turcotte and N. N. Mehta, “The fault in our stats,” inProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’25. ACM, Nov. 2025. [Online]. Available: https://reallytg.github.io/files/papers/ ase25 statlint final final.pdf
2025
-
[25]
Expressing and checking statistical assumptions,
A. Turcotte and Z. Wu, “Expressing and checking statistical assumptions,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 2735–2758, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1145/ 3729391
2025
-
[26]
W. Meijer, “Contract-based validation of conceptual design bugs for engineering complex machine learning software,” inProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, ser. MODELS Companion ’24. ACM, Sep. 2024, p. 155–161. [Online]. Available: http://dx.doi.org/10.1145/3652620.3688201
-
[27]
The data linter: Lightweight, automated sanity checking for ML data sets,
N. Hynes, D. Sculley, and M. Terry, “The data linter: Lightweight, automated sanity checking for ML data sets,” inNIPS MLSys Workshop, vol. 1, no. 5, 2017, p. 10
2017
-
[28]
Jenga: A framework to study the impact of data errors on the predictions of machine learning models,
S. Schelter, T. Rukat, and F. Biessmann, “Jenga: A framework to study the impact of data errors on the predictions of machine learning models,” inProceedings of the 24th International Conference on Extending Database Technology (EDBT). OpenProceedings, Mar
-
[29]
Available: https://openproceedings.org/ 2021/conf/edbt/p134.pdf
[Online]. Available: https://openproceedings.org/ 2021/conf/edbt/p134.pdf
2021
-
[30]
Experiment repository for
A. Authors, “Experiment repository for ”data-informed static analysis: Improving detection of semantic faults in machine learning code using data characteristics”,” Dec. 2025. [Online]. Available: https://doi.org/10.5281/ zenodo.17805911
2025
-
[31]
Dille: Data-informed static analysis,
——, “Dille: Data-informed static analysis,” Dec
-
[32]
Available: https://doi.org/10.5281/ zenodo.17805763
[Online]. Available: https://doi.org/10.5281/ zenodo.17805763
-
[33]
Better code, better sharing: on the need of analyzing jupyter notebooks,
J. Wang, L. Li, and A. Zeller, “Better code, better sharing: on the need of analyzing jupyter notebooks,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE ’20. ACM, Jun. 2020, p. 53–56. [Online]. Available: http://dx.doi.org/10.1145/ 3377816.3381724
arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.