pith. sign in

arxiv: 2606.09957 · v1 · pith:IGCPYRLPnew · submitted 2026-06-08 · 💻 cs.SE · cs.LG

Data-aware Static Analysis: Improving Detection of Semantic Faults in Machine Learning Code Using Data Characteristics

Pith reviewed 2026-06-27 15:42 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords static analysismachine learningsemantic faultsdata flow analysisAPI contractssoftware engineering
0
0 comments X

The pith

A data-aware static analysis detects semantic faults in machine learning code before any model is trained.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes combining data-flow analysis, control-flow analysis, and API contracts to reason about data characteristics statically in ML programs. This lets the analysis flag issues such as feeding unscaled inputs to scale-sensitive models or mismatched feature counts without executing or training the code. The approach is evaluated on real-world notebooks, where it identifies faults that purely syntactic or type-based checks miss. A sympathetic reader would care because these faults are currently found only after costly training runs and manual result inspection.

Core claim

The central claim is that semantic faults specific to machine learning can be detected at development time by a static analysis that tracks data characteristics through data and control flow while consulting API contracts, rather than requiring execution or model training.

What carries the argument

The data-aware static analysis that performs combined data-flow and control-flow reasoning over ML code and API contracts to infer relevant data characteristics.

If this is right

  • Developers receive warnings about data-related ML bugs while editing code, before any training run.
  • Faults such as unscaled inputs to sensitive models or incorrect feature dimensions become detectable without manual post-training inspection.
  • The analysis operates at the level of API contracts rather than low-level tensor shapes, keeping it language- and framework-agnostic in principle.
  • Existing static-analysis tools for ML can be extended by adding the data-characteristic tracking layer described.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the inference of data characteristics proves reliable, it could be integrated into IDEs as a lightweight check that runs on every save.
  • The same machinery might apply to detecting other data-dependent issues such as privacy leaks or fairness violations that depend on data distributions.
  • A natural next measurement would be the false-positive rate on a broad corpus of notebooks that do not contain the targeted faults.

Load-bearing premise

Data characteristics that determine whether a semantic fault exists can be inferred from source code and library contracts with enough precision to catch real faults without false positives or missed cases.

What would settle it

Running the analysis on a larger set of notebooks containing known semantic faults and checking whether every fault is reported exactly when the data characteristic mismatch is present in the code.

Figures

Figures reproduced from arXiv: 2606.09957 by D\'aniel Varr\'o, Kristian Sandahl, Willem Meijer.

Figure 1
Figure 1. Figure 1: Decision boundary plots of a SVC fitted on unequally scaled data (left) versus equally scaled data (right) support vector machine1 svc on line 4, and trains it on the dataset on line 5 using the fit method. Seemingly, this code is syntactically correct and runs without exceptions. However, it can still fail semantically, as it does not account for the support vector machine’s prerequisite that the scale of… view at source ↗
Figure 2
Figure 2. Figure 2: Visual representation of the running example in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Extension of Figure 2 where the dataset is transformed [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Semantic faults specific to the use of machine learning models are a common problem for machine learning developers, causing suboptimal predictions, high computational cost, or incorrect outputs. For example, one may erroneously use unscaled data to train a scale-sensitive model. Machine learning developers detect these faults after training their models and manually analyzing the results, making it an inefficient process. We propose a novel data-aware static analysis approach to detect semantic faults in machine learning code, allowing developers to reveal these bugs while writing code instead of after training the model. Our approach uses combined data and control flow analysis, and API contracts, enabling data-aware reasoning about machine learning code at a high level of abstraction. We highlight the potential of our solution by analyzing a sample of real-world machine learning notebooks, finding that we can detect faults that require a data-aware approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a novel data-aware static analysis approach to detect semantic faults in machine learning code (e.g., using unscaled data for scale-sensitive models). The method combines data and control flow analysis with API contracts to enable high-level reasoning about data characteristics without executing or training models. The authors evaluate the approach on a sample of real-world ML notebooks and report that it can detect faults requiring data awareness.

Significance. If the approach achieves adequate precision and recall on real code, it would represent a useful advance in software engineering for ML by shifting detection of data-dependent semantic faults to coding time. The integration of API contracts for abstraction is a promising direction, though its effectiveness depends on details not visible in the provided text.

major comments (1)
  1. [Abstract] Abstract: the central claim that the approach 'enables data-aware reasoning about machine learning code at a high level of abstraction' and can 'detect faults that require a data-aware approach' rests on static inference of data properties (scaling, shapes, distributions) from code, control flow, and contracts. No algorithm, contract modeling, over-approximation strategy, or handling of Python dynamism is described, making it impossible to assess whether the method avoids excessive false positives or misses on runtime-dependent ML library properties.
minor comments (1)
  1. [Abstract] Abstract: the evaluation is described only as 'analyzing a sample of real-world machine learning notebooks' with no mention of sample size, specific faults found, or any metrics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify our work. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the approach 'enables data-aware reasoning about machine learning code at a high level of abstraction' and can 'detect faults that require a data-aware approach' rests on static inference of data properties (scaling, shapes, distributions) from code, control flow, and contracts. No algorithm, contract modeling, over-approximation strategy, or handling of Python dynamism is described, making it impossible to assess whether the method avoids excessive false positives or misses on runtime-dependent ML library properties.

    Authors: We agree that the abstract is high-level and does not describe the algorithm, contract modeling, over-approximation strategy, or handling of Python dynamism. The full manuscript expands on the combined data/control-flow analysis and API contracts, but we acknowledge these details are insufficient for assessing precision, recall, or robustness to runtime-dependent library behavior. We will revise the abstract to include a brief overview of the key technical components and add/expand sections describing the analysis algorithm, contract representation, over-approximation choices, and Python dynamism handling. The notebook evaluation is presented as an initial demonstration of data-aware fault detection rather than a full precision/recall study; we will clarify this scope. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological proposal with no derivations or fitted quantities

full rationale

The paper proposes a static analysis technique combining data/control-flow analysis with API contracts to detect ML-specific semantic faults at development time. No equations, parameters, predictions, or uniqueness theorems appear in the provided text. The central claim rests on the existence of a combined analysis plus a notebook sample showing detectable faults; this does not reduce to any self-definition, fitted input renamed as prediction, or self-citation chain. The approach is presented as novel without invoking prior author results as load-bearing justification. This is a standard non-circular engineering proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no concrete free parameters, axioms, or invented entities; the approach description mentions API contracts and data flow but gives no details on how they are formalized or any fitted values.

pith-pipeline@v0.9.1-grok · 5677 in / 1053 out tokens · 20728 ms · 2026-06-27T15:42:53.188111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 18 canonical work pages

  1. [1]

    OECD Publishing, May 2025

    OECD/BCG/INSEAD,The Adoption of Artificial Intelligence in Firms: New Evidence for Policymaking. OECD Publishing, May 2025. [Online]. Available: http://dx.doi.org/10.1787/f9ef33c3-en

  2. [2]

    Analyzing AI adoption in European SMEs: A study of digital capabilities, innovation, and external environment,

    M. F. Arroyabe, C. F. Arranz, I. Fernandez De Arroyabe, and J. C. Fernandez de Arroyabe, “Analyzing ai adoption in european smes: A study of digital capabilities, innovation, and external environment,” Technology in Society, vol. 79, p. 102733, Dec. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.techsoc. 2024.102733

  3. [3]

    Maintainability challenges in ml: A systematic literature review,

    K. Shivashankar and A. Martini, “Maintainability challenges in ml: A systematic literature review,” in2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, Aug. 2022, p. 60–67. [Online]. Available: http://dx.doi.org/10.1109/ SEAA56994.2022.00018

  4. [4]

    Santhanam,Quality Management of Machine Learning Systems

    P. Santhanam,Quality Management of Machine Learning Systems. Springer International Publishing, 2020, p. 1–13. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-030-62144-5 1

  5. [5]

    Characterizing technical debt and antipatterns in ai-based systems: A systematic mapping study,

    J. Bogner, R. Verdecchia, and I. Gerostathopoulos, “Characterizing technical debt and antipatterns in ai-based systems: A systematic mapping study,” in2021 IEEE/ACM International Conference on Technical Debt (TechDebt). IEEE, May 2021, p. 64–73. [Online]. Available: http://dx.doi.org/10.1109/ TechDebt52882.2021.00016

  6. [6]

    Quality issues in machine learning software systems,

    P.-O. C ˆot´e, A. Nikanjam, R. Bouchoucha, I. Basta, M. Abidi, and F. Khomh, “Quality issues in machine learning software systems,”Empirical Software Engi- neering, vol. 29, no. 6, Sep. 2024. [Online]. Available: http://dx.doi.org/10.1007/s10664-024-10536-7

  7. [7]

    Architecting ML-enabled systems: Challenges, best practices, and design decisions,

    R. Nazir, A. Bucaioni, and P. Pelliccione, “Architecting ml-enabled systems: Challenges, best practices, and design decisions,”Journal of Systems and Software, vol. 207, p. 111860, Jan. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023.111860

  8. [8]

    A checklist of quality concerns for architecting ML- intensive systems,

    A. Bucaioni, R. Kazman, and P. Pelliccione, “A checklist of quality concerns for architecting ml- intensive systems,”Journal of Systems and Software, vol. 231, p. 112612, Jan. 2026. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2025.112612

  9. [9]

    Software engineering practices for machine learning — adoption, effects, and team assessment,

    A. Serban, K. van der Blom, H. Hoos, and J. Visser, “Software engineering practices for machine learning — adoption, effects, and team assessment,”Journal of Systems and Software, vol. 209, p. 111907, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023. 111907

  10. [10]

    Data collection and quality challenges in deep learning: a data- centric AI perspective,

    S. E. Whang, Y . Roh, H. Song, and J.-G. Lee, “Data collection and quality challenges in deep learning: a data-centric ai perspective,”The VLDB Journal, vol. 32, no. 4, p. 791–813, Jan. 2023. [Online]. Available: http://dx.doi.org/10.1007/s00778-022-00775-9

  11. [11]

    Decoding the mystery: How can LLMs turn text into Cypher in complex knowledge graphs?IEEE Access, 13:80981–81001, 2025

    S. Kumar, S. Datta, V . Singh, S. K. Singh, and R. Sharma, “Opportunities and challenges in data- centric ai,”IEEE Access, vol. 12, p. 33173–33189, 2024. [Online]. Available: http://dx.doi.org/10.1109/ACCESS. 2024.3369417

  12. [12]

    Testing machine learning and deep learning systems: Achievements and challenges,

    S. Albelali and M. Ahmed, “Testing machine learning and deep learning systems: Achievements and challenges,”Arabian Journal for Science and Engineering, vol. 50, no. 15, p. 11433–11484, Jun

  13. [13]

    Available: http://dx.doi.org/10.1007/ s13369-025-10276-w

    [Online]. Available: http://dx.doi.org/10.1007/ s13369-025-10276-w

  14. [14]

    Why do machine learning notebooks crash? an empirical study on public Python Jupyter notebooks,

    Y . Wang, W. Meijer, J. A. H. L ´opez, U. Nilsson, and D. Varr ´o, “Why do machine learning notebooks crash? an empirical study on public python jupyter notebooks,” IEEE Transactions on Software Engineering, vol. 51, no. 7, p. 2181–2196, Jul. 2025. [Online]. Available: http://dx.doi.org/10.1109/TSE.2025.3574500

  15. [15]

    Taxonomy of real faults in deep learning systems,

    N. Humbatova, G. Jahangirova, G. Bavota, V . Riccio, A. Stocco, and P. Tonella, “Taxonomy of real faults in deep learning systems,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. ACM, Jun. 2020, p. 1110–1121. [Online]. Available: http://dx.doi.org/10.1145/3377811.3380395

  16. [16]

    Bug analysis in Jupyter notebook projects: An empirical study,

    T. L. De Santana, P. A. D. M. S. Neto, E. S. De Almeida, and I. Ahmed, “Bug analysis in jupyter notebook projects: An empirical study,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, p. 1–34, Apr. 2024. [Online]. Available: http://dx.doi.org/10.1145/3641539

  17. [17]

    What kinds of contracts do ML APIs need?

    S. S. Khairunnesa, S. Ahmed, S. M. Imtiaz, H. Rajan, and G. T. Leavens, “What kinds of contracts do ml apis need?”Empirical Software Engineering, vol. 28, no. 6, Oct. 2023. [Online]. Available: http://dx.doi.org/ 10.1007/s10664-023-10320-z

  18. [18]

    Comparative analysis of real issues in open- source machine learning projects,

    T. D. Lai, A. Simmons, S. Barnett, J.-G. Schneider, and R. Vasa, “Comparative analysis of real issues in open- source machine learning projects,”Empirical Software Engineering, vol. 29, no. 3, May 2024. [Online]. Avail- able: http://dx.doi.org/10.1007/s10664-024-10467-3

  19. [19]

    Bug characterization in machine learning-based systems,

    M. M. Morovati, A. Nikanjam, F. Tambon, F. Khomh, and Z. M. Jiang, “Bug characterization in machine learning-based systems,”Empirical Software Engineer- ing, vol. 29, no. 1, Dec. 2023. [Online]. Available: http://dx.doi.org/10.1007/s10664-023-10400-0

  20. [20]

    Refty: refinement types for valid deep learning models,

    Y . Gao, Z. Li, H. Lin, H. Zhang, M. Wu, and M. Yang, “Refty: refinement types for valid deep learning models,” inProceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. ACM, May 2022, p. 1843–1855. [Online]. Available: http://dx.doi.org/10.1145/3510003.3510077

  21. [21]

    Safe- DS: A domain specific language to make data science safe,

    L. Reimann and G. Kniesel-W ¨unsche, “Safe- ds: A domain specific language to make data science safe,” in2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, May 2023, p. 72–77. [Online]. Available: http://dx.doi.org/10.1109/ICSE-NIER58687.2023.00019

  22. [22]

    Mlscent: A tool for anti-pattern detection in ml projects,

    K. Shivashankar and A. Martini, “Mlscent: A tool for anti-pattern detection in ml projects,” in2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE, Apr. 2025, p. 150–160. [Online]. Available: http://dx.doi.org/10. 1109/CAIN66642.2025.00026

  23. [23]

    Design by contract for deep learning APIs,

    S. Ahmed, S. M. Imtiaz, S. S. Khairunnesa, B. D. Cruz, and H. Rajan, “Design by contract for deep learning apis,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’23. ACM, Nov. 2023, p. 94–106. [Online]. Available: http://dx.doi.org/10.1145/3611643.3616247

  24. [24]

    The fault in our stats,

    A. Turcotte and N. N. Mehta, “The fault in our stats,” inProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’25. ACM, Nov. 2025. [Online]. Available: https://reallytg.github.io/files/papers/ ase25 statlint final final.pdf

  25. [25]

    Expressing and checking statistical assumptions,

    A. Turcotte and Z. Wu, “Expressing and checking statistical assumptions,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 2735–2758, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1145/ 3729391

  26. [26]

    Contract-based validation of conceptual design bugs for engineering complex machine learning software,

    W. Meijer, “Contract-based validation of conceptual design bugs for engineering complex machine learning software,” inProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, ser. MODELS Companion ’24. ACM, Sep. 2024, p. 155–161. [Online]. Available: http://dx.doi.org/10.1145/3652620.3688201

  27. [27]

    The data linter: Lightweight, automated sanity checking for ML data sets,

    N. Hynes, D. Sculley, and M. Terry, “The data linter: Lightweight, automated sanity checking for ML data sets,” inNIPS MLSys Workshop, vol. 1, no. 5, 2017, p. 10

  28. [28]

    Jenga: A framework to study the impact of data errors on the predictions of machine learning models,

    S. Schelter, T. Rukat, and F. Biessmann, “Jenga: A framework to study the impact of data errors on the predictions of machine learning models,” inProceedings of the 24th International Conference on Extending Database Technology (EDBT). OpenProceedings, Mar

  29. [29]

    Available: https://openproceedings.org/ 2021/conf/edbt/p134.pdf

    [Online]. Available: https://openproceedings.org/ 2021/conf/edbt/p134.pdf

  30. [30]

    Experiment repository for

    A. Authors, “Experiment repository for ”data-informed static analysis: Improving detection of semantic faults in machine learning code using data characteristics”,” Dec. 2025. [Online]. Available: https://doi.org/10.5281/ zenodo.17805911

  31. [31]

    Dille: Data-informed static analysis,

    ——, “Dille: Data-informed static analysis,” Dec

  32. [32]

    Available: https://doi.org/10.5281/ zenodo.17805763

    [Online]. Available: https://doi.org/10.5281/ zenodo.17805763

  33. [33]

    Better code, better sharing: on the need of analyzing jupyter notebooks,

    J. Wang, L. Li, and A. Zeller, “Better code, better sharing: on the need of analyzing jupyter notebooks,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE ’20. ACM, Jun. 2020, p. 53–56. [Online]. Available: http://dx.doi.org/10.1145/ 3377816.3381724