pith. sign in

arxiv: 2509.20491 · v2 · submitted 2025-09-24 · 💻 cs.SE · cs.AI

ML Code Smells: From Specification to Detection

Pith reviewed 2026-05-18 13:34 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords ML code smellscode smell detectiondomain-specific languagecode property graphsmachine learning pipelinessoftware maintainabilityreproducibilitydata leakage
0
0 comments X

The pith

SpecDetect4ML detects ML code smells using declarative DSL specifications and code property graph analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SpecDetect4ML as a tool to identify recurring implementation patterns in machine learning systems that undermine reproducibility, robustness, and maintainability, such as those causing silent failures or data leakage. It combines a declarative domain-specific language for expressing smells as reusable predicates with analysis over code property graphs that capture syntactic, control-flow, and data-flow relations across an entire project. Evaluated on 890 ML-based systems against a unified ground truth, the approach specifies 22 smells and reports 95.82 percent precision with 88.14 percent recall. This outperforms state-of-the-art analyzers in effectiveness and coverage while remaining scalable. A sympathetic reader would care because such detection could help practitioners catch subtle issues in complex ML pipelines before they affect experimental results or production behavior.

Core claim

The paper claims that ML code smells can be expressed as executable specifications in a declarative DSL and detected effectively through CPG-based project-level reasoning, which handles non-local and flow-dependent patterns that AST-only tools miss, yielding 95.82 percent precision and 88.14 percent recall on a unified ground truth while providing an extensible foundation for additional smells.

What carries the argument

SpecDetect4ML, the specification-driven detection tool that pairs a declarative DSL for reusable smell predicates with a scalable CPG analysis engine for syntactic, control-flow, and data-flow relations.

If this is right

  • ML developers gain a practical way to scan pipelines for issues like data leakage before they compromise results.
  • The DSL approach allows new smells to be added by writing predicates rather than hand-coding rules.
  • Project-level CPG analysis extends detection to control and data flows that local pattern matchers overlook.
  • Higher coverage than existing tools reduces the chance of silent failures in reproducibility and maintainability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same DSL-plus-CPG pattern could apply to code smells in other domains such as scientific computing or web services.
  • Embedding the detector in continuous integration could prevent smells from entering codebases during development.
  • Automatic repair suggestions based on the detected smells would be a natural next step not explored here.

Load-bearing premise

The 22 specified ML code smells represent real-world problems and the unified ground truth used for evaluation on 890 systems is accurate and unbiased.

What would settle it

Re-running the evaluation on a fresh set of ML systems outside the original 890 would produce precision or recall below 80 percent or miss known flow-dependent smells that the tool claims to cover.

Figures

Figures reproduced from arXiv: 2509.20491 by Brahim Mahmoudi, Florent Avellaneda, Naouel Moha, Quentin Sti\'evenart.

Figure 1
Figure 1. Figure 1: Overview of SpecDetect4AI’s three-step approach for specifying informal AI-specific code smell descriptions into executable detection rules. usedBeforeTrainTestSplit(), which checks whether preprocessing occurs before dataset partitioning. We define the DSL in Lark’s Extended Backus-Naur Form (EBNF) [39, 45]. Listing 4 illustrates a simplified excerpt of the grammar, showing how rules are composed of an id… view at source ↗
Figure 2
Figure 2. Figure 2: Per-rule F1-scores on the intersection of supported smells (darker = higher). “NA” marks rules outside an approach’s coverage. 4.6.2 RQ2: Does SpecDetect4AI perform better in terms of precision and recall than existing detection tools ? [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Execution time vs. system size (values above the [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

The rapid adoption of Artificial Intelligence (AI) is increasingly realised through Machine Learning (ML) pipelines that integrate data preprocessing, model training, evaluation scripts, and configuration-heavy experimentation code. In these ML-based systems, small and often overlooked implementation choices can silently compromise experimental reproducibility, robustness to data and environment changes, and maintainability. We study ML code smells, recurring implementation patterns that can undermine reproducibility, robustness, and maintainability, for example by inducing silent failures or data leakage. We present SpecDetect4ML, a specification-driven detection tool that combines a declarative Domain-Specific Language (DSL) with a scalable analysis engine backed with Code Property Graphs (CPGs). Unlike the state-of-the-art (SOTA) analysers that rely on hand-coded, per-rule local pattern checks, our DSL expresses smells as executable specifications via reusable predicates, while the CPG analysis enables project-level reasoning over syntactic, control-flow, and data-flow relations. This CPG reasoning outperforms Abstract Syntax Tree (AST)-only analysers while remaining scalable and practical in analysis time. We specified 22 ML code smells and evaluated SpecDetect4ML on 890 ML-based systems. SpecDetect4ML achieves 95.82% precision and 88.14% recall on a unified ground truth, outperforming SOTA analysers in both effectiveness and coverage, providing an extensible foundation for detecting non-local, flow-dependent ML code smells.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SpecDetect4ML, a specification-driven tool that uses a declarative DSL to express ML code smells as reusable predicates and a CPG-based analysis engine to perform project-level syntactic, control-flow, and data-flow reasoning. The authors specify 22 ML code smells and evaluate the approach on 890 ML-based systems, claiming 95.82% precision and 88.14% recall on a unified ground truth while outperforming SOTA analysers in effectiveness and coverage for non-local, flow-dependent smells.

Significance. If the evaluation methodology proves robust, the work would be significant as it moves beyond local AST pattern matching to scalable, non-local detection of ML-specific smells that affect reproducibility and maintainability. The extensible DSL and CPG foundation could serve as a reusable platform for future smell specifications.

major comments (1)
  1. [Evaluation] Evaluation section: The central claims of 95.82% precision and 88.14% recall (and superiority over SOTA) rest on an unspecified 'unified ground truth' for the 890 systems. No details are provided on the labeling protocol, number of annotators, inter-rater reliability, sampling strategy, conflict resolution for non-local or cross-file smells, or operationalization criteria for the 22 smells. This absence directly undermines verification of the reported metrics and the claim that the tool accurately captures both local and flow-dependent issues.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'unified ground truth' appears without a forward reference to the evaluation methodology or a brief characterization of how it was assembled; adding one sentence would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation of our evaluation. We address the major comment below and will revise the manuscript to provide the requested details on ground truth construction.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claims of 95.82% precision and 88.14% recall (and superiority over SOTA) rest on an unspecified 'unified ground truth' for the 890 systems. No details are provided on the labeling protocol, number of annotators, inter-rater reliability, sampling strategy, conflict resolution for non-local or cross-file smells, or operationalization criteria for the 22 smells. This absence directly undermines verification of the reported metrics and the claim that the tool accurately captures both local and flow-dependent issues.

    Authors: We agree that the manuscript currently provides insufficient detail on the unified ground truth, which limits independent verification of the metrics. In the revised version we will insert a dedicated subsection 'Ground Truth Construction' immediately preceding the results. This subsection will cover: the labeling protocol, which operationalizes each of the 22 smells using the exact DSL predicate definitions to distinguish local syntactic patterns from non-local control- and data-flow conditions; the number of annotators and their qualifications; the inter-rater reliability measure employed; the sampling strategy for selecting the 890 systems and the annotated instances; and the conflict-resolution procedure, which explicitly addresses disagreements on cross-file smells through consensus discussion anchored to the smell specifications. These additions will directly support the claims regarding both local and flow-dependent detection and the comparison with prior tools. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation on external corpus

full rationale

The paper specifies 22 ML code smells in a DSL, implements detection via CPG-based analysis, and reports precision/recall on a unified ground truth across 890 external ML systems. No equations, fitted parameters, or first-principles derivations are present that could reduce to self-definition or input renaming. Performance figures are measured outcomes against an external benchmark rather than quantities constructed from the paper's own fitted values or prior self-citations. The evaluation protocol, while potentially underspecified in the abstract, does not create a circular reduction because the ground truth is treated as an independent reference set. This is a standard empirical software-engineering result with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that ML code smells can be captured by declarative predicates over syntactic, control-flow, and data-flow relations and that these smells meaningfully affect reproducibility and robustness.

axioms (2)
  • domain assumption ML pipelines contain recurring implementation patterns that silently compromise reproducibility and robustness
    Invoked in the opening motivation for studying ML code smells.
  • standard math Code Property Graphs enable scalable project-level reasoning over data and control flow
    Used to justify superiority over AST-only analysers.

pith-pipeline@v0.9.0 · 5791 in / 1199 out tokens · 47235 ms · 2026-05-18T13:34:41.249136+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 3 internal anchors

  1. [1]

    Meta AI. 2024. LLaMA 3: Open Foundation and Instruction-Tuned Language Models by Meta. https://github.com/meta-llama/llama3. Accessed: 2025-03-10

  2. [2]

    Eduardo Almeida and Elvys Soares. 2025. Agentic SLMs: Hunting Down Test Smells.arXiv preprint arXiv:2504.07277(2025). https://arxiv.org/abs/ 2504.07277

  3. [3]

    Anonymous. 2025. SpecDetect4AI Replication Package. https://anonymous.4open.science/r/SpecDetect4AI-B903. Study artefacts

  4. [4]

    Sebastian Baltes and Paul Ralph. 2022. Sampling in Software Engineering Research: A Critical Review and Guidelines.Empirical Software Engineering 27, 3 (2022), 1–38. doi:10.1007/s10664-021-10072-8 AI-Specific Code Smells: From Specification to Detection 19

  5. [5]

    Polarization Control of VCSELs,

    Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009.Pearson correlation coefficient. Springer. 1–4 pages. doi:10.1007/978-3-642- 00296-0_1

  6. [6]

    quick and dirty

    John Brooke. 1996. SUS: a “quick and dirty” usability scale.Usability evaluation in industry189, 194 (1996), 4–7

  7. [7]

    William G. Cochran. 1977.Sampling Techniques(3 ed.). John Wiley & Sons, New York

  8. [8]

    Databricks. 2024. mlflow: An Open Source Platform for the Machine Learning Lifecycle. https://github.com/mlflow/mlflow. Accessed: 2025-04-01

  9. [9]

    DeepSeek-AI. 2024. DeepSeek-R1: A Strong Open LLM Series with 7B/67B Models. https://github.com/deepseek-ai/DeepSeek-R1. Accessed: 2025-03-21

  10. [10]

    DeepSeek-AI et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948 (2025). https://arxiv.org/abs/2501.12948

  11. [11]

    Malinda Dilhara, Danny Dig, and Ameya Ketkar. 2023. PyEvolve: Automating Frequent Code Changes in Python ML Systems. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 995–1007. doi:10.1109/ICSE48619.2023.00091

  12. [12]

    Julian Dolby, Avraham Shinnar, Allison Allain, and Jenna Reinen. 2018. Ariadne: analysis for machine learning programs. InProceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Philadelphia, PA, USA)(MAPL 2018). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/3211346.3211349

  13. [13]

    Aryaz Eghbali and Michael Pradel. 2022. DynaPyt: A Dynamic Analysis Framework for Python. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’22). 760–771. doi:10.1145/3540250.3549126

  14. [14]

    John Field, Manu Sridharan, and Gaurav S. Kc. 2005. WALA: A Scalable Infrastructure for Program Analysis. InWorkshop on Middleware for Software Development Environments (at ACM/IFIP/USENIX Middleware). https://wala.github.io/ IBM T.J. Watson Research Center

  15. [15]

    Martin Fowler. 2002. Refactoring: Improving the Design of Existing Code. InExtreme Programming and Agile Methods — XP/Agile Universe 2002, Don Wells and Laurie Williams (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 256–256

  16. [16]

    Corrado Gini. 1921. Measurement of inequality of incomes.The economic journal31, 121 (1921), 124–125

  17. [17]

    GitHub Security Lab. 2024. CodeQL: Semantic Code Analysis Engine. https://codeql.github.com. Accessed May 2025

  18. [18]

    Rebecca A Grier, Aaron Bangor, Philip Kortum, and S Camille Peres. 2013. The system usability scale: Beyond standard usability testing. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 57. SAGE Publications Sage CA: Los Angeles, CA, 187–191

  19. [19]

    2023.MLpylint: Automating the Identification of Machine Learning-Specific Code Smells

    Pär Hamfelt. 2023.MLpylint: Automating the Identification of Machine Learning-Specific Code Smells. Dissertation. Blekinge Institute of Technology. https://urn.kb.se/resolve?urn=urn:nbn:se:bth-25490

  20. [20]

    Weigang He, Peng Di, Mengli Ming, Chengyu Zhang, Ting Su, Shijie Li, and Yulei Sui. 2024. Finding and Understanding Defects in Static Analyzers by Constructing Automated Oracles.Proc. ACM Softw. Eng.1, FSE, Article 74 (July 2024), 23 pages. doi:10.1145/3660781

  21. [21]

    Heli Järvenpää, Patricia Lago, Justus Bogner, Grace Lewis, Henry Muccini, and Ipek Ozkaya. 2024. A Synthesis of Green Architectural Tactics for ML-Enabled Systems. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Society(Lisbon, Portugal) (ICSE-SEIS’24). Association for Computing Machinery, New York, NY, ...

  22. [22]

    2013.Content Analysis: An Introduction to Its Methodology(3rd ed.)

    Klaus Krippendorff. 2013.Content Analysis: An Introduction to Its Methodology(3rd ed.). SAGE Publications

  23. [23]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention.arXiv preprint arXiv:2309.06180(2023). arXiv:2309.06180 [cs.LG] doi:10.48550/arXiv.2309.06180

  24. [24]

    Nelson Amaral, Bor-Yuh Evan Chang, Samuel Z

    Benjamin Livshits, Manu Sridharan, Yannis Smaragdakis, Ondřej Lhoták, J. Nelson Amaral, Bor-Yuh Evan Chang, Samuel Z. Guyer, Uday P. Khedker, Anders Møller, and Dimitrios Vardoulakis. 2015. In defense of soundiness: a manifesto.Commun. ACM58, 2 (Jan. 2015), 44–46. doi:10.1145/2644805

  25. [25]

    Marjan Mernik, Jan Heering, and Anthony M. Sloane. 2005. When and how to develop domain-specific languages.ACM Comput. Surv.37, 4 (Dec. 2005), 316–344. doi:10.1145/1118890.1118892

  26. [26]

    Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner. 2023. A meta-summary of challenges in building products with ml components–collecting experiences from 4758+ practitioners. In2023 IEEE/ACM 2nd International Conference on AI Engineering–Software Engineering for AI (CAIN). IEEE, 171–183

  27. [27]

    Himesh Nandani, Mootez Saad, and Tushar Sharma. 2023. DACOS: A Manually Annotated Dataset of Code Smells. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1–12. doi:MSR59073.2023.00067

  28. [28]

    Shayan Noei, Heng Li, and Ying Zou. 2018. Detecting Refactoring Commits in Machine Learning Python Projects: A Machine Learning-Based Approach.arXiv preprint arXiv:2404.06572(2018)

  29. [29]

    OpenAI. 2025. GPT-4.1 mini (Model documentation). https://platform.openai.com/docs/models/gpt-4.1-mini Accessed: 2025-09-01

  30. [30]

    Samir Passi and Steven J. Jackson. 2018. Trust in Data Science: Collaboration, Translation, and Accountability in Corporate Data Science Projects. Proc. ACM Hum.-Comput. Interact.2, CSCW, Article 136 (Nov. 2018), 28 pages. doi:10.1145/3274405

  31. [31]

    Păsăreanu, and Limin Jia

    Eduard Pinconschi, Sofia Reis, Chi Zhang, Rui Abreu, Hakan Erdogmus, Corina S. Păsăreanu, and Limin Jia. 2023. Tenet: A Flexible Framework for Machine-Learning-based Vulnerability Detection. InProceedings of the 2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE. doi:10.1109/CAIN58948.2023.00026 Paper p...

  32. [32]

    PyCQA_flake8. 2024. flake8. https://pypi.org/project/flake8/. Accessed May 2025

  33. [33]

    PyCQA_pylint. 2024. Pylint. https://pypi.org/project/pylint/. Accessed May 2025

  34. [34]

    r2c, Inc. 2024. Semgrep: Lightweight Static Analysis for Many Languages. https://semgrep.dev. Accessed May 2025. 20 Mahmoudi et al

  35. [35]

    Gilberto Recupito, Giammaria Giordano, Filomena Ferrucci, Dario Di Nucci, and Fabio Palomba. 2025. When code smells meet ML: on the lifecycle of ML-specific code smells in ML-enabled systems.Empirical Software Engineering30, Article 139 (2025). doi:10.1007/s10664-025-10676-4

  36. [36]

    Eduardo Royuela and Yania Crespo. 2025. Comparing Different Techniques for Automatic Detection and Correction of React Code Smells.SSRN Electronic Journal(2025). doi:10.2139/ssrn.5251648

  37. [37]

    Hellerstein, and Aditya G

    Shreya Shankar, Reyna Garcia, Joseph M. Hellerstein, and Aditya G. Parameswaran. 2022. Operationalizing Machine Learning: An Interview Study. (2022). arXiv:2209.09125 [cs.SE] arXiv preprint

  38. [38]

    Mary Shaw and Liming Zhu. 2022. Can Software Engineering Harness the Benefits of Advanced AI?IEEE Softw.39, 6 (Nov. 2022), 99–104. doi:10.1109/MS.2022.3203200

  39. [39]

    Erez Shinan. 2017. Lark: A Modern Parsing Library for Python. https://github.com/lark-parser/lark. Used for DSL grammar parsing in this work. Accessed April 7, 2025

  40. [40]

    Xiaoxi Wang et al. 2023. Code Llama: Open Foundation Models for Code. InArXiv Preprint. https://arxiv.org/abs/2308.12950

  41. [41]

    Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. 2025. Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models. InProceedings of the 2025 International Conference on Software Engineering (Track ICSE Research Track)(São Francisco, CA)

  42. [42]

    Moshi Wei, Nima Shiri Harzevili, Yuekai Huang, Jinqiu Yang, Junjie Wang, and Song Wang. 2024. Demystifying and Detecting Misuses of Deep Learning APIs. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 201, 12 pages. doi:10.1145/3...

  43. [43]

    Moshi Wei, Yuchao Huang, Junjie Wang, Jiho Shin, Nima Shiri Harzevili, and Song Wang. 2022. API recommendation for machine learning libraries: how far are we?. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing...

  44. [44]

    Frank Wilcoxon. 1945. Individual comparisons by ranking methods.Biometrics Bulletin1, 6 (1945), 80–83. doi:10.2307/3001968

  45. [45]

    Niklaus Wirth. 1977. What Can We Do About the Unnecessary Diversity of Notation for Syntactic Definitions?Commun. ACM20, 11 (1977), 822–823. doi:10.1145/359863.359909

  46. [46]

    Di Wu, Fangwen Mu, Lin Shi, Zhaoqiang Guo, Kui Liu, Weiguang Zhuang, Yuqi Zhong, and Li Zhang. 2024. iSMELL: Assembling LLMs with Expert Toolsets for Code Smell Detection and Refactoring. In2024 39th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1345–1357

  47. [47]

    Yao Yao, Lei Zhang, and Abhimanyu Chaudhuri. 2008. Early Stopping and Its Applications to Boosting.Journal of Machine Learning Research9 (2008), 947–970

  48. [48]

    Haiyin Zhang, Luís Cruz, and Arie van Deursen. 2022. Code smells for machine learning applications. InProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI(Pittsburgh, Pennsylvania)(CAIN ’22). Association for Computing Machinery, New York, NY, USA, 217–228. doi:10.1145/3522664.3528620