LLM Code Smells: A Taxonomy and Detection Approach
Pith reviewed 2026-05-25 05:42 UTC · model grok-4.3
The pith
Nine LLM code smells documented in a taxonomy and detected by SpecDetect4LLM appear in 73.5% of 692 analyzed software systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors consolidate and refine the concept of LLM code smells by presenting a self-contained taxonomy and catalog of nine such smells. They develop SpecDetect4LLM, a static source code analysis tool for detecting these smells, and evaluate it on 692 open-source projects comprising 171,194 source files. The results indicate that LLM code smells affect 73.5% of the analyzed systems, with the tool achieving a precision of 91.3% and recall of 71.8%.
What carries the argument
A catalog of nine LLM code smells together with the SpecDetect4LLM static analysis rules that map to them.
Load-bearing premise
The nine LLM code smells accurately capture inadequate integration practices that undermine software system quality, and the static analysis rules in SpecDetect4LLM correctly map to these smells without significant false classifications.
What would settle it
A controlled study comparing quality metrics such as bug rates or maintenance effort between systems containing the detected smells and equivalent refactored versions without them.
read the original abstract
Large Language Models (LLMs) are increasingly integrated into software systems for diverse purposes, due to their versatility, flexibility, and ability to simulate human reasoning to some extent. However, poor integration of LLM inference in source code can undermine software system quality. Therefore, inadequate LLM integration coding practices must be documented to help developers mitigate such issues. Following our earlier work on LLM code smells, this paper consolidates and refines the concept by presenting a self-contained taxonomy and a catalog of nine LLM code smells. We also create SpecDetect4LLM, a static source code analysis tool for their detection, and conduct extensive empirical evaluations of its detection effectiveness (precision and recall) as well as the prevalence of LLM code smells across 692 open-source software projects (171,194 source files). Our results show that LLM code smells affect 73.5% of the analyzed systems, with a detection precision of 91.3% and a recall of 71.8%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper consolidates prior work into a self-contained taxonomy of nine LLM code smells, presents the SpecDetect4LLM static analysis tool for detecting them, and reports an empirical evaluation on 692 open-source projects (171,194 files) claiming that the smells affect 73.5% of systems with tool precision of 91.3% and recall of 71.8%.
Significance. If the taxonomy validly identifies integration practices that degrade quality and the detection rules map to them without substantial misclassification, the catalog and tool could give developers concrete guidance for LLM usage in production codebases. The scale of the corpus evaluation would add practical value if the metrics are independently corroborated.
major comments (2)
- [Evaluation] Evaluation section: the reported precision of 91.3% and recall of 71.8% are presented without any description of how ground-truth labels for the nine smells were obtained, how inter-rater agreement was measured, or what controls were applied for selection bias in the 692-project corpus; these omissions directly undermine the reliability of the central effectiveness and prevalence claims.
- [Taxonomy] Taxonomy and §3 (or equivalent): the nine smells are asserted to capture inadequate LLM integration practices that undermine system quality, yet the manuscript supplies no external expert validation, quality-impact correlation study, or comparison against independent oracles beyond the authors' internal definitions; this makes both the 73.5% prevalence figure and the tool's mapping dependent on unverified internal consistency.
minor comments (1)
- The abstract states the work 'consolidates and refines' an earlier taxonomy but provides neither a citation to that prior work nor a concise delta table showing which smells were added, removed, or redefined.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below. We will revise the manuscript to provide the requested details on ground-truth labeling and inter-rater agreement. For the taxonomy, we will add explicit discussion of its derivation and limitations while maintaining that the internal definitions are grounded in prior work.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the reported precision of 91.3% and recall of 71.8% are presented without any description of how ground-truth labels for the nine smells were obtained, how inter-rater agreement was measured, or what controls were applied for selection bias in the 692-project corpus; these omissions directly undermine the reliability of the central effectiveness and prevalence claims.
Authors: We agree that these methodological details are essential for assessing the reliability of the metrics. The ground-truth labels were created via manual review of a stratified sample of files by two authors, with disagreements resolved through discussion; inter-rater agreement was measured using Cohen's kappa (value to be reported). Corpus selection followed criteria from prior LLM studies to reduce bias. In the revision we will insert a new subsection (likely 4.2) detailing the full labeling protocol, agreement statistics, and bias controls. revision: yes
-
Referee: [Taxonomy] Taxonomy and §3 (or equivalent): the nine smells are asserted to capture inadequate LLM integration practices that undermine system quality, yet the manuscript supplies no external expert validation, quality-impact correlation study, or comparison against independent oracles beyond the authors' internal definitions; this makes both the 73.5% prevalence figure and the tool's mapping dependent on unverified internal consistency.
Authors: The taxonomy consolidates and refines definitions from our earlier published work on LLM code smells, where initial examples were drawn from real-world LLM usage patterns reported in the literature. We did not perform a new external expert survey or correlation study in this manuscript. We will add a limitations paragraph acknowledging this and noting that future work could include such validation. The prevalence and tool results are presented as tied to the stated definitions; we will make this dependency explicit in the revised text. revision: partial
Circularity Check
No circularity: empirical metrics from external projects are independent of author definitions
full rationale
The paper's central results (prevalence 73.5%, precision 91.3%, recall 71.8%) are obtained by applying SpecDetect4LLM to 692 external open-source projects and counting matches against the nine author-defined smells. No equations, fitted parameters, or self-referential reductions appear in the provided text; the taxonomy is presented as self-contained and the evaluation numbers are direct empirical counts rather than predictions derived from the definitions themselves. The mention of 'earlier work' is a normal citation and does not carry the load-bearing claim. This is a standard empirical software-engineering study whose reported figures do not reduce to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM code smells can be identified through static source code analysis rules
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
presents a self-contained taxonomy and a catalog of nine LLM code smells... SpecDetect4LLM, a static source code analysis tool
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLM code smells affect 73.5% of the analyzed systems, with a detection precision of 91.3% and a recall of 71.8%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://arxiv.org/abs/2504.08619
Xia, Z., Zhu, L., Li, B., Chen, F., Li, Q., Liao, C., Wang, F., Liu, H.: Analyzing 16,193 LLM Papers for Fun and Profits (2025). https://arxiv.org/abs/2504.08619
-
[2]
https://arxiv.org/abs/2407.05138
Shao, Y., Huang, Y., Shen, J., Ma, L., Su, T., Wan, C.: Are LLMs Correctly Integrated into Software Systems? (2025). https://arxiv.org/abs/2407.05138
-
[3]
PhD thesis, University of Waterloo (August 2024)
Khatun, A.: Uncovering the reliability and consistency of ai language models: A systematic study. PhD thesis, University of Waterloo (August 2024). https: //uwspace.uwaterloo.ca/items/e01e11a6-e033-4f6a-85c6-849fba74e039
work page 2024
-
[4]
Knowledge-Based Systems 318, 113503 (2025) https://doi.org/10.1016/j.knosys.2025.113503
Yang, W., Some, L., Bain, M., Kang, B.: A comprehensive survey on integrating large language models with knowledge-based methods. Knowledge-Based Systems 318, 113503 (2025) https://doi.org/10.1016/j.knosys.2025.113503
-
[5]
https://arxiv.org/abs/ 2501.12904
Bucaioni, A., Weyssow, M., He, J., Lyu, Y., Lo, D.: A Functional Software Ref- erence Architecture for LLM-Integrated Systems (2025). https://arxiv.org/abs/ 2501.12904
-
[6]
Addison-Wesley Longman Publishing Co., Inc., USA (1999)
Fowler, M., Beck, K., Brant, J., Opdyke, W., Roberts, D.: Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., USA (1999)
work page 1999
-
[7]
https://arxiv.org/abs/2203.13746
Zhang, H., Cruz, L., Deursen, A.: Code Smells for Machine Learning Applications (2022). https://arxiv.org/abs/2203.13746
-
[8]
https://arxiv.org/abs/2509.14404
Tian, H., Wang, C., Yang, B., Zhang, L., Liu, Y.: A Taxonomy of Prompt Defects in LLM Systems (2025). https://arxiv.org/abs/2509.14404
-
[9]
Paul, D.G., Zhu, H., Bayley, I.: Investigating the Smells of LLM Generated Code. SSRN. Available at SSRN (2025). https://doi.org/10.2139/ssrn.5601126 . https: //ssrn.com/abstract=5601126
-
[10]
Mahmoudi, B., Chenail-Larcher, Z., Moha, N., Sti´ evenart, Q., Avellaneda, F.: Specification and detection of LLM code smells. In: Proceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering, New Ideas and Emerging Results (ICSE-NIER ’26). Association for Computing Machin- ery, New York, NY, USA (2026). https://doi.org/10.1145/3...
-
[11]
https://doi.org/10.48550/arXiv.2509
Mahmoudi, B., Moha, N., Stievenert, Q., Avellaneda, F.: AI-Specific Code Smells: From Specification to Detection (2025). https://doi.org/10.48550/arXiv.2509. 52 20491
-
[12]
In: Proceedings of the 31st International Conference on Neural Information Processing Systems
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA (2017)
work page 2017
-
[13]
Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699
-
[14]
OpenAI: Learning to Reason with LLMs. Technical report (2024). https://openai. com/index/learning-to-reason-with-llms/
work page 2024
-
[15]
International standard, International Organization for Standardiza- tion (2017)
ISO/IEC/IEEE: ISO/IEC/IEEE 24765:2017 Systems and software engineering: Vocabulary. International standard, International Organization for Standardiza- tion (2017)
work page 2017
-
[16]
International standard, International Organization for Standardization (2023)
ISO/IEC: ISO/IEC 25010:2023 Systems and software engineering: Systems and software Quality Requirements and Evaluation (SQuaRE): Product quality model. International standard, International Organization for Standardization (2023)
work page 2023
-
[17]
Ieee std 610.12-1990, Institute of Electrical and Electronics Engineers (1990)
IEEE: IEEE Standard Glossary of Software Engineering Terminology. Ieee std 610.12-1990, Institute of Electrical and Electronics Engineers (1990). https://doi. org/10.1109/IEEESTD.1990.101064
-
[18]
Technical Report EBSE-2007- 01, EBSE 2007 (2007)
Kitchenham, B., Charters, S.: Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007- 01, EBSE 2007 (2007). https://www.elsevier.com/ data/promis misc/ 525444systematicreviewsguide.pdf
work page 2007
-
[19]
https://arxiv.org/abs/2512.23066
Cherief, H.A., Mahmoudi, B., Chenail-Larcher, Z., Moha, N., Sti’evenart, Q., Avellaneda, F.: An Automated Grey Literature Extraction Tool for Software Engineering (2025). https://arxiv.org/abs/2512.23066
-
[20]
BMJ372(71), 1–9 (2021) https://doi.org/10.1136/bmj.n71
Page, M.J., McKenzie, J.E., Bossuyt, P.M., Boutron, I., Hoffmann, T.C., Mul- row, C.D., Shamseer, L., Tetzlaff, J.M., Akl, E.A., Brennan, S.E., Chou, R., Glanville, J., Grimshaw, J.M., Hrobjartsson, A., Lalu, M.M., Li, T., Loder, E.W., Mayo-Wilson, E., McDonald, S., McGuinness, L.A., Stewart, L.A., Thomas, J., Tricco, A.C., Welch, V.A., Whiting, P., Moher...
-
[21]
BMJ 372, 160 (2021) https://doi.org/10.1136/bmj.n160
Page, M.J., McKenzie, J.E., Bossuyt, P.M., Boutron, I., Hoffmann, T.C., Mulrow, C.D., Shamseer, L., Tetzlaff, J.M., Akl, E.A., Brennan, S.E., Chou, R., Glanville, 53 J., Grimshaw, J.M., Hrobjartsson, A., Lalu, M.M., Li, T., Loder, E.W., Mayo- Wilson, E., McDonald, S., McGuinness, L.A., Stewart, L.A., Thomas, J., Tricco, A.C., Welch, V.A., Whiting, P., Moh...
-
[22]
Kitchenham, B., Madeyski, L., Budgen, D.: Segress: Software engineering guide- lines for reporting secondary studies. IEEE Transactions on Software Engineering 49(3), 1273–1298 (2023) https://doi.org/10.1109/TSE.2022.3174092
-
[23]
BMC Medical Informatics and Decision Making7, 16 (2007) https://doi.org/10.1186/ 1472-6947-7-16
Schardt, C., Adams, M.B., Owens, T., Keitz, S., Fontelo, P.: Utilization of the pico framework to improve searching pubmed for clinical questions. BMC Medical Informatics and Decision Making7, 16 (2007) https://doi.org/10.1186/ 1472-6947-7-16
work page 2007
-
[24]
Dyb˚ a, T., Dingsøyr, T., Hanssen, G.K.: Applying systematic reviews to diverse study types: An experience report. In: Proceedings of the First Interna- tional Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 225–234. IEEE, ??? (2007). https://doi.org/10.1109/ESEM.2007.59 . https://doi.org/10.1109/ESEM.2007.59
-
[25]
A Survey of Large Language Models
Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J., Wen, J.: A survey of large language models. arXiv preprint (2023) arXiv:2303.18223 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Information and Software Technology106, 101–121 (2019) https://doi.org/10
Garousi, V., Felderer, M., M”antyl”a, M.V.: Guidelines for including grey lit- erature and conducting multivocal literature reviews in software engineering. Information and Software Technology106, 101–121 (2019) https://doi.org/10. 1016/j.infsof.2018.09.006
work page 2019
-
[27]
In: Proceedings of the 34th Brazilian Symposium on Software Engineering
Kamei, F., Wiese, I., Pinto, G., Ribeiro, M., Soares, S.: On the use of grey liter- ature: A survey with the brazilian software engineering research community. In: Proceedings of the 34th Brazilian Symposium on Software Engineering. SBES ’20. Association for Computing Machinery, ??? (2020). https://doi.org/10.1145/ 3422392.3422442
-
[28]
https://www.perplexity.ai/ (2025)
AI, P.: Perplexity https://www.perplexity.ai/. https://www.perplexity.ai/ (2025)
work page 2025
-
[29]
https: //huggingface.co/ Accessed 2025-09-25
Hugging Face: Hugging Face - The AI Community Building the Future. https: //huggingface.co/ Accessed 2025-09-25
work page 2025
-
[30]
https: //github.com/Brahim-Mahmoudi/Code Smell LLM (2025)
Mahmoudi, B., Chenail Larcher, Z.: Replication Package LLM-code smells. https: //github.com/Brahim-Mahmoudi/Code Smell LLM (2025)
work page 2025
-
[31]
OpenAI: API Reference - Chat Completions (2025). https://platform.openai. com/docs/api-reference/chat Accessed 2025-09-25 54
work page 2025
-
[32]
https://docs.claude.com/en/api/ messages Accessed 2025-09-25
Anthropic: Messages API - Claude Docs (2025). https://docs.claude.com/en/api/ messages Accessed 2025-09-25
work page 2025
-
[33]
https:// developers.openai.com/api/docs/guides/images-vision
OpenAI: Images and Vision — OpenAI API Documentation (2025). https:// developers.openai.com/api/docs/guides/images-vision
work page 2025
-
[34]
Anthropic: Vision - Claude API Documentation (2025). https://platform.claude. com/docs/en/build-with-claude/vision
work page 2025
-
[35]
Dis- cussion thread accessed 2025-12-09 (2024)
OpenAI Developer Community: Clarifications on setting temperature = 0. Dis- cussion thread accessed 2025-12-09 (2024). https://community.openai.com/t/ clarifications-on-setting-temperature-0/886447
work page 2025
-
[36]
Nandani, H., Saad, M., Sharma, T.: DACOS: A manually annotated dataset of code smells. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1–12 (2023). https://doi.org/10. 1109/MSR59073.2023.00067
-
[37]
John Wiley & Sons, New York (1977)
Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley & Sons, New York (1977). Chap. 5
work page 1977
-
[38]
Passi, S., Jackson, S.J.: Trust in data science: Collaboration, translation, and accountability in corporate data science projects. Proc. ACM Hum.-Comput. Interact.2(CSCW) (2018) https://doi.org/10.1145/3274405
-
[39]
Livshits, B., Sridharan, M., Smaragdakis, Y., Lhot´ ak, O., Amaral, J.N., Chang, B.-Y.E., Guyer, S.Z., Khedker, U.P., Møller, A., Vardoulakis, D.: In defense of soundiness: a manifesto. Commun. ACM58(2), 44–46 (2015) https://doi.org/10. 1145/2644805
work page 2015
-
[40]
Empirical Software Engineering24(6), 3546–3586 (2019) https://doi.org/10
Carvalho, S.G., Aniche, M., Ver´ ıssimo, J., Garcia, A., Alves, V., Gheyi, R.: An empirical catalog of code smells for the presentation layer of android apps. Empirical Software Engineering24(6), 3546–3586 (2019) https://doi.org/10. 1007/s10664-019-09768-9
work page 2019
-
[41]
https://arxiv.org/abs/2412.18371
Ning, K., Chen, J., Zhang, J., Li, W., Wang, Z., Feng, Y., Zhang, W., Zheng, Z.: Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents (2024). https://arxiv.org/abs/2412.18371
-
[42]
https://arxiv.org/abs/2504.09037
Ke, Z., Jiao, F., Ming, Y., Nguyen, X.-P., Xu, A., Long, D.X., Li, M., Qin, C., Wang, P., Savarese, S., Xiong, C., Joty, S.: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems (2025). https://arxiv.org/abs/2504.09037
-
[43]
Winston, C., Just, R.: A taxonomy of failures in tool-augmented llms. In: AST 2025, pp. 125–135 (2025). https://doi.org/10.1109/AST66626.2025.00019 55
-
[44]
Cemri, M., Pan, M.Z., Yang, S., Agrawal, L.A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J.E., Stoica, I.: Why Do Multi-Agent LLM Systems Fail? (2025)
work page 2025
-
[45]
Le Jeune, P., Liu, J., Rossi, L., Dora, M.: Realharm: A collection of real-world language model application failures. In: LLMSEC 2025, pp. 87–100 (2025)
work page 2025
-
[46]
https://arxiv.org/abs/2401.12611
Ronanki, K., Cabrero-Daniel, B., Berger, C.: Prompt Smells: An Omen for Undesirable Generative AI Outputs (2024). https://arxiv.org/abs/2401.12611
-
[47]
Agrawal, A., Kedia, N., Agarwal, A., Mohan, J., Kwatra, N., Kundu, S., Ramjee, R., Tumanov, A.: On Evaluating Performance of LLM Inference Serving Systems (2025)
work page 2025
-
[48]
Zhuo, T.Y., He, J., Sun, J., Xing, Z., Lo, D., Grundy, J., Du, X.: Identifying and Mitigating API Misuse in Large Language Models (2025)
work page 2025
-
[49]
https://arxiv.org/abs/2408.13372
Esfahani, A.M., Kahani, N., Ajila, S.A.: Understanding Defects in Generated Codes by Language Models (2024). https://arxiv.org/abs/2408.13372
-
[50]
In: 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech), pp
Diaz-De-Arcaya, J., L´ opez-De-Armentia, J., Mi˜ n´ on, R., Ojanguren, I.L., Torre- Bastida, A.I.: Large language model operations (llmops): Definition, challenges, and lifecycle management. In: 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech), pp. 1–4 (2024). https://doi.org/10.23919/ SpliTech61897.2024.10612341
-
[51]
IEEE Software42(1), 26–32 (2025) https://doi.org/10.1109/ MS.2024.3477014
Tantithamthavorn, C.K., Palomba, F., Khomh, F., Chua, J.J.: Mlops, llmops, fmops, and beyond. IEEE Software42(1), 26–32 (2025) https://doi.org/10.1109/ MS.2024.3477014
-
[52]
https://cloud.google.com/discover/what-is-llmops
Google Cloud: What is LLMOps (large language model operations)? (2026). https://cloud.google.com/discover/what-is-llmops
work page 2026
-
[53]
IBM: What is LLMOps? Accessed: March 20, 2026 (2026). https://www.ibm. com/think/topics/llmops 11 Appendix Selected Papers [SP54] Liu, M.X., Liu, F., Fiannaca, A.J., Koo, T., Dixon, L., Terry, M., Cai, C.J.: We need structured output: Towards user-centered constraints on large language model output. (2024). https://doi.org/10.1145/3613905.3650756 [SP55] P...
-
[54]
Discussion thread accessed 2025-12-09. https://community.openai.com/t/ clarifications-on-setting-temperature-0/886447 [SP164] Institute, P.E.: Complete Guide to Prompt Engineering with Tempera- ture and Top-p. Accessed: 2025-12-31 (2024). https://promptengineering.org/ prompt-engineering-with-temperature-and-top-p/ [SP165] Reyes, F., Gamage, Y., Skoglund,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.