LLM Code Smells: A Taxonomy and Detection Approach

Brahim Mahmoudi; Florent Avellaneda; Naouel Moha; Quentin Sti\'evenart; Zacharie Chenail-Larcher

arxiv: 2605.22976 · v1 · pith:PWR5LQRInew · submitted 2026-05-21 · 💻 cs.SE · cs.AI

LLM Code Smells: A Taxonomy and Detection Approach

Zacharie Chenail-Larcher , Brahim Mahmoudi , Naouel Moha , Quentin Sti\'evenart , Florent Avellaneda This is my paper

Pith reviewed 2026-05-25 05:42 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code smellstaxonomystatic analysiscode smell detectionsoftware qualityLLM integrationopen-source projectsdetection tool

0 comments

The pith

Nine LLM code smells documented in a taxonomy and detected by SpecDetect4LLM appear in 73.5% of 692 analyzed software systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines nine specific coding practices that represent poor ways of integrating large language models into software applications. It builds a static analysis tool called SpecDetect4LLM to automatically find these practices in source code. Evaluation across hundreds of open-source projects shows these smells are common and the tool detects them with 91.3 percent precision and 71.8 percent recall. Developers need this kind of guidance because bad LLM integration can reduce the reliability and maintainability of the overall system.

Core claim

The authors consolidate and refine the concept of LLM code smells by presenting a self-contained taxonomy and catalog of nine such smells. They develop SpecDetect4LLM, a static source code analysis tool for detecting these smells, and evaluate it on 692 open-source projects comprising 171,194 source files. The results indicate that LLM code smells affect 73.5% of the analyzed systems, with the tool achieving a precision of 91.3% and recall of 71.8%.

What carries the argument

A catalog of nine LLM code smells together with the SpecDetect4LLM static analysis rules that map to them.

Load-bearing premise

The nine LLM code smells accurately capture inadequate integration practices that undermine software system quality, and the static analysis rules in SpecDetect4LLM correctly map to these smells without significant false classifications.

What would settle it

A controlled study comparing quality metrics such as bug rates or maintenance effort between systems containing the detected smells and equivalent refactored versions without them.

read the original abstract

Large Language Models (LLMs) are increasingly integrated into software systems for diverse purposes, due to their versatility, flexibility, and ability to simulate human reasoning to some extent. However, poor integration of LLM inference in source code can undermine software system quality. Therefore, inadequate LLM integration coding practices must be documented to help developers mitigate such issues. Following our earlier work on LLM code smells, this paper consolidates and refines the concept by presenting a self-contained taxonomy and a catalog of nine LLM code smells. We also create SpecDetect4LLM, a static source code analysis tool for their detection, and conduct extensive empirical evaluations of its detection effectiveness (precision and recall) as well as the prevalence of LLM code smells across 692 open-source software projects (171,194 source files). Our results show that LLM code smells affect 73.5% of the analyzed systems, with a detection precision of 91.3% and a recall of 71.8%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper consolidates prior work into a self-contained taxonomy of nine LLM code smells, presents the SpecDetect4LLM static analysis tool for detecting them, and reports an empirical evaluation on 692 open-source projects (171,194 files) claiming that the smells affect 73.5% of systems with tool precision of 91.3% and recall of 71.8%.

Significance. If the taxonomy validly identifies integration practices that degrade quality and the detection rules map to them without substantial misclassification, the catalog and tool could give developers concrete guidance for LLM usage in production codebases. The scale of the corpus evaluation would add practical value if the metrics are independently corroborated.

major comments (2)

[Evaluation] Evaluation section: the reported precision of 91.3% and recall of 71.8% are presented without any description of how ground-truth labels for the nine smells were obtained, how inter-rater agreement was measured, or what controls were applied for selection bias in the 692-project corpus; these omissions directly undermine the reliability of the central effectiveness and prevalence claims.
[Taxonomy] Taxonomy and §3 (or equivalent): the nine smells are asserted to capture inadequate LLM integration practices that undermine system quality, yet the manuscript supplies no external expert validation, quality-impact correlation study, or comparison against independent oracles beyond the authors' internal definitions; this makes both the 73.5% prevalence figure and the tool's mapping dependent on unverified internal consistency.

minor comments (1)

The abstract states the work 'consolidates and refines' an earlier taxonomy but provides neither a citation to that prior work nor a concise delta table showing which smells were added, removed, or redefined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below. We will revise the manuscript to provide the requested details on ground-truth labeling and inter-rater agreement. For the taxonomy, we will add explicit discussion of its derivation and limitations while maintaining that the internal definitions are grounded in prior work.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported precision of 91.3% and recall of 71.8% are presented without any description of how ground-truth labels for the nine smells were obtained, how inter-rater agreement was measured, or what controls were applied for selection bias in the 692-project corpus; these omissions directly undermine the reliability of the central effectiveness and prevalence claims.

Authors: We agree that these methodological details are essential for assessing the reliability of the metrics. The ground-truth labels were created via manual review of a stratified sample of files by two authors, with disagreements resolved through discussion; inter-rater agreement was measured using Cohen's kappa (value to be reported). Corpus selection followed criteria from prior LLM studies to reduce bias. In the revision we will insert a new subsection (likely 4.2) detailing the full labeling protocol, agreement statistics, and bias controls. revision: yes
Referee: [Taxonomy] Taxonomy and §3 (or equivalent): the nine smells are asserted to capture inadequate LLM integration practices that undermine system quality, yet the manuscript supplies no external expert validation, quality-impact correlation study, or comparison against independent oracles beyond the authors' internal definitions; this makes both the 73.5% prevalence figure and the tool's mapping dependent on unverified internal consistency.

Authors: The taxonomy consolidates and refines definitions from our earlier published work on LLM code smells, where initial examples were drawn from real-world LLM usage patterns reported in the literature. We did not perform a new external expert survey or correlation study in this manuscript. We will add a limitations paragraph acknowledging this and noting that future work could include such validation. The prevalence and tool results are presented as tied to the stated definitions; we will make this dependency explicit in the revised text. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical metrics from external projects are independent of author definitions

full rationale

The paper's central results (prevalence 73.5%, precision 91.3%, recall 71.8%) are obtained by applying SpecDetect4LLM to 692 external open-source projects and counting matches against the nine author-defined smells. No equations, fitted parameters, or self-referential reductions appear in the provided text; the taxonomy is presented as self-contained and the evaluation numbers are direct empirical counts rather than predictions derived from the definitions themselves. The mention of 'earlier work' is a normal citation and does not carry the load-bearing claim. This is a standard empirical software-engineering study whose reported figures do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the defined smells are meaningful indicators of poor LLM integration and that static rules can detect them reliably; no free parameters or invented entities are introduced.

axioms (1)

domain assumption LLM code smells can be identified through static source code analysis rules
The paper states that SpecDetect4LLM is a static source code analysis tool for detection of the smells.

pith-pipeline@v0.9.0 · 5713 in / 1242 out tokens · 32996 ms · 2026-05-25T05:42:09.687032+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

presents a self-contained taxonomy and a catalog of nine LLM code smells... SpecDetect4LLM, a static source code analysis tool
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLM code smells affect 73.5% of the analyzed systems, with a detection precision of 91.3% and a recall of 71.8%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

[1]

https://arxiv.org/abs/2504.08619

Xia, Z., Zhu, L., Li, B., Chen, F., Li, Q., Liao, C., Wang, F., Liu, H.: Analyzing 16,193 LLM Papers for Fun and Profits (2025). https://arxiv.org/abs/2504.08619

work page arXiv 2025
[2]

https://arxiv.org/abs/2407.05138

Shao, Y., Huang, Y., Shen, J., Ma, L., Su, T., Wan, C.: Are LLMs Correctly Integrated into Software Systems? (2025). https://arxiv.org/abs/2407.05138

work page arXiv 2025
[3]

PhD thesis, University of Waterloo (August 2024)

Khatun, A.: Uncovering the reliability and consistency of ai language models: A systematic study. PhD thesis, University of Waterloo (August 2024). https: //uwspace.uwaterloo.ca/items/e01e11a6-e033-4f6a-85c6-849fba74e039

work page 2024
[4]

Knowledge-Based Systems 318, 113503 (2025) https://doi.org/10.1016/j.knosys.2025.113503

Yang, W., Some, L., Bain, M., Kang, B.: A comprehensive survey on integrating large language models with knowledge-based methods. Knowledge-Based Systems 318, 113503 (2025) https://doi.org/10.1016/j.knosys.2025.113503

work page doi:10.1016/j.knosys.2025.113503 2025
[5]

https://arxiv.org/abs/ 2501.12904

Bucaioni, A., Weyssow, M., He, J., Lyu, Y., Lo, D.: A Functional Software Ref- erence Architecture for LLM-Integrated Systems (2025). https://arxiv.org/abs/ 2501.12904

work page arXiv 2025
[6]

Addison-Wesley Longman Publishing Co., Inc., USA (1999)

Fowler, M., Beck, K., Brant, J., Opdyke, W., Roberts, D.: Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., USA (1999)

work page 1999
[7]

https://arxiv.org/abs/2203.13746

Zhang, H., Cruz, L., Deursen, A.: Code Smells for Machine Learning Applications (2022). https://arxiv.org/abs/2203.13746

work page arXiv 2022
[8]

https://arxiv.org/abs/2509.14404

Tian, H., Wang, C., Yang, B., Zhang, L., Liu, Y.: A Taxonomy of Prompt Defects in LLM Systems (2025). https://arxiv.org/abs/2509.14404

work page arXiv 2025
[9]

Paul, D.G., Zhu, H., Bayley, I.: Investigating the Smells of LLM Generated Code. SSRN. Available at SSRN (2025). https://doi.org/10.2139/ssrn.5601126 . https: //ssrn.com/abstract=5601126

work page doi:10.2139/ssrn.5601126 2025
[10]

In: Proceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering, New Ideas and Emerging Results (ICSE-NIER ’26)

Mahmoudi, B., Chenail-Larcher, Z., Moha, N., Sti´ evenart, Q., Avellaneda, F.: Specification and detection of LLM code smells. In: Proceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering, New Ideas and Emerging Results (ICSE-NIER ’26). Association for Computing Machin- ery, New York, NY, USA (2026). https://doi.org/10.1145/3...

work page doi:10.1145/3786582.3786835 2026
[11]

https://doi.org/10.48550/arXiv.2509

Mahmoudi, B., Moha, N., Stievenert, Q., Avellaneda, F.: AI-Specific Code Smells: From Specification to Detection (2025). https://doi.org/10.48550/arXiv.2509. 52 20491

work page doi:10.48550/arxiv.2509 2025
[12]

In: Proceedings of the 31st International Conference on Neural Information Processing Systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA (2017)

work page 2017
[13]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

work page doi:10.1109/tpami.2024.3369699 2024
[14]

Technical report (2024)

OpenAI: Learning to Reason with LLMs. Technical report (2024). https://openai. com/index/learning-to-reason-with-llms/

work page 2024
[15]

International standard, International Organization for Standardiza- tion (2017)

ISO/IEC/IEEE: ISO/IEC/IEEE 24765:2017 Systems and software engineering: Vocabulary. International standard, International Organization for Standardiza- tion (2017)

work page 2017
[16]

International standard, International Organization for Standardization (2023)

ISO/IEC: ISO/IEC 25010:2023 Systems and software engineering: Systems and software Quality Requirements and Evaluation (SQuaRE): Product quality model. International standard, International Organization for Standardization (2023)

work page 2023
[17]

Ieee std 610.12-1990, Institute of Electrical and Electronics Engineers (1990)

IEEE: IEEE Standard Glossary of Software Engineering Terminology. Ieee std 610.12-1990, Institute of Electrical and Electronics Engineers (1990). https://doi. org/10.1109/IEEESTD.1990.101064

work page doi:10.1109/ieeestd.1990.101064 1990
[18]

Technical Report EBSE-2007- 01, EBSE 2007 (2007)

Kitchenham, B., Charters, S.: Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007- 01, EBSE 2007 (2007). https://www.elsevier.com/ data/promis misc/ 525444systematicreviewsguide.pdf

work page 2007
[19]

https://arxiv.org/abs/2512.23066

Cherief, H.A., Mahmoudi, B., Chenail-Larcher, Z., Moha, N., Sti’evenart, Q., Avellaneda, F.: An Automated Grey Literature Extraction Tool for Software Engineering (2025). https://arxiv.org/abs/2512.23066

work page arXiv 2025
[20]

BMJ372(71), 1–9 (2021) https://doi.org/10.1136/bmj.n71

Page, M.J., McKenzie, J.E., Bossuyt, P.M., Boutron, I., Hoffmann, T.C., Mul- row, C.D., Shamseer, L., Tetzlaff, J.M., Akl, E.A., Brennan, S.E., Chou, R., Glanville, J., Grimshaw, J.M., Hrobjartsson, A., Lalu, M.M., Li, T., Loder, E.W., Mayo-Wilson, E., McDonald, S., McGuinness, L.A., Stewart, L.A., Thomas, J., Tricco, A.C., Welch, V.A., Whiting, P., Moher...

work page doi:10.1136/bmj.n71 2020
[21]

BMJ 372, 160 (2021) https://doi.org/10.1136/bmj.n160

Page, M.J., McKenzie, J.E., Bossuyt, P.M., Boutron, I., Hoffmann, T.C., Mulrow, C.D., Shamseer, L., Tetzlaff, J.M., Akl, E.A., Brennan, S.E., Chou, R., Glanville, 53 J., Grimshaw, J.M., Hrobjartsson, A., Lalu, M.M., Li, T., Loder, E.W., Mayo- Wilson, E., McDonald, S., McGuinness, L.A., Stewart, L.A., Thomas, J., Tricco, A.C., Welch, V.A., Whiting, P., Moh...

work page doi:10.1136/bmj.n160 2020
[22]

IEEE Transactions on Software Engineering 49(3), 1273–1298 (2023) https://doi.org/10.1109/TSE.2022.3174092

Kitchenham, B., Madeyski, L., Budgen, D.: Segress: Software engineering guide- lines for reporting secondary studies. IEEE Transactions on Software Engineering 49(3), 1273–1298 (2023) https://doi.org/10.1109/TSE.2022.3174092

work page doi:10.1109/tse.2022.3174092 2023
[23]

BMC Medical Informatics and Decision Making7, 16 (2007) https://doi.org/10.1186/ 1472-6947-7-16

Schardt, C., Adams, M.B., Owens, T., Keitz, S., Fontelo, P.: Utilization of the pico framework to improve searching pubmed for clinical questions. BMC Medical Informatics and Decision Making7, 16 (2007) https://doi.org/10.1186/ 1472-6947-7-16

work page 2007
[24]

In: Proceedings of the First Interna- tional Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp

Dyb˚ a, T., Dingsøyr, T., Hanssen, G.K.: Applying systematic reviews to diverse study types: An experience report. In: Proceedings of the First Interna- tional Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 225–234. IEEE, ??? (2007). https://doi.org/10.1109/ESEM.2007.59 . https://doi.org/10.1109/ESEM.2007.59

work page doi:10.1109/esem.2007.59 2007
[25]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J., Wen, J.: A survey of large language models. arXiv preprint (2023) arXiv:2303.18223 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Information and Software Technology106, 101–121 (2019) https://doi.org/10

Garousi, V., Felderer, M., M”antyl”a, M.V.: Guidelines for including grey lit- erature and conducting multivocal literature reviews in software engineering. Information and Software Technology106, 101–121 (2019) https://doi.org/10. 1016/j.infsof.2018.09.006

work page 2019
[27]

In: Proceedings of the 34th Brazilian Symposium on Software Engineering

Kamei, F., Wiese, I., Pinto, G., Ribeiro, M., Soares, S.: On the use of grey liter- ature: A survey with the brazilian software engineering research community. In: Proceedings of the 34th Brazilian Symposium on Software Engineering. SBES ’20. Association for Computing Machinery, ??? (2020). https://doi.org/10.1145/ 3422392.3422442

work page arXiv 2020
[28]

https://www.perplexity.ai/ (2025)

AI, P.: Perplexity https://www.perplexity.ai/. https://www.perplexity.ai/ (2025)

work page 2025
[29]

https: //huggingface.co/ Accessed 2025-09-25

Hugging Face: Hugging Face - The AI Community Building the Future. https: //huggingface.co/ Accessed 2025-09-25

work page 2025
[30]

https: //github.com/Brahim-Mahmoudi/Code Smell LLM (2025)

Mahmoudi, B., Chenail Larcher, Z.: Replication Package LLM-code smells. https: //github.com/Brahim-Mahmoudi/Code Smell LLM (2025)

work page 2025
[31]

https://platform.openai

OpenAI: API Reference - Chat Completions (2025). https://platform.openai. com/docs/api-reference/chat Accessed 2025-09-25 54

work page 2025
[32]

https://docs.claude.com/en/api/ messages Accessed 2025-09-25

Anthropic: Messages API - Claude Docs (2025). https://docs.claude.com/en/api/ messages Accessed 2025-09-25

work page 2025
[33]

https:// developers.openai.com/api/docs/guides/images-vision

OpenAI: Images and Vision — OpenAI API Documentation (2025). https:// developers.openai.com/api/docs/guides/images-vision

work page 2025
[34]

https://platform.claude

Anthropic: Vision - Claude API Documentation (2025). https://platform.claude. com/docs/en/build-with-claude/vision

work page 2025
[35]

Dis- cussion thread accessed 2025-12-09 (2024)

OpenAI Developer Community: Clarifications on setting temperature = 0. Dis- cussion thread accessed 2025-12-09 (2024). https://community.openai.com/t/ clarifications-on-setting-temperature-0/886447

work page 2025
[36]

In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp

Nandani, H., Saad, M., Sharma, T.: DACOS: A manually annotated dataset of code smells. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1–12 (2023). https://doi.org/10. 1109/MSR59073.2023.00067

work page arXiv 2023
[37]

John Wiley & Sons, New York (1977)

Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley & Sons, New York (1977). Chap. 5

work page 1977
[38]

Passi, S., Jackson, S.J.: Trust in data science: Collaboration, translation, and accountability in corporate data science projects. Proc. ACM Hum.-Comput. Interact.2(CSCW) (2018) https://doi.org/10.1145/3274405

work page doi:10.1145/3274405 2018
[39]

Livshits, B., Sridharan, M., Smaragdakis, Y., Lhot´ ak, O., Amaral, J.N., Chang, B.-Y.E., Guyer, S.Z., Khedker, U.P., Møller, A., Vardoulakis, D.: In defense of soundiness: a manifesto. Commun. ACM58(2), 44–46 (2015) https://doi.org/10. 1145/2644805

work page 2015
[40]

Empirical Software Engineering24(6), 3546–3586 (2019) https://doi.org/10

Carvalho, S.G., Aniche, M., Ver´ ıssimo, J., Garcia, A., Alves, V., Gheyi, R.: An empirical catalog of code smells for the presentation layer of android apps. Empirical Software Engineering24(6), 3546–3586 (2019) https://doi.org/10. 1007/s10664-019-09768-9

work page 2019
[41]

https://arxiv.org/abs/2412.18371

Ning, K., Chen, J., Zhang, J., Li, W., Wang, Z., Feng, Y., Zhang, W., Zheng, Z.: Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents (2024). https://arxiv.org/abs/2412.18371

work page arXiv 2024
[42]

https://arxiv.org/abs/2504.09037

Ke, Z., Jiao, F., Ming, Y., Nguyen, X.-P., Xu, A., Long, D.X., Li, M., Qin, C., Wang, P., Savarese, S., Xiong, C., Joty, S.: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems (2025). https://arxiv.org/abs/2504.09037

work page arXiv 2025
[43]

In: AST 2025, pp

Winston, C., Just, R.: A taxonomy of failures in tool-augmented llms. In: AST 2025, pp. 125–135 (2025). https://doi.org/10.1109/AST66626.2025.00019 55

work page doi:10.1109/ast66626.2025.00019 2025
[44]

Cemri, M., Pan, M.Z., Yang, S., Agrawal, L.A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J.E., Stoica, I.: Why Do Multi-Agent LLM Systems Fail? (2025)

work page 2025
[45]

In: LLMSEC 2025, pp

Le Jeune, P., Liu, J., Rossi, L., Dora, M.: Realharm: A collection of real-world language model application failures. In: LLMSEC 2025, pp. 87–100 (2025)

work page 2025
[46]

https://arxiv.org/abs/2401.12611

Ronanki, K., Cabrero-Daniel, B., Berger, C.: Prompt Smells: An Omen for Undesirable Generative AI Outputs (2024). https://arxiv.org/abs/2401.12611

work page arXiv 2024
[47]

Agrawal, A., Kedia, N., Agarwal, A., Mohan, J., Kwatra, N., Kundu, S., Ramjee, R., Tumanov, A.: On Evaluating Performance of LLM Inference Serving Systems (2025)

work page 2025
[48]

Zhuo, T.Y., He, J., Sun, J., Xing, Z., Lo, D., Grundy, J., Du, X.: Identifying and Mitigating API Misuse in Large Language Models (2025)

work page 2025
[49]

https://arxiv.org/abs/2408.13372

Esfahani, A.M., Kahani, N., Ajila, S.A.: Understanding Defects in Generated Codes by Language Models (2024). https://arxiv.org/abs/2408.13372

work page arXiv 2024
[50]

In: 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech), pp

Diaz-De-Arcaya, J., L´ opez-De-Armentia, J., Mi˜ n´ on, R., Ojanguren, I.L., Torre- Bastida, A.I.: Large language model operations (llmops): Definition, challenges, and lifecycle management. In: 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech), pp. 1–4 (2024). https://doi.org/10.23919/ SpliTech61897.2024.10612341

work page arXiv 2024
[51]

IEEE Software42(1), 26–32 (2025) https://doi.org/10.1109/ MS.2024.3477014

Tantithamthavorn, C.K., Palomba, F., Khomh, F., Chua, J.J.: Mlops, llmops, fmops, and beyond. IEEE Software42(1), 26–32 (2025) https://doi.org/10.1109/ MS.2024.3477014

work page arXiv 2025
[52]

https://cloud.google.com/discover/what-is-llmops

Google Cloud: What is LLMOps (large language model operations)? (2026). https://cloud.google.com/discover/what-is-llmops

work page 2026
[53]

We Need Structured Output

IBM: What is LLMOps? Accessed: March 20, 2026 (2026). https://www.ibm. com/think/topics/llmops 11 Appendix Selected Papers [SP54] Liu, M.X., Liu, F., Fiannaca, A.J., Koo, T., Dixon, L., Terry, M., Cai, C.J.: We need structured output: Towards user-centered constraints on large language model output. (2024). https://doi.org/10.1145/3613905.3650756 [SP55] P...

work page doi:10.1145/3613905.3650756 2026
[54]

https://community.openai.com/t/ clarifications-on-setting-temperature-0/886447 [SP164] Institute, P.E.: Complete Guide to Prompt Engineering with Tempera- ture and Top-p

Discussion thread accessed 2025-12-09. https://community.openai.com/t/ clarifications-on-setting-temperature-0/886447 [SP164] Institute, P.E.: Complete Guide to Prompt Engineering with Tempera- ture and Top-p. Accessed: 2025-12-31 (2024). https://promptengineering.org/ prompt-engineering-with-temperature-and-top-p/ [SP165] Reyes, F., Gamage, Y., Skoglund,...

work page doi:10.48550/arxiv.2401.09906 2025

[1] [1]

https://arxiv.org/abs/2504.08619

Xia, Z., Zhu, L., Li, B., Chen, F., Li, Q., Liao, C., Wang, F., Liu, H.: Analyzing 16,193 LLM Papers for Fun and Profits (2025). https://arxiv.org/abs/2504.08619

work page arXiv 2025

[2] [2]

https://arxiv.org/abs/2407.05138

Shao, Y., Huang, Y., Shen, J., Ma, L., Su, T., Wan, C.: Are LLMs Correctly Integrated into Software Systems? (2025). https://arxiv.org/abs/2407.05138

work page arXiv 2025

[3] [3]

PhD thesis, University of Waterloo (August 2024)

Khatun, A.: Uncovering the reliability and consistency of ai language models: A systematic study. PhD thesis, University of Waterloo (August 2024). https: //uwspace.uwaterloo.ca/items/e01e11a6-e033-4f6a-85c6-849fba74e039

work page 2024

[4] [4]

Knowledge-Based Systems 318, 113503 (2025) https://doi.org/10.1016/j.knosys.2025.113503

Yang, W., Some, L., Bain, M., Kang, B.: A comprehensive survey on integrating large language models with knowledge-based methods. Knowledge-Based Systems 318, 113503 (2025) https://doi.org/10.1016/j.knosys.2025.113503

work page doi:10.1016/j.knosys.2025.113503 2025

[5] [5]

https://arxiv.org/abs/ 2501.12904

Bucaioni, A., Weyssow, M., He, J., Lyu, Y., Lo, D.: A Functional Software Ref- erence Architecture for LLM-Integrated Systems (2025). https://arxiv.org/abs/ 2501.12904

work page arXiv 2025

[6] [6]

Addison-Wesley Longman Publishing Co., Inc., USA (1999)

Fowler, M., Beck, K., Brant, J., Opdyke, W., Roberts, D.: Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., USA (1999)

work page 1999

[7] [7]

https://arxiv.org/abs/2203.13746

Zhang, H., Cruz, L., Deursen, A.: Code Smells for Machine Learning Applications (2022). https://arxiv.org/abs/2203.13746

work page arXiv 2022

[8] [8]

https://arxiv.org/abs/2509.14404

Tian, H., Wang, C., Yang, B., Zhang, L., Liu, Y.: A Taxonomy of Prompt Defects in LLM Systems (2025). https://arxiv.org/abs/2509.14404

work page arXiv 2025

[9] [9]

Paul, D.G., Zhu, H., Bayley, I.: Investigating the Smells of LLM Generated Code. SSRN. Available at SSRN (2025). https://doi.org/10.2139/ssrn.5601126 . https: //ssrn.com/abstract=5601126

work page doi:10.2139/ssrn.5601126 2025

[10] [10]

In: Proceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering, New Ideas and Emerging Results (ICSE-NIER ’26)

Mahmoudi, B., Chenail-Larcher, Z., Moha, N., Sti´ evenart, Q., Avellaneda, F.: Specification and detection of LLM code smells. In: Proceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering, New Ideas and Emerging Results (ICSE-NIER ’26). Association for Computing Machin- ery, New York, NY, USA (2026). https://doi.org/10.1145/3...

work page doi:10.1145/3786582.3786835 2026

[11] [11]

https://doi.org/10.48550/arXiv.2509

Mahmoudi, B., Moha, N., Stievenert, Q., Avellaneda, F.: AI-Specific Code Smells: From Specification to Detection (2025). https://doi.org/10.48550/arXiv.2509. 52 20491

work page doi:10.48550/arxiv.2509 2025

[12] [12]

In: Proceedings of the 31st International Conference on Neural Information Processing Systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA (2017)

work page 2017

[13] [13]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

work page doi:10.1109/tpami.2024.3369699 2024

[14] [14]

Technical report (2024)

OpenAI: Learning to Reason with LLMs. Technical report (2024). https://openai. com/index/learning-to-reason-with-llms/

work page 2024

[15] [15]

International standard, International Organization for Standardiza- tion (2017)

ISO/IEC/IEEE: ISO/IEC/IEEE 24765:2017 Systems and software engineering: Vocabulary. International standard, International Organization for Standardiza- tion (2017)

work page 2017

[16] [16]

International standard, International Organization for Standardization (2023)

ISO/IEC: ISO/IEC 25010:2023 Systems and software engineering: Systems and software Quality Requirements and Evaluation (SQuaRE): Product quality model. International standard, International Organization for Standardization (2023)

work page 2023

[17] [17]

Ieee std 610.12-1990, Institute of Electrical and Electronics Engineers (1990)

IEEE: IEEE Standard Glossary of Software Engineering Terminology. Ieee std 610.12-1990, Institute of Electrical and Electronics Engineers (1990). https://doi. org/10.1109/IEEESTD.1990.101064

work page doi:10.1109/ieeestd.1990.101064 1990

[18] [18]

Technical Report EBSE-2007- 01, EBSE 2007 (2007)

Kitchenham, B., Charters, S.: Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007- 01, EBSE 2007 (2007). https://www.elsevier.com/ data/promis misc/ 525444systematicreviewsguide.pdf

work page 2007

[19] [19]

https://arxiv.org/abs/2512.23066

Cherief, H.A., Mahmoudi, B., Chenail-Larcher, Z., Moha, N., Sti’evenart, Q., Avellaneda, F.: An Automated Grey Literature Extraction Tool for Software Engineering (2025). https://arxiv.org/abs/2512.23066

work page arXiv 2025

[20] [20]

BMJ372(71), 1–9 (2021) https://doi.org/10.1136/bmj.n71

Page, M.J., McKenzie, J.E., Bossuyt, P.M., Boutron, I., Hoffmann, T.C., Mul- row, C.D., Shamseer, L., Tetzlaff, J.M., Akl, E.A., Brennan, S.E., Chou, R., Glanville, J., Grimshaw, J.M., Hrobjartsson, A., Lalu, M.M., Li, T., Loder, E.W., Mayo-Wilson, E., McDonald, S., McGuinness, L.A., Stewart, L.A., Thomas, J., Tricco, A.C., Welch, V.A., Whiting, P., Moher...

work page doi:10.1136/bmj.n71 2020

[21] [21]

BMJ 372, 160 (2021) https://doi.org/10.1136/bmj.n160

Page, M.J., McKenzie, J.E., Bossuyt, P.M., Boutron, I., Hoffmann, T.C., Mulrow, C.D., Shamseer, L., Tetzlaff, J.M., Akl, E.A., Brennan, S.E., Chou, R., Glanville, 53 J., Grimshaw, J.M., Hrobjartsson, A., Lalu, M.M., Li, T., Loder, E.W., Mayo- Wilson, E., McDonald, S., McGuinness, L.A., Stewart, L.A., Thomas, J., Tricco, A.C., Welch, V.A., Whiting, P., Moh...

work page doi:10.1136/bmj.n160 2020

[22] [22]

IEEE Transactions on Software Engineering 49(3), 1273–1298 (2023) https://doi.org/10.1109/TSE.2022.3174092

Kitchenham, B., Madeyski, L., Budgen, D.: Segress: Software engineering guide- lines for reporting secondary studies. IEEE Transactions on Software Engineering 49(3), 1273–1298 (2023) https://doi.org/10.1109/TSE.2022.3174092

work page doi:10.1109/tse.2022.3174092 2023

[23] [23]

BMC Medical Informatics and Decision Making7, 16 (2007) https://doi.org/10.1186/ 1472-6947-7-16

Schardt, C., Adams, M.B., Owens, T., Keitz, S., Fontelo, P.: Utilization of the pico framework to improve searching pubmed for clinical questions. BMC Medical Informatics and Decision Making7, 16 (2007) https://doi.org/10.1186/ 1472-6947-7-16

work page 2007

[24] [24]

In: Proceedings of the First Interna- tional Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp

Dyb˚ a, T., Dingsøyr, T., Hanssen, G.K.: Applying systematic reviews to diverse study types: An experience report. In: Proceedings of the First Interna- tional Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 225–234. IEEE, ??? (2007). https://doi.org/10.1109/ESEM.2007.59 . https://doi.org/10.1109/ESEM.2007.59

work page doi:10.1109/esem.2007.59 2007

[25] [25]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J., Wen, J.: A survey of large language models. arXiv preprint (2023) arXiv:2303.18223 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Information and Software Technology106, 101–121 (2019) https://doi.org/10

Garousi, V., Felderer, M., M”antyl”a, M.V.: Guidelines for including grey lit- erature and conducting multivocal literature reviews in software engineering. Information and Software Technology106, 101–121 (2019) https://doi.org/10. 1016/j.infsof.2018.09.006

work page 2019

[27] [27]

In: Proceedings of the 34th Brazilian Symposium on Software Engineering

Kamei, F., Wiese, I., Pinto, G., Ribeiro, M., Soares, S.: On the use of grey liter- ature: A survey with the brazilian software engineering research community. In: Proceedings of the 34th Brazilian Symposium on Software Engineering. SBES ’20. Association for Computing Machinery, ??? (2020). https://doi.org/10.1145/ 3422392.3422442

work page arXiv 2020

[28] [28]

https://www.perplexity.ai/ (2025)

AI, P.: Perplexity https://www.perplexity.ai/. https://www.perplexity.ai/ (2025)

work page 2025

[29] [29]

https: //huggingface.co/ Accessed 2025-09-25

Hugging Face: Hugging Face - The AI Community Building the Future. https: //huggingface.co/ Accessed 2025-09-25

work page 2025

[30] [30]

https: //github.com/Brahim-Mahmoudi/Code Smell LLM (2025)

Mahmoudi, B., Chenail Larcher, Z.: Replication Package LLM-code smells. https: //github.com/Brahim-Mahmoudi/Code Smell LLM (2025)

work page 2025

[31] [31]

https://platform.openai

OpenAI: API Reference - Chat Completions (2025). https://platform.openai. com/docs/api-reference/chat Accessed 2025-09-25 54

work page 2025

[32] [32]

https://docs.claude.com/en/api/ messages Accessed 2025-09-25

Anthropic: Messages API - Claude Docs (2025). https://docs.claude.com/en/api/ messages Accessed 2025-09-25

work page 2025

[33] [33]

https:// developers.openai.com/api/docs/guides/images-vision

OpenAI: Images and Vision — OpenAI API Documentation (2025). https:// developers.openai.com/api/docs/guides/images-vision

work page 2025

[34] [34]

https://platform.claude

Anthropic: Vision - Claude API Documentation (2025). https://platform.claude. com/docs/en/build-with-claude/vision

work page 2025

[35] [35]

Dis- cussion thread accessed 2025-12-09 (2024)

OpenAI Developer Community: Clarifications on setting temperature = 0. Dis- cussion thread accessed 2025-12-09 (2024). https://community.openai.com/t/ clarifications-on-setting-temperature-0/886447

work page 2025

[36] [36]

In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp

Nandani, H., Saad, M., Sharma, T.: DACOS: A manually annotated dataset of code smells. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1–12 (2023). https://doi.org/10. 1109/MSR59073.2023.00067

work page arXiv 2023

[37] [37]

John Wiley & Sons, New York (1977)

Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley & Sons, New York (1977). Chap. 5

work page 1977

[38] [38]

Passi, S., Jackson, S.J.: Trust in data science: Collaboration, translation, and accountability in corporate data science projects. Proc. ACM Hum.-Comput. Interact.2(CSCW) (2018) https://doi.org/10.1145/3274405

work page doi:10.1145/3274405 2018

[39] [39]

Livshits, B., Sridharan, M., Smaragdakis, Y., Lhot´ ak, O., Amaral, J.N., Chang, B.-Y.E., Guyer, S.Z., Khedker, U.P., Møller, A., Vardoulakis, D.: In defense of soundiness: a manifesto. Commun. ACM58(2), 44–46 (2015) https://doi.org/10. 1145/2644805

work page 2015

[40] [40]

Empirical Software Engineering24(6), 3546–3586 (2019) https://doi.org/10

Carvalho, S.G., Aniche, M., Ver´ ıssimo, J., Garcia, A., Alves, V., Gheyi, R.: An empirical catalog of code smells for the presentation layer of android apps. Empirical Software Engineering24(6), 3546–3586 (2019) https://doi.org/10. 1007/s10664-019-09768-9

work page 2019

[41] [41]

https://arxiv.org/abs/2412.18371

Ning, K., Chen, J., Zhang, J., Li, W., Wang, Z., Feng, Y., Zhang, W., Zheng, Z.: Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents (2024). https://arxiv.org/abs/2412.18371

work page arXiv 2024

[42] [42]

https://arxiv.org/abs/2504.09037

Ke, Z., Jiao, F., Ming, Y., Nguyen, X.-P., Xu, A., Long, D.X., Li, M., Qin, C., Wang, P., Savarese, S., Xiong, C., Joty, S.: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems (2025). https://arxiv.org/abs/2504.09037

work page arXiv 2025

[43] [43]

In: AST 2025, pp

Winston, C., Just, R.: A taxonomy of failures in tool-augmented llms. In: AST 2025, pp. 125–135 (2025). https://doi.org/10.1109/AST66626.2025.00019 55

work page doi:10.1109/ast66626.2025.00019 2025

[44] [44]

Cemri, M., Pan, M.Z., Yang, S., Agrawal, L.A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J.E., Stoica, I.: Why Do Multi-Agent LLM Systems Fail? (2025)

work page 2025

[45] [45]

In: LLMSEC 2025, pp

Le Jeune, P., Liu, J., Rossi, L., Dora, M.: Realharm: A collection of real-world language model application failures. In: LLMSEC 2025, pp. 87–100 (2025)

work page 2025

[46] [46]

https://arxiv.org/abs/2401.12611

Ronanki, K., Cabrero-Daniel, B., Berger, C.: Prompt Smells: An Omen for Undesirable Generative AI Outputs (2024). https://arxiv.org/abs/2401.12611

work page arXiv 2024

[47] [47]

Agrawal, A., Kedia, N., Agarwal, A., Mohan, J., Kwatra, N., Kundu, S., Ramjee, R., Tumanov, A.: On Evaluating Performance of LLM Inference Serving Systems (2025)

work page 2025

[48] [48]

Zhuo, T.Y., He, J., Sun, J., Xing, Z., Lo, D., Grundy, J., Du, X.: Identifying and Mitigating API Misuse in Large Language Models (2025)

work page 2025

[49] [49]

https://arxiv.org/abs/2408.13372

Esfahani, A.M., Kahani, N., Ajila, S.A.: Understanding Defects in Generated Codes by Language Models (2024). https://arxiv.org/abs/2408.13372

work page arXiv 2024

[50] [50]

In: 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech), pp

Diaz-De-Arcaya, J., L´ opez-De-Armentia, J., Mi˜ n´ on, R., Ojanguren, I.L., Torre- Bastida, A.I.: Large language model operations (llmops): Definition, challenges, and lifecycle management. In: 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech), pp. 1–4 (2024). https://doi.org/10.23919/ SpliTech61897.2024.10612341

work page arXiv 2024

[51] [51]

IEEE Software42(1), 26–32 (2025) https://doi.org/10.1109/ MS.2024.3477014

Tantithamthavorn, C.K., Palomba, F., Khomh, F., Chua, J.J.: Mlops, llmops, fmops, and beyond. IEEE Software42(1), 26–32 (2025) https://doi.org/10.1109/ MS.2024.3477014

work page arXiv 2025

[52] [52]

https://cloud.google.com/discover/what-is-llmops

Google Cloud: What is LLMOps (large language model operations)? (2026). https://cloud.google.com/discover/what-is-llmops

work page 2026

[53] [53]

We Need Structured Output

IBM: What is LLMOps? Accessed: March 20, 2026 (2026). https://www.ibm. com/think/topics/llmops 11 Appendix Selected Papers [SP54] Liu, M.X., Liu, F., Fiannaca, A.J., Koo, T., Dixon, L., Terry, M., Cai, C.J.: We need structured output: Towards user-centered constraints on large language model output. (2024). https://doi.org/10.1145/3613905.3650756 [SP55] P...

work page doi:10.1145/3613905.3650756 2026

[54] [54]

https://community.openai.com/t/ clarifications-on-setting-temperature-0/886447 [SP164] Institute, P.E.: Complete Guide to Prompt Engineering with Tempera- ture and Top-p

Discussion thread accessed 2025-12-09. https://community.openai.com/t/ clarifications-on-setting-temperature-0/886447 [SP164] Institute, P.E.: Complete Guide to Prompt Engineering with Tempera- ture and Top-p. Accessed: 2025-12-31 (2024). https://promptengineering.org/ prompt-engineering-with-temperature-and-top-p/ [SP165] Reyes, F., Gamage, Y., Skoglund,...

work page doi:10.48550/arxiv.2401.09906 2025