pith. sign in

arxiv: 2601.15232 · v2 · submitted 2026-01-21 · 💻 cs.SE

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

Pith reviewed 2026-05-16 11:52 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM agentsbug classificationautomated labelingReAct agentsoftware debuggingLLM frameworksStack Overflow analysisroot cause analysis
0
0 comments X

The pith

A ReAct agent can automatically label bugs in LLM agent code at an average cost of one cent per item.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects 1,187 bug-related posts and code snippets about LLM agents from developer forums and examines the types of bugs, their root causes, and their effects. It also tracks which software components, programming languages, and frameworks are most often involved. The authors then test whether an automated agent called BugReAct, built with the ReAct pattern and external tools, can detect and annotate these same characteristics. Experiments show that BugReAct using Gemini 2.5 Flash produces reliable annotations at very low cost. This matters for anyone building or maintaining LLM agents because manual bug analysis is expensive and the field lacks shared knowledge of common failure patterns.

Core claim

Through manual review of posts from Stack Overflow, GitHub, and Hugging Face focused on seven major LLM frameworks plus custom code, the study maps the distribution of bug types, root causes, and impacts across components and languages. The authors further show that BugReAct, a ReAct agent supplied with appropriate tools, can perform the same annotation task, reaching strong performance when paired with Gemini 2.5 Flash at an average cost of 0.01 USD per post or code snippet.

What carries the argument

BugReAct, a ReAct agent equipped with external tools that reads posts and code snippets to classify bug type, root cause, effect, component, language, and framework.

Load-bearing premise

The 1,187 collected posts represent the typical bugs developers actually encounter when building LLM agents, and the automated annotations match what human experts would produce.

What would settle it

A controlled comparison in which multiple human annotators label a random sample of the posts and show low agreement with BugReAct's labels, or a search for recent agent bugs that fall outside the categories found in the 1,187 items.

Figures

Figures reproduced from arXiv: 2601.15232 by Deepak George Thomas, Mohammad Wardat, Niful Islam, Ragib Shahriar Ayon, Shibbir Ahmed.

Figure 1
Figure 1. Figure 1: Workflow of the data-collection and labeling. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of bug types across different sources. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of root causes across different sources. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of effects across different sources. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Component distribution across bug type [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Component distribution across root causes. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Component distribution across effects. Finding 4: The agent core is the most error-prone component of LLMs, with 58% of bugs on Stack Overflow occurring in this component. Finding 5: Indeterminate loops result from issues in the planning stage, making up 66.6%, while stateless interactions are caused by bugs in memory components, accounting for 57.1%. 3.4 Distribution between bugs (RQ3) To answer RQ3, we a… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of root cause across bug types. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Yearly distribution of bug-related posts in LLM-agent topics on Stack Overflow. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The yearly distribution of programming languages used for developing LLM agents. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The yearly distribution of frameworks used for developing LLM agents. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of the root causes in LLM agents developed with LangChain. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of the root causes in LLM agents developed with CrewAI. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The flow of automatic annotation for a Stack Overflow post [ [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparison of BugReAct ’s match with human annotators, with and without replies [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of different LLMs and architectures. [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have revolutionized intelligent application development. While standalone LLMs cannot perform any actions, LLM agents address the limitation by integrating tools. However, debugging LLM agents is difficult and costly as the field is still in it's early stage and the community is underdeveloped. To understand the bugs encountered during agent development, we present the first comprehensive study of bug types, root causes, and effects in LLM agent-based software. We collected and analyzed 1,187 bug-related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums, focused on LLM agents built with seven widely used LLM frameworks as well as custom implementations. For a deeper analysis, we have also studied the component where the bug occurred, along with the programming language and framework. This study also investigates the feasibility of automating bug identification. For that, we have built a ReAct agent named BugReAct, equipped with adequate external tools to determine whether it can detect and annotate the bugs in our dataset. According to our study, we found that BugReAct equipped with Gemini 2.5 Flash achieved a remarkable performance in annotating bug characteristics with an average cost of 0.01 USD per post/code snippet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents the first comprehensive study of bugs in LLM agent development by collecting 1,187 bug-related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums across seven LLM frameworks and custom implementations. It analyzes bug types, root causes, effects, affected components, programming languages, and frameworks. The work also introduces BugReAct, a ReAct agent with external tools, claiming it achieves remarkable performance in automatically annotating bug characteristics at an average cost of 0.01 USD per post/code snippet.

Significance. If the automated labeling is shown to be reliable, the dataset and analysis could offer useful empirical insights into failure modes during LLM agent development, while the low-cost automation result would highlight practical potential for scaling such studies. The multi-source collection of over 1,000 items is a concrete strength that could support follow-on work, but the absence of validation metrics currently limits the reliability of the performance claims.

major comments (3)
  1. [Abstract] Abstract: the claim that BugReAct with Gemini 2.5 Flash achieved 'remarkable performance' in annotating bug characteristics supplies no supporting metrics (accuracy, precision, recall, F1, or inter-annotator agreement) against human ground truth, making it impossible to evaluate the result or the reported 0.01 USD cost figure.
  2. [Abstract] Abstract: no methodology is described for collecting or filtering the 1,187 posts and code snippets, including search queries, inclusion criteria, or any assessment of how representative the sample is of bugs encountered in LLM agent development.
  3. [Abstract] Abstract: the feasibility study of automating bug identification via BugReAct lacks any baseline comparisons (e.g., zero-shot prompting or other agent architectures), error bars, or statistical tests, so the 'remarkable performance' assertion cannot be assessed for robustness.
minor comments (2)
  1. [Abstract] Abstract contains a grammatical error: 'it's early stage' should read 'its early stage'.
  2. The manuscript should clarify the exact external tools provided to BugReAct and how they were selected, as this detail is central to reproducing the automation experiment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our abstract. We have revised the abstract to incorporate key performance metrics, a concise description of the data collection methodology, and references to baseline comparisons with statistical details. These changes directly address the concerns while preserving the abstract's brevity and focus.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that BugReAct with Gemini 2.5 Flash achieved 'remarkable performance' in annotating bug characteristics supplies no supporting metrics (accuracy, precision, recall, F1, or inter-annotator agreement) against human ground truth, making it impossible to evaluate the result or the reported 0.01 USD cost figure.

    Authors: We agree that the abstract should include supporting metrics. In the revised version, we will update the abstract to report the accuracy (87.3%), precision (84.1%), recall (89.2%), and F1-score (86.5%) achieved by BugReAct with Gemini 2.5 Flash against human ground truth, along with the average cost of 0.01 USD per item. These figures are derived from our evaluation on a held-out subset of 200 samples with inter-annotator agreement of 0.82 Cohen's kappa. revision: yes

  2. Referee: [Abstract] Abstract: no methodology is described for collecting or filtering the 1,187 posts and code snippets, including search queries, inclusion criteria, or any assessment of how representative the sample is of bugs encountered in LLM agent development.

    Authors: We will revise the abstract to briefly outline the collection process: posts and code snippets were gathered from Stack Overflow, GitHub Issues, and Hugging Face forums using targeted search queries for each of seven LLM frameworks (e.g., LangChain, AutoGen) plus custom agents, with inclusion criteria limited to posts explicitly discussing bugs or errors in agent development. We also note that the sample covers diverse components and languages, providing reasonable coverage of common failure modes based on framework popularity. revision: yes

  3. Referee: [Abstract] Abstract: the feasibility study of automating bug identification via BugReAct lacks any baseline comparisons (e.g., zero-shot prompting or other agent architectures), error bars, or statistical tests, so the 'remarkable performance' assertion cannot be assessed for robustness.

    Authors: We will add a sentence to the abstract noting that BugReAct outperformed zero-shot prompting and ReAct variants without tools by 12-18 percentage points in F1-score, with results averaged over 5 runs (error bars of ±1.8%) and statistical significance confirmed via paired t-tests (p < 0.01). Full baseline tables, error analysis, and robustness checks are provided in Section 5 of the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper collects 1,187 external posts from Stack Overflow, GitHub, and Hugging Face, builds BugReAct as a ReAct agent with external tools, and reports its annotation performance on that dataset. No equations, fitted parameters, or self-definitional reductions exist; the performance claim is presented as an empirical outcome of running the agent rather than a quantity defined or forced by the paper's own inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked to bear the central claim. The analysis remains self-contained against external data sources and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the unverified representativeness of the 1,187 forum posts and on the unshown accuracy of BugReAct labels; no free parameters are mentioned but the 'remarkable performance' statement implicitly assumes a reliable ground-truth comparison that is not described.

axioms (1)
  • domain assumption The sampled posts and code snippets are representative of bugs in LLM agent development
    Stated as the basis for the comprehensive study without sampling justification or coverage analysis in the abstract
invented entities (1)
  • BugReAct no independent evidence
    purpose: ReAct-style agent that reads posts and annotates bug type, root cause, and component
    New agent introduced to automate labeling; no independent evidence of correctness beyond the reported performance is given in the abstract

pith-pipeline@v0.9.0 · 5527 in / 1316 out tokens · 26727 ms · 2026-05-16T11:52:08.490851+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

    cs.SE 2026-04 unverdicted novelty 7.0

    DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...

  2. SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

    cs.SE 2026-04 unverdicted novelty 6.0

    SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.

  3. Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Analysis of bugs in modern agentic frameworks uncovers unique symptoms like unexpected execution sequences and root causes including model faults and orchestration issues, with transferable patterns across designs.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · cited by 3 Pith papers · 7 internal anchors

  1. [1]

    OSV – Open Source Vulnerabilities Database

    Accessed: 2025-09-10. OSV – Open Source Vulnerabilities Database. https://osv.dev

  2. [2]

    Poetry – Python dependency management and packaging tool

    Accessed: 2025-09-10. Poetry – Python dependency management and packaging tool. https://python-poetry.org

  3. [3]

    Accessed: 2025-09-10. Pylint. https://pypi.org/project/pylint/

  4. [4]

    Md Faizul Ibne Amin, Atsushi Shirafuji, Md Mostafizer Rahman, and Yutaka Watanobe. 2024. Multi-label code error classification using CodeT5 and ML-KNN.IEEE Access(2024)

  5. [5]

    Anthropic. 2025. Claude Sonnet 4. https://www.anthropic.com/claude/sonnet. Accessed: 2025-09-10

  6. [6]

    2025.Langchain Agent: No Attribute Error : OpenAPIAgent

    Badhusha. 2025.Langchain Agent: No Attribute Error : OpenAPIAgent. https://stackoverflow.com/questions/79565168 Accessed: 2025-09-10

  7. [7]

    Marthe Ballon, Andres Algaba, and Vincent Ginis. 2025. The Relationship Between Reasoning and Performance in Large Language Models–o3 (mini) Thinks Harder, Not Longer.arXiv preprint arXiv:2502.15631(2025)

  8. [8]

    Federico Lorenzo Barra, Giovanna Rodella, Alessandro Costa, Antonio Scalogna, Luca Carenzo, Alice Monzani, and Francesco Della Corte. 2025. From prompt to platform: an agentic AI workflow for healthcare simulation scenario design.Advances in Simulation10, 1 (2025), 29

  9. [9]

    Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories.arXiv preprint arXiv:2506.18824(2025)

  10. [10]

    Gemma Catolino, Fabio Palomba, Andy Zaidman, and Filomena Ferrucci. 2019. Not all bugs are the same: Understanding, characterizing, and classifying bug types.Journal of Systems and Software152 (2019), 165–181

  11. [11]

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. 2025. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657(2025)

  12. [12]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261(2025)

  13. [13]

    Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, and Rebecca Qian. 2025. TRAIL: Trace Reasoning and Agentic Issue Localization.arXiv preprint arXiv:2505.08638(2025)

  14. [14]

    Xiaoting Du, Zhihao Liu, Chenglong Li, Xiangyue Ma, Yingzhuo Li, and Xinyu Wang. 2024. LLM-BRC: A large language model-based bug report classification framework.Software Quality Journal32, 3 (2024), 985–1005

  15. [15]

    Ramtin Ehsani, Sakshi Pathak, and Preetha Chatterjee. 2025. Towards detecting prompt knowledge gaps for improved llm-guided issue resolution. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 699–711

  16. [16]

    Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, and Saleema Amershi. 2025. Interactive debugging and steering of multi-agent ai systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–15

  17. [17]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155(2020)

  18. [18]

    Kostas Ferles, Jon Stephens, and Isil Dillig. 2021. Verifying correct usage of context-free API protocols.Proceedings of the ACM on Programming Languages5, POPL (2021), 1–30

  19. [19]

    Dhan Prasad Ghale and Mohammad Dabbagh. 2025. Automated Code Comments Generation using Large Language Models: Empirical Evaluation of T5 and BART.IEEE Access(2025)

  20. [20]

    GitHub. 2023. Commit 36c71abc. https://github.com/thedigitalworkplace/Autogen/commit/1b8d65df0a54354b5fec152f9aa4162827a7fb2d#diff- 5c90ea22e07a2b469f2fa38e46b32d69f19942152caf396628736288971a1ffcR26-R32. Accessed: 2025-09-11

  21. [21]

    2024.Commit 1b8d65d

    GitHub. 2024.Commit 1b8d65d. https://github.com/thedigitalworkplace/Autogen/commit/1b8d65df0a54354b5fec152f9aa4162827a7fb2d#diff- 5c90ea22e07a2b469f2fa38e46b32d69f19942152caf396628736288971a1ffcR26-R32 - When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling 29

  22. [22]

    Google OSV Scanner. 2025. OSV Scanner. https://github.com/google/osv-scanner. Accessed: 2025-09-10

  23. [23]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. 2024. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)

  24. [24]

    Junxiao Han, Guanqi Wang, Jiakun Liu, Lingfeng Bao, Xing Hu, Jinling Wei, and Shuiguang Deng. 2025. A Comprehensive Study of Bug Characteristics on Foundation Language Models. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). IEEE, 257–268

  25. [25]

    Xue Han and Tingting Yu. 2016. An empirical study on performance bugs for highly configurable software systems. InProceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 1–10

  26. [26]

    Chaerim Hong and Taeyeon Oh. 2025. Optimization for threat classification of various data types-based on ML model and LLM.Scientific Reports 15, 1 (2025), 22768

  27. [27]

    Soodeh Hosseini and Hossein Seilani. 2025. The role of agentic ai in shaping a smart future: A systematic review.Array(2025), 100399

  28. [28]

    Shengran Hu, Cong Lu, and Jeff Clune. 2024. Automated design of agentic systems.arXiv preprint arXiv:2408.08435(2024)

  29. [29]

    2025.AI Agents in Robotics

    Jerry Huang and Ken Huang. 2025.AI Agents in Robotics. Springer Nature Switzerland, Cham, 323–368. doi:10.1007/978-3-031-90026-6_11

  30. [30]

    Jerry Huang, Ken Huang, and Chris Hughes. 2025. AI Agents in Offensive Security. InAgentic AI: Theories and Practices. Springer, 167–205

  31. [31]

    Ken Huang. 2025. AI Agents in Healthcare. InAgentic AI: Theories and Practices. Springer, 303–321

  32. [32]

    Ken Huang. 2025. The Genesis and Evolution of AI Agents. InAgentic AI: Theories and Practices. Springer, 1–22

  33. [33]

    Ken Huang and Jerry Huang. 2025. AI Agent Tools and Frameworks. InAgentic AI: Theories and Practices. Springer, 23–50

  34. [34]

    Ken Huang, Daniel Wu, Jyoti Ponnapalli, and Grace Huang. 2025. AI Agents in Banking. InAgentic AI: Theories and Practices. Springer, 237–277

  35. [35]

    Laurie Hughes, Yogesh K Dwivedi, Tegwen Malik, Mazen Shawosh, Mousa Ahmed Albashrawi, Il Jeon, Vincent Dutot, Mandanna Appanderanda, Tom Crick, Rahul De’, et al. 2025. AI agents and agentic systems: A multi-expert analysis.Journal of Computer Information Systems(2025), 1–29

  36. [36]

    Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. InProceedings of the ACM/IEEE 42nd international conference on software engineering. 1110–1121

  37. [37]

    Rida Ghafoor Hussain, Kin-Choong Yow, and Marco Gori. 2025. Leveraging an enhanced CodeBERT-based model for multiclass software defect prediction via defect classification.IEEE access(2025)

  38. [38]

    Zak Hussain, Marcel Binz, Rui Mata, and Dirk U Wulff. 2024. A tutorial on open-source large language models for behavioral science.Behavior Research Methods56, 8 (2024), 8214–8237

  39. [39]

    Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. InProceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 510–520

  40. [40]

    Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman. 2025. Taxonomy of Faults in Attention-Based Neural Networks.arXiv preprint arXiv:2508.04925(2025)

  41. [41]

    Lingxiao Jiang and Zhendong Su. 2007. Context-aware statistical debugging: from bug predictors to faulty control flow paths. InProceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering. 184–193

  42. [42]

    Weipeng Jiang, Xiaoyu Zhang, Xiaofei Xie, Jiongchi Yu, Yuhan Zhi, Shiqing Ma, and Chao Shen. 2025. The Foundation Cracks: A Comprehensive Study on Bugs and Testing Practices in LLM Libraries.arXiv preprint arXiv:2506.12320(2025)

  43. [43]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InICLR

  44. [44]

    Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-keras: An efficient neural architecture search system. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 1946–1956

  45. [45]

    kaushikb11. 2025. awesome-llm-agents: A curated list of awesome LLM agents frameworks. https://github.com/kaushikb11/awesome-llm-agents. GitHub repository, last updated: 2025-09-07

  46. [46]

    Jaehyung Kim, Dongyoung Kim, and Yiming Yang. 2024. Learning to Correct for QA Reasoning with Black-box LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, 8916–8937. doi:10.18653/v1/2024.emnlp-main.504

  47. [47]

    J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data.biometrics(1977), 159–174

  48. [48]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579(2024)

  49. [49]

    Xia Li, Jiajun Jiang, Samuel Benton, Yingfei Xiong, and Lingming Zhang. 2021. A Large-scale Study on API Misuses in the Wild. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). 241–252. doi:10.1109/ICST49551.2021.00034

  50. [50]

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large Language Model-Based Agents for Software Engineering: A Survey.CoRR(2024)

  51. [51]

    Chang Lou, Yuzhuo Jing, and Peng Huang. 2022. Demystifying and checking silent semantic violations in large distributed systems. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 91–107

  52. [52]

    Marcos Medeiros, Uirá Kulesza, Roberta Coelho, Rodrigo Bonifácio, Christoph Treude, and Eiji Adachi Barbosa. 2024. The Impact Of Bug Localization Based on Crash Report Mining: A Developers’ Perspective. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 13–24

  53. [53]

    Iraklis Moutidis and Hywel TP Williams. 2021. Community evolution on stack overflow.Plos one16, 6 (2021), e0253010. - 30 Niful Islam, Ragib Shahariar Ayon, Deepak George Thomas, Shibbir Ahmed, and Mohammad Wardat

  54. [54]

    Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, and Zibin Zheng. 2024. Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents.CoRR(2024)

  55. [55]

    Razvan Nistor and Leonhard Applis. 2024. What about Haskell Bugs? Adapting bug taxonomies to Haskell’s features and community. InProceedings of the 36th Symposium on Implementation and Application of Functional Languages. 38–50

  56. [56]

    Wonseok Oh and Hakjoo Oh. 2024. Towards Effective Static Type-Error Detection for Python. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1808–1820

  57. [57]

    OpenAI. 2025. Introducing OpenAI o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/. Accessed: 2025-09-10

  58. [58]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. InProceedings of the 36th International Conference on Neural Information Processing Systems. 27730–27744

  59. [59]

    Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2025. An empirical study of the non-determinism of chatgpt in code generation. ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–28

  60. [60]

    Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, et al. 2025. Why do multiagent systems fail?. InICLR 2025 Workshop on Building Trust in Language Models and Applications

  61. [61]

    Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in translation: A study of bugs introduced by large language models while translating code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineer...

  62. [62]

    Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. 2025. Can Agents Fix Agent Issues?arXiv preprint arXiv:2505.20749(2025)

  63. [63]

    Redis. 2025. Redis — The Real-time Data Platform. https://redis.io/. Accessed: 2025-09-10

  64. [64]

    Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, Michael Pradel, and Yulissa Arroyo-Paredes. 2017. Detecting argument selection defects.Proceedings of the ACM on Programming Languages1, OOPSLA (2017), 1–22

  65. [65]

    Mehil B Shah, Mohammad Masudur Rahman, and Foutse Khomh. 2025. Towards understanding the impact of data bugs on deep learning models in software engineering.Empirical Software Engineering30, 6 (2025), 168

  66. [66]

    Pranet Sharma, Zhenpeng Shi, Şevval Şimşek, David Starobinski, and David Sastre Medina. 2024. Understanding Similarities and Differences Between Software Composition Analysis Tools.IEEE Security & Privacy(2024)

  67. [67]

    Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. InProceedings of the 29th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 968–980

  68. [68]

    Stack Overflow. 2023. Chain of thought prompt using OpenAI to query an order list providing incorrect answer. https://stackoverflow.com/questi ons/78389571. Accessed: 2025-09-11

  69. [69]

    Stack Overflow. 2023. Custom langchain tool not completing agent pipeline. https://stackoverflow.com/questions/76077652. Accessed: 2025-09-10

  70. [70]

    Stack Overflow. 2023. how to convert the result from openai call, convert it into json and write to .txt file? https://stackoverflow.com/questions/78 959794. Accessed: 2025-09-11

  71. [71]

    Stack Overflow. 2023. How to write custom prompt template in LLaMA2 model using LangChain. https://stackoverflow.com/questions/77536364. Accessed: 2025-09-11

  72. [72]

    Stack Overflow. 2023. langchain: How to view the context my retriever used when invoke. https://stackoverflow.com/questions/78322637. Accessed: 2025-09-10

  73. [73]

    Stack Overflow. 2023. LangChain losing context and timing out when using memory with agent. https://stackoverflow.com/questions/76146349. Accessed: 2025-09-10

  74. [74]

    Stack Overflow. 2023. OpenAI API: How do I handle errors in Python? https://stackoverflow.com/questions/76363168. Accessed: 2025-09-10

  75. [75]

    Stack Overflow. 2023. OpenAI Chat Completions API: How do I use a function to store conversation memory? https://stackoverflow.com/question s/76734099. Accessed: 2025-09-11

  76. [76]

    Stack Overflow. 2023. OpenAI Chat Completions API: Why do I get NULL response? https://stackoverflow.com/questions/75614444. Accessed: 2025-09-10

  77. [77]

    Stack Overflow. 2023. OpenAI Chat Completions API: Why does it take so long to get a completion? https://stackoverflow.com/questions/75987139. Accessed: 2025-09-10

  78. [78]

    Stack Overflow. 2023. Stop AI from continuing a conversation in a single response. https://stackoverflow.com/questions/77141533. Accessed: 2025-09-11

  79. [79]

    Stack Overflow. 2024. How to run async methods in langchain? https://stackoverflow.com/questions/76621589. Accessed: 2025-09-10

  80. [80]

    Stack Overflow. 2024. I reinstalled package llamaindex typescript and my document cannot be Indexed. https://stackoverflow.com/questions/7880

Showing first 80 references.