When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

Niful Islam , Ragib Shahriar Ayon , Deepak George Thomas , Shibbir Ahmed , Mohammad Wardat

Authors on Pith no claims yet

classification 💻 cs.SE

keywords agentsbugsagentbugreactbuiltcodecomprehensivedevelopment

read the original abstract

Large Language Models (LLMs) have revolutionized intelligent application development. While standalone LLMs cannot perform any actions, LLM agents address the limitation by integrating tools. However, debugging LLM agents is difficult and costly as the field is still in it's early stage and the community is underdeveloped. To understand the bugs encountered during agent development, we present the first comprehensive study of bug types, root causes, and effects in LLM agent-based software. We collected and analyzed 1,187 bug-related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums, focused on LLM agents built with seven widely used LLM frameworks as well as custom implementations. For a deeper analysis, we have also studied the component where the bug occurred, along with the programming language and framework. This study also investigates the feasibility of automating bug identification. For that, we have built a ReAct agent named BugReAct, equipped with adequate external tools to determine whether it can detect and annotate the bugs in our dataset. According to our study, we found that BugReAct equipped with Gemini 2.5 Flash achieved a remarkable performance in annotating bug characteristics with an average cost of 0.01 USD per post/code snippet.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
cs.SE 2026-04 unverdicted novelty 7.0

DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents
cs.SE 2026-04 unverdicted novelty 6.0

SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study
cs.SE 2026-04 unverdicted novelty 6.0

Analysis of bugs in modern agentic frameworks uncovers unique symptoms like unexpected execution sequences and root causes including model faults and orchestration issues, with transferable patterns across designs.