TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

Minxing Wang; Xiaofei Xie; Yintong Huo

arxiv: 2605.26563 · v1 · pith:SEHWXHFRnew · submitted 2026-05-26 · 💻 cs.SE

TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

Minxing Wang , Xiaofei Xie , Yintong Huo This is my paper

Pith reviewed 2026-06-29 16:10 UTC · model grok-4.3

classification 💻 cs.SE

keywords agentic codingfailure diagnosistrajectory analysisrepository-level tasksRootSE benchmarklocalization accuracynoise filteringtest report priors

0 comments

The pith

TrajAudit diagnoses failures in long noisy agentic coding trajectories by filtering irrelevant details and supplying test-report priors to an investigator agent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic coding systems produce extended execution traces filled with redundant code and context that degrade diagnosis performance. TrajAudit counters this with two supporting modules for its investigator agent: one that removes failure-irrelevant material through pattern matching and keyword detection, and another that supplies an initial diagnosis drawn from test failure reports. The agent can then retrieve filtered segments on demand. On the new RootSE benchmark of 93 real repository-level failure cases, the method raises localization accuracy more than 24.4 points above prior approaches while cutting token usage by at least 18 percent.

Core claim

TrajAudit is the first failure diagnosis framework built specifically for repository-level agentic coding trajectories. It pairs an investigator agent with a noise-filter module that applies pattern matching and keyword detection, plus a preliminary-diagnosis module that extracts prior knowledge from test reports; the agent invokes tools to pull back filtered content when needed. This design directly targets the twin problems of excessive length and high noise that impair LLM reasoning on complex software-maintenance traces. Evaluation on RootSE, a collection of 93 authentic failure instances, shows the framework exceeds all baselines by over 24.4 percentage points in localization accuracy a

What carries the argument

Investigator agent backed by a noise-filter module (pattern matching plus keyword detection) and a preliminary-diagnosis module (extracted from test reports), enabling on-demand retrieval of retained content.

If this is right

Automated diagnosis becomes feasible for the longest and noisiest agentic coding runs that current methods cannot handle.
Lower token budgets allow repeated diagnosis cycles during iterative refinement of agentic systems.
The RootSE benchmark supplies a concrete testbed for comparing future trajectory-diagnosis techniques.
Failure localization accuracy above 24 points better than prior work translates directly into faster identification of root causes in real maintenance tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering-plus-prior structure could be tested on non-coding agentic domains that also generate long execution logs.
Integration into CI pipelines would let teams receive automated failure explanations immediately after each agent run.
If the filter proves too aggressive on certain codebases, adding lightweight semantic embeddings to the matching step could be checked experimentally.

Load-bearing premise

Simple pattern matching and keyword detection together with a preliminary diagnosis from test reports will keep every essential piece of failure information while stripping enough noise for the investigator agent to succeed.

What would settle it

A fresh set of repository trajectories in which the decisive failure evidence sits inside code patterns the filter removes, producing no accuracy gain over baselines.

Figures

Figures reproduced from arXiv: 2605.26563 by Minxing Wang, Xiaofei Xie, Yintong Huo.

**Figure 1.** Figure 1: Failure diagnosis in agentic systems. refer to the information returned by tools invoked by the agent, often accounting for over 70% of the total trajectory content. However, most observations are not relevant to failure localization, such as redundant program structures and verbose code context, which can interfere with LLM reasoning [44]. (2) Excessive length. These trajectories often span from 20 to ov… view at source ↗

**Figure 2.** Figure 2: The agent workflow and execution trajectory in a coding task. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy of baseline methods under varying tra [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: RootSE Annotation Guideline. to a single decisive step due to system limitations rather than to ambiguous task descriptions or misaligned test code. 3.3 Annotation Following prior work [57], we adopt the Earliest Decisive Error Step as the failure point definition for RootSE. We outline this problem formulation and our annotation process below. 3.3.1 Problem Formulation. We consider an agentic system as a … view at source ↗

**Figure 5.** Figure 5: Phase-wise Failure Distribution in RootSE. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: The overall workflow of TrajAudit by dynamically inspecting folded observations and probing for additional context when the compressed trajectory provides insufficient information [43]. Through the complementary strengths of targeted information extraction and active context probing, TrajAudit locates failures more accurately and efficiently than existing methods. As illustrated in [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 7.** Figure 7: Exact Step-Level Accuracy across Varying Trajectory [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: A Worked Example of TrajAudit. most suspicious failure region. The diagnosis is based on the observation that the test code explicitly specifies the expected port, yet the generated patch does not address this inconsistency, suggesting that the agent failed to correctly identify the problematic code. (ib) Concurrently, the semantic saliency folding module applies pattern matching and keyword filtering to… view at source ↗

read the original abstract

Agentic systems have been widely studied to automate software engineering jobs such as bug fixing. As these systems increasingly tackle complex tasks, understanding where and why they fail becomes essential for iterative refinement and operational reliability. Existing automated failure diagnosis approaches leverage task execution trajectories, yet their effectiveness degrades substantially as trajectory length and complexity increase. For repository-level coding tasks specifically, trajectories are laden with noise, such as redundant program structure and verbose code context. Moreover, these trajectories are very long, while long-context reasoning remains a known weakness of LLMs. To address these two challenges, we propose TrajAudit, the first failure diagnosis framework for repository-level coding trajectories. TrajAudit employs an investigator agent supported by two modules: one filters failure-irrelevant information through pattern matching and keyword detection, and the other generates a preliminary diagnosis from test failure reports as prior knowledge, helping the agent handle noisy long contexts. The investigator agent can further invoke tools to retrieve filtered content on demand, ensuring that critical information is preserved while noise is minimized. We also introduce RootSE, a benchmark of 93 real-world agentic failure instances sourced from software maintenance tasks, representing the most complex trajectory diagnosis benchmark to date. Experiments on RootSE show that TrajAudit outperforms all existing baselines by over 24.4 percentage points in localization accuracy, while reducing token consumption by at least 18%, demonstrating its practical effectiveness. We hope this work draws community attention to failure management in agentic software engineering and provides a foundational resource for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrajAudit adds a filter-plus-prior setup and a new RootSE benchmark for diagnosing failures in long repo-scale agent trajectories, but the 24-point gains rest on an unverified claim that the filter keeps every necessary signal.

read the letter

The paper's core move is practical: it splits failure diagnosis into a pattern-and-keyword filter that strips redundant code, a test-report prior that gives the agent a head start, and an investigator that can pull back dropped content on demand. They also release RootSE, 93 real maintenance-task failures, which is positioned as the hardest trajectory benchmark so far.

That combination is new for repository-level work. Most prior diagnosis methods either work on short unit-test traces or treat the whole trajectory as input, so the explicit handling of length and noise fills a gap people actually hit when scaling agents.

The reported numbers—over 24 points better localization accuracy and at least 18% fewer tokens—are the headline result. They come from running the filtered trajectories through the investigator versus existing baselines.

The soft spot is exactly the one the stress-test flags. The gains are measured after the filter has already removed content. If any decisive root-cause fragment falls outside the chosen patterns or keywords on even a handful of the 93 cases, the investigator is solving an easier problem than the baselines. The abstract gives no evidence that the patterns were validated for completeness or that the filter was ablated. Without those checks, the margin is hard to interpret.

This is for people building or debugging agentic coding tools who need to understand why their systems fail on real repositories. The benchmark alone would be worth a look for that group.

It should go to peer review. The idea is grounded in a real scaling limit, the benchmark is a concrete addition, and the filtering concern is fixable with more detail rather than fatal.

Referee Report

1 major / 1 minor

Summary. The paper introduces TrajAudit, the first automated failure diagnosis framework for repository-level agentic coding trajectories. It consists of an investigator agent augmented by two modules—one that filters failure-irrelevant information via pattern matching and keyword detection, and another that generates a preliminary diagnosis from test failure reports as prior knowledge—while allowing on-demand retrieval of filtered content. The work also presents RootSE, a benchmark of 93 real-world failure instances from software maintenance tasks, and reports that TrajAudit outperforms existing baselines by more than 24.4 percentage points in localization accuracy while reducing token consumption by at least 18%.

Significance. If the central empirical claims hold after verification of the filtering step, the contribution would be significant: it directly targets the degradation of diagnosis methods on long, noisy repository trajectories, introduces the most complex such benchmark to date, and demonstrates concrete gains in both accuracy and efficiency. The emphasis on practical failure management in agentic SE systems could stimulate follow-on work on trajectory auditing and iterative agent refinement.

major comments (1)

[Abstract] Abstract: The reported 24.4 pp localization gain and 18% token reduction on RootSE are measured after the pattern-matching/keyword filter and test-report prior have already removed content from the trajectories. For these margins to be attributable to the investigator agent rather than to an easier problem, the manuscript must demonstrate that the retained fragments always include every decisive root-cause signal across the 93 instances. No such validation (e.g., manual audit of discarded segments, coverage analysis of failure-critical statements, or ablation removing the filter) is described, leaving open the possibility that the preprocessing itself accounts for part or all of the observed improvement.

minor comments (1)

[Abstract] Abstract: The claim that RootSE is 'the most complex trajectory diagnosis benchmark to date' would be strengthened by explicit quantitative comparisons (trajectory length, number of files touched, number of LLM calls) against prior benchmarks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this constructive comment on validating the filtering step. We address the concern directly below and commit to revisions that strengthen the attribution of results to the investigator agent.

read point-by-point responses

Referee: The reported 24.4 pp localization gain and 18% token reduction on RootSE are measured after the pattern-matching/keyword filter and test-report prior have already removed content from the trajectories. For these margins to be attributable to the investigator agent rather than to an easier problem, the manuscript must demonstrate that the retained fragments always include every decisive root-cause signal across the 93 instances. No such validation (e.g., manual audit of discarded segments, coverage analysis of failure-critical statements, or ablation removing the filter) is described, leaving open the possibility that the preprocessing itself accounts for part or all of the observed improvement.

Authors: We agree that the current manuscript lacks an explicit validation (such as a manual audit of discarded segments or an ablation removing the filter) to confirm that no decisive root-cause signals are lost. The filter is designed via pattern matching and keyword detection to target only failure-irrelevant noise (e.g., redundant program structure), with the investigator agent retaining on-demand tool-based retrieval of any filtered content. Nevertheless, this design choice alone does not substitute for empirical verification across all 93 RootSE instances. We will add (1) a manual audit of discarded segments on a representative sample of instances and (2) an ablation study that disables the filter (while retaining the test-report prior and investigator agent) to quantify its isolated contribution. These results will be reported in the revised manuscript to directly address the attribution question. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with no derivations or self-referential fits

full rationale

The paper describes a filtering-plus-prior system (pattern matching, keyword detection, test-report prior) and reports measured performance gains on the newly introduced RootSE benchmark. No equations, fitted parameters, or load-bearing self-citations appear in the provided text. Localization accuracy and token reduction are presented as direct experimental outcomes rather than quantities derived from the method's own inputs by construction. The filtering step is an engineering choice whose sufficiency is tested empirically, not presupposed in a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation; the work is an empirical framework proposal. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5803 in / 1088 out tokens · 26157 ms · 2026-06-29T16:10:00.088043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 11 canonical work pages · 6 internal anchors

[1]

Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of spectrum-based fault localization. InTesting: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007). IEEE, 89–98

2007
[2]

Elena Akik, Marko Vještica, Vladimir Dimitrieski, Slavica Kordić, and Sonja Ristić. 2025. Architecture of Multi-agent System for Automatic Code Template Maintenance. InEuropean Conference on Advances in Databases and Information Systems. Springer, 296–310

2025
[3]

Stefano V Albrecht and Peter Stone. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems.Artificial Intelligence258 (2018), 66–95

2018
[4]

2026.Reference

Anonymous. 2026.Reference. doi:10.5281/zenodo.19230090

work page doi:10.5281/zenodo.19230090 2026
[5]

Amine Barrak. 2025. Traceability and Accountability in Role-Specialized Multi- Agent LLM Pipelines. In2025 40th IEEE/ACM International Conference on Auto- mated Software Engineering Workshops (ASEW). IEEE, 315–322

2025
[6]

Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2846– 2857

2025
[7]

Subhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan, Georgios Kollias, Elliot Nelson, Khushbu Pahwa, Tejaswini Pedapati, Igor Melnyk, and Matthew Riemer. 2025. EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 11696–11708

2025
[8]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement20, 1 (1960), 37–46

1960
[9]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kan- nappan, and Rebecca Qian. 2025. TRAIL: Trace Reasoning and Agentic Issue Localization. arXiv:2505.08638 [cs.AI] https://arxiv.org/abs/2505.08638

work page arXiv 2025
[11]

Liming Dong, Qinghua Lu, and Liming Zhu. 2024. Agentops: Enabling observ- ability of llm agents.arXiv preprint arXiv:2411.05285(2024)

work page arXiv 2024
[12]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. InProceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285–1298

2017
[13]

Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, and Saleema Amershi. 2025. Interactive debugging and steering of multi- agent ai systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–15

2025
[14]

Stan Franklin and Art Graesser. 1996. Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents. InInternational workshop on agent theories, architectures, and languages. Springer, 21–35

1996
[15]

Yu Ge, Linna Xie, Zhong Li, Yu Pei, and Tian Zhang. 2025. Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782(2025)

work page arXiv 2025
[16]

Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. Logbert: Log anomaly detec- tion via bert. In2021 international joint conference on neural networks (IJCNN). IEEE, 1–8

2021
[17]

Shanshan Han, Qifan Zhang, Weizhao Jin, and Zhaozhuo Xu. 2024. LLM multi- agent systems: Challenges and open problems.arXiv preprint arXiv:2402.03578 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience re- port: System log analysis for anomaly detection. In2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, 207–218

2016
[19]

Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. [n. d.]. L2MAC: Large Language Model Automatic Computer for Extensive Code Generation. In The Twelfth International Conference on Learning Representations
[20]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al
[21]

InThe twelfth international conference on learning representations

MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations
[22]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

2024
[23]

Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, LiGangyang LiGangyang, Xu Zhu, Weiming Zhang, and Nenghai Yu. 2025. CompileAgent: Automated real-world repo-level compilation with tool-integrated LLM-based agent system. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2078–2091

2025
[24]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

James A Jones and Mary Jean Harrold. 2005. Empirical evaluation of the taran- tula automatic fault-localization technique. InProceedings of the 20th IEEE/ACM international Conference on Automated software engineering. 273–282

2005
[26]

Satyadhar Joshi. 2025. LLMOps, AgentOps, and MLOps for Generative AI: A Comprehensive Review. (2025)

2025
[27]

Max Landauer, Sebastian Onder, Florian Skopik, and Markus Wurzenberger. 2023. Deep learning for anomaly detection in log data: A survey.Machine Learning with Applications12 (2023), 100470

2023
[28]

J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data.biometrics(1977), 159–174

1977
[29]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems36 (2023), 51991–52008

2023
[30]

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large language model-based agents for software en- gineering: A survey.ACM Transactions on Software Engineering and Methodology (2024)

2024
[31]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024
[32]

Tianyang Liu, Canwen Xu, and Julian McAuley. [n. d.]. RepoBench: Benchmark- ing Repository-Level Code Auto-Completion Systems. InThe Twelfth Interna- tional Conference on Learning Representations
[33]

Ruofan Lu, Yichen Li, and Yintong Huo. 2025. Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3856– 3860

2025
[34]

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. 2025. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. Gaia: a benchmark for general ai assistants. InThe Twelfth Inter- national Conference on Learning Representations

2023
[36]

Niels Mündler, Mark N Müller, Jingxuan He, and Martin Vechev. 2024. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887

2024
[37]

Eugene W Myers. 1986. An O (ND) difference algorithm and its variations. Algorithmica1, 1 (1986), 251–266

1986
[38]

2023.OpenAI API

OpenAI. 2023.OpenAI API. https://openai.com/blog/openai-api [Online; accessed 1 Aug 2023]

2023
[39]

Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2025. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–28

2025
[40]

Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, et al. 2025. Why do multiagent systems fail?. InICLR 2025 Workshop on Building Trust in Language Models and Applications. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

2025
[41]

Chris Parnin and Alessandro Orso. 2011. Are automated debugging techniques actually helping programmers?. InProceedings of the 2011 international symposium on software testing and analysis. 199–209

2011
[42]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 15174–15186

2024
[43]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems36 (2023), 68539–68551

2023
[45]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning. PMLR, 31210–31227

2023
[46]

Yuan Tian and Tianyi Zhang. 2025. Selective Prompt Anchoring for Code Gener- ation. InInternational Conference on Machine Learning. PMLR, 59528–59551

2025
[47]

Maria Trofimova, Anton Shevtsov, Badertdinov Ibragim, Konstantin Pyaev, Simon Karasik, and Alexander Golubev. 2025. OpenHands Trajectories with Qwen3- Coder-480B-A35B-Instruct.Nebius blog(2025)

2025
[48]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Yanlin Wang, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng. 2025. Agents in soft- ware engineering: Survey, landscape, and vision.Automated Software Engineering 32, 2 (2025), 70

2025
[50]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[51]

Mark Weiser. 1984. Program slicing.IEEE Transactions on software engineering4 (1984), 352–357

1984
[52]

W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization.IEEE Transactions on Software Engineering 42, 8 (2016), 707–740

2016
[53]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst conference on language modeling

2024
[54]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

2023
[55]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024
[56]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

2022
[57]

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. 2025. AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?arXiv preprint arXiv:2509.03312(2025)

work page arXiv 2025
[58]

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. 2025. Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=GazlTYxZss

2025
[59]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- tocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

2024
[60]

Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. 2025. GSM: How Do your LLMs Behave over Infinitely Increasing Reasoning Complex- ity and Context Length?. InForty-second International Conference on Machine Learning. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

2025

[1] [1]

Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of spectrum-based fault localization. InTesting: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007). IEEE, 89–98

2007

[2] [2]

Elena Akik, Marko Vještica, Vladimir Dimitrieski, Slavica Kordić, and Sonja Ristić. 2025. Architecture of Multi-agent System for Automatic Code Template Maintenance. InEuropean Conference on Advances in Databases and Information Systems. Springer, 296–310

2025

[3] [3]

Stefano V Albrecht and Peter Stone. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems.Artificial Intelligence258 (2018), 66–95

2018

[4] [4]

2026.Reference

Anonymous. 2026.Reference. doi:10.5281/zenodo.19230090

work page doi:10.5281/zenodo.19230090 2026

[5] [5]

Amine Barrak. 2025. Traceability and Accountability in Role-Specialized Multi- Agent LLM Pipelines. In2025 40th IEEE/ACM International Conference on Auto- mated Software Engineering Workshops (ASEW). IEEE, 315–322

2025

[6] [6]

Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2846– 2857

2025

[7] [7]

Subhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan, Georgios Kollias, Elliot Nelson, Khushbu Pahwa, Tejaswini Pedapati, Igor Melnyk, and Matthew Riemer. 2025. EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 11696–11708

2025

[8] [8]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement20, 1 (1960), 37–46

1960

[9] [9]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kan- nappan, and Rebecca Qian. 2025. TRAIL: Trace Reasoning and Agentic Issue Localization. arXiv:2505.08638 [cs.AI] https://arxiv.org/abs/2505.08638

work page arXiv 2025

[11] [11]

Liming Dong, Qinghua Lu, and Liming Zhu. 2024. Agentops: Enabling observ- ability of llm agents.arXiv preprint arXiv:2411.05285(2024)

work page arXiv 2024

[12] [12]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. InProceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285–1298

2017

[13] [13]

Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, and Saleema Amershi. 2025. Interactive debugging and steering of multi- agent ai systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–15

2025

[14] [14]

Stan Franklin and Art Graesser. 1996. Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents. InInternational workshop on agent theories, architectures, and languages. Springer, 21–35

1996

[15] [15]

Yu Ge, Linna Xie, Zhong Li, Yu Pei, and Tian Zhang. 2025. Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782(2025)

work page arXiv 2025

[16] [16]

Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. Logbert: Log anomaly detec- tion via bert. In2021 international joint conference on neural networks (IJCNN). IEEE, 1–8

2021

[17] [17]

Shanshan Han, Qifan Zhang, Weizhao Jin, and Zhaozhuo Xu. 2024. LLM multi- agent systems: Challenges and open problems.arXiv preprint arXiv:2402.03578 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience re- port: System log analysis for anomaly detection. In2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, 207–218

2016

[19] [19]

Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. [n. d.]. L2MAC: Large Language Model Automatic Computer for Extensive Code Generation. In The Twelfth International Conference on Learning Representations

[20] [20]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

[21] [21]

InThe twelfth international conference on learning representations

MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations

[22] [22]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

2024

[23] [23]

Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, LiGangyang LiGangyang, Xu Zhu, Weiming Zhang, and Nenghai Yu. 2025. CompileAgent: Automated real-world repo-level compilation with tool-integrated LLM-based agent system. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2078–2091

2025

[24] [24]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

James A Jones and Mary Jean Harrold. 2005. Empirical evaluation of the taran- tula automatic fault-localization technique. InProceedings of the 20th IEEE/ACM international Conference on Automated software engineering. 273–282

2005

[26] [26]

Satyadhar Joshi. 2025. LLMOps, AgentOps, and MLOps for Generative AI: A Comprehensive Review. (2025)

2025

[27] [27]

Max Landauer, Sebastian Onder, Florian Skopik, and Markus Wurzenberger. 2023. Deep learning for anomaly detection in log data: A survey.Machine Learning with Applications12 (2023), 100470

2023

[28] [28]

J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data.biometrics(1977), 159–174

1977

[29] [29]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems36 (2023), 51991–52008

2023

[30] [30]

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large language model-based agents for software en- gineering: A survey.ACM Transactions on Software Engineering and Methodology (2024)

2024

[31] [31]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024

[32] [32]

Tianyang Liu, Canwen Xu, and Julian McAuley. [n. d.]. RepoBench: Benchmark- ing Repository-Level Code Auto-Completion Systems. InThe Twelfth Interna- tional Conference on Learning Representations

[33] [33]

Ruofan Lu, Yichen Li, and Yintong Huo. 2025. Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3856– 3860

2025

[34] [34]

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. 2025. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. Gaia: a benchmark for general ai assistants. InThe Twelfth Inter- national Conference on Learning Representations

2023

[36] [36]

Niels Mündler, Mark N Müller, Jingxuan He, and Martin Vechev. 2024. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887

2024

[37] [37]

Eugene W Myers. 1986. An O (ND) difference algorithm and its variations. Algorithmica1, 1 (1986), 251–266

1986

[38] [38]

2023.OpenAI API

OpenAI. 2023.OpenAI API. https://openai.com/blog/openai-api [Online; accessed 1 Aug 2023]

2023

[39] [39]

Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2025. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–28

2025

[40] [40]

Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, et al. 2025. Why do multiagent systems fail?. InICLR 2025 Workshop on Building Trust in Language Models and Applications. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

2025

[41] [41]

Chris Parnin and Alessandro Orso. 2011. Are automated debugging techniques actually helping programmers?. InProceedings of the 2011 international symposium on software testing and analysis. 199–209

2011

[42] [42]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 15174–15186

2024

[43] [43]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems36 (2023), 68539–68551

2023

[45] [45]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning. PMLR, 31210–31227

2023

[46] [46]

Yuan Tian and Tianyi Zhang. 2025. Selective Prompt Anchoring for Code Gener- ation. InInternational Conference on Machine Learning. PMLR, 59528–59551

2025

[47] [47]

Maria Trofimova, Anton Shevtsov, Badertdinov Ibragim, Konstantin Pyaev, Simon Karasik, and Alexander Golubev. 2025. OpenHands Trajectories with Qwen3- Coder-480B-A35B-Instruct.Nebius blog(2025)

2025

[48] [48]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Yanlin Wang, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng. 2025. Agents in soft- ware engineering: Survey, landscape, and vision.Automated Software Engineering 32, 2 (2025), 70

2025

[50] [50]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022

[51] [51]

Mark Weiser. 1984. Program slicing.IEEE Transactions on software engineering4 (1984), 352–357

1984

[52] [52]

W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization.IEEE Transactions on Software Engineering 42, 8 (2016), 707–740

2016

[53] [53]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst conference on language modeling

2024

[54] [54]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

2023

[55] [55]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024

[56] [56]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

2022

[57] [57]

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. 2025. AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?arXiv preprint arXiv:2509.03312(2025)

work page arXiv 2025

[58] [58]

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. 2025. Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=GazlTYxZss

2025

[59] [59]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- tocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

2024

[60] [60]

Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. 2025. GSM: How Do your LLMs Behave over Infinitely Increasing Reasoning Complex- ity and Context Length?. InForty-second International Conference on Machine Learning. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

2025