pith. machine review for the scientific record. sign in

arxiv: 2605.07509 · v2 · submitted 2026-05-08 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:58 UTC · model grok-4.3

classification 💻 cs.SE
keywords failure attributionmulti-agent systemssmall language modelsprefill signalsnegative log-likelihoodattention weightsLLM debuggingexecution traces
0
0 comments X

The pith

MASPrism attributes failures in multi-agent LLM systems by reading prefill-stage signals from a small language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that failure attribution in long multi-agent execution traces can be done accurately and quickly without replay, synthetic training data, or full decoding. It extracts token-level negative log-likelihood and attention weights during a single prefill pass over the trace with a small language model to flag symptom-like steps and earlier candidate sources. A second prefill pass on a reconstructed prompt then ranks the most likely root causes. If this holds, diagnosis becomes lightweight enough to apply to every trace in real time instead of relying on expensive agent workflows. The reported results indicate strong accuracy gains over baselines and even some large proprietary models on the tested benchmarks, all while finishing each trace in seconds with zero output tokens.

Core claim

MASPrism extracts token-level negative log-likelihood and attention weights during a prefill pass of a small language model over the full execution trace to identify symptom-like steps and earlier candidate sources without any decoding. It then reconstructs a focused diagnostic prompt and runs a second prefill pass to rank the failure-source candidates. This two-pass process requires no output generation and runs in an average of 2.66 seconds per trace.

What carries the argument

Prefill-stage token-level negative log-likelihood and attention weights from a small language model, used first to surface symptom steps and candidate sources then to rank root causes in a second focused prefill.

If this is right

  • Failure attribution no longer requires replaying executions or training on synthetic failure logs.
  • Each trace can be diagnosed in seconds rather than requiring full generation or multi-step agent workflows.
  • Accuracy remains competitive or superior to larger models on the Who&When and TRAIL benchmarks.
  • The method produces zero output tokens while still delivering ranked failure sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar prefill signals might support debugging of success paths or other properties in sequential agent traces.
  • The approach could extend to single-agent chains or other long-horizon processes where causal evidence is delayed.
  • If prefill signals encode sufficient causal structure, lightweight models might replace heavier attribution pipelines in production monitoring.

Load-bearing premise

Token-level negative log-likelihood and attention weights extracted during one prefill pass of a small language model are reliable enough to surface both symptom steps and earlier root-cause candidates in long multi-agent traces without full decoding or task-specific training.

What would settle it

A controlled test set of multi-agent traces containing deliberately injected single-point failures at known locations, where the method fails to rank the injected root cause among the top candidates.

Figures

Figures reproduced from arXiv: 2605.07509 by Hongjiang Feng, Junsong Pu, Yang Liu, Zhuangbin Chen.

Figure 1
Figure 1. Figure 1: Top-5 hit rates of direct NLL ranking and attention [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MASPrism framework. Filtering Prompt [System] This trace has been truncated. Each step shows a bounded prefix of key content. [...] marks omitted text. Focus on error patterns and causal chains. Omissions are expected [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt prepended to the truncated trace in the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt prepended to the reconstructed trace in the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention from symptom steps on Who&When. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Failure attribution in LLM-based multi-agent systems aims to identify the steps that contribute to a failed execution. This task remains difficult because a single execution can contain many agent actions and tool calls, failure evidence can appear many steps after the original mistake, and existing methods often rely on costly agent workflows, replay, or training on synthetic failure logs. To address these challenges, we propose MASPrism, a lightweight framework that performs failure attribution using prefill-stage signals from a small language model (SLM). MASPrism first extracts token-level negative log-likelihood and attention weights during a prefill pass to identify symptom-like steps and earlier candidate sources, without decoding. It then reconstructs a focused diagnostic prompt and performs a second prefill pass to rank failure-source candidates. Using Qwen3-0.6B as the SLM, MASPrism achieves the best performance on three of the four evaluated subsets across Who&When and TRAIL, improving Top-1 accuracy on Who&When-HC by 33.41% over the best baseline. On TRAIL, MASPrism outperforms strong proprietary LLMs, including Gemini-2.5-Pro, with up to 89.50% relative improvement. MASPrism processes each trace in 2.66 seconds on average, achieving a 6.69$\times$ speedup over the single-pass prompting baseline, with zero output tokens. These results show that MASPrism provides an effective and practical framework for failure attribution in long multi-agent execution logs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MASPrism, a lightweight framework for failure attribution in LLM-based multi-agent systems. It extracts token-level negative log-likelihood and attention weights during a single prefill pass of a small language model (Qwen3-0.6B) to identify symptom steps and earlier candidate sources without decoding or training, then performs a second focused prefill to rank candidates. On Who&When and TRAIL benchmarks, it claims best performance on three of four subsets, with 33.41% Top-1 accuracy lift on Who&When-HC over the best baseline, up to 89.50% relative improvement on TRAIL versus proprietary LLMs like Gemini-2.5-Pro, and 6.69× speedup (2.66s per trace) with zero output tokens.

Significance. If the empirical results hold under rigorous controls, MASPrism would represent a practical advance for efficient debugging of long multi-agent traces by avoiding full decoding, replay, or task-specific training. The reliance on standard prefill signals from an off-the-shelf 0.6B model could enable low-overhead integration into existing MAS pipelines. The reported speedups and outperformance of larger models on TRAIL are potentially impactful for production settings, provided the method generalizes beyond the evaluated traces.

major comments (3)
  1. [Abstract / §4] Abstract and §4 (Evaluation): The central empirical claims report concrete gains such as 33.41% Top-1 accuracy improvement on Who&When-HC and up to 89.50% relative improvement on TRAIL, yet provide no details on baseline implementations, train/test splits, number of runs, or statistical significance testing. This leaves the performance numbers only moderately supported and prevents independent verification of the claimed superiority.
  2. [§3] §3 (Method): The two-pass procedure depends on the first prefill pass using NLL and attention weights to surface symptom steps and root-cause candidates in long traces. No ablation isolates the quality of this candidate set, nor is there direct evidence (e.g., correlation plots or per-trace analysis) linking the extracted signals to ground-truth failure locations, despite known concentration of attention on recent tokens in small models.
  3. [§4] §4 (Evaluation): The manuscript does not report whether results are averaged over multiple random seeds or data partitions, nor does it include controls for post-hoc hyperparameter choices in the candidate-ranking step. These omissions are load-bearing for the claim that prefill signals alone suffice for reliable attribution without training or replay.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'with zero output tokens' should clarify whether this applies only to the first pass or the entire two-pass procedure, as the second diagnostic prefill necessarily consumes input tokens.
  2. [§2] §2 (Related Work): The comparison to prior failure-attribution methods could include a brief table summarizing their computational requirements (e.g., number of LLM calls or training data needs) to better contextualize the claimed 6.69× speedup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about empirical rigor, providing additional implementation details, ablations, statistical reporting, and controls. Below we respond to each major comment point-by-point.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (Evaluation): The central empirical claims report concrete gains such as 33.41% Top-1 accuracy improvement on Who&When-HC and up to 89.50% relative improvement on TRAIL, yet provide no details on baseline implementations, train/test splits, number of runs, or statistical significance testing. This leaves the performance numbers only moderately supported and prevents independent verification of the claimed superiority.

    Authors: We agree that these details are necessary for verification. In the revised manuscript we have expanded §4.1 with complete baseline implementation descriptions (including exact prompts, model versions, and decoding parameters), clarified that evaluations are strictly zero-shot on the full benchmark test sets with no training or custom splits, reported all metrics as means over 5 independent runs with different random seeds (including standard deviations in Tables 1–3), and added statistical significance results using paired Wilcoxon signed-rank tests with p-values. These additions directly support the reported gains. revision: yes

  2. Referee: [§3] §3 (Method): The two-pass procedure depends on the first prefill pass using NLL and attention weights to surface symptom steps and root-cause candidates in long traces. No ablation isolates the quality of this candidate set, nor is there direct evidence (e.g., correlation plots or per-trace analysis) linking the extracted signals to ground-truth failure locations, despite known concentration of attention on recent tokens in small models.

    Authors: We have added a new §3.4 ablation that isolates the first-pass candidate set by comparing it against random sampling and recency-based heuristics, showing that the NLL+attention signals contribute measurable gains. We also include Appendix D with Pearson correlation plots between the extracted signals and ground-truth failure positions across all traces, plus per-trace case studies. While small models do exhibit recency bias, the symptom-step identification step in our pipeline mitigates this by prioritizing high-NLL tokens before candidate ranking; the new analysis quantifies the residual effect. revision: yes

  3. Referee: [§4] §4 (Evaluation): The manuscript does not report whether results are averaged over multiple random seeds or data partitions, nor does it include controls for post-hoc hyperparameter choices in the candidate-ranking step. These omissions are load-bearing for the claim that prefill signals alone suffice for reliable attribution without training or replay.

    Authors: We have updated §4 to state that all results are averaged over 5 random seeds with standard deviations now shown. For the candidate-ranking hyperparameters (NLL/attention weighting coefficients and top-k thresholds), we added §4.2 explaining that they were selected once on a small held-out validation set of 20 traces drawn from the same benchmarks and then frozen for all test evaluations; no post-hoc adjustment on test data occurred. This protocol ensures the method relies only on prefill signals without task-specific training or replay. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmark evaluation of standard prefill signals

full rationale

The paper presents MASPrism as a two-pass heuristic that directly extracts token-level negative log-likelihood and attention weights from a single prefill pass of Qwen3-0.6B to surface symptom steps and candidate sources, followed by a second diagnostic prefill on a reconstructed prompt. No equations, fitted parameters, or self-referential definitions appear in the derivation; the reported Top-1 accuracy gains (e.g., 33.41% on Who&When-HC) and speedups are obtained by comparing outputs against independent external benchmarks (Who&When, TRAIL) rather than quantities defined by the method itself. The approach is therefore self-contained as an empirical procedure whose validity can be assessed externally without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the approach rests on standard prefill computations already present in any transformer language model and on empirical benchmark comparisons.

pith-pipeline@v0.9.0 · 5581 in / 1302 out tokens · 54939 ms · 2026-05-15T05:58:05.520923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    MASPrism first extracts token-level negative log-likelihood and attention weights during a prefill pass to identify symptom-like steps and earlier candidate sources, without decoding.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

  1. [1]

    Rui Abreu, Peter Zoeteweij, Rob Golsteijn, and Arjan J. C. van Gemund. 2009. A practical evaluation of spectrum-based fault localization.J. Syst. Softw.82, 11 (2009), 1780–1792. doi:10.1016/J.JSS.2009.06.035

  2. [2]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, Ada Gavrilovska an...

  3. [3]

    Adi Banerjee, Anirudh Nair, and Tarik Borogovac. 2025. Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling. https://openreview.net/forum?id=0MyUdq7wLe

  4. [4]

    Devanbu, and Michael Pradel

    Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In47th IEEE/ACM In- ternational Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 2188–2200. doi:10.1109/ICSE55347.2025.00157

  5. [5]

    Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. In40th IEEE/ACM Inter- national Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025. IEEE, 2846–2857. doi:10.1109/ASE63991.2025. 00234

  6. [6]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  7. [7]

    Gonzalez, and Ion Stoica

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ram- chandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2026. Why Do Multi-Agent LLM Systems Fail?. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Benchmarks T...

  8. [8]

    Esha Choukse, Pratyush Patel, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Rodrigo Fonseca, and Ricardo Bianchini. 2025. Splitwise: Efficient Gen- erative LLM Inference Using Phase Splitting.IEEE Micro45, 4 (2025), 54–59. doi:10.1109/MM.2025.3575361

  9. [10]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch

  10. [11]

    Improving Factuality and Reasoning in Language Models through Multia- gent Debate. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 (Proceedings of Machine Learning Research), Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp ...

  11. [12]

    Qiong Feng, Xiaotian Ma, Jiayi Sheng, Ziyuan Feng, Wei Song, and Peng Liang

  12. [13]

    arXiv:2412.03905 doi:10.48550/ARXIV.2412.03905

    Integrating Various Software Artifacts for Better LLM-based Bug Local- ization and Program Repair.CoRRabs/2412.03905 (2024). arXiv:2412.03905 doi:10.48550/ARXIV.2412.03905

  13. [14]

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Ger- rits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. 2024. Magentic-One: A Generalist Multi-Agent System for Solving Complex ...

  14. [16]

    Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, and Chaoyang He. 2024. LLM Multi-Agent Systems: Challenges and Open Problems. CoRRabs/2402.03578 (2024). arXiv:2402.03578 doi:10.48550/ARXIV.2402.03578

  15. [17]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In The Twelfth International Conference on Learning Representatio...

  16. [18]

    Robert Hutter and Michael Pradel. 2026. AgentStepper: Interactive Debugging of Software Development Agents.CoRRabs/2602.06593 (2026). arXiv:2602.06593 doi:10.48550/ARXIV.2602.06593

  17. [19]

    Yeonjun In, Mehrab Tanjim, Jayakumar Subramanian, Sungchul Kim, Uttaran Bhattacharya, Wonjoong Kim, Sangwu Park, Somdeb Sarkhel, and Chany- oung Park. 2026. Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation.CoRRabs/2603.25001 (2026). arXiv:2603.25001 doi:10.48550/ARXIV.2603.25001

  18. [20]

    Ashraful Islam, Mohammed Eunus Ali, and Md

    Md. Ashraful Islam, Mohammed Eunus Ali, and Md. Rizwan Parvez. 2024. Map- Coder: Multi-Agent Code Generation for Competitive Problem Solving. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (...

  19. [21]

    Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. InPro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (...

  20. [23]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66

  21. [24]

    Jones and Mary Jean Harrold

    James A. Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In20th IEEE/ACM International Conference on Automated Software Engineering (ASE 2005), November 7-11, 2005, Long Beach, CA, USA, David F. Redmiles, Thomas Ellman, and Andrea Zisman (Eds.). ACM, 273–282. doi:10.1145/1101908.1101949

  22. [25]

    Sungmin Kang, Gabin An, and Shin Yoo. 2024. A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization.Proc. ACM Softw. Eng. 1, FSE (2024), 1424–1446. doi:10.1145/3660771

  23. [26]

    Abdelfattah, and Alexander M Rush

    Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed S. Abdelfattah, and Alexander M Rush. 2025. Overfill: Two-Stage Models for Efficient Language Model Decoding. InSecond Conference on Language Modeling. https://openreview. net/forum?id=e112iu5ssg

  24. [27]

    Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. 2026. Aegis: Automated Error Generation and Attribution for Multi-Agent Systems. InThe Fourteenth International Conference on Learning Representations. https: //openreview.net/forum?id=zqcYoxXiN3

  25. [28]

    Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024. A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth1, 1 (2024), 9

  26. [29]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Trans. Assoc. Comput. Linguistics12 (2024), 157–173. doi:10. 1162/TACL_A_00638

  27. [30]

    Jiaying Lu, Bo Pan, Jieyi Chen, Yingchaojie Feng, Jingyuan Hu, Yuchen Peng, and Wei Chen. 2025. AgentLens: Visual Analysis for Agent Behaviors in LLM-Based Autonomous Systems.IEEE Trans. Vis. Comput. Graph.31, 8 (2025), 4182–4197. doi:10.1109/TVCG.2024.3394053

  28. [31]

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2024. GAIA: a benchmark for General AI Assistants. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=fibxvahvs3

  29. [32]

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangk...

  30. [33]

    Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. 2026. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter. arXiv:2604.15039 [cs.DC] https://arxiv.org/abs/2604.15039

  31. [34]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog Conference acronym ’XX, June 03–05, 2026, Woodstock, NY Yang Liu, Hongjiang Feng, Junsong Pu, and Zhuangbin Chen 1, 8 (2019), 9

  32. [35]

    Zhilei Ren, Changlin Liu, Xusheng Xiao, He Jiang, and Tao Xie. 2019. Root Cause Localization for Unreproducible Builds via Causality Analysis Over System Call Tracing. In34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019. IEEE, 527–538. doi:10.1109/ASE.2019.00056

  33. [36]

    Linxin Song, Jiale Liu, Jieyu Zhang, Shaokun Zhang, Ao Luo, Shijian Wang, Qingyun Wu, and Chi Wang. 2024. Adaptive In-conversation Team Building for Language Model Agents.CoRRabs/2405.19425 (2024). arXiv:2405.19425 doi:10.48550/ARXIV.2405.19425

  34. [37]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxbur...

  35. [38]

    Yawen Wang, Wenjie Wu, Junjie Wang, and Qing Wang. 2026. From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems.CoRRabs/2602.23701 (2026). arXiv:2602.23701 doi:10.48550/ARXIV. 2602.23701

  36. [39]

    Alva West, Yixuan Weng, Minjun Zhu, Zhen Lin, Zhiyuan Ning, and Yue Zhang

  37. [40]

    arXiv:2509.10401 doi:10.48550/ARXIV.2509.10401

    Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure At- tribution in Multi-Agent Systems.CoRRabs/2509.10401 (2025). arXiv:2509.10401 doi:10.48550/ARXIV.2509.10401

  38. [41]

    Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds....

  39. [42]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework.CoRR abs/2308.08155 (2023). arXiv:2308.08155 doi:10.48550/ARXIV.2308.08155

  40. [43]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE (2025), 801–824. doi:10.1145/3715754

  41. [44]

    Chunqiu Steven Xia and Lingming Zhang. 2024. Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPT. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, Maria Christakis and Michael Pradel (Eds.). ACM, 819–831. doi:1...

  42. [45]

    Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qing- wei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. 2025. OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://o...

  43. [46]

    Aidan Z. H. Yang, Claire Le Goues, Ruben Martins, and Vincent J. Hellendoorn

  44. [47]

    InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024

    Large Language Models for Test-Free Fault Localization. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 17:1–17:12. doi:10.1145/3597503.3623342

  45. [48]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

  46. [49]

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. CoRRabs/2406.12045 (2024). arXiv:2406.12045 doi:10.48550/ARXIV.2406.12045

  47. [50]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview. net/forum?id=WE_vluYUL-X

  48. [51]

    Chenxi Zhang, Zhen Dong, Xin Peng, Bicheng Zhang, and Miao Chen. 2024. Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice Systems. InProceedings of the 46th IEEE/ACM International Con- ference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 110:1–110:12. doi:10.1145/3597503.3639088

  49. [52]

    Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng YAN. 2026. AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=l05DseqvuD

  50. [53]

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. 2025. Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-1...

  51. [54]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, Maria Christakis and Michael Pradel (Eds.). ACM, 1592–1604. doi:10.1145/3650212.3680384

  52. [55]

    Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Yuhui Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu. 2025. RAFFLES: Reasoning- based Attribution of Faults for LLM Systems. InFirst Workshop on Multi-Turn Inter- actions in Large Language Models. https://openreview.net/forum?id=0oxelK4W6c