pith. machine review for the scientific record. sign in

arxiv: 2604.12108 · v1 · submitted 2026-04-13 · 💻 cs.SE · cs.AI

Recognition: no theorem link

LLM-Based Automated Diagnosis Of Integration Test Failures At Google

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLMintegration testingfailure diagnosisroot cause analysissoftware testinglog analysisautomated debuggingcode review tools
0
0 comments X

The pith

An LLM-based tool identifies root causes of integration test failures with 90 percent accuracy at Google.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models can analyze unstructured and heterogeneous logs from integration tests to produce concise summaries and accurate root cause diagnoses. This matters because developers consistently report spending far more time on these failures than on unit tests, creating high cognitive load and slow debugging cycles in large systems. By embedding the tool directly into the existing code review workflow, the approach delivers assistance at the moment it is needed rather than as a separate step. Evaluation on dozens of real cases and usage across tens of thousands of failures shows that developers accept the output and find it helpful in the great majority of instances.

Core claim

Auto-Diagnose processes failure logs with an LLM to extract the most relevant lines, generate a short summary, and state the likely root cause. On a manual review of 71 real-world integration test failures the diagnoses matched human judgment 90.14 percent of the time. After full deployment the system ran on 52,635 distinct failing tests; users marked it not helpful in only 5.8 percent of cases and ranked it fourteenth in helpfulness among 370 tools that post findings in Critique.

What carries the argument

Auto-Diagnose, an LLM pipeline that turns raw integration-test logs into concise summaries and root-cause statements inside the Critique code-review interface.

If this is right

  • Developers receive immediate, low-effort assistance while reviewing changes that trigger integration failures.
  • The fraction of time spent interpreting logs drops for the subset of failures the tool handles correctly.
  • Accuracy becomes the main driver of whether developers continue to use or ignore the assistance.
  • LLMs prove capable of extracting signal from the high-volume, low-signal text that integration testing produces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same log-summarization pattern could be applied to other classes of failures whose logs are currently too large for quick human inspection.
  • Over repeated use the tool may gradually change what information developers expect to see first when a test fails.
  • Organizations with similar scale and testing volume could adopt comparable embeddings without rebuilding their review systems from scratch.

Load-bearing premise

The 71 manually checked failures are typical of all integration test failures and human judges can reliably determine the true root cause from the same logs the model receives.

What would settle it

A follow-up study that presents the same set of new failures to both the LLM tool and to independent developers who have full access to source code and environment details, then measures agreement between the two diagnoses.

Figures

Figures reproduced from arXiv: 2604.12108 by Celal Ziftci, Livio Dalloro, Ray Liu, Spencer Greene.

Figure 1
Figure 1. Figure 1: Differences in the encounter and diagnosis of unit [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Integration test failure information surfaced in Cri [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Test driver and the system under test (SUT) that [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System overview of automatically generating findings for integration test failure diagnosis with [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The LLM-based diagnosis result posted as a finding [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 3
Figure 3. Figure 3: Typically, a reviewer would click on "Please fix" [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 7
Figure 7. Figure 7: The prompt template used to construct the prompt sent to the LLM. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Excerpts of feedback on Auto-Diagnose Critique findings from interviews conducted with 11 participants. developers are now expecting more from LLM-based tooling, e.g. fixes instead of diagnoses, as they get a better understanding of the capabilities of LLMs and as they use them more often in their daily workflows. Based on the manual evaluations, user feedback and user inter￾views, we conclude that LLMs ar… view at source ↗
read the original abstract

Integration testing is critical for the quality and reliability of complex software systems. However, diagnosing their failures presents significant challenges due to the massive volume, unstructured nature, and heterogeneity of logs they generate. These result in a high cognitive load, low signal-to-noise ratio, and make diagnosis difficult and time-consuming. Developers complain about these difficulties consistently and report spending substantially more time diagnosing integration test failures compared to unit test failures. To address these shortcomings, we introduce Auto-Diagnose, a novel diagnosis tool that leverages LLMs to help developers efficiently determine the root cause of integration test failures. Auto-Diagnose analyzes failure logs, produces concise summaries with the most relevant log lines, and is integrated into Critique, Google's internal code review system, providing contextual and in-time assistance. Based on our case studies, Auto-Diagnose is highly effective. A manual evaluation conducted on 71 real-world failures demonstrated 90.14% accuracy in diagnosing the root cause. Following its Google-wide deployment, Auto-Diagnose was used across 52, 635 distinct failing tests. User feedback indicated that the tool was deemed "Not helpful" in only 5.8% of cases, and it was ranked #14 in helpfulness among 370 tools that post findings in Critique. Finally, user interviews confirmed the perceived usefulness of Auto-Diagnose and positive reception of integrating automatic diagnostic assistance into existing workflows. We conclude that LLMs are highly successful in diagnosing integration test failures due to their capacity to process and summarize complex textual data. Integrating such AI-powered tooling automatically into developers' daily workflows is perceived positively, with the tool's accuracy remaining a critical factor in shaping developer perception and adoption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Auto-Diagnose, an LLM-based tool integrated into Google's internal Critique code review system that analyzes integration test failure logs, extracts relevant lines, and generates concise root-cause summaries. It reports a manual evaluation showing 90.14% accuracy on 71 real-world failures, Google-wide deployment across 52,635 distinct failing tests, user feedback with only 5.8% 'Not helpful' ratings, a #14 helpfulness ranking among 370 tools, and positive interview feedback on workflow integration.

Significance. If the accuracy claim holds under rigorous validation, the work supplies concrete industrial evidence that LLMs can reduce cognitive load in diagnosing heterogeneous, high-volume integration test logs. Strengths include the scale of deployment data, direct integration into an existing developer tool, and collection of both quantitative usage metrics and qualitative user perceptions, which together offer a rare end-to-end view of AI tooling adoption inside a large organization.

major comments (2)
  1. [Abstract] Abstract: The headline 90.14% accuracy rests on a manual evaluation of 71 failures, yet the manuscript supplies no sampling protocol for selecting those cases, no description of how ground truth was established, no inter-rater reliability statistic, and no blinding procedure. Because the same logs are available to both the LLM and the human judges, the metric risks circularity or inflation if judges routinely consult unlogged context (prior commits, test history, or code inspection). This detail is load-bearing for the central effectiveness claim.
  2. [Abstract] Abstract and evaluation sections: No baseline comparison (e.g., keyword search, simple log heuristics, or non-LLM ML classifiers) is reported against which the LLM's incremental benefit can be measured. Without such controls, it is impossible to determine whether the observed accuracy and helpfulness ratings exceed what simpler methods already achieve on the same failure population.
minor comments (1)
  1. [Abstract] Abstract: The deployment figure is written as '52, 635' with an extraneous space; standardize to '52,635'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity and transparency on the evaluation methodology.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline 90.14% accuracy rests on a manual evaluation of 71 failures, yet the manuscript supplies no sampling protocol for selecting those cases, no description of how ground truth was established, no inter-rater reliability statistic, and no blinding procedure. Because the same logs are available to both the LLM and the human judges, the metric risks circularity or inflation if judges routinely consult unlogged context (prior commits, test history, or code inspection). This detail is load-bearing for the central effectiveness claim.

    Authors: We agree that the current description of the manual evaluation lacks sufficient methodological detail. In the revised manuscript we will add a dedicated subsection under Evaluation that specifies: the sampling protocol (random selection of 71 failures from those occurring during the study period), ground-truth establishment (independent root-cause identification by two authors with extensive experience in the relevant codebases, followed by discussion to resolve disagreements), inter-rater reliability (we will compute and report Cohen’s kappa), and blinding steps (judges first assessed the logs without seeing the LLM output). We will also explicitly state that judges were instructed to rely solely on the provided failure logs and not to consult unlogged context such as commit history. These additions will directly address concerns about circularity and strengthen the validity of the reported accuracy. revision: yes

  2. Referee: [Abstract] Abstract and evaluation sections: No baseline comparison (e.g., keyword search, simple log heuristics, or non-LLM ML classifiers) is reported against which the LLM's incremental benefit can be measured. Without such controls, it is impossible to determine whether the observed accuracy and helpfulness ratings exceed what simpler methods already achieve on the same failure population.

    Authors: We acknowledge that a baseline comparison would help quantify the LLM’s added value. The manuscript’s primary contribution, however, lies in the large-scale industrial deployment and user-adoption metrics rather than a controlled ablation study. Performing a full non-LLM baseline experiment on the identical 71 cases would require substantial additional effort. In the revision we will insert a discussion paragraph explaining why keyword-based or simple heuristic approaches are inadequate for the heterogeneous, high-volume integration-test logs we target, and we will list a comparative baseline evaluation as future work. This provides context for the results without overstating incremental benefit. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of tool performance

full rationale

The paper presents an empirical description of Auto-Diagnose, an LLM-based tool, along with direct measurements of its accuracy (90.14% on 71 manually evaluated cases) and deployment outcomes (usage on 52,635 tests, 5.8% 'not helpful' rate). No equations, derivations, fitted parameters, or predictions are introduced that could reduce to the inputs by construction. Human judgment serves as the external ground truth for the accuracy metric, and deployment statistics are observational rather than model-derived. This satisfies the default expectation of no significant circularity for an applied systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical case study and deployment report; it introduces no free parameters, mathematical axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5608 in / 1237 out tokens · 44332 ms · 2026-05-10T15:29:19.764899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 19 canonical work pages · 6 internal anchors

  1. [1]

    Rui Abreu, Peter Zoeteweij, Rob Golsteijn, and Arjan JC Van Gemund. 2009. A practical evaluation of spectrum-based fault localization.Journal of Systems and Software82, 11 (2009), 1780–1792

  2. [2]

    Earl T Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro

  3. [3]

    InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

    The plastic surgery hypothesis. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 306–317

  4. [4]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134 (2024)

  5. [5]

    Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al. 2024. Coder: Issue resolving with multi-agent and task graphs.arXiv preprint arXiv:2406.01304 (2024). LLM-Based Automated Diagnosis Of Integration Test Failures At Google

  6. [6]

    Lan Cheng, Emerson Murphy-Hill, Mark Canning, Ciera Jaspan, Collin Green, Andrea Knight, Nan Zhang, and Elizabeth Kammer. 2022. What improves devel- oper productivity at google? code quality. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1302–1313

  7. [7]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  8. [8]

    Higor A de Souza, Marcos L Chaim, and Fabio Kon. 2016. Spectrum-based software fault localization: A survey of techniques, advances, and challenges. arXiv preprint arXiv:1607.04347(2016)

  9. [9]

    Google Inc. 2025. Google Colab. https://colab.research.google.com. Accessed: 2025-09-24

  10. [10]

    Google Inc. 2025. Google Gemini: Google’s AI Assistant. https://gemini.google. com. Accessed: 2025-09-24

  11. [11]

    Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al

  12. [12]

    Large language models: a comprehensive survey of its applications, chal- lenges, limitations, and future prospects.Authorea preprints1, 3 (2023), 1–26

  13. [13]

    Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering.ACM computing surveys (CSUR)54, 6 (2021), 1–37

  14. [14]

    Shilin He, Jieming Zhu, Pinjia He, Zibin Li, and Rui Liu. 2017. Log-based anom- aly detection: A survey. In2017 IEEE 24th International Conference on Software Analysis, Evolution and Reconfiguration (SANER). IEEE, 520–529

  15. [15]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  16. [16]

    2010.Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation

    Jez Humble and David Farley. 2010.Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation. Addison-Wesley Professional

  17. [17]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

  18. [18]

    James A Jones, Mary Jean Harrold, and John Stasko. 2002. Visualization of test information to assist fault localization. InProceedings of the 24th international conference on Software engineering. 467–477

  19. [19]

    René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA). ACM, 437–440. doi:10.1145/2610384.2628055 Accessed: 2025-09-22

  20. [20]

    Sungmin Kang, Gabin An, and Shin Yoo. 2024. A quantitative and qualitative evaluation of LLM-based explainable fault localization.Proceedings of the ACM on Software Engineering1, FSE (2024), 1424–1446

  21. [21]

    Baris Kasikci, Cristiano Pereira, Gilles Pokam, Benjamin Schubert, Malandal Musuvathi, and George Candea. 2015. Failure sketches: A better way to debug. In15th Workshop on Hot Topics in Operating Systems (HotOS XV)

  22. [22]

    1994.LaTeX: A Document Preparation System(2nd ed.)

    Leslie Lamport. 1994.LaTeX: A Document Preparation System(2nd ed.). Addison- Wesley

  23. [23]

    Cheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen-tse Huang, Zhouruixin Zhu, Lingming Zhang, and Michael R Lyu. 2024. A unified debugging approach via llm-based multi-agent synergy.arXiv preprint arXiv:2404.17153(2024)

  24. [24]

    Jierui Li, Szymon Tworkowski, Yingying Wu, and Raymond Mooney. 2023. Ex- plaining competitive-level programming solutions using llms.arXiv preprint arXiv:2307.05337(2023)

  25. [25]

    Yi Li, Shaohua Wang, and Tien Nguyen. 2021. Fault localization with code coverage representation learning. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 661–673

  26. [26]

    Ben Liblit, Mayur Naik, Alice X Zheng, Alex Aiken, and Michael I Jordan. 2005. Scalable statistical bug isolation.Acm Sigplan Notices40, 6 (2005), 15–26

  27. [27]

    Chao Liu, Long Fei, Xifeng Yan, Jiawei Han, and Samuel P Midkiff. 2006. Statistical debugging: A hypothesis testing-based approach.IEEE Transactions on software engineering32, 10 (2006), 831–848

  28. [28]

    Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. Marscode agent: Ai-native automated bug fixing.arXiv preprint arXiv:2409.00899(2024)

  29. [29]

    Yiling Lou, Qihao Zhu, Jinhao Dong, Xia Li, Zeyu Sun, Dan Hao, Lu Zhang, and Lingming Zhang. 2021. Boosting coverage-based fault localization via graph- based representation learning. InProceedings of the 29th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 664–676

  30. [30]

    Chandra Maddila, Adam Tait, Claire Chang, Daniel Cheng, Nauman Ahmad, Vijayaraghavan Murali, Marshall Roch, Arnaud Avondet, Aaron Meltzer, Victor Montalvao, et al. 2025. Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback.arXiv preprint arXiv:2507.18755(2025)

  31. [31]

    Yacine Majdoub and Eya Ben Charrada. 2024. Debugging with open-source large language models: An evaluation. InProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 510–516

  32. [32]

    Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siem- borski, and John Micco. 2017. Taming Google-Scale Continuous Testing. In Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track. IEEE Press, 233–242

  33. [33]

    Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke

  34. [34]

    SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?arXiv preprint arXiv:2502.12115(2025)

  35. [35]

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models. In NeurIPS 2023 workshop on instruction tuning and instruction following

  36. [36]

    Md Nakhla Rafi, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang. 2024. En- hancing Fault Localization Through Ordered Code Analysis with LLM Agents and Self-Reflection.arXiv e-prints(2024), arXiv–2409

  37. [37]

    OpenAI. 2022. ChatGPT-3.5. Large language model. https://chat.openai.com/ Accessed: 2025-09-22

  38. [38]

    OpenAI. 2023. ChatGPT-4. Large language model. https://chat.openai.com/ Accessed: 2025-09-22

  39. [39]

    2023.GPT-4 Technical Report

    OpenAI. 2023.GPT-4 Technical Report. Technical Report. OpenAI. https: //cdn.openai.com/papers/gpt-4.pdf Accessed: 2025-09-22

  40. [40]

    Rachel Potvin and Josh Levenberg. 2016. Why Google stores billions of lines of code in a single repository.Commun. ACM59, 7 (2016), 78–87

  41. [41]

    Pressman and Bruce R

    Roger S. Pressman and Bruce R. Maxim. 2014.Software Engineering: A Practi- tioner’s Approach. McGraw-Hill Education

  42. [42]

    Xiao Pu, Mingqi Gao, and Xiaojun Wan. 2023. Summarization is (almost) dead. arXiv preprint arXiv:2309.09558(2023)

  43. [43]

    Jie Qian, Xiaolin Ju, and Xiang Chen. 2023. GNet4FL: effective fault localization via graph convolutional neural network.Automated Software Engineering30, 2 (2023), 16

  44. [44]

    Pat Rondon, Renyao Wei, José Cambronero, Jürgen Cito, Aaron Sun, Siddhant Sanyam, Michele Tufano, and Satish Chandra. 2025. Evaluating agent-based program repair at google.arXiv preprint arXiv:2501.07531(2025)

  45. [45]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Can- ton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

  46. [46]

    Paul Rozin and Edward B Royzman. 2001. Negativity bias, negativity dominance, and contagion.Personality and social psychology review5, 4 (2001), 296–320

  47. [47]

    Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. Specrover: Code intent extraction via llms.arXiv preprint arXiv:2408.02232(2024)

  48. [49]

    Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern code review: a case study at Google. InProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice. ACM, 181–190

  49. [50]

    Caitlin Sadowski, Jeffrey Van Gogh, Ciera Jaspan, Emma Soderberg, and Collin Winter. 2015. Tricorder: Building a program analysis ecosystem. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 598–608

  50. [51]

    Kensen Shi, Deniz Altınbüken, Saswat Anand, Mihai Christodorescu, Katja Grün- wedel, Alexa Koenings, Sai Naidu, Anurag Pathak, Marc Rasi, Fredde Ribeiro, et al. 2025. Natural language outlines for code: Literate programming in the llm era. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 150–161

  51. [52]

    Jeongju Sohn and Shin Yoo. 2017. Fluccs: Using code and change metrics to improve fault localization. InProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 273–283

  52. [53]

    Manu Sridharan, Stephen J Fink, and Rastislav Bodik. 2007. Thin slicing. In Proceedings of the 28th ACM SIGPLAN conference on programming language design and implementation. 112–122

  53. [54]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  54. [55]

    Yonghao Wu, Zheng Li, Jie M Zhang, Mike Papadakis, Mark Harman, and Yong Liu. 2023. Large language models in fault localisation.arXiv preprint Celal Ziftci, Ray Liu, Spencer Greene, and Livio Dalloro arXiv:2308.15276(2023)

  55. [56]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

  56. [57]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

  57. [58]

    Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971

  58. [59]

    Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831

  59. [60]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  60. [61]

    John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khand- pur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

  61. [62]

    Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798(2025)

  62. [63]

    Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, and Ge Yu. 2024. Enhancing the code debugging ability of llms via communicative agent based data refinement.language30 (2024), 31

  63. [64]

    Dixin Yuan, Sunghun Park, and Yuanyuan Zhou. 2012. Characterizing logging practices in open-source software. InProceedings of the 34th International Con- ference on Software Engineering (ICSE). IEEE Press, 1–11

  64. [65]

    Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and isolating failure- inducing input.IEEE Transactions on software engineering28, 2 (2002), 183–200

  65. [66]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592– 1604

  66. [67]

    Zhuo Zhang, Yan Lei, Xiaoguang Mao, and Panpan Li. 2019. CNN-FL: An effective approach for localizing faults using convolutional neural networks. In2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 445–455

  67. [68]

    Yintong Zhao, Shilin He, Pinjia He, Zhekang Chen, Hongyu Zhang, and Renzhi Duan. 2023. Log-based Anomaly Detection and Diagnosis for Software Systems: A Survey.Comput. Surveys56, 4 (2023), 1–37