Towards Agentic Runtime Healing

Bowen Xu; David Lo; Haotian Zhu; Li Li; Xiaoning Du; Zhensu Sun

arxiv: 2408.01055 · v2 · submitted 2024-08-02 · 💻 cs.SE · cs.AI· cs.CR

Towards Agentic Runtime Healing

Zhensu Sun , Haotian Zhu , Bowen Xu , Xiaoning Du , Li Li , David Lo This is my paper

Pith reviewed 2026-05-23 22:12 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR

keywords runtime error recoverylarge language modelsself-healing systemsdynamic code generationsoftware resilienceagentic systemserror handling

0 comments

The pith

Large language models can generate on-the-fly error handlers that recover from 72.8 percent of runtime errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using large language models to create tailored error-handling code in real time when runtime errors occur. It presents the Healer framework, which invokes an LLM to produce healing code based on the specific error message and program state, then executes that code to restore a working state. Experiments across four datasets and three models show GPT-4 succeeds in 72.8 percent of cases. This moves beyond fixed heuristic rules toward adaptive recovery. The work also flags remaining issues with code safety but suggests checks and special programming patterns as mitigations.

Core claim

We demonstrate the feasibility of this approach by designing such a framework, Healer, and empirically showing that it can handle runtime errors with a high success rate. When an unanticipated runtime error occurs, Healer leverages its internal LLM to generate bespoke error-handling code. The generated healing code is then executed to produce a corrected program state, allowing the program to continue execution with minimal disruption. GPT-4 can successfully recover from 72.8 percent of runtime errors.

What carries the argument

The Healer framework, which calls an internal LLM to generate and run custom error-handling code based on the runtime error and program state.

If this is right

Self-healing systems become able to address a wider variety of runtime errors than rule-based methods permit.
Programs can resume after errors with less need for manual intervention or predefined handlers.
LLM integration at runtime can support more adaptive and resilient software architectures.
Safety checks and specialized programming conventions become necessary to incorporate generated patches safely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LLM generation pattern could be tested on logical errors or performance degradations beyond crashes.
Reliable runtime healing might let developers write less defensive code upfront.
Combining the generated patches with static analysis tools could provide an extra layer of verification before execution.

Load-bearing premise

The trustworthiness of LLM-generated code can be managed sufficiently through safety checks and Healer-aware programming so that executing the generated patches does not introduce new errors or security issues.

What would settle it

A controlled test in which the generated healing code frequently fails to restore correct execution or introduces new errors or security problems would show the claimed recovery rates are not reliable.

Figures

Figures reproduced from arXiv: 2408.01055 by Bowen Xu, David Lo, Haotian Zhu, Li Li, Xiaoning Du, Zhensu Sun.

**Figure 1.** Figure 1: A motivating example of how Healer handles the runtime errors to recover the execution of the program. This skill set enables LLMs to understand the context of runtime errors, such as error messages and program states, and provide case-by-case solutions for each unanticipated runtime error in real time. Since human developers cannot monitor programs around the clock, we propose leveraging LLMs as virtual “… view at source ↗

**Figure 2.** Figure 2: The workflow of Healer. When a runtime error occurs, Healer collects the error context, prompts the LLM with the context, and generates the handling code. The handling code is then executed in an isolated environment to recover the program from the faulty state. containing over 1,000 buggy Python code. The authors of DebugBench employed LLMs to statically fix these buggy code snippets and released the re… view at source ↗

**Figure 3.** Figure 3: An example of the prompt construction workflow in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Self-healing systems have long been a focus of research, aiming to enable software to recover from unexpected runtime errors without human intervention. Traditional approaches rely on predefined heuristic rules, such as reusing error handlers or rolling back to checkpoints, but these methods struggle to adapt to the diverse range of runtime errors. The emergence of Large Language Models offers a new opportunity to address this challenge. Leveraging their ability to understand and generate code and natural language, we propose using LLMs to dynamically generate error-handling strategies in real time, tailored to specific runtime contexts such as error messages and program states. We demonstrate the feasibility of this approach by designing such a framework, Healer, and empirically showing that it can handle runtime errors with a high success rate. When an unanticipated runtime error occurs, Healer leverages its internal LLM to generate bespoke error-handling code. The generated healing code is then executed to produce a corrected program state, allowing the program to continue execution with minimal disruption. We evaluate Healer across four code datasets and three state-of-the-art LLMs (GPT-3.5, GPT-4, and CodeQwen-7B), where GPT-4 can successfully recover from 72.8% of runtime errors, underscoring the promise of LLMs in this domain. Despite these promising results, challenges remain, particularly regarding the trustworthiness of LLM-generated code and its integration into existing systems. We mention potential solutions, such as safety checks and Healer-aware programming, to mitigate risks and ensure reliable operation. This work represents the first step toward agentic runtime healing, paving the way for more adaptive, resilient, and self-healing software systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches an LLM-driven approach to generating runtime patches on the fly but the reported 72.8% success rate leaves open whether those patches create fresh errors or security holes.

read the letter

The main takeaway is that this work tries to move self-healing from fixed rules or checkpoints to on-the-spot LLM code generation that reacts to the actual error and program state. They built a framework called Healer and ran it on four datasets with GPT-3.5, GPT-4, and CodeQwen-7B, getting the 72.8% figure with GPT-4. That combination of dynamic generation plus the reported recovery numbers is presented as new relative to earlier heuristic methods. The evaluation setup itself is a straightforward first cut at showing the idea can produce a corrected state in many cases. Credit for trying the experiment across multiple models and datasets rather than just one toy example. The soft spot is exactly the one the stress-test flags: the abstract flags the trustworthiness problem and lists safety checks as a future mitigation, yet the results only track whether the original error was cleared. No numbers appear on post-patch exceptions, static analysis of the generated code, or adversarial cases that would show whether new failures or vulnerabilities are common. Without that, the success rate is hard to interpret as evidence of practical feasibility. The paper is aimed at researchers working on runtime reliability or LLM use in software engineering. A reader looking for early empirical signals on agentic healing ideas can extract the basic setup and the headline number, but anyone needing rigorous controls or security data will find the current version thin. It is worth sending to referees because the direction is distinct enough and the initial data is there; review would likely push for the missing post-fix checks and clearer methodology rather than reject outright.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Healer framework, which uses an LLM to generate and execute bespoke error-handling code at runtime when an unanticipated error occurs, allowing the program to continue from a corrected state. It evaluates the approach across four code datasets and three LLMs, reporting that GPT-4 recovers from 72.8% of runtime errors, and identifies trustworthiness of generated patches as a remaining challenge while suggesting safety checks and Healer-aware programming as mitigations.

Significance. If the evaluation methodology and post-patch safety claims can be substantiated, the work would provide a concrete demonstration of LLM-driven runtime recovery that goes beyond static heuristics, representing an early empirical step toward agentic self-healing systems. The explicit acknowledgment of the trustworthiness gap is a constructive element.

major comments (2)

[Evaluation section] Evaluation section: the reported 72.8% success rate for GPT-4 is presented without any description of the evaluation methodology, definition of success (e.g., whether the original program resumes without further exceptions or merely that the healing code executes), error diversity across the four datasets, baselines, or statistical controls. This information is required to assess whether the number supports the feasibility claim.
[Abstract and Evaluation section] Abstract and Evaluation section: the central feasibility claim requires that executing LLM-generated patches does not introduce new runtime errors or security vulnerabilities, yet the evaluation reports only recovery success rates and supplies no post-patch error rates, static analysis results, or adversarial test outcomes. The abstract lists safety checks as a mitigation but provides no empirical evidence that they suffice.

minor comments (2)

[Abstract] The abstract and introduction could more precisely delimit the class of runtime errors considered (e.g., whether they include only exceptions or also logical errors and performance degradations).
[Evaluation section] No table or figure summarizes the per-dataset or per-LLM breakdown of the 72.8% figure; adding one would improve clarity of the empirical results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation methodology and safety considerations. These comments identify areas where the manuscript would benefit from greater clarity and detail. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: the reported 72.8% success rate for GPT-4 is presented without any description of the evaluation methodology, definition of success (e.g., whether the original program resumes without further exceptions or merely that the healing code executes), error diversity across the four datasets, baselines, or statistical controls. This information is required to assess whether the number supports the feasibility claim.

Authors: We agree that the Evaluation section requires expansion to substantiate the reported recovery rate. In the revised manuscript we will add: (1) an explicit definition of success (the original program resumes execution from the corrected state without raising further exceptions); (2) a breakdown of error types and their distribution across the four datasets; (3) any baseline comparisons performed; and (4) statistical details such as the number of trials and observed variance. These additions will directly address the request for methodological transparency. revision: yes
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the central feasibility claim requires that executing LLM-generated patches does not introduce new runtime errors or security vulnerabilities, yet the evaluation reports only recovery success rates and supplies no post-patch error rates, static analysis results, or adversarial test outcomes. The abstract lists safety checks as a mitigation but provides no empirical evidence that they suffice.

Authors: We concur that the feasibility claim would be strengthened by evidence on post-patch behavior. The current manuscript reports only recovery rates and flags trustworthiness as an open challenge while listing safety checks as a proposed mitigation without supporting measurements. In revision we will incorporate any post-healing error observations available from the existing experimental logs, add a dedicated limitations subsection on potential new errors or vulnerabilities, and expand the discussion of safety checks to clarify their current status as unvalidated proposals rather than demonstrated safeguards. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical success rates are measured directly, not derived or fitted.

full rationale

The paper proposes the Healer framework and reports measured recovery rates (e.g., 72.8% with GPT-4) from direct evaluation on four datasets and three LLMs. No equations, parameters, predictions, or first-principles derivations appear in the provided text. The central claim is an observed empirical outcome rather than a reduction to inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The evaluation is presented as a feasibility demonstration, making the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that current LLMs possess sufficient code-understanding and code-generation capability to produce correct runtime fixes; the Healer system itself is an invented artifact whose behavior is only characterized through the reported experiments.

axioms (1)

domain assumption Large language models can understand runtime error contexts and generate appropriate fixing code.
Invoked throughout the description of how Healer operates and in the interpretation of the 72.8% recovery result.

invented entities (1)

Healer framework no independent evidence
purpose: Dynamically generate and execute bespoke error-handling code using an internal LLM.
New system introduced by the paper; no independent evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5832 in / 1278 out tokens · 27280 ms · 2026-05-23T22:12:35.837063+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 13 internal anchors

[1]

CWE - CWE-248: Uncaught Exception (4.14)

2024. CWE - CWE-248: Uncaught Exception (4.14). https://cwe.mitre.org/data/ definitions/248.html Accessed: 2024-06-03

work page 2024
[2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

AlphaCode Team, Google DeepMind. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf Accessed: 2024-05-23

work page 2023
[4]

AtCoder. 2024. AtCoder. https://atcoder.jp/ Accessed: 2024-05-26

work page 2024
[5]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing 1, 1 (2004), 11–33

work page 2004
[7]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

George Candea, Emre Kiciman, Steve Zhang, Pedram Keyani, and Armando Fox. 2003. JAGR: An autonomous self-recovering application server. In 2003 Autonomic Computing Workshop. IEEE, 168–177

work page 2003
[10]

Antonio Carzaniga, Alessandra Gorla, Andrea Mattavelli, Nicolo Perino, and Mauro Pezze. 2013. Automatic recovery from runtime failures. In 2013 35th International Conference on Software Engineering (ICSE) . IEEE, 782–791

work page 2013
[11]

Antonio Carzaniga, Alessandra Gorla, Nicolò Perino, and Mauro Pezzè. 2010. Automatic workarounds for web applications. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering . 237–246

work page 2010
[12]

Hervé Chang, Leonardo Mariani, and Mauro Pezze. 2013. Exception handlers for healing component-based systems. ACM Transactions on Software Engineering and Methodology (TOSEM) 22, 4 (2013), 1–40

work page 2013
[13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Zakir Durumeric, Frank Li, James Kasten, Johanna Amann, Jethro Beekman, Mathias Payer, Nicolas Weaver, David Adrian, Vern Paxson, Michael Bailey, et al

work page
[15]

In Proceedings of the 2014 conference on internet measurement conference

The matter of heartbleed. In Proceedings of the 2014 conference on internet measurement conference. 475–488

work page 2014
[16]

EvalPlus. 2024. EvalPlus Releases. https://github.com/evalplus/evalplus/releases Accessed: 2024-05-26

work page 2024
[17]

Python Software Foundation. 2024. Python FAQ: How fast are exceptions? https://docs.python.org/3/faq/design.html#how-fast-are-exceptions Accessed: 2024-08-01

work page 2024
[18]

David Garlan and Bradley Schmerl. 2002. Model-based adaptation for self-healing systems. In Proceedings of the first workshop on Self-healing systems . 27–32

work page 2002
[19]

Georgi Gerganov. 2023. llama.cpp: Port of Facebook’s LLaMA model in C/C++. https://github.com/ggerganov/llama.cpp Accessed: 2024-05-30. Conference’17, July 2017, Washington, DC, USA Zhensu Sun, Haotian Zhu, Bowen Xu, Xiaoning Du, Li Li, and David Lo

work page 2023
[20]

Debanjan Ghosh, Raj Sharman, H Raghav Rao, and Shambhu Upadhyaya. 2007. Self-healing systems—survey and synthesis. Decision support systems 42, 4 (2007), 2164–2185

work page 2007
[21]

Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Jian Lü, and Zhendong Su. 2016. Automatic runtime recovery via error handler synthesis. InProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. 684–695

work page 2016
[22]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]

work page arXiv 2023
[23]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Michael N Huhns, Vance T Holderfield, and Rosa Laura Zavala Gutierrez. 2003. Robust software via agent-based redundancy. InProceedings of the second interna- tional joint conference on Autonomous agents and multiagent systems . 1018–1019

work page 2003
[25]

IBM. 2024. Project CodeNet. https://github.com/IBM/Project_CodeNet Accessed: 2024-05-26

work page 2024
[26]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al . 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Aizu Online Judge. 2024. Aizu Online Judge Home. https://onlinejudge.u- aizu.ac.jp/home Accessed: 2024-05-26

work page 2024
[28]

Sungmin Kang, Gabin An, and Shin Yoo. 2023. A preliminary evaluation of llm-based fault localization. arXiv preprint arXiv:2308.05487 (2023)

work page arXiv 2023
[29]

Pavneet Singh Kochhar, Ferdian Thung, Nachiappan Nagappan, Thomas Zim- mermann, and David Lo. 2015. Understanding the test automation culture of app developers. In 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 1–10

work page 2015
[30]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neu- rocomputing 461 (2021), 370–403

work page 2021
[32]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[33]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35

work page 2023
[34]

Fan Long, Vijay Ganesh, Michael Carbin, Stelios Sidiroglou, and Martin Rinard

work page
[35]

In 2012 34th International Conference on Software Engineering (ICSE)

Automatic input rectification. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 80–90

work page 2012
[36]

Frank D Macías-Escrivá, Rodolfo Haber, Raul Del Toro, and Vicente Hernandez

work page
[37]

Expert Systems with Applications 40, 18 (2013), 7267–7279

Self-adaptive systems: A survey of current approaches, research challenges and applications. Expert Systems with Applications 40, 18 (2013), 7267–7279

work page 2013
[38]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

OpenAI. 2024. Fine-Tuning Integrations. https://platform.openai.com/docs/ guides/fine-tuning/fine-tuning-integrations Accessed: 2024-05-26

work page 2024
[40]

OpenAI. 2024. GPT-3.5 Turbo Model Documentation. https://platform.openai. com/docs/models/gpt-3-5-turbo Accessed: 2024-05-21

work page 2024
[41]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744

work page 2022
[42]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768

work page 2022
[43]

William H Pierce. 2014. Failure-tolerant computer design. Academic Press

work page 2014
[44]

Harald Psaier and Schahram Dustdar. 2011. A survey on self-healing systems: approaches and systems. Computing 91 (2011), 43–73

work page 2011
[45]

Martin C Rinard. 2007. Living in the comfort zone. ACM SIGPLAN Notices 42, 10 (2007), 611–622

work page 2007
[46]

Martin C Rinard, Cristian Cadar, Daniel Dumitran, Daniel M Roy, Tudor Leu, and William S Beebee. 2004. Enhancing Server Availability and Security Through Failure-Oblivious Computing.. In Osdi, Vol. 4. 21–21

work page 2004
[47]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al

work page
[49]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. 2023. In chatgpt we trust? measuring and characterizing the reliability of chatgpt. arXiv preprint arXiv:2304.08979 (2023)

work page arXiv 2023
[51]

Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2022. Compress- ing pre-trained models of code into 3 mb. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–12

work page 2022
[52]

Beatriz Souza and Michael Pradel. 2023. Lexecutor: Learning-guided execution. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1522–1534

work page 2023
[53]

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. Debugbench: Evaluating debugging capability of large language models. arXiv preprint arXiv:2401.04621 (2024)

work page arXiv 2024
[54]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)

work page 2017
[56]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

work page 2022
[58]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120 (2023)

work page arXiv 2023
[59]

Westley Weimer and George C Necula. 2004. Finding and preventing run-time er- ror handling mistakes. InProceedings of the 19th annual ACM SIGPLAN Conference on Object-oriented programming, systems, languages, and applications . 419–431

work page 2004
[60]

Danny Weyns, Ilias Gerostathopoulos, Nadeem Abbas, Jesper Andersson, Stefan Biffl, Premek Brada, Tomas Bures, Amleto Di Salle, Patricia Lago, Angelika Musil, et al. 2022. Preliminary results of a survey on the use of self-adaptation in industry. In Proceedings of the 17th Symposium on Software Engineering for Adaptive and Self-Managing Systems. 70–76

work page 2022
[61]

Ratnadira Widyasari, Jia Wei Ang, Truong Giang Nguyen, Neil Sharma, and David Lo. 2024. Demystifying Faulty Code with LLM: Step-by-Step Reasoning for Explainable Fault Localization. arXiv preprint arXiv:2403.10507 (2024)

work page arXiv 2024
[62]

Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Go- ing: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023)

work page arXiv 2023
[63]

Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering. 1482–1493

work page 2022
[64]

Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo

work page
[65]

arXiv preprint arXiv:2403.07506 (2024)

Robustness, security, privacy, explainability, efficiency, and usability of large language models for code. arXiv preprint arXiv:2403.07506 (2024)

work page arXiv 2024
[66]

Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In 2012 34th International conference on software engineering (ICSE) . IEEE, 14–24

work page 2012

[1] [1]

CWE - CWE-248: Uncaught Exception (4.14)

2024. CWE - CWE-248: Uncaught Exception (4.14). https://cwe.mitre.org/data/ definitions/248.html Accessed: 2024-06-03

work page 2024

[2] [2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

AlphaCode Team, Google DeepMind. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf Accessed: 2024-05-23

work page 2023

[4] [4]

AtCoder. 2024. AtCoder. https://atcoder.jp/ Accessed: 2024-05-26

work page 2024

[5] [5]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing 1, 1 (2004), 11–33

work page 2004

[7] [7]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

George Candea, Emre Kiciman, Steve Zhang, Pedram Keyani, and Armando Fox. 2003. JAGR: An autonomous self-recovering application server. In 2003 Autonomic Computing Workshop. IEEE, 168–177

work page 2003

[10] [10]

Antonio Carzaniga, Alessandra Gorla, Andrea Mattavelli, Nicolo Perino, and Mauro Pezze. 2013. Automatic recovery from runtime failures. In 2013 35th International Conference on Software Engineering (ICSE) . IEEE, 782–791

work page 2013

[11] [11]

Antonio Carzaniga, Alessandra Gorla, Nicolò Perino, and Mauro Pezzè. 2010. Automatic workarounds for web applications. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering . 237–246

work page 2010

[12] [12]

Hervé Chang, Leonardo Mariani, and Mauro Pezze. 2013. Exception handlers for healing component-based systems. ACM Transactions on Software Engineering and Methodology (TOSEM) 22, 4 (2013), 1–40

work page 2013

[13] [13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Zakir Durumeric, Frank Li, James Kasten, Johanna Amann, Jethro Beekman, Mathias Payer, Nicolas Weaver, David Adrian, Vern Paxson, Michael Bailey, et al

work page

[15] [15]

In Proceedings of the 2014 conference on internet measurement conference

The matter of heartbleed. In Proceedings of the 2014 conference on internet measurement conference. 475–488

work page 2014

[16] [16]

EvalPlus. 2024. EvalPlus Releases. https://github.com/evalplus/evalplus/releases Accessed: 2024-05-26

work page 2024

[17] [17]

Python Software Foundation. 2024. Python FAQ: How fast are exceptions? https://docs.python.org/3/faq/design.html#how-fast-are-exceptions Accessed: 2024-08-01

work page 2024

[18] [18]

David Garlan and Bradley Schmerl. 2002. Model-based adaptation for self-healing systems. In Proceedings of the first workshop on Self-healing systems . 27–32

work page 2002

[19] [19]

Georgi Gerganov. 2023. llama.cpp: Port of Facebook’s LLaMA model in C/C++. https://github.com/ggerganov/llama.cpp Accessed: 2024-05-30. Conference’17, July 2017, Washington, DC, USA Zhensu Sun, Haotian Zhu, Bowen Xu, Xiaoning Du, Li Li, and David Lo

work page 2023

[20] [20]

Debanjan Ghosh, Raj Sharman, H Raghav Rao, and Shambhu Upadhyaya. 2007. Self-healing systems—survey and synthesis. Decision support systems 42, 4 (2007), 2164–2185

work page 2007

[21] [21]

Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Jian Lü, and Zhendong Su. 2016. Automatic runtime recovery via error handler synthesis. InProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. 684–695

work page 2016

[22] [22]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]

work page arXiv 2023

[23] [23]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Michael N Huhns, Vance T Holderfield, and Rosa Laura Zavala Gutierrez. 2003. Robust software via agent-based redundancy. InProceedings of the second interna- tional joint conference on Autonomous agents and multiagent systems . 1018–1019

work page 2003

[25] [25]

IBM. 2024. Project CodeNet. https://github.com/IBM/Project_CodeNet Accessed: 2024-05-26

work page 2024

[26] [26]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al . 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Aizu Online Judge. 2024. Aizu Online Judge Home. https://onlinejudge.u- aizu.ac.jp/home Accessed: 2024-05-26

work page 2024

[28] [28]

Sungmin Kang, Gabin An, and Shin Yoo. 2023. A preliminary evaluation of llm-based fault localization. arXiv preprint arXiv:2308.05487 (2023)

work page arXiv 2023

[29] [29]

Pavneet Singh Kochhar, Ferdian Thung, Nachiappan Nagappan, Thomas Zim- mermann, and David Lo. 2015. Understanding the test automation culture of app developers. In 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 1–10

work page 2015

[30] [30]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neu- rocomputing 461 (2021), 370–403

work page 2021

[32] [32]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024)

work page 2024

[33] [33]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35

work page 2023

[34] [34]

Fan Long, Vijay Ganesh, Michael Carbin, Stelios Sidiroglou, and Martin Rinard

work page

[35] [35]

In 2012 34th International Conference on Software Engineering (ICSE)

Automatic input rectification. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 80–90

work page 2012

[36] [36]

Frank D Macías-Escrivá, Rodolfo Haber, Raul Del Toro, and Vicente Hernandez

work page

[37] [37]

Expert Systems with Applications 40, 18 (2013), 7267–7279

Self-adaptive systems: A survey of current approaches, research challenges and applications. Expert Systems with Applications 40, 18 (2013), 7267–7279

work page 2013

[38] [38]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

OpenAI. 2024. Fine-Tuning Integrations. https://platform.openai.com/docs/ guides/fine-tuning/fine-tuning-integrations Accessed: 2024-05-26

work page 2024

[40] [40]

OpenAI. 2024. GPT-3.5 Turbo Model Documentation. https://platform.openai. com/docs/models/gpt-3-5-turbo Accessed: 2024-05-21

work page 2024

[41] [41]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744

work page 2022

[42] [42]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768

work page 2022

[43] [43]

William H Pierce. 2014. Failure-tolerant computer design. Academic Press

work page 2014

[44] [44]

Harald Psaier and Schahram Dustdar. 2011. A survey on self-healing systems: approaches and systems. Computing 91 (2011), 43–73

work page 2011

[45] [45]

Martin C Rinard. 2007. Living in the comfort zone. ACM SIGPLAN Notices 42, 10 (2007), 611–622

work page 2007

[46] [46]

Martin C Rinard, Cristian Cadar, Daniel Dumitran, Daniel M Roy, Tudor Leu, and William S Beebee. 2004. Enhancing Server Availability and Security Through Failure-Oblivious Computing.. In Osdi, Vol. 4. 21–21

work page 2004

[47] [47]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al

work page

[49] [49]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[50] [50]

Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. 2023. In chatgpt we trust? measuring and characterizing the reliability of chatgpt. arXiv preprint arXiv:2304.08979 (2023)

work page arXiv 2023

[51] [51]

Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2022. Compress- ing pre-trained models of code into 3 mb. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–12

work page 2022

[52] [52]

Beatriz Souza and Michael Pradel. 2023. Lexecutor: Learning-guided execution. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1522–1534

work page 2023

[53] [53]

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. Debugbench: Evaluating debugging capability of large language models. arXiv preprint arXiv:2401.04621 (2024)

work page arXiv 2024

[54] [54]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)

work page 2017

[56] [56]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

work page 2022

[58] [58]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120 (2023)

work page arXiv 2023

[59] [59]

Westley Weimer and George C Necula. 2004. Finding and preventing run-time er- ror handling mistakes. InProceedings of the 19th annual ACM SIGPLAN Conference on Object-oriented programming, systems, languages, and applications . 419–431

work page 2004

[60] [60]

Danny Weyns, Ilias Gerostathopoulos, Nadeem Abbas, Jesper Andersson, Stefan Biffl, Premek Brada, Tomas Bures, Amleto Di Salle, Patricia Lago, Angelika Musil, et al. 2022. Preliminary results of a survey on the use of self-adaptation in industry. In Proceedings of the 17th Symposium on Software Engineering for Adaptive and Self-Managing Systems. 70–76

work page 2022

[61] [61]

Ratnadira Widyasari, Jia Wei Ang, Truong Giang Nguyen, Neil Sharma, and David Lo. 2024. Demystifying Faulty Code with LLM: Step-by-Step Reasoning for Explainable Fault Localization. arXiv preprint arXiv:2403.10507 (2024)

work page arXiv 2024

[62] [62]

Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Go- ing: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023)

work page arXiv 2023

[63] [63]

Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering. 1482–1493

work page 2022

[64] [64]

Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo

work page

[65] [65]

arXiv preprint arXiv:2403.07506 (2024)

Robustness, security, privacy, explainability, efficiency, and usability of large language models for code. arXiv preprint arXiv:2403.07506 (2024)

work page arXiv 2024

[66] [66]

Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In 2012 34th International conference on software engineering (ICSE) . IEEE, 14–24

work page 2012