Towards Agentic Runtime Healing
Pith reviewed 2026-05-23 22:12 UTC · model grok-4.3
The pith
Large language models can generate on-the-fly error handlers that recover from 72.8 percent of runtime errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We demonstrate the feasibility of this approach by designing such a framework, Healer, and empirically showing that it can handle runtime errors with a high success rate. When an unanticipated runtime error occurs, Healer leverages its internal LLM to generate bespoke error-handling code. The generated healing code is then executed to produce a corrected program state, allowing the program to continue execution with minimal disruption. GPT-4 can successfully recover from 72.8 percent of runtime errors.
What carries the argument
The Healer framework, which calls an internal LLM to generate and run custom error-handling code based on the runtime error and program state.
If this is right
- Self-healing systems become able to address a wider variety of runtime errors than rule-based methods permit.
- Programs can resume after errors with less need for manual intervention or predefined handlers.
- LLM integration at runtime can support more adaptive and resilient software architectures.
- Safety checks and specialized programming conventions become necessary to incorporate generated patches safely.
Where Pith is reading between the lines
- The same LLM generation pattern could be tested on logical errors or performance degradations beyond crashes.
- Reliable runtime healing might let developers write less defensive code upfront.
- Combining the generated patches with static analysis tools could provide an extra layer of verification before execution.
Load-bearing premise
The trustworthiness of LLM-generated code can be managed sufficiently through safety checks and Healer-aware programming so that executing the generated patches does not introduce new errors or security issues.
What would settle it
A controlled test in which the generated healing code frequently fails to restore correct execution or introduces new errors or security problems would show the claimed recovery rates are not reliable.
Figures
read the original abstract
Self-healing systems have long been a focus of research, aiming to enable software to recover from unexpected runtime errors without human intervention. Traditional approaches rely on predefined heuristic rules, such as reusing error handlers or rolling back to checkpoints, but these methods struggle to adapt to the diverse range of runtime errors. The emergence of Large Language Models offers a new opportunity to address this challenge. Leveraging their ability to understand and generate code and natural language, we propose using LLMs to dynamically generate error-handling strategies in real time, tailored to specific runtime contexts such as error messages and program states. We demonstrate the feasibility of this approach by designing such a framework, Healer, and empirically showing that it can handle runtime errors with a high success rate. When an unanticipated runtime error occurs, Healer leverages its internal LLM to generate bespoke error-handling code. The generated healing code is then executed to produce a corrected program state, allowing the program to continue execution with minimal disruption. We evaluate Healer across four code datasets and three state-of-the-art LLMs (GPT-3.5, GPT-4, and CodeQwen-7B), where GPT-4 can successfully recover from 72.8% of runtime errors, underscoring the promise of LLMs in this domain. Despite these promising results, challenges remain, particularly regarding the trustworthiness of LLM-generated code and its integration into existing systems. We mention potential solutions, such as safety checks and Healer-aware programming, to mitigate risks and ensure reliable operation. This work represents the first step toward agentic runtime healing, paving the way for more adaptive, resilient, and self-healing software systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Healer framework, which uses an LLM to generate and execute bespoke error-handling code at runtime when an unanticipated error occurs, allowing the program to continue from a corrected state. It evaluates the approach across four code datasets and three LLMs, reporting that GPT-4 recovers from 72.8% of runtime errors, and identifies trustworthiness of generated patches as a remaining challenge while suggesting safety checks and Healer-aware programming as mitigations.
Significance. If the evaluation methodology and post-patch safety claims can be substantiated, the work would provide a concrete demonstration of LLM-driven runtime recovery that goes beyond static heuristics, representing an early empirical step toward agentic self-healing systems. The explicit acknowledgment of the trustworthiness gap is a constructive element.
major comments (2)
- [Evaluation section] Evaluation section: the reported 72.8% success rate for GPT-4 is presented without any description of the evaluation methodology, definition of success (e.g., whether the original program resumes without further exceptions or merely that the healing code executes), error diversity across the four datasets, baselines, or statistical controls. This information is required to assess whether the number supports the feasibility claim.
- [Abstract and Evaluation section] Abstract and Evaluation section: the central feasibility claim requires that executing LLM-generated patches does not introduce new runtime errors or security vulnerabilities, yet the evaluation reports only recovery success rates and supplies no post-patch error rates, static analysis results, or adversarial test outcomes. The abstract lists safety checks as a mitigation but provides no empirical evidence that they suffice.
minor comments (2)
- [Abstract] The abstract and introduction could more precisely delimit the class of runtime errors considered (e.g., whether they include only exceptions or also logical errors and performance degradations).
- [Evaluation section] No table or figure summarizes the per-dataset or per-LLM breakdown of the 72.8% figure; adding one would improve clarity of the empirical results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation methodology and safety considerations. These comments identify areas where the manuscript would benefit from greater clarity and detail. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: the reported 72.8% success rate for GPT-4 is presented without any description of the evaluation methodology, definition of success (e.g., whether the original program resumes without further exceptions or merely that the healing code executes), error diversity across the four datasets, baselines, or statistical controls. This information is required to assess whether the number supports the feasibility claim.
Authors: We agree that the Evaluation section requires expansion to substantiate the reported recovery rate. In the revised manuscript we will add: (1) an explicit definition of success (the original program resumes execution from the corrected state without raising further exceptions); (2) a breakdown of error types and their distribution across the four datasets; (3) any baseline comparisons performed; and (4) statistical details such as the number of trials and observed variance. These additions will directly address the request for methodological transparency. revision: yes
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the central feasibility claim requires that executing LLM-generated patches does not introduce new runtime errors or security vulnerabilities, yet the evaluation reports only recovery success rates and supplies no post-patch error rates, static analysis results, or adversarial test outcomes. The abstract lists safety checks as a mitigation but provides no empirical evidence that they suffice.
Authors: We concur that the feasibility claim would be strengthened by evidence on post-patch behavior. The current manuscript reports only recovery rates and flags trustworthiness as an open challenge while listing safety checks as a proposed mitigation without supporting measurements. In revision we will incorporate any post-healing error observations available from the existing experimental logs, add a dedicated limitations subsection on potential new errors or vulnerabilities, and expand the discussion of safety checks to clarify their current status as unvalidated proposals rather than demonstrated safeguards. revision: yes
Circularity Check
No circularity: empirical success rates are measured directly, not derived or fitted.
full rationale
The paper proposes the Healer framework and reports measured recovery rates (e.g., 72.8% with GPT-4) from direct evaluation on four datasets and three LLMs. No equations, parameters, predictions, or first-principles derivations appear in the provided text. The central claim is an observed empirical outcome rather than a reduction to inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The evaluation is presented as a feasibility demonstration, making the result self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can understand runtime error contexts and generate appropriate fixing code.
invented entities (1)
-
Healer framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
CWE - CWE-248: Uncaught Exception (4.14)
2024. CWE - CWE-248: Uncaught Exception (4.14). https://cwe.mitre.org/data/ definitions/248.html Accessed: 2024-06-03
work page 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
AlphaCode Team, Google DeepMind. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf Accessed: 2024-05-23
work page 2023
-
[4]
AtCoder. 2024. AtCoder. https://atcoder.jp/ Accessed: 2024-05-26
work page 2024
-
[5]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing 1, 1 (2004), 11–33
work page 2004
-
[7]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
George Candea, Emre Kiciman, Steve Zhang, Pedram Keyani, and Armando Fox. 2003. JAGR: An autonomous self-recovering application server. In 2003 Autonomic Computing Workshop. IEEE, 168–177
work page 2003
-
[10]
Antonio Carzaniga, Alessandra Gorla, Andrea Mattavelli, Nicolo Perino, and Mauro Pezze. 2013. Automatic recovery from runtime failures. In 2013 35th International Conference on Software Engineering (ICSE) . IEEE, 782–791
work page 2013
-
[11]
Antonio Carzaniga, Alessandra Gorla, Nicolò Perino, and Mauro Pezzè. 2010. Automatic workarounds for web applications. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering . 237–246
work page 2010
-
[12]
Hervé Chang, Leonardo Mariani, and Mauro Pezze. 2013. Exception handlers for healing component-based systems. ACM Transactions on Software Engineering and Methodology (TOSEM) 22, 4 (2013), 1–40
work page 2013
-
[13]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Zakir Durumeric, Frank Li, James Kasten, Johanna Amann, Jethro Beekman, Mathias Payer, Nicolas Weaver, David Adrian, Vern Paxson, Michael Bailey, et al
-
[15]
In Proceedings of the 2014 conference on internet measurement conference
The matter of heartbleed. In Proceedings of the 2014 conference on internet measurement conference. 475–488
work page 2014
-
[16]
EvalPlus. 2024. EvalPlus Releases. https://github.com/evalplus/evalplus/releases Accessed: 2024-05-26
work page 2024
-
[17]
Python Software Foundation. 2024. Python FAQ: How fast are exceptions? https://docs.python.org/3/faq/design.html#how-fast-are-exceptions Accessed: 2024-08-01
work page 2024
-
[18]
David Garlan and Bradley Schmerl. 2002. Model-based adaptation for self-healing systems. In Proceedings of the first workshop on Self-healing systems . 27–32
work page 2002
-
[19]
Georgi Gerganov. 2023. llama.cpp: Port of Facebook’s LLaMA model in C/C++. https://github.com/ggerganov/llama.cpp Accessed: 2024-05-30. Conference’17, July 2017, Washington, DC, USA Zhensu Sun, Haotian Zhu, Bowen Xu, Xiaoning Du, Li Li, and David Lo
work page 2023
-
[20]
Debanjan Ghosh, Raj Sharman, H Raghav Rao, and Shambhu Upadhyaya. 2007. Self-healing systems—survey and synthesis. Decision support systems 42, 4 (2007), 2164–2185
work page 2007
-
[21]
Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Jian Lü, and Zhendong Su. 2016. Automatic runtime recovery via error handler synthesis. InProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. 684–695
work page 2016
- [22]
-
[23]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Michael N Huhns, Vance T Holderfield, and Rosa Laura Zavala Gutierrez. 2003. Robust software via agent-based redundancy. InProceedings of the second interna- tional joint conference on Autonomous agents and multiagent systems . 1018–1019
work page 2003
-
[25]
IBM. 2024. Project CodeNet. https://github.com/IBM/Project_CodeNet Accessed: 2024-05-26
work page 2024
-
[26]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al . 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Aizu Online Judge. 2024. Aizu Online Judge Home. https://onlinejudge.u- aizu.ac.jp/home Accessed: 2024-05-26
work page 2024
- [28]
-
[29]
Pavneet Singh Kochhar, Ferdian Thung, Nachiappan Nagappan, Thomas Zim- mermann, and David Lo. 2015. Understanding the test automation culture of app developers. In 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 1–10
work page 2015
-
[30]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neu- rocomputing 461 (2021), 370–403
work page 2021
-
[32]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[33]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35
work page 2023
-
[34]
Fan Long, Vijay Ganesh, Michael Carbin, Stelios Sidiroglou, and Martin Rinard
-
[35]
In 2012 34th International Conference on Software Engineering (ICSE)
Automatic input rectification. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 80–90
work page 2012
-
[36]
Frank D Macías-Escrivá, Rodolfo Haber, Raul Del Toro, and Vicente Hernandez
-
[37]
Expert Systems with Applications 40, 18 (2013), 7267–7279
Self-adaptive systems: A survey of current approaches, research challenges and applications. Expert Systems with Applications 40, 18 (2013), 7267–7279
work page 2013
-
[38]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
OpenAI. 2024. Fine-Tuning Integrations. https://platform.openai.com/docs/ guides/fine-tuning/fine-tuning-integrations Accessed: 2024-05-26
work page 2024
-
[40]
OpenAI. 2024. GPT-3.5 Turbo Model Documentation. https://platform.openai. com/docs/models/gpt-3-5-turbo Accessed: 2024-05-21
work page 2024
-
[41]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744
work page 2022
-
[42]
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768
work page 2022
-
[43]
William H Pierce. 2014. Failure-tolerant computer design. Academic Press
work page 2014
-
[44]
Harald Psaier and Schahram Dustdar. 2011. A survey on self-healing systems: approaches and systems. Computing 91 (2011), 43–73
work page 2011
-
[45]
Martin C Rinard. 2007. Living in the comfort zone. ACM SIGPLAN Notices 42, 10 (2007), 611–622
work page 2007
-
[46]
Martin C Rinard, Cristian Cadar, Daniel Dumitran, Daniel M Roy, Tudor Leu, and William S Beebee. 2004. Enhancing Server Availability and Security Through Failure-Oblivious Computing.. In Osdi, Vol. 4. 21–21
work page 2004
-
[47]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al
-
[49]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [50]
-
[51]
Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2022. Compress- ing pre-trained models of code into 3 mb. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–12
work page 2022
-
[52]
Beatriz Souza and Michael Pradel. 2023. Lexecutor: Learning-guided execution. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1522–1534
work page 2023
- [53]
-
[54]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)
work page 2017
-
[56]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837
work page 2022
- [58]
-
[59]
Westley Weimer and George C Necula. 2004. Finding and preventing run-time er- ror handling mistakes. InProceedings of the 19th annual ACM SIGPLAN Conference on Object-oriented programming, systems, languages, and applications . 419–431
work page 2004
-
[60]
Danny Weyns, Ilias Gerostathopoulos, Nadeem Abbas, Jesper Andersson, Stefan Biffl, Premek Brada, Tomas Bures, Amleto Di Salle, Patricia Lago, Angelika Musil, et al. 2022. Preliminary results of a survey on the use of self-adaptation in industry. In Proceedings of the 17th Symposium on Software Engineering for Adaptive and Self-Managing Systems. 70–76
work page 2022
- [61]
- [62]
-
[63]
Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering. 1482–1493
work page 2022
-
[64]
Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo
-
[65]
arXiv preprint arXiv:2403.07506 (2024)
Robustness, security, privacy, explainability, efficiency, and usability of large language models for code. arXiv preprint arXiv:2403.07506 (2024)
-
[66]
Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In 2012 34th International conference on software engineering (ICSE) . IEEE, 14–24
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.