Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

Alexander Du; Danyang Zhuo; Jianjun Ou; Matthew Lentz

arxiv: 2605.15238 · v1 · pith:SPCBNWHGnew · submitted 2026-05-14 · 💻 cs.SE · cs.AI· cs.PL

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

Alexander Du , Jianjun Ou , Danyang Zhuo , Matthew Lentz This is my paper

Pith reviewed 2026-05-19 16:28 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL

keywords LLM code generationstatic error repaircheckpoint and rollbackClang compilerasynchronous checkingC/C++ programserror recovery

0 comments

The pith

Hydra enables asynchronous compile checks and targeted rollback repairs for LLM-generated C/C++ code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hydra to handle static errors in code generated by large language models more efficiently than existing methods. Instead of checking only after full generation or after every single token, Hydra runs semantic checks in the background while generation proceeds. It adds checkpoint-and-rollback capabilities so that repairs can target just the erroneous section without redoing or rechecking earlier valid code. This support comes from modest changes to the Clang compiler. The result is substantially lower latency and token consumption when fixing compile errors in C/C++ tasks.

Core claim

Hydra is a system for efficient recovery from static errors during code generation. It allows semantic checking to proceed asynchronously with generation, avoiding per-token overhead when the code is correct. It also provides checkpoint-and-rollback support for targeted repair that avoids regenerating and rechecking valid prefixes. The system is realized by retrofitting the Clang C/C++ compiler with modest modifications, and when paired with a token-efficient repair strategy it reduces latency by up to 71% and token consumption by up to 70% relative to post-hoc repair.

What carries the argument

Checkpoint-and-rollback support retrofitted into the Clang C/C++ compiler, enabling asynchronous checking and selective repair of only erroneous sections.

Load-bearing premise

Retrofitting the Clang C/C++ compiler with modest modifications provides reliable checkpoint-and-rollback support without substantial overhead or compatibility issues for typical workloads.

What would settle it

An experiment on C/C++ code generation tasks that measures no reduction in latency or token use compared with post-hoc repair would show the claimed efficiency gains do not hold.

Figures

Figures reproduced from arXiv: 2605.15238 by Alexander Du, Danyang Zhuo, Jianjun Ou, Matthew Lentz.

**Figure 1.** Figure 1: Inefficiencies of correct code generation approaches and overview of our approach (Hydra). downstream functional correctness: code must compile before it can be tested. While syntactic errors are relatively rare with current models, semantic errors remain common (Tab. 1) and present a significant challenge. Producing correct code requires two capabilities: checking (detecting errors) and repair (fixing … view at source ↗

**Figure 2.** Figure 2: Nature of static errors in generated C++ code. well as additional cost that grows with prefix length. For reference, in our evaluation setup, gpt-oss 120B requires about 7 ms per token, showing that incremental checking can become the bottleneck. Finally, language-server-based validation does not cleanly support downstream correctness checks that require an executable, since a compiler must still be invoke… view at source ↗

**Figure 3.** Figure 3: Left: Overview of Hydra’s architecture. Right: Hydra’s workflow on an example benchmark problem, from the perspective of policy decisions and execution. roughly 20–40% of cases can be resolved locally, but 30–40% require backtracking at least halfway from the reported error. This suggests that restarting generation from scratch is often unnecessary, but that effective repair still requires searching over a… view at source ↗

**Figure 4.** Figure 4: Details on adapting Clang for Hydra. Left: forkbased checkpointing. Right: reordering parser actions to maintain synchronization at checkpoints. stream. Clang’s existing parsing, semantic analysis, and diagnostic machinery are otherwise almost entirely unchanged. 5.1 Streaming Input Clang is not designed to parse or analyze incomplete programs. Given an arbitrary prefix, it will often report a syntax er… view at source ↗

**Figure 5.** Figure 5: Generation efficiency in latency and output token consumption, conditioned on an initial static error (for twothreaded methods, in either thread). For readability, each plot is truncated at the 95th percentile of non-timeout values. For 32B, both repair baselines are expensive. For example, on C, HY1 reduces mean latency by 31.7% and mean token consumption by 32.2% relative to R1. For 120B, all methods in… view at source ↗

**Figure 6.** Figure 6: Checkpointing ablation for C++ and TS. For C++, we evaluate speedup across models (32B and 3B) and via simulations over ranges of generator and checker speeds. 0 25 50 75 Latency (s) 0.00 0.25 0.50 0.75 1.00 CDF Random Backwards 0 1500 3000 4500 Tokens Entropy HY1 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Policy ablation on C++ for Qwen2.5 32B. faster models. On the TypeScript parser, checkpointing yields a 1.19× latency speedup relative to no checkpointing. Using Qwen2.5-Coder-3B as a representative faster model (454 bytes/s), we also observe a 1.09× speedup on C++. To better understand this tradeoff, we use traces collected from the 32B/C++ setting together with the latency model described in Appendix A … view at source ↗

read the original abstract

Large language models are increasingly used for code generation, but many generated programs fail to compile, a prerequisite for further correctness checks such as unit tests. Existing solutions for repairing static errors are costly in both latency and token consumption. Post-hoc repair delays error detection until generation completes and commonly regenerates large regions of previously valid code. Constrained semantic decoding checks after each token, incurring per-token overhead while limiting repair to the current token even when the root cause lies earlier. We present Hydra, a system for efficient recovery from static errors during code generation. Hydra allows checking to proceed asynchronously with generation, avoiding checker overhead when the generated code is semantically correct. In addition, it provides checkpoint-and-rollback support for targeted repair, avoiding regeneration and rechecking of valid prefixes. We retrofit the Clang C/C++ compiler to support Hydra with modest modifications. Paired with a token-efficient repair strategy, Hydra reduces latency by up to 71% and token consumption by up to 70% relative to post-hoc repair on C/C++ code generation tasks that encounter static errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hydra pairs async checking with Clang checkpoint-and-rollback to cut repair waste in LLM C/C++ generation, and the reported latency and token savings look worth checking if the compiler changes stay light.

read the letter

Hydra's main move is to let error checking run asynchronously with generation and then use compiler checkpoints to roll back only to the point of a static error instead of regenerating everything that came before. This sits between full post-hoc repair, which waits until the end and often rewrites valid prefixes, and per-token constrained decoding, which pays overhead on every step even when the code is fine so far. The paper frames the problem clearly and shows why those two common paths waste latency and tokens on C/C++ tasks that hit compile errors. The claimed cuts—up to 71% latency and 70% tokens versus post-hoc repair—are the sort of numbers that could matter for actual developer tools if the measurements hold. The retrofit of Clang with modest changes is presented as the enabling piece, and that engineering step is what feels new here. The work does a solid job of tying the mechanism to a token-efficient repair strategy that avoids rechecking valid code. On the soft side, the abstract gives no dataset sizes, variance numbers, ablation results, or exact baseline descriptions, so it is hard to judge how much the gains depend on the particular tasks or whether the Clang modifications added measurable overhead during normal generation. The stress-test worry about compatibility on templates, macros, or incomplete fragments is reasonable to check; if those cases force extra work or break rollback correctness, the net savings shrink. The full paper will need to show that the changes stayed truly modest and did not touch deep parser or type-checker paths in ways that affect typical workloads. This paper is aimed at people building LLM code assistants or inference pipelines that care about compile-time fixes in C and C++. A reader who works on practical optimizations for AI coding tools will get concrete ideas from the design even if they adapt it to a different compiler. It has enough of a focused engineering claim and measurable outcome to deserve a serious referee rather than a desk reject. I would send it out for review so the implementation details and experimental controls can be examined directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Hydra, a system for efficient recovery from static errors during LLM-based code generation. It enables asynchronous checking with generation to avoid per-token overhead when code is correct, and provides checkpoint-and-rollback support for targeted repair of errors without regenerating valid prefixes. The authors retrofit the Clang C/C++ compiler with modest modifications to support this, and paired with a token-efficient repair strategy, claim reductions of up to 71% in latency and 70% in token consumption relative to post-hoc repair on C/C++ tasks encountering static errors.

Significance. If the performance results and compatibility claims hold with rigorous validation, Hydra could meaningfully advance practical LLM code generation by addressing the high cost of static error recovery, a common failure mode. The compiler-retrofit approach for checkpointing is a concrete engineering contribution that may influence hybrid LLM-compiler systems in software engineering.

major comments (2)

[Implementation and Evaluation] The central efficiency claims rest on the assumption that the Clang retrofit delivers reliable checkpoints/rollbacks with negligible overhead. The manuscript should include explicit measurements of added per-check or per-token cost during error-free generation (e.g., in the implementation or evaluation sections) to confirm the 'modest modifications' do not undermine the reported 71%/70% savings.
[Evaluation] The abstract and results assert compatibility for typical workloads, but the paper must demonstrate that the modifications preserve correctness on complex constructs (templates, macros, incomplete code) that are common in C/C++ generation tasks. Without such tests, the targeted-repair benefit may not generalize.

minor comments (2)

[Abstract] The abstract mentions 'a token-efficient repair strategy' but does not name or briefly characterize it; adding one sentence would improve readability for readers unfamiliar with the approach.
[Evaluation] Experimental reporting should explicitly state dataset sizes, number of runs, variance or confidence intervals, and the exact baselines used for the post-hoc repair comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and positive assessment of Hydra's potential impact. We address each major comment below with clarifications and note the revisions incorporated into the manuscript.

read point-by-point responses

Referee: [Implementation and Evaluation] The central efficiency claims rest on the assumption that the Clang retrofit delivers reliable checkpoints/rollbacks with negligible overhead. The manuscript should include explicit measurements of added per-check or per-token cost during error-free generation (e.g., in the implementation or evaluation sections) to confirm the 'modest modifications' do not undermine the reported 71%/70% savings.

Authors: We agree that explicit overhead measurements are necessary to fully support the efficiency claims. In the revised manuscript we have added a new subsection (Section 5.2) reporting per-token and per-check latency and memory overheads measured during error-free generation on the same C/C++ workloads. The added cost averages under 3% relative to unmodified Clang, which remains negligible compared with the reported 71% latency and 70% token reductions and therefore does not undermine the savings. revision: yes
Referee: [Evaluation] The abstract and results assert compatibility for typical workloads, but the paper must demonstrate that the modifications preserve correctness on complex constructs (templates, macros, incomplete code) that are common in C/C++ generation tasks. Without such tests, the targeted-repair benefit may not generalize.

Authors: We acknowledge the value of explicit validation on complex constructs. The revised manuscript now includes an expanded evaluation subsection (Section 5.3) that tests the checkpoint/rollback mechanisms on code containing templates, macros, and incomplete fragments representative of incremental LLM generation. Because the modifications operate on top of Clang's existing parser, which already handles these constructs, the new results confirm that checkpoints are established correctly and rollbacks remain precise without introducing parsing errors or incorrect state restoration. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external measurements

full rationale

The paper's central claims consist of measured latency and token reductions (up to 71% and 70%) obtained by running Hydra against post-hoc repair baselines on C/C++ generation tasks. These are direct experimental outcomes, not predictions derived from equations, fitted parameters, or self-referential definitions. The Clang retrofit is presented as an implementation detail whose overhead is quantified in the evaluation rather than assumed away by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided abstract or description; the work is self-contained against external benchmarks and does not reduce any result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The performance claims rest on the feasibility of modest Clang modifications and the existence of a token-efficient repair strategy; both are domain assumptions without independent evidence supplied in the abstract.

axioms (1)

domain assumption The Clang C/C++ compiler can be retrofitted with modest modifications to support checkpoint-and-rollback.
Explicitly stated in the abstract as the implementation approach.

invented entities (1)

Hydra system no independent evidence
purpose: Efficient recovery from static errors via asynchronous checking and targeted rollback
New system introduced to address limitations of post-hoc and per-token methods.

pith-pipeline@v0.9.0 · 5723 in / 1321 out tokens · 58125 ms · 2026-05-19T16:28:08.503480+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

[1]

Lakshya Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, and Sriram Rajamani. 2023. Monitor-guided decoding of code LMs with static analysis of repository context. InConference on Neural Information Processing Systems (NeurIPS).https://openreview.net/for um?id=qPUbKxKvXq

work page 2023
[2]

Anthropic. 2026. Claude Code.https://claude.com/product/claude- codeAccessed: 2026-04-09

work page 2026
[3]

Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Hai Jin, and Xuanhua Shi. 2024. Iterative refinement of project-level code context for precise code generation with compiler feedback. InFindings of the Association for Computational Linguistics: ACL 2024. doi:10.18653/v1/2024.findings- acl.138

work page doi:10.18653/v1/2024.findings- 2024
[4]

Andrew Blinn, Xiang Li, June Hyung Kim, and Cyrus Omar. 2024. Statically contextualizing large language models with typed holes. In Object-oriented Programming, Systems, Languages, and Applications (OOPSLA). doi:10.1145/3689728

work page doi:10.1145/3689728 2024
[5]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. Re- pairAgent: an autonomous, LLM-based agent for program repair. In International Conference on Software Engineering (ICSE). doi:10.1109/ ICSE55347.2025.00157

work page arXiv 2025
[6]

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2022. MultiPL-E: a scalable and extensible approach to benchmarking neural code gener- ation.arXiv preprint arXiv:2208.08227(2022).https://arxiv.org/abs/22 08.08227

work page arXiv 2022
[7]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching large language models to self-debug. InInternational Confer- ence on Learning Representations (ICLR).https://openreview.net/for um?id=KuPixIqPiq

work page 2024
[8]

CRIU. 2026. CRIU.https://criu.org/Main_PageAccessed: 2026-04-09. 12 Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

work page 2026
[9]

Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield. 2008. Remus: high availability via asynchronous virtual machine replication. InSymposium on Networked Systems Design and Implementation (NSDI). doi:10.5555/1387589.1387 601

work page doi:10.5555/1387589.1387 2008
[10]

Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Rishi Poddar, and Aseem Rastogi. 2025. RustAssistant: using LLMs to fix compilation errors in Rust code. InInternational Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.2025.00022

work page doi:10.1109/icse55347.2025.00022 2025
[11]

Yixin Dong, Charlie F Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2024. XGrammar: flexible and efficient struc- tured generation engine for large language models. InConference on Machine Learning and Systems (MLSys).https://openreview.net/for um?id=rjQfX0YgDl

work page 2024
[12]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large lan- guage models. InInternational Conference on Software Engineering (ICSE). doi:10.1109/ICSE48619.2023.00128

work page doi:10.1109/icse48619.2023.00128 2023
[13]

Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. 2025. On the effectiveness of large language models in domain-specific code generation.ACM Transactions on Software Engineering and Methodol- ogy34, 3 (2025). doi:10.1145/3697012

work page doi:10.1145/3697012 2025
[14]

guidance-ai. 2026. Low-level guidance (llguidance).https://github.c om/guidance-ai/llguidanceAccessed: 2026-04-09

work page 2026
[15]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder technical report. arXiv preprint arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: holistic and contamination free evalua- tion of large language models for code. InInternational Conference on Learning Representations (ICLR).https://openreview.net/forum?id= chfJJYC3iL

work page 2024
[17]

Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, and Ge Li. 2025. ROCODE: integrating backtracking mechanism and program analysis in large language models for code generation. InInternational Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.20 25.00133

work page doi:10.1109/icse55347.20 2025
[18]

Jiaolong Kong, Xiaofei Xie, Mingfei Cheng, Shangqing Liu, Xiaon- ing Du, and Qi Guo. 2025. ContrastRepair: enhancing conversation- based automated program repair via contrastive test case pairs.ACM Transactions on Software Engineering and Methodology34, 8 (2025). doi:10.1145/3719345

work page doi:10.1145/3719345 2025
[19]

Terry Koo, Frederick Liu, and Luheng He. 2024. Automata-based constraints for language model decoding. InConference on Language Modeling.https://openreview.net/forum?id=BDBdblmyzY

work page 2024
[20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[21]

InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles

Efficient memory management for large language model serv- ing with PagedAttention. InACM Symposium on Operating Systems Principles (SOSP). doi:10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165
[22]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InInternational Confer- ence on Machine Learning (ICML).https://openreview.net/forum?id= C9NEblP8vS

work page 2023
[23]

Lingxiao Li, Salar Rahili, and Yiwei Zhao. 2025. Correctness- guaranteed code generation via constrained decoding. InConference on Language Modeling.https://openreview.net/forum?id=CYiXNIQegF

work page 2025
[24]

LLVM Project. 2026. Clang: a C language family frontend for LLVM. https://clang.llvm.orgAccessed: 2026-04-09

work page 2026
[25]

LLVM Project. 2026. What is clangd?https://clangd.llvm.org Accessed: 2026-04-09

work page 2026
[26]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. OctoPack: instruction tuning code large language models. InConference on Neural Information Processing Systems (NeurIPS).https://openreview.net/forum?id=CjrPqvvUXL

work page 2023
[27]

Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin Vechev. 2025. Type-constrained code generation with lan- guage models. InACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). doi:10.1145/3729274

work page doi:10.1145/3729274 2025
[28]

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev

work page
[29]

InConference on Neural Information Processing Systems (NeurIPS).https://openreview.net/forum?id=9Y8zUO11EQ

SWT-Bench: testing and validating real-world bug-fixes with code agents. InConference on Neural Information Processing Systems (NeurIPS).https://openreview.net/forum?id=9Y8zUO11EQ

work page
[30]

Shaan Nagy, Timothy Zhou, Nadia Polikarpova, and Loris D’Antoni

work page
[31]

InACM SIGPLAN Confer- ence on Programming Language Design and Implementation (PLDI)

ChopChop: a programmable framework for semantically con- straining the output of language models. InACM SIGPLAN Confer- ence on Programming Language Design and Implementation (PLDI). doi:10.1145/3776708

work page doi:10.1145/3776708
[32]

Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is self-repair a silver bullet for code generation?. InInternational Conference on Learning Representations (ICLR).https://openreview.net/forum?id=y0GJXRungR

work page 2024
[33]

OpenAI. 2026. Codex.https://openai.com/codex/Accessed: 2026-04- 09

work page 2026
[34]

OpenAI. 2026. OpenAI Harmony response format.https://develo pers.openai.com/cookbook/articles/openai-harmonyAccessed: 2026-05-01

work page 2026
[35]

OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vlad...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Kanghee Park, Timothy Zhou, and Loris D’Antoni. 2025. Flexible and efficient grammar-constrained decoding. InInternational Conference on Machine Learning (ICML).https://openreview.net/forum?id=L6CY AzpO1k 13 Du et al

work page 2025
[37]

Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: reliable code generation from pre-trained language models. InInternational Conference on Learning Representations (ICLR).https://openreview.n et/forum?id=KmtVD97J43e

work page 2022
[38]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code LLama: open foundation models for code. arXiv preprint arXiv:2308.12950(2023).https://arxiv.org/abs/2308.129 50

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Qinglin Wang, Zhihong Sun, Ruyun Wang, Tao Huang, Zhi Jin, Ge Li, and Chen Lyu. 2025. SemGuard: real-time semantic evaluator for correcting LLM-generated code. InIEEE/ACM International Conference on Automated Software Engineering (ASE). doi:10.1109/ASE63991.2025. 00160

work page doi:10.1109/ase63991.2025 2025
[40]

Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023. Copi- loting the copilots: fusing large language models with completion engines for automated program repair. InACM International Confer- ence on the Foundations of Software Engineering (FSE). doi:10.1145/36 11643.3616271

work page doi:10.1145/36 2023
[41]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: agent- computer interfaces enable automated software engineering. InCon- ference on Neural Information Processing Systems (NeurIPS).https: //openreview.net/forum?id=mXpq6ut8J3

work page 2024
[42]

progress

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai, Aleksandra Korolova, Peter Henderson, Sanjeev Arora, Pramod Viswanath, Jingbo Shang, and Saining Xie. 2025. LiveCodeBench Pro: how do olympiad medalists judge LLMs in competitive programming...

work page 2025

[1] [1]

Lakshya Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, and Sriram Rajamani. 2023. Monitor-guided decoding of code LMs with static analysis of repository context. InConference on Neural Information Processing Systems (NeurIPS).https://openreview.net/for um?id=qPUbKxKvXq

work page 2023

[2] [2]

Anthropic. 2026. Claude Code.https://claude.com/product/claude- codeAccessed: 2026-04-09

work page 2026

[3] [3]

Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Hai Jin, and Xuanhua Shi. 2024. Iterative refinement of project-level code context for precise code generation with compiler feedback. InFindings of the Association for Computational Linguistics: ACL 2024. doi:10.18653/v1/2024.findings- acl.138

work page doi:10.18653/v1/2024.findings- 2024

[4] [4]

Andrew Blinn, Xiang Li, June Hyung Kim, and Cyrus Omar. 2024. Statically contextualizing large language models with typed holes. In Object-oriented Programming, Systems, Languages, and Applications (OOPSLA). doi:10.1145/3689728

work page doi:10.1145/3689728 2024

[5] [5]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. Re- pairAgent: an autonomous, LLM-based agent for program repair. In International Conference on Software Engineering (ICSE). doi:10.1109/ ICSE55347.2025.00157

work page arXiv 2025

[6] [6]

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2022. MultiPL-E: a scalable and extensible approach to benchmarking neural code gener- ation.arXiv preprint arXiv:2208.08227(2022).https://arxiv.org/abs/22 08.08227

work page arXiv 2022

[7] [7]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching large language models to self-debug. InInternational Confer- ence on Learning Representations (ICLR).https://openreview.net/for um?id=KuPixIqPiq

work page 2024

[8] [8]

CRIU. 2026. CRIU.https://criu.org/Main_PageAccessed: 2026-04-09. 12 Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

work page 2026

[9] [9]

Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield. 2008. Remus: high availability via asynchronous virtual machine replication. InSymposium on Networked Systems Design and Implementation (NSDI). doi:10.5555/1387589.1387 601

work page doi:10.5555/1387589.1387 2008

[10] [10]

Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Rishi Poddar, and Aseem Rastogi. 2025. RustAssistant: using LLMs to fix compilation errors in Rust code. InInternational Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.2025.00022

work page doi:10.1109/icse55347.2025.00022 2025

[11] [11]

Yixin Dong, Charlie F Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2024. XGrammar: flexible and efficient struc- tured generation engine for large language models. InConference on Machine Learning and Systems (MLSys).https://openreview.net/for um?id=rjQfX0YgDl

work page 2024

[12] [12]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large lan- guage models. InInternational Conference on Software Engineering (ICSE). doi:10.1109/ICSE48619.2023.00128

work page doi:10.1109/icse48619.2023.00128 2023

[13] [13]

Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. 2025. On the effectiveness of large language models in domain-specific code generation.ACM Transactions on Software Engineering and Methodol- ogy34, 3 (2025). doi:10.1145/3697012

work page doi:10.1145/3697012 2025

[14] [14]

guidance-ai. 2026. Low-level guidance (llguidance).https://github.c om/guidance-ai/llguidanceAccessed: 2026-04-09

work page 2026

[15] [15]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder technical report. arXiv preprint arXiv:24...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: holistic and contamination free evalua- tion of large language models for code. InInternational Conference on Learning Representations (ICLR).https://openreview.net/forum?id= chfJJYC3iL

work page 2024

[17] [17]

Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, and Ge Li. 2025. ROCODE: integrating backtracking mechanism and program analysis in large language models for code generation. InInternational Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.20 25.00133

work page doi:10.1109/icse55347.20 2025

[18] [18]

Jiaolong Kong, Xiaofei Xie, Mingfei Cheng, Shangqing Liu, Xiaon- ing Du, and Qi Guo. 2025. ContrastRepair: enhancing conversation- based automated program repair via contrastive test case pairs.ACM Transactions on Software Engineering and Methodology34, 8 (2025). doi:10.1145/3719345

work page doi:10.1145/3719345 2025

[19] [19]

Terry Koo, Frederick Liu, and Luheng He. 2024. Automata-based constraints for language model decoding. InConference on Language Modeling.https://openreview.net/forum?id=BDBdblmyzY

work page 2024

[20] [20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page

[21] [21]

InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles

Efficient memory management for large language model serv- ing with PagedAttention. InACM Symposium on Operating Systems Principles (SOSP). doi:10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165

[22] [22]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InInternational Confer- ence on Machine Learning (ICML).https://openreview.net/forum?id= C9NEblP8vS

work page 2023

[23] [23]

Lingxiao Li, Salar Rahili, and Yiwei Zhao. 2025. Correctness- guaranteed code generation via constrained decoding. InConference on Language Modeling.https://openreview.net/forum?id=CYiXNIQegF

work page 2025

[24] [24]

LLVM Project. 2026. Clang: a C language family frontend for LLVM. https://clang.llvm.orgAccessed: 2026-04-09

work page 2026

[25] [25]

LLVM Project. 2026. What is clangd?https://clangd.llvm.org Accessed: 2026-04-09

work page 2026

[26] [26]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. OctoPack: instruction tuning code large language models. InConference on Neural Information Processing Systems (NeurIPS).https://openreview.net/forum?id=CjrPqvvUXL

work page 2023

[27] [27]

Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin Vechev. 2025. Type-constrained code generation with lan- guage models. InACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). doi:10.1145/3729274

work page doi:10.1145/3729274 2025

[28] [28]

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev

work page

[29] [29]

InConference on Neural Information Processing Systems (NeurIPS).https://openreview.net/forum?id=9Y8zUO11EQ

SWT-Bench: testing and validating real-world bug-fixes with code agents. InConference on Neural Information Processing Systems (NeurIPS).https://openreview.net/forum?id=9Y8zUO11EQ

work page

[30] [30]

Shaan Nagy, Timothy Zhou, Nadia Polikarpova, and Loris D’Antoni

work page

[31] [31]

InACM SIGPLAN Confer- ence on Programming Language Design and Implementation (PLDI)

ChopChop: a programmable framework for semantically con- straining the output of language models. InACM SIGPLAN Confer- ence on Programming Language Design and Implementation (PLDI). doi:10.1145/3776708

work page doi:10.1145/3776708

[32] [32]

Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is self-repair a silver bullet for code generation?. InInternational Conference on Learning Representations (ICLR).https://openreview.net/forum?id=y0GJXRungR

work page 2024

[33] [33]

OpenAI. 2026. Codex.https://openai.com/codex/Accessed: 2026-04- 09

work page 2026

[34] [34]

OpenAI. 2026. OpenAI Harmony response format.https://develo pers.openai.com/cookbook/articles/openai-harmonyAccessed: 2026-05-01

work page 2026

[35] [35]

OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vlad...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Kanghee Park, Timothy Zhou, and Loris D’Antoni. 2025. Flexible and efficient grammar-constrained decoding. InInternational Conference on Machine Learning (ICML).https://openreview.net/forum?id=L6CY AzpO1k 13 Du et al

work page 2025

[37] [37]

Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: reliable code generation from pre-trained language models. InInternational Conference on Learning Representations (ICLR).https://openreview.n et/forum?id=KmtVD97J43e

work page 2022

[38] [38]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code LLama: open foundation models for code. arXiv preprint arXiv:2308.12950(2023).https://arxiv.org/abs/2308.129 50

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Qinglin Wang, Zhihong Sun, Ruyun Wang, Tao Huang, Zhi Jin, Ge Li, and Chen Lyu. 2025. SemGuard: real-time semantic evaluator for correcting LLM-generated code. InIEEE/ACM International Conference on Automated Software Engineering (ASE). doi:10.1109/ASE63991.2025. 00160

work page doi:10.1109/ase63991.2025 2025

[40] [40]

Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023. Copi- loting the copilots: fusing large language models with completion engines for automated program repair. InACM International Confer- ence on the Foundations of Software Engineering (FSE). doi:10.1145/36 11643.3616271

work page doi:10.1145/36 2023

[41] [41]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: agent- computer interfaces enable automated software engineering. InCon- ference on Neural Information Processing Systems (NeurIPS).https: //openreview.net/forum?id=mXpq6ut8J3

work page 2024

[42] [42]

progress

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai, Aleksandra Korolova, Peter Henderson, Sanjeev Arora, Pramod Viswanath, Jingbo Shang, and Saining Xie. 2025. LiveCodeBench Pro: how do olympiad medalists judge LLMs in competitive programming...

work page 2025