pith. sign in

arxiv: 2605.17626 · v1 · pith:65R2ABKMnew · submitted 2026-05-17 · 💻 cs.LG · cs.SE

Verifier-Guided Code Translation via Meta-Step Decoding

Pith reviewed 2026-05-20 14:12 UTC · model grok-4.3

classification 💻 cs.LG cs.SE
keywords code translationverifier-guided decodingtest-time scalinglarge language modelsstructural boundarieserror propagation preventionC-to-RustJavaScript-to-TypeScript
0
0 comments X

The pith

Verifier-guided decoding at structural boundaries during generation improves code translation success rates and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that code translation by large language models can be improved by guiding the generation process itself with verifier feedback at key structural points, rather than checking only after the full output is produced. The core idea is to detect boundaries such as function ends or statement completions in the emerging code and immediately verify those prefixes using compilers or type checkers. If an error is found, the system rolls back to the last valid point instead of letting the mistake corrupt everything that follows. A sympathetic reader would care because this uses inference compute more efficiently on tasks where exact feedback is available, potentially making automated code migration more reliable without needing to generate and discard many full attempts. The results indicate better pass rates at lower token cost than standard self-refinement methods.

Core claim

Decoding Time Verification (DTV) treats structural boundaries as meta-steps in the decoding process. A state-machine controller interleaves token generation with verifier calls on valid prefixes, and applies structure-aware rollback when errors are detected, thereby preventing error propagation in the autoregressive output for code translation from C to Rust and from JavaScript to TypeScript.

What carries the argument

Decoding Time Verification (DTV) framework, which uses a state-machine to interleave generation with verifier-guided checks at structural boundaries and rollback mechanisms.

If this is right

  • DTV raises pass rates from 72.3% to 82.0% on C-to-Rust translation using the Qwen3-4B model under matched token budgets.
  • Pass rates improve from 33.3% to 46.0% on JavaScript-to-TypeScript translation with the same setup.
  • The method consumes fewer tokens per case than matched self-refinement baselines.
  • Similar gains appear when using the Gemma-4-E4B model.
  • DTV achieves a superior pass-rate versus cost tradeoff compared to post-generation verification approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If boundary detection works reliably across languages, the technique could apply to other generation tasks with deterministic verifiers such as formal proofs or API specifications.
  • Guided decoding might allow smaller models to match the performance of larger ones on structured output tasks by spending compute on corrections early rather than late.
  • Future work could test whether combining DTV with sampling-based methods yields further gains beyond what either achieves alone.

Load-bearing premise

The method depends on being able to reliably identify structural boundaries in the partially generated target code at the moment they are reached.

What would settle it

Running the same experiments but finding no improvement in pass rates or higher token consumption with DTV compared to the self-refinement baselines would indicate the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.17626 by Kirill Levchenko, Mihai Christodorescu, Somesh Jha, Tianyang Zhou, Varun Chandrasekaran.

Figure 1
Figure 1. Figure 1: GENERATION extends the target prefix to the next structural boundary (statement, block, function, or program). VERIFICATION renders the prefix and runs applicable oracles, committing on pass. RECOVERY parses diagnostics on fail, rolls back to the last checkpoint with augmented context, and escalates scope on repeated failure. The loop terminates on program-level success or budget bailout. intermediate reas… view at source ↗
Figure 2
Figure 2. Figure 2: DTV/self-refine Pareto-dominates na¨ıve/self-refine and sits on the upper-left of the cost￾pass-rate frontier on both tasks, above the BoN-N and S* frontiers at matched per-case cost. DTV/self-refine is also more token-efficient than na¨ıve/self-refine while attaining a higher pass rate: on C to Rust the two configurations average k = 5.28 and k = 7.18 tokens per case respectively. DTV/self-refine therefor… view at source ↗
Figure 3
Figure 3. Figure 3: Paired round-count ratios (rna¨ıve/rDTV) on the both-pass subset of self-refine. Markers: median (white circle), mean (red triangle). Differential-testing pass rates. They follow the same ordering: DTV-full attains 10.0%, while no-feedback, no-escalation, and detect-and-abort reach only 8.0%, 5.0%, and 8.0% respectively. DTV-full therefore leads on behavioral correctness as well, ruling out the possibility… view at source ↗
Figure 4
Figure 4. Figure 4: Three verification paradigms on an illustrative C-to-Rust translation. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The two repair modes (Section 2.2) on the same rustc E0277 failure. (left) Inline continuation: diagnostic appended as a comment, model continues in the same turn. (right) Patch￾based repair: structured user turn, model returns a diff patch in a fresh assistant turn. constrained to the rollback scope; outputs that violate the constraint are rejected and the controller advances along the retry ladder. The c… view at source ↗
Figure 6
Figure 6. Figure 6: Full-set head-to-head comparison on C to Rust ( [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-model cumulative pass rate vs. normalized per-case cost [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-case distribution of the shared-code edit-concentration ratio on rescued cases where [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
read the original abstract

Test-time scaling is an important mechanism for improving large language models, especially on tasks with deterministic verifiers. Code translation is a canonical example: the source program constrains valid outputs, while compilers, type check- ers, and behavioral checks provide exact pass/fail feedback. Existing approaches typically apply these verifiers only after generation, which is inefficient because early errors corrupt the autoregressive context and are rarely corrected later. We introduce Decoding Time Verification (DTV), a framework that treats structural boundaries as meta steps for verifier-guided decoding. DTV interleaves generation with verifier calls under a state-machine controller that enforces valid prefixes, using structural-boundary checks and structure-aware rollback to prevent error propagation while reducing wasted tokens. We evaluate DTV on C-to-Rust and JavaScript-to-TypeScript translation. Using Qwen3-4B as the primary generator under matched token budgets, DTV improves pass rates from 72.3% to 82.0% on C-to-Rust and from 33.3% to 46.0% on JavaScript-to-TypeScript relative to matched self-refinement baselines, while using fewer tokens per case; the same trend largely transfers to Gemma-4-E4B. In the evaluated cost-matched grid, DTV achieves a more favorable pass-rate-cost tradeoff than post-hoc verification or sampling-based scaling. These results show that verifier-guided decoding is an effective use of inference-time compute for code translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Decoding Time Verification (DTV), a framework that treats structural boundaries as meta-steps for verifier-guided decoding in code translation. It uses a state-machine controller to interleave autoregressive generation with verifier calls at detected boundaries, applies structure-aware rollback to enforce valid prefixes, and reports empirical pass-rate gains under matched token budgets: from 72.3% to 82.0% on C-to-Rust and 33.3% to 46.0% on JavaScript-to-TypeScript using Qwen3-4B relative to self-refinement baselines, with similar trends for Gemma-4-E4B and a more favorable pass-rate-cost tradeoff than post-hoc verification.

Significance. If the gains are robustly due to the proposed mechanism, DTV could meaningfully improve the efficiency of test-time scaling for code tasks with deterministic verifiers by enabling early error correction rather than post-generation fixes. The cost-matched grid evaluation and explicit token-budget comparisons are strengths that make the empirical claims more actionable than typical sampling-based scaling results.

major comments (2)
  1. [§3 (DTV framework and state-machine controller)] §3 (DTV framework and state-machine controller): The central claim that structural-boundary detection enables timely verifier feedback to prevent error propagation rests on an unvalidated assumption. No accuracy metrics for boundary detection, failure-case analysis on malformed prefixes or language-specific constructs, or ablation removing the detector are provided; if detection fails, the controller either skips verifiers (reducing to baseline) or triggers spurious rollbacks that could waste tokens or introduce new errors.
  2. [§4 (Experimental evaluation)] §4 (Experimental evaluation): Token accounting, baseline equivalence, and rollback logic details are insufficient to confirm that the reported pass-rate improvements (e.g., 72.3%→82.0% on C-to-Rust) arise from DTV rather than selection bias or unequal effective compute. The manuscript does not specify how rollback affects the final candidate pool or whether verifier calls are strictly budget-matched across methods.
minor comments (2)
  1. [Abstract] The abstract states that 'the same trend largely transfers to Gemma-4-E4B' but provides no numerical pass rates or token counts for this model, making it hard to assess transferability.
  2. [§3] Notation for the state-machine states and rollback triggers could be clarified with a small diagram or pseudocode to make the controller logic easier to follow without re-reading the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we will revise the manuscript to provide the requested validation and details.

read point-by-point responses
  1. Referee: [§3 (DTV framework and state-machine controller)] §3 (DTV framework and state-machine controller): The central claim that structural-boundary detection enables timely verifier feedback to prevent error propagation rests on an unvalidated assumption. No accuracy metrics for boundary detection, failure-case analysis on malformed prefixes or language-specific constructs, or ablation removing the detector are provided; if detection fails, the controller either skips verifiers (reducing to baseline) or triggers spurious rollbacks that could waste tokens or introduce new errors.

    Authors: We agree that explicit validation of the boundary detector is needed to support the central claim. In the revised manuscript we will add precision/recall/F1 metrics for the detector on a held-out set of 500 snippets per translation direction. We will also include a failure-case analysis covering malformed prefixes and language-specific constructs (C macros, TypeScript generics). Finally, we will add an ablation that disables the detector while keeping the rest of the controller fixed, showing that performance drops to near-baseline levels and confirming that gains are not due to skipping verifiers or spurious rollbacks. These results will appear in §3 and the appendix. revision: yes

  2. Referee: [§4 (Experimental evaluation)] §4 (Experimental evaluation): Token accounting, baseline equivalence, and rollback logic details are insufficient to confirm that the reported pass-rate improvements (e.g., 72.3%→82.0% on C-to-Rust) arise from DTV rather than selection bias or unequal effective compute. The manuscript does not specify how rollback affects the final candidate pool or whether verifier calls are strictly budget-matched across methods.

    Authors: We acknowledge that the current description of token accounting and rollback effects is insufficient. In the revision we will expand §4 with a precise accounting of how each verifier call and rollback is charged against the token budget, and we will state that the generation length is adjusted so that total tokens (including verifier overhead) are matched across DTV, self-refinement, and post-hoc baselines. We will also clarify that structure-aware rollbacks revert only to the last verified prefix and that the final candidate pool contains only fully verified outputs; this maintains equivalence in the search space and rules out selection bias. These additions will confirm that the reported gains are attributable to the DTV mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are direct experimental comparisons

full rationale

The paper introduces the DTV framework and reports pass-rate improvements on C-to-Rust and JavaScript-to-TypeScript translation tasks using Qwen3-4B and Gemma-4-E4B under matched token budgets. These gains are presented as measured outcomes relative to self-refinement baselines, with no equations, first-principles derivations, or predictions that reduce by construction to fitted inputs or self-citations. The central claims rest on empirical evaluation rather than any tautological reduction, rendering the reported results self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of reliable structural-boundary detectors and deterministic verifiers; these are treated as given rather than derived.

axioms (2)
  • domain assumption Compilers, type checkers, and behavioral checks provide exact pass/fail feedback on partial code prefixes
    Stated directly in the abstract as the basis for verifier-guided decoding.
  • domain assumption Structural boundaries in the target language can be identified reliably during token-by-token generation
    Required for the meta-step controller to decide when to invoke verifiers.
invented entities (1)
  • Decoding Time Verification (DTV) state-machine controller no independent evidence
    purpose: Enforces valid prefixes and triggers structure-aware rollback during generation
    Newly introduced framework; no independent evidence outside this work is provided in the abstract.

pith-pipeline@v0.9.0 · 5803 in / 1511 out tokens · 51926 ms · 2026-05-20T14:12:14.537513+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning. InInternational Conference on Learning Representations, 2025

  2. [2]

    S1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. S1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332. Association for Computational Linguistics, 2025. doi...

  3. [3]

    A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning, January 2025

    Yixin Ji, Juntao Li, Yang Xiang, Hai Ye, Kaixin Wu, Kai Yao, Jia Xu, Linjian Mo, and Min Zhang. A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning, January 2025

  4. [4]

    Let’s Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. In International Conference on Learning Representations, 2024

  5. [5]

    OpenAI o1 System Card, December 2024

    OpenAI. OpenAI o1 System Card, December 2024

  6. [6]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Push...

  7. [7]

    Le, Christopher R´e, and Azalia Mirhoseini

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R´e, and Azalia Mirhoseini. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, July 2024

  8. [8]

    Griffiths, Yuan Cao, and Karthik R Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, 2023

  9. [9]

    2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

    Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. Calibration and Correctness of Language Models for Code. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 540–552. IEEE, April 2025. doi: 10.1109/icse55347.2025.00040

  10. [10]

    Process Reward Models That Think, April 2025

    Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process Reward Models That Think, April 2025

  11. [11]

    LEVER: Learning to Verify Language-to-Code Generation with Execution

    Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. LEVER: Learning to Verify Language-to-Code Generation with Execution. In International Conference on Machine Learning, 2023

  12. [12]

    Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen

    Bei Chen, Fengji Zhang, A. Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code Generation with Generated Tests. InInternational Conference on Learning Representations, July 2022. 10

  13. [13]

    Teaching Large Language Models to Self-Debug

    Xinyun Chen, Maxwell Lin, Nathanael Sch¨arli, and Denny Zhou. Teaching Large Language Models to Self-Debug. InInternational Conference on Learning Representations, 2024

  14. [14]

    Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and J. Cheung. Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation.Findings, April 2022

  15. [15]

    Execution Guided Line-by-Line Code Generation, June 2025

    Boaz Lavon, Shahar Katz, and Lior Wolf. Execution Guided Line-by-Line Code Generation, June 2025

  16. [16]

    Byrd, Robert Zinkov, and Nada Amin

    David Brandfonbrener, Simon Henniger, Sibi Raja, Tarun Prasad, Chloe Loughridge, Federico Cassano, Sabrina Ruixin Hu, Jianang Yang, William E. Byrd, Robert Zinkov, and Nada Amin. VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search, May 2024

  17. [17]

    AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement, December 2024

    Pranjal Aggarwal, Bryan Parno, and Sean Welleck. AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement, December 2024

  18. [18]

    Qwen3 Technical Report, May 2025

    Qwen Team. Qwen3 Technical Report, May 2025

  19. [19]

    Gemma 4: Byte for byte, the most capable open models, April 2026

    Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open models, April 2026

  20. [20]

    Gonzalez, and Ion Stoica

    Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test Time Scaling for Code Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15964–15978. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.865

  21. [21]

    Ruchi Puri, David S. Kung, G. Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir A. Zolotov, Julian Dolby, Jie Chen, M. Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, and Ulrich Finkler. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks.NeurIPS Datasets and Benchmarks, May 2021

  22. [22]

    Do Machine Learning Models Produce TypeScript Types That Type Check?LIPIcs, Volume 263, ECOOP 2023, 263:37:1–37:28, 2023

    Ming-Ho Yee and Arjun Guha. Do Machine Learning Models Produce TypeScript Types That Type Check?LIPIcs, Volume 263, ECOOP 2023, 263:37:1–37:28, 2023. ISSN 1868-8969. doi: 10.4230/LIPIcs.ECOOP.2023.37

  23. [23]

    Wang, Antonia Creswell, G

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, L. Wang, Antonia Creswell, G. Irving, and I. Higgins. Solving math word problems with process- and outcome- based feedback, November 2022

  24. [24]

    Training Verifiers to Solve Math Word Problems, October 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, October 2021

  25. [25]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 9426–9439. Association for Computational Linguistics...

  26. [26]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision, June 2024

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve Mathematical Reasoning in Language Models by Automated Process Supervision, June 2024

  27. [27]

    Probabilistic Programs of Thought, April 2026

    Poorva Garg, Renato Lui Geh, Daniel Israel, Todd Millstein, Kyle Richardson, and Guy Van den Broeck. Probabilistic Programs of Thought, April 2026

  28. [28]

    doi:10.18653/v1/2021.emnlp-main.779

    Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.779. 11

  29. [29]

    Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani

    Gabriel Poesia, Oleksandr Polozov, Vu Le, A. Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations, January 2022

  30. [30]

    Willard and R´emi Louf

    Brandon T. Willard and R´emi Louf. Efficient Guided Generation for Large Language Models, July 2023

  31. [31]

    doi:10.1145/3591300

    Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting Is Programming: A Query Language for Large Language Models.Proceedings of the ACM on Programming Languages, 7 (PLDI):1946–1969, June 2023. ISSN 2475-1421. doi: 10.1145/3591300

  32. [32]

    Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning

    Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10932–10952. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.674

  33. [33]

    Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context

    Lakshya Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, and Sriram Rajamani. Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context. In Advances in Neural Information Processing Systems, 2023

  34. [34]

    Copiloting the copilots: Fusing large language models with completion engines for automated program repair

    Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, pages 172–184, New York, NY , USA, November 20...

  35. [35]

    Type- Constrained Code Generation with Language Models

    Niels M ¨undler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin Vechev. Type- Constrained Code Generation with Language Models. InProgramming Language Design and Implementation, May 2025. doi: 10.1145/3729274

  36. [36]

    CRANE: Reasoning with constrained LLM generation

    Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. CRANE: Reasoning with constrained LLM generation. InInternational Conference on Machine Learning, February 2025

  37. [37]

    TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation, November 2025

    Henrijs Princis, Arindam Sharma, and Cristina David. TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation, November 2025

  38. [38]

    Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents, April 2026

    Anh Ta, Junjie Zhu, and Shahin Shayandeh. Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents, April 2026

  39. [39]

    SACTOR: LLM-Driven Correct and Idiomatic C to Rust Translation with Static Analysis and FFI-Based Verification, March 2025

    Tianyang Zhou, Ziyi Zhang, Haowen Lin, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, and Varun Chandrasekaran. SACTOR: LLM-Driven Correct and Idiomatic C to Rust Translation with Static Analysis and FFI-Based Verification, March 2025

  40. [40]

    Understanding TypeScript

    Gavin Bierman, Mart´ın Abadi, and Mads Torgersen. Understanding TypeScript. InLecture Notes in Computer Science, pages 257–281. Springer Berlin Heidelberg, 2014. ISBN 978-3- 662-44201-2 978-3-662-44202-9. doi: 10.1007/978-3-662-44202-9 11

  41. [41]

    10"; L2 let total = count * 2; L3 let avg = total / 4; L4 println!(

    Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. Deep learning type inference. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 152–162. ACM, October 2018. doi: 10.1145/3236024.3236051. 12 APPENDIX A Limitations DTV res...

  42. [42]

    unexpected EOF

    augment the grammar to preserve reasoning capacity. More recent framework work generalizes this with pluggable search strategies and composable constraints: Princis et al. [37] unify sampling, beam search, MCTS, SMC, and ASAp under a tree-search abstraction that admits both token-level syntactic constraints and whole-program execution constraints (e.g., u...