Verifier-Guided Code Translation via Meta-Step Decoding

Kirill Levchenko; Mihai Christodorescu; Somesh Jha; Tianyang Zhou; Varun Chandrasekaran

arxiv: 2605.17626 · v1 · pith:65R2ABKMnew · submitted 2026-05-17 · 💻 cs.LG · cs.SE

Verifier-Guided Code Translation via Meta-Step Decoding

Tianyang Zhou , Somesh Jha , Mihai Christodorescu , Kirill Levchenko , Varun Chandrasekaran This is my paper

Pith reviewed 2026-05-20 14:12 UTC · model grok-4.3

classification 💻 cs.LG cs.SE

keywords code translationverifier-guided decodingtest-time scalinglarge language modelsstructural boundarieserror propagation preventionC-to-RustJavaScript-to-TypeScript

0 comments

The pith

Verifier-guided decoding at structural boundaries during generation improves code translation success rates and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that code translation by large language models can be improved by guiding the generation process itself with verifier feedback at key structural points, rather than checking only after the full output is produced. The core idea is to detect boundaries such as function ends or statement completions in the emerging code and immediately verify those prefixes using compilers or type checkers. If an error is found, the system rolls back to the last valid point instead of letting the mistake corrupt everything that follows. A sympathetic reader would care because this uses inference compute more efficiently on tasks where exact feedback is available, potentially making automated code migration more reliable without needing to generate and discard many full attempts. The results indicate better pass rates at lower token cost than standard self-refinement methods.

Core claim

Decoding Time Verification (DTV) treats structural boundaries as meta-steps in the decoding process. A state-machine controller interleaves token generation with verifier calls on valid prefixes, and applies structure-aware rollback when errors are detected, thereby preventing error propagation in the autoregressive output for code translation from C to Rust and from JavaScript to TypeScript.

What carries the argument

Decoding Time Verification (DTV) framework, which uses a state-machine to interleave generation with verifier-guided checks at structural boundaries and rollback mechanisms.

If this is right

DTV raises pass rates from 72.3% to 82.0% on C-to-Rust translation using the Qwen3-4B model under matched token budgets.
Pass rates improve from 33.3% to 46.0% on JavaScript-to-TypeScript translation with the same setup.
The method consumes fewer tokens per case than matched self-refinement baselines.
Similar gains appear when using the Gemma-4-E4B model.
DTV achieves a superior pass-rate versus cost tradeoff compared to post-generation verification approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If boundary detection works reliably across languages, the technique could apply to other generation tasks with deterministic verifiers such as formal proofs or API specifications.
Guided decoding might allow smaller models to match the performance of larger ones on structured output tasks by spending compute on corrections early rather than late.
Future work could test whether combining DTV with sampling-based methods yields further gains beyond what either achieves alone.

Load-bearing premise

The method depends on being able to reliably identify structural boundaries in the partially generated target code at the moment they are reached.

What would settle it

Running the same experiments but finding no improvement in pass rates or higher token consumption with DTV compared to the self-refinement baselines would indicate the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.17626 by Kirill Levchenko, Mihai Christodorescu, Somesh Jha, Tianyang Zhou, Varun Chandrasekaran.

**Figure 1.** Figure 1: GENERATION extends the target prefix to the next structural boundary (statement, block, function, or program). VERIFICATION renders the prefix and runs applicable oracles, committing on pass. RECOVERY parses diagnostics on fail, rolls back to the last checkpoint with augmented context, and escalates scope on repeated failure. The loop terminates on program-level success or budget bailout. intermediate reas… view at source ↗

**Figure 2.** Figure 2: DTV/self-refine Pareto-dominates na¨ıve/self-refine and sits on the upper-left of the costpass-rate frontier on both tasks, above the BoN-N and S* frontiers at matched per-case cost. DTV/self-refine is also more token-efficient than na¨ıve/self-refine while attaining a higher pass rate: on C to Rust the two configurations average k = 5.28 and k = 7.18 tokens per case respectively. DTV/self-refine therefor… view at source ↗

**Figure 3.** Figure 3: Paired round-count ratios (rna¨ıve/rDTV) on the both-pass subset of self-refine. Markers: median (white circle), mean (red triangle). Differential-testing pass rates. They follow the same ordering: DTV-full attains 10.0%, while no-feedback, no-escalation, and detect-and-abort reach only 8.0%, 5.0%, and 8.0% respectively. DTV-full therefore leads on behavioral correctness as well, ruling out the possibility… view at source ↗

**Figure 4.** Figure 4: Three verification paradigms on an illustrative C-to-Rust translation. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: The two repair modes (Section 2.2) on the same rustc E0277 failure. (left) Inline continuation: diagnostic appended as a comment, model continues in the same turn. (right) Patchbased repair: structured user turn, model returns a diff patch in a fresh assistant turn. constrained to the rollback scope; outputs that violate the constraint are rejected and the controller advances along the retry ladder. The c… view at source ↗

**Figure 6.** Figure 6: Full-set head-to-head comparison on C to Rust ( [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-model cumulative pass rate vs. normalized per-case cost [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Per-case distribution of the shared-code edit-concentration ratio on rescued cases where [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

read the original abstract

Test-time scaling is an important mechanism for improving large language models, especially on tasks with deterministic verifiers. Code translation is a canonical example: the source program constrains valid outputs, while compilers, type check- ers, and behavioral checks provide exact pass/fail feedback. Existing approaches typically apply these verifiers only after generation, which is inefficient because early errors corrupt the autoregressive context and are rarely corrected later. We introduce Decoding Time Verification (DTV), a framework that treats structural boundaries as meta steps for verifier-guided decoding. DTV interleaves generation with verifier calls under a state-machine controller that enforces valid prefixes, using structural-boundary checks and structure-aware rollback to prevent error propagation while reducing wasted tokens. We evaluate DTV on C-to-Rust and JavaScript-to-TypeScript translation. Using Qwen3-4B as the primary generator under matched token budgets, DTV improves pass rates from 72.3% to 82.0% on C-to-Rust and from 33.3% to 46.0% on JavaScript-to-TypeScript relative to matched self-refinement baselines, while using fewer tokens per case; the same trend largely transfers to Gemma-4-E4B. In the evaluated cost-matched grid, DTV achieves a more favorable pass-rate-cost tradeoff than post-hoc verification or sampling-based scaling. These results show that verifier-guided decoding is an effective use of inference-time compute for code translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DTV gets solid pass-rate gains on code translation by verifying at structural boundaries during generation rather than after, but the boundary detector itself lacks direct validation.

read the letter

The main point is that this paper shows a practical way to use inference compute on code translation by calling verifiers at detected structural points in the middle of autoregressive generation instead of only at the end. They wrap it in a state-machine controller that does structure-aware rollbacks when a prefix fails the check. On Qwen3-4B with matched token budgets, they report lifting C-to-Rust pass rates from 72.3% to 82.0% and JavaScript-to-TypeScript from 33.3% to 46.0% versus self-refinement baselines, while spending fewer tokens overall. The same pattern holds for the other model they tested, and the cost-pass-rate curve looks better than post-hoc verification or extra sampling.

Referee Report

2 major / 2 minor

Summary. The paper introduces Decoding Time Verification (DTV), a framework that treats structural boundaries as meta-steps for verifier-guided decoding in code translation. It uses a state-machine controller to interleave autoregressive generation with verifier calls at detected boundaries, applies structure-aware rollback to enforce valid prefixes, and reports empirical pass-rate gains under matched token budgets: from 72.3% to 82.0% on C-to-Rust and 33.3% to 46.0% on JavaScript-to-TypeScript using Qwen3-4B relative to self-refinement baselines, with similar trends for Gemma-4-E4B and a more favorable pass-rate-cost tradeoff than post-hoc verification.

Significance. If the gains are robustly due to the proposed mechanism, DTV could meaningfully improve the efficiency of test-time scaling for code tasks with deterministic verifiers by enabling early error correction rather than post-generation fixes. The cost-matched grid evaluation and explicit token-budget comparisons are strengths that make the empirical claims more actionable than typical sampling-based scaling results.

major comments (2)

[§3 (DTV framework and state-machine controller)] §3 (DTV framework and state-machine controller): The central claim that structural-boundary detection enables timely verifier feedback to prevent error propagation rests on an unvalidated assumption. No accuracy metrics for boundary detection, failure-case analysis on malformed prefixes or language-specific constructs, or ablation removing the detector are provided; if detection fails, the controller either skips verifiers (reducing to baseline) or triggers spurious rollbacks that could waste tokens or introduce new errors.
[§4 (Experimental evaluation)] §4 (Experimental evaluation): Token accounting, baseline equivalence, and rollback logic details are insufficient to confirm that the reported pass-rate improvements (e.g., 72.3%→82.0% on C-to-Rust) arise from DTV rather than selection bias or unequal effective compute. The manuscript does not specify how rollback affects the final candidate pool or whether verifier calls are strictly budget-matched across methods.

minor comments (2)

[Abstract] The abstract states that 'the same trend largely transfers to Gemma-4-E4B' but provides no numerical pass rates or token counts for this model, making it hard to assess transferability.
[§3] Notation for the state-machine states and rollback triggers could be clarified with a small diagram or pseudocode to make the controller logic easier to follow without re-reading the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we will revise the manuscript to provide the requested validation and details.

read point-by-point responses

Referee: [§3 (DTV framework and state-machine controller)] §3 (DTV framework and state-machine controller): The central claim that structural-boundary detection enables timely verifier feedback to prevent error propagation rests on an unvalidated assumption. No accuracy metrics for boundary detection, failure-case analysis on malformed prefixes or language-specific constructs, or ablation removing the detector are provided; if detection fails, the controller either skips verifiers (reducing to baseline) or triggers spurious rollbacks that could waste tokens or introduce new errors.

Authors: We agree that explicit validation of the boundary detector is needed to support the central claim. In the revised manuscript we will add precision/recall/F1 metrics for the detector on a held-out set of 500 snippets per translation direction. We will also include a failure-case analysis covering malformed prefixes and language-specific constructs (C macros, TypeScript generics). Finally, we will add an ablation that disables the detector while keeping the rest of the controller fixed, showing that performance drops to near-baseline levels and confirming that gains are not due to skipping verifiers or spurious rollbacks. These results will appear in §3 and the appendix. revision: yes
Referee: [§4 (Experimental evaluation)] §4 (Experimental evaluation): Token accounting, baseline equivalence, and rollback logic details are insufficient to confirm that the reported pass-rate improvements (e.g., 72.3%→82.0% on C-to-Rust) arise from DTV rather than selection bias or unequal effective compute. The manuscript does not specify how rollback affects the final candidate pool or whether verifier calls are strictly budget-matched across methods.

Authors: We acknowledge that the current description of token accounting and rollback effects is insufficient. In the revision we will expand §4 with a precise accounting of how each verifier call and rollback is charged against the token budget, and we will state that the generation length is adjusted so that total tokens (including verifier overhead) are matched across DTV, self-refinement, and post-hoc baselines. We will also clarify that structure-aware rollbacks revert only to the last verified prefix and that the final candidate pool contains only fully verified outputs; this maintains equivalence in the search space and rules out selection bias. These additions will confirm that the reported gains are attributable to the DTV mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are direct experimental comparisons

full rationale

The paper introduces the DTV framework and reports pass-rate improvements on C-to-Rust and JavaScript-to-TypeScript translation tasks using Qwen3-4B and Gemma-4-E4B under matched token budgets. These gains are presented as measured outcomes relative to self-refinement baselines, with no equations, first-principles derivations, or predictions that reduce by construction to fitted inputs or self-citations. The central claims rest on empirical evaluation rather than any tautological reduction, rendering the reported results self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of reliable structural-boundary detectors and deterministic verifiers; these are treated as given rather than derived.

axioms (2)

domain assumption Compilers, type checkers, and behavioral checks provide exact pass/fail feedback on partial code prefixes
Stated directly in the abstract as the basis for verifier-guided decoding.
domain assumption Structural boundaries in the target language can be identified reliably during token-by-token generation
Required for the meta-step controller to decide when to invoke verifiers.

invented entities (1)

Decoding Time Verification (DTV) state-machine controller no independent evidence
purpose: Enforces valid prefixes and triggers structure-aware rollback during generation
Newly introduced framework; no independent evidence outside this work is provided in the abstract.

pith-pipeline@v0.9.0 · 5803 in / 1511 out tokens · 51926 ms · 2026-05-20T14:12:14.537513+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning. InInternational Conference on Learning Representations, 2025

work page 2025
[2]

S1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. S1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332. Association for Computational Linguistics, 2025. doi...

work page doi:10.18653/v1/2025.emnlp-main.1025 2025
[3]

A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning, January 2025

Yixin Ji, Juntao Li, Yang Xiang, Hai Ye, Kaixin Wu, Kai Yao, Jia Xu, Linjian Mo, and Min Zhang. A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning, January 2025

work page 2025
[4]

Let’s Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. In International Conference on Learning Representations, 2024

work page 2024
[5]

OpenAI o1 System Card, December 2024

OpenAI. OpenAI o1 System Card, December 2024

work page 2024
[6]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Push...

work page doi:10.1126/science.abq1158 2022
[7]

Le, Christopher R´e, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R´e, and Azalia Mirhoseini. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, July 2024

work page 2024
[8]

Griffiths, Yuan Cao, and Karthik R Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[9]

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. Calibration and Correctness of Language Models for Code. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 540–552. IEEE, April 2025. doi: 10.1109/icse55347.2025.00040

work page doi:10.1109/icse55347.2025.00040 2025
[10]

Process Reward Models That Think, April 2025

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process Reward Models That Think, April 2025

work page 2025
[11]

LEVER: Learning to Verify Language-to-Code Generation with Execution

Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. LEVER: Learning to Verify Language-to-Code Generation with Execution. In International Conference on Machine Learning, 2023

work page 2023
[12]

Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen

Bei Chen, Fengji Zhang, A. Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code Generation with Generated Tests. InInternational Conference on Learning Representations, July 2022. 10

work page 2022
[13]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Sch¨arli, and Denny Zhou. Teaching Large Language Models to Self-Debug. InInternational Conference on Learning Representations, 2024

work page 2024
[14]

Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and J. Cheung. Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation.Findings, April 2022

work page 2022
[15]

Execution Guided Line-by-Line Code Generation, June 2025

Boaz Lavon, Shahar Katz, and Lior Wolf. Execution Guided Line-by-Line Code Generation, June 2025

work page 2025
[16]

Byrd, Robert Zinkov, and Nada Amin

David Brandfonbrener, Simon Henniger, Sibi Raja, Tarun Prasad, Chloe Loughridge, Federico Cassano, Sabrina Ruixin Hu, Jianang Yang, William E. Byrd, Robert Zinkov, and Nada Amin. VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search, May 2024

work page 2024
[17]

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement, December 2024

Pranjal Aggarwal, Bryan Parno, and Sean Welleck. AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement, December 2024

work page 2024
[18]

Qwen3 Technical Report, May 2025

Qwen Team. Qwen3 Technical Report, May 2025

work page 2025
[19]

Gemma 4: Byte for byte, the most capable open models, April 2026

Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open models, April 2026

work page 2026
[20]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test Time Scaling for Code Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15964–15978. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.865

work page doi:10.18653/v1/2025.findings-emnlp.865 2025
[21]

Ruchi Puri, David S. Kung, G. Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir A. Zolotov, Julian Dolby, Jie Chen, M. Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, and Ulrich Finkler. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks.NeurIPS Datasets and Benchmarks, May 2021

work page 2021
[22]

Do Machine Learning Models Produce TypeScript Types That Type Check?LIPIcs, Volume 263, ECOOP 2023, 263:37:1–37:28, 2023

Ming-Ho Yee and Arjun Guha. Do Machine Learning Models Produce TypeScript Types That Type Check?LIPIcs, Volume 263, ECOOP 2023, 263:37:1–37:28, 2023. ISSN 1868-8969. doi: 10.4230/LIPIcs.ECOOP.2023.37

work page doi:10.4230/lipics.ecoop.2023.37 2023
[23]

Wang, Antonia Creswell, G

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, L. Wang, Antonia Creswell, G. Irving, and I. Higgins. Solving math word problems with process- and outcome- based feedback, November 2022

work page 2022
[24]

Training Verifiers to Solve Math Word Problems, October 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, October 2021

work page 2021
[25]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 9426–9439. Association for Computational Linguistics...

work page doi:10.18653/v1/2024.acl-long.510 2024
[26]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision, June 2024

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve Mathematical Reasoning in Language Models by Automated Process Supervision, June 2024

work page 2024
[27]

Probabilistic Programs of Thought, April 2026

Poorva Garg, Renato Lui Geh, Daniel Israel, Todd Millstein, Kyle Richardson, and Guy Van den Broeck. Probabilistic Programs of Thought, April 2026

work page 2026
[28]

doi:10.18653/v1/2021.emnlp-main.779

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.779. 11

work page doi:10.18653/v1/2021.emnlp-main.779 2021
[29]

Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani

Gabriel Poesia, Oleksandr Polozov, Vu Le, A. Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations, January 2022

work page 2022
[30]

Willard and R´emi Louf

Brandon T. Willard and R´emi Louf. Efficient Guided Generation for Large Language Models, July 2023

work page 2023
[31]

doi:10.1145/3591300

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting Is Programming: A Query Language for Large Language Models.Proceedings of the ACM on Programming Languages, 7 (PLDI):1946–1969, June 2023. ISSN 2475-1421. doi: 10.1145/3591300

work page doi:10.1145/3591300 1946
[32]

Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10932–10952. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.674

work page doi:10.18653/v1/2023.emnlp-main.674 2023
[33]

Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context

Lakshya Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, and Sriram Rajamani. Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context. In Advances in Neural Information Processing Systems, 2023

work page 2023
[34]

Copiloting the copilots: Fusing large language models with completion engines for automated program repair

Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, pages 172–184, New York, NY , USA, November 20...

work page doi:10.1145/3611643.3616271 2023
[35]

Type- Constrained Code Generation with Language Models

Niels M ¨undler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin Vechev. Type- Constrained Code Generation with Language Models. InProgramming Language Design and Implementation, May 2025. doi: 10.1145/3729274

work page doi:10.1145/3729274 2025
[36]

CRANE: Reasoning with constrained LLM generation

Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. CRANE: Reasoning with constrained LLM generation. InInternational Conference on Machine Learning, February 2025

work page 2025
[37]

TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation, November 2025

Henrijs Princis, Arindam Sharma, and Cristina David. TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation, November 2025

work page 2025
[38]

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents, April 2026

Anh Ta, Junjie Zhu, and Shahin Shayandeh. Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents, April 2026

work page 2026
[39]

SACTOR: LLM-Driven Correct and Idiomatic C to Rust Translation with Static Analysis and FFI-Based Verification, March 2025

Tianyang Zhou, Ziyi Zhang, Haowen Lin, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, and Varun Chandrasekaran. SACTOR: LLM-Driven Correct and Idiomatic C to Rust Translation with Static Analysis and FFI-Based Verification, March 2025

work page 2025
[40]

Understanding TypeScript

Gavin Bierman, Mart´ın Abadi, and Mads Torgersen. Understanding TypeScript. InLecture Notes in Computer Science, pages 257–281. Springer Berlin Heidelberg, 2014. ISBN 978-3- 662-44201-2 978-3-662-44202-9. doi: 10.1007/978-3-662-44202-9 11

work page doi:10.1007/978-3-662-44202-9 2014
[41]

10"; L2 let total = count * 2; L3 let avg = total / 4; L4 println!(

Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. Deep learning type inference. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 152–162. ACM, October 2018. doi: 10.1145/3236024.3236051. 12 APPENDIX A Limitations DTV res...

work page doi:10.1145/3236024.3236051 2018
[42]

unexpected EOF

augment the grammar to preserve reasoning capacity. More recent framework work generalizes this with pluggable search strategies and composable constraints: Princis et al. [37] unify sampling, beam search, MCTS, SMC, and ASAp under a tree-search abstraction that admits both token-level syntactic constraints and whole-program execution constraints (e.g., u...

work page 2048

[1] [1]

Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-Time Com- pute Optimally Can be More Effective than Scaling Parameters for Reasoning. InInternational Conference on Learning Representations, 2025

work page 2025

[2] [2]

S1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. S1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332. Association for Computational Linguistics, 2025. doi...

work page doi:10.18653/v1/2025.emnlp-main.1025 2025

[3] [3]

A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning, January 2025

Yixin Ji, Juntao Li, Yang Xiang, Hai Ye, Kaixin Wu, Kai Yao, Jia Xu, Linjian Mo, and Min Zhang. A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning, January 2025

work page 2025

[4] [4]

Let’s Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. In International Conference on Learning Representations, 2024

work page 2024

[5] [5]

OpenAI o1 System Card, December 2024

OpenAI. OpenAI o1 System Card, December 2024

work page 2024

[6] [6]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Push...

work page doi:10.1126/science.abq1158 2022

[7] [7]

Le, Christopher R´e, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R´e, and Azalia Mirhoseini. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, July 2024

work page 2024

[8] [8]

Griffiths, Yuan Cao, and Karthik R Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[9] [9]

2025.IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. Calibration and Correctness of Language Models for Code. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 540–552. IEEE, April 2025. doi: 10.1109/icse55347.2025.00040

work page doi:10.1109/icse55347.2025.00040 2025

[10] [10]

Process Reward Models That Think, April 2025

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process Reward Models That Think, April 2025

work page 2025

[11] [11]

LEVER: Learning to Verify Language-to-Code Generation with Execution

Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. LEVER: Learning to Verify Language-to-Code Generation with Execution. In International Conference on Machine Learning, 2023

work page 2023

[12] [12]

Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen

Bei Chen, Fengji Zhang, A. Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code Generation with Generated Tests. InInternational Conference on Learning Representations, July 2022. 10

work page 2022

[13] [13]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Sch¨arli, and Denny Zhou. Teaching Large Language Models to Self-Debug. InInternational Conference on Learning Representations, 2024

work page 2024

[14] [14]

Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and J. Cheung. Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation.Findings, April 2022

work page 2022

[15] [15]

Execution Guided Line-by-Line Code Generation, June 2025

Boaz Lavon, Shahar Katz, and Lior Wolf. Execution Guided Line-by-Line Code Generation, June 2025

work page 2025

[16] [16]

Byrd, Robert Zinkov, and Nada Amin

David Brandfonbrener, Simon Henniger, Sibi Raja, Tarun Prasad, Chloe Loughridge, Federico Cassano, Sabrina Ruixin Hu, Jianang Yang, William E. Byrd, Robert Zinkov, and Nada Amin. VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search, May 2024

work page 2024

[17] [17]

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement, December 2024

Pranjal Aggarwal, Bryan Parno, and Sean Welleck. AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement, December 2024

work page 2024

[18] [18]

Qwen3 Technical Report, May 2025

Qwen Team. Qwen3 Technical Report, May 2025

work page 2025

[19] [19]

Gemma 4: Byte for byte, the most capable open models, April 2026

Clement Farabet and Olivier Lacombe. Gemma 4: Byte for byte, the most capable open models, April 2026

work page 2026

[20] [20]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. S*: Test Time Scaling for Code Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15964–15978. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.865

work page doi:10.18653/v1/2025.findings-emnlp.865 2025

[21] [21]

Ruchi Puri, David S. Kung, G. Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir A. Zolotov, Julian Dolby, Jie Chen, M. Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, and Ulrich Finkler. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks.NeurIPS Datasets and Benchmarks, May 2021

work page 2021

[22] [22]

Do Machine Learning Models Produce TypeScript Types That Type Check?LIPIcs, Volume 263, ECOOP 2023, 263:37:1–37:28, 2023

Ming-Ho Yee and Arjun Guha. Do Machine Learning Models Produce TypeScript Types That Type Check?LIPIcs, Volume 263, ECOOP 2023, 263:37:1–37:28, 2023. ISSN 1868-8969. doi: 10.4230/LIPIcs.ECOOP.2023.37

work page doi:10.4230/lipics.ecoop.2023.37 2023

[23] [23]

Wang, Antonia Creswell, G

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, L. Wang, Antonia Creswell, G. Irving, and I. Higgins. Solving math word problems with process- and outcome- based feedback, November 2022

work page 2022

[24] [24]

Training Verifiers to Solve Math Word Problems, October 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, October 2021

work page 2021

[25] [25]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 9426–9439. Association for Computational Linguistics...

work page doi:10.18653/v1/2024.acl-long.510 2024

[26] [26]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision, June 2024

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve Mathematical Reasoning in Language Models by Automated Process Supervision, June 2024

work page 2024

[27] [27]

Probabilistic Programs of Thought, April 2026

Poorva Garg, Renato Lui Geh, Daniel Israel, Todd Millstein, Kyle Richardson, and Guy Van den Broeck. Probabilistic Programs of Thought, April 2026

work page 2026

[28] [28]

doi:10.18653/v1/2021.emnlp-main.779

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.779. 11

work page doi:10.18653/v1/2021.emnlp-main.779 2021

[29] [29]

Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani

Gabriel Poesia, Oleksandr Polozov, Vu Le, A. Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations, January 2022

work page 2022

[30] [30]

Willard and R´emi Louf

Brandon T. Willard and R´emi Louf. Efficient Guided Generation for Large Language Models, July 2023

work page 2023

[31] [31]

doi:10.1145/3591300

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting Is Programming: A Query Language for Large Language Models.Proceedings of the ACM on Programming Languages, 7 (PLDI):1946–1969, June 2023. ISSN 2475-1421. doi: 10.1145/3591300

work page doi:10.1145/3591300 1946

[32] [32]

Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10932–10952. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.674

work page doi:10.18653/v1/2023.emnlp-main.674 2023

[33] [33]

Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context

Lakshya Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, and Sriram Rajamani. Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context. In Advances in Neural Information Processing Systems, 2023

work page 2023

[34] [34]

Copiloting the copilots: Fusing large language models with completion engines for automated program repair

Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, pages 172–184, New York, NY , USA, November 20...

work page doi:10.1145/3611643.3616271 2023

[35] [35]

Type- Constrained Code Generation with Language Models

Niels M ¨undler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin Vechev. Type- Constrained Code Generation with Language Models. InProgramming Language Design and Implementation, May 2025. doi: 10.1145/3729274

work page doi:10.1145/3729274 2025

[36] [36]

CRANE: Reasoning with constrained LLM generation

Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. CRANE: Reasoning with constrained LLM generation. InInternational Conference on Machine Learning, February 2025

work page 2025

[37] [37]

TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation, November 2025

Henrijs Princis, Arindam Sharma, and Cristina David. TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation, November 2025

work page 2025

[38] [38]

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents, April 2026

Anh Ta, Junjie Zhu, and Shahin Shayandeh. Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents, April 2026

work page 2026

[39] [39]

SACTOR: LLM-Driven Correct and Idiomatic C to Rust Translation with Static Analysis and FFI-Based Verification, March 2025

Tianyang Zhou, Ziyi Zhang, Haowen Lin, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, and Varun Chandrasekaran. SACTOR: LLM-Driven Correct and Idiomatic C to Rust Translation with Static Analysis and FFI-Based Verification, March 2025

work page 2025

[40] [40]

Understanding TypeScript

Gavin Bierman, Mart´ın Abadi, and Mads Torgersen. Understanding TypeScript. InLecture Notes in Computer Science, pages 257–281. Springer Berlin Heidelberg, 2014. ISBN 978-3- 662-44201-2 978-3-662-44202-9. doi: 10.1007/978-3-662-44202-9 11

work page doi:10.1007/978-3-662-44202-9 2014

[41] [41]

10"; L2 let total = count * 2; L3 let avg = total / 4; L4 println!(

Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. Deep learning type inference. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 152–162. ACM, October 2018. doi: 10.1145/3236024.3236051. 12 APPENDIX A Limitations DTV res...

work page doi:10.1145/3236024.3236051 2018

[42] [42]

unexpected EOF

augment the grammar to preserve reasoning capacity. More recent framework work generalizes this with pluggable search strategies and composable constraints: Princis et al. [37] unify sampling, beam search, MCTS, SMC, and ASAp under a tree-search abstraction that admits both token-level syntactic constraints and whole-program execution constraints (e.g., u...

work page 2048