TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

Arindam Sharma; Cristina David; Henrijs Princis

arxiv: 2511.22277 · v2 · submitted 2025-11-27 · 💻 cs.LG

TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

Henrijs Princis , Arindam Sharma , Cristina David This is my paper

Pith reviewed 2026-05-17 04:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM code generationdecoding strategiesconstraint optimizationtree searchMBPP benchmarkSQL-Spidersystematic tuning

0 comments

The pith

TreeCoder models LLM decoding for code as an optimizable tree search over constraints to raise output accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TreeCoder as a framework that casts the decoding step in large language models for code generation as a tree search across possible program candidates. Decoding strategies and constraint functions for syntax, style, and execution are made into adjustable, first-class parts that can be tuned automatically with standard optimization methods. This moves the enforcement of correctness into the generation process itself instead of relying only on prompts. If the method works as described, models would produce more valid code on tasks such as Python function writing and SQL query construction without extra training. Reported tests on the MBPP and SQL-Spider sets show higher success rates for several open models compared with ordinary decoding.

Core claim

TreeCoder represents decoding as a tree search over candidate programs, where both decoding strategies and constraint functions such as style, syntax, and execution are treated as first-class, optimisable components that enable systematic exploration and automatic tuning using standard optimisation techniques, resulting in improved accuracy on the MBPP Python and SQL-Spider benchmarks for models including CodeLlama, Mistral, and DeepSeek.

What carries the argument

TreeCoder, a framework that represents decoding as a tree search over candidate programs and treats decoding strategies and constraint functions as first-class optimisable components.

Load-bearing premise

That turning decoding into a tree search with constraints as tunable components will deliver accuracy gains at acceptable computational cost and without losing flexibility.

What would settle it

If experiments on the same benchmarks show that tuned TreeCoder configurations produce no accuracy lift over unconstrained decoding or incur runtimes too high for practical use.

Figures

Figures reproduced from arXiv: 2511.22277 by Arindam Sharma, Cristina David, Henrijs Princis.

**Figure 2.** Figure 2: Expand without rollouts (left) and expand with rollouts (right). The [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Update step of all five decoding functions [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Prune and move steps of all five decoding functions [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Partial order of constraints. These constraints can act individually or jointly during decoding, guiding the search toward syntactically valid, well-structured, and semantically meaningful programs (more details on the constraints and their use are given in the experimental evaluation in Section 7). Formally, they are composed under the product-of-experts formulation introduced earlier, where the overall c… view at source ↗

**Figure 6.** Figure 6: Constraint function examples: Syntax (checks syntactic validity), No-Comments (suppress comments), No-Explanations (encourage code over natural language explanations), Unit-Tests (validate generated code via assertions). function. As a result, TreeCoder enables complex constraint behaviours—such as enforcing syntax, style, and functional correctness simultaneously—without altering the underlying decoding l… view at source ↗

read the original abstract

Large language models (LLMs) have shown remarkable ability to generate code, yet their outputs often violate syntactic or semantic constraints when guided only through natural language prompts. We introduce TreeCoder, the most general and flexible framework to date for exploring decoding strategies, constraints, and hyperparameters in LLMs, and use it in code generation to enforce correctness and structure during decoding rather than relying on prompt engineering. TreeCoder represents decoding as a tree search over candidate programs, where both decoding strategies and constraint functions - such as style, syntax, execution - are treated as first-class, optimisable components. This design enables systematic exploration and automatic tuning of decoding configurations using standard optimisation techniques. Experiments on the MBPP (Python) and SQL-Spider benchmarks show that TreeCoder consistently improves accuracy across open-source models such as CodeLlama, Mistral and DeepSeek, often outperforming their unconstrained baselines by considerable margins.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TreeCoder frames decoding as joint tree-search optimization over strategies and constraints, with reported accuracy gains, but efficiency controls are the open question.

read the letter

The punchline is that TreeCoder presents a tree search framework for optimizing both decoding and constraints in LLM code generation, showing accuracy improvements on MBPP and SQL-Spider, but the lack of efficiency metrics makes the practicality hard to assess. The new part is the treatment of decoding strategies and constraint functions as jointly optimizable first-class components in a tree search setup. This allows automatic tuning rather than manual design, which is a step forward from typical prompt engineering or simple filtering. The paper demonstrates this on open-source models like CodeLlama, Mistral, and DeepSeek, with results suggesting notable gains over baselines in code correctness. It does well in providing a flexible framework that can incorporate various constraints such as style, syntax, and execution feedback. This systematic exploration using optimization techniques is a solid conceptual contribution. The soft spots center on the experimental protocol and scalability. The description does not specify the search parameters like branching limits or pruning methods, leaving open whether the tree search remains efficient for longer code snippets. If the number of LLM calls grows too fast, the reported improvements may not translate to usable systems. More data on runtime costs and comparisons to existing constrained generation techniques would help. Readers working on AI-assisted programming tools would find this relevant, particularly those experimenting with search algorithms for better structured outputs. It offers a new angle for those frustrated with constraint violations in generated code. Overall, the paper shows clear thinking on the problem and engages with the need for better decoding methods. It deserves peer review because the idea has merit and the benchmarks are appropriate, even with the gaps in details. Referees can provide feedback on strengthening the efficiency arguments. My recommendation is to send it for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TreeCoder, a framework that models LLM decoding for code generation as a tree search over candidate programs. Both decoding strategies and constraint functions (e.g., style, syntax, execution) are treated as first-class, optimizable components, enabling systematic exploration and automatic tuning via standard optimization methods. Experiments on the MBPP (Python) and SQL-Spider benchmarks are reported to show consistent accuracy gains across open-source models including CodeLlama, Mistral, and DeepSeek, often exceeding unconstrained baselines by considerable margins.

Significance. If the empirical results and efficiency claims hold, the work could offer a more principled alternative to prompt engineering for enforcing syntactic and semantic constraints in code generation. Treating constraints as tunable components within a tree-search formulation is a potentially generalizable idea. However, the abstract supplies no quantitative results, implementation details, or overhead measurements, so the practical significance remains difficult to assess from the provided text.

major comments (2)

[Abstract] Abstract: The central claim that TreeCoder 'consistently improves accuracy' and 'outperforms their unconstrained baselines by considerable margins' is stated without any numerical results, baseline comparisons, statistical tests, or even the magnitude of the reported gains. This absence makes it impossible to evaluate whether the experimental evidence supports the claim.
[Methods] Methods (tree-search formulation): No description is given of branching factor, pruning criteria, beam width, or the average number of LLM calls required per sample. Because the central claim requires that the search remains tractable on MBPP and Spider programs, the lack of these controls is load-bearing for the scalability and practicality assertions.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one concrete accuracy delta or table reference so readers can immediately gauge effect size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions have been made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that TreeCoder 'consistently improves accuracy' and 'outperforms their unconstrained baselines by considerable margins' is stated without any numerical results, baseline comparisons, statistical tests, or even the magnitude of the reported gains. This absence makes it impossible to evaluate whether the experimental evidence supports the claim.

Authors: We agree that the abstract would benefit from including specific quantitative results to allow readers to immediately assess the strength of the claims. In the revised version, we have updated the abstract to report key accuracy improvements (e.g., absolute gains on MBPP and SQL-Spider for CodeLlama, Mistral, and DeepSeek) along with direct comparisons to the unconstrained baselines. revision: yes
Referee: [Methods] Methods (tree-search formulation): No description is given of branching factor, pruning criteria, beam width, or the average number of LLM calls required per sample. Because the central claim requires that the search remains tractable on MBPP and Spider programs, the lack of these controls is load-bearing for the scalability and practicality assertions.

Authors: The referee is correct that these implementation parameters are essential for evaluating tractability. We have added a dedicated paragraph in the Methods section specifying the branching factor, pruning criteria, beam width, and the measured average number of LLM calls per sample on both benchmarks, confirming that the search remains practical within the reported experimental setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces TreeCoder as an empirical framework that represents LLM decoding as tree search with first-class optimizable constraints and validates it through accuracy improvements on the MBPP and SQL-Spider benchmarks across multiple models. No mathematical derivation chain, fitted-parameter predictions, or self-citation load-bearing steps are present in the abstract or described methods; the central claims rest on external benchmark results rather than reducing to inputs by construction or renaming known patterns. The work is therefore self-contained against external benchmarks with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; the approach implicitly relies on the assumption that standard optimization techniques can effectively tune tree-search configurations and that constraints can be applied during decoding without major efficiency trade-offs.

pith-pipeline@v0.9.0 · 5455 in / 1116 out tokens · 47885 ms · 2026-05-17T04:16:18.615375+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
Diagnosing CFG Interpretation in LLMs
cs.AI 2026-04 unverdicted novelty 6.0

LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Kareem Ahmed, Kai-Wei Chang, and Guy Van den Broeck. 2024. Controllable Generation via Locally Constrained Resampling.arXiv preprint arXiv:2410.13111(2024)

work page arXiv 2024
[2]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. arXiv:1907.10902 [cs.LG] https://arxiv.org/abs/1907.10902

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-Generation Hyperparameter Optimization Framework. InThe 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2623–2631

work page 2019
[5]

Zhamri Che Ani, Zauridah Abdul Hamid, and Nur Nazifa Zhamri. 2024. The Recent Trends of Research on GitHub Copilot: A Systematic Review. InComputing and Informatics, Nur Haryani Zakaria, Nur Suhaili Mansor, Husniza Husni, and Fathey Mohammed (Eds.). Springer Nature Singapore, Singapore, 355–366

work page 2024
[6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRRabs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. 2025. CRANE: Reasoning with constrained LLM generation.arXiv preprint arXiv:2502.09061(2025)

work page arXiv 2025
[8]

Yehoshua Bar-Hillel, Micha Perles, and Eli Shamir. 1961. On formal properties of simple phrase structure grammars. Sprachtypologie und Universalienforschung14 (1961), 143–172

work page 1961
[9]

Browne, Edward Powley, Daniel Whitehouse, Simon M

Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. 2012. A Survey of Monte Carlo Tree Search Methods.IEEE Transactions on Computational Intelligence and AI in Games4, 1 (2012), 1–43. doi:10.1109/TCIAIG.2012. 2186810

work page doi:10.1109/tciaig.2012 2012
[10]

Godsill, and Eric Moulines

Olivier Cappe, Simon J. Godsill, and Eric Moulines. 2007. An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo.Proc. IEEE95, 5 (2007), 899–924. doi:10.1109/JPROC.2007.893250

work page doi:10.1109/jproc.2007.893250 2007
[11]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page
[12]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Xiang Chen, Jiacheng Xue, Xiaofei Xie, Caokai Liang, and Xiaolin Ju. 2025. A Systematic Literature Review on Neural Code Translation. arXiv:2505.07425 [cs.SE] https://arxiv.org/abs/2505.07425

work page arXiv 2025
[14]

Microsoft Corporation. 2025. Pyright: Static Type Checker for Python. https://github.com/microsoft/pyright. Accessed: 2025-11-13

work page 2025
[15]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

2001.An Introduction to Sequential Monte Carlo Methods

Arnaud Doucet, Nando de Freitas, and Neil Gordon. 2001.An Introduction to Sequential Monte Carlo Methods. Springer New York, New York, NY, 3–14. doi:10.1007/978-1-4757-3437-9_1

work page doi:10.1007/978-1-4757-3437-9_1 2001
[17]

Saibo Geng, Berkay Döner, Chris Wendler, Martin Josifoski, and Robert West. 2024. Sketch-guided constrained decoding for boosting blackbox large language models without logit access.arXiv preprint arXiv:2401.09967(2024)

work page arXiv 2024
[18]

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2024. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. arXiv:2305.13971 [cs.CL] https://arxiv.org/abs/2305.13971

work page arXiv 2024
[19]

Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg-Kirkpatrick, and Loris D’Antoni

work page
[20]

arXiv:2506.05754 [cs.AI] https://arxiv.org/abs/2506.05754

Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective. arXiv:2506.05754 [cs.AI] https://arxiv.org/abs/2506.05754

work page arXiv
[21]

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. 2025. rStar- Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv:2501.04519 [cs.CL] https: //arxiv.org/abs/2501.04519

work page internal anchor Pith review arXiv 2025
[22]

Jianfei He, Shichao Sun, Xiaohua Jia, and Wenjie Li. 2023. Empirical Analysis of Beam Search Curse and Search Errors with Model Errors in Neural Machine Translation. InProceedings of the 24th Annual Conference of the European Association for Machine Translation, Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Sc...

work page 2023
[23]

Corentin Hokamp and Qun Liu. 2017. Lexically Constrained Decoding for Neural Machine Translation via Grid Beam Search. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 1535–1546

work page 2017
[24]

Liang Huang, Kai Zhao, and Mingbo Ma. 2018. When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size). arXiv:1809.00069 [cs.CL] https://arxiv.org/abs/1809.00069

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.06825...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[28]

Syntactic and semantic control of large language models via sequential

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo. arXiv:2504.13139 [cs.CL] https://arxiv.org/abs/2504.13139

work page arXiv
[29]

Lew, Tim Vieira, and Timothy J

João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, and Timothy J. O’Donnell

work page
[30]

InInternational Conference on Learning Representations (ICLR)

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo. InInternational Conference on Learning Representations (ICLR)

work page
[31]

Tingwei Lu, Yangning Li, Liyuan Wang, Binghuai Lin, Jiwei Tang, Qingsong Lv, Wanshi Xu, Hai-Tao Zheng, Yinghui Li, Xin Su, and Zifei Shan. 2025. LSR-MCTS: Alleviating Long Range Dependency in Code Generation. arXiv:2504.07433 [cs.CL] https://arxiv.org/abs/2504.07433

work page arXiv 2025
[32]

Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, An...

work page 2021
[33]

Clara Meister, Tim Vieira, and Ryan Cotterell. 2021. If beam search is the answer, what was the question? arXiv:2010.02650 [cs.CL] https://arxiv.org/abs/2010.02650 TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation 111:23

work page arXiv 2021
[34]

Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. Grammar-Aligned Decoding. arXiv:2405.21047 [cs.AI] https://arxiv.org/abs/2405.21047

work page arXiv 2024
[35]

Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2025. Grammar-aligned decoding. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 774, 22 pages

work page 2025
[36]

Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani

work page
[37]

arXiv:2201.11227 [cs.LG] https: //arxiv.org/abs/2201.11227

Synchromesh: Reliable code generation from pre-trained language models. arXiv:2201.11227 [cs.LG] https: //arxiv.org/abs/2201.11227

work page arXiv
[38]

Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable Code Generation from Pre-Trained Language Models.arXiv preprint arXiv:2204.02805(2022)

work page arXiv 2022
[39]

Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. 2025. Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods. arXiv:2502.01618 [cs.LG] https://arxiv.org/abs/2502.01618

work page arXiv 2025
[40]

Christopher D. Rosin. 2011. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence 61, 3 (2011), 203–230. doi:10.1007/s10472-011-9258-6

work page doi:10.1007/s10472-011-9258-6 2011
[41]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. arXiv:2109.05093 [cs.CL] https://arxiv.org/abs/2109.05093

work page arXiv 2021
[43]

Torsten Scholak, Nathan Schücher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Autoregressive Decoding from Language Models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7517–7535

work page 2021
[44]

Shubham Ugare, Rohan Gumaste, Tarun Suresh, Gagandeep Singh, and Sasa Misailovic. 2025. IterGen: Iterative Semantic-aware Structured LLM Generation with Backtracking. arXiv:2410.07295 [cs.SE] https://arxiv.org/abs/2410. 07295

work page arXiv 2025
[45]

Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2024. Improving LLM Code Generation with Grammar Augmentation.CoRRabs/2403.01632 (2024). https://doi.org/10.48550/arXiv.2403.01632

work page doi:10.48550/arxiv.2403.01632 2024
[46]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush

work page
[47]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771 [cs.CL] https: //arxiv.org/abs/1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 1910
[48]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2019. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887 [cs.CL] https://arxiv.org/abs/1809.08887

work page internal anchor Pith review Pith/arXiv arXiv 2019
[50]

Yuqi Zhu, Jia Li, Ge Li, YunFei Zhao, Jia Li, Zhi Jin, and Hong Mei. 2023. Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models. arXiv:2309.02772 [cs.SE] https://arxiv.org/abs/2309.02772

work page arXiv 2023

[1] [1]

Kareem Ahmed, Kai-Wei Chang, and Guy Van den Broeck. 2024. Controllable Generation via Locally Constrained Resampling.arXiv preprint arXiv:2410.13111(2024)

work page arXiv 2024

[2] [2]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. arXiv:1907.10902 [cs.LG] https://arxiv.org/abs/1907.10902

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-Generation Hyperparameter Optimization Framework. InThe 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2623–2631

work page 2019

[4] [5]

Zhamri Che Ani, Zauridah Abdul Hamid, and Nur Nazifa Zhamri. 2024. The Recent Trends of Research on GitHub Copilot: A Systematic Review. InComputing and Informatics, Nur Haryani Zakaria, Nur Suhaili Mansor, Husniza Husni, and Fathey Mohammed (Eds.). Springer Nature Singapore, Singapore, 355–366

work page 2024

[5] [6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRRabs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [7]

Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. 2025. CRANE: Reasoning with constrained LLM generation.arXiv preprint arXiv:2502.09061(2025)

work page arXiv 2025

[7] [8]

Yehoshua Bar-Hillel, Micha Perles, and Eli Shamir. 1961. On formal properties of simple phrase structure grammars. Sprachtypologie und Universalienforschung14 (1961), 143–172

work page 1961

[8] [9]

Browne, Edward Powley, Daniel Whitehouse, Simon M

Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. 2012. A Survey of Monte Carlo Tree Search Methods.IEEE Transactions on Computational Intelligence and AI in Games4, 1 (2012), 1–43. doi:10.1109/TCIAIG.2012. 2186810

work page doi:10.1109/tciaig.2012 2012

[9] [10]

Godsill, and Eric Moulines

Olivier Cappe, Simon J. Godsill, and Eric Moulines. 2007. An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo.Proc. IEEE95, 5 (2007), 899–924. doi:10.1109/JPROC.2007.893250

work page doi:10.1109/jproc.2007.893250 2007

[10] [11]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page

[11] [12]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Xiang Chen, Jiacheng Xue, Xiaofei Xie, Caokai Liang, and Xiaolin Ju. 2025. A Systematic Literature Review on Neural Code Translation. arXiv:2505.07425 [cs.SE] https://arxiv.org/abs/2505.07425

work page arXiv 2025

[13] [14]

Microsoft Corporation. 2025. Pyright: Static Type Checker for Python. https://github.com/microsoft/pyright. Accessed: 2025-11-13

work page 2025

[14] [15]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [16]

2001.An Introduction to Sequential Monte Carlo Methods

Arnaud Doucet, Nando de Freitas, and Neil Gordon. 2001.An Introduction to Sequential Monte Carlo Methods. Springer New York, New York, NY, 3–14. doi:10.1007/978-1-4757-3437-9_1

work page doi:10.1007/978-1-4757-3437-9_1 2001

[16] [17]

Saibo Geng, Berkay Döner, Chris Wendler, Martin Josifoski, and Robert West. 2024. Sketch-guided constrained decoding for boosting blackbox large language models without logit access.arXiv preprint arXiv:2401.09967(2024)

work page arXiv 2024

[17] [18]

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2024. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. arXiv:2305.13971 [cs.CL] https://arxiv.org/abs/2305.13971

work page arXiv 2024

[18] [19]

Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg-Kirkpatrick, and Loris D’Antoni

work page

[19] [20]

arXiv:2506.05754 [cs.AI] https://arxiv.org/abs/2506.05754

Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective. arXiv:2506.05754 [cs.AI] https://arxiv.org/abs/2506.05754

work page arXiv

[20] [21]

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. 2025. rStar- Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv:2501.04519 [cs.CL] https: //arxiv.org/abs/2501.04519

work page internal anchor Pith review arXiv 2025

[21] [22]

Jianfei He, Shichao Sun, Xiaohua Jia, and Wenjie Li. 2023. Empirical Analysis of Beam Search Curse and Search Errors with Model Errors in Neural Machine Translation. InProceedings of the 24th Annual Conference of the European Association for Machine Translation, Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Sc...

work page 2023

[22] [23]

Corentin Hokamp and Qun Liu. 2017. Lexically Constrained Decoding for Neural Machine Translation via Grid Beam Search. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 1535–1546

work page 2017

[23] [24]

Liang Huang, Kai Zhao, and Mingbo Ma. 2018. When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size). arXiv:1809.00069 [cs.CL] https://arxiv.org/abs/1809.00069

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [25]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.06825...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [26]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022

[26] [28]

Syntactic and semantic control of large language models via sequential

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo. arXiv:2504.13139 [cs.CL] https://arxiv.org/abs/2504.13139

work page arXiv

[27] [29]

Lew, Tim Vieira, and Timothy J

João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, and Timothy J. O’Donnell

work page

[28] [30]

InInternational Conference on Learning Representations (ICLR)

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo. InInternational Conference on Learning Representations (ICLR)

work page

[29] [31]

Tingwei Lu, Yangning Li, Liyuan Wang, Binghuai Lin, Jiwei Tang, Qingsong Lv, Wanshi Xu, Hai-Tao Zheng, Yinghui Li, Xin Su, and Zifei Shan. 2025. LSR-MCTS: Alleviating Long Range Dependency in Code Generation. arXiv:2504.07433 [cs.CL] https://arxiv.org/abs/2504.07433

work page arXiv 2025

[30] [32]

Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, An...

work page 2021

[31] [33]

Clara Meister, Tim Vieira, and Ryan Cotterell. 2021. If beam search is the answer, what was the question? arXiv:2010.02650 [cs.CL] https://arxiv.org/abs/2010.02650 TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation 111:23

work page arXiv 2021

[32] [34]

Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. Grammar-Aligned Decoding. arXiv:2405.21047 [cs.AI] https://arxiv.org/abs/2405.21047

work page arXiv 2024

[33] [35]

Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2025. Grammar-aligned decoding. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 774, 22 pages

work page 2025

[34] [36]

Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani

work page

[35] [37]

arXiv:2201.11227 [cs.LG] https: //arxiv.org/abs/2201.11227

Synchromesh: Reliable code generation from pre-trained language models. arXiv:2201.11227 [cs.LG] https: //arxiv.org/abs/2201.11227

work page arXiv

[36] [38]

Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable Code Generation from Pre-Trained Language Models.arXiv preprint arXiv:2204.02805(2022)

work page arXiv 2022

[37] [39]

Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. 2025. Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods. arXiv:2502.01618 [cs.LG] https://arxiv.org/abs/2502.01618

work page arXiv 2025

[38] [40]

Christopher D. Rosin. 2011. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence 61, 3 (2011), 203–230. doi:10.1007/s10472-011-9258-6

work page doi:10.1007/s10472-011-9258-6 2011

[39] [41]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [42]

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. arXiv:2109.05093 [cs.CL] https://arxiv.org/abs/2109.05093

work page arXiv 2021

[41] [43]

Torsten Scholak, Nathan Schücher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Autoregressive Decoding from Language Models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7517–7535

work page 2021

[42] [44]

Shubham Ugare, Rohan Gumaste, Tarun Suresh, Gagandeep Singh, and Sasa Misailovic. 2025. IterGen: Iterative Semantic-aware Structured LLM Generation with Backtracking. arXiv:2410.07295 [cs.SE] https://arxiv.org/abs/2410. 07295

work page arXiv 2025

[43] [45]

Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2024. Improving LLM Code Generation with Grammar Augmentation.CoRRabs/2403.01632 (2024). https://doi.org/10.48550/arXiv.2403.01632

work page doi:10.48550/arxiv.2403.01632 2024

[44] [46]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush

work page

[45] [47]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771 [cs.CL] https: //arxiv.org/abs/1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 1910

[46] [48]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [49]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2019. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887 [cs.CL] https://arxiv.org/abs/1809.08887

work page internal anchor Pith review Pith/arXiv arXiv 2019

[48] [50]

Yuqi Zhu, Jia Li, Ge Li, YunFei Zhao, Jia Li, Zhi Jin, and Hong Mei. 2023. Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models. arXiv:2309.02772 [cs.SE] https://arxiv.org/abs/2309.02772

work page arXiv 2023