pith. sign in

arxiv: 2606.21619 · v1 · pith:RQKITQVGnew · submitted 2026-06-19 · 💻 cs.SE · cs.LG· cs.PL

The Alignment Problem in Constrained Code Generation

Pith reviewed 2026-06-26 13:30 UTC · model grok-4.3

classification 💻 cs.SE cs.LGcs.PL
keywords constrained decodingcode generationlarge language modelsalignmentincompletenessfunctional correctnesssyntax constraintstype constraints
0
0 comments X

The pith

Incomplete constrainers cause constrained decoding to underperform unconstrained decoding in code generation by distorting language model distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that constrained decoding for code generation fails to improve functional correctness when the constrainer is incomplete. Incompleteness rejects programs that belong to the target language, which distorts the language model's probability distribution and pushes generations into low-probability regions that often time out. Experiments with seven models, two target languages, and three benchmarks show that unconstrained decoding then achieves significantly higher functional correctness, with gaps reaching 97 percent. The core issue is misalignment among the constrainer, the model, and the specification language, where the bias from incompleteness outweighs the benefit of avoiding syntax or type errors. This implies that formal constraints only deliver their intended gains if the constrainer is complete enough to preserve the model's natural behavior.

Core claim

When the constrainer is incomplete, unconstrained decoding significantly outperforms constrained decoding in terms of functional correctness. Incompleteness pushes the model into low-probability regions of the program space, causing the generation to frequently time out, and reducing functional correctness by up to 97%.

What carries the argument

The alignment between constrainer, language model, and target specification language, where incompleteness creates a bias that distorts the model's distribution over valid programs.

If this is right

  • Constrained decoding improves functional correctness only when the constrainer is complete enough to avoid rejecting valid programs.
  • Design of constrainers must prioritize completeness alongside soundness to prevent distortion of the language model distribution.
  • Functional correctness metrics for constrained code generation must account for increased timeout rates induced by incompleteness.
  • Unconstrained decoding remains preferable in settings where the constrainer cannot be made sufficiently complete.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Measuring the degree of incompleteness before deployment could help decide whether to apply constrained decoding at all.
  • Hybrid systems that switch to unconstrained generation when constraints become too restrictive may mitigate the observed performance loss.
  • The same misalignment effect could appear in other constrained generation domains such as structured text or formal specifications.

Load-bearing premise

The observed performance gaps are caused primarily by incompleteness of the constrainer rather than other experimental factors such as benchmark selection or timeout thresholds.

What would settle it

An experiment that applies a demonstrably complete constrainer and finds constrained decoding then matches or exceeds unconstrained decoding on functional correctness.

Figures

Figures reproduced from arXiv: 2606.21619 by George Zakhour, Guido Salvaneschi, Jahrim Gabriele Cesario, Luca Di Grazia, Matteo Biagiola.

Figure 1
Figure 1. Figure 1: Incompleteness bias in TypeScript. The model favors a valid forward reference after generating a function signa￾ture, but the incomplete constrainer (𝐶) rejects it, leading to a low-probability path and text degeneration. incomplete constraints can degrade functional correctness by up to 97% (RQ3). These contributions make the community aware of the negative effects of misalignment in constrained decoding,… view at source ↗
Figure 2
Figure 2. Figure 2: Irrelevant combinations of the model language [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Alignment between the model language 𝐿M, target language 𝐿𝑇 , and the constrained language 𝐿𝐶. Lastly, the constrained language 𝐿𝐶 is the set of programs that are compliant with some user-defined constraining model, or con￾strainer. More formally, 𝐿𝐶 includes all complete programs that are valid by some syntactic PI and semantic TI constraints (from Algorithm 1), which are often approximating the target la… view at source ↗
Figure 3
Figure 3. Figure 3: b shows the first general case: the constrainer is sound, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows NLL distributions for Gemma-2-2B at 𝜏 = 0.1 on HumanEval (results for MBPP are similar). Overall, constrained and unconstrained distributions differ significantly, with constrained decoding producing higher NLL. This is most pronounced for timed￾out solutions, which are extremely unlikely under the model’s distribution, illustrating how an incomplete constrainer can push the language model into uncha… view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated strong capabilities in code generation, but their outputs frequently contain syntax or type errors that result in compilation failures. Constrained decoding has been proposed as a solution to mitigate compilation errors by construction, improving functional correctness as a byproduct. However, previous works overlook a critical aspect of constrained decoding: the alignment between constrainer (e.g., types), language model and the target specification language (e.g., TypeScript). Misalignment is caused by the constrainer being incomplete--rejecting programs that belong to the target--or unsound--allowing programs that are not part of the target. The bias created by incompleteness distorts the language model distribution, and can be detrimental for code generation. We evaluate this hypothesis using seven language models, two target languages, two constrainers, enforcing types and syntax during decoding, and we study how language models react to varying levels of incompleteness. On three benchmarks, when the constrainer is incomplete, unconstrained decoding significantly outperforms constrained decoding in terms of functional correctness. Incompleteness pushes the model into low-probability regions of the program space, causing the generation to frequently time out, and reducing functional correctness by up to 97%. These contributions make the community aware of the negative effects of misalignment in constrained decoding, and provide quantitative insights on how to design constrainers that are beneficial for code generation systems with formal guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that misalignment between constrainers, language models, and target languages in constrained code generation arises from incompleteness (rejecting valid programs), which distorts the LM distribution, pushes generations into low-probability regions, causes frequent timeouts, and reduces functional correctness by up to 97% relative to unconstrained decoding. It supports this via evaluations across seven models, two target languages, two constrainers (types and syntax), and three benchmarks, concluding that incomplete constrainers can be detrimental and offering insights for better constrainer design.

Significance. If the central empirical claim can be isolated from confounds, the work would highlight an important practical limitation of constrained decoding for code generation that prior literature has overlooked. The quantitative results on performance degradation could inform constrainer design choices in systems aiming for formal guarantees. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are described.

major comments (3)
  1. [Evaluation section] Evaluation section: The attribution of the performance gap (unconstrained outperforming constrained by up to 97%) to incompleteness lacks controls that vary timeout thresholds independently of the constrainer; without such isolation it is unclear whether the observed timeouts and correctness drops are driven by incompleteness or by the interaction of the chosen timeout with constrained search paths.
  2. [Methods and results] Methods and results: Incompleteness levels are not shown to be quantified via an independent metric separate from the decoding runs themselves; if incompleteness is measured from the same generations that exhibit timeouts, the causal claim that incompleteness pushes models into low-probability regions risks circularity.
  3. [Abstract and evaluation] Abstract and evaluation: The claim that 'incompleteness pushes the model into low-probability regions' is presented as the operative mechanism, yet no direct measurement (e.g., probability mass or entropy comparisons between constrained and unconstrained paths) is described to support this over alternative explanations such as benchmark selection or model-specific sampling.
minor comments (2)
  1. [Abstract] The abstract states results on 'seven language models, two target languages, two constrainers' but does not list the exact models, languages, or constrainers; adding an explicit table or list in the methods would improve clarity.
  2. [Evaluation] No discussion of how functional correctness is measured (e.g., test-case pass rate, exact match) or inter-rater reliability for any manual checks appears in the provided abstract; this detail should be added to the evaluation protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our experimental design. We address each major comment below, providing clarifications and indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: The attribution of the performance gap (unconstrained outperforming constrained by up to 97%) to incompleteness lacks controls that vary timeout thresholds independently of the constrainer; without such isolation it is unclear whether the observed timeouts and correctness drops are driven by incompleteness or by the interaction of the chosen timeout with constrained search paths.

    Authors: We acknowledge that varying timeout thresholds independently would provide stronger isolation of the incompleteness effect. Our experiments use a fixed timeout value consistent with prior constrained decoding literature for code generation. The higher timeout rates under constrained decoding arise because incompleteness forces the search to exhaust valid high-probability paths, but we agree this could interact with the timeout choice. We will revise the evaluation section to explicitly discuss the fixed timeout as a potential confound and include it as a direction for future work. revision: partial

  2. Referee: [Methods and results] Methods and results: Incompleteness levels are not shown to be quantified via an independent metric separate from the decoding runs themselves; if incompleteness is measured from the same generations that exhibit timeouts, the causal claim that incompleteness pushes models into low-probability regions risks circularity.

    Authors: Incompleteness is quantified independently by measuring the rate at which the constrainer rejects ground-truth programs from the benchmarks (i.e., valid programs in the target language that the constrainer incorrectly rejects). This metric is computed prior to and separately from any decoding runs or timeout observations. The timeouts and functional correctness results are then correlated with these pre-computed incompleteness levels across different constrainers. We will revise the methods section to make this separation explicit and include the exact formula used for the independent incompleteness metric. revision: yes

  3. Referee: [Abstract and evaluation] Abstract and evaluation: The claim that 'incompleteness pushes the model into low-probability regions' is presented as the operative mechanism, yet no direct measurement (e.g., probability mass or entropy comparisons between constrained and unconstrained paths) is described to support this over alternative explanations such as benchmark selection or model-specific sampling.

    Authors: We rely on indirect but consistent evidence: across seven models and three benchmarks, incomplete constrainers produce substantially higher timeout rates and lower functional correctness, which we attribute to the model being forced away from its preferred (high-probability) valid programs. While direct probability mass or entropy measurements on paths would provide stronger mechanistic support, such measurements were not performed in the current study. We will add a limitations paragraph noting this and that alternative explanations cannot be fully ruled out without those measurements. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation

full rationale

The paper advances an empirical hypothesis about misalignment in constrained decoding and tests it via direct experiments across seven LLMs, two languages, two constrainers, and three benchmarks. Functional correctness, timeout rates, and incompleteness effects are measured as observed outcomes rather than derived from any equations, fitted parameters, or self-referential definitions. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the abstract or described contributions. The central quantitative claim (up to 97% reduction) is an experimental result, not a prediction that reduces to its inputs by construction. This matches the default expectation of no significant circularity for benchmark-driven work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study relying on standard assumptions in machine learning evaluation for code generation tasks. No free parameters, invented entities, or non-standard axioms are evident from the abstract.

axioms (1)
  • domain assumption Benchmarks and models used are representative for assessing constrained decoding effects.
    Invoked implicitly in the evaluation across seven models and three benchmarks.

pith-pipeline@v0.9.1-grok · 5796 in / 1236 out tokens · 24047 ms · 2026-06-26T13:30:02.469267+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 26 canonical work pages · 9 internal anchors

  1. [1]

    Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K

    Lakshya A. Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K. Lahiri, and Sriram K. Rajamani. 2023. Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023...

  2. [2]

    Guidance AI. 2025. LLGuidance: Super-fast Structured Outputs for Large Lan- guage Models. GitHub repository. https://guidance-ai.github.io/llguidance/llg- go-brrr Version 1.0.0, MIT License

  3. [3]

    Andrea Arcuri and Lionel C. Briand. 2014. A Hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering.Softw. Test. Verification Reliab.24, 3 (2014), 219–250. doi:10.1002/STVR.1486

  4. [4]

    Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J

    Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRRabs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732

  5. [5]

    Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small Lan- guage Models are the Future of Agentic AI.CoRRabs/2506.02153 (2025). doi:10.48550/ARXIV.2506.02153 arXiv:2506.02153

  6. [6]

    Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. 2022. A framework for the evaluation of code genera- tion models. https://github.com/bigcode-project/bigcode-evaluation-harness. Accessed 2026-06-18

  7. [7]

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, and Fan Zhou. 2026. Qwen3-Coder-Next Technical Report.CoRR abs/2603.00729 (2026). doi:10.48550/ARXIV.2603.00729 arXiv:...

  8. [8]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  9. [9]

    Yiu Wai Chow, Luca Di Grazia, and Michael Pradel. 2024. PyTy: Repairing Static Type Errors in Python. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 87:1–87:13. doi:10.1145/3597503.3639184

  10. [10]

    Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen

    Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen. 2025. XGrammar: Flexible and Efficient Structured Genera- tion Engine for Large Language Models. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12- 15, 2025, Matei Zaharia, Gauri Joshi, and Yingyan (Ce...

  11. [11]

    Margarida Ferreira, Victor Nicolet, Joey Dodds, and Daniel Kroening. 2025. Pro- gram Synthesis from Partial Traces.Proc. ACM Program. Lang.9, PLDI (2025), 1642–1665. doi:10.1145/3729316

  12. [12]

    Teodoro Freund, Yann Hamdaoui, and Arnaud Spiwack. 2021. Union and inter- section contracts are hard, actually. InDLS 2021: Proceedings of the 17th ACM SIG- PLAN International Symposium on Dynamic Languages, Virtual Event / Chicago, IL, USA, October 19, 2021, Arjun Guha (Ed.). ACM, 1–11. doi:10.1145/3486602.3486767

  13. [13]

    Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar- Constrained Decoding for Structured NLP Tasks without Finetuning. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computati...

  14. [14]

    ggml-org. 2026. GGML BNF Grammars. https://github.com/ggml-org/llama.cpp/ blob/master/grammars/README.md. GitHub, Accessed 2026-03-12

  15. [15]

    Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. 2025. En- hancing Code Generation for Low-Resource Languages: No Silver Bullet. In33rd IEEE/ACM International Conference on Program Comprehension, ICPC@ICSE 2025, Ottawa, ON, Canada, April 27-28, 2025. IEEE, 478–488. doi:10.1109/ICPC66645. 2025.00058

  16. [16]

    Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg- Kirkpatrick, and Loris D’Antoni. 2025. Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective.CoRRabs/2506.05754 (2025). doi:10.48550/ARXIV.2506.05754 arXiv:2506.05754

  17. [17]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence.CoRRabs/2401.14196 (2024). doi:10.48550/ARXIV.2401.14196 arXiv:2401.14196

  18. [18]

    Md Mahade Hasan, Muhammad Waseem, Kai-Kristian Kemell, Jussi Rasku, Juha Ala-Rantala, and Pekka Abrahamsson. 2026. Assessing small language models for code generation: An empirical study with benchmarks.J. Syst. Softw.236 (2026), 112815. doi:10.1016/J.JSS.2026.112815

  19. [19]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.33, 8 (2024), 220:1–220:79. doi:10.1145/3695988

  20. [20]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9

  21. [21]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.CoRRabs/2409.12186 (2024). doi:10.48550/ARXIV.2409.12186 arXiv:2409.12186

  22. [22]

    Nima Karimipour, Michael Pradel, Martin Kellogg, and Manu Sridharan. 2025. LLM-Based Repair of Static Nullability Errors.CoRRabs/2507.20674 (2025). doi:10.48550/ARXIV.2507.20674 arXiv:2507.20674

  23. [23]

    Lingxiao Li, Salar Rahili, and Yiwei Zhao. 2025. Correctness-Guaranteed Code Generation via Constrained Decoding.CoRRabs/2508.15866 (2025). doi:10.48550/ ARXIV.2508.15866 arXiv:2508.15866

  24. [24]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Pro- cessing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 1...

  25. [25]

    Lew, Tim Vieira, and Timothy J

    João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, and Timothy J. O’Donnell

  26. [26]

    InThe Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025

    Syntactic and Semantic Control of Large Language Models via Se- quential Monte Carlo. InThe Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=xoXn62FzD0

  27. [27]

    smoke test passes

    Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, and Mauro Pezzè. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. In40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025. IEEE, 278–290. doi:10.1109/ASE63991.2025.00031

  28. [28]

    Ernst, and Mauro Pezzè

    Davide Molinelli, Alberto Martin-Lopez, Elliott Zackrone, Beyza Eken, Michael D. Ernst, and Mauro Pezzè. 2025. Tratto: A Neuro-Symbolic Approach to Deriving Axiomatic Test Oracles.Proc. ACM Softw. Eng.2, ISSTA (2025), 1887–1909. doi:10.1145/3728960

  29. [29]

    Niels Mündler, Jasper Dekoninck, and Martin T. Vechev. 2025. Constrained Decoding of Diffusion LLMs with Context-Free Grammars.CoRRabs/2508.10111 (2025). doi:10.48550/ARXIV.2508.10111 arXiv:2508.10111

  30. [30]

    Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin T. Vechev. 2025. Type-Constrained Code Generation with Language Models.Proc. ACM Program. Lang.9, PLDI (2025), 601–626. doi:10.1145/3729274

  31. [31]

    Shaan Nagy, Timothy Zhou, Nadia Polikarpova, and Loris D’Antoni. 2026. Chop- Chop: A Programmable Framework for Semantically Constraining the Output of Language Models.Proc. ACM Program. Lang.10, POPL (2026), 1905–1932. doi:10.1145/3776708

  32. [32]

    NousResearch. 2025. NousResearch/json-mode-eval. Hugging Face dataset repos- itory. https://huggingface.co/datasets/NousResearch/json-mode-eval Accessed Matteo Biagiola, Jahrim Gabriele Cesario, Luca Di Grazia, George Zakhour, and Guido Salvaneschi 2026-03-12

  33. [33]

    OpenAI. 2023. GPT-4 Technical Report.CoRRabs/2303.08774 (2023). doi:10. 48550/ARXIV.2303.08774 arXiv:2303.08774

  34. [34]

    Zhang, Mark Harman, and Meng Wang

    Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2025. An Em- pirical Study of the Non-Determinism of ChatGPT in Code Generation.ACM Trans. Softw. Eng. Methodol.34, 2 (2025), 42:1–42:28. doi:10.1145/3697010

  35. [35]

    Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. Grammar-Aligned Decoding. InAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgrave, An...

  36. [36]

    Kanghee Park, Timothy Zhou, and Loris D’Antoni. 2025. Flexible and Efficient Grammar-Constrained Decoding. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 (Proceed- ings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Teg...

  37. [37]

    Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable Code Generation from Pre-trained Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=KmtVD97J43e

  38. [38]

    Bierman, and Panagiotis Vekris

    Aseem Rastogi, Nikhil Swamy, Cédric Fournet, Gavin M. Bierman, and Panagiotis Vekris. 2015. Safe & Efficient Gradual Typing for TypeScript. InProceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015, Sriram K. Rajamani and David Walker (Eds.). ACM, 167–180. doi:10.114...

  39. [39]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xi- aoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton- Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Th...

  40. [40]

    Jan-Philipp Schreiter, Kirill Fuks, and Horst Hellbrück. 2025. A Novel Approach and Framework for Configuration of Agent-Based LLMs in Real-World Applica- tions. InIntelligent Computing, Kohei Arai (Ed.). Springer Nature Switzerland, Cham, 635–650

  41. [41]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024). doi:10.48550/ARXIV.2402.03300

  42. [42]

    Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, and Gagan- deep Singh. 2025. DINGO: Constrained Inference for Diffusion LLMs. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mation Processing Systems 2025, NeurIPS 2025, San Diago, CA, USA, December 2-7, 2025 / Mexico City, Mexico, November 30 - ...

  43. [43]

    Gemma Team. 2024. Gemma: Open Models Based on Gemini Research and Technology.CoRRabs/2403.08295 (2024). doi:10.48550/ARXIV.2403.08295 arXiv:2403.08295

  44. [44]

    TOML. 2026. TOML: Tom’s Obvious Minimal Language, v1.1.0. https://toml.io/ en/. Accessed 2026-03-13

  45. [46]

    Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2025. SynCode: LLM Generation with Grammar Augmentation.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/forum?id=HiUZtgAPoH

  46. [47]

    András Vargha and Harold D. Delaney. 2000. A Critique and Improvement of the "CL" Common Language Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics25, 2 (2000), 101–132. http: //www.jstor.org/stable/1165329

  47. [48]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxbur...

  48. [49]

    Efficient Guided Generation for Large Language Models

    Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models.CoRRabs/2307.09702 (2023). doi:10.48550/ARXIV.2307.09702 arXiv:2307.09702

  49. [50]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. SWE-smith: Scaling Data for Software Engineering Agents.CoRRabs/2504.21798 (2025). doi:10.48550/ARXIV.2504.21798 arXiv:2504.21798