The Alignment Problem in Constrained Code Generation

George Zakhour; Guido Salvaneschi; Jahrim Gabriele Cesario; Luca Di Grazia; Matteo Biagiola

arxiv: 2606.21619 · v1 · pith:RQKITQVGnew · submitted 2026-06-19 · 💻 cs.SE · cs.LG· cs.PL

The Alignment Problem in Constrained Code Generation

Matteo Biagiola , Jahrim Gabriele Cesario , Luca Di Grazia , George Zakhour , Guido Salvaneschi This is my paper

Pith reviewed 2026-06-26 13:30 UTC · model grok-4.3

classification 💻 cs.SE cs.LGcs.PL

keywords constrained decodingcode generationlarge language modelsalignmentincompletenessfunctional correctnesssyntax constraintstype constraints

0 comments

The pith

Incomplete constrainers cause constrained decoding to underperform unconstrained decoding in code generation by distorting language model distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that constrained decoding for code generation fails to improve functional correctness when the constrainer is incomplete. Incompleteness rejects programs that belong to the target language, which distorts the language model's probability distribution and pushes generations into low-probability regions that often time out. Experiments with seven models, two target languages, and three benchmarks show that unconstrained decoding then achieves significantly higher functional correctness, with gaps reaching 97 percent. The core issue is misalignment among the constrainer, the model, and the specification language, where the bias from incompleteness outweighs the benefit of avoiding syntax or type errors. This implies that formal constraints only deliver their intended gains if the constrainer is complete enough to preserve the model's natural behavior.

Core claim

When the constrainer is incomplete, unconstrained decoding significantly outperforms constrained decoding in terms of functional correctness. Incompleteness pushes the model into low-probability regions of the program space, causing the generation to frequently time out, and reducing functional correctness by up to 97%.

What carries the argument

The alignment between constrainer, language model, and target specification language, where incompleteness creates a bias that distorts the model's distribution over valid programs.

If this is right

Constrained decoding improves functional correctness only when the constrainer is complete enough to avoid rejecting valid programs.
Design of constrainers must prioritize completeness alongside soundness to prevent distortion of the language model distribution.
Functional correctness metrics for constrained code generation must account for increased timeout rates induced by incompleteness.
Unconstrained decoding remains preferable in settings where the constrainer cannot be made sufficiently complete.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Measuring the degree of incompleteness before deployment could help decide whether to apply constrained decoding at all.
Hybrid systems that switch to unconstrained generation when constraints become too restrictive may mitigate the observed performance loss.
The same misalignment effect could appear in other constrained generation domains such as structured text or formal specifications.

Load-bearing premise

The observed performance gaps are caused primarily by incompleteness of the constrainer rather than other experimental factors such as benchmark selection or timeout thresholds.

What would settle it

An experiment that applies a demonstrably complete constrainer and finds constrained decoding then matches or exceeds unconstrained decoding on functional correctness.

Figures

Figures reproduced from arXiv: 2606.21619 by George Zakhour, Guido Salvaneschi, Jahrim Gabriele Cesario, Luca Di Grazia, Matteo Biagiola.

**Figure 1.** Figure 1: Incompleteness bias in TypeScript. The model favors a valid forward reference after generating a function signature, but the incomplete constrainer (𝐶) rejects it, leading to a low-probability path and text degeneration. incomplete constraints can degrade functional correctness by up to 97% (RQ3). These contributions make the community aware of the negative effects of misalignment in constrained decoding,… view at source ↗

**Figure 2.** Figure 2: Irrelevant combinations of the model language [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Alignment between the model language 𝐿M, target language 𝐿𝑇 , and the constrained language 𝐿𝐶. Lastly, the constrained language 𝐿𝐶 is the set of programs that are compliant with some user-defined constraining model, or constrainer. More formally, 𝐿𝐶 includes all complete programs that are valid by some syntactic PI and semantic TI constraints (from Algorithm 1), which are often approximating the target la… view at source ↗

**Figure 3.** Figure 3: b shows the first general case: the constrainer is sound, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: shows NLL distributions for Gemma-2-2B at 𝜏 = 0.1 on HumanEval (results for MBPP are similar). Overall, constrained and unconstrained distributions differ significantly, with constrained decoding producing higher NLL. This is most pronounced for timedout solutions, which are extremely unlikely under the model’s distribution, illustrating how an incomplete constrainer can push the language model into uncha… view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated strong capabilities in code generation, but their outputs frequently contain syntax or type errors that result in compilation failures. Constrained decoding has been proposed as a solution to mitigate compilation errors by construction, improving functional correctness as a byproduct. However, previous works overlook a critical aspect of constrained decoding: the alignment between constrainer (e.g., types), language model and the target specification language (e.g., TypeScript). Misalignment is caused by the constrainer being incomplete--rejecting programs that belong to the target--or unsound--allowing programs that are not part of the target. The bias created by incompleteness distorts the language model distribution, and can be detrimental for code generation. We evaluate this hypothesis using seven language models, two target languages, two constrainers, enforcing types and syntax during decoding, and we study how language models react to varying levels of incompleteness. On three benchmarks, when the constrainer is incomplete, unconstrained decoding significantly outperforms constrained decoding in terms of functional correctness. Incompleteness pushes the model into low-probability regions of the program space, causing the generation to frequently time out, and reducing functional correctness by up to 97%. These contributions make the community aware of the negative effects of misalignment in constrained decoding, and provide quantitative insights on how to design constrainers that are beneficial for code generation systems with formal guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Incomplete constrainers can make constrained decoding worse than unconstrained on functional correctness, but the causal isolation from confounds like timeouts is not yet clear from the abstract.

read the letter

The main takeaway is that when a constrainer rejects valid programs, constrained decoding can drop functional correctness by large margins compared to unconstrained runs, up to 97% in their tests, because it steers the model into low-probability areas that often time out.

What stands out is the focus on misalignment between the constrainer, the model, and the target specification. Prior work mostly chased lower compilation errors; this one flags that incompleteness itself can be harmful. They back this with runs across seven models, three benchmarks, two languages, and two constrainers, plus some variation in incompleteness levels. That gives the claim some breadth and produces concrete numbers on when the technique backfires.

The soft spot is the causal attribution. The abstract ties the performance gap directly to incompleteness distorting the distribution, but it does not describe controls that separate this from timeout thresholds, benchmark selection, or model sampling quirks. Without those, the mechanism remains plausible but not isolated. The low soundness score tracks with that gap.

This is for people who build or deploy constrained decoding in LLM code tools. A practitioner weighing whether to add type or syntax constraints would get a practical caution and some scale on the risk.

Send it to peer review. The empirical angle and the alignment framing are worth referee scrutiny even if the methods need tightening on the controls.

Referee Report

3 major / 2 minor

Summary. The paper claims that misalignment between constrainers, language models, and target languages in constrained code generation arises from incompleteness (rejecting valid programs), which distorts the LM distribution, pushes generations into low-probability regions, causes frequent timeouts, and reduces functional correctness by up to 97% relative to unconstrained decoding. It supports this via evaluations across seven models, two target languages, two constrainers (types and syntax), and three benchmarks, concluding that incomplete constrainers can be detrimental and offering insights for better constrainer design.

Significance. If the central empirical claim can be isolated from confounds, the work would highlight an important practical limitation of constrained decoding for code generation that prior literature has overlooked. The quantitative results on performance degradation could inform constrainer design choices in systems aiming for formal guarantees. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are described.

major comments (3)

[Evaluation section] Evaluation section: The attribution of the performance gap (unconstrained outperforming constrained by up to 97%) to incompleteness lacks controls that vary timeout thresholds independently of the constrainer; without such isolation it is unclear whether the observed timeouts and correctness drops are driven by incompleteness or by the interaction of the chosen timeout with constrained search paths.
[Methods and results] Methods and results: Incompleteness levels are not shown to be quantified via an independent metric separate from the decoding runs themselves; if incompleteness is measured from the same generations that exhibit timeouts, the causal claim that incompleteness pushes models into low-probability regions risks circularity.
[Abstract and evaluation] Abstract and evaluation: The claim that 'incompleteness pushes the model into low-probability regions' is presented as the operative mechanism, yet no direct measurement (e.g., probability mass or entropy comparisons between constrained and unconstrained paths) is described to support this over alternative explanations such as benchmark selection or model-specific sampling.

minor comments (2)

[Abstract] The abstract states results on 'seven language models, two target languages, two constrainers' but does not list the exact models, languages, or constrainers; adding an explicit table or list in the methods would improve clarity.
[Evaluation] No discussion of how functional correctness is measured (e.g., test-case pass rate, exact match) or inter-rater reliability for any manual checks appears in the provided abstract; this detail should be added to the evaluation protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our experimental design. We address each major comment below, providing clarifications and indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The attribution of the performance gap (unconstrained outperforming constrained by up to 97%) to incompleteness lacks controls that vary timeout thresholds independently of the constrainer; without such isolation it is unclear whether the observed timeouts and correctness drops are driven by incompleteness or by the interaction of the chosen timeout with constrained search paths.

Authors: We acknowledge that varying timeout thresholds independently would provide stronger isolation of the incompleteness effect. Our experiments use a fixed timeout value consistent with prior constrained decoding literature for code generation. The higher timeout rates under constrained decoding arise because incompleteness forces the search to exhaust valid high-probability paths, but we agree this could interact with the timeout choice. We will revise the evaluation section to explicitly discuss the fixed timeout as a potential confound and include it as a direction for future work. revision: partial
Referee: [Methods and results] Methods and results: Incompleteness levels are not shown to be quantified via an independent metric separate from the decoding runs themselves; if incompleteness is measured from the same generations that exhibit timeouts, the causal claim that incompleteness pushes models into low-probability regions risks circularity.

Authors: Incompleteness is quantified independently by measuring the rate at which the constrainer rejects ground-truth programs from the benchmarks (i.e., valid programs in the target language that the constrainer incorrectly rejects). This metric is computed prior to and separately from any decoding runs or timeout observations. The timeouts and functional correctness results are then correlated with these pre-computed incompleteness levels across different constrainers. We will revise the methods section to make this separation explicit and include the exact formula used for the independent incompleteness metric. revision: yes
Referee: [Abstract and evaluation] Abstract and evaluation: The claim that 'incompleteness pushes the model into low-probability regions' is presented as the operative mechanism, yet no direct measurement (e.g., probability mass or entropy comparisons between constrained and unconstrained paths) is described to support this over alternative explanations such as benchmark selection or model-specific sampling.

Authors: We rely on indirect but consistent evidence: across seven models and three benchmarks, incomplete constrainers produce substantially higher timeout rates and lower functional correctness, which we attribute to the model being forced away from its preferred (high-probability) valid programs. While direct probability mass or entropy measurements on paths would provide stronger mechanistic support, such measurements were not performed in the current study. We will add a limitations paragraph noting this and that alternative explanations cannot be fully ruled out without those measurements. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation

full rationale

The paper advances an empirical hypothesis about misalignment in constrained decoding and tests it via direct experiments across seven LLMs, two languages, two constrainers, and three benchmarks. Functional correctness, timeout rates, and incompleteness effects are measured as observed outcomes rather than derived from any equations, fitted parameters, or self-referential definitions. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the abstract or described contributions. The central quantitative claim (up to 97% reduction) is an experimental result, not a prediction that reduces to its inputs by construction. This matches the default expectation of no significant circularity for benchmark-driven work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study relying on standard assumptions in machine learning evaluation for code generation tasks. No free parameters, invented entities, or non-standard axioms are evident from the abstract.

axioms (1)

domain assumption Benchmarks and models used are representative for assessing constrained decoding effects.
Invoked implicitly in the evaluation across seven models and three benchmarks.

pith-pipeline@v0.9.1-grok · 5796 in / 1236 out tokens · 24047 ms · 2026-06-26T13:30:02.469267+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 26 canonical work pages · 9 internal anchors

[1]

Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K

Lakshya A. Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K. Lahiri, and Sriram K. Rajamani. 2023. Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023...

2023
[2]

Guidance AI. 2025. LLGuidance: Super-fast Structured Outputs for Large Lan- guage Models. GitHub repository. https://guidance-ai.github.io/llguidance/llg- go-brrr Version 1.0.0, MIT License

2025
[3]

Andrea Arcuri and Lionel C. Briand. 2014. A Hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering.Softw. Test. Verification Reliab.24, 3 (2014), 219–250. doi:10.1002/STVR.1486

work page doi:10.1002/stvr.1486 2014
[4]

Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRRabs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732

Pith/arXiv arXiv 2021
[5]

Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small Lan- guage Models are the Future of Agentic AI.CoRRabs/2506.02153 (2025). doi:10.48550/ARXIV.2506.02153 arXiv:2506.02153

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.02153 2025
[6]

Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. 2022. A framework for the evaluation of code genera- tion models. https://github.com/bigcode-project/bigcode-evaluation-harness. Accessed 2026-06-18

2022
[7]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, and Fan Zhou. 2026. Qwen3-Coder-Next Technical Report.CoRR abs/2603.00729 (2026). doi:10.48550/ARXIV.2603.00729 arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.00729 2026
[8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

Pith/arXiv arXiv 2021
[9]

Yiu Wai Chow, Luca Di Grazia, and Michael Pradel. 2024. PyTy: Repairing Static Type Errors in Python. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 87:1–87:13. doi:10.1145/3597503.3639184

work page doi:10.1145/3597503.3639184 2024
[10]

Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen. 2025. XGrammar: Flexible and Efficient Structured Genera- tion Engine for Large Language Models. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12- 15, 2025, Matei Zaharia, Gauri Joshi, and Yingyan (Ce...

2025
[11]

Margarida Ferreira, Victor Nicolet, Joey Dodds, and Daniel Kroening. 2025. Pro- gram Synthesis from Partial Traces.Proc. ACM Program. Lang.9, PLDI (2025), 1642–1665. doi:10.1145/3729316

work page doi:10.1145/3729316 2025
[12]

Teodoro Freund, Yann Hamdaoui, and Arnaud Spiwack. 2021. Union and inter- section contracts are hard, actually. InDLS 2021: Proceedings of the 17th ACM SIG- PLAN International Symposium on Dynamic Languages, Virtual Event / Chicago, IL, USA, October 19, 2021, Arjun Guha (Ed.). ACM, 1–11. doi:10.1145/3486602.3486767

work page doi:10.1145/3486602.3486767 2021
[13]

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar- Constrained Decoding for Structured NLP Tasks without Finetuning. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computati...

work page doi:10.18653/v1/2023.emnlp-main.674 2023
[14]

ggml-org. 2026. GGML BNF Grammars. https://github.com/ggml-org/llama.cpp/ blob/master/grammars/README.md. GitHub, Accessed 2026-03-12

2026
[15]

Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. 2025. En- hancing Code Generation for Low-Resource Languages: No Silver Bullet. In33rd IEEE/ACM International Conference on Program Comprehension, ICPC@ICSE 2025, Ottawa, ON, Canada, April 27-28, 2025. IEEE, 478–488. doi:10.1109/ICPC66645. 2025.00058

work page doi:10.1109/icpc66645 2025
[16]

Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg- Kirkpatrick, and Loris D’Antoni. 2025. Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective.CoRRabs/2506.05754 (2025). doi:10.48550/ARXIV.2506.05754 arXiv:2506.05754

work page doi:10.48550/arxiv.2506.05754 2025
[17]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence.CoRRabs/2401.14196 (2024). doi:10.48550/ARXIV.2401.14196 arXiv:2401.14196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024
[18]

Md Mahade Hasan, Muhammad Waseem, Kai-Kristian Kemell, Jussi Rasku, Juha Ala-Rantala, and Pekka Abrahamsson. 2026. Assessing small language models for code generation: An empirical study with benchmarks.J. Syst. Softw.236 (2026), 112815. doi:10.1016/J.JSS.2026.112815

work page doi:10.1016/j.jss.2026.112815 2026
[19]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.33, 8 (2024), 220:1–220:79. doi:10.1145/3695988

work page doi:10.1145/3695988 2024
[20]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9

2022
[21]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.CoRRabs/2409.12186 (2024). doi:10.48550/ARXIV.2409.12186 arXiv:2409.12186

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024
[22]

Nima Karimipour, Michael Pradel, Martin Kellogg, and Manu Sridharan. 2025. LLM-Based Repair of Static Nullability Errors.CoRRabs/2507.20674 (2025). doi:10.48550/ARXIV.2507.20674 arXiv:2507.20674

work page doi:10.48550/arxiv.2507.20674 2025
[23]

Lingxiao Li, Salar Rahili, and Yiwei Zhao. 2025. Correctness-Guaranteed Code Generation via Constrained Decoding.CoRRabs/2508.15866 (2025). doi:10.48550/ ARXIV.2508.15866 arXiv:2508.15866

arXiv 2025
[24]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Pro- cessing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 1...

2023
[25]

Lew, Tim Vieira, and Timothy J

João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, and Timothy J. O’Donnell
[26]

InThe Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025

Syntactic and Semantic Control of Large Language Models via Se- quential Monte Carlo. InThe Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=xoXn62FzD0

2025
[27]

smoke test passes

Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, and Mauro Pezzè. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. In40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025. IEEE, 278–290. doi:10.1109/ASE63991.2025.00031

work page doi:10.1109/ase63991.2025.00031 2025
[28]

Ernst, and Mauro Pezzè

Davide Molinelli, Alberto Martin-Lopez, Elliott Zackrone, Beyza Eken, Michael D. Ernst, and Mauro Pezzè. 2025. Tratto: A Neuro-Symbolic Approach to Deriving Axiomatic Test Oracles.Proc. ACM Softw. Eng.2, ISSTA (2025), 1887–1909. doi:10.1145/3728960

work page doi:10.1145/3728960 2025
[29]

Niels Mündler, Jasper Dekoninck, and Martin T. Vechev. 2025. Constrained Decoding of Diffusion LLMs with Context-Free Grammars.CoRRabs/2508.10111 (2025). doi:10.48550/ARXIV.2508.10111 arXiv:2508.10111

work page doi:10.48550/arxiv.2508.10111 2025
[30]

Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin T. Vechev. 2025. Type-Constrained Code Generation with Language Models.Proc. ACM Program. Lang.9, PLDI (2025), 601–626. doi:10.1145/3729274

work page doi:10.1145/3729274 2025
[31]

Shaan Nagy, Timothy Zhou, Nadia Polikarpova, and Loris D’Antoni. 2026. Chop- Chop: A Programmable Framework for Semantically Constraining the Output of Language Models.Proc. ACM Program. Lang.10, POPL (2026), 1905–1932. doi:10.1145/3776708

work page doi:10.1145/3776708 2026
[32]

NousResearch. 2025. NousResearch/json-mode-eval. Hugging Face dataset repos- itory. https://huggingface.co/datasets/NousResearch/json-mode-eval Accessed Matteo Biagiola, Jahrim Gabriele Cesario, Luca Di Grazia, George Zakhour, and Guido Salvaneschi 2026-03-12

2025
[33]

OpenAI. 2023. GPT-4 Technical Report.CoRRabs/2303.08774 (2023). doi:10. 48550/ARXIV.2303.08774 arXiv:2303.08774

Pith/arXiv arXiv 2023
[34]

Zhang, Mark Harman, and Meng Wang

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2025. An Em- pirical Study of the Non-Determinism of ChatGPT in Code Generation.ACM Trans. Softw. Eng. Methodol.34, 2 (2025), 42:1–42:28. doi:10.1145/3697010

work page doi:10.1145/3697010 2025
[35]

Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. Grammar-Aligned Decoding. InAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgrave, An...

2024
[36]

Kanghee Park, Timothy Zhou, and Loris D’Antoni. 2025. Flexible and Efficient Grammar-Constrained Decoding. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 (Proceed- ings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Teg...

2025
[37]

Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable Code Generation from Pre-trained Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=KmtVD97J43e

2022
[38]

Bierman, and Panagiotis Vekris

Aseem Rastogi, Nikhil Swamy, Cédric Fournet, Gavin M. Bierman, and Panagiotis Vekris. 2015. Safe & Efficient Gradual Typing for TypeScript. InProceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015, Sriram K. Rajamani and David Walker (Eds.). ACM, 167–180. doi:10.114...

work page doi:10.1145/2676726.2676971 2015
[39]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xi- aoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton- Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Th...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12950 2023
[40]

Jan-Philipp Schreiter, Kirill Fuks, and Horst Hellbrück. 2025. A Novel Approach and Framework for Configuration of Agent-Based LLMs in Real-World Applica- tions. InIntelligent Computing, Kohei Arai (Ed.). Springer Nature Switzerland, Cham, 635–650

2025
[41]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024). doi:10.48550/ARXIV.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[42]

Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, and Gagan- deep Singh. 2025. DINGO: Constrained Inference for Diffusion LLMs. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mation Processing Systems 2025, NeurIPS 2025, San Diago, CA, USA, December 2-7, 2025 / Mexico City, Mexico, November 30 - ...

2025
[43]

Gemma Team. 2024. Gemma: Open Models Based on Gemini Research and Technology.CoRRabs/2403.08295 (2024). doi:10.48550/ARXIV.2403.08295 arXiv:2403.08295

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.08295 2024
[44]

TOML. 2026. TOML: Tom’s Obvious Minimal Language, v1.1.0. https://toml.io/ en/. Accessed 2026-03-13

2026
[46]

Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2025. SynCode: LLM Generation with Grammar Augmentation.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/forum?id=HiUZtgAPoH

2025
[47]

András Vargha and Harold D. Delaney. 2000. A Critique and Improvement of the "CL" Common Language Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics25, 2 (2000), 101–132. http: //www.jstor.org/stable/1165329

arXiv 2000
[48]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxbur...

2017
[49]

Efficient Guided Generation for Large Language Models

Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models.CoRRabs/2307.09702 (2023). doi:10.48550/ARXIV.2307.09702 arXiv:2307.09702

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09702 2023
[50]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. SWE-smith: Scaling Data for Software Engineering Agents.CoRRabs/2504.21798 (2025). doi:10.48550/ARXIV.2504.21798 arXiv:2504.21798

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21798 2025

[1] [1]

Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K

Lakshya A. Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K. Lahiri, and Sriram K. Rajamani. 2023. Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023...

2023

[2] [2]

Guidance AI. 2025. LLGuidance: Super-fast Structured Outputs for Large Lan- guage Models. GitHub repository. https://guidance-ai.github.io/llguidance/llg- go-brrr Version 1.0.0, MIT License

2025

[3] [3]

Andrea Arcuri and Lionel C. Briand. 2014. A Hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering.Softw. Test. Verification Reliab.24, 3 (2014), 219–250. doi:10.1002/STVR.1486

work page doi:10.1002/stvr.1486 2014

[4] [4]

Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRRabs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732

Pith/arXiv arXiv 2021

[5] [5]

Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small Lan- guage Models are the Future of Agentic AI.CoRRabs/2506.02153 (2025). doi:10.48550/ARXIV.2506.02153 arXiv:2506.02153

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.02153 2025

[6] [6]

Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. 2022. A framework for the evaluation of code genera- tion models. https://github.com/bigcode-project/bigcode-evaluation-harness. Accessed 2026-06-18

2022

[7] [7]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, and Fan Zhou. 2026. Qwen3-Coder-Next Technical Report.CoRR abs/2603.00729 (2026). doi:10.48550/ARXIV.2603.00729 arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.00729 2026

[8] [8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

Pith/arXiv arXiv 2021

[9] [9]

Yiu Wai Chow, Luca Di Grazia, and Michael Pradel. 2024. PyTy: Repairing Static Type Errors in Python. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 87:1–87:13. doi:10.1145/3597503.3639184

work page doi:10.1145/3597503.3639184 2024

[10] [10]

Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen. 2025. XGrammar: Flexible and Efficient Structured Genera- tion Engine for Large Language Models. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12- 15, 2025, Matei Zaharia, Gauri Joshi, and Yingyan (Ce...

2025

[11] [11]

Margarida Ferreira, Victor Nicolet, Joey Dodds, and Daniel Kroening. 2025. Pro- gram Synthesis from Partial Traces.Proc. ACM Program. Lang.9, PLDI (2025), 1642–1665. doi:10.1145/3729316

work page doi:10.1145/3729316 2025

[12] [12]

Teodoro Freund, Yann Hamdaoui, and Arnaud Spiwack. 2021. Union and inter- section contracts are hard, actually. InDLS 2021: Proceedings of the 17th ACM SIG- PLAN International Symposium on Dynamic Languages, Virtual Event / Chicago, IL, USA, October 19, 2021, Arjun Guha (Ed.). ACM, 1–11. doi:10.1145/3486602.3486767

work page doi:10.1145/3486602.3486767 2021

[13] [13]

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar- Constrained Decoding for Structured NLP Tasks without Finetuning. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computati...

work page doi:10.18653/v1/2023.emnlp-main.674 2023

[14] [14]

ggml-org. 2026. GGML BNF Grammars. https://github.com/ggml-org/llama.cpp/ blob/master/grammars/README.md. GitHub, Accessed 2026-03-12

2026

[15] [15]

Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. 2025. En- hancing Code Generation for Low-Resource Languages: No Silver Bullet. In33rd IEEE/ACM International Conference on Program Comprehension, ICPC@ICSE 2025, Ottawa, ON, Canada, April 27-28, 2025. IEEE, 478–488. doi:10.1109/ICPC66645. 2025.00058

work page doi:10.1109/icpc66645 2025

[16] [16]

Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg- Kirkpatrick, and Loris D’Antoni. 2025. Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective.CoRRabs/2506.05754 (2025). doi:10.48550/ARXIV.2506.05754 arXiv:2506.05754

work page doi:10.48550/arxiv.2506.05754 2025

[17] [17]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence.CoRRabs/2401.14196 (2024). doi:10.48550/ARXIV.2401.14196 arXiv:2401.14196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024

[18] [18]

Md Mahade Hasan, Muhammad Waseem, Kai-Kristian Kemell, Jussi Rasku, Juha Ala-Rantala, and Pekka Abrahamsson. 2026. Assessing small language models for code generation: An empirical study with benchmarks.J. Syst. Softw.236 (2026), 112815. doi:10.1016/J.JSS.2026.112815

work page doi:10.1016/j.jss.2026.112815 2026

[19] [19]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.33, 8 (2024), 220:1–220:79. doi:10.1145/3695988

work page doi:10.1145/3695988 2024

[20] [20]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9

2022

[21] [21]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.CoRRabs/2409.12186 (2024). doi:10.48550/ARXIV.2409.12186 arXiv:2409.12186

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024

[22] [22]

Nima Karimipour, Michael Pradel, Martin Kellogg, and Manu Sridharan. 2025. LLM-Based Repair of Static Nullability Errors.CoRRabs/2507.20674 (2025). doi:10.48550/ARXIV.2507.20674 arXiv:2507.20674

work page doi:10.48550/arxiv.2507.20674 2025

[23] [23]

Lingxiao Li, Salar Rahili, and Yiwei Zhao. 2025. Correctness-Guaranteed Code Generation via Constrained Decoding.CoRRabs/2508.15866 (2025). doi:10.48550/ ARXIV.2508.15866 arXiv:2508.15866

arXiv 2025

[24] [24]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Pro- cessing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 1...

2023

[25] [25]

Lew, Tim Vieira, and Timothy J

João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, and Timothy J. O’Donnell

[26] [26]

InThe Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025

Syntactic and Semantic Control of Large Language Models via Se- quential Monte Carlo. InThe Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=xoXn62FzD0

2025

[27] [27]

smoke test passes

Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, and Mauro Pezzè. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. In40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025. IEEE, 278–290. doi:10.1109/ASE63991.2025.00031

work page doi:10.1109/ase63991.2025.00031 2025

[28] [28]

Ernst, and Mauro Pezzè

Davide Molinelli, Alberto Martin-Lopez, Elliott Zackrone, Beyza Eken, Michael D. Ernst, and Mauro Pezzè. 2025. Tratto: A Neuro-Symbolic Approach to Deriving Axiomatic Test Oracles.Proc. ACM Softw. Eng.2, ISSTA (2025), 1887–1909. doi:10.1145/3728960

work page doi:10.1145/3728960 2025

[29] [29]

Niels Mündler, Jasper Dekoninck, and Martin T. Vechev. 2025. Constrained Decoding of Diffusion LLMs with Context-Free Grammars.CoRRabs/2508.10111 (2025). doi:10.48550/ARXIV.2508.10111 arXiv:2508.10111

work page doi:10.48550/arxiv.2508.10111 2025

[30] [30]

Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin T. Vechev. 2025. Type-Constrained Code Generation with Language Models.Proc. ACM Program. Lang.9, PLDI (2025), 601–626. doi:10.1145/3729274

work page doi:10.1145/3729274 2025

[31] [31]

Shaan Nagy, Timothy Zhou, Nadia Polikarpova, and Loris D’Antoni. 2026. Chop- Chop: A Programmable Framework for Semantically Constraining the Output of Language Models.Proc. ACM Program. Lang.10, POPL (2026), 1905–1932. doi:10.1145/3776708

work page doi:10.1145/3776708 2026

[32] [32]

NousResearch. 2025. NousResearch/json-mode-eval. Hugging Face dataset repos- itory. https://huggingface.co/datasets/NousResearch/json-mode-eval Accessed Matteo Biagiola, Jahrim Gabriele Cesario, Luca Di Grazia, George Zakhour, and Guido Salvaneschi 2026-03-12

2025

[33] [33]

OpenAI. 2023. GPT-4 Technical Report.CoRRabs/2303.08774 (2023). doi:10. 48550/ARXIV.2303.08774 arXiv:2303.08774

Pith/arXiv arXiv 2023

[34] [34]

Zhang, Mark Harman, and Meng Wang

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2025. An Em- pirical Study of the Non-Determinism of ChatGPT in Code Generation.ACM Trans. Softw. Eng. Methodol.34, 2 (2025), 42:1–42:28. doi:10.1145/3697010

work page doi:10.1145/3697010 2025

[35] [35]

Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. Grammar-Aligned Decoding. InAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgrave, An...

2024

[36] [36]

Kanghee Park, Timothy Zhou, and Loris D’Antoni. 2025. Flexible and Efficient Grammar-Constrained Decoding. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 (Proceed- ings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Teg...

2025

[37] [37]

Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable Code Generation from Pre-trained Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=KmtVD97J43e

2022

[38] [38]

Bierman, and Panagiotis Vekris

Aseem Rastogi, Nikhil Swamy, Cédric Fournet, Gavin M. Bierman, and Panagiotis Vekris. 2015. Safe & Efficient Gradual Typing for TypeScript. InProceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015, Sriram K. Rajamani and David Walker (Eds.). ACM, 167–180. doi:10.114...

work page doi:10.1145/2676726.2676971 2015

[39] [39]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xi- aoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton- Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Th...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12950 2023

[40] [40]

Jan-Philipp Schreiter, Kirill Fuks, and Horst Hellbrück. 2025. A Novel Approach and Framework for Configuration of Agent-Based LLMs in Real-World Applica- tions. InIntelligent Computing, Kohei Arai (Ed.). Springer Nature Switzerland, Cham, 635–650

2025

[41] [41]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024). doi:10.48550/ARXIV.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024

[42] [42]

Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, and Gagan- deep Singh. 2025. DINGO: Constrained Inference for Diffusion LLMs. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mation Processing Systems 2025, NeurIPS 2025, San Diago, CA, USA, December 2-7, 2025 / Mexico City, Mexico, November 30 - ...

2025

[43] [43]

Gemma Team. 2024. Gemma: Open Models Based on Gemini Research and Technology.CoRRabs/2403.08295 (2024). doi:10.48550/ARXIV.2403.08295 arXiv:2403.08295

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.08295 2024

[44] [44]

TOML. 2026. TOML: Tom’s Obvious Minimal Language, v1.1.0. https://toml.io/ en/. Accessed 2026-03-13

2026

[45] [46]

Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2025. SynCode: LLM Generation with Grammar Augmentation.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/forum?id=HiUZtgAPoH

2025

[46] [47]

András Vargha and Harold D. Delaney. 2000. A Critique and Improvement of the "CL" Common Language Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics25, 2 (2000), 101–132. http: //www.jstor.org/stable/1165329

arXiv 2000

[47] [48]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxbur...

2017

[48] [49]

Efficient Guided Generation for Large Language Models

Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models.CoRRabs/2307.09702 (2023). doi:10.48550/ARXIV.2307.09702 arXiv:2307.09702

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09702 2023

[49] [50]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. SWE-smith: Scaling Data for Software Engineering Agents.CoRRabs/2504.21798 (2025). doi:10.48550/ARXIV.2504.21798 arXiv:2504.21798

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21798 2025