ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation
Pith reviewed 2026-05-18 07:56 UTC · model grok-4.3
The pith
Code LLMs achieve high functional correctness but produce almost no code that enforces task preconditions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that functional-correctness benchmarks overlook whether generated code enforces the preconditions stated or implied by the task. ContractEval supplies 364 tasks that include reconstructed descriptions stating contracts explicitly, test cases created by pairing an LLM with an SMT solver to verify contract satisfaction, and reference implementations that incorporate the contracts. Evaluation of five representative open-source code LLMs shows pass@1 scores of 75-82 percent on functional tests but 0 percent contract satisfaction under standard prompting, rising to only 23-41 percent when contracts appear explicitly in the prompt.
What carries the argument
ContractEval benchmark, built from HumanEval+ and MBPP+ tasks augmented with explicit contracts, neuro-symbolic test-case synthesis via LLM paired with SMT solver, and reference code that includes contracts, used to isolate contract satisfaction from functional correctness.
If this is right
- Generated code that passes functional tests but violates contracts can accept invalid inputs and produce incorrect or unsafe results in deployment.
- Code generation benchmarks should add contract-satisfaction checks to avoid overestimating model quality on precondition enforcement.
- Explicitly stating contracts in prompts improves satisfaction rates but still leaves most outputs non-compliant.
- Training or fine-tuning objectives for code models do not currently emphasize learning to enforce input preconditions.
Where Pith is reading between the lines
- Real-world use of such models risks crashes or security problems when inputs fall outside the assumed valid range.
- Extending the same contract-checking method to other languages or task domains would likely reveal similar gaps between correctness and safety.
- Incorporating contract verification directly into the generation or post-processing step could serve as a practical safeguard until models improve.
Load-bearing premise
The neuro-symbolic pipeline that pairs an LLM with an SMT solver produces test cases that correctly and completely evaluate whether generated code satisfies the stated contracts.
What would settle it
Independent execution of generated code on an input that violates a contract, showing acceptance of the invalid input despite the benchmark marking the code as contract-satisfying.
Figures
read the original abstract
Current code generation evaluation measures functional correctness on well-formed inputs that satisfy all input preconditions. This paradigm has a critical limitation: task descriptions often leave these preconditions implicit, while evaluation filters out inputs that violate them. As a result, generated code may achieve high pass@k scores while failing to enforce the preconditions that the task actually requires. To address this gap, we introduce ContractEval, a benchmark for evaluating whether generated code enforces such preconditions--commonly referred to as contracts. Built on HumanEval+ and MBPP+, ContractEval consists of 364 tasks, each with three components: (i) descriptions reconstructed to explicitly state the contracts, (ii) test cases synthesized through a neuro-symbolic pipeline that pairs an LLM with an SMT solver to evaluate whether generated code satisfies these contracts, and (iii) reference code combined with contracts. Using ContractEval to evaluate five representative open-source code LLMs, we reveal a stark disparity between functional correctness and contract satisfaction. Under standard prompting, these models achieve pass@1 of 75-82% with 0% contract satisfaction. Even when contracts are explicitly stated in the prompt, the satisfaction rate reaches only 23-41%. This indicates that current LLMs struggle to satisfy contracts in their generated code, establishing contract satisfaction as a crucial and previously overlooked axis of code generation quality. Our code is available at https://github.com/suhanmen/ContractEval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ContractEval, a benchmark of 364 tasks derived from HumanEval+ and MBPP+ to evaluate whether LLM-generated code satisfies contracts (preconditions). Each task includes reconstructed descriptions that explicitly state contracts, test cases synthesized via a neuro-symbolic LLM+SMT-solver pipeline, and reference code with contracts. Evaluation of five open-source code LLMs shows functional pass@1 rates of 75-82% under standard prompting but 0% contract satisfaction, improving only to 23-41% when contracts are explicitly included in the prompt.
Significance. If the synthesized test cases reliably detect contract violations, the work identifies contract satisfaction as a distinct and previously unmeasured axis of code-generation quality that is orthogonal to functional correctness on valid inputs. The results suggest current models largely ignore implicit preconditions even when high functional scores are achieved, with potential relevance to reliable and safe code generation. The public GitHub release aids reproducibility.
major comments (1)
- [Methodology (test-case synthesis pipeline)] The neuro-symbolic test-case synthesis pipeline (described in the methodology section on task construction) reports no quantitative validation such as false-negative rates for undetected contract violations, coverage across violation modes, or agreement with human-authored oracles. This is load-bearing for the headline disparity (75-82% functional vs. 0% contract satisfaction), because incomplete or overly narrow test cases could produce the observed low satisfaction rates without reflecting actual model behavior.
minor comments (1)
- [Abstract] The abstract refers to results on 'five representative open-source code LLMs' without naming the specific models; this detail should appear in the abstract or early in the evaluation section for immediate clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Methodology (test-case synthesis pipeline)] The neuro-symbolic test-case synthesis pipeline (described in the methodology section on task construction) reports no quantitative validation such as false-negative rates for undetected contract violations, coverage across violation modes, or agreement with human-authored oracles. This is load-bearing for the headline disparity (75-82% functional vs. 0% contract satisfaction), because incomplete or overly narrow test cases could produce the observed low satisfaction rates without reflecting actual model behavior.
Authors: We agree that the current manuscript lacks explicit quantitative validation of the test-case synthesis pipeline, including false-negative rates, coverage across violation modes, and agreement with human-authored oracles. This is a substantive point, as such metrics would increase confidence that the synthesized tests are comprehensive enough to support the reported disparity. The pipeline pairs an LLM with an SMT solver to ensure that any generated violating input is formally verified as a true violation (providing soundness for detected cases), but we acknowledge that unreported false-negative rates could in principle affect interpretation. In the revised manuscript we will add a dedicated validation subsection reporting: (i) agreement rates with human oracles on a sampled subset of tasks, (ii) coverage statistics stratified by precondition type, and (iii) an empirical estimate of undetected violations obtained via alternative generation methods. These additions will directly address the concern while preserving the core findings. revision: yes
Circularity Check
No significant circularity in benchmark construction or evaluation results
full rationale
The paper constructs ContractEval by reconstructing explicit contracts from existing HumanEval+ and MBPP+ tasks, synthesizing test cases via an LLM+SMT neuro-symbolic pipeline, and running independent evaluations of five open-source LLMs to measure functional pass@1 versus contract satisfaction rates. These steps produce empirical observations (e.g., 75-82% pass@1 with 0% contract satisfaction under standard prompting) from external model inferences on newly generated artifacts rather than any self-referential fitting, parameter estimation, or derivation that reduces the reported disparity to inputs defined inside the paper. No equations, uniqueness theorems, ansatzes, or load-bearing self-citations appear in the derivation chain; the methodology is presented as an additive empirical benchmark without reducing claims by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reconstructed task descriptions accurately capture the implicit contracts present in the original HumanEval+ and MBPP+ tasks.
- domain assumption The neuro-symbolic pipeline (LLM paired with SMT solver) correctly determines whether generated code satisfies the contracts.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Do- han, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Pavol Bielik, Veselin Raychev, and Martin T
9 Under review. Pavol Bielik, Veselin Raychev, and Martin T. Vechev. PHOG: probabilistic model for code. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.),Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, vol- ume 48 ofJMLR Workshop and Conference Proceedings, pp. 2933–2942. JMLR.org,
work page 2016
-
[3]
URL http://proceedings.mlr.press/v48/bielik16.html. Jacob Burnim and Koushik Sen. Heuristics for scalable dynamic test generation. In23rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2008), 15-19 September 2008, L’Aquila, Italy, pp. 443–446. IEEE Computer Society,
work page 2008
-
[4]
Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In Richard Draves and Robbert van Re- nesse (eds.),8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, pp. 209–224. USE...
work page 2008
-
[5]
Automatic gen- eration of test cases from formal specifications using mutation testing
Román Jaramillo Cajica, Raul Ernesto Gonzalez-Torres, and Pedro Mejía-Alvarez. Automatic gen- eration of test cases from formal specifications using mutation testing. In18th International Conference on Electrical Engineering, Computing Science and Automatic Control, CCE 2021, Mexico City, Mexico, November 10-12, 2021, pp. 1–6. IEEE,
work page 2021
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[7]
Quickcheck: a lightweight tool for random testing of haskell programs
Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. In Martin Odersky and Philip Wadler (eds.),Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming (ICFP ’00), Montreal, Canada, September 18-21, 2000, pp. 268–279. ACM,
work page 2000
-
[8]
doi: 10.1145/351240.351266. URLhttps://doi. org/10.1145/351240.351266. 10 Under review. Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Des- marais. Effective test generation using pre-trained large language models and mutation testing. Information & Software Technology, 171:107468,
-
[9]
Leonardo Mendonça de Moura and Nikolaj S. Bjørner. Z3: an efficient SMT solver. In C. R. Ramakrishnan and Jakob Rehof (eds.),Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29...
work page 2008
-
[10]
Springer, 2008.doi:10.1007/978-3-540-78800-3\_24
doi: 10.1007/978-3-540-78800-3\_24. URLhttps://doi.org/10.1007/ 978-3-540-78800-3_24. Florian Dyck, Cedric Richter, and Heike Wehrheim. Robustness testing of software verifiers. In Carla Ferreira and Tim A. C. Willemse (eds.),Software Engineering and Formal Methods - 21st International Conference, SEFM 2023, Eindhoven, The Netherlands, November 6-10, 2023...
-
[11]
DART: directed automated random testing
Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: directed automated random testing. In Vivek Sarkar and Mary W. Hall (eds.),Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, pp. 213–223. ACM,
work page 2005
-
[12]
Patrice Godefroid, Adam Kiezun, and Michael Y . Levin. Grammar-based whitebox fuzzing. In Rajiv Gupta and Saman P. Amarasinghe (eds.),Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7-13, 2008, pp. 206–215. ACM,
work page 2008
-
[13]
Jingxuan He, Gishor Sivanrupan, Petar Tsankov, and Martin T. Vechev. Learning to explore paths for symbolic execution. In Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi (eds.),CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, pp. 2526–2540. ACM,
work page 2021
-
[14]
Testgeneval: A real world unit test genera- tion and test completion benchmark
Kush Jain, Gabriel Synnaeve, and Baptiste Rozière. Testgeneval: A real world unit test genera- tion and test completion benchmark. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,
work page 2025
-
[15]
Revisiting implicitly abusive language detection: Evaluating llms in zero-shot and few-shot settings
Julia Jaremko, Dagmar Gromann, and Michael Wiegand. Revisiting implicitly abusive language detection: Evaluating llms in zero-shot and few-shot settings. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (eds.),Pro- ceedings of the 31st International Conference on Computational Linguistics, COLING ...
work page 2025
-
[16]
Brahma Reddy Korraprolu, Pavitra Pinninti, and Y
11 Under review. Brahma Reddy Korraprolu, Pavitra Pinninti, and Y . Raghu Reddy. Test case generation for re- quirements in natural language - an LLM comparison study. In Jitendra Chhabra, Lov Ku- mar, Sridhar Chimalakonda, Paddy Krishan, and Sangharatna Godboley (eds.),Proceedings of the 18th Innovations in Software Engineering Conference, ISEC 2025, Kur...
work page 2025
-
[17]
URL https://doi.org/10.1145/3717383.3717389
doi: 10.1145/3717383.3717389. URL https://doi.org/10.1145/3717383.3717389. Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. Codamosa: Es- caping coverage plateaus in test generation with pre-trained large language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Aus- tralia, May ...
-
[18]
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan
doi: 10.1109/ICSE48619.2023.00085. URL https://doi.org/10.1109/ICSE48619.2023.00085. Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Go...
-
[20]
Training language models to follow instructions with human feedback
URLhttps://arxiv.org/abs/2203.02155. Jan Peleska, Elena V orobev, and Florian Lapschies. Automated test case generation with smt-solving and abstract interpretation. In Mihaela Gheorghiu Bobaru, Klaus Havelund, Gerard J. Holzmann, and Rajeev Joshi (eds.),NASA Formal Methods - Third International Symposium, NFM 2011, Pasadena, CA, USA, April 18-20,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[21]
T-fuzz: Fuzzing by program transformation
Hui Peng, Yan Shoshitaishvili, and Mathias Payer. T-fuzz: Fuzzing by program transformation. In 2018 IEEE Symposium on Security and Privacy, SP 2018, Proceedings, 21-23 May 2018, San Francisco, California, USA, pp. 697–710. IEEE Computer Society,
work page 2018
-
[22]
Investigating coverage guided fuzzing with mutation testing
Ruixiang Qian, Quanjun Zhang, Chunrong Fang, and Lihua Guo. Investigating coverage guided fuzzing with mutation testing. InInternetware 2022: 13th Asia-Pacific Symposium on Internet- ware, Hohhot, China, June 11 - 12, 2022, pp. 272–281. ACM,
work page 2022
-
[23]
Testspark: Intellij idea’s ultimate test generation companion
Arkadii Sapozhnikov, Mitchell Olsthoorn, Annibale Panichella, Vladimir Kovalenko, and Pouria Derakhshanfar. Testspark: Intellij idea’s ultimate test generation companion. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2024, Lisbon, Portugal, April 14-20, 2024, pp. 30–34. ACM,
work page 2024
-
[24]
FOX: coverage-guided fuzzing as online stochastic control
Dongdong She, Adam Storek, Yuchong Xie, Seoyoung Kweon, Prashast Srivastava, and Suman Jana. FOX: coverage-guided fuzzing as online stochastic control. In Bo Luo, Xiaojing Liao, Jun Xu, Engin Kirda, and David Lie (eds.),Proceedings of the 2024 on ACM SIGSAC Conference 12 Under review. on Computer and Communications Security, CCS 2024, Salt Lake City, UT, ...
work page 2024
-
[25]
Gramatron: effective grammar-aware fuzzing
Prashast Srivastava and Mathias Payer. Gramatron: effective grammar-aware fuzzing. In Cristian Cadar and Xiangyu Zhang (eds.),ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021, pp. 244–256. ACM,
work page 2021
-
[26]
Driller: Augmenting fuzzing through selective symbolic execution
Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In23rd Annual Network and Distributed System Secu- rity Symposium, NDSS 2016, San Diego, California, USA, February 21-24,
work page 2016
-
[27]
Sicheol Sung, Aditi, Dogyu kim, Yo-Sub Han, and Sang-Ki Ko. Logicase: Effective test case generation from logical description in competitive programming.CoRR, abs/2505.15039,
-
[28]
2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv
doi: 10.48550/ARXIV .2505.15039. URLhttps://doi.org/10.48550/arXiv.2505. 15039. Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. TESTEV AL: benchmarking large language models for test case generation. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Findings of the Association for Comp...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[29]
URLhttps://doi.org/10.1145/3660815
doi: 10.1145/3660815. URLhttps://doi.org/10.1145/3660815. Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. exlong: Generating excep- tional behavior tests with large language models,
-
[30]
doi: 10.1145/3696630.3728608. URL http://dx.doi.org/10.1145/3696630.3728608. A CASESTUDY: LOGICALCONTRADICTIONS INDIRECTLLM TESTCASE GENERATION HumanEvalIn Figure 3 shown in the code snippet, this task includes three sequential contracts: assert_0checks if the input is a list,assert_1verifies that all elements in the list are strings, andassert_2ensures t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.