ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

Hyunwoo Park; Joonghyuk Hahn; Sang-Ki Ko; Soohan Lim; Yo-Sub Han

arxiv: 2510.12047 · v4 · submitted 2025-10-14 · 💻 cs.AI · cs.SE

ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

Soohan Lim , Joonghyuk Hahn , Hyunwoo Park , Sang-Ki Ko , Yo-Sub Han This is my paper

Pith reviewed 2026-05-18 07:56 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords code generationbenchmarkscontractspreconditionsLLM evaluationfunctional correctnessinput validationneuro-symbolic methods

0 comments

The pith

Code LLMs achieve high functional correctness but produce almost no code that enforces task preconditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard code generation tests only check whether programs work on valid inputs and discard cases that violate implicit rules, so high scores can hide code that accepts bad inputs the task actually forbids. ContractEval rebuilds 364 tasks from existing benchmarks with explicit contract statements in the descriptions, then uses a pipeline that combines language models and solvers to create tests checking whether generated code rejects invalid inputs. When five open-source code models were run on these tasks, they reached 75 to 82 percent pass rates on ordinary correctness yet satisfied contracts in zero percent of cases under normal prompting. Even when the contracts were written directly into the prompt the satisfaction rate rose only to 23 to 41 percent, showing that contract enforcement is a separate capability current models largely lack.

Core claim

The paper claims that functional-correctness benchmarks overlook whether generated code enforces the preconditions stated or implied by the task. ContractEval supplies 364 tasks that include reconstructed descriptions stating contracts explicitly, test cases created by pairing an LLM with an SMT solver to verify contract satisfaction, and reference implementations that incorporate the contracts. Evaluation of five representative open-source code LLMs shows pass@1 scores of 75-82 percent on functional tests but 0 percent contract satisfaction under standard prompting, rising to only 23-41 percent when contracts appear explicitly in the prompt.

What carries the argument

ContractEval benchmark, built from HumanEval+ and MBPP+ tasks augmented with explicit contracts, neuro-symbolic test-case synthesis via LLM paired with SMT solver, and reference code that includes contracts, used to isolate contract satisfaction from functional correctness.

If this is right

Generated code that passes functional tests but violates contracts can accept invalid inputs and produce incorrect or unsafe results in deployment.
Code generation benchmarks should add contract-satisfaction checks to avoid overestimating model quality on precondition enforcement.
Explicitly stating contracts in prompts improves satisfaction rates but still leaves most outputs non-compliant.
Training or fine-tuning objectives for code models do not currently emphasize learning to enforce input preconditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world use of such models risks crashes or security problems when inputs fall outside the assumed valid range.
Extending the same contract-checking method to other languages or task domains would likely reveal similar gaps between correctness and safety.
Incorporating contract verification directly into the generation or post-processing step could serve as a practical safeguard until models improve.

Load-bearing premise

The neuro-symbolic pipeline that pairs an LLM with an SMT solver produces test cases that correctly and completely evaluate whether generated code satisfies the stated contracts.

What would settle it

Independent execution of generated code on an input that violates a contract, showing acceptance of the invalid input despite the benchmark marking the code as contract-satisfying.

Figures

Figures reproduced from arXiv: 2510.12047 by Hyunwoo Park, Joonghyuk Hahn, Sang-Ki Ko, Soohan Lim, Yo-Sub Han.

**Figure 1.** Figure 1: PACT’s contract-violating test uncovers an implicit constraint that conventional functional tests miss, proving the need for contractaware evaluation. Therefore, contracts are not optional safeguards but integral parts of the specification that define the boundary of valid behavior. Without them, even code snippets that appear functionally correct cannot be trusted to operate safely and reliably. Emp… view at source ↗

**Figure 2.** Figure 2: Running example of PACT with an Example-Augmented Specification (EAS) prompt, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Code and contracts for HumanEval. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Code and contracts for MBPP. B CASE STUDY: CONTRACT ENFORCEMENT UNDER DIFFERENT PROMPTING CONDITIONS We present a direct comparison of code snippets that the DeepSeek model generated for the MBPP/11 task under two prompting conditions. This comparison illustrates why providing CVTCs is more effective than relying on natural language descriptions alone [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Code generated by DeepSeek with the contract specification (CS) prompt. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Code generated by DeepSeek with the example-augmented specification (EAS) prompt. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: The ground-truth implementation, correct in both functionality and contracts. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: A functionally correct implementation that fails to enforce contracts. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: A robust implementation that enforces contracts but fails on functionality. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: The SMT-LIB template used for formalizing contracts. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: The ground-truth implementation in HumanEval/11 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: An example of the SMT-LIB template populated for HumanEval/11 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Current code generation evaluation measures functional correctness on well-formed inputs that satisfy all input preconditions. This paradigm has a critical limitation: task descriptions often leave these preconditions implicit, while evaluation filters out inputs that violate them. As a result, generated code may achieve high pass@k scores while failing to enforce the preconditions that the task actually requires. To address this gap, we introduce ContractEval, a benchmark for evaluating whether generated code enforces such preconditions--commonly referred to as contracts. Built on HumanEval+ and MBPP+, ContractEval consists of 364 tasks, each with three components: (i) descriptions reconstructed to explicitly state the contracts, (ii) test cases synthesized through a neuro-symbolic pipeline that pairs an LLM with an SMT solver to evaluate whether generated code satisfies these contracts, and (iii) reference code combined with contracts. Using ContractEval to evaluate five representative open-source code LLMs, we reveal a stark disparity between functional correctness and contract satisfaction. Under standard prompting, these models achieve pass@1 of 75-82% with 0% contract satisfaction. Even when contracts are explicitly stated in the prompt, the satisfaction rate reaches only 23-41%. This indicates that current LLMs struggle to satisfy contracts in their generated code, establishing contract satisfaction as a crucial and previously overlooked axis of code generation quality. Our code is available at https://github.com/suhanmen/ContractEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContractEval flags a real gap in how we test code LLMs for contract adherence, but the test synthesis pipeline lacks the validation needed to trust the reported disparity.

read the letter

The main takeaway here is that standard code generation benchmarks let models pass functional tests while ignoring whether the generated code actually respects the preconditions that the task description implies. ContractEval tries to close that gap by rebuilding 364 tasks from HumanEval+ and MBPP+ with explicit contracts and new test cases aimed at contract violations. That focus is new and worth noticing because real software development cares about safety properties that go beyond just producing correct outputs on happy-path inputs. The reported numbers are eye-catching: 75-82% pass@1 drops to 0% contract satisfaction under normal prompting, and only climbs to 23-41% when contracts are spelled out. If those numbers hold, they point to a practical weakness in current models that matters for anyone shipping AI coding assistants. The paper does a clean job of laying out the motivation and showing the disparity across five open-source models. The neuro-symbolic test generation approach (LLM plus SMT solver) is a reasonable way to scale test creation without hand-writing everything. What is missing is any direct check on whether those synthesized tests actually catch the violations they are supposed to catch. There are no reported false-negative rates, no coverage numbers for different violation modes, and no comparison against human-written oracles. Without that, it is hard to know how much of the low satisfaction score comes from model behavior versus test cases that are too narrow or miss edge cases. The reconstruction of contracts from the original task descriptions also introduces a step where small wording choices could affect the results. This work is aimed at researchers who evaluate or improve code generation systems, especially those concerned with robustness and safety. A reader who already works on benchmarks like HumanEval+ will see the value immediately. The paper deserves a serious referee because the core idea is sound and the benchmark could become a useful addition if the test pipeline is tightened up. I would send it to review with a request for more validation data on the synthesized tests rather than desk-rejecting it.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces ContractEval, a benchmark of 364 tasks derived from HumanEval+ and MBPP+ to evaluate whether LLM-generated code satisfies contracts (preconditions). Each task includes reconstructed descriptions that explicitly state contracts, test cases synthesized via a neuro-symbolic LLM+SMT-solver pipeline, and reference code with contracts. Evaluation of five open-source code LLMs shows functional pass@1 rates of 75-82% under standard prompting but 0% contract satisfaction, improving only to 23-41% when contracts are explicitly included in the prompt.

Significance. If the synthesized test cases reliably detect contract violations, the work identifies contract satisfaction as a distinct and previously unmeasured axis of code-generation quality that is orthogonal to functional correctness on valid inputs. The results suggest current models largely ignore implicit preconditions even when high functional scores are achieved, with potential relevance to reliable and safe code generation. The public GitHub release aids reproducibility.

major comments (1)

[Methodology (test-case synthesis pipeline)] The neuro-symbolic test-case synthesis pipeline (described in the methodology section on task construction) reports no quantitative validation such as false-negative rates for undetected contract violations, coverage across violation modes, or agreement with human-authored oracles. This is load-bearing for the headline disparity (75-82% functional vs. 0% contract satisfaction), because incomplete or overly narrow test cases could produce the observed low satisfaction rates without reflecting actual model behavior.

minor comments (1)

[Abstract] The abstract refers to results on 'five representative open-source code LLMs' without naming the specific models; this detail should appear in the abstract or early in the evaluation section for immediate clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses

Referee: [Methodology (test-case synthesis pipeline)] The neuro-symbolic test-case synthesis pipeline (described in the methodology section on task construction) reports no quantitative validation such as false-negative rates for undetected contract violations, coverage across violation modes, or agreement with human-authored oracles. This is load-bearing for the headline disparity (75-82% functional vs. 0% contract satisfaction), because incomplete or overly narrow test cases could produce the observed low satisfaction rates without reflecting actual model behavior.

Authors: We agree that the current manuscript lacks explicit quantitative validation of the test-case synthesis pipeline, including false-negative rates, coverage across violation modes, and agreement with human-authored oracles. This is a substantive point, as such metrics would increase confidence that the synthesized tests are comprehensive enough to support the reported disparity. The pipeline pairs an LLM with an SMT solver to ensure that any generated violating input is formally verified as a true violation (providing soundness for detected cases), but we acknowledge that unreported false-negative rates could in principle affect interpretation. In the revised manuscript we will add a dedicated validation subsection reporting: (i) agreement rates with human oracles on a sampled subset of tasks, (ii) coverage statistics stratified by precondition type, and (iii) an empirical estimate of undetected violations obtained via alternative generation methods. These additions will directly address the concern while preserving the core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation results

full rationale

The paper constructs ContractEval by reconstructing explicit contracts from existing HumanEval+ and MBPP+ tasks, synthesizing test cases via an LLM+SMT neuro-symbolic pipeline, and running independent evaluations of five open-source LLMs to measure functional pass@1 versus contract satisfaction rates. These steps produce empirical observations (e.g., 75-82% pass@1 with 0% contract satisfaction under standard prompting) from external model inferences on newly generated artifacts rather than any self-referential fitting, parameter estimation, or derivation that reduces the reported disparity to inputs defined inside the paper. No equations, uniqueness theorems, ansatzes, or load-bearing self-citations appear in the derivation chain; the methodology is presented as an additive empirical benchmark without reducing claims by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions about description reconstruction and the correctness of the LLM-plus-SMT test synthesis pipeline; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Reconstructed task descriptions accurately capture the implicit contracts present in the original HumanEval+ and MBPP+ tasks.
This assumption is required to create the 364 ContractEval tasks from the base benchmarks.
domain assumption The neuro-symbolic pipeline (LLM paired with SMT solver) correctly determines whether generated code satisfies the contracts.
This assumption underpins both the test-case synthesis and the reported contract-satisfaction percentages.

pith-pipeline@v0.9.0 · 5795 in / 1407 out tokens · 41557 ms · 2026-05-18T07:56:56.084360+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Do- han, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Pavol Bielik, Veselin Raychev, and Martin T

9 Under review. Pavol Bielik, Veselin Raychev, and Martin T. Vechev. PHOG: probabilistic model for code. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.),Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, vol- ume 48 ofJMLR Workshop and Conference Proceedings, pp. 2933–2942. JMLR.org,

work page 2016
[3]

Jacob Burnim and Koushik Sen

URL http://proceedings.mlr.press/v48/bielik16.html. Jacob Burnim and Koushik Sen. Heuristics for scalable dynamic test generation. In23rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2008), 15-19 September 2008, L’Aquila, Italy, pp. 443–446. IEEE Computer Society,

work page 2008
[4]

Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In Richard Draves and Robbert van Re- nesse (eds.),8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, pp. 209–224. USE...

work page 2008
[5]

Automatic gen- eration of test cases from formal specifications using mutation testing

Román Jaramillo Cajica, Raul Ernesto Gonzalez-Torres, and Pedro Mejía-Alvarez. Automatic gen- eration of test cases from formal specifications using mutation testing. In18th International Conference on Electrical Engineering, Computing Science and Automatic Control, CCE 2021, Mexico City, Mexico, November 10-12, 2021, pp. 1–6. IEEE,

work page 2021
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

Quickcheck: a lightweight tool for random testing of haskell programs

Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. In Martin Odersky and Philip Wadler (eds.),Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming (ICFP ’00), Montreal, Canada, September 18-21, 2000, pp. 268–279. ACM,

work page 2000
[8]

URLhttps://doi

doi: 10.1145/351240.351266. URLhttps://doi. org/10.1145/351240.351266. 10 Under review. Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Des- marais. Effective test generation using pre-trained large language models and mutation testing. Information & Software Technology, 171:107468,

work page doi:10.1145/351240.351266
[9]

Leonardo Mendonça de Moura and Nikolaj S. Bjørner. Z3: an efficient SMT solver. In C. R. Ramakrishnan and Jakob Rehof (eds.),Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29...

work page 2008
[10]

Springer, 2008.doi:10.1007/978-3-540-78800-3\_24

doi: 10.1007/978-3-540-78800-3\_24. URLhttps://doi.org/10.1007/ 978-3-540-78800-3_24. Florian Dyck, Cedric Richter, and Heike Wehrheim. Robustness testing of software verifiers. In Carla Ferreira and Tim A. C. Willemse (eds.),Software Engineering and Formal Methods - 21st International Conference, SEFM 2023, Eindhoven, The Netherlands, November 6-10, 2023...

work page doi:10.1007/978-3-540-78800-3 2023
[11]

DART: directed automated random testing

Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: directed automated random testing. In Vivek Sarkar and Mary W. Hall (eds.),Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, pp. 213–223. ACM,

work page 2005
[12]

Patrice Godefroid, Adam Kiezun, and Michael Y . Levin. Grammar-based whitebox fuzzing. In Rajiv Gupta and Saman P. Amarasinghe (eds.),Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7-13, 2008, pp. 206–215. ACM,

work page 2008
[13]

Jingxuan He, Gishor Sivanrupan, Petar Tsankov, and Martin T. Vechev. Learning to explore paths for symbolic execution. In Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi (eds.),CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, pp. 2526–2540. ACM,

work page 2021
[14]

Testgeneval: A real world unit test genera- tion and test completion benchmark

Kush Jain, Gabriel Synnaeve, and Baptiste Rozière. Testgeneval: A real world unit test genera- tion and test completion benchmark. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025
[15]

Revisiting implicitly abusive language detection: Evaluating llms in zero-shot and few-shot settings

Julia Jaremko, Dagmar Gromann, and Michael Wiegand. Revisiting implicitly abusive language detection: Evaluating llms in zero-shot and few-shot settings. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (eds.),Pro- ceedings of the 31st International Conference on Computational Linguistics, COLING ...

work page 2025
[16]

Brahma Reddy Korraprolu, Pavitra Pinninti, and Y

11 Under review. Brahma Reddy Korraprolu, Pavitra Pinninti, and Y . Raghu Reddy. Test case generation for re- quirements in natural language - an LLM comparison study. In Jitendra Chhabra, Lov Ku- mar, Sridhar Chimalakonda, Paddy Krishan, and Sangharatna Godboley (eds.),Proceedings of the 18th Innovations in Software Engineering Conference, ISEC 2025, Kur...

work page 2025
[17]

URL https://doi.org/10.1145/3717383.3717389

doi: 10.1145/3717383.3717389. URL https://doi.org/10.1145/3717383.3717389. Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. Codamosa: Es- caping coverage plateaus in test generation with pre-trained large language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Aus- tralia, May ...

work page doi:10.1145/3717383.3717389 2023
[18]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan

doi: 10.1109/ICSE48619.2023.00085. URL https://doi.org/10.1109/ICSE48619.2023.00085. Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Go...

work page doi:10.1109/icse48619.2023.00085 2023
[20]

Training language models to follow instructions with human feedback

URLhttps://arxiv.org/abs/2203.02155. Jan Peleska, Elena V orobev, and Florian Lapschies. Automated test case generation with smt-solving and abstract interpretation. In Mihaela Gheorghiu Bobaru, Klaus Havelund, Gerard J. Holzmann, and Rajeev Joshi (eds.),NASA Formal Methods - Third International Symposium, NFM 2011, Pasadena, CA, USA, April 18-20,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[21]

T-fuzz: Fuzzing by program transformation

Hui Peng, Yan Shoshitaishvili, and Mathias Payer. T-fuzz: Fuzzing by program transformation. In 2018 IEEE Symposium on Security and Privacy, SP 2018, Proceedings, 21-23 May 2018, San Francisco, California, USA, pp. 697–710. IEEE Computer Society,

work page 2018
[22]

Investigating coverage guided fuzzing with mutation testing

Ruixiang Qian, Quanjun Zhang, Chunrong Fang, and Lihua Guo. Investigating coverage guided fuzzing with mutation testing. InInternetware 2022: 13th Asia-Pacific Symposium on Internet- ware, Hohhot, China, June 11 - 12, 2022, pp. 272–281. ACM,

work page 2022
[23]

Testspark: Intellij idea’s ultimate test generation companion

Arkadii Sapozhnikov, Mitchell Olsthoorn, Annibale Panichella, Vladimir Kovalenko, and Pouria Derakhshanfar. Testspark: Intellij idea’s ultimate test generation companion. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2024, Lisbon, Portugal, April 14-20, 2024, pp. 30–34. ACM,

work page 2024
[24]

FOX: coverage-guided fuzzing as online stochastic control

Dongdong She, Adam Storek, Yuchong Xie, Seoyoung Kweon, Prashast Srivastava, and Suman Jana. FOX: coverage-guided fuzzing as online stochastic control. In Bo Luo, Xiaojing Liao, Jun Xu, Engin Kirda, and David Lie (eds.),Proceedings of the 2024 on ACM SIGSAC Conference 12 Under review. on Computer and Communications Security, CCS 2024, Salt Lake City, UT, ...

work page 2024
[25]

Gramatron: effective grammar-aware fuzzing

Prashast Srivastava and Mathias Payer. Gramatron: effective grammar-aware fuzzing. In Cristian Cadar and Xiangyu Zhang (eds.),ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021, pp. 244–256. ACM,

work page 2021
[26]

Driller: Augmenting fuzzing through selective symbolic execution

Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In23rd Annual Network and Distributed System Secu- rity Symposium, NDSS 2016, San Diego, California, USA, February 21-24,

work page 2016
[27]

Logicase: Effective test case generation from logical description in competitive programming.CoRR, abs/2505.15039,

Sicheol Sung, Aditi, Dogyu kim, Yo-Sub Han, and Sang-Ki Ko. Logicase: Effective test case generation from logical description in competitive programming.CoRR, abs/2505.15039,

work page arXiv
[28]

2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv

doi: 10.48550/ARXIV .2505.15039. URLhttps://doi.org/10.48550/arXiv.2505. 15039. Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. TESTEV AL: benchmarking large language models for test case generation. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Findings of the Association for Comp...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[29]

URLhttps://doi.org/10.1145/3660815

doi: 10.1145/3660815. URLhttps://doi.org/10.1145/3660815. Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. exlong: Generating excep- tional behavior tests with large language models,

work page doi:10.1145/3660815
[30]

456"], [

doi: 10.1145/3696630.3728608. URL http://dx.doi.org/10.1145/3696630.3728608. A CASESTUDY: LOGICALCONTRADICTIONS INDIRECTLLM TESTCASE GENERATION HumanEvalIn Figure 3 shown in the code snippet, this task includes three sequential contracts: assert_0checks if the input is a list,assert_1verifies that all elements in the list are strings, andassert_2ensures t...

work page doi:10.1145/3696630.3728608

[1] [1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Do- han, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Pavol Bielik, Veselin Raychev, and Martin T

9 Under review. Pavol Bielik, Veselin Raychev, and Martin T. Vechev. PHOG: probabilistic model for code. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.),Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, vol- ume 48 ofJMLR Workshop and Conference Proceedings, pp. 2933–2942. JMLR.org,

work page 2016

[3] [3]

Jacob Burnim and Koushik Sen

URL http://proceedings.mlr.press/v48/bielik16.html. Jacob Burnim and Koushik Sen. Heuristics for scalable dynamic test generation. In23rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2008), 15-19 September 2008, L’Aquila, Italy, pp. 443–446. IEEE Computer Society,

work page 2008

[4] [4]

Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In Richard Draves and Robbert van Re- nesse (eds.),8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, pp. 209–224. USE...

work page 2008

[5] [5]

Automatic gen- eration of test cases from formal specifications using mutation testing

Román Jaramillo Cajica, Raul Ernesto Gonzalez-Torres, and Pedro Mejía-Alvarez. Automatic gen- eration of test cases from formal specifications using mutation testing. In18th International Conference on Electrical Engineering, Computing Science and Automatic Control, CCE 2021, Mexico City, Mexico, November 10-12, 2021, pp. 1–6. IEEE,

work page 2021

[6] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

Quickcheck: a lightweight tool for random testing of haskell programs

Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. In Martin Odersky and Philip Wadler (eds.),Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming (ICFP ’00), Montreal, Canada, September 18-21, 2000, pp. 268–279. ACM,

work page 2000

[8] [8]

URLhttps://doi

doi: 10.1145/351240.351266. URLhttps://doi. org/10.1145/351240.351266. 10 Under review. Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Des- marais. Effective test generation using pre-trained large language models and mutation testing. Information & Software Technology, 171:107468,

work page doi:10.1145/351240.351266

[9] [9]

Leonardo Mendonça de Moura and Nikolaj S. Bjørner. Z3: an efficient SMT solver. In C. R. Ramakrishnan and Jakob Rehof (eds.),Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29...

work page 2008

[10] [10]

Springer, 2008.doi:10.1007/978-3-540-78800-3\_24

doi: 10.1007/978-3-540-78800-3\_24. URLhttps://doi.org/10.1007/ 978-3-540-78800-3_24. Florian Dyck, Cedric Richter, and Heike Wehrheim. Robustness testing of software verifiers. In Carla Ferreira and Tim A. C. Willemse (eds.),Software Engineering and Formal Methods - 21st International Conference, SEFM 2023, Eindhoven, The Netherlands, November 6-10, 2023...

work page doi:10.1007/978-3-540-78800-3 2023

[11] [11]

DART: directed automated random testing

Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: directed automated random testing. In Vivek Sarkar and Mary W. Hall (eds.),Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, pp. 213–223. ACM,

work page 2005

[12] [12]

Patrice Godefroid, Adam Kiezun, and Michael Y . Levin. Grammar-based whitebox fuzzing. In Rajiv Gupta and Saman P. Amarasinghe (eds.),Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7-13, 2008, pp. 206–215. ACM,

work page 2008

[13] [13]

Jingxuan He, Gishor Sivanrupan, Petar Tsankov, and Martin T. Vechev. Learning to explore paths for symbolic execution. In Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi (eds.),CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, pp. 2526–2540. ACM,

work page 2021

[14] [14]

Testgeneval: A real world unit test genera- tion and test completion benchmark

Kush Jain, Gabriel Synnaeve, and Baptiste Rozière. Testgeneval: A real world unit test genera- tion and test completion benchmark. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025

[15] [15]

Revisiting implicitly abusive language detection: Evaluating llms in zero-shot and few-shot settings

Julia Jaremko, Dagmar Gromann, and Michael Wiegand. Revisiting implicitly abusive language detection: Evaluating llms in zero-shot and few-shot settings. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (eds.),Pro- ceedings of the 31st International Conference on Computational Linguistics, COLING ...

work page 2025

[16] [16]

Brahma Reddy Korraprolu, Pavitra Pinninti, and Y

11 Under review. Brahma Reddy Korraprolu, Pavitra Pinninti, and Y . Raghu Reddy. Test case generation for re- quirements in natural language - an LLM comparison study. In Jitendra Chhabra, Lov Ku- mar, Sridhar Chimalakonda, Paddy Krishan, and Sangharatna Godboley (eds.),Proceedings of the 18th Innovations in Software Engineering Conference, ISEC 2025, Kur...

work page 2025

[17] [17]

URL https://doi.org/10.1145/3717383.3717389

doi: 10.1145/3717383.3717389. URL https://doi.org/10.1145/3717383.3717389. Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. Codamosa: Es- caping coverage plateaus in test generation with pre-trained large language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Aus- tralia, May ...

work page doi:10.1145/3717383.3717389 2023

[18] [18]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan

doi: 10.1109/ICSE48619.2023.00085. URL https://doi.org/10.1109/ICSE48619.2023.00085. Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Go...

work page doi:10.1109/icse48619.2023.00085 2023

[19] [20]

Training language models to follow instructions with human feedback

URLhttps://arxiv.org/abs/2203.02155. Jan Peleska, Elena V orobev, and Florian Lapschies. Automated test case generation with smt-solving and abstract interpretation. In Mihaela Gheorghiu Bobaru, Klaus Havelund, Gerard J. Holzmann, and Rajeev Joshi (eds.),NASA Formal Methods - Third International Symposium, NFM 2011, Pasadena, CA, USA, April 18-20,

work page internal anchor Pith review Pith/arXiv arXiv 2011

[20] [21]

T-fuzz: Fuzzing by program transformation

Hui Peng, Yan Shoshitaishvili, and Mathias Payer. T-fuzz: Fuzzing by program transformation. In 2018 IEEE Symposium on Security and Privacy, SP 2018, Proceedings, 21-23 May 2018, San Francisco, California, USA, pp. 697–710. IEEE Computer Society,

work page 2018

[21] [22]

Investigating coverage guided fuzzing with mutation testing

Ruixiang Qian, Quanjun Zhang, Chunrong Fang, and Lihua Guo. Investigating coverage guided fuzzing with mutation testing. InInternetware 2022: 13th Asia-Pacific Symposium on Internet- ware, Hohhot, China, June 11 - 12, 2022, pp. 272–281. ACM,

work page 2022

[22] [23]

Testspark: Intellij idea’s ultimate test generation companion

Arkadii Sapozhnikov, Mitchell Olsthoorn, Annibale Panichella, Vladimir Kovalenko, and Pouria Derakhshanfar. Testspark: Intellij idea’s ultimate test generation companion. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2024, Lisbon, Portugal, April 14-20, 2024, pp. 30–34. ACM,

work page 2024

[23] [24]

FOX: coverage-guided fuzzing as online stochastic control

Dongdong She, Adam Storek, Yuchong Xie, Seoyoung Kweon, Prashast Srivastava, and Suman Jana. FOX: coverage-guided fuzzing as online stochastic control. In Bo Luo, Xiaojing Liao, Jun Xu, Engin Kirda, and David Lie (eds.),Proceedings of the 2024 on ACM SIGSAC Conference 12 Under review. on Computer and Communications Security, CCS 2024, Salt Lake City, UT, ...

work page 2024

[24] [25]

Gramatron: effective grammar-aware fuzzing

Prashast Srivastava and Mathias Payer. Gramatron: effective grammar-aware fuzzing. In Cristian Cadar and Xiangyu Zhang (eds.),ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021, pp. 244–256. ACM,

work page 2021

[25] [26]

Driller: Augmenting fuzzing through selective symbolic execution

Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In23rd Annual Network and Distributed System Secu- rity Symposium, NDSS 2016, San Diego, California, USA, February 21-24,

work page 2016

[26] [27]

Logicase: Effective test case generation from logical description in competitive programming.CoRR, abs/2505.15039,

Sicheol Sung, Aditi, Dogyu kim, Yo-Sub Han, and Sang-Ki Ko. Logicase: Effective test case generation from logical description in competitive programming.CoRR, abs/2505.15039,

work page arXiv

[27] [28]

2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv

doi: 10.48550/ARXIV .2505.15039. URLhttps://doi.org/10.48550/arXiv.2505. 15039. Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. TESTEV AL: benchmarking large language models for test case generation. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Findings of the Association for Comp...

work page internal anchor Pith review doi:10.48550/arxiv 2025

[28] [29]

URLhttps://doi.org/10.1145/3660815

doi: 10.1145/3660815. URLhttps://doi.org/10.1145/3660815. Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. exlong: Generating excep- tional behavior tests with large language models,

work page doi:10.1145/3660815

[29] [30]

456"], [

doi: 10.1145/3696630.3728608. URL http://dx.doi.org/10.1145/3696630.3728608. A CASESTUDY: LOGICALCONTRADICTIONS INDIRECTLLM TESTCASE GENERATION HumanEvalIn Figure 3 shown in the code snippet, this task includes three sequential contracts: assert_0checks if the input is a list,assert_1verifies that all elements in the list are strings, andassert_2ensures t...

work page doi:10.1145/3696630.3728608