Recognition: unknown
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
Pith reviewed 2026-05-07 15:59 UTC · model grok-4.3
The pith
A new benchmark reveals that LLMs generate postconditions that are often correct yet miss many behaviors in real code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce POSTCONDBENCH, a multilingual benchmark for method-level postcondition generation from real-world software. POSTCONDBENCH comprises 420 Python and Java tasks drawn from 121 open-source projects, each paired with a high-quality ground-truth postcondition set constructed with expert involvement. To enable automatic evaluation, POSTCONDBENCH provides a runnable execution environment and operationalizes completeness via defect discrimination: a postcondition set is more complete if it is violated by more defective implementations while remaining satisfied on correct executions. Using POSTCONDBENCH, we formulate three generation settings and evaluate five SOTA LLMs. Our results showa
What carries the argument
POSTCONDBENCH benchmark, which turns completeness into a measurable defect-discrimination score by counting how many buggy implementations violate the generated postconditions while correct executions satisfy them.
If this is right
- LLM-generated postconditions that pass surface correctness checks can still leave many observable behaviors unspecified.
- Repository-level context and method complexity make it harder for models to produce postconditions that discriminate defects.
- Benchmarks limited to correctness or manual review will overstate the practical utility of current postcondition generators.
- Automatic, execution-based completeness checks become necessary to guide progress beyond small synthetic test sets.
Where Pith is reading between the lines
- Models that score high on local correctness may still require additional static analysis or test generation to reach usable completeness.
- If the gap persists across newer models, postcondition-based verification tools will inherit the same blind spots on complex modules.
- Extending the defect-discrimination idea to other specification artifacts, such as invariants or preconditions, could expose similar trade-offs in other verification tasks.
Load-bearing premise
The expert-written reference postconditions are treated as complete enough, and the selected defective implementations are treated as a fair stand-in for the kinds of mistakes that occur in practice.
What would settle it
Collecting an independent set of real defects for the same methods and checking whether the postconditions ranked most complete by the benchmark also catch the new defects at higher rates.
Figures
read the original abstract
Formal postconditions precisely characterize program behavior and support debugging, testing, and verification, but writing them requires substantial expertise and effort. This has motivated recent work on automatically generating postconditions from code and natural-language artifacts using large language models (LLMs). However, evaluation remains a key bottleneck. Existing benchmarks primarily emphasize correctness under limited evaluation settings, often relying on surface-form matching or manual assessment on small or synthetic datasets. We introduce POSTCONDBENCH, a multilingual benchmark for evaluating method-level postcondition generation from real-world software. POSTCONDBENCH comprises 420 Python and Java tasks drawn from 121 open-source projects, each paired with a high-quality ground-truth postcondition set constructed with expert involvement. To enable automatic evaluation, POSTCONDBENCH provides a runnable execution environment and operationalizes completeness via defect discrimination: a postcondition set is more complete if it is violated by more defective implementations while remaining satisfied on correct executions. Using POSTCONDBENCH, we formulate three generation settings and evaluate five SOTA LLMs. Our results reveal a substantial gap between correctness and completeness, and show that repository-level dependencies and method complexity exacerbate this gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces POSTCONDBENCH, a multilingual benchmark with 420 Python and Java tasks drawn from 121 real-world open-source projects. Each task includes an expert-constructed ground-truth postcondition set and a runnable execution environment. Completeness is operationalized via defect discrimination: generated postconditions must hold on correct executions but be violated by as many defective implementations as possible. The work evaluates five SOTA LLMs across three generation settings and reports a substantial correctness-completeness gap that widens with repository-level dependencies and method complexity.
Significance. If the ground-truth sets and defect proxies are validated, POSTCONDBENCH offers a scalable, execution-based alternative to surface matching or small-scale manual assessment. The emphasis on real projects, automatic completeness measurement, and analysis of complexity/dependency factors could usefully guide LLM research toward more complete postcondition generation, with downstream benefits for verification, testing, and debugging.
major comments (3)
- [Benchmark construction] Benchmark construction section: the completeness metric depends on the assumption that expert ground-truth postcondition sets are sufficiently complete and that the defective implementations are realistic proxies for real-world bugs. No quantitative validation (e.g., inter-expert agreement scores, coverage analysis against project test suites, or comparison of defects to actual bug reports) is reported, which directly affects the interpretability of the reported gap.
- [Results and analysis] Results and analysis section (factors exacerbating the gap): the claim that repository-level dependencies and method complexity worsen the correctness-completeness gap requires explicit controls or regression analysis to isolate these variables from confounders such as method length or number of defects per task; without this, the exacerbation finding is not yet load-bearing.
- [Defect generation] Defect generation subsection: details on how the defective implementations were produced (e.g., mutation operators, manual injection criteria, or selection from version history) are insufficient to assess whether violation rates reflect genuine incompleteness rather than artifacts of the defect-creation process.
minor comments (3)
- The abstract refers to 'three generation settings' without naming them; the introduction or evaluation setup section should list them explicitly for immediate clarity.
- A summary table of dataset characteristics (tasks per language, average cyclomatic complexity, dependency counts, number of defects per task) would improve readability and allow readers to assess balance.
- Related-work discussion could more explicitly contrast POSTCONDBENCH with prior specification-inference benchmarks that also use execution or mutation-based evaluation.
Simulated Author's Rebuttal
Thank you for the referee's constructive and detailed comments on POSTCONDBENCH. We address each major comment point by point below and will revise the manuscript accordingly to strengthen the benchmark's validation, analysis, and methodological transparency.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: the completeness metric depends on the assumption that expert ground-truth postcondition sets are sufficiently complete and that the defective implementations are realistic proxies for real-world bugs. No quantitative validation (e.g., inter-expert agreement scores, coverage analysis against project test suites, or comparison of defects to actual bug reports) is reported, which directly affects the interpretability of the reported gap.
Authors: We agree that additional quantitative validation would improve interpretability. In the revision we will add inter-expert agreement scores on a random sample of 50 tasks (two experts independently annotating postconditions) and report coverage of our ground-truth sets against the source projects' existing test suites. For defect realism we will expand the discussion of how defects were chosen to reflect common bug patterns documented in the literature and project histories; direct one-to-one mapping to specific bug reports is not always possible because the defects function as controlled proxies for measuring incompleteness rather than as literal reproductions of reported issues. revision: yes
-
Referee: [Results and analysis] Results and analysis section (factors exacerbating the gap): the claim that repository-level dependencies and method complexity worsen the correctness-completeness gap requires explicit controls or regression analysis to isolate these variables from confounders such as method length or number of defects per task; without this, the exacerbation finding is not yet load-bearing.
Authors: We acknowledge the need for stronger statistical isolation of effects. The revised manuscript will include a multiple linear regression with the correctness-completeness gap as the outcome variable and predictors for repository-level dependencies, method complexity (measured by cyclomatic complexity and dependency count), method length, and number of defects per task. We will also report correlation matrices and subgroup analyses to demonstrate that the exacerbation effect persists after controlling for the listed confounders. revision: yes
-
Referee: [Defect generation] Defect generation subsection: details on how the defective implementations were produced (e.g., mutation operators, manual injection criteria, or selection from version history) are insufficient to assess whether violation rates reflect genuine incompleteness rather than artifacts of the defect-creation process.
Authors: We will expand the Defect generation subsection with the requested details. Defective implementations were produced by first applying a fixed set of mutation operators (statement deletion, condition negation, arithmetic operator replacement, and variable misuse) drawn from standard mutation-testing suites, then supplementing with targeted manual injections of realistic bugs identified from the projects' commit histories and issue trackers. All defects were required to compile, execute without crashing, and violate at least one ground-truth postcondition. The revision will list the exact operators, the proportion of manual vs. automated defects, and the verification steps used to avoid trivial or non-executable mutations. revision: yes
Circularity Check
No circularity in benchmark construction or metric definitions
full rationale
POSTCONDBENCH is constructed from independent open-source projects with expert-constructed ground-truth postconditions and provided correct/defective implementations. Completeness is operationalized via execution-based defect discrimination without any equations, fitted parameters, self-citations, or derivations that reduce claims to inputs by construction. The evaluation of LLMs on three generation settings is purely empirical and self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert involvement produces high-quality and complete ground-truth postcondition sets.
- domain assumption Violation rates on defective implementations measure postcondition completeness.
Reference graph
Works this paper leans on
-
[1]
Can large language models transform natural language intent into formal method postconditions? , volume =
Endres, Madeline and Fakhoury, Sarah and Chakraborty, Saikat and Lahiri, Shuvendu K , journal =. Can large language models transform natural language intent into formal method postconditions? , volume =
-
[2]
Do Membership Inference Attacks Work on Large Language Models? , year =
Michael Duan and Anshuman Suri and Niloofar Mireshghallah and Sewon Min and Weijia Shi and Luke Zettlemoyer and Yulia Tsvetkov and Yejin Choi and David Evans and Hannaneh Hajishirzi , journal =. Do Membership Inference Attacks Work on Large Language Models? , year =
-
[3]
DeepSeek-Coder: Large-Scale Code Language Models , year =
DeepSeek Team , journal =. DeepSeek-Coder: Large-Scale Code Language Models , year =
-
[4]
, journal =
Ghosh, et al. , journal =. Reliable Specification Generation Using LLMs , year =
-
[5]
, journal =
Weiss, et al. , journal =. Formal Verification with Large Language Models , year =
-
[6]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=
work page internal anchor Pith review arXiv
-
[7]
arXiv preprint arXiv:2406.16838 , year=
From decoding to meta-generation: Inference-time algorithms for large language models , author=. arXiv preprint arXiv:2406.16838 , year=
-
[8]
Learning to reason with LLMs
OpenAI. Learning to reason with LLMs. 2024 , url =
2024
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review arXiv
-
[10]
s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=
work page internal anchor Pith review arXiv
-
[11]
Reasonflux: Hierarchical LLM reasoning via scaling thought templates
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates , author=. arXiv preprint arXiv:2502.06772 , year=
-
[12]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review arXiv
-
[13]
Proceedings of the International Conference on Software Engineering Research and Practice (SERP) , pages=
Formal specification-driven development , author=. Proceedings of the International Conference on Software Engineering Research and Practice (SERP) , pages=. 2014 , organization=
2014
-
[14]
arXiv preprint arXiv:2401.08807 , year=
Specgen: Automated generation of formal program specifications via large language models , author=. arXiv preprint arXiv:2401.08807 , year=
-
[15]
Proceedings of the 2014 international symposium on software testing and analysis , pages=
Defects4J: A database of existing faults to enable controlled testing studies for Java programs , author=. Proceedings of the 2014 international symposium on software testing and analysis , pages=
2014
-
[16]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review arXiv
-
[17]
StarCoder: may the source be with you!
Starcoder: may the source be with you! , author=. arXiv preprint arXiv:2305.06161 , year=
work page internal anchor Pith review arXiv
-
[18]
Knowledge and Information Visualization: Searching for Synergies , pages=
Visualizing information in virtual space: Prospects and pitfalls , author=. Knowledge and Information Visualization: Searching for Synergies , pages=. 2005 , publisher=
2005
-
[19]
IEEE Transactions on software engineering , volume=
Simplifying and isolating failure-inducing input , author=. IEEE Transactions on software engineering , volume=. 2002 , publisher=
2002
-
[20]
arXiv preprint arXiv:2312.00413 , year=
Abstract syntax tree for programming language understanding and representation: How far are we? , author=. arXiv preprint arXiv:2312.00413 , year=
-
[21]
2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=
Knod: Domain knowledge distilled tree decoder for automated program repair , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=
2023
-
[22]
Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=
Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair , author=. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=
-
[23]
Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , pages=
Iter: Iterative neural repair for multi-location patches , author=. Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , pages=
-
[24]
Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding , author=. arXiv preprint arXiv:2309.15028 , year=
-
[25]
An empirical analysis of compute-optimal inference for problem-solving with language models , author=
-
[26]
arXiv preprint arXiv:2303.05510 , year=
Planning with large language models for code generation , author=. arXiv preprint arXiv:2303.05510 , year=
-
[27]
Language agent tree search unifies reasoning acting and planning in language models , author=. arXiv preprint arXiv:2310.04406 , year=
-
[28]
arXiv preprint arXiv:2310.09044 , year=
KCTS: knowledge-constrained tree search decoding with token-level hallucination detection , author=. arXiv preprint arXiv:2310.09044 , year=
-
[29]
Advances in Neural Information Processing Systems , volume=
Self-evaluation guided beam search for reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=
-
[31]
The Twelfth International Conference on Learning Representations , year=
Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=
-
[32]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. arXiv preprint arXiv:2312.08935 , year=
work page internal anchor Pith review arXiv
-
[33]
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data, 2024
Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data , author=. arXiv preprint arXiv:2405.14333 , year=
-
[34]
Critique-out-loud re- ward models.arXiv preprint arXiv:2408.11791, 2024
Critique-out-loud reward models , author=. arXiv preprint arXiv:2408.11791 , year=
-
[35]
Hendryx and Summer Yue and Hugh Zhang , booktitle=
Evan Z Wang and Federico Cassano and Catherine Wu and Yunfeng Bai and William Song and Vaskar Nath and Ziwen Han and Sean M. Hendryx and Summer Yue and Hugh Zhang , booktitle=. Planning in Natural Language Improves. 2025 , url=
2025
-
[36]
The Thirteenth International Conference on Learning Representations , year=
What Makes Large Language Models Reason in (Multi-Turn) Code Generation? , author=. The Thirteenth International Conference on Learning Representations , year=
-
[37]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[38]
Publications Manual , year = "1983", publisher =
1983
-
[39]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[40]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[41]
Dan Gusfield , title =. 1997
1997
-
[42]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[43]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[44]
iContract Documentation: Async , howpublished =
-
[45]
Proceedings of the 25th international symposium on software testing and analysis , pages=
Pit: a practical mutation testing tool for java , author=. Proceedings of the 25th international symposium on software testing and analysis , pages=
-
[46]
Mutmut: a Python mutation testing system , howpublished =
Hovm. Mutmut: a Python mutation testing system , howpublished =. 2016 , urldate =
2016
-
[47]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Repo2run: Automated building executable environment for code repository at scale , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[48]
2010 , publisher=
Apache Maven , author=. 2010 , publisher=
2010
-
[49]
IEEE Transactions on software Engineering , number=
A complexity measure , author=. IEEE Transactions on software Engineering , number=. 1976 , publisher=
1976
-
[50]
2023 13th International Conference on Software Technology and Engineering (ICSTE) , pages=
Comparison of leading language parsers--antlr, javacc, sablecc, tree-sitter, yacc, bison , author=. 2023 13th International Conference on Software Technology and Engineering (ICSTE) , pages=. 2023 , organization=
2023
-
[51]
Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
Towards mitigating LLM hallucination via self reflection , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
2023
-
[52]
2025 , month =
Meta AI , title =. 2025 , month =
2025
-
[53]
Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=
CodeBERT: A Pre-Trained Model for Programming and Natural Languages , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=
2020
-
[54]
Theoretical computer science , volume=
Clustering to minimize the maximum intercluster distance , author=. Theoretical computer science , volume=. 1985 , publisher=
1985
-
[55]
SIAM journal on computing , volume=
An analysis of several heuristics for the traveling salesman problem , author=. SIAM journal on computing , volume=. 1977 , publisher=
1977
-
[56]
2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM) , pages=
A farthest first traversal based sampling algorithm for k-clustering , author=. 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM) , pages=. 2020 , organization=
2020
-
[57]
CoRR , year=
An Exploratory Study on Using Large Language Models for Mutation Testing , author=. CoRR , year=
-
[58]
12th International Conference on Learning Representations, ICLR 2024 , year=
SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES? , author=. 12th International Conference on Learning Representations, ICLR 2024 , year=
2024
-
[59]
Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering , pages=
Are mutants a valid substitute for real faults in software testing? , author=. Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering , pages=
-
[60]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=
work page internal anchor Pith review arXiv
-
[61]
Advances in Neural Information Processing Systems , volume=
Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
-
[62]
2025 , note =
icontract , author =. 2025 , note =
2025
-
[63]
2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages=
How Effective are Large Language Models in Generating Software Specifications? , author=. 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages=. 2025 , organization=
2025
-
[64]
JML reference manual , author=
-
[65]
2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , pages=
Towards a Test Framework for Reactive (Type 2) Asset Administration Shell Implementations , author=. 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , pages=. 2024 , organization=
2024
-
[66]
2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=
SpecGen: Automated Generation of Formal Program Specifications via Large Language Models , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=
2025
-
[67]
Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages=
Understanding Logical Expressions with Negations: Its Complicated , author=. Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages=
-
[68]
Brown , title =
Walter E. Brown , title =. 2024 , url =
2024
-
[69]
Training Software Engineering Agents and Verifiers with
Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang , booktitle=. Training Software Engineering Agents and Verifiers with. 2025 , url=
2025
-
[70]
2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume=
Ekstazi: Lightweight test selection , author=. 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume=. 2015 , organization=
2015
-
[71]
2024 , month =
OpenAI , title =. 2024 , month =
2024
-
[72]
2025 , month =
OpenAI , title =. 2025 , month =
2025
-
[73]
Qwen3-Coder: Agentic Coding in the World , year =
-
[74]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review arXiv
-
[75]
Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=
work page internal anchor Pith review arXiv
-
[76]
Proceedings of the 29th symposium on operating systems principles , pages=
Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
-
[77]
2025 , month =
Anthropic , title =. 2025 , month =
2025
-
[78]
2024 , month =
Anthropic , title =. 2024 , month =
2024
-
[79]
Advances in Neural Information Processing Systems , volume=
Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=
-
[80]
Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering , pages=
Call me maybe: Using nlp to automatically generate unit test cases respecting temporal constraints , author=. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.