arxiv: 2605.03356 · v1 · submitted 2026-05-05 · 💻 cs.SE

Recognition: unknown

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

Gehao Zhang , Juan Zhai

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:59 UTC · model grok-4.3

classification 💻 cs.SE

keywords postcondition inferenceLLM evaluationsoftware verificationformal specificationsbenchmarkcorrectnesscompletenessdefect detection

0 comments

The pith

A new benchmark reveals that LLMs generate postconditions that are often correct yet miss many behaviors in real code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POSTCONDBENCH to evaluate how well large language models can produce formal postconditions for methods taken from actual open-source projects. Prior benchmarks mostly checked whether a generated postcondition matched some expected form or passed limited manual review, but did not measure whether it fully described the method's observable behavior. The benchmark supplies 420 Python and Java tasks drawn from 121 projects, each with an expert-built set of reference postconditions, plus a runnable environment that scores completeness by counting how many defective implementations the postconditions reject while still accepting correct runs. When five current LLMs are tested under three different generation settings, correctness scores prove much higher than completeness scores, and both repository-wide dependencies and method size widen the difference.

Core claim

We introduce POSTCONDBENCH, a multilingual benchmark for method-level postcondition generation from real-world software. POSTCONDBENCH comprises 420 Python and Java tasks drawn from 121 open-source projects, each paired with a high-quality ground-truth postcondition set constructed with expert involvement. To enable automatic evaluation, POSTCONDBENCH provides a runnable execution environment and operationalizes completeness via defect discrimination: a postcondition set is more complete if it is violated by more defective implementations while remaining satisfied on correct executions. Using POSTCONDBENCH, we formulate three generation settings and evaluate five SOTA LLMs. Our results showa

What carries the argument

POSTCONDBENCH benchmark, which turns completeness into a measurable defect-discrimination score by counting how many buggy implementations violate the generated postconditions while correct executions satisfy them.

If this is right

LLM-generated postconditions that pass surface correctness checks can still leave many observable behaviors unspecified.
Repository-level context and method complexity make it harder for models to produce postconditions that discriminate defects.
Benchmarks limited to correctness or manual review will overstate the practical utility of current postcondition generators.
Automatic, execution-based completeness checks become necessary to guide progress beyond small synthetic test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that score high on local correctness may still require additional static analysis or test generation to reach usable completeness.
If the gap persists across newer models, postcondition-based verification tools will inherit the same blind spots on complex modules.
Extending the defect-discrimination idea to other specification artifacts, such as invariants or preconditions, could expose similar trade-offs in other verification tasks.

Load-bearing premise

The expert-written reference postconditions are treated as complete enough, and the selected defective implementations are treated as a fair stand-in for the kinds of mistakes that occur in practice.

What would settle it

Collecting an independent set of real defects for the same methods and checking whether the postconditions ranked most complete by the benchmark also catch the new defects at higher rates.

Figures

Figures reproduced from arXiv: 2605.03356 by Gehao Zhang, Juan Zhai.

**Figure 1.** Figure 1: An example instance from POSTCONDBENCH, illustrating the components of a benchmark instance. The instance corresponds to the target method get_parent. Method Selection Repository Comment Quality Code Complexity Code Coverage Mutant Generation Mutants PostCondBench Operator-/LLMbased Mutation Target Method Postcond. Eval. Pipeline Signature NL Comment Implementation Mutants Ground Truth Postcond. Method Si… view at source ↗

**Figure 2.** Figure 2: Overall workflow of POSTCONDBENCH. Complexity: at least 15 lines of code or cyclomatic complexity (McCabe, 1976) ≥ 3; ❸ Code Coverage: ≥ 90% line coverage. We exclude methods that throw exceptions under normal execution. This yields 1,030 Python and 1,476 Java methods. 3.2.2 Mutant Generation We generate mutants for each selected method following mutation-testing principles, which introduce small, contr… view at source ↗

**Figure 3.** Figure 3: Distribution of method-level correctness–completeness gaps under different input settings. For each method m, the gap is computed as ∆(m) = corr_rate(m) − comp_rate(m), where corr_rate and comp_rate denote the fraction of generated postcondition sets that are test-correct and complete, respectively. Each violin aggregates gap values across methods for which at least one test-correct postcondition is gene… view at source ↗

**Figure 4.** Figure 4: An example of incorrect postconditions caused by LLMs’ misuse of specification languages or APIs, with view at source ↗

**Figure 5.** Figure 5: Example of a correct but incomplete post view at source ↗

**Figure 6.** Figure 6: Corr@1 (top) and Comp@1 (bottom) across target methods grouped by lines of code (LoC). creases with LoC, and the drop is much steeper for completeness. For example, under C2P with Claude-4.5, Comp@1 falls from 0.388 to 0.206 and 0.076 from short to medium and long methods. Aggregated over models, languages, and input settings, the overall average score decreases from 0.296 to 0.205 and then to 0.149 as Lo… view at source ↗

**Figure 7.** Figure 7: Prompt template. The placeholders are shown view at source ↗

**Figure 8.** Figure 8: Left: A Python instance from keon/algorithms (Keon, 2025). Right: A Java instance from jhy/jsoup (jhy, 2025). Top: Target methods. For each method, the “+” and “-” leading lines represent the differences from the original code to an example mutant. Middle: Complete postcondition sets (both are GPT-5 generated from the F2P task). Bottom: Correct but incomplete postcondition sets (Claude-4.5 generated from t… view at source ↗

**Figure 9.** Figure 9: Two examples of incorrect postconditions generated due to LLMs’ hallucination on semantics. The view at source ↗

**Figure 10.** Figure 10: Corr@1 (top) and Comp@1 (bottom) across target methods grouped by lines of code (LoC). view at source ↗

**Figure 11.** Figure 11: An example for sorting algorithm validation. view at source ↗

**Figure 12.** Figure 12: Distribution of method-level correct view at source ↗

**Figure 13.** Figure 13: Python iterator-related filtered method example: view at source ↗

**Figure 14.** Figure 14: Python timeout-related filtered method example: view at source ↗

**Figure 15.** Figure 15: Python concurrency-related filtered method example: view at source ↗

**Figure 16.** Figure 16: Python coroutine-related filtered method example: view at source ↗

**Figure 17.** Figure 17: Python language-entity filtered method example: view at source ↗

**Figure 18.** Figure 18: Python language-entity filtered method example: view at source ↗

**Figure 19.** Figure 19: Python ordinary-category filtered method example: view at source ↗

**Figure 20.** Figure 20: Python specified-error-process example: test case covering view at source ↗

**Figure 21.** Figure 21: Python other filtered method example: merge from pypyr/pypyr view at source ↗

**Figure 22.** Figure 22: Python mutant-level ordinary example: getprops from google/mobly. result = _invoke_isolated_cli_runner(["./", "--output", "-", "--exclude-pattern", "tests/"]) ... assert result.exit_code == 0 view at source ↗

**Figure 23.** Figure 23: Python mutant-level specified-error-process example: triggered test case for view at source ↗

**Figure 24.** Figure 24: Python mutant-level equivalent-result example: original method, equivalent mutant, and test case for view at source ↗

**Figure 25.** Figure 25: Python mutant-level timeout example: partition_by_spaces from frostming/marko and a simplified downstream call path. public T compare() throws IOException, RenderingException { ... try (PDDocument expectedDocument = Loader.loadPDF(...)) { try (PDDocument actualDocument = Loader.loadPDF(...)) { compare(expectedDocument, actualDocument); } } ... return compareResult; } view at source ↗

**Figure 26.** Figure 26: Java timeout-related filtered method example: view at source ↗

**Figure 27.** Figure 27: Java iterator-related filtered method example: view at source ↗

**Figure 28.** Figure 28: Java inaccessible-code-context filtered method example: view at source ↗

**Figure 29.** Figure 29: Java random-related filtered method examples for view at source ↗

**Figure 30.** Figure 30: Java ordinary-category filtered method example: view at source ↗

**Figure 31.** Figure 31: Java other filtered method example: douglasPeucker from dyn4j/dyn4j view at source ↗

**Figure 32.** Figure 32: Java mutant-level nameless-implementation examples from dyn4j/dyn4j. view at source ↗

**Figure 33.** Figure 33: Distribution of method-level correctness–completeness gaps with F2P setting under different models. For view at source ↗

**Figure 34.** Figure 34: Distribution of method-level correctness–completeness gaps under different input settings and program view at source ↗

read the original abstract

Formal postconditions precisely characterize program behavior and support debugging, testing, and verification, but writing them requires substantial expertise and effort. This has motivated recent work on automatically generating postconditions from code and natural-language artifacts using large language models (LLMs). However, evaluation remains a key bottleneck. Existing benchmarks primarily emphasize correctness under limited evaluation settings, often relying on surface-form matching or manual assessment on small or synthetic datasets. We introduce POSTCONDBENCH, a multilingual benchmark for evaluating method-level postcondition generation from real-world software. POSTCONDBENCH comprises 420 Python and Java tasks drawn from 121 open-source projects, each paired with a high-quality ground-truth postcondition set constructed with expert involvement. To enable automatic evaluation, POSTCONDBENCH provides a runnable execution environment and operationalizes completeness via defect discrimination: a postcondition set is more complete if it is violated by more defective implementations while remaining satisfied on correct executions. Using POSTCONDBENCH, we formulate three generation settings and evaluate five SOTA LLMs. Our results reveal a substantial gap between correctness and completeness, and show that repository-level dependencies and method complexity exacerbate this gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POSTCONDBENCH gives a practical, execution-based way to score postcondition completeness on real multilingual code, but the gap it reports depends on assumptions about expert ground truth and defect realism that the abstract leaves untested.

read the letter

The paper's main contribution is POSTCONDBENCH: 420 tasks pulled from 121 open-source Python and Java projects, each with expert-constructed ground-truth postcondition sets, a runnable environment, and a completeness metric based on how many defective implementations the generated postconditions can catch while still holding on correct runs. They evaluate five LLMs across three settings and find that high correctness scores often fail to deliver completeness, with the gap widening for methods that have repository dependencies or higher complexity. This setup moves past the small synthetic or surface-matching evaluations that dominated earlier work. The real-world sourcing and automatic scoring are clear improvements for anyone trying to measure progress in LLM-assisted verification or testing tools. The construction appears consistent at the level described, and the multilingual coverage plus defect-discrimination idea address a genuine bottleneck in the subfield. The soft spot is the metric itself. Completeness is defined through violation rates on defects, which only measures what it claims if the expert ground truths are close to exhaustive and the defects serve as realistic proxies rather than benchmark artifacts. The abstract gives no separate validation steps for either, so the reported correctness-completeness gap could partly reflect how the defects were built or what the experts chose to include. Minor details on task selection and statistical handling are also thin in the summary. This is for researchers working on code generation, formal methods tooling, or benchmark design in AI for software engineering. People who need concrete tasks and an execution harness to test their own postcondition generators will get immediate use from it. It is coherent enough and addresses a real evaluation need, so it deserves a serious referee. I would send it to review and ask the referees to focus on the defect creation process and any checks done on ground-truth coverage.

Referee Report

3 major / 3 minor

Summary. The paper introduces POSTCONDBENCH, a multilingual benchmark with 420 Python and Java tasks drawn from 121 real-world open-source projects. Each task includes an expert-constructed ground-truth postcondition set and a runnable execution environment. Completeness is operationalized via defect discrimination: generated postconditions must hold on correct executions but be violated by as many defective implementations as possible. The work evaluates five SOTA LLMs across three generation settings and reports a substantial correctness-completeness gap that widens with repository-level dependencies and method complexity.

Significance. If the ground-truth sets and defect proxies are validated, POSTCONDBENCH offers a scalable, execution-based alternative to surface matching or small-scale manual assessment. The emphasis on real projects, automatic completeness measurement, and analysis of complexity/dependency factors could usefully guide LLM research toward more complete postcondition generation, with downstream benefits for verification, testing, and debugging.

major comments (3)

[Benchmark construction] Benchmark construction section: the completeness metric depends on the assumption that expert ground-truth postcondition sets are sufficiently complete and that the defective implementations are realistic proxies for real-world bugs. No quantitative validation (e.g., inter-expert agreement scores, coverage analysis against project test suites, or comparison of defects to actual bug reports) is reported, which directly affects the interpretability of the reported gap.
[Results and analysis] Results and analysis section (factors exacerbating the gap): the claim that repository-level dependencies and method complexity worsen the correctness-completeness gap requires explicit controls or regression analysis to isolate these variables from confounders such as method length or number of defects per task; without this, the exacerbation finding is not yet load-bearing.
[Defect generation] Defect generation subsection: details on how the defective implementations were produced (e.g., mutation operators, manual injection criteria, or selection from version history) are insufficient to assess whether violation rates reflect genuine incompleteness rather than artifacts of the defect-creation process.

minor comments (3)

The abstract refers to 'three generation settings' without naming them; the introduction or evaluation setup section should list them explicitly for immediate clarity.
A summary table of dataset characteristics (tasks per language, average cyclomatic complexity, dependency counts, number of defects per task) would improve readability and allow readers to assess balance.
Related-work discussion could more explicitly contrast POSTCONDBENCH with prior specification-inference benchmarks that also use execution or mutation-based evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the referee's constructive and detailed comments on POSTCONDBENCH. We address each major comment point by point below and will revise the manuscript accordingly to strengthen the benchmark's validation, analysis, and methodological transparency.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the completeness metric depends on the assumption that expert ground-truth postcondition sets are sufficiently complete and that the defective implementations are realistic proxies for real-world bugs. No quantitative validation (e.g., inter-expert agreement scores, coverage analysis against project test suites, or comparison of defects to actual bug reports) is reported, which directly affects the interpretability of the reported gap.

Authors: We agree that additional quantitative validation would improve interpretability. In the revision we will add inter-expert agreement scores on a random sample of 50 tasks (two experts independently annotating postconditions) and report coverage of our ground-truth sets against the source projects' existing test suites. For defect realism we will expand the discussion of how defects were chosen to reflect common bug patterns documented in the literature and project histories; direct one-to-one mapping to specific bug reports is not always possible because the defects function as controlled proxies for measuring incompleteness rather than as literal reproductions of reported issues. revision: yes
Referee: [Results and analysis] Results and analysis section (factors exacerbating the gap): the claim that repository-level dependencies and method complexity worsen the correctness-completeness gap requires explicit controls or regression analysis to isolate these variables from confounders such as method length or number of defects per task; without this, the exacerbation finding is not yet load-bearing.

Authors: We acknowledge the need for stronger statistical isolation of effects. The revised manuscript will include a multiple linear regression with the correctness-completeness gap as the outcome variable and predictors for repository-level dependencies, method complexity (measured by cyclomatic complexity and dependency count), method length, and number of defects per task. We will also report correlation matrices and subgroup analyses to demonstrate that the exacerbation effect persists after controlling for the listed confounders. revision: yes
Referee: [Defect generation] Defect generation subsection: details on how the defective implementations were produced (e.g., mutation operators, manual injection criteria, or selection from version history) are insufficient to assess whether violation rates reflect genuine incompleteness rather than artifacts of the defect-creation process.

Authors: We will expand the Defect generation subsection with the requested details. Defective implementations were produced by first applying a fixed set of mutation operators (statement deletion, condition negation, arithmetic operator replacement, and variable misuse) drawn from standard mutation-testing suites, then supplementing with targeted manual injections of realistic bugs identified from the projects' commit histories and issue trackers. All defects were required to compile, execute without crashing, and violate at least one ground-truth postcondition. The revision will list the exact operators, the proportion of manual vs. automated defects, and the verification steps used to avoid trivial or non-executable mutations. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or metric definitions

full rationale

POSTCONDBENCH is constructed from independent open-source projects with expert-constructed ground-truth postconditions and provided correct/defective implementations. Completeness is operationalized via execution-based defect discrimination without any equations, fitted parameters, self-citations, or derivations that reduce claims to inputs by construction. The evaluation of LLMs on three generation settings is purely empirical and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that expert postconditions are high-quality and that defect discrimination validly captures completeness; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Expert involvement produces high-quality and complete ground-truth postcondition sets.
The benchmark relies on experts to construct the reference postconditions for all 420 tasks.
domain assumption Violation rates on defective implementations measure postcondition completeness.
Completeness is operationalized via how many defective variants violate the postcondition while correct executions satisfy it.

pith-pipeline@v0.9.0 · 5504 in / 1237 out tokens · 31029 ms · 2026-05-07T15:59:52.900351+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 40 canonical work pages · 12 internal anchors

[1]

Can large language models transform natural language intent into formal method postconditions? , volume =

Endres, Madeline and Fakhoury, Sarah and Chakraborty, Saikat and Lahiri, Shuvendu K , journal =. Can large language models transform natural language intent into formal method postconditions? , volume =
[2]

Do Membership Inference Attacks Work on Large Language Models? , year =

Michael Duan and Anshuman Suri and Niloofar Mireshghallah and Sewon Min and Weijia Shi and Luke Zettlemoyer and Yulia Tsvetkov and Yejin Choi and David Evans and Hannaneh Hajishirzi , journal =. Do Membership Inference Attacks Work on Large Language Models? , year =
[3]

DeepSeek-Coder: Large-Scale Code Language Models , year =

DeepSeek Team , journal =. DeepSeek-Coder: Large-Scale Code Language Models , year =
[4]

, journal =

Ghosh, et al. , journal =. Reliable Specification Generation Using LLMs , year =
[5]

, journal =

Weiss, et al. , journal =. Formal Verification with Large Language Models , year =
[6]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review arXiv
[7]

arXiv preprint arXiv:2406.16838 , year=

From decoding to meta-generation: Inference-time algorithms for large language models , author=. arXiv preprint arXiv:2406.16838 , year=

work page arXiv
[8]

Learning to reason with LLMs

OpenAI. Learning to reason with LLMs. 2024 , url =

2024
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review arXiv
[10]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

work page internal anchor Pith review arXiv
[11]

Reasonflux: Hierarchical LLM reasoning via scaling thought templates

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates , author=. arXiv preprint arXiv:2502.06772 , year=

work page arXiv
[12]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review arXiv
[13]

Proceedings of the International Conference on Software Engineering Research and Practice (SERP) , pages=

Formal specification-driven development , author=. Proceedings of the International Conference on Software Engineering Research and Practice (SERP) , pages=. 2014 , organization=

2014
[14]

arXiv preprint arXiv:2401.08807 , year=

Specgen: Automated generation of formal program specifications via large language models , author=. arXiv preprint arXiv:2401.08807 , year=

work page arXiv
[15]

Proceedings of the 2014 international symposium on software testing and analysis , pages=

Defects4J: A database of existing faults to enable controlled testing studies for Java programs , author=. Proceedings of the 2014 international symposium on software testing and analysis , pages=

2014
[16]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review arXiv
[17]

StarCoder: may the source be with you!

Starcoder: may the source be with you! , author=. arXiv preprint arXiv:2305.06161 , year=

work page internal anchor Pith review arXiv
[18]

Knowledge and Information Visualization: Searching for Synergies , pages=

Visualizing information in virtual space: Prospects and pitfalls , author=. Knowledge and Information Visualization: Searching for Synergies , pages=. 2005 , publisher=

2005
[19]

IEEE Transactions on software engineering , volume=

Simplifying and isolating failure-inducing input , author=. IEEE Transactions on software engineering , volume=. 2002 , publisher=

2002
[20]

arXiv preprint arXiv:2312.00413 , year=

Abstract syntax tree for programming language understanding and representation: How far are we? , author=. arXiv preprint arXiv:2312.00413 , year=

work page arXiv
[21]

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

Knod: Domain knowledge distilled tree decoder for automated program repair , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

2023
[22]

Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair , author=. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=
[23]

Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , pages=

Iter: Iterative neural repair for multi-location patches , author=. Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , pages=
[24]

Don’t throw away your value model! Generating more preferable text with Value-Guided Monte- Carlo Tree Search decoding,

Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding , author=. arXiv preprint arXiv:2309.15028 , year=

work page arXiv
[25]

An empirical analysis of compute-optimal inference for problem-solving with language models , author=
[26]

arXiv preprint arXiv:2303.05510 , year=

Planning with large language models for code generation , author=. arXiv preprint arXiv:2303.05510 , year=

work page arXiv
[27]

Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

Language agent tree search unifies reasoning acting and planning in language models , author=. arXiv preprint arXiv:2310.04406 , year=

work page arXiv
[28]

arXiv preprint arXiv:2310.09044 , year=

KCTS: knowledge-constrained tree search decoding with token-level hallucination detection , author=. arXiv preprint arXiv:2310.09044 , year=

work page arXiv
[29]

Advances in Neural Information Processing Systems , volume=

Self-evaluation guided beam search for reasoning , author=. Advances in Neural Information Processing Systems , volume=
[30]

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=

work page arXiv
[31]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=
[32]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. arXiv preprint arXiv:2312.08935 , year=

work page internal anchor Pith review arXiv
[33]

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data, 2024

Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data , author=. arXiv preprint arXiv:2405.14333 , year=

work page arXiv
[34]

Critique-out-loud re- ward models.arXiv preprint arXiv:2408.11791, 2024

Critique-out-loud reward models , author=. arXiv preprint arXiv:2408.11791 , year=

work page arXiv
[35]

Hendryx and Summer Yue and Hugh Zhang , booktitle=

Evan Z Wang and Federico Cassano and Catherine Wu and Yunfeng Bai and William Song and Vaskar Nath and Ziwen Han and Sean M. Hendryx and Summer Yue and Hugh Zhang , booktitle=. Planning in Natural Language Improves. 2025 , url=

2025
[36]

The Thirteenth International Conference on Learning Representations , year=

What Makes Large Language Models Reason in (Multi-Turn) Code Generation? , author=. The Thirteenth International Conference on Learning Representations , year=
[37]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[38]

Publications Manual , year = "1983", publisher =

1983
[39]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[40]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[41]

Dan Gusfield , title =. 1997

1997
[42]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[43]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[44]

iContract Documentation: Async , howpublished =
[45]

Proceedings of the 25th international symposium on software testing and analysis , pages=

Pit: a practical mutation testing tool for java , author=. Proceedings of the 25th international symposium on software testing and analysis , pages=
[46]

Mutmut: a Python mutation testing system , howpublished =

Hovm. Mutmut: a Python mutation testing system , howpublished =. 2016 , urldate =

2016
[47]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Repo2run: Automated building executable environment for code repository at scale , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[48]

2010 , publisher=

Apache Maven , author=. 2010 , publisher=

2010
[49]

IEEE Transactions on software Engineering , number=

A complexity measure , author=. IEEE Transactions on software Engineering , number=. 1976 , publisher=

1976
[50]

2023 13th International Conference on Software Technology and Engineering (ICSTE) , pages=

Comparison of leading language parsers--antlr, javacc, sablecc, tree-sitter, yacc, bison , author=. 2023 13th International Conference on Software Technology and Engineering (ICSTE) , pages=. 2023 , organization=

2023
[51]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Towards mitigating LLM hallucination via self reflection , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023
[52]

2025 , month =

Meta AI , title =. 2025 , month =

2025
[53]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

CodeBERT: A Pre-Trained Model for Programming and Natural Languages , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

2020
[54]

Theoretical computer science , volume=

Clustering to minimize the maximum intercluster distance , author=. Theoretical computer science , volume=. 1985 , publisher=

1985
[55]

SIAM journal on computing , volume=

An analysis of several heuristics for the traveling salesman problem , author=. SIAM journal on computing , volume=. 1977 , publisher=

1977
[56]

2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM) , pages=

A farthest first traversal based sampling algorithm for k-clustering , author=. 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM) , pages=. 2020 , organization=

2020
[57]

CoRR , year=

An Exploratory Study on Using Large Language Models for Mutation Testing , author=. CoRR , year=
[58]

12th International Conference on Learning Representations, ICLR 2024 , year=

SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES? , author=. 12th International Conference on Learning Representations, ICLR 2024 , year=

2024
[59]

Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering , pages=

Are mutants a valid substitute for real faults in software testing? , author=. Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering , pages=
[60]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review arXiv
[61]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
[62]

2025 , note =

icontract , author =. 2025 , note =

2025
[63]

2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages=

How Effective are Large Language Models in Generating Software Specifications? , author=. 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages=. 2025 , organization=

2025
[64]

JML reference manual , author=
[65]

2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , pages=

Towards a Test Framework for Reactive (Type 2) Asset Administration Shell Implementations , author=. 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , pages=. 2024 , organization=

2024
[66]

2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

SpecGen: Automated Generation of Formal Program Specifications via Large Language Models , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=

2025
[67]

Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages=

Understanding Logical Expressions with Negations: Its Complicated , author=. Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages=
[68]

Brown , title =

Walter E. Brown , title =. 2024 , url =

2024
[69]

Training Software Engineering Agents and Verifiers with

Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang , booktitle=. Training Software Engineering Agents and Verifiers with. 2025 , url=

2025
[70]

2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume=

Ekstazi: Lightweight test selection , author=. 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume=. 2015 , organization=

2015
[71]

2024 , month =

OpenAI , title =. 2024 , month =

2024
[72]

2025 , month =

OpenAI , title =. 2025 , month =

2025
[73]

Qwen3-Coder: Agentic Coding in the World , year =
[74]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review arXiv
[75]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review arXiv
[76]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
[77]

2025 , month =

Anthropic , title =. 2025 , month =

2025
[78]

2024 , month =

Anthropic , title =. 2024 , month =

2024
[79]

Advances in Neural Information Processing Systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=
[80]

Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering , pages=

Call me maybe: Using nlp to automatically generate unit test cases respecting temporal constraints , author=. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering , pages=

Showing first 80 references.