pith. sign in

arxiv: 2506.18315 · v2 · submitted 2025-06-23 · 💻 cs.SE · cs.AI

Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback

Pith reviewed 2026-05-19 08:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM code refinementproperty-oriented feedbackminimal counterexampletest-driven developmentautomated program repairfeedback qualitycode debugging
0
0 comments X

The pith

The Property-Generated Solver refines LLM code by checking high-level properties and supplying the simplest failing counterexample instead of relying on noisy test cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shifts the focus in LLM code refinement from the quantity of tests to the quality of feedback. It introduces the Property-Generated Solver that first verifies semantic properties of the intended program behavior and then returns only the smallest counterexample that violates one of those properties. The goal is to give the model clear semantic direction while avoiding the confusion created by many low-quality or irrelevant tests. If the approach works, LLMs would correct their own outputs more often and produce solutions that generalize beyond the specific examples seen during refinement.

Core claim

PGS operates by checking high-level program properties then providing the simplest failing counterexample to the LLM. By adhering to these principles of being property-oriented and structurally minimal, this targeted feedback mechanism leads to significant performance gains. Specifically, PGS achieves an improvement of up to 13.4% in pass@1 against other TDD-based methods and an over 64% fix rate on problems where the model initially failed, while also delivering a bug fix rate 1.4x-1.6x higher than the strongest debugging-based approaches.

What carries the argument

Property-Generated Solver (PGS) that verifies high-level program properties and returns the simplest failing counterexample to isolate root causes with low cognitive load.

If this is right

  • LLMs receive semantic guidance that goes beyond raw I/O mismatches and produces more generalizable fixes.
  • The method outperforms other automated debugging approaches by a factor of 1.4x to 1.6x in bug-fix success.
  • Property-driven feedback establishes a new state-of-the-art across multiple code refinement benchmarks.
  • Structurally minimal signals reduce the chance that the model is misled by extraneous test noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same minimal-counterexample style could be tested on LLM outputs in non-code domains such as mathematical derivations or natural-language summaries.
  • If property checkers prove easy to generate, integrated development environments might adopt them for real-time suggestions to human programmers.
  • Scaling the approach to large codebases would require checking whether property verification remains tractable when functions have complex internal state.

Load-bearing premise

High-level program properties can be checked automatically and the simplest failing counterexample reliably reveals the underlying bug without extra domain engineering per problem.

What would settle it

A controlled comparison on the same benchmarks where a standard TDD baseline using only input-output mismatches matches or exceeds PGS pass@1 and fix rates would falsify the advantage of property-oriented minimal feedback.

Figures

Figures reproduced from arXiv: 2506.18315 by Lehan He, Lu Sheng, Xiang Gao, Zeren Chen, Zhe Zhang.

Figure 1
Figure 1. Figure 1: A programming problem excerpted from the HumanEval [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Property-Generated Solver framework, showcasing the iterative collaboration between the Generator and the Tester. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The prompt template used by the Tester to generate validation and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Contribution of different testing and refinement stages to overall [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of code generation outcome distributions (%) on [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

LLMs excel at code generation, yet ensuring the functional correctness of their outputs remains a persistent challenge. While recent studies have applied Test-Driven Development (TDD) to refine code, these methods are often undermined by poor feedback quality, stemming from the scarcity of high-quality test cases and noisy signals from auto-generated ones. In this work, we shift the focus from test quantity to feedback quality. We introduce the Property-Generated Solver (PGS), a novel paradigm designed to generate highly effective feedback via two principles: it must be property-oriented, to provide semantic guidance beyond simple I/O mismatches, and structurally minimal, to reduce cognitive load and isolate root causes. PGS operates by checking high-level program properties (e.g., a sorting function must produce a non-decreasing sequence) then providing the simplest failing counterexample to the LLM. By adhering to these principles, this targeted feedback mechanism leads to significant performance gains. Specifically, PGS achieves an improvement of up to 13.4% in pass@1 against other TDD-based methods and an over 64% fix rate on problems where the model initially failed. This property-driven, minimal feedback steers LLMs toward correct and generalizable solutions. Across diverse benchmarks, PGS demonstrates superior performance, achieving a bug fix rate 1.4x-1.6x higher than the strongest debugging-based approaches and establishing a new state-of-the-art in automated code refinement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Property-Generated Solver (PGS) for refining LLM-generated code. PGS generates feedback by checking high-level program properties (e.g., non-decreasing output for sorting) and supplying the simplest failing counterexample to the LLM. The feedback is designed to be property-oriented for semantic guidance beyond I/O and structurally minimal to reduce cognitive load. The paper reports up to 13.4% pass@1 improvement over TDD-based methods, over 64% fix rate on initially failed problems, and 1.4x-1.6x higher bug-fix rates than debugging baselines, claiming a new state-of-the-art across diverse benchmarks.

Significance. If the results hold and properties can be obtained automatically without per-problem engineering, the work would be significant for LLM code refinement by shifting emphasis from test quantity to targeted semantic feedback. The minimal-counterexample principle could improve generalizability of fixes if the automation claim is substantiated.

major comments (2)
  1. [Abstract] Abstract: the performance claims (13.4% pass@1 gain, >64% fix rate, 1.4-1.6x bug-fix improvement) are presented as direct experimental outcomes, yet no benchmarks, problem counts, statistical significance tests, or error bars are referenced, preventing verification that the gains are robust rather than benchmark-specific.
  2. [Method] Method (property definition and checking): the central attribution of gains to 'property-oriented' and 'structurally minimal' feedback assumes high-level properties can be automatically derived and checked for arbitrary problems without domain-specific engineering. If properties are hand-specified or benchmark-dependent (as suggested by the sorting example), the comparison to TDD baselines—which lack equivalent semantic oracles—may not be fair, undermining the claim that the feedback style itself drives the improvement.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'structurally minimal' is introduced without a concise definition or example of what constitutes minimal structure in the counterexample, which would aid immediate understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major point below, offering clarifications based on the manuscript content and noting revisions where they will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the performance claims (13.4% pass@1 gain, >64% fix rate, 1.4-1.6x bug-fix improvement) are presented as direct experimental outcomes, yet no benchmarks, problem counts, statistical significance tests, or error bars are referenced, preventing verification that the gains are robust rather than benchmark-specific.

    Authors: We agree that the abstract's brevity omits explicit references to benchmarks, problem counts, and statistical details, which could aid immediate verification. The main text (Section 4) reports results on HumanEval (164 problems), MBPP, and additional benchmarks, with pass@1 gains averaged over multiple sampling runs and including standard deviations. To improve accessibility, we will revise the abstract to name the primary benchmarks and indicate that full statistical analysis appears in the experimental evaluation. revision: yes

  2. Referee: [Method] Method (property definition and checking): the central attribution of gains to 'property-oriented' and 'structurally minimal' feedback assumes high-level properties can be automatically derived and checked for arbitrary problems without domain-specific engineering. If properties are hand-specified or benchmark-dependent (as suggested by the sorting example), the comparison to TDD baselines—which lack equivalent semantic oracles—may not be fair, undermining the claim that the feedback style itself drives the improvement.

    Authors: The manuscript presents property generation as an automated process that extracts high-level invariants (e.g., monotonicity, bounds) from problem descriptions and signatures using general templates, rather than per-problem hand-specification. The sorting example serves only to illustrate the feedback principle; the same extraction logic applies across benchmarks without domain-specific engineering. Property checking relies on lightweight, reusable oracles. We will expand the Method section with additional pseudocode and cross-benchmark examples to make the automation explicit and reinforce the distinction from TDD. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results independent of inputs

full rationale

The paper proposes the PGS method and reports its performance gains (13.4% pass@1, >64% fix rate) as direct outcomes of experiments on benchmarks. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or described method. The core claims rest on the experimental comparison itself rather than any derivation that reduces to the method's own definitions or prior author work by construction. This is a standard empirical software-engineering paper whose central results are falsifiable via replication on the same benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on the existence of checkable high-level properties and an oracle that can produce minimal counterexamples.

pith-pipeline@v0.9.0 · 5789 in / 1270 out tokens · 46850 ms · 2026-05-19T08:32:25.475978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PBT-Bench: Benchmarking AI Agents on Property-Based Testing

    cs.SE 2026-05 conditional novelty 7.0

    PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.

  2. PBT-Bench: Benchmarking AI Agents on Property-Based Testing

    cs.SE 2026-05 unverdicted novelty 7.0

    PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.

  3. BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

    cs.NE 2026-03 unverdicted novelty 7.0

    BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.

  4. Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis

    cs.SE 2026-04 unverdicted novelty 5.0

    SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 3 Pith papers · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, no. 5, 2023. 1, 3, 10

  2. [2]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang et al. , “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023. 1

  3. [3]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu, Y . Wu, Y . Li, H. Gao, S. Ma et al., “Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv:2406.11931, 2024. 1, 6

  4. [4]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,

    J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems , vol. 36, 2024. 1

  5. [5]

    Selfevolve: A code evolution framework via large language models,

    S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution frame- work via large language models,” arXiv preprint arXiv:2306.02907 ,

  6. [6]

    Debug like a human: A large language model debugger via verifying runtime execution step by step,

    L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large language model debugger via verifying runtime execution step by step,” in Findings of the Association for Computational Linguistics ACL 2024 , 2024, pp. 851–870. 1, 7, 10

  7. [7]

    Re- flexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems , vol. 36, 2024. 1, 6

  8. [8]

    Studying the effect of ai code generators on supporting novice learners in introductory programming,

    M. Kazemitabaar, J. Chow, C. K. T. Ma, B. J. Ericson, D. Weintrop, and T. Grossman, “Studying the effect of ai code generators on supporting novice learners in introductory programming,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , 2023, pp. 1–23. 1

  9. [9]

    Using github copilot to solve simple programming problems,

    M. Wermelinger, “Using github copilot to solve simple programming problems,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V . 1, 2023, pp. 172–178. 1

  10. [10]

    Large language models as test case gen- erators: Performance evaluation and enhancement,

    K. Li and Y . Yuan, “Large language models as test case gen- erators: Performance evaluation and enhancement,” arXiv preprint arXiv:2404.13340, 2024. 1, 11

  11. [11]

    Codet: Code generation with generated tests,

    B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,” inThe Eleventh International Conference on Learning Representations , 2023. 1, 6, 7, 10

  12. [12]

    Llm-powered test case generation for detecting tricky bugs

    K. Liu, Y . Liu, Z. Chen, J. M. Zhang, Y . Han, Y . Ma, G. Li, and G. Huang, “Llm-powered test case generation for detecting tricky bugs,” arXiv preprint arXiv:2404.10304 , 2024. 1, 10

  13. [13]

    An empirical evaluation of using large language models for automated unit test generation,

    M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2023. 1

  14. [14]

    The oracle problem in software testing: A survey,

    E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE transactions on software engineering, vol. 41, no. 5, pp. 507–525, 2014. 1

  15. [15]

    Togll: Correct and strong test oracle generation with llms,

    S. B. Hossain and M. Dwyer, “Togll: Correct and strong test oracle generation with llms,” arXiv preprint arXiv:2405.03786 , 2024. 1

  16. [16]

    Livecodebench: Holistic and contamination free evaluation of large language models for code,

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,” in The Thirteenth International Conference on Learning Representations ,

  17. [17]

    Quickcheck: a lightweight tool for random testing of haskell programs,

    K. Claessen and J. Hughes, “Quickcheck: a lightweight tool for random testing of haskell programs,” in Proceedings of the fifth ACM SIGPLAN international conference on Functional programming , 2000, pp. 268–

  18. [18]

    11 Xingyao Wang, Boxuan Li, Yufan Song, Frank F

    V . Vikram, C. Lemieux, J. Sunshine, and R. Padhye, “Can large language models write good property-based tests?” arXiv preprint arXiv:2307.04346, 2023. 2, 11

  19. [19]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 ,

  20. [20]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732 , 2021. 2, 6

  21. [21]

    Test-driven development and llm- based code generation,

    N. S. Mathews and M. Nagappan, “Test-driven development and llm- based code generation,” in Proceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering , ser. ASE ’24. Association for Computing Machinery, 2024, p. 1583–1594. 2

  22. [22]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou et al. , “Challenging big-bench tasks and whether chain-of-thought can solve them,” arXiv preprint arXiv:2210.09261, 2022. 3

  23. [23]

    The rise and potential of large language model based agents: A survey,

    Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,” Science China Information Sciences , vol. 68, no. 2, p. 121101, 2025. 3, 10

  24. [24]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948 ,

  25. [25]

    Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

    A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Do- han, F. Song, H. Lightman, I. Clavera, J. Pachocki et al. , “Com- petitive programming with large reasoning models,” arXiv preprint arXiv:2502.06807, 2025. 4

  26. [26]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics , vol. 12,

  27. [27]

    Hdd: hierarchical delta debugging,

    G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Pro- ceedings of the 28th international conference on Software engineering , 2006, pp. 142–151. 4, 9

  28. [28]

    A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,

    H. Zhang, W. Cheng, Y . Wu, and W. Hu, “A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,” in The 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024) , 2024. 6, 10

  29. [29]

    Competition- level code generation with alphacode,

    Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. 6

  30. [30]

    Lever: Learning to verify language-to-code generation with execution,

    A. Ni, S. Iyer, D. Radev, V . Stoyanov, W.-t. Yih, S. Wang, and X. V . Lin, “Lever: Learning to verify language-to-code generation with execution,” in International Conference on Machine Learning . PMLR, 2023, pp. 26 106–26 128. 6

  31. [31]

    Codereval: A benchmark of pragmatic code generation with generative pre-trained models,

    H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y . Ma, G. Liang, Y . Li, Q. Wang, and T. Xie, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , 2024, pp. 1–12. 6

  32. [32]

    Break-it-fix-it: Unsupervised learning for program repair,

    M. Yasunaga and P. Liang, “Break-it-fix-it: Unsupervised learning for program repair,” in International conference on machine learning . PMLR, 2021, pp. 11 941–11 952. 6

  33. [33]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu et al. , “Qwen2. 5-coder technical report,” arXiv preprint arXiv:2409.12186, 2024. 6

  34. [34]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems , vol. 35, pp. 24 824–24 837, 2022. 6

  35. [35]

    Self-edit: Fault-aware code editor for code generation.arXiv preprint arXiv:2305.04087, 2023

    K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-edit: Fault-aware code editor for code generation,” arXiv preprint arXiv:2305.04087 , 2023. 6, 7, 10

  36. [36]

    From code to correctness: Closing the last mile of code generation with hierarchical debugging,

    Y . Shi, S. Wang, C. Wan, and X. Gu, “From code to correctness: Closing the last mile of code generation with hierarchical debugging,” arXiv preprint arXiv:2410.01215, 2024. 7, 8

  37. [37]

    Teaching Large Language Models to Self-Debug

    X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,” arXiv preprint arXiv:2304.05128 , 2023. 7

  38. [38]

    Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,

    Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, L. Shen, Z. Wang, A. Wang, Y . Li et al. , “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2023, pp. 5673–5684. 10

  39. [39]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al. , “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025. 10

  40. [40]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. 10

  41. [41]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024. 10

  42. [42]

    Structured chain-of-thought prompting for code generation,

    J. Li, G. Li, Y . Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–23, 2025. 10

  43. [43]

    Self-planning code generation with large language models,

    X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,” ACM Transactions on Software Engineering and Methodology , vol. 33, no. 7, pp. 1–30, 2024. 10

  44. [44]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi-agent-based code generation with iterative testing and optimisa- tion,” arXiv preprint arXiv:2312.13010 , 2023. 10

  45. [45]

    Self-collaboration code generation via chatgpt,

    Y . Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via chatgpt,” ACM Transactions on Software Engineering and Method- ology, vol. 33, no. 7, pp. 1–38, 2024. 10

  46. [46]

    From llms to llm- based agents for software engineering: A survey of current, challenges and future,

    H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From llms to llm- based agents for software engineering: A survey of current, challenges and future,” arXiv preprint arXiv:2408.02479 , 2024. 10

  47. [47]

    A deep dive into large language models for automated bug localization and repair,

    S. B. Hossain, N. Jiang, Q. Zhou, X. Li, W.-H. Chiang, Y . Lyu, H. Nguyen, and O. Tripp, “A deep dive into large language models for automated bug localization and repair,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1471–1493, 2024. 10

  48. [48]

    Conversational automated program repair,

    C. S. Xia and L. Zhang, “Conversational automated program repair,” arXiv preprint arXiv:2301.13246 , 2023. 10

  49. [49]

    Cath: in- creased structural coverage of functional space,

    I. Sillitoe, N. Bordin, N. Dawson, V . P. Waman, P. Ashford, H. M. Scholes, C. S. Pang, L. Woodridge, C. Rauer, N. Sen et al., “Cath: in- creased structural coverage of functional space,” Nucleic acids research, vol. 49, no. D1, pp. D266–D273, 2021. 10

  50. [50]

    An analysis and survey of the development of mutation testing,

    Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering , vol. 37, no. 5, pp. 649–678, 2010. 10

  51. [51]

    Mutation testing advances: an analysis and survey,

    M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,” in Advances in computers. Elsevier, 2019, vol. 112, pp. 275–378. 10

  52. [52]

    Predictive mutation testing,

    J. Zhang, Z. Wang, L. Zhang, D. Hao, L. Zang, S. Cheng, and L. Zhang, “Predictive mutation testing,” in Proceedings of the 25th international symposium on software testing and analysis , 2016, pp. 342–353. 10

  53. [53]

    Finding and understanding bugs in c compilers,

    X. Yang, Y . Chen, E. Eide, and J. Regehr, “Finding and understanding bugs in c compilers,” in Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, 2011, pp. 283–294. 10

  54. [54]

    Think outside the code: Brainstorming boosts large language models in code generation,

    X.-Y . Li, J.-T. Xue, Z. Xie, and M. Li, “Think outside the code: Brainstorming boosts large language models in code generation,” arXiv preprint arXiv:2305.10679, 2023. 10

  55. [55]

    Tufano, D

    M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,” arXiv preprint arXiv:2009.05617, 2020. 10

  56. [56]

    Codecot: Tackling code syntax errors in cot reasoning for code generation,

    D. Huang, Q. Bu, Y . Qing, and H. Cui, “Codecot: Tackling code syntax errors in cot reasoning for code generation,” CoRR, vol. 2308, pp. 1–20,