Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback

Lehan He; Lu Sheng; Xiang Gao; Zeren Chen; Zhe Zhang

arxiv: 2506.18315 · v2 · submitted 2025-06-23 · 💻 cs.SE · cs.AI

Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback

Lehan He , Zeren Chen , Zhe Zhang , Xiang Gao , Lu Sheng This is my paper

Pith reviewed 2026-05-19 08:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code refinementproperty-oriented feedbackminimal counterexampletest-driven developmentautomated program repairfeedback qualitycode debugging

0 comments

The pith

The Property-Generated Solver refines LLM code by checking high-level properties and supplying the simplest failing counterexample instead of relying on noisy test cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shifts the focus in LLM code refinement from the quantity of tests to the quality of feedback. It introduces the Property-Generated Solver that first verifies semantic properties of the intended program behavior and then returns only the smallest counterexample that violates one of those properties. The goal is to give the model clear semantic direction while avoiding the confusion created by many low-quality or irrelevant tests. If the approach works, LLMs would correct their own outputs more often and produce solutions that generalize beyond the specific examples seen during refinement.

Core claim

PGS operates by checking high-level program properties then providing the simplest failing counterexample to the LLM. By adhering to these principles of being property-oriented and structurally minimal, this targeted feedback mechanism leads to significant performance gains. Specifically, PGS achieves an improvement of up to 13.4% in pass@1 against other TDD-based methods and an over 64% fix rate on problems where the model initially failed, while also delivering a bug fix rate 1.4x-1.6x higher than the strongest debugging-based approaches.

What carries the argument

Property-Generated Solver (PGS) that verifies high-level program properties and returns the simplest failing counterexample to isolate root causes with low cognitive load.

If this is right

LLMs receive semantic guidance that goes beyond raw I/O mismatches and produces more generalizable fixes.
The method outperforms other automated debugging approaches by a factor of 1.4x to 1.6x in bug-fix success.
Property-driven feedback establishes a new state-of-the-art across multiple code refinement benchmarks.
Structurally minimal signals reduce the chance that the model is misled by extraneous test noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same minimal-counterexample style could be tested on LLM outputs in non-code domains such as mathematical derivations or natural-language summaries.
If property checkers prove easy to generate, integrated development environments might adopt them for real-time suggestions to human programmers.
Scaling the approach to large codebases would require checking whether property verification remains tractable when functions have complex internal state.

Load-bearing premise

High-level program properties can be checked automatically and the simplest failing counterexample reliably reveals the underlying bug without extra domain engineering per problem.

What would settle it

A controlled comparison on the same benchmarks where a standard TDD baseline using only input-output mismatches matches or exceeds PGS pass@1 and fix rates would falsify the advantage of property-oriented minimal feedback.

Figures

Figures reproduced from arXiv: 2506.18315 by Lehan He, Lu Sheng, Xiang Gao, Zeren Chen, Zhe Zhang.

**Figure 2.** Figure 2: Overview of the Property-Generated Solver framework, showcasing the iterative collaboration between the Generator and the Tester. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The prompt template used by the Tester to generate validation and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Contribution of different testing and refinement stages to overall [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of code generation outcome distributions (%) on [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

LLMs excel at code generation, yet ensuring the functional correctness of their outputs remains a persistent challenge. While recent studies have applied Test-Driven Development (TDD) to refine code, these methods are often undermined by poor feedback quality, stemming from the scarcity of high-quality test cases and noisy signals from auto-generated ones. In this work, we shift the focus from test quantity to feedback quality. We introduce the Property-Generated Solver (PGS), a novel paradigm designed to generate highly effective feedback via two principles: it must be property-oriented, to provide semantic guidance beyond simple I/O mismatches, and structurally minimal, to reduce cognitive load and isolate root causes. PGS operates by checking high-level program properties (e.g., a sorting function must produce a non-decreasing sequence) then providing the simplest failing counterexample to the LLM. By adhering to these principles, this targeted feedback mechanism leads to significant performance gains. Specifically, PGS achieves an improvement of up to 13.4% in pass@1 against other TDD-based methods and an over 64% fix rate on problems where the model initially failed. This property-driven, minimal feedback steers LLMs toward correct and generalizable solutions. Across diverse benchmarks, PGS demonstrates superior performance, achieving a bug fix rate 1.4x-1.6x higher than the strongest debugging-based approaches and establishing a new state-of-the-art in automated code refinement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PGS tries to improve LLM code fixes by swapping noisy tests for property checks plus minimal counterexamples, but the gains rest on how automatically those properties are produced.

read the letter

Colleague, the core point is that this work moves from standard TDD feedback to property-oriented checks that supply semantic guidance and then returns the simplest failing counterexample. The abstract reports up to 13.4% better pass@1 and 1.4-1.6x higher bug-fix rates than the baselines they compare against, which would matter if the numbers hold in fuller experiments. The paper does a clean job naming the real problem with current test-based refinement: too many low-quality or noisy signals that do not help the model isolate what is actually wrong. Focusing on high-level properties like sortedness or other invariants, then stripping the counterexample down, is a sensible way to reduce cognitive load and point the LLM at the root cause rather than just showing another failing input. That framing is useful and the experimental claims are presented directly as outcomes rather than derived quantities. The soft spot is exactly the one the stress-test flags. The abstract keeps saying the properties are automatically checked with no domain-specific engineering, yet it is not clear from the given text how the system derives or verifies those properties for new problems. If the full method relies on hand-written properties or external oracles that the TDD baselines never receive, the performance edge could be coming from extra semantic information rather than from the feedback format itself. The comparisons would then need re-running with the same property access granted to the other methods. This paper is aimed at people building or studying iterative LLM code repair loops. A reader who cares about practical feedback design in software-engineering tools would get something out of the idea and the benchmark numbers, provided the property-generation details survive scrutiny. It is worth sending to peer review so the implementation, benchmark coverage, and statistical reporting can be checked in detail.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Property-Generated Solver (PGS) for refining LLM-generated code. PGS generates feedback by checking high-level program properties (e.g., non-decreasing output for sorting) and supplying the simplest failing counterexample to the LLM. The feedback is designed to be property-oriented for semantic guidance beyond I/O and structurally minimal to reduce cognitive load. The paper reports up to 13.4% pass@1 improvement over TDD-based methods, over 64% fix rate on initially failed problems, and 1.4x-1.6x higher bug-fix rates than debugging baselines, claiming a new state-of-the-art across diverse benchmarks.

Significance. If the results hold and properties can be obtained automatically without per-problem engineering, the work would be significant for LLM code refinement by shifting emphasis from test quantity to targeted semantic feedback. The minimal-counterexample principle could improve generalizability of fixes if the automation claim is substantiated.

major comments (2)

[Abstract] Abstract: the performance claims (13.4% pass@1 gain, >64% fix rate, 1.4-1.6x bug-fix improvement) are presented as direct experimental outcomes, yet no benchmarks, problem counts, statistical significance tests, or error bars are referenced, preventing verification that the gains are robust rather than benchmark-specific.
[Method] Method (property definition and checking): the central attribution of gains to 'property-oriented' and 'structurally minimal' feedback assumes high-level properties can be automatically derived and checked for arbitrary problems without domain-specific engineering. If properties are hand-specified or benchmark-dependent (as suggested by the sorting example), the comparison to TDD baselines—which lack equivalent semantic oracles—may not be fair, undermining the claim that the feedback style itself drives the improvement.

minor comments (1)

[Abstract] Abstract: the phrase 'structurally minimal' is introduced without a concise definition or example of what constitutes minimal structure in the counterexample, which would aid immediate understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major point below, offering clarifications based on the manuscript content and noting revisions where they will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the performance claims (13.4% pass@1 gain, >64% fix rate, 1.4-1.6x bug-fix improvement) are presented as direct experimental outcomes, yet no benchmarks, problem counts, statistical significance tests, or error bars are referenced, preventing verification that the gains are robust rather than benchmark-specific.

Authors: We agree that the abstract's brevity omits explicit references to benchmarks, problem counts, and statistical details, which could aid immediate verification. The main text (Section 4) reports results on HumanEval (164 problems), MBPP, and additional benchmarks, with pass@1 gains averaged over multiple sampling runs and including standard deviations. To improve accessibility, we will revise the abstract to name the primary benchmarks and indicate that full statistical analysis appears in the experimental evaluation. revision: yes
Referee: [Method] Method (property definition and checking): the central attribution of gains to 'property-oriented' and 'structurally minimal' feedback assumes high-level properties can be automatically derived and checked for arbitrary problems without domain-specific engineering. If properties are hand-specified or benchmark-dependent (as suggested by the sorting example), the comparison to TDD baselines—which lack equivalent semantic oracles—may not be fair, undermining the claim that the feedback style itself drives the improvement.

Authors: The manuscript presents property generation as an automated process that extracts high-level invariants (e.g., monotonicity, bounds) from problem descriptions and signatures using general templates, rather than per-problem hand-specification. The sorting example serves only to illustrate the feedback principle; the same extraction logic applies across benchmarks without domain-specific engineering. Property checking relies on lightweight, reusable oracles. We will expand the Method section with additional pseudocode and cross-benchmark examples to make the automation explicit and reinforce the distinction from TDD. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results independent of inputs

full rationale

The paper proposes the PGS method and reports its performance gains (13.4% pass@1, >64% fix rate) as direct outcomes of experiments on benchmarks. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or described method. The core claims rest on the experimental comparison itself rather than any derivation that reduces to the method's own definitions or prior author work by construction. This is a standard empirical software-engineering paper whose central results are falsifiable via replication on the same benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on the existence of checkable high-level properties and an oracle that can produce minimal counterexamples.

pith-pipeline@v0.9.0 · 5789 in / 1270 out tokens · 46850 ms · 2026-05-19T08:32:25.475978+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PGS operates by checking high-level program properties (e.g., a sorting function must produce a non-decreasing sequence) then providing the simplest failing counterexample
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

property-oriented, to provide semantic guidance beyond simple I/O mismatches, and structurally minimal

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PBT-Bench: Benchmarking AI Agents on Property-Based Testing
cs.SE 2026-05 conditional novelty 7.0

PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
cs.SE 2026-05 unverdicted novelty 7.0

PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
cs.NE 2026-03 unverdicted novelty 7.0

BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis
cs.SE 2026-04 unverdicted novelty 5.0

SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 3 Pith papers · 13 internal anchors

[1]

GPT-4 Technical Report

R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, no. 5, 2023. 1, 3, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang et al. , “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu, Y . Wu, Y . Li, H. Gao, S. Ma et al., “Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv:2406.11931, 2024. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,

J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems , vol. 36, 2024. 1

work page 2024
[5]

Selfevolve: A code evolution framework via large language models,

S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution frame- work via large language models,” arXiv preprint arXiv:2306.02907 ,

work page arXiv
[6]

Debug like a human: A large language model debugger via verifying runtime execution step by step,

L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large language model debugger via verifying runtime execution step by step,” in Findings of the Association for Computational Linguistics ACL 2024 , 2024, pp. 851–870. 1, 7, 10

work page 2024
[7]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems , vol. 36, 2024. 1, 6

work page 2024
[8]

Studying the effect of ai code generators on supporting novice learners in introductory programming,

M. Kazemitabaar, J. Chow, C. K. T. Ma, B. J. Ericson, D. Weintrop, and T. Grossman, “Studying the effect of ai code generators on supporting novice learners in introductory programming,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , 2023, pp. 1–23. 1

work page 2023
[9]

Using github copilot to solve simple programming problems,

M. Wermelinger, “Using github copilot to solve simple programming problems,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V . 1, 2023, pp. 172–178. 1

work page 2023
[10]

Large language models as test case gen- erators: Performance evaluation and enhancement,

K. Li and Y . Yuan, “Large language models as test case gen- erators: Performance evaluation and enhancement,” arXiv preprint arXiv:2404.13340, 2024. 1, 11

work page arXiv 2024
[11]

Codet: Code generation with generated tests,

B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,” inThe Eleventh International Conference on Learning Representations , 2023. 1, 6, 7, 10

work page 2023
[12]

Llm-powered test case generation for detecting tricky bugs

K. Liu, Y . Liu, Z. Chen, J. M. Zhang, Y . Han, Y . Ma, G. Li, and G. Huang, “Llm-powered test case generation for detecting tricky bugs,” arXiv preprint arXiv:2404.10304 , 2024. 1, 10

work page arXiv 2024
[13]

An empirical evaluation of using large language models for automated unit test generation,

M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2023. 1

work page 2023
[14]

The oracle problem in software testing: A survey,

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE transactions on software engineering, vol. 41, no. 5, pp. 507–525, 2014. 1

work page 2014
[15]

Togll: Correct and strong test oracle generation with llms,

S. B. Hossain and M. Dwyer, “Togll: Correct and strong test oracle generation with llms,” arXiv preprint arXiv:2405.03786 , 2024. 1

work page arXiv 2024
[16]

Livecodebench: Holistic and contamination free evaluation of large language models for code,

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,” in The Thirteenth International Conference on Learning Representations ,

work page
[17]

Quickcheck: a lightweight tool for random testing of haskell programs,

K. Claessen and J. Hughes, “Quickcheck: a lightweight tool for random testing of haskell programs,” in Proceedings of the fifth ACM SIGPLAN international conference on Functional programming , 2000, pp. 268–

work page 2000
[18]

11 Xingyao Wang, Boxuan Li, Yufan Song, Frank F

V . Vikram, C. Lemieux, J. Sunshine, and R. Padhye, “Can large language models write good property-based tests?” arXiv preprint arXiv:2307.04346, 2023. 2, 11

work page arXiv 2023
[19]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 ,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732 , 2021. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Test-driven development and llm- based code generation,

N. S. Mathews and M. Nagappan, “Test-driven development and llm- based code generation,” in Proceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering , ser. ASE ’24. Association for Computing Machinery, 2024, p. 1583–1594. 2

work page 2024
[22]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou et al. , “Challenging big-bench tasks and whether chain-of-thought can solve them,” arXiv preprint arXiv:2210.09261, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

The rise and potential of large language model based agents: A survey,

Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,” Science China Information Sciences , vol. 68, no. 2, p. 121101, 2025. 3, 10

work page 2025
[24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948 ,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Do- han, F. Song, H. Lightman, I. Clavera, J. Pachocki et al. , “Com- petitive programming with large reasoning models,” arXiv preprint arXiv:2502.06807, 2025. 4

work page arXiv 2025
[26]

Lost in the middle: How language models use long contexts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics , vol. 12,

work page
[27]

Hdd: hierarchical delta debugging,

G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Pro- ceedings of the 28th international conference on Software engineering , 2006, pp. 142–151. 4, 9

work page 2006
[28]

A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,

H. Zhang, W. Cheng, Y . Wu, and W. Hu, “A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,” in The 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024) , 2024. 6, 10

work page 2024
[29]

Competition- level code generation with alphacode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. 6

work page 2022
[30]

Lever: Learning to verify language-to-code generation with execution,

A. Ni, S. Iyer, D. Radev, V . Stoyanov, W.-t. Yih, S. Wang, and X. V . Lin, “Lever: Learning to verify language-to-code generation with execution,” in International Conference on Machine Learning . PMLR, 2023, pp. 26 106–26 128. 6

work page 2023
[31]

Codereval: A benchmark of pragmatic code generation with generative pre-trained models,

H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y . Ma, G. Liang, Y . Li, Q. Wang, and T. Xie, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , 2024, pp. 1–12. 6

work page 2024
[32]

Break-it-fix-it: Unsupervised learning for program repair,

M. Yasunaga and P. Liang, “Break-it-fix-it: Unsupervised learning for program repair,” in International conference on machine learning . PMLR, 2021, pp. 11 941–11 952. 6

work page 2021
[33]

Qwen2.5-Coder Technical Report

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu et al. , “Qwen2. 5-coder technical report,” arXiv preprint arXiv:2409.12186, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems , vol. 35, pp. 24 824–24 837, 2022. 6

work page 2022
[35]

Self-edit: Fault-aware code editor for code generation.arXiv preprint arXiv:2305.04087, 2023

K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-edit: Fault-aware code editor for code generation,” arXiv preprint arXiv:2305.04087 , 2023. 6, 7, 10

work page arXiv 2023
[36]

From code to correctness: Closing the last mile of code generation with hierarchical debugging,

Y . Shi, S. Wang, C. Wan, and X. Gu, “From code to correctness: Closing the last mile of code generation with hierarchical debugging,” arXiv preprint arXiv:2410.01215, 2024. 7, 8

work page arXiv 2024
[37]

Teaching Large Language Models to Self-Debug

X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,” arXiv preprint arXiv:2304.05128 , 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,

Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, L. Shen, Z. Wang, A. Wang, Y . Li et al. , “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2023, pp. 5673–5684. 10

work page 2023
[39]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al. , “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Structured chain-of-thought prompting for code generation,

J. Li, G. Li, Y . Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–23, 2025. 10

work page 2025
[43]

Self-planning code generation with large language models,

X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,” ACM Transactions on Software Engineering and Methodology , vol. 33, no. 7, pp. 1–30, 2024. 10

work page 2024
[44]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi-agent-based code generation with iterative testing and optimisa- tion,” arXiv preprint arXiv:2312.13010 , 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Self-collaboration code generation via chatgpt,

Y . Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via chatgpt,” ACM Transactions on Software Engineering and Method- ology, vol. 33, no. 7, pp. 1–38, 2024. 10

work page 2024
[46]

From llms to llm- based agents for software engineering: A survey of current, challenges and future,

H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From llms to llm- based agents for software engineering: A survey of current, challenges and future,” arXiv preprint arXiv:2408.02479 , 2024. 10

work page arXiv 2024
[47]

A deep dive into large language models for automated bug localization and repair,

S. B. Hossain, N. Jiang, Q. Zhou, X. Li, W.-H. Chiang, Y . Lyu, H. Nguyen, and O. Tripp, “A deep dive into large language models for automated bug localization and repair,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1471–1493, 2024. 10

work page 2024
[48]

Conversational automated program repair,

C. S. Xia and L. Zhang, “Conversational automated program repair,” arXiv preprint arXiv:2301.13246 , 2023. 10

work page arXiv 2023
[49]

Cath: in- creased structural coverage of functional space,

I. Sillitoe, N. Bordin, N. Dawson, V . P. Waman, P. Ashford, H. M. Scholes, C. S. Pang, L. Woodridge, C. Rauer, N. Sen et al., “Cath: in- creased structural coverage of functional space,” Nucleic acids research, vol. 49, no. D1, pp. D266–D273, 2021. 10

work page 2021
[50]

An analysis and survey of the development of mutation testing,

Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering , vol. 37, no. 5, pp. 649–678, 2010. 10

work page 2010
[51]

Mutation testing advances: an analysis and survey,

M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,” in Advances in computers. Elsevier, 2019, vol. 112, pp. 275–378. 10

work page 2019
[52]

Predictive mutation testing,

J. Zhang, Z. Wang, L. Zhang, D. Hao, L. Zang, S. Cheng, and L. Zhang, “Predictive mutation testing,” in Proceedings of the 25th international symposium on software testing and analysis , 2016, pp. 342–353. 10

work page 2016
[53]

Finding and understanding bugs in c compilers,

X. Yang, Y . Chen, E. Eide, and J. Regehr, “Finding and understanding bugs in c compilers,” in Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, 2011, pp. 283–294. 10

work page 2011
[54]

Think outside the code: Brainstorming boosts large language models in code generation,

X.-Y . Li, J.-T. Xue, Z. Xie, and M. Li, “Think outside the code: Brainstorming boosts large language models in code generation,” arXiv preprint arXiv:2305.10679, 2023. 10

work page arXiv 2023
[55]

Tufano, D

M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,” arXiv preprint arXiv:2009.05617, 2020. 10

work page arXiv 2009
[56]

Codecot: Tackling code syntax errors in cot reasoning for code generation,

D. Huang, Q. Bu, Y . Qing, and H. Cui, “Codecot: Tackling code syntax errors in cot reasoning for code generation,” CoRR, vol. 2308, pp. 1–20,

work page

[1] [1]

GPT-4 Technical Report

R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, no. 5, 2023. 1, 3, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang et al. , “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu, Y . Wu, Y . Li, H. Gao, S. Ma et al., “Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv:2406.11931, 2024. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,

J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems , vol. 36, 2024. 1

work page 2024

[5] [5]

Selfevolve: A code evolution framework via large language models,

S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution frame- work via large language models,” arXiv preprint arXiv:2306.02907 ,

work page arXiv

[6] [6]

Debug like a human: A large language model debugger via verifying runtime execution step by step,

L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large language model debugger via verifying runtime execution step by step,” in Findings of the Association for Computational Linguistics ACL 2024 , 2024, pp. 851–870. 1, 7, 10

work page 2024

[7] [7]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems , vol. 36, 2024. 1, 6

work page 2024

[8] [8]

Studying the effect of ai code generators on supporting novice learners in introductory programming,

M. Kazemitabaar, J. Chow, C. K. T. Ma, B. J. Ericson, D. Weintrop, and T. Grossman, “Studying the effect of ai code generators on supporting novice learners in introductory programming,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , 2023, pp. 1–23. 1

work page 2023

[9] [9]

Using github copilot to solve simple programming problems,

M. Wermelinger, “Using github copilot to solve simple programming problems,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V . 1, 2023, pp. 172–178. 1

work page 2023

[10] [10]

Large language models as test case gen- erators: Performance evaluation and enhancement,

K. Li and Y . Yuan, “Large language models as test case gen- erators: Performance evaluation and enhancement,” arXiv preprint arXiv:2404.13340, 2024. 1, 11

work page arXiv 2024

[11] [11]

Codet: Code generation with generated tests,

B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,” inThe Eleventh International Conference on Learning Representations , 2023. 1, 6, 7, 10

work page 2023

[12] [12]

Llm-powered test case generation for detecting tricky bugs

K. Liu, Y . Liu, Z. Chen, J. M. Zhang, Y . Han, Y . Ma, G. Li, and G. Huang, “Llm-powered test case generation for detecting tricky bugs,” arXiv preprint arXiv:2404.10304 , 2024. 1, 10

work page arXiv 2024

[13] [13]

An empirical evaluation of using large language models for automated unit test generation,

M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2023. 1

work page 2023

[14] [14]

The oracle problem in software testing: A survey,

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE transactions on software engineering, vol. 41, no. 5, pp. 507–525, 2014. 1

work page 2014

[15] [15]

Togll: Correct and strong test oracle generation with llms,

S. B. Hossain and M. Dwyer, “Togll: Correct and strong test oracle generation with llms,” arXiv preprint arXiv:2405.03786 , 2024. 1

work page arXiv 2024

[16] [16]

Livecodebench: Holistic and contamination free evaluation of large language models for code,

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,” in The Thirteenth International Conference on Learning Representations ,

work page

[17] [17]

Quickcheck: a lightweight tool for random testing of haskell programs,

K. Claessen and J. Hughes, “Quickcheck: a lightweight tool for random testing of haskell programs,” in Proceedings of the fifth ACM SIGPLAN international conference on Functional programming , 2000, pp. 268–

work page 2000

[18] [18]

11 Xingyao Wang, Boxuan Li, Yufan Song, Frank F

V . Vikram, C. Lemieux, J. Sunshine, and R. Padhye, “Can large language models write good property-based tests?” arXiv preprint arXiv:2307.04346, 2023. 2, 11

work page arXiv 2023

[19] [19]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 ,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732 , 2021. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Test-driven development and llm- based code generation,

N. S. Mathews and M. Nagappan, “Test-driven development and llm- based code generation,” in Proceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering , ser. ASE ’24. Association for Computing Machinery, 2024, p. 1583–1594. 2

work page 2024

[22] [22]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou et al. , “Challenging big-bench tasks and whether chain-of-thought can solve them,” arXiv preprint arXiv:2210.09261, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

The rise and potential of large language model based agents: A survey,

Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,” Science China Information Sciences , vol. 68, no. 2, p. 121101, 2025. 3, 10

work page 2025

[24] [24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948 ,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Do- han, F. Song, H. Lightman, I. Clavera, J. Pachocki et al. , “Com- petitive programming with large reasoning models,” arXiv preprint arXiv:2502.06807, 2025. 4

work page arXiv 2025

[26] [26]

Lost in the middle: How language models use long contexts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics , vol. 12,

work page

[27] [27]

Hdd: hierarchical delta debugging,

G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Pro- ceedings of the 28th international conference on Software engineering , 2006, pp. 142–151. 4, 9

work page 2006

[28] [28]

A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,

H. Zhang, W. Cheng, Y . Wu, and W. Hu, “A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,” in The 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024) , 2024. 6, 10

work page 2024

[29] [29]

Competition- level code generation with alphacode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. 6

work page 2022

[30] [30]

Lever: Learning to verify language-to-code generation with execution,

A. Ni, S. Iyer, D. Radev, V . Stoyanov, W.-t. Yih, S. Wang, and X. V . Lin, “Lever: Learning to verify language-to-code generation with execution,” in International Conference on Machine Learning . PMLR, 2023, pp. 26 106–26 128. 6

work page 2023

[31] [31]

Codereval: A benchmark of pragmatic code generation with generative pre-trained models,

H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y . Ma, G. Liang, Y . Li, Q. Wang, and T. Xie, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , 2024, pp. 1–12. 6

work page 2024

[32] [32]

Break-it-fix-it: Unsupervised learning for program repair,

M. Yasunaga and P. Liang, “Break-it-fix-it: Unsupervised learning for program repair,” in International conference on machine learning . PMLR, 2021, pp. 11 941–11 952. 6

work page 2021

[33] [33]

Qwen2.5-Coder Technical Report

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu et al. , “Qwen2. 5-coder technical report,” arXiv preprint arXiv:2409.12186, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems , vol. 35, pp. 24 824–24 837, 2022. 6

work page 2022

[35] [35]

Self-edit: Fault-aware code editor for code generation.arXiv preprint arXiv:2305.04087, 2023

K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-edit: Fault-aware code editor for code generation,” arXiv preprint arXiv:2305.04087 , 2023. 6, 7, 10

work page arXiv 2023

[36] [36]

From code to correctness: Closing the last mile of code generation with hierarchical debugging,

Y . Shi, S. Wang, C. Wan, and X. Gu, “From code to correctness: Closing the last mile of code generation with hierarchical debugging,” arXiv preprint arXiv:2410.01215, 2024. 7, 8

work page arXiv 2024

[37] [37]

Teaching Large Language Models to Self-Debug

X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,” arXiv preprint arXiv:2304.05128 , 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,

Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, L. Shen, Z. Wang, A. Wang, Y . Li et al. , “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2023, pp. 5673–5684. 10

work page 2023

[39] [39]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al. , “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Structured chain-of-thought prompting for code generation,

J. Li, G. Li, Y . Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–23, 2025. 10

work page 2025

[43] [43]

Self-planning code generation with large language models,

X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,” ACM Transactions on Software Engineering and Methodology , vol. 33, no. 7, pp. 1–30, 2024. 10

work page 2024

[44] [44]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi-agent-based code generation with iterative testing and optimisa- tion,” arXiv preprint arXiv:2312.13010 , 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Self-collaboration code generation via chatgpt,

Y . Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via chatgpt,” ACM Transactions on Software Engineering and Method- ology, vol. 33, no. 7, pp. 1–38, 2024. 10

work page 2024

[46] [46]

From llms to llm- based agents for software engineering: A survey of current, challenges and future,

H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From llms to llm- based agents for software engineering: A survey of current, challenges and future,” arXiv preprint arXiv:2408.02479 , 2024. 10

work page arXiv 2024

[47] [47]

A deep dive into large language models for automated bug localization and repair,

S. B. Hossain, N. Jiang, Q. Zhou, X. Li, W.-H. Chiang, Y . Lyu, H. Nguyen, and O. Tripp, “A deep dive into large language models for automated bug localization and repair,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1471–1493, 2024. 10

work page 2024

[48] [48]

Conversational automated program repair,

C. S. Xia and L. Zhang, “Conversational automated program repair,” arXiv preprint arXiv:2301.13246 , 2023. 10

work page arXiv 2023

[49] [49]

Cath: in- creased structural coverage of functional space,

I. Sillitoe, N. Bordin, N. Dawson, V . P. Waman, P. Ashford, H. M. Scholes, C. S. Pang, L. Woodridge, C. Rauer, N. Sen et al., “Cath: in- creased structural coverage of functional space,” Nucleic acids research, vol. 49, no. D1, pp. D266–D273, 2021. 10

work page 2021

[50] [50]

An analysis and survey of the development of mutation testing,

Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering , vol. 37, no. 5, pp. 649–678, 2010. 10

work page 2010

[51] [51]

Mutation testing advances: an analysis and survey,

M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,” in Advances in computers. Elsevier, 2019, vol. 112, pp. 275–378. 10

work page 2019

[52] [52]

Predictive mutation testing,

J. Zhang, Z. Wang, L. Zhang, D. Hao, L. Zang, S. Cheng, and L. Zhang, “Predictive mutation testing,” in Proceedings of the 25th international symposium on software testing and analysis , 2016, pp. 342–353. 10

work page 2016

[53] [53]

Finding and understanding bugs in c compilers,

X. Yang, Y . Chen, E. Eide, and J. Regehr, “Finding and understanding bugs in c compilers,” in Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, 2011, pp. 283–294. 10

work page 2011

[54] [54]

Think outside the code: Brainstorming boosts large language models in code generation,

X.-Y . Li, J.-T. Xue, Z. Xie, and M. Li, “Think outside the code: Brainstorming boosts large language models in code generation,” arXiv preprint arXiv:2305.10679, 2023. 10

work page arXiv 2023

[55] [55]

Tufano, D

M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,” arXiv preprint arXiv:2009.05617, 2020. 10

work page arXiv 2009

[56] [56]

Codecot: Tackling code syntax errors in cot reasoning for code generation,

D. Huang, Q. Bu, Y . Qing, and H. Cui, “Codecot: Tackling code syntax errors in cot reasoning for code generation,” CoRR, vol. 2308, pp. 1–20,

work page