Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback
Pith reviewed 2026-05-19 08:32 UTC · model grok-4.3
The pith
The Property-Generated Solver refines LLM code by checking high-level properties and supplying the simplest failing counterexample instead of relying on noisy test cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PGS operates by checking high-level program properties then providing the simplest failing counterexample to the LLM. By adhering to these principles of being property-oriented and structurally minimal, this targeted feedback mechanism leads to significant performance gains. Specifically, PGS achieves an improvement of up to 13.4% in pass@1 against other TDD-based methods and an over 64% fix rate on problems where the model initially failed, while also delivering a bug fix rate 1.4x-1.6x higher than the strongest debugging-based approaches.
What carries the argument
Property-Generated Solver (PGS) that verifies high-level program properties and returns the simplest failing counterexample to isolate root causes with low cognitive load.
If this is right
- LLMs receive semantic guidance that goes beyond raw I/O mismatches and produces more generalizable fixes.
- The method outperforms other automated debugging approaches by a factor of 1.4x to 1.6x in bug-fix success.
- Property-driven feedback establishes a new state-of-the-art across multiple code refinement benchmarks.
- Structurally minimal signals reduce the chance that the model is misled by extraneous test noise.
Where Pith is reading between the lines
- The same minimal-counterexample style could be tested on LLM outputs in non-code domains such as mathematical derivations or natural-language summaries.
- If property checkers prove easy to generate, integrated development environments might adopt them for real-time suggestions to human programmers.
- Scaling the approach to large codebases would require checking whether property verification remains tractable when functions have complex internal state.
Load-bearing premise
High-level program properties can be checked automatically and the simplest failing counterexample reliably reveals the underlying bug without extra domain engineering per problem.
What would settle it
A controlled comparison on the same benchmarks where a standard TDD baseline using only input-output mismatches matches or exceeds PGS pass@1 and fix rates would falsify the advantage of property-oriented minimal feedback.
Figures
read the original abstract
LLMs excel at code generation, yet ensuring the functional correctness of their outputs remains a persistent challenge. While recent studies have applied Test-Driven Development (TDD) to refine code, these methods are often undermined by poor feedback quality, stemming from the scarcity of high-quality test cases and noisy signals from auto-generated ones. In this work, we shift the focus from test quantity to feedback quality. We introduce the Property-Generated Solver (PGS), a novel paradigm designed to generate highly effective feedback via two principles: it must be property-oriented, to provide semantic guidance beyond simple I/O mismatches, and structurally minimal, to reduce cognitive load and isolate root causes. PGS operates by checking high-level program properties (e.g., a sorting function must produce a non-decreasing sequence) then providing the simplest failing counterexample to the LLM. By adhering to these principles, this targeted feedback mechanism leads to significant performance gains. Specifically, PGS achieves an improvement of up to 13.4% in pass@1 against other TDD-based methods and an over 64% fix rate on problems where the model initially failed. This property-driven, minimal feedback steers LLMs toward correct and generalizable solutions. Across diverse benchmarks, PGS demonstrates superior performance, achieving a bug fix rate 1.4x-1.6x higher than the strongest debugging-based approaches and establishing a new state-of-the-art in automated code refinement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Property-Generated Solver (PGS) for refining LLM-generated code. PGS generates feedback by checking high-level program properties (e.g., non-decreasing output for sorting) and supplying the simplest failing counterexample to the LLM. The feedback is designed to be property-oriented for semantic guidance beyond I/O and structurally minimal to reduce cognitive load. The paper reports up to 13.4% pass@1 improvement over TDD-based methods, over 64% fix rate on initially failed problems, and 1.4x-1.6x higher bug-fix rates than debugging baselines, claiming a new state-of-the-art across diverse benchmarks.
Significance. If the results hold and properties can be obtained automatically without per-problem engineering, the work would be significant for LLM code refinement by shifting emphasis from test quantity to targeted semantic feedback. The minimal-counterexample principle could improve generalizability of fixes if the automation claim is substantiated.
major comments (2)
- [Abstract] Abstract: the performance claims (13.4% pass@1 gain, >64% fix rate, 1.4-1.6x bug-fix improvement) are presented as direct experimental outcomes, yet no benchmarks, problem counts, statistical significance tests, or error bars are referenced, preventing verification that the gains are robust rather than benchmark-specific.
- [Method] Method (property definition and checking): the central attribution of gains to 'property-oriented' and 'structurally minimal' feedback assumes high-level properties can be automatically derived and checked for arbitrary problems without domain-specific engineering. If properties are hand-specified or benchmark-dependent (as suggested by the sorting example), the comparison to TDD baselines—which lack equivalent semantic oracles—may not be fair, undermining the claim that the feedback style itself drives the improvement.
minor comments (1)
- [Abstract] Abstract: the phrase 'structurally minimal' is introduced without a concise definition or example of what constitutes minimal structure in the counterexample, which would aid immediate understanding.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond to each major point below, offering clarifications based on the manuscript content and noting revisions where they will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the performance claims (13.4% pass@1 gain, >64% fix rate, 1.4-1.6x bug-fix improvement) are presented as direct experimental outcomes, yet no benchmarks, problem counts, statistical significance tests, or error bars are referenced, preventing verification that the gains are robust rather than benchmark-specific.
Authors: We agree that the abstract's brevity omits explicit references to benchmarks, problem counts, and statistical details, which could aid immediate verification. The main text (Section 4) reports results on HumanEval (164 problems), MBPP, and additional benchmarks, with pass@1 gains averaged over multiple sampling runs and including standard deviations. To improve accessibility, we will revise the abstract to name the primary benchmarks and indicate that full statistical analysis appears in the experimental evaluation. revision: yes
-
Referee: [Method] Method (property definition and checking): the central attribution of gains to 'property-oriented' and 'structurally minimal' feedback assumes high-level properties can be automatically derived and checked for arbitrary problems without domain-specific engineering. If properties are hand-specified or benchmark-dependent (as suggested by the sorting example), the comparison to TDD baselines—which lack equivalent semantic oracles—may not be fair, undermining the claim that the feedback style itself drives the improvement.
Authors: The manuscript presents property generation as an automated process that extracts high-level invariants (e.g., monotonicity, bounds) from problem descriptions and signatures using general templates, rather than per-problem hand-specification. The sorting example serves only to illustrate the feedback principle; the same extraction logic applies across benchmarks without domain-specific engineering. Property checking relies on lightweight, reusable oracles. We will expand the Method section with additional pseudocode and cross-benchmark examples to make the automation explicit and reinforce the distinction from TDD. revision: yes
Circularity Check
No circularity; empirical results independent of inputs
full rationale
The paper proposes the PGS method and reports its performance gains (13.4% pass@1, >64% fix rate) as direct outcomes of experiments on benchmarks. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or described method. The core claims rest on the experimental comparison itself rather than any derivation that reduces to the method's own definitions or prior author work by construction. This is a standard empirical software-engineering paper whose central results are falsifiable via replication on the same benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PGS operates by checking high-level program properties (e.g., a sorting function must produce a non-decreasing sequence) then providing the simplest failing counterexample
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
property-oriented, to provide semantic guidance beyond simple I/O mismatches, and structurally minimal
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.
-
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.
-
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
-
Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis
SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.
Reference graph
Works this paper leans on
-
[1]
R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, no. 5, 2023. 1, 3, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang et al. , “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu, Y . Wu, Y . Li, H. Gao, S. Ma et al., “Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv:2406.11931, 2024. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems , vol. 36, 2024. 1
work page 2024
-
[5]
Selfevolve: A code evolution framework via large language models,
S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution frame- work via large language models,” arXiv preprint arXiv:2306.02907 ,
-
[6]
Debug like a human: A large language model debugger via verifying runtime execution step by step,
L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large language model debugger via verifying runtime execution step by step,” in Findings of the Association for Computational Linguistics ACL 2024 , 2024, pp. 851–870. 1, 7, 10
work page 2024
-
[7]
Re- flexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems , vol. 36, 2024. 1, 6
work page 2024
-
[8]
Studying the effect of ai code generators on supporting novice learners in introductory programming,
M. Kazemitabaar, J. Chow, C. K. T. Ma, B. J. Ericson, D. Weintrop, and T. Grossman, “Studying the effect of ai code generators on supporting novice learners in introductory programming,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , 2023, pp. 1–23. 1
work page 2023
-
[9]
Using github copilot to solve simple programming problems,
M. Wermelinger, “Using github copilot to solve simple programming problems,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V . 1, 2023, pp. 172–178. 1
work page 2023
-
[10]
Large language models as test case gen- erators: Performance evaluation and enhancement,
K. Li and Y . Yuan, “Large language models as test case gen- erators: Performance evaluation and enhancement,” arXiv preprint arXiv:2404.13340, 2024. 1, 11
-
[11]
Codet: Code generation with generated tests,
B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,” inThe Eleventh International Conference on Learning Representations , 2023. 1, 6, 7, 10
work page 2023
-
[12]
Llm-powered test case generation for detecting tricky bugs
K. Liu, Y . Liu, Z. Chen, J. M. Zhang, Y . Han, Y . Ma, G. Li, and G. Huang, “Llm-powered test case generation for detecting tricky bugs,” arXiv preprint arXiv:2404.10304 , 2024. 1, 10
-
[13]
An empirical evaluation of using large language models for automated unit test generation,
M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2023. 1
work page 2023
-
[14]
The oracle problem in software testing: A survey,
E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE transactions on software engineering, vol. 41, no. 5, pp. 507–525, 2014. 1
work page 2014
-
[15]
Togll: Correct and strong test oracle generation with llms,
S. B. Hossain and M. Dwyer, “Togll: Correct and strong test oracle generation with llms,” arXiv preprint arXiv:2405.03786 , 2024. 1
-
[16]
Livecodebench: Holistic and contamination free evaluation of large language models for code,
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,” in The Thirteenth International Conference on Learning Representations ,
-
[17]
Quickcheck: a lightweight tool for random testing of haskell programs,
K. Claessen and J. Hughes, “Quickcheck: a lightweight tool for random testing of haskell programs,” in Proceedings of the fifth ACM SIGPLAN international conference on Functional programming , 2000, pp. 268–
work page 2000
-
[18]
11 Xingyao Wang, Boxuan Li, Yufan Song, Frank F
V . Vikram, C. Lemieux, J. Sunshine, and R. Padhye, “Can large language models write good property-based tests?” arXiv preprint arXiv:2307.04346, 2023. 2, 11
-
[19]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732 , 2021. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Test-driven development and llm- based code generation,
N. S. Mathews and M. Nagappan, “Test-driven development and llm- based code generation,” in Proceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering , ser. ASE ’24. Association for Computing Machinery, 2024, p. 1583–1594. 2
work page 2024
-
[22]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou et al. , “Challenging big-bench tasks and whether chain-of-thought can solve them,” arXiv preprint arXiv:2210.09261, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
The rise and potential of large language model based agents: A survey,
Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,” Science China Information Sciences , vol. 68, no. 2, p. 121101, 2025. 3, 10
work page 2025
-
[24]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025
A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Do- han, F. Song, H. Lightman, I. Clavera, J. Pachocki et al. , “Com- petitive programming with large reasoning models,” arXiv preprint arXiv:2502.06807, 2025. 4
-
[26]
Lost in the middle: How language models use long contexts,
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics , vol. 12,
-
[27]
Hdd: hierarchical delta debugging,
G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Pro- ceedings of the 28th international conference on Software engineering , 2006, pp. 142–151. 4, 9
work page 2006
-
[28]
H. Zhang, W. Cheng, Y . Wu, and W. Hu, “A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,” in The 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024) , 2024. 6, 10
work page 2024
-
[29]
Competition- level code generation with alphacode,
Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. 6
work page 2022
-
[30]
Lever: Learning to verify language-to-code generation with execution,
A. Ni, S. Iyer, D. Radev, V . Stoyanov, W.-t. Yih, S. Wang, and X. V . Lin, “Lever: Learning to verify language-to-code generation with execution,” in International Conference on Machine Learning . PMLR, 2023, pp. 26 106–26 128. 6
work page 2023
-
[31]
Codereval: A benchmark of pragmatic code generation with generative pre-trained models,
H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y . Ma, G. Liang, Y . Li, Q. Wang, and T. Xie, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , 2024, pp. 1–12. 6
work page 2024
-
[32]
Break-it-fix-it: Unsupervised learning for program repair,
M. Yasunaga and P. Liang, “Break-it-fix-it: Unsupervised learning for program repair,” in International conference on machine learning . PMLR, 2021, pp. 11 941–11 952. 6
work page 2021
-
[33]
Qwen2.5-Coder Technical Report
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu et al. , “Qwen2. 5-coder technical report,” arXiv preprint arXiv:2409.12186, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems , vol. 35, pp. 24 824–24 837, 2022. 6
work page 2022
-
[35]
Self-edit: Fault-aware code editor for code generation.arXiv preprint arXiv:2305.04087, 2023
K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-edit: Fault-aware code editor for code generation,” arXiv preprint arXiv:2305.04087 , 2023. 6, 7, 10
-
[36]
From code to correctness: Closing the last mile of code generation with hierarchical debugging,
Y . Shi, S. Wang, C. Wan, and X. Gu, “From code to correctness: Closing the last mile of code generation with hierarchical debugging,” arXiv preprint arXiv:2410.01215, 2024. 7, 8
-
[37]
Teaching Large Language Models to Self-Debug
X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,” arXiv preprint arXiv:2304.05128 , 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,
Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, L. Shen, Z. Wang, A. Wang, Y . Li et al. , “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2023, pp. 5673–5684. 10
work page 2023
-
[39]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al. , “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Structured chain-of-thought prompting for code generation,
J. Li, G. Li, Y . Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–23, 2025. 10
work page 2025
-
[43]
Self-planning code generation with large language models,
X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,” ACM Transactions on Software Engineering and Methodology , vol. 33, no. 7, pp. 1–30, 2024. 10
work page 2024
-
[44]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi-agent-based code generation with iterative testing and optimisa- tion,” arXiv preprint arXiv:2312.13010 , 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Self-collaboration code generation via chatgpt,
Y . Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via chatgpt,” ACM Transactions on Software Engineering and Method- ology, vol. 33, no. 7, pp. 1–38, 2024. 10
work page 2024
-
[46]
From llms to llm- based agents for software engineering: A survey of current, challenges and future,
H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From llms to llm- based agents for software engineering: A survey of current, challenges and future,” arXiv preprint arXiv:2408.02479 , 2024. 10
-
[47]
A deep dive into large language models for automated bug localization and repair,
S. B. Hossain, N. Jiang, Q. Zhou, X. Li, W.-H. Chiang, Y . Lyu, H. Nguyen, and O. Tripp, “A deep dive into large language models for automated bug localization and repair,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1471–1493, 2024. 10
work page 2024
-
[48]
Conversational automated program repair,
C. S. Xia and L. Zhang, “Conversational automated program repair,” arXiv preprint arXiv:2301.13246 , 2023. 10
-
[49]
Cath: in- creased structural coverage of functional space,
I. Sillitoe, N. Bordin, N. Dawson, V . P. Waman, P. Ashford, H. M. Scholes, C. S. Pang, L. Woodridge, C. Rauer, N. Sen et al., “Cath: in- creased structural coverage of functional space,” Nucleic acids research, vol. 49, no. D1, pp. D266–D273, 2021. 10
work page 2021
-
[50]
An analysis and survey of the development of mutation testing,
Y . Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering , vol. 37, no. 5, pp. 649–678, 2010. 10
work page 2010
-
[51]
Mutation testing advances: an analysis and survey,
M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,” in Advances in computers. Elsevier, 2019, vol. 112, pp. 275–378. 10
work page 2019
-
[52]
J. Zhang, Z. Wang, L. Zhang, D. Hao, L. Zang, S. Cheng, and L. Zhang, “Predictive mutation testing,” in Proceedings of the 25th international symposium on software testing and analysis , 2016, pp. 342–353. 10
work page 2016
-
[53]
Finding and understanding bugs in c compilers,
X. Yang, Y . Chen, E. Eide, and J. Regehr, “Finding and understanding bugs in c compilers,” in Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, 2011, pp. 283–294. 10
work page 2011
-
[54]
Think outside the code: Brainstorming boosts large language models in code generation,
X.-Y . Li, J.-T. Xue, Z. Xie, and M. Li, “Think outside the code: Brainstorming boosts large language models in code generation,” arXiv preprint arXiv:2305.10679, 2023. 10
- [55]
-
[56]
Codecot: Tackling code syntax errors in cot reasoning for code generation,
D. Huang, Q. Bu, Y . Qing, and H. Cui, “Codecot: Tackling code syntax errors in cot reasoning for code generation,” CoRR, vol. 2308, pp. 1–20,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.