Unlocking LLM Code Correction with Iterative Feedback Loops

Le Zhang; Suresh Kothari

arxiv: 2606.17514 · v1 · pith:WP73QVZMnew · submitted 2026-06-16 · 💻 cs.SE · cs.AI

Unlocking LLM Code Correction with Iterative Feedback Loops

Le Zhang , Suresh Kothari This is my paper

Pith reviewed 2026-06-27 00:10 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code generationiterative refinementfeedback loopscode correctionreasoning modelssyntactic errorsruntime errors

0 comments

The pith

Reasoning LLMs improve code accuracy over iterations when given compiler errors and test feedback, outperforming non-reasoning models on syntactic and runtime issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can refine their own code across multiple attempts when supplied with execution feedback instead of succeeding on the first try. It applies an iterative process to real programming problems in two languages using four models and tracks how performance changes round by round. Reasoning models show steady gains while non-reasoning ones lag, and the work separates error types to show which ones respond to feedback. This approach mirrors how programmers actually work, so the results bear on practical use of LLMs for code tasks. The study supplies metrics for failure categories and rectification patterns to make the comparison concrete.

Core claim

Using real-world problems across four models and two languages, the evaluation shows that reasoning models consistently improve over iterations by leveraging compiler error messages and testcase feedback after each attempt, substantially outperforming non-reasoning models, while syntactic and runtime errors are far more tractable than logical or algorithmic failures.

What carries the argument

Iterative refinement framework in which each model attempt receives compiler error messages and testcase feedback before generating the next code version.

If this is right

Reasoning models reach higher final success rates through repeated feedback rounds than single-attempt generation allows.
Syntactic and runtime errors can be corrected more reliably than logical or algorithmic ones when feedback is supplied.
Non-reasoning models derive limited benefit from the same iterative signals.
Metrics that classify failure types and track rectification patterns can inform which models suit code correction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback mechanism might allow LLMs to handle larger codebases if extended beyond isolated problems.
Training objectives that reward use of execution signals could narrow the gap between reasoning and non-reasoning models.
Embedding this loop inside development environments could change how programmers interact with generated code.

Load-bearing premise

Compiler error messages and testcase feedback after each attempt provide sufficient and representative signals for real-world code correction, and the chosen problems, models, and languages generalize beyond the tested set.

What would settle it

A new test on additional problems or models in which reasoning models show no consistent accuracy gains across iterations would falsify the central result.

Figures

Figures reproduced from arXiv: 2606.17514 by Le Zhang, Suresh Kothari.

**Figure 2.** Figure 2: Iteration limit calibration with DeepSeek-R1. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Iterative Experiment Procedure # Compile Error Feedback: Your code crashed at a compile error: [Error Message] Fix your code with above information. # Runtime Error Feedback: Your code crashed at a runtime error: [Error Message] Fix your code with above information. # Wrong Answer Feedback: Your code generated wrong outputs at testcase: [Testcase] Expected output: [Expected Output] Actual output: [Actual O… view at source ↗

**Figure 4.** Figure 4: Examples of execution feedback [ { role: ”user”, content: ”You are a software developer. Implement ...” }, { role: ”assistant”, content: ”Solution code #1”}, { role: ”user”, content: ”Your code crashed at a compile error: ...”}, { role: ”assistant”, content: ”Solution code #2”}, { role: ”user”, content: ”Your code generated wrong output ...”}, { role: ”assistant”, content: ”Solution code #3”}, ... { role: … view at source ↗

**Figure 5.** Figure 5: Example of multi-turn conversation with LLM [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Time Limit Exceeded error count with and without suggestive hint. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Success rate comparison, baseline vs iterative framework. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Cumulative ISR@10 for Python solutions. 1 2 3 4 5 6 7 8 9 10 Iteration 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Cumulative Solved Problems Models DeepSeek_R1 GPT-o4-mini DeepSeek_V3 GPT-4.1-mini [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative ISR@10 for Java solutions [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: GPT-o4-mini solutions comparison, iteration #1 vs iteration #2. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: GPT-o4-mini solutions comparison, iteration #2 vs iteration #3. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: DeepSeek-R1 solutions comparison, iteration #1 vs iteration #6. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

Large Language Models have shown remarkable capabilities in code generation. However, most existing evaluations focus only on single-attempt accuracy and overlook the iterative refinement process that is central to real-world programming. This study presents a systematic investigation of LLMs' ability to rectify their own code through execution feedback. Using real-world programming problems across four models and two major programming languages, this study evaluates performance using iterative refinement framework where LLMs receive compiler error messages and testcase feedback after each attempt. This study introduces metrics to evaluate code failures, analyze rectification patterns, and compare the effectiveness of reasoning and non-reasoning models, offering actionable insights into both the understanding and practical application of feedback loops in LLM-driven code generation systems. Results show that reasoning models consistently improve over iterations, substantially outperforming non-reasoning models in leveraging feedback, while syntactic and runtime errors are far more tractable than logical or algorithmic failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reasoning models improve code fixes over iterations with feedback more than non-reasoning ones, but missing experimental details make the claims hard to assess.

read the letter

The paper's core finding is that reasoning models keep getting better at correcting code across iterations when fed compiler errors and test results, while non-reasoning models do not improve as much, and that syntax or runtime errors are easier to resolve than logical or algorithmic ones.

The new elements are the direct comparison of reasoning versus non-reasoning models in an iterative setting and the metrics they introduce for classifying failure types and tracking rectification patterns. These are not standard in single-shot code generation work.

The paper does a clear job showing why single-attempt accuracy misses how actual programming works with repeated feedback.

The main soft spot is the complete absence of experimental details: no dataset size, no model names or versions, no statistical tests, and no controls. Without those, it is impossible to judge whether the reported gaps are reliable or driven by problem selection. The stress-test concern about feedback signals and generalization lands because compiler messages plus test cases may not represent the noisy inputs developers actually face, and the tractability split between error types could shift if the problems were chosen to favor clean feedback.

If the full paper supplies a reproducible dataset, explicit error categorization rules, and basic stats, the results would be worth using. Right now the claims rest on unverified assumptions.

This is for researchers and tool builders working on LLM coding assistants who care about refinement loops. It deserves peer review so the methods and data can be checked properly.

Referee Report

2 major / 1 minor

Summary. The paper presents a systematic investigation of LLMs' ability to rectify their own code through execution feedback using an iterative refinement framework. It evaluates four models on real-world programming problems in two languages, introduces metrics for code failures and rectification patterns, and compares reasoning and non-reasoning models. The results show that reasoning models consistently improve over iterations and outperform non-reasoning models in leveraging feedback, while syntactic and runtime errors are more tractable than logical or algorithmic failures.

Significance. If the results hold, the paper contributes to the understanding of iterative code correction in LLMs, which is more representative of real-world programming than single-attempt evaluations. The distinction between reasoning and non-reasoning models and between error types offers actionable insights for improving LLM code generation systems. The introduction of specific metrics for analyzing rectification patterns is a strength.

major comments (2)

[Abstract] Abstract: The abstract states clear results but supplies no information on experimental design, dataset size, statistical tests, error bars, or controls for confounding factors, so the claims cannot be verified from the given text. This is load-bearing for the central empirical claims about model improvement and error tractability.
[Methodology/Results] The central claim that reasoning models substantially outperform non-reasoning models in leveraging feedback and that syntactic/runtime errors are far more tractable rests on the unexamined premise that the chosen problems, models, and languages capture general behavior; no justification or diversity analysis for problem/model selection is provided, which directly affects generalizability.

minor comments (1)

[Introduction] The paper could include more references to prior work on iterative refinement and feedback in code generation to better contextualize the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states clear results but supplies no information on experimental design, dataset size, statistical tests, error bars, or controls for confounding factors, so the claims cannot be verified from the given text. This is load-bearing for the central empirical claims about model improvement and error tractability.

Authors: We agree the abstract is too concise for verification of claims. We will revise it to include key experimental details: evaluation on real-world problems across two languages using an iterative refinement framework with execution feedback, four models (reasoning and non-reasoning), and observed trends in success rates by error type. The study is descriptive rather than inferential, so no statistical tests or error bars were applied; we will note this explicitly. This addresses verifiability without altering the core claims. revision: yes
Referee: [Methodology/Results] The central claim that reasoning models substantially outperform non-reasoning models in leveraging feedback and that syntactic/runtime errors are far more tractable rests on the unexamined premise that the chosen problems, models, and languages capture general behavior; no justification or diversity analysis for problem/model selection is provided, which directly affects generalizability.

Authors: We agree explicit justification is needed. The problems were drawn from standard real-world programming benchmarks to reflect practical error distributions, and models were selected to contrast reasoning vs. non-reasoning architectures. In revision we will add a dedicated paragraph in Methodology explaining selection criteria (problem diversity in error types and difficulty; model representativeness) and a Limitations subsection discussing threats to generalizability and scope of claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation without derivations or self-referential definitions

full rationale

The paper is an empirical study evaluating LLMs on iterative code correction using compiler and testcase feedback across models and languages. It reports performance observations, introduces metrics for failures and patterns, and compares reasoning vs. non-reasoning models. No equations, derivations, fitted parameters presented as predictions, or self-citations justifying uniqueness/ansatzes appear in the abstract or described content. Central claims rest on experimental results rather than reducing to inputs by construction, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical study; the abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5674 in / 1120 out tokens · 42300 ms · 2026-06-27T00:10:49.521216+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 12 canonical work pages · 5 internal anchors

[1]

& Rouvoy, R

Coignion, T., Quinton, C. & Rouvoy, R. A performance study of llm-generated code on leetcode.Proceedings Of The 28th International Conference On Evaluation And Assessment In Software Engineering. pp. 79-89 (2024)

2024
[2]

& Combemale, B

D¨ oderlein, J., Kouadio, N., Acher, M., Khelladi, D. & Combemale, B. Piloting Copilot, Codex, and StarCoder2: Hot temperature, cold prompts, or black magic?. Journal Of Systems And Software. pp. 112562 (2025)

2025
[3]

& Chang, K

Zheng, S., Zhang, Y., Zhu, Y., Xi, C., Gao, P., Xun, Z. & Chang, K. Gpt-fathom: Benchmarking large language models to decipher the evolutionary path towards gpt-4 and beyond.Findings Of The Association For Computational Linguistics: NAACL 2024. pp. 1363-1382 (2024)

2024
[4]

& Choi, Y

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The Curious Case of Neural Text Degeneration.International Conference On Learning Representations. (2020)

2020
[5]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G. & Others Evaluating large language models trained on code.https://doi.org/10.48550/arXiv.2107.03374. (2021)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
[6]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q. & Others Program synthesis with large language models. https://doi.org/10.48550/arXiv.2108.07732. (2021)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021
[7]

& Xiong, C

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S. & Xiong, C. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.The Eleventh International Conference On Learning Represen- tations. (2023)

2023
[8]

& Kim, S

Jiang, J., Wang, F., Shen, J., Kim, S. & Kim, S. A Survey on Large Lan- guage Models for Code Generation.ACM Trans. Softw. Eng. Methodol.. (2025,7), https://doi.org/10.1145/3747588

work page doi:10.1145/3747588 2025
[9]

& Wang, Y

Jiang, S., Wang, Y. & Wang, Y. Selfevolve: A code evolution framework via large language models.https://doi.org/10.48550/arXiv.2306.02907. (2023)

work page doi:10.48550/arxiv.2306.02907 2023
[10]

& Zhang, S

Yu, Z., Gu, W., Wang, Y., Jiang, X., Zeng, Z., Wang, J., Ye, W. & Zhang, S. Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation.Forty-second International Conference On Machine Learning. (2025)

2025
[11]

& Shang, J

Zhong, L., Wang, Z. & Shang, J. Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step.Findings Of The Asso- ciation For Computational Linguistics ACL 2024. pp. 851-870 (2024)

2024
[12]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Gou, Z., Shao, Z., Gong, Y., Shen, Y., Yang, Y., Duan, N. & Chen, W. Critic: Large language models can self-correct with tool-interactive critiquing. https://doi.org/10.48550/arXiv.2305.11738. (2023)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.11738 2023
[13]

& Cohen, T

Butt, N., Manczak, B., Wiggers, A., Rainone, C., Zhang, D., Defferrard, M. & Cohen, T. CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay.ICML. (2024) 18 Le Zhang et al

2024
[14]

& Zheng, H

Nadimi, B., Boutaib, G. & Zheng, H. Verimind: Agentic llm for automated verilog generation with a novel evaluation metric. https://doi.org/10.48550/arXiv.2503.16514. (2025)

work page doi:10.48550/arxiv.2503.16514 2025
[15]

& Chen, Y

Wang, J. & Chen, Y. A review on code generation with llms: Application and evaluation.2023 IEEE International Conference On Medical Artificial Intelligence (MedAI). pp. 284-289 (2023)

2023
[17]

& Chen, K

Liu, J., Liu, H., Xiao, L., Wang, Z., Liu, K., Gao, S., Zhang, W., Zhang, S. & Chen, K. Are Your LLMs Capable of Stable Reasoning?.Findings Of The Association For Computational Linguistics: ACL 2025. pp. 17594-17632 (2025)

2025
[18]

& Shi, G

Chen, Z., Qin, X., Wu, Y., Ling, Y., Ye, Q., Zhao, W. & Shi, G. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. https://doi.org/10.48550/arXiv.2506.20807. (2025)

work page doi:10.48550/arxiv.2506.20807 2025
[19]

& Syed, R

Shukla, S., Joshi, H. & Syed, R. Security Degradation in Itera- tive AI Code Generation–A Systematic Analysis of the Paradox. https://doi.org/10.48550/arXiv.2506.11022. (2025)

work page doi:10.48550/arxiv.2506.11022 2025
[20]

& Jin, Z

Wang, Z., Li, J., Li, G. & Jin, Z. ChatCoder: Chat-based refine requirement im- proves LLMs’ code generation.https://doi.org/10.48550/arXiv.2311.00272. (2023)

work page doi:10.48550/arxiv.2311.00272 2023
[21]

& Milanova, M

Davis, D. & Milanova, M. Parsing Requirements for Automatic Prompting of Large Language Models for Requirements Validation.Intelligent Computing-Proceedings Of The Computing Conference. pp. 1-19 (2025)

2025
[22]

Niu, C., Zhang, T., Li, C., Luo, B. & Ng, V. On evaluating the efficiency of source code generated by llms.Proceedings Of The 2024 IEEE/ACM First In- ternational Conference On AI Foundation Models And Software Engineering. pp. 103-107 (2024)

2024
[23]

& Mehta, R

Dougherty, Q. & Mehta, R. Proving the coding interview: A benchmark for for- mally verified code generation.2025 IEEE/ACM International Workshop On Large Language Models For Code (LLM4Code). pp. 72-79 (2025)

2025
[24]

& Nelapati, A

Guimaraes, E., Nascimento, N., Shivalingaiah, C. & Nelapati, A. Analyzing promi- nent llms: An empirical study of performance and complexity in solving leetcode problems.https://doi.org/10.48550/arXiv.2508.03931. (2025)

work page doi:10.48550/arxiv.2508.03931 2025
[25]

& Karkhanis, D

Walder, C. & Karkhanis, D. Pass@K Policy Optimization: Solving Harder Re- inforcement Learning Problems.The Thirty-ninth Annual Conference On Neural Information Processing Systems. (2025)

2025
[26]

Code Generation with LLMs: Practical Challenges, Gotchas, and Nuances

Masood, A. Code Generation with LLMs: Practical Challenges, Gotchas, and Nuances. (2025), https://medium.com/@adnanmasood/code-generation-with- llms-practical-challenges-gotchas-and-nuances-7b51d394f588, [Online; posted 28- Feburary-2025]

2025
[27]

& Bile, A

Pawar, V., Gawande, M., Kollu, A. & Bile, A. Exploring the Potential of Prompt Engineering: A Comprehensive Analysis of Interacting with Large Language Mod- els.2024 8th International Conference On Computing, Communication, Control And Automation (ICCUBEA). pp. 1-9 (2024)

2024
[28]

& Mikkonen, T

Uusn¨ akki, J., Ihantola, P. & Mikkonen, T. Exploring Prompt Engineering with Large Language Models in the Context of Software Maintenance.Intelligent Computing-Proceedings Of The Computing Conference. pp. 166-177 (2025)

2025
[29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D. & Others DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.https://doi.org/10.48550/arXiv.2501.12948. (2025) Iterative Feedback Loops 19

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[30]

Is there an optimal temperature and top-p for code generation with paid LLM APIs?

Siml, J. Is there an optimal temperature and top-p for code generation with paid LLM APIs?. (2023), https://forem.julialang.org/svilupp/is-there-an-optimal- temperature-and-top-p-for-code-generation-with-paid-llm-apis-15am

2023
[31]

& Itkonen, J

Lappi, V., Tirronen, V. & Itkonen, J. A replication study on the intuitiveness of programming language syntax.Software Quality Journal.31, 1211-1240 (2023)

2023
[32]

Gradual Typing in an Open World

Vitousek, M. & Siek, J. Gradual typing in an open world. https://doi.org/10.48550/arXiv.1610.08476. (2016)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.08476 2016
[33]

The Top Programming Languages 2025.IEEE Spectrum

Cass, S. The Top Programming Languages 2025.IEEE Spectrum. (2025,10), https://spectrum.ieee.org/top-programming-languages-2025

2025
[34]

Minimum Cost to Equal- ize Array

Zhang, L. & Kothari, S. Holistic Evaluation of State-of-the-Art LLMs for Code Generation.ArXiv E-prints. (2025) A Tables Table 6: pass@1 scores across different models and datasets. Dataset Language DeepSeek-V3 DeepSeek-R1 GPT-4.1-mini GPT-o4-mini Core dataset Python 72.4 84.0 76.289.1 Java 71.6 82.4 75.187.3 Strain dataset Python 55.5 79.0 67.087.0 Java ...

2025

[1] [1]

& Rouvoy, R

Coignion, T., Quinton, C. & Rouvoy, R. A performance study of llm-generated code on leetcode.Proceedings Of The 28th International Conference On Evaluation And Assessment In Software Engineering. pp. 79-89 (2024)

2024

[2] [2]

& Combemale, B

D¨ oderlein, J., Kouadio, N., Acher, M., Khelladi, D. & Combemale, B. Piloting Copilot, Codex, and StarCoder2: Hot temperature, cold prompts, or black magic?. Journal Of Systems And Software. pp. 112562 (2025)

2025

[3] [3]

& Chang, K

Zheng, S., Zhang, Y., Zhu, Y., Xi, C., Gao, P., Xun, Z. & Chang, K. Gpt-fathom: Benchmarking large language models to decipher the evolutionary path towards gpt-4 and beyond.Findings Of The Association For Computational Linguistics: NAACL 2024. pp. 1363-1382 (2024)

2024

[4] [4]

& Choi, Y

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The Curious Case of Neural Text Degeneration.International Conference On Learning Representations. (2020)

2020

[5] [5]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G. & Others Evaluating large language models trained on code.https://doi.org/10.48550/arXiv.2107.03374. (2021)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021

[6] [6]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q. & Others Program synthesis with large language models. https://doi.org/10.48550/arXiv.2108.07732. (2021)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021

[7] [7]

& Xiong, C

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S. & Xiong, C. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.The Eleventh International Conference On Learning Represen- tations. (2023)

2023

[8] [8]

& Kim, S

Jiang, J., Wang, F., Shen, J., Kim, S. & Kim, S. A Survey on Large Lan- guage Models for Code Generation.ACM Trans. Softw. Eng. Methodol.. (2025,7), https://doi.org/10.1145/3747588

work page doi:10.1145/3747588 2025

[9] [9]

& Wang, Y

Jiang, S., Wang, Y. & Wang, Y. Selfevolve: A code evolution framework via large language models.https://doi.org/10.48550/arXiv.2306.02907. (2023)

work page doi:10.48550/arxiv.2306.02907 2023

[10] [10]

& Zhang, S

Yu, Z., Gu, W., Wang, Y., Jiang, X., Zeng, Z., Wang, J., Ye, W. & Zhang, S. Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation.Forty-second International Conference On Machine Learning. (2025)

2025

[11] [11]

& Shang, J

Zhong, L., Wang, Z. & Shang, J. Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step.Findings Of The Asso- ciation For Computational Linguistics ACL 2024. pp. 851-870 (2024)

2024

[12] [12]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Gou, Z., Shao, Z., Gong, Y., Shen, Y., Yang, Y., Duan, N. & Chen, W. Critic: Large language models can self-correct with tool-interactive critiquing. https://doi.org/10.48550/arXiv.2305.11738. (2023)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.11738 2023

[13] [13]

& Cohen, T

Butt, N., Manczak, B., Wiggers, A., Rainone, C., Zhang, D., Defferrard, M. & Cohen, T. CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay.ICML. (2024) 18 Le Zhang et al

2024

[14] [14]

& Zheng, H

Nadimi, B., Boutaib, G. & Zheng, H. Verimind: Agentic llm for automated verilog generation with a novel evaluation metric. https://doi.org/10.48550/arXiv.2503.16514. (2025)

work page doi:10.48550/arxiv.2503.16514 2025

[15] [15]

& Chen, Y

Wang, J. & Chen, Y. A review on code generation with llms: Application and evaluation.2023 IEEE International Conference On Medical Artificial Intelligence (MedAI). pp. 284-289 (2023)

2023

[16] [17]

& Chen, K

Liu, J., Liu, H., Xiao, L., Wang, Z., Liu, K., Gao, S., Zhang, W., Zhang, S. & Chen, K. Are Your LLMs Capable of Stable Reasoning?.Findings Of The Association For Computational Linguistics: ACL 2025. pp. 17594-17632 (2025)

2025

[17] [18]

& Shi, G

Chen, Z., Qin, X., Wu, Y., Ling, Y., Ye, Q., Zhao, W. & Shi, G. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. https://doi.org/10.48550/arXiv.2506.20807. (2025)

work page doi:10.48550/arxiv.2506.20807 2025

[18] [19]

& Syed, R

Shukla, S., Joshi, H. & Syed, R. Security Degradation in Itera- tive AI Code Generation–A Systematic Analysis of the Paradox. https://doi.org/10.48550/arXiv.2506.11022. (2025)

work page doi:10.48550/arxiv.2506.11022 2025

[19] [20]

& Jin, Z

Wang, Z., Li, J., Li, G. & Jin, Z. ChatCoder: Chat-based refine requirement im- proves LLMs’ code generation.https://doi.org/10.48550/arXiv.2311.00272. (2023)

work page doi:10.48550/arxiv.2311.00272 2023

[20] [21]

& Milanova, M

Davis, D. & Milanova, M. Parsing Requirements for Automatic Prompting of Large Language Models for Requirements Validation.Intelligent Computing-Proceedings Of The Computing Conference. pp. 1-19 (2025)

2025

[21] [22]

Niu, C., Zhang, T., Li, C., Luo, B. & Ng, V. On evaluating the efficiency of source code generated by llms.Proceedings Of The 2024 IEEE/ACM First In- ternational Conference On AI Foundation Models And Software Engineering. pp. 103-107 (2024)

2024

[22] [23]

& Mehta, R

Dougherty, Q. & Mehta, R. Proving the coding interview: A benchmark for for- mally verified code generation.2025 IEEE/ACM International Workshop On Large Language Models For Code (LLM4Code). pp. 72-79 (2025)

2025

[23] [24]

& Nelapati, A

Guimaraes, E., Nascimento, N., Shivalingaiah, C. & Nelapati, A. Analyzing promi- nent llms: An empirical study of performance and complexity in solving leetcode problems.https://doi.org/10.48550/arXiv.2508.03931. (2025)

work page doi:10.48550/arxiv.2508.03931 2025

[24] [25]

& Karkhanis, D

Walder, C. & Karkhanis, D. Pass@K Policy Optimization: Solving Harder Re- inforcement Learning Problems.The Thirty-ninth Annual Conference On Neural Information Processing Systems. (2025)

2025

[25] [26]

Code Generation with LLMs: Practical Challenges, Gotchas, and Nuances

Masood, A. Code Generation with LLMs: Practical Challenges, Gotchas, and Nuances. (2025), https://medium.com/@adnanmasood/code-generation-with- llms-practical-challenges-gotchas-and-nuances-7b51d394f588, [Online; posted 28- Feburary-2025]

2025

[26] [27]

& Bile, A

Pawar, V., Gawande, M., Kollu, A. & Bile, A. Exploring the Potential of Prompt Engineering: A Comprehensive Analysis of Interacting with Large Language Mod- els.2024 8th International Conference On Computing, Communication, Control And Automation (ICCUBEA). pp. 1-9 (2024)

2024

[27] [28]

& Mikkonen, T

Uusn¨ akki, J., Ihantola, P. & Mikkonen, T. Exploring Prompt Engineering with Large Language Models in the Context of Software Maintenance.Intelligent Computing-Proceedings Of The Computing Conference. pp. 166-177 (2025)

2025

[28] [29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D. & Others DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.https://doi.org/10.48550/arXiv.2501.12948. (2025) Iterative Feedback Loops 19

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[29] [30]

Is there an optimal temperature and top-p for code generation with paid LLM APIs?

Siml, J. Is there an optimal temperature and top-p for code generation with paid LLM APIs?. (2023), https://forem.julialang.org/svilupp/is-there-an-optimal- temperature-and-top-p-for-code-generation-with-paid-llm-apis-15am

2023

[30] [31]

& Itkonen, J

Lappi, V., Tirronen, V. & Itkonen, J. A replication study on the intuitiveness of programming language syntax.Software Quality Journal.31, 1211-1240 (2023)

2023

[31] [32]

Gradual Typing in an Open World

Vitousek, M. & Siek, J. Gradual typing in an open world. https://doi.org/10.48550/arXiv.1610.08476. (2016)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.08476 2016

[32] [33]

The Top Programming Languages 2025.IEEE Spectrum

Cass, S. The Top Programming Languages 2025.IEEE Spectrum. (2025,10), https://spectrum.ieee.org/top-programming-languages-2025

2025

[33] [34]

Minimum Cost to Equal- ize Array

Zhang, L. & Kothari, S. Holistic Evaluation of State-of-the-Art LLMs for Code Generation.ArXiv E-prints. (2025) A Tables Table 6: pass@1 scores across different models and datasets. Dataset Language DeepSeek-V3 DeepSeek-R1 GPT-4.1-mini GPT-o4-mini Core dataset Python 72.4 84.0 76.289.1 Java 71.6 82.4 75.187.3 Strain dataset Python 55.5 79.0 67.087.0 Java ...

2025