Recognition: unknown
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3
The pith
Iterative self-repair by feeding execution errors back to large language models raises code generation pass rates across model scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that feeding execution error messages back to the model for correction increases pass rates on HumanEval by 4.9 to 17.1 percentage points and on MBPP by 16.0 to 30.0 percentage points. This holds across dense and mixture-of-experts architectures and for instruction-tuned models down to 8B parameters. Most of the benefit occurs in the first two repair rounds, assertion errors remain hardest to fix at roughly 45 percent success, and adding chain-of-thought to the repair prompt yields up to 5.5 points more gain for stronger models.
What carries the argument
Iterative self-repair, the repeated process of passing execution error messages to the model to generate corrected code versions.
Load-bearing premise
That execution error messages are clear and detailed enough for the models to interpret and produce accurate fixes.
What would settle it
Running the same models and benchmarks with error messages supplied but measuring no net rise in solved problems after five attempts.
Figures
read the original abstract
Large language models frequently fail to produce correct code on their first attempt, yet most benchmarks evaluate them in a single-shot setting. We investigate iterative self-repair (feeding execution errors back to the model for correction) across seven models spanning three families and both open-weight and proprietary providers: Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout (MoE, 16 experts), Llama 4 Maverick (MoE, 128 experts), Qwen3 32B, Gemini 2.5 Flash, and Gemini 2.5 Pro. On HumanEval (164 problems) and MBPP Sanitized (257 problems) with up to five attempts, self-repair universally improves pass rates: +4.9 to +17.1 pp on HumanEval and +16.0 to +30.0 pp on MBPP. Gemini 2.5 Flash achieves the highest final pass rates (96.3% HumanEval, 93.8% MBPP). Most gains concentrate in the first two rounds.Error-type analysis shows assertion errors (logical mistakes) are the hardest to repair at ~45%, while syntax and name errors are repaired at substantially higher rates, connecting to broader findings on the limits of LLM self-correction. Prior work found that weaker models fail at self-repair or require fine-tuning; we show that modern instruction-tuned models succeed with prompting alone, even at 8B scale. We also provide the first comparison of dense and MoE architectures for self-repair, and extend the repair-vs-resampling tradeoff analysis to modern models. A prompt ablation reveals chain-of-thought repair yields up to +5.5 pp additional self-repair gain (measured as improvement in repair delta) over minimal prompting for capable models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically investigates iterative self-repair in LLM code generation, where execution errors are fed back to the model for correction. Across seven models (Llama 3.1 8B, Llama 3.3 70B, Llama 4 variants, Qwen3 32B, Gemini 2.5 Flash/Pro) on HumanEval (164 problems) and MBPP Sanitized (257 problems) with up to five attempts, it reports consistent pass-rate gains of +4.9 to +17.1 pp on HumanEval and +16.0 to +30.0 pp on MBPP. Most gains occur in the first two rounds; assertion errors are hardest to repair (~45% success) while syntax/name errors are easier. The work claims modern instruction-tuned models succeed at self-repair via prompting alone even at 8B scale (contrasting prior work), provides the first dense-vs-MoE comparison for this task, extends repair-vs-resampling analysis, and shows chain-of-thought repair adds up to +5.5 pp over minimal prompts.
Significance. If the results hold under full reproducibility details, the paper makes a useful empirical contribution by showing self-repair is broadly effective for current instruction-tuned models without fine-tuning, including smaller scales and MoE architectures. The error-type breakdown links to known LLM limitations in logical reasoning, and the architecture and prompt comparisons add concrete data points for practitioners building coding agents. The scale of the evaluation (multiple families, two benchmarks, iteration tracking) strengthens its value as a reference study.
major comments (2)
- [§3] §3 (Methodology / Prompt Construction): The exact base repair prompt template and the precise format of execution error feedback (full traceback with test inputs/assertions vs. stripped message) are not quoted or exemplified. This is load-bearing for the central claim that 'modern instruction-tuned models succeed with prompting alone' (including the 8B-scale result), because the prompt ablation already demonstrates +5.5 pp sensitivity to CoT scaffolding; without the exact strings used in the main condition, it is impossible to rule out unstated informativeness or engineering advantages that could explain the contrast with prior work on weaker models.
- [§4] §4 (Results): The concrete percentage-point gains are reported without random seeds, number of independent runs per problem, or statistical significance measures (confidence intervals or p-values). Given fixed problem counts (164/257) and the universality claim, this omission weakens confidence that the +4.9–17.1 pp HumanEval range reflects stable self-repair effects rather than sampling variability; adding these details would directly support the reported deltas and error-type repair rates.
minor comments (2)
- [Abstract] Abstract: The gain ranges (+4.9 to +17.1 pp, +16.0 to +30.0 pp) are useful but would be clearer if the text or a table explicitly mapped the lower/upper bounds to specific models.
- Consider moving the full prompt templates and one or two example error-message strings to an appendix; this is a minor reproducibility aid rather than a core methodological gap.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. The comments highlight important aspects of reproducibility and statistical rigor that we address below.
read point-by-point responses
-
Referee: [§3] §3 (Methodology / Prompt Construction): The exact base repair prompt template and the precise format of execution error feedback (full traceback with test inputs/assertions vs. stripped message) are not quoted or exemplified. This is load-bearing for the central claim that 'modern instruction-tuned models succeed with prompting alone' (including the 8B-scale result), because the prompt ablation already demonstrates +5.5 pp sensitivity to CoT scaffolding; without the exact strings used in the main condition, it is impossible to rule out unstated informativeness or engineering advantages that could explain the contrast with prior work on weaker models.
Authors: We agree that full transparency on the prompt and error feedback format is essential to support the claim that modern instruction-tuned models succeed at self-repair via prompting alone. The manuscript currently describes the prompting approach at a high level but does not include the verbatim templates. In the revised version we will add the complete base repair prompt (including the exact error feedback formatting) to the Methodology section and provide illustrative examples of full versus stripped error messages. This will enable direct comparison with prior work and allow readers to evaluate the degree of informativeness in the feedback. revision: yes
-
Referee: [§4] §4 (Results): The concrete percentage-point gains are reported without random seeds, number of independent runs per problem, or statistical significance measures (confidence intervals or p-values). Given fixed problem counts (164/257) and the universality claim, this omission weakens confidence that the +4.9–17.1 pp HumanEval range reflects stable self-repair effects rather than sampling variability; adding these details would directly support the reported deltas and error-type repair rates.
Authors: We acknowledge that reporting experimental variability strengthens confidence in the results. Our evaluation used a single deterministic run per problem (temperature 0 where supported by the model API, or the lowest supported temperature otherwise) with fixed seeds for any stochastic components. We will revise §4 and the experimental setup to explicitly state the number of runs (one per problem), the decoding parameters, and any seeds used. We will also add a brief discussion of potential variability and, where computationally feasible, include bootstrap-derived confidence intervals around the pass-rate deltas. This directly addresses concerns about sampling variability while preserving the existing experimental design. revision: partial
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper reports direct experimental measurements of pass rates under iterative self-repair on fixed benchmarks (HumanEval, MBPP) across seven models, with error-type breakdowns and prompt ablations. No equations, fitted parameters, derivations, or first-principles claims appear; all results are observational counts of success/failure after feeding execution feedback. Prior-work citations serve only as contrast, not as load-bearing premises that reduce the present findings to self-citation. The central claims (gains of +4.9–30 pp, success at 8B scale with prompting alone) are therefore independent of any definitional or fitted-input loop and remain falsifiable by replication on the same benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Competition-level code generation with AlphaCode,
Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lagoet al., “Competition-level code generation with AlphaCode,” inScience, vol. 378, no. 6624, 2022, pp. 1092–1097
2022
-
[3]
Code Llama: Open Foundation Models for Code
B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code Llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Demystifying gpt self-repair for code generation
T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama, “Is self-repair a silver bullet for code generation?”arXiv preprint arXiv:2306.09896, 2024
-
[6]
Teaching Large Language Models to Self-Debug
X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,”arXiv preprint arXiv:2304.05128, 2024
work page internal anchor Pith review arXiv 2024
-
[7]
A. Grattafiori, A. Dubey, A. Jauhriet al., “Llama 3 model card,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Llama 3.3: Large language model,
Meta AI, “Llama 3.3: Large language model,”Meta AI Blog, 2024, https://ai.meta.com/blog/llama-3-3/
2024
-
[9]
Llama 4: Maverick, scout, and behemoth,
——, “Llama 4: Maverick, scout, and behemoth,”Meta AI Blog, 2025, https://ai.meta.com/blog/llama-4-multimodal-intelligence/
2025
-
[10]
Qwen Team, “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Gemini 2.5: Our most intelligent AI model,
Google DeepMind, “Gemini 2.5: Our most intelligent AI model,” Google Blog, 2025, https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/
2025
-
[12]
Revisit self-debugging with self-generated tests for code generation,
X. Chen, Z. Tao, K. Zhang, C. Zhou, X. Zhang, W. Gu, Y . He, M. Zhang, X. Cai, H. Zhao, and Z. Jin, “Revisit self-debugging with self-generated tests for code generation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 18 003– 18 023
2025
-
[13]
CYCLE: Learning to self- refine the code generation,
Y . Ding, M. J. Min, G. Kaiser, and B. Ray, “CYCLE: Learning to self- refine the code generation,”Proceedings of the ACM on Programming Languages, vol. 8, no. OOPSLA1, 2024
2024
-
[14]
LeDex: Training LLMs to better self-debug and explain code,
N. Jiang, X. Li, S. Wang, Q. Zhou, S. B. Hossain, B. Ray, V . Kumar, X. Ma, and A. Deoras, “LeDex: Training LLMs to better self-debug and explain code,”Advances in Neural Information Processing Systems, vol. 37, 2024
2024
-
[15]
Large Language Models Cannot Self-Correct Reasoning Yet
J. Huang, C. Xia, P. Shenoy, and J. Zhao, “Large language models cannot self-correct reasoning yet,”arXiv preprint arXiv:2310.01798, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
Code repair with LLMs gives an exploration-exploitation tradeoff,
H. Tang, K. Hu, J. P. Zhou, S. Zhong, W.-L. Zheng, X. Si, and K. Ellis, “Code repair with LLMs gives an exploration-exploitation tradeoff,” Advances in Neural Information Processing Systems, vol. 37, 2024
2024
-
[17]
CodeRL: Mastering code generation through pretrained models and deep reinforcement learning,
H. Le, Y . Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi, “CodeRL: Mastering code generation through pretrained models and deep reinforcement learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 21 314–21 328, 2022
2022
-
[18]
Measuring Coding Challenge Competence With APPS
D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Zou, D. Song, and J. Steinhardt, “Measuring coding challenge competence with APPS,” arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review arXiv 2021
-
[19]
SPoC: Search-based pseudocode to code,
S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. Liang, “SPoC: Search-based pseudocode to code,”Advances in Neural Information Processing Systems, vol. 32, 2019
2019
-
[20]
Self-refine: Iter- ative refinement with self-feedback,
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in Neural Information Processing Systems, vol. 36, 2023
2023
-
[21]
Reflexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Shakkottai, A. Labash, and S. Liu, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, 2023
2023
-
[22]
Multi-turn code generation with single-turn instruction tuning,
D. Key, S. Lee, J. Shin, and S. Hwang, “Multi-turn code generation with single-turn instruction tuning,”arXiv preprint arXiv:2404.11137, 2024
-
[23]
Self-collaboration code generation via chatgpt
Y . Zhang, J. Chen, D. Li, and Y . Zheng, “Self-collaboration code generation via ChatGPT,”arXiv preprint arXiv:2304.07590, 2023
-
[24]
arXiv preprint arXiv:2402.16906 , year=
L. Zhong, Z. Wang, and J. Shang, “LDB: A large language model debugger via verifying runtime execution step-by-step,”arXiv preprint arXiv:2402.16906, 2024
-
[25]
Selfevolve: A code evolution framework via large language models,
S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution framework via large language models,”arXiv preprint arXiv:2306.02907, 2023
-
[26]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “LiveCodeBench: Holistic and contamination free evaluation of large language models for code,”arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review arXiv 2024
-
[27]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
T. Y . Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paulet al., “BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?”arXiv preprint arXiv:2310.06770, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.