arxiv: 2604.10508 · v1 · submitted 2026-04-12 · 💻 cs.SE · cs.AI

Recognition: unknown

How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

Johin Johny Arimbur

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords self-repairLLM code generationiterative promptingHumanEvalMBPPmodel scalingerror feedbackcode benchmarks

0 comments

The pith

Iterative self-repair by feeding execution errors back to large language models raises code generation pass rates across model scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language models can correct their own code by using feedback from execution errors in repeated attempts. It evaluates this process on seven models from 8B to larger scales on the HumanEval and MBPP benchmarks, allowing up to five tries per problem. Results show that self-repair increases the share of problems solved correctly for all tested models, with larger gains on MBPP and most progress happening in the first two rounds. Modern instruction-tuned models achieve this through standard prompting alone, even at smaller sizes, and perform better on syntax and name errors than on logical mistakes.

Core claim

The central claim is that feeding execution error messages back to the model for correction increases pass rates on HumanEval by 4.9 to 17.1 percentage points and on MBPP by 16.0 to 30.0 percentage points. This holds across dense and mixture-of-experts architectures and for instruction-tuned models down to 8B parameters. Most of the benefit occurs in the first two repair rounds, assertion errors remain hardest to fix at roughly 45 percent success, and adding chain-of-thought to the repair prompt yields up to 5.5 points more gain for stronger models.

What carries the argument

Iterative self-repair, the repeated process of passing execution error messages to the model to generate corrected code versions.

Load-bearing premise

That execution error messages are clear and detailed enough for the models to interpret and produce accurate fixes.

What would settle it

Running the same models and benchmarks with error messages supplied but measuring no net rise in solved problems after five attempts.

Figures

Figures reproduced from arXiv: 2604.10508 by Johin Johny Arimbur.

**Figure 3.** Figure 3: Distribution of error types at the initial attempt ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 2.** Figure 2: Cross-benchmark comparison of self-repair gains for all seven models. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 5.** Figure 5: Marginal improvement in pass@1 per repair round on HumanEval. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt strategy ablation on HumanEval. CoT consistently achieves [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Self-repair vs. independent resampling on HumanEval. Self-repair [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Large language models frequently fail to produce correct code on their first attempt, yet most benchmarks evaluate them in a single-shot setting. We investigate iterative self-repair (feeding execution errors back to the model for correction) across seven models spanning three families and both open-weight and proprietary providers: Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout (MoE, 16 experts), Llama 4 Maverick (MoE, 128 experts), Qwen3 32B, Gemini 2.5 Flash, and Gemini 2.5 Pro. On HumanEval (164 problems) and MBPP Sanitized (257 problems) with up to five attempts, self-repair universally improves pass rates: +4.9 to +17.1 pp on HumanEval and +16.0 to +30.0 pp on MBPP. Gemini 2.5 Flash achieves the highest final pass rates (96.3% HumanEval, 93.8% MBPP). Most gains concentrate in the first two rounds.Error-type analysis shows assertion errors (logical mistakes) are the hardest to repair at ~45%, while syntax and name errors are repaired at substantially higher rates, connecting to broader findings on the limits of LLM self-correction. Prior work found that weaker models fail at self-repair or require fine-tuning; we show that modern instruction-tuned models succeed with prompting alone, even at 8B scale. We also provide the first comparison of dense and MoE architectures for self-repair, and extend the repair-vs-resampling tradeoff analysis to modern models. A prompt ablation reveals chain-of-thought repair yields up to +5.5 pp additional self-repair gain (measured as improvement in repair delta) over minimal prompting for capable models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper empirically investigates iterative self-repair in LLM code generation, where execution errors are fed back to the model for correction. Across seven models (Llama 3.1 8B, Llama 3.3 70B, Llama 4 variants, Qwen3 32B, Gemini 2.5 Flash/Pro) on HumanEval (164 problems) and MBPP Sanitized (257 problems) with up to five attempts, it reports consistent pass-rate gains of +4.9 to +17.1 pp on HumanEval and +16.0 to +30.0 pp on MBPP. Most gains occur in the first two rounds; assertion errors are hardest to repair (~45% success) while syntax/name errors are easier. The work claims modern instruction-tuned models succeed at self-repair via prompting alone even at 8B scale (contrasting prior work), provides the first dense-vs-MoE comparison for this task, extends repair-vs-resampling analysis, and shows chain-of-thought repair adds up to +5.5 pp over minimal prompts.

Significance. If the results hold under full reproducibility details, the paper makes a useful empirical contribution by showing self-repair is broadly effective for current instruction-tuned models without fine-tuning, including smaller scales and MoE architectures. The error-type breakdown links to known LLM limitations in logical reasoning, and the architecture and prompt comparisons add concrete data points for practitioners building coding agents. The scale of the evaluation (multiple families, two benchmarks, iteration tracking) strengthens its value as a reference study.

major comments (2)

[§3] §3 (Methodology / Prompt Construction): The exact base repair prompt template and the precise format of execution error feedback (full traceback with test inputs/assertions vs. stripped message) are not quoted or exemplified. This is load-bearing for the central claim that 'modern instruction-tuned models succeed with prompting alone' (including the 8B-scale result), because the prompt ablation already demonstrates +5.5 pp sensitivity to CoT scaffolding; without the exact strings used in the main condition, it is impossible to rule out unstated informativeness or engineering advantages that could explain the contrast with prior work on weaker models.
[§4] §4 (Results): The concrete percentage-point gains are reported without random seeds, number of independent runs per problem, or statistical significance measures (confidence intervals or p-values). Given fixed problem counts (164/257) and the universality claim, this omission weakens confidence that the +4.9–17.1 pp HumanEval range reflects stable self-repair effects rather than sampling variability; adding these details would directly support the reported deltas and error-type repair rates.

minor comments (2)

[Abstract] Abstract: The gain ranges (+4.9 to +17.1 pp, +16.0 to +30.0 pp) are useful but would be clearer if the text or a table explicitly mapped the lower/upper bounds to specific models.
Consider moving the full prompt templates and one or two example error-message strings to an appendix; this is a minor reproducibility aid rather than a core methodological gap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The comments highlight important aspects of reproducibility and statistical rigor that we address below.

read point-by-point responses

Referee: [§3] §3 (Methodology / Prompt Construction): The exact base repair prompt template and the precise format of execution error feedback (full traceback with test inputs/assertions vs. stripped message) are not quoted or exemplified. This is load-bearing for the central claim that 'modern instruction-tuned models succeed with prompting alone' (including the 8B-scale result), because the prompt ablation already demonstrates +5.5 pp sensitivity to CoT scaffolding; without the exact strings used in the main condition, it is impossible to rule out unstated informativeness or engineering advantages that could explain the contrast with prior work on weaker models.

Authors: We agree that full transparency on the prompt and error feedback format is essential to support the claim that modern instruction-tuned models succeed at self-repair via prompting alone. The manuscript currently describes the prompting approach at a high level but does not include the verbatim templates. In the revised version we will add the complete base repair prompt (including the exact error feedback formatting) to the Methodology section and provide illustrative examples of full versus stripped error messages. This will enable direct comparison with prior work and allow readers to evaluate the degree of informativeness in the feedback. revision: yes
Referee: [§4] §4 (Results): The concrete percentage-point gains are reported without random seeds, number of independent runs per problem, or statistical significance measures (confidence intervals or p-values). Given fixed problem counts (164/257) and the universality claim, this omission weakens confidence that the +4.9–17.1 pp HumanEval range reflects stable self-repair effects rather than sampling variability; adding these details would directly support the reported deltas and error-type repair rates.

Authors: We acknowledge that reporting experimental variability strengthens confidence in the results. Our evaluation used a single deterministic run per problem (temperature 0 where supported by the model API, or the lowest supported temperature otherwise) with fixed seeds for any stochastic components. We will revise §4 and the experimental setup to explicitly state the number of runs (one per problem), the decoding parameters, and any seeds used. We will also add a brief discussion of potential variability and, where computationally feasible, include bootstrap-derived confidence intervals around the pass-rate deltas. This directly addresses concerns about sampling variability while preserving the existing experimental design. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper reports direct experimental measurements of pass rates under iterative self-repair on fixed benchmarks (HumanEval, MBPP) across seven models, with error-type breakdowns and prompt ablations. No equations, fitted parameters, derivations, or first-principles claims appear; all results are observational counts of success/failure after feeding execution feedback. Prior-work citations serve only as contrast, not as load-bearing premises that reduce the present findings to self-citation. The central claims (gains of +4.9–30 pp, success at 8B scale with prompting alone) are therefore independent of any definitional or fitted-input loop and remain falsifiable by replication on the same benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work consists of controlled empirical trials on public benchmarks rather than theoretical modeling.

pith-pipeline@v0.9.0 · 5640 in / 1125 out tokens · 37713 ms · 2026-05-10T16:10:41.039413+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
cs.CL 2026-05 unverdicted novelty 5.0

An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.

Reference graph

Works this paper leans on

29 extracted references · 17 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Competition-level code generation with AlphaCode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lagoet al., “Competition-level code generation with AlphaCode,” inScience, vol. 378, no. 6624, 2022, pp. 1092–1097

2022
[3]

Code Llama: Open Foundation Models for Code

B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code Llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Demystifying gpt self-repair for code generation

T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama, “Is self-repair a silver bullet for code generation?”arXiv preprint arXiv:2306.09896, 2024

work page arXiv 2024
[6]

Teaching Large Language Models to Self-Debug

X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,”arXiv preprint arXiv:2304.05128, 2024

work page internal anchor Pith review arXiv 2024
[7]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhriet al., “Llama 3 model card,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Llama 3.3: Large language model,

Meta AI, “Llama 3.3: Large language model,”Meta AI Blog, 2024, https://ai.meta.com/blog/llama-3-3/

2024
[9]

Llama 4: Maverick, scout, and behemoth,

——, “Llama 4: Maverick, scout, and behemoth,”Meta AI Blog, 2025, https://ai.meta.com/blog/llama-4-multimodal-intelligence/

2025
[10]

Qwen3 Technical Report

Qwen Team, “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gemini 2.5: Our most intelligent AI model,

Google DeepMind, “Gemini 2.5: Our most intelligent AI model,” Google Blog, 2025, https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/

2025
[12]

Revisit self-debugging with self-generated tests for code generation,

X. Chen, Z. Tao, K. Zhang, C. Zhou, X. Zhang, W. Gu, Y . He, M. Zhang, X. Cai, H. Zhao, and Z. Jin, “Revisit self-debugging with self-generated tests for code generation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 18 003– 18 023

2025
[13]

CYCLE: Learning to self- refine the code generation,

Y . Ding, M. J. Min, G. Kaiser, and B. Ray, “CYCLE: Learning to self- refine the code generation,”Proceedings of the ACM on Programming Languages, vol. 8, no. OOPSLA1, 2024

2024
[14]

LeDex: Training LLMs to better self-debug and explain code,

N. Jiang, X. Li, S. Wang, Q. Zhou, S. B. Hossain, B. Ray, V . Kumar, X. Ma, and A. Deoras, “LeDex: Training LLMs to better self-debug and explain code,”Advances in Neural Information Processing Systems, vol. 37, 2024

2024
[15]

Large Language Models Cannot Self-Correct Reasoning Yet

J. Huang, C. Xia, P. Shenoy, and J. Zhao, “Large language models cannot self-correct reasoning yet,”arXiv preprint arXiv:2310.01798, 2024

work page internal anchor Pith review arXiv 2024
[16]

Code repair with LLMs gives an exploration-exploitation tradeoff,

H. Tang, K. Hu, J. P. Zhou, S. Zhong, W.-L. Zheng, X. Si, and K. Ellis, “Code repair with LLMs gives an exploration-exploitation tradeoff,” Advances in Neural Information Processing Systems, vol. 37, 2024

2024
[17]

CodeRL: Mastering code generation through pretrained models and deep reinforcement learning,

H. Le, Y . Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi, “CodeRL: Mastering code generation through pretrained models and deep reinforcement learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 21 314–21 328, 2022

2022
[18]

Measuring Coding Challenge Competence With APPS

D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Zou, D. Song, and J. Steinhardt, “Measuring coding challenge competence with APPS,” arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review arXiv 2021
[19]

SPoC: Search-based pseudocode to code,

S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. Liang, “SPoC: Search-based pseudocode to code,”Advances in Neural Information Processing Systems, vol. 32, 2019

2019
[20]

Self-refine: Iter- ative refinement with self-feedback,

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in Neural Information Processing Systems, vol. 36, 2023

2023
[21]

Reflexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Shakkottai, A. Labash, and S. Liu, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, 2023

2023
[22]

Multi-turn code generation with single-turn instruction tuning,

D. Key, S. Lee, J. Shin, and S. Hwang, “Multi-turn code generation with single-turn instruction tuning,”arXiv preprint arXiv:2404.11137, 2024

work page arXiv 2024
[23]

Self-collaboration code generation via chatgpt

Y . Zhang, J. Chen, D. Li, and Y . Zheng, “Self-collaboration code generation via ChatGPT,”arXiv preprint arXiv:2304.07590, 2023

work page arXiv 2023
[24]

arXiv preprint arXiv:2402.16906 , year=

L. Zhong, Z. Wang, and J. Shang, “LDB: A large language model debugger via verifying runtime execution step-by-step,”arXiv preprint arXiv:2402.16906, 2024

work page arXiv 2024
[25]

Selfevolve: A code evolution framework via large language models,

S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution framework via large language models,”arXiv preprint arXiv:2306.02907, 2023

work page arXiv 2023
[26]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “LiveCodeBench: Holistic and contamination free evaluation of large language models for code,”arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review arXiv 2024
[27]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

T. Y . Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paulet al., “BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review arXiv 2024
[28]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?”arXiv preprint arXiv:2310.06770, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review arXiv 2024