Prompt Optimization for LLM Code Generation via Reinforcement Learning
Pith reviewed 2026-05-20 08:45 UTC · model grok-4.3
The pith
A reinforcement learning agent refines prompts to raise functional correctness in LLM code generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modeling prompt refinement as a reinforcement-learning task with Proximal Policy Optimization, a hybrid action space, and unit-test-derived rewards produces higher rates of functionally correct code from frozen LLMs on MBPP+, HumanEval+, and APPS.
What carries the argument
PPO agent that selects from a hybrid action space of direct generation, genetic lexical mutation, and semantic rewriting, guided by shaped rewards computed from unit-test feedback.
If this is right
- Strict Pass@1 reaches 57.58 percent for CodeT5+, 64.80 percent for CodeLLaMA, and 85.50 percent for DeepSeek-Coder on the 500-task MBPP+ test set.
- Soft-Pass@1 reaches 67.90 percent, 73.10 percent, and 88.20 percent for the same three models on MBPP+.
- The method outperforms EPiC, Reflexion, and Random-Hybrid on MBPP+, HumanEval+, and APPS.
- Comparable accuracy lifts occur for all three backbone models on HumanEval+ and APPS.
Where Pith is reading between the lines
- The same reward-shaping approach could be tested on prompt optimization for non-code tasks such as mathematical reasoning.
- Replacing hand-designed mutation operators with learned ones might further reduce the need for domain-specific engineering.
- Combining the learned prompt policy with lightweight fine-tuning of the code generator could compound the observed gains.
Load-bearing premise
Unit-test feedback supplies a sufficiently dense and unbiased reward signal that allows the PPO agent to improve prompts without overfitting to the specific test suites.
What would settle it
Evaluating the final prompts on a fresh collection of programming problems whose unit tests were never seen during training and finding that the reported Pass@1 gains vanish would falsify the central claim.
Figures
read the original abstract
Large Language Models (LLMs) can generate code from natural language, but their performance is highly sensitive to prompt formulation. We propose a reinforcement-learning-based framework that models prompt refinement as a sequential decision-making problem. A Proximal Policy Optimization (PPO) agent iteratively improves prompts using a hybrid action space that combines direct generation, genetic lexical mutation and semantic rewriting, guided by shaped rewards derived from unit-test feedback. We evaluate the framework on MBPP+, HumanEval+, and APPS using CodeT5+, CodeLLaMA, and DeepSeek-Coder as frozen code generators. On the 500-task MBPP+ test set, the PPO agent achieves strict Pass@1 scores of 57.58%, 64.80%, and 85.50%, respectively, outperforming EPiC, Reflexion, and Random-Hybrid. Soft-Pass@1 reaches 67.90%, 73.10%, and 88.20%, respectively. Similar improvements are observed on HumanEval+ and APPS across all backbone models. The results demonstrate that reinforcement learning with shaped test-driven rewards improves functional correctness in LLM-based code generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a reinforcement learning approach using PPO to optimize prompts for code generation with LLMs. It uses a hybrid action space of generation, mutation, and rewriting, guided by unit-test based rewards. On MBPP+ with 500 tasks, it reports Pass@1 scores of 57.58%, 64.80%, and 85.50% for CodeT5+, CodeLLaMA, and DeepSeek-Coder, outperforming several baselines, with similar gains on other benchmarks.
Significance. Should the results prove robust to proper validation splits and statistical analysis, this framework offers a promising method for automated prompt refinement in code generation, potentially reducing the need for manual prompt engineering. The integration of genetic and semantic actions with RL provides a novel combination that could inspire further work in test-driven prompt optimization.
major comments (3)
- [Section 4 (Experiments)] Section 4 (Experiments): The experimental protocol does not specify the use of a held-out training set separate from the 500-task MBPP+ test set. Since rewards are derived from unit tests on these tasks, direct optimization risks overfitting to the test suites, which would invalidate the generalization claims for the learned prompt strategies.
- [§5 (Results and Discussion)] §5 (Results and Discussion): No information is provided on the variance across multiple runs, number of random seeds, or statistical significance of the reported Pass@1 improvements (e.g., 57.58% vs. baselines). This makes it hard to determine if the gains are reliable or due to chance.
- [§3.2 (Reward Shaping)] §3.2 (Reward Shaping): The exact formulation of the shaped rewards from unit-test feedback is not detailed, including any weighting parameters or how partial successes are handled. This is important for reproducibility and understanding the source of improvements.
minor comments (2)
- [Abstract] Abstract: The abstract could briefly mention the training procedure details, such as the number of PPO iterations or the size of the prompt optimization dataset.
- [Table 1] Table 1: Ensure all baseline methods are described consistently with their original papers for fair comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below in a point-by-point manner and indicate planned revisions to strengthen the work.
read point-by-point responses
-
Referee: [Section 4 (Experiments)] Section 4 (Experiments): The experimental protocol does not specify the use of a held-out training set separate from the 500-task MBPP+ test set. Since rewards are derived from unit tests on these tasks, direct optimization risks overfitting to the test suites, which would invalidate the generalization claims for the learned prompt strategies.
Authors: We appreciate the referee highlighting this important methodological point. The current protocol optimizes prompts directly on the MBPP+ tasks using unit-test rewards because the objective is to demonstrate automated refinement for standard benchmark performance. We acknowledge that this setup carries a risk of overfitting to the specific test suites and does not constitute a strict train/test separation for the prompt optimization process itself. To address the concern, we will revise Section 4 to explicitly describe the protocol, discuss the implications for generalization claims, and emphasize the supporting evidence from transfer to the separate HumanEval+ and APPS benchmarks. We will also note this as a limitation and outline how a held-out split could be incorporated in follow-up work. revision: partial
-
Referee: [§5 (Results and Discussion)] §5 (Results and Discussion): No information is provided on the variance across multiple runs, number of random seeds, or statistical significance of the reported Pass@1 improvements (e.g., 57.58% vs. baselines). This makes it hard to determine if the gains are reliable or due to chance.
Authors: We agree that variance, random seeds, and statistical significance are necessary for assessing result reliability. The reported figures were obtained from single runs per configuration owing to the computational cost of PPO training. In the revision we will rerun the experiments with multiple random seeds (at least three), report means and standard deviations for Pass@1 scores, and include statistical tests (e.g., paired t-tests) comparing our method against baselines to establish significance. revision: yes
-
Referee: [§3.2 (Reward Shaping)] §3.2 (Reward Shaping): The exact formulation of the shaped rewards from unit-test feedback is not detailed, including any weighting parameters or how partial successes are handled. This is important for reproducibility and understanding the source of improvements.
Authors: Thank you for this observation. We will expand §3.2 with the precise reward formulation, including the weighting scheme for passed, partially passed, and failed unit tests, the handling of partial credit (based on output similarity metrics), and all hyper-parameters used in the shaped reward function. This addition will directly improve reproducibility. revision: yes
Circularity Check
No circularity detected; results are direct empirical measurements
full rationale
The paper describes an RL-based prompt optimization method using PPO with hybrid actions and unit-test-derived rewards, then reports Pass@1 scores on fixed external benchmarks (MBPP+, HumanEval+, APPS). No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the central performance claims to the inputs by construction. The evaluation relies on direct measurement against standard test suites rather than any internal derivation that collapses into its own assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward shaping weights
axioms (1)
- domain assumption Unit tests provide a reliable proxy for functional correctness that can guide sequential prompt refinement.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A Proximal Policy Optimization (PPO) agent iteratively improves prompts using a hybrid action space that combines direct generation, genetic lexical mutation and semantic rewriting, guided by shaped rewards derived from unit-test feedback.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On the 500-task MBPP+ test set, the PPO agent achieves strict Pass@1 scores of 57.58%, 64.80%, and 85.50%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2103.06333 (2021)
Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.W.: Unified pre-training for program understanding and generation. arXiv:2103.06333 (2021) 14 A. Mohammadi Esfahani et al
-
[2]
Asare, O., Nagappan, M., Asokan, N.: Is github’s copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering28(6), 129 (2023)
work page 2023
-
[3]
Program Synthesis with Large Language Models
Austin, J., Odena, A., Nye, M., Bosma, et al.: Program synthesis with large lan- guage models. arXiv:2108.07732 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
PRL: Prompts from Reinforcement Learning
Batorski, P., Kosmala, A., Swoboda, P.: Prl: Prompts from reinforcement learning. arXiv:2505.14412 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Advances in neural information processing systems33, 1877–1901 (2020)
Brown, T., Mann, B., Ryder, N., Subbiah, et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)
work page 1901
-
[6]
Evaluating Large Language Models Trained on Code
Chen, M., et al.: Evaluating large language models trained on code. arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Journal of Machine Learning Research24(240), 1–113 (2023)
Chowdhery, A., et al.: Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)
work page 2023
-
[8]
Rlprompt: Optimizing discrete text prompts with reinforcement learning,
Deng, M., et al.: Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv:2205.12548 (2022)
-
[9]
Elnaggar, A., Ding, W., Jones, L., Gibbs, T., Feher, T., Angerer, C., Severini, S., Matthes, F., Rost, B.: Codetrans: Towards cracking the language of sili- con’s code through self-supervised deep learning and high performance computing. arXiv:2104.02443 (2021)
-
[10]
Esfahani, A.M., Kahani, N., Ajila, S.A.: Understanding defects in generated codes by language models. In: 2024 34th International Conference on Collab- orative Advances in Software and COmputiNg (CASCON). pp. 1–10 (2024). https://doi.org/10.1109/CASCON62161.2024.10837857
-
[11]
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al.: Codebert: A pre-trained model for programming and natural languages. arXiv:2002.08155 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[12]
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Prompt- breeder: Self-referential self-improvement via prompt evolution. arXiv:2309.16797 (2023),https://arxiv.org/abs/2309.16797
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
InCoder: A Generative Model for Code Infilling and Synthesis
Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., Yih, W.t., Zettlemoyer, L., Lewis, M.: Incoder: A generative model for code infilling and synthesis. arXiv:2204.05999 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
GraphCodeBERT: Pre-training Code Representations with Data Flow
Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svy- atkovskiy, A., Fu, S., et al.: Graphcodebert: Pre-training code representations with data flow. arXiv:2009.08366 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[15]
EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers. In: International Conference on Learning Representations (ICLR) (2024),https://arxiv.org/abs/2309.08532
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Measuring Coding Challenge Competence With APPS
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al.: Measuring coding challenge competence with apps. arXiv:2105.09938 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
arXiv preprint arXiv:2406.19508 (2024)
Holden, D., Kahani, N.: Code linting using language models. arXiv preprint arXiv:2406.19508 (2024)
-
[18]
Hou, X., Zhao, Y., Liu, Y., et al.: Large language models for software engineering: A systematic literature review (2024)
work page 2024
-
[19]
Kong, W., Hombaiah, S.A., Zhang, M., Mei, Q., Bendersky, M.: Prewrite: Prompt rewriting with reinforcement learning. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2024) Prompt Optimization for LLM Code Generation via Reinforcement Learning 15
work page 2024
-
[20]
Kwon, M., Kim, G., Kim, J., Lee, H., Kim, J.: Stableprompt: Automatic prompt tuning using reinforcement learning for large language models. arXiv:2410.07652 (2024)
-
[21]
Li, C., Liang, J., Zeng, A., Chen, X., Hausman, et al.: Chain of code: Reasoning with a language model-augmented code emulator. arXiv:2312.04474 (2023)
-
[22]
Liu, J., Xia, C.S., Wang, Y., Zhang, L.: Is your code generated by chatgpt really correct?rigorousevaluationoflarge languagemodelsforcodegeneration.Advances in Neural Information Processing Systems36, 21558–21572 (2023)
work page 2023
-
[23]
IEEE Transactions on software Engineering (4), 308–320 (1976)
McCabe, T.J.: A complexity measure. IEEE Transactions on software Engineering (4), 308–320 (1976)
work page 1976
-
[24]
Mohammadkhani, A.H., Tantithamthavorn, C., Hemmatif, H.: Explaining transformer-based code models: What do they learn? when they do not work? pp. 96–106 (2023)
work page 2023
-
[25]
Ad- vances in neural information processing systems35, 27730–27744 (2022)
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, et al.: Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems35, 27730–27744 (2022)
work page 2022
-
[26]
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: EMNLP (2019)
work page 2019
-
[27]
Code Llama: Open Foundation Models for Code
Roziere, B., Gehring, J., Gloeckle, et al.: Code llama: Open foundation models for code. arXiv:2308.12950 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Schulman, J., Wolski, F., Dhariwal, P., et al.: Proximal policy optimization algo- rithms (2017)
work page 2017
-
[29]
Advances in Neural Information Processing Systems36, 8634–8652 (2023)
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in Neural Information Processing Systems36, 8634–8652 (2023)
work page 2023
-
[30]
Taherkhani, H., Sepidband, M., et al.: Epic: Cost-effective search-based prompt engineering of llms for code generation. arXiv:2408.11198 (2024)
-
[31]
Tambon, F., Dakhel, A.M., Nikanjam, A., Khomh, F., Desmarais, M.C., Antoniol, G.: Bugs in large language models generated code. arXiv:2403.08937 (2024)
-
[32]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H.e.a.: Llama: Open and efficient foundation language models. arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., Wang, Q.: Software testing with large language models: Survey, landscape, and vision (2024)
work page 2024
-
[34]
Wang, Y., Le, H., Gotmare, A.D., Bui, N.D.Q., Li, J., Hoi, S.C.H.: Codet5+: Open code large language models for code understanding and generation (2023)
work page 2023
-
[35]
Large Language Models as Optimizers
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large language models as optimizers. arXiv:2309.03409 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., Wang, R.: Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv:2311.07989 (2023)
-
[37]
A Survey of Large Language Models
Zhao, W.X., Zhou, K., Li, et al.: A survey of large language models. arXiv:2303.18223 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Ldb: A large language model debugger via verifying runtime execution step-by-step,
Zhong, L., Wang, Z., Shang, J.: Debug like a human: A large language model debugger via verifying runtime execution step-by-step. arXiv:2402.16906 (2024)
-
[39]
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., Xu, R., Wu, Y., Li, Y., Gao, H., Ma, S., et al.: Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv:2406.11931 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.