pith. sign in

arxiv: 2605.19102 · v1 · pith:OUEP6ODUnew · submitted 2026-05-18 · 💻 cs.SE

Prompt Optimization for LLM Code Generation via Reinforcement Learning

Pith reviewed 2026-05-20 08:45 UTC · model grok-4.3

classification 💻 cs.SE
keywords prompt optimizationreinforcement learningLLM code generationPPOunit-test rewardsfunctional correctnessMBPP+HumanEval+
0
0 comments X

The pith

A reinforcement learning agent refines prompts to raise functional correctness in LLM code generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames prompt improvement as a sequential decision process solved by a PPO agent. The agent chooses among direct generation, genetic lexical mutations, and semantic rewrites while receiving shaped rewards from unit-test execution. Experiments across three code-generation models and three benchmarks show higher strict and soft Pass@1 scores than prior prompt-optimization baselines. The gains appear without retraining the underlying LLMs.

Core claim

Modeling prompt refinement as a reinforcement-learning task with Proximal Policy Optimization, a hybrid action space, and unit-test-derived rewards produces higher rates of functionally correct code from frozen LLMs on MBPP+, HumanEval+, and APPS.

What carries the argument

PPO agent that selects from a hybrid action space of direct generation, genetic lexical mutation, and semantic rewriting, guided by shaped rewards computed from unit-test feedback.

If this is right

  • Strict Pass@1 reaches 57.58 percent for CodeT5+, 64.80 percent for CodeLLaMA, and 85.50 percent for DeepSeek-Coder on the 500-task MBPP+ test set.
  • Soft-Pass@1 reaches 67.90 percent, 73.10 percent, and 88.20 percent for the same three models on MBPP+.
  • The method outperforms EPiC, Reflexion, and Random-Hybrid on MBPP+, HumanEval+, and APPS.
  • Comparable accuracy lifts occur for all three backbone models on HumanEval+ and APPS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward-shaping approach could be tested on prompt optimization for non-code tasks such as mathematical reasoning.
  • Replacing hand-designed mutation operators with learned ones might further reduce the need for domain-specific engineering.
  • Combining the learned prompt policy with lightweight fine-tuning of the code generator could compound the observed gains.

Load-bearing premise

Unit-test feedback supplies a sufficiently dense and unbiased reward signal that allows the PPO agent to improve prompts without overfitting to the specific test suites.

What would settle it

Evaluating the final prompts on a fresh collection of programming problems whose unit tests were never seen during training and finding that the reported Pass@1 gains vanish would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.19102 by Ali Mohammadi Esfahani, Nafiseh Kahani, Samuel A.Ajila.

Figure 1
Figure 1. Figure 1: Workflow of the RL–based prompt optimization framework [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (Left) PPO-based training loop; (Right) Evaluation protocol. Both algo [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrative progression from an ambiguous benchmark-style prompt to [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Large Language Models (LLMs) can generate code from natural language, but their performance is highly sensitive to prompt formulation. We propose a reinforcement-learning-based framework that models prompt refinement as a sequential decision-making problem. A Proximal Policy Optimization (PPO) agent iteratively improves prompts using a hybrid action space that combines direct generation, genetic lexical mutation and semantic rewriting, guided by shaped rewards derived from unit-test feedback. We evaluate the framework on MBPP+, HumanEval+, and APPS using CodeT5+, CodeLLaMA, and DeepSeek-Coder as frozen code generators. On the 500-task MBPP+ test set, the PPO agent achieves strict Pass@1 scores of 57.58%, 64.80%, and 85.50%, respectively, outperforming EPiC, Reflexion, and Random-Hybrid. Soft-Pass@1 reaches 67.90%, 73.10%, and 88.20%, respectively. Similar improvements are observed on HumanEval+ and APPS across all backbone models. The results demonstrate that reinforcement learning with shaped test-driven rewards improves functional correctness in LLM-based code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a reinforcement learning approach using PPO to optimize prompts for code generation with LLMs. It uses a hybrid action space of generation, mutation, and rewriting, guided by unit-test based rewards. On MBPP+ with 500 tasks, it reports Pass@1 scores of 57.58%, 64.80%, and 85.50% for CodeT5+, CodeLLaMA, and DeepSeek-Coder, outperforming several baselines, with similar gains on other benchmarks.

Significance. Should the results prove robust to proper validation splits and statistical analysis, this framework offers a promising method for automated prompt refinement in code generation, potentially reducing the need for manual prompt engineering. The integration of genetic and semantic actions with RL provides a novel combination that could inspire further work in test-driven prompt optimization.

major comments (3)
  1. [Section 4 (Experiments)] Section 4 (Experiments): The experimental protocol does not specify the use of a held-out training set separate from the 500-task MBPP+ test set. Since rewards are derived from unit tests on these tasks, direct optimization risks overfitting to the test suites, which would invalidate the generalization claims for the learned prompt strategies.
  2. [§5 (Results and Discussion)] §5 (Results and Discussion): No information is provided on the variance across multiple runs, number of random seeds, or statistical significance of the reported Pass@1 improvements (e.g., 57.58% vs. baselines). This makes it hard to determine if the gains are reliable or due to chance.
  3. [§3.2 (Reward Shaping)] §3.2 (Reward Shaping): The exact formulation of the shaped rewards from unit-test feedback is not detailed, including any weighting parameters or how partial successes are handled. This is important for reproducibility and understanding the source of improvements.
minor comments (2)
  1. [Abstract] Abstract: The abstract could briefly mention the training procedure details, such as the number of PPO iterations or the size of the prompt optimization dataset.
  2. [Table 1] Table 1: Ensure all baseline methods are described consistently with their original papers for fair comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below in a point-by-point manner and indicate planned revisions to strengthen the work.

read point-by-point responses
  1. Referee: [Section 4 (Experiments)] Section 4 (Experiments): The experimental protocol does not specify the use of a held-out training set separate from the 500-task MBPP+ test set. Since rewards are derived from unit tests on these tasks, direct optimization risks overfitting to the test suites, which would invalidate the generalization claims for the learned prompt strategies.

    Authors: We appreciate the referee highlighting this important methodological point. The current protocol optimizes prompts directly on the MBPP+ tasks using unit-test rewards because the objective is to demonstrate automated refinement for standard benchmark performance. We acknowledge that this setup carries a risk of overfitting to the specific test suites and does not constitute a strict train/test separation for the prompt optimization process itself. To address the concern, we will revise Section 4 to explicitly describe the protocol, discuss the implications for generalization claims, and emphasize the supporting evidence from transfer to the separate HumanEval+ and APPS benchmarks. We will also note this as a limitation and outline how a held-out split could be incorporated in follow-up work. revision: partial

  2. Referee: [§5 (Results and Discussion)] §5 (Results and Discussion): No information is provided on the variance across multiple runs, number of random seeds, or statistical significance of the reported Pass@1 improvements (e.g., 57.58% vs. baselines). This makes it hard to determine if the gains are reliable or due to chance.

    Authors: We agree that variance, random seeds, and statistical significance are necessary for assessing result reliability. The reported figures were obtained from single runs per configuration owing to the computational cost of PPO training. In the revision we will rerun the experiments with multiple random seeds (at least three), report means and standard deviations for Pass@1 scores, and include statistical tests (e.g., paired t-tests) comparing our method against baselines to establish significance. revision: yes

  3. Referee: [§3.2 (Reward Shaping)] §3.2 (Reward Shaping): The exact formulation of the shaped rewards from unit-test feedback is not detailed, including any weighting parameters or how partial successes are handled. This is important for reproducibility and understanding the source of improvements.

    Authors: Thank you for this observation. We will expand §3.2 with the precise reward formulation, including the weighting scheme for passed, partially passed, and failed unit tests, the handling of partial credit (based on output similarity metrics), and all hyper-parameters used in the shaped reward function. This addition will directly improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity detected; results are direct empirical measurements

full rationale

The paper describes an RL-based prompt optimization method using PPO with hybrid actions and unit-test-derived rewards, then reports Pass@1 scores on fixed external benchmarks (MBPP+, HumanEval+, APPS). No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the central performance claims to the inputs by construction. The evaluation relies on direct measurement against standard test suites rather than any internal derivation that collapses into its own assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the assumption that unit-test outcomes can be turned into effective shaped rewards and that the hybrid action space is expressive enough for prompt refinement; no new physical entities are introduced.

free parameters (1)
  • reward shaping weights
    Shaped rewards derived from unit-test feedback require choices about how to combine pass/fail signals, partial credit, and possibly length or diversity terms.
axioms (1)
  • domain assumption Unit tests provide a reliable proxy for functional correctness that can guide sequential prompt refinement.
    The entire reward signal and therefore the learning process rests on this premise.

pith-pipeline@v0.9.0 · 5736 in / 1242 out tokens · 43038 ms · 2026-05-20T08:45:33.544899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 14 internal anchors

  1. [1]

    arXiv preprint arXiv:2103.06333 (2021)

    Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.W.: Unified pre-training for program understanding and generation. arXiv:2103.06333 (2021) 14 A. Mohammadi Esfahani et al

  2. [2]

    Asare, O., Nagappan, M., Asokan, N.: Is github’s copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering28(6), 129 (2023)

  3. [3]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, et al.: Program synthesis with large lan- guage models. arXiv:2108.07732 (2021)

  4. [4]

    PRL: Prompts from Reinforcement Learning

    Batorski, P., Kosmala, A., Swoboda, P.: Prl: Prompts from reinforcement learning. arXiv:2505.14412 (2025)

  5. [5]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  6. [6]

    Evaluating Large Language Models Trained on Code

    Chen, M., et al.: Evaluating large language models trained on code. arXiv:2107.03374 (2021)

  7. [7]

    Journal of Machine Learning Research24(240), 1–113 (2023)

    Chowdhery, A., et al.: Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

  8. [8]

    Rlprompt: Optimizing discrete text prompts with reinforcement learning,

    Deng, M., et al.: Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv:2205.12548 (2022)

  9. [9]

    arXiv:2104.02443 (2021)

    Elnaggar, A., Ding, W., Jones, L., Gibbs, T., Feher, T., Angerer, C., Severini, S., Matthes, F., Rost, B.: Codetrans: Towards cracking the language of sili- con’s code through self-supervised deep learning and high performance computing. arXiv:2104.02443 (2021)

  10. [10]

    In: 2024 34th International Conference on Collab- orative Advances in Software and COmputiNg (CASCON)

    Esfahani, A.M., Kahani, N., Ajila, S.A.: Understanding defects in generated codes by language models. In: 2024 34th International Conference on Collab- orative Advances in Software and COmputiNg (CASCON). pp. 1–10 (2024). https://doi.org/10.1109/CASCON62161.2024.10837857

  11. [11]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al.: Codebert: A pre-trained model for programming and natural languages. arXiv:2002.08155 (2020)

  12. [12]

    Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Prompt- breeder: Self-referential self-improvement via prompt evolution. arXiv:2309.16797 (2023),https://arxiv.org/abs/2309.16797

  13. [13]

    InCoder: A Generative Model for Code Infilling and Synthesis

    Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., Yih, W.t., Zettlemoyer, L., Lewis, M.: Incoder: A generative model for code infilling and synthesis. arXiv:2204.05999 (2022)

  14. [14]

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svy- atkovskiy, A., Fu, S., et al.: Graphcodebert: Pre-training code representations with data flow. arXiv:2009.08366 (2020)

  15. [15]

    EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

    Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers. In: International Conference on Learning Representations (ICLR) (2024),https://arxiv.org/abs/2309.08532

  16. [16]

    Measuring Coding Challenge Competence With APPS

    Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al.: Measuring coding challenge competence with apps. arXiv:2105.09938 (2021)

  17. [17]

    arXiv preprint arXiv:2406.19508 (2024)

    Holden, D., Kahani, N.: Code linting using language models. arXiv preprint arXiv:2406.19508 (2024)

  18. [18]

    Hou, X., Zhao, Y., Liu, Y., et al.: Large language models for software engineering: A systematic literature review (2024)

  19. [19]

    Kong, W., Hombaiah, S.A., Zhang, M., Mei, Q., Bendersky, M.: Prewrite: Prompt rewriting with reinforcement learning. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2024) Prompt Optimization for LLM Code Generation via Reinforcement Learning 15

  20. [20]

    arXiv:2410.07652 (2024)

    Kwon, M., Kim, G., Kim, J., Lee, H., Kim, J.: Stableprompt: Automatic prompt tuning using reinforcement learning for large language models. arXiv:2410.07652 (2024)

  21. [21]

    Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

    Li, C., Liang, J., Zeng, A., Chen, X., Hausman, et al.: Chain of code: Reasoning with a language model-augmented code emulator. arXiv:2312.04474 (2023)

  22. [22]

    Liu, J., Xia, C.S., Wang, Y., Zhang, L.: Is your code generated by chatgpt really correct?rigorousevaluationoflarge languagemodelsforcodegeneration.Advances in Neural Information Processing Systems36, 21558–21572 (2023)

  23. [23]

    IEEE Transactions on software Engineering (4), 308–320 (1976)

    McCabe, T.J.: A complexity measure. IEEE Transactions on software Engineering (4), 308–320 (1976)

  24. [24]

    96–106 (2023)

    Mohammadkhani, A.H., Tantithamthavorn, C., Hemmatif, H.: Explaining transformer-based code models: What do they learn? when they do not work? pp. 96–106 (2023)

  25. [25]

    Ad- vances in neural information processing systems35, 27730–27744 (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, et al.: Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems35, 27730–27744 (2022)

  26. [26]

    In: EMNLP (2019)

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: EMNLP (2019)

  27. [27]

    Code Llama: Open Foundation Models for Code

    Roziere, B., Gehring, J., Gloeckle, et al.: Code llama: Open foundation models for code. arXiv:2308.12950 (2023)

  28. [28]

    Schulman, J., Wolski, F., Dhariwal, P., et al.: Proximal policy optimization algo- rithms (2017)

  29. [29]

    Advances in Neural Information Processing Systems36, 8634–8652 (2023)

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in Neural Information Processing Systems36, 8634–8652 (2023)

  30. [30]

    arXiv:2408.11198 (2024)

    Taherkhani, H., Sepidband, M., et al.: Epic: Cost-effective search-based prompt engineering of llms for code generation. arXiv:2408.11198 (2024)

  31. [31]

    arXiv:2403.08937 (2024)

    Tambon, F., Dakhel, A.M., Nikanjam, A., Khomh, F., Desmarais, M.C., Antoniol, G.: Bugs in large language models generated code. arXiv:2403.08937 (2024)

  32. [32]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H.e.a.: Llama: Open and efficient foundation language models. arXiv:2302.13971 (2023)

  33. [33]

    Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., Wang, Q.: Software testing with large language models: Survey, landscape, and vision (2024)

  34. [34]

    Wang, Y., Le, H., Gotmare, A.D., Bui, N.D.Q., Li, J., Hoi, S.C.H.: Codet5+: Open code large language models for code understanding and generation (2023)

  35. [35]

    Large Language Models as Optimizers

    Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large language models as optimizers. arXiv:2309.03409 (2023)

  36. [36]

    testJustifications

    Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., Wang, R.: Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv:2311.07989 (2023)

  37. [37]

    A Survey of Large Language Models

    Zhao, W.X., Zhou, K., Li, et al.: A survey of large language models. arXiv:2303.18223 (2023)

  38. [38]

    Ldb: A large language model debugger via verifying runtime execution step-by-step,

    Zhong, L., Wang, Z., Shang, J.: Debug like a human: A large language model debugger via verifying runtime execution step-by-step. arXiv:2402.16906 (2024)

  39. [39]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., Xu, R., Wu, Y., Li, Y., Gao, H., Ma, S., et al.: Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv:2406.11931 (2024)