Prompt Optimization for LLM Code Generation via Reinforcement Learning

Ali Mohammadi Esfahani; Nafiseh Kahani; Samuel A.Ajila

arxiv: 2605.19102 · v1 · pith:OUEP6ODUnew · submitted 2026-05-18 · 💻 cs.SE

Prompt Optimization for LLM Code Generation via Reinforcement Learning

Ali Mohammadi Esfahani , Nafiseh Kahani , Samuel A.Ajila This is my paper

Pith reviewed 2026-05-20 08:45 UTC · model grok-4.3

classification 💻 cs.SE

keywords prompt optimizationreinforcement learningLLM code generationPPOunit-test rewardsfunctional correctnessMBPP+HumanEval+

0 comments

The pith

A reinforcement learning agent refines prompts to raise functional correctness in LLM code generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames prompt improvement as a sequential decision process solved by a PPO agent. The agent chooses among direct generation, genetic lexical mutations, and semantic rewrites while receiving shaped rewards from unit-test execution. Experiments across three code-generation models and three benchmarks show higher strict and soft Pass@1 scores than prior prompt-optimization baselines. The gains appear without retraining the underlying LLMs.

Core claim

Modeling prompt refinement as a reinforcement-learning task with Proximal Policy Optimization, a hybrid action space, and unit-test-derived rewards produces higher rates of functionally correct code from frozen LLMs on MBPP+, HumanEval+, and APPS.

What carries the argument

PPO agent that selects from a hybrid action space of direct generation, genetic lexical mutation, and semantic rewriting, guided by shaped rewards computed from unit-test feedback.

If this is right

Strict Pass@1 reaches 57.58 percent for CodeT5+, 64.80 percent for CodeLLaMA, and 85.50 percent for DeepSeek-Coder on the 500-task MBPP+ test set.
Soft-Pass@1 reaches 67.90 percent, 73.10 percent, and 88.20 percent for the same three models on MBPP+.
The method outperforms EPiC, Reflexion, and Random-Hybrid on MBPP+, HumanEval+, and APPS.
Comparable accuracy lifts occur for all three backbone models on HumanEval+ and APPS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-shaping approach could be tested on prompt optimization for non-code tasks such as mathematical reasoning.
Replacing hand-designed mutation operators with learned ones might further reduce the need for domain-specific engineering.
Combining the learned prompt policy with lightweight fine-tuning of the code generator could compound the observed gains.

Load-bearing premise

Unit-test feedback supplies a sufficiently dense and unbiased reward signal that allows the PPO agent to improve prompts without overfitting to the specific test suites.

What would settle it

Evaluating the final prompts on a fresh collection of programming problems whose unit tests were never seen during training and finding that the reported Pass@1 gains vanish would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.19102 by Ali Mohammadi Esfahani, Nafiseh Kahani, Samuel A.Ajila.

**Figure 2.** Figure 2: (Left) PPO-based training loop; (Right) Evaluation protocol. Both algo [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Illustrative progression from an ambiguous benchmark-style prompt to [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) can generate code from natural language, but their performance is highly sensitive to prompt formulation. We propose a reinforcement-learning-based framework that models prompt refinement as a sequential decision-making problem. A Proximal Policy Optimization (PPO) agent iteratively improves prompts using a hybrid action space that combines direct generation, genetic lexical mutation and semantic rewriting, guided by shaped rewards derived from unit-test feedback. We evaluate the framework on MBPP+, HumanEval+, and APPS using CodeT5+, CodeLLaMA, and DeepSeek-Coder as frozen code generators. On the 500-task MBPP+ test set, the PPO agent achieves strict Pass@1 scores of 57.58%, 64.80%, and 85.50%, respectively, outperforming EPiC, Reflexion, and Random-Hybrid. Soft-Pass@1 reaches 67.90%, 73.10%, and 88.20%, respectively. Similar improvements are observed on HumanEval+ and APPS across all backbone models. The results demonstrate that reinforcement learning with shaped test-driven rewards improves functional correctness in LLM-based code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL prompt optimization gets benchmark lifts via hybrid actions but the evaluation risks overfitting to test-suite rewards without clear held-out splits.

read the letter

Here's the quick read on this one. The core idea is using PPO to iteratively refine prompts for code LLMs with a hybrid set of actions and rewards pulled from unit test results. They get better Pass@1 numbers than the baselines they compare against. The new part is that hybrid action space inside the RL loop. Mixing direct generation with genetic mutations and semantic rewrites gives the agent more ways to explore prompt changes than just one method. They apply it to three different code models on MBPP+, HumanEval+, and APPS, which is a reasonable spread. The numbers are the main draw: up to 85% strict Pass@1 on MBPP+ for DeepSeek-Coder, with gains over EPiC, Reflexion, and random hybrids. Soft pass rates are even higher. If the full paper has the implementation details and code, that would make it more usable for others. On the downside, the abstract leaves out key evaluation details. No mention of statistical significance, run-to-run variance, or whether they kept a separate set of tasks for training the PPO agent versus testing. The reward comes from unit tests on the same tasks used for the final metric. That setup could let the agent overfit to those specific tests instead of finding broadly better prompts. If the paper doesn't address this with held-out data or cross-validation, the improvements might not hold up as well as they appear. This is the kind of paper that would interest people building tools for LLM-assisted coding or studying prompt optimization techniques. It gives a concrete method with benchmark results, so a practitioner could try adapting it. I'd say it merits a serious referee. The approach is grounded in existing RL methods but applies them in a fresh combination for this domain. Send it to review, though expect feedback on the experimental rigor around data separation and reward shaping.

Referee Report

3 major / 2 minor

Summary. The paper presents a reinforcement learning approach using PPO to optimize prompts for code generation with LLMs. It uses a hybrid action space of generation, mutation, and rewriting, guided by unit-test based rewards. On MBPP+ with 500 tasks, it reports Pass@1 scores of 57.58%, 64.80%, and 85.50% for CodeT5+, CodeLLaMA, and DeepSeek-Coder, outperforming several baselines, with similar gains on other benchmarks.

Significance. Should the results prove robust to proper validation splits and statistical analysis, this framework offers a promising method for automated prompt refinement in code generation, potentially reducing the need for manual prompt engineering. The integration of genetic and semantic actions with RL provides a novel combination that could inspire further work in test-driven prompt optimization.

major comments (3)

[Section 4 (Experiments)] Section 4 (Experiments): The experimental protocol does not specify the use of a held-out training set separate from the 500-task MBPP+ test set. Since rewards are derived from unit tests on these tasks, direct optimization risks overfitting to the test suites, which would invalidate the generalization claims for the learned prompt strategies.
[§5 (Results and Discussion)] §5 (Results and Discussion): No information is provided on the variance across multiple runs, number of random seeds, or statistical significance of the reported Pass@1 improvements (e.g., 57.58% vs. baselines). This makes it hard to determine if the gains are reliable or due to chance.
[§3.2 (Reward Shaping)] §3.2 (Reward Shaping): The exact formulation of the shaped rewards from unit-test feedback is not detailed, including any weighting parameters or how partial successes are handled. This is important for reproducibility and understanding the source of improvements.

minor comments (2)

[Abstract] Abstract: The abstract could briefly mention the training procedure details, such as the number of PPO iterations or the size of the prompt optimization dataset.
[Table 1] Table 1: Ensure all baseline methods are described consistently with their original papers for fair comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below in a point-by-point manner and indicate planned revisions to strengthen the work.

read point-by-point responses

Referee: [Section 4 (Experiments)] Section 4 (Experiments): The experimental protocol does not specify the use of a held-out training set separate from the 500-task MBPP+ test set. Since rewards are derived from unit tests on these tasks, direct optimization risks overfitting to the test suites, which would invalidate the generalization claims for the learned prompt strategies.

Authors: We appreciate the referee highlighting this important methodological point. The current protocol optimizes prompts directly on the MBPP+ tasks using unit-test rewards because the objective is to demonstrate automated refinement for standard benchmark performance. We acknowledge that this setup carries a risk of overfitting to the specific test suites and does not constitute a strict train/test separation for the prompt optimization process itself. To address the concern, we will revise Section 4 to explicitly describe the protocol, discuss the implications for generalization claims, and emphasize the supporting evidence from transfer to the separate HumanEval+ and APPS benchmarks. We will also note this as a limitation and outline how a held-out split could be incorporated in follow-up work. revision: partial
Referee: [§5 (Results and Discussion)] §5 (Results and Discussion): No information is provided on the variance across multiple runs, number of random seeds, or statistical significance of the reported Pass@1 improvements (e.g., 57.58% vs. baselines). This makes it hard to determine if the gains are reliable or due to chance.

Authors: We agree that variance, random seeds, and statistical significance are necessary for assessing result reliability. The reported figures were obtained from single runs per configuration owing to the computational cost of PPO training. In the revision we will rerun the experiments with multiple random seeds (at least three), report means and standard deviations for Pass@1 scores, and include statistical tests (e.g., paired t-tests) comparing our method against baselines to establish significance. revision: yes
Referee: [§3.2 (Reward Shaping)] §3.2 (Reward Shaping): The exact formulation of the shaped rewards from unit-test feedback is not detailed, including any weighting parameters or how partial successes are handled. This is important for reproducibility and understanding the source of improvements.

Authors: Thank you for this observation. We will expand §3.2 with the precise reward formulation, including the weighting scheme for passed, partially passed, and failed unit tests, the handling of partial credit (based on output similarity metrics), and all hyper-parameters used in the shaped reward function. This addition will directly improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity detected; results are direct empirical measurements

full rationale

The paper describes an RL-based prompt optimization method using PPO with hybrid actions and unit-test-derived rewards, then reports Pass@1 scores on fixed external benchmarks (MBPP+, HumanEval+, APPS). No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the central performance claims to the inputs by construction. The evaluation relies on direct measurement against standard test suites rather than any internal derivation that collapses into its own assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the assumption that unit-test outcomes can be turned into effective shaped rewards and that the hybrid action space is expressive enough for prompt refinement; no new physical entities are introduced.

free parameters (1)

reward shaping weights
Shaped rewards derived from unit-test feedback require choices about how to combine pass/fail signals, partial credit, and possibly length or diversity terms.

axioms (1)

domain assumption Unit tests provide a reliable proxy for functional correctness that can guide sequential prompt refinement.
The entire reward signal and therefore the learning process rests on this premise.

pith-pipeline@v0.9.0 · 5736 in / 1242 out tokens · 43038 ms · 2026-05-20T08:45:33.544899+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A Proximal Policy Optimization (PPO) agent iteratively improves prompts using a hybrid action space that combines direct generation, genetic lexical mutation and semantic rewriting, guided by shaped rewards derived from unit-test feedback.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On the 500-task MBPP+ test set, the PPO agent achieves strict Pass@1 scores of 57.58%, 64.80%, and 85.50%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 14 internal anchors

[1]

arXiv preprint arXiv:2103.06333 (2021)

Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.W.: Unified pre-training for program understanding and generation. arXiv:2103.06333 (2021) 14 A. Mohammadi Esfahani et al

work page arXiv 2021
[2]

Asare, O., Nagappan, M., Asokan, N.: Is github’s copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering28(6), 129 (2023)

work page 2023
[3]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, et al.: Program synthesis with large lan- guage models. arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

PRL: Prompts from Reinforcement Learning

Batorski, P., Kosmala, A., Swoboda, P.: Prl: Prompts from reinforcement learning. arXiv:2505.14412 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

work page 1901
[6]

Evaluating Large Language Models Trained on Code

Chen, M., et al.: Evaluating large language models trained on code. arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Journal of Machine Learning Research24(240), 1–113 (2023)

Chowdhery, A., et al.: Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

work page 2023
[8]

Rlprompt: Optimizing discrete text prompts with reinforcement learning,

Deng, M., et al.: Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv:2205.12548 (2022)

work page arXiv 2022
[9]

arXiv:2104.02443 (2021)

Elnaggar, A., Ding, W., Jones, L., Gibbs, T., Feher, T., Angerer, C., Severini, S., Matthes, F., Rost, B.: Codetrans: Towards cracking the language of sili- con’s code through self-supervised deep learning and high performance computing. arXiv:2104.02443 (2021)

work page arXiv 2021
[10]

In: 2024 34th International Conference on Collab- orative Advances in Software and COmputiNg (CASCON)

Esfahani, A.M., Kahani, N., Ajila, S.A.: Understanding defects in generated codes by language models. In: 2024 34th International Conference on Collab- orative Advances in Software and COmputiNg (CASCON). pp. 1–10 (2024). https://doi.org/10.1109/CASCON62161.2024.10837857

work page doi:10.1109/cascon62161.2024.10837857 2024
[11]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al.: Codebert: A pre-trained model for programming and natural languages. arXiv:2002.08155 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002
[12]

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Prompt- breeder: Self-referential self-improvement via prompt evolution. arXiv:2309.16797 (2023),https://arxiv.org/abs/2309.16797

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

InCoder: A Generative Model for Code Infilling and Synthesis

Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., Yih, W.t., Zettlemoyer, L., Lewis, M.: Incoder: A generative model for code infilling and synthesis. arXiv:2204.05999 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svy- atkovskiy, A., Fu, S., et al.: Graphcodebert: Pre-training code representations with data flow. arXiv:2009.08366 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2009
[15]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers. In: International Conference on Learning Representations (ICLR) (2024),https://arxiv.org/abs/2309.08532

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Measuring Coding Challenge Competence With APPS

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al.: Measuring coding challenge competence with apps. arXiv:2105.09938 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

arXiv preprint arXiv:2406.19508 (2024)

Holden, D., Kahani, N.: Code linting using language models. arXiv preprint arXiv:2406.19508 (2024)

work page arXiv 2024
[18]

Hou, X., Zhao, Y., Liu, Y., et al.: Large language models for software engineering: A systematic literature review (2024)

work page 2024
[19]

Kong, W., Hombaiah, S.A., Zhang, M., Mei, Q., Bendersky, M.: Prewrite: Prompt rewriting with reinforcement learning. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2024) Prompt Optimization for LLM Code Generation via Reinforcement Learning 15

work page 2024
[20]

arXiv:2410.07652 (2024)

Kwon, M., Kim, G., Kim, J., Lee, H., Kim, J.: Stableprompt: Automatic prompt tuning using reinforcement learning for large language models. arXiv:2410.07652 (2024)

work page arXiv 2024
[21]

Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

Li, C., Liang, J., Zeng, A., Chen, X., Hausman, et al.: Chain of code: Reasoning with a language model-augmented code emulator. arXiv:2312.04474 (2023)

work page arXiv 2023
[22]

Liu, J., Xia, C.S., Wang, Y., Zhang, L.: Is your code generated by chatgpt really correct?rigorousevaluationoflarge languagemodelsforcodegeneration.Advances in Neural Information Processing Systems36, 21558–21572 (2023)

work page 2023
[23]

IEEE Transactions on software Engineering (4), 308–320 (1976)

McCabe, T.J.: A complexity measure. IEEE Transactions on software Engineering (4), 308–320 (1976)

work page 1976
[24]

96–106 (2023)

Mohammadkhani, A.H., Tantithamthavorn, C., Hemmatif, H.: Explaining transformer-based code models: What do they learn? when they do not work? pp. 96–106 (2023)

work page 2023
[25]

Ad- vances in neural information processing systems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, et al.: Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems35, 27730–27744 (2022)

work page 2022
[26]

In: EMNLP (2019)

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: EMNLP (2019)

work page 2019
[27]

Code Llama: Open Foundation Models for Code

Roziere, B., Gehring, J., Gloeckle, et al.: Code llama: Open foundation models for code. arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Schulman, J., Wolski, F., Dhariwal, P., et al.: Proximal policy optimization algo- rithms (2017)

work page 2017
[29]

Advances in Neural Information Processing Systems36, 8634–8652 (2023)

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in Neural Information Processing Systems36, 8634–8652 (2023)

work page 2023
[30]

arXiv:2408.11198 (2024)

Taherkhani, H., Sepidband, M., et al.: Epic: Cost-effective search-based prompt engineering of llms for code generation. arXiv:2408.11198 (2024)

work page arXiv 2024
[31]

arXiv:2403.08937 (2024)

Tambon, F., Dakhel, A.M., Nikanjam, A., Khomh, F., Desmarais, M.C., Antoniol, G.: Bugs in large language models generated code. arXiv:2403.08937 (2024)

work page arXiv 2024
[32]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H.e.a.: Llama: Open and efficient foundation language models. arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., Wang, Q.: Software testing with large language models: Survey, landscape, and vision (2024)

work page 2024
[34]

Wang, Y., Le, H., Gotmare, A.D., Bui, N.D.Q., Li, J., Hoi, S.C.H.: Codet5+: Open code large language models for code understanding and generation (2023)

work page 2023
[35]

Large Language Models as Optimizers

Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large language models as optimizers. arXiv:2309.03409 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

testJustifications

Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., Wang, R.: Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv:2311.07989 (2023)

work page arXiv 2023
[37]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, et al.: A survey of large language models. arXiv:2303.18223 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Ldb: A large language model debugger via verifying runtime execution step-by-step,

Zhong, L., Wang, Z., Shang, J.: Debug like a human: A large language model debugger via verifying runtime execution step-by-step. arXiv:2402.16906 (2024)

work page arXiv 2024
[39]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., Xu, R., Wu, Y., Li, Y., Gao, H., Ma, S., et al.: Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv:2406.11931 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

arXiv preprint arXiv:2103.06333 (2021)

Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.W.: Unified pre-training for program understanding and generation. arXiv:2103.06333 (2021) 14 A. Mohammadi Esfahani et al

work page arXiv 2021

[2] [2]

Asare, O., Nagappan, M., Asokan, N.: Is github’s copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering28(6), 129 (2023)

work page 2023

[3] [3]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, et al.: Program synthesis with large lan- guage models. arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

PRL: Prompts from Reinforcement Learning

Batorski, P., Kosmala, A., Swoboda, P.: Prl: Prompts from reinforcement learning. arXiv:2505.14412 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

work page 1901

[6] [6]

Evaluating Large Language Models Trained on Code

Chen, M., et al.: Evaluating large language models trained on code. arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Journal of Machine Learning Research24(240), 1–113 (2023)

Chowdhery, A., et al.: Palm: Scaling language modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

work page 2023

[8] [8]

Rlprompt: Optimizing discrete text prompts with reinforcement learning,

Deng, M., et al.: Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv:2205.12548 (2022)

work page arXiv 2022

[9] [9]

arXiv:2104.02443 (2021)

Elnaggar, A., Ding, W., Jones, L., Gibbs, T., Feher, T., Angerer, C., Severini, S., Matthes, F., Rost, B.: Codetrans: Towards cracking the language of sili- con’s code through self-supervised deep learning and high performance computing. arXiv:2104.02443 (2021)

work page arXiv 2021

[10] [10]

In: 2024 34th International Conference on Collab- orative Advances in Software and COmputiNg (CASCON)

Esfahani, A.M., Kahani, N., Ajila, S.A.: Understanding defects in generated codes by language models. In: 2024 34th International Conference on Collab- orative Advances in Software and COmputiNg (CASCON). pp. 1–10 (2024). https://doi.org/10.1109/CASCON62161.2024.10837857

work page doi:10.1109/cascon62161.2024.10837857 2024

[11] [11]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al.: Codebert: A pre-trained model for programming and natural languages. arXiv:2002.08155 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002

[12] [12]

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Prompt- breeder: Self-referential self-improvement via prompt evolution. arXiv:2309.16797 (2023),https://arxiv.org/abs/2309.16797

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

InCoder: A Generative Model for Code Infilling and Synthesis

Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., Yih, W.t., Zettlemoyer, L., Lewis, M.: Incoder: A generative model for code infilling and synthesis. arXiv:2204.05999 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svy- atkovskiy, A., Fu, S., et al.: Graphcodebert: Pre-training code representations with data flow. arXiv:2009.08366 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2009

[15] [15]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., Yang, Y.: Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers. In: International Conference on Learning Representations (ICLR) (2024),https://arxiv.org/abs/2309.08532

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Measuring Coding Challenge Competence With APPS

Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al.: Measuring coding challenge competence with apps. arXiv:2105.09938 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

arXiv preprint arXiv:2406.19508 (2024)

Holden, D., Kahani, N.: Code linting using language models. arXiv preprint arXiv:2406.19508 (2024)

work page arXiv 2024

[18] [18]

Hou, X., Zhao, Y., Liu, Y., et al.: Large language models for software engineering: A systematic literature review (2024)

work page 2024

[19] [19]

Kong, W., Hombaiah, S.A., Zhang, M., Mei, Q., Bendersky, M.: Prewrite: Prompt rewriting with reinforcement learning. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2024) Prompt Optimization for LLM Code Generation via Reinforcement Learning 15

work page 2024

[20] [20]

arXiv:2410.07652 (2024)

Kwon, M., Kim, G., Kim, J., Lee, H., Kim, J.: Stableprompt: Automatic prompt tuning using reinforcement learning for large language models. arXiv:2410.07652 (2024)

work page arXiv 2024

[21] [21]

Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

Li, C., Liang, J., Zeng, A., Chen, X., Hausman, et al.: Chain of code: Reasoning with a language model-augmented code emulator. arXiv:2312.04474 (2023)

work page arXiv 2023

[22] [22]

Liu, J., Xia, C.S., Wang, Y., Zhang, L.: Is your code generated by chatgpt really correct?rigorousevaluationoflarge languagemodelsforcodegeneration.Advances in Neural Information Processing Systems36, 21558–21572 (2023)

work page 2023

[23] [23]

IEEE Transactions on software Engineering (4), 308–320 (1976)

McCabe, T.J.: A complexity measure. IEEE Transactions on software Engineering (4), 308–320 (1976)

work page 1976

[24] [24]

96–106 (2023)

Mohammadkhani, A.H., Tantithamthavorn, C., Hemmatif, H.: Explaining transformer-based code models: What do they learn? when they do not work? pp. 96–106 (2023)

work page 2023

[25] [25]

Ad- vances in neural information processing systems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, et al.: Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems35, 27730–27744 (2022)

work page 2022

[26] [26]

In: EMNLP (2019)

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: EMNLP (2019)

work page 2019

[27] [27]

Code Llama: Open Foundation Models for Code

Roziere, B., Gehring, J., Gloeckle, et al.: Code llama: Open foundation models for code. arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Schulman, J., Wolski, F., Dhariwal, P., et al.: Proximal policy optimization algo- rithms (2017)

work page 2017

[29] [29]

Advances in Neural Information Processing Systems36, 8634–8652 (2023)

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in Neural Information Processing Systems36, 8634–8652 (2023)

work page 2023

[30] [30]

arXiv:2408.11198 (2024)

Taherkhani, H., Sepidband, M., et al.: Epic: Cost-effective search-based prompt engineering of llms for code generation. arXiv:2408.11198 (2024)

work page arXiv 2024

[31] [31]

arXiv:2403.08937 (2024)

Tambon, F., Dakhel, A.M., Nikanjam, A., Khomh, F., Desmarais, M.C., Antoniol, G.: Bugs in large language models generated code. arXiv:2403.08937 (2024)

work page arXiv 2024

[32] [32]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H.e.a.: Llama: Open and efficient foundation language models. arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., Wang, Q.: Software testing with large language models: Survey, landscape, and vision (2024)

work page 2024

[34] [34]

Wang, Y., Le, H., Gotmare, A.D., Bui, N.D.Q., Li, J., Hoi, S.C.H.: Codet5+: Open code large language models for code understanding and generation (2023)

work page 2023

[35] [35]

Large Language Models as Optimizers

Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large language models as optimizers. arXiv:2309.03409 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

testJustifications

Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., Wang, R.: Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv:2311.07989 (2023)

work page arXiv 2023

[37] [37]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, et al.: A survey of large language models. arXiv:2303.18223 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Ldb: A large language model debugger via verifying runtime execution step-by-step,

Zhong, L., Wang, Z., Shang, J.: Debug like a human: A large language model debugger via verifying runtime execution step-by-step. arXiv:2402.16906 (2024)

work page arXiv 2024

[39] [39]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., Xu, R., Wu, Y., Li, Y., Gao, H., Ma, S., et al.: Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv:2406.11931 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024