Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation

Faroq AL-Tam; Jie M. Zhang; Muhammad Al-Qurishi; Shuyin Ouyang; Zhaozhi Qian

arxiv: 2605.17174 · v1 · pith:CM42U3X3new · submitted 2026-05-16 · 💻 cs.SE · cs.AI

Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation

Shuyin Ouyang , Zhaozhi Qian , Faroq AL-Tam , Muhammad AL-Qurishi , Jie M. Zhang This is my paper

Pith reviewed 2026-05-20 13:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords reinforcement learningcode generationdiffusion modelsstatic analysisreward designhint conditioningprogram synthesisexecution-free rewards

0 comments

The pith

Static checking serves as the strongest execution-free reward for RL in diffusion code generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates reinforcement learning methods to align diffusion language models with functional correctness in generating code. It explores execution-free rewards to address cases where unit-test execution provides insufficient learning signals on complex tasks. Experiments reveal that static checking outperforms other options, enhancing model performance on benchmarks like HumanEval and LiveCodeBench while also speeding up the training process. The study further examines how hint-conditioned sampling aids exploration and how reward effectiveness changes with task difficulty. These insights offer ways to improve code generation models without heavy dependence on executable tests.

Core claim

The authors present an empirical study showing that static checking is the strongest overall standalone execution-free reward for RL post-training of diffusion-based code generation models. It improves DiffuCoder from 53.9 to 67.1 on HumanEval and from 14.9 to 15.5 on LiveCodeBench, while reducing rollout time by 9.4%. Moderate AST-based hinting is most useful on harder benchmarks, and the best reward design depends on task difficulty, with similarity-based rewards better for easier subsets and static checking more reliable for harder ones.

What carries the argument

Static checking as an execution-free reward in combination with hint-conditioned diffusion sampling.

If this is right

Static analysis can replace execution rewards to provide viable learning signals on complex programming tasks.
Hint conditioning during training helps overcome exploration bottlenecks in diffusion RL for code.
Reward choice should be tailored to task difficulty for optimal results in code generation.
These methods can improve both performance and efficiency in post-training diffusion models for code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may apply to training other generative AI models for programming beyond diffusion architectures.
Adopting execution-free rewards could lower the computational demands of RL training for code models.
Difficulty-aware reward systems might become standard in aligning models for software engineering tasks.

Load-bearing premise

Improvements on HumanEval, MBPP, and LiveCodeBench with the DiffuCoder model result primarily from the static rewards and hint conditioning rather than differences in training dynamics or benchmark selection.

What would settle it

Reproducing the RL experiments while isolating only the reward function and observing no performance gains on the benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.17174 by Faroq AL-Tam, Jie M. Zhang, Muhammad Al-Qurishi, Shuyin Ouyang, Zhaozhi Qian.

**Figure 1.** Figure 1: Reward when applying RL to the latest SFT checkpoint of DiffuCoder (rollout number=10). While the format reward quickly converges near 1, the execution-based semantic reward remains near zero across training, highlighting the low-signal reward that motivates our execution-free reward. This problem is not tied to a particular RL algorithm. Methods such as PPO (Schulman et al., 2017), GRPO (Shao et al., 2… view at source ↗

**Figure 2.** Figure 2: Hinting strategies for diffusion sampling (hint ratio=0.5) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: RQ3: Accuracy and average generation length across training sets of increasing dif [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Training reward trajectories under different composite reward designs. Each [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Reinforcement Learning (RL) is an important paradigm for aligning Diffusion Language Models (DLMs) toward functional correctness in code generation. However, these models often encounter a ``capability cliff'' on complex tasks, where execution-based semantic rewards become too low to provide a viable learning signal. In this paper, we present a systematic empirical study of RL post-training for diffusion-based code generation along three axes: reward design, hint-conditioned sampling, and task difficulty. We investigate the effectiveness of execution-free rewards as alternatives to traditional unit-test execution, the role of training-time hint-conditioned diffusion sampling in mitigating exploration bottlenecks, and the impact of these design choices varies across tasks with different difficulty levels. Across HumanEval, MBPP, and LiveCodeBench, we find that static checking is the strongest overall standalone execution-free reward in our setting, especially improving DiffuCoder from 53.9 to 67.1 on HumanEval and from 14.9 to 15.5 on LiveCodeBench while reducing rollout time by 9.4\%. We further find that moderate AST-based hinting is most useful on harder benchmarks, while the best reward design depends strongly on task difficulty: similarity-based rewards are more effective on easier subsets, whereas static checking is more reliable on harder subsets where execution rewards are low. These findings suggest that reward design and training guidance substantially affect diffusion RL performance in our evaluated code-generation setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Static checking rewards give measurable lifts on hard code tasks for diffusion RL but the causal link needs tighter experimental controls.

read the letter

Static checking rewards give measurable lifts on hard code tasks for diffusion RL but the causal link needs tighter experimental controls. The paper compares execution-free rewards against unit-test execution in post-training of a diffusion code model and breaks the results out by task difficulty. It reports that static analysis is the strongest standalone option overall, moving DiffuCoder from 53.9 to 67.1 on HumanEval and from 14.9 to 15.5 on LiveCodeBench while cutting rollout time by 9.4 percent. Moderate AST-based hint conditioning helps more on the harder subsets, and similarity rewards work better on easier ones. Those difficulty-dependent patterns are the clearest new piece here. The study is useful because it stays close to real training constraints and gives practitioners concrete numbers on public benchmarks to try first. The empirical sweep across reward types and hint strategies is systematic enough to be worth reading for anyone tuning RL on diffusion language models. The soft spot is the missing detail on variance and controls. The abstract supplies aggregate deltas but no seed-level spreads, no hyperparameter-ablation table, and no explicit statement that every non-reward factor stayed frozen. RL training for these models is sensitive to optimizer state and sampling schedules, so without those checks some of the observed gain could trace to training dynamics rather than the reward design itself. That concern is real but fixable with standard additions. This paper is for researchers and engineers working on reward engineering or alignment for generative code models. A reader in that area gets practical comparisons and can test the difficulty split against their own runs. It deserves peer review because the benchmarks are standard, the claims are falsifiable, and the topic is directly relevant to current diffusion RL work. Send it out and ask for the variance numbers and any ablations that exist.

Referee Report

1 major / 2 minor

Summary. The paper presents an empirical study of RL post-training for diffusion language models in code generation. It examines three axes: execution-free rewards as alternatives to unit-test execution, hint-conditioned diffusion sampling to mitigate exploration bottlenecks, and variation across task difficulties. On HumanEval, MBPP, and LiveCodeBench, the authors report that static checking is the strongest standalone execution-free reward, lifting DiffuCoder from 53.9 to 67.1 on HumanEval and from 14.9 to 15.5 on LiveCodeBench while cutting rollout time by 9.4%; they also find that moderate AST-based hinting helps more on harder tasks and that reward effectiveness depends on difficulty.

Significance. If the causal attribution to reward design holds, the work supplies actionable guidance for aligning diffusion-based code models when execution signals are weak, showing that static-analysis rewards can outperform execution-based ones on hard subsets while improving efficiency. The empirical focus on multiple public benchmarks and the explicit comparison of reward types and hinting strategies constitute a useful contribution to the emerging literature on diffusion RL for code.

major comments (1)

[Abstract and §4] Abstract and §4 (empirical results): the headline claim that static checking is the strongest execution-free reward and produces the reported deltas (53.9 → 67.1 on HumanEval, 14.9 → 15.5 on LiveCodeBench) rests on the unverified premise that every non-reward hyperparameter, optimizer state, sampling schedule, and random seed was identical across reward conditions. No variance across seeds, no hyperparameter-ablation table, and no explicit statement that training dynamics were frozen are provided; without these controls the observed improvements cannot be confidently attributed to the reward function itself rather than incidental differences in training.

minor comments (2)

[Abstract] The abstract and results sections would benefit from an explicit statement of the exact static-analysis rules employed and how they differ from the similarity-based and execution baselines.
[Results] Table or figure captions should include the number of random seeds and the precise definition of the reported metrics (pass@1, etc.) to allow direct replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for identifying a key aspect of experimental rigor. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (empirical results): the headline claim that static checking is the strongest execution-free reward and produces the reported deltas (53.9 → 67.1 on HumanEval, 14.9 → 15.5 on LiveCodeBench) rests on the unverified premise that every non-reward hyperparameter, optimizer state, sampling schedule, and random seed was identical across reward conditions. No variance across seeds, no hyperparameter-ablation table, and no explicit statement that training dynamics were frozen are provided; without these controls the observed improvements cannot be confidently attributed to the reward function itself rather than incidental differences in training.

Authors: We appreciate the referee's emphasis on isolating the causal effect of the reward function. In our experimental protocol (detailed in Section 3.2), all non-reward hyperparameters—including optimizer configuration, learning rate schedule, diffusion sampling steps, batch size, and random seed initialization—were held fixed across reward conditions; only the reward computation itself was varied. This design choice is implicit in the shared training pipeline description but was not stated with sufficient explicitness. We will revise the manuscript to add a dedicated paragraph in Section 4 (and a corresponding sentence in the abstract) confirming that training dynamics were frozen except for the reward component. We acknowledge that multi-seed variance statistics and a full hyperparameter-ablation table are absent; these omissions stem from the substantial compute cost of diffusion-model RL runs. In the revision we will (i) explicitly note this limitation and (ii) report, where space permits, pass@1 results from two additional seeds for the primary static-checking condition on HumanEval to provide at least a basic indication of stability. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical reporting of benchmark results

full rationale

The paper is a systematic empirical study that reports observed performance deltas on HumanEval, MBPP, and LiveCodeBench when applying different reward designs and hint-conditioning to a diffusion model for code generation. No equations, derivations, or first-principles predictions appear in the provided text; the central claims (e.g., static checking lifting scores from 53.9 to 67.1) are presented as measured outcomes of experiments rather than results that reduce to fitted parameters or self-referential definitions by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are invoked to justify the findings. The work is therefore self-contained as straightforward experimental reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is empirical and relies on standard assumptions from reinforcement learning and diffusion modeling without introducing new free parameters or postulated entities.

axioms (1)

domain assumption Reinforcement learning with appropriate reward signals can improve functional correctness in diffusion language models for code generation.
This underpins the entire post-training approach described in the abstract.

pith-pipeline@v0.9.0 · 5803 in / 1241 out tokens · 87483 ms · 2026-05-20T13:59:33.133914+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

static checking is the strongest overall standalone execution-free reward... improving DiffuCoder from 53.9 to 67.1 on HumanEval

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 17 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Evaluating Large Language Models Trained on Code

Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Stepcoder: Improve code generation with reinforcement learning from compiler feedback.arXiv preprint arXiv:2402.01391,

Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, et al. Stepcoder: Improve code generation with reinforcement learning from compiler feedback.arXiv preprint arXiv:2402.01391,

work page arXiv
[5]

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Se- quence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adapta- tion from autoregressive models.arXiv preprint arXiv:2410.17891,

work page internal anchor Pith review arXiv
[7]

Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,

work page arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Multi-turn code generation through single-step rewards.arXiv preprint arXiv:2502.20380,

Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Went- ing Zhao, and Sanjiban Choudhury. Multi-turn code generation through single-step rewards.arXiv preprint arXiv:2502.20380,

work page arXiv
[11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Under review

10 Preprint. Under review. Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, and Jingzhao Zhang. Questa: Expanding reasoning capacity in llms via question augmentation. arXiv preprint arXiv:2507.13266,

work page arXiv
[13]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1–28, 2025a

Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1–28, 2025a. Shuyin Ouyang, Jie M Zhang, Zeyu Sun, and Albert Merono Penuela. Knowledge-enhanced program repair for data science code.arXiv preprint arXiv:2502.09771...

work page arXiv
[16]

URLhttps://arxiv.org/abs/2305.18290. Allen Z. Ren, Justin Lidard, Lars L. Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Diffusion Policy Policy Optimization

URLhttps://arxiv.org/abs/2409.00588. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Execution-based code generation using deep reinforcement learning.arXiv preprint arXiv:2301.13816,

Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning.arXiv preprint arXiv:2301.13816,

work page arXiv
[20]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949,

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949,

work page arXiv
[22]

Dream-coder 7b: An open diffusion language model for code, 2025

Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142,

work page arXiv
[23]

arXiv preprint arXiv:2410.14157 , year=

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning. arXiv preprint arXiv:2410.14157,

work page arXiv
[24]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Ling- peng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

URLhttps://arxiv.org/abs/2504.13837. Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Ace- coder: Acing coder rl via automated test-case synthesis.arXiv preprint arXiv:2502.01718, 2025a. Yiming Zeng, Jinghan Cao, Zexin Li, Yiming Chen, Tao Ren, Zhuochun Li, Dawei Xiang, Xidong Wu, Shangqian Gao, and Tingting Yu. Treediff: As...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling rea- soning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216,

work page arXiv
[27]

A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737,

Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737,

work page arXiv
[28]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Under review

12 Preprint. Under review. Appendix A Preliminaries and Notation A.1 Problem Definition We consider mask-based DLMs that generate code through iterative denoising. A mask DLM performs inference by gradually denoising a masked input sequence. Let xT be the DLM’s input sequence, where each element may be an unmasked tokens from the vocabulary V or a special...

work page 2022
[30]

You are a helpful assistant

and DiffuCoder (Gong et al., 2025), two representative state-of-the-art diffusion language models for code generation. These models provide strong diffusion-based coding baselines and allow us to examine whether the observed RL trends are consistent across different DLM architectures. We train in bfloat16 precision to reduce memory footprint while maintai...

work page 2025

[1] [1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Evaluating Large Language Models Trained on Code

Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Stepcoder: Improve code generation with reinforcement learning from compiler feedback.arXiv preprint arXiv:2402.01391,

Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, et al. Stepcoder: Improve code generation with reinforcement learning from compiler feedback.arXiv preprint arXiv:2402.01391,

work page arXiv

[5] [5]

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Se- quence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adapta- tion from autoregressive models.arXiv preprint arXiv:2410.17891,

work page internal anchor Pith review arXiv

[7] [7]

Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,

work page arXiv

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Multi-turn code generation through single-step rewards.arXiv preprint arXiv:2502.20380,

Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Went- ing Zhao, and Sanjiban Choudhury. Multi-turn code generation through single-step rewards.arXiv preprint arXiv:2502.20380,

work page arXiv

[11] [11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Under review

10 Preprint. Under review. Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, and Jingzhao Zhang. Questa: Expanding reasoning capacity in llms via question augmentation. arXiv preprint arXiv:2507.13266,

work page arXiv

[13] [13]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1–28, 2025a

Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1–28, 2025a. Shuyin Ouyang, Jie M Zhang, Zeyu Sun, and Albert Merono Penuela. Knowledge-enhanced program repair for data science code.arXiv preprint arXiv:2502.09771...

work page arXiv

[16] [16]

URLhttps://arxiv.org/abs/2305.18290. Allen Z. Ren, Justin Lidard, Lars L. Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Diffusion Policy Policy Optimization

URLhttps://arxiv.org/abs/2409.00588. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Execution-based code generation using deep reinforcement learning.arXiv preprint arXiv:2301.13816,

Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning.arXiv preprint arXiv:2301.13816,

work page arXiv

[20] [20]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949,

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949,

work page arXiv

[22] [22]

Dream-coder 7b: An open diffusion language model for code, 2025

Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142,

work page arXiv

[23] [23]

arXiv preprint arXiv:2410.14157 , year=

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning. arXiv preprint arXiv:2410.14157,

work page arXiv

[24] [24]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Ling- peng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

URLhttps://arxiv.org/abs/2504.13837. Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Ace- coder: Acing coder rl via automated test-case synthesis.arXiv preprint arXiv:2502.01718, 2025a. Yiming Zeng, Jinghan Cao, Zexin Li, Yiming Chen, Tao Ren, Zhuochun Li, Dawei Xiang, Xidong Wu, Shangqian Gao, and Tingting Yu. Treediff: As...

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling rea- soning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216,

work page arXiv

[27] [27]

A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737,

Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737,

work page arXiv

[28] [28]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Under review

12 Preprint. Under review. Appendix A Preliminaries and Notation A.1 Problem Definition We consider mask-based DLMs that generate code through iterative denoising. A mask DLM performs inference by gradually denoising a masked input sequence. Let xT be the DLM’s input sequence, where each element may be an unmasked tokens from the vocabulary V or a special...

work page 2022

[30] [30]

You are a helpful assistant

and DiffuCoder (Gong et al., 2025), two representative state-of-the-art diffusion language models for code generation. These models provide strong diffusion-based coding baselines and allow us to examine whether the observed RL trends are consistent across different DLM architectures. We train in bfloat16 precision to reduce memory footprint while maintai...

work page 2025