DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models
Pith reviewed 2026-05-20 22:47 UTC · model grok-4.3
The pith
Diffusion language models gain from RL that assigns credit by denoising progress and reduces likelihood bias.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DACA-GRPO improves any GRPO base method by adding Denoising Progress Scores, which derive per-token importance weights from intermediate denoising predictions at zero extra forward cost, and Stratified Masking Likelihood, which divides token positions into strata so each token is predicted with nearly full context, thereby lowering mean-field bias in the policy gradient.
What carries the argument
Denoising Progress Scores and Stratified Masking Likelihood, which together supply temporal credit assignment and debiased likelihood estimates across the denoising trajectory.
If this is right
- Mathematical reasoning benchmarks improve by as much as 5.6 percentage points.
- Code generation performance rises by up to 7.4 percentage points.
- Constraint satisfaction tasks see gains reaching 36.3 percentage points.
- Constrained generation, including JSON schema adherence, improves by as much as 5.9 percentage points.
- The same two mechanisms can be added to any existing GRPO trainer without extra model forward passes.
Where Pith is reading between the lines
- The same progress-score idea could be tested on diffusion models for images or audio to see whether temporal credit assignment helps non-language domains.
- If the stratified masking reduces bias reliably, it may also lower variance in other non-autoregressive policy optimization settings.
- The approach suggests that future work on RL for sequential generation should treat the generation trajectory as having ordered importance rather than uniform steps.
Load-bearing premise
The observed gains on benchmarks come from the two new mechanisms rather than from differences in training details, seeds, or evaluation protocols.
What would settle it
Re-running the seven-benchmark suite on the three base GRPO methods with and without the two mechanisms while holding every other training and evaluation detail fixed, then finding no consistent improvement when the mechanisms are present.
read the original abstract
Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the sequence as context, reducing the mean-field bias. Applied on top of three GRPO base methods, DACA-GRPO achieves consistent improvements across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation, with gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction, and 5.9pp on JSON schema adherence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DACA-GRPO as a lightweight plug-and-play addition to existing GRPO trainers for reinforcement learning on diffusion language models. It identifies two weaknesses in prior work—lack of temporal credit assignment across denoising steps and bias in mean-field likelihood estimates—and introduces Denoising Progress Scores (extracting per-token weights from intermediate predictions) together with Stratified Masking Likelihood (partitioning tokens into strata for richer context). The authors report consistent gains when the method is applied to three GRPO baselines across seven benchmarks covering mathematical reasoning, code generation, constraint satisfaction, and constrained generation.
Significance. If the reported improvements are shown to be caused by the two proposed mechanisms rather than differences in training configuration, the work supplies a practical, low-overhead technique for addressing credit assignment and likelihood bias in diffusion-model RL. The plug-and-play framing and zero-extra-forward-cost claim for the progress scores are attractive for adoption; the approach could help close the performance gap between diffusion and autoregressive LLMs on structured generation tasks.
major comments (2)
- [§4] §4 (Experiments) and the associated tables: the manuscript does not state that the three base GRPO runs were executed with identical hyperparameters, number of denoising steps, masking schedules, random seeds, and evaluation protocols as the DACA-GRPO variants. Because the central claim attributes the observed deltas (up to 5.6 pp math, 7.4 pp code, 36.3 pp constraint satisfaction) to Denoising Progress Scores and Stratified Masking Likelihood, the absence of matched controls leaves open the possibility that the gains reflect implementation discrepancies rather than the credit-assignment innovations.
- [§3.2] §3.2 (Stratified Masking Likelihood): the partitioning into strata is described at a high level, but no explicit equation or pseudocode shows how the likelihood is computed within each stratum or how the strata boundaries are chosen as a function of denoising progress. Without this, it is impossible to verify that the procedure actually reduces mean-field bias rather than introducing a new form of stratification bias.
minor comments (2)
- [Figure 2] Figure 2 caption: the legend does not distinguish the three GRPO base methods from their DACA-augmented counterparts, making it difficult to map the plotted curves to the numerical results in Table 1.
- [§3.1] The abstract states “no additional forward cost,” yet §3.1 does not quantify the extra memory or compute required to store and process the intermediate predictions used for Denoising Progress Scores.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address each of the major comments in detail below and have made revisions to the manuscript to improve clarity and rigor where indicated.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and the associated tables: the manuscript does not state that the three base GRPO runs were executed with identical hyperparameters, number of denoising steps, masking schedules, random seeds, and evaluation protocols as the DACA-GRPO variants. Because the central claim attributes the observed deltas (up to 5.6 pp math, 7.4 pp code, 36.3 pp constraint satisfaction) to Denoising Progress Scores and Stratified Masking Likelihood, the absence of matched controls leaves open the possibility that the gains reflect implementation discrepancies rather than the credit-assignment innovations.
Authors: We agree that explicit confirmation of matched experimental conditions is essential for attributing the performance gains to our proposed methods. Although the base GRPO and DACA-GRPO variants were implemented and run within the same codebase and training pipeline, the original manuscript did not explicitly document this. We have revised Section 4 and the experimental setup subsection to clearly state that all compared methods used identical hyperparameters, the same number of denoising steps, identical masking schedules, the same random seeds, and consistent evaluation protocols. This revision ensures that the deltas can be confidently attributed to the Denoising Progress Scores and Stratified Masking Likelihood. We have also included a sentence confirming that no other implementation differences were introduced. revision: yes
-
Referee: [§3.2] §3.2 (Stratified Masking Likelihood): the partitioning into strata is described at a high level, but no explicit equation or pseudocode shows how the likelihood is computed within each stratum or how the strata boundaries are chosen as a function of denoising progress. Without this, it is impossible to verify that the procedure actually reduces mean-field bias rather than introducing a new form of stratification bias.
Authors: We appreciate this observation and acknowledge that the description of Stratified Masking Likelihood in §3.2 was insufficiently detailed. In the revised manuscript, we have added an explicit mathematical formulation for the stratified likelihood computation. Specifically, we now include the equation for the log-likelihood within each stratum, where tokens are grouped by their denoising progress level, and each token is predicted conditioned on the context from other strata. Additionally, we provide pseudocode that outlines the strata boundary selection as a function of the denoising timestep t, ensuring that boundaries are set to maximize contextual information. These additions clarify how the method mitigates mean-field bias without introducing new biases, and we believe they enable full reproducibility and verification. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes DACA-GRPO as a lightweight additive enhancement to existing GRPO trainers via two new mechanisms (Denoising Progress Scores extracted from intermediate predictions and Stratified Masking Likelihood to reduce mean-field bias). These are presented as conceptual innovations addressing temporal credit assignment and likelihood bias, with reported gains framed as empirical outcomes on seven benchmarks when applied atop three base GRPO methods. No equations, derivations, or self-referential definitions appear in the abstract or description that would reduce the claimed improvements to fitted parameters or inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work are referenced. The central claims rest on experimental application rather than a closed mathematical chain that collapses to its own inputs, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Denoising Progress Scores... extract per-token importance weights from intermediate predictions... Stratified Masking Likelihood... partitions token positions into strata
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Amin Karimi Monsefi and Nikhil Bhendawade and Manuel Rafael Ciosici and Dominic Culver and Yizhe Zhang and Irina Belousova , booktitle=. 2026 , url=
work page 2026
-
[2]
Proceedings of the 41st International Conference on Machine Learning , articleno =
Lou, Aaron and Meng, Chenlin and Ermon, Stefano , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
work page 2024
- [3]
-
[4]
Dream-Coder 7B: An Open Diffusion Language Model for Code , author=. 2025 , eprint=
work page 2025
-
[5]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Large Language Diffusion Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[7]
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation , author=. 2025 , eprint=
work page 2025
-
[8]
Training Deep Nets with Sublinear Memory Cost
Training Deep Nets with Sublinear Memory Cost , author=. arXiv preprint arXiv:1604.06174 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=
work page 2024
-
[10]
Constrained Decoding of Diffusion
Niels M. Constrained Decoding of Diffusion. The Fourteenth International Conference on Learning Representations , year=
-
[11]
ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , journal=
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , journal=. 2025 , url=
work page 2025
-
[12]
The Fourteenth International Conference on Learning Representations , journal=
wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models , author=. The Fourteenth International Conference on Learning Representations , journal=. 2026 , url=
work page 2026
-
[13]
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding , author=. 2025 , eprint=
work page 2025
-
[14]
Measuring Mathematical Problem Solving With the
Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=
work page 2021
-
[15]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[16]
Training language models to follow instructions with human feedback , url =
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
-
[17]
The Twelfth International Conference on Learning Representations , year=
Training Diffusion Models with Reinforcement Learning , author=. The Twelfth International Conference on Learning Representations , year=
-
[18]
Reinforcement Learning: An Introduction , author=
-
[19]
The Twelfth International Conference on Learning Representations , year=
Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=
-
[20]
Solving math word problems with process- and outcome-based feedback , author=. 2022 , eprint=
work page 2022
-
[21]
Monte Carlo Statistical Methods , author=
-
[22]
Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization , author=. 2026 , eprint=
work page 2026
-
[23]
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
work page 2021
-
[24]
Program Synthesis with Large Language Models , author=. 2021 , eprint=
work page 2021
-
[26]
Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=
work page 2025
-
[27]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=
work page 2025
-
[28]
Mercury: Ultra-Fast Language Models Based on Diffusion , author=. 2025 , eprint=
work page 2025
-
[30]
DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention , author=. 2026 , eprint=
work page 2026
-
[31]
JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models , author=. 2025 , eprint=
work page 2025
-
[32]
SO-Bench: A Structural Output Evaluation of Multimodal LLMs , author=. 2026 , eprint=
work page 2026
-
[33]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Training diffusion models with reinforcement learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=YCWjhGrJFD
work page 2024
-
[35]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
So-bench: A structural output evaluation of multimodal llms, 2026
Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, and Afshin Dehghan. So-bench: A structural output evaluation of multimodal llms, 2026. URL https://arxiv.org/abs/2511.21750
-
[38]
Jsonschemabench: A rigorous benchmark of structured outputs for language models, 2025
Saibo Geng, Hudson Cooper, Micha Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Jsonschemabench: A rigorous benchmark of structured outputs for language models, 2025. URL https://arxiv.org/abs/2501.10868
-
[39]
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation. arXiv preprint arXiv:2506.20639, 2025. URL https://arxiv.org/abs/2506.20639
-
[40]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
-
[41]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe
work page 2021
-
[42]
Mercury: Ultra-Fast Language Models Based on Diffusion
Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and Volodymyr Kuleshov. Mercury: Ultra-fast language models based on diffusion, 2025. URL https://arxiv.org/abs/2506.17298
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, and Jung Ho Ahn. Dyllm: Efficient diffusion llm inference via saliency-based token selection and partial attention, 2026. URL https://arxiv.org/abs/2603.08026
-
[44]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi
work page 2024
-
[45]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https://arxiv.org/abs/2503.20783
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Discrete diffusion modeling by estimating the ratios of the data distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024
work page 2024
-
[47]
FS - DFM : Fast and accurate long text generation with few-step diffusion language models
Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, and Irina Belousova. FS - DFM : Fast and accurate long text generation with few-step diffusion language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=ue1zFeD275
work page 2026
-
[48]
Constrained decoding of diffusion LLM s with context-free grammars
Niels M \"u ndler, Jasper Dekoninck, and Martin Vechev. Constrained decoding of diffusion LLM s with context-free grammars. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=7Sph4KyeYO
work page 2026
-
[49]
Large language diffusion models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=KnqiC0znVF
work page 2026
-
[50]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
work page 2022
-
[51]
Improving reasoning for diffusion language models via group diffusion policy optimization
Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, and Wei Deng. Improving reasoning for diffusion language models via group diffusion policy optimization, 2026. URL https://arxiv.org/abs/2510.08554
-
[52]
Simple and effective masked diffusion language models
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages...
-
[53]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Reinforcement Learning: An Introduction
Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press, 2018
work page 2018
-
[55]
wd1: Weighted policy optimization for reasoning in diffusion language models
Xiaohang Tang, Rares Dolga, Sangwoong Yoon, and Ilija Bogunovic. wd1: Weighted policy optimization for reasoning in diffusion language models. ICLR, 2026. URL https://openreview.net/forum?id=L2rfd2Czbj
work page 2026
-
[56]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. URL https://arxiv.org/abs/2211.14275
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[57]
Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142,
Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream-coder 7b: An open diffusion language model for code, 2025. URL https://arxiv.org/abs/2509.01142
-
[58]
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding, 2025. URL https://arxiv.org/abs/2503.02951
-
[59]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models, 2025. URL https://arxiv.org/abs/2508.15487
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
d1: Scaling reasoning in diffusion large language models via reinforcement learning
Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning. ICLR, 2025. URL https://openreview.net/forum?id=t8oYNHAvM9
work page 2025
-
[62]
Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning. arXiv preprint arXiv:2506.01347, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.