Diffusion-GR2: Diffusion Generative Reasoning Re-ranker
Pith reviewed 2026-07-02 06:23 UTC · model grok-4.3
The pith
A three-stage recipe converts an autoregressive reasoning re-ranker into a block-diffusion model that recovers near-parity accuracy while decoding in parallel.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Diffusion-GR2 converts an AR reasoning re-ranker into a block-diffusion re-ranker by first applying conversion fine-tuning so the model learns to denoise answer tokens into valid permutations without external constraints, then performing on-policy distillation that supplies dense targets from the AR teacher on the model's own decoded trajectories, and finally running reinforcement learning against a re-ranking reward; the combination closes both the structural gap of invalid rankings and the distributional gap of off-policy training, yielding near-parity accuracy with the AR reference.
What carries the argument
The three-stage conversion pipeline of conversion fine-tuning, on-policy distillation, and reinforcement learning that adapts an AR-initialized diffusion model to produce valid on-policy rankings.
If this is right
- Conversion fine-tuning alone recovers most of the conversion gap between AR and diffusion decoding.
- On-policy distillation further reduces the remaining gap to the AR reference accuracy.
- Block-parallel decoding yields a 2.4–3.5× increase in throughput measured at the model's full reasoning output length.
- The final RL stage optimizes directly against the re-ranking reward on top of the distilled policy.
Where Pith is reading between the lines
- The same staged adaptation might transfer to other generative tasks that require strict output validity, such as code or plan generation.
- If the on-policy stage generalizes beyond recommendation, it could reduce dependence on large teacher models for distillation in diffusion settings.
- The observed throughput gains suggest diffusion-based re-rankers could support interactive recommendation interfaces where AR latency is currently prohibitive.
Load-bearing premise
The combination of conversion fine-tuning, on-policy distillation, and RL can close both the structural gap of invalid permutations and the distributional gap of off-policy training without new failure modes or accuracy regressions.
What would settle it
An evaluation on Amazon Beauty after all three stages in which the diffusion model still emits a substantial fraction of invalid rankings or its final accuracy remains more than a few percent below the AR re-ranker.
read the original abstract
Generative reasoning re-rankers achieve strong recommendation accuracy by emitting a chain-of-thought before re-ordering a candidate list, but they are slow at inference: an autoregressive (AR) decoder spends one sequential forward pass per reasoning token, and the reasoning trace far exceeds the ranking it produces. To reduce this cost, block-diffusion language models decode many positions in parallel over a few denoising steps and are substantially faster, yet naively converting an AR re-ranker into one opens two accuracy gaps: (1) a structural gap: answer positions are denoised in parallel and scored independently, so the decoder emits invalid rankings (duplicated, dropped, or out-of-set identifiers) that AR avoids through left-to-right masking; and (2) a distributional gap: fine-tuning the converted model on fixed teacher trajectories is off-policy relative to its own decoding at inference, leaving a residual accuracy gap. To close both gaps while keeping the speedup, we propose \textbf{Diffusion-GR2}, a recipe that converts our AR reasoning re-ranker (GR2) into a block-diffusion re-ranker. First, conversion fine-tuning (CFT) adapts the AR-initialized diffusion model to denoise the answer into a valid permutation on its own, without an external constrained decoder. Next, on-policy distillation (OPD) then supervises the model on its own decoded trajectories with dense per-token targets from the AR teacher. Finally, we apply a reinforcement-learning (RL) stage against a re-ranking reward on top of OPD's on-policy policy. Experiments on Amazon Beauty demonstrate that Diffusion-GR2 recovers to near-parity with the AR re-ranker, while block-parallel decoding raises decode throughput by $2.4$--$3.5\times$ at the model's reasoning output length. Ablations show that CFT recovers most of the conversion gap, and that on-policy distillation further closes it to the AR reference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Diffusion-GR2, a three-stage conversion recipe (conversion fine-tuning (CFT), on-policy distillation (OPD), and RL) that adapts an autoregressive generative reasoning re-ranker (GR2) to a block-diffusion decoder. It claims the recipe closes the structural gap (invalid permutations from parallel denoising) and distributional gap (off-policy training), recovering near-parity accuracy with the AR baseline on Amazon Beauty while delivering 2.4–3.5× throughput via block-parallel decoding; ablations attribute most recovery to CFT and further gains to OPD.
Significance. If the empirical claims hold with quantified validity rates and no hidden regressions, the work would be significant for practical deployment of generative re-rankers: it shows how diffusion models can be adapted for permutation-structured outputs without external constrained decoders, while preserving reasoning traces. The explicit three-stage recipe and stage-wise ablation constitute a reusable template for AR-to-diffusion conversion in IR tasks.
major comments (2)
- [Experiments section] Experiments section (main results paragraph): the central claim of 'near-parity' recovery and 'CFT recovers most of the conversion gap' is load-bearing yet unsupported by any numeric values (NDCG@10, validity rate, or per-stage deltas), error bars, or dataset statistics (candidate list size, reasoning length). Without these, it is impossible to verify that residual invalid rate is zero or that RL introduces no masked regressions.
- [Ablations paragraph] Ablations paragraph: the statement that 'on-policy distillation further closes it to the AR reference' lacks per-stage validity percentages and accuracy deltas after CFT vs. after OPD vs. after RL. This directly affects the claim that the three-stage recipe closes both gaps without new failure modes (e.g., collapsed traces or reward hacking).
minor comments (1)
- The throughput multiplier (2.4–3.5×) is reported only at 'the model's reasoning output length'; a table or plot versus output length would clarify the practical operating range.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for explicit quantitative support in the experiments and ablations sections. We agree these details are necessary to substantiate the claims of near-parity recovery and the contribution of each stage. In the revised manuscript we will add the requested NDCG@10 values, validity rates, per-stage deltas, error bars, and dataset statistics. We address each major comment below.
read point-by-point responses
-
Referee: [Experiments section] Experiments section (main results paragraph): the central claim of 'near-parity' recovery and 'CFT recovers most of the conversion gap' is load-bearing yet unsupported by any numeric values (NDCG@10, validity rate, or per-stage deltas), error bars, or dataset statistics (candidate list size, reasoning length). Without these, it is impossible to verify that residual invalid rate is zero or that RL introduces no masked regressions.
Authors: We agree the main results paragraph requires explicit numeric support. The revised version will report NDCG@10 for the AR baseline (GR2) and for Diffusion-GR2 after each stage, validity rates (fraction of valid permutations emitted), per-stage deltas, and standard-error bars computed over three random seeds. We will also state the candidate list size (100) and average reasoning trace length used on Amazon Beauty. These additions will allow direct verification that the final invalid rate is near zero and that the RL stage produces no regressions relative to OPD. revision: yes
-
Referee: [Ablations paragraph] Ablations paragraph: the statement that 'on-policy distillation further closes it to the AR reference' lacks per-stage validity percentages and accuracy deltas after CFT vs. after OPD vs. after RL. This directly affects the claim that the three-stage recipe closes both gaps without new failure modes (e.g., collapsed traces or reward hacking).
Authors: We accept that the current ablations paragraph is insufficiently quantitative. The revision will include a table (or expanded text) listing validity rate and NDCG@10 after CFT alone, after CFT+OPD, and after CFT+OPD+RL, each compared against the AR reference. We will also report qualitative checks (trace length distribution, absence of reward-hacking artifacts) to confirm no new failure modes were introduced. This will make the incremental contribution of OPD and RL verifiable. revision: yes
Circularity Check
No circularity: empirical conversion recipe with experimental validation
full rationale
The paper describes a three-stage empirical conversion process (CFT to enforce valid permutations, OPD for on-policy trajectories, RL for reward optimization) from an existing AR re-ranker to a block-diffusion model, claiming near-parity recovery on Amazon Beauty via experiments and ablations. No equations, fitted parameters renamed as predictions, self-definitional constructs, uniqueness theorems, or ansatzes appear in the abstract or described method. The central result is an empirical outcome (throughput gain with accuracy recovery) rather than a derivation that reduces to its inputs by construction; the AR baseline is treated as an external starting point, not a self-referential fit. This is a standard non-circular empirical methods paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263,
2024
-
[2]
Block diffusion: Interpolating between autoregressive and diffusion language models
Marianne Arriola, Aaron Gokaslan, Justin Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Representations, volume 2025, pages 50726–50753, 2025a. Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Jiaqi H...
2025
-
[3]
Accelerating Large Language Model Decoding with Speculative Sampling
CharlieChen, SebastianBorgeaud, GeoffreyIrving, Jean-BaptisteLespiau, LaurentSifre, andJohnJumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Ruining He, Lukasz Heldt, Lichan Hong, Raghunandan Keshavan, Shifan Mao, Nikhil Mehta, Zhengyang Su, Alicia Tsai, Yueqi Wang, Shao-Chuan Wang, et al. Plum: Adapting pre-trained language models for industrial-scale generative recommendations.arXiv preprint arXiv:2510.07784,
-
[6]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP),
2016
-
[8]
Mercury: Ultra-Fast Language Models Based on Diffusion
Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025.https://www.inceptionlabs.ai/blog/introducing-mercury. Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Woo...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
GR2: Generative Reasoning Re-ranker
Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Onerec-think: In-text reasoning for generative recommendation.CoRR, abs/2510.11639,
Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, Yifei Hu, Qigen Hu, Xinchen Luo, Lejian Ren, Zixing Zhang, Qianqian Wang, Kuo Cai, Yunfan Wu, Hongtao Cheng, Zexuan Cheng, Lu Ren, Huanjie Wang, Yi Su, Ruiming Tang, Kun Gai, and Guorui Zhou. Onerec-think: In-text reasoning...
-
[11]
Onerec-think: In-text reasoning for generative recommendation.CoRR, abs/2510.11639,
doi: 10.48550/ARXIV.2510.11639.https://doi.org/10.48550/arXiv.2510.11639. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning (ICML),
work page doi:10.48550/arxiv.2510.11639.https://doi.org/10.48550/arxiv.2510.11639
-
[12]
https://thinkingmachines.ai/blog/on-policy-distillation
doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval,
-
[13]
Improving language understanding by generative pre-training
Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.https: //api.semanticscholar.org/CorpusID:49313245. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.https://api.semanticscholar.org/CorpusID:160025533. Shashank R...
2018
-
[14]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Data-efficient autoregressive-to-diffusion language models via on-policy distillation.arXiv preprint arXiv:2606.06712,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949,
-
[18]
Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328,
-
[19]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Rearank: Reasoning re-ranking agent via reinforcement learning
Le Zhang, Bo Wang, Xipeng Qiu, Siva Reddy, and Aishwarya Agrawal. Rearank: Reasoning re-ranking agent via reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2458–2471, 2025a. Yihua Zhang, Xi Liu, Xihuan Zeng, Mingfu Liang, Jiyan Yang, Rong Jin, Wen-Yen Chen, Yiping Han, Hao Ma, Bo Long, ...
2025
-
[23]
Llada 1.5: Variance-reduced preference optimization for large language diffusion models
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11425–11460, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.