Diffusion-GR2: Diffusion Generative Reasoning Re-ranker

Adam (Yang) Song; Chonglin Sun; Fei Tian; Frank Shyu; Kangqi Ni; Luke Simon; Mingfu Liang; Sandeep Pandey; Tianlong Chen; Xiaohan Wei

arxiv: 2607.01170 · v1 · pith:7DHPD5QZnew · submitted 2026-07-01 · 💻 cs.IR · cs.AI

Diffusion-GR2: Diffusion Generative Reasoning Re-ranker

Zhuoxuan Zhang , Kangqi Ni , Yuhang Chen , Mingfu Liang , Xiaohan Wei , Yunchen Pu , Fei Tian , Chonglin Sun

show 6 more authors

Frank Shyu Adam (Yang) Song Sandeep Pandey Luke Simon Tianlong Chen Xi Liu

This is my paper

Pith reviewed 2026-07-02 06:23 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords diffusion modelsgenerative re-rankingreasoning re-rankersblock diffusionon-policy distillationrecommendation systemsinference accelerationpermutation generation

0 comments

The pith

A three-stage recipe converts an autoregressive reasoning re-ranker into a block-diffusion model that recovers near-parity accuracy while decoding in parallel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive generative re-rankers produce accurate rankings by first emitting long chains of reasoning, but this requires one forward pass per token and makes inference slow. Block-diffusion models can decode many positions at once over few steps and therefore run faster, yet they produce invalid rankings such as duplicates or omissions because positions are denoised independently. Diffusion-GR2 closes the resulting structural and distributional gaps through conversion fine-tuning that forces the model to emit valid permutations on its own, followed by on-policy distillation that trains on trajectories the model itself generates, and a final reinforcement-learning stage that optimizes a re-ranking reward. On the Amazon Beauty dataset the resulting model reaches accuracy close to the original autoregressive re-ranker while delivering 2.4 to 3.5 times higher decode throughput at the full reasoning length.

Core claim

Diffusion-GR2 converts an AR reasoning re-ranker into a block-diffusion re-ranker by first applying conversion fine-tuning so the model learns to denoise answer tokens into valid permutations without external constraints, then performing on-policy distillation that supplies dense targets from the AR teacher on the model's own decoded trajectories, and finally running reinforcement learning against a re-ranking reward; the combination closes both the structural gap of invalid rankings and the distributional gap of off-policy training, yielding near-parity accuracy with the AR reference.

What carries the argument

The three-stage conversion pipeline of conversion fine-tuning, on-policy distillation, and reinforcement learning that adapts an AR-initialized diffusion model to produce valid on-policy rankings.

If this is right

Conversion fine-tuning alone recovers most of the conversion gap between AR and diffusion decoding.
On-policy distillation further reduces the remaining gap to the AR reference accuracy.
Block-parallel decoding yields a 2.4–3.5× increase in throughput measured at the model's full reasoning output length.
The final RL stage optimizes directly against the re-ranking reward on top of the distilled policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged adaptation might transfer to other generative tasks that require strict output validity, such as code or plan generation.
If the on-policy stage generalizes beyond recommendation, it could reduce dependence on large teacher models for distillation in diffusion settings.
The observed throughput gains suggest diffusion-based re-rankers could support interactive recommendation interfaces where AR latency is currently prohibitive.

Load-bearing premise

The combination of conversion fine-tuning, on-policy distillation, and RL can close both the structural gap of invalid permutations and the distributional gap of off-policy training without new failure modes or accuracy regressions.

What would settle it

An evaluation on Amazon Beauty after all three stages in which the diffusion model still emits a substantial fraction of invalid rankings or its final accuracy remains more than a few percent below the AR re-ranker.

read the original abstract

Generative reasoning re-rankers achieve strong recommendation accuracy by emitting a chain-of-thought before re-ordering a candidate list, but they are slow at inference: an autoregressive (AR) decoder spends one sequential forward pass per reasoning token, and the reasoning trace far exceeds the ranking it produces. To reduce this cost, block-diffusion language models decode many positions in parallel over a few denoising steps and are substantially faster, yet naively converting an AR re-ranker into one opens two accuracy gaps: (1) a structural gap: answer positions are denoised in parallel and scored independently, so the decoder emits invalid rankings (duplicated, dropped, or out-of-set identifiers) that AR avoids through left-to-right masking; and (2) a distributional gap: fine-tuning the converted model on fixed teacher trajectories is off-policy relative to its own decoding at inference, leaving a residual accuracy gap. To close both gaps while keeping the speedup, we propose \textbf{Diffusion-GR2}, a recipe that converts our AR reasoning re-ranker (GR2) into a block-diffusion re-ranker. First, conversion fine-tuning (CFT) adapts the AR-initialized diffusion model to denoise the answer into a valid permutation on its own, without an external constrained decoder. Next, on-policy distillation (OPD) then supervises the model on its own decoded trajectories with dense per-token targets from the AR teacher. Finally, we apply a reinforcement-learning (RL) stage against a re-ranking reward on top of OPD's on-policy policy. Experiments on Amazon Beauty demonstrate that Diffusion-GR2 recovers to near-parity with the AR re-ranker, while block-parallel decoding raises decode throughput by $2.4$--$3.5\times$ at the model's reasoning output length. Ablations show that CFT recovers most of the conversion gap, and that on-policy distillation further closes it to the AR reference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Diffusion-GR2 shows a three-stage conversion from AR reasoning re-rankers to block diffusion that targets validity and off-policy gaps, with claimed near-parity on one dataset and 2.4-3.5x speedup, but the abstract gives no numbers to judge how well it actually works.

read the letter

The main thing here is a practical recipe to turn an autoregressive generative reasoning re-ranker into a block-diffusion version without an external decoder. The three stages are conversion fine-tuning to produce valid permutations, on-policy distillation on the model's own trajectories, and then RL against a re-ranking reward. Experiments on Amazon Beauty are said to recover near the AR baseline while delivering the throughput gain from parallel decoding.

What is new is the specific ordering and combination of those stages to handle both the structural problem (invalid rankings from independent denoising) and the distributional problem (off-policy training). The paper does a clean job naming the two gaps and showing through ablations that CFT handles most of the conversion loss and OPD closes more of it.

The soft spots are the missing numbers. The abstract states recovery to near-parity and gives the speedup range, but supplies no accuracy deltas, no validity rates after each stage, no error bars, and no checks for side effects such as reduced ranking diversity or RL-induced regressions. Results are reported on only one dataset. If the full paper does not quantify those points or test on additional collections, the central claim stays hard to evaluate.

This is for IR researchers who need faster inference for generative re-rankers in recommendation settings. A reader working on diffusion models for structured outputs or on efficient ranking would find the conversion steps worth looking at.

It deserves peer review. The motivation and method are concrete enough that referees can check whether the stages actually close the gaps as described.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Diffusion-GR2, a three-stage conversion recipe (conversion fine-tuning (CFT), on-policy distillation (OPD), and RL) that adapts an autoregressive generative reasoning re-ranker (GR2) to a block-diffusion decoder. It claims the recipe closes the structural gap (invalid permutations from parallel denoising) and distributional gap (off-policy training), recovering near-parity accuracy with the AR baseline on Amazon Beauty while delivering 2.4–3.5× throughput via block-parallel decoding; ablations attribute most recovery to CFT and further gains to OPD.

Significance. If the empirical claims hold with quantified validity rates and no hidden regressions, the work would be significant for practical deployment of generative re-rankers: it shows how diffusion models can be adapted for permutation-structured outputs without external constrained decoders, while preserving reasoning traces. The explicit three-stage recipe and stage-wise ablation constitute a reusable template for AR-to-diffusion conversion in IR tasks.

major comments (2)

[Experiments section] Experiments section (main results paragraph): the central claim of 'near-parity' recovery and 'CFT recovers most of the conversion gap' is load-bearing yet unsupported by any numeric values (NDCG@10, validity rate, or per-stage deltas), error bars, or dataset statistics (candidate list size, reasoning length). Without these, it is impossible to verify that residual invalid rate is zero or that RL introduces no masked regressions.
[Ablations paragraph] Ablations paragraph: the statement that 'on-policy distillation further closes it to the AR reference' lacks per-stage validity percentages and accuracy deltas after CFT vs. after OPD vs. after RL. This directly affects the claim that the three-stage recipe closes both gaps without new failure modes (e.g., collapsed traces or reward hacking).

minor comments (1)

The throughput multiplier (2.4–3.5×) is reported only at 'the model's reasoning output length'; a table or plot versus output length would clarify the practical operating range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for explicit quantitative support in the experiments and ablations sections. We agree these details are necessary to substantiate the claims of near-parity recovery and the contribution of each stage. In the revised manuscript we will add the requested NDCG@10 values, validity rates, per-stage deltas, error bars, and dataset statistics. We address each major comment below.

read point-by-point responses

Referee: [Experiments section] Experiments section (main results paragraph): the central claim of 'near-parity' recovery and 'CFT recovers most of the conversion gap' is load-bearing yet unsupported by any numeric values (NDCG@10, validity rate, or per-stage deltas), error bars, or dataset statistics (candidate list size, reasoning length). Without these, it is impossible to verify that residual invalid rate is zero or that RL introduces no masked regressions.

Authors: We agree the main results paragraph requires explicit numeric support. The revised version will report NDCG@10 for the AR baseline (GR2) and for Diffusion-GR2 after each stage, validity rates (fraction of valid permutations emitted), per-stage deltas, and standard-error bars computed over three random seeds. We will also state the candidate list size (100) and average reasoning trace length used on Amazon Beauty. These additions will allow direct verification that the final invalid rate is near zero and that the RL stage produces no regressions relative to OPD. revision: yes
Referee: [Ablations paragraph] Ablations paragraph: the statement that 'on-policy distillation further closes it to the AR reference' lacks per-stage validity percentages and accuracy deltas after CFT vs. after OPD vs. after RL. This directly affects the claim that the three-stage recipe closes both gaps without new failure modes (e.g., collapsed traces or reward hacking).

Authors: We accept that the current ablations paragraph is insufficiently quantitative. The revision will include a table (or expanded text) listing validity rate and NDCG@10 after CFT alone, after CFT+OPD, and after CFT+OPD+RL, each compared against the AR reference. We will also report qualitative checks (trace length distribution, absence of reward-hacking artifacts) to confirm no new failure modes were introduced. This will make the incremental contribution of OPD and RL verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical conversion recipe with experimental validation

full rationale

The paper describes a three-stage empirical conversion process (CFT to enforce valid permutations, OPD for on-policy trajectories, RL for reward optimization) from an existing AR re-ranker to a block-diffusion model, claiming near-parity recovery on Amazon Beauty via experiments and ablations. No equations, fitted parameters renamed as predictions, self-definitional constructs, uniqueness theorems, or ansatzes appear in the abstract or described method. The central result is an empirical outcome (throughput gain with accuracy recovery) rather than a derivation that reduces to its inputs by construction; the AR baseline is treated as an external starting point, not a self-referential fit. This is a standard non-circular empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. Implicit modeling assumptions (valid permutation output, reward signal for RL) are not quantified.

pith-pipeline@v0.9.1-grok · 5933 in / 1244 out tokens · 18960 ms · 2026-07-02T06:23:06.210344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 17 canonical work pages · 11 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263,

2024
[2]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Representations, volume 2025, pages 50726–50753, 2025a. Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Jiaqi H...

2025
[3]

Accelerating Large Language Model Decoding with Speculative Sampling

CharlieChen, SebastianBorgeaud, GeoffreyIrving, Jean-BaptisteLespiau, LaurentSifre, andJohnJumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Plum: Adapting pre-trained language models for industrial-scale generative recommendations.arXiv preprint arXiv:2510.07784,

Ruining He, Lukasz Heldt, Lichan Hong, Raghunandan Keshavan, Shifan Mao, Nikhil Mehta, Zhengyang Su, Alicia Tsai, Yueqi Wang, Shao-Chuan Wang, et al. Plum: Adapting pre-trained language models for industrial-scale generative recommendations.arXiv preprint arXiv:2510.07784,

work page arXiv
[6]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP),

2016
[8]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025.https://www.inceptionlabs.ai/blog/introducing-mercury. Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Woo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

GR2: Generative Reasoning Re-ranker

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Onerec-think: In-text reasoning for generative recommendation.CoRR, abs/2510.11639,

Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, Yifei Hu, Qigen Hu, Xinchen Luo, Lejian Ren, Zixing Zhang, Qianqian Wang, Kuo Cai, Yunfan Wu, Hongtao Cheng, Zexuan Cheng, Lu Ren, Huanjie Wang, Yi Su, Ruiming Tang, Kun Gai, and Guorui Zhou. Onerec-think: In-text reasoning...

work page arXiv
[11]

Onerec-think: In-text reasoning for generative recommendation.CoRR, abs/2510.11639,

doi: 10.48550/ARXIV.2510.11639.https://doi.org/10.48550/arXiv.2510.11639. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning (ICML),

work page doi:10.48550/arxiv.2510.11639.https://doi.org/10.48550/arxiv.2510.11639
[12]

https://thinkingmachines.ai/blog/on-policy-distillation

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval,

work page doi:10.64434/tml.20251026
[13]

Improving language understanding by generative pre-training

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.https: //api.semanticscholar.org/CorpusID:49313245. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.https://api.semanticscholar.org/CorpusID:160025533. Shashank R...

2018
[14]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Data-efficient autoregressive-to-diffusion language models via on-policy distillation.arXiv preprint arXiv:2606.06712,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949,

work page arXiv
[18]

Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328,

work page arXiv
[19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Rearank: Reasoning re-ranking agent via reinforcement learning

Le Zhang, Bo Wang, Xipeng Qiu, Siva Reddy, and Aishwarya Agrawal. Rearank: Reasoning re-ranking agent via reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2458–2471, 2025a. Yihua Zhang, Xi Liu, Xihuan Zeng, Mingfu Liang, Jiyan Yang, Rong Jin, Wen-Yen Chen, Yiping Han, Hao Ma, Bo Long, ...

2025
[23]

Llada 1.5: Variance-reduced preference optimization for large language diffusion models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11425–11460, 2026

2026

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263,

2024

[2] [2]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Representations, volume 2025, pages 50726–50753, 2025a. Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Jiaqi H...

2025

[3] [3]

Accelerating Large Language Model Decoding with Speculative Sampling

CharlieChen, SebastianBorgeaud, GeoffreyIrving, Jean-BaptisteLespiau, LaurentSifre, andJohnJumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Plum: Adapting pre-trained language models for industrial-scale generative recommendations.arXiv preprint arXiv:2510.07784,

Ruining He, Lukasz Heldt, Lichan Hong, Raghunandan Keshavan, Shifan Mao, Nikhil Mehta, Zhengyang Su, Alicia Tsai, Yueqi Wang, Shao-Chuan Wang, et al. Plum: Adapting pre-trained language models for industrial-scale generative recommendations.arXiv preprint arXiv:2510.07784,

work page arXiv

[6] [6]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP),

2016

[8] [8]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025.https://www.inceptionlabs.ai/blog/introducing-mercury. Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Woo...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

GR2: Generative Reasoning Re-ranker

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Onerec-think: In-text reasoning for generative recommendation.CoRR, abs/2510.11639,

Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, Yifei Hu, Qigen Hu, Xinchen Luo, Lejian Ren, Zixing Zhang, Qianqian Wang, Kuo Cai, Yunfan Wu, Hongtao Cheng, Zexuan Cheng, Lu Ren, Huanjie Wang, Yi Su, Ruiming Tang, Kun Gai, and Guorui Zhou. Onerec-think: In-text reasoning...

work page arXiv

[11] [11]

Onerec-think: In-text reasoning for generative recommendation.CoRR, abs/2510.11639,

doi: 10.48550/ARXIV.2510.11639.https://doi.org/10.48550/arXiv.2510.11639. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InInternational Conference on Machine Learning (ICML),

work page doi:10.48550/arxiv.2510.11639.https://doi.org/10.48550/arxiv.2510.11639

[12] [12]

https://thinkingmachines.ai/blog/on-policy-distillation

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval,

work page doi:10.64434/tml.20251026

[13] [13]

Improving language understanding by generative pre-training

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.https: //api.semanticscholar.org/CorpusID:49313245. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.https://api.semanticscholar.org/CorpusID:160025533. Shashank R...

2018

[14] [14]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Data-efficient autoregressive-to-diffusion language models via on-policy distillation.arXiv preprint arXiv:2606.06712,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949,

work page arXiv

[18] [18]

Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328,

work page arXiv

[19] [19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Rearank: Reasoning re-ranking agent via reinforcement learning

Le Zhang, Bo Wang, Xipeng Qiu, Siva Reddy, and Aishwarya Agrawal. Rearank: Reasoning re-ranking agent via reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2458–2471, 2025a. Yihua Zhang, Xi Liu, Xihuan Zeng, Mingfu Liang, Jiyan Yang, Rong Jin, Wen-Yen Chen, Yiping Han, Hao Ma, Bo Long, ...

2025

[23] [23]

Llada 1.5: Variance-reduced preference optimization for large language diffusion models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11425–11460, 2026

2026