Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

Anant Khandelwal; Manish Gupta

arxiv: 2606.04396 · v1 · pith:26LCY3DVnew · submitted 2026-06-03 · 💻 cs.CL

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

Anant Khandelwal , Manish Gupta This is my paper

Pith reviewed 2026-06-28 06:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelsreinforcement learningdenoising tracetrajectory-aware RLblock-wise unmaskingvalue headCAPRreasoning tasks

0 comments

The pith

CAPR turns the denoising trace of diffusion LLMs into block-level rewards that recover tree-search granularity at lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion language models generate by iteratively unmasking tokens, producing a trace of which positions gain confidence and when commitments form. Existing RL methods either apply one outcome reward to the full flat trajectory or pay for expensive tree rollouts that branch partial paths. The paper claims that summarizing this trace into cached path states and redistributing the final reward across blocks according to revealed tokens trains a value head that supplies local supervision. This yields training signals finer than flat rollouts while using less compute than full trees. The result is new state-of-the-art performance for RL-tuned dLLMs on Sudoku, Countdown, GSM8K and Math500 at fixed token budgets.

Core claim

CAPR records path-state and block-progress features under a block-wise unmasking schedule, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights, recovering much of the granularity of tree search while avoiding full tree expansion and reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts.

What carries the argument

Cached-Amortized Path Refinement (CAPR), which summarizes the denoising trace into a compact path state, caches trajectory states for cheap sibling continuations, and trains a block-level value head from redistributed outcome rewards.

If this is right

CAPR sets a new state of the art for RL-tuned dLLMs on 4x4 Sudoku, Countdown, GSM8K, and Math500 at 256- and 512-token budgets on both dense and mixture-of-experts LLaDA backbones.
Rollout-generation cost falls to roughly 0.75x flat rollouts and 0.6x tree rollouts under standard settings.
On Sudoku the method matches the strongest tree-structured baseline at less than one third of the per-step compute.
The approach works by training a block-level value head that supplies PPO weights from a single sparse outcome reward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trace-to-block-reward mapping could be tested on other iterative generative models that produce internal confidence sequences during sampling.
If the redistribution rule introduces schedule-dependent bias, performance would degrade when the block size or unmasking order changes.
The cached path-state representation may allow combining CAPR with other search methods that operate on partial trajectories.

Load-bearing premise

Redistributing the final outcome reward across blocks according to the tokens revealed in each block under a block-wise unmasking schedule yields accurate local value estimates that approximate full tree supervision without systematic bias.

What would settle it

A direct comparison on a small task where the learned block-level value head predictions show low correlation with the actual rewards obtained by expanding the same blocks into complete trajectories would falsify the accuracy of the redistributed supervision.

Figures

Figures reproduced from arXiv: 2606.04396 by Anant Khandelwal, Manish Gupta.

**Figure 2.** Figure 2: Overview of CAPR. (a) Trace State / Cache & Steer: CAPR summarizes per-position confidence, entropy, and stability into a path state, then uses it to steer the next reverse step while carrying only the previous clean-token prediction and path state. (b) Branch & Prune: a shared denoising prefix is forked once at the branch step, and siblings are kept by path-state quality. (c) Block Critic: a value head re… view at source ↗

**Figure 3.** Figure 3: Training reward for the eight CAPR runs. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Sudoku training reward under progressive [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Peak Sudoku reward by ablation configuration. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Sudoku stability diagnostics. Left: self [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Block Critic loss and self-distillation NLL for [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 7.** Figure 7: Dense-backbone diagnostics. Reference KL [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Training reward curves for CAPR compared with [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: A case study comparing LLaDA-8B-Instruct trained with [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAPR's use of the denoising trace for block-wise reward redistribution is the new piece, but it rests on an untested assumption that the redistribution avoids bias from the unmasking schedule.

read the letter

The key point is that CAPR claims to recover tree-search granularity in dLLM RL at a fraction of the cost by redistributing the final reward across blocks using the denoising trace.

This combination of cached sibling states, block-level value head, and trace-based redistribution is not in the prior work they cite.

The paper does a solid job explaining the limitations of existing flat and tree methods and proposing a practical alternative that leverages the parallel unmasking process already happening in dLLMs.

On the downside, the central mechanism—redistributing reward based on tokens revealed per block—could easily introduce bias if the unmasking schedule favors certain positions or if caching misses important variance. The abstract reports specific cost savings and new SOTA results, but with no details on error bars, ablations, or how they chose the schedule, those numbers are hard to trust yet.

The math seems straightforward from the description, but the empirical claims need the full paper to verify.

This is for specialists in diffusion-based generation and RL fine-tuning for math and reasoning tasks. A reader in that area would find the idea worth testing even if the current results are preliminary.

It deserves peer review to see whether the bias concern is real or if their implementation avoids it.

Referee Report

2 major / 2 minor

Summary. The paper introduces CAPR (Cached-Amortized Path Refinement), an RL algorithm for diffusion LLMs that summarizes the denoising trace into a path state, uses cached states for sibling continuations, and redistributes the terminal outcome reward across blocks proportional to tokens revealed under a block-wise unmasking schedule to train a block-level value head for PPO. It claims this recovers much of tree-search granularity at reduced cost (roughly 0.75x flat rollouts and 0.6x tree rollouts) while setting SOTA for RL-tuned dLLMs on 4x4 Sudoku, Countdown, GSM8K, and Math500 across dense and MoE LLaDA backbones at 256- and 512-token budgets.

Significance. If the reward-redistribution mechanism produces value estimates sufficiently close to explicit tree supervision without systematic bias, the work would offer a practical efficiency gain for fine-grained RL in dLLMs, lowering the barrier to tree-like training signals. The caching and trace-summarization ideas are a concrete algorithmic contribution that could generalize beyond the reported tasks.

major comments (2)

[Abstract / method description] The central mechanism—redistributing the final outcome reward across blocks according to tokens revealed in each block under the chosen unmasking schedule—is presented as yielding local value estimates that approximate tree-rollout supervision. No comparison to explicit tree-search value estimates, no ablation on schedule parameters, and no analysis of potential bias from early-block credit assignment or caching of sibling states is provided to support this equivalence, which is required for the claimed 0.6–0.75× cost reduction with preserved granularity.
[Results (implied from abstract claims)] The SOTA and cost-reduction claims (0.75x flat, 0.6x tree; matching strongest tree baseline on Sudoku at <1/3 per-step compute) are stated without error bars, statistical significance tests, or ablation tables isolating the contribution of the block-wise value head versus the caching mechanism.

minor comments (2)

The phrase 'under standard settings' for the cost ratios is undefined; explicit hyperparameter values or a reference table for the unmasking schedule and cache size would improve reproducibility.
Notation for 'path-state and block-progress features' is introduced without an accompanying equation or pseudocode snippet, making the exact input to the value head unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, outlining clarifications from the manuscript and planned revisions to strengthen the supporting evidence for the method's claims.

read point-by-point responses

Referee: [Abstract / method description] The central mechanism—redistributing the final outcome reward across blocks according to tokens revealed in each block under the chosen unmasking schedule—is presented as yielding local value estimates that approximate tree-rollout supervision. No comparison to explicit tree-search value estimates, no ablation on schedule parameters, and no analysis of potential bias from early-block credit assignment or caching of sibling states is provided to support this equivalence, which is required for the claimed 0.6–0.75× cost reduction with preserved granularity.

Authors: The manuscript motivates the redistribution mechanism via the denoising trace properties and demonstrates its utility through SOTA results on four benchmarks with the stated compute savings. We agree, however, that direct validation of the approximation quality would strengthen the equivalence claim. In revision we will add a dedicated analysis subsection that (i) compares CAPR block-level value estimates against explicit tree-search values on Sudoku and GSM8K subsets, (ii) ablates block size and unmasking schedule parameters, and (iii) quantifies bias from early-block credit assignment and sibling-state caching. These additions will directly support the reported cost reductions. revision: yes
Referee: [Results (implied from abstract claims)] The SOTA and cost-reduction claims (0.75x flat, 0.6x tree; matching strongest tree baseline on Sudoku at <1/3 per-step compute) are stated without error bars, statistical significance tests, or ablation tables isolating the contribution of the block-wise value head versus the caching mechanism.

Authors: The current manuscript reports point estimates for the performance and compute metrics. We acknowledge that statistical rigor and component isolation are needed. The revised version will include error bars from multiple random seeds, paired statistical significance tests against baselines, and ablation tables that separately measure the block-wise value head and the caching mechanism. These will appear in the main results and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; CAPR is an independent algorithmic proposal

full rationale

The paper presents CAPR as a new RL algorithm for dLLMs that processes the external denoising trace under a block-wise unmasking schedule to redistribute a terminal reward into block-level value estimates for PPO. No equations, fitted parameters, or self-citations are shown that reduce the claimed supervision or compute savings to the inputs by construction. The method is framed as using the generation trace as an independent signal rather than re-deriving quantities from the same data or prior author results. This is the most common honest finding for algorithmic papers that do not invoke uniqueness theorems or ansatzes from self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated beyond standard RL components. The block-wise schedule and reward redistribution are presented as design choices without quantified fitting details.

free parameters (1)

block-wise unmasking schedule parameters
The method depends on a chosen block-wise schedule whose specific values are not reported in the abstract.

axioms (1)

domain assumption The denoising trace contains sufficient information to support block-level value estimates approximating tree search
This premise underpins the claim that trace-based supervision can replace full tree expansion.

pith-pipeline@v0.9.1-grok · 5855 in / 1392 out tokens · 27884 ms · 2026-06-28T06:57:32.831198+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 5 linked inside Pith

[1]

Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.Preprint, arXiv:2510.06303. Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. InProceedings of...

arXiv 2014
[2]

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025. The entropy mechanism of rein- forcement learning for reasoning langua...

Pith/arXiv arXiv 2025
[3]

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D

Mdpo: Overcoming the training-inference di- vide of masked diffusion language models.Preprint, arXiv:2508.13148. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, and Dy- lan J. Foster. 2025a. Correcting the mythos of kl- regularization: Direct alignment without overopti- mization via chi-squared preference optimization...

arXiv 2002
[4]

InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

Let’s verify step by step. InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous con- trol with deep reinforcement learning. In4th I...

2024
[5]

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li

Boundary-guided policy optimization for memory-efficient rl of diffusion large language mod- els.Preprint, arXiv:2510.11683. Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. 2025a. Scaling up masked diffusion models on text. Preprint, arXiv:2410.18514. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang O...

Pith/arXiv arXiv 2025
[6]

Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schnei- der, Yuriy Nevmyvaka, Molei Tao, and Wei Deng

d-treerpo: Towards more reliable policy op- timization for diffusion language models.Preprint, arXiv:2512.09675. Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schnei- der, Yuriy Nevmyvaka, Molei Tao, and Wei Deng

Pith/arXiv arXiv
[7]

Preprint, arXiv:2510.08554

Improving reasoning for diffusion language models via group diffusion policy optimization. Preprint, arXiv:2510.08554. Subham S. Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexan- der Rush, and V olodymyr Kuleshov. 2024. Simple and effective masked diffusion language models. In Advances in Neural Information Pro...

arXiv 2024
[8]

Richard S

Seed diffusion: A large-scale diffusion lan- guage model with high-speed inference.Preprint, arXiv:2508.02193. Richard S. Sutton. 1988. Learning to predict by the methods of temporal differences.Mach. Learn., 3:9– 44. Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang. 2026. Gtpo and...

Pith/arXiv arXiv 1988
[9]

Jingyi Yang, Guanxu Chen, Xuhao Hu, and Jing Shao

Advancing reasoning in diffusion language models with denoising process rewards.Preprint, arXiv:2510.01544. Jingyi Yang, Guanxu Chen, Xuhao Hu, and Jing Shao. 2025a. Taming masked diffusion language models via consistency trajectory reinforcement learning with fewer decoding step.Preprint, arXiv:2509.23924. Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiaf...

Pith/arXiv arXiv
[10]

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang

Entropic: Towards stable long-term training of llms via entropy stabilization with proportional- integral control.Preprint, arXiv:2511.15248. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. 2025b. Mmada: Multimodal large diffusion language mod- els.Preprint, arXiv:2505.15809. Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui G...

arXiv 2025
[11]

CAPR moves from the low- reward region into the same high-reward band as the tree method while using the cheaper cached/branched rollout structure

baselines. CAPR moves from the low- reward region into the same high-reward band as the tree method while using the cheaper cached/branched rollout structure. On GSM8K and Math500, the base model already has stronger task competence and the reward axis is much narrower, so CAPR stays close to thed-TreeRPO curve and may move slightly above or below it at d...

2026
[12]

Identify John’s age when Digimon came out
[13]

Determine Jim’s age when Digimon came out
[14]

First, we know that Digimon had its 20th anniversary, so it came out 20 years ago

Calculate Jim’s current age. First, we know that Digimon had its 20th anniversary, so it came out 20 years ago. If John is currently 28 years old, then his age when Digimon came out was28−20 = 8years old. Next, we know that when Digimon came out, John was twice as old as Jim. Therefore, let’s set up the equation:8 = 2·Jim’s age. To find Jim’s age, we divi...
[15]

Understand that Digimon had its 20th anniversary, 20 years ago
[16]

At that time, John was twice as old as Jim
[17]

Jim’s age then

We are currently given that John is 28 years old. Let’s denote Jim’s current age asJ. Since Digimon had its 20th anniversary 20 years ago, John was28−20 = 8years old at that time. At that time, John was twice as old as Jim. Therefore, we can set up the equation: 8 = 2J. To find Jim’s current age, we solve forJ:J= 8 2 = 4. Thus, Jim is currently 4 years ol...

[1] [1]

Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.Preprint, arXiv:2510.06303. Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. InProceedings of...

arXiv 2014

[2] [2]

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025. The entropy mechanism of rein- forcement learning for reasoning langua...

Pith/arXiv arXiv 2025

[3] [3]

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D

Mdpo: Overcoming the training-inference di- vide of masked diffusion language models.Preprint, arXiv:2508.13148. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, and Dy- lan J. Foster. 2025a. Correcting the mythos of kl- regularization: Direct alignment without overopti- mization via chi-squared preference optimization...

arXiv 2002

[4] [4]

InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

Let’s verify step by step. InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous con- trol with deep reinforcement learning. In4th I...

2024

[5] [5]

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li

Boundary-guided policy optimization for memory-efficient rl of diffusion large language mod- els.Preprint, arXiv:2510.11683. Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. 2025a. Scaling up masked diffusion models on text. Preprint, arXiv:2410.18514. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang O...

Pith/arXiv arXiv 2025

[6] [6]

Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schnei- der, Yuriy Nevmyvaka, Molei Tao, and Wei Deng

d-treerpo: Towards more reliable policy op- timization for diffusion language models.Preprint, arXiv:2512.09675. Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schnei- der, Yuriy Nevmyvaka, Molei Tao, and Wei Deng

Pith/arXiv arXiv

[7] [7]

Preprint, arXiv:2510.08554

Improving reasoning for diffusion language models via group diffusion policy optimization. Preprint, arXiv:2510.08554. Subham S. Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexan- der Rush, and V olodymyr Kuleshov. 2024. Simple and effective masked diffusion language models. In Advances in Neural Information Pro...

arXiv 2024

[8] [8]

Richard S

Seed diffusion: A large-scale diffusion lan- guage model with high-speed inference.Preprint, arXiv:2508.02193. Richard S. Sutton. 1988. Learning to predict by the methods of temporal differences.Mach. Learn., 3:9– 44. Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang. 2026. Gtpo and...

Pith/arXiv arXiv 1988

[9] [9]

Jingyi Yang, Guanxu Chen, Xuhao Hu, and Jing Shao

Advancing reasoning in diffusion language models with denoising process rewards.Preprint, arXiv:2510.01544. Jingyi Yang, Guanxu Chen, Xuhao Hu, and Jing Shao. 2025a. Taming masked diffusion language models via consistency trajectory reinforcement learning with fewer decoding step.Preprint, arXiv:2509.23924. Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiaf...

Pith/arXiv arXiv

[10] [10]

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang

Entropic: Towards stable long-term training of llms via entropy stabilization with proportional- integral control.Preprint, arXiv:2511.15248. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. 2025b. Mmada: Multimodal large diffusion language mod- els.Preprint, arXiv:2505.15809. Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui G...

arXiv 2025

[11] [11]

CAPR moves from the low- reward region into the same high-reward band as the tree method while using the cheaper cached/branched rollout structure

baselines. CAPR moves from the low- reward region into the same high-reward band as the tree method while using the cheaper cached/branched rollout structure. On GSM8K and Math500, the base model already has stronger task competence and the reward axis is much narrower, so CAPR stays close to thed-TreeRPO curve and may move slightly above or below it at d...

2026

[12] [12]

Identify John’s age when Digimon came out

[13] [13]

Determine Jim’s age when Digimon came out

[14] [14]

First, we know that Digimon had its 20th anniversary, so it came out 20 years ago

Calculate Jim’s current age. First, we know that Digimon had its 20th anniversary, so it came out 20 years ago. If John is currently 28 years old, then his age when Digimon came out was28−20 = 8years old. Next, we know that when Digimon came out, John was twice as old as Jim. Therefore, let’s set up the equation:8 = 2·Jim’s age. To find Jim’s age, we divi...

[15] [15]

Understand that Digimon had its 20th anniversary, 20 years ago

[16] [16]

At that time, John was twice as old as Jim

[17] [17]

Jim’s age then

We are currently given that John is 28 years old. Let’s denote Jim’s current age asJ. Since Digimon had its 20th anniversary 20 years ago, John was28−20 = 8years old at that time. At that time, John was twice as old as Jim. Therefore, we can set up the equation: 8 = 2J. To find Jim’s current age, we solve forJ:J= 8 2 = 4. Thus, Jim is currently 4 years ol...