pith. machine review for the scientific record. sign in

arxiv: 2605.13935 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords diffusion language modelstrajectory balancepost-trainingtrajectory lockingreward maximizationmathematical reasoningcode generation
0
0 comments X

The pith

A trajectory-balance objective stops diffusion language models from locking onto narrow denoising paths during post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion language models lose coverage of valid solutions when post-trained with standard reward-maximizing updates, because sampled trajectories concentrate on a small set of denoising paths. TraFL replaces this with a trajectory-balance objective that tilts the model toward higher-reward outputs while keeping the distribution anchored to a frozen reference model. The method is made practical for diffusion models through a sequence-level surrogate loss and a learned prompt-dependent normalization term. Experiments across mathematical reasoning and code generation benchmarks show that TraFL alone improves over the base model in every length setting, and that these gains remain when more samples are drawn at inference time. The same pattern holds on held-out evaluations such as Minerva Math and every difficulty split of LiveCodeBench.

Core claim

Standard reward-driven post-training for diffusion language models produces trajectory locking by over-concentrating probability mass on narrow sets of denoising paths and thereby reducing coverage of alternative correct solutions. TraFL corrects this by training the policy to match a reward-tilted target distribution that is anchored to a frozen reference model. The objective is realized with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. On mathematical reasoning and code generation tasks, TraFL is the only evaluated post-training method that raises performance above the base model in every benchmark-length setting, with gains that persist as

What carries the argument

TraFL, a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model, implemented via a diffusion-compatible sequence-level surrogate and learned prompt-dependent normalization.

If this is right

  • Diffusion language models can receive post-training that raises reasoning accuracy while preserving coverage of multiple valid solution paths.
  • Performance advantages from TraFL remain or grow when inference draws more samples, unlike mode-seeking alternatives.
  • Improvements transfer to held-out mathematical and code benchmarks without retraining.
  • TraFL is the only tested post-training approach that beats the base model uniformly across all length settings and sampling budgets examined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same balancing principle could be tested on other non-autoregressive generators that suffer from path concentration.
  • A reference-model anchor may allow TraFL to be combined with existing alignment pipelines without extra hyper-parameter search.
  • Practitioners could apply the method to tasks that reward multiple distinct outputs rather than a single best answer.
  • Scaling the learned normalization to longer contexts or larger models would test whether the surrogate remains faithful.

Load-bearing premise

The sequence-level surrogate and prompt-dependent normalization accurately approximate the trajectory-balance objective without creating new collapse modes or needing task-specific tuning.

What would settle it

If TraFL performance falls below the base model on a held-out benchmark once the sampling budget exceeds a modest threshold while other methods continue to improve, the claim that gains persist with increased sampling would be falsified.

Figures

Figures reproduced from arXiv: 2605.13935 by Prasanna Parthasarathi, Saba Ahmadi, Yufei Cui.

Figure 1
Figure 1. Figure 1: TraFL improves over the base model and strong prior post-training methods across sampling budgets and temperatures. (a) Average Pass@k of TraFL on GSM8K, MATH-500, HumanEval, and MBPP for T ∈ {0.3, 0.6, 0.9}. (b) Pass@k gap to LLaDA-8B-Instruct under matched decoding. (c) Baseline comparison at T = 0.6. TraFL leads ESPO, JustGRPO, and the base model from Pass@1 through Pass@16. All results use n = 16 sampl… view at source ↗
Figure 2
Figure 2. Figure 2: TraFL improves Pass@5 across math and coding benchmarks at comparable denoising cost. Pass@5 on GSM8K, MATH-500, HumanEval, and MBPP at maximum completion lengths 256 and 512 (n = 16, T = 0.6). Error bars show standard error. of the learned policies. LLaDA-8B-Instruct, ESPO, and TraFL use comparable denoising budgets, while JustGRPO terminates much earlier without consistent Pass@5 gains. Shorter denoising… view at source ↗
Figure 3
Figure 3. Figure 3: Gains transfer to held-out math and coding benchmarks. Pass@5 on Minerva Math (left) using the GSM8K-trained and MATH-trained checkpoints, and on LiveCodeBench (right) by difficulty split, both at maximum completion lengths 256 and 512. On Minerva Math, TraFL stays above the base model at both lengths and substantially outperforms JustGRPO for both GSM8K-trained and MATH-trained checkpoints. ESPO is strong… view at source ↗
read the original abstract

Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting, with gains that persist as the sampling budget increases. The improvements transfer to held-out evaluations: TraFL stays above the base model on Minerva Math and is the strongest method on every LiveCodeBench difficulty split.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies trajectory locking as a failure mode in reward-maximizing post-training of diffusion language models, where updates over-concentrate on narrow denoising paths. It proposes TraFL, a trajectory-balance objective anchored to a frozen reference model, implemented via a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Experiments on math reasoning and code generation benchmarks show TraFL as the only evaluated method that improves over the base model in every benchmark-length setting, with gains persisting at higher sampling budgets and transferring to held-out evaluations like Minerva Math and LiveCodeBench splits.

Significance. If the surrogate faithfully implements trajectory balancing without introducing new collapse modes, the result would be significant for diffusion LM post-training: it provides a non-mode-seeking alternative that improves coverage and robustness to sampling budget, grounded by the frozen reference. The consistent outperformance across settings and transfer to held-out tasks would strengthen the case for trajectory-balance objectives over standard RLHF-style methods in this architecture.

major comments (2)
  1. [§3.2] §3.2 (sequence-level surrogate): the central claim that the diffusion-compatible surrogate plus learned normalization implements the trajectory-balance objective without new collapse modes lacks direct validation. No diagnostics (e.g., KL divergence, coverage metrics, or path-probability comparisons) are reported between the surrogate-induced distribution and the true reward-tilted target, especially for long sequences or out-of-distribution prompts; this makes it unclear whether observed robustness to sampling budget arises from genuine balancing or from the normalization term acting as regularization.
  2. [Table 3, §5.1] Table 3 and §5.1 (benchmark results): the headline result that TraFL alone improves over the base model in every setting depends on the surrogate claim, yet the paper reports no statistical significance tests, exact baseline implementations, or ablation removing the prompt-dependent normalization. Without these, it is difficult to isolate whether gains are load-bearing for trajectory balance or artifacts of the learned normalization.
minor comments (2)
  1. [§3.3] Notation for the learned normalization factor is introduced without an explicit equation number; adding one would clarify its dependence on the prompt.
  2. [Figure 2] Figure 2 caption does not specify the exact sampling budgets used for the coverage curves, making it hard to reproduce the persistence claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and outline revisions to strengthen the validation of the surrogate and the statistical presentation of results.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (sequence-level surrogate): the central claim that the diffusion-compatible surrogate plus learned normalization implements the trajectory-balance objective without new collapse modes lacks direct validation. No diagnostics (e.g., KL divergence, coverage metrics, or path-probability comparisons) are reported between the surrogate-induced distribution and the true reward-tilted target, especially for long sequences or out-of-distribution prompts; this makes it unclear whether observed robustness to sampling budget arises from genuine balancing or from the normalization term acting as regularization.

    Authors: We acknowledge the value of direct diagnostics. Exact KL divergence over all denoising paths is computationally intractable for long sequences. We instead provide indirect support via consistent gains in coverage and robustness to sampling budget, which would be unlikely under new collapse modes. In revision we will add coverage metrics (unique correct solutions under repeated sampling) and path-probability comparisons on representative prompt subsets, including OOD cases, to better separate the contributions of the surrogate and normalization. revision: partial

  2. Referee: [Table 3, §5.1] Table 3 and §5.1 (benchmark results): the headline result that TraFL alone improves over the base model in every setting depends on the surrogate claim, yet the paper reports no statistical significance tests, exact baseline implementations, or ablation removing the prompt-dependent normalization. Without these, it is difficult to isolate whether gains are load-bearing for trajectory balance or artifacts of the learned normalization.

    Authors: We agree these elements would improve clarity. We will add statistical significance tests (bootstrap confidence intervals and paired comparisons) to Table 3. Baseline implementations are detailed in the appendix; we will reference them explicitly in §5.1. The ablation removing prompt-dependent normalization already appears in Appendix B.3 and shows degraded performance and higher variance; we will highlight these results in the main text. These additions will better isolate the trajectory-balance contribution while preserving the headline observation that TraFL is the only method improving over the base model across all settings. revision: yes

Circularity Check

0 steps flagged

Derivation anchored externally with no reduction to self-fitted inputs

full rationale

The trajectory-balance objective is defined relative to a frozen reference model, and the diffusion-compatible surrogate plus learned normalization is presented as a practical implementation rather than a redefinition of the target. No equations or claims in the provided description reduce the central prediction to a fitted parameter or self-citation chain by construction. The benchmark improvements are reported as empirical outcomes without evident circularity in the derivation steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that a sequence-level surrogate can approximate trajectory balance in the diffusion setting and that a learned prompt-dependent normalization suffices to stabilize training.

free parameters (1)
  • prompt-dependent normalization
    Learned parameter introduced to make the objective practical for diffusion LMs.
axioms (1)
  • domain assumption The reward-tilted target distribution can be approximated via the proposed diffusion-compatible sequence-level surrogate loss.
    Invoked to justify the practical implementation of TraFL.

pith-pipeline@v0.9.0 · 5478 in / 1203 out tokens · 26568 ms · 2026-05-15T04:56:08.731453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 7 internal anchors

  1. [1]

    Natural gradient works efficiently in learning.Neural computation, 10(2):251– 276, 1998

    Shun-Ichi Amari. Natural gradient works efficiently in learning.Neural computation, 10(2):251– 276, 1998

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

    Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

  4. [4]

    Hu, Mo Tiwari, and Emmanuel Bengio

    Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, and Emmanuel Bengio. Gflownet foundations.Journal of Machine Learning Research, 24(210):1–55, 2023

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    W. J. Conover.Practical Nonparametric Statistics. John Wiley & Sons, New York, 3 edition, 1999

  8. [8]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  9. [9]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  10. [10]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

  11. [11]

    Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

    Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, and Jean-Francois Chamberland. Reinforcement learning for diffusion llms with entropy-guided step selection and stepwise advantages.arXiv preprint arXiv:2603.12554, 2026

  12. [12]

    Solving quantitative reasoning problems with language models, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022

  13. [13]

    Let’s verify step by step, 2023

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

  14. [14]

    Understanding r1-zero-like training: A critical perspective, 2025

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025

  15. [15]

    The flexibility trap: Why arbitrary order limits reasoning potential in diffusion language models.arXiv preprint arXiv:2601.15165, 2026

    Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, and Gao Huang. The flexibility trap: Why arbitrary order limits reasoning potential in diffusion language models.arXiv preprint arXiv:2601.15165, 2026

  16. [16]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

  17. [17]

    Principled RL for diffusion LLMs emerges from a sequence-level perspective

    Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled RL for diffusion LLMs emerges from a sequence-level perspective. InThe Fourteenth International Conference on Learning Representations, 2026

  18. [18]

    Code llama: Open foundation models for code, 2024

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

  19. [19]

    wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

    Xindi Tang, Radu Dolga, Sangwon Yoon, and Ilija Bogunovic. wd1: Weighted policy optimiza- tion for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838, 2025

  20. [20]

    SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

    Cheng Wang, Paria Rashidinejad, Dong Su, Shixin Jiang, Siyuan Wang, Shuo Zhao, Cheng Zhou, Sheng Shen Shen, Fan Chen, Tommi Jaakkola, et al. Spg: Sandwiched policy gradient for masked diffusion language models.arXiv preprint arXiv:2510.09541, 2025

  21. [21]

    Revolution- izing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

    Yifei Wang, Lei Yang, Bowen Li, Yuhang Tian, Kai Shen, and Mengdi Wang. Revolution- izing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

  22. [22]

    Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding, 2025

    Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding, 2025

  23. [23]

    Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding.arXiv, 2025

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding.arXiv, 2025. 11

  24. [24]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  25. [25]

    Acecoder: Acing coder rl via automated test-case synthesis.ArXiv, abs/2207.01780, 2025

    Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis.ArXiv, abs/2207.01780, 2025

  26. [26]

    d1: Scaling reasoning in diffusion large language models via reinforcement learning

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  27. [27]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025

  28. [28]

    FlowRL: Matching reward distributions for LLM reasoning

    Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, and Zhouhan Lin. FlowRL: Matching reward distributions for LLM reasoning. InThe ...

  29. [29]

    Subject {subject_id}, Trial {trial_type}

    Yuchen Zhu, Wei Guo, Jaemoo Choi, Petr Molodyk, Bo Yuan, Molei Tao, and Yongxin Chen. Enhancing reasoning for diffusion llms via distribution matching policy optimization, 2026. 12 Overview of Appendix ATraining Algorithm BDerivation of the Normalized Log-Probability Surrogate CTheory: Path Diversity, Mode Coverage, and Trajectory Balance DPrompt Template...

  30. [30]

    Identify the distinct approach types in each set

  31. [31]

    Compare the number and substance of these approach types

  32. [32]

    set_a_approaches

    Prefer TIE if the difference is mostly superficial or unclear. [Response Format] Return exactly this JSON: {{ "set_a_approaches": ["short names of distinct approaches"], "set_b_approaches": ["short names of distinct approaches"], "winner": "A" | "B" | "TIE", "confidence": "low" | "medium" | "high", "reason": "one concise sentence" }} 21 H Limitations and ...