Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

Liang Lin; Pengxu Wei; Weijian Deng; Xiangyang Ji; Yuliang Huang; ZiYi Dong

arxiv: 2605.14531 · v2 · pith:SQFVJQUWnew · submitted 2026-05-14 · 💻 cs.CL

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

ZiYi Dong , Yuliang Huang , Weijian Deng , Xiangyang Ji , Liang Lin , Pengxu Wei This is my paper

Pith reviewed 2026-05-19 16:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords language generationoptimal controlflow matchingdiffusion modelsclosed-loop controlHamilton-Jacobi-Bellmanlatent space

0 comments

The pith

Reformulating language generation as stochastic optimal control and approximating the Hamilton-Jacobi-Bellman equation with flow matching produces a closed-loop model that delivers high-fidelity text via efficient parallel sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper recasts language generation as finding optimal control actions in a stochastic dynamical system. It argues that autoregressive and diffusion approaches each run into distinct mathematical obstacles—trajectory singularity, vanishing adjoints, and absent gradients—that force a trade-off between generation quality and sampling speed. By treating the ideal solution as the policy that satisfies the Hamilton-Jacobi-Bellman equation and then approximating that policy with flow matching inside a rectified latent control space, the work constructs a closed-loop controller. A reader would care because the resulting model is claimed to generate coherent text at autoregressive fidelity while sampling tokens in parallel at low computational cost.

Core claim

Viewing language generation as a stochastic optimal control problem reveals that the optimal policy is the closed-loop controller obtained by approximating the Hamilton-Jacobi-Bellman equation. Flow matching serves as the trajectory solver inside the rectified latent control space; the Manta-LM model equipped with a global integral operator thereby approximates the global vector field, simultaneously realizing high-fidelity generation and low-cost parallel sampling while mitigating the efficiency-fidelity paradox, irreversibility error propagation, and optimization intractability.

What carries the argument

Rectified latent control space in which flow matching acts as the optimal trajectory solver to approximate the Hamilton-Jacobi-Bellman equation and realize the closed-loop optimal policy.

If this is right

The model achieves high-fidelity text generation together with efficient, low-cost parallel sampling.
Generation exhibits improved stability, efficiency, and controllability relative to prior autoregressive and diffusion baselines.
Strong empirical results appear on both unconditional language modeling and conditional generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same control-theoretic framing may extend to other autoregressive sequence tasks such as code or music generation.
The closed-loop structure could support dynamic adjustment of generation constraints during sampling without retraining.
Rectification of the latent space may prove reusable as a general technique for turning open-loop diffusion processes into closed-loop controllers.

Load-bearing premise

Flow matching inside the rectified latent control space approximates the true optimal vector field closely enough to preserve both fidelity and efficiency without introducing substantial new errors.

What would settle it

A direct comparison on a standard language-modeling benchmark that reports both perplexity (or equivalent fidelity metric) and wall-clock sampling time per sequence; the claim would be falsified if the new model fails to match autoregressive perplexity while simultaneously using substantially fewer serial steps than diffusion baselines.

Figures

Figures reproduced from arXiv: 2605.14531 by Liang Lin, Pengxu Wei, Weijian Deng, Xiangyang Ji, Yuliang Huang, ZiYi Dong.

**Figure 1.** Figure 1: Generation dynamics. On a non-convex manifold, (a) AR and Diffusion are trapped in a slow, myopic crawl along the high-curvature density ridge. (b) In contrast, our method approximates the global optimal trajectory, bypassing curvature via the rectified latent geometry (energy-minimizing geodesic) for improved efficiency. the optimal controller in Equation (3), targeting high data fidelity with low infer… view at source ↗

**Figure 2.** Figure 2: Visualizing Generative Dynamics and Error Propagation on BVP task. Color from pink to blue denotes generation progress. (a) AR suffers from compounding errors (blue lines in (d)) due to open-loop myopia, drifting off-manifold. (b) Discrete DLM relies on stochastic combinatorial search showing jagged trajectories caused by geometric blindness (lack of gradients). (c) Our Manta-LM acts as an optimal closed… view at source ↗

**Figure 3.** Figure 3: Geometric comparison. Unlike (a) Autoregressive models’ serial paths or (b-c) Diffusion baselines’ high-curvature trajectories in ill-conditioned spaces, (d) Ours operates on a rectified latent manifold. The learned optimal vector field vθ enables energy-minimizing, straight-line transport from noise to data. 4.1. Control-Friendly Manifold Rectification Rectification via Diffeomorphism. We introduce a Var… view at source ↗

**Figure 4.** Figure 4: Efficiency evaluation with inference throughput. transforming the ill-conditioned high-frequency regression problem into a well-conditioned one, thereby making the HJB-inspired dynamics easier to approximate with Flow Matching and efficient large-step integration. Geometric Regularity and Optimization Stability. Figure 6 contrasts the rugged optimization landscape of discrete baselines, which reflects sev… view at source ↗

**Figure 5.** Figure 5: Stiffness Analysis. The raw Token (Embedding) Space exhibits extreme stiffness and high curvature, indicating an illconditioned control landscape that forces adaptive solvers (RK45) to high NFE. In contrast, our Rectified Latent Space maintains low stiffness and near-linear trajectories, verifying the efficacy of VAE. (a) Auto-Regressive (GPT-2) (b) Discrete Diffusion (RADD) (c) Manta-LM (Ours) [PITH_FUL… view at source ↗

**Figure 6.** Figure 6: Optimization landscapes of different generation paradigms. (a) AR exhibits sharp and unstable geometry. (b) Discrete diffusion leads to fragmented and irregular landscapes. (c) Our Manta-LM yields a smooth and well-conditioned landscape, enabling stable optimization. 6. Conclusion We presented Manta-LM, a framework that studies and reimagines text generation as Stochastic Optimal Control problem. By app… view at source ↗

**Figure 7.** Figure 7: Model structure and pipeline. due to the high-frequency discontinuities of the semantic energy landscape V across discrete tokens, no single vector v can satisfy the first-order Taylor approximation for the local neighborhood, formally implying that ∄ v such that V (z + d) − V (z) ≈ ⟨v, d⟩ holds, thereby confirming the structural absence of gradient guidance. ∄ v ∈ TzM s.t. ⟨v, d⟩ ≈ V (z + d) − V (z). (24)… view at source ↗

**Figure 8.** Figure 8: Analysis on interplay between CFG guidance strength and integration fidelity • Optimal Regime (w ∈ [3.0, 5.0]): This setting achieves the best quality-efficiency trade-off. Metrics saturate rapidly (within 20–30 steps), indicating that the vector field is sufficiently aligned with the condition while remaining smooth enough for coarse-step integration. • Over-Guided Regime (w ≥ 7.0): We observe a sharp per… view at source ↗

**Figure 9.** Figure 9: Step-by-step conditional generation process of Manta-LM on a paraphrase task. Given the input sentence “what was the best day of your life, and what happened?”, the figure visualizes the intermediate generation trajectories of Manta-LM across diffusion steps. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Step-by-step conditional generation process of Manta-LM on a paraphrase task. Given the input sentence “how can i be a good geologist?”, the figure visualizes the intermediate generation trajectories of Manta-LM across diffusion steps. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Visualizing error correction capabilities across different models. Red text indicates corrupted or erroneous tokens introduced by noise. while yellow text denotes tokens that are semantically consistent with the ground-truth text but differ in surface form. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. Rabat – Dutch far-right lawmaker Geert Wilders was found not guilty of hate speech but guilty of discrimination and group insult. He will face no punishment. The verdict is in reference to comments Wilders, the leader of the Freedom Party, made… view at source ↗

**Figure 13.** Figure 13: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative examples of the text infilling task. Text in blue represents the provided prefix and suffix, while text in black denotes the model’s generated results. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

read the original abstract

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives language generation a stochastic optimal control framing and tries to approximate the HJB equation with Flow Matching in a rectified latent space, but the abstract leaves the key equivalence unshown.

read the letter

The core move is to treat generation as a stochastic optimal control problem, then use that lens to explain limits in autoregressive and diffusion models through trajectory singularity, adjoint vanishing, and missing gradients. They approximate the Hamilton-Jacobi-Bellman equation with Flow Matching inside a rectified latent control space so the resulting policy works as a closed-loop controller. The Manta-LM with its Global Integral Operator is meant to deliver both fidelity and cheap parallel sampling at once.

Referee Report

2 major / 2 minor

Summary. The manuscript reformulates language generation as a stochastic optimal control problem, using this lens to diagnose limitations of autoregressive and diffusion models (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) as arising from trajectory singularity, adjoint vanishing, and gradient absence. It proposes approximating the Hamilton-Jacobi-Bellman equation via Flow Matching as the trajectory solver inside a rectified latent control space, realized as Manta-LM equipped with a Global Integral Operator, to obtain a closed-loop optimal policy that simultaneously delivers high-fidelity generation and low-cost parallel sampling. Strong empirical results on language modeling and conditional generation are claimed, together with gains in stability, efficiency, and controllability.

Significance. If the central approximation is shown to be valid with a controllable error bound, the work supplies a unified theoretical account that could resolve persistent paradoxes in language generative modeling and yield practical controllers with both fidelity and sampling efficiency.

major comments (2)

[Abstract] Abstract and method description: the claim that Flow Matching inside the rectified latent control space approximates the HJB solution (and thereby yields an optimal closed-loop policy) is not accompanied by a derivation, equivalence statement, or error bound relating the Flow Matching regression objective to the HJB Hamiltonian or optimality condition, particularly for the discrete token state space.
[Abstract] Abstract: the assertion that the Global Integral Operator enables approximation of the global vector field without reintroducing singularity or adjoint-vanishing problems lacks a supporting analysis or proof sketch showing that the latent rectification maps back to a categorical distribution while preserving the required optimality properties.

minor comments (2)

Define the rectified latent control space and Global Integral Operator with explicit equations and state the precise mapping from latent trajectories to token distributions.
Supply quantitative tables or figures with concrete metrics, baselines, and ablation results to substantiate the claimed empirical gains in fidelity, efficiency, and stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the unified theoretical framing and empirical contributions. We respond to each major comment below and indicate planned revisions to address the concerns about theoretical grounding.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the claim that Flow Matching inside the rectified latent control space approximates the HJB solution (and thereby yields an optimal closed-loop policy) is not accompanied by a derivation, equivalence statement, or error bound relating the Flow Matching regression objective to the HJB Hamiltonian or optimality condition, particularly for the discrete token state space.

Authors: We agree that the abstract is too concise on this point. The manuscript derives the stochastic optimal control formulation in Section 2 and positions Flow Matching as the tractable trajectory solver for the HJB equation inside the continuous rectified latent space (Section 3). The rectification step maps discrete tokens to this latent space so that the Flow Matching regression objective can be applied directly to the vector field. We acknowledge that an explicit equivalence statement or error bound for the discrete case is not stated in the abstract. In the revision we will add a short derivation sketch to the abstract and a dedicated paragraph in Section 3 (with an appendix note) that relates the Flow Matching objective to the HJB Hamiltonian under the latent rectification, while clarifying that the approximation error is controlled in the continuous latent space before the final categorical mapping. revision: yes
Referee: [Abstract] Abstract: the assertion that the Global Integral Operator enables approximation of the global vector field without reintroducing singularity or adjoint-vanishing problems lacks a supporting analysis or proof sketch showing that the latent rectification maps back to a categorical distribution while preserving the required optimality properties.

Authors: We thank the referee for this observation. The Global Integral Operator is introduced in Section 4 as the mechanism that integrates the learned vector field over latent trajectories to recover the token distribution. Because the control policy is closed-loop and applied entirely inside the rectified latent space, the original singularity and adjoint-vanishing issues are sidestepped; the final rectification step simply decodes the latent state back to a categorical distribution without altering the optimality of the latent policy. We agree that the abstract omits a supporting analysis. In the revised manuscript we will insert a brief proof sketch (or reference to the relevant lemma in Section 3) demonstrating that the rectification mapping preserves the closed-loop optimality properties and does not reintroduce the identified pathologies. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation introduces independent approximation without reduction to inputs

full rationale

The paper reformulates language generation as stochastic optimal control and proposes approximating the HJB PDE solution via Flow Matching inside a rectified latent control space with a Global Integral Operator. This is framed as a modeling choice to bypass direct PDE intractability rather than a definitional equivalence or a prediction forced by fitting parameters to the target result. No equations or self-citations are shown that reduce the claimed optimality or closed-loop policy back to the Flow Matching objective or latent rectification by construction; the central claims rest on the new framework's ability to deliver fidelity and efficiency, which remains externally falsifiable on benchmarks. The derivation is therefore self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based solely on the abstract, the central claim rests on the applicability of stochastic optimal control to discrete language sequences and the effectiveness of Flow Matching for global approximation. No explicit free parameters or invented entities with independent evidence are detailed; the ledger reflects high-level assumptions from the abstract only.

axioms (1)

domain assumption Language generation processes can be modeled as solutions to a stochastic optimal control problem with trajectory singularity and adjoint state vanishing issues.
This is the foundational reformulation invoked in the abstract to unify models and explain limitations.

invented entities (2)

rectified latent control space no independent evidence
purpose: To enable Flow Matching to approximate the global vector field and solve the HJB equation tractably.
New space introduced in the abstract to bypass direct PDE intractability.
Manta-LM with Global Integral Operator no independent evidence
purpose: To realize the closed-loop optimal policy for high-fidelity parallel sampling.
Proposed model name and component described in the abstract as the practical realization.

pith-pipeline@v0.9.0 · 5711 in / 1663 out tokens · 77281 ms · 2026-05-19T16:23:40.947497+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

J(u) = E[−log p_θ(z1)] + λ ∫ E[½ ∥u_t(z_t)∥²] dt; u* = −∇V satisfying the HJB equation; Flow Matching regression L_CFM = E ∥v_θ(z_t,t) − (z1 − z0)∥² on OT paths
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Manifold Rectification via regularized VAE producing a diffeomorphic Euclidean latent space so that ∇z is well-defined and the dynamics are Lipschitz-regular

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 11 internal anchors

[1]

D., Ho, J., Tarlow, M., and van den Berg, R

Austin, J., Johnson, D. D., Ho, J., Tarlow, M., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, volume 34, pp.\ 17981--17993, 2021

work page 2021
[2]

and Brenier, Y

Benamou, J.-D. and Brenier, Y. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84 0 (3): 0 375--393, 2000

work page 2000
[3]

Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures

Bertucci, C. Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures. Annales de l'Institut Henri Poincar \'e C, Analyse non lin \'e aire , 2023. URL https://api.semanticscholar.org/CorpusID:259095954

work page 2023
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pp.\ 1877--1901, 2020

work page 1901
[5]

and Lewis, A

Bullo, F. and Lewis, A. D. Geometric control of mechanical systems. 2004. URL https://api.semanticscholar.org/CorpusID:679624

work page 2004
[6]

T., and Robinson, T

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P. T., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013

work page 2013
[7]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., and Han, S. Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733, 2024

work page arXiv 2024
[8]

Categorical flow matching on statistical manifolds

Cheng, C., Li, J., Peng, J., and Liu, G. Categorical flow matching on statistical manifolds. Advances in Neural Information Processing Systems, 37: 0 54787--54819, 2024

work page 2024
[9]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Quora question pairs

DataCanary, hilfialkaff, Jiang, L., Risdal, M., Dandekar, N., and tomtung. Quora question pairs. Kaggle Competition, 2017. https://kaggle.com/competitions/quora-question-pairs

work page 2017
[11]

Dhingra, B., Mazaitis, K., and Cohen, W. W. Quasar: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Continuous diffusion for categorical data

Dieleman, S., Sartran, L., Roshannai, A., Savinov, N., Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Fleming, W. H. and Rishel, R. W. Deterministic and stochastic optimal control. Springer Science & Business Media, 2012

work page 2012
[14]

and Cohen, V

Gokaslan, A. and Cohen, V. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019
[15]

Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models. In The 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023
[16]

Diffuseq: Sequence to sequence text generation with diffusion models

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq: Sequence to sequence text generation with diffusion models. In International Conference on Learning Representations (ICLR 2023)(01/05/2023-05/05/2023, Kigali, Rwanda), 2023

work page 2023
[17]

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Gong, S., Agarwal, S., Zhang, Y., Ye, J., Zheng, L., Li, M., An, C., Zhao, P., Bi, W., Han, J., et al. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024

work page internal anchor Pith review arXiv 2024
[18]

and Hashimoto, T

Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36: 0 16693--16715, 2023

work page 2023
[19]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Han, X., Kumar, S., and Tsvetkov, Y. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 11575--11596, 2023

work page 2023
[20]

Argmax flows and multinomial diffusion: Learning categorical distributions

Hoogeboom, E., Nielsen, D., Jaini, P., Forr \'e , P., and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in neural information processing systems, 34: 0 12454--12465, 2021

work page 2021
[21]

K., Xu, W., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., et al

Huang, S., Cheng, T., Liu, J. K., Xu, W., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., et al. Opencoder: The open cookbook for top-tier code large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 33167--33193, 2025

work page 2025
[22]

Neural crf model for sentence alignment in text simplification

Jiang, C., Maddela, M., Lan, W., Zhong, Y., and Xu, W. Neural crf model for sentence alignment in text simplification. arXiv preprint arXiv:2005.02324, 2020

work page arXiv 2005
[23]

and Hwang, S

Jo, J. and Hwang, S. J. Continuous diffusion model for language modeling. In Neural Information Processing Systems, 2025

work page 2025
[24]

Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

Li, J., Du, L., Zhao, H., Zhang, B.-w., Wang, L., Gao, B., Liu, G., and Lin, Y. Infinity instruct: Scaling instruction selection and synthesis to enhance language models. arXiv preprint arXiv:2506.11116, 2025 a

work page arXiv 2025
[25]

Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

Li, S., Gu, J., Liu, K., Lin, Z., Wei, Z., Grover, A., and Kuen, J. Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244, 2025 b

work page arXiv 2025
[26]

S., and Hashimoto, T

Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in neural information processing systems, 35: 0 4328--4343, 2022

work page 2022
[27]

T., Ben-Hamu, H., Nickel, M., and Le, M

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In International Conference on Learning Representations, 2023

work page 2023
[28]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

K., Ivison, H., Tae, J., Henderson, J., Beltagy, I., Peters, M

Mahabadi, R. K., Ivison, H., Tae, J., Henderson, J., Beltagy, I., Peters, M. E., and Cohan, A. Tess: Text-to-text self-conditioned simplex diffusion. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2347--2361, 2024

work page 2024
[30]

Pointer sentinel mixture models, 2016

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016

work page 2016
[31]

arXiv preprint arXiv:2504.16891 , year=

Moshkov, I., Hanley, D., Sorokin, I., Toshniwal, S., Henkel, C., Schifferer, B., Du, W., and Gitman, I. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025

work page arXiv 2025
[32]

Large language diffusion models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. In Neural Information Processing Systems, 2025 a

work page 2025
[33]

Large Language Diffusion Models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

The lambada dataset: Word prediction requiring a broad discourse context

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern \'a ndez, R. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pp.\ 1525--1534, 2016

work page 2016
[36]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[37]

Simple and effective masked diffusion language models

Sahoo, S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J., Rush, A., and Kuleshov, V. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

work page 2024
[38]

Simplified and generalized masked diffusion for discrete data

Shi, J., Han, K., Wang, Z., Doucet, A., and Titsias, M. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

work page 2024
[39]

Deep unsupervised learning using nonequilibrium thermodynamics

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.\ 2256--2265. pmlr, 2015

work page 2015
[40]

Self- conditioned embedding diffusion for text generation,

Strudel, R., Tallec, C., Altch \'e , F., Du, Y., Ganin, Y., Mensch, A., Grathwohl, W., Savinov, N., Dieleman, S., Sifre, L., et al. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022

work page arXiv 2022
[41]

Score-based continuous-time discrete diffusion models

Sun, H., Yu, L., Dai, B., Schuurmans, D., and Dai, H. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022

work page arXiv 2022
[42]

Unified multimodal discrete diffusion.arXiv preprint arXiv:2503.20853, 2025

Swerdlow, A., Prabhudesai, M., Gandhi, S., Pathak, D., and Fragkiadaki, K. Unified multimodal discrete diffusion. arXiv preprint arXiv:2503.20853, 2025

work page arXiv 2025
[43]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi \`e re, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K.-Y., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

MMaDA: Multimodal Large Diffusion Language Models

Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., and Wang, M. Mmada: Multimodal large diffusion language models. arXiv preprint arXiv:2505.15809, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Commonsense knowledge aware conversation generation with graph attention

Zhou, H., Young, T., Huang, M., Zhao, H., Xu, J., and Zhu, X. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, volume 18, pp.\ 4623--4629, 2018

work page 2018

[1] [1]

D., Ho, J., Tarlow, M., and van den Berg, R

Austin, J., Johnson, D. D., Ho, J., Tarlow, M., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, volume 34, pp.\ 17981--17993, 2021

work page 2021

[2] [2]

and Brenier, Y

Benamou, J.-D. and Brenier, Y. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84 0 (3): 0 375--393, 2000

work page 2000

[3] [3]

Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures

Bertucci, C. Stochastic optimal transport and hamilton–jacobi–bellman equations on the set of probability measures. Annales de l'Institut Henri Poincar \'e C, Analyse non lin \'e aire , 2023. URL https://api.semanticscholar.org/CorpusID:259095954

work page 2023

[4] [4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pp.\ 1877--1901, 2020

work page 1901

[5] [5]

and Lewis, A

Bullo, F. and Lewis, A. D. Geometric control of mechanical systems. 2004. URL https://api.semanticscholar.org/CorpusID:679624

work page 2004

[6] [6]

T., and Robinson, T

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P. T., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013

work page 2013

[7] [7]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Chen, J., Cai, H., Chen, J., Xie, E., Yang, S., Tang, H., Li, M., Lu, Y., and Han, S. Deep compression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733, 2024

work page arXiv 2024

[8] [8]

Categorical flow matching on statistical manifolds

Cheng, C., Li, J., Peng, J., and Liu, G. Categorical flow matching on statistical manifolds. Advances in Neural Information Processing Systems, 37: 0 54787--54819, 2024

work page 2024

[9] [9]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Quora question pairs

DataCanary, hilfialkaff, Jiang, L., Risdal, M., Dandekar, N., and tomtung. Quora question pairs. Kaggle Competition, 2017. https://kaggle.com/competitions/quora-question-pairs

work page 2017

[11] [11]

Dhingra, B., Mazaitis, K., and Cohen, W. W. Quasar: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Continuous diffusion for categorical data

Dieleman, S., Sartran, L., Roshannai, A., Savinov, N., Ganin, Y., Richemond, P. H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Fleming, W. H. and Rishel, R. W. Deterministic and stochastic optimal control. Springer Science & Business Media, 2012

work page 2012

[14] [14]

and Cohen, V

Gokaslan, A. and Cohen, V. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019

[15] [15]

Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq-v2: Bridging discrete and continuous text spaces for accelerated seq2seq diffusion models. In The 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023

[16] [16]

Diffuseq: Sequence to sequence text generation with diffusion models

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. Diffuseq: Sequence to sequence text generation with diffusion models. In International Conference on Learning Representations (ICLR 2023)(01/05/2023-05/05/2023, Kigali, Rwanda), 2023

work page 2023

[17] [17]

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Gong, S., Agarwal, S., Zhang, Y., Ye, J., Zheng, L., Li, M., An, C., Zhao, P., Bi, W., Han, J., et al. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024

work page internal anchor Pith review arXiv 2024

[18] [18]

and Hashimoto, T

Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36: 0 16693--16715, 2023

work page 2023

[19] [19]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Han, X., Kumar, S., and Tsvetkov, Y. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 11575--11596, 2023

work page 2023

[20] [20]

Argmax flows and multinomial diffusion: Learning categorical distributions

Hoogeboom, E., Nielsen, D., Jaini, P., Forr \'e , P., and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in neural information processing systems, 34: 0 12454--12465, 2021

work page 2021

[21] [21]

K., Xu, W., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., et al

Huang, S., Cheng, T., Liu, J. K., Xu, W., Hao, J., Song, L., Xu, Y., Yang, J., Liu, J., Zhang, C., et al. Opencoder: The open cookbook for top-tier code large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 33167--33193, 2025

work page 2025

[22] [22]

Neural crf model for sentence alignment in text simplification

Jiang, C., Maddela, M., Lan, W., Zhong, Y., and Xu, W. Neural crf model for sentence alignment in text simplification. arXiv preprint arXiv:2005.02324, 2020

work page arXiv 2005

[23] [23]

and Hwang, S

Jo, J. and Hwang, S. J. Continuous diffusion model for language modeling. In Neural Information Processing Systems, 2025

work page 2025

[24] [24]

Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

Li, J., Du, L., Zhao, H., Zhang, B.-w., Wang, L., Gao, B., Liu, G., and Lin, Y. Infinity instruct: Scaling instruction selection and synthesis to enhance language models. arXiv preprint arXiv:2506.11116, 2025 a

work page arXiv 2025

[25] [25]

Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

Li, S., Gu, J., Liu, K., Lin, Z., Wei, Z., Grover, A., and Kuen, J. Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation. arXiv preprint arXiv:2509.19244, 2025 b

work page arXiv 2025

[26] [26]

S., and Hashimoto, T

Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in neural information processing systems, 35: 0 4328--4343, 2022

work page 2022

[27] [27]

T., Ben-Hamu, H., Nickel, M., and Le, M

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In International Conference on Learning Representations, 2023

work page 2023

[28] [28]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

K., Ivison, H., Tae, J., Henderson, J., Beltagy, I., Peters, M

Mahabadi, R. K., Ivison, H., Tae, J., Henderson, J., Beltagy, I., Peters, M. E., and Cohan, A. Tess: Text-to-text self-conditioned simplex diffusion. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2347--2361, 2024

work page 2024

[30] [30]

Pointer sentinel mixture models, 2016

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016

work page 2016

[31] [31]

arXiv preprint arXiv:2504.16891 , year=

Moshkov, I., Hanley, D., Sorokin, I., Toshniwal, S., Henkel, C., Schifferer, B., Du, W., and Gitman, I. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025

work page arXiv 2025

[32] [32]

Large language diffusion models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. In Neural Information Processing Systems, 2025 a

work page 2025

[33] [33]

Large Language Diffusion Models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

The lambada dataset: Word prediction requiring a broad discourse context

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern \'a ndez, R. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pp.\ 1525--1534, 2016

work page 2016

[36] [36]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019

[37] [37]

Simple and effective masked diffusion language models

Sahoo, S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J., Rush, A., and Kuleshov, V. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

work page 2024

[38] [38]

Simplified and generalized masked diffusion for discrete data

Shi, J., Han, K., Wang, Z., Doucet, A., and Titsias, M. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

work page 2024

[39] [39]

Deep unsupervised learning using nonequilibrium thermodynamics

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.\ 2256--2265. pmlr, 2015

work page 2015

[40] [40]

Self- conditioned embedding diffusion for text generation,

Strudel, R., Tallec, C., Altch \'e , F., Du, Y., Ganin, Y., Mensch, A., Grathwohl, W., Savinov, N., Dieleman, S., Sifre, L., et al. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022

work page arXiv 2022

[41] [41]

Score-based continuous-time discrete diffusion models

Sun, H., Yu, L., Dai, B., Schuurmans, D., and Dai, H. Score-based continuous-time discrete diffusion models. arXiv preprint arXiv:2211.16750, 2022

work page arXiv 2022

[42] [42]

Unified multimodal discrete diffusion.arXiv preprint arXiv:2503.20853, 2025

Swerdlow, A., Prabhudesai, M., Gandhi, S., Pathak, D., and Fragkiadaki, K. Unified multimodal discrete diffusion. arXiv preprint arXiv:2503.20853, 2025

work page arXiv 2025

[43] [43]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi \`e re, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K.-Y., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

MMaDA: Multimodal Large Diffusion Language Models

Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., and Wang, M. Mmada: Multimodal large diffusion language models. arXiv preprint arXiv:2505.15809, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Commonsense knowledge aware conversation generation with graph attention

Zhou, H., Young, T., Huang, M., Zhao, H., Xu, J., and Zhu, X. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, volume 18, pp.\ 4623--4629, 2018

work page 2018