arxiv: 2605.06548 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.CV

Recognition: unknown

Continuous Latent Diffusion Language Model

Hongcan Guo , Qinyu Zhao , Yian Zhao , Shen Nie , Rui Zhu , Qiushan Guo , Feng Wang , Tao Yang

show 3 more authors

Hengshuang Zhao Guoqiang Wei Yan Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords latent diffusionnon-autoregressive generationcontinuous latent spacehierarchical language modelingdiffusion transformertext variational autoencodersemantic priorscaling behavior

0 comments

The pith

A hierarchical latent diffusion model separates global semantic organization from local text realization in continuous space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cola DLM to generate text by first mapping sequences to continuous latents via a Text VAE, then diffusing a global semantic prior with a block-causal DiT, and finally decoding words conditionally. This frames generation as latent prior transport rather than direct token recovery, creating a non-autoregressive path that compresses semantics independently of word order. A reader would care because the approach decouples high-level meaning from surface form, potentially allowing more flexible generation and scaling that tracks actual output quality better than likelihood scores alone.

Core claim

From a unified Markov-path perspective, Cola DLM's diffusion performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This yields a flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and extends naturally to other continuous modalities. Experiments on 8 benchmarks with matched ~2B-parameter baselines and scaling to 2000 EFLOPs identify an effective configuration and confirm strong scaling behavior for text generation.

What carries the argument

The hierarchical decomposition that uses a Text VAE for stable text-to-latent mapping, a block-causal DiT for diffusion-based global semantic prior transport in continuous space, and conditional decoding for final text output.

If this is right

Text generation gains a non-autoregressive inductive bias that organizes semantics globally before realizing local tokens.
Semantic compression and prior fitting occur directly in continuous space rather than through token likelihood.
Generation quality and scaling curves become stronger indicators of model capability than likelihood alone.
The same latent diffusion structure extends without modification to joint modeling of text with other continuous data types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of global semantics from token realization could reduce error accumulation in long sequences by enforcing high-level coherence first.
A shared continuous latent space might allow direct mixing of text generation with image or audio synthesis under one diffusion process.
Evaluation focus may shift toward measuring output coherence and scaling efficiency rather than perplexity on next-token prediction.

Load-bearing premise

A stable and invertible mapping from discrete text to continuous latent space exists so that block-causal diffusion can reliably carry global semantics to support high-quality conditional word generation.

What would settle it

Scaling curves showing that Cola DLM generation quality plateaus or lags behind matched autoregressive baselines past 2000 EFLOPs, or that the Text VAE mapping becomes unstable and non-invertible on diverse or long texts.

read the original abstract

Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cola DLM puts a Text VAE, block-causal DiT prior, and conditional decoder together for non-autoregressive language modeling in continuous space, but the VAE stability and concrete results are not shown.

read the letter

The paper introduces Cola DLM as a three-stage pipeline: a Text VAE compresses text into continuous latents, a block-causal diffusion transformer models the global prior over those latents, and a decoder generates the final text. The authors frame the diffusion step as transporting the semantic prior along a Markov path rather than recovering tokens directly. This is the main new element compared with earlier diffusion language models that stayed at the token level. The setup aims to separate global semantics from local realization and to make scaling and multi-modal extensions more natural. The experimental plan includes matched 2B-parameter baselines, eight benchmarks, and scaling curves to high compute, which shows they tried to run a fair comparison. The architecture is coherent on paper and the Markov-path perspective gives a clean way to think about what the diffusion is actually doing. The soft spots are in the load-bearing assumptions. The abstract states that the Text VAE learns a stable invertible mapping, yet no reconstruction metrics, KL curves, or latent quality checks appear in the summary. Language VAEs commonly suffer from collapse or lossy compression, and if that occurs here the diffusion prior would be operating on incomplete representations. The scaling claims and benchmark results are described but no numbers, ablations, or error bars are provided, so it is impossible to tell whether the hierarchical design actually improves over token-level autoregressive models or simply inherits whatever the VAE managed to preserve. The stress-test concern about the VAE therefore lands directly. This paper is for researchers who want to explore non-autoregressive and latent-variable routes to text generation, especially those interested in continuous-space priors or cross-modal work. It deserves a serious referee because the proposed decomposition is distinct and the experimental outline is ambitious, even though the current version leaves the central empirical questions open. I would send it to peer review and ask the authors to supply the VAE diagnostics and full result tables.

Referee Report

2 major / 2 minor

Summary. The paper proposes Cola DLM, a hierarchical latent diffusion language model that decomposes text generation into a Text VAE for learning a stable text-to-latent mapping, a block-causal DiT for modeling a global semantic prior in continuous latent space, and conditional decoding for text generation. From a Markov-path view, the diffusion performs latent prior transport rather than token-level recovery. Experiments span 4 research questions and 8 benchmarks with strictly matched ~2B-parameter autoregressive and LLaDA baselines, plus scaling curves to ~2000 EFLOPs, claiming strong scaling behavior and establishing hierarchical continuous latent prior modeling as a principled non-autoregressive alternative to token-level language modeling.

Significance. If the empirical claims hold with full supporting data, the work would be significant for offering a continuous-space inductive bias that separates global semantics from local realization, with potential advantages in scaling and multimodal unification. The use of matched baselines and large-scale EFLOP curves provides a concrete basis for comparing generation quality and scaling behavior against likelihood-based AR models.

major comments (2)

[Abstract] Abstract: The central claim that Cola DLM establishes hierarchical continuous latent prior modeling as a principled alternative rests on reported performance across 8 benchmarks and scaling to 2000 EFLOPs, yet the abstract supplies no numerical results, ablation tables, or error bars. This renders the support for outperformance over matched ~2B baselines unverifiable from the provided summary.
[Methods] Text VAE component (methods): The load-bearing assumption of a 'stable text-to-latent mapping' that supports faithful semantic representations for the subsequent DiT prior is not accompanied by reconstruction fidelity metrics (e.g., BLEU, perplexity on held-out text), posterior-collapse diagnostics, or KL-annealing curves. Without these, downstream generation quality and scaling curves could reflect VAE compression artifacts rather than the benefits of block-causal diffusion in continuous space.

minor comments (2)

Clarify the precise definition of block-causality in the DiT architecture and how it interacts with the diffusion noise schedule; an explicit equation or diagram would aid reproducibility.
Provide exact parameter counts, training token budgets, and optimizer settings for all baselines to ensure the 'strictly matched' comparison is fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the Text VAE validation. We have revised the manuscript to directly address both points by adding concrete numerical support and diagnostic metrics, which we believe strengthens the verifiability of our claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Cola DLM establishes hierarchical continuous latent prior modeling as a principled alternative rests on reported performance across 8 benchmarks and scaling to 2000 EFLOPs, yet the abstract supplies no numerical results, ablation tables, or error bars. This renders the support for outperformance over matched ~2B baselines unverifiable from the provided summary.

Authors: We agree that the abstract would benefit from explicit quantitative anchors to make the central claims immediately verifiable. In the revised manuscript we have inserted concise numerical highlights drawn from the main results (e.g., average gains over the matched ~2B AR and LLaDA baselines across the eight benchmarks, together with the observed scaling trend to ~2000 EFLOPs). Detailed ablation tables, error bars, and per-benchmark breakdowns remain in the body and appendix, as space constraints preclude their inclusion in the abstract itself. These additions render the support for outperformance directly readable from the abstract while preserving its brevity. revision: yes
Referee: [Methods] Text VAE component (methods): The load-bearing assumption of a 'stable text-to-latent mapping' that supports faithful semantic representations for the subsequent DiT prior is not accompanied by reconstruction fidelity metrics (e.g., BLEU, perplexity on held-out text), posterior-collapse diagnostics, or KL-annealing curves. Without these, downstream generation quality and scaling curves could reflect VAE compression artifacts rather than the benefits of block-causal diffusion in continuous space.

Authors: We acknowledge that the original submission did not foreground explicit reconstruction and stability diagnostics for the Text VAE in the main text. We have added a dedicated paragraph in Section 3.1 together with a new appendix subsection that reports (i) BLEU and perplexity on held-out text, (ii) posterior-collapse diagnostics via KL-divergence statistics and histograms, and (iii) the KL-annealing schedule and corresponding curves. These metrics confirm faithful reconstruction without collapse. We further include a controlled ablation that isolates the VAE contribution from the block-causal DiT prior, showing that the reported scaling behavior and benchmark gains are not explained by VAE compression artifacts alone. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external empirical comparisons

full rationale

The paper presents Cola DLM as a hierarchical design (Text VAE for mapping, block-causal DiT for prior, conditional decoder) justified by experiments across 8 benchmarks, matched baselines, and scaling curves up to 2000 EFLOPs. No derivation chain reduces a claimed result to a fitted parameter or self-citation by construction; the Markov-path perspective is interpretive framing rather than a mathematical reduction. The central claim of principled alternative is supported by generation quality and scaling behavior versus autoregressive and LLaDA baselines, which are independent of internal fits. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the existence of a stable invertible latent mapping and on the ability of block-causal diffusion to organize global semantics; both are domain assumptions rather than derived results.

free parameters (2)

latent dimensionality
Controls the degree of semantic compression in the Text VAE and must be selected to support both reconstruction and prior modeling.
diffusion noise schedule and number of steps
Determines how the latent prior is learned and sampled; chosen to balance quality and compute.

axioms (2)

domain assumption A Text VAE can learn a stable, sufficiently invertible mapping from discrete text to continuous latent codes.
Invoked in the first stage and required for the subsequent conditional decoding to succeed.
domain assumption Block-causal attention applied to latent codes can capture and transport global semantic structure.
Central to the claim that the diffusion process performs prior transport rather than token-level recovery.

pith-pipeline@v0.9.0 · 5585 in / 1591 out tokens · 84718 ms · 2026-05-08T09:58:14.276942+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

144 extracted references · 68 canonical work pages · 23 internal anchors

[1]

Tandem transformers for inference efficient llms

PS Aishwarya, Pranav Ajit Nair, Yashas Samaga BL, Toby James Boyd, Sanjiv Kumar, Prateek Jain, and Praneeth Netrapalli. Tandem transformers for inference efficient llms. InForty-firstInternational Conference on Machine Learning, 2024

2024
[2]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

2021
[3]

The pitfalls of next-token prediction

Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963, 2024

work page arXiv 2024
[4]

Block cascading: Training free acceleration of block-causal video models.arXiv preprint arXiv:2511.20426, 2025

Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song, and Varun Jampani. Block cascading: Training free acceleration of block-causal video models.arXiv preprint arXiv:2511.20426, 2025

work page arXiv 2025
[5]

Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

2023
[6]

Large concept models: Language modeling in a sentence representation space

Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R Costa-jussà, David Dale, et al. Large concept models: Language modeling in a sentence representation space.arXiv preprint arXiv:2412.08821, 2024

work page arXiv 2024
[7]

a is b" fail to learn

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a".arXiv preprint arXiv:2309.12288, 2023

work page arXiv 2023
[8]

Generating sentences from a continuous space

Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. InProceedings of the 20th SIGNLL conference on computational natural language learning, pages 10–21, 2016

2016
[9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

1901
[10]

A continuous time framework for discrete denoising models.Advancesin Neural Information Processing Systems, 35:28266–28279, 2022

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advancesin Neural Information Processing Systems, 35:28266–28279, 2022

2022
[11]

Towards a causal probabilistic framework for prediction, action-selection & explanations for robot block-stacking tasks.arXiv preprint arXiv:2308.06203, 2023

Ricardo Cannizzaro, Jonathan Routley, and Lars Kunze. Towards a causal probabilistic framework for prediction, action-selection & explanations for robot block-stacking tasks.arXiv preprint arXiv:2308.06203, 2023

work page arXiv 2023
[12]

Exploring diffusion transformer designs via grafting

Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, et al. Exploring diffusion transformer designs via grafting. arXiv preprint arXiv:2506.05340, 2025

work page arXiv 2025
[13]

A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

2024
[14]

A cheaper and better diffusion language model with soft-masked noise

Jiaao Chen, Aston Zhang, Mu Li, Alex Smola, and Diyi Yang. A cheaper and better diffusion language model with soft-masked noise. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4765–4775, 2023

2023
[15]

Dlm-one: Diffusion language models for one-step sequence generation

Tianqi Chen, Shujian Zhang, and Mingyuan Zhou. Dlm-one: Diffusion language models for one-step sequence generation. arXiv preprint arXiv:2506.00290, 2025

work page arXiv 2025
[16]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review arXiv 1911
[17]

Autoregressive models: What are they good for?arXiv preprint arXiv:1910.07737, 2019

Murtaza Dalal, Alexander C Li, and Rohan Taori. Autoregressive models: What are they good for?arXiv preprint arXiv:1910.07737, 2019

work page arXiv 1910
[18]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review arXiv 2025
[19]

Generative Modeling via Drifting

Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026. 34

work page internal anchor Pith review arXiv 2026
[20]

Promises, outlooks and challenges of diffusion language modeling

Justin Deschenaux and Caglar Gulcehre. Promises, outlooks and challenges of diffusion language modeling. arXiv preprint arXiv:2406.11473, 2024

work page arXiv 2024
[21]

H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022

work page arXiv 2022
[22]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 320–335, 2022

2022
[23]

Break the sequential dependency of LLM inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024
[24]

Empowering diffusion models on the embedding space for text generation

Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, and Linli Xu. Empowering diffusion models on the embedding space for text generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4664–4683, 2024

2024
[25]

Discrete flow matching.Advancesin Neural Information Processing Systems, 37:133345–133385, 2024

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advancesin Neural Information Processing Systems, 37:133345–133385, 2024

2024
[26]

A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, and Fatih Porikli. Skip to the good part: Representation structure & inference-time layer skipping in diffusion vs. autoregressive llms.arXiv preprint arXiv:2603.07475, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Diffuseq: Se- quence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022

work page arXiv 2022
[28]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

work page arXiv 2024
[29]

Likelihood-based diffusion language models.Advancesin Neural Information Processing Systems, 36:16693–16715, 2023

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advancesin Neural Information Processing Systems, 36:16693–16715, 2023

2023
[30]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review arXiv 2025
[31]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 11575–11596, 2023

2023
[32]

Unifying human and statistical evaluation for natural language generation

Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation for natural language generation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701, 2019

2019
[33]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review arXiv 2009
[34]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review arXiv 1904
[35]

Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

work page arXiv 2021
[36]

Argmax flows and multinomial diffusion: Learning categorical distributions.Advancesin neural information processing systems, 34:12454–12465, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advancesin neural information processing systems, 34:12454–12465, 2021

2021
[37]

arXiv preprint arXiv:2404.09937 , year=

Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. Compression represents intelligence linearly.arXiv preprint arXiv:2404.09937, 2024. 35

work page arXiv 2024
[38]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review arXiv 2024
[39]

Block-recurrent trans- formers

DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent trans- formers. Advancesin neural information processing systems, 35:33248–33261, 2022

2022
[40]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016

work page internal anchor Pith review arXiv 2016
[41]

Language agents as digital representatives in collective decision-making

Daniel Jarrett, Miruna Pislar, Michiel A Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, and Andrea Tacchetti. Language agents as digital representatives in collective decision-making. arXiv preprint arXiv:2502.09369, 2025

work page arXiv 2025
[42]

Examining alignment of large language models through representative heuristics: the case of political stereotypes.arXiv preprint arXiv:2501.14294, 2025

Sullam Jeoung, Yubin Ge, Haohan Wang, and Jana Diesner. Examining alignment of large language models through representative heuristics: the case of political stereotypes.arXiv preprint arXiv:2501.14294, 2025

work page arXiv 2025
[43]

Continuous diffusion model for language modeling

Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. arXiv preprint arXiv:2502.11564, 2025

work page arXiv 2025
[44]

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, and Lianhui Qin. Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

2020
[46]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review arXiv 2013
[47]

Improving diversity of demographic representation in large language models via collective-critiques and self-voting

Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel, et al. Improving diversity of demographic representation in large language models via collective-critiques and self-voting. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pr...

2023
[48]

Race: Large-scale reading comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017

2017
[49]

Unifying continuous and discrete text diffusion with non-simultaneous diffusion processes

Bocheng Li, Zhujin Gao, and Linli Xu. Unifying continuous and discrete text diffusion with non-simultaneous diffusion processes. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 11530–11551, 2025

2025
[50]

Beyond autoregression: An empirical study of diffusion large language models for code generation,

Chengze Li, Yitong Zhang, Jia Li, Liyi Cai, and Ge Li. Beyond autoregression: An empirical study of diffusion large language models for code generation.arXiv preprint arXiv:2509.11252, 2025

work page arXiv 2025
[51]

Optimus: Organizing sentences via pre-trained modeling of a latent space

Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe Zhang, and Jianfeng Gao. Optimus: Organizing sentences via pre-trained modeling of a latent space. InProceedings ofthe2020ConferenceonEmpiricalMethods in Natural Language Processing (EMNLP), pages 4678–4699, 2020

2020
[52]

Diffusion-lm improves controllable text generation.Advancesin neural information processing systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advancesin neural information processing systems, 35:4328–4343, 2022

2022
[53]

Limitations of autoregressive models and their alternatives

Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R Gormley, and Jason Eisner. Limitations of autoregressive models and their alternatives. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 5147–5173, 2021

2021
[54]

Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise

Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. In International Conference on Machine Learning, pages 21051–21064. PMLR, 2023

2023
[55]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review arXiv 2024
[56]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 36

work page internal anchor Pith review arXiv 2024
[57]

Bcat: A block causal transformer for pde foundation models for fluid dynamics.arXiv preprint arXiv:2501.18972, 2025

Yuxuan Liu, Jingmin Sun, and Hayden Schaeffer. Bcat: A block causal transformer for pde foundation models for fluid dynamics.arXiv preprint arXiv:2501.18972, 2025

work page arXiv 2025
[58]

Latent diffusion for language generation

Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. Advancesin Neural Information Processing Systems, 36:56998–57025, 2023

2023
[59]

Tess: Text-to-text self-conditioned simplex diffusion

Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2347–2361, 2024

2024
[60]

Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023

Eran Malach. Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023

work page arXiv 2023
[61]

Language model evaluation beyond perplexity

Clara Meister and Ryan Cotterell. Language model evaluation beyond perplexity. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 5328–5339, 2021

2021
[62]

Concrete score matching: Generalized score matching for discrete data.Advancesin Neural Information Processing Systems, 35:34532–34545, 2022

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data.Advancesin Neural Information Processing Systems, 35:34532–34545, 2022

2022
[63]

Cosmos: Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170, 2025

Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170, 2025

work page arXiv 2025
[64]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

2018
[65]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review arXiv 2024
[66]

Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36: 67960–67971, 2023

Shentong Mo, Enze Xie, Ruihang Chu, Lanqing Hong, Matthias Niessner, and Zhenguo Li. Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36: 67960–67971, 2023

2023
[67]

Pass: Parallel speculative sampling,

Giovanni Monea, Armand Joulin, and Edouard Grave. Pass: Parallel speculative sampling.arXiv preprint arXiv:2311.13581, 2023

work page arXiv 2023
[68]

A corpus and cloze evaluation for deeper understanding of commonsense stories

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

2016
[69]

Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514,

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

work page arXiv 2024
[70]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review arXiv 2025
[71]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review arXiv 2024
[72]

arXiv preprint arXiv:2406.03736 , year=

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

work page arXiv 2024
[73]

A survey of LLM inference systems.arXiv preprint arXiv:2506.21901, 2025

James Pan and Guoliang Li. A survey of llm inference systems.arXiv preprint arXiv:2506.21901, 2025

work page arXiv 2025
[74]

The lambada dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525...

2016
[75]

Switch diffusion transformer: Synergizing denoising tasks with sparse mixture-of-experts

Byeongjun Park, Hyojun Go, Jin-Young Kim, Sangmin Woo, Seokil Ham, and Changick Kim. Switch diffusion transformer: Synergizing denoising tasks with sparse mixture-of-experts. InEuropean Conference on Computer Vision, pages 461–477. Springer, 2024

2024
[76]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[77]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[78]

Squad: 100,000+ questions for ma- chine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for ma- chine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392, 2016

2016
[79]

Categorical sdes with simplex diffusion.arXiv preprint arXiv:2210.14784, 2022

Pierre H Richemond, Sander Dieleman, and Arnaud Doucet. Categorical sdes with simplex diffusion.arXiv preprint arXiv:2210.14784, 2022

work page arXiv 2022
[80]

Simple and effective masked diffusion language models.Advancesin Neural Information Processing Systems, 37:130136–130184, 2024

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advancesin Neural Information Processing Systems, 37:130136–130184, 2024

2024

Showing first 80 references.