Recognition: unknown
Continuous Latent Diffusion Language Model
Pith reviewed 2026-05-08 09:58 UTC · model grok-4.3
The pith
A hierarchical latent diffusion model separates global semantic organization from local text realization in continuous space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
From a unified Markov-path perspective, Cola DLM's diffusion performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This yields a flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and extends naturally to other continuous modalities. Experiments on 8 benchmarks with matched ~2B-parameter baselines and scaling to 2000 EFLOPs identify an effective configuration and confirm strong scaling behavior for text generation.
What carries the argument
The hierarchical decomposition that uses a Text VAE for stable text-to-latent mapping, a block-causal DiT for diffusion-based global semantic prior transport in continuous space, and conditional decoding for final text output.
If this is right
- Text generation gains a non-autoregressive inductive bias that organizes semantics globally before realizing local tokens.
- Semantic compression and prior fitting occur directly in continuous space rather than through token likelihood.
- Generation quality and scaling curves become stronger indicators of model capability than likelihood alone.
- The same latent diffusion structure extends without modification to joint modeling of text with other continuous data types.
Where Pith is reading between the lines
- The separation of global semantics from token realization could reduce error accumulation in long sequences by enforcing high-level coherence first.
- A shared continuous latent space might allow direct mixing of text generation with image or audio synthesis under one diffusion process.
- Evaluation focus may shift toward measuring output coherence and scaling efficiency rather than perplexity on next-token prediction.
Load-bearing premise
A stable and invertible mapping from discrete text to continuous latent space exists so that block-causal diffusion can reliably carry global semantics to support high-quality conditional word generation.
What would settle it
Scaling curves showing that Cola DLM generation quality plateaus or lags behind matched autoregressive baselines past 2000 EFLOPs, or that the Text VAE mapping becomes unstable and non-invertible on diverse or long texts.
read the original abstract
Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Cola DLM, a hierarchical latent diffusion language model that decomposes text generation into a Text VAE for learning a stable text-to-latent mapping, a block-causal DiT for modeling a global semantic prior in continuous latent space, and conditional decoding for text generation. From a Markov-path view, the diffusion performs latent prior transport rather than token-level recovery. Experiments span 4 research questions and 8 benchmarks with strictly matched ~2B-parameter autoregressive and LLaDA baselines, plus scaling curves to ~2000 EFLOPs, claiming strong scaling behavior and establishing hierarchical continuous latent prior modeling as a principled non-autoregressive alternative to token-level language modeling.
Significance. If the empirical claims hold with full supporting data, the work would be significant for offering a continuous-space inductive bias that separates global semantics from local realization, with potential advantages in scaling and multimodal unification. The use of matched baselines and large-scale EFLOP curves provides a concrete basis for comparing generation quality and scaling behavior against likelihood-based AR models.
major comments (2)
- [Abstract] Abstract: The central claim that Cola DLM establishes hierarchical continuous latent prior modeling as a principled alternative rests on reported performance across 8 benchmarks and scaling to 2000 EFLOPs, yet the abstract supplies no numerical results, ablation tables, or error bars. This renders the support for outperformance over matched ~2B baselines unverifiable from the provided summary.
- [Methods] Text VAE component (methods): The load-bearing assumption of a 'stable text-to-latent mapping' that supports faithful semantic representations for the subsequent DiT prior is not accompanied by reconstruction fidelity metrics (e.g., BLEU, perplexity on held-out text), posterior-collapse diagnostics, or KL-annealing curves. Without these, downstream generation quality and scaling curves could reflect VAE compression artifacts rather than the benefits of block-causal diffusion in continuous space.
minor comments (2)
- Clarify the precise definition of block-causality in the DiT architecture and how it interacts with the diffusion noise schedule; an explicit equation or diagram would aid reproducibility.
- Provide exact parameter counts, training token budgets, and optimizer settings for all baselines to ensure the 'strictly matched' comparison is fully transparent.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and the Text VAE validation. We have revised the manuscript to directly address both points by adding concrete numerical support and diagnostic metrics, which we believe strengthens the verifiability of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that Cola DLM establishes hierarchical continuous latent prior modeling as a principled alternative rests on reported performance across 8 benchmarks and scaling to 2000 EFLOPs, yet the abstract supplies no numerical results, ablation tables, or error bars. This renders the support for outperformance over matched ~2B baselines unverifiable from the provided summary.
Authors: We agree that the abstract would benefit from explicit quantitative anchors to make the central claims immediately verifiable. In the revised manuscript we have inserted concise numerical highlights drawn from the main results (e.g., average gains over the matched ~2B AR and LLaDA baselines across the eight benchmarks, together with the observed scaling trend to ~2000 EFLOPs). Detailed ablation tables, error bars, and per-benchmark breakdowns remain in the body and appendix, as space constraints preclude their inclusion in the abstract itself. These additions render the support for outperformance directly readable from the abstract while preserving its brevity. revision: yes
-
Referee: [Methods] Text VAE component (methods): The load-bearing assumption of a 'stable text-to-latent mapping' that supports faithful semantic representations for the subsequent DiT prior is not accompanied by reconstruction fidelity metrics (e.g., BLEU, perplexity on held-out text), posterior-collapse diagnostics, or KL-annealing curves. Without these, downstream generation quality and scaling curves could reflect VAE compression artifacts rather than the benefits of block-causal diffusion in continuous space.
Authors: We acknowledge that the original submission did not foreground explicit reconstruction and stability diagnostics for the Text VAE in the main text. We have added a dedicated paragraph in Section 3.1 together with a new appendix subsection that reports (i) BLEU and perplexity on held-out text, (ii) posterior-collapse diagnostics via KL-divergence statistics and histograms, and (iii) the KL-annealing schedule and corresponding curves. These metrics confirm faithful reconstruction without collapse. We further include a controlled ablation that isolates the VAE contribution from the block-causal DiT prior, showing that the reported scaling behavior and benchmark gains are not explained by VAE compression artifacts alone. revision: yes
Circularity Check
No circularity: claims rest on external empirical comparisons
full rationale
The paper presents Cola DLM as a hierarchical design (Text VAE for mapping, block-causal DiT for prior, conditional decoder) justified by experiments across 8 benchmarks, matched baselines, and scaling curves up to 2000 EFLOPs. No derivation chain reduces a claimed result to a fitted parameter or self-citation by construction; the Markov-path perspective is interpretive framing rather than a mathematical reduction. The central claim of principled alternative is supported by generation quality and scaling behavior versus autoregressive and LLaDA baselines, which are independent of internal fits. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text.
Axiom & Free-Parameter Ledger
free parameters (2)
- latent dimensionality
- diffusion noise schedule and number of steps
axioms (2)
- domain assumption A Text VAE can learn a stable, sufficiently invertible mapping from discrete text to continuous latent codes.
- domain assumption Block-causal attention applied to latent codes can capture and transport global semantic structure.
Reference graph
Works this paper leans on
-
[1]
Tandem transformers for inference efficient llms
PS Aishwarya, Pranav Ajit Nair, Yashas Samaga BL, Toby James Boyd, Sanjiv Kumar, Prateek Jain, and Praneeth Netrapalli. Tandem transformers for inference efficient llms. InForty-firstInternational Conference on Machine Learning, 2024
2024
-
[2]
Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
2021
-
[3]
The pitfalls of next-token prediction
Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963, 2024
-
[4]
Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song, and Varun Jampani. Block cascading: Training free acceleration of block-causal video models.arXiv preprint arXiv:2511.20426, 2025
-
[5]
Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023
Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders.Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023
2023
-
[6]
Large concept models: Language modeling in a sentence representation space
Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R Costa-jussà, David Dale, et al. Large concept models: Language modeling in a sentence representation space.arXiv preprint arXiv:2412.08821, 2024
-
[7]
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a".arXiv preprint arXiv:2309.12288, 2023
-
[8]
Generating sentences from a continuous space
Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. InProceedings of the 20th SIGNLL conference on computational natural language learning, pages 10–21, 2016
2016
-
[9]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020
1901
-
[10]
A continuous time framework for discrete denoising models.Advancesin Neural Information Processing Systems, 35:28266–28279, 2022
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advancesin Neural Information Processing Systems, 35:28266–28279, 2022
2022
-
[11]
Ricardo Cannizzaro, Jonathan Routley, and Lars Kunze. Towards a causal probabilistic framework for prediction, action-selection & explanations for robot block-stacking tasks.arXiv preprint arXiv:2308.06203, 2023
-
[12]
Exploring diffusion transformer designs via grafting
Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, et al. Exploring diffusion transformer designs via grafting. arXiv preprint arXiv:2506.05340, 2025
-
[13]
A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024
2024
-
[14]
A cheaper and better diffusion language model with soft-masked noise
Jiaao Chen, Aston Zhang, Mu Li, Alex Smola, and Diyi Yang. A cheaper and better diffusion language model with soft-masked noise. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4765–4775, 2023
2023
-
[15]
Dlm-one: Diffusion language models for one-step sequence generation
Tianqi Chen, Shujian Zhang, and Mingyuan Zhou. Dlm-one: Diffusion language models for one-step sequence generation. arXiv preprint arXiv:2506.00290, 2025
-
[16]
On the Measure of Intelligence
François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019
work page internal anchor Pith review arXiv 1911
-
[17]
Autoregressive models: What are they good for?arXiv preprint arXiv:1910.07737, 2019
Murtaza Dalal, Alexander C Li, and Rohan Taori. Autoregressive models: What are they good for?arXiv preprint arXiv:1910.07737, 2019
-
[18]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
Generative Modeling via Drifting
Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026. 34
work page internal anchor Pith review arXiv 2026
-
[20]
Promises, outlooks and challenges of diffusion language modeling
Justin Deschenaux and Caglar Gulcehre. Promises, outlooks and challenges of diffusion language modeling. arXiv preprint arXiv:2406.11473, 2024
-
[21]
H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al
Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022
-
[22]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 320–335, 2022
2022
-
[23]
Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024
-
[24]
Empowering diffusion models on the embedding space for text generation
Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, and Linli Xu. Empowering diffusion models on the embedding space for text generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4664–4683, 2024
2024
-
[25]
Discrete flow matching.Advancesin Neural Information Processing Systems, 37:133345–133385, 2024
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advancesin Neural Information Processing Systems, 37:133345–133385, 2024
2024
-
[26]
A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs
Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, and Fatih Porikli. Skip to the good part: Representation structure & inference-time layer skipping in diffusion vs. autoregressive llms.arXiv preprint arXiv:2603.07475, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022
-
[28]
Scaling diffusion language models via adaptation from autoregressive models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024
-
[29]
Likelihood-based diffusion language models.Advancesin Neural Information Processing Systems, 36:16693–16715, 2023
Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advancesin Neural Information Processing Systems, 36:16693–16715, 2023
2023
-
[30]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review arXiv 2025
-
[31]
Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 11575–11596, 2023
2023
-
[32]
Unifying human and statistical evaluation for natural language generation
Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation for natural language generation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701, 2019
2019
-
[33]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review arXiv 2009
-
[34]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019
work page internal anchor Pith review arXiv 1904
-
[35]
Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021
Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021
-
[36]
Argmax flows and multinomial diffusion: Learning categorical distributions.Advancesin neural information processing systems, 34:12454–12465, 2021
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advancesin neural information processing systems, 34:12454–12465, 2021
2021
-
[37]
arXiv preprint arXiv:2404.09937 , year=
Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. Compression represents intelligence linearly.arXiv preprint arXiv:2404.09937, 2024. 35
-
[38]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review arXiv 2024
-
[39]
Block-recurrent trans- formers
DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent trans- formers. Advancesin neural information processing systems, 35:33248–33261, 2022
2022
-
[40]
Categorical Reparameterization with Gumbel-Softmax
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016
work page internal anchor Pith review arXiv 2016
-
[41]
Language agents as digital representatives in collective decision-making
Daniel Jarrett, Miruna Pislar, Michiel A Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, and Andrea Tacchetti. Language agents as digital representatives in collective decision-making. arXiv preprint arXiv:2502.09369, 2025
-
[42]
Sullam Jeoung, Yubin Ge, Haohan Wang, and Jana Diesner. Examining alignment of large language models through representative heuristics: the case of political stereotypes.arXiv preprint arXiv:2501.14294, 2025
-
[43]
Continuous diffusion model for language modeling
Jaehyeong Jo and Sung Ju Hwang. Continuous diffusion model for language modeling. arXiv preprint arXiv:2502.11564, 2025
-
[44]
LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, and Lianhui Qin. Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020
2020
-
[46]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review arXiv 2013
-
[47]
Improving diversity of demographic representation in large language models via collective-critiques and self-voting
Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel, et al. Improving diversity of demographic representation in large language models via collective-critiques and self-voting. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pr...
2023
-
[48]
Race: Large-scale reading comprehension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017
2017
-
[49]
Unifying continuous and discrete text diffusion with non-simultaneous diffusion processes
Bocheng Li, Zhujin Gao, and Linli Xu. Unifying continuous and discrete text diffusion with non-simultaneous diffusion processes. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 11530–11551, 2025
2025
-
[50]
Beyond autoregression: An empirical study of diffusion large language models for code generation,
Chengze Li, Yitong Zhang, Jia Li, Liyi Cai, and Ge Li. Beyond autoregression: An empirical study of diffusion large language models for code generation.arXiv preprint arXiv:2509.11252, 2025
-
[51]
Optimus: Organizing sentences via pre-trained modeling of a latent space
Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe Zhang, and Jianfeng Gao. Optimus: Organizing sentences via pre-trained modeling of a latent space. InProceedings ofthe2020ConferenceonEmpiricalMethods in Natural Language Processing (EMNLP), pages 4678–4699, 2020
2020
-
[52]
Diffusion-lm improves controllable text generation.Advancesin neural information processing systems, 35:4328–4343, 2022
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advancesin neural information processing systems, 35:4328–4343, 2022
2022
-
[53]
Limitations of autoregressive models and their alternatives
Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R Gormley, and Jason Eisner. Limitations of autoregressive models and their alternatives. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 5147–5173, 2021
2021
-
[54]
Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise
Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. In International Conference on Machine Learning, pages 21051–21064. PMLR, 2023
2023
-
[55]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review arXiv 2024
-
[56]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 36
work page internal anchor Pith review arXiv 2024
-
[57]
Yuxuan Liu, Jingmin Sun, and Hayden Schaeffer. Bcat: A block causal transformer for pde foundation models for fluid dynamics.arXiv preprint arXiv:2501.18972, 2025
-
[58]
Latent diffusion for language generation
Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. Advancesin Neural Information Processing Systems, 36:56998–57025, 2023
2023
-
[59]
Tess: Text-to-text self-conditioned simplex diffusion
Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffusion. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2347–2361, 2024
2024
-
[60]
Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023
Eran Malach. Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023
-
[61]
Language model evaluation beyond perplexity
Clara Meister and Ryan Cotterell. Language model evaluation beyond perplexity. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 5328–5339, 2021
2021
-
[62]
Concrete score matching: Generalized score matching for discrete data.Advancesin Neural Information Processing Systems, 35:34532–34545, 2022
Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data.Advancesin Neural Information Processing Systems, 35:34532–34545, 2022
2022
-
[63]
Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, and Dmitry Vetrov. Cosmos: Compressed and smooth latent space for text diffusion modeling.arXiv preprint arXiv:2506.21170, 2025
-
[64]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018
2018
-
[65]
Large Language Models: A Survey
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024
work page internal anchor Pith review arXiv 2024
-
[66]
Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36: 67960–67971, 2023
Shentong Mo, Enze Xie, Ruihang Chu, Lanqing Hong, Matthias Niessner, and Zhenguo Li. Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36: 67960–67971, 2023
2023
-
[67]
Pass: Parallel speculative sampling,
Giovanni Monea, Armand Joulin, and Edouard Grave. Pass: Parallel speculative sampling.arXiv preprint arXiv:2311.13581, 2023
-
[68]
A corpus and cloze evaluation for deeper understanding of commonsense stories
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...
2016
-
[69]
Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514,
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024
-
[70]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review arXiv 2025
-
[71]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024
work page internal anchor Pith review arXiv 2024
-
[72]
arXiv preprint arXiv:2406.03736 , year=
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024
-
[73]
A survey of LLM inference systems.arXiv preprint arXiv:2506.21901, 2025
James Pan and Guoliang Li. A survey of llm inference systems.arXiv preprint arXiv:2506.21901, 2025
-
[74]
The lambada dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525...
2016
-
[75]
Switch diffusion transformer: Synergizing denoising tasks with sparse mixture-of-experts
Byeongjun Park, Hyojun Go, Jin-Young Kim, Sangmin Woo, Seokil Ham, and Changick Kim. Switch diffusion transformer: Synergizing denoising tasks with sparse mixture-of-experts. InEuropean Conference on Computer Vision, pages 461–477. Springer, 2024
2024
-
[76]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[77]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
2019
-
[78]
Squad: 100,000+ questions for ma- chine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for ma- chine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392, 2016
2016
-
[79]
Categorical sdes with simplex diffusion.arXiv preprint arXiv:2210.14784, 2022
Pierre H Richemond, Sander Dieleman, and Arnaud Doucet. Categorical sdes with simplex diffusion.arXiv preprint arXiv:2210.14784, 2022
-
[80]
Simple and effective masked diffusion language models.Advancesin Neural Information Processing Systems, 37:130136–130184, 2024
Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advancesin Neural Information Processing Systems, 37:130136–130184, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.