Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation

Chonglin Sun; Fei Tian; Frank Shyu; Jinhao Duan; Luke Simon; Mingfu Liang; Parish Aggarwal; Sandeep Pandey; Tianlong Chen; Xianfeng Wu

arxiv: 2606.27732 · v1 · pith:V6KHOCTQnew · submitted 2026-06-26 · 💻 cs.IR · cs.AI· cs.LG

Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation

Yuhang Chen , Xianfeng Wu , Jinhao Duan , Mingfu Liang , Xiaohan Wei , Yunchen Pu , Fei Tian , Chonglin Sun

show 6 more authors

Parish Aggarwal Frank Shyu Luke Simon Sandeep Pandey Xi Liu Tianlong Chen

This is my paper

Pith reviewed 2026-06-29 03:28 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords bifocal diffusionR2LMasymmetric bidirectional contextdiscrete diffusion language modelsparallel generationKV cachingMamba SSMright-to-left context

0 comments

The pith

Bifocal dLLMs use causal attention plus a reverse Mamba sidecar to enable efficient parallel generation with full KV cache compatibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the trade-off in discrete diffusion language models between bidirectional attention, which provides full context for good quality but prevents KV caching, and causal attention, which allows caching but loses right context. By introducing asymmetric bidirectional context through R2LM, it combines precise left context from causal attention with compressed right context from a lightweight reverse Mamba SSM. This setup allows parallel decoding in batch serving scenarios, leading to significant throughput improvements while matching or exceeding baseline performance. A sympathetic reader would care because it makes diffusion-based generation practical for high-throughput applications without major quality sacrifices.

Core claim

Bifocal dLLMs instantiate asymmetric bidirectional context in R2LM by pairing standard causal attention for left-context precision and KV cache compatibility with a lightweight reverse Mamba SSM sidecar that supplies compressed right-side context, resolving the architectural dilemma in discrete diffusion models.

What carries the argument

The R2LM (Right-to-Left Mamba) mechanism, which uses a reverse Mamba SSM sidecar to provide compressed right-side context alongside causal attention.

If this is right

R2LM achieves 2.4× to 12.9× higher throughput than bidirectional dLLMs in batch serving.
It provides 1.9× to 2.9× speedup over AR baselines through parallel decoding with KV caching.
R2LM exceeds the causal baseline on most benchmarks.
It surpasses the bidirectional dLLM on average quality metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This design could be adapted to other state space models for context compression in generative tasks.
Longer sequence lengths might benefit more from the compressed right context approach.
The method opens a path to hybrid attention-SSM architectures in diffusion models.
Batch serving efficiency gains may extend to other non-causal generation frameworks.

Load-bearing premise

That a lightweight reverse Mamba SSM sidecar can supply sufficient compressed right-side context without meaningful overhead or quality degradation while preserving KV cache compatibility.

What would settle it

Running the model without the reverse Mamba sidecar and measuring if generation quality drops to causal baseline levels or if throughput gains disappear in batch serving experiments.

read the original abstract

Discrete diffusion language models (dLLMs) recover masked tokens in parallel, offering significant speedups over autoregressive (AR) generation. However, such promising frameworks face a fundamental architectural design dilemma: \ding{182} Adopting bidirectional attention achieves strong generation quality by allowing each position to access the full context, but is inherently incompatible with KV caching, limiting inference throughput in batch-serving scenarios; \ding{183} Conversely, causal attention enables efficient cached inference but loses all right-side context, substantially degrading generation quality. This paper introduces Bifocal dLLMs, a new paradigm that resolves this dilemma through \emph{asymmetric bidirectional context}. Analogous to bifocal lenses, we instantiate the paradigm as \textbf{R2LM} (Right-to-Left Mamba), which combines two complementary mechanisms: $a$) standard causal attention providing precise left-context with full KV cache compatibility, while $b$) a lightweight reverse Mamba SSM sidecar supplying compressed right-side context without breaking cacheability. Comprehensive experiments on continued pretraining of Qwen3-1.7B with 60B tokens demonstrate that R2LM achieves $2.4\times$ to $12.9\times$ higher throughput than bidirectional dLLMs and $1.9\times$ to $2.9\times$ speedup over AR baselines in batch serving through parallel decoding with KV caching, while exceeding the causal baseline on most benchmarks and surpassing the bidirectional dLLM on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R2LM pairs causal attention with a reverse Mamba sidecar to add right context without breaking KV cache, but the lack of ablations leaves the quality and overhead claims untested.

read the letter

R2LM pairs causal attention with a reverse Mamba sidecar to add right context without breaking KV cache, but the lack of ablations leaves the quality and overhead claims untested.

The new element is the asymmetric setup itself: standard left-context causal attention plus a lightweight reverse SSM that supplies compressed future information while keeping full cache compatibility. The paper frames the bidirectional-versus-causal tradeoff clearly and shows a workable path through it on Qwen3-1.7B after 60B tokens of continued pretraining.

The throughput numbers (2.4–12.9× over bidirectional dLLMs, 1.9–2.9× over AR) and the claim of matching or beating bidirectional quality on average are the parts that would matter if they hold. The architecture keeps the inference path cache-friendly, which is the practical constraint the work targets.

The main gap is verification. No ablations appear on sidecar state size, how often the reverse state is refreshed across diffusion steps, or how much right-context information actually survives the compression. The stress-test point is accurate here: if the sidecar drops key future tokens or adds measurable cost during unmasking, both the speed and quality results collapse. The abstract also omits error bars, controls, and direct comparisons of right-context fidelity, so the central assumption stays unproven from the given text.

This is for groups working on parallel LLM serving and diffusion generation. Someone already running dLLM experiments would get value from testing whether the sidecar delivers the promised context without extra latency.

It deserves peer review because the problem is concrete and the proposed mechanism is specific enough to evaluate, even though the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Bifocal dLLMs, a paradigm for discrete diffusion language models that resolves the bidirectional-vs-causal attention dilemma via asymmetric bidirectional context. It instantiates the paradigm as R2LM (Right-to-Left Mamba), which pairs standard causal attention (full KV-cache compatibility) with a lightweight reverse Mamba SSM sidecar that supplies compressed right-side context. After 60B-token continued pretraining on Qwen3-1.7B, the paper claims R2LM delivers 2.4×–12.9× higher throughput than bidirectional dLLMs and 1.9×–2.9× speedup over AR baselines in batch serving, while exceeding the causal baseline on most benchmarks and surpassing the bidirectional dLLM on average.

Significance. If the central claims hold, the work would be significant for practical deployment of diffusion LMs: it offers a concrete mechanism to retain bidirectional quality while preserving KV-cache compatibility and parallel decoding, directly addressing a load-bearing architectural tradeoff. The explicit use of a sidecar SSM for compressed future context, combined with reported speedups in batch-serving scenarios, would constitute a useful contribution if supported by ablations and controls. The paper's strength lies in framing the problem as an asymmetric-context design choice rather than a pure architectural incompatibility.

major comments (2)

[Abstract] Abstract: the central claim that the reverse Mamba sidecar supplies 'sufficient' right-side context to exceed causal baselines and match bidirectional dLLMs on average rests on unshown ablations (state size, update frequency across diffusion steps, right-context fidelity). Without these, it is impossible to verify that the compressed state does not lose critical future tokens or introduce measurable overhead during parallel unmasking, which directly undermines the throughput (2.4×–12.9×) and quality assertions.
[Abstract] The manuscript states performance numbers (throughput multipliers, benchmark exceedances) but supplies no experimental details, benchmarks, error bars, or controls. This is load-bearing because the soundness of the claim that mechanism (b) resolves the dilemma without degradation cannot be assessed from the available text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract and the need for supporting evidence. We agree that the current abstract presentation does not make the central claims fully verifiable on its own and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] the central claim that the reverse Mamba sidecar supplies 'sufficient' right-side context to exceed causal baselines and match bidirectional dLLMs on average rests on unshown ablations (state size, update frequency across diffusion steps, right-context fidelity). Without these, it is impossible to verify that the compressed state does not lose critical future tokens or introduce measurable overhead during parallel unmasking, which directly undermines the throughput (2.4×–12.9×) and quality assertions.

Authors: We accept this point. The abstract summarizes results without including the requested ablations on Mamba state size, update frequency per diffusion step, or quantitative right-context fidelity metrics. In revision we will add a dedicated ablation subsection (or expanded paragraph in the experimental section) reporting these controls, including measurements of token recovery accuracy as a function of state dimension and any overhead during parallel unmasking. This will allow direct verification that the sidecar preserves sufficient future context. revision: yes
Referee: [Abstract] The manuscript states performance numbers (throughput multipliers, benchmark exceedances) but supplies no experimental details, benchmarks, error bars, or controls. This is load-bearing because the soundness of the claim that mechanism (b) resolves the dilemma without degradation cannot be assessed from the available text.

Authors: We agree that the abstract alone does not supply the necessary experimental details. The full manuscript reports continued pretraining on Qwen3-1.7B with 60B tokens and batch-serving throughput measurements, but lacks explicit benchmark lists, error bars across seeds, and control ablations in the summary. In the revised version we will expand the abstract with a concise experimental summary (benchmarks, number of runs, error reporting) and ensure the main text contains the full controls and statistical details so that the quality and throughput claims can be evaluated directly. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture claims are self-contained

full rationale

The paper proposes R2LM as a new asymmetric context mechanism for dLLMs, motivated by the stated bidirectional-causal tradeoff, and reports experimental throughput and quality results after continued pretraining. No equations, parameter fits, self-citations, or uniqueness theorems appear in the provided text that reduce the performance numbers or design choices to inputs by construction. The central claims rest on external benchmarks and ablations rather than renaming or re-deriving prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review is abstract-only; no free parameters, axioms, or invented entities can be audited beyond the high-level design choice stated in the abstract.

axioms (1)

domain assumption A lightweight reverse Mamba SSM sidecar can supply useful compressed right-side context while preserving KV cache compatibility
Core premise of the R2LM design as described in the abstract

invented entities (2)

Bifocal dLLM paradigm no independent evidence
purpose: Resolving bidirectional vs causal attention trade-off via asymmetric context
New framing introduced in the paper
R2LM (Right-to-Left Mamba) no independent evidence
purpose: Concrete instantiation using causal attention plus reverse Mamba sidecar
New model variant proposed

pith-pipeline@v0.9.1-grok · 5841 in / 1199 out tokens · 68476 ms · 2026-06-29T03:28:21.000544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 14 linked inside Pith

[1]

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

arXiv:2503.09573. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. InNeurIPS,

Pith/arXiv arXiv
[2]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi

arXiv:2107.03006. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InAAAI,

arXiv
[3]

arXiv:1911.11641. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigl...

Pith/arXiv arXiv 1911
[4]

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou

arXiv:2205.14987. Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,

arXiv
[5]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

arXiv:1905.10044. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

Pith/arXiv arXiv 1905
[6]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

Pith/arXiv arXiv
[7]

The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602

LeoGao, JonathanTow, BaberAbbasi, StellaBiderman, SidBlack, AnthonyDiPofi, CharlesFoster, LaurenceGolding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The language model ...

arXiv 2024
[8]

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong

arXiv:2407.15595. Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891,

arXiv
[9]

Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

Pith/arXiv arXiv
[10]

Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu

arXiv:2210.17432. Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029,

arXiv
[11]

Jia-Nan Li, Jian Guan, Wei Wu, and Chongxuan Li

arXiv:2009.03300. Jia-Nan Li, Jian Guan, Wei Wu, and Chongxuan Li. ReFusion: A diffusion large language model with parallel autoregressive decoding.arXiv preprint arXiv:2512.13586,

Pith/arXiv arXiv 2009
[12]

Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

Pith/arXiv arXiv
[13]

Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba langua...

Pith/arXiv arXiv
[14]

WeDLM: Reconciling diffusion language models with standard causal attention for fast inference.arXiv preprint arXiv:2512.22737, 2025a

Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, and Jie Zhou. WeDLM: Reconciling diffusion language models with standard causal attention for fast inference.arXiv preprint arXiv:2512.22737, 2025a. Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pav...

arXiv
[15]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

arXiv:2310.16834. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP,

Pith/arXiv arXiv
[16]

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li

arXiv:1809.02789. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,

Pith/arXiv arXiv
[17]

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf

arXiv:2406.03736. Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

Pith/arXiv arXiv
[18]

Causal autoregressive diffusion language model.arXiv preprint arXiv:2601.22031,

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, and JingBo Zhu. Causal autoregressive diffusion language model.arXiv preprint arXiv:2601.22031,

arXiv
[19]

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat

arXiv:2406.07524. Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models: Bridging autoregressive and masked diffusion LLMs.arXiv preprint arXiv:2506.01928,

arXiv
[20]

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias

arXiv:1907.10641. Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. InNeurIPS,

Pith/arXiv arXiv 1907
[21]

arXiv:2406.04329. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Go...

arXiv
[22]

Fast-dLLM v2: Efficient block-diffusion LLM.arXiv preprint arXiv:2509.26328,

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dLLM v2: Efficient block-diffusion LLM.arXiv preprint arXiv:2509.26328,

arXiv
[23]

Qwen2 technical report.arXiv preprint arXiv:2407.10671,

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

Pith/arXiv arXiv
[24]

Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

Pith/arXiv arXiv
[25]

Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759,

Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759,

arXiv

[1] [1]

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

arXiv:2503.09573. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. InNeurIPS,

Pith/arXiv arXiv

[2] [2]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi

arXiv:2107.03006. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InAAAI,

arXiv

[3] [3]

arXiv:1911.11641. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigl...

Pith/arXiv arXiv 1911

[4] [4]

Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou

arXiv:2205.14987. Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,

arXiv

[5] [5]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

arXiv:1905.10044. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

Pith/arXiv arXiv 1905

[6] [6]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

Pith/arXiv arXiv

[7] [7]

The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602

LeoGao, JonathanTow, BaberAbbasi, StellaBiderman, SidBlack, AnthonyDiPofi, CharlesFoster, LaurenceGolding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The language model ...

arXiv 2024

[8] [8]

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong

arXiv:2407.15595. Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891,

arXiv

[9] [9]

Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

Pith/arXiv arXiv

[10] [10]

Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu

arXiv:2210.17432. Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029,

arXiv

[11] [11]

Jia-Nan Li, Jian Guan, Wei Wu, and Chongxuan Li

arXiv:2009.03300. Jia-Nan Li, Jian Guan, Wei Wu, and Chongxuan Li. ReFusion: A diffusion large language model with parallel autoregressive decoding.arXiv preprint arXiv:2512.13586,

Pith/arXiv arXiv 2009

[12] [12]

Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

Pith/arXiv arXiv

[13] [13]

Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba langua...

Pith/arXiv arXiv

[14] [14]

WeDLM: Reconciling diffusion language models with standard causal attention for fast inference.arXiv preprint arXiv:2512.22737, 2025a

Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, and Jie Zhou. WeDLM: Reconciling diffusion language models with standard causal attention for fast inference.arXiv preprint arXiv:2512.22737, 2025a. Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pav...

arXiv

[15] [15]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

arXiv:2310.16834. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP,

Pith/arXiv arXiv

[16] [16]

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li

arXiv:1809.02789. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,

Pith/arXiv arXiv

[17] [17]

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf

arXiv:2406.03736. Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

Pith/arXiv arXiv

[18] [18]

Causal autoregressive diffusion language model.arXiv preprint arXiv:2601.22031,

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, and JingBo Zhu. Causal autoregressive diffusion language model.arXiv preprint arXiv:2601.22031,

arXiv

[19] [19]

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat

arXiv:2406.07524. Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models: Bridging autoregressive and masked diffusion LLMs.arXiv preprint arXiv:2506.01928,

arXiv

[20] [20]

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias

arXiv:1907.10641. Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. InNeurIPS,

Pith/arXiv arXiv 1907

[21] [21]

arXiv:2406.04329. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Go...

arXiv

[22] [22]

Fast-dLLM v2: Efficient block-diffusion LLM.arXiv preprint arXiv:2509.26328,

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dLLM v2: Efficient block-diffusion LLM.arXiv preprint arXiv:2509.26328,

arXiv

[23] [23]

Qwen2 technical report.arXiv preprint arXiv:2407.10671,

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

Pith/arXiv arXiv

[24] [24]

Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

Pith/arXiv arXiv

[25] [25]

Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759,

Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759,

arXiv