Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation
Pith reviewed 2026-06-29 03:28 UTC · model grok-4.3
The pith
Bifocal dLLMs use causal attention plus a reverse Mamba sidecar to enable efficient parallel generation with full KV cache compatibility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bifocal dLLMs instantiate asymmetric bidirectional context in R2LM by pairing standard causal attention for left-context precision and KV cache compatibility with a lightweight reverse Mamba SSM sidecar that supplies compressed right-side context, resolving the architectural dilemma in discrete diffusion models.
What carries the argument
The R2LM (Right-to-Left Mamba) mechanism, which uses a reverse Mamba SSM sidecar to provide compressed right-side context alongside causal attention.
If this is right
- R2LM achieves 2.4× to 12.9× higher throughput than bidirectional dLLMs in batch serving.
- It provides 1.9× to 2.9× speedup over AR baselines through parallel decoding with KV caching.
- R2LM exceeds the causal baseline on most benchmarks.
- It surpasses the bidirectional dLLM on average quality metrics.
Where Pith is reading between the lines
- This design could be adapted to other state space models for context compression in generative tasks.
- Longer sequence lengths might benefit more from the compressed right context approach.
- The method opens a path to hybrid attention-SSM architectures in diffusion models.
- Batch serving efficiency gains may extend to other non-causal generation frameworks.
Load-bearing premise
That a lightweight reverse Mamba SSM sidecar can supply sufficient compressed right-side context without meaningful overhead or quality degradation while preserving KV cache compatibility.
What would settle it
Running the model without the reverse Mamba sidecar and measuring if generation quality drops to causal baseline levels or if throughput gains disappear in batch serving experiments.
read the original abstract
Discrete diffusion language models (dLLMs) recover masked tokens in parallel, offering significant speedups over autoregressive (AR) generation. However, such promising frameworks face a fundamental architectural design dilemma: \ding{182} Adopting bidirectional attention achieves strong generation quality by allowing each position to access the full context, but is inherently incompatible with KV caching, limiting inference throughput in batch-serving scenarios; \ding{183} Conversely, causal attention enables efficient cached inference but loses all right-side context, substantially degrading generation quality. This paper introduces Bifocal dLLMs, a new paradigm that resolves this dilemma through \emph{asymmetric bidirectional context}. Analogous to bifocal lenses, we instantiate the paradigm as \textbf{R2LM} (Right-to-Left Mamba), which combines two complementary mechanisms: $a$) standard causal attention providing precise left-context with full KV cache compatibility, while $b$) a lightweight reverse Mamba SSM sidecar supplying compressed right-side context without breaking cacheability. Comprehensive experiments on continued pretraining of Qwen3-1.7B with 60B tokens demonstrate that R2LM achieves $2.4\times$ to $12.9\times$ higher throughput than bidirectional dLLMs and $1.9\times$ to $2.9\times$ speedup over AR baselines in batch serving through parallel decoding with KV caching, while exceeding the causal baseline on most benchmarks and surpassing the bidirectional dLLM on average.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Bifocal dLLMs, a paradigm for discrete diffusion language models that resolves the bidirectional-vs-causal attention dilemma via asymmetric bidirectional context. It instantiates the paradigm as R2LM (Right-to-Left Mamba), which pairs standard causal attention (full KV-cache compatibility) with a lightweight reverse Mamba SSM sidecar that supplies compressed right-side context. After 60B-token continued pretraining on Qwen3-1.7B, the paper claims R2LM delivers 2.4×–12.9× higher throughput than bidirectional dLLMs and 1.9×–2.9× speedup over AR baselines in batch serving, while exceeding the causal baseline on most benchmarks and surpassing the bidirectional dLLM on average.
Significance. If the central claims hold, the work would be significant for practical deployment of diffusion LMs: it offers a concrete mechanism to retain bidirectional quality while preserving KV-cache compatibility and parallel decoding, directly addressing a load-bearing architectural tradeoff. The explicit use of a sidecar SSM for compressed future context, combined with reported speedups in batch-serving scenarios, would constitute a useful contribution if supported by ablations and controls. The paper's strength lies in framing the problem as an asymmetric-context design choice rather than a pure architectural incompatibility.
major comments (2)
- [Abstract] Abstract: the central claim that the reverse Mamba sidecar supplies 'sufficient' right-side context to exceed causal baselines and match bidirectional dLLMs on average rests on unshown ablations (state size, update frequency across diffusion steps, right-context fidelity). Without these, it is impossible to verify that the compressed state does not lose critical future tokens or introduce measurable overhead during parallel unmasking, which directly undermines the throughput (2.4×–12.9×) and quality assertions.
- [Abstract] The manuscript states performance numbers (throughput multipliers, benchmark exceedances) but supplies no experimental details, benchmarks, error bars, or controls. This is load-bearing because the soundness of the claim that mechanism (b) resolves the dilemma without degradation cannot be assessed from the available text.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on the abstract and the need for supporting evidence. We agree that the current abstract presentation does not make the central claims fully verifiable on its own and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] the central claim that the reverse Mamba sidecar supplies 'sufficient' right-side context to exceed causal baselines and match bidirectional dLLMs on average rests on unshown ablations (state size, update frequency across diffusion steps, right-context fidelity). Without these, it is impossible to verify that the compressed state does not lose critical future tokens or introduce measurable overhead during parallel unmasking, which directly undermines the throughput (2.4×–12.9×) and quality assertions.
Authors: We accept this point. The abstract summarizes results without including the requested ablations on Mamba state size, update frequency per diffusion step, or quantitative right-context fidelity metrics. In revision we will add a dedicated ablation subsection (or expanded paragraph in the experimental section) reporting these controls, including measurements of token recovery accuracy as a function of state dimension and any overhead during parallel unmasking. This will allow direct verification that the sidecar preserves sufficient future context. revision: yes
-
Referee: [Abstract] The manuscript states performance numbers (throughput multipliers, benchmark exceedances) but supplies no experimental details, benchmarks, error bars, or controls. This is load-bearing because the soundness of the claim that mechanism (b) resolves the dilemma without degradation cannot be assessed from the available text.
Authors: We agree that the abstract alone does not supply the necessary experimental details. The full manuscript reports continued pretraining on Qwen3-1.7B with 60B tokens and batch-serving throughput measurements, but lacks explicit benchmark lists, error bars across seeds, and control ablations in the summary. In the revised version we will expand the abstract with a concise experimental summary (benchmarks, number of runs, error reporting) and ensure the main text contains the full controls and statistical details so that the quality and throughput claims can be evaluated directly. revision: yes
Circularity Check
No circularity; empirical architecture claims are self-contained
full rationale
The paper proposes R2LM as a new asymmetric context mechanism for dLLMs, motivated by the stated bidirectional-causal tradeoff, and reports experimental throughput and quality results after continued pretraining. No equations, parameter fits, self-citations, or uniqueness theorems appear in the provided text that reduce the performance numbers or design choices to inputs by construction. The central claims rest on external benchmarks and ablations rather than renaming or re-deriving prior results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A lightweight reverse Mamba SSM sidecar can supply useful compressed right-side context while preserving KV cache compatibility
invented entities (2)
-
Bifocal dLLM paradigm
no independent evidence
-
R2LM (Right-to-Left Mamba)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg
arXiv:2503.09573. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. InNeurIPS,
-
[2]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi
arXiv:2107.03006. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InAAAI,
-
[3]
arXiv:1911.11641. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigl...
Pith/arXiv arXiv 1911
-
[4]
arXiv:2205.14987. Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,
-
[5]
arXiv:1905.10044. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,
Pith/arXiv arXiv 1905
-
[6]
Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
-
[7]
The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602
LeoGao, JonathanTow, BaberAbbasi, StellaBiderman, SidBlack, AnthonyDiPofi, CharlesFoster, LaurenceGolding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The language model ...
arXiv 2024
-
[8]
arXiv:2407.15595. Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891,
-
[9]
Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
-
[10]
Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu
arXiv:2210.17432. Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029,
-
[11]
Jia-Nan Li, Jian Guan, Wei Wu, and Chongxuan Li
arXiv:2009.03300. Jia-Nan Li, Jian Guan, Wei Wu, and Chongxuan Li. ReFusion: A diffusion large language model with parallel autoregressive decoding.arXiv preprint arXiv:2512.13586,
Pith/arXiv arXiv 2009
-
[12]
Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,
Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,
-
[13]
Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba langua...
-
[14]
Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, and Jie Zhou. WeDLM: Reconciling diffusion language models with standard causal attention for fast inference.arXiv preprint arXiv:2512.22737, 2025a. Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, and Pav...
-
[15]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal
arXiv:2310.16834. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP,
-
[16]
arXiv:1809.02789. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,
-
[17]
arXiv:2406.03736. Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,
-
[18]
Causal autoregressive diffusion language model.arXiv preprint arXiv:2601.22031,
Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, and JingBo Zhu. Causal autoregressive diffusion language model.arXiv preprint arXiv:2601.22031,
-
[19]
arXiv:2406.07524. Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, and Arash Vahdat. Esoteric language models: Bridging autoregressive and masked diffusion LLMs.arXiv preprint arXiv:2506.01928,
-
[20]
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias
arXiv:1907.10641. Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data. InNeurIPS,
Pith/arXiv arXiv 1907
-
[21]
arXiv:2406.04329. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Go...
-
[22]
Fast-dLLM v2: Efficient block-diffusion LLM.arXiv preprint arXiv:2509.26328,
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dLLM v2: Efficient block-diffusion LLM.arXiv preprint arXiv:2509.26328,
-
[23]
Qwen2 technical report.arXiv preprint arXiv:2407.10671,
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
-
[24]
Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
-
[25]
Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.