Recognition: unknown
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
Pith reviewed 2026-05-10 14:25 UTC · model grok-4.3
The pith
A pretrained autoregressive vision-language model can be bridged to a large-block diffusion version that preserves multimodal strength and decodes up to three times faster using progressive merging and distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Progressive supervised block merging gradually enlarges the decoding block size while stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor recovers performance, allowing the transfer of strong multimodal capability to a large-block dVLM that achieves new SOTA results among comparable open models at 4B and 8B scales and up to 3 times faster decoding with no more than 4.4 million data points.
What carries the argument
Progressive supervised block merging combined with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor, which carries the conversion by enlarging blocks step by step and recovering quality within the diffusion regime.
Load-bearing premise
Progressive supervised block merging with stage-wise distillation from a small-block diffusion anchor will reliably recover performance when decoding blocks are enlarged.
What would settle it
If experiments show that the large-block diffusion model underperforms the original autoregressive model on key multimodal benchmarks such as visual question answering or image captioning, the bridging approach would be shown not to preserve capabilities as claimed.
Figures
read the original abstract
Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq$ 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to 3$\times$ decoding throughput speedup compared to the source model. Code is available at https://github.com/fudan-generative-vision/Bard-VL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes BARD, a bridging framework to convert pretrained autoregressive VLMs (e.g., Qwen3-VL) into same-architecture large-block diffusion VLMs (dVLMs). It combines progressive supervised block merging to enlarge decoding blocks with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor, plus a mixed noise scheduler and memory-friendly training. With ≤4.4M data, it claims to transfer strong multimodal capability, achieve new SOTA among comparable open dVLMs at 4B/8B scales, and deliver up to 3× decoding speedup. A key observation is that direct AR-to-diffusion distillation is poorly aligned while intra-dVLM distillation is effective.
Significance. If the empirical results hold under scrutiny, the work demonstrates a practical, data-efficient route to inference-efficient diffusion VLMs without full retraining from scratch. The code release and external baseline comparisons are strengths that support reproducibility. The reported distinction between intra- versus inter-regime distillation, if backed by controlled experiments, could guide hybrid model design in multimodal generation.
major comments (2)
- [Abstract] Abstract: the headline SOTA and speedup claims rest on the assertion that progressive supervised block merging plus stage-wise distillation from one fixed small-block anchor reliably recovers quality lost when enlarging blocks, yet no sensitivity analysis on anchor choice, no failure-mode cases, and no ablation isolating progressive merging versus direct large-block training are referenced, leaving the central transfer claim without visible load-bearing validation.
- [Abstract] The manuscript reports concrete SOTA numbers on an 'evaluation suite' at 4B/8B scales but provides no visible details on benchmark composition, error bars, data-exclusion criteria, or post-hoc selection effects, which directly affects assessment of whether the limited-data transfer result generalizes.
minor comments (1)
- The abstract refers to 'our evaluation suite' without enumerating the specific tasks or datasets; adding this list would improve clarity for readers assessing the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline SOTA and speedup claims rest on the assertion that progressive supervised block merging plus stage-wise distillation from one fixed small-block anchor reliably recovers quality lost when enlarging blocks, yet no sensitivity analysis on anchor choice, no failure-mode cases, and no ablation isolating progressive merging versus direct large-block training are referenced, leaving the central transfer claim without visible load-bearing validation.
Authors: We agree that the central claim would benefit from additional controlled validation. In the revised manuscript we will add (i) a sensitivity study varying the small-block anchor size and initialization, (ii) an explicit ablation that trains large-block dVLMs directly versus via progressive merging, and (iii) a short discussion of observed failure modes and edge cases. These additions will provide clearer empirical support for the effectiveness of the progressive-plus-intra-distillation pipeline. revision: yes
-
Referee: [Abstract] The manuscript reports concrete SOTA numbers on an 'evaluation suite' at 4B/8B scales but provides no visible details on benchmark composition, error bars, data-exclusion criteria, or post-hoc selection effects, which directly affects assessment of whether the limited-data transfer result generalizes.
Authors: We acknowledge that greater transparency is required. In the revised version we will expand the experimental section to explicitly list the full composition of the evaluation suite, report standard deviations or error bars from repeated runs where available, state the data-exclusion criteria used, and discuss any post-hoc selection procedures. These clarifications will allow readers to better assess the generalizability of the limited-data transfer results. revision: yes
Circularity Check
No significant circularity; empirical engineering results stand on external benchmarks
full rationale
The manuscript describes an empirical bridging procedure (progressive supervised block merging plus intra-dVLM stage-wise distillation) whose performance claims are evaluated on held-out benchmarks and compared against independently trained open dVLMs. No equations, fitted parameters, or self-citations are invoked that reduce the reported SOTA scores, throughput gains, or the key observation about intra- versus inter-regime distillation to quantities defined inside the same training run. The work is therefore self-contained against external data and code release.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
arXiv preprint arXiv:2503.09573 , year=
Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025
-
[3]
Marianne Arriola, Yair Schiff, Hao Phung, Aaron Gokaslan, and Volodymyr Kuleshov. Encoder-decoder diffusion language models for efficient training and inference.arXiv preprint arXiv:2510.22852, 2025
-
[4]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, and Bowen Zhou. Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025
-
[6]
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, et al. Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Scaling diffusion language models via adaptation from autoregressive models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language mod- els via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024
-
[8]
Haoyu He, Katrin Renz, Yong Cao, and Andreas Geiger. Mdpo: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025
-
[9]
arXiv preprint arXiv:2506.00413 , year=
Daniel Israel, Guy Van den Broeck, and Aditya Grover. Accelerating diffusion llms via adaptive parallel decoding.arXiv preprint arXiv:2506.00413, 2025
-
[10]
From denoising to refining: A corrective framework for vision-language diffusion model
Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, and Ping Luo. From denoising to refining: A corrective framework for vision-language diffusion model.arXiv preprint arXiv:2510.19871, 2025
-
[11]
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding.arXiv preprint arXiv:2505.16839, 2025
-
[12]
YuQian Li, Limeng Qiao, and Lin Ma. Riv: Recursive introspection mask diffusion vision language model.arXiv preprint arXiv:2509.23625, 2025
-
[13]
Refusion: Enabling large-size realistic image restoration with latent-space diffusion models
Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B Schön. Refusion: Enabling large-size realistic image restoration with latent-space diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1680–1691, 2023
2023
-
[14]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review arXiv 2025
-
[15]
Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024
Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024
2024
-
[16]
Yuchuan Tian, Yuchen Liang, Shuo Zhang, Yingte Shu, Guangwen Yang, Wei He, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, et al. From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2025
-
[17]
arXiv preprint arXiv:2503.04482 , year=
Dimitri Von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion.arXiv preprint arXiv:2503.04482, 2025
-
[18]
Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, and Ping Luo. Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities.arXiv preprint arXiv:2505.20147, 2025
-
[19]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need.arXiv preprint arXiv:2510.17269, 2025
-
[21]
Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025a
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025
-
[22]
Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, and Siyu Zhu. T*: Progressive block scaling for mdm through trajectory aware rl.arXiv preprint arXiv:2601.11214, 2026
-
[23]
Mmada: Multimodal large diffusion language models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
-
[24]
Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, and Lingpeng Kong. Dream-vl & dream-vla: Open vision- language and vision-language-action models with diffusion language model backbone.arXiv preprint arXiv:2512.22615, 2025
-
[25]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025
work page internal anchor Pith review arXiv 2025
-
[26]
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruc- tion tuning.arXiv preprint arXiv:2505.16933, 2025
-
[27]
Dimple: Discrete diffusion multimodal large language model with parallel decoding
Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding.URL https://arxiv. org/abs/2505.16990, 2025
-
[28]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review arXiv 2026
-
[29]
Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, and Xing- gang Wang. Diffusionvl: Translating any autoregressive models into diffusion vision language models.arXiv preprint arXiv:2512.15713, 2025. ,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.