arxiv: 2604.16514 · v4 · submitted 2026-04-15 · 💻 cs.CV · cs.LG

Recognition: unknown

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen , Hanchen Xia , Peng Tu , Haojun Shi , Liwei Zhang , Weihao Yuan , Siyu Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:25 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords vision-language modelsdiffusion vision-language modelsautoregressive modelsblock mergingdistillationmultimodal capabilitydecoding efficiencyparallel decoding

0 comments

The pith

A pretrained autoregressive vision-language model can be bridged to a large-block diffusion version that preserves multimodal strength and decodes up to three times faster using progressive merging and distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that autoregressive vision-language models, strong in multimodal tasks but slow due to sequential decoding, can be converted into diffusion versions that decode in parallel for much higher speed. The conversion uses progressive supervised block merging to gradually enlarge the decoding blocks from small to large sizes, while applying stage-wise distillation inside the diffusion model from a fixed small-block version to restore any lost quality. Direct distillation from the autoregressive model to the diffusion one is shown to be poorly aligned and potentially damaging, whereas keeping the distillation within diffusion models proves reliable. With at most 4.4 million data points, this produces models at 4 billion and 8 billion parameters that set new performance records among open diffusion vision-language models while running up to three times faster. A reader might care because this offers a practical way to gain the speed benefits of diffusion approaches without sacrificing the capabilities built into existing autoregressive models.

Core claim

Progressive supervised block merging gradually enlarges the decoding block size while stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor recovers performance, allowing the transfer of strong multimodal capability to a large-block dVLM that achieves new SOTA results among comparable open models at 4B and 8B scales and up to 3 times faster decoding with no more than 4.4 million data points.

What carries the argument

Progressive supervised block merging combined with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor, which carries the conversion by enlarging blocks step by step and recovering quality within the diffusion regime.

Load-bearing premise

Progressive supervised block merging with stage-wise distillation from a small-block diffusion anchor will reliably recover performance when decoding blocks are enlarged.

What would settle it

If experiments show that the large-block diffusion model underperforms the original autoregressive model on key multimodal benchmarks such as visual question answering or image captioning, the bridging approach would be shown not to preserve capabilities as claimed.

Figures

Figures reproduced from arXiv: 2604.16514 by Baoyou Chen, Hanchen Xia, Haojun Shi, Liwei Zhang, Peng Tu, Siyu Zhu, Weihao Yuan.

**Figure 1.** Figure 1: Quality–efficiency comparison of BARD-VL and representative open dVLMs. Left: radar chart on seven multimodal [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The autoregressive VLM produces next-token logits [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of BARD-VL. (a) Pipeline. We use the packed training layout [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison on document understanding. Compared with prior diffusion VLMs, BARD-VL is consistently competitive and usually stronger across the reported suite. BARDVL 8B outperforms LLaDA-V-8B on all seven benchmarks, and BARD-VL 4B also surpasses Dimple-VL on all seven. Relative to Dream-VL and SDAR-VL, BARD-VL 4B wins on six of seven benchmarks, with the only exception being ChartQA. Overall, these comp… view at source ↗

read the original abstract

Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq$ 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to 3$\times$ decoding throughput speedup compared to the source model. Code is available at https://github.com/fudan-generative-vision/Bard-VL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BARD gives a practical recipe for converting AR VLMs to large-block diffusion versions via progressive merging and intra-regime distillation, with reported SOTA and speedups, but the robustness of those steps rests on limited visible checks.

read the letter

The core takeaway is that this paper shows how to take a strong autoregressive model like Qwen3-VL and turn it into a diffusion VLM that decodes in bigger blocks for faster inference. They do it by gradually increasing block size during supervised training and then distilling from a fixed small-block diffusion anchor in stages, plus a mixed noise scheduler. The key observation is that staying inside the diffusion regime for distillation works better than jumping straight from the AR model, and they get new SOTA numbers on their benchmarks at 4B and 8B scales with under 4.4 million data points along with up to 3x throughput gains. Code release helps here too.

Referee Report

2 major / 1 minor

Summary. The paper proposes BARD, a bridging framework to convert pretrained autoregressive VLMs (e.g., Qwen3-VL) into same-architecture large-block diffusion VLMs (dVLMs). It combines progressive supervised block merging to enlarge decoding blocks with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor, plus a mixed noise scheduler and memory-friendly training. With ≤4.4M data, it claims to transfer strong multimodal capability, achieve new SOTA among comparable open dVLMs at 4B/8B scales, and deliver up to 3× decoding speedup. A key observation is that direct AR-to-diffusion distillation is poorly aligned while intra-dVLM distillation is effective.

Significance. If the empirical results hold under scrutiny, the work demonstrates a practical, data-efficient route to inference-efficient diffusion VLMs without full retraining from scratch. The code release and external baseline comparisons are strengths that support reproducibility. The reported distinction between intra- versus inter-regime distillation, if backed by controlled experiments, could guide hybrid model design in multimodal generation.

major comments (2)

[Abstract] Abstract: the headline SOTA and speedup claims rest on the assertion that progressive supervised block merging plus stage-wise distillation from one fixed small-block anchor reliably recovers quality lost when enlarging blocks, yet no sensitivity analysis on anchor choice, no failure-mode cases, and no ablation isolating progressive merging versus direct large-block training are referenced, leaving the central transfer claim without visible load-bearing validation.
[Abstract] The manuscript reports concrete SOTA numbers on an 'evaluation suite' at 4B/8B scales but provides no visible details on benchmark composition, error bars, data-exclusion criteria, or post-hoc selection effects, which directly affects assessment of whether the limited-data transfer result generalizes.

minor comments (1)

The abstract refers to 'our evaluation suite' without enumerating the specific tasks or datasets; adding this list would improve clarity for readers assessing the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline SOTA and speedup claims rest on the assertion that progressive supervised block merging plus stage-wise distillation from one fixed small-block anchor reliably recovers quality lost when enlarging blocks, yet no sensitivity analysis on anchor choice, no failure-mode cases, and no ablation isolating progressive merging versus direct large-block training are referenced, leaving the central transfer claim without visible load-bearing validation.

Authors: We agree that the central claim would benefit from additional controlled validation. In the revised manuscript we will add (i) a sensitivity study varying the small-block anchor size and initialization, (ii) an explicit ablation that trains large-block dVLMs directly versus via progressive merging, and (iii) a short discussion of observed failure modes and edge cases. These additions will provide clearer empirical support for the effectiveness of the progressive-plus-intra-distillation pipeline. revision: yes
Referee: [Abstract] The manuscript reports concrete SOTA numbers on an 'evaluation suite' at 4B/8B scales but provides no visible details on benchmark composition, error bars, data-exclusion criteria, or post-hoc selection effects, which directly affects assessment of whether the limited-data transfer result generalizes.

Authors: We acknowledge that greater transparency is required. In the revised version we will expand the experimental section to explicitly list the full composition of the evaluation suite, report standard deviations or error bars from repeated runs where available, state the data-exclusion criteria used, and discuss any post-hoc selection procedures. These clarifications will allow readers to better assess the generalizability of the limited-data transfer results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering results stand on external benchmarks

full rationale

The manuscript describes an empirical bridging procedure (progressive supervised block merging plus intra-dVLM stage-wise distillation) whose performance claims are evaluated on held-out benchmarks and compared against independently trained open dVLMs. No equations, fitted parameters, or self-citations are invoked that reduce the reported SOTA scores, throughput gains, or the key observation about intra- versus inter-regime distillation to quantities defined inside the same training run. The work is therefore self-contained against external data and code release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical ML engineering paper whose central claims rest on experimental outcomes rather than formal axioms or new physical entities. No free parameters, axioms, or invented entities are explicitly introduced in the abstract beyond standard training hyperparameters and the choice of small-block anchor model.

pith-pipeline@v0.9.0 · 5619 in / 1242 out tokens · 28375 ms · 2026-05-10T14:25:26.977504+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 27 canonical work pages · 7 internal anchors

[1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review arXiv 2025
[2]

arXiv preprint arXiv:2503.09573 , year=

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025

work page arXiv 2025
[3]

Encoder-decoder diffusion language models for efficient training and inference.arXiv preprint arXiv:2510.22852, 2025

Marianne Arriola, Yair Schiff, Hao Phung, Aaron Gokaslan, and Volodymyr Kuleshov. Encoder-decoder diffusion language models for efficient training and inference.arXiv preprint arXiv:2510.22852, 2025

work page arXiv 2025
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, and Bowen Zhou. Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

work page arXiv 2025
[6]

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, et al. Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed.arXiv preprint arXiv:2512.14067, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language mod- els via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

work page arXiv 2024
[8]

Mdpo: Overcoming the training- inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

Haoyu He, Katrin Renz, Yong Cao, and Andreas Geiger. Mdpo: Overcoming the training-inference divide of masked diffusion language models.arXiv preprint arXiv:2508.13148, 2025

work page arXiv 2025
[9]

arXiv preprint arXiv:2506.00413 , year=

Daniel Israel, Guy Van den Broeck, and Aditya Grover. Accelerating diffusion llms via adaptive parallel decoding.arXiv preprint arXiv:2506.00413, 2025

work page arXiv 2025
[10]

From denoising to refining: A corrective framework for vision-language diffusion model

Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, and Ping Luo. From denoising to refining: A corrective framework for vision-language diffusion model.arXiv preprint arXiv:2510.19871, 2025

work page arXiv 2025
[11]

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding.arXiv preprint arXiv:2505.16839, 2025

work page arXiv 2025
[12]

Riv: Recursive introspection mask diffusion vision language model.arXiv preprint arXiv:2509.23625, 2025

YuQian Li, Limeng Qiao, and Lin Ma. Riv: Recursive introspection mask diffusion vision language model.arXiv preprint arXiv:2509.23625, 2025

work page arXiv 2025
[13]

Refusion: Enabling large-size realistic image restoration with latent-space diffusion models

Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B Schön. Refusion: Enabling large-size realistic image restoration with latent-space diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1680–1691, 2023

2023
[14]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review arXiv 2025
[15]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

2024
[16]

From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776,

Yuchuan Tian, Yuchen Liang, Shuo Zhang, Yingte Shu, Guangwen Yang, Wei He, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, et al. From next-token to next-block: A principled adaptation path for diffusion llms.arXiv preprint arXiv:2512.06776, 2025

work page arXiv 2025
[17]

arXiv preprint arXiv:2503.04482 , year=

Dimitri Von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, and Thomas Hofmann. Generalized interpolating discrete diffusion.arXiv preprint arXiv:2503.04482, 2025

work page arXiv 2025
[18]

Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities.arXiv preprint arXiv:2505.20147, 2025a

Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, and Ping Luo. Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities.arXiv preprint arXiv:2505.20147, 2025

work page arXiv 2025
[19]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Wiedmann, O

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need.arXiv preprint arXiv:2510.17269, 2025

work page arXiv 2025
[21]

Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025a

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

work page arXiv 2025
[22]

T*: Progressive block scaling for mdm through trajectory aware rl.arXiv preprint arXiv:2601.11214, 2026

Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, and Siyu Zhu. T*: Progressive block scaling for mdm through trajectory aware rl.arXiv preprint arXiv:2601.11214, 2026

work page arXiv 2026
[23]

Mmada: Multimodal large diffusion language models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

work page arXiv 2025
[24]

Dream-vl&dream-vla: Openvision-languageandvision-language-actionmodelswithdiffusionlanguagemodelbackbone

Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, and Lingpeng Kong. Dream-vl & dream-vla: Open vision- language and vision-language-action models with diffusion language model backbone.arXiv preprint arXiv:2512.22615, 2025

work page arXiv 2025
[25]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review arXiv 2025
[26]

Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruc- tion tuning.arXiv preprint arXiv:2505.16933, 2025

work page arXiv 2025
[27]

Dimple: Discrete diffusion multimodal large language model with parallel decoding

Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding.URL https://arxiv. org/abs/2505.16990, 2025

work page arXiv 2025
[28]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review arXiv 2026
[29]

Diffusionvl: Translating any autoregressive models into diffusion vision language models.arXiv preprint arXiv:2512.15713,

Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, and Xing- gang Wang. Diffusionvl: Translating any autoregressive models into diffusion vision language models.arXiv preprint arXiv:2512.15713, 2025. ,

work page arXiv 2025