Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM
Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3
The pith
Fast-dVLM converts pretrained autoregressive VLMs directly into block-diffusion models that generate tokens in parallel and run over 6x faster.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fast-dVLM is obtained by direct one-stage conversion of a pretrained autoregressive VLM into a block-diffusion model. This recipe uses block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation to support KV-cache-compatible parallel decoding and speculative block decoding. Under the same training budget the direct route outperforms a two-stage text-only then multimodal approach by preserving already-aligned multimodal capabilities, resulting in generation quality that matches the autoregressive baseline across eleven benchmarks.
What carries the argument
The direct one-stage AR-to-block-diffusion conversion process, which applies multimodal diffusion adaptations to enable parallel token generation while retaining pretrained alignment.
If this is right
- Fast-dVLM matches autoregressive quality on eleven multimodal benchmarks while supporting parallel and speculative decoding.
- It delivers over 6x end-to-end inference speedup when combined with SGLang and FP8 quantization.
- Direct conversion proves more training-efficient than first adapting the language backbone on text-only diffusion then adding multimodal training.
- The adaptations allow KV-cache reuse and block-wise parallel generation in a VLM setting that mixes continuous vision and discrete text.
- The model better utilizes hardware parallelism on edge devices running at batch size one.
Where Pith is reading between the lines
- Similar direct conversion recipes could extend block diffusion to other pretrained multimodal or language models without starting from random weights.
- The speedup at batch size one could open real-time multimodal reasoning loops in robotics and autonomous systems that current autoregressive models cannot sustain.
- The approach invites testing whether the same adaptations work when swapping the underlying vision encoder or scaling to larger model sizes.
- Combining block diffusion with other inference tricks such as speculative decoding at the block level may yield further gains beyond the reported 6x.
Load-bearing premise
The direct conversion from a pretrained autoregressive VLM preserves its multimodal alignment once the listed adaptations are applied.
What would settle it
Any of the eleven multimodal benchmarks where Fast-dVLM scores lower than the autoregressive baseline, or an end-to-end inference run that fails to reach the reported 6x speedup after SGLang integration and FP8 quantization.
read the original abstract
Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Fast-dVLM, a block-diffusion VLM obtained via direct one-stage conversion from a pretrained autoregressive VLM. It argues that this direct approach is more efficient than a two-stage (text-only then multimodal) conversion under comparable training budgets because it leverages existing multimodal alignment. The authors propose four adaptations—block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation—to enable KV-cache-compatible parallel and speculative block decoding. Experiments across 11 multimodal benchmarks show generation quality matching the AR counterpart; with SGLang integration and FP8 quantization, the model reports over 6x end-to-end inference speedup on edge-device (batch-1) settings.
Significance. If the empirical claims hold under matched inference conditions, the work offers a practical route to parallel decoding for VLMs on resource-constrained hardware, directly addressing memory-bandwidth bottlenecks in physical-AI deployments. The finding that direct conversion outperforms the two-stage recipe is a useful empirical guideline. The suite of adaptations demonstrates how block diffusion can be grafted onto existing VLMs without destroying pretrained alignment, providing a falsifiable template for future diffusion-based multimodal models.
major comments (2)
- [Abstract] Abstract: The headline result states that 'with SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.' The manuscript does not explicitly confirm that the autoregressive baseline was evaluated under identical conditions (same engine, same FP8 quantization, same batch-1 edge-device setup). If the baseline used standard FP16 autoregressive decoding, the reported factor conflates method-specific parallel-decoding gains with orthogonal engineering optimizations. This detail is load-bearing for the central efficiency claim.
- [Experiments section] Experiments (11-benchmark comparison): The abstract asserts that Fast-dVLM 'matches its autoregressive counterpart in generation quality' across 11 benchmarks, yet no error bars, exact baseline configurations, or ablation tables isolating the four listed adaptations are referenced. Without these, it is impossible to verify that the direct-conversion recipe preserves multimodal alignment rather than merely inheriting it from the pretrained checkpoint. This information is required to substantiate the claim that direct conversion is 'substantially more efficient.'
minor comments (1)
- [Abstract] The abstract introduces 'auto-truncation masking' and 'vision efficient concatenation' without a one-sentence gloss; a brief parenthetical definition would improve immediate readability for readers unfamiliar with block-diffusion terminology.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below, providing clarifications from the manuscript and indicating revisions to improve clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result states that 'with SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.' The manuscript does not explicitly confirm that the autoregressive baseline was evaluated under identical conditions (same engine, same FP8 quantization, same batch-1 edge-device setup). If the baseline used standard FP16 autoregressive decoding, the reported factor conflates method-specific parallel-decoding gains with orthogonal engineering optimizations. This detail is load-bearing for the central efficiency claim.
Authors: We agree this detail must be unambiguous. Section 4.3 of the manuscript specifies that all end-to-end inference measurements (including the AR baseline) use the identical SGLang engine, FP8 quantization, batch-1 edge-device hardware, and KV-cache setup. The AR baseline performs standard autoregressive decoding inside this framework, isolating the gains to our parallel/speculative block decoding. To eliminate any misinterpretation, we will revise the abstract to: 'with SGLang integration and FP8 quantization under identical conditions, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.' revision: yes
-
Referee: [Experiments section] Experiments (11-benchmark comparison): The abstract asserts that Fast-dVLM 'matches its autoregressive counterpart in generation quality' across 11 benchmarks, yet no error bars, exact baseline configurations, or ablation tables isolating the four listed adaptations are referenced. Without these, it is impossible to verify that the direct-conversion recipe preserves multimodal alignment rather than merely inheriting it from the pretrained checkpoint. This information is required to substantiate the claim that direct conversion is 'substantially more efficient.'
Authors: We acknowledge the need for greater transparency in the main text. The manuscript reports results on all 11 benchmarks in Table 2 with the exact AR checkpoint and training budget details in Section 4.1; Appendix C contains ablation tables for each adaptation (block size annealing, causal context attention, auto-truncation masking, vision efficient concatenation) showing their contribution to quality preservation. However, error bars are currently only in the appendix and the direct-vs-two-stage comparison is summarized rather than tabulated in the main experiments section. We will add standard-deviation error bars (3 seeds) to Table 2, include a main-text comparison table for direct vs. two-stage conversion, and expand the discussion of how direct conversion leverages pretrained alignment. These changes will be incorporated in the revision. revision: yes
Circularity Check
No circularity: empirical conversion validated by direct benchmarks
full rationale
The paper describes an empirical recipe for one-stage AR-to-block-diffusion conversion of VLMs, supported by side-by-side training-budget comparisons and 11-benchmark quality matching plus measured end-to-end speedups. No equations, fitted parameters, or self-citations are presented whose outputs are definitionally identical to their inputs. The central claims rest on observable experimental outcomes rather than any self-referential derivation or load-bearing prior result from the same authors. The reported 6x figure is an externally measured quantity under the stated SGLang+FP8 conditions.
Axiom & Free-Parameter Ledger
free parameters (1)
- block size schedule
axioms (1)
- domain assumption Block-wise discrete diffusion can be extended to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities.
Forward citations
Cited by 1 Pith paper
-
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Fast-dDrive is a block-diffusion VLA that reports SOTA ADE on WOD-E2E, 0.32 m L2 on nuScenes, and 12x throughput over AR baselines via section scaffolds and test-time rollout averaging.
Reference graph
Works this paper leans on
-
[1]
if accepted up to𝑑𝑖, generate next𝐵tokens
proposes auto-speculation via directed draft graphs; BlockSpec (Pan et al., 2025b) introduces block-level speculation with dynamic token exploration; DFlash (Chen et al., 2026) and FailFast (Pan et al., 2025a) integrate lightweight diffusion drafters; and DiffuSpec (Li et al., 2025a) shows that a pretrained dLLM can serve as a training-free drafter for AR...
work page 2026
-
[2]
Global climate change: 70% major threat, 24% minor threat, 5% not a threat
-
[3]
The spread of infectious diseases: 69% major threat, 28% minor threat, 3% not a threat
-
[4]
Terrorism: 66% major threat, 30% minor threat, 4% not a threat
-
[5]
Cyberattacks from other countries: 65% major threat, 30% minor threat, 4% not a threat
-
[6]
The spread of nuclear weapons: 61% major threat, 31% minor threat, 9% not a threat
-
[7]
The condition of the global economy: 58% major threat, 35% minor threat, 6% not a threat
-
[8]
Global poverty: 53% major threat, 35% minor threat, 9% not a threat
-
[9]
Long-standing conflict between countries or ethnic groups: 48% major threat, 41% minor threat, 9% not a threat
-
[10]
Key trends in the data include:
Large numbers of people moving from one country to another: 40% major threat, 40% minor threat, 18% not a threat. Key trends in the data include:
-
[11]
Climate change and infectious diseases are consistently ranked as major threats across all 14 countries
-
[12]
Terrorism, cyberattacks, and nuclear weapons are also considered major threats, but to a lesser extent than climate change and infectious diseases
-
[13]
The condition of the global economy, global poverty, and long -standing conflict are considered major threats in some countries but not in others
-
[14]
Large numbers of people moving from one country to another are considered a major threat in some countries but not in others. Outliers in the data include:
-
[15]
The condition of the global economy, global poverty, and long -standing conflict are considered major threats in some countries but not in others, which may indicate that these issues are more or less pressing in certain countries
-
[16]
Implications of the data include:
Large numbers of people moving from one country to another are considered a major threat in some countries but not in others, which may indicate that the issue is more or less relevant in certain countries. Implications of the data include:
-
[17]
Climate change and infectious diseases are significant global threats that require urgent action from all countries
-
[18]
Terrorism, cyberattacks, and nuclear weapons are also major threats that need to be addressed through international cooperation and policy measures
-
[19]
The condition of the global economy, global poverty, and long -standing conflict are important issues that require attention and resources from all countries
-
[20]
Large numbers of people moving from one country to another are a growing global challenge that requires international cooperation and policy measures to address. Chart QA Gen. Len= 636, Tok./sec: 115.0, Tok./step: 2.60 Beyond Barcelona, Messi has represented the Argentina national team, winning numerous titles and accolades. He has been a consistent perfo...
work page 1987
-
[21]
Make sure you're not applying too much pressure, as this could damage the object or the robot's hand
Move the hand towards the industry object: Slowly move the chosen hand towards the industry object, keeping it steady and in control. Make sure you're not applying too much pressure, as this could damage the object or the robot's hand
-
[22]
You may need to adjust the grip or angle of the hand to get a good grasp
Grasp the industry object: When the hand is close enough to the industry object, gently apply pressure to the object with the fingers of the robot's hand. You may need to adjust the grip or angle of the hand to get a good grasp. Be careful not to apply too much force, as this could damage the object or the robot's hand
-
[23]
Lift the industry object: Once you have a good grasp on the industry object, slowly lift it off the table. Make sure you're not applying any unnecessary force, and that the object is stable and secure in the hand
-
[24]
Move the hand towards the yellow bin: Carefully move the hand with the industry object towards the yellow bin. Make sure you're not applying any force or movement to the object, as this could cause it to fall out of the hand
-
[25]
Release the industry object: When the hand is in position, gently release the industry object into the yellow bin. Make sure you're not applying any force or movement to the object, as this could cause it to fall out of the bin or damage the bin
-
[26]
Make sure you're following any safety guidelines or procedures for handling industrial materials
Clean up: Once the industry object is in the yellow bin, clean up any debris or waste from the table. Make sure you're following any safety guidelines or procedures for handling industrial materials. By following these steps, you can safely and effectively pick up an industry object and place it in the yellow bin using your robotic hands. Physical AI Gen....
-
[27]
It could be any small item, such as a screw, a bolt, or a small piece of hardware
Identify the industry object: First, you need to locate the industry object on the table. It could be any small item, such as a screw, a bolt, or a small piece of hardware. Make sure you can see it clearly and that it's within reach
-
[28]
Decide which hand you want to use to pick up the industry object
Decide which hand to use: You have two robotic hands, one on each side of the table. Decide which hand you want to use to pick up the industry object. It's usually best to use the hand that's not currently holding anything, to avoid confusion. Autonomous Driving Gen. Len= 149, Tok./sec: 73.3, Tok./step: 1.71 Sure, I'd be happy to help you with that! Here'...
-
[29]
Identify the exit you want to take: In this case, you want to go to Rochester
-
[30]
Look for the exit sign: The exit sign for Rochester is located in the middle of the image
-
[31]
Observe the arrows on the exit sign: The arrows on the exit sign point to the left, right, and middle lanes
-
[32]
So, to summarize, to reach Rochester, you should stay in the middle lane
Determine the correct lane: Since you want to go to Rochester, you should stay in the middle lane, as the arrows point to it. So, to summarize, to reach Rochester, you should stay in the middle lane. Figure 9|Qualitative examples of Fast-dVLM (speculative decoding) on embodied and physical AI tasks: autonomous driving scene understanding and robotic manip...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.