pith. sign in

arxiv: 2604.06832 · v2 · submitted 2026-04-08 · 💻 cs.CL

Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords block diffusionvision-language modelsautoregressive to diffusion conversioninference accelerationparallel decodingmultimodal benchmarksedge device deploymentdirect conversion
0
0 comments X

The pith

Fast-dVLM converts pretrained autoregressive VLMs directly into block-diffusion models that generate tokens in parallel and run over 6x faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an existing autoregressive vision-language model can be turned into a block-diffusion version in a single training stage, allowing multiple text tokens to be produced at once instead of one by one. This matters for edge-device uses like robotics and self-driving systems, where sequential generation wastes hardware parallelism and creates memory bottlenecks at batch size one. The direct conversion keeps the model's multimodal capabilities intact by adding targeted changes such as block size annealing, causal context attention, auto-truncation masking, and efficient vision concatenation. Experiments on eleven benchmarks confirm that output quality stays the same as the original model. When paired with an inference engine and low-precision quantization, the new model delivers more than six times the end-to-end speed.

Core claim

Fast-dVLM is obtained by direct one-stage conversion of a pretrained autoregressive VLM into a block-diffusion model. This recipe uses block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation to support KV-cache-compatible parallel decoding and speculative block decoding. Under the same training budget the direct route outperforms a two-stage text-only then multimodal approach by preserving already-aligned multimodal capabilities, resulting in generation quality that matches the autoregressive baseline across eleven benchmarks.

What carries the argument

The direct one-stage AR-to-block-diffusion conversion process, which applies multimodal diffusion adaptations to enable parallel token generation while retaining pretrained alignment.

If this is right

  • Fast-dVLM matches autoregressive quality on eleven multimodal benchmarks while supporting parallel and speculative decoding.
  • It delivers over 6x end-to-end inference speedup when combined with SGLang and FP8 quantization.
  • Direct conversion proves more training-efficient than first adapting the language backbone on text-only diffusion then adding multimodal training.
  • The adaptations allow KV-cache reuse and block-wise parallel generation in a VLM setting that mixes continuous vision and discrete text.
  • The model better utilizes hardware parallelism on edge devices running at batch size one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar direct conversion recipes could extend block diffusion to other pretrained multimodal or language models without starting from random weights.
  • The speedup at batch size one could open real-time multimodal reasoning loops in robotics and autonomous systems that current autoregressive models cannot sustain.
  • The approach invites testing whether the same adaptations work when swapping the underlying vision encoder or scaling to larger model sizes.
  • Combining block diffusion with other inference tricks such as speculative decoding at the block level may yield further gains beyond the reported 6x.

Load-bearing premise

The direct conversion from a pretrained autoregressive VLM preserves its multimodal alignment once the listed adaptations are applied.

What would settle it

Any of the eleven multimodal benchmarks where Fast-dVLM scores lower than the autoregressive baseline, or an end-to-end inference run that fails to reach the reported 6x speedup after SGLang integration and FP8 quantization.

read the original abstract

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Fast-dVLM, a block-diffusion VLM obtained via direct one-stage conversion from a pretrained autoregressive VLM. It argues that this direct approach is more efficient than a two-stage (text-only then multimodal) conversion under comparable training budgets because it leverages existing multimodal alignment. The authors propose four adaptations—block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation—to enable KV-cache-compatible parallel and speculative block decoding. Experiments across 11 multimodal benchmarks show generation quality matching the AR counterpart; with SGLang integration and FP8 quantization, the model reports over 6x end-to-end inference speedup on edge-device (batch-1) settings.

Significance. If the empirical claims hold under matched inference conditions, the work offers a practical route to parallel decoding for VLMs on resource-constrained hardware, directly addressing memory-bandwidth bottlenecks in physical-AI deployments. The finding that direct conversion outperforms the two-stage recipe is a useful empirical guideline. The suite of adaptations demonstrates how block diffusion can be grafted onto existing VLMs without destroying pretrained alignment, providing a falsifiable template for future diffusion-based multimodal models.

major comments (2)
  1. [Abstract] Abstract: The headline result states that 'with SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.' The manuscript does not explicitly confirm that the autoregressive baseline was evaluated under identical conditions (same engine, same FP8 quantization, same batch-1 edge-device setup). If the baseline used standard FP16 autoregressive decoding, the reported factor conflates method-specific parallel-decoding gains with orthogonal engineering optimizations. This detail is load-bearing for the central efficiency claim.
  2. [Experiments section] Experiments (11-benchmark comparison): The abstract asserts that Fast-dVLM 'matches its autoregressive counterpart in generation quality' across 11 benchmarks, yet no error bars, exact baseline configurations, or ablation tables isolating the four listed adaptations are referenced. Without these, it is impossible to verify that the direct-conversion recipe preserves multimodal alignment rather than merely inheriting it from the pretrained checkpoint. This information is required to substantiate the claim that direct conversion is 'substantially more efficient.'
minor comments (1)
  1. [Abstract] The abstract introduces 'auto-truncation masking' and 'vision efficient concatenation' without a one-sentence gloss; a brief parenthetical definition would improve immediate readability for readers unfamiliar with block-diffusion terminology.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below, providing clarifications from the manuscript and indicating revisions to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline result states that 'with SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.' The manuscript does not explicitly confirm that the autoregressive baseline was evaluated under identical conditions (same engine, same FP8 quantization, same batch-1 edge-device setup). If the baseline used standard FP16 autoregressive decoding, the reported factor conflates method-specific parallel-decoding gains with orthogonal engineering optimizations. This detail is load-bearing for the central efficiency claim.

    Authors: We agree this detail must be unambiguous. Section 4.3 of the manuscript specifies that all end-to-end inference measurements (including the AR baseline) use the identical SGLang engine, FP8 quantization, batch-1 edge-device hardware, and KV-cache setup. The AR baseline performs standard autoregressive decoding inside this framework, isolating the gains to our parallel/speculative block decoding. To eliminate any misinterpretation, we will revise the abstract to: 'with SGLang integration and FP8 quantization under identical conditions, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.' revision: yes

  2. Referee: [Experiments section] Experiments (11-benchmark comparison): The abstract asserts that Fast-dVLM 'matches its autoregressive counterpart in generation quality' across 11 benchmarks, yet no error bars, exact baseline configurations, or ablation tables isolating the four listed adaptations are referenced. Without these, it is impossible to verify that the direct-conversion recipe preserves multimodal alignment rather than merely inheriting it from the pretrained checkpoint. This information is required to substantiate the claim that direct conversion is 'substantially more efficient.'

    Authors: We acknowledge the need for greater transparency in the main text. The manuscript reports results on all 11 benchmarks in Table 2 with the exact AR checkpoint and training budget details in Section 4.1; Appendix C contains ablation tables for each adaptation (block size annealing, causal context attention, auto-truncation masking, vision efficient concatenation) showing their contribution to quality preservation. However, error bars are currently only in the appendix and the direct-vs-two-stage comparison is summarized rather than tabulated in the main experiments section. We will add standard-deviation error bars (3 seeds) to Table 2, include a main-text comparison table for direct vs. two-stage conversion, and expand the discussion of how direct conversion leverages pretrained alignment. These changes will be incorporated in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical conversion validated by direct benchmarks

full rationale

The paper describes an empirical recipe for one-stage AR-to-block-diffusion conversion of VLMs, supported by side-by-side training-budget comparisons and 11-benchmark quality matching plus measured end-to-end speedups. No equations, fitted parameters, or self-citations are presented whose outputs are definitionally identical to their inputs. The central claims rest on observable experimental outcomes rather than any self-referential derivation or load-bearing prior result from the same authors. The reported 6x figure is an externally measured quantity under the stated SGLang+FP8 conditions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper rests on standard transformer and diffusion assumptions plus one domain assumption about extending block diffusion to mixed continuous-discrete multimodal inputs; no new entities are postulated and free parameters are ordinary training hyperparameters.

free parameters (1)
  • block size schedule
    Block size is annealed during training; the schedule is a hyperparameter chosen to stabilize conversion.
axioms (1)
  • domain assumption Block-wise discrete diffusion can be extended to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities.
    Invoked when stating that the listed adaptations enable effective block diffusion in the VLM setting.

pith-pipeline@v0.9.0 · 5621 in / 1426 out tokens · 51624 ms · 2026-05-10T17:13:58.104976+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

    cs.CL 2026-05 unverdicted novelty 5.0

    Fast-dDrive is a block-diffusion VLA that reports SOTA ADE on WOD-E2E, 0.32 m L2 on nuScenes, and 12x throughput over AR baselines via section scaffolds and test-time rollout averaging.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper

  1. [1]

    if accepted up to𝑑𝑖, generate next𝐵tokens

    proposes auto-speculation via directed draft graphs; BlockSpec (Pan et al., 2025b) introduces block-level speculation with dynamic token exploration; DFlash (Chen et al., 2026) and FailFast (Pan et al., 2025a) integrate lightweight diffusion drafters; and DiffuSpec (Li et al., 2025a) shows that a pretrained dLLM can serve as a training-free drafter for AR...

  2. [2]

    Global climate change: 70% major threat, 24% minor threat, 5% not a threat

  3. [3]

    The spread of infectious diseases: 69% major threat, 28% minor threat, 3% not a threat

  4. [4]

    Terrorism: 66% major threat, 30% minor threat, 4% not a threat

  5. [5]

    Cyberattacks from other countries: 65% major threat, 30% minor threat, 4% not a threat

  6. [6]

    The spread of nuclear weapons: 61% major threat, 31% minor threat, 9% not a threat

  7. [7]

    The condition of the global economy: 58% major threat, 35% minor threat, 6% not a threat

  8. [8]

    Global poverty: 53% major threat, 35% minor threat, 9% not a threat

  9. [9]

    Long-standing conflict between countries or ethnic groups: 48% major threat, 41% minor threat, 9% not a threat

  10. [10]

    Key trends in the data include:

    Large numbers of people moving from one country to another: 40% major threat, 40% minor threat, 18% not a threat. Key trends in the data include:

  11. [11]

    Climate change and infectious diseases are consistently ranked as major threats across all 14 countries

  12. [12]

    Terrorism, cyberattacks, and nuclear weapons are also considered major threats, but to a lesser extent than climate change and infectious diseases

  13. [13]

    The condition of the global economy, global poverty, and long -standing conflict are considered major threats in some countries but not in others

  14. [14]

    Outliers in the data include:

    Large numbers of people moving from one country to another are considered a major threat in some countries but not in others. Outliers in the data include:

  15. [15]

    The condition of the global economy, global poverty, and long -standing conflict are considered major threats in some countries but not in others, which may indicate that these issues are more or less pressing in certain countries

  16. [16]

    Implications of the data include:

    Large numbers of people moving from one country to another are considered a major threat in some countries but not in others, which may indicate that the issue is more or less relevant in certain countries. Implications of the data include:

  17. [17]

    Climate change and infectious diseases are significant global threats that require urgent action from all countries

  18. [18]

    Terrorism, cyberattacks, and nuclear weapons are also major threats that need to be addressed through international cooperation and policy measures

  19. [19]

    The condition of the global economy, global poverty, and long -standing conflict are important issues that require attention and resources from all countries

  20. [20]

    Chart QA Gen

    Large numbers of people moving from one country to another are a growing global challenge that requires international cooperation and policy measures to address. Chart QA Gen. Len= 636, Tok./sec: 115.0, Tok./step: 2.60 Beyond Barcelona, Messi has represented the Argentina national team, winning numerous titles and accolades. He has been a consistent perfo...

  21. [21]

    Make sure you're not applying too much pressure, as this could damage the object or the robot's hand

    Move the hand towards the industry object: Slowly move the chosen hand towards the industry object, keeping it steady and in control. Make sure you're not applying too much pressure, as this could damage the object or the robot's hand

  22. [22]

    You may need to adjust the grip or angle of the hand to get a good grasp

    Grasp the industry object: When the hand is close enough to the industry object, gently apply pressure to the object with the fingers of the robot's hand. You may need to adjust the grip or angle of the hand to get a good grasp. Be careful not to apply too much force, as this could damage the object or the robot's hand

  23. [23]

    Make sure you're not applying any unnecessary force, and that the object is stable and secure in the hand

    Lift the industry object: Once you have a good grasp on the industry object, slowly lift it off the table. Make sure you're not applying any unnecessary force, and that the object is stable and secure in the hand

  24. [24]

    Make sure you're not applying any force or movement to the object, as this could cause it to fall out of the hand

    Move the hand towards the yellow bin: Carefully move the hand with the industry object towards the yellow bin. Make sure you're not applying any force or movement to the object, as this could cause it to fall out of the hand

  25. [25]

    Make sure you're not applying any force or movement to the object, as this could cause it to fall out of the bin or damage the bin

    Release the industry object: When the hand is in position, gently release the industry object into the yellow bin. Make sure you're not applying any force or movement to the object, as this could cause it to fall out of the bin or damage the bin

  26. [26]

    Make sure you're following any safety guidelines or procedures for handling industrial materials

    Clean up: Once the industry object is in the yellow bin, clean up any debris or waste from the table. Make sure you're following any safety guidelines or procedures for handling industrial materials. By following these steps, you can safely and effectively pick up an industry object and place it in the yellow bin using your robotic hands. Physical AI Gen....

  27. [27]

    It could be any small item, such as a screw, a bolt, or a small piece of hardware

    Identify the industry object: First, you need to locate the industry object on the table. It could be any small item, such as a screw, a bolt, or a small piece of hardware. Make sure you can see it clearly and that it's within reach

  28. [28]

    Decide which hand you want to use to pick up the industry object

    Decide which hand to use: You have two robotic hands, one on each side of the table. Decide which hand you want to use to pick up the industry object. It's usually best to use the hand that's not currently holding anything, to avoid confusion. Autonomous Driving Gen. Len= 149, Tok./sec: 73.3, Tok./step: 1.71 Sure, I'd be happy to help you with that! Here'...

  29. [29]

    Identify the exit you want to take: In this case, you want to go to Rochester

  30. [30]

    Look for the exit sign: The exit sign for Rochester is located in the middle of the image

  31. [31]

    Observe the arrows on the exit sign: The arrows on the exit sign point to the left, right, and middle lanes

  32. [32]

    So, to summarize, to reach Rochester, you should stay in the middle lane

    Determine the correct lane: Since you want to go to Rochester, you should stay in the middle lane, as the arrows point to it. So, to summarize, to reach Rochester, you should stay in the middle lane. Figure 9|Qualitative examples of Fast-dVLM (speculative decoding) on embodied and physical AI tasks: autonomous driving scene understanding and robotic manip...