Recognition: 2 theorem links
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Pith reviewed 2026-05-15 10:53 UTC · model grok-4.3
The pith
Block diffusion language models interpolate between autoregressive and diffusion approaches to support arbitrary-length generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Block diffusion processes sequences in blocks, applying diffusion denoising within each block and autoregressive prediction across blocks. This interpolation supports flexible-length generation, improves inference speed via KV caching and parallel token sampling, and reaches new state-of-the-art results among diffusion language models on standard benchmarks.
What carries the argument
The block diffusion process, which applies diffusion within fixed-size token blocks while chaining blocks autoregressively.
Load-bearing premise
The training algorithm, variance estimators, and data-driven noise schedules will produce stable models that generalize to new data without hidden instabilities or overfitting.
What would settle it
An evaluation on a standard language modeling benchmark where block diffusion models fail to exceed prior diffusion baselines in likelihood or produce incoherent text when generating sequences longer than the training block size.
read the original abstract
Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces block diffusion language models that interpolate between autoregressive and discrete denoising diffusion approaches for language modeling. It proposes an efficient training algorithm, estimators for gradient variance, and data-driven noise schedules, claiming new state-of-the-art results among diffusion models on language benchmarks together with support for flexible-length and arbitrary-length generation via KV caching and parallel token sampling.
Significance. If the central claims hold, the work would meaningfully narrow the performance gap between diffusion and autoregressive language models while adding controllability and parallel-generation advantages; the open release of code, weights, and a blog post further increases the potential impact.
major comments (2)
- [Experiments] Experiments section: the headline claim that block diffusion enables generation of arbitrary-length sequences rests on the unverified assumption that block-wise KV caching and the data-driven noise schedule remain stable and variance-minimizing for lengths substantially exceeding the training block size (e.g., 4–8× longer contexts). No scaling curves, perplexity-vs-length measurements, or out-of-distribution length ablations are reported, which is load-bearing for the central flexibility claim.
- [§3 and Experiments] §3 (training recipe) and Experiments: the paper asserts that the combination of the efficient training algorithm, gradient-variance estimators, and data-driven noise schedules produces stable models, yet provides neither quantitative ablations isolating each component’s contribution nor error bars on the reported benchmark numbers. Without these, the SOTA claim among diffusion models cannot be fully assessed.
minor comments (2)
- Figure captions and axis labels should be expanded to make the reported metrics (e.g., perplexity, generation speed) immediately interpretable without reference to the main text.
- [Abstract] The abstract states that code and weights are provided; the manuscript should include a permanent DOI or archive link in addition to the project-page URL.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: Experiments section: the headline claim that block diffusion enables generation of arbitrary-length sequences rests on the unverified assumption that block-wise KV caching and the data-driven noise schedule remain stable and variance-minimizing for lengths substantially exceeding the training block size (e.g., 4–8× longer contexts). No scaling curves, perplexity-vs-length measurements, or out-of-distribution length ablations are reported, which is load-bearing for the central flexibility claim.
Authors: We acknowledge that the current experiments focus on lengths comparable to the training block size and do not include explicit scaling curves or out-of-distribution ablations for substantially longer contexts. The design of block diffusion, which interpolates between autoregressive and diffusion models, provides a basis for expecting that KV caching and the data-driven noise schedule will generalize, but we agree this requires direct empirical verification. In the revision we will add perplexity-versus-length curves and length-ablation experiments testing up to 4× the training block size. revision: yes
-
Referee: §3 (training recipe) and Experiments: the paper asserts that the combination of the efficient training algorithm, gradient-variance estimators, and data-driven noise schedules produces stable models, yet provides neither quantitative ablations isolating each component’s contribution nor error bars on the reported benchmark numbers. Without these, the SOTA claim among diffusion models cannot be fully assessed.
Authors: We agree that quantitative ablations isolating the contribution of each element of the training recipe and error bars on benchmark results would make the stability and SOTA claims more robust. In the revised manuscript we will add ablations that separately measure the effect of the efficient training algorithm, the gradient-variance estimators, and the data-driven noise schedules. We will also report standard deviations computed over multiple independent runs for all main benchmark numbers. revision: yes
Circularity Check
No circularity; block diffusion introduces independent block structure, training algorithm, and noise schedules on top of existing diffusion and autoregressive frameworks.
full rationale
The derivation chain in the abstract and described method is self-contained. Block diffusion is presented as an interpolation via explicit new elements (block-wise structure, KV caching for flexible lengths, efficient training algorithm, gradient-variance estimators, and data-driven noise schedules) that are not shown to reduce by construction to prior fitted quantities or self-citations. Performance claims and arbitrary-length generation are positioned as empirical outcomes of these additions rather than tautological renamings or load-bearing self-references. No equations or steps equate outputs to inputs via definition or fitting alone.
Axiom & Free-Parameter Ledger
free parameters (1)
- data-driven noise schedule parameters
axioms (1)
- domain assumption Diffusion and autoregressive models can be effectively interpolated via a block structure for language sequences.
invented entities (1)
-
block diffusion language model
no independent evidence
Forward citations
Cited by 23 Pith papers
-
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Support Before Frequency in Discrete Diffusion
Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction
DiffTSP applies discrete diffusion to knowledge graph triple set prediction, recovering all missing triples simultaneously via edge-masking noise reversal and a structure-aware transformer, achieving SOTA on three datasets.
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
Attention-Based Sampler for Diffusion Language Models
Attn-Sampler decodes diffusion language models by selecting tokens in descending order of attention column sums, yielding higher quality and more parallel generation than token-level greedy baselines.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
-
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
-
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.
-
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
-
TextLDM: Language Modeling with Continuous Latent Diffusion
TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
Stability-Weighted Decoding for Diffusion Language Models
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
-
Differences in Text Generated by Diffusion and Autoregressive Language Models
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
-
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
Dream 7B: Diffusion Large Language Models
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
-
Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.