arxiv: 2510.13117 · v3 · submitted 2025-10-15 · 💻 cs.LG · cs.AI· cs.CL

On the Reasoning Abilities of Masked Diffusion Language Models

Anej Svete , Ashish Sabharwal This is my paper

Pith reviewed 2026-05-18 06:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords masked diffusion modelsreasoning abilitieschain of thoughtpadded looped transformersfinite-precision modelsparallel generationcomputational equivalencelanguage model reasoning

0 comments

The pith

Masked diffusion models match the reasoning power of chain-of-thought transformers and outperform them in efficiency on regular languages through parallel generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that masked diffusion models for text are computationally equivalent to polynomially-padded looped transformers in the finite-precision log-width setting. This equivalence implies that MDMs can solve every reasoning problem that can be solved by transformers augmented with chain-of-thought reasoning. A sympathetic reader would care because the result clarifies that the parallel nature of diffusion models does not restrict their reasoning abilities relative to sequential autoregressive models and can instead provide concrete efficiency advantages on structured tasks.

Core claim

In the finite-precision log-width setting, masked diffusion models are equivalent to polynomially-padded padded looped transformers. As a direct consequence, MDMs can solve all problems solvable by CoT-augmented transformers. Moreover, MDMs are inherently more efficient than CoT transformers on classes of problems that include regular languages because parallel generation permits substantially faster reasoning.

What carries the argument

The equivalence between masked diffusion models and polynomially-padded looped transformers under finite-precision log-width constraints, which transfers both full reasoning power and efficiency properties from one model class to the other.

If this is right

MDMs inherit the ability to solve any problem that CoT-augmented transformers can solve.
MDMs achieve faster reasoning than sequential CoT models on regular languages and similar structured problems.
Parallel generation in MDMs removes the need for explicit sequential CoT steps on tasks where parallelism suffices.
The computational capabilities of MDMs are fully characterized by their relation to padded looped transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the finite-precision model is representative, then practical masked diffusion models may perform complex reasoning without requiring explicit chain-of-thought prompting.
The efficiency results suggest exploring diffusion architectures optimized specifically for parallel reasoning rather than sequential generation.
These equivalences could guide the design of hybrid models that combine diffusion parallelism with targeted sequential refinement for harder problems.

Load-bearing premise

The finite-precision log-width computational model accurately captures the relevant reasoning behavior of practical masked diffusion models.

What would settle it

A concrete demonstration that some regular language recognizable by a CoT-augmented transformer in polynomial steps cannot be recognized by any masked diffusion model in a comparable number of parallel steps under the same finite-precision log-width constraints.

Figures

Figures reproduced from arXiv: 2510.13117 by Anej Svete, Ashish Sabharwal.

**Figure 2.** Figure 2: Two strategies for solving a mathematical expression. (a) Parallel: Parallel computation of intermediate values (three steps). (b) Sequential: Step-by-step generation (eleven steps). Our analysis builds upon the study of LM expressivity—the formal characterization of the problems whose solution the neural architecture of an LM (with appropriate parameters) can express. While this field comprehensively desc… view at source ↗

read the original abstract

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MDMs match polynomially-padded PLTs in the finite-precision log-width model and gain an efficiency edge on regular languages, but only inside that model.

read the letter

The core result is that masked diffusion models are equivalent to polynomially-padded looped transformers in the finite-precision log-width setting, and they can solve the same problems as CoT transformers but more efficiently on regular languages due to parallel generation. This mapping is new. Prior work on diffusion models has not drawn this direct link to PLTs or shown the efficiency separation for those language classes. The paper does well by grounding the claims in explicit connections to established frameworks rather than inventing new abstractions from scratch. That makes the proofs more believable and allows reuse of existing analysis techniques. The derivations appear to go through cleanly inside the chosen model. The abstract indicates they prove the equivalence and the efficiency advantage without circularity or hidden fitting. The main soft spot is the computational model itself. All equivalences and step-count advantages are shown under finite precision and logarithmic width. In wider or higher-precision regimes the masking process may fail to pack the necessary state information into a single parallel step, which would break both the equivalence and the claimed speedup. The paper offers no argument that this restriction is without loss of generality for the reasoning problems it targets. This work is for people who study the theoretical capabilities of language models beyond autoregressive decoding. Readers who already know the CoT and PLT results will get the most value, as the paper mostly translates those ideas to MDMs. It is solid enough on its own terms to deserve a serious referee who can check the full proofs and weigh the model assumptions. I would recommend sending it to peer review.

Referee Report

1 major / 1 minor

Summary. The paper claims that in the finite-precision log-width setting, masked diffusion models (MDMs) are equivalent to polynomially-padded padded looped transformers (PLTs), that MDMs can solve all problems solvable by chain-of-thought (CoT) augmented transformers, and that MDMs are inherently more efficient than CoT transformers on certain problem classes including regular languages because parallel generation allows simulating looped steps in fewer iterations.

Significance. If the equivalences and efficiency results hold under the stated model, the work supplies a useful theoretical bridge between diffusion-based generation and established reasoning frameworks, clarifying the computational power of MDMs and identifying concrete settings where their parallelism yields asymptotic advantages. The explicit connections to CoT and PLT strengthen the analysis and make the claims falsifiable within the chosen computational model.

major comments (1)

Abstract and the main equivalence section: the central claims of equivalence to polynomially-padded PLTs and superior efficiency on regular languages are proved exclusively inside the finite-precision log-width regime. The manuscript supplies no argument that this restriction is without loss of generality; outside it, the masking schedule may fail to encode the required state transitions in a single parallel step, which directly undermines both the equivalence and the step-count advantage.

minor comments (1)

Notation for the masking schedule and the precise definition of 'polynomially-padded' should be introduced earlier and used consistently across the equivalence proofs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the significance of our work in establishing connections between masked diffusion models and established reasoning frameworks like CoT and PLTs. We address the major comment below.

read point-by-point responses

Referee: Abstract and the main equivalence section: the central claims of equivalence to polynomially-padded PLTs and superior efficiency on regular languages are proved exclusively inside the finite-precision log-width regime. The manuscript supplies no argument that this restriction is without loss of generality; outside it, the masking schedule may fail to encode the required state transitions in a single parallel step, which directly undermines both the equivalence and the step-count advantage.

Authors: We agree that our proofs and claims are presented within the finite-precision log-width setting, as explicitly noted in the abstract and the main sections of the manuscript. This setting is not claimed to be without loss of generality; instead, it is selected as it provides a realistic and standard model for analyzing the computational power of transformer-like architectures, where finite precision accounts for practical numerical representations and log-width reflects the typical relationship between model capacity and input length in complexity-theoretic analyses. In this regime, we demonstrate the equivalence to polynomially-padded PLTs and the efficiency advantages for problems like regular language recognition through parallel simulation of looped computations. To strengthen the manuscript, we will add a new paragraph in the introduction and a dedicated discussion section explaining the rationale for this modeling choice, its implications, and how the masking mechanism leverages the finite precision to enable single-step state transitions. This revision will make explicit that our results pertain to this important setting without overgeneralizing. revision: yes

Circularity Check

0 steps flagged

No significant circularity; equivalences derived via connections to established external CoT and PLT frameworks

full rationale

The paper derives its main results by explicitly connecting masked diffusion models to prior, well-understood reasoning frameworks of chain-of-thought and padded looped transformers inside a finite-precision log-width model. These equivalences and efficiency comparisons are presented as mappings to those external frameworks rather than as self-referential re-derivations, fitted parameters renamed as predictions, or load-bearing self-citations that close on themselves. No quoted step reduces an output quantity to an input by construction, and the central claims retain independent content from the cited prior models. This is the normal case of a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the finite-precision log-width model and standard definitions of MDMs, CoT, and PLTs; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Finite-precision log-width setting accurately represents relevant model behavior
All equivalences and efficiency statements are derived inside this restricted computational regime.

pith-pipeline@v0.9.0 · 5677 in / 1288 out tokens · 32937 ms · 2026-05-18T06:51:19.179339+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting... MDMs with logN denoising steps can recognize regular languages.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MDMrlogN, Ns zCoTrlogNs, for example, contains all NC1-complete regular languages.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, and Ryan Cotterell

URLhttps://openreview.net/forum?id=FB8FMU99BC. Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, and Ryan Cotterell. What languages are easy to language-model? a perspective from learning probabilistic regular languages. InACL, 2024. URL https://aclanthology.org/ 2024.acl-long.807/. Alexandra Buto...

work page doi:10.1007/bf01744431 2024
[2]

Selim Jerad, Anej Svete, Jiaoda Li, and Ryan Cotterell

URLhttps://aclanthology.org/2023.acl-long.248/. Selim Jerad, Anej Svete, Jiaoda Li, and Ryan Cotterell. Unique hard attention: A tale of two sides. arXiv, 2025. URLhttps://arxiv.org/abs/2503.14615. Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusio...

work page arXiv 2023
[3]

Measuring Faithfulness in Chain-of-Thought Reasoning

URLhttps://arxiv.org/abs/2307.13702. Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InICML, 2023. Gen Li and Changxiao Cai. A convergence theory for diffusion language models: An information- theoretic perspective.arXiv, 2025. URLhttps://arxiv.org/abs/2505.21400. Jiaoda Li and Ryan Cotterell. Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma

URLhttps://arxiv.org/abs/2505.23623. Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InICLR, 2024. URL https://openreview.net/forum?id= 3EWTEy9MTM. 11 Preprint Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transform- ers learn shortcuts to automa...

work page arXiv 2024
[5]

Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834,

URLhttps://arxiv.org/abs/2508.09834. Anej Svete and Ryan Cotterell. Transformers can represent n-gram language models. InNAACL,

work page arXiv
[6]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

URLhttps://aclanthology.org/2024.naacl-long.381/. Anej Svete, Nadav Borenstein, Mike Zhou, Isabelle Augenstein, and Ryan Cotterell. Can transformers learn n-gram language models? InEMNLP, 2024. URL https://aclanthology.org/2024. emnlp-main.550/. Anej Svete, Will Merrill, and Ashish Sabharwal. The exact expressive power of fixed-precision looped padded tra...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-662-03927-4 2024
[7]

All intermediate computations in Eq

Sub-case 2a: qJ n kn1 ălog 2pp`1q . All intermediate computations in Eq. (44a) are bounded by log 2pp`1q ` BF{2 in absolute value, so they fall within the range of Fp. Moreover, addition of BF{2 can be exactly represented in Fp. This makes addition in Eq. (44a) associative and commutative. By Lem. D.3, the terms G in Eq. (44a) are 0, meaning that the fina...

work page
[8]

In this case, the intermediate computations of Eq

Sub-case 2b: qJ n kn1 ělog 2pp`1q . In this case, the intermediate computations of Eq. (44a) are either exact or thresholded at BF. In both cases, the exponent of the resulting attention score isB F by Lem. D.1, preserving the attention score

work page
[9]

In this case, all intermediate computations are representable in Fp analogously to the case 2a

Sub-case 2c: qJ n kn1 ďlog 2pp`1q . In this case, all intermediate computations are representable in Fp analogously to the case 2a. The attention score is therefore preserved. Taken together, this means that the attention scores between positions n and n1 of τ 1 are identical to those of τ on the positions in N , while the attention scores on the rest of ...

work page
[10]

If it is, wn1 stores e1 and d1 def “w n1e1 in designated parts of its residual stream

In the first layer, each symbol wn1 P t0,1u checks if it is immediately preceded by the & symbol, which denotes the beginning of the pointer in the string. If it is, wn1 stores e1 and d1 def “w n1e1 in designated parts of its residual stream. Here, e1 is the first unit vector of RrlogNs

work page
[11]

pnmodNq `1 represents the position within that layer. T 1 then processes a stringwPΣ ˚ of lengthNpadded withN 2 padding symbols as follows. (1)T 1 uses an additional “copy

In the subsequent layers lP t2, . . . ,rlogNsu , each symbol wn1 checks if the entry el´1 has already been written to the designated space of the previous symbol’s residual stream. If it has, wn1 copies and shifts el´1 into el, and stores el and dl def “d l´1 `w n1el in designated parts of its residual stream. AfterrlogNslayers, the residual stream at pos...

work page 2025
[12]

This is because, by Lem

We first note that the pCoT transformer predicts the next P 1 symbols at every timestep t based on the input string wďN`pt´1qP 1 ˝¨ ¨ ¨˝lo omo on pT´tqP 1 rather than wďN`pt´1qP 1. This is because, by Lem. D.6, the pCoT transformer can ignore all symbols containing the padding symbol and thus produce equivalent predictions at every step t. This will help ...

work page
[13]

It will be filled in T generation steps, where at each step tP rTs , a new block of P 1 symbols will be predicted

We assume that the final output of the MDM will be stored in the first P positions of the padding space. It will be filled in T generation steps, where at each step tP rTs , a new block of P 1 symbols will be predicted. In particular, by Lem. D.9, the planner can select the next P 1 positions to unmask at time t by including the values B˘prrn´N s`{P1sq an...

work page
[14]

The predictor then uses an initial layer to copy the input string wďN`pt´1qP 1 ˝¨ ¨ ¨˝lo omo on pT´tqP 1 into the residual stream of the firstN`Ppositions of the padding space

work page
[15]

overwrites

The predictor then uses the pN`Pq 2 padding positions to predict the next P 1 symbols predicted by the pCoT transformer. These are written to the positions chosen to be unmasked by the planner. ■ Theorem 3.4(pCoT transformers can simulate MDMs). MDMrT, Ps ĎpCoTrT, LTpP`Nqs,(7) whereLis the number of layers in the transformer implementing the MDM. 34 Prepr...

work page