A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961

William Merrill, Ashish Sabharwal · 2025 · arXiv 2503.03961

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.

Training-Free Looped Transformers

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

cs.AI · 2025-10-03 · unverdicted · novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

FPRM is a Transformer-based model using fixed-point convergence for adaptive halting in looped architectures, claimed effective on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks.

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.

The Serial Scaling Hypothesis

cs.LG · 2025-07-16 · unverdicted · novelty 5.0

The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

Scaling Latent Reasoning via Looped Language Models

cs.CL · 2025-10-29

citing papers explorer

Showing 7 of 7 citing papers after filters.

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference cs.CL · 2026-05-25 · unverdicted · none · ref 40
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
Training-Free Looped Transformers cs.LG · 2026-05-22 · unverdicted · none · ref 61
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner cs.AI · 2025-10-03 · unverdicted · none · ref 28
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers cs.AI · 2026-06-16 · unverdicted · none · ref 4
FPRM is a Transformer-based model using fixed-point convergence for adaptive halting in looped architectures, claimed effective on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models cs.LG · 2026-04-20 · unverdicted · none · ref 185
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task cs.LG · 2026-04-14 · unverdicted · none · ref 18
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 72
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961

fields

years

verdicts

representative citing papers

citing papers explorer