The Serial Scaling Hypothesis

Kananart Kuwaranancharoen; Konpat Preechakul; Yutong Bai; Yuxi Liu

arxiv: 2507.12549 · v4 · submitted 2025-07-16 · 💻 cs.LG · cs.CC· stat.ML

The Serial Scaling Hypothesis

Yuxi Liu , Konpat Preechakul , Kananart Kuwaranancharoen , Yutong Bai This is my paper

Pith reviewed 2026-05-19 04:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CCstat.ML

keywords diffusion modelsinherently serial problemssequential computationparallelization limitsmachine learning architecturescomplexity theoryserial scaling

0 comments

The pith

Diffusion models cannot solve inherently serial problems even though they generate outputs step by step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that machine learning progress has relied on parallel computation while ignoring problems that depend on sequential steps that cannot be sped up by running many operations at once. These inherently serial problems appear in mathematical reasoning, physical simulations, and sequential decision making. The authors use complexity theory to define the distinction and then show that diffusion models fail on such tasks despite their own sequential generation process. If the claim holds, current scaling approaches that emphasize parallelism will hit hard limits on a broad class of important tasks.

Core claim

We show for the first time that diffusion models, despite their sequential nature, are incapable of solving inherently serial problems that require sequentially dependent computational steps that cannot be efficiently parallelized.

What carries the argument

The formal distinction between parallelizable and inherently serial problems, drawn from complexity theory, used to explain why current architectures including diffusion models reach fundamental limits.

If this is right

Model architectures must incorporate explicit mechanisms for sequential dependence rather than relying solely on parallel layers.
Hardware development will need to prioritize support for non-parallelizable computation paths.
Scaling efforts on reasoning and simulation tasks will require serial-aware designs instead of pure width increases.
Diffusion models will need fundamental changes or replacement for any application involving true sequential dependence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hypothesis suggests that future scaling laws should track serial depth separately from parallel width.
Hybrid architectures that combine parallel efficiency with serial chaining could address the identified gap.
Empirical tests on specific complexity classes could provide direct evidence for or against the serial limitation.

Load-bearing premise

There exist problems that are fundamentally sequential and cannot be efficiently parallelized, as formalized in complexity theory.

What would settle it

A diffusion model that solves a known inherently serial problem, such as a canonical P-complete task, at scale with efficiency matching or exceeding parallel methods would disprove the central claim.

Figures

Figures reproduced from arXiv: 2507.12549 by Kananart Kuwaranancharoen, Konpat Preechakul, Yutong Bai, Yuxi Liu.

**Figure 1.** Figure 1: (Left) Many easy Sudoku puzzles, where the circled blanks can be filled independently in parallel. (Right) A hard Sudoku with the same total compute, but the circled blanks are interdependent, requiring sequential reasoning. Consider Sudoku—a number-placement puzzle requiring each number 1–9 to appear once per row, column, and subgrid—as a parable ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: (A) A decision problem has a variable-size input and a fixed-size output (e.g., “yes”/“no”). (B) A serial problem requires deeper or more steps as the problem size grows. Examples of serial problems are: (C) Cellular automaton: takes the initial state as input and outputs a discrete value of the row N at cell i for i ∈ {1, . . . , 2N − 1}. (D) Many-body mechanics: takes initial positions and momenta of eac… view at source ↗

**Figure 3.** Figure 3: The complexity classes are nested as TC0 ⊆ TC1 ⊆ · · · ⊆ TC ⊆ P. Each containment is widely believed to be strict. Problems in TC are parallel, while those outside are inherently serial. Although the universal approximation theorem (Cybenko, 1989; Hornik et al., 1989) shows that a 3-layer MLP with unbounded width can approximate any continuous function, a constant-depth MLP with only polynomial width is f… view at source ↗

**Figure 4.** Figure 4: A single run of Rule 110. Given the top row, the CA evolves row-by-row according to the 8 rules. We begin with predicting the outcome of cellular automata (CA) (Sarkar, 2000; Codd, 2014), shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Predicting the frame at time T. The intermediate frames may not be observable by camera motion/occlusion. In real videos, objects frequently leave the field of view or become occluded, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical serial scaling on (a) Hex board game and (b) locomotion & manipulation. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Average benchmark scores [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: The backbone network takes as input a sequence of tokens, and a single noisy output token. This is repeated until the output token is fully denoised. The Serial Scaling Hypothesis provide a lens to look at machine learning from serial–parallel lens. We have examined serial machine learning problems. We now turn to a diffusion models (Ho et al., 2020; Song et al., 2021; Song & Ermon, 2019), widely used in… view at source ↗

read the original abstract

While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require sequentially dependent computational steps that cannot be efficiently parallelized. We formalize this distinction in complexity theory, and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. Then, we show for first time that diffusion models despite their sequential nature are incapable of solving inherently serial problems. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, and hardware development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper makes a case for serial scaling limits but the diffusion model incapability claim looks more like an efficiency point than a proven impossibility without more details.

read the letter

The main takeaway is that this paper proposes a serial scaling hypothesis, arguing that some problems are inherently sequential and cannot be efficiently handled by parallel architectures, including diffusion models. The authors claim to show for the first time that diffusion models fail on these tasks despite their sequential sampling process. What is new here is the application of complexity theory concepts like P-complete problems to machine learning scaling. The paper does a good job of explaining why massive parallelization might hit walls in areas such as mathematical reasoning, physical simulations, and sequential decision making. It correctly notes that not all computation can be sped up by throwing more parallel resources at it, which is a useful reminder in the current scaling era. They earn credit for trying to formalize the serial versus parallel distinction and for thinking through implications for model architecture and hardware priorities. This could encourage more work on models that better capture sequential dependencies. That said, there are some soft spots that stand out. The strongest claim about diffusion models being incapable requires careful justification. As the stress-test points out, each denoising step can perform arbitrary parallel computation, and the number of steps is a hyperparameter that can be scaled. Without explicit arguments ruling out increasing steps or capacity to match problem depth, the limitation might be more about computational efficiency than a hard impossibility. The abstract mentions a demonstration and formalization, but without equations or experimental details visible, it's difficult to assess how they establish this. There's also the question of whether the definition of inherently serial problems is grounded independently or shaped around the weaknesses of current models. If it's the latter, the hypothesis could feel circular. Overall, this paper is for researchers focused on the theoretical side of scaling laws and those exploring alternatives to pure parallel scaling in domains requiring deep sequential reasoning. A reader who wants to think differently about why certain tasks resist current approaches would find it relevant. I recommend putting it through peer review. The core idea has enough substance to deserve referee feedback, which could strengthen the formal parts and clarify the evidence.

Referee Report

2 major / 2 minor

Summary. The paper identifies a class of 'inherently serial' problems (e.g., mathematical reasoning, physical simulations, sequential decision-making) that require sequentially dependent steps not efficiently parallelizable. It claims to formalize this distinction via complexity theory, shows limitations of parallel-centric architectures on such tasks, and asserts for the first time that diffusion models are incapable of solving them despite their iterative sequential sampling process, with broad implications for model design and hardware.

Significance. If the formalization and impossibility result hold, the work would identify a fundamental blind spot in current parallel-focused ML paradigms and diffusion models, potentially motivating serial-aware architectures. The explicit grounding in complexity theory (e.g., sequential depth requirements) is a positive feature that could make the hypothesis falsifiable.

major comments (2)

Abstract and central claim: the assertion that diffusion models are 'incapable' of solving inherently serial problems is presented without any equations, sampling schedule analysis, or experimental protocol showing that bounded denoising trajectories cannot emulate required sequential depth even when steps or per-step capacity are scaled with problem size.
Formalization section (implied by abstract): the definition of 'inherently serial problems' must be shown to be independently grounded in complexity theory (e.g., P-complete tasks outside NC) rather than retrofitted to observed diffusion limitations; otherwise the distinction risks circularity and the 'incapability' claim reduces to an efficiency statement.

minor comments (2)

The abstract states 'we formalize this distinction in complexity theory' but supplies no theorem statements, definitions, or references to specific complexity classes; adding these would strengthen readability.
Clarify whether the diffusion schedule (number of steps) is treated as fixed or allowed to grow with serial depth; this directly affects whether the result is an impossibility or a scaling claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report on our manuscript 'The Serial Scaling Hypothesis'. We address each major comment below with clarifications drawn from the manuscript's formalization and provide indications of revisions to strengthen the presentation of our claims.

read point-by-point responses

Referee: Abstract and central claim: the assertion that diffusion models are 'incapable' of solving inherently serial problems is presented without any equations, sampling schedule analysis, or experimental protocol showing that bounded denoising trajectories cannot emulate required sequential depth even when steps or per-step capacity are scaled with problem size.

Authors: The manuscript's core argument links the serial scaling hypothesis to the fact that diffusion sampling, while iterative, operates under a fixed parallel denoising structure whose effective depth is bounded independently of problem size. We agree that the abstract states the claim concisely and that explicit equations would improve clarity. In the revised manuscript we will add a subsection deriving the sampling trajectory equations and proving that scaling the number of denoising steps or per-step model capacity cannot produce arbitrary sequential depth for problems whose solution depth grows linearly with input size. We will also include a high-level experimental protocol for empirical verification on controlled serial tasks. revision: yes
Referee: Formalization section (implied by abstract): the definition of 'inherently serial problems' must be shown to be independently grounded in complexity theory (e.g., P-complete tasks outside NC) rather than retrofitted to observed diffusion limitations; otherwise the distinction risks circularity and the 'incapability' claim reduces to an efficiency statement.

Authors: The formalization begins from standard parallel complexity theory: we define inherently serial problems as those whose minimal circuit depth is super-polylogarithmic in the input size, i.e., problems in P but not in NC. Concrete examples include the circuit-value problem and certain formulations of linear programming, both known to be P-complete and therefore not parallelizable to polylog depth. This grounding precedes any discussion of machine-learning architectures and is drawn directly from existing results on NC versus P. The manuscript therefore does not retrofit the definition to diffusion models. To eliminate any residual concern about circularity we will revise the formalization section to include explicit citations to NC/P-completeness results together with the cited examples. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper formalizes inherently serial problems using established complexity theory (e.g., P-complete tasks not in NC), an external framework independent of the diffusion model analysis. The claim that diffusion models cannot solve such problems is presented as a first-time demonstration based on their iterative sampling process, without reducing to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or derivations in the provided abstract equate the result to its inputs by construction; the distinction between parallel and serial computation draws on prior complexity literature rather than being tailored tautologically to observed model limitations. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a class of inherently serial problems and on the inability of diffusion models to handle them; these are introduced without independent empirical or formal verification in the abstract.

axioms (1)

domain assumption Some problems are fundamentally sequential and cannot be efficiently parallelized.
This distinction is the load-bearing premise formalized in complexity theory.

invented entities (1)

inherently serial problems no independent evidence
purpose: To categorize tasks whose computational steps are sequentially dependent and non-parallelizable.
New category introduced to explain architectural limits.

pith-pipeline@v0.9.0 · 5635 in / 1236 out tokens · 72107 ms · 2026-05-19T04:02:47.928135+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1: diffusion model with TC0 backbone remains in TC0 even with infinitely many sampling steps
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 2.2 and Assumption 2.3: parallel iff in TC; TC ⊊ P

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
cs.AI 2026-03 unverdicted novelty 6.0

Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.

Reference graph

Works this paper leans on

139 extracted references · 139 canonical work pages · cited by 2 Pith papers · 26 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Aggarwal, P. and Welleck, S. L1 : Controlling how long a reasoning model thinks with reinforcement learning. arXiv [cs.CL], 6 March 2025. URL http://arxiv.org/abs/2503.04697

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Amdahl, G. M. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pp.\ 483--485, 1967

work page 1967
[4]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V. Block diffusion: Interpolating between autoregressive and diffusion language models. arXiv preprint arXiv:2503.09573, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

D., Ho, J., Tarlow, D., and Van Den Berg, R

Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

work page 2021
[6]

Navigation world models

Bar, A., Zhou, G., Tran, D., Darrell, T., and LeCun, Y. Navigation world models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[7]

J., and Telgarsky, M

Bartlett, P., Foster, D. J., and Telgarsky, M. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 26 June 2017

work page 2017
[8]

Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw., 5 0 (2): 0 157--166, 1994. URL http://dx.doi.org/10.1109/72.279181

work page doi:10.1109/72.279181 1994
[9]

On the ability and limitations of transformers to recognize formal languages

Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. arXiv preprint arXiv:2009.11264, 2020

work page arXiv 2009
[10]

Blelloch, G. E. Programming parallel algorithms. Communications of the ACM, 39 0 (3): 0 85--97, 1996

work page 1996
[11]

Average-case complexity

Bogdanov, A., Trevisan, L., et al. Average-case complexity. Foundations and Trends in Theoretical Computer Science , 2 0 (1): 0 1--106, 2006

work page 2006
[12]

Video generation models as world simulators

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators. OpenAI Blog, 1: 0 8, 2024

work page 2024
[13]

Theoretical limitations of multi-layer transformer

Chen, L., Peng, B., and Wu, H. Theoretical limitations of multi-layer transformer. arXiv preprint arXiv:2412.02975, 2024 a

work page arXiv 2024
[14]

The computational limits of state-space models and mamba via the lens of circuit complexity

Chen, Y., Li, X., Liang, Y., Shi, Z., and Song, Z. The computational limits of state-space models and mamba via the lens of circuit complexity. arXiv preprint arXiv:2412.06148, 2024 b

work page arXiv 2024
[15]

Transformers in uniform TC0

Chiang, D. Transformers in uniform TC0 . arXiv preprint arXiv:2409.13629, 2024

work page arXiv 2024
[16]

On the Measure of Intelligence

Chollet, F. On the measure of intelligence. arXiv [cs.AI], 4 November 2019. URL http://arxiv.org/abs/1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 2019
[17]

W., Sutton, C., Gehrmann, S., et al

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

work page 2023
[18]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv [cs.LG], 27 October 2021. URL http://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Codd, E. F. Cellular automata. Academic press, 2014

work page 2014
[20]

On the expressive power of deep learning: A tensor analysis

Cohen, N., Sharir, O., and Shashua, A. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pp.\ 698--728. PMLR, 2016

work page 2016
[21]

Cook, S. A. A taxonomy of problems with fast parallel algorithms. Information and Control, 64 0 (1--3): 0 2--22, 1985

work page 1985
[22]

Approximation by superpositions of a sigmoidal function

Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems, 2 0 (4): 0 303--314, December 1989. URL https://web.njit.edu/ usman/courses/cs675_fall18/10.1.1.441.7873.pdf

work page 1989
[23]

FlashAttention -2: Faster attention with better parallelism and work partitioning

Dao, T. FlashAttention -2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[24]

Diffusion Models Beat GANs on Image Synthesis

Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, pp.\ 8780--8794, 11 May 2021. URL http://arxiv.org/abs/2105.05233

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Paradigms for Fast Parallel Approximability

D \' az, J., Serna, M., Spirakis, P., and Tor \'a n, J. Paradigms for Fast Parallel Approximability. Cambridge University Press, Cambridge, 1st edition, 1997

work page 1997
[26]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[27]

Diffusiondepth: Diffusion denoising approach for monocular depth estimation

Duan, Y., Guo, X., and Zhu, Z. Diffusiondepth: Diffusion denoising approach for monocular depth estimation. In European Conference on Computer Vision, pp.\ 432--449. Springer, 2024

work page 2024
[28]

and Shamir, O

Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on learning theory, pp.\ 907--940. PMLR, 2016

work page 2016
[29]

B., Goldberg, K., Lee, Y., Hafner, D., and Abbeel, P

Escontrela, A., Adeniji, A., Yan, W., Jain, A., Peng, X. B., Goldberg, K., Lee, Y., Hafner, D., and Abbeel, P. Video prediction models as rewards for reinforcement learning. Neural Inf Process Syst, abs/2305.14343: 0 68760--68783, 23 May 2023. URL http://dx.doi.org/10.48550/arXiv.2305.14343

work page doi:10.48550/arxiv.2305.14343 2023
[30]

Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. J. Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 11 June 2019

work page 2019
[31]

Towards revealing the mystery behind chain of thought: a theoretical perspective

Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., and Wang, L. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36: 0 70757--70798, 2023

work page 2023
[32]

and Toffoli, T

Fredkin, E. and Toffoli, T. Conservative logic. Int. J. Theor. Phys., 21 0 (3-4): 0 219--253, April 1982. URL http://dx.doi.org/10.1007/BF01857727

work page doi:10.1007/bf01857727 1982
[33]

Gander, M. J. and Vandewalle, S. Analysis of the parareal time-parallel time-integration method. SIAM Journal on Scientific Computing, 29 0 (2): 0 556--578, 2007

work page 2007
[34]

J., and Ruzzo, W

Greenlaw, R., Hoover, H. J., and Ruzzo, W. L. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, New York, 1995. ISBN 0-19-508591-4

work page 1995
[35]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Efficiently Modeling Long Sequences with Structured State Spaces

Gu, A., Goel, K., and R \'e , C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

and Hashimoto, T

Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. In Advances in Neural Information Processing Systems, 30 May 2023

work page 2023
[38]

Gustafson, J. L. Reevaluating amdahl's law. Communications of the ACM, 31 0 (5): 0 532--533, 1988

work page 1988
[39]

Theoretical limitations of self-attention in neural sequence models

Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8: 0 156--171, 2020

work page 2020
[40]

Threshold circuits of bounded depth

Hajnal, A., Maass, W., Pudl \'a k, P., Szegedy, M., and Tur \'a n, G. Threshold circuits of bounded depth. Journal of Computer and System Sciences, 46 0 (2): 0 129--154, 1993

work page 1993
[41]

Formal language recognition by hard attention transformers: Perspectives from circuit complexity

Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10: 0 800--810, 2022

work page 2022
[42]

Denoising Diffusion Probabilistic Models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 19 June 2020. URL http://arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[43]

Probability inequalities for sums of bounded random variables

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58 0 (301): 0 13--30, 1963

work page 1963
[44]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Multilayer feedforward networks are universal approximators

Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Netw., 2 0 (5): 0 359--366, January 1989. URL https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

work page 1989
[46]

Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey,...

work page 2017
[47]

Jukna, S. et al. Boolean function complexity: advances and frontiers, volume 27. Springer, 2012

work page 2012
[48]

B., Blelloch, G

Kang, H., Gibbons, P. B., Blelloch, G. E., Dhulipala, L., Gu, Y., and McGuffey, C. The processing-in-memory model. In Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, volume 1, pp.\ 295--306, New York, NY, USA, 6 July 2021. ACM. URL http://dx.doi.org/10.1145/3409964.3461816

work page doi:10.1145/3409964.3461816 2021
[49]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[50]

A comprehensive review of processing-in-memory architectures for deep neural networks

Kaur, R., Asad, A., and Mohammadi, F. A comprehensive review of processing-in-memory architectures for deep neural networks. Computers, 13 0 (7): 0 174, 16 July 2024. URL http://dx.doi.org/10.3390/computers13070174

work page doi:10.3390/computers13070174 2024
[51]

Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., Mehta, S. V., Jain, L. K., Aglietti, V., Jindal, D., Chen, P., Dikkala, N., Tyen, G., Liu, X., Shalit, U., Chiappa, S., Olszewska, K., Tay, Y., Tran, V. Q., Le, Q. V., and Firat, O. BIG -bench extra hard. arXiv [cs.CL], 26 February 2025. URL http://dx.doi.org/10.48550/arXiv.2502.19187

work page doi:10.48550/arxiv.2502.19187 2025
[52]

C., and Schindler, K

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R. C., and Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[53]

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

Kevin, W., Ishaan, J., Michał, B., Tomasz, T., and Benjamin, E. 1000 layer networks for self-supervised RL : Scaling depth can enable new goal-reaching capabilities. arXiv [cs.LG], 19 March 2025. URL http://arxiv.org/abs/2503.14858

work page arXiv 2025
[54]

Kirousis, L. M. and Spirakis, P. Probabilistic log-space reductions and problems probabilistically hard for P . In SWAT 88: 1st Scandinavian Workshop on Algorithm Theory, Halmstad, Sweden, July 5--8, 1988 Proceedings. Springer Berlin Heidelberg, 1988

work page 1988
[55]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012

work page 2012
[56]

Ladner, R. E. The circuit value problem is log space complete for p. ACM Sigact News, 7 0 (1): 0 18--20, 1975

work page 1975
[57]

and Yan, Y

Li, G. and Yan, Y. O(d/T) convergence theory for diffusion probabilistic models under minimal assumptions. arXiv preprint arXiv:2409.18959, 2024

work page arXiv 2024
[58]

S., and Hashimoto, T

Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-LM improves controllable text generation. Advances in neural information processing systems, 35: 0 4328--4343, 2022

work page 2022
[59]

Chain of thought empowers transformers to solve inherently serial problems

Li, Z., Liu, H., Zhou, D., and Ma, T. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875, 2024

work page arXiv 2024
[60]

A., Manning, C

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C. A., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul,...

work page 2023
[61]

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Lin, S., Wang, A., and Yang, X. SDXL -lightning: Progressive adversarial diffusion distillation. arXiv [cs.CV], 21 February 2024. URL http://arxiv.org/abs/2402.13929

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Résolution d' EDP par un schéma en temps «pararéel »

Lions, J.-L., Maday, Y., and Turinici, G. Résolution d' EDP par un schéma en temps «pararéel ». C.R. Acad. Sci. (Ser. 1) (Math./Math.), 332 0 (7): 0 661--668, 1 April 2001. URL http://dx.doi.org/10.1016/S0764-4442(00)01793-6

work page doi:10.1016/s0764-4442(00)01793-6 2001
[63]

arXiv preprint arXiv:2210.10749 , year=

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022

work page arXiv 2022
[64]

InstaFlow : One step is enough for high-quality diffusion-based text-to-image generation

Liu, X., Zhang, X., Ma, J., Peng, J., and Liu, Q. InstaFlow : One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2024. URL http://arxiv.org/abs/2309.06380

work page arXiv 2024
[65]

Discrete diffusion modeling by estimating the ratios of the data distribution

Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[66]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Ma, N., Tong, S., Jia, H., Hu, H., Su, Y.-C., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., and Xie, S. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv [cs.CV], 16 January 2025. URL http://arxiv.org/abs/2501.09732

work page internal anchor Pith review arXiv 2025
[67]

and Mula, O

Maday, Y. and Mula, O. An adaptive parareal algorithm. J. Comput. Appl. Math., 377 0 (112915): 0 112915, 15 October 2020. URL http://dx.doi.org/10.1016/j.cam.2020.112915

work page doi:10.1016/j.cam.2020.112915 2020
[68]

Potemkin understanding in large language models

Mancoridis, M., Weeks, B., Vafa, K., and Mullainathan, S. Potemkin understanding in large language models. arXiv [cs.CL], 26 June 2025. URL http://arxiv.org/abs/2506.21521

work page arXiv 2025
[69]

Deep Learning: A Critical Appraisal

Marcus, G. Deep learning: A critical appraisal. arXiv [cs.AI], 2 January 2018. URL http://arxiv.org/abs/1801.00631

work page internal anchor Pith review Pith/arXiv arXiv 2018
[70]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023 a

work page arXiv 2023
[71]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11: 0 531--545, 2023 b

work page 2023
[72]

A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

Merrill, W. and Sabharwal, A. A little depth goes a long way: The expressive power of log-depth transformers. arXiv preprint arXiv:2503.03961, 2025

work page arXiv 2025
[73]

Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10: 0 843--856, 2022

work page 2022
[74]

R Chris Miall and Daniel M Wolpert

Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models. arXiv preprint arXiv:2404.08819, 2024

work page arXiv 2024
[75]

Unpredictability and undecidability in dynamical systems

Moore, C. Unpredictability and undecidability in dynamical systems. Physical Review Letters, 64 0 (20): 0 2354, 1990

work page 1990
[76]

Majority-vote cellular automata, ising dynamics, and p-completeness

Moore, C. Majority-vote cellular automata, ising dynamics, and p-completeness. Journal of Statistical Physics, 88: 0 795--805, 1997

work page 1997
[77]

and Klabjan, D

Mu, S. and Klabjan, D. On the second-order convergence of biased policy gradient algorithms. In Proceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[78]

s1: Simple test-time scaling

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp.\ 8162--8171. PMLR, 2021

work page 2021
[80]

Large Language Diffusion Models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Aggarwal, P. and Welleck, S. L1 : Controlling how long a reasoning model thinks with reinforcement learning. arXiv [cs.CL], 6 March 2025. URL http://arxiv.org/abs/2503.04697

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Amdahl, G. M. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pp.\ 483--485, 1967

work page 1967

[4] [4]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V. Block diffusion: Interpolating between autoregressive and diffusion language models. arXiv preprint arXiv:2503.09573, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

D., Ho, J., Tarlow, D., and Van Den Berg, R

Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

work page 2021

[6] [6]

Navigation world models

Bar, A., Zhou, G., Tran, D., Darrell, T., and LeCun, Y. Navigation world models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[7] [7]

J., and Telgarsky, M

Bartlett, P., Foster, D. J., and Telgarsky, M. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 26 June 2017

work page 2017

[8] [8]

Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw., 5 0 (2): 0 157--166, 1994. URL http://dx.doi.org/10.1109/72.279181

work page doi:10.1109/72.279181 1994

[9] [9]

On the ability and limitations of transformers to recognize formal languages

Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. arXiv preprint arXiv:2009.11264, 2020

work page arXiv 2009

[10] [10]

Blelloch, G. E. Programming parallel algorithms. Communications of the ACM, 39 0 (3): 0 85--97, 1996

work page 1996

[11] [11]

Average-case complexity

Bogdanov, A., Trevisan, L., et al. Average-case complexity. Foundations and Trends in Theoretical Computer Science , 2 0 (1): 0 1--106, 2006

work page 2006

[12] [12]

Video generation models as world simulators

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators. OpenAI Blog, 1: 0 8, 2024

work page 2024

[13] [13]

Theoretical limitations of multi-layer transformer

Chen, L., Peng, B., and Wu, H. Theoretical limitations of multi-layer transformer. arXiv preprint arXiv:2412.02975, 2024 a

work page arXiv 2024

[14] [14]

The computational limits of state-space models and mamba via the lens of circuit complexity

Chen, Y., Li, X., Liang, Y., Shi, Z., and Song, Z. The computational limits of state-space models and mamba via the lens of circuit complexity. arXiv preprint arXiv:2412.06148, 2024 b

work page arXiv 2024

[15] [15]

Transformers in uniform TC0

Chiang, D. Transformers in uniform TC0 . arXiv preprint arXiv:2409.13629, 2024

work page arXiv 2024

[16] [16]

On the Measure of Intelligence

Chollet, F. On the measure of intelligence. arXiv [cs.AI], 4 November 2019. URL http://arxiv.org/abs/1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 2019

[17] [17]

W., Sutton, C., Gehrmann, S., et al

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

work page 2023

[18] [18]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv [cs.LG], 27 October 2021. URL http://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Codd, E. F. Cellular automata. Academic press, 2014

work page 2014

[20] [20]

On the expressive power of deep learning: A tensor analysis

Cohen, N., Sharir, O., and Shashua, A. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pp.\ 698--728. PMLR, 2016

work page 2016

[21] [21]

Cook, S. A. A taxonomy of problems with fast parallel algorithms. Information and Control, 64 0 (1--3): 0 2--22, 1985

work page 1985

[22] [22]

Approximation by superpositions of a sigmoidal function

Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems, 2 0 (4): 0 303--314, December 1989. URL https://web.njit.edu/ usman/courses/cs675_fall18/10.1.1.441.7873.pdf

work page 1989

[23] [23]

FlashAttention -2: Faster attention with better parallelism and work partitioning

Dao, T. FlashAttention -2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[24] [24]

Diffusion Models Beat GANs on Image Synthesis

Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, pp.\ 8780--8794, 11 May 2021. URL http://arxiv.org/abs/2105.05233

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Paradigms for Fast Parallel Approximability

D \' az, J., Serna, M., Spirakis, P., and Tor \'a n, J. Paradigms for Fast Parallel Approximability. Cambridge University Press, Cambridge, 1st edition, 1997

work page 1997

[26] [26]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021

work page 2021

[27] [27]

Diffusiondepth: Diffusion denoising approach for monocular depth estimation

Duan, Y., Guo, X., and Zhu, Z. Diffusiondepth: Diffusion denoising approach for monocular depth estimation. In European Conference on Computer Vision, pp.\ 432--449. Springer, 2024

work page 2024

[28] [28]

and Shamir, O

Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on learning theory, pp.\ 907--940. PMLR, 2016

work page 2016

[29] [29]

B., Goldberg, K., Lee, Y., Hafner, D., and Abbeel, P

Escontrela, A., Adeniji, A., Yan, W., Jain, A., Peng, X. B., Goldberg, K., Lee, Y., Hafner, D., and Abbeel, P. Video prediction models as rewards for reinforcement learning. Neural Inf Process Syst, abs/2305.14343: 0 68760--68783, 23 May 2023. URL http://dx.doi.org/10.48550/arXiv.2305.14343

work page doi:10.48550/arxiv.2305.14343 2023

[30] [30]

Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. J. Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 11 June 2019

work page 2019

[31] [31]

Towards revealing the mystery behind chain of thought: a theoretical perspective

Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., and Wang, L. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36: 0 70757--70798, 2023

work page 2023

[32] [32]

and Toffoli, T

Fredkin, E. and Toffoli, T. Conservative logic. Int. J. Theor. Phys., 21 0 (3-4): 0 219--253, April 1982. URL http://dx.doi.org/10.1007/BF01857727

work page doi:10.1007/bf01857727 1982

[33] [33]

Gander, M. J. and Vandewalle, S. Analysis of the parareal time-parallel time-integration method. SIAM Journal on Scientific Computing, 29 0 (2): 0 556--578, 2007

work page 2007

[34] [34]

J., and Ruzzo, W

Greenlaw, R., Hoover, H. J., and Ruzzo, W. L. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, New York, 1995. ISBN 0-19-508591-4

work page 1995

[35] [35]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Efficiently Modeling Long Sequences with Structured State Spaces

Gu, A., Goel, K., and R \'e , C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [37]

and Hashimoto, T

Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. In Advances in Neural Information Processing Systems, 30 May 2023

work page 2023

[38] [38]

Gustafson, J. L. Reevaluating amdahl's law. Communications of the ACM, 31 0 (5): 0 532--533, 1988

work page 1988

[39] [39]

Theoretical limitations of self-attention in neural sequence models

Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8: 0 156--171, 2020

work page 2020

[40] [40]

Threshold circuits of bounded depth

Hajnal, A., Maass, W., Pudl \'a k, P., Szegedy, M., and Tur \'a n, G. Threshold circuits of bounded depth. Journal of Computer and System Sciences, 46 0 (2): 0 129--154, 1993

work page 1993

[41] [41]

Formal language recognition by hard attention transformers: Perspectives from circuit complexity

Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10: 0 800--810, 2022

work page 2022

[42] [42]

Denoising Diffusion Probabilistic Models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 19 June 2020. URL http://arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020

[43] [43]

Probability inequalities for sums of bounded random variables

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58 0 (301): 0 13--30, 1963

work page 1963

[44] [44]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Multilayer feedforward networks are universal approximators

Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Netw., 2 0 (5): 0 359--366, January 1989. URL https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

work page 1989

[46] [46]

Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey,...

work page 2017

[47] [47]

Jukna, S. et al. Boolean function complexity: advances and frontiers, volume 27. Springer, 2012

work page 2012

[48] [48]

B., Blelloch, G

Kang, H., Gibbons, P. B., Blelloch, G. E., Dhulipala, L., Gu, Y., and McGuffey, C. The processing-in-memory model. In Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, volume 1, pp.\ 295--306, New York, NY, USA, 6 July 2021. ACM. URL http://dx.doi.org/10.1145/3409964.3461816

work page doi:10.1145/3409964.3461816 2021

[49] [49]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[50] [50]

A comprehensive review of processing-in-memory architectures for deep neural networks

Kaur, R., Asad, A., and Mohammadi, F. A comprehensive review of processing-in-memory architectures for deep neural networks. Computers, 13 0 (7): 0 174, 16 July 2024. URL http://dx.doi.org/10.3390/computers13070174

work page doi:10.3390/computers13070174 2024

[51] [51]

Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., Mehta, S. V., Jain, L. K., Aglietti, V., Jindal, D., Chen, P., Dikkala, N., Tyen, G., Liu, X., Shalit, U., Chiappa, S., Olszewska, K., Tay, Y., Tran, V. Q., Le, Q. V., and Firat, O. BIG -bench extra hard. arXiv [cs.CL], 26 February 2025. URL http://dx.doi.org/10.48550/arXiv.2502.19187

work page doi:10.48550/arxiv.2502.19187 2025

[52] [52]

C., and Schindler, K

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R. C., and Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[53] [53]

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

Kevin, W., Ishaan, J., Michał, B., Tomasz, T., and Benjamin, E. 1000 layer networks for self-supervised RL : Scaling depth can enable new goal-reaching capabilities. arXiv [cs.LG], 19 March 2025. URL http://arxiv.org/abs/2503.14858

work page arXiv 2025

[54] [54]

Kirousis, L. M. and Spirakis, P. Probabilistic log-space reductions and problems probabilistically hard for P . In SWAT 88: 1st Scandinavian Workshop on Algorithm Theory, Halmstad, Sweden, July 5--8, 1988 Proceedings. Springer Berlin Heidelberg, 1988

work page 1988

[55] [55]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012

work page 2012

[56] [56]

Ladner, R. E. The circuit value problem is log space complete for p. ACM Sigact News, 7 0 (1): 0 18--20, 1975

work page 1975

[57] [57]

and Yan, Y

Li, G. and Yan, Y. O(d/T) convergence theory for diffusion probabilistic models under minimal assumptions. arXiv preprint arXiv:2409.18959, 2024

work page arXiv 2024

[58] [58]

S., and Hashimoto, T

Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-LM improves controllable text generation. Advances in neural information processing systems, 35: 0 4328--4343, 2022

work page 2022

[59] [59]

Chain of thought empowers transformers to solve inherently serial problems

Li, Z., Liu, H., Zhou, D., and Ma, T. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875, 2024

work page arXiv 2024

[60] [60]

A., Manning, C

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C. A., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul,...

work page 2023

[61] [61]

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Lin, S., Wang, A., and Yang, X. SDXL -lightning: Progressive adversarial diffusion distillation. arXiv [cs.CV], 21 February 2024. URL http://arxiv.org/abs/2402.13929

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

Résolution d' EDP par un schéma en temps «pararéel »

Lions, J.-L., Maday, Y., and Turinici, G. Résolution d' EDP par un schéma en temps «pararéel ». C.R. Acad. Sci. (Ser. 1) (Math./Math.), 332 0 (7): 0 661--668, 1 April 2001. URL http://dx.doi.org/10.1016/S0764-4442(00)01793-6

work page doi:10.1016/s0764-4442(00)01793-6 2001

[63] [63]

arXiv preprint arXiv:2210.10749 , year=

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022

work page arXiv 2022

[64] [64]

InstaFlow : One step is enough for high-quality diffusion-based text-to-image generation

Liu, X., Zhang, X., Ma, J., Peng, J., and Liu, Q. InstaFlow : One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2024. URL http://arxiv.org/abs/2309.06380

work page arXiv 2024

[65] [65]

Discrete diffusion modeling by estimating the ratios of the data distribution

Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024

[66] [66]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Ma, N., Tong, S., Jia, H., Hu, H., Su, Y.-C., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., and Xie, S. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv [cs.CV], 16 January 2025. URL http://arxiv.org/abs/2501.09732

work page internal anchor Pith review arXiv 2025

[67] [67]

and Mula, O

Maday, Y. and Mula, O. An adaptive parareal algorithm. J. Comput. Appl. Math., 377 0 (112915): 0 112915, 15 October 2020. URL http://dx.doi.org/10.1016/j.cam.2020.112915

work page doi:10.1016/j.cam.2020.112915 2020

[68] [68]

Potemkin understanding in large language models

Mancoridis, M., Weeks, B., Vafa, K., and Mullainathan, S. Potemkin understanding in large language models. arXiv [cs.CL], 26 June 2025. URL http://arxiv.org/abs/2506.21521

work page arXiv 2025

[69] [69]

Deep Learning: A Critical Appraisal

Marcus, G. Deep learning: A critical appraisal. arXiv [cs.AI], 2 January 2018. URL http://arxiv.org/abs/1801.00631

work page internal anchor Pith review Pith/arXiv arXiv 2018

[70] [70]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023 a

work page arXiv 2023

[71] [71]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11: 0 531--545, 2023 b

work page 2023

[72] [72]

A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

Merrill, W. and Sabharwal, A. A little depth goes a long way: The expressive power of log-depth transformers. arXiv preprint arXiv:2503.03961, 2025

work page arXiv 2025

[73] [73]

Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10: 0 843--856, 2022

work page 2022

[74] [74]

R Chris Miall and Daniel M Wolpert

Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models. arXiv preprint arXiv:2404.08819, 2024

work page arXiv 2024

[75] [75]

Unpredictability and undecidability in dynamical systems

Moore, C. Unpredictability and undecidability in dynamical systems. Physical Review Letters, 64 0 (20): 0 2354, 1990

work page 1990

[76] [76]

Majority-vote cellular automata, ising dynamics, and p-completeness

Moore, C. Majority-vote cellular automata, ising dynamics, and p-completeness. Journal of Statistical Physics, 88: 0 795--805, 1997

work page 1997

[77] [77]

and Klabjan, D

Mu, S. and Klabjan, D. On the second-order convergence of biased policy gradient algorithms. In Proceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[78] [78]

s1: Simple test-time scaling

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [79]

Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp.\ 8162--8171. PMLR, 2021

work page 2021

[80] [80]

Large Language Diffusion Models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025