pith. sign in

arxiv: 2507.12549 · v4 · submitted 2025-07-16 · 💻 cs.LG · cs.CC· stat.ML

The Serial Scaling Hypothesis

Pith reviewed 2026-05-19 04:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CCstat.ML
keywords diffusion modelsinherently serial problemssequential computationparallelization limitsmachine learning architecturescomplexity theoryserial scaling
0
0 comments X

The pith

Diffusion models cannot solve inherently serial problems even though they generate outputs step by step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that machine learning progress has relied on parallel computation while ignoring problems that depend on sequential steps that cannot be sped up by running many operations at once. These inherently serial problems appear in mathematical reasoning, physical simulations, and sequential decision making. The authors use complexity theory to define the distinction and then show that diffusion models fail on such tasks despite their own sequential generation process. If the claim holds, current scaling approaches that emphasize parallelism will hit hard limits on a broad class of important tasks.

Core claim

We show for the first time that diffusion models, despite their sequential nature, are incapable of solving inherently serial problems that require sequentially dependent computational steps that cannot be efficiently parallelized.

What carries the argument

The formal distinction between parallelizable and inherently serial problems, drawn from complexity theory, used to explain why current architectures including diffusion models reach fundamental limits.

If this is right

  • Model architectures must incorporate explicit mechanisms for sequential dependence rather than relying solely on parallel layers.
  • Hardware development will need to prioritize support for non-parallelizable computation paths.
  • Scaling efforts on reasoning and simulation tasks will require serial-aware designs instead of pure width increases.
  • Diffusion models will need fundamental changes or replacement for any application involving true sequential dependence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hypothesis suggests that future scaling laws should track serial depth separately from parallel width.
  • Hybrid architectures that combine parallel efficiency with serial chaining could address the identified gap.
  • Empirical tests on specific complexity classes could provide direct evidence for or against the serial limitation.

Load-bearing premise

There exist problems that are fundamentally sequential and cannot be efficiently parallelized, as formalized in complexity theory.

What would settle it

A diffusion model that solves a known inherently serial problem, such as a canonical P-complete task, at scale with efficiency matching or exceeding parallel methods would disprove the central claim.

Figures

Figures reproduced from arXiv: 2507.12549 by Kananart Kuwaranancharoen, Konpat Preechakul, Yutong Bai, Yuxi Liu.

Figure 1
Figure 1. Figure 1: (Left) Many easy Sudoku puz￾zles, where the circled blanks can be filled independently in parallel. (Right) A hard Sudoku with the same total com￾pute, but the circled blanks are interde￾pendent, requiring sequential reasoning. Consider Sudoku—a number-placement puzzle requiring each number 1–9 to appear once per row, column, and subgrid—as a parable ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (A) A decision problem has a variable-size input and a fixed-size output (e.g., “yes”/“no”). (B) A serial problem requires deeper or more steps as the problem size grows. Examples of serial problems are: (C) Cellular automaton: takes the initial state as input and outputs a discrete value of the row N at cell i for i ∈ {1, . . . , 2N − 1}. (D) Many-body mechanics: takes initial positions and momenta of eac… view at source ↗
Figure 3
Figure 3. Figure 3: The complexity classes are nested as TC0 ⊆ TC1 ⊆ · · · ⊆ TC ⊆ P. Each containment is widely believed to be strict. Problems in TC are parallel, while those outside are inherently serial. Although the universal approximation theorem (Cybenko, 1989; Hornik et al., 1989) shows that a 3-layer MLP with unbounded width can approximate any continuous func￾tion, a constant-depth MLP with only polynomial width is f… view at source ↗
Figure 4
Figure 4. Figure 4: A single run of Rule 110. Given the top row, the CA evolves row-by-row according to the 8 rules. We begin with predicting the outcome of cellular automata (CA) (Sarkar, 2000; Codd, 2014), shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Predicting the frame at time T. The intermediate frames may not be observable by camera motion/occlusion. In real videos, objects frequently leave the field of view or become occluded, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical serial scaling on (a) Hex board game and (b) locomotion & manipulation. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average benchmark scores [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The backbone network takes as input a sequence of tokens, and a single noisy output token. This is repeated until the output to￾ken is fully denoised. The Serial Scaling Hypothesis provide a lens to look at machine learning from serial–parallel lens. We have examined serial machine learning problems. We now turn to a diffusion mod￾els (Ho et al., 2020; Song et al., 2021; Song & Ermon, 2019), widely used in… view at source ↗
read the original abstract

While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require sequentially dependent computational steps that cannot be efficiently parallelized. We formalize this distinction in complexity theory, and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. Then, we show for first time that diffusion models despite their sequential nature are incapable of solving inherently serial problems. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, and hardware development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies a class of 'inherently serial' problems (e.g., mathematical reasoning, physical simulations, sequential decision-making) that require sequentially dependent steps not efficiently parallelizable. It claims to formalize this distinction via complexity theory, shows limitations of parallel-centric architectures on such tasks, and asserts for the first time that diffusion models are incapable of solving them despite their iterative sequential sampling process, with broad implications for model design and hardware.

Significance. If the formalization and impossibility result hold, the work would identify a fundamental blind spot in current parallel-focused ML paradigms and diffusion models, potentially motivating serial-aware architectures. The explicit grounding in complexity theory (e.g., sequential depth requirements) is a positive feature that could make the hypothesis falsifiable.

major comments (2)
  1. Abstract and central claim: the assertion that diffusion models are 'incapable' of solving inherently serial problems is presented without any equations, sampling schedule analysis, or experimental protocol showing that bounded denoising trajectories cannot emulate required sequential depth even when steps or per-step capacity are scaled with problem size.
  2. Formalization section (implied by abstract): the definition of 'inherently serial problems' must be shown to be independently grounded in complexity theory (e.g., P-complete tasks outside NC) rather than retrofitted to observed diffusion limitations; otherwise the distinction risks circularity and the 'incapability' claim reduces to an efficiency statement.
minor comments (2)
  1. The abstract states 'we formalize this distinction in complexity theory' but supplies no theorem statements, definitions, or references to specific complexity classes; adding these would strengthen readability.
  2. Clarify whether the diffusion schedule (number of steps) is treated as fixed or allowed to grow with serial depth; this directly affects whether the result is an impossibility or a scaling claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report on our manuscript 'The Serial Scaling Hypothesis'. We address each major comment below with clarifications drawn from the manuscript's formalization and provide indications of revisions to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: Abstract and central claim: the assertion that diffusion models are 'incapable' of solving inherently serial problems is presented without any equations, sampling schedule analysis, or experimental protocol showing that bounded denoising trajectories cannot emulate required sequential depth even when steps or per-step capacity are scaled with problem size.

    Authors: The manuscript's core argument links the serial scaling hypothesis to the fact that diffusion sampling, while iterative, operates under a fixed parallel denoising structure whose effective depth is bounded independently of problem size. We agree that the abstract states the claim concisely and that explicit equations would improve clarity. In the revised manuscript we will add a subsection deriving the sampling trajectory equations and proving that scaling the number of denoising steps or per-step model capacity cannot produce arbitrary sequential depth for problems whose solution depth grows linearly with input size. We will also include a high-level experimental protocol for empirical verification on controlled serial tasks. revision: yes

  2. Referee: Formalization section (implied by abstract): the definition of 'inherently serial problems' must be shown to be independently grounded in complexity theory (e.g., P-complete tasks outside NC) rather than retrofitted to observed diffusion limitations; otherwise the distinction risks circularity and the 'incapability' claim reduces to an efficiency statement.

    Authors: The formalization begins from standard parallel complexity theory: we define inherently serial problems as those whose minimal circuit depth is super-polylogarithmic in the input size, i.e., problems in P but not in NC. Concrete examples include the circuit-value problem and certain formulations of linear programming, both known to be P-complete and therefore not parallelizable to polylog depth. This grounding precedes any discussion of machine-learning architectures and is drawn directly from existing results on NC versus P. The manuscript therefore does not retrofit the definition to diffusion models. To eliminate any residual concern about circularity we will revise the formalization section to include explicit citations to NC/P-completeness results together with the cited examples. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper formalizes inherently serial problems using established complexity theory (e.g., P-complete tasks not in NC), an external framework independent of the diffusion model analysis. The claim that diffusion models cannot solve such problems is presented as a first-time demonstration based on their iterative sampling process, without reducing to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or derivations in the provided abstract equate the result to its inputs by construction; the distinction between parallel and serial computation draws on prior complexity literature rather than being tailored tautologically to observed model limitations. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a class of inherently serial problems and on the inability of diffusion models to handle them; these are introduced without independent empirical or formal verification in the abstract.

axioms (1)
  • domain assumption Some problems are fundamentally sequential and cannot be efficiently parallelized.
    This distinction is the load-bearing premise formalized in complexity theory.
invented entities (1)
  • inherently serial problems no independent evidence
    purpose: To categorize tasks whose computational steps are sequentially dependent and non-parallelizable.
    New category introduced to explain architectural limits.

pith-pipeline@v0.9.0 · 5635 in / 1236 out tokens · 72107 ms · 2026-05-19T04:02:47.928135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...

  2. Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

    cs.AI 2026-03 unverdicted novelty 6.0

    Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.

Reference graph

Works this paper leans on

139 extracted references · 139 canonical work pages · cited by 2 Pith papers · 26 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    Aggarwal, P. and Welleck, S. L1 : Controlling how long a reasoning model thinks with reinforcement learning. arXiv [cs.CL], 6 March 2025. URL http://arxiv.org/abs/2503.04697

  3. [3]

    Amdahl, G. M. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pp.\ 483--485, 1967

  4. [4]

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

    Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V. Block diffusion: Interpolating between autoregressive and diffusion language models. arXiv preprint arXiv:2503.09573, 2025

  5. [5]

    D., Ho, J., Tarlow, D., and Van Den Berg, R

    Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

  6. [6]

    Navigation world models

    Bar, A., Zhou, G., Tran, D., Darrell, T., and LeCun, Y. Navigation world models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  7. [7]

    J., and Telgarsky, M

    Bartlett, P., Foster, D. J., and Telgarsky, M. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 26 June 2017

  8. [8]

    Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

    Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw., 5 0 (2): 0 157--166, 1994. URL http://dx.doi.org/10.1109/72.279181

  9. [9]

    On the ability and limitations of transformers to recognize formal languages

    Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. arXiv preprint arXiv:2009.11264, 2020

  10. [10]

    Blelloch, G. E. Programming parallel algorithms. Communications of the ACM, 39 0 (3): 0 85--97, 1996

  11. [11]

    Average-case complexity

    Bogdanov, A., Trevisan, L., et al. Average-case complexity. Foundations and Trends in Theoretical Computer Science , 2 0 (1): 0 1--106, 2006

  12. [12]

    Video generation models as world simulators

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators. OpenAI Blog, 1: 0 8, 2024

  13. [13]

    Theoretical limitations of multi-layer transformer

    Chen, L., Peng, B., and Wu, H. Theoretical limitations of multi-layer transformer. arXiv preprint arXiv:2412.02975, 2024 a

  14. [14]

    The computational limits of state-space models and mamba via the lens of circuit complexity

    Chen, Y., Li, X., Liang, Y., Shi, Z., and Song, Z. The computational limits of state-space models and mamba via the lens of circuit complexity. arXiv preprint arXiv:2412.06148, 2024 b

  15. [15]

    Transformers in uniform TC0

    Chiang, D. Transformers in uniform TC0 . arXiv preprint arXiv:2409.13629, 2024

  16. [16]

    On the Measure of Intelligence

    Chollet, F. On the measure of intelligence. arXiv [cs.AI], 4 November 2019. URL http://arxiv.org/abs/1911.01547

  17. [17]

    W., Sutton, C., Gehrmann, S., et al

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

  18. [18]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv [cs.LG], 27 October 2021. URL http://arxiv.org/abs/2110.14168

  19. [19]

    Codd, E. F. Cellular automata. Academic press, 2014

  20. [20]

    On the expressive power of deep learning: A tensor analysis

    Cohen, N., Sharir, O., and Shashua, A. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pp.\ 698--728. PMLR, 2016

  21. [21]

    Cook, S. A. A taxonomy of problems with fast parallel algorithms. Information and Control, 64 0 (1--3): 0 2--22, 1985

  22. [22]

    Approximation by superpositions of a sigmoidal function

    Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems, 2 0 (4): 0 303--314, December 1989. URL https://web.njit.edu/ usman/courses/cs675_fall18/10.1.1.441.7873.pdf

  23. [23]

    FlashAttention -2: Faster attention with better parallelism and work partitioning

    Dao, T. FlashAttention -2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

  24. [24]

    Diffusion Models Beat GANs on Image Synthesis

    Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, pp.\ 8780--8794, 11 May 2021. URL http://arxiv.org/abs/2105.05233

  25. [25]

    Paradigms for Fast Parallel Approximability

    D \' az, J., Serna, M., Spirakis, P., and Tor \'a n, J. Paradigms for Fast Parallel Approximability. Cambridge University Press, Cambridge, 1st edition, 1997

  26. [26]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021

  27. [27]

    Diffusiondepth: Diffusion denoising approach for monocular depth estimation

    Duan, Y., Guo, X., and Zhu, Z. Diffusiondepth: Diffusion denoising approach for monocular depth estimation. In European Conference on Computer Vision, pp.\ 432--449. Springer, 2024

  28. [28]

    and Shamir, O

    Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on learning theory, pp.\ 907--940. PMLR, 2016

  29. [29]

    B., Goldberg, K., Lee, Y., Hafner, D., and Abbeel, P

    Escontrela, A., Adeniji, A., Yan, W., Jain, A., Peng, X. B., Goldberg, K., Lee, Y., Hafner, D., and Abbeel, P. Video prediction models as rewards for reinforcement learning. Neural Inf Process Syst, abs/2305.14343: 0 68760--68783, 23 May 2023. URL http://dx.doi.org/10.48550/arXiv.2305.14343

  30. [30]

    Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. J. Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 11 June 2019

  31. [31]

    Towards revealing the mystery behind chain of thought: a theoretical perspective

    Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., and Wang, L. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36: 0 70757--70798, 2023

  32. [32]

    and Toffoli, T

    Fredkin, E. and Toffoli, T. Conservative logic. Int. J. Theor. Phys., 21 0 (3-4): 0 219--253, April 1982. URL http://dx.doi.org/10.1007/BF01857727

  33. [33]

    Gander, M. J. and Vandewalle, S. Analysis of the parareal time-parallel time-integration method. SIAM Journal on Scientific Computing, 29 0 (2): 0 556--578, 2007

  34. [34]

    J., and Ruzzo, W

    Greenlaw, R., Hoover, H. J., and Ruzzo, W. L. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, New York, 1995. ISBN 0-19-508591-4

  35. [35]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  36. [36]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Gu, A., Goel, K., and R \'e , C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

  37. [37]

    and Hashimoto, T

    Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. In Advances in Neural Information Processing Systems, 30 May 2023

  38. [38]

    Gustafson, J. L. Reevaluating amdahl's law. Communications of the ACM, 31 0 (5): 0 532--533, 1988

  39. [39]

    Theoretical limitations of self-attention in neural sequence models

    Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8: 0 156--171, 2020

  40. [40]

    Threshold circuits of bounded depth

    Hajnal, A., Maass, W., Pudl \'a k, P., Szegedy, M., and Tur \'a n, G. Threshold circuits of bounded depth. Journal of Computer and System Sciences, 46 0 (2): 0 129--154, 1993

  41. [41]

    Formal language recognition by hard attention transformers: Perspectives from circuit complexity

    Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10: 0 800--810, 2022

  42. [42]

    Denoising Diffusion Probabilistic Models

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 19 June 2020. URL http://arxiv.org/abs/2006.11239

  43. [43]

    Probability inequalities for sums of bounded random variables

    Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58 0 (301): 0 13--30, 1963

  44. [44]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  45. [45]

    Multilayer feedforward networks are universal approximators

    Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Netw., 2 0 (5): 0 359--366, January 1989. URL https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

  46. [46]

    Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey,...

  47. [47]

    Jukna, S. et al. Boolean function complexity: advances and frontiers, volume 27. Springer, 2012

  48. [48]

    B., Blelloch, G

    Kang, H., Gibbons, P. B., Blelloch, G. E., Dhulipala, L., Gu, Y., and McGuffey, C. The processing-in-memory model. In Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, volume 1, pp.\ 295--306, New York, NY, USA, 6 July 2021. ACM. URL http://dx.doi.org/10.1145/3409964.3461816

  49. [49]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  50. [50]

    A comprehensive review of processing-in-memory architectures for deep neural networks

    Kaur, R., Asad, A., and Mohammadi, F. A comprehensive review of processing-in-memory architectures for deep neural networks. Computers, 13 0 (7): 0 174, 16 July 2024. URL http://dx.doi.org/10.3390/computers13070174

  51. [51]

    Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025

    Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., Mehta, S. V., Jain, L. K., Aglietti, V., Jindal, D., Chen, P., Dikkala, N., Tyen, G., Liu, X., Shalit, U., Chiappa, S., Olszewska, K., Tay, Y., Tran, V. Q., Le, Q. V., and Firat, O. BIG -bench extra hard. arXiv [cs.CL], 26 February 2025. URL http://dx.doi.org/10.48550/arXiv.2502.19187

  52. [52]

    C., and Schindler, K

    Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R. C., and Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  53. [53]

    1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

    Kevin, W., Ishaan, J., Michał, B., Tomasz, T., and Benjamin, E. 1000 layer networks for self-supervised RL : Scaling depth can enable new goal-reaching capabilities. arXiv [cs.LG], 19 March 2025. URL http://arxiv.org/abs/2503.14858

  54. [54]

    Kirousis, L. M. and Spirakis, P. Probabilistic log-space reductions and problems probabilistically hard for P . In SWAT 88: 1st Scandinavian Workshop on Algorithm Theory, Halmstad, Sweden, July 5--8, 1988 Proceedings. Springer Berlin Heidelberg, 1988

  55. [55]

    Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012

  56. [56]

    Ladner, R. E. The circuit value problem is log space complete for p. ACM Sigact News, 7 0 (1): 0 18--20, 1975

  57. [57]

    and Yan, Y

    Li, G. and Yan, Y. O(d/T) convergence theory for diffusion probabilistic models under minimal assumptions. arXiv preprint arXiv:2409.18959, 2024

  58. [58]

    S., and Hashimoto, T

    Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-LM improves controllable text generation. Advances in neural information processing systems, 35: 0 4328--4343, 2022

  59. [59]

    Chain of thought empowers transformers to solve inherently serial problems

    Li, Z., Liu, H., Zhou, D., and Ma, T. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875, 2024

  60. [60]

    A., Manning, C

    Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C. A., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul,...

  61. [61]

    SDXL-Lightning: Progressive Adversarial Diffusion Distillation

    Lin, S., Wang, A., and Yang, X. SDXL -lightning: Progressive adversarial diffusion distillation. arXiv [cs.CV], 21 February 2024. URL http://arxiv.org/abs/2402.13929

  62. [62]

    Résolution d' EDP par un schéma en temps «pararéel »

    Lions, J.-L., Maday, Y., and Turinici, G. Résolution d' EDP par un schéma en temps «pararéel ». C.R. Acad. Sci. (Ser. 1) (Math./Math.), 332 0 (7): 0 661--668, 1 April 2001. URL http://dx.doi.org/10.1016/S0764-4442(00)01793-6

  63. [63]

    arXiv preprint arXiv:2210.10749 , year=

    Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022

  64. [64]

    InstaFlow : One step is enough for high-quality diffusion-based text-to-image generation

    Liu, X., Zhang, X., Ma, J., Peng, J., and Liu, Q. InstaFlow : One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2024. URL http://arxiv.org/abs/2309.06380

  65. [65]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  66. [66]

    Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

    Ma, N., Tong, S., Jia, H., Hu, H., Su, Y.-C., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., and Xie, S. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv [cs.CV], 16 January 2025. URL http://arxiv.org/abs/2501.09732

  67. [67]

    and Mula, O

    Maday, Y. and Mula, O. An adaptive parareal algorithm. J. Comput. Appl. Math., 377 0 (112915): 0 112915, 15 October 2020. URL http://dx.doi.org/10.1016/j.cam.2020.112915

  68. [68]

    Potemkin understanding in large language models

    Mancoridis, M., Weeks, B., Vafa, K., and Mullainathan, S. Potemkin understanding in large language models. arXiv [cs.CL], 26 June 2025. URL http://arxiv.org/abs/2506.21521

  69. [69]

    Deep Learning: A Critical Appraisal

    Marcus, G. Deep learning: A critical appraisal. arXiv [cs.AI], 2 January 2018. URL http://arxiv.org/abs/1801.00631

  70. [70]

    and Sabharwal, A

    Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023 a

  71. [71]

    and Sabharwal, A

    Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11: 0 531--545, 2023 b

  72. [72]

    A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

    Merrill, W. and Sabharwal, A. A little depth goes a long way: The expressive power of log-depth transformers. arXiv preprint arXiv:2503.03961, 2025

  73. [73]

    Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10: 0 843--856, 2022

  74. [74]

    R Chris Miall and Daniel M Wolpert

    Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models. arXiv preprint arXiv:2404.08819, 2024

  75. [75]

    Unpredictability and undecidability in dynamical systems

    Moore, C. Unpredictability and undecidability in dynamical systems. Physical Review Letters, 64 0 (20): 0 2354, 1990

  76. [76]

    Majority-vote cellular automata, ising dynamics, and p-completeness

    Moore, C. Majority-vote cellular automata, ising dynamics, and p-completeness. Journal of Statistical Physics, 88: 0 795--805, 1997

  77. [77]

    and Klabjan, D

    Mu, S. and Klabjan, D. On the second-order convergence of biased policy gradient algorithms. In Proceedings of the 41st International Conference on Machine Learning, 2024

  78. [78]

    s1: Simple test-time scaling

    Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  79. [79]

    Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp.\ 8162--8171. PMLR, 2021

  80. [80]

    Large Language Diffusion Models

    Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025

Showing first 80 references.