The Serial Scaling Hypothesis
Pith reviewed 2026-05-19 04:02 UTC · model grok-4.3
The pith
Diffusion models cannot solve inherently serial problems even though they generate outputs step by step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show for the first time that diffusion models, despite their sequential nature, are incapable of solving inherently serial problems that require sequentially dependent computational steps that cannot be efficiently parallelized.
What carries the argument
The formal distinction between parallelizable and inherently serial problems, drawn from complexity theory, used to explain why current architectures including diffusion models reach fundamental limits.
If this is right
- Model architectures must incorporate explicit mechanisms for sequential dependence rather than relying solely on parallel layers.
- Hardware development will need to prioritize support for non-parallelizable computation paths.
- Scaling efforts on reasoning and simulation tasks will require serial-aware designs instead of pure width increases.
- Diffusion models will need fundamental changes or replacement for any application involving true sequential dependence.
Where Pith is reading between the lines
- The hypothesis suggests that future scaling laws should track serial depth separately from parallel width.
- Hybrid architectures that combine parallel efficiency with serial chaining could address the identified gap.
- Empirical tests on specific complexity classes could provide direct evidence for or against the serial limitation.
Load-bearing premise
There exist problems that are fundamentally sequential and cannot be efficiently parallelized, as formalized in complexity theory.
What would settle it
A diffusion model that solves a known inherently serial problem, such as a canonical P-complete task, at scale with efficiency matching or exceeding parallel methods would disprove the central claim.
Figures
read the original abstract
While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require sequentially dependent computational steps that cannot be efficiently parallelized. We formalize this distinction in complexity theory, and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. Then, we show for first time that diffusion models despite their sequential nature are incapable of solving inherently serial problems. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, and hardware development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a class of 'inherently serial' problems (e.g., mathematical reasoning, physical simulations, sequential decision-making) that require sequentially dependent steps not efficiently parallelizable. It claims to formalize this distinction via complexity theory, shows limitations of parallel-centric architectures on such tasks, and asserts for the first time that diffusion models are incapable of solving them despite their iterative sequential sampling process, with broad implications for model design and hardware.
Significance. If the formalization and impossibility result hold, the work would identify a fundamental blind spot in current parallel-focused ML paradigms and diffusion models, potentially motivating serial-aware architectures. The explicit grounding in complexity theory (e.g., sequential depth requirements) is a positive feature that could make the hypothesis falsifiable.
major comments (2)
- Abstract and central claim: the assertion that diffusion models are 'incapable' of solving inherently serial problems is presented without any equations, sampling schedule analysis, or experimental protocol showing that bounded denoising trajectories cannot emulate required sequential depth even when steps or per-step capacity are scaled with problem size.
- Formalization section (implied by abstract): the definition of 'inherently serial problems' must be shown to be independently grounded in complexity theory (e.g., P-complete tasks outside NC) rather than retrofitted to observed diffusion limitations; otherwise the distinction risks circularity and the 'incapability' claim reduces to an efficiency statement.
minor comments (2)
- The abstract states 'we formalize this distinction in complexity theory' but supplies no theorem statements, definitions, or references to specific complexity classes; adding these would strengthen readability.
- Clarify whether the diffusion schedule (number of steps) is treated as fixed or allowed to grow with serial depth; this directly affects whether the result is an impossibility or a scaling claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report on our manuscript 'The Serial Scaling Hypothesis'. We address each major comment below with clarifications drawn from the manuscript's formalization and provide indications of revisions to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: Abstract and central claim: the assertion that diffusion models are 'incapable' of solving inherently serial problems is presented without any equations, sampling schedule analysis, or experimental protocol showing that bounded denoising trajectories cannot emulate required sequential depth even when steps or per-step capacity are scaled with problem size.
Authors: The manuscript's core argument links the serial scaling hypothesis to the fact that diffusion sampling, while iterative, operates under a fixed parallel denoising structure whose effective depth is bounded independently of problem size. We agree that the abstract states the claim concisely and that explicit equations would improve clarity. In the revised manuscript we will add a subsection deriving the sampling trajectory equations and proving that scaling the number of denoising steps or per-step model capacity cannot produce arbitrary sequential depth for problems whose solution depth grows linearly with input size. We will also include a high-level experimental protocol for empirical verification on controlled serial tasks. revision: yes
-
Referee: Formalization section (implied by abstract): the definition of 'inherently serial problems' must be shown to be independently grounded in complexity theory (e.g., P-complete tasks outside NC) rather than retrofitted to observed diffusion limitations; otherwise the distinction risks circularity and the 'incapability' claim reduces to an efficiency statement.
Authors: The formalization begins from standard parallel complexity theory: we define inherently serial problems as those whose minimal circuit depth is super-polylogarithmic in the input size, i.e., problems in P but not in NC. Concrete examples include the circuit-value problem and certain formulations of linear programming, both known to be P-complete and therefore not parallelizable to polylog depth. This grounding precedes any discussion of machine-learning architectures and is drawn directly from existing results on NC versus P. The manuscript therefore does not retrofit the definition to diffusion models. To eliminate any residual concern about circularity we will revise the formalization section to include explicit citations to NC/P-completeness results together with the cited examples. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper formalizes inherently serial problems using established complexity theory (e.g., P-complete tasks not in NC), an external framework independent of the diffusion model analysis. The claim that diffusion models cannot solve such problems is presented as a first-time demonstration based on their iterative sampling process, without reducing to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or derivations in the provided abstract equate the result to its inputs by construction; the distinction between parallel and serial computation draws on prior complexity literature rather than being tailored tautologically to observed model limitations. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Some problems are fundamentally sequential and cannot be efficiently parallelized.
invented entities (1)
-
inherently serial problems
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1: diffusion model with TC0 backbone remains in TC0 even with infinitely many sampling steps
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking (D=3 forcing) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 2.2 and Assumption 2.3: parallel iff in TC; TC ⊊ P
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
-
Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Aggarwal, P. and Welleck, S. L1 : Controlling how long a reasoning model thinks with reinforcement learning. arXiv [cs.CL], 6 March 2025. URL http://arxiv.org/abs/2503.04697
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Amdahl, G. M. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pp.\ 483--485, 1967
work page 1967
-
[4]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V. Block diffusion: Interpolating between autoregressive and diffusion language models. arXiv preprint arXiv:2503.09573, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
D., Ho, J., Tarlow, D., and Van Den Berg, R
Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021
work page 2021
-
[6]
Bar, A., Zhou, G., Tran, D., Darrell, T., and LeCun, Y. Navigation world models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[7]
Bartlett, P., Foster, D. J., and Telgarsky, M. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 26 June 2017
work page 2017
-
[8]
Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw., 5 0 (2): 0 157--166, 1994. URL http://dx.doi.org/10.1109/72.279181
-
[9]
On the ability and limitations of transformers to recognize formal languages
Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. arXiv preprint arXiv:2009.11264, 2020
-
[10]
Blelloch, G. E. Programming parallel algorithms. Communications of the ACM, 39 0 (3): 0 85--97, 1996
work page 1996
-
[11]
Bogdanov, A., Trevisan, L., et al. Average-case complexity. Foundations and Trends in Theoretical Computer Science , 2 0 (1): 0 1--106, 2006
work page 2006
-
[12]
Video generation models as world simulators
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators. OpenAI Blog, 1: 0 8, 2024
work page 2024
-
[13]
Theoretical limitations of multi-layer transformer
Chen, L., Peng, B., and Wu, H. Theoretical limitations of multi-layer transformer. arXiv preprint arXiv:2412.02975, 2024 a
-
[14]
The computational limits of state-space models and mamba via the lens of circuit complexity
Chen, Y., Li, X., Liang, Y., Shi, Z., and Song, Z. The computational limits of state-space models and mamba via the lens of circuit complexity. arXiv preprint arXiv:2412.06148, 2024 b
-
[15]
Chiang, D. Transformers in uniform TC0 . arXiv preprint arXiv:2409.13629, 2024
-
[16]
On the Measure of Intelligence
Chollet, F. On the measure of intelligence. arXiv [cs.AI], 4 November 2019. URL http://arxiv.org/abs/1911.01547
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[17]
W., Sutton, C., Gehrmann, S., et al
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023
work page 2023
-
[18]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv [cs.LG], 27 October 2021. URL http://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Codd, E. F. Cellular automata. Academic press, 2014
work page 2014
-
[20]
On the expressive power of deep learning: A tensor analysis
Cohen, N., Sharir, O., and Shashua, A. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pp.\ 698--728. PMLR, 2016
work page 2016
-
[21]
Cook, S. A. A taxonomy of problems with fast parallel algorithms. Information and Control, 64 0 (1--3): 0 2--22, 1985
work page 1985
-
[22]
Approximation by superpositions of a sigmoidal function
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems, 2 0 (4): 0 303--314, December 1989. URL https://web.njit.edu/ usman/courses/cs675_fall18/10.1.1.441.7873.pdf
work page 1989
-
[23]
FlashAttention -2: Faster attention with better parallelism and work partitioning
Dao, T. FlashAttention -2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[24]
Diffusion Models Beat GANs on Image Synthesis
Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, pp.\ 8780--8794, 11 May 2021. URL http://arxiv.org/abs/2105.05233
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
Paradigms for Fast Parallel Approximability
D \' az, J., Serna, M., Spirakis, P., and Tor \'a n, J. Paradigms for Fast Parallel Approximability. Cambridge University Press, Cambridge, 1st edition, 1997
work page 1997
-
[26]
An image is worth 16x16 words: Transformers for image recognition at scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[27]
Diffusiondepth: Diffusion denoising approach for monocular depth estimation
Duan, Y., Guo, X., and Zhu, Z. Diffusiondepth: Diffusion denoising approach for monocular depth estimation. In European Conference on Computer Vision, pp.\ 432--449. Springer, 2024
work page 2024
-
[28]
Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on learning theory, pp.\ 907--940. PMLR, 2016
work page 2016
-
[29]
B., Goldberg, K., Lee, Y., Hafner, D., and Abbeel, P
Escontrela, A., Adeniji, A., Yan, W., Jain, A., Peng, X. B., Goldberg, K., Lee, Y., Hafner, D., and Abbeel, P. Video prediction models as rewards for reinforcement learning. Neural Inf Process Syst, abs/2305.14343: 0 68760--68783, 23 May 2023. URL http://dx.doi.org/10.48550/arXiv.2305.14343
-
[30]
Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. J. Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 11 June 2019
work page 2019
-
[31]
Towards revealing the mystery behind chain of thought: a theoretical perspective
Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., and Wang, L. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36: 0 70757--70798, 2023
work page 2023
-
[32]
Fredkin, E. and Toffoli, T. Conservative logic. Int. J. Theor. Phys., 21 0 (3-4): 0 219--253, April 1982. URL http://dx.doi.org/10.1007/BF01857727
-
[33]
Gander, M. J. and Vandewalle, S. Analysis of the parareal time-parallel time-integration method. SIAM Journal on Scientific Computing, 29 0 (2): 0 556--578, 2007
work page 2007
-
[34]
Greenlaw, R., Hoover, H. J., and Ruzzo, W. L. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, New York, 1995. ISBN 0-19-508591-4
work page 1995
-
[35]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Efficiently Modeling Long Sequences with Structured State Spaces
Gu, A., Goel, K., and R \'e , C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
Gulrajani, I. and Hashimoto, T. B. Likelihood-based diffusion language models. In Advances in Neural Information Processing Systems, 30 May 2023
work page 2023
-
[38]
Gustafson, J. L. Reevaluating amdahl's law. Communications of the ACM, 31 0 (5): 0 532--533, 1988
work page 1988
-
[39]
Theoretical limitations of self-attention in neural sequence models
Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8: 0 156--171, 2020
work page 2020
-
[40]
Threshold circuits of bounded depth
Hajnal, A., Maass, W., Pudl \'a k, P., Szegedy, M., and Tur \'a n, G. Threshold circuits of bounded depth. Journal of Computer and System Sciences, 46 0 (2): 0 129--154, 1993
work page 1993
-
[41]
Formal language recognition by hard attention transformers: Perspectives from circuit complexity
Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10: 0 800--810, 2022
work page 2022
-
[42]
Denoising Diffusion Probabilistic Models
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 19 June 2020. URL http://arxiv.org/abs/2006.11239
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[43]
Probability inequalities for sums of bounded random variables
Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58 0 (301): 0 13--30, 1963
work page 1963
-
[44]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Multilayer feedforward networks are universal approximators
Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Netw., 2 0 (5): 0 359--366, January 1989. URL https://cognitivemedium.com/magic_paper/assets/Hornik.pdf
work page 1989
-
[46]
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey,...
work page 2017
-
[47]
Jukna, S. et al. Boolean function complexity: advances and frontiers, volume 27. Springer, 2012
work page 2012
-
[48]
Kang, H., Gibbons, P. B., Blelloch, G. E., Dhulipala, L., Gu, Y., and McGuffey, C. The processing-in-memory model. In Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, volume 1, pp.\ 295--306, New York, NY, USA, 6 July 2021. ACM. URL http://dx.doi.org/10.1145/3409964.3461816
-
[49]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[50]
A comprehensive review of processing-in-memory architectures for deep neural networks
Kaur, R., Asad, A., and Mohammadi, F. A comprehensive review of processing-in-memory architectures for deep neural networks. Computers, 13 0 (7): 0 174, 16 July 2024. URL http://dx.doi.org/10.3390/computers13070174
-
[51]
Big-bench extra hard.arXiv preprint arXiv:2502.19187, 2025
Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., Mehta, S. V., Jain, L. K., Aglietti, V., Jindal, D., Chen, P., Dikkala, N., Tyen, G., Liu, X., Shalit, U., Chiappa, S., Olszewska, K., Tay, Y., Tran, V. Q., Le, Q. V., and Firat, O. BIG -bench extra hard. arXiv [cs.CL], 26 February 2025. URL http://dx.doi.org/10.48550/arXiv.2502.19187
-
[52]
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R. C., and Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[53]
Kevin, W., Ishaan, J., Michał, B., Tomasz, T., and Benjamin, E. 1000 layer networks for self-supervised RL : Scaling depth can enable new goal-reaching capabilities. arXiv [cs.LG], 19 March 2025. URL http://arxiv.org/abs/2503.14858
-
[54]
Kirousis, L. M. and Spirakis, P. Probabilistic log-space reductions and problems probabilistically hard for P . In SWAT 88: 1st Scandinavian Workshop on Algorithm Theory, Halmstad, Sweden, July 5--8, 1988 Proceedings. Springer Berlin Heidelberg, 1988
work page 1988
-
[55]
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012
work page 2012
-
[56]
Ladner, R. E. The circuit value problem is log space complete for p. ACM Sigact News, 7 0 (1): 0 18--20, 1975
work page 1975
-
[57]
Li, G. and Yan, Y. O(d/T) convergence theory for diffusion probabilistic models under minimal assumptions. arXiv preprint arXiv:2409.18959, 2024
-
[58]
Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-LM improves controllable text generation. Advances in neural information processing systems, 35: 0 4328--4343, 2022
work page 2022
-
[59]
Chain of thought empowers transformers to solve inherently serial problems
Li, Z., Liu, H., Zhou, D., and Ma, T. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875, 2024
-
[60]
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C. A., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul,...
work page 2023
-
[61]
SDXL-Lightning: Progressive Adversarial Diffusion Distillation
Lin, S., Wang, A., and Yang, X. SDXL -lightning: Progressive adversarial diffusion distillation. arXiv [cs.CV], 21 February 2024. URL http://arxiv.org/abs/2402.13929
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Résolution d' EDP par un schéma en temps «pararéel »
Lions, J.-L., Maday, Y., and Turinici, G. Résolution d' EDP par un schéma en temps «pararéel ». C.R. Acad. Sci. (Ser. 1) (Math./Math.), 332 0 (7): 0 661--668, 1 April 2001. URL http://dx.doi.org/10.1016/S0764-4442(00)01793-6
-
[63]
arXiv preprint arXiv:2210.10749 , year=
Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022
-
[64]
InstaFlow : One step is enough for high-quality diffusion-based text-to-image generation
Liu, X., Zhang, X., Ma, J., Peng, J., and Liu, Q. InstaFlow : One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2024. URL http://arxiv.org/abs/2309.06380
-
[65]
Discrete diffusion modeling by estimating the ratios of the data distribution
Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[66]
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Ma, N., Tong, S., Jia, H., Hu, H., Su, Y.-C., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., and Xie, S. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv [cs.CV], 16 January 2025. URL http://arxiv.org/abs/2501.09732
work page internal anchor Pith review arXiv 2025
-
[67]
Maday, Y. and Mula, O. An adaptive parareal algorithm. J. Comput. Appl. Math., 377 0 (112915): 0 112915, 15 October 2020. URL http://dx.doi.org/10.1016/j.cam.2020.112915
-
[68]
Potemkin understanding in large language models
Mancoridis, M., Weeks, B., Vafa, K., and Mullainathan, S. Potemkin understanding in large language models. arXiv [cs.CL], 26 June 2025. URL http://arxiv.org/abs/2506.21521
-
[69]
Deep Learning: A Critical Appraisal
Marcus, G. Deep learning: A critical appraisal. arXiv [cs.AI], 2 January 2018. URL http://arxiv.org/abs/1801.00631
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[70]
Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023 a
-
[71]
Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11: 0 531--545, 2023 b
work page 2023
-
[72]
Merrill, W. and Sabharwal, A. A little depth goes a long way: The expressive power of log-depth transformers. arXiv preprint arXiv:2503.03961, 2025
-
[73]
Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10: 0 843--856, 2022
work page 2022
-
[74]
R Chris Miall and Daniel M Wolpert
Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models. arXiv preprint arXiv:2404.08819, 2024
-
[75]
Unpredictability and undecidability in dynamical systems
Moore, C. Unpredictability and undecidability in dynamical systems. Physical Review Letters, 64 0 (20): 0 2354, 1990
work page 1990
-
[76]
Majority-vote cellular automata, ising dynamics, and p-completeness
Moore, C. Majority-vote cellular automata, ising dynamics, and p-completeness. Journal of Statistical Physics, 88: 0 795--805, 1997
work page 1997
-
[77]
Mu, S. and Klabjan, D. On the second-order convergence of biased policy gradient algorithms. In Proceedings of the 41st International Conference on Machine Learning, 2024
work page 2024
-
[78]
Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand \`e s, E., and Hashimoto, T. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp.\ 8162--8171. PMLR, 2021
work page 2021
-
[80]
Large Language Diffusion Models
Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., and Li, C. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.