SNLP: Layer-Parallel Inference via Structured Newton Corrections

Akash Srivastava; Hao Wang; Kai Xu; Ligong Han

arxiv: 2605.17842 · v1 · pith:A6IN2XMYnew · submitted 2026-05-18 · 💻 cs.LG

SNLP: Layer-Parallel Inference via Structured Newton Corrections

Ligong Han , Kai Xu , Hao Wang , Akash Srivastava This is my paper

Pith reviewed 2026-05-20 12:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords layer-parallel inferenceNewton methodsTransformer modelsresidual connectionsinference accelerationparallel computinglanguage models

0 comments

The pith

Treating hidden states across layers as a nonlinear equation enables parallel Newton inference in Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to prove that layer-wise dependencies in Transformers can be relaxed by solving the hidden-state trace as a nonlinear residual equation using parallel Newton-style updates. Exact methods are too costly, so SNLP introduces cheap surrogates like Identity Newton corrections that turn into simple prefix-sum updates in residual architectures. SNLP-aware regularization during training makes one or few iterations sufficient to match sequential results, leading to both speed gains and perplexity improvements at inference. Readers should care as this addresses the sequential bottleneck without additional parallelism hardware.

Core claim

SNLP replaces expensive exact Jacobians with architecture-induced surrogate dynamics, yielding Identity Newton for residual Transformers where corrections become prefix-sum-like updates, and HC Newton for other mixing styles. When combined with regularization that aligns the parallel solver to the sequential forward pass, a small number of iterations approximate the full computation accurately enough to deliver wall-clock speedups and perplexity reductions on nanochat-scale models.

What carries the argument

The Structured Newton Layer Parallelism (SNLP) framework, which substitutes exact layer Jacobians with cheap surrogate dynamics derived from the model's residual connections to enable parallel solving of the layer trace.

Load-bearing premise

Cheap architecture-induced surrogate dynamics such as Identity Newton or HC Newton can replace exact layer Jacobians while remaining stable and accurate for trained Transformers after SNLP-aware regularization.

What would settle it

Training a model with SNLP regularization and then comparing the output of a single parallel Newton iteration against the sequential forward pass on new inputs; significant output mismatch would falsify the claim that the surrogates suffice.

Figures

Figures reproduced from arXiv: 2605.17842 by Akash Srivastava, Hao Wang, Kai Xu, Ligong Han.

read the original abstract

Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture-induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix-sum-like update; in mHC-style architectures, HC Newton (HCN) uses the model's residual mixing matrix. We further introduce SNLP-aware regularization, which trains models to make one or a few structured Newton iterations accurately approximate the sequential forward. Experiments on nanochat-scale Transformers show that SNLP regularization improves layer-parallel compatibility and can also improve standard sequential perplexity, reducing baseline PPL by 4.7%-23.4%. At inference time, SNLP combined with layer fusion and chunkwise decomposition achieves practical wall-clock speedups: on a 0.5B Nanochat model, it reaches 2.3x speedup while still improving PPL by 6.1%. These results suggest that layer-parallel inference is not merely a numerical approximation to sequential execution, but can act as a useful solver-induced inference bias. We also characterize limitations: off-the-shelf pretrained models are less amenable to this procedure, and exact convergence recovers the sequential computation rather than providing monotonic inference-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SNLP shows how to train Transformers so cheap architecture-specific Newton surrogates can replace sequential layer execution and deliver real speedups with better perplexity on small models.

read the letter

The main takeaway is that this work gives a concrete way to relax the sequential layer dependency in autoregressive models by framing the hidden-state trace as a nonlinear residual equation and solving it with parallel structured Newton steps. They replace full Jacobians with cheap surrogates—identity updates for residual blocks and mixing-matrix updates for mHC-style layers—then add a regularization term during training so that one or two of these steps already match the sequential forward pass closely enough to be useful at inference time. Combined with layer fusion and chunking, this produces the reported 2.3x wall-clock gain on a 0.5B Nanochat model while also cutting perplexity by 6.1 percent. That combination of surrogate dynamics plus training for fast convergence is the actual new piece; it is not just another fixed-point iteration or standard pipeline trick. The experiments are run on nanochat-scale models and include the honest note that off-the-shelf pretrained checkpoints adapt less well, which keeps the claims grounded. The regularization also appears to help even the usual sequential perplexity, which is a nice side effect. The soft spots are mostly about missing verification details. The abstract gives no error bars, no seed sweeps, and limited ablation on how sensitive the gains are to regularization strength or chunk size. The circularity concern is real but moderate: the training objective is explicitly tuned to the same parallel solver whose performance is later measured, so it is not surprising that it works on the models trained this way. The stress-test worry about iteration count is worth checking in the full paper; if the surrogates need more than a couple of steps once you move beyond the reported scale, the net latency win shrinks. Still, the central argument holds up on the evidence shown: the method is a practical inference optimization rather than a fundamental change in model capacity. This paper is for people working on inference latency and numerical methods for large autoregressive models. A reader who already thinks about parallel solvers or training-time regularization for deployment constraints will get the most out of it. It has enough of a new formulation and concrete numbers to deserve a serious referee, even if the experiments need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Structured Newton Layer Parallelism (SNLP) to address the sequential layer dependency in autoregressive Transformers by recasting the hidden-state trace as the solution to a nonlinear residual equation and solving it with parallel Newton-style updates. Exact Jacobians are replaced by cheap architecture-induced surrogates (Identity Newton prefix-sum updates in residual Transformers and HC Newton mixing-matrix updates in mHC architectures). SNLP-aware regularization is introduced to train models such that one or a few surrogate iterations closely approximate the sequential forward pass. Combined with layer fusion and chunkwise decomposition, the method is shown to yield wall-clock speedups and perplexity gains on nanochat-scale models, with a reported 2.3x speedup and 6.1% PPL improvement on a 0.5B model; limitations for off-the-shelf pretrained models are also characterized.

Significance. If the empirical results and the stability of the surrogate dynamics hold under broader verification, the work could meaningfully advance practical layer-parallel inference for large language models by converting an architectural bottleneck into a solver-induced bias that can even improve perplexity. The use of architecture-specific cheap surrogates avoids the cost of exact Newton methods, and the demonstration that regularization can simultaneously improve sequential PPL and enable parallelism is a notable strength. The explicit characterization of limitations for pretrained models adds credibility and helps bound the scope of the claims.

major comments (2)

[Abstract and Experiments] Abstract and experiments: The central claim of a 2.3x wall-clock speedup while improving PPL by 6.1% on the 0.5B Nanochat model rests on SNLP-aware regularization rendering one or a few cheap surrogate Newton steps (IDN or HCN) accurate enough proxies for the exact sequential pass. No explicit error bounds, per-iteration convergence curves, or analysis of accumulation across chunks/layers are referenced, leaving open the possibility that more iterations are required in practice and thereby eroding the reported net speedup.
[Method (surrogate dynamics)] Method (surrogate dynamics): The assumption that architecture-induced surrogates (Identity Newton or HC Newton) can stably replace exact layer Jacobians after regularization is load-bearing for the inference scaling result. Given the abstract's own statement that naive fixed-point iteration is unstable on trained Transformers, additional verification is needed to confirm that the number of iterations assumed by the fusion/chunking schedule suffices for the reported accuracy across model scales and data conditions.

minor comments (2)

[Abstract] The abstract reports PPL reductions of 4.7%-23.4% under SNLP regularization but does not specify the exact model sizes, data splits, or evaluation conditions for these figures, which would improve reproducibility.
[Experiments] Consider including standard deviations or error bars on the speedup and PPL metrics, along with ablations over random seeds, to strengthen the empirical presentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where the feedback identifies opportunities to strengthen the presentation of our empirical results and methodological assumptions.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and experiments: The central claim of a 2.3x wall-clock speedup while improving PPL by 6.1% on the 0.5B Nanochat model rests on SNLP-aware regularization rendering one or a few cheap surrogate Newton steps (IDN or HCN) accurate enough proxies for the exact sequential pass. No explicit error bounds, per-iteration convergence curves, or analysis of accumulation across chunks/layers are referenced, leaving open the possibility that more iterations are required in practice and thereby eroding the reported net speedup.

Authors: We agree that the absence of explicit per-iteration convergence curves and accumulation analysis leaves the speedup claim open to the interpretation raised. The manuscript prioritizes end-to-end wall-clock measurements on the target hardware, but we acknowledge that supplementary convergence diagnostics would better substantiate that the budgeted iterations suffice. In the revised manuscript we have added per-iteration residual-error plots and chunk-wise accumulation measurements for the 0.5B model (and smaller ablations) in Section 4 and the appendix. These curves show rapid error reduction within one to three surrogate steps under SNLP regularization, with negligible accumulation across the chunk decomposition used in the reported experiments. While we do not provide theoretical error bounds—owing to the data-dependent and architecture-specific nature of the surrogates—the added empirical diagnostics directly address the concern for the scales and schedules evaluated. revision: yes
Referee: [Method (surrogate dynamics)] Method (surrogate dynamics): The assumption that architecture-induced surrogates (Identity Newton or HC Newton) can stably replace exact layer Jacobians after regularization is load-bearing for the inference scaling result. Given the abstract's own statement that naive fixed-point iteration is unstable on trained Transformers, additional verification is needed to confirm that the number of iterations assumed by the fusion/chunking schedule suffices for the reported accuracy across model scales and data conditions.

Authors: The referee correctly notes that stability of the surrogate dynamics is essential. The manuscript already contrasts the observed instability of naive fixed-point iteration with the behavior of the architecture-specific surrogates (IDN and HCN) once SNLP regularization is applied. To supply the requested additional verification, the revised version includes expanded ablation tables and convergence plots across model sizes (100M–0.5B) and multiple data regimes in the method and experiments sections. These results indicate that the regularization renders the surrogates sufficiently accurate within the iteration counts assumed by the fusion and chunking schedule. We continue to characterize the limitation that off-the-shelf pretrained models without SNLP-aware training exhibit poorer compatibility, as stated in the original text. Verification at substantially larger scales remains computationally intensive and is noted as future work, but the trends observed are consistent with the reported operating regime. revision: partial

Circularity Check

1 steps flagged

SNLP-aware regularization trains models to match the parallel solver to sequential execution, partially tying inference claims to training objective

specific steps

fitted input called prediction [Abstract (SNLP-aware regularization paragraph)]
"We further introduce SNLP-aware regularization, which trains models to make one or a few structured Newton iterations accurately approximate the sequential forward."

The regularization objective is defined in terms of making the structured Newton solver (IDN/HCN) approximate the sequential layer execution. The inference-time speedup and accuracy claims are then evaluated on models trained under this exact objective, so the reported compatibility and wall-clock gains are statistically encouraged by construction rather than independently verified.

full rationale

The paper introduces SNLP-aware regularization explicitly to make one or a few structured Newton iterations approximate the sequential forward pass. This is a deliberate training choice rather than an independent derivation or first-principles result. The reported 2.3x speedup and PPL improvement are measured on models trained under this objective, so the approximation quality is not an emergent property but a direct consequence of the regularization. However, the paper also reports PPL gains on standard sequential evaluation and characterizes limitations for off-the-shelf models, indicating the central claim retains some independent empirical content beyond pure self-definition. No self-citations or uniqueness theorems are load-bearing in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that residual Transformer dynamics admit cheap surrogate Jacobians that remain stable under few Newton iterations after targeted regularization; no new physical entities are postulated.

free parameters (1)

SNLP regularization strength
Hyperparameter controlling how strongly the model is trained to make one or few Newton iterations match sequential behavior; value not specified in abstract.

axioms (1)

domain assumption Trained Transformers admit stable fixed-point or Newton iterations when using architecture-induced surrogates instead of exact Jacobians.
Invoked when replacing exact Jacobian-vector products with Identity Newton or HC Newton corrections.

pith-pipeline@v0.9.0 · 5834 in / 1391 out tokens · 25490 ms · 2026-05-20T12:31:59.868358+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SNLP replaces exact layer Jacobians with cheap structured surrogates... Identity Newton (IDN)... HC Newton (HCN) uses the model's residual mixing matrix.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SNLP-aware regularization... trains models to make one or a few structured Newton iterations accurately approximate the sequential forward.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 8 internal anchors

[1]

Alizadeh, S

K. Alizadeh, S. I. Mirzadeh, D. Belenko, S. Khatamifard, M. Cho, C. C. Del Mundo, M. Raste- gari, and M. Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12562–12584, 2024

work page 2024
[2]

S. Bai, J. Z. Kolter, and V . Koltun. Deep equilibrium models.Advances in Neural Information Processing Systems, 32:688–699, 2019

work page 2019
[3]

Bekas, E

C. Bekas, E. Kokiopoulou, and Y . Saad. An estimator for the diagonal of a matrix.Applied numerical mathematics, 57(11-12):1214–1229, 2007

work page 2007
[4]

G. E. Blelloch. Prefix sums and their applications. InSynthesis of Parallel Algorithms, pages 35–60. Morgan Kaufmann, 1990

work page 1990
[5]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901
[6]

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Danieli, M

F. Danieli, M. Sarabia, X. Suau Cuadros, P. Rodriguez, and L. Zappella. Deeppcr: Parallelizing sequential operations in neural networks.Advances in Neural Information Processing Systems, 36:47598–47625, 2023

work page 2023
[8]

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35, 2022

work page 2022
[9]

Dehghani, S

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers. In International Conference on Learning Representations, 2019

work page 2019
[10]

S. Diao, Y . Yang, Y . Fu, X. Dong, D. Su, M. Kliegl, Z. Chen, P. Belcak, Y . Suhara, H. Yin, M. Patwary, C. Lin, J. Kautz, and P. Molchanov. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training.Advances in Neural Information Processing Systems, 38, 2025

work page 2025
[11]

Geiping, S

J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.Advances in Neural Information Processing Systems, 38, 2025

work page 2025
[12]

S. Gong, R. Zhang, H. Zheng, J. Gu, N. Jaitly, L. Kong, and Y . Zhang. Diffucoder: Un- derstanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

work page arXiv 2025
[13]

Gonzalez, A

X. Gonzalez, A. Warrington, J. T. Smith, and S. W. Linderman. Towards scalable and stable parallelization of nonlinear rnns.Advances in Neural Information Processing Systems, 37:5817– 5849, 2024

work page 2024
[14]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

work page 2024
[15]

A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. 10

work page 2022
[16]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[17]

Huang, Y

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. X. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems, 32:103–112, 2019

work page 2019
[18]

M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines.Communications in Statistics – Simulation and Computation, 19(2):433–450, 1990

work page 1990
[19]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, B. Vlado, Y . Jiacheng, F. Cesista, and B. Koszarsky. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

work page 2024
[20]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

Karpathy

A. Karpathy. nanochat: The best chatgpt that $100 can buy, 2025

work page 2025
[22]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[23]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. InInternational Conference on Learning Representations, 2020

work page 2020
[24]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[25]

Y . H. Lim, Q. Zhu, J. Selfridge, and M. F. Kasim. Parallelizing non-linear sequential models over the sequence length. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[26]

Martin and C

E. Martin and C. Cundy. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018

work page 2018
[27]

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia. Towards efficient generative large language model serving: A survey from algorithms to systems.ACM Computing Surveys, 58(1):1–37, 2025

work page 2025
[28]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in Neural Information Processing Systems, 32:8024–8035, 2019

work page 2019
[29]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[30]

Santilli, S

A. Santilli, S. Severino, E. Postolache, V . Maiorca, M. Mancusi, R. Marin, and E. Rodola. Accelerating transformer inference for translation via parallel decoding. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12336–12355, 2023

work page 2023
[31]

Schuster, A

T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Q. Tran, Y . Tay, and D. Metzler. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35, 2022

work page 2022
[32]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. 11

work page internal anchor Pith review Pith/arXiv arXiv 1909
[33]

Y . Song, C. Meng, R. Liao, and S. Ermon. Accelerating feedforward computation via parallel nonlinear equation solving. InInternational Conference on Machine Learning, pages 9791–

work page
[34]

J. Su, Y . Lu, S. Pan, A. Muffin, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

work page 2024
[35]

G. Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

K. Team, G. Chen, Y . Zhang, J. Su, W. Xu, S. Pan, Y . Wang, Y . Wang, G. Chen, B. Yin, Y . Chen, J. Yan, M. Wei, Y . Zhang, F. Meng, C. Hong, X. Xie, S. Liu, E. Lu, Y . Tai, Y . Chen, X. Men, H. Guo, Y . Charles, H. Lu, L. Sui, J. Zhu, Z. Zhou, W. He, W. Huang, X. Xu, Y . Wang, G. Lai, Y . Du, Y . Wu, Z. Yang, and X. Zhou. Attention residuals, 2026

work page 2026
[37]

Q. Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[38]

Y . Teng, H. Shi, X. Liu, X. Ning, G. Dai, Y . Wang, Z. Li, and X. Liu. Accelerating auto- regressive text-to-image generation with training-free speculative jacobi decoding. InInterna- tional Conference on Learning Representations, 2025

work page 2025
[39]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30:5998–6008, 2017

work page 2017
[41]

W. Wan, R. Kubendran, C. Schaefer, S. B. Eryilmaz, W. Zhang, D. Wu, S. Deiss, P. Raina, H. Qian, B. Gao, S. Joshi, H. Wu, and H.-S. P. Wong. A compute-in-memory chip based on resistive random-access memory.Nature, 608(7923):504–512, 2022

work page 2022
[42]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020

work page 2020
[43]

S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

work page 2023
[44]

Z. Xie, Y . Wei, H. Cao, C. Zhao, C. Deng, J. Li, D. Dai, H. Gao, J. Chang, K. Yu, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

S. Yang, B. Wang, Y . Zhang, Y . Shen, and Y . Kim. Parallelizing linear transformers with the delta rule over sequence length.Advances in Neural Information Processing Systems, 37, 2024

work page 2024
[46]

Hyperloop Transformers

A. Zeitoun, L. Torroba-Hennigen, and Y . Kim. Hyperloop transformers.arXiv preprint arXiv:2604.21254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

TinyLlama: An Open-Source Small Language Model

P. Zhang, G. Zeng, T. Wang, and W. Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

R. Zhen, J. Li, Y . Ji, Z. Yang, T. Liu, Q. Xia, X. Duan, Z. Wang, B. Huai, and M. Zhang. Taming the titans: A survey of efficient llm inference serving.arXiv preprint arXiv:2504.19720, 2025

work page arXiv 2025
[49]

Z. Zhou, T. Wu, Z. Jiang, F. Obeid, and Z. Lan. Value residual learning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 28341–28356, 2025

work page 2025
[50]

D. Zhu, H. Huang, Z. Huang, Y . Zeng, Y . Mao, B. Wu, Q. Min, and X. Zhou. Hyper-connections. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[51]

D. M. Zoltowski, S. Wu, X. Gonzalez, L. Kozachkov, and S. W. Linderman. Parallelizing mcmc across the sequence length.Advances in Neural Information Processing Systems, 38, 2025. 12 Appendix A Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B Analysis Details . . . . . . . . . . . . . ...

work page arXiv 2025

[1] [1]

Alizadeh, S

K. Alizadeh, S. I. Mirzadeh, D. Belenko, S. Khatamifard, M. Cho, C. C. Del Mundo, M. Raste- gari, and M. Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12562–12584, 2024

work page 2024

[2] [2]

S. Bai, J. Z. Kolter, and V . Koltun. Deep equilibrium models.Advances in Neural Information Processing Systems, 32:688–699, 2019

work page 2019

[3] [3]

Bekas, E

C. Bekas, E. Kokiopoulou, and Y . Saad. An estimator for the diagonal of a matrix.Applied numerical mathematics, 57(11-12):1214–1229, 2007

work page 2007

[4] [4]

G. E. Blelloch. Prefix sums and their applications. InSynthesis of Parallel Algorithms, pages 35–60. Morgan Kaufmann, 1990

work page 1990

[5] [5]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901

[6] [6]

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Danieli, M

F. Danieli, M. Sarabia, X. Suau Cuadros, P. Rodriguez, and L. Zappella. Deeppcr: Parallelizing sequential operations in neural networks.Advances in Neural Information Processing Systems, 36:47598–47625, 2023

work page 2023

[8] [8]

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35, 2022

work page 2022

[9] [9]

Dehghani, S

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers. In International Conference on Learning Representations, 2019

work page 2019

[10] [10]

S. Diao, Y . Yang, Y . Fu, X. Dong, D. Su, M. Kliegl, Z. Chen, P. Belcak, Y . Suhara, H. Yin, M. Patwary, C. Lin, J. Kautz, and P. Molchanov. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training.Advances in Neural Information Processing Systems, 38, 2025

work page 2025

[11] [11]

Geiping, S

J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.Advances in Neural Information Processing Systems, 38, 2025

work page 2025

[12] [12]

S. Gong, R. Zhang, H. Zheng, J. Gu, N. Jaitly, L. Kong, and Y . Zhang. Diffucoder: Un- derstanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

work page arXiv 2025

[13] [13]

Gonzalez, A

X. Gonzalez, A. Warrington, J. T. Smith, and S. W. Linderman. Towards scalable and stable parallelization of nonlinear rnns.Advances in Neural Information Processing Systems, 37:5817– 5849, 2024

work page 2024

[14] [14]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

work page 2024

[15] [15]

A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. 10

work page 2022

[16] [16]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016

[17] [17]

Huang, Y

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. X. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems, 32:103–112, 2019

work page 2019

[18] [18]

M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines.Communications in Statistics – Simulation and Computation, 19(2):433–450, 1990

work page 1990

[19] [19]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, B. Vlado, Y . Jiacheng, F. Cesista, and B. Koszarsky. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

work page 2024

[20] [20]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[21] [21]

Karpathy

A. Karpathy. nanochat: The best chatgpt that $100 can buy, 2025

work page 2025

[22] [22]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023

[23] [23]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. InInternational Conference on Learning Representations, 2020

work page 2020

[24] [24]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[25] [25]

Y . H. Lim, Q. Zhu, J. Selfridge, and M. F. Kasim. Parallelizing non-linear sequential models over the sequence length. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[26] [26]

Martin and C

E. Martin and C. Cundy. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018

work page 2018

[27] [27]

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia. Towards efficient generative large language model serving: A survey from algorithms to systems.ACM Computing Surveys, 58(1):1–37, 2025

work page 2025

[28] [28]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in Neural Information Processing Systems, 32:8024–8035, 2019

work page 2019

[29] [29]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019

[30] [30]

Santilli, S

A. Santilli, S. Severino, E. Postolache, V . Maiorca, M. Mancusi, R. Marin, and E. Rodola. Accelerating transformer inference for translation via parallel decoding. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12336–12355, 2023

work page 2023

[31] [31]

Schuster, A

T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Q. Tran, Y . Tay, and D. Metzler. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35, 2022

work page 2022

[32] [32]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. 11

work page internal anchor Pith review Pith/arXiv arXiv 1909

[33] [33]

Y . Song, C. Meng, R. Liao, and S. Ermon. Accelerating feedforward computation via parallel nonlinear equation solving. InInternational Conference on Machine Learning, pages 9791–

work page

[34] [34]

J. Su, Y . Lu, S. Pan, A. Muffin, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

work page 2024

[35] [35]

G. Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

K. Team, G. Chen, Y . Zhang, J. Su, W. Xu, S. Pan, Y . Wang, Y . Wang, G. Chen, B. Yin, Y . Chen, J. Yan, M. Wei, Y . Zhang, F. Meng, C. Hong, X. Xie, S. Liu, E. Lu, Y . Tai, Y . Chen, X. Men, H. Guo, Y . Charles, H. Lu, L. Sui, J. Zhu, Z. Zhou, W. He, W. Huang, X. Xu, Y . Wang, G. Lai, Y . Du, Y . Wu, Z. Yang, and X. Zhou. Attention residuals, 2026

work page 2026

[37] [37]

Q. Team. Qwen2.5: A party of foundation models, September 2024

work page 2024

[38] [38]

Y . Teng, H. Shi, X. Liu, X. Ning, G. Dai, Y . Wang, Z. Li, and X. Liu. Accelerating auto- regressive text-to-image generation with training-free speculative jacobi decoding. InInterna- tional Conference on Learning Representations, 2025

work page 2025

[39] [39]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30:5998–6008, 2017

work page 2017

[41] [41]

W. Wan, R. Kubendran, C. Schaefer, S. B. Eryilmaz, W. Zhang, D. Wu, S. Deiss, P. Raina, H. Qian, B. Gao, S. Joshi, H. Wu, and H.-S. P. Wong. A compute-in-memory chip based on resistive random-access memory.Nature, 608(7923):504–512, 2022

work page 2022

[42] [42]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020

work page 2020

[43] [43]

S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

work page 2023

[44] [44]

Z. Xie, Y . Wei, H. Cao, C. Zhao, C. Deng, J. Li, D. Dai, H. Gao, J. Chang, K. Yu, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

S. Yang, B. Wang, Y . Zhang, Y . Shen, and Y . Kim. Parallelizing linear transformers with the delta rule over sequence length.Advances in Neural Information Processing Systems, 37, 2024

work page 2024

[46] [46]

Hyperloop Transformers

A. Zeitoun, L. Torroba-Hennigen, and Y . Kim. Hyperloop transformers.arXiv preprint arXiv:2604.21254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

TinyLlama: An Open-Source Small Language Model

P. Zhang, G. Zeng, T. Wang, and W. Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

R. Zhen, J. Li, Y . Ji, Z. Yang, T. Liu, Q. Xia, X. Duan, Z. Wang, B. Huai, and M. Zhang. Taming the titans: A survey of efficient llm inference serving.arXiv preprint arXiv:2504.19720, 2025

work page arXiv 2025

[49] [49]

Z. Zhou, T. Wu, Z. Jiang, F. Obeid, and Z. Lan. Value residual learning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 28341–28356, 2025

work page 2025

[50] [50]

D. Zhu, H. Huang, Z. Huang, Y . Zeng, Y . Mao, B. Wu, Q. Min, and X. Zhou. Hyper-connections. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[51] [51]

D. M. Zoltowski, S. Wu, X. Gonzalez, L. Kozachkov, and S. W. Linderman. Parallelizing mcmc across the sequence length.Advances in Neural Information Processing Systems, 38, 2025. 12 Appendix A Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B Analysis Details . . . . . . . . . . . . . ...

work page arXiv 2025