pith. sign in

arxiv: 2605.17842 · v1 · pith:A6IN2XMYnew · submitted 2026-05-18 · 💻 cs.LG

SNLP: Layer-Parallel Inference via Structured Newton Corrections

Pith reviewed 2026-05-20 12:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords layer-parallel inferenceNewton methodsTransformer modelsresidual connectionsinference accelerationparallel computinglanguage models
0
0 comments X

The pith

Treating hidden states across layers as a nonlinear equation enables parallel Newton inference in Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to prove that layer-wise dependencies in Transformers can be relaxed by solving the hidden-state trace as a nonlinear residual equation using parallel Newton-style updates. Exact methods are too costly, so SNLP introduces cheap surrogates like Identity Newton corrections that turn into simple prefix-sum updates in residual architectures. SNLP-aware regularization during training makes one or few iterations sufficient to match sequential results, leading to both speed gains and perplexity improvements at inference. Readers should care as this addresses the sequential bottleneck without additional parallelism hardware.

Core claim

SNLP replaces expensive exact Jacobians with architecture-induced surrogate dynamics, yielding Identity Newton for residual Transformers where corrections become prefix-sum-like updates, and HC Newton for other mixing styles. When combined with regularization that aligns the parallel solver to the sequential forward pass, a small number of iterations approximate the full computation accurately enough to deliver wall-clock speedups and perplexity reductions on nanochat-scale models.

What carries the argument

The Structured Newton Layer Parallelism (SNLP) framework, which substitutes exact layer Jacobians with cheap surrogate dynamics derived from the model's residual connections to enable parallel solving of the layer trace.

Load-bearing premise

Cheap architecture-induced surrogate dynamics such as Identity Newton or HC Newton can replace exact layer Jacobians while remaining stable and accurate for trained Transformers after SNLP-aware regularization.

What would settle it

Training a model with SNLP regularization and then comparing the output of a single parallel Newton iteration against the sequential forward pass on new inputs; significant output mismatch would falsify the claim that the surrogates suffice.

Figures

Figures reproduced from arXiv: 2605.17842 by Akash Srivastava, Hao Wang, Kai Xu, Ligong Han.

Figure 1
Figure 1. Figure 1: Structured Newton Layer Parallelism (SNLP) replaces sequential layer execution with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture-induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix-sum-like update; in mHC-style architectures, HC Newton (HCN) uses the model's residual mixing matrix. We further introduce SNLP-aware regularization, which trains models to make one or a few structured Newton iterations accurately approximate the sequential forward. Experiments on nanochat-scale Transformers show that SNLP regularization improves layer-parallel compatibility and can also improve standard sequential perplexity, reducing baseline PPL by 4.7%-23.4%. At inference time, SNLP combined with layer fusion and chunkwise decomposition achieves practical wall-clock speedups: on a 0.5B Nanochat model, it reaches 2.3x speedup while still improving PPL by 6.1%. These results suggest that layer-parallel inference is not merely a numerical approximation to sequential execution, but can act as a useful solver-induced inference bias. We also characterize limitations: off-the-shelf pretrained models are less amenable to this procedure, and exact convergence recovers the sequential computation rather than providing monotonic inference-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Structured Newton Layer Parallelism (SNLP) to address the sequential layer dependency in autoregressive Transformers by recasting the hidden-state trace as the solution to a nonlinear residual equation and solving it with parallel Newton-style updates. Exact Jacobians are replaced by cheap architecture-induced surrogates (Identity Newton prefix-sum updates in residual Transformers and HC Newton mixing-matrix updates in mHC architectures). SNLP-aware regularization is introduced to train models such that one or a few surrogate iterations closely approximate the sequential forward pass. Combined with layer fusion and chunkwise decomposition, the method is shown to yield wall-clock speedups and perplexity gains on nanochat-scale models, with a reported 2.3x speedup and 6.1% PPL improvement on a 0.5B model; limitations for off-the-shelf pretrained models are also characterized.

Significance. If the empirical results and the stability of the surrogate dynamics hold under broader verification, the work could meaningfully advance practical layer-parallel inference for large language models by converting an architectural bottleneck into a solver-induced bias that can even improve perplexity. The use of architecture-specific cheap surrogates avoids the cost of exact Newton methods, and the demonstration that regularization can simultaneously improve sequential PPL and enable parallelism is a notable strength. The explicit characterization of limitations for pretrained models adds credibility and helps bound the scope of the claims.

major comments (2)
  1. [Abstract and Experiments] Abstract and experiments: The central claim of a 2.3x wall-clock speedup while improving PPL by 6.1% on the 0.5B Nanochat model rests on SNLP-aware regularization rendering one or a few cheap surrogate Newton steps (IDN or HCN) accurate enough proxies for the exact sequential pass. No explicit error bounds, per-iteration convergence curves, or analysis of accumulation across chunks/layers are referenced, leaving open the possibility that more iterations are required in practice and thereby eroding the reported net speedup.
  2. [Method (surrogate dynamics)] Method (surrogate dynamics): The assumption that architecture-induced surrogates (Identity Newton or HC Newton) can stably replace exact layer Jacobians after regularization is load-bearing for the inference scaling result. Given the abstract's own statement that naive fixed-point iteration is unstable on trained Transformers, additional verification is needed to confirm that the number of iterations assumed by the fusion/chunking schedule suffices for the reported accuracy across model scales and data conditions.
minor comments (2)
  1. [Abstract] The abstract reports PPL reductions of 4.7%-23.4% under SNLP regularization but does not specify the exact model sizes, data splits, or evaluation conditions for these figures, which would improve reproducibility.
  2. [Experiments] Consider including standard deviations or error bars on the speedup and PPL metrics, along with ablations over random seeds, to strengthen the empirical presentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where the feedback identifies opportunities to strengthen the presentation of our empirical results and methodological assumptions.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and experiments: The central claim of a 2.3x wall-clock speedup while improving PPL by 6.1% on the 0.5B Nanochat model rests on SNLP-aware regularization rendering one or a few cheap surrogate Newton steps (IDN or HCN) accurate enough proxies for the exact sequential pass. No explicit error bounds, per-iteration convergence curves, or analysis of accumulation across chunks/layers are referenced, leaving open the possibility that more iterations are required in practice and thereby eroding the reported net speedup.

    Authors: We agree that the absence of explicit per-iteration convergence curves and accumulation analysis leaves the speedup claim open to the interpretation raised. The manuscript prioritizes end-to-end wall-clock measurements on the target hardware, but we acknowledge that supplementary convergence diagnostics would better substantiate that the budgeted iterations suffice. In the revised manuscript we have added per-iteration residual-error plots and chunk-wise accumulation measurements for the 0.5B model (and smaller ablations) in Section 4 and the appendix. These curves show rapid error reduction within one to three surrogate steps under SNLP regularization, with negligible accumulation across the chunk decomposition used in the reported experiments. While we do not provide theoretical error bounds—owing to the data-dependent and architecture-specific nature of the surrogates—the added empirical diagnostics directly address the concern for the scales and schedules evaluated. revision: yes

  2. Referee: [Method (surrogate dynamics)] Method (surrogate dynamics): The assumption that architecture-induced surrogates (Identity Newton or HC Newton) can stably replace exact layer Jacobians after regularization is load-bearing for the inference scaling result. Given the abstract's own statement that naive fixed-point iteration is unstable on trained Transformers, additional verification is needed to confirm that the number of iterations assumed by the fusion/chunking schedule suffices for the reported accuracy across model scales and data conditions.

    Authors: The referee correctly notes that stability of the surrogate dynamics is essential. The manuscript already contrasts the observed instability of naive fixed-point iteration with the behavior of the architecture-specific surrogates (IDN and HCN) once SNLP regularization is applied. To supply the requested additional verification, the revised version includes expanded ablation tables and convergence plots across model sizes (100M–0.5B) and multiple data regimes in the method and experiments sections. These results indicate that the regularization renders the surrogates sufficiently accurate within the iteration counts assumed by the fusion and chunking schedule. We continue to characterize the limitation that off-the-shelf pretrained models without SNLP-aware training exhibit poorer compatibility, as stated in the original text. Verification at substantially larger scales remains computationally intensive and is noted as future work, but the trends observed are consistent with the reported operating regime. revision: partial

Circularity Check

1 steps flagged

SNLP-aware regularization trains models to match the parallel solver to sequential execution, partially tying inference claims to training objective

specific steps
  1. fitted input called prediction [Abstract (SNLP-aware regularization paragraph)]
    "We further introduce SNLP-aware regularization, which trains models to make one or a few structured Newton iterations accurately approximate the sequential forward."

    The regularization objective is defined in terms of making the structured Newton solver (IDN/HCN) approximate the sequential layer execution. The inference-time speedup and accuracy claims are then evaluated on models trained under this exact objective, so the reported compatibility and wall-clock gains are statistically encouraged by construction rather than independently verified.

full rationale

The paper introduces SNLP-aware regularization explicitly to make one or a few structured Newton iterations approximate the sequential forward pass. This is a deliberate training choice rather than an independent derivation or first-principles result. The reported 2.3x speedup and PPL improvement are measured on models trained under this objective, so the approximation quality is not an emergent property but a direct consequence of the regularization. However, the paper also reports PPL gains on standard sequential evaluation and characterizes limitations for off-the-shelf models, indicating the central claim retains some independent empirical content beyond pure self-definition. No self-citations or uniqueness theorems are load-bearing in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that residual Transformer dynamics admit cheap surrogate Jacobians that remain stable under few Newton iterations after targeted regularization; no new physical entities are postulated.

free parameters (1)
  • SNLP regularization strength
    Hyperparameter controlling how strongly the model is trained to make one or few Newton iterations match sequential behavior; value not specified in abstract.
axioms (1)
  • domain assumption Trained Transformers admit stable fixed-point or Newton iterations when using architecture-induced surrogates instead of exact Jacobians.
    Invoked when replacing exact Jacobian-vector products with Identity Newton or HC Newton corrections.

pith-pipeline@v0.9.0 · 5834 in / 1391 out tokens · 25490 ms · 2026-05-20T12:31:59.868358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 8 internal anchors

  1. [1]

    Alizadeh, S

    K. Alizadeh, S. I. Mirzadeh, D. Belenko, S. Khatamifard, M. Cho, C. C. Del Mundo, M. Raste- gari, and M. Farajtabar. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12562–12584, 2024

  2. [2]

    S. Bai, J. Z. Kolter, and V . Koltun. Deep equilibrium models.Advances in Neural Information Processing Systems, 32:688–699, 2019

  3. [3]

    Bekas, E

    C. Bekas, E. Kokiopoulou, and Y . Saad. An estimator for the diagonal of a matrix.Applied numerical mathematics, 57(11-12):1214–1229, 2007

  4. [4]

    G. E. Blelloch. Prefix sums and their applications. InSynthesis of Parallel Algorithms, pages 35–60. Morgan Kaufmann, 1990

  5. [5]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

  6. [6]

    C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

  7. [7]

    Danieli, M

    F. Danieli, M. Sarabia, X. Suau Cuadros, P. Rodriguez, and L. Zappella. Deeppcr: Parallelizing sequential operations in neural networks.Advances in Neural Information Processing Systems, 36:47598–47625, 2023

  8. [8]

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35, 2022

  9. [9]

    Dehghani, S

    M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers. In International Conference on Learning Representations, 2019

  10. [10]

    S. Diao, Y . Yang, Y . Fu, X. Dong, D. Su, M. Kliegl, Z. Chen, P. Belcak, Y . Suhara, H. Yin, M. Patwary, C. Lin, J. Kautz, and P. Molchanov. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training.Advances in Neural Information Processing Systems, 38, 2025

  11. [11]

    Geiping, S

    J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.Advances in Neural Information Processing Systems, 38, 2025

  12. [12]

    S. Gong, R. Zhang, H. Zheng, J. Gu, N. Jaitly, L. Kong, and Y . Zhang. Diffucoder: Un- derstanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

  13. [13]

    Gonzalez, A

    X. Gonzalez, A. Warrington, J. T. Smith, and S. W. Linderman. Towards scalable and stable parallelization of nonlinear rnns.Advances in Neural Information Processing Systems, 37:5817– 5849, 2024

  14. [14]

    Gu and T

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

  15. [15]

    A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. 10

  16. [16]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  17. [17]

    Huang, Y

    Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. X. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems, 32:103–112, 2019

  18. [18]

    M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines.Communications in Statistics – Simulation and Computation, 19(2):433–450, 1990

  19. [19]

    Jordan, J

    K. Jordan, J. Bernstein, B. Rappazzo, B. Vlado, Y . Jiacheng, F. Cesista, and B. Koszarsky. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

  20. [20]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  21. [21]

    Karpathy

    A. Karpathy. nanochat: The best chatgpt that $100 can buy, 2025

  22. [22]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  23. [23]

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. InInternational Conference on Learning Representations, 2020

  24. [24]

    Leviathan, M

    Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

  25. [25]

    Y . H. Lim, Q. Zhu, J. Selfridge, and M. F. Kasim. Parallelizing non-linear sequential models over the sequence length. InThe Twelfth International Conference on Learning Representations, 2024

  26. [26]

    Martin and C

    E. Martin and C. Cundy. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018

  27. [27]

    X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia. Towards efficient generative large language model serving: A survey from algorithms to systems.ACM Computing Surveys, 58(1):1–37, 2025

  28. [28]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in Neural Information Processing Systems, 32:8024–8035, 2019

  29. [29]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

  30. [30]

    Santilli, S

    A. Santilli, S. Severino, E. Postolache, V . Maiorca, M. Mancusi, R. Marin, and E. Rodola. Accelerating transformer inference for translation via parallel decoding. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12336–12355, 2023

  31. [31]

    Schuster, A

    T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Q. Tran, Y . Tay, and D. Metzler. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35, 2022

  32. [32]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. 11

  33. [33]

    Y . Song, C. Meng, R. Liao, and S. Ermon. Accelerating feedforward computation via parallel nonlinear equation solving. InInternational Conference on Machine Learning, pages 9791–

  34. [34]

    J. Su, Y . Lu, S. Pan, A. Muffin, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

  35. [35]

    G. Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  36. [36]

    K. Team, G. Chen, Y . Zhang, J. Su, W. Xu, S. Pan, Y . Wang, Y . Wang, G. Chen, B. Yin, Y . Chen, J. Yan, M. Wei, Y . Zhang, F. Meng, C. Hong, X. Xie, S. Liu, E. Lu, Y . Tai, Y . Chen, X. Men, H. Guo, Y . Charles, H. Lu, L. Sui, J. Zhu, Z. Zhou, W. He, W. Huang, X. Xu, Y . Wang, G. Lai, Y . Du, Y . Wu, Z. Yang, and X. Zhou. Attention residuals, 2026

  37. [37]

    Q. Team. Qwen2.5: A party of foundation models, September 2024

  38. [38]

    Y . Teng, H. Shi, X. Liu, X. Ning, G. Dai, Y . Wang, Z. Li, and X. Liu. Accelerating auto- regressive text-to-image generation with training-free speculative jacobi decoding. InInterna- tional Conference on Learning Representations, 2025

  39. [39]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  40. [40]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30:5998–6008, 2017

  41. [41]

    W. Wan, R. Kubendran, C. Schaefer, S. B. Eryilmaz, W. Zhang, D. Wu, S. Deiss, P. Raina, H. Qian, B. Gao, S. Joshi, H. Wu, and H.-S. P. Wong. A compute-in-memory chip based on resistive random-access memory.Nature, 608(7923):504–512, 2022

  42. [42]

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020

  43. [43]

    S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

  44. [44]

    Z. Xie, Y . Wei, H. Cao, C. Zhao, C. Deng, J. Li, D. Dai, H. Gao, J. Chang, K. Yu, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025

  45. [45]

    S. Yang, B. Wang, Y . Zhang, Y . Shen, and Y . Kim. Parallelizing linear transformers with the delta rule over sequence length.Advances in Neural Information Processing Systems, 37, 2024

  46. [46]

    Hyperloop Transformers

    A. Zeitoun, L. Torroba-Hennigen, and Y . Kim. Hyperloop transformers.arXiv preprint arXiv:2604.21254, 2026

  47. [47]

    TinyLlama: An Open-Source Small Language Model

    P. Zhang, G. Zeng, T. Wang, and W. Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024

  48. [48]

    R. Zhen, J. Li, Y . Ji, Z. Yang, T. Liu, Q. Xia, X. Duan, Z. Wang, B. Huai, and M. Zhang. Taming the titans: A survey of efficient llm inference serving.arXiv preprint arXiv:2504.19720, 2025

  49. [49]

    Z. Zhou, T. Wu, Z. Jiang, F. Obeid, and Z. Lan. Value residual learning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 28341–28356, 2025

  50. [50]

    D. Zhu, H. Huang, Z. Huang, Y . Zeng, Y . Mao, B. Wu, Q. Min, and X. Zhou. Hyper-connections. InThe Thirteenth International Conference on Learning Representations, 2025

  51. [51]

    D. M. Zoltowski, S. Wu, X. Gonzalez, L. Kozachkov, and S. W. Linderman. Parallelizing mcmc across the sequence length.Advances in Neural Information Processing Systems, 38, 2025. 12 Appendix A Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B Analysis Details . . . . . . . . . . . . . ...