pith. sign in

arxiv: 2509.21716 · v2 · submitted 2025-09-26 · 💻 cs.LG

A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems

Pith reviewed 2026-05-18 13:18 UTC · model grok-4.3

classification 💻 cs.LG
keywords linear dynamical systemsparallel computingfixed-point iterationssequential modelsconvergence analysismachine learning
0
0 comments X

The pith

Fixed-point iterations for parallelizing sequential models arise as linearizations within a linear dynamical systems framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that approaches like Newton, Picard, and Jacobi iterations for running sequential processes in parallel are all instances of approximate linearizations of a nonlinear recursion, modeled through linear dynamical systems. This common framework makes it possible to analyze their convergence rates theoretically and to see why some methods suit particular models better than others. A sympathetic reader would care because it supplies a shared language and theory for making sequential machine learning models run faster through parallelism. The predictions are checked against several case studies on different models.

Core claim

Different iteration schemes for parallel evaluation arise naturally as approximate linearizations of a nonlinear recursion in the linear dynamical systems framework, where the rates of convergence can be analyzed theoretically and are confirmed by case studies.

What carries the argument

Linear dynamical systems (LDSs) that represent approximate linearizations of the underlying nonlinear recursion, generating the various fixed-point iteration schemes as special cases.

If this is right

  • Convergence rates of each parallel method can be predicted from the properties of its linearization.
  • Particular iteration schemes are most effective under conditions identified by the linear dynamical system analysis.
  • The framework supplies a clearer theoretical basis for choosing or designing parallel algorithms for sequential models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linearization view might be applied to other iterative techniques in simulation or optimization beyond the ones studied.
  • Adding stochastic elements to the linear dynamical system could extend the framework to noisy or probabilistic sequential models.
  • The convergence predictions could let users select an iteration scheme for a new model by computing only a few matrix properties rather than running full trials.

Load-bearing premise

The sequential process must be well-described by a nonlinear recursion whose behavior is captured accurately by the chosen linear approximations without large errors in convergence or stability.

What would settle it

A case where one of the fixed-point methods converges or diverges at a rate clearly inconsistent with the linear dynamical system approximation derived from the nonlinear recursion.

Figures

Figures reproduced from arXiv: 2509.21716 by Christopher R\'e, David M. Zoltowski, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, Leo Kozachkov, Scott W. Linderman, Xavier Gonzalez.

Figure 1
Figure 1. Figure 1: A single Newton iteration solves the S5 group word problem, whereas the number of iterations required for the quasi-Newton and Picard methods increases with problem size. We consider the task of evaluating the product of S5 group elements. A: The group word problem can be expressed as an LDS with input-dependent state-transition matrices. B: An example input-dependent transition matrix At for permutation (… view at source ↗
Figure 2
Figure 2. Figure 2: First-order iterations can excel for complicated dynamics. We evaluate GRUs with random parameter initialization for different sequence lengths T and hidden state sizes D. A: The nonlinear dynamics of a GRU, following Feng et al. (2024), where xt is the hidden state, ut is the input, and the notation Linear[·, ·] indicates a linear readout from the concatenation of two vectors. B: A representative Jacobian… view at source ↗
Figure 3
Figure 3. Figure 3: Zeroth-order fixed-point iterations work well when the Jacobian is close to the identity. We evaluate Langevin dynamics for a potential ϕ. A: The nonlinear dynamics of Langevin dynamics for a potential ϕ and step size ϵ, where xt is the state and wt is Gaussian noise. B: The Jacobian for Langevin dynamics is well-approximated by the identity matrix, especially for small step size ϵ = 1 × 10−5 . C: We evalu… view at source ↗
Figure 4
Figure 4. Figure 4: Parallel Scan for Matrix Multiplication. We illustrate a divide-and-conquer approach to compute the product A4A3A2A1. Note that this divide-and-conquer approach naturally leads to O(log T) depth. 4Note that we have the matrices act via left-multiplication over the sequence length, because this is the most common way to write matrix-vector products. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A linear Gaussian state space model (LGSSM): The LGSSM consists of latent variables xt and observed variables yt. The generative model of the LGSSM consists of dynamics xt+1 ∼ N (Axt, Q) and emissions yt+1 ∼ N (Hxt+1, R). the Kalman filter and RTS smoother can also be parallelized over the sequence length via the construction of custom binary associative operators and a parallel scan. While we leave the de… view at source ↗
Figure 6
Figure 6. Figure 6: In a similar setting as Figure 3, we instead consider evaluating Langevin dynamics on a potential [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The term “parallel” in parallel-chord methods. Here we illustrate 3 iterations of Newton’s method for root-finding on the one-dimensional cubic function f(x) = (x−0.4)3 + 0.45(x−0.4). We observe that each iteration of Newton’s method involves forming a “parallel chord” to the function (shown in color). When we plug this form of the Jacobian into T in eq. (12) and simplify, we obtain the linear dynamical sy… view at source ↗
read the original abstract

Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using iterative fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. Moreover, we theoretically analyze the rates of convergence of these methods, and we verify the predictions of this theory with several case studies. This unifying framework highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, the framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that iteration schemes for parallelizing sequential models (Newton, Picard, Jacobi) can be unified as approximate linearizations of a nonlinear recursion within a linear dynamical systems (LDS) framework. It derives theoretical convergence rates from the spectral properties of the resulting linear systems and verifies these predictions through several case studies, aiming to clarify when each method is effective and to provide a clearer foundation for parallel computation in models like RNNs and transformers.

Significance. If the approximation errors from linearization are shown to be controlled, the framework would offer a useful unifying lens on existing parallelization techniques, highlighting shared principles across methods and suggesting new directions for scalable sequential model evaluation. The explicit verification against case studies is a strength, as it provides empirical grounding for the theoretical rates and helps assess practical applicability.

major comments (2)
  1. [Theoretical convergence analysis (around the derivation of rates from LDS linearization)] The central claim that convergence rates follow directly from the spectral radius of the linearized LDS requires explicit control on the neglected higher-order terms in the linearization of x_{t+1}=f(x_t). Without remainder bounds, Lipschitz assumptions on f, or analysis of how nonlinearity (e.g., in attention or gated layers) affects the effective spectral radius, the predicted rates may not hold in general.
  2. [Case studies section] The case-study verification should include quantitative checks that the observed convergence matches the LDS-predicted rates even for strongly nonlinear f, rather than only for near-linear regimes; otherwise the framework's applicability to typical sequential models remains incompletely supported.
minor comments (2)
  1. [Abstract] The abstract states that rates are 'theoretically analyze[d]' and 'verified'; adding a sentence on the key assumptions (e.g., contraction or bounded nonlinearity) would help readers gauge scope immediately.
  2. [Introduction / Framework section] Notation for the nonlinear recursion and the precise construction of each LDS approximation (Newton vs. Picard linearization) should be introduced with a single running example early in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. These have highlighted important opportunities to strengthen the theoretical and empirical support for the proposed framework. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Theoretical convergence analysis (around the derivation of rates from LDS linearization)] The central claim that convergence rates follow directly from the spectral radius of the linearized LDS requires explicit control on the neglected higher-order terms in the linearization of x_{t+1}=f(x_t). Without remainder bounds, Lipschitz assumptions on f, or analysis of how nonlinearity (e.g., in attention or gated layers) affects the effective spectral radius, the predicted rates may not hold in general.

    Authors: We agree that bounding the linearization error is necessary to make the convergence-rate claims fully rigorous for general nonlinear f. The manuscript currently derives rates from the spectral radius of the exact linearization and treats the iteration schemes as approximate linearizations; the analysis is therefore local near the fixed point. In the revision we will add a dedicated subsection that (i) states a local Lipschitz assumption on f, (ii) invokes the standard first-order Taylor expansion with Lagrange remainder, and (iii) shows how the remainder term scales with the distance to the fixed point. We will also include a brief discussion of how the effective spectral radius is modulated by common nonlinearities (ReLU, softmax, gating) under these assumptions. These additions will clarify the precise regime in which the predicted rates hold. revision: yes

  2. Referee: [Case studies section] The case-study verification should include quantitative checks that the observed convergence matches the LDS-predicted rates even for strongly nonlinear f, rather than only for near-linear regimes; otherwise the framework's applicability to typical sequential models remains incompletely supported.

    Authors: We appreciate the suggestion to strengthen the empirical validation. While the existing case studies already employ nonlinear components (ReLU activations, gated recurrent units, and scaled dot-product attention), the quantitative comparison between observed and LDS-predicted rates can be made more explicit. In the revision we will augment the case-study section with (i) log-log plots of residual error versus iteration count, (ii) direct numerical comparison of empirical convergence factors against the spectral radius of the corresponding linearized system, and (iii) additional experiments that systematically vary the strength of nonlinearity (e.g., by scaling the temperature of the softmax or the slope of activations). These quantitative checks will demonstrate that the framework remains predictive even when f deviates substantially from linearity. revision: yes

Circularity Check

0 steps flagged

No circularity: unifying framework derives from standard LDS linearization of existing iterations

full rationale

The paper's derivation interprets Newton, Picard, and Jacobi iterations as approximate linearizations of a nonlinear recursion x_{t+1}=f(x_t) inside an LDS framework, then analyzes convergence via spectral properties of the resulting linear system and verifies with case studies. This chain relies on standard fixed-point theory and dynamical systems analysis applied to sequential models rather than any self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equation or claim reduces by construction to a parameter or result defined inside the paper itself; the framework is an external organizing lens on prior methods and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling sequential processes as nonlinear recursions that admit useful linear approximations; no free parameters, new entities, or additional axioms are introduced in the abstract.

axioms (1)
  • domain assumption Sequential models can be represented as nonlinear recursions whose fixed-point iterations admit linear dynamical system approximations.
    This premise is required to interpret Newton, Picard, and Jacobi schemes as linearizations.

pith-pipeline@v0.9.0 · 5703 in / 1151 out tokens · 52868 ms · 2026-05-18T13:18:10.390906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Learning Phrase Representations using RNN Encoder–Decoder for Sta- tistical Machine Translation

    Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder–Decoder for Sta- tistical Machine Translation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  2. [2]

    Scaling up liquid-resistance liquid- capacitance networks for efficient sequence modeling.arXiv preprint arXiv:2505.21717,

    Mónika Farsang, Ramin Hasani, Daniela Rus, and Radu Grosu. Scaling up liquid-resistance liquid- capacitance networks for efficient sequence modeling.arXiv preprint arXiv:2505.21717,

  3. [3]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    ISSN 0041-5553. doi: 10.1016/0041-5553(81)90075-6. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach.arXiv preprint arXiv:2502.05171,

  4. [4]

    Amber Hu, Henry Smith, and Scott Linderman

    MIT course. Amber Hu, Henry Smith, and Scott Linderman. SING: SDE Inference via Natural Gradients.arXiv preprint arXiv:2506.17796,

  5. [5]

    Patrick Kidger and Cristian Garcia

    Version 0.1. Patrick Kidger and Cristian Garcia. Equinox: neural networks in JAX via callable PyTrees and filtered transformations.CoRR, abs/2111.00254,

  6. [6]

    Entity tracking in language models.arXiv preprint arXiv:2305.02363,

    Najoung Kim and Sebastian Schuster. Entity tracking in language models.arXiv preprint arXiv:2305.02363,

  7. [7]

    Lemons and translated by A

    English translation, introduced by D.S. Lemons and translated by A. Gythiel. Original: C.R. Acad. Sci. 146, 530–533 (1908). Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares.Quarterly of Applied Mathematics, 2:164–168,

  8. [8]

    15 Simo Särkkä and Lennart Svensson.Bayesian filtering and smoothing, volume

    doi: 10.1109/TAC.2020.2976316. 15 Simo Särkkä and Lennart Svensson.Bayesian filtering and smoothing, volume

  9. [9]

    Levenberg-Marquardt and line-search extended Kalman smoothers

    Simo Särkkä and Lennart Svensson. Levenberg-Marquardt and line-search extended Kalman smoothers. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5875–5879. IEEE,

  10. [10]

    XG, EKB, HDL, JWL, and KAW all wrote substantial code in various exploratory aspects of the project

    HDL developed, ran, and wrote up the state tracking experiments. XG, EKB, HDL, JWL, and KAW all wrote substantial code in various exploratory aspects of the project. DZ and CR contributed important insights about the relationships between algorithms and computational efficiency. All authors made important contributions to conceptualization, brainstorming,...

  11. [11]

    Note that this divide-and-conquer approach naturally leads toO(logT) depth

    A1 A2 A3 A4 A2A1 A4A3 A4A3A2A1 Figure 4:Parallel Scan for Matrix Multiplication.We illustrate a divide-and-conquer approach to compute the productA 4A3A2A1. Note that this divide-and-conquer approach naturally leads toO(logT) depth. 4Note that we have the matrices act via left-multiplication over the sequence length, because this is the most common way to...

  12. [12]

    and Rauch-Tung-Striebel (RTS)smoother(Rauchetal.,1965)obtainthefilteringandsmoothingdistributions(respectively)in an LGSSM. The Kalman filter makes a single pass forward in time to get the filtering distributions, while the RTS smoother then makes an additional pass backwards in time to get the smoothing distributions. Both the Kalman filter and RTS smoot...

  13. [13]

    The generative model of the LGSSM consists of dynamicsxt+1∼N(Axt,Q)and emissionsy t+1∼N(Hxt+1,R)

    19 x0 x1 y1 x2 y2 x3 y3 Figure 5:A linear Gaussian state space model (LGSSM):The LGSSM consists of latent variablesxt and observed variablesyt. The generative model of the LGSSM consists of dynamicsxt+1∼N(Axt,Q)and emissionsy t+1∼N(Hxt+1,R). the Kalman filter and RTS smoother can also be parallelized over the sequence length via the construction of custom...

  14. [14]

    and for computing log-normalizing constants (Hu et al., 2025). •Parallelizing fixed point iterations:Finally, these parallel filtering and smoothing are directly applicable to the types of parallel fixed-point iterations that are the focus of this paper. In partic- ular, the ELK algorithm (Gonzalez et al., 2024)—EvaluatingLevenberg-Marquardt viaKalman— st...

  15. [15]

    we discuss in this paper using the Levenberg- Marquardt (trust-region) method (Levenberg, 1944; Marquardt, 1963). The trust-region updates are able to be computed using a parallel scan because they are equivalent to a Kalman smoother in an appropriately constructed LGSSM (Särkkä & Svensson, 2020). What about arbitrary function composition?The astute reade...

  16. [16]

    prefix sums

    We will denote the individual matrices asA1,A 2,A 3,...A8, and their products asAs:t, i.e.A 5:6 =A 6A5. The first phase of the parallel scan is theup-sweep, and takeslog(T)iterations andO(T)memory. We start multiplying adjacent pairs of matrices together, going, for example6, fromA 8 toA 7:8 toA 5:8 toA 1:8. Then, in thedown-sweep, we fill in the missing ...

  17. [17]

    JAX has a general purpose parallel scan (jax.lax.associative_scan) as a fundamental primitive, which allows for implementation of a wide range of parallel scans

    and PyTorch (Paszke et al., 2019), two leading Python libraries for deep learning. JAX has a general purpose parallel scan (jax.lax.associative_scan) as a fundamental primitive, which allows for implementation of a wide range of parallel scans. For exam- ple,dynamax, a JAX library for probabilistic state space modeling (Linderman et al., 2025), implements...

  18. [18]

    8Although Heinsen (2023) shows that clever uses oftorch.cumsumcan parallelize scalar/diagonal LDSs, of the type that are used in quasi-Newton iterations (eq. (6)). 21 C Experimental Details and Additional Experiments We implemented our experiments using the Equinox library (Kidger & Garcia,

  19. [19]

    In our experiments that report wall-clock time, we use a stochastic implementation of quasi-Newton iterations (Zoltowski et al.,

    in JAX (Bradbury et al., 2018). In our experiments that report wall-clock time, we use a stochastic implementation of quasi-Newton iterations (Zoltowski et al.,

  20. [20]

    (2025) for details

    that estimates the diagonal using the Hutchinson estimator (Hutchinson, 1989); see Section 3.4 of Zoltowski et al. (2025) for details. The stopping criterion we use for deciding when the fixed-point iterations in eq. (2) have converged is based on the merit functionL(x1:T ), which is defined as: r(x1:T) = [x1−f(x0),x 2−f(x1),x 3−f(x2),...,x T−f(xT−1)] L(x...

  21. [21]

    C.1 Experimental Details for Section 3.1 We use 10 random seeds, where the randomness controls the sequence of theS5 group elements

    In our experiments we use a tolerance of5×10−4, that is we terminate the fixed-point iterations when iterate isatisfies L(x(i) 1:T )≤5×10−4. C.1 Experimental Details for Section 3.1 We use 10 random seeds, where the randomness controls the sequence of theS5 group elements. Each seed uses a batch size of 16, i.e. 16 differentS5 word problems are evaluated ...

  22. [22]

    parallel

    Numerical StabilityFor all fixed-point methods, numerical stability is a concern (Yaghoobi et al., 2025). In particular, LDS matrices with spectral norm close to or greater than one can cause numerical instabilities in the parallel scan operation (Gonzalez et al., 2024; 2025). This is especially critical in high-precision tasks or long sequences, and prac...

  23. [23]

    parallel chord

    In general, the fixed-point methods of the common form given by eq. (4) all give rise to T∈RT D×T Dmatrices of the form T=   ID 0 0...0 0 −A2 ID 0...0 0 0−A 3 ID ...0 0 ... ... ... ... ... ... 0 0 0... I D 0 0 0 0...−A T ID   .(17) Thus, Chapter 7 of Ortega & Rheinboldt (1970) unites Newton and Picard iterations for the general root find...

  24. [24]

    Thus, lower values ofσindicates thatTis good approximation of the Jacobian matrix∂F/∂xevaluated at the zerox∗of F, while higher values ofσindicate thatTis a poor approximation for∂F/∂x. Ortega & Rheinboldt (1970) then useσin their Chapter 10 (in particular, their Theorem 10.1.4) to prove linear rates of convergence for parallel-chord methods (where the ra...

  25. [25]

    trivial,

    to∂f/∂x(bothD×Dmatrices). This difference in focus is important, because in the setting we consider in this paper—using fixed-point iterations of the form eq. (4) to solve nonlinear equations of the form eq. (3)—Theorem 10.1.4 of Ortega & Rheinboldt (1970) is actuallytrivial. By “trivial,” we mean that it does not distinguish between the convergence rates...

  26. [26]

    (14) and eq

    Proof.As shown in eq. (14) and eq. (17), both∂F/∂xandTare lower-triangular matrices with allD×D identity matrices on their main block-diagonal. In particular,T−1is also a lower-triangular matrix with all D×Didentity matrices on its main block-diagonal. Consequently, the productT−1∂F/∂xis also a lower- triangular matrix with allD×Didentity matrices on its ...

  27. [27]

    This empirical approach also highlights how the increased computational cost of higher-order fixed-point methods affects wall-clock time on GPUs

    Because this asymptotic notion ofR-convergence is not suitable in our setting, we instead use empirical case studies in Section 3 to show that the closeness ofAt to ∂ft/∂ximpacts the number of iterations needed forA to converge. This empirical approach also highlights how the increased computational cost of higher-order fixed-point methods affects wall-cl...