A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems

Christopher R\'e; David M. Zoltowski; E. Kelly Buchanan; Hyun Dong Lee; Jerry Weihong Liu; Ke Alexander Wang; Leo Kozachkov; Scott W. Linderman; Xavier Gonzalez

arxiv: 2509.21716 · v2 · submitted 2025-09-26 · 💻 cs.LG

A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems

Xavier Gonzalez , E. Kelly Buchanan , Hyun Dong Lee , Jerry Weihong Liu , Ke Alexander Wang , David M. Zoltowski , Leo Kozachkov , Christopher R\'e

show 1 more author

Scott W. Linderman

This is my paper

Pith reviewed 2026-05-18 13:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear dynamical systemsparallel computingfixed-point iterationssequential modelsconvergence analysismachine learning

0 comments

The pith

Fixed-point iterations for parallelizing sequential models arise as linearizations within a linear dynamical systems framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that approaches like Newton, Picard, and Jacobi iterations for running sequential processes in parallel are all instances of approximate linearizations of a nonlinear recursion, modeled through linear dynamical systems. This common framework makes it possible to analyze their convergence rates theoretically and to see why some methods suit particular models better than others. A sympathetic reader would care because it supplies a shared language and theory for making sequential machine learning models run faster through parallelism. The predictions are checked against several case studies on different models.

Core claim

Different iteration schemes for parallel evaluation arise naturally as approximate linearizations of a nonlinear recursion in the linear dynamical systems framework, where the rates of convergence can be analyzed theoretically and are confirmed by case studies.

What carries the argument

Linear dynamical systems (LDSs) that represent approximate linearizations of the underlying nonlinear recursion, generating the various fixed-point iteration schemes as special cases.

If this is right

Convergence rates of each parallel method can be predicted from the properties of its linearization.
Particular iteration schemes are most effective under conditions identified by the linear dynamical system analysis.
The framework supplies a clearer theoretical basis for choosing or designing parallel algorithms for sequential models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linearization view might be applied to other iterative techniques in simulation or optimization beyond the ones studied.
Adding stochastic elements to the linear dynamical system could extend the framework to noisy or probabilistic sequential models.
The convergence predictions could let users select an iteration scheme for a new model by computing only a few matrix properties rather than running full trials.

Load-bearing premise

The sequential process must be well-described by a nonlinear recursion whose behavior is captured accurately by the chosen linear approximations without large errors in convergence or stability.

What would settle it

A case where one of the fixed-point methods converges or diverges at a rate clearly inconsistent with the linear dynamical system approximation derived from the nonlinear recursion.

Figures

Figures reproduced from arXiv: 2509.21716 by Christopher R\'e, David M. Zoltowski, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, Leo Kozachkov, Scott W. Linderman, Xavier Gonzalez.

**Figure 1.** Figure 1: A single Newton iteration solves the S5 group word problem, whereas the number of iterations required for the quasi-Newton and Picard methods increases with problem size. We consider the task of evaluating the product of S5 group elements. A: The group word problem can be expressed as an LDS with input-dependent state-transition matrices. B: An example input-dependent transition matrix At for permutation (… view at source ↗

**Figure 2.** Figure 2: First-order iterations can excel for complicated dynamics. We evaluate GRUs with random parameter initialization for different sequence lengths T and hidden state sizes D. A: The nonlinear dynamics of a GRU, following Feng et al. (2024), where xt is the hidden state, ut is the input, and the notation Linear[·, ·] indicates a linear readout from the concatenation of two vectors. B: A representative Jacobian… view at source ↗

**Figure 3.** Figure 3: Zeroth-order fixed-point iterations work well when the Jacobian is close to the identity. We evaluate Langevin dynamics for a potential ϕ. A: The nonlinear dynamics of Langevin dynamics for a potential ϕ and step size ϵ, where xt is the state and wt is Gaussian noise. B: The Jacobian for Langevin dynamics is well-approximated by the identity matrix, especially for small step size ϵ = 1 × 10−5 . C: We evalu… view at source ↗

**Figure 4.** Figure 4: Parallel Scan for Matrix Multiplication. We illustrate a divide-and-conquer approach to compute the product A4A3A2A1. Note that this divide-and-conquer approach naturally leads to O(log T) depth. 4Note that we have the matrices act via left-multiplication over the sequence length, because this is the most common way to write matrix-vector products. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: A linear Gaussian state space model (LGSSM): The LGSSM consists of latent variables xt and observed variables yt. The generative model of the LGSSM consists of dynamics xt+1 ∼ N (Axt, Q) and emissions yt+1 ∼ N (Hxt+1, R). the Kalman filter and RTS smoother can also be parallelized over the sequence length via the construction of custom binary associative operators and a parallel scan. While we leave the de… view at source ↗

**Figure 6.** Figure 6: In a similar setting as Figure 3, we instead consider evaluating Langevin dynamics on a potential [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: The term “parallel” in parallel-chord methods. Here we illustrate 3 iterations of Newton’s method for root-finding on the one-dimensional cubic function f(x) = (x−0.4)3 + 0.45(x−0.4). We observe that each iteration of Newton’s method involves forming a “parallel chord” to the function (shown in color). When we plug this form of the Jacobian into T in eq. (12) and simplify, we obtain the linear dynamical sy… view at source ↗

read the original abstract

Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using iterative fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. Moreover, we theoretically analyze the rates of convergence of these methods, and we verify the predictions of this theory with several case studies. This unifying framework highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, the framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies fixed-point iterations for parallel sequential models as LDS linearizations of a nonlinear recursion and derives convergence rates, but the approximation error control for typical ML nonlinearities is the part that needs the most scrutiny.

read the letter

The main thing to know is that this paper treats Newton, Picard, and Jacobi iterations as different linear approximations inside a linear dynamical system applied to a nonlinear recursion x_{t+1}=f(x_t). They show how each scheme drops out of that setup and then analyze the resulting convergence rates from the spectral radius of the linear part, with some case-study checks. That framing and the rate derivations are the fresh pieces relative to prior work on the individual methods. The unification is clean and makes it easier to see why one iteration might converge faster than another for a given model. The case studies give at least initial evidence that the predicted rates track what happens in practice. On the soft spots, the analysis rests on the linearization error not shifting stability or rates too much. For simple recursions this can be fine, but attention layers and gated units are strongly nonlinear, so the neglected terms could change the effective behavior. The paper claims explicit rates and verification, yet the strength of those claims depends on how tightly they bound the remainder or state the Lipschitz conditions on f. If those bounds are loose or the case studies stay away from representative deep models, the practical takeaway shrinks. This is aimed at people building or analyzing parallel algorithms for RNNs, state-space models, or transformer variants. A reader who cares about theoretical grounding for these speed-ups would find the common language useful. I would send it for peer review. The core idea is coherent and the work is grounded enough in derivations plus checks to deserve referee time, even if the generality of the error control will likely need tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that iteration schemes for parallelizing sequential models (Newton, Picard, Jacobi) can be unified as approximate linearizations of a nonlinear recursion within a linear dynamical systems (LDS) framework. It derives theoretical convergence rates from the spectral properties of the resulting linear systems and verifies these predictions through several case studies, aiming to clarify when each method is effective and to provide a clearer foundation for parallel computation in models like RNNs and transformers.

Significance. If the approximation errors from linearization are shown to be controlled, the framework would offer a useful unifying lens on existing parallelization techniques, highlighting shared principles across methods and suggesting new directions for scalable sequential model evaluation. The explicit verification against case studies is a strength, as it provides empirical grounding for the theoretical rates and helps assess practical applicability.

major comments (2)

[Theoretical convergence analysis (around the derivation of rates from LDS linearization)] The central claim that convergence rates follow directly from the spectral radius of the linearized LDS requires explicit control on the neglected higher-order terms in the linearization of x_{t+1}=f(x_t). Without remainder bounds, Lipschitz assumptions on f, or analysis of how nonlinearity (e.g., in attention or gated layers) affects the effective spectral radius, the predicted rates may not hold in general.
[Case studies section] The case-study verification should include quantitative checks that the observed convergence matches the LDS-predicted rates even for strongly nonlinear f, rather than only for near-linear regimes; otherwise the framework's applicability to typical sequential models remains incompletely supported.

minor comments (2)

[Abstract] The abstract states that rates are 'theoretically analyze[d]' and 'verified'; adding a sentence on the key assumptions (e.g., contraction or bounded nonlinearity) would help readers gauge scope immediately.
[Introduction / Framework section] Notation for the nonlinear recursion and the precise construction of each LDS approximation (Newton vs. Picard linearization) should be introduced with a single running example early in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. These have highlighted important opportunities to strengthen the theoretical and empirical support for the proposed framework. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Theoretical convergence analysis (around the derivation of rates from LDS linearization)] The central claim that convergence rates follow directly from the spectral radius of the linearized LDS requires explicit control on the neglected higher-order terms in the linearization of x_{t+1}=f(x_t). Without remainder bounds, Lipschitz assumptions on f, or analysis of how nonlinearity (e.g., in attention or gated layers) affects the effective spectral radius, the predicted rates may not hold in general.

Authors: We agree that bounding the linearization error is necessary to make the convergence-rate claims fully rigorous for general nonlinear f. The manuscript currently derives rates from the spectral radius of the exact linearization and treats the iteration schemes as approximate linearizations; the analysis is therefore local near the fixed point. In the revision we will add a dedicated subsection that (i) states a local Lipschitz assumption on f, (ii) invokes the standard first-order Taylor expansion with Lagrange remainder, and (iii) shows how the remainder term scales with the distance to the fixed point. We will also include a brief discussion of how the effective spectral radius is modulated by common nonlinearities (ReLU, softmax, gating) under these assumptions. These additions will clarify the precise regime in which the predicted rates hold. revision: yes
Referee: [Case studies section] The case-study verification should include quantitative checks that the observed convergence matches the LDS-predicted rates even for strongly nonlinear f, rather than only for near-linear regimes; otherwise the framework's applicability to typical sequential models remains incompletely supported.

Authors: We appreciate the suggestion to strengthen the empirical validation. While the existing case studies already employ nonlinear components (ReLU activations, gated recurrent units, and scaled dot-product attention), the quantitative comparison between observed and LDS-predicted rates can be made more explicit. In the revision we will augment the case-study section with (i) log-log plots of residual error versus iteration count, (ii) direct numerical comparison of empirical convergence factors against the spectral radius of the corresponding linearized system, and (iii) additional experiments that systematically vary the strength of nonlinearity (e.g., by scaling the temperature of the softmax or the slope of activations). These quantitative checks will demonstrate that the framework remains predictive even when f deviates substantially from linearity. revision: yes

Circularity Check

0 steps flagged

No circularity: unifying framework derives from standard LDS linearization of existing iterations

full rationale

The paper's derivation interprets Newton, Picard, and Jacobi iterations as approximate linearizations of a nonlinear recursion x_{t+1}=f(x_t) inside an LDS framework, then analyzes convergence via spectral properties of the resulting linear system and verifies with case studies. This chain relies on standard fixed-point theory and dynamical systems analysis applied to sequential models rather than any self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equation or claim reduces by construction to a parameter or result defined inside the paper itself; the framework is an external organizing lens on prior methods and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling sequential processes as nonlinear recursions that admit useful linear approximations; no free parameters, new entities, or additional axioms are introduced in the abstract.

axioms (1)

domain assumption Sequential models can be represented as nonlinear recursions whose fixed-point iterations admit linear dynamical system approximations.
This premise is required to interpret Newton, Picard, and Jacobi schemes as linearizations.

pith-pipeline@v0.9.0 · 5703 in / 1151 out tokens · 52868 ms · 2026-05-18T13:18:10.390906+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion... x(i+1)_{t+1} = f_{t+1}(x(i)_t) + A_{t+1}(x(i+1)_t - x(i)_t)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

all of these methods... guaranteed to converge in at most T iterations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

Learning Phrase Representations using RNN Encoder–Decoder for Sta- tistical Machine Translation

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder–Decoder for Sta- tistical Machine Translation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2014
[2]

Scaling up liquid-resistance liquid- capacitance networks for efficient sequence modeling.arXiv preprint arXiv:2505.21717,

Mónika Farsang, Ramin Hasani, Daniela Rus, and Radu Grosu. Scaling up liquid-resistance liquid- capacitance networks for efficient sequence modeling.arXiv preprint arXiv:2505.21717,

work page arXiv
[3]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

ISSN 0041-5553. doi: 10.1016/0041-5553(81)90075-6. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach.arXiv preprint arXiv:2502.05171,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/0041-5553(81)90075-6
[4]

Amber Hu, Henry Smith, and Scott Linderman

MIT course. Amber Hu, Henry Smith, and Scott Linderman. SING: SDE Inference via Natural Gradients.arXiv preprint arXiv:2506.17796,

work page arXiv
[5]

Patrick Kidger and Cristian Garcia

Version 0.1. Patrick Kidger and Cristian Garcia. Equinox: neural networks in JAX via callable PyTrees and filtered transformations.CoRR, abs/2111.00254,

work page arXiv
[6]

Entity tracking in language models.arXiv preprint arXiv:2305.02363,

Najoung Kim and Sebastian Schuster. Entity tracking in language models.arXiv preprint arXiv:2305.02363,

work page arXiv
[7]

Lemons and translated by A

English translation, introduced by D.S. Lemons and translated by A. Gythiel. Original: C.R. Acad. Sci. 146, 530–533 (1908). Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares.Quarterly of Applied Mathematics, 2:164–168,

work page 1908
[8]

15 Simo Särkkä and Lennart Svensson.Bayesian filtering and smoothing, volume

doi: 10.1109/TAC.2020.2976316. 15 Simo Särkkä and Lennart Svensson.Bayesian filtering and smoothing, volume

work page doi:10.1109/tac.2020.2976316 2020
[9]

Levenberg-Marquardt and line-search extended Kalman smoothers

Simo Särkkä and Lennart Svensson. Levenberg-Marquardt and line-search extended Kalman smoothers. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5875–5879. IEEE,

work page 2020
[10]

XG, EKB, HDL, JWL, and KAW all wrote substantial code in various exploratory aspects of the project

HDL developed, ran, and wrote up the state tracking experiments. XG, EKB, HDL, JWL, and KAW all wrote substantial code in various exploratory aspects of the project. DZ and CR contributed important insights about the relationships between algorithms and computational efficiency. All authors made important contributions to conceptualization, brainstorming,...

work page 2025
[11]

Note that this divide-and-conquer approach naturally leads toO(logT) depth

A1 A2 A3 A4 A2A1 A4A3 A4A3A2A1 Figure 4:Parallel Scan for Matrix Multiplication.We illustrate a divide-and-conquer approach to compute the productA 4A3A2A1. Note that this divide-and-conquer approach naturally leads toO(logT) depth. 4Note that we have the matrices act via left-multiplication over the sequence length, because this is the most common way to...

work page 2018
[12]

and Rauch-Tung-Striebel (RTS)smoother(Rauchetal.,1965)obtainthefilteringandsmoothingdistributions(respectively)in an LGSSM. The Kalman filter makes a single pass forward in time to get the filtering distributions, while the RTS smoother then makes an additional pass backwards in time to get the smoothing distributions. Both the Kalman filter and RTS smoot...

work page 1965
[13]

The generative model of the LGSSM consists of dynamicsxt+1∼N(Axt,Q)and emissionsy t+1∼N(Hxt+1,R)

19 x0 x1 y1 x2 y2 x3 y3 Figure 5:A linear Gaussian state space model (LGSSM):The LGSSM consists of latent variablesxt and observed variablesyt. The generative model of the LGSSM consists of dynamicsxt+1∼N(Axt,Q)and emissionsy t+1∼N(Hxt+1,R). the Kalman filter and RTS smoother can also be parallelized over the sequence length via the construction of custom...

work page 2021
[14]

and for computing log-normalizing constants (Hu et al., 2025). •Parallelizing fixed point iterations:Finally, these parallel filtering and smoothing are directly applicable to the types of parallel fixed-point iterations that are the focus of this paper. In partic- ular, the ELK algorithm (Gonzalez et al., 2024)—EvaluatingLevenberg-Marquardt viaKalman— st...

work page 2025
[15]

we discuss in this paper using the Levenberg- Marquardt (trust-region) method (Levenberg, 1944; Marquardt, 1963). The trust-region updates are able to be computed using a parallel scan because they are equivalent to a Kalman smoother in an appropriately constructed LGSSM (Särkkä & Svensson, 2020). What about arbitrary function composition?The astute reade...

work page 1944
[16]

prefix sums

We will denote the individual matrices asA1,A 2,A 3,...A8, and their products asAs:t, i.e.A 5:6 =A 6A5. The first phase of the parallel scan is theup-sweep, and takeslog(T)iterations andO(T)memory. We start multiplying adjacent pairs of matrices together, going, for example6, fromA 8 toA 7:8 toA 5:8 toA 1:8. Then, in thedown-sweep, we fill in the missing ...

work page 2007
[17]

JAX has a general purpose parallel scan (jax.lax.associative_scan) as a fundamental primitive, which allows for implementation of a wide range of parallel scans

and PyTorch (Paszke et al., 2019), two leading Python libraries for deep learning. JAX has a general purpose parallel scan (jax.lax.associative_scan) as a fundamental primitive, which allows for implementation of a wide range of parallel scans. For exam- ple,dynamax, a JAX library for probabilistic state space modeling (Linderman et al., 2025), implements...

work page 2019
[18]

8Although Heinsen (2023) shows that clever uses oftorch.cumsumcan parallelize scalar/diagonal LDSs, of the type that are used in quasi-Newton iterations (eq. (6)). 21 C Experimental Details and Additional Experiments We implemented our experiments using the Equinox library (Kidger & Garcia,

work page 2023
[19]

In our experiments that report wall-clock time, we use a stochastic implementation of quasi-Newton iterations (Zoltowski et al.,

in JAX (Bradbury et al., 2018). In our experiments that report wall-clock time, we use a stochastic implementation of quasi-Newton iterations (Zoltowski et al.,

work page 2018
[20]

(2025) for details

that estimates the diagonal using the Hutchinson estimator (Hutchinson, 1989); see Section 3.4 of Zoltowski et al. (2025) for details. The stopping criterion we use for deciding when the fixed-point iterations in eq. (2) have converged is based on the merit functionL(x1:T ), which is defined as: r(x1:T) = [x1−f(x0),x 2−f(x1),x 3−f(x2),...,x T−f(xT−1)] L(x...

work page 1989
[21]

C.1 Experimental Details for Section 3.1 We use 10 random seeds, where the randomness controls the sequence of theS5 group elements

In our experiments we use a tolerance of5×10−4, that is we terminate the fixed-point iterations when iterate isatisfies L(x(i) 1:T )≤5×10−4. C.1 Experimental Details for Section 3.1 We use 10 random seeds, where the randomness controls the sequence of theS5 group elements. Each seed uses a batch size of 16, i.e. 16 differentS5 word problems are evaluated ...

work page 2024
[22]

parallel

Numerical StabilityFor all fixed-point methods, numerical stability is a concern (Yaghoobi et al., 2025). In particular, LDS matrices with spectral norm close to or greater than one can cause numerical instabilities in the parallel scan operation (Gonzalez et al., 2024; 2025). This is especially critical in high-precision tasks or long sequences, and prac...

work page 2025
[23]

parallel chord

In general, the fixed-point methods of the common form given by eq. (4) all give rise to T∈RT D×T Dmatrices of the form T=   ID 0 0...0 0 −A2 ID 0...0 0 0−A 3 ID ...0 0 ... ... ... ... ... ... 0 0 0... I D 0 0 0 0...−A T ID   .(17) Thus, Chapter 7 of Ortega & Rheinboldt (1970) unites Newton and Picard iterations for the general root find...

work page 1970
[24]

Thus, lower values ofσindicates thatTis good approximation of the Jacobian matrix∂F/∂xevaluated at the zerox∗of F, while higher values ofσindicate thatTis a poor approximation for∂F/∂x. Ortega & Rheinboldt (1970) then useσin their Chapter 10 (in particular, their Theorem 10.1.4) to prove linear rates of convergence for parallel-chord methods (where the ra...

work page 1970
[25]

trivial,

to∂f/∂x(bothD×Dmatrices). This difference in focus is important, because in the setting we consider in this paper—using fixed-point iterations of the form eq. (4) to solve nonlinear equations of the form eq. (3)—Theorem 10.1.4 of Ortega & Rheinboldt (1970) is actuallytrivial. By “trivial,” we mean that it does not distinguish between the convergence rates...

work page 1970
[26]

(14) and eq

Proof.As shown in eq. (14) and eq. (17), both∂F/∂xandTare lower-triangular matrices with allD×D identity matrices on their main block-diagonal. In particular,T−1is also a lower-triangular matrix with all D×Didentity matrices on its main block-diagonal. Consequently, the productT−1∂F/∂xis also a lower- triangular matrix with allD×Didentity matrices on its ...

work page 2024
[27]

This empirical approach also highlights how the increased computational cost of higher-order fixed-point methods affects wall-clock time on GPUs

Because this asymptotic notion ofR-convergence is not suitable in our setting, we instead use empirical case studies in Section 3 to show that the closeness ofAt to ∂ft/∂ximpacts the number of iterations needed forA to converge. This empirical approach also highlights how the increased computational cost of higher-order fixed-point methods affects wall-cl...

work page 2024

[1] [1]

Learning Phrase Representations using RNN Encoder–Decoder for Sta- tistical Machine Translation

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder–Decoder for Sta- tistical Machine Translation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2014

[2] [2]

Scaling up liquid-resistance liquid- capacitance networks for efficient sequence modeling.arXiv preprint arXiv:2505.21717,

Mónika Farsang, Ramin Hasani, Daniela Rus, and Radu Grosu. Scaling up liquid-resistance liquid- capacitance networks for efficient sequence modeling.arXiv preprint arXiv:2505.21717,

work page arXiv

[3] [3]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

ISSN 0041-5553. doi: 10.1016/0041-5553(81)90075-6. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach.arXiv preprint arXiv:2502.05171,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/0041-5553(81)90075-6

[4] [4]

Amber Hu, Henry Smith, and Scott Linderman

MIT course. Amber Hu, Henry Smith, and Scott Linderman. SING: SDE Inference via Natural Gradients.arXiv preprint arXiv:2506.17796,

work page arXiv

[5] [5]

Patrick Kidger and Cristian Garcia

Version 0.1. Patrick Kidger and Cristian Garcia. Equinox: neural networks in JAX via callable PyTrees and filtered transformations.CoRR, abs/2111.00254,

work page arXiv

[6] [6]

Entity tracking in language models.arXiv preprint arXiv:2305.02363,

Najoung Kim and Sebastian Schuster. Entity tracking in language models.arXiv preprint arXiv:2305.02363,

work page arXiv

[7] [7]

Lemons and translated by A

English translation, introduced by D.S. Lemons and translated by A. Gythiel. Original: C.R. Acad. Sci. 146, 530–533 (1908). Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares.Quarterly of Applied Mathematics, 2:164–168,

work page 1908

[8] [8]

15 Simo Särkkä and Lennart Svensson.Bayesian filtering and smoothing, volume

doi: 10.1109/TAC.2020.2976316. 15 Simo Särkkä and Lennart Svensson.Bayesian filtering and smoothing, volume

work page doi:10.1109/tac.2020.2976316 2020

[9] [9]

Levenberg-Marquardt and line-search extended Kalman smoothers

Simo Särkkä and Lennart Svensson. Levenberg-Marquardt and line-search extended Kalman smoothers. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5875–5879. IEEE,

work page 2020

[10] [10]

XG, EKB, HDL, JWL, and KAW all wrote substantial code in various exploratory aspects of the project

HDL developed, ran, and wrote up the state tracking experiments. XG, EKB, HDL, JWL, and KAW all wrote substantial code in various exploratory aspects of the project. DZ and CR contributed important insights about the relationships between algorithms and computational efficiency. All authors made important contributions to conceptualization, brainstorming,...

work page 2025

[11] [11]

Note that this divide-and-conquer approach naturally leads toO(logT) depth

A1 A2 A3 A4 A2A1 A4A3 A4A3A2A1 Figure 4:Parallel Scan for Matrix Multiplication.We illustrate a divide-and-conquer approach to compute the productA 4A3A2A1. Note that this divide-and-conquer approach naturally leads toO(logT) depth. 4Note that we have the matrices act via left-multiplication over the sequence length, because this is the most common way to...

work page 2018

[12] [12]

and Rauch-Tung-Striebel (RTS)smoother(Rauchetal.,1965)obtainthefilteringandsmoothingdistributions(respectively)in an LGSSM. The Kalman filter makes a single pass forward in time to get the filtering distributions, while the RTS smoother then makes an additional pass backwards in time to get the smoothing distributions. Both the Kalman filter and RTS smoot...

work page 1965

[13] [13]

The generative model of the LGSSM consists of dynamicsxt+1∼N(Axt,Q)and emissionsy t+1∼N(Hxt+1,R)

19 x0 x1 y1 x2 y2 x3 y3 Figure 5:A linear Gaussian state space model (LGSSM):The LGSSM consists of latent variablesxt and observed variablesyt. The generative model of the LGSSM consists of dynamicsxt+1∼N(Axt,Q)and emissionsy t+1∼N(Hxt+1,R). the Kalman filter and RTS smoother can also be parallelized over the sequence length via the construction of custom...

work page 2021

[14] [14]

and for computing log-normalizing constants (Hu et al., 2025). •Parallelizing fixed point iterations:Finally, these parallel filtering and smoothing are directly applicable to the types of parallel fixed-point iterations that are the focus of this paper. In partic- ular, the ELK algorithm (Gonzalez et al., 2024)—EvaluatingLevenberg-Marquardt viaKalman— st...

work page 2025

[15] [15]

we discuss in this paper using the Levenberg- Marquardt (trust-region) method (Levenberg, 1944; Marquardt, 1963). The trust-region updates are able to be computed using a parallel scan because they are equivalent to a Kalman smoother in an appropriately constructed LGSSM (Särkkä & Svensson, 2020). What about arbitrary function composition?The astute reade...

work page 1944

[16] [16]

prefix sums

We will denote the individual matrices asA1,A 2,A 3,...A8, and their products asAs:t, i.e.A 5:6 =A 6A5. The first phase of the parallel scan is theup-sweep, and takeslog(T)iterations andO(T)memory. We start multiplying adjacent pairs of matrices together, going, for example6, fromA 8 toA 7:8 toA 5:8 toA 1:8. Then, in thedown-sweep, we fill in the missing ...

work page 2007

[17] [17]

JAX has a general purpose parallel scan (jax.lax.associative_scan) as a fundamental primitive, which allows for implementation of a wide range of parallel scans

and PyTorch (Paszke et al., 2019), two leading Python libraries for deep learning. JAX has a general purpose parallel scan (jax.lax.associative_scan) as a fundamental primitive, which allows for implementation of a wide range of parallel scans. For exam- ple,dynamax, a JAX library for probabilistic state space modeling (Linderman et al., 2025), implements...

work page 2019

[18] [18]

8Although Heinsen (2023) shows that clever uses oftorch.cumsumcan parallelize scalar/diagonal LDSs, of the type that are used in quasi-Newton iterations (eq. (6)). 21 C Experimental Details and Additional Experiments We implemented our experiments using the Equinox library (Kidger & Garcia,

work page 2023

[19] [19]

In our experiments that report wall-clock time, we use a stochastic implementation of quasi-Newton iterations (Zoltowski et al.,

in JAX (Bradbury et al., 2018). In our experiments that report wall-clock time, we use a stochastic implementation of quasi-Newton iterations (Zoltowski et al.,

work page 2018

[20] [20]

(2025) for details

that estimates the diagonal using the Hutchinson estimator (Hutchinson, 1989); see Section 3.4 of Zoltowski et al. (2025) for details. The stopping criterion we use for deciding when the fixed-point iterations in eq. (2) have converged is based on the merit functionL(x1:T ), which is defined as: r(x1:T) = [x1−f(x0),x 2−f(x1),x 3−f(x2),...,x T−f(xT−1)] L(x...

work page 1989

[21] [21]

C.1 Experimental Details for Section 3.1 We use 10 random seeds, where the randomness controls the sequence of theS5 group elements

In our experiments we use a tolerance of5×10−4, that is we terminate the fixed-point iterations when iterate isatisfies L(x(i) 1:T )≤5×10−4. C.1 Experimental Details for Section 3.1 We use 10 random seeds, where the randomness controls the sequence of theS5 group elements. Each seed uses a batch size of 16, i.e. 16 differentS5 word problems are evaluated ...

work page 2024

[22] [22]

parallel

Numerical StabilityFor all fixed-point methods, numerical stability is a concern (Yaghoobi et al., 2025). In particular, LDS matrices with spectral norm close to or greater than one can cause numerical instabilities in the parallel scan operation (Gonzalez et al., 2024; 2025). This is especially critical in high-precision tasks or long sequences, and prac...

work page 2025

[23] [23]

parallel chord

In general, the fixed-point methods of the common form given by eq. (4) all give rise to T∈RT D×T Dmatrices of the form T=   ID 0 0...0 0 −A2 ID 0...0 0 0−A 3 ID ...0 0 ... ... ... ... ... ... 0 0 0... I D 0 0 0 0...−A T ID   .(17) Thus, Chapter 7 of Ortega & Rheinboldt (1970) unites Newton and Picard iterations for the general root find...

work page 1970

[24] [24]

Thus, lower values ofσindicates thatTis good approximation of the Jacobian matrix∂F/∂xevaluated at the zerox∗of F, while higher values ofσindicate thatTis a poor approximation for∂F/∂x. Ortega & Rheinboldt (1970) then useσin their Chapter 10 (in particular, their Theorem 10.1.4) to prove linear rates of convergence for parallel-chord methods (where the ra...

work page 1970

[25] [25]

trivial,

to∂f/∂x(bothD×Dmatrices). This difference in focus is important, because in the setting we consider in this paper—using fixed-point iterations of the form eq. (4) to solve nonlinear equations of the form eq. (3)—Theorem 10.1.4 of Ortega & Rheinboldt (1970) is actuallytrivial. By “trivial,” we mean that it does not distinguish between the convergence rates...

work page 1970

[26] [26]

(14) and eq

Proof.As shown in eq. (14) and eq. (17), both∂F/∂xandTare lower-triangular matrices with allD×D identity matrices on their main block-diagonal. In particular,T−1is also a lower-triangular matrix with all D×Didentity matrices on its main block-diagonal. Consequently, the productT−1∂F/∂xis also a lower- triangular matrix with allD×Didentity matrices on its ...

work page 2024

[27] [27]

This empirical approach also highlights how the increased computational cost of higher-order fixed-point methods affects wall-clock time on GPUs

Because this asymptotic notion ofR-convergence is not suitable in our setting, we instead use empirical case studies in Section 3 to show that the closeness ofAt to ∂ft/∂ximpacts the number of iterations needed forA to converge. This empirical approach also highlights how the increased computational cost of higher-order fixed-point methods affects wall-cl...

work page 2024