pith. sign in

arxiv: 2605.17108 · v1 · pith:ZTDTQHU5new · submitted 2026-05-16 · 💻 cs.LG

Parallel Recursive LSTM

Pith reviewed 2026-05-20 15:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords Parallel Recursive LSTMhierarchical recurrencelogarithmic depthgated state compositionsequence modelingformal language tasksparallel computationlength generalization
0
0 comments X

The pith

The Parallel Recursive LSTM merges latent states recursively over a balanced tree to reduce recurrent computation depth from linear to logarithmic while preserving nonlinear gated updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Parallel Recursive LSTM as a way to reorganize recurrent computation for better parallelism. Instead of processing a sequence strictly left to right, it first maps each token to an independent latent state and then repeatedly applies a single learned gated composition block to merge pairs of states according to a fixed balanced binary tree. This keeps the nonlinear state-tracking dynamics that make LSTMs effective while cutting the number of sequential steps down to the logarithm of the sequence length. On formal-language tasks that test long-range dependencies, the resulting model generalizes to longer sequences than standard RNNs, LSTMs, and Transformers and avoids the quadratic cost of attention.

Core claim

By replacing the strictly sequential left-to-right recurrence of LSTMs with recursive nonlinear state composition over a fixed balanced computation tree, the PR-LSTM retains explicit nonlinear gated state representations and achieves logarithmic parallel depth. Tokens are mapped independently to latent states, which are then merged by a learned gated composition block applied according to the reduction pattern of parallel scans rather than an associative recurrence.

What carries the argument

The learned gated composition block that merges two latent state vectors into one, applied recursively according to the schedule of a balanced binary tree.

If this is right

  • Recurrent models can expose logarithmic rather than linear parallelism without forcing the transition function to be linear or associative.
  • State-tracking performance on formal-language benchmarks improves relative to both sequential LSTMs and quadratic-attention transformers.
  • Training and inference cost for long sequences scales better than attention while still supporting nonlinear dynamics.
  • The same hierarchical schedule can be applied to any gated recurrent cell that uses a composition operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed balanced-tree schedule could be relaxed to input-dependent trees if the composition block is made sufficiently expressive.
  • Similar hierarchical reorganization might improve efficiency in hybrid architectures that combine recurrence with limited attention.
  • The approach opens a route to training very deep recurrent networks without the usual vanishing-gradient problems associated with linear depth.

Load-bearing premise

That a single learned gated merge function applied repeatedly over a fixed balanced tree can preserve the full state-tracking expressivity of a conventional left-to-right LSTM.

What would settle it

A controlled experiment showing that PR-LSTM fails on long formal-language instances that a standard LSTM solves correctly, or that its accuracy degrades sharply once sequence length exceeds the training distribution.

Figures

Figures reproduced from arXiv: 2605.17108 by Tristan Gaudreault, Yongyi Mao.

Figure 1
Figure 1. Figure 1: Overview of PR-LSTM. can be implemented with scan-style primitives that accept user-defined binary operators, such as jax.lax.associative scan in JAX, or analogous framework-specific implementations. Rather than reproducing the exact dynamics of a sequential RNN, PR-LSTM learns hierarchical state compositions over the input sequence. Lower levels of the tree capture local interactions between neighboring r… view at source ↗
Figure 2
Figure 2. Figure 2: Definitions of the color-coded LSTM encoder modules used in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics across six tasks. Training loss (y-axis) as a function of wall-clock training time (x-axis), with all runs trained for 40,000 iterations on a cluster job allocation with an NVIDIA H100, 1 CPU core, and 35 GB of system memory. Endpoint markers show completion times for this fixed training budget, highlighting relative training speed. 4.1 Main results We first evaluate whether hierarchical … view at source ↗
Figure 4
Figure 4. Figure 4: Scaling behavior of inference time and memory consumption with increasing sequence length. Colors follow [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics across all fifteen tasks. Training loss (y-axis) as a function of wall￾clock training time (x-axis), with all runs trained for 40,000 iterations on a job allocation with an NVIDIA H100, 1 CPU core, and 35 GB of system memory. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR-LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Parallel Recursive LSTM (PR-LSTM), which maps tokens to latent states and then recursively merges them via a learned gated composition block over a fixed balanced binary tree (following the reduction pattern of parallel scans). This yields logarithmic parallel depth while claiming to retain the nonlinear gated state dynamics of standard LSTMs, and the manuscript reports that PR-LSTM solves more formal-language tasks with stronger sequence-length generalization than RNN, LSTM, and Transformer baselines.

Significance. If the empirical results and the retention of expressivity hold, the work would be significant for demonstrating that recurrent computation can be reorganized hierarchically to expose parallelism without forcing associativity or linear dynamics. Credit is due for the explicit use of a fixed balanced-tree schedule rather than learned or associative assumptions, and for the focus on formal-language benchmarks that test state tracking.

major comments (2)
  1. [§3] §3 (architecture definition): the central claim that recursive application of one learned gated merge over a fixed balanced tree preserves or exceeds the state-tracking expressivity of left-to-right LSTM is not supported by any analysis of the changed information-flow graph. Each token participates in log n merges from disjoint subtrees rather than sequential incremental updates; nothing in the construction or the learned block is shown to recover the exact dynamics needed for tasks whose acceptance depends on strict left-to-right accumulation (e.g., precise counters). This is load-bearing for the retention-of-nonlinear-dynamics claim.
  2. [Experiments section / Table 2] Experiments section / Table 2 (formal-language results): the reported superiority and sequence-length generalization lack ablations that isolate the effect of the tree schedule versus the gated block itself, and no direct comparison (e.g., via probing or state-similarity metrics) is given between PR-LSTM internal states and those of a standard LSTM on the same tasks. Without this, it remains unclear whether performance gains reflect retained expressivity or other factors.
minor comments (2)
  1. The abstract and introduction would benefit from a brief explicit statement of the training objective and hyperparameter ranges used for the formal-language experiments to aid reproducibility.
  2. Notation for the gated composition block (e.g., how the merge parameters are shared across tree levels) should be clarified in the first appearance to avoid ambiguity when reading the recursive definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the PR-LSTM architecture and experimental evaluation. We address each major comment in turn and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (architecture definition): the central claim that recursive application of one learned gated merge over a fixed balanced tree preserves or exceeds the state-tracking expressivity of left-to-right LSTM is not supported by any analysis of the changed information-flow graph. Each token participates in log n merges from disjoint subtrees rather than sequential incremental updates; nothing in the construction or the learned block is shown to recover the exact dynamics needed for tasks whose acceptance depends on strict left-to-right accumulation (e.g., precise counters). This is load-bearing for the retention-of-nonlinear-dynamics claim.

    Authors: We agree that the manuscript would benefit from a more detailed discussion of the information flow in the balanced tree structure. The gated merge block is parameterized similarly to an LSTM cell, allowing it to selectively update, forget, and output information from the two child states. In principle, this can learn to prioritize information from the left subtree to mimic left-to-right processing. However, we acknowledge the lack of explicit analysis or guarantees for exact equivalence on all tasks. In the revision, we will expand §3 with a description of the information propagation paths and note potential differences from sequential LSTMs for tasks requiring precise sequential accumulation. revision: yes

  2. Referee: [Experiments section / Table 2] Experiments section / Table 2 (formal-language results): the reported superiority and sequence-length generalization lack ablations that isolate the effect of the tree schedule versus the gated block itself, and no direct comparison (e.g., via probing or state-similarity metrics) is given between PR-LSTM internal states and those of a standard LSTM on the same tasks. Without this, it remains unclear whether performance gains reflect retained expressivity or other factors.

    Authors: The referee correctly identifies that the current experiments do not include ablations isolating the tree schedule from the gated composition or direct state comparisons. We will add these in the revised manuscript: specifically, an ablation replacing the learned gated merge with a simple sum or concatenation, and probing experiments measuring state similarity or task-specific metrics between PR-LSTM and LSTM hidden states on the formal language tasks. This will help clarify the source of the observed performance. revision: yes

Circularity Check

0 steps flagged

No circularity: PR-LSTM defined independently of results

full rationale

The paper explicitly constructs the PR-LSTM as a fixed balanced-tree recursion schedule that applies a learned gated composition block to independently mapped token states. This architectural definition stands alone and does not depend on the formal-language benchmark outcomes or any self-citation chain for its validity. Performance claims are presented as empirical consequences of the tree-structured schedule rather than inputs that force the model equations. No load-bearing step reduces a prediction to a fitted parameter, renames a known result, or imports uniqueness via prior self-work. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assumption that the learned composition function can be trained to emulate the sequential LSTM's state evolution when applied in tree order.

free parameters (1)
  • gated composition parameters
    The weights inside the learned gated merge block are fitted during training and directly determine whether the tree-structured updates preserve long-range state tracking.
axioms (1)
  • domain assumption The reduction pattern of parallel scans can be used as a fixed execution schedule for nonlinear gated state updates without loss of expressivity.
    Invoked when the paper states that the structure uses the reduction pattern as a fixed schedule rather than assuming associativity.

pith-pipeline@v0.9.0 · 5716 in / 1171 out tokens · 43173 ms · 2026-05-20T15:14:12.354213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    [Beck u. a. 2024] BECK, Maximilian ; P ¨OPPEL, Korbinian ; SPANRING, Markus ; AUER, Andreas ; PRUDNIKOVA, Oleksandra ; KOPP, Michael ; KLAMBAUER, G ¨unter ; BRANDSTETTER, Johannes ; HOCHREITER, Sepp:xLSTM: Extended Long Short-Term Memory

  2. [2]

    xlstm: Extended long short-term memory

    – URL https: //arxiv.org/abs/2405.04517 [Blelloch 1990] BLELLOCH, Guy E.: Prefix Sums and Their Applications / School of Computer Science, Carnegie Mellon University. November 1990 (CMU-CS-90-190). – Forschungsbericht [Cho u. a. 2014] CHO, Kyunghyun ; MERRIENBOER, Bart van ; GULCEHRE, Caglar ; BAHDANAU, Dzmitry ; BOUGARES, Fethi ; SCHWENK, Holger ; BENGIO...

  3. [3]

    – URL https://arxiv.org/abs/1406.1078 [Choromanski u. a. 2020] CHOROMANSKI, Krzysztof ; LIKHOSHERSTOV, Valerii ; DOHAN, David ; SONG, Xingyou ; GANE, Andreea ; SARLOS, Tamas ; HAWKINS, Peter ; DAVIS, Jared ; MOHIUDDIN, Afroz ; KAISER, Lukasz u. a.: Rethinking attention with performers. In:arXiv preprint arXiv:2009.14794(2020) [Danieli u. a. 2025] DANIELI,...

  4. [4]

    ParaRNN: Unlocking parallel training of nonlinear RNNs for large language models, 2025

    – URLhttps://arxiv.org/abs/2510.21450 [Dao und Gu 2024] DAO, Tri ; GU, Albert: Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In:Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024 (ICML’24) [Del´etang u. a. 2023] DEL ´ETANG, Gr ´egoire ; RUOSS, Anian ; GRAU-MOYA,...

  5. [5]

    – URLhttps://arxiv.org/abs/2312.00752 [Gu u. a. 2022] GU, Albert ; GOEL, Karan ; R ´E, Christopher: Efficiently Modeling Long Sequences with Structured State Spaces. In:The International Conference on Learning Representations (ICLR), 2022 10 [Gupta u. a. 2022] GUPTA, Ankit ; GU, Albert ; BERANT, Jonathan: Diagonal state spaces are as effective as structur...

  6. [6]

    – ISBN 978-0-387-09766-4 [Hochreiter und Schmidhuber 1997] HOCHREITER, Sepp ; SCHMIDHUBER, J ¨urgen: Long Short- Term Memory

    – URL https: //doi.org/10.1007/978-0-387-09766-4_80. – ISBN 978-0-387-09766-4 [Hochreiter und Schmidhuber 1997] HOCHREITER, Sepp ; SCHMIDHUBER, J ¨urgen: Long Short- Term Memory. In:Neural computation9 (1997), Nr. 8, S. 1735–1780. – ISSN 0899-7667 [Katharopoulos u. a. 2020] KATHAROPOULOS, Angelos ; VYAS, Apoorv ; PAPPAS, Nikolaos ; FLEURET, Fran c ¸ois: T...

  7. [7]

    – URLhttps://arxiv.org/abs/1710.10777 [Peng u. a. 2023] PENG, Bo ; ALCAIDE, Eric ; ANTHONY, Quentin ; ALBALAK, Alon ; ARCAD- INHO, Samuel ; BIDERMAN, Stella ; CAO, Huanqi ; CHENG, Xin ; CHUNG, Michael ; GRELLA, Matteo ; GV, Kranthi K. ; HE, Xuzheng ; HOU, Haowen ; LIN, Jiaju ; KAZIENKO, Przemyslaw ; KOCON, Jan ; KONG, Jiaming ; KOPTYRA, Bartlomiej ; LAU, ...

  8. [8]

    – URLhttps://arxiv.org/abs/2305.13048 [Poli u. a. 2023] POLI, Michael ; MASSAROLI, Stefano ; NGUYEN, Eric ; FU, Daniel Y . ; DAO, Tri ; BACCUS, Stephen ; BENGIO, Yoshua ; ERMON, Stefano ; R ´E, Christopher:Hyena Hierarchy: Towards Larger Convolutional Language Models

  9. [9]

    – URL https://arxiv.org/abs/ 2302.10866 [Smith u. a. 2023] SMITH, Jimmy T. ; WARRINGTON, Andrew ; LINDERMAN, Scott: Simplified State Space Layers for Sequence Modeling. In:The Eleventh International Conference on Learning Representations, URLhttps://openreview.net/forum?id=Ai8Hw3AXqks, 2023 [Socher u. a. 2011] SOCHER, Richard ; LIN, Cliff Chiung-Yu ; NG, ...

  10. [10]

    – URLhttps://arxiv.org/abs/1606.07461 [Sun u. a. 2023] SUN, Yutao ; DONG, Li ; HUANG, Shaohan ; MA, Shuming ; XIA, Yuqing ; XUE, Jilong ; WANG, Jianyong ; WEI, Furu:Retentive Network: A Successor to Transformer for Large Language Models

  11. [11]

    – URLhttps://arxiv.org/abs/2307.08621 11 [Tai u. a. 2015] TAI, Kai S. ; SOCHER, Richard ; MANNING, Christopher D.:Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

  12. [12]

    – URL https: //arxiv.org/abs/1503.00075 [Vaswani u. a. 2017] VASWANI, Ashish ; SHAZEER, Noam ; PARMAR, Niki ; USZKOREIT, Jakob ; JONES, Llion ; GOMEZ, Aidan N. ; KAISER, Ł u. ; POLOSUKHIN, Illia: Attention is All you Need. In: GUYON, I. (Hrsg.) ; LUXBURG, U. V . (Hrsg.) ; BENGIO, S. (Hrsg.) ; WALLACH, H. (Hrsg.) ; FERGUS, R. (Hrsg.) ; VISHWANATHAN, S. (Hr...

  13. [13]

    – URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf 12 A Full results 0 20 40 60 80 1000 0.2 0.4 0.6 Even Pairs 0 20 40 60 800 0.5 1 1.5 Modular Arithmetic (Simple) 0 20 40 60 80 1000 0.2 0.4 0.6 Parity Check 0 20 40 60 80 1000 0.5 1 1.5 Cycle Navigation 0 100 200 300 400 5000 0.2 0.4 0.6 0.8 1 1.2 ...