Parallel Recursive LSTM
Pith reviewed 2026-05-20 15:14 UTC · model grok-4.3
The pith
The Parallel Recursive LSTM merges latent states recursively over a balanced tree to reduce recurrent computation depth from linear to logarithmic while preserving nonlinear gated updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing the strictly sequential left-to-right recurrence of LSTMs with recursive nonlinear state composition over a fixed balanced computation tree, the PR-LSTM retains explicit nonlinear gated state representations and achieves logarithmic parallel depth. Tokens are mapped independently to latent states, which are then merged by a learned gated composition block applied according to the reduction pattern of parallel scans rather than an associative recurrence.
What carries the argument
The learned gated composition block that merges two latent state vectors into one, applied recursively according to the schedule of a balanced binary tree.
If this is right
- Recurrent models can expose logarithmic rather than linear parallelism without forcing the transition function to be linear or associative.
- State-tracking performance on formal-language benchmarks improves relative to both sequential LSTMs and quadratic-attention transformers.
- Training and inference cost for long sequences scales better than attention while still supporting nonlinear dynamics.
- The same hierarchical schedule can be applied to any gated recurrent cell that uses a composition operation.
Where Pith is reading between the lines
- The fixed balanced-tree schedule could be relaxed to input-dependent trees if the composition block is made sufficiently expressive.
- Similar hierarchical reorganization might improve efficiency in hybrid architectures that combine recurrence with limited attention.
- The approach opens a route to training very deep recurrent networks without the usual vanishing-gradient problems associated with linear depth.
Load-bearing premise
That a single learned gated merge function applied repeatedly over a fixed balanced tree can preserve the full state-tracking expressivity of a conventional left-to-right LSTM.
What would settle it
A controlled experiment showing that PR-LSTM fails on long formal-language instances that a standard LSTM solves correctly, or that its accuracy degrades sharply once sequence length exceeds the training distribution.
Figures
read the original abstract
Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR-LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Parallel Recursive LSTM (PR-LSTM), which maps tokens to latent states and then recursively merges them via a learned gated composition block over a fixed balanced binary tree (following the reduction pattern of parallel scans). This yields logarithmic parallel depth while claiming to retain the nonlinear gated state dynamics of standard LSTMs, and the manuscript reports that PR-LSTM solves more formal-language tasks with stronger sequence-length generalization than RNN, LSTM, and Transformer baselines.
Significance. If the empirical results and the retention of expressivity hold, the work would be significant for demonstrating that recurrent computation can be reorganized hierarchically to expose parallelism without forcing associativity or linear dynamics. Credit is due for the explicit use of a fixed balanced-tree schedule rather than learned or associative assumptions, and for the focus on formal-language benchmarks that test state tracking.
major comments (2)
- [§3] §3 (architecture definition): the central claim that recursive application of one learned gated merge over a fixed balanced tree preserves or exceeds the state-tracking expressivity of left-to-right LSTM is not supported by any analysis of the changed information-flow graph. Each token participates in log n merges from disjoint subtrees rather than sequential incremental updates; nothing in the construction or the learned block is shown to recover the exact dynamics needed for tasks whose acceptance depends on strict left-to-right accumulation (e.g., precise counters). This is load-bearing for the retention-of-nonlinear-dynamics claim.
- [Experiments section / Table 2] Experiments section / Table 2 (formal-language results): the reported superiority and sequence-length generalization lack ablations that isolate the effect of the tree schedule versus the gated block itself, and no direct comparison (e.g., via probing or state-similarity metrics) is given between PR-LSTM internal states and those of a standard LSTM on the same tasks. Without this, it remains unclear whether performance gains reflect retained expressivity or other factors.
minor comments (2)
- The abstract and introduction would benefit from a brief explicit statement of the training objective and hyperparameter ranges used for the formal-language experiments to aid reproducibility.
- Notation for the gated composition block (e.g., how the merge parameters are shared across tree levels) should be clarified in the first appearance to avoid ambiguity when reading the recursive definition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the PR-LSTM architecture and experimental evaluation. We address each major comment in turn and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (architecture definition): the central claim that recursive application of one learned gated merge over a fixed balanced tree preserves or exceeds the state-tracking expressivity of left-to-right LSTM is not supported by any analysis of the changed information-flow graph. Each token participates in log n merges from disjoint subtrees rather than sequential incremental updates; nothing in the construction or the learned block is shown to recover the exact dynamics needed for tasks whose acceptance depends on strict left-to-right accumulation (e.g., precise counters). This is load-bearing for the retention-of-nonlinear-dynamics claim.
Authors: We agree that the manuscript would benefit from a more detailed discussion of the information flow in the balanced tree structure. The gated merge block is parameterized similarly to an LSTM cell, allowing it to selectively update, forget, and output information from the two child states. In principle, this can learn to prioritize information from the left subtree to mimic left-to-right processing. However, we acknowledge the lack of explicit analysis or guarantees for exact equivalence on all tasks. In the revision, we will expand §3 with a description of the information propagation paths and note potential differences from sequential LSTMs for tasks requiring precise sequential accumulation. revision: yes
-
Referee: [Experiments section / Table 2] Experiments section / Table 2 (formal-language results): the reported superiority and sequence-length generalization lack ablations that isolate the effect of the tree schedule versus the gated block itself, and no direct comparison (e.g., via probing or state-similarity metrics) is given between PR-LSTM internal states and those of a standard LSTM on the same tasks. Without this, it remains unclear whether performance gains reflect retained expressivity or other factors.
Authors: The referee correctly identifies that the current experiments do not include ablations isolating the tree schedule from the gated composition or direct state comparisons. We will add these in the revised manuscript: specifically, an ablation replacing the learned gated merge with a simple sum or concatenation, and probing experiments measuring state similarity or task-specific metrics between PR-LSTM and LSTM hidden states on the formal language tasks. This will help clarify the source of the observed performance. revision: yes
Circularity Check
No circularity: PR-LSTM defined independently of results
full rationale
The paper explicitly constructs the PR-LSTM as a fixed balanced-tree recursion schedule that applies a learned gated composition block to independently mapped token states. This architectural definition stands alone and does not depend on the formal-language benchmark outcomes or any self-citation chain for its validity. Performance claims are presented as empirical consequences of the tree-structured schedule rather than inputs that force the model equations. No load-bearing step reduces a prediction to a fitted parameter, renames a known result, or imports uniqueness via prior self-work. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- gated composition parameters
axioms (1)
- domain assumption The reduction pattern of parallel scans can be used as a fixed execution schedule for nonlinear gated state updates without loss of expressivity.
Reference graph
Works this paper leans on
-
[1]
[Beck u. a. 2024] BECK, Maximilian ; P ¨OPPEL, Korbinian ; SPANRING, Markus ; AUER, Andreas ; PRUDNIKOVA, Oleksandra ; KOPP, Michael ; KLAMBAUER, G ¨unter ; BRANDSTETTER, Johannes ; HOCHREITER, Sepp:xLSTM: Extended Long Short-Term Memory
work page 2024
-
[2]
xlstm: Extended long short-term memory
– URL https: //arxiv.org/abs/2405.04517 [Blelloch 1990] BLELLOCH, Guy E.: Prefix Sums and Their Applications / School of Computer Science, Carnegie Mellon University. November 1990 (CMU-CS-90-190). – Forschungsbericht [Cho u. a. 2014] CHO, Kyunghyun ; MERRIENBOER, Bart van ; GULCEHRE, Caglar ; BAHDANAU, Dzmitry ; BOUGARES, Fethi ; SCHWENK, Holger ; BENGIO...
-
[3]
– URL https://arxiv.org/abs/1406.1078 [Choromanski u. a. 2020] CHOROMANSKI, Krzysztof ; LIKHOSHERSTOV, Valerii ; DOHAN, David ; SONG, Xingyou ; GANE, Andreea ; SARLOS, Tamas ; HAWKINS, Peter ; DAVIS, Jared ; MOHIUDDIN, Afroz ; KAISER, Lukasz u. a.: Rethinking attention with performers. In:arXiv preprint arXiv:2009.14794(2020) [Danieli u. a. 2025] DANIELI,...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[4]
ParaRNN: Unlocking parallel training of nonlinear RNNs for large language models, 2025
– URLhttps://arxiv.org/abs/2510.21450 [Dao und Gu 2024] DAO, Tri ; GU, Albert: Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In:Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024 (ICML’24) [Del´etang u. a. 2023] DEL ´ETANG, Gr ´egoire ; RUOSS, Anian ; GRAU-MOYA,...
-
[5]
– URLhttps://arxiv.org/abs/2312.00752 [Gu u. a. 2022] GU, Albert ; GOEL, Karan ; R ´E, Christopher: Efficiently Modeling Long Sequences with Structured State Spaces. In:The International Conference on Learning Representations (ICLR), 2022 10 [Gupta u. a. 2022] GUPTA, Ankit ; GU, Albert ; BERANT, Jonathan: Diagonal state spaces are as effective as structur...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
– URL https: //doi.org/10.1007/978-0-387-09766-4_80. – ISBN 978-0-387-09766-4 [Hochreiter und Schmidhuber 1997] HOCHREITER, Sepp ; SCHMIDHUBER, J ¨urgen: Long Short- Term Memory. In:Neural computation9 (1997), Nr. 8, S. 1735–1780. – ISSN 0899-7667 [Katharopoulos u. a. 2020] KATHAROPOULOS, Angelos ; VYAS, Apoorv ; PAPPAS, Nikolaos ; FLEURET, Fran c ¸ois: T...
-
[7]
– URLhttps://arxiv.org/abs/1710.10777 [Peng u. a. 2023] PENG, Bo ; ALCAIDE, Eric ; ANTHONY, Quentin ; ALBALAK, Alon ; ARCAD- INHO, Samuel ; BIDERMAN, Stella ; CAO, Huanqi ; CHENG, Xin ; CHUNG, Michael ; GRELLA, Matteo ; GV, Kranthi K. ; HE, Xuzheng ; HOU, Haowen ; LIN, Jiaju ; KAZIENKO, Przemyslaw ; KOCON, Jan ; KONG, Jiaming ; KOPTYRA, Bartlomiej ; LAU, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
– URLhttps://arxiv.org/abs/2305.13048 [Poli u. a. 2023] POLI, Michael ; MASSAROLI, Stefano ; NGUYEN, Eric ; FU, Daniel Y . ; DAO, Tri ; BACCUS, Stephen ; BENGIO, Yoshua ; ERMON, Stefano ; R ´E, Christopher:Hyena Hierarchy: Towards Larger Convolutional Language Models
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
– URL https://arxiv.org/abs/ 2302.10866 [Smith u. a. 2023] SMITH, Jimmy T. ; WARRINGTON, Andrew ; LINDERMAN, Scott: Simplified State Space Layers for Sequence Modeling. In:The Eleventh International Conference on Learning Representations, URLhttps://openreview.net/forum?id=Ai8Hw3AXqks, 2023 [Socher u. a. 2011] SOCHER, Richard ; LIN, Cliff Chiung-Yu ; NG, ...
-
[10]
– URLhttps://arxiv.org/abs/1606.07461 [Sun u. a. 2023] SUN, Yutao ; DONG, Li ; HUANG, Shaohan ; MA, Shuming ; XIA, Yuqing ; XUE, Jilong ; WANG, Jianyong ; WEI, Furu:Retentive Network: A Successor to Transformer for Large Language Models
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
– URLhttps://arxiv.org/abs/2307.08621 11 [Tai u. a. 2015] TAI, Kai S. ; SOCHER, Richard ; MANNING, Christopher D.:Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
– URL https: //arxiv.org/abs/1503.00075 [Vaswani u. a. 2017] VASWANI, Ashish ; SHAZEER, Noam ; PARMAR, Niki ; USZKOREIT, Jakob ; JONES, Llion ; GOMEZ, Aidan N. ; KAISER, Ł u. ; POLOSUKHIN, Illia: Attention is All you Need. In: GUYON, I. (Hrsg.) ; LUXBURG, U. V . (Hrsg.) ; BENGIO, S. (Hrsg.) ; WALLACH, H. (Hrsg.) ; FERGUS, R. (Hrsg.) ; VISHWANATHAN, S. (Hr...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
– URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf 12 A Full results 0 20 40 60 80 1000 0.2 0.4 0.6 Even Pairs 0 20 40 60 800 0.5 1 1.5 Modular Arithmetic (Simple) 0 20 40 60 80 1000 0.2 0.4 0.6 Parity Check 0 20 40 60 80 1000 0.5 1 1.5 Cycle Navigation 0 100 200 300 400 5000 0.2 0.4 0.6 0.8 1 1.2 ...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.