Parallel Recursive LSTM

Tristan Gaudreault; Yongyi Mao

arxiv: 2605.17108 · v1 · pith:ZTDTQHU5new · submitted 2026-05-16 · 💻 cs.LG

Parallel Recursive LSTM

Tristan Gaudreault , Yongyi Mao This is my paper

Pith reviewed 2026-05-20 15:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords Parallel Recursive LSTMhierarchical recurrencelogarithmic depthgated state compositionsequence modelingformal language tasksparallel computationlength generalization

0 comments

The pith

The Parallel Recursive LSTM merges latent states recursively over a balanced tree to reduce recurrent computation depth from linear to logarithmic while preserving nonlinear gated updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Parallel Recursive LSTM as a way to reorganize recurrent computation for better parallelism. Instead of processing a sequence strictly left to right, it first maps each token to an independent latent state and then repeatedly applies a single learned gated composition block to merge pairs of states according to a fixed balanced binary tree. This keeps the nonlinear state-tracking dynamics that make LSTMs effective while cutting the number of sequential steps down to the logarithm of the sequence length. On formal-language tasks that test long-range dependencies, the resulting model generalizes to longer sequences than standard RNNs, LSTMs, and Transformers and avoids the quadratic cost of attention.

Core claim

By replacing the strictly sequential left-to-right recurrence of LSTMs with recursive nonlinear state composition over a fixed balanced computation tree, the PR-LSTM retains explicit nonlinear gated state representations and achieves logarithmic parallel depth. Tokens are mapped independently to latent states, which are then merged by a learned gated composition block applied according to the reduction pattern of parallel scans rather than an associative recurrence.

What carries the argument

The learned gated composition block that merges two latent state vectors into one, applied recursively according to the schedule of a balanced binary tree.

If this is right

Recurrent models can expose logarithmic rather than linear parallelism without forcing the transition function to be linear or associative.
State-tracking performance on formal-language benchmarks improves relative to both sequential LSTMs and quadratic-attention transformers.
Training and inference cost for long sequences scales better than attention while still supporting nonlinear dynamics.
The same hierarchical schedule can be applied to any gated recurrent cell that uses a composition operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fixed balanced-tree schedule could be relaxed to input-dependent trees if the composition block is made sufficiently expressive.
Similar hierarchical reorganization might improve efficiency in hybrid architectures that combine recurrence with limited attention.
The approach opens a route to training very deep recurrent networks without the usual vanishing-gradient problems associated with linear depth.

Load-bearing premise

That a single learned gated merge function applied repeatedly over a fixed balanced tree can preserve the full state-tracking expressivity of a conventional left-to-right LSTM.

What would settle it

A controlled experiment showing that PR-LSTM fails on long formal-language instances that a standard LSTM solves correctly, or that its accuracy degrades sharply once sequence length exceeds the training distribution.

Figures

Figures reproduced from arXiv: 2605.17108 by Tristan Gaudreault, Yongyi Mao.

**Figure 1.** Figure 1: Overview of PR-LSTM. can be implemented with scan-style primitives that accept user-defined binary operators, such as jax.lax.associative scan in JAX, or analogous framework-specific implementations. Rather than reproducing the exact dynamics of a sequential RNN, PR-LSTM learns hierarchical state compositions over the input sequence. Lower levels of the tree capture local interactions between neighboring r… view at source ↗

**Figure 2.** Figure 2: Definitions of the color-coded LSTM encoder modules used in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics across six tasks. Training loss (y-axis) as a function of wall-clock training time (x-axis), with all runs trained for 40,000 iterations on a cluster job allocation with an NVIDIA H100, 1 CPU core, and 35 GB of system memory. Endpoint markers show completion times for this fixed training budget, highlighting relative training speed. 4.1 Main results We first evaluate whether hierarchical … view at source ↗

**Figure 4.** Figure 4: Scaling behavior of inference time and memory consumption with increasing sequence length. Colors follow [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics across all fifteen tasks. Training loss (y-axis) as a function of wallclock training time (x-axis), with all runs trained for 40,000 iterations on a job allocation with an NVIDIA H100, 1 CPU core, and 35 GB of system memory. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR-LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PR-LSTM replaces linear LSTM recurrence with recursive gated merges on a balanced tree to reach log depth, but the changed mixing order leaves open whether it keeps the same state-tracking power on dependency-heavy tasks.

read the letter

The punchline is that this paper gives a concrete way to run LSTM-style recurrence in logarithmic parallel time using a fixed balanced tree of gated merges, and the early results on formal languages look promising for length generalization. What stands out as new is the choice to use a non-associative gated composition block applied recursively, rather than forcing the recurrence to be associative for parallel scan. The architecture first embeds tokens independently, then builds up the state bottom-up through the tree. This keeps the nonlinear state updates that make LSTMs good at tracking, while exposing more parallelism than a standard RNN. The paper does well at framing the problem: attention is quadratic, plain RNNs are sequential, and prior parallel RNNs often lose expressivity by assuming linearity or associativity. The empirical claim that it solves more formal language tasks than the baselines and generalizes better to longer sequences is the kind of evidence that matters here. The soft spot is the information flow difference the stress test points out. Each token's state gets mixed in at different levels of the tree, so the number of gating operations and the order of combinations differ from left-to-right processing. For tasks that rely on exact sequential accumulation, like maintaining a precise counter or handling deep nesting, it's not guaranteed that the learned merge recovers the same behavior. Since the manuscript details are light on equations and ablations in the abstract, this needs checking in the full version. This paper is for people working on efficient sequence models who want recurrent alternatives to transformers for long contexts. A reader who cares about state tracking in parallel settings would get value from the design and the benchmark results. It deserves a serious referee. The idea is solid enough and the potential impact on long-sequence modeling is real, even if revisions are needed on the analysis side. I recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Parallel Recursive LSTM (PR-LSTM), which maps tokens to latent states and then recursively merges them via a learned gated composition block over a fixed balanced binary tree (following the reduction pattern of parallel scans). This yields logarithmic parallel depth while claiming to retain the nonlinear gated state dynamics of standard LSTMs, and the manuscript reports that PR-LSTM solves more formal-language tasks with stronger sequence-length generalization than RNN, LSTM, and Transformer baselines.

Significance. If the empirical results and the retention of expressivity hold, the work would be significant for demonstrating that recurrent computation can be reorganized hierarchically to expose parallelism without forcing associativity or linear dynamics. Credit is due for the explicit use of a fixed balanced-tree schedule rather than learned or associative assumptions, and for the focus on formal-language benchmarks that test state tracking.

major comments (2)

[§3] §3 (architecture definition): the central claim that recursive application of one learned gated merge over a fixed balanced tree preserves or exceeds the state-tracking expressivity of left-to-right LSTM is not supported by any analysis of the changed information-flow graph. Each token participates in log n merges from disjoint subtrees rather than sequential incremental updates; nothing in the construction or the learned block is shown to recover the exact dynamics needed for tasks whose acceptance depends on strict left-to-right accumulation (e.g., precise counters). This is load-bearing for the retention-of-nonlinear-dynamics claim.
[Experiments section / Table 2] Experiments section / Table 2 (formal-language results): the reported superiority and sequence-length generalization lack ablations that isolate the effect of the tree schedule versus the gated block itself, and no direct comparison (e.g., via probing or state-similarity metrics) is given between PR-LSTM internal states and those of a standard LSTM on the same tasks. Without this, it remains unclear whether performance gains reflect retained expressivity or other factors.

minor comments (2)

The abstract and introduction would benefit from a brief explicit statement of the training objective and hyperparameter ranges used for the formal-language experiments to aid reproducibility.
Notation for the gated composition block (e.g., how the merge parameters are shared across tree levels) should be clarified in the first appearance to avoid ambiguity when reading the recursive definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the PR-LSTM architecture and experimental evaluation. We address each major comment in turn and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§3] §3 (architecture definition): the central claim that recursive application of one learned gated merge over a fixed balanced tree preserves or exceeds the state-tracking expressivity of left-to-right LSTM is not supported by any analysis of the changed information-flow graph. Each token participates in log n merges from disjoint subtrees rather than sequential incremental updates; nothing in the construction or the learned block is shown to recover the exact dynamics needed for tasks whose acceptance depends on strict left-to-right accumulation (e.g., precise counters). This is load-bearing for the retention-of-nonlinear-dynamics claim.

Authors: We agree that the manuscript would benefit from a more detailed discussion of the information flow in the balanced tree structure. The gated merge block is parameterized similarly to an LSTM cell, allowing it to selectively update, forget, and output information from the two child states. In principle, this can learn to prioritize information from the left subtree to mimic left-to-right processing. However, we acknowledge the lack of explicit analysis or guarantees for exact equivalence on all tasks. In the revision, we will expand §3 with a description of the information propagation paths and note potential differences from sequential LSTMs for tasks requiring precise sequential accumulation. revision: yes
Referee: [Experiments section / Table 2] Experiments section / Table 2 (formal-language results): the reported superiority and sequence-length generalization lack ablations that isolate the effect of the tree schedule versus the gated block itself, and no direct comparison (e.g., via probing or state-similarity metrics) is given between PR-LSTM internal states and those of a standard LSTM on the same tasks. Without this, it remains unclear whether performance gains reflect retained expressivity or other factors.

Authors: The referee correctly identifies that the current experiments do not include ablations isolating the tree schedule from the gated composition or direct state comparisons. We will add these in the revised manuscript: specifically, an ablation replacing the learned gated merge with a simple sum or concatenation, and probing experiments measuring state similarity or task-specific metrics between PR-LSTM and LSTM hidden states on the formal language tasks. This will help clarify the source of the observed performance. revision: yes

Circularity Check

0 steps flagged

No circularity: PR-LSTM defined independently of results

full rationale

The paper explicitly constructs the PR-LSTM as a fixed balanced-tree recursion schedule that applies a learned gated composition block to independently mapped token states. This architectural definition stands alone and does not depend on the formal-language benchmark outcomes or any self-citation chain for its validity. Performance claims are presented as empirical consequences of the tree-structured schedule rather than inputs that force the model equations. No load-bearing step reduces a prediction to a fitted parameter, renames a known result, or imports uniqueness via prior self-work. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assumption that the learned composition function can be trained to emulate the sequential LSTM's state evolution when applied in tree order.

free parameters (1)

gated composition parameters
The weights inside the learned gated merge block are fitted during training and directly determine whether the tree-structured updates preserve long-range state tracking.

axioms (1)

domain assumption The reduction pattern of parallel scans can be used as a fixed execution schedule for nonlinear gated state updates without loss of expressivity.
Invoked when the paper states that the structure uses the reduction pattern as a fixed schedule rather than assuming associativity.

pith-pipeline@v0.9.0 · 5716 in / 1171 out tokens · 43173 ms · 2026-05-20T15:14:12.354213+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 7 internal anchors

[1]

[Beck u. a. 2024] BECK, Maximilian ; P ¨OPPEL, Korbinian ; SPANRING, Markus ; AUER, Andreas ; PRUDNIKOVA, Oleksandra ; KOPP, Michael ; KLAMBAUER, G ¨unter ; BRANDSTETTER, Johannes ; HOCHREITER, Sepp:xLSTM: Extended Long Short-Term Memory

work page 2024
[2]

xlstm: Extended long short-term memory

– URL https: //arxiv.org/abs/2405.04517 [Blelloch 1990] BLELLOCH, Guy E.: Prefix Sums and Their Applications / School of Computer Science, Carnegie Mellon University. November 1990 (CMU-CS-90-190). – Forschungsbericht [Cho u. a. 2014] CHO, Kyunghyun ; MERRIENBOER, Bart van ; GULCEHRE, Caglar ; BAHDANAU, Dzmitry ; BOUGARES, Fethi ; SCHWENK, Holger ; BENGIO...

work page arXiv 1990
[3]

– URL https://arxiv.org/abs/1406.1078 [Choromanski u. a. 2020] CHOROMANSKI, Krzysztof ; LIKHOSHERSTOV, Valerii ; DOHAN, David ; SONG, Xingyou ; GANE, Andreea ; SARLOS, Tamas ; HAWKINS, Peter ; DAVIS, Jared ; MOHIUDDIN, Afroz ; KAISER, Lukasz u. a.: Rethinking attention with performers. In:arXiv preprint arXiv:2009.14794(2020) [Danieli u. a. 2025] DANIELI,...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

ParaRNN: Unlocking parallel training of nonlinear RNNs for large language models, 2025

– URLhttps://arxiv.org/abs/2510.21450 [Dao und Gu 2024] DAO, Tri ; GU, Albert: Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In:Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024 (ICML’24) [Del´etang u. a. 2023] DEL ´ETANG, Gr ´egoire ; RUOSS, Anian ; GRAU-MOYA,...

work page arXiv 2024
[5]

– URLhttps://arxiv.org/abs/2312.00752 [Gu u. a. 2022] GU, Albert ; GOEL, Karan ; R ´E, Christopher: Efficiently Modeling Long Sequences with Structured State Spaces. In:The International Conference on Learning Representations (ICLR), 2022 10 [Gupta u. a. 2022] GUPTA, Ankit ; GU, Albert ; BERANT, Jonathan: Diagonal state spaces are as effective as structur...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

– ISBN 978-0-387-09766-4 [Hochreiter und Schmidhuber 1997] HOCHREITER, Sepp ; SCHMIDHUBER, J ¨urgen: Long Short- Term Memory

– URL https: //doi.org/10.1007/978-0-387-09766-4_80. – ISBN 978-0-387-09766-4 [Hochreiter und Schmidhuber 1997] HOCHREITER, Sepp ; SCHMIDHUBER, J ¨urgen: Long Short- Term Memory. In:Neural computation9 (1997), Nr. 8, S. 1735–1780. – ISSN 0899-7667 [Katharopoulos u. a. 2020] KATHAROPOULOS, Angelos ; VYAS, Apoorv ; PAPPAS, Nikolaos ; FLEURET, Fran c ¸ois: T...

work page doi:10.1007/978-0-387-09766-4_80 1997
[7]

– URLhttps://arxiv.org/abs/1710.10777 [Peng u. a. 2023] PENG, Bo ; ALCAIDE, Eric ; ANTHONY, Quentin ; ALBALAK, Alon ; ARCAD- INHO, Samuel ; BIDERMAN, Stella ; CAO, Huanqi ; CHENG, Xin ; CHUNG, Michael ; GRELLA, Matteo ; GV, Kranthi K. ; HE, Xuzheng ; HOU, Haowen ; LIN, Jiaju ; KAZIENKO, Przemyslaw ; KOCON, Jan ; KONG, Jiaming ; KOPTYRA, Bartlomiej ; LAU, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

– URLhttps://arxiv.org/abs/2305.13048 [Poli u. a. 2023] POLI, Michael ; MASSAROLI, Stefano ; NGUYEN, Eric ; FU, Daniel Y . ; DAO, Tri ; BACCUS, Stephen ; BENGIO, Yoshua ; ERMON, Stefano ; R ´E, Christopher:Hyena Hierarchy: Towards Larger Convolutional Language Models

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

– URL https://arxiv.org/abs/ 2302.10866 [Smith u. a. 2023] SMITH, Jimmy T. ; WARRINGTON, Andrew ; LINDERMAN, Scott: Simplified State Space Layers for Sequence Modeling. In:The Eleventh International Conference on Learning Representations, URLhttps://openreview.net/forum?id=Ai8Hw3AXqks, 2023 [Socher u. a. 2011] SOCHER, Richard ; LIN, Cliff Chiung-Yu ; NG, ...

work page arXiv 2023
[10]

– URLhttps://arxiv.org/abs/1606.07461 [Sun u. a. 2023] SUN, Yutao ; DONG, Li ; HUANG, Shaohan ; MA, Shuming ; XIA, Yuqing ; XUE, Jilong ; WANG, Jianyong ; WEI, Furu:Retentive Network: A Successor to Transformer for Large Language Models

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

– URLhttps://arxiv.org/abs/2307.08621 11 [Tai u. a. 2015] TAI, Kai S. ; SOCHER, Richard ; MANNING, Christopher D.:Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

– URL https: //arxiv.org/abs/1503.00075 [Vaswani u. a. 2017] VASWANI, Ashish ; SHAZEER, Noam ; PARMAR, Niki ; USZKOREIT, Jakob ; JONES, Llion ; GOMEZ, Aidan N. ; KAISER, Ł u. ; POLOSUKHIN, Illia: Attention is All you Need. In: GUYON, I. (Hrsg.) ; LUXBURG, U. V . (Hrsg.) ; BENGIO, S. (Hrsg.) ; WALLACH, H. (Hrsg.) ; FERGUS, R. (Hrsg.) ; VISHWANATHAN, S. (Hr...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

– URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf 12 A Full results 0 20 40 60 80 1000 0.2 0.4 0.6 Even Pairs 0 20 40 60 800 0.5 1 1.5 Modular Arithmetic (Simple) 0 20 40 60 80 1000 0.2 0.4 0.6 Parity Check 0 20 40 60 80 1000 0.5 1 1.5 Cycle Navigation 0 100 200 300 400 5000 0.2 0.4 0.6 0.8 1 1.2 ...

work page 2017

[1] [1]

[Beck u. a. 2024] BECK, Maximilian ; P ¨OPPEL, Korbinian ; SPANRING, Markus ; AUER, Andreas ; PRUDNIKOVA, Oleksandra ; KOPP, Michael ; KLAMBAUER, G ¨unter ; BRANDSTETTER, Johannes ; HOCHREITER, Sepp:xLSTM: Extended Long Short-Term Memory

work page 2024

[2] [2]

xlstm: Extended long short-term memory

– URL https: //arxiv.org/abs/2405.04517 [Blelloch 1990] BLELLOCH, Guy E.: Prefix Sums and Their Applications / School of Computer Science, Carnegie Mellon University. November 1990 (CMU-CS-90-190). – Forschungsbericht [Cho u. a. 2014] CHO, Kyunghyun ; MERRIENBOER, Bart van ; GULCEHRE, Caglar ; BAHDANAU, Dzmitry ; BOUGARES, Fethi ; SCHWENK, Holger ; BENGIO...

work page arXiv 1990

[3] [3]

– URL https://arxiv.org/abs/1406.1078 [Choromanski u. a. 2020] CHOROMANSKI, Krzysztof ; LIKHOSHERSTOV, Valerii ; DOHAN, David ; SONG, Xingyou ; GANE, Andreea ; SARLOS, Tamas ; HAWKINS, Peter ; DAVIS, Jared ; MOHIUDDIN, Afroz ; KAISER, Lukasz u. a.: Rethinking attention with performers. In:arXiv preprint arXiv:2009.14794(2020) [Danieli u. a. 2025] DANIELI,...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

ParaRNN: Unlocking parallel training of nonlinear RNNs for large language models, 2025

– URLhttps://arxiv.org/abs/2510.21450 [Dao und Gu 2024] DAO, Tri ; GU, Albert: Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In:Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024 (ICML’24) [Del´etang u. a. 2023] DEL ´ETANG, Gr ´egoire ; RUOSS, Anian ; GRAU-MOYA,...

work page arXiv 2024

[5] [5]

– URLhttps://arxiv.org/abs/2312.00752 [Gu u. a. 2022] GU, Albert ; GOEL, Karan ; R ´E, Christopher: Efficiently Modeling Long Sequences with Structured State Spaces. In:The International Conference on Learning Representations (ICLR), 2022 10 [Gupta u. a. 2022] GUPTA, Ankit ; GU, Albert ; BERANT, Jonathan: Diagonal state spaces are as effective as structur...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

– ISBN 978-0-387-09766-4 [Hochreiter und Schmidhuber 1997] HOCHREITER, Sepp ; SCHMIDHUBER, J ¨urgen: Long Short- Term Memory

– URL https: //doi.org/10.1007/978-0-387-09766-4_80. – ISBN 978-0-387-09766-4 [Hochreiter und Schmidhuber 1997] HOCHREITER, Sepp ; SCHMIDHUBER, J ¨urgen: Long Short- Term Memory. In:Neural computation9 (1997), Nr. 8, S. 1735–1780. – ISSN 0899-7667 [Katharopoulos u. a. 2020] KATHAROPOULOS, Angelos ; VYAS, Apoorv ; PAPPAS, Nikolaos ; FLEURET, Fran c ¸ois: T...

work page doi:10.1007/978-0-387-09766-4_80 1997

[7] [7]

– URLhttps://arxiv.org/abs/1710.10777 [Peng u. a. 2023] PENG, Bo ; ALCAIDE, Eric ; ANTHONY, Quentin ; ALBALAK, Alon ; ARCAD- INHO, Samuel ; BIDERMAN, Stella ; CAO, Huanqi ; CHENG, Xin ; CHUNG, Michael ; GRELLA, Matteo ; GV, Kranthi K. ; HE, Xuzheng ; HOU, Haowen ; LIN, Jiaju ; KAZIENKO, Przemyslaw ; KOCON, Jan ; KONG, Jiaming ; KOPTYRA, Bartlomiej ; LAU, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

– URLhttps://arxiv.org/abs/2305.13048 [Poli u. a. 2023] POLI, Michael ; MASSAROLI, Stefano ; NGUYEN, Eric ; FU, Daniel Y . ; DAO, Tri ; BACCUS, Stephen ; BENGIO, Yoshua ; ERMON, Stefano ; R ´E, Christopher:Hyena Hierarchy: Towards Larger Convolutional Language Models

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

– URL https://arxiv.org/abs/ 2302.10866 [Smith u. a. 2023] SMITH, Jimmy T. ; WARRINGTON, Andrew ; LINDERMAN, Scott: Simplified State Space Layers for Sequence Modeling. In:The Eleventh International Conference on Learning Representations, URLhttps://openreview.net/forum?id=Ai8Hw3AXqks, 2023 [Socher u. a. 2011] SOCHER, Richard ; LIN, Cliff Chiung-Yu ; NG, ...

work page arXiv 2023

[10] [10]

– URLhttps://arxiv.org/abs/1606.07461 [Sun u. a. 2023] SUN, Yutao ; DONG, Li ; HUANG, Shaohan ; MA, Shuming ; XIA, Yuqing ; XUE, Jilong ; WANG, Jianyong ; WEI, Furu:Retentive Network: A Successor to Transformer for Large Language Models

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

– URLhttps://arxiv.org/abs/2307.08621 11 [Tai u. a. 2015] TAI, Kai S. ; SOCHER, Richard ; MANNING, Christopher D.:Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

– URL https: //arxiv.org/abs/1503.00075 [Vaswani u. a. 2017] VASWANI, Ashish ; SHAZEER, Noam ; PARMAR, Niki ; USZKOREIT, Jakob ; JONES, Llion ; GOMEZ, Aidan N. ; KAISER, Ł u. ; POLOSUKHIN, Illia: Attention is All you Need. In: GUYON, I. (Hrsg.) ; LUXBURG, U. V . (Hrsg.) ; BENGIO, S. (Hrsg.) ; WALLACH, H. (Hrsg.) ; FERGUS, R. (Hrsg.) ; VISHWANATHAN, S. (Hr...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

– URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf 12 A Full results 0 20 40 60 80 1000 0.2 0.4 0.6 Even Pairs 0 20 40 60 800 0.5 1 1.5 Modular Arithmetic (Simple) 0 20 40 60 80 1000 0.2 0.4 0.6 Parity Check 0 20 40 60 80 1000 0.5 1 1.5 Cycle Navigation 0 100 200 300 400 5000 0.2 0.4 0.6 0.8 1 1.2 ...

work page 2017