LSTM Variants for Chaotic Dynamical Systems: An Empirical Study on the Lorenz Attractor

Ruslan Gokhman

arxiv: 2606.22662 · v1 · pith:TA22AFYFnew · submitted 2026-06-21 · 💻 cs.LG · cs.SY· eess.SY

LSTM Variants for Chaotic Dynamical Systems: An Empirical Study on the Lorenz Attractor

Ruslan Gokhman This is my paper

Pith reviewed 2026-06-26 10:26 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords LSTMBiLSTMLorenz attractorchaotic dynamical systemstime series forecastingHuber lossempirical studyrecurrent networks

0 comments

The pith

Bidirectional LSTM trained with Huber loss scores highest among seven architectures on Lorenz attractor forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares seven recurrent and convolutional models for forecasting the Lorenz attractor under identical preprocessing, sequence length, and autoregressive rollout conditions. It reports leaderboard scores from 45.72 to 58.81 and identifies the BiLSTM with Huber loss as the top performer. Adding additive attention to a unidirectional LSTM reduces scores by more than ten points, while prepending a CNN front-end to LSTM or BiLSTM models yields no improvement and can slightly lower results. Per-pair RMSE analysis shows the BiLSTM family maintains better accuracy on harder test pairs where the attention model collapses.

Core claim

The BiLSTM trained with Huber loss reaches the highest leaderboard score of 58.81. Additive attention on the unidirectional baseline degrades performance by over ten points. Prepending a CNN front-end to LSTM or BiLSTM variants does not help and slightly hurts the score. The BiLSTM family generalizes better on the harder pairs while the LSTM plus attention model produces RMSE values up to 8.94 on pair 6.

What carries the argument

Controlled comparison of seven LSTM and convolutional architectures under shared preprocessing and rollout procedure on the Lorenz attractor.

If this is right

Bidirectional context improves generalization on difficult prediction pairs in chaotic regimes.
Additive attention mechanisms can cause large errors and collapse on harder test cases.
CNN front-ends provide no benefit and can reduce accuracy when added to LSTM or BiLSTM backbones.
Huber loss supports more robust training than standard losses in exponentially sensitive dynamical systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benefit of bidirectionality may arise because access to future context during training helps stabilize representations of sensitive dynamics even when test rollouts are autoregressive.
The same controlled comparison setup could be applied to other chaotic systems such as the Rössler attractor to test whether the ranking of architectures transfers.
Attention may fail here because local temporal dependencies dominate over long-range ones in short-sequence chaotic forecasting.

Load-bearing premise

That sharing the same pre-processing, sequence length, and rollout procedure across models sufficiently isolates the contribution of each architecture choice without confounding effects from implementation details or random seeds.

What would settle it

Re-running all seven models with multiple independent random seeds and checking whether the ranking of BiLSTM-Huber over the other six configurations remains stable.

Figures

Figures reproduced from arXiv: 2606.22662 by Ruslan Gokhman.

read the original abstract

Forecasting chaotic dynamical systems such as the Lorenz attractor is notoriously difficult: small numerical errors are amplified exponentially over long autoregressive rollouts. We study seven recurrent and convolutional architectures for the AI-DEEDS 2026 Chaotic Systems Challenge: a vanilla LSTM, an LSTM with additive attention, a Bidirectional LSTM (BiLSTM), a BiLSTM trained with the Huber loss, a Temporal Convolutional Network (TCN), a CNN front-end followed by an LSTM, and a CNN front-end followed by a BiLSTM. All models share the same pre-processing, sequence length, and rollout procedure, isolating the contribution of each design choice. The challenge scores predictions on a 0-100 scale where higher is better. We obtain leaderboard scores between 45.72 and 58.81, with the BiLSTM trained with Huber loss being the strongest configuration. Two findings stand out: (i) adding additive attention to the unidirectional baseline degraded performance by over ten points, and (ii) prepending a CNN front-end to either an LSTM or a BiLSTM did not help and slightly hurt the score. Per-pair RMSE measurements confirm that the BiLSTM family generalizes better in the harder pairs (6-7), while the LSTM + Attention model collapses there (RMSE up to 8.94 on pair 6). We discuss why bidirectional context and a robust loss help in chaotic regimes while attention and CNN front-ends fail in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiLSTM with Huber loss tops the single-run leaderboard on Lorenz forecasting, but missing seed variance and tests make the architecture rankings unreliable.

read the letter

The paper's main result is that a BiLSTM trained with Huber loss scores 58.81 on the AI-DEEDS challenge, beating the other six setups. Attention added to a plain LSTM drops the score by more than 10 points, and sticking a CNN in front of either LSTM or BiLSTM slightly hurts performance. Per-pair RMSE numbers back this up by showing the BiLSTM family holds up better on the harder test pairs.

It does a clean job of holding preprocessing, sequence length, and rollout fixed so the differences can be attributed to the architecture choices. Reporting concrete leaderboard numbers and breaking them down by pair is useful for anyone who might try the same benchmark.

The soft spot is the lack of any error bars, multiple seeds, or statistical tests. Chaotic rollouts are sensitive to initialization and optimization noise, so a 10-point gap on a single run could easily flip with a different random seed. The abstract gives no training details either, which makes it hard to judge how much the ordering depends on implementation specifics.

This is a narrow empirical comparison on one attractor and one challenge. It does not introduce new methods or test broader claims about chaotic systems. Readers working on time-series models for similar benchmarks might find the numbers handy as a quick reference, but the work does not have enough robustness or scope to shift practice.

I would not bring it to a reading group. It does not deserve peer review; the central comparative claims rest on single realizations without the checks needed to trust the ordering.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically evaluates seven LSTM, BiLSTM, attention-augmented, and CNN-augmented architectures on the Lorenz attractor forecasting task from the AI-DEEDS 2026 challenge. All models use identical pre-processing, sequence length, and autoregressive rollout. Leaderboard scores range from 45.72 to 58.81, with BiLSTM trained under Huber loss ranked highest; additive attention degrades the unidirectional LSTM by >10 points and CNN front-ends slightly hurt both LSTM and BiLSTM. Per-pair RMSE values are supplied to show BiLSTM variants generalize better on harder pairs (6-7) while the attention model collapses (RMSE 8.94 on pair 6).

Significance. If the reported ordering proves robust, the study supplies concrete, challenge-derived evidence that bidirectional context and robust losses are advantageous for long-horizon chaotic prediction while attention and CNN front-ends are not. The per-pair RMSE numbers and explicit leaderboard scores constitute reproducible data points that future work can directly compare against.

major comments (2)

[Abstract and Results] Abstract and Results: leaderboard scores (45.72–58.81) and per-pair RMSE values are single realizations with no standard deviations, no multi-seed averages, and no statistical significance tests. Because chaotic rollouts exponentially amplify initialization and optimization noise, the central claims—that BiLSTM+Huber is strongest, that attention drops >10 points, and that CNN front-ends hurt—cannot be distinguished from training stochasticity on the basis of the reported numbers.
[Experimental Setup] Experimental Setup: the assertion that identical pre-processing, sequence length, and rollout “isolate the contribution of each design choice” is not accompanied by any control for random-seed variance or implementation details. In the absence of such controls, the observed architecture ranking remains vulnerable to confounding, directly undermining the two highlighted findings.

minor comments (2)

[Abstract] Abstract: training hyperparameters, optimizer, learning-rate schedule, and number of epochs are not stated, making the reported scores difficult to reproduce even if seeds were supplied.
[Introduction] The manuscript would benefit from a short related-work paragraph situating the seven architectures against prior RNN/TCN studies on the Lorenz system.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the importance of statistical robustness in our empirical evaluation of LSTM variants on the Lorenz attractor. We agree that single-run results are insufficient to support the central claims given the sensitivity of chaotic forecasting to initialization and optimization noise. The revised manuscript will incorporate multi-seed experiments and statistical reporting to address these concerns directly.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: leaderboard scores (45.72–58.81) and per-pair RMSE values are single realizations with no standard deviations, no multi-seed averages, and no statistical significance tests. Because chaotic rollouts exponentially amplify initialization and optimization noise, the central claims—that BiLSTM+Huber is strongest, that attention drops >10 points, and that CNN front-ends hurt—cannot be distinguished from training stochasticity on the basis of the reported numbers.

Authors: We agree that the reported leaderboard scores and per-pair RMSE values are from single realizations and that this limits the strength of the claims. In the revision we will rerun every architecture with five independent random seeds, report mean scores with standard deviations for both the overall leaderboard metric and the per-pair RMSE values, and include statistical significance tests (paired t-tests across seeds) for the key comparisons. This will allow readers to assess whether the observed differences exceed training stochasticity. revision: yes
Referee: [Experimental Setup] Experimental Setup: the assertion that identical pre-processing, sequence length, and rollout “isolate the contribution of each design choice” is not accompanied by any control for random-seed variance or implementation details. In the absence of such controls, the observed architecture ranking remains vulnerable to confounding, directly undermining the two highlighted findings.

Authors: We accept that fixing pre-processing, sequence length, and rollout procedure alone does not fully isolate architectural effects without also controlling random-seed variance. The revised manuscript will explicitly state that all models were trained with the same fixed seeds for the initial submission but will now include the multi-seed protocol described above. We will also document the exact random-seed values, optimizer settings, and implementation framework to reduce confounding and strengthen the isolation argument. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical leaderboard comparison

full rationale

The paper reports performance numbers obtained from an external AI-DEEDS challenge evaluation. No mathematical derivations, fitted parameters renamed as predictions, self-citations, or ansatzes are present in the abstract or described methodology. All architecture comparisons rely on shared preprocessing and external scoring, with no load-bearing step that reduces to its own inputs by construction. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning training assumptions and the validity of the external challenge metric; no new entities are postulated.

free parameters (1)

model-specific hyperparameters
Hidden sizes, learning rates, and other training choices for each of the seven architectures are selected but not enumerated in the abstract.

axioms (1)

domain assumption The AI-DEEDS 2026 Chaotic Systems Challenge 0-100 score is a reliable proxy for forecasting quality on the Lorenz attractor.
The paper adopts the leaderboard metric as the primary outcome without additional validation or sensitivity analysis.

pith-pipeline@v0.9.1-grok · 5793 in / 1229 out tokens · 45501 ms · 2026-06-26T10:26:09.101824+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 3 linked inside Pith

[1]

E. N. Lorenz. Deterministic nonperiodic flow.Journal of the Atmospheric Sciences, 20(2):130–141, 1963

1963
[2]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

1997
[3]

Bahdanau, K

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. InICLR, 2015

2015
[4]

P. J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1):73–101, 1964

1964
[5]

S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv:1803.01271, 2018

Pith/arXiv arXiv 2018
[6]

S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.PNAS, 113(15):3932– 3937, 2016

2016
[7]

Yermakov, Y

A. Yermakov, Y. Zhao, M. Denolle, Y. Ni, P. M. Wyder, J. Goldfeder, S. Riva, J. Williams, A. S. Rude, J. Germany, D. Zoro, M. Tomasetto, J. Bakarji, G. Maier- hofer, M. Cranmer, and J. N. Kutz. The seismic wavefield common task framework. arXiv:2512.19927, 2025

Pith/arXiv arXiv 2025
[8]

S. Riva, C. Introini, A. Cammi, D. Price, A. Yermakov, Y. Zhao, P. M. Wyder, J. Goldfeder, J. Williams, A. S. Rude, M. Tomasetto, J. Germany, J. Bakarji, G. Maier- hofer, M. Cranmer, and J. N. Kutz. CTF4Nuclear: Common task framework for nuclear fission and fusion models.arXiv:2605.15549, 2026

Pith/arXiv arXiv 2026
[9]

Wyder, J

P. Wyder, J. Goldfeder, A. Yermakov, Y. Zhao, S. Riva, J. Williams, D. Zoro, A. Rude, M. Tomasetto, J. Germany, J. Bakarji, G. Maierhofer, M. Cranmer, and N. Kutz. Common task framework for a critical evaluation of scientific machine learning algorithms. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), Datasets and Benchmarks Track, 2025

2025
[10]

S. Wang, H. Wu, X. Shi, T. Hu, H. Luo, L. Ma, J. Y. Zhang, and J. Zhou. TimeMixer: Decomposable multiscale mixing for time series forecasting. InICLR, 2024

2024

[1] [1]

E. N. Lorenz. Deterministic nonperiodic flow.Journal of the Atmospheric Sciences, 20(2):130–141, 1963

1963

[2] [2]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

1997

[3] [3]

Bahdanau, K

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. InICLR, 2015

2015

[4] [4]

P. J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1):73–101, 1964

1964

[5] [5]

S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv:1803.01271, 2018

Pith/arXiv arXiv 2018

[6] [6]

S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.PNAS, 113(15):3932– 3937, 2016

2016

[7] [7]

Yermakov, Y

A. Yermakov, Y. Zhao, M. Denolle, Y. Ni, P. M. Wyder, J. Goldfeder, S. Riva, J. Williams, A. S. Rude, J. Germany, D. Zoro, M. Tomasetto, J. Bakarji, G. Maier- hofer, M. Cranmer, and J. N. Kutz. The seismic wavefield common task framework. arXiv:2512.19927, 2025

Pith/arXiv arXiv 2025

[8] [8]

S. Riva, C. Introini, A. Cammi, D. Price, A. Yermakov, Y. Zhao, P. M. Wyder, J. Goldfeder, J. Williams, A. S. Rude, M. Tomasetto, J. Germany, J. Bakarji, G. Maier- hofer, M. Cranmer, and J. N. Kutz. CTF4Nuclear: Common task framework for nuclear fission and fusion models.arXiv:2605.15549, 2026

Pith/arXiv arXiv 2026

[9] [9]

Wyder, J

P. Wyder, J. Goldfeder, A. Yermakov, Y. Zhao, S. Riva, J. Williams, D. Zoro, A. Rude, M. Tomasetto, J. Germany, J. Bakarji, G. Maierhofer, M. Cranmer, and N. Kutz. Common task framework for a critical evaluation of scientific machine learning algorithms. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), Datasets and Benchmarks Track, 2025

2025

[10] [10]

S. Wang, H. Wu, X. Shi, T. Hu, H. Luo, L. Ma, J. Y. Zhang, and J. Zhou. TimeMixer: Decomposable multiscale mixing for time series forecasting. InICLR, 2024

2024