LSTM Variants for Chaotic Dynamical Systems: An Empirical Study on the Lorenz Attractor
Pith reviewed 2026-06-26 10:26 UTC · model grok-4.3
The pith
Bidirectional LSTM trained with Huber loss scores highest among seven architectures on Lorenz attractor forecasting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The BiLSTM trained with Huber loss reaches the highest leaderboard score of 58.81. Additive attention on the unidirectional baseline degrades performance by over ten points. Prepending a CNN front-end to LSTM or BiLSTM variants does not help and slightly hurts the score. The BiLSTM family generalizes better on the harder pairs while the LSTM plus attention model produces RMSE values up to 8.94 on pair 6.
What carries the argument
Controlled comparison of seven LSTM and convolutional architectures under shared preprocessing and rollout procedure on the Lorenz attractor.
If this is right
- Bidirectional context improves generalization on difficult prediction pairs in chaotic regimes.
- Additive attention mechanisms can cause large errors and collapse on harder test cases.
- CNN front-ends provide no benefit and can reduce accuracy when added to LSTM or BiLSTM backbones.
- Huber loss supports more robust training than standard losses in exponentially sensitive dynamical systems.
Where Pith is reading between the lines
- The benefit of bidirectionality may arise because access to future context during training helps stabilize representations of sensitive dynamics even when test rollouts are autoregressive.
- The same controlled comparison setup could be applied to other chaotic systems such as the Rössler attractor to test whether the ranking of architectures transfers.
- Attention may fail here because local temporal dependencies dominate over long-range ones in short-sequence chaotic forecasting.
Load-bearing premise
That sharing the same pre-processing, sequence length, and rollout procedure across models sufficiently isolates the contribution of each architecture choice without confounding effects from implementation details or random seeds.
What would settle it
Re-running all seven models with multiple independent random seeds and checking whether the ranking of BiLSTM-Huber over the other six configurations remains stable.
Figures
read the original abstract
Forecasting chaotic dynamical systems such as the Lorenz attractor is notoriously difficult: small numerical errors are amplified exponentially over long autoregressive rollouts. We study seven recurrent and convolutional architectures for the AI-DEEDS 2026 Chaotic Systems Challenge: a vanilla LSTM, an LSTM with additive attention, a Bidirectional LSTM (BiLSTM), a BiLSTM trained with the Huber loss, a Temporal Convolutional Network (TCN), a CNN front-end followed by an LSTM, and a CNN front-end followed by a BiLSTM. All models share the same pre-processing, sequence length, and rollout procedure, isolating the contribution of each design choice. The challenge scores predictions on a 0-100 scale where higher is better. We obtain leaderboard scores between 45.72 and 58.81, with the BiLSTM trained with Huber loss being the strongest configuration. Two findings stand out: (i) adding additive attention to the unidirectional baseline degraded performance by over ten points, and (ii) prepending a CNN front-end to either an LSTM or a BiLSTM did not help and slightly hurt the score. Per-pair RMSE measurements confirm that the BiLSTM family generalizes better in the harder pairs (6-7), while the LSTM + Attention model collapses there (RMSE up to 8.94 on pair 6). We discuss why bidirectional context and a robust loss help in chaotic regimes while attention and CNN front-ends fail in this setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript empirically evaluates seven LSTM, BiLSTM, attention-augmented, and CNN-augmented architectures on the Lorenz attractor forecasting task from the AI-DEEDS 2026 challenge. All models use identical pre-processing, sequence length, and autoregressive rollout. Leaderboard scores range from 45.72 to 58.81, with BiLSTM trained under Huber loss ranked highest; additive attention degrades the unidirectional LSTM by >10 points and CNN front-ends slightly hurt both LSTM and BiLSTM. Per-pair RMSE values are supplied to show BiLSTM variants generalize better on harder pairs (6-7) while the attention model collapses (RMSE 8.94 on pair 6).
Significance. If the reported ordering proves robust, the study supplies concrete, challenge-derived evidence that bidirectional context and robust losses are advantageous for long-horizon chaotic prediction while attention and CNN front-ends are not. The per-pair RMSE numbers and explicit leaderboard scores constitute reproducible data points that future work can directly compare against.
major comments (2)
- [Abstract and Results] Abstract and Results: leaderboard scores (45.72–58.81) and per-pair RMSE values are single realizations with no standard deviations, no multi-seed averages, and no statistical significance tests. Because chaotic rollouts exponentially amplify initialization and optimization noise, the central claims—that BiLSTM+Huber is strongest, that attention drops >10 points, and that CNN front-ends hurt—cannot be distinguished from training stochasticity on the basis of the reported numbers.
- [Experimental Setup] Experimental Setup: the assertion that identical pre-processing, sequence length, and rollout “isolate the contribution of each design choice” is not accompanied by any control for random-seed variance or implementation details. In the absence of such controls, the observed architecture ranking remains vulnerable to confounding, directly undermining the two highlighted findings.
minor comments (2)
- [Abstract] Abstract: training hyperparameters, optimizer, learning-rate schedule, and number of epochs are not stated, making the reported scores difficult to reproduce even if seeds were supplied.
- [Introduction] The manuscript would benefit from a short related-work paragraph situating the seven architectures against prior RNN/TCN studies on the Lorenz system.
Simulated Author's Rebuttal
We thank the referee for highlighting the importance of statistical robustness in our empirical evaluation of LSTM variants on the Lorenz attractor. We agree that single-run results are insufficient to support the central claims given the sensitivity of chaotic forecasting to initialization and optimization noise. The revised manuscript will incorporate multi-seed experiments and statistical reporting to address these concerns directly.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: leaderboard scores (45.72–58.81) and per-pair RMSE values are single realizations with no standard deviations, no multi-seed averages, and no statistical significance tests. Because chaotic rollouts exponentially amplify initialization and optimization noise, the central claims—that BiLSTM+Huber is strongest, that attention drops >10 points, and that CNN front-ends hurt—cannot be distinguished from training stochasticity on the basis of the reported numbers.
Authors: We agree that the reported leaderboard scores and per-pair RMSE values are from single realizations and that this limits the strength of the claims. In the revision we will rerun every architecture with five independent random seeds, report mean scores with standard deviations for both the overall leaderboard metric and the per-pair RMSE values, and include statistical significance tests (paired t-tests across seeds) for the key comparisons. This will allow readers to assess whether the observed differences exceed training stochasticity. revision: yes
-
Referee: [Experimental Setup] Experimental Setup: the assertion that identical pre-processing, sequence length, and rollout “isolate the contribution of each design choice” is not accompanied by any control for random-seed variance or implementation details. In the absence of such controls, the observed architecture ranking remains vulnerable to confounding, directly undermining the two highlighted findings.
Authors: We accept that fixing pre-processing, sequence length, and rollout procedure alone does not fully isolate architectural effects without also controlling random-seed variance. The revised manuscript will explicitly state that all models were trained with the same fixed seeds for the initial submission but will now include the multi-seed protocol described above. We will also document the exact random-seed values, optimizer settings, and implementation framework to reduce confounding and strengthen the isolation argument. revision: yes
Circularity Check
No circularity: purely empirical leaderboard comparison
full rationale
The paper reports performance numbers obtained from an external AI-DEEDS challenge evaluation. No mathematical derivations, fitted parameters renamed as predictions, self-citations, or ansatzes are present in the abstract or described methodology. All architecture comparisons rely on shared preprocessing and external scoring, with no load-bearing step that reduces to its own inputs by construction. This matches the default expectation of a non-circular empirical study.
Axiom & Free-Parameter Ledger
free parameters (1)
- model-specific hyperparameters
axioms (1)
- domain assumption The AI-DEEDS 2026 Chaotic Systems Challenge 0-100 score is a reliable proxy for forecasting quality on the Lorenz attractor.
Reference graph
Works this paper leans on
-
[1]
E. N. Lorenz. Deterministic nonperiodic flow.Journal of the Atmospheric Sciences, 20(2):130–141, 1963
1963
-
[2]
Hochreiter and J
S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997
1997
-
[3]
Bahdanau, K
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. InICLR, 2015
2015
-
[4]
P. J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1):73–101, 1964
1964
-
[5]
S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv:1803.01271, 2018
Pith/arXiv arXiv 2018
-
[6]
S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.PNAS, 113(15):3932– 3937, 2016
2016
-
[7]
A. Yermakov, Y. Zhao, M. Denolle, Y. Ni, P. M. Wyder, J. Goldfeder, S. Riva, J. Williams, A. S. Rude, J. Germany, D. Zoro, M. Tomasetto, J. Bakarji, G. Maier- hofer, M. Cranmer, and J. N. Kutz. The seismic wavefield common task framework. arXiv:2512.19927, 2025
Pith/arXiv arXiv 2025
-
[8]
S. Riva, C. Introini, A. Cammi, D. Price, A. Yermakov, Y. Zhao, P. M. Wyder, J. Goldfeder, J. Williams, A. S. Rude, M. Tomasetto, J. Germany, J. Bakarji, G. Maier- hofer, M. Cranmer, and J. N. Kutz. CTF4Nuclear: Common task framework for nuclear fission and fusion models.arXiv:2605.15549, 2026
Pith/arXiv arXiv 2026
-
[9]
Wyder, J
P. Wyder, J. Goldfeder, A. Yermakov, Y. Zhao, S. Riva, J. Williams, D. Zoro, A. Rude, M. Tomasetto, J. Germany, J. Bakarji, G. Maierhofer, M. Cranmer, and N. Kutz. Common task framework for a critical evaluation of scientific machine learning algorithms. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), Datasets and Benchmarks Track, 2025
2025
-
[10]
S. Wang, H. Wu, X. Shi, T. Hu, H. Luo, L. Ma, J. Y. Zhang, and J. Zhou. TimeMixer: Decomposable multiscale mixing for time series forecasting. InICLR, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.