A Deep Reinforcement Learning Approach to Automated Stock Trading, using xLSTM Networks
Pith reviewed 2026-05-23 00:54 UTC · model grok-4.3
The pith
xLSTM networks in a PPO reinforcement learning agent outperform LSTM networks on stock trading metrics including cumulative return and Sharpe ratio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an xLSTM-based deep reinforcement learning model using PPO outperforms LSTM-based methods in key trading evaluation metrics, including cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio, when tested on financial data from major tech companies over a comprehensive timeline.
What carries the argument
xLSTM networks placed in both the actor and critic of a Proximal Policy Optimization (PPO) reinforcement learning agent to process time series market data and output trading actions.
If this is right
- The xLSTM-PPO agent produces higher cumulative returns and average profitability per trade than the LSTM baseline on the tested tech stock data.
- The approach yields a higher Sharpe ratio and lower maximum pullback, indicating better risk-adjusted performance.
- xLSTM enables more effective capture of long-term dependencies in financial time series within reinforcement learning trading policies.
- The method balances exploration and exploitation in dynamic market conditions through the PPO optimizer.
Where Pith is reading between the lines
- The same xLSTM replacement could be tested in reinforcement learning agents for other sequential financial tasks such as portfolio allocation across asset classes.
- Performance gains might depend on the length of the price history; shorter or longer windows could alter the relative advantage over LSTM.
- If xLSTM scales well, it might reduce the need for extensive feature engineering in trading systems that currently rely on LSTM.
Load-bearing premise
Any observed performance difference between xLSTM and LSTM versions can be attributed to the network architecture rather than to unstated differences in hyperparameter choices, data preprocessing, train-test splits, or random seeds.
What would settle it
Re-running the exact experiments with identical hyperparameters, preprocessing steps, train-test splits, and multiple random seeds for both the xLSTM and LSTM models and checking whether the reported gaps in cumulative return and Sharpe ratio remain.
read the original abstract
Traditional Long Short-Term Memory (LSTM) networks are effective for handling sequential data but have limitations such as gradient vanishing and difficulty in capturing long-term dependencies, which can impact their performance in dynamic and risky environments like stock trading. To address these limitations, this study explores the usage of the newly introduced Extended Long Short Term Memory (xLSTM) network in combination with a deep reinforcement learning (DRL) approach for automated stock trading. Our proposed method utilizes xLSTM networks in both actor and critic components, enabling effective handling of time series data and dynamic market environments. Proximal Policy Optimization (PPO), with its ability to balance exploration and exploitation, is employed to optimize the trading strategy. Experiments were conducted using financial data from major tech companies over a comprehensive timeline, demonstrating that the xLSTM-based model outperforms LSTM-based methods in key trading evaluation metrics, including cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio. These findings mark the potential of xLSTM for enhancing DRL-based stock trading systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes replacing LSTM cells with xLSTM networks in both the actor and critic of a PPO-based deep reinforcement learning agent for automated stock trading. Experiments on historical price sequences from major technology companies are claimed to demonstrate that the xLSTM agent outperforms an LSTM baseline on cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio.
Significance. If the reported gains can be shown to arise specifically from the xLSTM architectural changes under controlled conditions, the result would be of interest to the computational finance and DRL communities because it would supply evidence that the matrix-memory and other extensions in xLSTM improve policy learning in non-stationary financial environments. The work also illustrates a practical use case for PPO with modern recurrent cells.
major comments (3)
- Abstract: the claim that the xLSTM model 'outperforms LSTM-based methods' on the listed metrics supplies no information on the number of random seeds, variance across runs, or any statistical significance test; without these, the metric gaps cannot be distinguished from sampling noise.
- Abstract / Experiments section: no description is given of whether the LSTM baseline received the same hyperparameter search budget, identical train-test split dates, or the same preprocessing pipeline as the xLSTM agent; the central attribution of performance differences to the recurrent cell therefore lacks the necessary controls.
- Abstract: the reported metrics (cumulative return, Sharpe ratio, etc.) are computed on historical price sequences that were necessarily used to select or tune the trading policy; the manuscript does not describe any held-out test period, walk-forward validation, or external benchmark independent of the fitted policy.
minor comments (1)
- Abstract: the term 'maximum earning rate' is non-standard; clarify its definition and relation to conventional quantities such as maximum drawdown.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the claim that the xLSTM model 'outperforms LSTM-based methods' on the listed metrics supplies no information on the number of random seeds, variance across runs, or any statistical significance test; without these, the metric gaps cannot be distinguished from sampling noise.
Authors: We agree that reporting variance and statistical tests is necessary to substantiate performance claims. The revised manuscript will specify the number of random seeds used (typically 5–10), report means and standard deviations for all metrics, and include statistical significance tests (e.g., paired t-tests) comparing xLSTM and LSTM agents. revision: yes
-
Referee: Abstract / Experiments section: no description is given of whether the LSTM baseline received the same hyperparameter search budget, identical train-test split dates, or the same preprocessing pipeline as the xLSTM agent; the central attribution of performance differences to the recurrent cell therefore lacks the necessary controls.
Authors: The LSTM baseline was configured with an identical hyperparameter search budget, the same train-test split dates, and the same preprocessing pipeline as the xLSTM agent to ensure a controlled comparison. The Experiments section will be expanded to explicitly document these controls. revision: yes
-
Referee: Abstract: the reported metrics (cumulative return, Sharpe ratio, etc.) are computed on historical price sequences that were necessarily used to select or tune the trading policy; the manuscript does not describe any held-out test period, walk-forward validation, or external benchmark independent of the fitted policy.
Authors: We acknowledge the importance of clarifying the evaluation protocol to address potential data leakage concerns. The original experiments used a walk-forward validation scheme with rolling train-test windows on unseen future periods. The revised manuscript will include a dedicated subsection detailing the temporal splits, walk-forward procedure, and confirmation that test periods were strictly held out from policy tuning. revision: yes
Circularity Check
No significant circularity: empirical comparison is self-contained
full rationale
The paper reports an empirical comparison of xLSTM-PPO versus LSTM-PPO agents on historical stock data for trading metrics. No derivation chain, equation, or prediction is presented that reduces by construction to its own fitted inputs or to a self-citation. The central claim rests on experimental outcomes rather than a closed logical loop, and the provided text contains no self-definitional, fitted-input-renamed-as-prediction, or uniqueness-imported patterns.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.