A Deep Reinforcement Learning Approach to Automated Stock Trading, using xLSTM Networks

Armin Salimi-Badr; Faezeh Sarlakifar; Mohammadreza Mohammadzadeh Asl; Sajjad Rezvani Khaledi

arxiv: 2503.09655 · v2 · submitted 2025-03-12 · 💻 cs.CE · cs.LG· q-fin.TR

A Deep Reinforcement Learning Approach to Automated Stock Trading, using xLSTM Networks

Faezeh Sarlakifar , Mohammadreza Mohammadzadeh Asl , Sajjad Rezvani Khaledi , Armin Salimi-Badr This is my paper

Pith reviewed 2026-05-23 00:54 UTC · model grok-4.3

classification 💻 cs.CE cs.LGq-fin.TR

keywords xLSTMdeep reinforcement learningstock tradingPPOLSTMautomated tradingfinancial time seriestrading metrics

0 comments

The pith

xLSTM networks in a PPO reinforcement learning agent outperform LSTM networks on stock trading metrics including cumulative return and Sharpe ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores replacing LSTM networks with xLSTM networks inside a deep reinforcement learning setup for automated stock trading. It claims that xLSTM better manages long-term dependencies and avoids gradient issues in volatile financial data, allowing the agent to make more effective buy and sell decisions. The method places xLSTM units in both the actor and critic parts of a Proximal Policy Optimization agent and tests it on historical prices from major technology companies. A sympathetic reader would care because improved trading performance could translate into higher returns and lower drawdowns in real markets. The results indicate that xLSTM offers a practical upgrade for sequential decision tasks in dynamic environments.

Core claim

The central claim is that an xLSTM-based deep reinforcement learning model using PPO outperforms LSTM-based methods in key trading evaluation metrics, including cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio, when tested on financial data from major tech companies over a comprehensive timeline.

What carries the argument

xLSTM networks placed in both the actor and critic of a Proximal Policy Optimization (PPO) reinforcement learning agent to process time series market data and output trading actions.

If this is right

The xLSTM-PPO agent produces higher cumulative returns and average profitability per trade than the LSTM baseline on the tested tech stock data.
The approach yields a higher Sharpe ratio and lower maximum pullback, indicating better risk-adjusted performance.
xLSTM enables more effective capture of long-term dependencies in financial time series within reinforcement learning trading policies.
The method balances exploration and exploitation in dynamic market conditions through the PPO optimizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same xLSTM replacement could be tested in reinforcement learning agents for other sequential financial tasks such as portfolio allocation across asset classes.
Performance gains might depend on the length of the price history; shorter or longer windows could alter the relative advantage over LSTM.
If xLSTM scales well, it might reduce the need for extensive feature engineering in trading systems that currently rely on LSTM.

Load-bearing premise

Any observed performance difference between xLSTM and LSTM versions can be attributed to the network architecture rather than to unstated differences in hyperparameter choices, data preprocessing, train-test splits, or random seeds.

What would settle it

Re-running the exact experiments with identical hyperparameters, preprocessing steps, train-test splits, and multiple random seeds for both the xLSTM and LSTM models and checking whether the reported gaps in cumulative return and Sharpe ratio remain.

read the original abstract

Traditional Long Short-Term Memory (LSTM) networks are effective for handling sequential data but have limitations such as gradient vanishing and difficulty in capturing long-term dependencies, which can impact their performance in dynamic and risky environments like stock trading. To address these limitations, this study explores the usage of the newly introduced Extended Long Short Term Memory (xLSTM) network in combination with a deep reinforcement learning (DRL) approach for automated stock trading. Our proposed method utilizes xLSTM networks in both actor and critic components, enabling effective handling of time series data and dynamic market environments. Proximal Policy Optimization (PPO), with its ability to balance exploration and exploitation, is employed to optimize the trading strategy. Experiments were conducted using financial data from major tech companies over a comprehensive timeline, demonstrating that the xLSTM-based model outperforms LSTM-based methods in key trading evaluation metrics, including cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio. These findings mark the potential of xLSTM for enhancing DRL-based stock trading systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

xLSTM-PPO trading agent is a new pairing but the outperformance claim rests on an uncontrolled LSTM baseline comparison

read the letter

The paper puts xLSTM into both actor and critic of a PPO trading agent and reports better cumulative return, profitability per trade, max earning rate, max pullback, and Sharpe ratio than an LSTM version on tech-stock data. That specific substitution looks new based on the references given. It does the straightforward thing of testing a recent recurrent cell in a domain where long-range dependencies can matter, and the PPO choice is a reasonable fit for trading policies. Credit for trying the architecture swap in a practical setting. The soft spots are the lack of any reported controls. Nothing is said about whether the LSTM baseline received equivalent hyperparameter search, the same train-test dates, identical preprocessing, or the same number of random seeds. No variance across runs, no transaction costs or slippage, no statistical significance tests, and no confirmation the test window stayed truly held out. Those gaps mean the metric differences cannot be pinned on xLSTM rather than tuning or split choices. The stress-test concern holds up on the available text. This is the sort of incremental architecture trial that might interest a narrow group working on DRL trading agents, but only after the methods are tightened. It does not yet show the rigor needed for a serious referee process. I would not send it to peer review until the authors add the missing baseline controls and reporting.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes replacing LSTM cells with xLSTM networks in both the actor and critic of a PPO-based deep reinforcement learning agent for automated stock trading. Experiments on historical price sequences from major technology companies are claimed to demonstrate that the xLSTM agent outperforms an LSTM baseline on cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio.

Significance. If the reported gains can be shown to arise specifically from the xLSTM architectural changes under controlled conditions, the result would be of interest to the computational finance and DRL communities because it would supply evidence that the matrix-memory and other extensions in xLSTM improve policy learning in non-stationary financial environments. The work also illustrates a practical use case for PPO with modern recurrent cells.

major comments (3)

Abstract: the claim that the xLSTM model 'outperforms LSTM-based methods' on the listed metrics supplies no information on the number of random seeds, variance across runs, or any statistical significance test; without these, the metric gaps cannot be distinguished from sampling noise.
Abstract / Experiments section: no description is given of whether the LSTM baseline received the same hyperparameter search budget, identical train-test split dates, or the same preprocessing pipeline as the xLSTM agent; the central attribution of performance differences to the recurrent cell therefore lacks the necessary controls.
Abstract: the reported metrics (cumulative return, Sharpe ratio, etc.) are computed on historical price sequences that were necessarily used to select or tune the trading policy; the manuscript does not describe any held-out test period, walk-forward validation, or external benchmark independent of the fitted policy.

minor comments (1)

Abstract: the term 'maximum earning rate' is non-standard; clarify its definition and relation to conventional quantities such as maximum drawdown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the claim that the xLSTM model 'outperforms LSTM-based methods' on the listed metrics supplies no information on the number of random seeds, variance across runs, or any statistical significance test; without these, the metric gaps cannot be distinguished from sampling noise.

Authors: We agree that reporting variance and statistical tests is necessary to substantiate performance claims. The revised manuscript will specify the number of random seeds used (typically 5–10), report means and standard deviations for all metrics, and include statistical significance tests (e.g., paired t-tests) comparing xLSTM and LSTM agents. revision: yes
Referee: Abstract / Experiments section: no description is given of whether the LSTM baseline received the same hyperparameter search budget, identical train-test split dates, or the same preprocessing pipeline as the xLSTM agent; the central attribution of performance differences to the recurrent cell therefore lacks the necessary controls.

Authors: The LSTM baseline was configured with an identical hyperparameter search budget, the same train-test split dates, and the same preprocessing pipeline as the xLSTM agent to ensure a controlled comparison. The Experiments section will be expanded to explicitly document these controls. revision: yes
Referee: Abstract: the reported metrics (cumulative return, Sharpe ratio, etc.) are computed on historical price sequences that were necessarily used to select or tune the trading policy; the manuscript does not describe any held-out test period, walk-forward validation, or external benchmark independent of the fitted policy.

Authors: We acknowledge the importance of clarifying the evaluation protocol to address potential data leakage concerns. The original experiments used a walk-forward validation scheme with rolling train-test windows on unseen future periods. The revised manuscript will include a dedicated subsection detailing the temporal splits, walk-forward procedure, and confirmation that test periods were strictly held out from policy tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison is self-contained

full rationale

The paper reports an empirical comparison of xLSTM-PPO versus LSTM-PPO agents on historical stock data for trading metrics. No derivation chain, equation, or prediction is presented that reduces by construction to its own fitted inputs or to a self-citation. The central claim rests on experimental outcomes rather than a closed logical loop, and the provided text contains no self-definitional, fitted-input-renamed-as-prediction, or uniqueness-imported patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that standard DRL training procedures transfer without modification to the xLSTM architecture.

pith-pipeline@v0.9.0 · 5734 in / 1139 out tokens · 51665 ms · 2026-05-23T00:54:09.256340+00:00 · methodology

A Deep Reinforcement Learning Approach to Automated Stock Trading, using xLSTM Networks

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)