pith. sign in

arxiv: 2503.09655 · v2 · submitted 2025-03-12 · 💻 cs.CE · cs.LG· q-fin.TR

A Deep Reinforcement Learning Approach to Automated Stock Trading, using xLSTM Networks

Pith reviewed 2026-05-23 00:54 UTC · model grok-4.3

classification 💻 cs.CE cs.LGq-fin.TR
keywords xLSTMdeep reinforcement learningstock tradingPPOLSTMautomated tradingfinancial time seriestrading metrics
0
0 comments X

The pith

xLSTM networks in a PPO reinforcement learning agent outperform LSTM networks on stock trading metrics including cumulative return and Sharpe ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores replacing LSTM networks with xLSTM networks inside a deep reinforcement learning setup for automated stock trading. It claims that xLSTM better manages long-term dependencies and avoids gradient issues in volatile financial data, allowing the agent to make more effective buy and sell decisions. The method places xLSTM units in both the actor and critic parts of a Proximal Policy Optimization agent and tests it on historical prices from major technology companies. A sympathetic reader would care because improved trading performance could translate into higher returns and lower drawdowns in real markets. The results indicate that xLSTM offers a practical upgrade for sequential decision tasks in dynamic environments.

Core claim

The central claim is that an xLSTM-based deep reinforcement learning model using PPO outperforms LSTM-based methods in key trading evaluation metrics, including cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio, when tested on financial data from major tech companies over a comprehensive timeline.

What carries the argument

xLSTM networks placed in both the actor and critic of a Proximal Policy Optimization (PPO) reinforcement learning agent to process time series market data and output trading actions.

If this is right

  • The xLSTM-PPO agent produces higher cumulative returns and average profitability per trade than the LSTM baseline on the tested tech stock data.
  • The approach yields a higher Sharpe ratio and lower maximum pullback, indicating better risk-adjusted performance.
  • xLSTM enables more effective capture of long-term dependencies in financial time series within reinforcement learning trading policies.
  • The method balances exploration and exploitation in dynamic market conditions through the PPO optimizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same xLSTM replacement could be tested in reinforcement learning agents for other sequential financial tasks such as portfolio allocation across asset classes.
  • Performance gains might depend on the length of the price history; shorter or longer windows could alter the relative advantage over LSTM.
  • If xLSTM scales well, it might reduce the need for extensive feature engineering in trading systems that currently rely on LSTM.

Load-bearing premise

Any observed performance difference between xLSTM and LSTM versions can be attributed to the network architecture rather than to unstated differences in hyperparameter choices, data preprocessing, train-test splits, or random seeds.

What would settle it

Re-running the exact experiments with identical hyperparameters, preprocessing steps, train-test splits, and multiple random seeds for both the xLSTM and LSTM models and checking whether the reported gaps in cumulative return and Sharpe ratio remain.

read the original abstract

Traditional Long Short-Term Memory (LSTM) networks are effective for handling sequential data but have limitations such as gradient vanishing and difficulty in capturing long-term dependencies, which can impact their performance in dynamic and risky environments like stock trading. To address these limitations, this study explores the usage of the newly introduced Extended Long Short Term Memory (xLSTM) network in combination with a deep reinforcement learning (DRL) approach for automated stock trading. Our proposed method utilizes xLSTM networks in both actor and critic components, enabling effective handling of time series data and dynamic market environments. Proximal Policy Optimization (PPO), with its ability to balance exploration and exploitation, is employed to optimize the trading strategy. Experiments were conducted using financial data from major tech companies over a comprehensive timeline, demonstrating that the xLSTM-based model outperforms LSTM-based methods in key trading evaluation metrics, including cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio. These findings mark the potential of xLSTM for enhancing DRL-based stock trading systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes replacing LSTM cells with xLSTM networks in both the actor and critic of a PPO-based deep reinforcement learning agent for automated stock trading. Experiments on historical price sequences from major technology companies are claimed to demonstrate that the xLSTM agent outperforms an LSTM baseline on cumulative return, average profitability per trade, maximum earning rate, maximum pullback, and Sharpe ratio.

Significance. If the reported gains can be shown to arise specifically from the xLSTM architectural changes under controlled conditions, the result would be of interest to the computational finance and DRL communities because it would supply evidence that the matrix-memory and other extensions in xLSTM improve policy learning in non-stationary financial environments. The work also illustrates a practical use case for PPO with modern recurrent cells.

major comments (3)
  1. Abstract: the claim that the xLSTM model 'outperforms LSTM-based methods' on the listed metrics supplies no information on the number of random seeds, variance across runs, or any statistical significance test; without these, the metric gaps cannot be distinguished from sampling noise.
  2. Abstract / Experiments section: no description is given of whether the LSTM baseline received the same hyperparameter search budget, identical train-test split dates, or the same preprocessing pipeline as the xLSTM agent; the central attribution of performance differences to the recurrent cell therefore lacks the necessary controls.
  3. Abstract: the reported metrics (cumulative return, Sharpe ratio, etc.) are computed on historical price sequences that were necessarily used to select or tune the trading policy; the manuscript does not describe any held-out test period, walk-forward validation, or external benchmark independent of the fitted policy.
minor comments (1)
  1. Abstract: the term 'maximum earning rate' is non-standard; clarify its definition and relation to conventional quantities such as maximum drawdown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the claim that the xLSTM model 'outperforms LSTM-based methods' on the listed metrics supplies no information on the number of random seeds, variance across runs, or any statistical significance test; without these, the metric gaps cannot be distinguished from sampling noise.

    Authors: We agree that reporting variance and statistical tests is necessary to substantiate performance claims. The revised manuscript will specify the number of random seeds used (typically 5–10), report means and standard deviations for all metrics, and include statistical significance tests (e.g., paired t-tests) comparing xLSTM and LSTM agents. revision: yes

  2. Referee: Abstract / Experiments section: no description is given of whether the LSTM baseline received the same hyperparameter search budget, identical train-test split dates, or the same preprocessing pipeline as the xLSTM agent; the central attribution of performance differences to the recurrent cell therefore lacks the necessary controls.

    Authors: The LSTM baseline was configured with an identical hyperparameter search budget, the same train-test split dates, and the same preprocessing pipeline as the xLSTM agent to ensure a controlled comparison. The Experiments section will be expanded to explicitly document these controls. revision: yes

  3. Referee: Abstract: the reported metrics (cumulative return, Sharpe ratio, etc.) are computed on historical price sequences that were necessarily used to select or tune the trading policy; the manuscript does not describe any held-out test period, walk-forward validation, or external benchmark independent of the fitted policy.

    Authors: We acknowledge the importance of clarifying the evaluation protocol to address potential data leakage concerns. The original experiments used a walk-forward validation scheme with rolling train-test windows on unseen future periods. The revised manuscript will include a dedicated subsection detailing the temporal splits, walk-forward procedure, and confirmation that test periods were strictly held out from policy tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison is self-contained

full rationale

The paper reports an empirical comparison of xLSTM-PPO versus LSTM-PPO agents on historical stock data for trading metrics. No derivation chain, equation, or prediction is presented that reduces by construction to its own fitted inputs or to a self-citation. The central claim rests on experimental outcomes rather than a closed logical loop, and the provided text contains no self-definitional, fitted-input-renamed-as-prediction, or uniqueness-imported patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that standard DRL training procedures transfer without modification to the xLSTM architecture.

pith-pipeline@v0.9.0 · 5734 in / 1139 out tokens · 51665 ms · 2026-05-23T00:54:09.256340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.