pith. sign in

arxiv: 2606.04574 · v2 · pith:LNTDFBJ4new · submitted 2026-06-03 · 💻 cs.LG · cs.NE· q-fin.ST· q-fin.TR· stat.ML

Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning

Pith reviewed 2026-06-28 07:35 UTC · model grok-4.3

classification 💻 cs.LG cs.NEq-fin.STq-fin.TRstat.ML
keywords pair tradingdeep reinforcement learningcryptocurrencyPPOstatistical arbitrageexecution modelrisk managementBinance futures
0
0 comments X

The pith

A PPO reinforcement learning agent with deterministic shielding outperforms a heuristic baseline for pair trading execution in cryptocurrency markets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether deep reinforcement learning can serve as a dynamic execution layer on top of classical pair trading to handle the extreme volatility of crypto assets. It introduces a Filter-then-Rank selection process and a Fixed Risk, Adaptive Mean execution model, then trains a PPO agent with an LSTM layer to choose trade timing and sizing inside fixed risk boundaries. On one-hour Binance USD-M futures data the learned policy produced higher risk-adjusted returns than the rule-based alternative. A stationary circular block bootstrap test placed the outperformance at statistical significance of 10 percent. The work therefore presents a concrete hybrid of statistical arbitrage and constrained neural policies that limits divergence risk while allowing adaptive decisions.

Core claim

The optimized RL policy achieved an out-of-sample performance that substantially outperformed the heuristic baseline. A stationary circular block bootstrap robustness check confirms that the agent's risk-adjusted outperformance is statistically significant at the 10 percent level. The architecture anchors the neural policy to statistically robust boundaries through deterministic shielding, thereby reducing severe divergence risks that otherwise appear when classical pair trading is applied directly to high-variance digital assets.

What carries the argument

The Proximal Policy Optimization agent with LSTM layer that makes execution decisions inside the deterministic risk boundaries of the Fixed Risk, Adaptive Mean model.

If this is right

  • The hybrid system combines statistical arbitrage selection with DRL execution to manage divergence risk in volatile markets.
  • Deterministic shielding around the neural policy enables safe application of reinforcement learning to trading without unbounded losses.
  • The Filter-then-Rank methodology supports dynamic multi-pair selection on hourly futures data.
  • Risk-adjusted gains remain detectable even under the idiosyncratic variance typical of cryptocurrency pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shielding approach could be ported to other mean-reversion strategies that currently suffer from sudden regime shifts.
  • Live deployment would need to monitor whether the proprietary execution model itself requires periodic recalibration as market microstructure changes.
  • Extending the agent to act across multiple timeframes simultaneously might further improve capture of short-term dislocations.

Load-bearing premise

The Fixed Risk, Adaptive Mean execution model and its deterministic shielding boundaries continue to work when the statistical relationships between paired assets shift rapidly in live crypto conditions.

What would settle it

A new out-of-sample window on the same exchange or a different venue in which the RL policy no longer produces higher risk-adjusted returns than the heuristic baseline or the bootstrap test loses significance at the 10 percent level.

Figures

Figures reproduced from arXiv: 2606.04574 by Damian Lebied\'z, Robert \'Slepaczuk.

Figure 1
Figure 1. Figure 1: Dynamic Rolling Window Hierarchy: Level 1. Level 1 at t [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Coarse Grid Search for the Entry Threshold. [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: High-Resolution Local Sensitivity Analysis: 3.5 Entry Threshold. [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: High-Resolution Local Sensitivity Analysis: 3.0 Entry Threshold. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Coarse Grid Search for the Stop Loss. 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 0.00 5.00 10.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 −1.00 0.00 1.00 2.00 Median Mean Stop Loss Stop Loss Sortino Ratio Sortino Ratio Panel A: Distribution by Stop Loss Panel B: Median & Mean by Stop Loss Note: Entry Threshold locked at the 3.0 optimum from Stage 1. Panel A displays the distribution of monthly Sortin… view at source ↗
Figure 6
Figure 6. Figure 6: High-Resolution Local Sensitivity Analysis: 1.5 Stop Loss. [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: High-Resolution Local Sensitivity Analysis: 2.0 Stop Loss. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cumulative Performance of the Baseline Strategy [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training Diagnostic: Mean Episode Reward. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training Diagnostic: Entropy Loss. 0.5M 1M 1.5M 2M 2.5M 3M −1 −0.8 −0.6 −0.4 −0.2 0 1 – StepPnLReward, Autonomous, λ=1.0 2 – StepPnLReward, Autonomous, λ=1.2 3 – StepPnLReward, Standard, λ=1.0 4 – StepPnLReward, Standard, λ=1.2 5 – StepPnLReward, Full, λ=1.0 6 – StepPnLReward, Full, λ=1.2 7 – TradePnLReward, Autonomous, λ=1.0 8 – TradePnLReward, Autonomous, λ=1.2 9 – TradePnLReward, Standard, λ=1.0 10 – T… view at source ↗
Figure 11
Figure 11. Figure 11: Cumulative Out-Of-Sample Performance of the Agent 2 Strategy. [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Out-Of-Sample Equity Curves: StepPnLReward. [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Out-Of-Sample Equity Curves: TradePnLReward. [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Out-Of-Sample Equity Curves: HybridActionReward. [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Training Diagnostic: Seed Variance Analysis of Mean Episode Reward. [PITH_FULL_IMAGE:figures/full_fig_p042_15.png] view at source ↗
read the original abstract

This study aims to determine whether the application of Deep Reinforcement Learning (DRL) as a specialized execution overlay can enhance pair trading in highly volatile cryptocurrency markets. Although classical implementations of the strategy have proven successful in traditional equities, they frequently exhibit rigidity and suffer from severe divergence risks when applied to high-variance environments. To address this need, this research introduces novel concepts. To construct a robust system, we developed a hierarchical "Filter-then-Rank" pair selection methodology and a proprietary "Fixed Risk, Adaptive Mean" execution model. The system employs a Proximal Policy Optimization (PPO) agent with a Long Short-Term Memory (LSTM) layer to govern execution decisions within strict deterministic risk management boundaries. Evaluated on 1-hour interval data from the Binance USD-M Futures market, the optimized RL policy achieved an out-of-sample performance that substantially outperformed the heuristic baseline. A stationary circular block bootstrap robustness check confirms that the agent's risk-adjusted outperformance is statistically significant at the 10 percent level. Although falling marginally short of the stricter 5 percent threshold, this result highlights the extreme idiosyncratic variance characteristic of digital assets. Ultimately, this thesis contributes to the quantitative finance literature by introducing a hybrid architecture that combines statistical arbitrage with DRL execution policies. Furthermore, it delivers a novel framework for safe reinforcement learning via deterministic shielding, proving that anchoring a neural policy to statistically robust boundaries successfully mitigates severe divergence risks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a hybrid pair trading strategy for cryptocurrency markets that combines a hierarchical Filter-then-Rank pair selection methodology with a proprietary Fixed Risk, Adaptive Mean execution model. A PPO agent with an LSTM layer governs execution decisions inside deterministic shielding boundaries. Evaluated on 1-hour Binance USD-M Futures data, the optimized RL policy is reported to substantially outperform a heuristic baseline out-of-sample, with risk-adjusted performance statistically significant at the 10% level via a stationary circular block bootstrap robustness check. The work claims to contribute a novel safe-RL framework that mitigates divergence risks in volatile digital-asset markets.

Significance. If the empirical results and robustness claims hold after additional validation, the paper supplies a concrete example of anchoring neural policies to statistically derived boundaries for safe execution in high-variance environments. The circular-block bootstrap is a constructive element, and the hybrid statistical-arbitrage-plus-DRL architecture could interest quantitative-finance readers if the shielding mechanism is shown to remain effective under regime shifts.

major comments (3)
  1. [Abstract] Abstract: the central claim that deterministic shielding 'successfully mitigates severe divergence risks' rests on the untested premise that the Fixed Risk, Adaptive Mean model and shielding boundaries remain unbiased when cointegration breaks; no boundary-violation rates, regime-shift diagnostics, or LSTM behavior under violated assumptions are reported.
  2. [Methodology] Methodology (pair-selection and execution sections): the heuristic baseline against which outperformance is measured is not described in sufficient detail to determine whether the reported gains reflect genuine generalization or in-sample fitting; this directly affects interpretation of the 10% significance result.
  3. [Results] Results (bootstrap and performance evaluation): the stationary circular block bootstrap addresses serial dependence under stationarity but does not probe structural breaks or non-stationary regimes typical of crypto markets; given that the significance is already marginal (10%), this omission weakens the robustness claim.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'novel concepts' is used without enumerating what is claimed to be novel beyond the named models; a concise list would improve clarity.
  2. The manuscript refers to 'proprietary' components; if the journal requires reproducibility, consider whether additional pseudocode or parameter ranges can be supplied without compromising the proprietary claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each major comment below and indicate where revisions will be made to improve the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that deterministic shielding 'successfully mitigates severe divergence risks' rests on the untested premise that the Fixed Risk, Adaptive Mean model and shielding boundaries remain unbiased when cointegration breaks; no boundary-violation rates, regime-shift diagnostics, or LSTM behavior under violated assumptions are reported.

    Authors: We acknowledge that the manuscript does not explicitly report boundary-violation rates or regime-shift diagnostics. However, the out-of-sample evaluation demonstrates the policy's performance under real market conditions, including volatility. In the revision, we will add a new subsection detailing observed boundary violations during the test period and analyze LSTM actions when assumptions may be strained. This will provide empirical support for the shielding mechanism's effectiveness. revision: yes

  2. Referee: [Methodology] Methodology (pair-selection and execution sections): the heuristic baseline against which outperformance is measured is not described in sufficient detail to determine whether the reported gains reflect genuine generalization or in-sample fitting; this directly affects interpretation of the 10% significance result.

    Authors: We agree that more detail on the heuristic baseline is necessary for proper interpretation. The baseline is outlined in the methodology section, but we will expand it with specific parameters, decision rules, and implementation details to allow readers to fully assess the comparison and the validity of the statistical significance. revision: yes

  3. Referee: [Results] Results (bootstrap and performance evaluation): the stationary circular block bootstrap addresses serial dependence under stationarity but does not probe structural breaks or non-stationary regimes typical of crypto markets; given that the significance is already marginal (10%), this omission weakens the robustness claim.

    Authors: The stationary circular block bootstrap was chosen to account for serial dependence in the returns series while maintaining the stationarity assumption appropriate for the block length selection. We recognize that it does not explicitly test for structural breaks, which is a valid concern in crypto markets. In the revised version, we will include a discussion of this limitation and perform additional robustness checks, such as performance evaluation across sub-periods to probe regime shifts where data permits. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a Filter-then-Rank pair selection and proprietary Fixed Risk, Adaptive Mean execution model, then trains a PPO-LSTM policy inside deterministic shielding boundaries on Binance futures data. Out-of-sample performance is compared to a heuristic baseline and assessed via stationary circular block bootstrap. No quoted equations, definitions, or self-citations reduce any claimed prediction or result to its own inputs by construction; the bootstrap is a standard external statistical procedure and the evaluation split is presented as independent. The derivation chain therefore remains self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Abstract-only view; the central claim rests on unstated choices for RL hyperparameters, risk-boundary values, pair-selection thresholds, and the assumption that Binance futures data remains representative.

free parameters (1)
  • RL policy and risk-boundary parameters
    Multiple values in the PPO agent and shielding rules must be chosen or fitted to produce the reported policy.
axioms (1)
  • domain assumption Binance USD-M futures 1-hour data is sufficiently stationary and representative for out-of-sample evaluation
    Invoked when reporting out-of-sample performance and bootstrap results.
invented entities (2)
  • Fixed Risk, Adaptive Mean execution model no independent evidence
    purpose: To control trade execution inside deterministic risk limits
    Proprietary component introduced to anchor the neural policy
  • deterministic shielding framework for safe RL no independent evidence
    purpose: To mitigate divergence risks in volatile markets
    Claimed novel safety layer

pith-pipeline@v0.9.1-grok · 5797 in / 1381 out tokens · 44784 ms · 2026-06-28T07:35:22.349402+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Safe rein- forcement learning via shielding

    Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U., 2018. Safe rein- forcement learning via shielding. Proceedings of the AAAI Conference on Artificial Intelligence

  2. [2]

    URL:https://doi.org/10.1609/aaai.v32i1.11797

  3. [3]

    Regime changes in bitcoin garch volatility dynamics

    Ardia, D., Bluteau, K., Rüede, M., 2019. Regime changes in bitcoin garch volatility dynamics. Finance Research Letters 29, 266–271. URL:https://doi.org/10.1016/j.frl.2018.08. 009

  4. [4]

    Leverage aversion and risk parity

    Asness, C., Frazzini, A., Pedersen, L., 2012. Leverage aversion and risk parity. Financial Analysts Journal 68, 47–59. URL:https://doi.org/10.2469/faj.v68.n1.1

  5. [5]

    Statistical arbitrage in the us equities market

    Avellaneda, M., Lee, J.H., 2010. Statistical arbitrage in the us equities market. Quantitative Finance 10, 761–782. URL:https://doi.org/10.1080/14697680903124632

  6. [6]

    Pseudo-mathematics and finan- cial charlatanism: The effects of backtest overfitting on out-of-sample performance

    Bailey, D., Borwein, J.J., Lopez de Prado, M., Zhu, Q., 2014. Pseudo-mathematics and finan- cial charlatanism: The effects of backtest overfitting on out-of-sample performance. Notices of the American Mathematical Society 61, 458. URL:https://doi.org/10.1090/noti1105

  7. [7]

    Selection of a portfolio of pairs based on cointegration: A statistical arbitrage strategy

    Caldeira, J.F., Moura, G.V., 2013. Selection of a portfolio of pairs based on cointegration: A statistical arbitrage strategy. Brazilian Review of Finance 11, 49–80. URL:https://doi. org/10.12660/rbfin.v11n1.2013.4785

  8. [8]

    Does simple pairs trading still work? Financial Analysts Journal 66, 83–95

    Do, B., Faff, R., 2010. Does simple pairs trading still work? Financial Analysts Journal 66, 83–95. URL:https://doi.org/10.2469/faj.v66.n4.1

  9. [9]

    Dulac-Arnold, G., Levine, N., Mankowitz, D.J., Li, J., Paduraru, C., Gowal, S., Hester, T.,

  10. [10]

    Challenges of real- world reinforcement learning,

    Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning 110, 2419–2468. URL:https://doi.org/10.1007/s10994-021-05961-4

  11. [11]

    Pairs trading

    Elliott, R.J., *, J.V.D.H., Malcolm, W.P., 2005. Pairs trading. Quantitative Finance 5, 271–276. URL:https://doi.org/10.1080/14697680500149370

  12. [12]

    Co-integration and error correction: Representation, esti- mation, and testing

    Engle, R.F., Granger, C.W.J., 1987. Co-integration and error correction: Representation, esti- mation, and testing. Econometrica 55, 251–276. URL:https://doi.org/10.2307/1913236

  13. [13]

    Implementation matters in deep policy gradients: A case study on ppo and trpo

    Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., Madry, A., 2020. Implementation matters in deep policy gradients: A case study on ppo and trpo. URL: https://arxiv.org/abs/2005.12729,arXiv:2005.12729

  14. [14]

    Statistical arbitrage in cryptocurrency mar- kets

    Fischer, T.G., Krauss, C., Deinert, A., 2019. Statistical arbitrage in cryptocurrency mar- kets. Journal of Risk and Financial Management 12. URL:https:/doi.org/10.3390/ jrfm12010031

  15. [15]

    A comprehensive survey on safe reinforcement learning

    García, J., Fern, o Fernández, 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16, 1437–1480. URL:http://jmlr.org/papers/v16/ garcia15a.html

  16. [16]

    Pairs trading: Performance of a relative-value arbitrage rule

    Gatev, E., Goetzmann, W.N., Rouwenhorst, K.G., 2006. Pairs trading: Performance of a relative-value arbitrage rule. The Review of Financial Studies 19, 797–827. URL:https: //doi.org/10.1093/rfs/hhj020

  17. [17]

    Deep learning statistical arbitrage

    Guijarro-Ordonez, J., Pelger, M., Zanotti, G., 2022. Deep learning statistical arbitrage. URL: https://arxiv.org/abs/2106.04028,arXiv:2106.04028

  18. [18]

    Deep rein- forcement learning that matters

    Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D., 2018. Deep rein- forcement learning that matters. Proceedings of the AAAI Conference on Artificial Intelligence

  19. [19]

    URL:https://doi.org/10.1609/aaai.v32i1.11694

  20. [20]

    Long short-term memory.Neural Computation, 9(8): 1735–1780, 1997

    Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9, 1735–1780. URL:https://doi.org/10.1162/neco.1997.9.8.1735. 59

  21. [21]

    Long-term storage capacity of reservoirs

    Hurst, H.E., 1951. Long-term storage capacity of reservoirs. Transactions of the American Society of Civil Engineers 116, 770–799. URL:https://doi.org/10.1061/TACEAT.0006518

  22. [22]

    A deep reinforcement learning framework for the financial portfolio management problem

    Jiang, Z., Xu, D., Liang, J., 2017. A deep reinforcement learning framework for the financial portfolio management problem. URL:https://arxiv.org/abs/1706.10059, arXiv:1706.10059

  23. [23]

    Statisticalanalysisofcointegrationvectors

    Johansen, S., 1988. Statisticalanalysisofcointegrationvectors. JournalofEconomicDynamics and Control 12, 231–254. URL:https://doi.org/10.1016/0165-1889(88)90041-3

  24. [24]

    Prospect theory: An analysis of decision under risk

    Kahneman, D., Tversky, A., 1979. Prospect theory: An analysis of decision under risk. Econometrica 47, 263–291. URL:http://doi.org/10.2307/1914185

  25. [25]

    Optimizing the pairs-trading strategy using deep reinforcement learning with trading and stop-loss boundaries

    Kim, T., Kim, H.Y., 2019. Optimizing the pairs-trading strategy using deep reinforcement learning with trading and stop-loss boundaries. Complexity 2019. URL:https://doi.org/ 10.1155/2019/3582516

  26. [26]

    Statistical arbitrage in multi-pair trading strategy based on graph clustering algorithms in us equities market

    Korniejczuk, A., Ślepaczuk, R., 2024. Statistical arbitrage in multi-pair trading strategy based on graph clustering algorithms in us equities market. URL:https://arxiv.org/abs/2406. 10695,arXiv:2406.10695

  27. [27]

    Statistical arbitrage pairs trading strategies: Review and outlook

    Krauss, C., 2017. Statistical arbitrage pairs trading strategies: Review and outlook. Journal of Economic Surveys 31, 513–545. URL:https://doi.org/10.1111/joes.12153

  28. [28]

    Optimal parameter selection and indica- tor design for technical analysis strategies by computer software: An empirical analysis of the taiwan futures market

    Lin, H.Y., ChiangLin, C.Y., Tseng, H.W., 2024. Optimal parameter selection and indica- tor design for technical analysis strategies by computer software: An empirical analysis of the taiwan futures market. Engineering Proceedings 74. URL:https://doi.org/10.3390/ engproc2024074056

  29. [29]

    Finrl: A deep reinforcement learning library for automated stock trading in quantitative finance

    Liu, X.Y., Yang, H., Chen, Q., Zhang, R., Yang, L., Xiao, B., Wang, C.D., 2022. Finrl: A deep reinforcement learning library for automated stock trading in quantitative finance. URL: https://arxiv.org/abs/2011.09607,arXiv:2011.09607

  30. [30]

    Magdon-Ismail, A

    Magdon-Ismail, M., Atiya, A.F., Pratap, A., Abu-Mostafa, Y.S., 2004. On the maximum drawdown of a brownian motion. Journal of Applied Probability 41, 147–161. URL:http: //doi.org/10.1239/jap/1077134674

  31. [31]

    Trading and arbitrage in cryptocurrency markets

    Makarov, I., Schoar, A., 2020. Trading and arbitrage in cryptocurrency markets. Journal of Financial Economics 135, 293–319. URL:https://doi.org/10.1016/j.jfineco.2019.07. 001

  32. [32]

    Risk-sensitive reinforcement learning

    Mihatsch, O., Neuneier, R., 2002. Risk-sensitive reinforcement learning. Machine Learning 49, 267–290. URL:https://doi.org/10.1023/A:1017940631555

  33. [33]

    Time limits in reinforcement learning

    Pardo, F., Tavakoli, A., Levdik, V., Kormushev, P., 2022. Time limits in reinforcement learning. URL:10.48550/arXiv.1712.00378,arXiv:1712.00378

  34. [34]

    and Romano, Joseph P

    Politis, D.N., Romano, J.P., 1994. The stationary bootstrap. Journal of the American Statisti- cal Association 89, 1303–1313. URL:https://doi.org/10.1080/01621459.1994.10476870

  35. [35]

    Advances in Financial Machine Learning

    López de Prado, M., 2018. Advances in Financial Machine Learning. John Wiley & Sons. URL:https://books.google.pl/books?id=v0RKDwAAQBAJ

  36. [36]

    Stock market prediction with multiple classifiers

    Qian, B., Rasheed, K., 2007. Stock market prediction with multiple classifiers. Applied Intelligence 26, 25–33. URL:https://doi.org/10.1007/s10489-006-0001-7

  37. [37]

    Introducing hurst ex- ponent in pair trading

    Ramos-Requena, J., Trinidad-Segovia, J., Sánchez-Granero, M., 2017. Introducing hurst ex- ponent in pair trading. Physica A: Statistical Mechanics and its Applications 488, 39–45. URL:https://doi.org/10.1016/j.physa.2017.06.032

  38. [38]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy optimization algorithms. URL:https://doi.org/10.48550/arXiv.1707.06347, arXiv:1707.06347. 60

  39. [39]

    URLhttp://dx.doi.org/10.1086/294846

    Sharpe, W.F., 1966. Mutual fund performance. The Journal of Business 39, 119–138. URL: http://doi.org/10.1086/294846

  40. [40]

    Performance measurement in a downside risk framework

    Sortino, F.A., Price, L.N., 1994. Performance measurement in a downside risk framework. The Journal of Investing 3, 59–64. URL:https://doi.org/10.3905/joi.3.3.59

  41. [41]

    Reinforcement learning: An introduction

    Sutton, R.S., Barto, A.G., 2018. Reinforcement learning: An introduction. Second ed., The MIT Press, Cambridge, Massachusetts. URL:http://incompleteideas.net/book/ the-book-2nd.html

  42. [42]

    Chaper 9 - the kelly criterion in blackjack sports betting, and the stock market*

    Thorp, E.O., 2008. Chaper 9 - the kelly criterion in blackjack sports betting, and the stock market*. Handbook of Asset and Liability Management 1, 385–428. URL:https://doi.org/ 10.1016/B978-044453248-0.50015-0

  43. [43]

    Deep reinforcement learning applied to statistical arbitrage investment strategy on cryptomarket

    Vergara, G., Kristjanpoller, W., 2024. Deep reinforcement learning applied to statistical arbitrage investment strategy on cryptomarket. Applied Soft Computing 153, 111255. URL: https://doi.org/10.1016/j.asoc.2024.111255

  44. [44]

    Pairs Trading: Quantitative Methods and Analysis

    Vidyamurthy, G., 2004. Pairs Trading: Quantitative Methods and Analysis. John Wiley & Sons

  45. [45]

    Reinforcement learning pair trading: A dynamic scaling ap- proach

    Yang, H., Malik, A., 2024. Reinforcement learning pair trading: A dynamic scaling ap- proach. Journal of Risk and Financial Management 17, 555. URL:http://doi.org/10. 3390/jrfm17120555

  46. [46]

    Regimefolio: A regime aware ml system for sectoral portfolio optimization in dynamic markets

    Zhang, Y., Goel, D., Ahmad, H., Szabo, C., 2025. Regimefolio: A regime aware ml system for sectoral portfolio optimization in dynamic markets. URL:https://doi.org/10.48550/ arXiv.2510.14986,arXiv:2510.14986. 61