Ensemble RL through Classifier Models: Enhancing Risk-Return Trade-offs in Trading Strategies

Zheli Xiong

arxiv: 2502.17518 · v2 · pith:7ANDHF3Nnew · submitted 2025-02-23 · 💻 cs.LG · cs.AI· q-fin.CP· stat.ML

Ensemble RL through Classifier Models: Enhancing Risk-Return Trade-offs in Trading Strategies

Zheli Xiong This is my paper

Pith reviewed 2026-05-23 02:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-fin.CPstat.ML

keywords ensemble reinforcement learningfinancial tradingrisk adjusted returnsclassifier integrationA2CPPOSACmaximum drawdown

0 comments

The pith

Ensemble RL models paired with classifiers deliver better risk-adjusted trading performance than individual RL agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether combining reinforcement learning algorithms such as A2C, PPO, and SAC with classifier models including SVM, decision trees, and logistic regression can improve trading strategies. It tests various ensemble methods against standalone RL models using metrics like cumulative returns, Sharpe ratio, Calmar ratio, and maximum drawdown. The central finding is that these ensembles provide superior risk management and stability, though results vary with the variance threshold tau used in integration. This matters for applications where consistent performance under uncertainty is valuable, such as financial markets.

Core claim

Integrating classifier predictions with RL policies through ensemble rules based on a variance threshold tau produces trading agents that outperform their base RL components on risk-return metrics, including higher Sharpe and Calmar ratios alongside reduced maximum drawdowns.

What carries the argument

The variance-thresholded ensemble rule that merges action distributions from multiple RL agents with classifier outputs to select or weight decisions.

Load-bearing premise

The chosen classifiers contribute information not already encoded in the RL policies' learned behaviors.

What would settle it

Running the same trading environments and finding that no ensemble variant exceeds the best base RL model on Sharpe ratio or drawdown metrics would contradict the reported outperformance.

Figures

Figures reproduced from arXiv: 2502.17518 by Zheli Xiong.

**Figure 1.** Figure 1: portfolio strategy process Normalized Std Dev(d) = σ(d) − min(σ) max(σ) − min(σ) + ϵ where ϵ is a small constant added to avoid division by zero. This normalization scales the standard deviations to the range [0, 1], enabling consistent comparisons between dimensions with differing magnitudes of variability. After normalization, the average normalized standard deviation across all stock dimensions is com… view at source ↗

**Figure 2.** Figure 2: decision block at each step Algorithm 1 Stock Holdings Adjustment Algorithm 1: Input: • Classifier outputs Pi , i = 1, . . . , C: Action-Agent matrices of size 2 × 2, where 2 is the number of agents’ stock holdings, and K is the number of agents. • True agent indices kj , j = 1, 2: The true agent for each stock holdings vector hj . • Variance threshold τ : Threshold for determining high and low variance sc… view at source ↗

**Figure 3.** Figure 3: Performance Metrics of Models in Classifier Group 1 Across the Entire Year of 2020 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Comparative Study on Risk-Return Trade-offs Across Classifier Groups [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: For different Variance threshold τ , using an ensemble of classifier group 1, compare the results of different base models, Model1 and Model2. Each result represents the average value over 30 backtesting iterations. 7 conclusion In this study, our study highlights the effectiveness of ensemble models in enhancing both the returns and stability of trading strategies, particularly when integrated with tradit… view at source ↗

read the original abstract

This paper presents a comprehensive study on the use of ensemble Reinforcement Learning (RL) models in financial trading strategies, leveraging classifier models to enhance performance. By combining RL algorithms such as A2C, PPO, and SAC with traditional classifiers like Support Vector Machines (SVM), Decision Trees, and Logistic Regression, we investigate how different classifier groups can be integrated to improve risk-return trade-offs. The study evaluates the effectiveness of various ensemble methods, comparing them with individual RL models across key financial metrics, including Cumulative Returns, Sharpe Ratios (SR), Calmar Ratios, and Maximum Drawdown (MDD). Our results demonstrate that ensemble methods consistently outperform base models in terms of risk-adjusted returns, providing better management of drawdowns and overall stability. However, we identify the sensitivity of ensemble performance to the choice of variance threshold {\tau}, highlighting the importance of dynamic {\tau} adjustment to achieve optimal performance. This study emphasizes the value of combining RL with classifiers for adaptive decision-making, with implications for financial trading, robotics, and other dynamic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard ensemble application to existing RL trading agents, with gains that depend on an untuned variance threshold and no check for independent classifier signal.

read the letter

The core of this paper is taking A2C, PPO, and SAC, pairing them with SVM, decision trees, and logistic regression, then claiming the resulting ensembles deliver better Sharpe, Calmar, and drawdown numbers than the single agents. That is the extent of what is new: a direct application of off-the-shelf ensemble ideas to already-published RL trading agents, with no new algorithm or derivation offered. The abstract does report the usual financial metrics and flags that results move with the variance threshold tau, which at least shows the authors noticed one practical issue. Beyond that, the work stays within routine empirical comparison. The main weaknesses are straightforward. The abstract supplies no information on data splits, walk-forward testing, or any statistical significance checks. The ensemble rules and the exact role of tau are not described, so it is impossible to tell whether the classifiers contribute anything beyond what the RL policies already encode. If the classifiers are trained on the same trajectories or states, their outputs can easily correlate with the RL value estimates, turning the reported gains into simple variance reduction from averaging rather than genuine complementarity. The admitted sensitivity to tau adds to the concern that some of the advantage may come from post-hoc parameter choice. This paper is aimed at quant teams that already run RL agents and want to try mixing in classifiers for risk control. A reader looking for a reproducible method or a result that survives basic robustness checks will not find enough here. I would not send it for peer review; the evidence presented is too thin to support the headline claims.

Referee Report

3 major / 1 minor

Summary. The paper claims that ensembles combining RL algorithms (A2C, PPO, SAC) with classifiers (SVM, Decision Trees, Logistic Regression) outperform individual RL models on financial trading metrics including cumulative returns, Sharpe ratio, Calmar ratio, and maximum drawdown, while noting sensitivity of results to the variance threshold τ.

Significance. If validated with proper controls, the work could provide a practical template for hybrid RL-classical ML ensembles in sequential decision tasks with risk constraints. The explicit acknowledgment of τ sensitivity is a strength, but the absence of reproducibility details and independence checks limits the current impact.

major comments (3)

[Abstract] Abstract and experimental section: no description of train/test splits, walk-forward validation, or statistical significance testing (e.g., Diebold-Mariano or bootstrap) is provided for the reported SR/Calmar/MDD improvements, making it impossible to assess whether gains exceed sampling variability.
[Methods / Ensemble Construction] Ensemble integration (variance-threshold gating): τ is treated as a tunable hyperparameter whose optimal value must be selected per experiment; the paper itself flags performance sensitivity to τ, which directly undermines the claim that ensembles are inherently superior rather than the result of post-hoc fitting.
[Results / Classifier Integration] Classifier-RL complementarity: no ablation, pairwise action-agreement rate, or mutual-information analysis is reported to test whether SVM/DT/LR outputs supply signal orthogonal to the A2C/PPO/SAC policies. Without this, the observed variance reduction is consistent with averaging correlated predictors rather than true ensemble benefit.

minor comments (1)

[Methods] Notation: the symbol τ is introduced without an explicit equation defining how the variance threshold is computed from the classifier outputs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas for improving the manuscript's clarity, reproducibility, and analytical depth. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: no description of train/test splits, walk-forward validation, or statistical significance testing (e.g., Diebold-Mariano or bootstrap) is provided for the reported SR/Calmar/MDD improvements, making it impossible to assess whether gains exceed sampling variability.

Authors: We agree that these details are necessary for proper evaluation. The revised manuscript will include an expanded Experimental Setup section describing the chronological train/test splits, walk-forward validation procedure to prevent data leakage, and statistical significance testing via bootstrap confidence intervals on the reported metrics. revision: yes
Referee: [Methods / Ensemble Construction] Ensemble integration (variance-threshold gating): τ is treated as a tunable hyperparameter whose optimal value must be selected per experiment; the paper itself flags performance sensitivity to τ, which directly undermines the claim that ensembles are inherently superior rather than the result of post-hoc fitting.

Authors: We present the sensitivity to τ as an explicit finding rather than a hidden caveat. Our central claim is that ensembles with suitable τ selection deliver improved risk-return profiles relative to base RL models; this is not claimed to be tuning-free. We will add further discussion and sensitivity plots across τ values to clarify the method's practical use. revision: partial
Referee: [Results / Classifier Integration] Classifier-RL complementarity: no ablation, pairwise action-agreement rate, or mutual-information analysis is reported to test whether SVM/DT/LR outputs supply signal orthogonal to the A2C/PPO/SAC policies. Without this, the observed variance reduction is consistent with averaging correlated predictors rather than true ensemble benefit.

Authors: This is a fair critique on the need for explicit complementarity analysis. We will add an ablation study together with pairwise action-agreement rates between the classifier outputs and RL policies in the revised Results section to better demonstrate the source of the observed gains. revision: yes

Circularity Check

0 steps flagged

Empirical comparison with acknowledged parameter sensitivity; no load-bearing derivation reduces to inputs

full rationale

The paper reports experimental results on ensembles of RL policies (A2C/PPO/SAC) with classifiers (SVM/DT/LR), evaluating financial metrics. The abstract explicitly flags sensitivity of results to the variance threshold τ and calls for dynamic adjustment, indicating performance is not presented as first-principles or independent of this choice. No equations, uniqueness theorems, or self-citations are shown that would make the outperformance claim reduce by construction to fitted inputs or prior author work. The central claim remains an empirical observation rather than a self-referential prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested premise that the selected classifiers supply orthogonal information to the RL value functions and on the existence of a stable optimal tau that can be identified without overfitting to the test period. No new entities are postulated.

free parameters (1)

variance threshold tau
Controls when classifier output overrides or augments the RL policy; its value is tuned to achieve the reported performance gains.

pith-pipeline@v0.9.0 · 5712 in / 1264 out tokens · 23151 ms · 2026-05-23T02:36:36.317093+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Constrained policy optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning , pages 22–31. PMLR, 2017

work page 2017
[2]

Uncertainty-based offline rein- forcement learning with diversified q-ensemble

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline rein- forcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021

work page 2021
[3]

Neurosymbolic reinforcement learning with formally verified exploration

Greg Anderson, Abhinav Verma, Isil Dillig, and Swarat Chaudhuri. Neurosymbolic reinforcement learning with formally verified exploration. Advances in neural information processing systems , 33:6172–6183, 2020

work page 2020
[4]

Finite-time analysis of the multiarmed bandit problem, 2002

P Auer. Finite-time analysis of the multiarmed bandit problem, 2002

work page 2002
[5]

Multi-timescale ensemble q-learning for markov decision process policy optimization

Talha Bozkus and Urbashi Mitra. Multi-timescale ensemble q-learning for markov decision process policy optimization. IEEE Transactions on Signal Processing , 2024

work page 2024
[6]

Risk-sensitive safety analysis using conditional value-at-risk

Margaret P Chapman, Riccardo Bonalli, Kevin M Smith, Insoon Yang, Marco Pavone, and Claire J Tomlin. Risk-sensitive safety analysis using conditional value-at-risk. IEEE Transactions on Auto- matic Control, 67(12):6521–6536, 2021

work page 2021
[7]

Risk-constrained rein- forcement learning with percentile risk criteria.Journal of Machine Learning Research, 18(167):1–51, 2018

Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained rein- forcement learning with percentile risk criteria.Journal of Machine Learning Research, 18(167):1–51, 2018

work page 2018
[8]

A lyapunov- based approach to safe reinforcement learning

Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov- based approach to safe reinforcement learning. Advances in neural information processing systems , 31, 2018

work page 2018
[9]

Count (and count-like) data in finance

Jonathan B Cohn, Zack Liu, and Malcolm I Wardlaw. Count (and count-like) data in finance. Journal of Financial Economics , 146(2):529–551, 2022

work page 2022
[10]

Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters

Kamyar Ghasemipour, Shixiang Shane Gu, and Ofir Nachum. Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Advances in Neural Information Processing Systems, 35:18267–18281, 2022

work page 2022
[11]

Towards sporadic demand stock manage- ment based on simulation with single reorder point estimation

Katerina Huskova, Petra Kasparova, and Jakub Dyntar. Towards sporadic demand stock manage- ment based on simulation with single reorder point estimation. 14

work page
[12]

On the importance of exploration for general- ization in reinforcement learning

Yiding Jiang, J Zico Kolter, and Roberta Raileanu. On the importance of exploration for general- ization in reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[13]

Accurate uncer- tainty estimation and decomposition in ensemble learning

Jeremiah Liu, John Paisley, Marianthi-Anna Kioumourtzoglou, and Brent Coull. Accurate uncer- tainty estimation and decomposition in ensemble learning. Advances in neural information process- ing systems, 32, 2019

work page 2019
[14]

Learning policies with zero or bounded constraint violation for constrained mdps

Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021

work page 2021
[15]

Model-based constrained reinforcement learning using generalized control barrier function

Haitong Ma, Jianyu Chen, Shengbo Eben, Ziyu Lin, Yang Guan, Yangang Ren, and Sifa Zheng. Model-based constrained reinforcement learning using generalized control barrier function. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 4552–4559. IEEE, 2021

work page 2021
[16]

Maximum drawdown

Malik Magdon-Ismail and Amir F Atiya. Maximum drawdown. Risk Magazine, 17(10):99–102, 2004

work page 2004
[17]

Risk aversion in markov decision processes via near optimal chernoff bounds

Teodor Moldovan and Pieter Abbeel. Risk aversion in markov decision processes via near optimal chernoff bounds. Advances in neural information processing systems , 25, 2012

work page 2012
[18]

Deep exploration via bootstrapped dqn

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. Advances in neural information processing systems , 29, 2016

work page 2016
[19]

Lyapunov design for safe reinforcement learning

Theodore J Perkins and Andrew G Barto. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3(Dec):803–832, 2002

work page 2002
[20]

Density constrained reinforcement learning

Zengyi Qin, Yuxiao Chen, and Chuchu Fan. Density constrained reinforcement learning. In Inter- national conference on machine learning , pages 8682–8692. PMLR, 2021

work page 2021
[21]

One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[22]

A comparative study on the sharpe ratio, sortino ratio, and calmar ratio in portfolio optimization

Jaydip Sen. A comparative study on the sharpe ratio, sortino ratio, and calmar ratio in portfolio optimization

work page
[23]

The sharpe ratio, the journal of portfolio management

William F Sharpe. The sharpe ratio, the journal of portfolio management. Stanfold University, Fall, 1994

work page 1994
[24]

Safe exploration for optimization with gaussian processes

Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In International conference on machine learning , pages 997–1005. PMLR, 2015

work page 2015
[25]

Reinforcement learning: An introduction

Richard S Sutton. Reinforcement learning: An introduction. A Bradford Book, 2018

work page 2018
[26]

Algorithmic trading using double deep q-networks and sentiment analysis

Leon Tabaro, Jean Marie Vianney Kinani, Alberto Jorge Rosales-Silva, Julio C´ esar Salgado-Ram´ ırez, Dante M´ ujica-Vargas, Ponciano Jorge Escamilla-Ambrosio, and Eduardo Ramos-D´ ıaz. Algorithmic trading using double deep q-networks and sentiment analysis. Information, 15(8):473, 2024

work page 2024
[27]

Worst cases policy gradients

Yichuan Charlie Tang, Jian Zhang, and Ruslan Salakhutdinov. Worst cases policy gradients. arXiv preprint arXiv:1911.03618, 2019

work page arXiv 1911
[28]

High-confidence off-policy evaluation

Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 29, 2015

work page 2015
[29]

Monetary policy effectiveness under climate uncertainty: A bayesian dynamic stochastic general equilibrium approach

Aminu Umaru and Nuhu Ado. Monetary policy effectiveness under climate uncertainty: A bayesian dynamic stochastic general equilibrium approach. Available at SSRN 5092156 , 2025

work page 2025
[30]

Ocean-mbrl: Offline conservative exploration for model-based offline reinforcement learning

Fan Wu, Rui Zhang, Qi Yi, Yunkai Gao, Jiaming Guo, Shaohui Peng, Siming Lan, Husheng Han, Yansong Pan, Kaizhao Yuan, et al. Ocean-mbrl: Offline conservative exploration for model-based offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 15897–15905, 2024. 15

work page 2024
[31]

Safe reinforcement learning using robust mpc.IEEE Transactions on Automatic Control, 66(8):3638–3652, 2020

Mario Zanon and S´ ebastien Gros. Safe reinforcement learning using robust mpc.IEEE Transactions on Automatic Control, 66(8):3638–3652, 2020

work page 2020
[32]

Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.arXiv preprint arXiv:2402.04080, 2024

Ruoqi Zhang, Ziwei Luo, Jens Sj¨ olund, Thomas B Sch¨ on, and Per Mattsson. Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.arXiv preprint arXiv:2402.04080, 2024. 16

work page arXiv 2024

[1] [1]

Constrained policy optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning , pages 22–31. PMLR, 2017

work page 2017

[2] [2]

Uncertainty-based offline rein- forcement learning with diversified q-ensemble

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline rein- forcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021

work page 2021

[3] [3]

Neurosymbolic reinforcement learning with formally verified exploration

Greg Anderson, Abhinav Verma, Isil Dillig, and Swarat Chaudhuri. Neurosymbolic reinforcement learning with formally verified exploration. Advances in neural information processing systems , 33:6172–6183, 2020

work page 2020

[4] [4]

Finite-time analysis of the multiarmed bandit problem, 2002

P Auer. Finite-time analysis of the multiarmed bandit problem, 2002

work page 2002

[5] [5]

Multi-timescale ensemble q-learning for markov decision process policy optimization

Talha Bozkus and Urbashi Mitra. Multi-timescale ensemble q-learning for markov decision process policy optimization. IEEE Transactions on Signal Processing , 2024

work page 2024

[6] [6]

Risk-sensitive safety analysis using conditional value-at-risk

Margaret P Chapman, Riccardo Bonalli, Kevin M Smith, Insoon Yang, Marco Pavone, and Claire J Tomlin. Risk-sensitive safety analysis using conditional value-at-risk. IEEE Transactions on Auto- matic Control, 67(12):6521–6536, 2021

work page 2021

[7] [7]

Risk-constrained rein- forcement learning with percentile risk criteria.Journal of Machine Learning Research, 18(167):1–51, 2018

Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained rein- forcement learning with percentile risk criteria.Journal of Machine Learning Research, 18(167):1–51, 2018

work page 2018

[8] [8]

A lyapunov- based approach to safe reinforcement learning

Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov- based approach to safe reinforcement learning. Advances in neural information processing systems , 31, 2018

work page 2018

[9] [9]

Count (and count-like) data in finance

Jonathan B Cohn, Zack Liu, and Malcolm I Wardlaw. Count (and count-like) data in finance. Journal of Financial Economics , 146(2):529–551, 2022

work page 2022

[10] [10]

Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters

Kamyar Ghasemipour, Shixiang Shane Gu, and Ofir Nachum. Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Advances in Neural Information Processing Systems, 35:18267–18281, 2022

work page 2022

[11] [11]

Towards sporadic demand stock manage- ment based on simulation with single reorder point estimation

Katerina Huskova, Petra Kasparova, and Jakub Dyntar. Towards sporadic demand stock manage- ment based on simulation with single reorder point estimation. 14

work page

[12] [12]

On the importance of exploration for general- ization in reinforcement learning

Yiding Jiang, J Zico Kolter, and Roberta Raileanu. On the importance of exploration for general- ization in reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024

work page 2024

[13] [13]

Accurate uncer- tainty estimation and decomposition in ensemble learning

Jeremiah Liu, John Paisley, Marianthi-Anna Kioumourtzoglou, and Brent Coull. Accurate uncer- tainty estimation and decomposition in ensemble learning. Advances in neural information process- ing systems, 32, 2019

work page 2019

[14] [14]

Learning policies with zero or bounded constraint violation for constrained mdps

Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021

work page 2021

[15] [15]

Model-based constrained reinforcement learning using generalized control barrier function

Haitong Ma, Jianyu Chen, Shengbo Eben, Ziyu Lin, Yang Guan, Yangang Ren, and Sifa Zheng. Model-based constrained reinforcement learning using generalized control barrier function. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 4552–4559. IEEE, 2021

work page 2021

[16] [16]

Maximum drawdown

Malik Magdon-Ismail and Amir F Atiya. Maximum drawdown. Risk Magazine, 17(10):99–102, 2004

work page 2004

[17] [17]

Risk aversion in markov decision processes via near optimal chernoff bounds

Teodor Moldovan and Pieter Abbeel. Risk aversion in markov decision processes via near optimal chernoff bounds. Advances in neural information processing systems , 25, 2012

work page 2012

[18] [18]

Deep exploration via bootstrapped dqn

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. Advances in neural information processing systems , 29, 2016

work page 2016

[19] [19]

Lyapunov design for safe reinforcement learning

Theodore J Perkins and Andrew G Barto. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3(Dec):803–832, 2002

work page 2002

[20] [20]

Density constrained reinforcement learning

Zengyi Qin, Yuxiao Chen, and Chuchu Fan. Density constrained reinforcement learning. In Inter- national conference on machine learning , pages 8682–8692. PMLR, 2021

work page 2021

[21] [21]

One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[22] [22]

A comparative study on the sharpe ratio, sortino ratio, and calmar ratio in portfolio optimization

Jaydip Sen. A comparative study on the sharpe ratio, sortino ratio, and calmar ratio in portfolio optimization

work page

[23] [23]

The sharpe ratio, the journal of portfolio management

William F Sharpe. The sharpe ratio, the journal of portfolio management. Stanfold University, Fall, 1994

work page 1994

[24] [24]

Safe exploration for optimization with gaussian processes

Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In International conference on machine learning , pages 997–1005. PMLR, 2015

work page 2015

[25] [25]

Reinforcement learning: An introduction

Richard S Sutton. Reinforcement learning: An introduction. A Bradford Book, 2018

work page 2018

[26] [26]

Algorithmic trading using double deep q-networks and sentiment analysis

Leon Tabaro, Jean Marie Vianney Kinani, Alberto Jorge Rosales-Silva, Julio C´ esar Salgado-Ram´ ırez, Dante M´ ujica-Vargas, Ponciano Jorge Escamilla-Ambrosio, and Eduardo Ramos-D´ ıaz. Algorithmic trading using double deep q-networks and sentiment analysis. Information, 15(8):473, 2024

work page 2024

[27] [27]

Worst cases policy gradients

Yichuan Charlie Tang, Jian Zhang, and Ruslan Salakhutdinov. Worst cases policy gradients. arXiv preprint arXiv:1911.03618, 2019

work page arXiv 1911

[28] [28]

High-confidence off-policy evaluation

Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 29, 2015

work page 2015

[29] [29]

Monetary policy effectiveness under climate uncertainty: A bayesian dynamic stochastic general equilibrium approach

Aminu Umaru and Nuhu Ado. Monetary policy effectiveness under climate uncertainty: A bayesian dynamic stochastic general equilibrium approach. Available at SSRN 5092156 , 2025

work page 2025

[30] [30]

Ocean-mbrl: Offline conservative exploration for model-based offline reinforcement learning

Fan Wu, Rui Zhang, Qi Yi, Yunkai Gao, Jiaming Guo, Shaohui Peng, Siming Lan, Husheng Han, Yansong Pan, Kaizhao Yuan, et al. Ocean-mbrl: Offline conservative exploration for model-based offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 15897–15905, 2024. 15

work page 2024

[31] [31]

Safe reinforcement learning using robust mpc.IEEE Transactions on Automatic Control, 66(8):3638–3652, 2020

Mario Zanon and S´ ebastien Gros. Safe reinforcement learning using robust mpc.IEEE Transactions on Automatic Control, 66(8):3638–3652, 2020

work page 2020

[32] [32]

Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.arXiv preprint arXiv:2402.04080, 2024

Ruoqi Zhang, Ziwei Luo, Jens Sj¨ olund, Thomas B Sch¨ on, and Per Mattsson. Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.arXiv preprint arXiv:2402.04080, 2024. 16

work page arXiv 2024