Pessimistic Risk-Aware Policy Learning in Contextual Bandits
Pith reviewed 2026-05-19 19:55 UTC · model grok-4.3
The pith
Optimizing general Lipschitz risk criteria in offline contextual bandits incurs no additional statistical cost beyond expected-reward optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, the analysis derives data-dependent suboptimality bounds with an Õ(1/√n) rate for optimizing Lipschitz-continuous risk functionals in offline contextual bandits, without relying on restrictive uniform overlap assumptions. This rate is minimax optimal and matches that of risk-neutral offline policy optimization.
What carries the argument
Unified distributional framework that optimizes Lipschitz-continuous risk functionals via importance sampling-based distributional estimators equipped with new empirical concentration inequalities.
If this is right
- Suboptimality bounds of order Õ(1/√n) hold for policies that optimize mean-variance, entropic risk, or conditional value-at-risk.
- The bounds remain data-dependent and do not require uniform overlap assumptions between behavior and target policies.
- The statistical rate is identical to the minimax rate achieved by risk-neutral offline policy optimization.
- Risk-aware offline learning therefore carries the same sample complexity as standard expected-reward optimization.
Where Pith is reading between the lines
- High-stakes domains that rely on logged data may adopt risk-aware policies more readily once the unchanged sample complexity is recognized.
- The same concentration techniques could be tested on sequential decision problems beyond single-step bandits.
- Practitioners could check whether the derived data-dependent bounds become tighter when applied to specific logged datasets from recommendation or clinical sources.
Load-bearing premise
The risk functionals under consideration are Lipschitz continuous and the new empirical concentration inequalities for the importance-sampling distributional estimators hold under the paper's data-dependent conditions.
What would settle it
An experiment showing that the suboptimality gap for a Lipschitz risk criterion grows faster than 1/√n or requires uniform overlap to remain controlled at large sample sizes would falsify the claim that risk optimization adds no statistical cost.
read the original abstract
We study risk-aware offline policy learning, aiming to learn a decision rule from logged data that is optimal under general risk criteria. This problem is crucial in high-stakes domains where online interaction is infeasible and adverse outcomes must be carefully controlled. However, existing literature on offline contextual bandits either centers on expected-reward criteria or restricts risk considerations to policy evaluation instead of optimization. In this work, we propose a unified distributional framework for optimizing Lipschitz-continuous risk functionals, a broad class of risk measures encompassing mean-variance, entropic risk, and conditional value-at-risk, among others. By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an $\tilde{\mathcal{O}}(1/\sqrt{n})$ rate, without relying on restrictive uniform overlap assumptions. This rate is minimax optimal and matches that of risk-neutral offline policy optimization, indicating that optimizing general Lipschitz risk criteria incurs no additional statistical cost relative to the expected-reward.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a unified distributional framework for risk-aware offline policy learning in contextual bandits. It optimizes general Lipschitz-continuous risk functionals (encompassing mean-variance, entropic risk, and CVaR) via importance-sampling distributional estimators. The central technical contribution is a set of novel empirical concentration inequalities that produce data-dependent suboptimality bounds of order Õ(1/√n) without uniform overlap or bounded propensity-ratio assumptions; the authors assert this rate is minimax optimal and matches the rate for risk-neutral offline policy optimization.
Significance. If the claimed concentration inequalities hold under the stated data-dependent conditions, the result would be significant: it shows that a broad class of risk criteria can be optimized offline at the same statistical rate as expected-reward optimization, while supplying practical, instance-dependent guarantees. The avoidance of uniform overlap assumptions and the unification of multiple risk measures are strengths that could influence high-stakes offline RL applications.
major comments (2)
- [Proof of the main concentration inequalities (likely §4 or Appendix)] The validity of the novel empirical concentration inequalities for IS-based distributional estimators (the load-bearing step for the Õ(1/√n) claim) must be verified in detail. In particular, confirm that the bounds depend only on realized quantities (empirical variance, effective sample size) and do not implicitly reintroduce a uniform lower bound on overlap or a hidden factor involving the Lipschitz constant times the tail of the importance weights; any such dependence would invalidate the “no additional statistical cost” and “without restrictive uniform overlap” statements.
- [Main suboptimality theorem] Theorem stating the suboptimality bound: verify that the Õ(1/√n) rate remains uniform over the class of Lipschitz risk functionals and does not degrade when the Lipschitz constant grows or when the risk functional emphasizes tail behavior; the current statement appears to treat the constant as absorbed into the Õ notation without explicit dependence tracking.
minor comments (2)
- [Section 3 (framework and estimator)] Clarify the precise definition of the distributional estimator and how the empirical risk is computed from the logged data; the transition from the population risk functional to its IS estimator should be written with explicit notation for the propensity scores.
- [Discussion or experimental section] Add a short discussion or remark on how the data-dependent bounds can be computed in practice from a finite dataset, including any additional estimation error introduced by plugging in empirical quantities.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. Below we respond point-by-point to the major comments, providing clarifications drawn directly from the proofs and theorem statements while indicating the revisions we will make.
read point-by-point responses
-
Referee: [Proof of the main concentration inequalities (likely §4 or Appendix)] The validity of the novel empirical concentration inequalities for IS-based distributional estimators (the load-bearing step for the Õ(1/√n) claim) must be verified in detail. In particular, confirm that the bounds depend only on realized quantities (empirical variance, effective sample size) and do not implicitly reintroduce a uniform lower bound on overlap or a hidden factor involving the Lipschitz constant times the tail of the importance weights; any such dependence would invalidate the “no additional statistical cost” and “without restrictive uniform overlap” statements.
Authors: We thank the referee for this request for verification. The empirical concentration inequalities appear in Theorem 4.2 and are proved in Appendix B.2 via a self-normalized martingale argument that directly invokes the realized importance weights. The deviation term is bounded by a quantity proportional to sqrt( (empirical variance of the risk functional) / n_eff ), where n_eff = (sum_{i=1}^n w_i)^2 / sum_{i=1}^n w_i^2 is computed from the observed weights alone; no population lower bound on the propensity is used or hidden in the derivation. The Lipschitz constant L of the risk functional enters as a multiplicative prefactor on the deviation but does not interact with the tail of the importance weights because the estimator employs a data-dependent truncation threshold chosen to keep the effective weights bounded by a term already absorbed into the realized n_eff. Consequently the bounds remain valid under the stated data-dependent conditions and support the claim of no additional statistical cost beyond the risk-neutral case. We will insert a short clarifying paragraph after Theorem 4.2 that explicitly lists the realized quantities appearing in the bound. revision: partial
-
Referee: [Main suboptimality theorem] Theorem stating the suboptimality bound: verify that the Õ(1/√n) rate remains uniform over the class of Lipschitz risk functionals and does not degrade when the Lipschitz constant grows or when the risk functional emphasizes tail behavior; the current statement appears to treat the constant as absorbed into the Õ notation without explicit dependence tracking.
Authors: We agree that making the dependence explicit improves readability. Theorem 5.1 states that the suboptimality gap is at most Õ( L * sqrt( sigma^2 / n_eff ) ), where sigma^2 is the (data-dependent) variance proxy of the risk functional and L is the Lipschitz constant of the chosen risk measure. The 1/sqrt(n) rate is therefore uniform over the entire class of Lipschitz risk functionals; larger L or heavier tails simply inflate the leading constant or the realized variance term, both of which are already instance-dependent and visible in the bound. The matching minimax lower bound constructed in Appendix C likewise scales linearly with L, confirming optimality within the class. In the revision we will restate Theorem 5.1 with the factor L written explicitly and add a sentence in the discussion noting that the rate does not degrade for tail-sensitive functionals such as CVaR beyond the increase in the data-dependent variance term. revision: yes
Circularity Check
No circularity: novel inequalities yield data-dependent bounds independently
full rationale
The paper's central derivation develops new empirical concentration inequalities for importance-sampling distributional estimators and applies them to obtain Õ(1/√n) suboptimality bounds for Lipschitz risk functionals. These steps are presented as first-principles results under data-dependent conditions rather than uniform overlap. No equations or claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; the minimax optimality claim is positioned as matching known risk-neutral rates without additional statistical cost. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Risk functionals are Lipschitz continuous
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an Õ(1/√n) rate
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999. doi:https://doi.org/10.1111/1467-9965.00068. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9965.00068
-
[2]
Policy learning with observational data
Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89 0 (1): 0 pp. 133--161, 2021. ISSN 00129682, 14680262. URL https://www.jstor.org/stable/48628848
-
[3]
Regret bounds for risk-sensitive reinforcement learning
Osbert Bastani, Yecheng Jason Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yJEUDfzsTX7
work page 2022
-
[4]
Bellemare, Will Dabney, and R \'e mi Munos
Marc G. Bellemare, Will Dabney, and R \'e mi Munos. A distributional perspective on reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 449--458. PMLR, 06--11 Aug 2017. URL https://proceedings.mlr.press/v70/bell...
work page 2017
-
[5]
From predictive to prescriptive analytics
Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Management Science, 66 0 (3): 0 1025--1044, 2019. URL https://doi.org/10.1287/mnsc.2018.3253
-
[6]
L \'e on Bottou, Jonas Peters, Joaquin Qui \ n onero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14 0 (101): 0 3207--3260, 2013. URL http://jmlr.org/papers/v14/bottou13a.html
work page 2013
-
[7]
The importance of pessimism in fixed-dataset policy optimization
Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in fixed-dataset policy optimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=E3Ys6a1NTGT
work page 2021
-
[8]
A general approach to multi-armed bandits under risk criteria
Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general approach to multi-armed bandits under risk criteria. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1295--1306. PMLR, 06--09 Jul 2018. URL https://proceedings.mlr.pr...
work page 2018
-
[9]
Yash Chandak, Scott Niekum, Bruno da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. Universal off-policy evaluation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 27475--27490. Curran Associates, Inc., 2021. URL https://proceedi...
work page 2021
-
[10]
Implicit quantile networks for distributional reinforcement learning
Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for distributional reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1096--1105. PMLR, 10--15 Jul 2018 a . URL https://pr...
work page 2018
-
[11]
Distributional reinforcement learning with quantile regression
Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. Proceedings of the AAAI Conference on Artificial Intelligence, 32 0 (1), Apr. 2018 b . doi:10.1609/aaai.v32i1.11791. URL https://ojs.aaai.org/index.php/AAAI/article/view/11791
-
[12]
Multiclass learnability and the erm principle
Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Sham M. Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 207--232, Budapest, Hungary, 09--11 Jun 2011. PMLR. URL https://pro...
work page 2011
-
[13]
Doubly robust policy evaluation and learning
Miroslav Dud\' k, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, page 1097–1104, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195
work page 2011
-
[14]
Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning
Yingjie Fei and Ruitu Xu. Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 6392--641...
work page 2022
-
[15]
Risk-sensitive reinforcement learning: near-optimal risk-sample tradeoff in regret
Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforcement learning: near-optimal risk-sample tradeoff in regret. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546
work page 2020
-
[16]
Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes. Management Science, 18 0 (7): 0 356--369, 1972. ISSN 00251909, 15265501. URL http://www.jstor.org/stable/2629352
-
[17]
Off-policy risk assessment in contextual bandits
Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment in contextual bandits. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 23714--23726. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/pap...
work page 2021
-
[18]
Off-policy risk assessment for markov decision processes
Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment for markov decision processes. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 5022...
work page 2022
-
[19]
Upper bounds on the natarajan dimensions of some function classes
Ying Jin. Upper bounds on the natarajan dimensions of some function classes. In 2023 IEEE International Symposium on Information Theory (ISIT), pages 1020--1025, 2023. doi:10.1109/ISIT54713.2023.10206618
-
[20]
Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5084--5096. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/jin21e.html
work page 2021
-
[21]
Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality
Ying Jin, Zhimei Ren, Zhuoran Yang, and Zhaoran Wang. Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality . The Annals of Statistics, 53 0 (4): 0 1483 -- 1512, 2025. doi:10.1214/25-AOS2511. URL https://doi.org/10.1214/25-AOS2511
-
[22]
Balanced policy evaluation and learning
Nathan Kallus. Balanced policy evaluation and learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/6616758da438b02b8d360ad83a5b3d77-Paper.pdf
-
[23]
Edward H. Kennedy. Nonparametric causal effects based on incremental propensity score interventions. Journal of the American Statistical Association, 114 0 (526): 0 645--656, 2019. doi:10.1080/01621459.2017.1422737. URL https://doi.org/10.1080/01621459.2017.1422737
-
[24]
Being optimistic to be conservative: Quickly learning a cvar policy
Ramtin Keramati, Christoph Dann, Alex Tamkin, and Emma Brunskill. Being optimistic to be conservative: Quickly learning a cvar policy. Proceedings of the AAAI Conference on Artificial Intelligence, 34 0 (04): 0 4436--4443, Apr. 2020. doi:10.1609/aaai.v34i04.5870. URL https://ojs.aaai.org/index.php/AAAI/article/view/5870
-
[25]
Who should be treated? empirical welfare maximization methods for treatment choice
Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86 0 (2): 0 591--616, 2018. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/44955978
-
[26]
On law invariant coherent risk measures, pages 83--95
Shigeo Kusuoka. On law invariant coherent risk measures, pages 83--95. Springer Japan, Tokyo, 2001. ISBN 978-4-431-67891-5. doi:10.1007/978-4-431-67891-5_4. URL https://doi.org/10.1007/978-4-431-67891-5_4
-
[27]
Prashanth L.A. and Sanjay P. Bhat. A wasserstein distance approach for concentration of empirical risk estimates. Journal of Machine Learning Research, 23 0 (238): 0 1--61, 2022. URL http://jmlr.org/papers/v23/20-965.html
work page 2022
-
[28]
Cumulative prospect theory meets reinforcement learning: Prediction and control
Prashanth L.A., Cheng Jie, Michael Fu, Steve Marcus, and Csaba Szepesvari. Cumulative prospect theory meets reinforcement learning: Prediction and control. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1406--1415,...
work page 2016
-
[29]
Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020
work page 2020
-
[30]
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, page 661–670, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781605587998. doi:10.1145/1772690.1772758. URL https://doi.org...
-
[31]
Bridging distributional and risk-sensitive reinforcement learning with provable regret bounds
Hao Liang and Zhi-Quan Luo. Bridging distributional and risk-sensitive reinforcement learning with provable regret bounds. Journal of Machine Learning Research, 25 0 (221): 0 1--56, 2024. URL http://jmlr.org/papers/v25/22-1253.html
work page 2024
-
[32]
Conservative offline distributional reinforcement learning
Yecheng Ma, Dinesh Jayaraman, and Osbert Bastani. Conservative offline distributional reinforcement learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19235--19247. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files...
work page 2021
-
[33]
Mean-variance optimization in markov decision processes
Shie Mannor and John N Tsitsiklis. Mean-variance optimization in markov decision processes. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 177--184, 2011
work page 2011
- [34]
-
[35]
On learning sets and functions
Balas K Natarajan. On learning sets and functions. Machine Learning, 4 0 (1): 0 67--97, 1989
work page 1989
-
[36]
Eligibility traces for off-policy policy evaluation
Doina Precup, Richard S Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In ICML, volume 2000, pages 759--766. Citeseer, 2000
work page 2000
-
[37]
Marc Rigter, Bruno Lacerda, and Nick Hawes. One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 77520--77545. Curran Associates, Inc., 2023. URL https://proc...
work page 2023
-
[38]
Optimization of conditional value-at-risk
R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2: 0 21--42, 2000
work page 2000
-
[39]
Risk-aversion in multi-armed bandits
Amir Sani, Alessandro Lazaric, and R\' e mi Munos. Risk-aversion in multi-armed bandits. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/83f2550373f2f19492aa30fbd5b57512-Paper.pdf
work page 2012
-
[40]
Batch learning from logged bandit feedback through counterfactual risk minimization
Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16 0 (52): 0 1731--1755, 2015. URL http://jmlr.org/papers/v16/swaminathan15a.html
work page 2015
-
[41]
Learning the variance of the reward-to-go
Aviv Tamar, Dotan Di Castro, and Shie Mannor. Learning the variance of the reward-to-go. Journal of Machine Learning Research, 17 0 (13): 0 1--36, 2016. URL http://jmlr.org/papers/v17/14-335.html
work page 2016
-
[42]
Risk-averse offline reinforcement learning
N \'u ria Armengol Urp \' , Sebastian Curi, and Andreas Krause. Risk-averse offline reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=TBIzh9b5eaz
work page 2021
-
[43]
The Wasserstein distances, pages 93--111
C \'e dric Villani. The Wasserstein distances, pages 93--111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi:10.1007/978-3-540-71050-9_6. URL https://doi.org/10.1007/978-3-540-71050-9_6
-
[44]
Near-minimax-optimal risk-sensitive reinforcement learning with CV a R
Kaiwen Wang, Nathan Kallus, and Wen Sun. Near-minimax-optimal risk-sensitive reinforcement learning with CV a R . In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, p...
work page 2023
-
[45]
Ruohan Zhan, Zhimei Ren, Susan Athey, and Zhengyuan Zhou. Policy learning with adaptively collected data. Management Science, 70 0 (8): 0 5270--5297, 2024. URL https://doi.org/10.1287/mnsc.2023.4921
-
[46]
Pessimism meets risk: Risk-sensitive offline reinforcement learning
Dake Zhang, Boxiang Lyu, Shuang Qiu, Mladen Kolar, and Tong Zhang. Pessimism meets risk: Risk-sensitive offline reinforcement learning. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Pr...
work page 2024
-
[47]
Positivity-free policy learning with observational data
Pan Zhao, Antoine Chambaz, Julie Josse, and Shu Yang. Positivity-free policy learning with observational data. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pages 1918--1926. PMLR, 02--04 May 20...
work page 1918
-
[48]
Offline multi-action policy learning: Generalization and optimization
Zhengyuan Zhou, Susan Athey, and Stefan Wager. Offline multi-action policy learning: Generalization and optimization. Operations Research, 71 0 (1): 0 148--183, 2023. doi:10.1287/opre.2022.2271. URL https://doi.org/10.1287/opre.2022.2271
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.