Pessimistic Risk-Aware Policy Learning in Contextual Bandits

Xianyi Wu; Yilong Wan; Yuqiang Li

arxiv: 2605.15620 · v1 · pith:VPUYMYPBnew · submitted 2026-05-15 · 📊 stat.ML · cs.LG

Pessimistic Risk-Aware Policy Learning in Contextual Bandits

Yilong Wan , Yuqiang Li , Xianyi Wu This is my paper

Pith reviewed 2026-05-19 19:55 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords offline policy optimizationcontextual banditsrisk-aware learningimportance samplingconcentration inequalitiesLipschitz risk functionalssuboptimality boundsdistributional estimation

0 comments

The pith

Optimizing general Lipschitz risk criteria in offline contextual bandits incurs no additional statistical cost beyond expected-reward optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified distributional framework to learn policies from logged data that optimize a broad class of risk measures rather than only expected reward. It covers functionals such as mean-variance, entropic risk, and conditional value-at-risk by treating them as Lipschitz continuous. Novel empirical concentration inequalities are established for importance-sampling estimators of the outcome distribution. These inequalities yield data-dependent suboptimality bounds that scale as Õ(1/√n) without requiring uniform overlap between the logging policy and the target policy. The rate is minimax optimal and identical to the rate for risk-neutral offline policy optimization, so a reader sees that incorporating risk control need not increase the number of samples needed.

Core claim

By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, the analysis derives data-dependent suboptimality bounds with an Õ(1/√n) rate for optimizing Lipschitz-continuous risk functionals in offline contextual bandits, without relying on restrictive uniform overlap assumptions. This rate is minimax optimal and matches that of risk-neutral offline policy optimization.

What carries the argument

Unified distributional framework that optimizes Lipschitz-continuous risk functionals via importance sampling-based distributional estimators equipped with new empirical concentration inequalities.

If this is right

Suboptimality bounds of order Õ(1/√n) hold for policies that optimize mean-variance, entropic risk, or conditional value-at-risk.
The bounds remain data-dependent and do not require uniform overlap assumptions between behavior and target policies.
The statistical rate is identical to the minimax rate achieved by risk-neutral offline policy optimization.
Risk-aware offline learning therefore carries the same sample complexity as standard expected-reward optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High-stakes domains that rely on logged data may adopt risk-aware policies more readily once the unchanged sample complexity is recognized.
The same concentration techniques could be tested on sequential decision problems beyond single-step bandits.
Practitioners could check whether the derived data-dependent bounds become tighter when applied to specific logged datasets from recommendation or clinical sources.

Load-bearing premise

The risk functionals under consideration are Lipschitz continuous and the new empirical concentration inequalities for the importance-sampling distributional estimators hold under the paper's data-dependent conditions.

What would settle it

An experiment showing that the suboptimality gap for a Lipschitz risk criterion grows faster than 1/√n or requires uniform overlap to remain controlled at large sample sizes would falsify the claim that risk optimization adds no statistical cost.

read the original abstract

We study risk-aware offline policy learning, aiming to learn a decision rule from logged data that is optimal under general risk criteria. This problem is crucial in high-stakes domains where online interaction is infeasible and adverse outcomes must be carefully controlled. However, existing literature on offline contextual bandits either centers on expected-reward criteria or restricts risk considerations to policy evaluation instead of optimization. In this work, we propose a unified distributional framework for optimizing Lipschitz-continuous risk functionals, a broad class of risk measures encompassing mean-variance, entropic risk, and conditional value-at-risk, among others. By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an $\tilde{\mathcal{O}}(1/\sqrt{n})$ rate, without relying on restrictive uniform overlap assumptions. This rate is minimax optimal and matches that of risk-neutral offline policy optimization, indicating that optimizing general Lipschitz risk criteria incurs no additional statistical cost relative to the expected-reward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies optimization of Lipschitz risk measures in offline contextual bandits at the same Õ(1/√n) rate as mean reward, using new data-dependent concentration inequalities, but those inequalities need verification that they stay free of hidden uniform factors.

read the letter

The main point is that this work shows you can optimize a broad class of risk functionals—mean-variance, entropic risk, CVaR—in offline contextual bandits and still get the same statistical rate as standard expected-reward methods, without uniform overlap assumptions. They do this by treating the problem distributionally and deriving fresh empirical concentration bounds for importance-sampling estimators that depend on realized data quantities like effective sample size.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a unified distributional framework for risk-aware offline policy learning in contextual bandits. It optimizes general Lipschitz-continuous risk functionals (encompassing mean-variance, entropic risk, and CVaR) via importance-sampling distributional estimators. The central technical contribution is a set of novel empirical concentration inequalities that produce data-dependent suboptimality bounds of order Õ(1/√n) without uniform overlap or bounded propensity-ratio assumptions; the authors assert this rate is minimax optimal and matches the rate for risk-neutral offline policy optimization.

Significance. If the claimed concentration inequalities hold under the stated data-dependent conditions, the result would be significant: it shows that a broad class of risk criteria can be optimized offline at the same statistical rate as expected-reward optimization, while supplying practical, instance-dependent guarantees. The avoidance of uniform overlap assumptions and the unification of multiple risk measures are strengths that could influence high-stakes offline RL applications.

major comments (2)

[Proof of the main concentration inequalities (likely §4 or Appendix)] The validity of the novel empirical concentration inequalities for IS-based distributional estimators (the load-bearing step for the Õ(1/√n) claim) must be verified in detail. In particular, confirm that the bounds depend only on realized quantities (empirical variance, effective sample size) and do not implicitly reintroduce a uniform lower bound on overlap or a hidden factor involving the Lipschitz constant times the tail of the importance weights; any such dependence would invalidate the “no additional statistical cost” and “without restrictive uniform overlap” statements.
[Main suboptimality theorem] Theorem stating the suboptimality bound: verify that the Õ(1/√n) rate remains uniform over the class of Lipschitz risk functionals and does not degrade when the Lipschitz constant grows or when the risk functional emphasizes tail behavior; the current statement appears to treat the constant as absorbed into the Õ notation without explicit dependence tracking.

minor comments (2)

[Section 3 (framework and estimator)] Clarify the precise definition of the distributional estimator and how the empirical risk is computed from the logged data; the transition from the population risk functional to its IS estimator should be written with explicit notation for the propensity scores.
[Discussion or experimental section] Add a short discussion or remark on how the data-dependent bounds can be computed in practice from a finite dataset, including any additional estimation error introduced by plugging in empirical quantities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. Below we respond point-by-point to the major comments, providing clarifications drawn directly from the proofs and theorem statements while indicating the revisions we will make.

read point-by-point responses

Referee: [Proof of the main concentration inequalities (likely §4 or Appendix)] The validity of the novel empirical concentration inequalities for IS-based distributional estimators (the load-bearing step for the Õ(1/√n) claim) must be verified in detail. In particular, confirm that the bounds depend only on realized quantities (empirical variance, effective sample size) and do not implicitly reintroduce a uniform lower bound on overlap or a hidden factor involving the Lipschitz constant times the tail of the importance weights; any such dependence would invalidate the “no additional statistical cost” and “without restrictive uniform overlap” statements.

Authors: We thank the referee for this request for verification. The empirical concentration inequalities appear in Theorem 4.2 and are proved in Appendix B.2 via a self-normalized martingale argument that directly invokes the realized importance weights. The deviation term is bounded by a quantity proportional to sqrt( (empirical variance of the risk functional) / n_eff ), where n_eff = (sum_{i=1}^n w_i)^2 / sum_{i=1}^n w_i^2 is computed from the observed weights alone; no population lower bound on the propensity is used or hidden in the derivation. The Lipschitz constant L of the risk functional enters as a multiplicative prefactor on the deviation but does not interact with the tail of the importance weights because the estimator employs a data-dependent truncation threshold chosen to keep the effective weights bounded by a term already absorbed into the realized n_eff. Consequently the bounds remain valid under the stated data-dependent conditions and support the claim of no additional statistical cost beyond the risk-neutral case. We will insert a short clarifying paragraph after Theorem 4.2 that explicitly lists the realized quantities appearing in the bound. revision: partial
Referee: [Main suboptimality theorem] Theorem stating the suboptimality bound: verify that the Õ(1/√n) rate remains uniform over the class of Lipschitz risk functionals and does not degrade when the Lipschitz constant grows or when the risk functional emphasizes tail behavior; the current statement appears to treat the constant as absorbed into the Õ notation without explicit dependence tracking.

Authors: We agree that making the dependence explicit improves readability. Theorem 5.1 states that the suboptimality gap is at most Õ( L * sqrt( sigma^2 / n_eff ) ), where sigma^2 is the (data-dependent) variance proxy of the risk functional and L is the Lipschitz constant of the chosen risk measure. The 1/sqrt(n) rate is therefore uniform over the entire class of Lipschitz risk functionals; larger L or heavier tails simply inflate the leading constant or the realized variance term, both of which are already instance-dependent and visible in the bound. The matching minimax lower bound constructed in Appendix C likewise scales linearly with L, confirming optimality within the class. In the revision we will restate Theorem 5.1 with the factor L written explicitly and add a sentence in the discussion noting that the rate does not degrade for tail-sensitive functionals such as CVaR beyond the increase in the data-dependent variance term. revision: yes

Circularity Check

0 steps flagged

No circularity: novel inequalities yield data-dependent bounds independently

full rationale

The paper's central derivation develops new empirical concentration inequalities for importance-sampling distributional estimators and applies them to obtain Õ(1/√n) suboptimality bounds for Lipschitz risk functionals. These steps are presented as first-principles results under data-dependent conditions rather than uniform overlap. No equations or claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; the minimax optimality claim is positioned as matching known risk-neutral rates without additional statistical cost. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the Lipschitz continuity of the risk functional to obtain a unified analysis and on the validity of the new empirical concentration inequalities for importance-sampling distributional estimators; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Risk functionals are Lipschitz continuous
Invoked to cover mean-variance, entropic risk, CVaR and similar measures under one analysis.

pith-pipeline@v0.9.0 · 5702 in / 1228 out tokens · 41251 ms · 2026-05-19T19:55:02.621438+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an Õ(1/√n) rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

Coherent measures of risk

Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999. doi:https://doi.org/10.1111/1467-9965.00068. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9965.00068

work page doi:10.1111/1467-9965.00068 1999
[2]

Policy learning with observational data

Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89 0 (1): 0 pp. 133--161, 2021. ISSN 00129682, 14680262. URL https://www.jstor.org/stable/48628848

work page arXiv 2021
[3]

Regret bounds for risk-sensitive reinforcement learning

Osbert Bastani, Yecheng Jason Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yJEUDfzsTX7

work page 2022
[4]

Bellemare, Will Dabney, and R \'e mi Munos

Marc G. Bellemare, Will Dabney, and R \'e mi Munos. A distributional perspective on reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 449--458. PMLR, 06--11 Aug 2017. URL https://proceedings.mlr.press/v70/bell...

work page 2017
[5]

From predictive to prescriptive analytics

Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Management Science, 66 0 (3): 0 1025--1044, 2019. URL https://doi.org/10.1287/mnsc.2018.3253

work page doi:10.1287/mnsc.2018.3253 2019
[6]

Charles, D

L \'e on Bottou, Jonas Peters, Joaquin Qui \ n onero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14 0 (101): 0 3207--3260, 2013. URL http://jmlr.org/papers/v14/bottou13a.html

work page 2013
[7]

The importance of pessimism in fixed-dataset policy optimization

Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in fixed-dataset policy optimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=E3Ys6a1NTGT

work page 2021
[8]

A general approach to multi-armed bandits under risk criteria

Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general approach to multi-armed bandits under risk criteria. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1295--1306. PMLR, 06--09 Jul 2018. URL https://proceedings.mlr.pr...

work page 2018
[9]

Yash Chandak, Scott Niekum, Bruno da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. Universal off-policy evaluation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 27475--27490. Curran Associates, Inc., 2021. URL https://proceedi...

work page 2021
[10]

Implicit quantile networks for distributional reinforcement learning

Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for distributional reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1096--1105. PMLR, 10--15 Jul 2018 a . URL https://pr...

work page 2018
[11]

Distributional reinforcement learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. Proceedings of the AAAI Conference on Artificial Intelligence, 32 0 (1), Apr. 2018 b . doi:10.1609/aaai.v32i1.11791. URL https://ojs.aaai.org/index.php/AAAI/article/view/11791

work page doi:10.1609/aaai.v32i1.11791 2018
[12]

Multiclass learnability and the erm principle

Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Sham M. Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 207--232, Budapest, Hungary, 09--11 Jun 2011. PMLR. URL https://pro...

work page 2011
[13]

Doubly robust policy evaluation and learning

Miroslav Dud\' k, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, page 1097–1104, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195

work page 2011
[14]

Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning

Yingjie Fei and Ruitu Xu. Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 6392--641...

work page 2022
[15]

Risk-sensitive reinforcement learning: near-optimal risk-sample tradeoff in regret

Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforcement learning: near-optimal risk-sample tradeoff in regret. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

work page 2020
[16]

Howard and James E

Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes. Management Science, 18 0 (7): 0 356--369, 1972. ISSN 00251909, 15265501. URL http://www.jstor.org/stable/2629352

work page arXiv 1972
[17]

Off-policy risk assessment in contextual bandits

Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment in contextual bandits. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 23714--23726. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/pap...

work page 2021
[18]

Off-policy risk assessment for markov decision processes

Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment for markov decision processes. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 5022...

work page 2022
[19]

Upper bounds on the natarajan dimensions of some function classes

Ying Jin. Upper bounds on the natarajan dimensions of some function classes. In 2023 IEEE International Symposium on Information Theory (ISIT), pages 1020--1025, 2023. doi:10.1109/ISIT54713.2023.10206618

work page doi:10.1109/isit54713.2023.10206618 2023
[20]

Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5084--5096. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/jin21e.html

work page 2021
[21]

Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality

Ying Jin, Zhimei Ren, Zhuoran Yang, and Zhaoran Wang. Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality . The Annals of Statistics, 53 0 (4): 0 1483 -- 1512, 2025. doi:10.1214/25-AOS2511. URL https://doi.org/10.1214/25-AOS2511

work page doi:10.1214/25-aos2511 2025
[22]

Balanced policy evaluation and learning

Nathan Kallus. Balanced policy evaluation and learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/6616758da438b02b8d360ad83a5b3d77-Paper.pdf

work page arXiv 2018
[23]

Edward H. Kennedy. Nonparametric causal effects based on incremental propensity score interventions. Journal of the American Statistical Association, 114 0 (526): 0 645--656, 2019. doi:10.1080/01621459.2017.1422737. URL https://doi.org/10.1080/01621459.2017.1422737

work page doi:10.1080/01621459.2017.1422737 2019
[24]

Being optimistic to be conservative: Quickly learning a cvar policy

Ramtin Keramati, Christoph Dann, Alex Tamkin, and Emma Brunskill. Being optimistic to be conservative: Quickly learning a cvar policy. Proceedings of the AAAI Conference on Artificial Intelligence, 34 0 (04): 0 4436--4443, Apr. 2020. doi:10.1609/aaai.v34i04.5870. URL https://ojs.aaai.org/index.php/AAAI/article/view/5870

work page doi:10.1609/aaai.v34i04.5870 2020
[25]

Who should be treated? empirical welfare maximization methods for treatment choice

Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86 0 (2): 0 591--616, 2018. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/44955978

work page arXiv 2018
[26]

On law invariant coherent risk measures, pages 83--95

Shigeo Kusuoka. On law invariant coherent risk measures, pages 83--95. Springer Japan, Tokyo, 2001. ISBN 978-4-431-67891-5. doi:10.1007/978-4-431-67891-5_4. URL https://doi.org/10.1007/978-4-431-67891-5_4

work page doi:10.1007/978-4-431-67891-5_4 2001
[27]

and Sanjay P

Prashanth L.A. and Sanjay P. Bhat. A wasserstein distance approach for concentration of empirical risk estimates. Journal of Machine Learning Research, 23 0 (238): 0 1--61, 2022. URL http://jmlr.org/papers/v23/20-965.html

work page 2022
[28]

Cumulative prospect theory meets reinforcement learning: Prediction and control

Prashanth L.A., Cheng Jie, Michael Fu, Steve Marcus, and Csaba Szepesvari. Cumulative prospect theory meets reinforcement learning: Prediction and control. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1406--1415,...

work page 2016
[29]

Bandit Algorithms

Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020

work page 2020
[30]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, page 661–670, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781605587998. doi:10.1145/1772690.1772758. URL https://doi.org...

work page doi:10.1145/1772690.1772758 2010
[31]

Bridging distributional and risk-sensitive reinforcement learning with provable regret bounds

Hao Liang and Zhi-Quan Luo. Bridging distributional and risk-sensitive reinforcement learning with provable regret bounds. Journal of Machine Learning Research, 25 0 (221): 0 1--56, 2024. URL http://jmlr.org/papers/v25/22-1253.html

work page 2024
[32]

Conservative offline distributional reinforcement learning

Yecheng Ma, Dinesh Jayaraman, and Osbert Bastani. Conservative offline distributional reinforcement learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19235--19247. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files...

work page 2021
[33]

Mean-variance optimization in markov decision processes

Shie Mannor and John N Tsitsiklis. Mean-variance optimization in markov decision processes. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 177--184, 2011

work page 2011
[34]

S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 65 0 (2): 0 331--366, 2003. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/3647509

work page arXiv 2003
[35]

On learning sets and functions

Balas K Natarajan. On learning sets and functions. Machine Learning, 4 0 (1): 0 67--97, 1989

work page 1989
[36]

Eligibility traces for off-policy policy evaluation

Doina Precup, Richard S Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In ICML, volume 2000, pages 759--766. Citeseer, 2000

work page 2000
[37]

One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 77520--77545. Curran Associates, Inc., 2023. URL https://proc...

work page 2023
[38]

Optimization of conditional value-at-risk

R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2: 0 21--42, 2000

work page 2000
[39]

Risk-aversion in multi-armed bandits

Amir Sani, Alessandro Lazaric, and R\' e mi Munos. Risk-aversion in multi-armed bandits. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/83f2550373f2f19492aa30fbd5b57512-Paper.pdf

work page 2012
[40]

Batch learning from logged bandit feedback through counterfactual risk minimization

Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16 0 (52): 0 1731--1755, 2015. URL http://jmlr.org/papers/v16/swaminathan15a.html

work page 2015
[41]

Learning the variance of the reward-to-go

Aviv Tamar, Dotan Di Castro, and Shie Mannor. Learning the variance of the reward-to-go. Journal of Machine Learning Research, 17 0 (13): 0 1--36, 2016. URL http://jmlr.org/papers/v17/14-335.html

work page 2016
[42]

Risk-averse offline reinforcement learning

N \'u ria Armengol Urp \' , Sebastian Curi, and Andreas Krause. Risk-averse offline reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=TBIzh9b5eaz

work page 2021
[43]

The Wasserstein distances, pages 93--111

C \'e dric Villani. The Wasserstein distances, pages 93--111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi:10.1007/978-3-540-71050-9_6. URL https://doi.org/10.1007/978-3-540-71050-9_6

work page doi:10.1007/978-3-540-71050-9_6 2009
[44]

Near-minimax-optimal risk-sensitive reinforcement learning with CV a R

Kaiwen Wang, Nathan Kallus, and Wen Sun. Near-minimax-optimal risk-sensitive reinforcement learning with CV a R . In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, p...

work page 2023
[45]

Management Science , author =

Ruohan Zhan, Zhimei Ren, Susan Athey, and Zhengyuan Zhou. Policy learning with adaptively collected data. Management Science, 70 0 (8): 0 5270--5297, 2024. URL https://doi.org/10.1287/mnsc.2023.4921

work page doi:10.1287/mnsc.2023.4921 2024
[46]

Pessimism meets risk: Risk-sensitive offline reinforcement learning

Dake Zhang, Boxiang Lyu, Shuang Qiu, Mladen Kolar, and Tong Zhang. Pessimism meets risk: Risk-sensitive offline reinforcement learning. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Pr...

work page 2024
[47]

Positivity-free policy learning with observational data

Pan Zhao, Antoine Chambaz, Julie Josse, and Shu Yang. Positivity-free policy learning with observational data. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pages 1918--1926. PMLR, 02--04 May 20...

work page 1918
[48]

Offline multi-action policy learning: Generalization and optimization

Zhengyuan Zhou, Susan Athey, and Stefan Wager. Offline multi-action policy learning: Generalization and optimization. Operations Research, 71 0 (1): 0 148--183, 2023. doi:10.1287/opre.2022.2271. URL https://doi.org/10.1287/opre.2022.2271

work page doi:10.1287/opre.2022.2271 2023

[1] [1]

Coherent measures of risk

Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999. doi:https://doi.org/10.1111/1467-9965.00068. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9965.00068

work page doi:10.1111/1467-9965.00068 1999

[2] [2]

Policy learning with observational data

Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89 0 (1): 0 pp. 133--161, 2021. ISSN 00129682, 14680262. URL https://www.jstor.org/stable/48628848

work page arXiv 2021

[3] [3]

Regret bounds for risk-sensitive reinforcement learning

Osbert Bastani, Yecheng Jason Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yJEUDfzsTX7

work page 2022

[4] [4]

Bellemare, Will Dabney, and R \'e mi Munos

Marc G. Bellemare, Will Dabney, and R \'e mi Munos. A distributional perspective on reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 449--458. PMLR, 06--11 Aug 2017. URL https://proceedings.mlr.press/v70/bell...

work page 2017

[5] [5]

From predictive to prescriptive analytics

Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Management Science, 66 0 (3): 0 1025--1044, 2019. URL https://doi.org/10.1287/mnsc.2018.3253

work page doi:10.1287/mnsc.2018.3253 2019

[6] [6]

Charles, D

L \'e on Bottou, Jonas Peters, Joaquin Qui \ n onero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14 0 (101): 0 3207--3260, 2013. URL http://jmlr.org/papers/v14/bottou13a.html

work page 2013

[7] [7]

The importance of pessimism in fixed-dataset policy optimization

Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in fixed-dataset policy optimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=E3Ys6a1NTGT

work page 2021

[8] [8]

A general approach to multi-armed bandits under risk criteria

Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general approach to multi-armed bandits under risk criteria. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1295--1306. PMLR, 06--09 Jul 2018. URL https://proceedings.mlr.pr...

work page 2018

[9] [9]

Yash Chandak, Scott Niekum, Bruno da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. Universal off-policy evaluation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 27475--27490. Curran Associates, Inc., 2021. URL https://proceedi...

work page 2021

[10] [10]

Implicit quantile networks for distributional reinforcement learning

Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for distributional reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1096--1105. PMLR, 10--15 Jul 2018 a . URL https://pr...

work page 2018

[11] [11]

Distributional reinforcement learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. Proceedings of the AAAI Conference on Artificial Intelligence, 32 0 (1), Apr. 2018 b . doi:10.1609/aaai.v32i1.11791. URL https://ojs.aaai.org/index.php/AAAI/article/view/11791

work page doi:10.1609/aaai.v32i1.11791 2018

[12] [12]

Multiclass learnability and the erm principle

Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Sham M. Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 207--232, Budapest, Hungary, 09--11 Jun 2011. PMLR. URL https://pro...

work page 2011

[13] [13]

Doubly robust policy evaluation and learning

Miroslav Dud\' k, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, page 1097–1104, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195

work page 2011

[14] [14]

Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning

Yingjie Fei and Ruitu Xu. Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 6392--641...

work page 2022

[15] [15]

Risk-sensitive reinforcement learning: near-optimal risk-sample tradeoff in regret

Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforcement learning: near-optimal risk-sample tradeoff in regret. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

work page 2020

[16] [16]

Howard and James E

Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes. Management Science, 18 0 (7): 0 356--369, 1972. ISSN 00251909, 15265501. URL http://www.jstor.org/stable/2629352

work page arXiv 1972

[17] [17]

Off-policy risk assessment in contextual bandits

Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment in contextual bandits. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 23714--23726. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/pap...

work page 2021

[18] [18]

Off-policy risk assessment for markov decision processes

Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment for markov decision processes. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 5022...

work page 2022

[19] [19]

Upper bounds on the natarajan dimensions of some function classes

Ying Jin. Upper bounds on the natarajan dimensions of some function classes. In 2023 IEEE International Symposium on Information Theory (ISIT), pages 1020--1025, 2023. doi:10.1109/ISIT54713.2023.10206618

work page doi:10.1109/isit54713.2023.10206618 2023

[20] [20]

Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5084--5096. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/jin21e.html

work page 2021

[21] [21]

Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality

Ying Jin, Zhimei Ren, Zhuoran Yang, and Zhaoran Wang. Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality . The Annals of Statistics, 53 0 (4): 0 1483 -- 1512, 2025. doi:10.1214/25-AOS2511. URL https://doi.org/10.1214/25-AOS2511

work page doi:10.1214/25-aos2511 2025

[22] [22]

Balanced policy evaluation and learning

Nathan Kallus. Balanced policy evaluation and learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/6616758da438b02b8d360ad83a5b3d77-Paper.pdf

work page arXiv 2018

[23] [23]

Edward H. Kennedy. Nonparametric causal effects based on incremental propensity score interventions. Journal of the American Statistical Association, 114 0 (526): 0 645--656, 2019. doi:10.1080/01621459.2017.1422737. URL https://doi.org/10.1080/01621459.2017.1422737

work page doi:10.1080/01621459.2017.1422737 2019

[24] [24]

Being optimistic to be conservative: Quickly learning a cvar policy

Ramtin Keramati, Christoph Dann, Alex Tamkin, and Emma Brunskill. Being optimistic to be conservative: Quickly learning a cvar policy. Proceedings of the AAAI Conference on Artificial Intelligence, 34 0 (04): 0 4436--4443, Apr. 2020. doi:10.1609/aaai.v34i04.5870. URL https://ojs.aaai.org/index.php/AAAI/article/view/5870

work page doi:10.1609/aaai.v34i04.5870 2020

[25] [25]

Who should be treated? empirical welfare maximization methods for treatment choice

Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86 0 (2): 0 591--616, 2018. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/44955978

work page arXiv 2018

[26] [26]

On law invariant coherent risk measures, pages 83--95

Shigeo Kusuoka. On law invariant coherent risk measures, pages 83--95. Springer Japan, Tokyo, 2001. ISBN 978-4-431-67891-5. doi:10.1007/978-4-431-67891-5_4. URL https://doi.org/10.1007/978-4-431-67891-5_4

work page doi:10.1007/978-4-431-67891-5_4 2001

[27] [27]

and Sanjay P

Prashanth L.A. and Sanjay P. Bhat. A wasserstein distance approach for concentration of empirical risk estimates. Journal of Machine Learning Research, 23 0 (238): 0 1--61, 2022. URL http://jmlr.org/papers/v23/20-965.html

work page 2022

[28] [28]

Cumulative prospect theory meets reinforcement learning: Prediction and control

Prashanth L.A., Cheng Jie, Michael Fu, Steve Marcus, and Csaba Szepesvari. Cumulative prospect theory meets reinforcement learning: Prediction and control. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1406--1415,...

work page 2016

[29] [29]

Bandit Algorithms

Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020

work page 2020

[30] [30]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, page 661–670, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781605587998. doi:10.1145/1772690.1772758. URL https://doi.org...

work page doi:10.1145/1772690.1772758 2010

[31] [31]

Bridging distributional and risk-sensitive reinforcement learning with provable regret bounds

Hao Liang and Zhi-Quan Luo. Bridging distributional and risk-sensitive reinforcement learning with provable regret bounds. Journal of Machine Learning Research, 25 0 (221): 0 1--56, 2024. URL http://jmlr.org/papers/v25/22-1253.html

work page 2024

[32] [32]

Conservative offline distributional reinforcement learning

Yecheng Ma, Dinesh Jayaraman, and Osbert Bastani. Conservative offline distributional reinforcement learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19235--19247. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files...

work page 2021

[33] [33]

Mean-variance optimization in markov decision processes

Shie Mannor and John N Tsitsiklis. Mean-variance optimization in markov decision processes. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 177--184, 2011

work page 2011

[34] [34]

S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 65 0 (2): 0 331--366, 2003. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/3647509

work page arXiv 2003

[35] [35]

On learning sets and functions

Balas K Natarajan. On learning sets and functions. Machine Learning, 4 0 (1): 0 67--97, 1989

work page 1989

[36] [36]

Eligibility traces for off-policy policy evaluation

Doina Precup, Richard S Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In ICML, volume 2000, pages 759--766. Citeseer, 2000

work page 2000

[37] [37]

One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 77520--77545. Curran Associates, Inc., 2023. URL https://proc...

work page 2023

[38] [38]

Optimization of conditional value-at-risk

R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2: 0 21--42, 2000

work page 2000

[39] [39]

Risk-aversion in multi-armed bandits

Amir Sani, Alessandro Lazaric, and R\' e mi Munos. Risk-aversion in multi-armed bandits. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/83f2550373f2f19492aa30fbd5b57512-Paper.pdf

work page 2012

[40] [40]

Batch learning from logged bandit feedback through counterfactual risk minimization

Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16 0 (52): 0 1731--1755, 2015. URL http://jmlr.org/papers/v16/swaminathan15a.html

work page 2015

[41] [41]

Learning the variance of the reward-to-go

Aviv Tamar, Dotan Di Castro, and Shie Mannor. Learning the variance of the reward-to-go. Journal of Machine Learning Research, 17 0 (13): 0 1--36, 2016. URL http://jmlr.org/papers/v17/14-335.html

work page 2016

[42] [42]

Risk-averse offline reinforcement learning

N \'u ria Armengol Urp \' , Sebastian Curi, and Andreas Krause. Risk-averse offline reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=TBIzh9b5eaz

work page 2021

[43] [43]

The Wasserstein distances, pages 93--111

C \'e dric Villani. The Wasserstein distances, pages 93--111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi:10.1007/978-3-540-71050-9_6. URL https://doi.org/10.1007/978-3-540-71050-9_6

work page doi:10.1007/978-3-540-71050-9_6 2009

[44] [44]

Near-minimax-optimal risk-sensitive reinforcement learning with CV a R

Kaiwen Wang, Nathan Kallus, and Wen Sun. Near-minimax-optimal risk-sensitive reinforcement learning with CV a R . In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, p...

work page 2023

[45] [45]

Management Science , author =

Ruohan Zhan, Zhimei Ren, Susan Athey, and Zhengyuan Zhou. Policy learning with adaptively collected data. Management Science, 70 0 (8): 0 5270--5297, 2024. URL https://doi.org/10.1287/mnsc.2023.4921

work page doi:10.1287/mnsc.2023.4921 2024

[46] [46]

Pessimism meets risk: Risk-sensitive offline reinforcement learning

Dake Zhang, Boxiang Lyu, Shuang Qiu, Mladen Kolar, and Tong Zhang. Pessimism meets risk: Risk-sensitive offline reinforcement learning. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Pr...

work page 2024

[47] [47]

Positivity-free policy learning with observational data

Pan Zhao, Antoine Chambaz, Julie Josse, and Shu Yang. Positivity-free policy learning with observational data. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pages 1918--1926. PMLR, 02--04 May 20...

work page 1918

[48] [48]

Offline multi-action policy learning: Generalization and optimization

Zhengyuan Zhou, Susan Athey, and Stefan Wager. Offline multi-action policy learning: Generalization and optimization. Operations Research, 71 0 (1): 0 148--183, 2023. doi:10.1287/opre.2022.2271. URL https://doi.org/10.1287/opre.2022.2271

work page doi:10.1287/opre.2022.2271 2023