Off-Policy Learning with Limited Supply

Bushun Kawagishi; Koichi Tanaka; Nobuyuki Shimizu; Ren Kishimoto; Yasuo Yamamoto; Yusuke Narita; Yuta Saito

arxiv: 2603.18702 · v4 · pith:UFEZKARInew · submitted 2026-03-19 · 💻 cs.LG

Off-Policy Learning with Limited Supply

Koichi Tanaka , Ren Kishimoto , Bushun Kawagishi , Yusuke Narita , Yasuo Yamamoto , Nobuyuki Shimizu , Yuta Saito This is my paper

Pith reviewed 2026-05-21 11:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords off-policy learningcontextual banditslimited supplyrecommendation systemsonline advertisingpolicy optimizationconstrained allocation

0 comments

The pith

In limited-supply contextual bandits, greedy off-policy methods that always pick the highest expected reward for the current user can deplete items before higher-value future users arrive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard off-policy learning in contextual bandits assumes unlimited item availability, but when supply is constrained by budgets or inventory, greedy selection often leads to early depletion and lower total reward. A sympathetic reader would care because real systems such as coupon distribution and e-commerce must allocate scarce items across a stream of users. The authors prove that superior policies exist under these constraints and introduce OPLS, which selects items whose expected reward for the present user is high relative to its value for other users. Experiments on synthetic and real datasets confirm that this relative comparison yields better overall performance than prior OPL approaches.

Core claim

In limited supply settings the optimal policy must account for opportunity costs across users rather than maximizing immediate expected reward alone. Conventional greedy OPL therefore fails to maximize performance, while policies that focus on items with relatively higher expected rewards compared to the other users achieve superior total reward. OPLS implements this insight by replacing pure greedy selection with a relative-value rule that reserves scarce items for users who stand to gain more from them.

What carries the argument

The relative expected reward rule in OPLS, which compares an item's value for the current user against its value for remaining users to decide allocation under supply limits.

If this is right

Total reward increases when scarce items are withheld from low-relative-value users for later higher-value ones.
Standard OPL methods that are optimal without supply constraints become suboptimal once budgets or inventory limits are imposed.
Coupon allocation and e-commerce inventory decisions can be improved by replacing pure greedy selection with relative-value ranking.
Empirical gains appear on both synthetic data and real recommendation traces when supply is capped.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relative-value logic could be tested in sequential resource scheduling problems outside bandits, such as cloud compute allocation.
Incorporating uncertainty estimates around the relative rewards might further stabilize performance when reward models are noisy.
The approach suggests that arrival-order robustness should be measured explicitly in any constrained bandit benchmark.

Load-bearing premise

That comparing an item's expected reward for the current user against its expected value to other users correctly captures future opportunity costs and avoids premature depletion.

What would settle it

A simulation with known reward distributions and a fixed item budget in which OPLS is run alongside a greedy baseline; if the greedy method achieves equal or higher cumulative reward, the claim that OPLS is superior under limited supply is falsified.

Figures

Figures reproduced from arXiv: 2603.18702 by Bushun Kawagishi, Koichi Tanaka, Nobuyuki Shimizu, Ren Kishimoto, Yasuo Yamamoto, Yusuke Narita, Yuta Saito.

**Figure 2.** Figure 2: The behavior of the conventional greedy method at [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The behavior of OPLS at 𝑡 = 10, 30, 60 Initial Supply. To identify the situation where OPLS provides more improvements, we define the initial supply 𝑠1 by three different ways as follows. 𝑠1 =    𝑠max · E𝑝 (𝑥 ) [𝑞(𝑥, 𝑎)] max 𝑎∈A E𝑝 (𝑥 ) [𝑞(𝑥, 𝑎)] (proportional) 𝑠max · min 𝑎∈A E𝑝 (𝑥 ) [𝑞(𝑥, 𝑎)] √︁ E𝑝 (𝑥 ) [𝑞(𝑥, 𝑎)] (inverse proportional) random.uniform(1, 𝑠max) (random) , where 𝑠max is… view at source ↗

**Figure 4.** Figure 4: Comparisons of relative policy values with varying (a)action popularity, (b)the number of users, and (c)estimation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Relative policy values varying max supply ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparisons of relative policy values with varying (a)the number of users and (b)estimation noise. The period [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Relative policy values varying max supply (𝑠𝑚𝑎𝑥 ) Additional Result [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Comparisons of relative policy values with varying (a) max supply ( [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a practical failure mode in greedy off-policy learning under limited supply and offers OPLS as a relative-reward fix, but the adjustment looks like a one-step heuristic rather than a full solution to the inventory problem.

read the letter

The main takeaway is that standard off-policy methods can waste scarce items on early users when later arrivals would get more value from them, and the authors give a theoretical argument that better policies exist. They then define OPLS to re-rank items by comparing an item's expected reward for the current user against its value to the broader user population, aiming for more efficient allocation before supply runs out.

Referee Report

1 major / 1 minor

Summary. The paper claims that in contextual bandits with limited supply, conventional greedy off-policy learning approaches can fail by depleting high-reward items early, and that superior policies exist. It introduces OPLS, which selects items with relatively higher expected rewards compared to other users to achieve more efficient allocation, supported by theoretical analysis and empirical results on synthetic and real-world datasets showing outperformance.

Significance. If the central claims hold, this work is significant for practical applications in recommendation systems and online advertising where item supply is constrained by inventory or budgets. The identification of the greedy failure mode and the proposal of a relative-reward based method provide a useful heuristic for better allocation. The empirical outperformance is a strength, though the method's performance relative to optimal dynamic programming solutions for the finite-horizon problem remains to be fully characterized.

major comments (1)

[Theoretical Analysis] The analysis shows that greedy selection can exhaust high-reward items before higher-value future users arrive. However, OPLS is defined using comparisons from the marginal distribution of users rather than the joint distribution of remaining supply and arrivals. This makes it a one-step lookahead adjustment that does not solve the underlying dynamic program, which could still result in premature depletion in regimes with heterogeneous user types and tight supply.

minor comments (1)

[Abstract] The abstract mentions 'theoretical analysis' and 'empirical results' but lacks specific details on the datasets or quantitative improvements, which would help assess the strength of the claims.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the careful review and insightful comments on the theoretical foundations of our work. We address the major comment below and will incorporate clarifications into the revised manuscript.

read point-by-point responses

Referee: [Theoretical Analysis] The analysis shows that greedy selection can exhaust high-reward items before higher-value future users arrive. However, OPLS is defined using comparisons from the marginal distribution of users rather than the joint distribution of remaining supply and arrivals. This makes it a one-step lookahead adjustment that does not solve the underlying dynamic program, which could still result in premature depletion in regimes with heterogeneous user types and tight supply.

Authors: We agree that OPLS is a one-step lookahead heuristic that relies on the marginal user distribution rather than the full joint distribution over remaining supply and future arrivals, and thus does not solve the underlying finite-horizon dynamic program. Solving the exact DP is intractable in contextual settings with continuous or high-dimensional user features and limited supply, as the state space grows exponentially with the number of items and supply levels. Our contribution instead focuses on identifying the failure mode of standard greedy OPL and proposing a scalable adjustment based on relative expected rewards that can be estimated from off-policy data. We will revise the manuscript to explicitly state that OPLS is an approximation, discuss its relationship to the DP, and add a limitations paragraph noting that premature depletion remains possible under extreme heterogeneity and very tight supply. Additional synthetic experiments in such regimes can be included to quantify the gap. revision: partial

standing simulated objections not resolved

Fully characterizing the suboptimality gap of OPLS relative to the optimal dynamic programming policy across all regimes of user heterogeneity and supply tightness.

Circularity Check

0 steps flagged

No significant circularity detected in the OPLS derivation

full rationale

The paper first identifies a failure mode of greedy OPL under limited supply via theoretical analysis, establishes existence of superior policies, and then defines OPLS as a re-ranking rule that compares an item's expected reward for the current user against its value to other users. This construction is explicit and motivated by the identified shortcoming rather than reducing to a fitted quantity or self-referential definition. No load-bearing self-citations, ansatzes smuggled via prior work, or predictions that collapse to inputs by construction appear in the abstract or described sections. The method remains a heuristic adjustment whose performance is evaluated empirically on separate data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5790 in / 1075 out tokens · 59044 ms · 2026-05-21T11:12:27.534131+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Shipra Agrawal and Nikhil Devanur. 2016. Linear contextual bandits with knap- sacks.Advances in neural information processing systems29 (2016). Off-Policy Learning with Limited Supply WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates

work page 2016
[2]

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2018. Bandits with knapsacks.Journal of the ACM (JACM)65, 3 (2018), 1–55

work page 2018
[3]

Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Re- sourceful contextual bandits. InConference on Learning Theory. PMLR, 1109–1134

work page 2014
[4]

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.Journal of Machine Learning Research14, 11 (2013)

work page 2013
[5]

Konstantina Christakopoulou, Jaya Kawale, and Arindam Banerjee. 2017. Recom- mendation with capacity constraints. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1439–1448

work page 2017
[6]

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization.Statist. Sci.29, 4 (2014), 485–511

work page 2014
[7]

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. InProceedings of the 35th Interna- tional Conference on Machine Learning, Vol. 80. PMLR, 1447–1456

work page 2018
[8]

Nicolo Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cre- monesi. 2022. Off-Policy Evaluation with Deficient Support Using Side Informa- tion.Advances in Neural Information Processing Systems35 (2022)

work page 2022
[9]

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 540–550

work page 2022
[10]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InPro- ceedings of the 11th ACM International Conference on Web Search and Data Mining. 198–206

work page 2018
[11]

Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal Off-Policy Evaluation from Multiple Logging Policies. InProceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256

work page 2021
[12]

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. InInternational Conference on Learning Repre- sentations

work page 2024
[13]

Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Off-policy evaluation of slate bandit policies via optimizing abstraction. InProceedings of the ACM on Web Conference 2024. 3150–3161

work page 2024
[14]

Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. InProceedings of the 15th International Conference on Web Search and Data Mining

work page 2022
[15]

Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163

work page 2023
[16]

Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929–934

work page 2015
[17]

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the curse of horizon: infinite-horizon off-policy estimation. InProceedings of the 32nd International Conference on Neural Information Processing Systems. 5361–5371

work page 2018
[18]

Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems34 (2021)

work page 2021
[19]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re...

work page 2011
[20]

Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. 2024. Long-term Off-Policy Evaluation and Learning. InProceedings of the ACM on Web Conference 2024. 3432–3443

work page 2024
[21]

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page 2021
[22]

Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProceedings of the 15th ACM Conference on Recommender Systems. 828–830

work page 2021
[23]

Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. InInternational Conference on Machine Learning. PMLR, 19089–19122

work page 2022
[24]

Yuta Saito, Qingyang Ren, and Thorsten Joachims. 2023. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling.arXiv preprint arXiv:2305.08062 (2023)

work page arXiv 2023
[25]

Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123

work page 2021
[26]

Yuta Saito, Jihan Yao, and Thorsten Joachims. 2024. POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition.arXiv preprint arXiv:2402.06151(2024)

work page arXiv 2024
[27]

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. InProceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176

work page 2020
[28]

Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. InInternational Conference on Machine Learning, Vol. 84. 6005–6014

work page 2019
[29]

Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization.The Journal of Machine Learning Research16, 1 (2015), 1731–1755

work page 2015
[30]

Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk mini- mization: Learning from logged bandit feedback. InInternational Conference on Machine Learning. PMLR, 814–823

work page 2015
[31]

Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning.Advances in Neural Information Processing Systems 28 (2015)

work page 2015
[32]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. InAdvances in Neural Information Processing Systems, Vol. 30. 3632–3642

work page 2017
[33]

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. 2022. A review of off- policy evaluation in reinforcement learning.arXiv preprint arXiv:2212.06355 (2022)

work page arXiv 2022
[34]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. InInternational Conference on Machine Learning. PMLR, 3589–3597

work page 2017
[35]

Huasen Wu, Rayadurgam Srikant, Xin Liu, and Chong Jiang. 2015. Algorithms with logarithmic or sublinear regret for constrained contextual bandits.Advances in Neural Information Processing Systems28 (2015)

work page 2015
[36]

Yafei Zhao and Long Yang. 2024. Constrained contextual bandit algorithm for limited-budget recommendation system.Engineering Applications of Artificial Intelligence128 (2024), 107558

work page 2024
[37]

Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, and Qiang Li

work page
[38]

InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Stock constrained recommendation in tmall. InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2287–2296. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates. Koichi Tanaka et al. A Calculation ofgreedyin Table 1 We present the detailed calculation of the greedy method in Table 1. Let X={𝑥 1, 𝑥2, 𝑥3} a...

work page 2026

[1] [1]

Shipra Agrawal and Nikhil Devanur. 2016. Linear contextual bandits with knap- sacks.Advances in neural information processing systems29 (2016). Off-Policy Learning with Limited Supply WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates

work page 2016

[2] [2]

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2018. Bandits with knapsacks.Journal of the ACM (JACM)65, 3 (2018), 1–55

work page 2018

[3] [3]

Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Re- sourceful contextual bandits. InConference on Learning Theory. PMLR, 1109–1134

work page 2014

[4] [4]

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.Journal of Machine Learning Research14, 11 (2013)

work page 2013

[5] [5]

Konstantina Christakopoulou, Jaya Kawale, and Arindam Banerjee. 2017. Recom- mendation with capacity constraints. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1439–1448

work page 2017

[6] [6]

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization.Statist. Sci.29, 4 (2014), 485–511

work page 2014

[7] [7]

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. InProceedings of the 35th Interna- tional Conference on Machine Learning, Vol. 80. PMLR, 1447–1456

work page 2018

[8] [8]

Nicolo Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cre- monesi. 2022. Off-Policy Evaluation with Deficient Support Using Side Informa- tion.Advances in Neural Information Processing Systems35 (2022)

work page 2022

[9] [9]

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 540–550

work page 2022

[10] [10]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InPro- ceedings of the 11th ACM International Conference on Web Search and Data Mining. 198–206

work page 2018

[11] [11]

Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal Off-Policy Evaluation from Multiple Logging Policies. InProceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256

work page 2021

[12] [12]

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. InInternational Conference on Learning Repre- sentations

work page 2024

[13] [13]

Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Off-policy evaluation of slate bandit policies via optimizing abstraction. InProceedings of the ACM on Web Conference 2024. 3150–3161

work page 2024

[14] [14]

Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. InProceedings of the 15th International Conference on Web Search and Data Mining

work page 2022

[15] [15]

Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163

work page 2023

[16] [16]

Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929–934

work page 2015

[17] [17]

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the curse of horizon: infinite-horizon off-policy estimation. InProceedings of the 32nd International Conference on Neural Information Processing Systems. 5361–5371

work page 2018

[18] [18]

Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems34 (2021)

work page 2021

[19] [19]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re...

work page 2011

[20] [20]

Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. 2024. Long-term Off-Policy Evaluation and Learning. InProceedings of the ACM on Web Conference 2024. 3432–3443

work page 2024

[21] [21]

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page 2021

[22] [22]

Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProceedings of the 15th ACM Conference on Recommender Systems. 828–830

work page 2021

[23] [23]

Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. InInternational Conference on Machine Learning. PMLR, 19089–19122

work page 2022

[24] [24]

Yuta Saito, Qingyang Ren, and Thorsten Joachims. 2023. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling.arXiv preprint arXiv:2305.08062 (2023)

work page arXiv 2023

[25] [25]

Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123

work page 2021

[26] [26]

Yuta Saito, Jihan Yao, and Thorsten Joachims. 2024. POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition.arXiv preprint arXiv:2402.06151(2024)

work page arXiv 2024

[27] [27]

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. InProceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176

work page 2020

[28] [28]

Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. InInternational Conference on Machine Learning, Vol. 84. 6005–6014

work page 2019

[29] [29]

Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization.The Journal of Machine Learning Research16, 1 (2015), 1731–1755

work page 2015

[30] [30]

Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk mini- mization: Learning from logged bandit feedback. InInternational Conference on Machine Learning. PMLR, 814–823

work page 2015

[31] [31]

Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning.Advances in Neural Information Processing Systems 28 (2015)

work page 2015

[32] [32]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. InAdvances in Neural Information Processing Systems, Vol. 30. 3632–3642

work page 2017

[33] [33]

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. 2022. A review of off- policy evaluation in reinforcement learning.arXiv preprint arXiv:2212.06355 (2022)

work page arXiv 2022

[34] [34]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. InInternational Conference on Machine Learning. PMLR, 3589–3597

work page 2017

[35] [35]

Huasen Wu, Rayadurgam Srikant, Xin Liu, and Chong Jiang. 2015. Algorithms with logarithmic or sublinear regret for constrained contextual bandits.Advances in Neural Information Processing Systems28 (2015)

work page 2015

[36] [36]

Yafei Zhao and Long Yang. 2024. Constrained contextual bandit algorithm for limited-budget recommendation system.Engineering Applications of Artificial Intelligence128 (2024), 107558

work page 2024

[37] [37]

Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, and Qiang Li

work page

[38] [38]

InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Stock constrained recommendation in tmall. InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2287–2296. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates. Koichi Tanaka et al. A Calculation ofgreedyin Table 1 We present the detailed calculation of the greedy method in Table 1. Let X={𝑥 1, 𝑥2, 𝑥3} a...

work page 2026