Off-Policy Learning with Limited Supply
Pith reviewed 2026-05-21 11:12 UTC · model grok-4.3
The pith
In limited-supply contextual bandits, greedy off-policy methods that always pick the highest expected reward for the current user can deplete items before higher-value future users arrive.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In limited supply settings the optimal policy must account for opportunity costs across users rather than maximizing immediate expected reward alone. Conventional greedy OPL therefore fails to maximize performance, while policies that focus on items with relatively higher expected rewards compared to the other users achieve superior total reward. OPLS implements this insight by replacing pure greedy selection with a relative-value rule that reserves scarce items for users who stand to gain more from them.
What carries the argument
The relative expected reward rule in OPLS, which compares an item's value for the current user against its value for remaining users to decide allocation under supply limits.
If this is right
- Total reward increases when scarce items are withheld from low-relative-value users for later higher-value ones.
- Standard OPL methods that are optimal without supply constraints become suboptimal once budgets or inventory limits are imposed.
- Coupon allocation and e-commerce inventory decisions can be improved by replacing pure greedy selection with relative-value ranking.
- Empirical gains appear on both synthetic data and real recommendation traces when supply is capped.
Where Pith is reading between the lines
- The same relative-value logic could be tested in sequential resource scheduling problems outside bandits, such as cloud compute allocation.
- Incorporating uncertainty estimates around the relative rewards might further stabilize performance when reward models are noisy.
- The approach suggests that arrival-order robustness should be measured explicitly in any constrained bandit benchmark.
Load-bearing premise
That comparing an item's expected reward for the current user against its expected value to other users correctly captures future opportunity costs and avoids premature depletion.
What would settle it
A simulation with known reward distributions and a fixed item budget in which OPLS is run alongside a greedy baseline; if the greedy method achieves equal or higher cumulative reward, the claim that OPLS is superior under limited supply is falsified.
Figures
read the original abstract
We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in contextual bandits with limited supply, conventional greedy off-policy learning approaches can fail by depleting high-reward items early, and that superior policies exist. It introduces OPLS, which selects items with relatively higher expected rewards compared to other users to achieve more efficient allocation, supported by theoretical analysis and empirical results on synthetic and real-world datasets showing outperformance.
Significance. If the central claims hold, this work is significant for practical applications in recommendation systems and online advertising where item supply is constrained by inventory or budgets. The identification of the greedy failure mode and the proposal of a relative-reward based method provide a useful heuristic for better allocation. The empirical outperformance is a strength, though the method's performance relative to optimal dynamic programming solutions for the finite-horizon problem remains to be fully characterized.
major comments (1)
- [Theoretical Analysis] The analysis shows that greedy selection can exhaust high-reward items before higher-value future users arrive. However, OPLS is defined using comparisons from the marginal distribution of users rather than the joint distribution of remaining supply and arrivals. This makes it a one-step lookahead adjustment that does not solve the underlying dynamic program, which could still result in premature depletion in regimes with heterogeneous user types and tight supply.
minor comments (1)
- [Abstract] The abstract mentions 'theoretical analysis' and 'empirical results' but lacks specific details on the datasets or quantitative improvements, which would help assess the strength of the claims.
Simulated Author's Rebuttal
We thank the referee for the careful review and insightful comments on the theoretical foundations of our work. We address the major comment below and will incorporate clarifications into the revised manuscript.
read point-by-point responses
-
Referee: [Theoretical Analysis] The analysis shows that greedy selection can exhaust high-reward items before higher-value future users arrive. However, OPLS is defined using comparisons from the marginal distribution of users rather than the joint distribution of remaining supply and arrivals. This makes it a one-step lookahead adjustment that does not solve the underlying dynamic program, which could still result in premature depletion in regimes with heterogeneous user types and tight supply.
Authors: We agree that OPLS is a one-step lookahead heuristic that relies on the marginal user distribution rather than the full joint distribution over remaining supply and future arrivals, and thus does not solve the underlying finite-horizon dynamic program. Solving the exact DP is intractable in contextual settings with continuous or high-dimensional user features and limited supply, as the state space grows exponentially with the number of items and supply levels. Our contribution instead focuses on identifying the failure mode of standard greedy OPL and proposing a scalable adjustment based on relative expected rewards that can be estimated from off-policy data. We will revise the manuscript to explicitly state that OPLS is an approximation, discuss its relationship to the DP, and add a limitations paragraph noting that premature depletion remains possible under extreme heterogeneity and very tight supply. Additional synthetic experiments in such regimes can be included to quantify the gap. revision: partial
- Fully characterizing the suboptimality gap of OPLS relative to the optimal dynamic programming policy across all regimes of user heterogeneity and supply tightness.
Circularity Check
No significant circularity detected in the OPLS derivation
full rationale
The paper first identifies a failure mode of greedy OPL under limited supply via theoretical analysis, establishes existence of superior policies, and then defines OPLS as a re-ranking rule that compares an item's expected reward for the current user against its value to other users. This construction is explicit and motivated by the identified shortcoming rather than reducing to a fitted quantity or self-referential definition. No load-bearing self-citations, ansatzes smuggled via prior work, or predictions that collapse to inputs by construction appear in the abstract or described sections. The method remains a heuristic adjustment whose performance is evaluated empirically on separate data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shipra Agrawal and Nikhil Devanur. 2016. Linear contextual bandits with knap- sacks.Advances in neural information processing systems29 (2016). Off-Policy Learning with Limited Supply WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates
work page 2016
-
[2]
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2018. Bandits with knapsacks.Journal of the ACM (JACM)65, 3 (2018), 1–55
work page 2018
-
[3]
Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Re- sourceful contextual bandits. InConference on Learning Theory. PMLR, 1109–1134
work page 2014
-
[4]
Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.Journal of Machine Learning Research14, 11 (2013)
work page 2013
-
[5]
Konstantina Christakopoulou, Jaya Kawale, and Arindam Banerjee. 2017. Recom- mendation with capacity constraints. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1439–1448
work page 2017
-
[6]
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization.Statist. Sci.29, 4 (2014), 485–511
work page 2014
-
[7]
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. InProceedings of the 35th Interna- tional Conference on Machine Learning, Vol. 80. PMLR, 1447–1456
work page 2018
-
[8]
Nicolo Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cre- monesi. 2022. Off-Policy Evaluation with Deficient Support Using Side Informa- tion.Advances in Neural Information Processing Systems35 (2022)
work page 2022
-
[9]
Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 540–550
work page 2022
-
[10]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InPro- ceedings of the 11th ACM International Conference on Web Search and Data Mining. 198–206
work page 2018
-
[11]
Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal Off-Policy Evaluation from Multiple Logging Policies. InProceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256
work page 2021
-
[12]
Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. InInternational Conference on Learning Repre- sentations
work page 2024
-
[13]
Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Off-policy evaluation of slate bandit policies via optimizing abstraction. InProceedings of the ACM on Web Conference 2024. 3150–3161
work page 2024
-
[14]
Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. InProceedings of the 15th International Conference on Web Search and Data Mining
work page 2022
-
[15]
Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163
work page 2023
-
[16]
Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929–934
work page 2015
-
[17]
Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the curse of horizon: infinite-horizon off-policy estimation. InProceedings of the 32nd International Conference on Neural Information Processing Systems. 5361–5371
work page 2018
-
[18]
Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems34 (2021)
work page 2021
-
[19]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re...
work page 2011
-
[20]
Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. 2024. Long-term Off-Policy Evaluation and Learning. InProceedings of the ACM on Web Conference 2024. 3432–3443
work page 2024
-
[21]
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track
work page 2021
-
[22]
Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProceedings of the 15th ACM Conference on Recommender Systems. 828–830
work page 2021
-
[23]
Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. InInternational Conference on Machine Learning. PMLR, 19089–19122
work page 2022
- [24]
-
[25]
Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123
work page 2021
- [26]
-
[27]
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. InProceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176
work page 2020
-
[28]
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. InInternational Conference on Machine Learning, Vol. 84. 6005–6014
work page 2019
-
[29]
Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization.The Journal of Machine Learning Research16, 1 (2015), 1731–1755
work page 2015
-
[30]
Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk mini- mization: Learning from logged bandit feedback. InInternational Conference on Machine Learning. PMLR, 814–823
work page 2015
-
[31]
Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning.Advances in Neural Information Processing Systems 28 (2015)
work page 2015
-
[32]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. InAdvances in Neural Information Processing Systems, Vol. 30. 3632–3642
work page 2017
- [33]
-
[34]
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. InInternational Conference on Machine Learning. PMLR, 3589–3597
work page 2017
-
[35]
Huasen Wu, Rayadurgam Srikant, Xin Liu, and Chong Jiang. 2015. Algorithms with logarithmic or sublinear regret for constrained contextual bandits.Advances in Neural Information Processing Systems28 (2015)
work page 2015
-
[36]
Yafei Zhao and Long Yang. 2024. Constrained contextual bandit algorithm for limited-budget recommendation system.Engineering Applications of Artificial Intelligence128 (2024), 107558
work page 2024
-
[37]
Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, and Qiang Li
-
[38]
InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Stock constrained recommendation in tmall. InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2287–2296. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates. Koichi Tanaka et al. A Calculation ofgreedyin Table 1 We present the detailed calculation of the greedy method in Table 1. Let X={𝑥 1, 𝑥2, 𝑥3} a...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.