Recognition: no theorem link
Off-Policy Learning with Limited Supply
Pith reviewed 2026-05-15 08:12 UTC · model grok-4.3
The pith
Greedy off-policy learning is suboptimal when supply is limited, and superior policies exist that allocate items based on relative expected rewards across users.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Conventional greedy OPL approaches may fail to maximize the policy performance in limited supply settings, and policies with superior performance must exist. OPLS addresses this by focusing on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply.
What carries the argument
The OPLS selection rule that ranks items by their expected reward relative to other potential users rather than by absolute expected reward for the current user alone.
If this is right
- Off-policy algorithms must explicitly model supply limits to remain optimal in applications such as coupon allocation and e-commerce.
- Recommendation and advertising systems can improve total reward by reserving scarce items for users who value them relatively more.
- Learning remains possible from ordinary unconstrained logs even though the deployed policy must respect budgets.
- The performance gap between greedy and relative-reward policies grows as supply becomes tighter relative to demand.
Where Pith is reading between the lines
- The same relative-valuation principle could extend to other scarce-resource problems such as dynamic inventory control or ride-sharing assignment.
- Relaxing the assumption of known supply limits to stochastic or adversarial constraints would test the robustness of the approach.
- The insight suggests redefining the objective in any sequential decision setting where early choices reduce later options.
Load-bearing premise
Logged data collected from an unconstrained behavior policy can be reused to learn a policy that correctly accounts for future users' relative valuations under limited supply without further assumptions on arrival order or reward distributions.
What would settle it
An experiment on a contextual bandit task with hard item budgets in which OPLS produces no higher cumulative reward than a standard greedy off-policy learner would falsify the claim of improved performance.
Figures
read the original abstract
We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies off-policy learning (OPL) for contextual bandits in settings with limited item supply. It argues that standard greedy OPL, which selects the item with highest expected reward for the current user, can be suboptimal because it risks early depletion of high-value items needed for future users. The authors provide a theoretical analysis showing greedy failure and existence of superior policies, introduce Off-Policy Learning with Limited Supply (OPLS) that instead prioritizes items with relatively higher expected rewards across users, and report empirical outperformance versus baselines on synthetic and real-world datasets.
Significance. If the central claims hold, the work addresses a practically relevant gap between unconstrained OPL theory and inventory-constrained applications such as coupon allocation and e-commerce. Demonstrating that superior policies exist and can be learned via a simple relative-reward adjustment would be useful for practitioners. The empirical results on real datasets are a positive signal, but the theoretical analysis must be shown to be robust to the data-generating process actually present in logged unconstrained data.
major comments (2)
- [Theoretical analysis] Theoretical analysis (abstract and §3): the proof that greedy OPL is suboptimal and that superior policies must exist relies on i.i.d. user arrivals and a known finite supply horizon. These conditions are not guaranteed by logged data collected under an unconstrained behavior policy, so the importance-weighted estimates used by OPLS may mis-rank items whose value depends on remaining supply.
- [OPLS method] OPLS derivation (likely §4): the method claims to recover relative valuations without additional assumptions, yet the off-policy correction from unconstrained data only estimates per-user rewards; it does not automatically recover the cross-user comparisons needed for depletion-aware allocation unless the arrival process is stationary and the horizon is known or estimable.
minor comments (1)
- [Experiments] Empirical section: the abstract states outperformance on synthetic and real-world datasets but provides no dataset descriptions, number of runs, error bars, or statistical tests; these details are required for assessing the strength of the empirical claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address each major comment below, clarifying the scope of our theoretical analysis and the derivation of OPLS while noting where revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: Theoretical analysis (abstract and §3): the proof that greedy OPL is suboptimal and that superior policies must exist relies on i.i.d. user arrivals and a known finite supply horizon. These conditions are not guaranteed by logged data collected under an unconstrained behavior policy, so the importance-weighted estimates used by OPLS may mis-rank items whose value depends on remaining supply.
Authors: Our theoretical analysis in Section 3 establishes suboptimality of greedy policies and existence of superior ones under the online limited-supply setting with i.i.d. arrivals and known horizon; this serves to motivate the need for non-greedy allocation. The OPLS estimator applies importance weighting to logged unconstrained data to recover per-user expected rewards, then ranks actions by their relative value with respect to the empirical user distribution observed in the data. While the logged data need not satisfy the exact i.i.d. and known-horizon conditions used in the proof, the relative-ranking step provides a practical proxy for depletion awareness. We will add a dedicated paragraph in the revised Section 3 distinguishing the theoretical construction from the off-policy estimation procedure and discussing robustness under mild stationarity. revision: partial
-
Referee: OPLS derivation (likely §4): the method claims to recover relative valuations without additional assumptions, yet the off-policy correction from unconstrained data only estimates per-user rewards; it does not automatically recover the cross-user comparisons needed for depletion-aware allocation unless the arrival process is stationary and the horizon is known or estimable.
Authors: OPLS first obtains unbiased estimates of r(a,x) via inverse-propensity scoring. It then computes a relative score for each item as the estimated reward for the current context minus the average estimated reward for that item across the logged user population. This difference directly encodes a cross-user comparison without requiring an explicit remaining-supply state or horizon value. The approach therefore operates under the same assumptions as standard off-policy contextual-bandit estimators (overlap and correct propensity model) plus the implicit stationarity of the user distribution in the log. We agree that a more explicit statement of these conditions is warranted and will revise the derivation in Section 4 to include the precise definition of the relative score and its link to limited-supply optimality. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's theoretical analysis shows suboptimality of greedy OPL under limited supply via direct comparison of allocation sequences, then defines OPLS as a new method using relative expected rewards rather than absolute ones. No step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the off-policy correction and relative-ranking rule are introduced as independent modeling choices supported by the analysis and external empirical validation on synthetic and real datasets. The derivation chain remains self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shipra Agrawal and Nikhil Devanur. 2016. Linear contextual bandits with knap- sacks.Advances in neural information processing systems29 (2016). Off-Policy Learning with Limited Supply WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates
work page 2016
-
[2]
Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2018. Bandits with knapsacks.Journal of the ACM (JACM)65, 3 (2018), 1–55
work page 2018
-
[3]
Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Re- sourceful contextual bandits. InConference on Learning Theory. PMLR, 1109–1134
work page 2014
-
[4]
Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.Journal of Machine Learning Research14, 11 (2013)
work page 2013
-
[5]
Konstantina Christakopoulou, Jaya Kawale, and Arindam Banerjee. 2017. Recom- mendation with capacity constraints. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1439–1448
work page 2017
-
[6]
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization.Statist. Sci.29, 4 (2014), 485–511
work page 2014
-
[7]
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. InProceedings of the 35th Interna- tional Conference on Machine Learning, Vol. 80. PMLR, 1447–1456
work page 2018
-
[8]
Nicolo Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cre- monesi. 2022. Off-Policy Evaluation with Deficient Support Using Side Informa- tion.Advances in Neural Information Processing Systems35 (2022)
work page 2022
-
[9]
Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 540–550
work page 2022
-
[10]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InPro- ceedings of the 11th ACM International Conference on Web Search and Data Mining. 198–206
work page 2018
-
[11]
Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal Off-Policy Evaluation from Multiple Logging Policies. InProceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256
work page 2021
-
[12]
Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. InInternational Conference on Learning Repre- sentations
work page 2024
-
[13]
Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Off-policy evaluation of slate bandit policies via optimizing abstraction. InProceedings of the ACM on Web Conference 2024. 3150–3161
work page 2024
-
[14]
Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. InProceedings of the 15th International Conference on Web Search and Data Mining
work page 2022
-
[15]
Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163
work page 2023
-
[16]
Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929–934
work page 2015
-
[17]
Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the curse of horizon: infinite-horizon off-policy estimation. InProceedings of the 32nd International Conference on Neural Information Processing Systems. 5361–5371
work page 2018
-
[18]
Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems34 (2021)
work page 2021
-
[19]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re...
work page 2011
-
[20]
Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. 2024. Long-term Off-Policy Evaluation and Learning. InProceedings of the ACM on Web Conference 2024. 3432–3443
work page 2024
-
[21]
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track
work page 2021
-
[22]
Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProceedings of the 15th ACM Conference on Recommender Systems. 828–830
work page 2021
-
[23]
Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. InInternational Conference on Machine Learning. PMLR, 19089–19122
work page 2022
- [24]
-
[25]
Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123
work page 2021
- [26]
-
[27]
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. InProceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176
work page 2020
-
[28]
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. InInternational Conference on Machine Learning, Vol. 84. 6005–6014
work page 2019
-
[29]
Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization.The Journal of Machine Learning Research16, 1 (2015), 1731–1755
work page 2015
-
[30]
Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk mini- mization: Learning from logged bandit feedback. InInternational Conference on Machine Learning. PMLR, 814–823
work page 2015
-
[31]
Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning.Advances in Neural Information Processing Systems 28 (2015)
work page 2015
-
[32]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. InAdvances in Neural Information Processing Systems, Vol. 30. 3632–3642
work page 2017
- [33]
-
[34]
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. InInternational Conference on Machine Learning. PMLR, 3589–3597
work page 2017
-
[35]
Huasen Wu, Rayadurgam Srikant, Xin Liu, and Chong Jiang. 2015. Algorithms with logarithmic or sublinear regret for constrained contextual bandits.Advances in Neural Information Processing Systems28 (2015)
work page 2015
-
[36]
Yafei Zhao and Long Yang. 2024. Constrained contextual bandit algorithm for limited-budget recommendation system.Engineering Applications of Artificial Intelligence128 (2024), 107558
work page 2024
-
[37]
Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, and Qiang Li
-
[38]
InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Stock constrained recommendation in tmall. InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2287–2296. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates. Koichi Tanaka et al. A Calculation ofgreedyin Table 1 We present the detailed calculation of the greedy method in Table 1. Let X={𝑥 1, 𝑥2, 𝑥3} a...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.