arxiv: 2603.18702 · v3 · submitted 2026-03-19 · 💻 cs.LG

Recognition: no theorem link

Off-Policy Learning with Limited Supply

Koichi Tanaka , Ren Kishimoto , Bushun Kawagishi , Yusuke Narita , Yasuo Yamamoto , Nobuyuki Shimizu , Yuta Saito

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords off-policy learningcontextual banditslimited supplyrecommendation systemsonline advertisingresource allocationpolicy optimization

0 comments

The pith

Greedy off-policy learning is suboptimal when supply is limited, and superior policies exist that allocate items based on relative expected rewards across users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Typical off-policy learning in contextual bandits assumes unlimited availability of each item, so the policy simply picks the action with the highest expected reward for the current context. When items are constrained by budgets or inventory, this greedy choice can exhaust a high-value item before future users who would value it even more have a chance to receive it. The paper proves that such greedy policies cannot be optimal under limited supply and shows that better-performing policies must exist. It introduces OPLS, which instead selects items whose expected reward is high relative to the other users who might claim them, producing more efficient allocation from the same logged data. Experiments on synthetic and real-world datasets confirm that OPLS yields higher overall performance than standard off-policy methods once supply constraints are active.

Core claim

Conventional greedy OPL approaches may fail to maximize the policy performance in limited supply settings, and policies with superior performance must exist. OPLS addresses this by focusing on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply.

What carries the argument

The OPLS selection rule that ranks items by their expected reward relative to other potential users rather than by absolute expected reward for the current user alone.

If this is right

Off-policy algorithms must explicitly model supply limits to remain optimal in applications such as coupon allocation and e-commerce.
Recommendation and advertising systems can improve total reward by reserving scarce items for users who value them relatively more.
Learning remains possible from ordinary unconstrained logs even though the deployed policy must respect budgets.
The performance gap between greedy and relative-reward policies grows as supply becomes tighter relative to demand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relative-valuation principle could extend to other scarce-resource problems such as dynamic inventory control or ride-sharing assignment.
Relaxing the assumption of known supply limits to stochastic or adversarial constraints would test the robustness of the approach.
The insight suggests redefining the objective in any sequential decision setting where early choices reduce later options.

Load-bearing premise

Logged data collected from an unconstrained behavior policy can be reused to learn a policy that correctly accounts for future users' relative valuations under limited supply without further assumptions on arrival order or reward distributions.

What would settle it

An experiment on a contextual bandit task with hard item budgets in which OPLS produces no higher cumulative reward than a standard greedy off-policy learner would falsify the claim of improved performance.

Figures

Figures reproduced from arXiv: 2603.18702 by Bushun Kawagishi, Koichi Tanaka, Nobuyuki Shimizu, Ren Kishimoto, Yasuo Yamamoto, Yusuke Narita, Yuta Saito.

**Figure 2.** Figure 2: The behavior of the conventional greedy method at [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The behavior of OPLS at 𝑡 = 10, 30, 60 Initial Supply. To identify the situation where OPLS provides more improvements, we define the initial supply 𝑠1 by three different ways as follows. 𝑠1 =    𝑠max · E𝑝 (𝑥 ) [𝑞(𝑥, 𝑎)] max 𝑎∈A E𝑝 (𝑥 ) [𝑞(𝑥, 𝑎)] (proportional) 𝑠max · min 𝑎∈A E𝑝 (𝑥 ) [𝑞(𝑥, 𝑎)] √︁ E𝑝 (𝑥 ) [𝑞(𝑥, 𝑎)] (inverse proportional) random.uniform(1, 𝑠max) (random) , where 𝑠max is… view at source ↗

**Figure 4.** Figure 4: Comparisons of relative policy values with varying (a)action popularity, (b)the number of users, and (c)estimation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Relative policy values varying max supply ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparisons of relative policy values with varying (a)the number of users and (b)estimation noise. The period [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Relative policy values varying max supply (𝑠𝑚𝑎𝑥 ) Additional Result [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Comparisons of relative policy values with varying (a) max supply ( [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real gap in greedy OPL under limited supply and proposes OPLS to handle relative rewards, but the theory leans on arrival assumptions that logged unconstrained data may not support.

read the letter

The main point is that standard greedy off-policy learning in contextual bandits can waste limited items by always picking the highest expected reward for the current user, leaving nothing for later users who value it more. The paper introduces OPLS, which instead selects items based on their relative expected rewards across users to enable better allocation under supply constraints like coupons or inventory limits. This is a practical observation that standard unconstrained OPL ignores. The work does a reasonable job laying out why this matters for recommendation systems and online advertising, and it claims both a theoretical argument that superior policies exist and an algorithm to learn them from logged data, with some supporting experiments on synthetic and real datasets. That framing is useful for people who actually deploy these methods. The soft spots are in the assumptions and evidence. The stress-test concern is fair: estimating relative valuations from unconstrained behavior data requires i.i.d. arrivals and some knowledge of the remaining horizon to get the depletion-aware ranking right. If arrivals are non-stationary or the total supply is unknown, the importance-weighted estimates can mis-rank items. The abstract mentions theoretical analysis but gives no derivation details, so it is hard to tell whether the paper closes this gap or simply assumes it away. The empirical claims also lack error bars and dataset specifics, which weakens the superiority argument. This paper is for researchers working on contextual bandits with real inventory or budget constraints. It is worth sending to peer review because the problem is relevant and the proposed fix is a direct attempt to address it, even if the theory and experiments will need tightening.

Referee Report

2 major / 1 minor

Summary. The paper studies off-policy learning (OPL) for contextual bandits in settings with limited item supply. It argues that standard greedy OPL, which selects the item with highest expected reward for the current user, can be suboptimal because it risks early depletion of high-value items needed for future users. The authors provide a theoretical analysis showing greedy failure and existence of superior policies, introduce Off-Policy Learning with Limited Supply (OPLS) that instead prioritizes items with relatively higher expected rewards across users, and report empirical outperformance versus baselines on synthetic and real-world datasets.

Significance. If the central claims hold, the work addresses a practically relevant gap between unconstrained OPL theory and inventory-constrained applications such as coupon allocation and e-commerce. Demonstrating that superior policies exist and can be learned via a simple relative-reward adjustment would be useful for practitioners. The empirical results on real datasets are a positive signal, but the theoretical analysis must be shown to be robust to the data-generating process actually present in logged unconstrained data.

major comments (2)

[Theoretical analysis] Theoretical analysis (abstract and §3): the proof that greedy OPL is suboptimal and that superior policies must exist relies on i.i.d. user arrivals and a known finite supply horizon. These conditions are not guaranteed by logged data collected under an unconstrained behavior policy, so the importance-weighted estimates used by OPLS may mis-rank items whose value depends on remaining supply.
[OPLS method] OPLS derivation (likely §4): the method claims to recover relative valuations without additional assumptions, yet the off-policy correction from unconstrained data only estimates per-user rewards; it does not automatically recover the cross-user comparisons needed for depletion-aware allocation unless the arrival process is stationary and the horizon is known or estimable.

minor comments (1)

[Experiments] Empirical section: the abstract states outperformance on synthetic and real-world datasets but provides no dataset descriptions, number of runs, error bars, or statistical tests; these details are required for assessing the strength of the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major comment below, clarifying the scope of our theoretical analysis and the derivation of OPLS while noting where revisions will strengthen the manuscript.

read point-by-point responses

Referee: Theoretical analysis (abstract and §3): the proof that greedy OPL is suboptimal and that superior policies must exist relies on i.i.d. user arrivals and a known finite supply horizon. These conditions are not guaranteed by logged data collected under an unconstrained behavior policy, so the importance-weighted estimates used by OPLS may mis-rank items whose value depends on remaining supply.

Authors: Our theoretical analysis in Section 3 establishes suboptimality of greedy policies and existence of superior ones under the online limited-supply setting with i.i.d. arrivals and known horizon; this serves to motivate the need for non-greedy allocation. The OPLS estimator applies importance weighting to logged unconstrained data to recover per-user expected rewards, then ranks actions by their relative value with respect to the empirical user distribution observed in the data. While the logged data need not satisfy the exact i.i.d. and known-horizon conditions used in the proof, the relative-ranking step provides a practical proxy for depletion awareness. We will add a dedicated paragraph in the revised Section 3 distinguishing the theoretical construction from the off-policy estimation procedure and discussing robustness under mild stationarity. revision: partial
Referee: OPLS derivation (likely §4): the method claims to recover relative valuations without additional assumptions, yet the off-policy correction from unconstrained data only estimates per-user rewards; it does not automatically recover the cross-user comparisons needed for depletion-aware allocation unless the arrival process is stationary and the horizon is known or estimable.

Authors: OPLS first obtains unbiased estimates of r(a,x) via inverse-propensity scoring. It then computes a relative score for each item as the estimated reward for the current context minus the average estimated reward for that item across the logged user population. This difference directly encodes a cross-user comparison without requiring an explicit remaining-supply state or horizon value. The approach therefore operates under the same assumptions as standard off-policy contextual-bandit estimators (overlap and correct propensity model) plus the implicit stationarity of the user distribution in the log. We agree that a more explicit statement of these conditions is warranted and will revise the derivation in Section 4 to include the precise definition of the relative score and its link to limited-supply optimality. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's theoretical analysis shows suboptimality of greedy OPL under limited supply via direct comparison of allocation sequences, then defines OPLS as a new method using relative expected rewards rather than absolute ones. No step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the off-policy correction and relative-ranking rule are introduced as independent modeling choices supported by the analysis and external empirical validation on synthetic and real datasets. The derivation chain remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; full paper required to audit.

pith-pipeline@v0.9.0 · 5559 in / 941 out tokens · 32591 ms · 2026-05-15T08:12:12.640979+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Shipra Agrawal and Nikhil Devanur. 2016. Linear contextual bandits with knap- sacks.Advances in neural information processing systems29 (2016). Off-Policy Learning with Limited Supply WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates

work page 2016
[2]

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2018. Bandits with knapsacks.Journal of the ACM (JACM)65, 3 (2018), 1–55

work page 2018
[3]

Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Re- sourceful contextual bandits. InConference on Learning Theory. PMLR, 1109–1134

work page 2014
[4]

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.Journal of Machine Learning Research14, 11 (2013)

work page 2013
[5]

Konstantina Christakopoulou, Jaya Kawale, and Arindam Banerjee. 2017. Recom- mendation with capacity constraints. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1439–1448

work page 2017
[6]

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization.Statist. Sci.29, 4 (2014), 485–511

work page 2014
[7]

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. InProceedings of the 35th Interna- tional Conference on Machine Learning, Vol. 80. PMLR, 1447–1456

work page 2018
[8]

Nicolo Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cre- monesi. 2022. Off-Policy Evaluation with Deficient Support Using Side Informa- tion.Advances in Neural Information Processing Systems35 (2022)

work page 2022
[9]

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 540–550

work page 2022
[10]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InPro- ceedings of the 11th ACM International Conference on Web Search and Data Mining. 198–206

work page 2018
[11]

Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal Off-Policy Evaluation from Multiple Logging Policies. InProceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256

work page 2021
[12]

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. InInternational Conference on Learning Repre- sentations

work page 2024
[13]

Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Off-policy evaluation of slate bandit policies via optimizing abstraction. InProceedings of the ACM on Web Conference 2024. 3150–3161

work page 2024
[14]

Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. InProceedings of the 15th International Conference on Web Search and Data Mining

work page 2022
[15]

Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163

work page 2023
[16]

Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929–934

work page 2015
[17]

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the curse of horizon: infinite-horizon off-policy estimation. InProceedings of the 32nd International Conference on Neural Information Processing Systems. 5361–5371

work page 2018
[18]

Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems34 (2021)

work page 2021
[19]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re...

work page 2011
[20]

Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. 2024. Long-term Off-Policy Evaluation and Learning. InProceedings of the ACM on Web Conference 2024. 3432–3443

work page 2024
[21]

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page 2021
[22]

Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProceedings of the 15th ACM Conference on Recommender Systems. 828–830

work page 2021
[23]

Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. InInternational Conference on Machine Learning. PMLR, 19089–19122

work page 2022
[24]

Yuta Saito, Qingyang Ren, and Thorsten Joachims. 2023. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling.arXiv preprint arXiv:2305.08062 (2023)

work page arXiv 2023
[25]

Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123

work page 2021
[26]

Yuta Saito, Jihan Yao, and Thorsten Joachims. 2024. POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition.arXiv preprint arXiv:2402.06151(2024)

work page arXiv 2024
[27]

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. InProceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176

work page 2020
[28]

Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. InInternational Conference on Machine Learning, Vol. 84. 6005–6014

work page 2019
[29]

Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization.The Journal of Machine Learning Research16, 1 (2015), 1731–1755

work page 2015
[30]

Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk mini- mization: Learning from logged bandit feedback. InInternational Conference on Machine Learning. PMLR, 814–823

work page 2015
[31]

Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning.Advances in Neural Information Processing Systems 28 (2015)

work page 2015
[32]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. InAdvances in Neural Information Processing Systems, Vol. 30. 3632–3642

work page 2017
[33]

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. 2022. A review of off- policy evaluation in reinforcement learning.arXiv preprint arXiv:2212.06355 (2022)

work page arXiv 2022
[34]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. InInternational Conference on Machine Learning. PMLR, 3589–3597

work page 2017
[35]

Huasen Wu, Rayadurgam Srikant, Xin Liu, and Chong Jiang. 2015. Algorithms with logarithmic or sublinear regret for constrained contextual bandits.Advances in Neural Information Processing Systems28 (2015)

work page 2015
[36]

Yafei Zhao and Long Yang. 2024. Constrained contextual bandit algorithm for limited-budget recommendation system.Engineering Applications of Artificial Intelligence128 (2024), 107558

work page 2024
[37]

Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, and Qiang Li

work page
[38]

InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Stock constrained recommendation in tmall. InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2287–2296. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates. Koichi Tanaka et al. A Calculation ofgreedyin Table 1 We present the detailed calculation of the greedy method in Table 1. Let X={𝑥 1, 𝑥2, 𝑥3} a...

work page 2026