pith. sign in

arxiv: 2605.18509 · v1 · pith:SHIZBXBAnew · submitted 2026-05-18 · 💻 cs.LG

Offline Contextual Bandits in the Presence of New Actions

Pith reviewed 2026-05-20 11:58 UTC · model grok-4.3

classification 💻 cs.LG
keywords offline contextual banditsnew actionsoff-policy learningpolicy optimizationdoubly robustLCPI estimatoraction features
0
0 comments X

The pith

A new method lets offline contextual bandit policies select actions never seen in the logged data by using action features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard off-policy learning picks the best actions from those already present in the training data but fails when new actions appear later, as happens with fresh news articles or videos. The paper shows how to overcome this by building an estimator that uses the features of actions to predict rewards for combinations never observed. It introduces the Local Combination PseudoInverse estimator to handle the new-action part and folds it into a larger algorithm that also keeps the strengths of doubly robust estimation for known actions. The result is a tunable policy that can include new actions at a controllable rate while preserving performance on the original set.

Core claim

We introduce the Local Combination PseudoInverse (LCPI) estimator for the policy gradient, which generalizes the PseudoInverse estimator by controlling the trade-off between reward-modeling and data-collection conditions on action features to capture interaction effects across feature dimensions. We then define PONA as a weighted sum of the LCPI and Doubly Robust estimators that jointly optimizes selection of existing and new actions, with the weight directly adjusting how often new actions are chosen.

What carries the argument

The LCPI estimator, which uses action-feature interactions to estimate rewards for new actions absent from the logged data while trading off modeling assumptions against coverage conditions.

If this is right

  • PONA selects new actions while most existing off-policy methods cannot.
  • The single weight parameter directly sets the fraction of new actions included in the final policy.
  • Experiments show PONA keeps overall reward close to strong baselines that only use existing actions.
  • LCPI can be swapped in for the new-action component without losing the benefits of doubly robust estimation on known actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In recommendation platforms where item catalogs turn over daily, the approach could reduce the frequency of fresh data collection rounds.
  • The same feature-interaction idea might extend to settings where actions are added continuously rather than in discrete batches.
  • Performance would degrade if the chosen features fail to encode the relevant interactions, suggesting feature engineering as a key practical knob.

Load-bearing premise

Action features can capture interaction effects across their dimensions well enough to estimate rewards for new actions that never appeared in the logged data.

What would settle it

Collect a dataset with held-out new actions whose feature combinations produce rewards no better than the global average, then check whether PONA's learned policy on those new actions outperforms a baseline that ignores them entirely.

Figures

Figures reproduced from arXiv: 2605.18509 by Kazuki Kawamura, Kei Tateno, Ren Kishimoto, Takanori Muroi, Takuma Udagawa, Tatsuhiro Shimizu, Yuki Sasamoto, Yusuke Narita, Yuta Saito.

Figure 1
Figure 1. Figure 1: Examples of Existing and New Actions. Note: There are three types of action features: character type (chosen as two from male, female, or child), title position (top, center, or bottom), and title size (large or small). The left side of the figure illustrates the thumbnails for existing actions, while the right side shows examples of new actions: "Male and female, title position at the bottom, large title"… view at source ↗
Figure 2
Figure 2. Figure 2: An example of converting action features into a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparisons of the overall policy value, policy value per existing action, and policy value per new action with varying [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons of the overall policy value, policy value per existing action, and policy value per new action with varying [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparisons of the overall policy value, policy value per existing action, and policy value per new action with varying [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparisons of the overall policy value, proportion of existing actions, and proportion of new actions under the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparisons of the overall policy value, policy value per existing action, and policy value per new action with varying [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparisons of the overall policy value, policy value per existing action, and policy value per new action with varying [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Automated decision-making algorithms drive applications such as recommendation systems and search engines. These algorithms often rely on off-policy contextual bandits or off-policy learning (OPL). Conventionally, OPL selects actions that maximize the expected reward from an existing action set. However, in many real-world scenarios, actions, such as news articles or video content, change continuously, and the action space evolves over time after data collection. We define actions introduced after deploying the logging policy as new actions and focus on OPL with new actions. Existing OPL methods identify optimal actions from the existing set effectively but cannot learn and select new actions because no relevant data are logged. To address this limitation, we propose a new OPL method that leverages action features. We first introduce the Local Combination PseudoInverse (LCPI) estimator for the policy gradient, generalizing the PseudoInverse estimator initially proposed for off-policy evaluation of slate bandits. LCPI controls the trade-off between reward-modeling condition and the condition for data collection regarding the action features, capturing the interaction effects among different dimensions of action features. Furthermore, we propose a generalized algorithm called Policy Optimization for Effective New Actions (PONA), which integrates LCPI, a component specialized for new action selection, with Doubly Robust (DR), which excels at learning within existing actions. We define PONA as a weighted sum of the LCPI and DR estimators, optimizing both the selection of existing and new actions, and allowing the proportion of new action selections to be adjusted by the weight parameter. Through extensive experiments, we demonstrate that PONA efficiently selects new actions while maintaining the overall policy performance as opposed to most existing methods that cannot select new actions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to address offline contextual bandits where new actions appear after data collection. It introduces the Local Combination PseudoInverse (LCPI) estimator, which generalizes the slate-bandit PseudoInverse estimator to contextual settings by leveraging action features to estimate policy gradients for unseen actions. It then defines PONA as a weighted combination of LCPI (for new actions) and the Doubly Robust (DR) estimator (for existing actions), with a tunable weight controlling the proportion of new-action selections. Experiments are presented to show that PONA selects new actions efficiently while preserving overall policy performance, unlike prior methods that cannot handle new actions.

Significance. If the LCPI estimator provides reliable gradient estimates for new actions under the stated feature-based modeling assumptions, the work would fill a practical gap in evolving action spaces such as recommendations and content systems. The explicit weighting mechanism in PONA offers a controllable trade-off that existing OPL methods lack. No machine-checked proofs or parameter-free derivations are present, but the empirical demonstration of new-action selection is a concrete contribution if the bias properties hold.

major comments (3)
  1. [§3] §3 (LCPI estimator definition): No theorem or proposition states the precise conditions on the action feature map, the reward function approximation order, or the logging distribution under which the pseudo-inverse correction recovers an unbiased estimate of the policy gradient for a new action a' that has zero probability under the logging policy. The central claim that LCPI enables new-action selection therefore rests on an unstated modeling assumption whose violation would render the estimator biased.
  2. [§5] §5 (PONA algorithm and weighting): The paper defines PONA as a weighted sum of LCPI and DR but provides neither bias/variance bounds for the combined estimator nor analysis of how the weight parameter interacts with the bias introduced by LCPI on new actions. Without such analysis the claimed controllable trade-off cannot be guaranteed to improve performance.
  3. [Experimental section] Experimental section (results tables): The reported experiments do not include controlled failure cases where action features fail to capture higher-order reward interactions, nor do they report variance or confidence intervals for the new-action selection rate. This leaves open whether the observed improvement is robust or specific to the chosen feature representations.
minor comments (2)
  1. [§3] Notation for the action feature vector and the local combination operator should be introduced earlier and used consistently; current presentation requires the reader to infer the dimensionality reduction step from context.
  2. [§6] The abstract and introduction claim 'extensive experiments' but the experimental protocol (logging policy details, number of runs, hyper-parameter selection for the weight) is only partially described; a dedicated subsection would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each of the major comments in detail below, indicating the revisions we plan to make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (LCPI estimator definition): No theorem or proposition states the precise conditions on the action feature map, the reward function approximation order, or the logging distribution under which the pseudo-inverse correction recovers an unbiased estimate of the policy gradient for a new action a' that has zero probability under the logging policy. The central claim that LCPI enables new-action selection therefore rests on an unstated modeling assumption whose violation would render the estimator biased.

    Authors: We agree that a formal statement of the conditions for unbiasedness would improve clarity. The LCPI estimator generalizes the slate-bandit PseudoInverse by using a local combination of action features to estimate gradients for unseen actions. We will add a new proposition in Section 3 that specifies the assumptions: namely, that the action feature map is such that the reward can be approximated linearly in the combined feature space, and that the logging policy provides sufficient coverage in the feature dimensions relevant to new actions. Under these conditions, the pseudo-inverse correction yields an unbiased gradient estimate. This will make the modeling assumptions explicit. revision: yes

  2. Referee: [§5] §5 (PONA algorithm and weighting): The paper defines PONA as a weighted sum of LCPI and DR but provides neither bias/variance bounds for the combined estimator nor analysis of how the weight parameter interacts with the bias introduced by LCPI on new actions. Without such analysis the claimed controllable trade-off cannot be guaranteed to improve performance.

    Authors: We acknowledge the absence of theoretical bounds on the combined estimator. Deriving such bounds is non-trivial because the bias of LCPI depends on the specific new actions and feature approximations. In the revision, we will expand Section 5 to include a qualitative analysis of the weight parameter's role in trading off new-action exploration against performance on existing actions. We will also add experiments varying the weight and reporting the resulting policy values and new-action selection rates. While we cannot provide parameter-free guarantees without further assumptions on the bias, the empirical results demonstrate the practical utility of the weighting mechanism. revision: partial

  3. Referee: [Experimental section] Experimental section (results tables): The reported experiments do not include controlled failure cases where action features fail to capture higher-order reward interactions, nor do they report variance or confidence intervals for the new-action selection rate. This leaves open whether the observed improvement is robust or specific to the chosen feature representations.

    Authors: We appreciate this suggestion for strengthening the experimental validation. We will include additional experiments in the revised manuscript that test scenarios with degraded action features, such as using random projections or omitting key feature interactions, to show when LCPI and PONA may underperform. Furthermore, we will augment the results tables with standard deviations and confidence intervals computed over multiple random seeds for the new-action selection metrics. This will help assess the robustness of our findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LCPI or PONA derivation

full rationale

The paper defines LCPI as an explicit generalization of the prior PseudoInverse estimator (originally for slate bandits) to contextual bandits with new actions, using action features to capture interactions. PONA is then introduced as a weighted combination of this LCPI term and the standard Doubly Robust estimator. No equation reduces a claimed prediction or gradient estimate to a fitted parameter by construction, and no load-bearing step relies on a self-citation chain or imported uniqueness theorem. The modeling assumption that action features suffice for new-action reward estimation is stated openly rather than smuggled in via redefinition. The derivation therefore remains self-contained against external benchmarks and prior estimators.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard contextual bandit assumptions plus the new modeling choice that action features suffice to estimate new-action values via the LCPI trade-off.

free parameters (1)
  • weight parameter
    Controls the proportion of new-action selections in the weighted sum of LCPI and DR estimators.
axioms (1)
  • domain assumption Standard contextual bandit assumptions including existence of a logging policy and reward model conditions on action features.
    Invoked when defining LCPI and its trade-off between reward-modeling and data-collection conditions.

pith-pipeline@v0.9.0 · 5865 in / 1318 out tokens · 45011 ms · 2026-05-20T11:58:00.337407+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    Fernando Amat, Ashok Chandrashekar, Tony Jebara, and Justin Basilico. 2018. Artwork personalization at Netflix. InProceedings of the 12th ACM conference on recommender systems. 487–488

  2. [2]

    Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.Journal of Machine Learning Research14, 11 (2013)

  3. [3]

    Shreyas Chaudhari, David Arbour, Georgios Theocharous, and Nikos Vlassis

  4. [4]

    InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol

    Distributional Off-Policy Evaluation for Slate Recommendations. InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 8265–8273

  5. [5]

    Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization.Statist. Sci.29, 4 (2014), 485–511

  6. [6]

    Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-Policy Evaluation. InProceedings of the 35th Interna- tional Conference on Machine Learning, Vol. 80. PMLR, 1447–1456

  7. [7]

    Nicolo Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cre- monesi. 2022. Off-Policy Evaluation with Deficient Support Using Side Informa- tion.Advances in Neural Information Processing Systems35 (2022)

  8. [8]

    Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 540–550

  9. [9]

    Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 198–206

  10. [10]

    Olivier Jeunen and Bart Goethals. 2021. Pessimistic reward models for off-policy learning in recommendation. InProceedings of the 15th ACM Conference on Recommender Systems. 63–74

  11. [11]

    Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal Off-Policy Evaluation from Multiple Logging Policies. InProceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256

  12. [12]

    Nathan Kallus and Angela Zhou. 2018. Policy evaluation and optimization with continuous treatments. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 1243–1251

  13. [13]

    Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Off-policy evaluation of slate bandit policies via optimizing abstraction. InProceedings of the ACM on Web Conference 2024. 3150–3161

  14. [14]

    Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. InProceedings of the 15th International Conference on Web Search and Data Mining

  15. [15]

    Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. 2023. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163

  16. [16]

    Lihong Li, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929–934

  17. [17]

    Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. 2018. Breaking the curse of horizon: infinite-horizon off-policy estimation. InProceedings of the 32nd International Conference on Neural Information Processing Systems. 5361–5371

  18. [18]

    Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems34 (2021)

  19. [19]

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re...

  20. [20]

    Jie Peng, Hao Zou, Jiashuo Liu, Shaoming Li, Yibao Jiang, Jian Pei, and Peng Cui

  21. [21]

    InProceedings of the ACM Web Conference 2023

    Offline Policy Evaluation in Large Action Spaces via Outcome-Oriented Action Grouping. InProceedings of the ACM Web Conference 2023. 1220–1230

  22. [22]

    Sutton, and Satinder P

    Doina Precup, Richard S. Sutton, and Satinder P. Singh. 2000. Eligibility Traces for Off-Policy Policy Evaluation. InProceedings of the 17th International Conference on Machine Learning. 759–766

  23. [23]

    Noveen Sachdeva, Yi Su, and Thorsten Joachims. 2020. Off-Policy Bandits with Deficient Support. InProceedings of the 26th ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining. 965–975

  24. [24]

    Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. 2024. Long-term Off-Policy Evaluation and Learning. InProceedings of the ACM on Web Conference 2024. 3432–3443

  25. [25]

    Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  26. [26]

    Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProceedings of the 15th ACM Conference on Recommender Systems. 828–830

  27. [27]

    Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. InInternational Conference on Machine Learning. PMLR, 19089–19122

  28. [28]

    Yuta Saito and Masahiro Nomura. 2024. Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It.arXiv preprint arXiv:2404.15084(2024)

  29. [29]

    Yuta Saito, Qingyang Ren, and Thorsten Joachims. 2023. Off-Policy Evalua- tion for Large Action Spaces via Conjunct Effect Modeling.arXiv preprint arXiv:2305.08062(2023)

  30. [30]

    Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123

  31. [31]

    Yuta Saito, Jihan Yao, and Thorsten Joachims. 2024. POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition.arXiv preprint arXiv:2402.06151(2024)

  32. [32]

    Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto, Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. 2024. Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits. InProceedings of the 18th ACM Conference on Recommender Systems. 733–741

  33. [33]

    Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. 2010. Learning from Logged Implicit Exploration Data. InAdvances in Neural Information Processing Systems, Vol. 23. 2217–2225

  34. [34]

    Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. InProceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176

  35. [35]

    Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous adaptive blending for policy evaluation and learning. InInternational Conference on Machine Learning, Vol. 84. 6005–6014

  36. [36]

    Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization.The Journal of Ma- chine Learning Research16, 1 (2015), 1731–1755

  37. [37]

    Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk mini- mization: Learning from logged bandit feedback. InInternational Conference on Machine Learning. PMLR, 814–823

  38. [38]

    Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning.Advances in Neural Information Processing Systems 28 (2015)

  39. [39]

    Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. InAdvances in Neural Information Processing Systems, Vol. 30. 3632–3642

  40. [40]

    Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. 2022. A review of off- policy evaluation in reinforcement learning.arXiv preprint arXiv:2212.06355 (2022)

  41. [41]

    Nikos Vlassis, Ashok Chandrashekar, Fernando Amat Gil, and Nathan Kallus

  42. [42]

    Control Variates for Slate Off-Policy Evaluation.Advances in Neural Information Processing Systems34 (2021)

  43. [43]

    1 𝑛 𝑛∑︁ 𝑖=1 ∑︁ 𝑎∈ A 𝜋𝜃 (𝑎|𝑥 𝑖 )∇𝜃 log𝜋 𝜃 (𝑎|𝑥 𝑖 )I𝑇 𝑎 ! Γ† 𝜋0,𝑥𝑖 I𝑎𝑖 𝑟𝑖 # =E 𝑝(𝑥)𝜋 0 (𝑎′ |𝑥)𝑝(𝑟|𝑥,𝑎 ′ )

    Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adap- tive off-policy evaluation in contextual bandits. InInternational Conference on Machine Learning. PMLR, 3589–3597. Ren Kishimoto, et al. A Extended Discussion on Related Work This section discuss the related work more comprehensively than Section 3. Off-Policy Evaluation and Learnin...