pith. sign in

arxiv: 2507.18756 · v2 · submitted 2025-07-24 · 💻 cs.LG · cs.IR

Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation

Pith reviewed 2026-05-19 02:24 UTC · model grok-4.3

classification 💻 cs.LG cs.IR
keywords linear banditsoffline evaluationrecommender systemsexploration-exploitationmulti-armed banditsgreedy algorithmsevaluation bias
0
0 comments X

The pith

Offline evaluations of linear bandit recommenders favor pure exploitation over any exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares multiple linear multi-armed bandit algorithms for recommender systems that share a linear regression core but vary in how they handle the exploration-exploitation trade-off. It reports that a greedy model with no exploration at all reaches top-tier results in over 90 percent of tested datasets and often matches or exceeds the exploratory versions. Hyperparameter tuning across these settings repeatedly selects the configurations that reduce exploration to the minimum. These patterns indicate that offline protocols built on historical interaction logs fail to credit or properly measure the benefits of exploration. The authors conclude that more reliable evaluation methods are required before conclusions about exploration strategies can be trusted for live recommender systems.

Core claim

Across more than 90 percent of the datasets examined, a greedy linear model that performs no exploration achieves top-tier performance and frequently outperforms or matches its exploratory counterparts; hyperparameter optimization further selects configurations that minimize exploration, showing that pure exploitation dominates the outcomes produced by standard offline evaluation protocols for linear bandit recommenders.

What carries the argument

Offline evaluation of linear regression-based bandit variants on historical recommender datasets, where algorithms differ mainly in their exploration mechanisms.

If this is right

  • Pure exploitation strategies appear sufficient or superior under current offline testing regimes for recommender performance.
  • Offline protocols may systematically underestimate the value of exploration during dynamic user interactions.
  • Alternative evaluation frameworks are needed to assess exploration efficacy outside historical logs.
  • Reported advantages of exploration in linear bandits may not translate from offline results to production systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners relying on offline results might under-deploy exploratory strategies and miss long-term gains in user engagement.
  • The same offline bias likely appears in other off-policy evaluation settings used in recommendation and sequential decision making.
  • Counterfactual or online A/B testing methods could serve as more faithful checks on exploration benefits.

Load-bearing premise

Offline evaluation protocols using historical data accurately reflect the true efficacy of exploration strategies in live interactive settings for linear bandit recommenders.

What would settle it

A controlled online deployment in which exploratory linear bandits produce clearly higher user metrics than the greedy baseline would contradict the claim that offline protocols systematically undervalue exploration.

Figures

Figures reproduced from arXiv: 2507.18756 by Gregorio F. Azevedo, Pedro R. Pires, Pietro L. Campos, Rafael T. Sereicikas, Tiago A. Almeida.

Figure 1
Figure 1. Figure 1: Each dataset was chronologically sorted to simulate a [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative novelty@20 for every partition on the test set. Higher values mean recommendations of less pop. items. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts an extensive offline empirical comparison of linear contextual bandit algorithms for recommender systems. It reports that a purely greedy linear model (no exploration) achieves top-tier performance across over 90% of tested datasets, often matching or exceeding exploratory variants such as LinUCB and LinTS. Hyperparameter sweeps are shown to consistently select configurations with minimal or zero exploration, from which the authors conclude that standard offline evaluation protocols are biased against exploration and inadequate for assessing true exploratory efficacy in interactive settings.

Significance. If the central empirical pattern is shown to be robust to evaluation artifacts, the work would be significant for the bandit and recommender-systems literature. It supplies a large-scale demonstration that offline replay can systematically undervalue exploration, thereby motivating the development of alternative assessment frameworks (e.g., model-based or doubly-robust estimators, or online A/B testing). The breadth of datasets examined is a positive feature that increases the result's potential impact.

major comments (2)
  1. [§4 and evaluation protocol] §4 (Experimental Results) and the evaluation protocol description: the manuscript reports neither per-policy coverage statistics nor the variance of importance-sampling weights for the exploratory versus greedy policies. Because replay-style evaluation rejects or heavily down-weights actions absent from the logged data, and because LinUCB/LinTS deliberately increase action diversity, the observed ranking could be an artifact of higher rejection rates or higher-variance estimates for exploratory policies rather than evidence of protocol bias.
  2. [Abstract and §5] Abstract and §5 (Discussion): the inference that 'offline evaluation protocols are biased against exploration' rests on the assumption that the chosen replay method produces comparable, low-bias estimates across policies whose action distributions differ substantially from the logging policy. No doubly-robust or model-based corrections are mentioned, leaving the central claim vulnerable to the very coverage issue the skeptic note identifies.
minor comments (2)
  1. [Abstract] The abstract states 'over 90% of various datasets' without enumerating the exact datasets, their sizes, or the logging policy used; adding a table or appendix listing these details would improve reproducibility.
  2. [§4] Clarify the precise definition of 'top-tier performance' (e.g., whether it is by cumulative reward, regret, or ranking) and whether statistical significance tests were applied across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strengths and potential limitations of our empirical analysis. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4 and evaluation protocol] §4 (Experimental Results) and the evaluation protocol description: the manuscript reports neither per-policy coverage statistics nor the variance of importance-sampling weights for the exploratory versus greedy policies. Because replay-style evaluation rejects or heavily down-weights actions absent from the logged data, and because LinUCB/LinTS deliberately increase action diversity, the observed ranking could be an artifact of higher rejection rates or higher-variance estimates for exploratory policies rather than evidence of protocol bias.

    Authors: We agree that additional diagnostics on coverage and importance weight variance would strengthen the analysis. In the revised version, we will add tables reporting the average coverage (fraction of recommended actions present in the logged data) and the variance of the importance sampling weights for the greedy policy and the exploratory policies (LinUCB and LinTS) across all datasets. Our preliminary calculations show that while exploratory policies indeed exhibit lower coverage and higher weight variance, the performance advantage of the greedy policy remains statistically significant even on the subset of datasets with comparable coverage levels. This suggests the observed pattern is not merely an artifact of the evaluation mechanics. revision: yes

  2. Referee: [Abstract and §5] Abstract and §5 (Discussion): the inference that 'offline evaluation protocols are biased against exploration' rests on the assumption that the chosen replay method produces comparable, low-bias estimates across policies whose action distributions differ substantially from the logging policy. No doubly-robust or model-based corrections are mentioned, leaving the central claim vulnerable to the very coverage issue the skeptic note identifies.

    Authors: We recognize that standard replay evaluation can introduce bias when policies deviate from the logging distribution. Our claim is not that the estimates are unbiased in an absolute sense, but that under the commonly used offline protocols, greedy policies consistently outperform exploratory ones. To address this, we will revise §5 to include a more explicit discussion of the limitations of replay-based evaluation and the potential benefits of doubly-robust estimators. However, we maintain that the hyperparameter optimization results—where optimal configurations favor zero or minimal exploration—provide supporting evidence that is less sensitive to direct policy comparisons. We will also cite relevant literature on offline evaluation biases in bandits. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivation chain

full rationale

The paper reports direct offline experiments comparing standard linear bandit algorithms (including greedy) on recommendation datasets, with the central claim resting on observed performance rankings and hyperparameter sweeps. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text or abstract. The results are self-contained empirical observations against external datasets and do not reduce to prior definitions or inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard linear bandit formulations and offline evaluation techniques drawn from prior literature; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1280 out tokens · 66229 ms · 2026-05-19T02:24:33.284039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 6 internal anchors

  1. [1]

    Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning (Atlanta, GA, USA) (ICML’13). JMLR.org, New York, NY, USA, 1220–1228. doi:10.48550/arXiv.1209.3352

  2. [2]

    Saeed Ghoorchian an Evgenii Kortukov and Setareh Maghsudi. 2024. Non- Stationary Linear Bandits With Dimensionality Reduction for Large-Scale Rec- ommender Systems. IEEE Open Journal of Signal Processing 5 (2024), 548–558. doi:10.1109/OJSP.2024.3386490

  3. [3]

    Alina Beygelzimer and John Langford. 2016. The Offset Tree for Learning with Partial Labels. arXiv preprint (2016), 1–16. doi:10.48550/arXiv.0812.4044 arXiv:0812.4044 [cs.LG]

  4. [4]

    Ana Caraban, Evangelos Karapanos, Daniel Gonçalves, and Pedro Campos. 2019. 23 Ways to Nudge: A Review of Technology-Mediated Nudging in Human- Computer Interaction. In Proceedings of the 2019 CHI Conference on Human PREPRINT Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation Conference’17, July 2017, Washing...

  5. [5]

    Stéphane Caron and Smriti Bhagat. 2013. Mixing bandits: a recipe for im- proved cold-start recommendations in a social network. In Proceedings of the 7th Workshop on Social Network Mining and Analysis (Chicago, IL, USA) (SNAKDD’13). Association for Computing Machinery, New York, NY, USA, 1–9. doi:10.1145/2501025.2501029

  6. [6]

    Luciano Caroprese, Francesco Sergio Pisani, Bruno Miguel Veloso, Matthias Konig, Giuseppe Manco, Holger Hoos, and João Gama. 2025. Modelling Concept Drift in Dynamic Data Streams for Recommender Systems. ACM Transactions on Recommender Systems 3, 2 (2025), 1–28. doi:10.1145/3707693

  7. [7]

    Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth Vishnoi

    L. Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth Vishnoi. 2019. Control- ling Polarization in Personalization: An Algorithmic Framework. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT’19). Association for Computing Machinery, New York, NY, USA, 160–169. doi:10.1145/3287560.3287601

  8. [8]

    Nicolò Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. 2013. A Gang of Bandits. In Proceedings of the 27th International Conference on Neural Information Processing Systems (Lake Tahoe, NV, USA) (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 737–745. doi:10.5555/2999611.2999694

  9. [9]

    Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Proceedings of the 25th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2249–2257. doi:10.5555/2986459.2986710

  10. [10]

    Lixing Chen, Jie Xu, and Zhuo Lu. 2018. Contextual Combinatorial Multi- armed Bandits with Volatile Arms and Submodular Reward. In Proceedings of the 32nd Conference on Neural Information Processing Systems (Montréal, Canada) (NeurIPS’18). Curran Associates, Inc., Red Hook, NY, USA, 3251–3260. doi:10.5555/3327144.3327245

  11. [11]

    Minmin Chen, Yuyan Wang, Can Xu, Ya Le, Mohit Sharma, Lee Richardson, Su- Lin Wu, and Ed Chi. 2021. Values of User Exploration in Recommender Systems. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Machinery, New York, NY, USA, 85–95. doi:10.1145/3460231.3474236

  12. [12]

    Xiaocong Chen, Chaoran Huang, Lina Yao, Xianzhi Wang, Wei liu, and Wenjie Zhang. 2020. Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation. In Proceedings of the 2020 International Joint Conference on Neural Networks (Glasgow, UK)(IJCNN’20). IEEE, New York, NY, USA, 1–8. doi:10. 1109/IJCNN48605.2020.9207010

  13. [13]

    Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, WA, USA) (ICML’11). Omnipress, Madison, WI, USA, 1097–1104. doi:10.5555/3104482.3104620

  14. [14]

    João Gama, Indre Žilobait ˙e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. Comput. Surveys 46, 4 (2014), 1–37. doi:10.1145/2523813

  15. [15]

    Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Pro- ceedings of the 11th ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM’18). Association for Computing Machinery, New York, NY, USA, 198–206. doi:10.1145/3159652.3159687

  16. [16]

    Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. In Proceedings of the 18th ACM Conference on Recommender Systems (Bari, Italy) (RecSys’24). Association for Computing Machinery, New York, NY, USA, 722–732. doi:10. 1145/3640457.3688105

  17. [17]

    Negar Hariri, Bamshad Mobasher, and Robin Burke. 2015. Adapting to user pref- erence changes in interactive recommendation. In Proceedings of the 24th Inter- national Conference on Artificial Intelligence (Buenos Aires, Argentina) (IJCAI’15). AAAI Press, Palo Alto, CA, USA, 4268–4274. doi:10.5555/2832747.2832852

  18. [18]

    Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 8th IEEE International Conference on Data Mining (Pisa, Italy) (ICDM’08). IEEE Computer Society, New York, NY, USA, 263–272. doi:10.1109/ICDM.2008.22

  19. [19]

    Eugene Ie, Chih wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. 2019. RecSim: A Configurable Simulation Platform for Recommender Systems. arXiv preprint (2019), 1–23. doi:10.48550/ arXiv.1909.04847 arXiv:1909.04847 [cs.LG]

  20. [20]

    Rolf Jagerman, Ilya Markov, and Maarten de Rijke. 2019. When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environ- ments. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (Melbourne, Australia) (WSDM’19). Association for Computing Machinery, New York, NY, USA, 447–455. doi:10.1145/32...

  21. [21]

    Mathias Jesse and Dietmar Jannach. 2021. Digital Nudging with Recommender Systems: Survey and Future Directions. Computers in Human Behavior Reports 3 (2021), 100052. doi:10.1016/j.chbr.2020.100052

  22. [22]

    Serdar Kadıoğlu and Bernard Kleynhans. 2024. Building Higher-Order Abstrac- tions from the Components of Recommender Systems. InProceedings of the AAAI Conference on Artificial Intelligence (Vancouver, Canada) (AAAI-24). AAAI Press, Washington, DC, USA, 22998–23004. doi:10.1609/aaai.v38i21.30341

  23. [23]

    John Langford and Tong Zhang. 2007. The Epoch-Greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems (Vancouver, Canada)(NIPS’07). Curran Associates Inc., Red Hook, NY, USA, 817–824. doi:10.5555/2981562.2981665

  24. [24]

    Chang Li, Haoyun Feng, and Maarten de Rijke. 2020. Cascading Hybrid Bandits: Online Learning to Rank for Relevance and Diversity. In Proceedings of the 14th ACM Conference on Recommender Systems (Virtual Event, Brazil) (RecSys’20). Association for Computing Machinery, New York, NY, USA, 33–42. doi:10.1145/ 3383313.3412245

  25. [25]

    Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. 2011. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Proceedings of the 2011 International Conference on On-line Trad- ing of Exploration and Exploitation 2 (Bellevue, WA, USA) (OTEAE’11). JMLR.org, New York, NY, USA, 19–36. doi:10.5555/304...

  26. [26]

    Schapire

    Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextual- Bandit Approach to Personalized News Article Recommendation. InProceedings of the 19th International Conference on World Wide Web (Madrid, Spain) (WWW’09). Association for Computing Machinery, New York, NY, USA, 661–670. doi:10. 1145/1772690.1772758

  27. [27]

    Muthukrishnan, Vishwa Vinay, and Zheng Wen

    Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, UK) (KDD’18). Association for Computing Machinery, New York, NY, USA, 1685––1694. doi:1...

  28. [28]

    Willemsen

    Yu Liang and Martijn C. Willemsen. 2021. The Role of Preference Consistency, Defaults and Musical Expertise in Users’ Exploration Behavior in a Genre Explo- ration Recommender. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Ma- chinery, New York, NY, USA, 230–240. doi:10.114...

  29. [29]

    Willemsen

    Yu Liang and Martijn C. Willemsen. 2022. Exploring the Longitudinal Effects of Nudging on Users’ Music Genre Exploration Behavior and Listening Preferences. In Proceedings of the 16th ACM Conference on Recommender Systems (Seattle, WA, USA) (RecSys’22). Association for Computing Machinery, New York, NY, USA, 3–13. doi:10.1145/3523227.3546772

  30. [30]

    Bo Liu, Ying Wei, Yu Zhang, Zhixian Yan, and Qiang Yang. 2018. Transferable Contextual Bandit for Cross-Domain Recommendation. InProceedings of the 32nd AAAI Conference on Artificial Intelligence (New Orleans, LA, USA) (AAAI’18). AAAI Press, Palo Alto, CA, USA, 3619–3626. doi:10.1609/aaai.v32i1.11699

  31. [31]

    James McInerney, Benjamin Lacker, Samantha Hansen, Karl Higley, Hugues Bouchard, Alois Gruson, and Rishabh Mehrotra. 2018. Explore, exploit, and explain: personalizing explainable recommendations with bandits. In Proceed- ings of the 12th ACM Conference on Recommender Systems (Vancouver, Canada) (RecSys’18). Association for Computing Machinery, New York, ...

  32. [32]

    Martin Mladenov, Chih-Wei Hsu, Vihan Jain, Eugene Ie, Christopher Colby, Nico- las Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov, and Craig Boutilier

  33. [33]

    Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

    RecSim NG: Toward Principled Uncertainty Modeling for Recom- mender Ecosystems. arXiv preprint (2021), 1–23. doi:10.48550/arXiv.2103.08057 arXiv:2103.08057 [cs.LG]

  34. [34]

    Nguyen and Hady W

    Trong T. Nguyen and Hady W. Lauw. 2014. Dynamic Clustering of Contex- tual Multi-Armed Bandits. In Proceedings of the 23rd ACM International Confer- ence on Conference on Information and Knowledge Management (Shangai, China) (CIKM’14). Association for Computing Machinery, New York, NY, USA, 1959–1962. doi:10.1145/2661829.2662063

  35. [35]

    Javier Parapar and Filip Radlinski. 2021. Towards Unified Metrics for Accuracy and Diversity for Recommender Systems. In Proceedings of the 15th ACM Confer- ence Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Machinery, New York, NY, USA, 75–84. doi:10.1145/3460231.3474234

  36. [36]

    Dattaraj Rao. 2020. Contextual Bandits for adapting to changing User prefer- ences over time. arXiv preprint (2020), 1–11. doi:10.48550/arXiv.2009.10073 arXiv:2009.10073 [cs.LG]

  37. [37]

    David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. arXiv preprint (2018), 1–5. doi:10.48550/arXiv.1808.00720 arXiv:1808.00720 [cs.IR]

  38. [38]

    Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (Virtual Event) (NeurIPS’21). Curran Associates Inc., Red Hook, NY, USA, 1–14. doi:10.48550/arXiv.2008.07146

  39. [39]

    Guy Shani and Asela Gunawardana. 2011. Evaluating Recommendation Systems. In Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor (Eds.). Springer US, New York, NY, USA, Chapter 8, 257–259. PREPRINT Conference’17, July 2017, Washington, DC, USA Pires et al. doi:10.1007/978-0-387-85820-3

  40. [40]

    Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2018. Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Rein- forcement Learning. arXiv preprint (2018), 1–15. doi:10.48550/arXiv.1805.10000 arXiv:1805.10000 [cs.AI]

  41. [41]

    Nícollas Silva, Heitor Werneck, Thiago Silva, Adriano C. M. Pereira, and Leonardo Rocha. 2022. Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions. Expert Systems with Applications 197, 1 (2022), 1–17. doi:10.1016/j.eswa.2022.116669

  42. [42]

    Aleksandrs Slivkins. 2019. Introduction to Multi-Armed Bandits. Foundations and Trends® in Machine Learning 12, 1 (2019), 1–286. doi:10.1561/2200000068

  43. [43]

    Linqi Song, Christina Fragouli, and Devavrat Shah. 2019. Interactions Between Learning and Broadcasting in Wireless Recommendation Systems. In Proceedings of the 2019 IEEE International Symposium on Information Theory (Paris, France) (ISIT’19). IEEE, New York, NY, USA, 2549–2553. doi:10.1109/ISIT.2019.8849556

  44. [44]

    Sho Takemori, Masahiro Sato, Takashi Sonoda, Janmajay Singh, and Tomoko Ohkuma. 2020. Submodular Bandit Problem Under Multiple Constraints. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (Virtual Event) (UAI’20). JMLR.org, New York, NY, USA, 191–200. doi:10.48550/arXiv.2006. 00661

  45. [45]

    Liang Tang, Yexi Jiang, Lei Li, and Tao Li. 2014. Ensemble Contextual Bandits for Personalized Recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, CA, USA) (RecSys’14). Association for Computing Machinery, New York, NY, USA, 73–80. doi:10.1145/2645710.2645732

  46. [46]

    Stefano Tracà, Cynthia Rudin, and Weiyu Yan. 2019. Reducing Exploration of Dying Arms in Mortal Bandits. InProceedings of the 35th Conference on Uncertainty in Artificial Intelligence (Tel Aviv, Israel)(UAI’19). JMLR.org, New York, NY, USA, 156–163. doi:10.48550/arXiv.1907.02571

  47. [47]

    Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2023. Practical Bandits: An Industry Perspective. arXiv preprint (2023), 1–5. doi:10.48550/arXiv.2302.01223 arXiv:2302.01223 [cs.LG]

  48. [48]

    João Vinagre, Alípio Mário Jorge, and João Gama. 2015. An Overview on the Ex- ploitation of Time in Collaborative Filtering. WIREs Data Mining and Knowledge Discovery 5 (2015), 195–215. doi:10.1002/widm.1160

  49. [49]

    Huazheng Wang, Qingyun Wu, and Hongning Wang. 2016. Learning Hidden Features for Contextual Bandits. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, IN, USA) (CIKM’16). Association for Computing Machinery, New York, NY, USA, 1633–1642. doi:10.1145/2983323.2983847

  50. [50]

    Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization Bandits for Interactive Recommendation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (San Francisco, CA, USA) (AAAI’17). AAAI Press, Palo Alto, CA, USA, 2695–2702. doi:10.1609/aaai.v31i1.10936

  51. [51]

    Huazheng Wang, Haifeng Xu, Chuanhao Li, Zhiyuan Liu, and Hongning Wang

  52. [52]

    arXiv preprint (2021), 1–14

    Incentivizing Exploration in Linear Bandits under Information Gap. arXiv preprint (2021), 1–14. doi:10.48550/arXiv.2104.03860 arXiv:2104.03860 [cs.LG]

  53. [53]

    Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2016. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. arXiv preprint (2016), 1–23. doi:10. 48550/arXiv.1612.01205 arXiv:1612.01205 [stat.ML]

  54. [54]

    Qingyun Wu, Hongning Wang, Liangjie Hong, and Yue Shi. 2017. Returning is Believing: Optimizing Long-term User Engagement in Recommender Systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Manage- ment (Singapore, Singapore) (CIKM’17). Association for Computing Machinery, New York, NY, USA, 1927–1936. doi:10.1145/3132847.3133025

  55. [55]

    Xiao Xu, Fang Dong, Yanghua Li, Shaojian He, and Xin Li. 2020. Contextual- Bandit Based Personalized Recommendation with Time-Varying User Interests. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (New York, NY, USA) (AAAI’20). AAAI Press, Palo Alto, CA, USA, 6518–6525. doi:10.1609/ aaai.v34i04.6125

  56. [56]

    Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased Offline Recommender Evaluation for Missing-Not-at- Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Rec- ommender Systems (Vancouver, Canada) (RecSys’18). Association for Computing Machinery, New York, NY, USA, 279–287. doi:10.1145/3240...

  57. [57]

    Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. 2016. Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, CA, USA) (KDD’16). Association for Computing Machinery, New York, NY, USA, 2025–2034. doi:10.1145/2939...

  58. [58]

    Xiaoying Zhang, Hong Xie, Hang Li, and John C.S. Lui. 2020. Conversational Contextual Bandit: Algorithm and Application. In Proceedings of The Web Confer- ence 2020 (Taipei, Taiwan) (WWW’20). Association for Computing Machinery, New York, NY, USA, 662–672. doi:10.1145/3366423.3380148

  59. [59]

    Kesen Zhao, Shuchang Liu, Qingpeng Cai, Xiangyu Zhao, Ziru Liu, Dong Zheng, Peng Jiang, and Kun Gai. 2023. KuaiSim: a comprehensive simulator for recom- mender systems. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS’18). Curran Asso- ciates Inc., Red Hook, NY, USA, 44880–44897. doi...

  60. [60]

    Chunyi Zhou, Yuanyuan Jin, Xiaoling Wang, and Yingjie Zhang. 2020. Conversa- tional Music Recommendation based on Bandits. In Proceedings of the 2020 IEEE International Conference on Knowledge Graph (Nanjing, China) (ICKG’20). IEEE, New York, NY, USA, 41–48. doi:10.1109/ICBK50248.2020.00016

  61. [61]

    Sijin Zhou, Xinyi Dai, Haokun Chen, Weinan Zhang, Kan Ren, Ruiming Tang, Xiuqiang He, and Yong Yu. 2020. Interactive Recommender System via Knowledge Graph-enhanced Reinforcement Learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR’20). Association for Co...

  62. [62]

    Zheqing Zhu and Benjamin Van Roy. 2023. Deep Exploration for Recommenda- tion Systems. In Proceedings of the 17th ACM Conference on Recommender Systems (Singapore, Singapore) (RecSys’23). Association for Computing Machinery, New York, NY, USA, 963–970. doi:10.1145/3604915.3608855