Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation

Gregorio F. Azevedo; Pedro R. Pires; Pietro L. Campos; Rafael T. Sereicikas; Tiago A. Almeida

arxiv: 2507.18756 · v2 · submitted 2025-07-24 · 💻 cs.LG · cs.IR

Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation

Pedro R. Pires , Gregorio F. Azevedo , Pietro L. Campos , Rafael T. Sereicikas , Tiago A. Almeida This is my paper

Pith reviewed 2026-05-19 02:24 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords linear banditsoffline evaluationrecommender systemsexploration-exploitationmulti-armed banditsgreedy algorithmsevaluation bias

0 comments

The pith

Offline evaluations of linear bandit recommenders favor pure exploitation over any exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares multiple linear multi-armed bandit algorithms for recommender systems that share a linear regression core but vary in how they handle the exploration-exploitation trade-off. It reports that a greedy model with no exploration at all reaches top-tier results in over 90 percent of tested datasets and often matches or exceeds the exploratory versions. Hyperparameter tuning across these settings repeatedly selects the configurations that reduce exploration to the minimum. These patterns indicate that offline protocols built on historical interaction logs fail to credit or properly measure the benefits of exploration. The authors conclude that more reliable evaluation methods are required before conclusions about exploration strategies can be trusted for live recommender systems.

Core claim

Across more than 90 percent of the datasets examined, a greedy linear model that performs no exploration achieves top-tier performance and frequently outperforms or matches its exploratory counterparts; hyperparameter optimization further selects configurations that minimize exploration, showing that pure exploitation dominates the outcomes produced by standard offline evaluation protocols for linear bandit recommenders.

What carries the argument

Offline evaluation of linear regression-based bandit variants on historical recommender datasets, where algorithms differ mainly in their exploration mechanisms.

If this is right

Pure exploitation strategies appear sufficient or superior under current offline testing regimes for recommender performance.
Offline protocols may systematically underestimate the value of exploration during dynamic user interactions.
Alternative evaluation frameworks are needed to assess exploration efficacy outside historical logs.
Reported advantages of exploration in linear bandits may not translate from offline results to production systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners relying on offline results might under-deploy exploratory strategies and miss long-term gains in user engagement.
The same offline bias likely appears in other off-policy evaluation settings used in recommendation and sequential decision making.
Counterfactual or online A/B testing methods could serve as more faithful checks on exploration benefits.

Load-bearing premise

Offline evaluation protocols using historical data accurately reflect the true efficacy of exploration strategies in live interactive settings for linear bandit recommenders.

What would settle it

A controlled online deployment in which exploratory linear bandits produce clearly higher user metrics than the greedy baseline would contradict the claim that offline protocols systematically undervalue exploration.

Figures

Figures reproduced from arXiv: 2507.18756 by Gregorio F. Azevedo, Pedro R. Pires, Pietro L. Campos, Rafael T. Sereicikas, Tiago A. Almeida.

**Figure 3.** Figure 3: Cumulative novelty@20 for every partition on the test set. Higher values mean recommendations of less pop. items. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Greedy linear models top most offline bandit evaluations on these datasets, but the result may reflect evaluation artifacts rather than a genuine bias against exploration.

read the letter

The main finding is that a simple greedy linear model, with zero exploration, ranks highest in offline evaluations for over 90% of the datasets tested. Hyperparameter searches also consistently select settings with minimal exploration. This adds some useful numbers to the discussion. The authors compare several standard linear bandit algorithms across various recommendation datasets and show the pattern holds even after tuning. It gives a clear picture of how exploration strategies fare under current offline protocols. The soft spot is that the paper does not examine whether the evaluation method itself drives the result. Offline methods like replay or inverse propensity scoring can penalize policies that differ from the logging policy because they encounter more actions without historical data. Exploratory bandits increase diversity on purpose, which could lead to higher rejection rates or variance in the estimates. Without reporting coverage statistics for each policy or using more robust estimators such as doubly-robust methods, it is difficult to separate a real effect from an artifact of the evaluation setup. This is relevant for researchers who rely on offline evaluation for contextual bandits in recommender systems. It flags a limitation in existing practices. I would recommend sending it for peer review. The empirical observation is worth a closer look with additional controls on the evaluation procedure.

Referee Report

2 major / 2 minor

Summary. The paper conducts an extensive offline empirical comparison of linear contextual bandit algorithms for recommender systems. It reports that a purely greedy linear model (no exploration) achieves top-tier performance across over 90% of tested datasets, often matching or exceeding exploratory variants such as LinUCB and LinTS. Hyperparameter sweeps are shown to consistently select configurations with minimal or zero exploration, from which the authors conclude that standard offline evaluation protocols are biased against exploration and inadequate for assessing true exploratory efficacy in interactive settings.

Significance. If the central empirical pattern is shown to be robust to evaluation artifacts, the work would be significant for the bandit and recommender-systems literature. It supplies a large-scale demonstration that offline replay can systematically undervalue exploration, thereby motivating the development of alternative assessment frameworks (e.g., model-based or doubly-robust estimators, or online A/B testing). The breadth of datasets examined is a positive feature that increases the result's potential impact.

major comments (2)

[§4 and evaluation protocol] §4 (Experimental Results) and the evaluation protocol description: the manuscript reports neither per-policy coverage statistics nor the variance of importance-sampling weights for the exploratory versus greedy policies. Because replay-style evaluation rejects or heavily down-weights actions absent from the logged data, and because LinUCB/LinTS deliberately increase action diversity, the observed ranking could be an artifact of higher rejection rates or higher-variance estimates for exploratory policies rather than evidence of protocol bias.
[Abstract and §5] Abstract and §5 (Discussion): the inference that 'offline evaluation protocols are biased against exploration' rests on the assumption that the chosen replay method produces comparable, low-bias estimates across policies whose action distributions differ substantially from the logging policy. No doubly-robust or model-based corrections are mentioned, leaving the central claim vulnerable to the very coverage issue the skeptic note identifies.

minor comments (2)

[Abstract] The abstract states 'over 90% of various datasets' without enumerating the exact datasets, their sizes, or the logging policy used; adding a table or appendix listing these details would improve reproducibility.
[§4] Clarify the precise definition of 'top-tier performance' (e.g., whether it is by cumulative reward, regret, or ranking) and whether statistical significance tests were applied across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strengths and potential limitations of our empirical analysis. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4 and evaluation protocol] §4 (Experimental Results) and the evaluation protocol description: the manuscript reports neither per-policy coverage statistics nor the variance of importance-sampling weights for the exploratory versus greedy policies. Because replay-style evaluation rejects or heavily down-weights actions absent from the logged data, and because LinUCB/LinTS deliberately increase action diversity, the observed ranking could be an artifact of higher rejection rates or higher-variance estimates for exploratory policies rather than evidence of protocol bias.

Authors: We agree that additional diagnostics on coverage and importance weight variance would strengthen the analysis. In the revised version, we will add tables reporting the average coverage (fraction of recommended actions present in the logged data) and the variance of the importance sampling weights for the greedy policy and the exploratory policies (LinUCB and LinTS) across all datasets. Our preliminary calculations show that while exploratory policies indeed exhibit lower coverage and higher weight variance, the performance advantage of the greedy policy remains statistically significant even on the subset of datasets with comparable coverage levels. This suggests the observed pattern is not merely an artifact of the evaluation mechanics. revision: yes
Referee: [Abstract and §5] Abstract and §5 (Discussion): the inference that 'offline evaluation protocols are biased against exploration' rests on the assumption that the chosen replay method produces comparable, low-bias estimates across policies whose action distributions differ substantially from the logging policy. No doubly-robust or model-based corrections are mentioned, leaving the central claim vulnerable to the very coverage issue the skeptic note identifies.

Authors: We recognize that standard replay evaluation can introduce bias when policies deviate from the logging distribution. Our claim is not that the estimates are unbiased in an absolute sense, but that under the commonly used offline protocols, greedy policies consistently outperform exploratory ones. To address this, we will revise §5 to include a more explicit discussion of the limitations of replay-based evaluation and the potential benefits of doubly-robust estimators. However, we maintain that the hyperparameter optimization results—where optimal configurations favor zero or minimal exploration—provide supporting evidence that is less sensitive to direct policy comparisons. We will also cite relevant literature on offline evaluation biases in bandits. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivation chain

full rationale

The paper reports direct offline experiments comparing standard linear bandit algorithms (including greedy) on recommendation datasets, with the central claim resting on observed performance rankings and hyperparameter sweeps. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text or abstract. The results are self-contained empirical observations against external datasets and do not reduce to prior definitions or inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard linear bandit formulations and offline evaluation techniques drawn from prior literature; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1280 out tokens · 66229 ms · 2026-05-19T02:24:33.284039+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

offline evaluation protocols using historical data accurately reflect the true efficacy of exploration strategies

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 6 internal anchors

[1]

Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning (Atlanta, GA, USA) (ICML’13). JMLR.org, New York, NY, USA, 1220–1228. doi:10.48550/arXiv.1209.3352

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1209.3352 2013
[2]

Saeed Ghoorchian an Evgenii Kortukov and Setareh Maghsudi. 2024. Non- Stationary Linear Bandits With Dimensionality Reduction for Large-Scale Rec- ommender Systems. IEEE Open Journal of Signal Processing 5 (2024), 548–558. doi:10.1109/OJSP.2024.3386490

work page doi:10.1109/ojsp.2024.3386490 2024
[3]

Alina Beygelzimer and John Langford. 2016. The Offset Tree for Learning with Partial Labels. arXiv preprint (2016), 1–16. doi:10.48550/arXiv.0812.4044 arXiv:0812.4044 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.0812.4044 2016
[4]

Ana Caraban, Evangelos Karapanos, Daniel Gonçalves, and Pedro Campos. 2019. 23 Ways to Nudge: A Review of Technology-Mediated Nudging in Human- Computer Interaction. In Proceedings of the 2019 CHI Conference on Human PREPRINT Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation Conference’17, July 2017, Washing...

work page doi:10.1145/3290605.3300733 2019
[5]

Stéphane Caron and Smriti Bhagat. 2013. Mixing bandits: a recipe for im- proved cold-start recommendations in a social network. In Proceedings of the 7th Workshop on Social Network Mining and Analysis (Chicago, IL, USA) (SNAKDD’13). Association for Computing Machinery, New York, NY, USA, 1–9. doi:10.1145/2501025.2501029

work page doi:10.1145/2501025.2501029 2013
[6]

Luciano Caroprese, Francesco Sergio Pisani, Bruno Miguel Veloso, Matthias Konig, Giuseppe Manco, Holger Hoos, and João Gama. 2025. Modelling Concept Drift in Dynamic Data Streams for Recommender Systems. ACM Transactions on Recommender Systems 3, 2 (2025), 1–28. doi:10.1145/3707693

work page doi:10.1145/3707693 2025
[7]

Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth Vishnoi

L. Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth Vishnoi. 2019. Control- ling Polarization in Personalization: An Algorithmic Framework. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT’19). Association for Computing Machinery, New York, NY, USA, 160–169. doi:10.1145/3287560.3287601

work page doi:10.1145/3287560.3287601 2019
[8]

Nicolò Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. 2013. A Gang of Bandits. In Proceedings of the 27th International Conference on Neural Information Processing Systems (Lake Tahoe, NV, USA) (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 737–745. doi:10.5555/2999611.2999694

work page doi:10.5555/2999611.2999694 2013
[9]

Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Proceedings of the 25th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2249–2257. doi:10.5555/2986459.2986710

work page doi:10.5555/2986459.2986710 2011
[10]

Lixing Chen, Jie Xu, and Zhuo Lu. 2018. Contextual Combinatorial Multi- armed Bandits with Volatile Arms and Submodular Reward. In Proceedings of the 32nd Conference on Neural Information Processing Systems (Montréal, Canada) (NeurIPS’18). Curran Associates, Inc., Red Hook, NY, USA, 3251–3260. doi:10.5555/3327144.3327245

work page doi:10.5555/3327144.3327245 2018
[11]

Minmin Chen, Yuyan Wang, Can Xu, Ya Le, Mohit Sharma, Lee Richardson, Su- Lin Wu, and Ed Chi. 2021. Values of User Exploration in Recommender Systems. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Machinery, New York, NY, USA, 85–95. doi:10.1145/3460231.3474236

work page doi:10.1145/3460231.3474236 2021
[12]

Xiaocong Chen, Chaoran Huang, Lina Yao, Xianzhi Wang, Wei liu, and Wenjie Zhang. 2020. Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation. In Proceedings of the 2020 International Joint Conference on Neural Networks (Glasgow, UK)(IJCNN’20). IEEE, New York, NY, USA, 1–8. doi:10. 1109/IJCNN48605.2020.9207010

work page arXiv 2020
[13]

Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, WA, USA) (ICML’11). Omnipress, Madison, WI, USA, 1097–1104. doi:10.5555/3104482.3104620

work page doi:10.5555/3104482.3104620 2011
[14]

João Gama, Indre Žilobait ˙e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. Comput. Surveys 46, 4 (2014), 1–37. doi:10.1145/2523813

work page doi:10.1145/2523813 2014
[15]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Pro- ceedings of the 11th ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM’18). Association for Computing Machinery, New York, NY, USA, 198–206. doi:10.1145/3159652.3159687

work page doi:10.1145/3159652.3159687 2018
[16]

Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. In Proceedings of the 18th ACM Conference on Recommender Systems (Bari, Italy) (RecSys’24). Association for Computing Machinery, New York, NY, USA, 722–732. doi:10. 1145/3640457.3688105

work page arXiv 2024
[17]

Negar Hariri, Bamshad Mobasher, and Robin Burke. 2015. Adapting to user pref- erence changes in interactive recommendation. In Proceedings of the 24th Inter- national Conference on Artificial Intelligence (Buenos Aires, Argentina) (IJCAI’15). AAAI Press, Palo Alto, CA, USA, 4268–4274. doi:10.5555/2832747.2832852

work page doi:10.5555/2832747.2832852 2015
[18]

Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 8th IEEE International Conference on Data Mining (Pisa, Italy) (ICDM’08). IEEE Computer Society, New York, NY, USA, 263–272. doi:10.1109/ICDM.2008.22

work page doi:10.1109/icdm.2008.22 2008
[19]

Eugene Ie, Chih wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. 2019. RecSim: A Configurable Simulation Platform for Recommender Systems. arXiv preprint (2019), 1–23. doi:10.48550/ arXiv.1909.04847 arXiv:1909.04847 [cs.LG]

work page arXiv 2019
[20]

Rolf Jagerman, Ilya Markov, and Maarten de Rijke. 2019. When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environ- ments. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (Melbourne, Australia) (WSDM’19). Association for Computing Machinery, New York, NY, USA, 447–455. doi:10.1145/32...

work page doi:10.1145/3289600.3290958 2019
[21]

Mathias Jesse and Dietmar Jannach. 2021. Digital Nudging with Recommender Systems: Survey and Future Directions. Computers in Human Behavior Reports 3 (2021), 100052. doi:10.1016/j.chbr.2020.100052

work page doi:10.1016/j.chbr.2020.100052 2021
[22]

Serdar Kadıoğlu and Bernard Kleynhans. 2024. Building Higher-Order Abstrac- tions from the Components of Recommender Systems. InProceedings of the AAAI Conference on Artificial Intelligence (Vancouver, Canada) (AAAI-24). AAAI Press, Washington, DC, USA, 22998–23004. doi:10.1609/aaai.v38i21.30341

work page doi:10.1609/aaai.v38i21.30341 2024
[23]

John Langford and Tong Zhang. 2007. The Epoch-Greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems (Vancouver, Canada)(NIPS’07). Curran Associates Inc., Red Hook, NY, USA, 817–824. doi:10.5555/2981562.2981665

work page doi:10.5555/2981562.2981665 2007
[24]

Chang Li, Haoyun Feng, and Maarten de Rijke. 2020. Cascading Hybrid Bandits: Online Learning to Rank for Relevance and Diversity. In Proceedings of the 14th ACM Conference on Recommender Systems (Virtual Event, Brazil) (RecSys’20). Association for Computing Machinery, New York, NY, USA, 33–42. doi:10.1145/ 3383313.3412245

work page arXiv 2020
[25]

Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. 2011. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Proceedings of the 2011 International Conference on On-line Trad- ing of Exploration and Exploitation 2 (Bellevue, WA, USA) (OTEAE’11). JMLR.org, New York, NY, USA, 19–36. doi:10.5555/304...

work page doi:10.5555/3045725.3045727 2011
[26]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextual- Bandit Approach to Personalized News Article Recommendation. InProceedings of the 19th International Conference on World Wide Web (Madrid, Spain) (WWW’09). Association for Computing Machinery, New York, NY, USA, 661–670. doi:10. 1145/1772690.1772758

work page arXiv 2010
[27]

Muthukrishnan, Vishwa Vinay, and Zheng Wen

Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, UK) (KDD’18). Association for Computing Machinery, New York, NY, USA, 1685––1694. doi:1...

work page doi:10.1145/3219819 2018
[28]

Willemsen

Yu Liang and Martijn C. Willemsen. 2021. The Role of Preference Consistency, Defaults and Musical Expertise in Users’ Exploration Behavior in a Genre Explo- ration Recommender. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Ma- chinery, New York, NY, USA, 230–240. doi:10.114...

work page doi:10.1145/3460231.3474253 2021
[29]

Willemsen

Yu Liang and Martijn C. Willemsen. 2022. Exploring the Longitudinal Effects of Nudging on Users’ Music Genre Exploration Behavior and Listening Preferences. In Proceedings of the 16th ACM Conference on Recommender Systems (Seattle, WA, USA) (RecSys’22). Association for Computing Machinery, New York, NY, USA, 3–13. doi:10.1145/3523227.3546772

work page doi:10.1145/3523227.3546772 2022
[30]

Bo Liu, Ying Wei, Yu Zhang, Zhixian Yan, and Qiang Yang. 2018. Transferable Contextual Bandit for Cross-Domain Recommendation. InProceedings of the 32nd AAAI Conference on Artificial Intelligence (New Orleans, LA, USA) (AAAI’18). AAAI Press, Palo Alto, CA, USA, 3619–3626. doi:10.1609/aaai.v32i1.11699

work page doi:10.1609/aaai.v32i1.11699 2018
[31]

James McInerney, Benjamin Lacker, Samantha Hansen, Karl Higley, Hugues Bouchard, Alois Gruson, and Rishabh Mehrotra. 2018. Explore, exploit, and explain: personalizing explainable recommendations with bandits. In Proceed- ings of the 12th ACM Conference on Recommender Systems (Vancouver, Canada) (RecSys’18). Association for Computing Machinery, New York, ...

work page doi:10.1145/3240323.3240354 2018
[32]

Martin Mladenov, Chih-Wei Hsu, Vihan Jain, Eugene Ie, Christopher Colby, Nico- las Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov, and Craig Boutilier

work page
[33]

Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

RecSim NG: Toward Principled Uncertainty Modeling for Recom- mender Ecosystems. arXiv preprint (2021), 1–23. doi:10.48550/arXiv.2103.08057 arXiv:2103.08057 [cs.LG]

work page doi:10.48550/arxiv.2103.08057 2021
[34]

Nguyen and Hady W

Trong T. Nguyen and Hady W. Lauw. 2014. Dynamic Clustering of Contex- tual Multi-Armed Bandits. In Proceedings of the 23rd ACM International Confer- ence on Conference on Information and Knowledge Management (Shangai, China) (CIKM’14). Association for Computing Machinery, New York, NY, USA, 1959–1962. doi:10.1145/2661829.2662063

work page doi:10.1145/2661829.2662063 2014
[35]

Javier Parapar and Filip Radlinski. 2021. Towards Unified Metrics for Accuracy and Diversity for Recommender Systems. In Proceedings of the 15th ACM Confer- ence Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Machinery, New York, NY, USA, 75–84. doi:10.1145/3460231.3474234

work page doi:10.1145/3460231.3474234 2021
[36]

Dattaraj Rao. 2020. Contextual Bandits for adapting to changing User prefer- ences over time. arXiv preprint (2020), 1–11. doi:10.48550/arXiv.2009.10073 arXiv:2009.10073 [cs.LG]

work page doi:10.48550/arxiv.2009.10073 2020
[37]

David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. arXiv preprint (2018), 1–5. doi:10.48550/arXiv.1808.00720 arXiv:1808.00720 [cs.IR]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1808.00720 2018
[38]

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (Virtual Event) (NeurIPS’21). Curran Associates Inc., Red Hook, NY, USA, 1–14. doi:10.48550/arXiv.2008.07146

work page doi:10.48550/arxiv.2008.07146 2021
[39]

Guy Shani and Asela Gunawardana. 2011. Evaluating Recommendation Systems. In Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor (Eds.). Springer US, New York, NY, USA, Chapter 8, 257–259. PREPRINT Conference’17, July 2017, Washington, DC, USA Pires et al. doi:10.1007/978-0-387-85820-3

work page doi:10.1007/978-0-387-85820-3 2011
[40]

Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2018. Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Rein- forcement Learning. arXiv preprint (2018), 1–15. doi:10.48550/arXiv.1805.10000 arXiv:1805.10000 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.10000 2018
[41]

Nícollas Silva, Heitor Werneck, Thiago Silva, Adriano C. M. Pereira, and Leonardo Rocha. 2022. Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions. Expert Systems with Applications 197, 1 (2022), 1–17. doi:10.1016/j.eswa.2022.116669

work page doi:10.1016/j.eswa.2022.116669 2022
[42]

Aleksandrs Slivkins. 2019. Introduction to Multi-Armed Bandits. Foundations and Trends® in Machine Learning 12, 1 (2019), 1–286. doi:10.1561/2200000068

work page doi:10.1561/2200000068 2019
[43]

Linqi Song, Christina Fragouli, and Devavrat Shah. 2019. Interactions Between Learning and Broadcasting in Wireless Recommendation Systems. In Proceedings of the 2019 IEEE International Symposium on Information Theory (Paris, France) (ISIT’19). IEEE, New York, NY, USA, 2549–2553. doi:10.1109/ISIT.2019.8849556

work page doi:10.1109/isit.2019.8849556 2019
[44]

Sho Takemori, Masahiro Sato, Takashi Sonoda, Janmajay Singh, and Tomoko Ohkuma. 2020. Submodular Bandit Problem Under Multiple Constraints. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (Virtual Event) (UAI’20). JMLR.org, New York, NY, USA, 191–200. doi:10.48550/arXiv.2006. 00661

work page doi:10.48550/arxiv.2006 2020
[45]

Liang Tang, Yexi Jiang, Lei Li, and Tao Li. 2014. Ensemble Contextual Bandits for Personalized Recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, CA, USA) (RecSys’14). Association for Computing Machinery, New York, NY, USA, 73–80. doi:10.1145/2645710.2645732

work page doi:10.1145/2645710.2645732 2014
[46]

Stefano Tracà, Cynthia Rudin, and Weiyu Yan. 2019. Reducing Exploration of Dying Arms in Mortal Bandits. InProceedings of the 35th Conference on Uncertainty in Artificial Intelligence (Tel Aviv, Israel)(UAI’19). JMLR.org, New York, NY, USA, 156–163. doi:10.48550/arXiv.1907.02571

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.02571 2019
[47]

Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2023. Practical Bandits: An Industry Perspective. arXiv preprint (2023), 1–5. doi:10.48550/arXiv.2302.01223 arXiv:2302.01223 [cs.LG]

work page doi:10.48550/arxiv.2302.01223 2023
[48]

João Vinagre, Alípio Mário Jorge, and João Gama. 2015. An Overview on the Ex- ploitation of Time in Collaborative Filtering. WIREs Data Mining and Knowledge Discovery 5 (2015), 195–215. doi:10.1002/widm.1160

work page doi:10.1002/widm.1160 2015
[49]

Huazheng Wang, Qingyun Wu, and Hongning Wang. 2016. Learning Hidden Features for Contextual Bandits. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, IN, USA) (CIKM’16). Association for Computing Machinery, New York, NY, USA, 1633–1642. doi:10.1145/2983323.2983847

work page doi:10.1145/2983323.2983847 2016
[50]

Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization Bandits for Interactive Recommendation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (San Francisco, CA, USA) (AAAI’17). AAAI Press, Palo Alto, CA, USA, 2695–2702. doi:10.1609/aaai.v31i1.10936

work page doi:10.1609/aaai.v31i1.10936 2017
[51]

Huazheng Wang, Haifeng Xu, Chuanhao Li, Zhiyuan Liu, and Hongning Wang

work page
[52]

arXiv preprint (2021), 1–14

Incentivizing Exploration in Linear Bandits under Information Gap. arXiv preprint (2021), 1–14. doi:10.48550/arXiv.2104.03860 arXiv:2104.03860 [cs.LG]

work page doi:10.48550/arxiv.2104.03860 2021
[53]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2016. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. arXiv preprint (2016), 1–23. doi:10. 48550/arXiv.1612.01205 arXiv:1612.01205 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[54]

Qingyun Wu, Hongning Wang, Liangjie Hong, and Yue Shi. 2017. Returning is Believing: Optimizing Long-term User Engagement in Recommender Systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Manage- ment (Singapore, Singapore) (CIKM’17). Association for Computing Machinery, New York, NY, USA, 1927–1936. doi:10.1145/3132847.3133025

work page doi:10.1145/3132847.3133025 2017
[55]

Xiao Xu, Fang Dong, Yanghua Li, Shaojian He, and Xin Li. 2020. Contextual- Bandit Based Personalized Recommendation with Time-Varying User Interests. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (New York, NY, USA) (AAAI’20). AAAI Press, Palo Alto, CA, USA, 6518–6525. doi:10.1609/ aaai.v34i04.6125

work page 2020
[56]

Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased Offline Recommender Evaluation for Missing-Not-at- Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Rec- ommender Systems (Vancouver, Canada) (RecSys’18). Association for Computing Machinery, New York, NY, USA, 279–287. doi:10.1145/3240...

work page doi:10.1145/3240323.3240355 2018
[57]

Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. 2016. Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, CA, USA) (KDD’16). Association for Computing Machinery, New York, NY, USA, 2025–2034. doi:10.1145/2939...

work page doi:10.1145/2939672 2016
[58]

Xiaoying Zhang, Hong Xie, Hang Li, and John C.S. Lui. 2020. Conversational Contextual Bandit: Algorithm and Application. In Proceedings of The Web Confer- ence 2020 (Taipei, Taiwan) (WWW’20). Association for Computing Machinery, New York, NY, USA, 662–672. doi:10.1145/3366423.3380148

work page doi:10.1145/3366423.3380148 2020
[59]

Kesen Zhao, Shuchang Liu, Qingpeng Cai, Xiangyu Zhao, Ziru Liu, Dong Zheng, Peng Jiang, and Kun Gai. 2023. KuaiSim: a comprehensive simulator for recom- mender systems. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS’18). Curran Asso- ciates Inc., Red Hook, NY, USA, 44880–44897. doi...

work page doi:10.5555/3666122.3668067 2023
[60]

Chunyi Zhou, Yuanyuan Jin, Xiaoling Wang, and Yingjie Zhang. 2020. Conversa- tional Music Recommendation based on Bandits. In Proceedings of the 2020 IEEE International Conference on Knowledge Graph (Nanjing, China) (ICKG’20). IEEE, New York, NY, USA, 41–48. doi:10.1109/ICBK50248.2020.00016

work page doi:10.1109/icbk50248.2020.00016 2020
[61]

Sijin Zhou, Xinyi Dai, Haokun Chen, Weinan Zhang, Kan Ren, Ruiming Tang, Xiuqiang He, and Yong Yu. 2020. Interactive Recommender System via Knowledge Graph-enhanced Reinforcement Learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR’20). Association for Co...

work page doi:10.1145/3397271.3401174 2020
[62]

Zheqing Zhu and Benjamin Van Roy. 2023. Deep Exploration for Recommenda- tion Systems. In Proceedings of the 17th ACM Conference on Recommender Systems (Singapore, Singapore) (RecSys’23). Association for Computing Machinery, New York, NY, USA, 963–970. doi:10.1145/3604915.3608855

work page doi:10.1145/3604915.3608855 2023

[1] [1]

Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning (Atlanta, GA, USA) (ICML’13). JMLR.org, New York, NY, USA, 1220–1228. doi:10.48550/arXiv.1209.3352

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1209.3352 2013

[2] [2]

Saeed Ghoorchian an Evgenii Kortukov and Setareh Maghsudi. 2024. Non- Stationary Linear Bandits With Dimensionality Reduction for Large-Scale Rec- ommender Systems. IEEE Open Journal of Signal Processing 5 (2024), 548–558. doi:10.1109/OJSP.2024.3386490

work page doi:10.1109/ojsp.2024.3386490 2024

[3] [3]

Alina Beygelzimer and John Langford. 2016. The Offset Tree for Learning with Partial Labels. arXiv preprint (2016), 1–16. doi:10.48550/arXiv.0812.4044 arXiv:0812.4044 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.0812.4044 2016

[4] [4]

Ana Caraban, Evangelos Karapanos, Daniel Gonçalves, and Pedro Campos. 2019. 23 Ways to Nudge: A Review of Technology-Mediated Nudging in Human- Computer Interaction. In Proceedings of the 2019 CHI Conference on Human PREPRINT Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation Conference’17, July 2017, Washing...

work page doi:10.1145/3290605.3300733 2019

[5] [5]

Stéphane Caron and Smriti Bhagat. 2013. Mixing bandits: a recipe for im- proved cold-start recommendations in a social network. In Proceedings of the 7th Workshop on Social Network Mining and Analysis (Chicago, IL, USA) (SNAKDD’13). Association for Computing Machinery, New York, NY, USA, 1–9. doi:10.1145/2501025.2501029

work page doi:10.1145/2501025.2501029 2013

[6] [6]

Luciano Caroprese, Francesco Sergio Pisani, Bruno Miguel Veloso, Matthias Konig, Giuseppe Manco, Holger Hoos, and João Gama. 2025. Modelling Concept Drift in Dynamic Data Streams for Recommender Systems. ACM Transactions on Recommender Systems 3, 2 (2025), 1–28. doi:10.1145/3707693

work page doi:10.1145/3707693 2025

[7] [7]

Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth Vishnoi

L. Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth Vishnoi. 2019. Control- ling Polarization in Personalization: An Algorithmic Framework. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT’19). Association for Computing Machinery, New York, NY, USA, 160–169. doi:10.1145/3287560.3287601

work page doi:10.1145/3287560.3287601 2019

[8] [8]

Nicolò Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. 2013. A Gang of Bandits. In Proceedings of the 27th International Conference on Neural Information Processing Systems (Lake Tahoe, NV, USA) (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 737–745. doi:10.5555/2999611.2999694

work page doi:10.5555/2999611.2999694 2013

[9] [9]

Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Proceedings of the 25th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2249–2257. doi:10.5555/2986459.2986710

work page doi:10.5555/2986459.2986710 2011

[10] [10]

Lixing Chen, Jie Xu, and Zhuo Lu. 2018. Contextual Combinatorial Multi- armed Bandits with Volatile Arms and Submodular Reward. In Proceedings of the 32nd Conference on Neural Information Processing Systems (Montréal, Canada) (NeurIPS’18). Curran Associates, Inc., Red Hook, NY, USA, 3251–3260. doi:10.5555/3327144.3327245

work page doi:10.5555/3327144.3327245 2018

[11] [11]

Minmin Chen, Yuyan Wang, Can Xu, Ya Le, Mohit Sharma, Lee Richardson, Su- Lin Wu, and Ed Chi. 2021. Values of User Exploration in Recommender Systems. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Machinery, New York, NY, USA, 85–95. doi:10.1145/3460231.3474236

work page doi:10.1145/3460231.3474236 2021

[12] [12]

Xiaocong Chen, Chaoran Huang, Lina Yao, Xianzhi Wang, Wei liu, and Wenjie Zhang. 2020. Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation. In Proceedings of the 2020 International Joint Conference on Neural Networks (Glasgow, UK)(IJCNN’20). IEEE, New York, NY, USA, 1–8. doi:10. 1109/IJCNN48605.2020.9207010

work page arXiv 2020

[13] [13]

Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, WA, USA) (ICML’11). Omnipress, Madison, WI, USA, 1097–1104. doi:10.5555/3104482.3104620

work page doi:10.5555/3104482.3104620 2011

[14] [14]

João Gama, Indre Žilobait ˙e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. Comput. Surveys 46, 4 (2014), 1–37. doi:10.1145/2523813

work page doi:10.1145/2523813 2014

[15] [15]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Pro- ceedings of the 11th ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM’18). Association for Computing Machinery, New York, NY, USA, 198–206. doi:10.1145/3159652.3159687

work page doi:10.1145/3159652.3159687 2018

[16] [16]

Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. In Proceedings of the 18th ACM Conference on Recommender Systems (Bari, Italy) (RecSys’24). Association for Computing Machinery, New York, NY, USA, 722–732. doi:10. 1145/3640457.3688105

work page arXiv 2024

[17] [17]

Negar Hariri, Bamshad Mobasher, and Robin Burke. 2015. Adapting to user pref- erence changes in interactive recommendation. In Proceedings of the 24th Inter- national Conference on Artificial Intelligence (Buenos Aires, Argentina) (IJCAI’15). AAAI Press, Palo Alto, CA, USA, 4268–4274. doi:10.5555/2832747.2832852

work page doi:10.5555/2832747.2832852 2015

[18] [18]

Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 8th IEEE International Conference on Data Mining (Pisa, Italy) (ICDM’08). IEEE Computer Society, New York, NY, USA, 263–272. doi:10.1109/ICDM.2008.22

work page doi:10.1109/icdm.2008.22 2008

[19] [19]

Eugene Ie, Chih wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. 2019. RecSim: A Configurable Simulation Platform for Recommender Systems. arXiv preprint (2019), 1–23. doi:10.48550/ arXiv.1909.04847 arXiv:1909.04847 [cs.LG]

work page arXiv 2019

[20] [20]

Rolf Jagerman, Ilya Markov, and Maarten de Rijke. 2019. When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environ- ments. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (Melbourne, Australia) (WSDM’19). Association for Computing Machinery, New York, NY, USA, 447–455. doi:10.1145/32...

work page doi:10.1145/3289600.3290958 2019

[21] [21]

Mathias Jesse and Dietmar Jannach. 2021. Digital Nudging with Recommender Systems: Survey and Future Directions. Computers in Human Behavior Reports 3 (2021), 100052. doi:10.1016/j.chbr.2020.100052

work page doi:10.1016/j.chbr.2020.100052 2021

[22] [22]

Serdar Kadıoğlu and Bernard Kleynhans. 2024. Building Higher-Order Abstrac- tions from the Components of Recommender Systems. InProceedings of the AAAI Conference on Artificial Intelligence (Vancouver, Canada) (AAAI-24). AAAI Press, Washington, DC, USA, 22998–23004. doi:10.1609/aaai.v38i21.30341

work page doi:10.1609/aaai.v38i21.30341 2024

[23] [23]

John Langford and Tong Zhang. 2007. The Epoch-Greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems (Vancouver, Canada)(NIPS’07). Curran Associates Inc., Red Hook, NY, USA, 817–824. doi:10.5555/2981562.2981665

work page doi:10.5555/2981562.2981665 2007

[24] [24]

Chang Li, Haoyun Feng, and Maarten de Rijke. 2020. Cascading Hybrid Bandits: Online Learning to Rank for Relevance and Diversity. In Proceedings of the 14th ACM Conference on Recommender Systems (Virtual Event, Brazil) (RecSys’20). Association for Computing Machinery, New York, NY, USA, 33–42. doi:10.1145/ 3383313.3412245

work page arXiv 2020

[25] [25]

Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. 2011. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Proceedings of the 2011 International Conference on On-line Trad- ing of Exploration and Exploitation 2 (Bellevue, WA, USA) (OTEAE’11). JMLR.org, New York, NY, USA, 19–36. doi:10.5555/304...

work page doi:10.5555/3045725.3045727 2011

[26] [26]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextual- Bandit Approach to Personalized News Article Recommendation. InProceedings of the 19th International Conference on World Wide Web (Madrid, Spain) (WWW’09). Association for Computing Machinery, New York, NY, USA, 661–670. doi:10. 1145/1772690.1772758

work page arXiv 2010

[27] [27]

Muthukrishnan, Vishwa Vinay, and Zheng Wen

Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, UK) (KDD’18). Association for Computing Machinery, New York, NY, USA, 1685––1694. doi:1...

work page doi:10.1145/3219819 2018

[28] [28]

Willemsen

Yu Liang and Martijn C. Willemsen. 2021. The Role of Preference Consistency, Defaults and Musical Expertise in Users’ Exploration Behavior in a Genre Explo- ration Recommender. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Ma- chinery, New York, NY, USA, 230–240. doi:10.114...

work page doi:10.1145/3460231.3474253 2021

[29] [29]

Willemsen

Yu Liang and Martijn C. Willemsen. 2022. Exploring the Longitudinal Effects of Nudging on Users’ Music Genre Exploration Behavior and Listening Preferences. In Proceedings of the 16th ACM Conference on Recommender Systems (Seattle, WA, USA) (RecSys’22). Association for Computing Machinery, New York, NY, USA, 3–13. doi:10.1145/3523227.3546772

work page doi:10.1145/3523227.3546772 2022

[30] [30]

Bo Liu, Ying Wei, Yu Zhang, Zhixian Yan, and Qiang Yang. 2018. Transferable Contextual Bandit for Cross-Domain Recommendation. InProceedings of the 32nd AAAI Conference on Artificial Intelligence (New Orleans, LA, USA) (AAAI’18). AAAI Press, Palo Alto, CA, USA, 3619–3626. doi:10.1609/aaai.v32i1.11699

work page doi:10.1609/aaai.v32i1.11699 2018

[31] [31]

James McInerney, Benjamin Lacker, Samantha Hansen, Karl Higley, Hugues Bouchard, Alois Gruson, and Rishabh Mehrotra. 2018. Explore, exploit, and explain: personalizing explainable recommendations with bandits. In Proceed- ings of the 12th ACM Conference on Recommender Systems (Vancouver, Canada) (RecSys’18). Association for Computing Machinery, New York, ...

work page doi:10.1145/3240323.3240354 2018

[32] [32]

Martin Mladenov, Chih-Wei Hsu, Vihan Jain, Eugene Ie, Christopher Colby, Nico- las Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov, and Craig Boutilier

work page

[33] [33]

Recsim NG: Toward principled uncertainty modeling for recommender ecosystems.arXiv preprint arXiv:2103.08057, 2021

RecSim NG: Toward Principled Uncertainty Modeling for Recom- mender Ecosystems. arXiv preprint (2021), 1–23. doi:10.48550/arXiv.2103.08057 arXiv:2103.08057 [cs.LG]

work page doi:10.48550/arxiv.2103.08057 2021

[34] [34]

Nguyen and Hady W

Trong T. Nguyen and Hady W. Lauw. 2014. Dynamic Clustering of Contex- tual Multi-Armed Bandits. In Proceedings of the 23rd ACM International Confer- ence on Conference on Information and Knowledge Management (Shangai, China) (CIKM’14). Association for Computing Machinery, New York, NY, USA, 1959–1962. doi:10.1145/2661829.2662063

work page doi:10.1145/2661829.2662063 2014

[35] [35]

Javier Parapar and Filip Radlinski. 2021. Towards Unified Metrics for Accuracy and Diversity for Recommender Systems. In Proceedings of the 15th ACM Confer- ence Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Machinery, New York, NY, USA, 75–84. doi:10.1145/3460231.3474234

work page doi:10.1145/3460231.3474234 2021

[36] [36]

Dattaraj Rao. 2020. Contextual Bandits for adapting to changing User prefer- ences over time. arXiv preprint (2020), 1–11. doi:10.48550/arXiv.2009.10073 arXiv:2009.10073 [cs.LG]

work page doi:10.48550/arxiv.2009.10073 2020

[37] [37]

David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. arXiv preprint (2018), 1–5. doi:10.48550/arXiv.1808.00720 arXiv:1808.00720 [cs.IR]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1808.00720 2018

[38] [38]

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (Virtual Event) (NeurIPS’21). Curran Associates Inc., Red Hook, NY, USA, 1–14. doi:10.48550/arXiv.2008.07146

work page doi:10.48550/arxiv.2008.07146 2021

[39] [39]

Guy Shani and Asela Gunawardana. 2011. Evaluating Recommendation Systems. In Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor (Eds.). Springer US, New York, NY, USA, Chapter 8, 257–259. PREPRINT Conference’17, July 2017, Washington, DC, USA Pires et al. doi:10.1007/978-0-387-85820-3

work page doi:10.1007/978-0-387-85820-3 2011

[40] [40]

Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2018. Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Rein- forcement Learning. arXiv preprint (2018), 1–15. doi:10.48550/arXiv.1805.10000 arXiv:1805.10000 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.10000 2018

[41] [41]

Nícollas Silva, Heitor Werneck, Thiago Silva, Adriano C. M. Pereira, and Leonardo Rocha. 2022. Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions. Expert Systems with Applications 197, 1 (2022), 1–17. doi:10.1016/j.eswa.2022.116669

work page doi:10.1016/j.eswa.2022.116669 2022

[42] [42]

Aleksandrs Slivkins. 2019. Introduction to Multi-Armed Bandits. Foundations and Trends® in Machine Learning 12, 1 (2019), 1–286. doi:10.1561/2200000068

work page doi:10.1561/2200000068 2019

[43] [43]

Linqi Song, Christina Fragouli, and Devavrat Shah. 2019. Interactions Between Learning and Broadcasting in Wireless Recommendation Systems. In Proceedings of the 2019 IEEE International Symposium on Information Theory (Paris, France) (ISIT’19). IEEE, New York, NY, USA, 2549–2553. doi:10.1109/ISIT.2019.8849556

work page doi:10.1109/isit.2019.8849556 2019

[44] [44]

Sho Takemori, Masahiro Sato, Takashi Sonoda, Janmajay Singh, and Tomoko Ohkuma. 2020. Submodular Bandit Problem Under Multiple Constraints. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (Virtual Event) (UAI’20). JMLR.org, New York, NY, USA, 191–200. doi:10.48550/arXiv.2006. 00661

work page doi:10.48550/arxiv.2006 2020

[45] [45]

Liang Tang, Yexi Jiang, Lei Li, and Tao Li. 2014. Ensemble Contextual Bandits for Personalized Recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, CA, USA) (RecSys’14). Association for Computing Machinery, New York, NY, USA, 73–80. doi:10.1145/2645710.2645732

work page doi:10.1145/2645710.2645732 2014

[46] [46]

Stefano Tracà, Cynthia Rudin, and Weiyu Yan. 2019. Reducing Exploration of Dying Arms in Mortal Bandits. InProceedings of the 35th Conference on Uncertainty in Artificial Intelligence (Tel Aviv, Israel)(UAI’19). JMLR.org, New York, NY, USA, 156–163. doi:10.48550/arXiv.1907.02571

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.02571 2019

[47] [47]

Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2023. Practical Bandits: An Industry Perspective. arXiv preprint (2023), 1–5. doi:10.48550/arXiv.2302.01223 arXiv:2302.01223 [cs.LG]

work page doi:10.48550/arxiv.2302.01223 2023

[48] [48]

João Vinagre, Alípio Mário Jorge, and João Gama. 2015. An Overview on the Ex- ploitation of Time in Collaborative Filtering. WIREs Data Mining and Knowledge Discovery 5 (2015), 195–215. doi:10.1002/widm.1160

work page doi:10.1002/widm.1160 2015

[49] [49]

Huazheng Wang, Qingyun Wu, and Hongning Wang. 2016. Learning Hidden Features for Contextual Bandits. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, IN, USA) (CIKM’16). Association for Computing Machinery, New York, NY, USA, 1633–1642. doi:10.1145/2983323.2983847

work page doi:10.1145/2983323.2983847 2016

[50] [50]

Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization Bandits for Interactive Recommendation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (San Francisco, CA, USA) (AAAI’17). AAAI Press, Palo Alto, CA, USA, 2695–2702. doi:10.1609/aaai.v31i1.10936

work page doi:10.1609/aaai.v31i1.10936 2017

[51] [51]

Huazheng Wang, Haifeng Xu, Chuanhao Li, Zhiyuan Liu, and Hongning Wang

work page

[52] [52]

arXiv preprint (2021), 1–14

Incentivizing Exploration in Linear Bandits under Information Gap. arXiv preprint (2021), 1–14. doi:10.48550/arXiv.2104.03860 arXiv:2104.03860 [cs.LG]

work page doi:10.48550/arxiv.2104.03860 2021

[53] [53]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2016. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. arXiv preprint (2016), 1–23. doi:10. 48550/arXiv.1612.01205 arXiv:1612.01205 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[54] [54]

Qingyun Wu, Hongning Wang, Liangjie Hong, and Yue Shi. 2017. Returning is Believing: Optimizing Long-term User Engagement in Recommender Systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Manage- ment (Singapore, Singapore) (CIKM’17). Association for Computing Machinery, New York, NY, USA, 1927–1936. doi:10.1145/3132847.3133025

work page doi:10.1145/3132847.3133025 2017

[55] [55]

Xiao Xu, Fang Dong, Yanghua Li, Shaojian He, and Xin Li. 2020. Contextual- Bandit Based Personalized Recommendation with Time-Varying User Interests. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (New York, NY, USA) (AAAI’20). AAAI Press, Palo Alto, CA, USA, 6518–6525. doi:10.1609/ aaai.v34i04.6125

work page 2020

[56] [56]

Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased Offline Recommender Evaluation for Missing-Not-at- Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Rec- ommender Systems (Vancouver, Canada) (RecSys’18). Association for Computing Machinery, New York, NY, USA, 279–287. doi:10.1145/3240...

work page doi:10.1145/3240323.3240355 2018

[57] [57]

Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. 2016. Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, CA, USA) (KDD’16). Association for Computing Machinery, New York, NY, USA, 2025–2034. doi:10.1145/2939...

work page doi:10.1145/2939672 2016

[58] [58]

Xiaoying Zhang, Hong Xie, Hang Li, and John C.S. Lui. 2020. Conversational Contextual Bandit: Algorithm and Application. In Proceedings of The Web Confer- ence 2020 (Taipei, Taiwan) (WWW’20). Association for Computing Machinery, New York, NY, USA, 662–672. doi:10.1145/3366423.3380148

work page doi:10.1145/3366423.3380148 2020

[59] [59]

Kesen Zhao, Shuchang Liu, Qingpeng Cai, Xiangyu Zhao, Ziru Liu, Dong Zheng, Peng Jiang, and Kun Gai. 2023. KuaiSim: a comprehensive simulator for recom- mender systems. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS’18). Curran Asso- ciates Inc., Red Hook, NY, USA, 44880–44897. doi...

work page doi:10.5555/3666122.3668067 2023

[60] [60]

Chunyi Zhou, Yuanyuan Jin, Xiaoling Wang, and Yingjie Zhang. 2020. Conversa- tional Music Recommendation based on Bandits. In Proceedings of the 2020 IEEE International Conference on Knowledge Graph (Nanjing, China) (ICKG’20). IEEE, New York, NY, USA, 41–48. doi:10.1109/ICBK50248.2020.00016

work page doi:10.1109/icbk50248.2020.00016 2020

[61] [61]

Sijin Zhou, Xinyi Dai, Haokun Chen, Weinan Zhang, Kan Ren, Ruiming Tang, Xiuqiang He, and Yong Yu. 2020. Interactive Recommender System via Knowledge Graph-enhanced Reinforcement Learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR’20). Association for Co...

work page doi:10.1145/3397271.3401174 2020

[62] [62]

Zheqing Zhu and Benjamin Van Roy. 2023. Deep Exploration for Recommenda- tion Systems. In Proceedings of the 17th ACM Conference on Recommender Systems (Singapore, Singapore) (RecSys’23). Association for Computing Machinery, New York, NY, USA, 963–970. doi:10.1145/3604915.3608855

work page doi:10.1145/3604915.3608855 2023