Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation
Pith reviewed 2026-05-19 02:24 UTC · model grok-4.3
The pith
Offline evaluations of linear bandit recommenders favor pure exploitation over any exploration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across more than 90 percent of the datasets examined, a greedy linear model that performs no exploration achieves top-tier performance and frequently outperforms or matches its exploratory counterparts; hyperparameter optimization further selects configurations that minimize exploration, showing that pure exploitation dominates the outcomes produced by standard offline evaluation protocols for linear bandit recommenders.
What carries the argument
Offline evaluation of linear regression-based bandit variants on historical recommender datasets, where algorithms differ mainly in their exploration mechanisms.
If this is right
- Pure exploitation strategies appear sufficient or superior under current offline testing regimes for recommender performance.
- Offline protocols may systematically underestimate the value of exploration during dynamic user interactions.
- Alternative evaluation frameworks are needed to assess exploration efficacy outside historical logs.
- Reported advantages of exploration in linear bandits may not translate from offline results to production systems.
Where Pith is reading between the lines
- Practitioners relying on offline results might under-deploy exploratory strategies and miss long-term gains in user engagement.
- The same offline bias likely appears in other off-policy evaluation settings used in recommendation and sequential decision making.
- Counterfactual or online A/B testing methods could serve as more faithful checks on exploration benefits.
Load-bearing premise
Offline evaluation protocols using historical data accurately reflect the true efficacy of exploration strategies in live interactive settings for linear bandit recommenders.
What would settle it
A controlled online deployment in which exploratory linear bandits produce clearly higher user metrics than the greedy baseline would contradict the claim that offline protocols systematically undervalue exploration.
Figures
read the original abstract
Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an extensive offline empirical comparison of linear contextual bandit algorithms for recommender systems. It reports that a purely greedy linear model (no exploration) achieves top-tier performance across over 90% of tested datasets, often matching or exceeding exploratory variants such as LinUCB and LinTS. Hyperparameter sweeps are shown to consistently select configurations with minimal or zero exploration, from which the authors conclude that standard offline evaluation protocols are biased against exploration and inadequate for assessing true exploratory efficacy in interactive settings.
Significance. If the central empirical pattern is shown to be robust to evaluation artifacts, the work would be significant for the bandit and recommender-systems literature. It supplies a large-scale demonstration that offline replay can systematically undervalue exploration, thereby motivating the development of alternative assessment frameworks (e.g., model-based or doubly-robust estimators, or online A/B testing). The breadth of datasets examined is a positive feature that increases the result's potential impact.
major comments (2)
- [§4 and evaluation protocol] §4 (Experimental Results) and the evaluation protocol description: the manuscript reports neither per-policy coverage statistics nor the variance of importance-sampling weights for the exploratory versus greedy policies. Because replay-style evaluation rejects or heavily down-weights actions absent from the logged data, and because LinUCB/LinTS deliberately increase action diversity, the observed ranking could be an artifact of higher rejection rates or higher-variance estimates for exploratory policies rather than evidence of protocol bias.
- [Abstract and §5] Abstract and §5 (Discussion): the inference that 'offline evaluation protocols are biased against exploration' rests on the assumption that the chosen replay method produces comparable, low-bias estimates across policies whose action distributions differ substantially from the logging policy. No doubly-robust or model-based corrections are mentioned, leaving the central claim vulnerable to the very coverage issue the skeptic note identifies.
minor comments (2)
- [Abstract] The abstract states 'over 90% of various datasets' without enumerating the exact datasets, their sizes, or the logging policy used; adding a table or appendix listing these details would improve reproducibility.
- [§4] Clarify the precise definition of 'top-tier performance' (e.g., whether it is by cumulative reward, regret, or ranking) and whether statistical significance tests were applied across runs.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the strengths and potential limitations of our empirical analysis. We address each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4 and evaluation protocol] §4 (Experimental Results) and the evaluation protocol description: the manuscript reports neither per-policy coverage statistics nor the variance of importance-sampling weights for the exploratory versus greedy policies. Because replay-style evaluation rejects or heavily down-weights actions absent from the logged data, and because LinUCB/LinTS deliberately increase action diversity, the observed ranking could be an artifact of higher rejection rates or higher-variance estimates for exploratory policies rather than evidence of protocol bias.
Authors: We agree that additional diagnostics on coverage and importance weight variance would strengthen the analysis. In the revised version, we will add tables reporting the average coverage (fraction of recommended actions present in the logged data) and the variance of the importance sampling weights for the greedy policy and the exploratory policies (LinUCB and LinTS) across all datasets. Our preliminary calculations show that while exploratory policies indeed exhibit lower coverage and higher weight variance, the performance advantage of the greedy policy remains statistically significant even on the subset of datasets with comparable coverage levels. This suggests the observed pattern is not merely an artifact of the evaluation mechanics. revision: yes
-
Referee: [Abstract and §5] Abstract and §5 (Discussion): the inference that 'offline evaluation protocols are biased against exploration' rests on the assumption that the chosen replay method produces comparable, low-bias estimates across policies whose action distributions differ substantially from the logging policy. No doubly-robust or model-based corrections are mentioned, leaving the central claim vulnerable to the very coverage issue the skeptic note identifies.
Authors: We recognize that standard replay evaluation can introduce bias when policies deviate from the logging distribution. Our claim is not that the estimates are unbiased in an absolute sense, but that under the commonly used offline protocols, greedy policies consistently outperform exploratory ones. To address this, we will revise §5 to include a more explicit discussion of the limitations of replay-based evaluation and the potential benefits of doubly-robust estimators. However, we maintain that the hyperparameter optimization results—where optimal configurations favor zero or minimal exploration—provide supporting evidence that is less sensitive to direct policy comparisons. We will also cite relevant literature on offline evaluation biases in bandits. revision: partial
Circularity Check
No circularity: purely empirical comparison with no derivation chain
full rationale
The paper reports direct offline experiments comparing standard linear bandit algorithms (including greedy) on recommendation datasets, with the central claim resting on observed performance rankings and hyperparameter sweeps. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text or abstract. The results are self-contained empirical observations against external datasets and do not reduce to prior definitions or inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
offline evaluation protocols using historical data accurately reflect the true efficacy of exploration strategies
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning (Atlanta, GA, USA) (ICML’13). JMLR.org, New York, NY, USA, 1220–1228. doi:10.48550/arXiv.1209.3352
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1209.3352 2013
-
[2]
Saeed Ghoorchian an Evgenii Kortukov and Setareh Maghsudi. 2024. Non- Stationary Linear Bandits With Dimensionality Reduction for Large-Scale Rec- ommender Systems. IEEE Open Journal of Signal Processing 5 (2024), 548–558. doi:10.1109/OJSP.2024.3386490
-
[3]
Alina Beygelzimer and John Langford. 2016. The Offset Tree for Learning with Partial Labels. arXiv preprint (2016), 1–16. doi:10.48550/arXiv.0812.4044 arXiv:0812.4044 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.0812.4044 2016
-
[4]
Ana Caraban, Evangelos Karapanos, Daniel Gonçalves, and Pedro Campos. 2019. 23 Ways to Nudge: A Review of Technology-Mediated Nudging in Human- Computer Interaction. In Proceedings of the 2019 CHI Conference on Human PREPRINT Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation Conference’17, July 2017, Washing...
-
[5]
Stéphane Caron and Smriti Bhagat. 2013. Mixing bandits: a recipe for im- proved cold-start recommendations in a social network. In Proceedings of the 7th Workshop on Social Network Mining and Analysis (Chicago, IL, USA) (SNAKDD’13). Association for Computing Machinery, New York, NY, USA, 1–9. doi:10.1145/2501025.2501029
-
[6]
Luciano Caroprese, Francesco Sergio Pisani, Bruno Miguel Veloso, Matthias Konig, Giuseppe Manco, Holger Hoos, and João Gama. 2025. Modelling Concept Drift in Dynamic Data Streams for Recommender Systems. ACM Transactions on Recommender Systems 3, 2 (2025), 1–28. doi:10.1145/3707693
-
[7]
Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth Vishnoi
L. Elisa Celis, Sayash Kapoor, Farnood Salehi, and Nisheeth Vishnoi. 2019. Control- ling Polarization in Personalization: An Algorithmic Framework. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT’19). Association for Computing Machinery, New York, NY, USA, 160–169. doi:10.1145/3287560.3287601
-
[8]
Nicolò Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. 2013. A Gang of Bandits. In Proceedings of the 27th International Conference on Neural Information Processing Systems (Lake Tahoe, NV, USA) (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 737–745. doi:10.5555/2999611.2999694
-
[9]
Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Proceedings of the 25th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2249–2257. doi:10.5555/2986459.2986710
-
[10]
Lixing Chen, Jie Xu, and Zhuo Lu. 2018. Contextual Combinatorial Multi- armed Bandits with Volatile Arms and Submodular Reward. In Proceedings of the 32nd Conference on Neural Information Processing Systems (Montréal, Canada) (NeurIPS’18). Curran Associates, Inc., Red Hook, NY, USA, 3251–3260. doi:10.5555/3327144.3327245
-
[11]
Minmin Chen, Yuyan Wang, Can Xu, Ya Le, Mohit Sharma, Lee Richardson, Su- Lin Wu, and Ed Chi. 2021. Values of User Exploration in Recommender Systems. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Machinery, New York, NY, USA, 85–95. doi:10.1145/3460231.3474236
-
[12]
Xiaocong Chen, Chaoran Huang, Lina Yao, Xianzhi Wang, Wei liu, and Wenjie Zhang. 2020. Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation. In Proceedings of the 2020 International Joint Conference on Neural Networks (Glasgow, UK)(IJCNN’20). IEEE, New York, NY, USA, 1–8. doi:10. 1109/IJCNN48605.2020.9207010
-
[13]
Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, WA, USA) (ICML’11). Omnipress, Madison, WI, USA, 1097–1104. doi:10.5555/3104482.3104620
-
[14]
João Gama, Indre Žilobait ˙e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. Comput. Surveys 46, 4 (2014), 1–37. doi:10.1145/2523813
-
[15]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Pro- ceedings of the 11th ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM’18). Association for Computing Machinery, New York, NY, USA, 198–206. doi:10.1145/3159652.3159687
-
[16]
Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. In Proceedings of the 18th ACM Conference on Recommender Systems (Bari, Italy) (RecSys’24). Association for Computing Machinery, New York, NY, USA, 722–732. doi:10. 1145/3640457.3688105
-
[17]
Negar Hariri, Bamshad Mobasher, and Robin Burke. 2015. Adapting to user pref- erence changes in interactive recommendation. In Proceedings of the 24th Inter- national Conference on Artificial Intelligence (Buenos Aires, Argentina) (IJCAI’15). AAAI Press, Palo Alto, CA, USA, 4268–4274. doi:10.5555/2832747.2832852
-
[18]
Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 8th IEEE International Conference on Data Mining (Pisa, Italy) (ICDM’08). IEEE Computer Society, New York, NY, USA, 263–272. doi:10.1109/ICDM.2008.22
- [19]
-
[20]
Rolf Jagerman, Ilya Markov, and Maarten de Rijke. 2019. When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environ- ments. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (Melbourne, Australia) (WSDM’19). Association for Computing Machinery, New York, NY, USA, 447–455. doi:10.1145/32...
-
[21]
Mathias Jesse and Dietmar Jannach. 2021. Digital Nudging with Recommender Systems: Survey and Future Directions. Computers in Human Behavior Reports 3 (2021), 100052. doi:10.1016/j.chbr.2020.100052
-
[22]
Serdar Kadıoğlu and Bernard Kleynhans. 2024. Building Higher-Order Abstrac- tions from the Components of Recommender Systems. InProceedings of the AAAI Conference on Artificial Intelligence (Vancouver, Canada) (AAAI-24). AAAI Press, Washington, DC, USA, 22998–23004. doi:10.1609/aaai.v38i21.30341
-
[23]
John Langford and Tong Zhang. 2007. The Epoch-Greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems (Vancouver, Canada)(NIPS’07). Curran Associates Inc., Red Hook, NY, USA, 817–824. doi:10.5555/2981562.2981665
-
[24]
Chang Li, Haoyun Feng, and Maarten de Rijke. 2020. Cascading Hybrid Bandits: Online Learning to Rank for Relevance and Diversity. In Proceedings of the 14th ACM Conference on Recommender Systems (Virtual Event, Brazil) (RecSys’20). Association for Computing Machinery, New York, NY, USA, 33–42. doi:10.1145/ 3383313.3412245
-
[25]
Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. 2011. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Proceedings of the 2011 International Conference on On-line Trad- ing of Exploration and Exploitation 2 (Bellevue, WA, USA) (OTEAE’11). JMLR.org, New York, NY, USA, 19–36. doi:10.5555/304...
-
[26]
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextual- Bandit Approach to Personalized News Article Recommendation. InProceedings of the 19th International Conference on World Wide Web (Madrid, Spain) (WWW’09). Association for Computing Machinery, New York, NY, USA, 661–670. doi:10. 1145/1772690.1772758
-
[27]
Muthukrishnan, Vishwa Vinay, and Zheng Wen
Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, UK) (KDD’18). Association for Computing Machinery, New York, NY, USA, 1685––1694. doi:1...
-
[28]
Yu Liang and Martijn C. Willemsen. 2021. The Role of Preference Consistency, Defaults and Musical Expertise in Users’ Exploration Behavior in a Genre Explo- ration Recommender. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Ma- chinery, New York, NY, USA, 230–240. doi:10.114...
-
[29]
Yu Liang and Martijn C. Willemsen. 2022. Exploring the Longitudinal Effects of Nudging on Users’ Music Genre Exploration Behavior and Listening Preferences. In Proceedings of the 16th ACM Conference on Recommender Systems (Seattle, WA, USA) (RecSys’22). Association for Computing Machinery, New York, NY, USA, 3–13. doi:10.1145/3523227.3546772
-
[30]
Bo Liu, Ying Wei, Yu Zhang, Zhixian Yan, and Qiang Yang. 2018. Transferable Contextual Bandit for Cross-Domain Recommendation. InProceedings of the 32nd AAAI Conference on Artificial Intelligence (New Orleans, LA, USA) (AAAI’18). AAAI Press, Palo Alto, CA, USA, 3619–3626. doi:10.1609/aaai.v32i1.11699
-
[31]
James McInerney, Benjamin Lacker, Samantha Hansen, Karl Higley, Hugues Bouchard, Alois Gruson, and Rishabh Mehrotra. 2018. Explore, exploit, and explain: personalizing explainable recommendations with bandits. In Proceed- ings of the 12th ACM Conference on Recommender Systems (Vancouver, Canada) (RecSys’18). Association for Computing Machinery, New York, ...
-
[32]
Martin Mladenov, Chih-Wei Hsu, Vihan Jain, Eugene Ie, Christopher Colby, Nico- las Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov, and Craig Boutilier
-
[33]
RecSim NG: Toward Principled Uncertainty Modeling for Recom- mender Ecosystems. arXiv preprint (2021), 1–23. doi:10.48550/arXiv.2103.08057 arXiv:2103.08057 [cs.LG]
-
[34]
Trong T. Nguyen and Hady W. Lauw. 2014. Dynamic Clustering of Contex- tual Multi-Armed Bandits. In Proceedings of the 23rd ACM International Confer- ence on Conference on Information and Knowledge Management (Shangai, China) (CIKM’14). Association for Computing Machinery, New York, NY, USA, 1959–1962. doi:10.1145/2661829.2662063
-
[35]
Javier Parapar and Filip Radlinski. 2021. Towards Unified Metrics for Accuracy and Diversity for Recommender Systems. In Proceedings of the 15th ACM Confer- ence Recommender Systems (Amsterdam, Netherlands) (RecSys’21). Association for Computing Machinery, New York, NY, USA, 75–84. doi:10.1145/3460231.3474234
-
[36]
Dattaraj Rao. 2020. Contextual Bandits for adapting to changing User prefer- ences over time. arXiv preprint (2020), 1–11. doi:10.48550/arXiv.2009.10073 arXiv:2009.10073 [cs.LG]
-
[37]
David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. arXiv preprint (2018), 1–5. doi:10.48550/arXiv.1808.00720 arXiv:1808.00720 [cs.IR]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1808.00720 2018
-
[38]
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (Virtual Event) (NeurIPS’21). Curran Associates Inc., Red Hook, NY, USA, 1–14. doi:10.48550/arXiv.2008.07146
-
[39]
Guy Shani and Asela Gunawardana. 2011. Evaluating Recommendation Systems. In Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor (Eds.). Springer US, New York, NY, USA, Chapter 8, 257–259. PREPRINT Conference’17, July 2017, Washington, DC, USA Pires et al. doi:10.1007/978-0-387-85820-3
-
[40]
Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2018. Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Rein- forcement Learning. arXiv preprint (2018), 1–15. doi:10.48550/arXiv.1805.10000 arXiv:1805.10000 [cs.AI]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.10000 2018
-
[41]
Nícollas Silva, Heitor Werneck, Thiago Silva, Adriano C. M. Pereira, and Leonardo Rocha. 2022. Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions. Expert Systems with Applications 197, 1 (2022), 1–17. doi:10.1016/j.eswa.2022.116669
-
[42]
Aleksandrs Slivkins. 2019. Introduction to Multi-Armed Bandits. Foundations and Trends® in Machine Learning 12, 1 (2019), 1–286. doi:10.1561/2200000068
-
[43]
Linqi Song, Christina Fragouli, and Devavrat Shah. 2019. Interactions Between Learning and Broadcasting in Wireless Recommendation Systems. In Proceedings of the 2019 IEEE International Symposium on Information Theory (Paris, France) (ISIT’19). IEEE, New York, NY, USA, 2549–2553. doi:10.1109/ISIT.2019.8849556
-
[44]
Sho Takemori, Masahiro Sato, Takashi Sonoda, Janmajay Singh, and Tomoko Ohkuma. 2020. Submodular Bandit Problem Under Multiple Constraints. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (Virtual Event) (UAI’20). JMLR.org, New York, NY, USA, 191–200. doi:10.48550/arXiv.2006. 00661
-
[45]
Liang Tang, Yexi Jiang, Lei Li, and Tao Li. 2014. Ensemble Contextual Bandits for Personalized Recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, CA, USA) (RecSys’14). Association for Computing Machinery, New York, NY, USA, 73–80. doi:10.1145/2645710.2645732
-
[46]
Stefano Tracà, Cynthia Rudin, and Weiyu Yan. 2019. Reducing Exploration of Dying Arms in Mortal Bandits. InProceedings of the 35th Conference on Uncertainty in Artificial Intelligence (Tel Aviv, Israel)(UAI’19). JMLR.org, New York, NY, USA, 156–163. doi:10.48550/arXiv.1907.02571
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.02571 2019
-
[47]
Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2023. Practical Bandits: An Industry Perspective. arXiv preprint (2023), 1–5. doi:10.48550/arXiv.2302.01223 arXiv:2302.01223 [cs.LG]
-
[48]
João Vinagre, Alípio Mário Jorge, and João Gama. 2015. An Overview on the Ex- ploitation of Time in Collaborative Filtering. WIREs Data Mining and Knowledge Discovery 5 (2015), 195–215. doi:10.1002/widm.1160
-
[49]
Huazheng Wang, Qingyun Wu, and Hongning Wang. 2016. Learning Hidden Features for Contextual Bandits. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, IN, USA) (CIKM’16). Association for Computing Machinery, New York, NY, USA, 1633–1642. doi:10.1145/2983323.2983847
-
[50]
Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization Bandits for Interactive Recommendation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (San Francisco, CA, USA) (AAAI’17). AAAI Press, Palo Alto, CA, USA, 2695–2702. doi:10.1609/aaai.v31i1.10936
-
[51]
Huazheng Wang, Haifeng Xu, Chuanhao Li, Zhiyuan Liu, and Hongning Wang
-
[52]
Incentivizing Exploration in Linear Bandits under Information Gap. arXiv preprint (2021), 1–14. doi:10.48550/arXiv.2104.03860 arXiv:2104.03860 [cs.LG]
-
[53]
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2016. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. arXiv preprint (2016), 1–23. doi:10. 48550/arXiv.1612.01205 arXiv:1612.01205 [stat.ML]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[54]
Qingyun Wu, Hongning Wang, Liangjie Hong, and Yue Shi. 2017. Returning is Believing: Optimizing Long-term User Engagement in Recommender Systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Manage- ment (Singapore, Singapore) (CIKM’17). Association for Computing Machinery, New York, NY, USA, 1927–1936. doi:10.1145/3132847.3133025
-
[55]
Xiao Xu, Fang Dong, Yanghua Li, Shaojian He, and Xin Li. 2020. Contextual- Bandit Based Personalized Recommendation with Time-Varying User Interests. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (New York, NY, USA) (AAAI’20). AAAI Press, Palo Alto, CA, USA, 6518–6525. doi:10.1609/ aaai.v34i04.6125
work page 2020
-
[56]
Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased Offline Recommender Evaluation for Missing-Not-at- Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Rec- ommender Systems (Vancouver, Canada) (RecSys’18). Association for Computing Machinery, New York, NY, USA, 279–287. doi:10.1145/3240...
-
[57]
Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. 2016. Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, CA, USA) (KDD’16). Association for Computing Machinery, New York, NY, USA, 2025–2034. doi:10.1145/2939...
-
[58]
Xiaoying Zhang, Hong Xie, Hang Li, and John C.S. Lui. 2020. Conversational Contextual Bandit: Algorithm and Application. In Proceedings of The Web Confer- ence 2020 (Taipei, Taiwan) (WWW’20). Association for Computing Machinery, New York, NY, USA, 662–672. doi:10.1145/3366423.3380148
-
[59]
Kesen Zhao, Shuchang Liu, Qingpeng Cai, Xiangyu Zhao, Ziru Liu, Dong Zheng, Peng Jiang, and Kun Gai. 2023. KuaiSim: a comprehensive simulator for recom- mender systems. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS’18). Curran Asso- ciates Inc., Red Hook, NY, USA, 44880–44897. doi...
-
[60]
Chunyi Zhou, Yuanyuan Jin, Xiaoling Wang, and Yingjie Zhang. 2020. Conversa- tional Music Recommendation based on Bandits. In Proceedings of the 2020 IEEE International Conference on Knowledge Graph (Nanjing, China) (ICKG’20). IEEE, New York, NY, USA, 41–48. doi:10.1109/ICBK50248.2020.00016
-
[61]
Sijin Zhou, Xinyi Dai, Haokun Chen, Weinan Zhang, Kan Ren, Ruiming Tang, Xiuqiang He, and Yong Yu. 2020. Interactive Recommender System via Knowledge Graph-enhanced Reinforcement Learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR’20). Association for Co...
-
[62]
Zheqing Zhu and Benjamin Van Roy. 2023. Deep Exploration for Recommenda- tion Systems. In Proceedings of the 17th ACM Conference on Recommender Systems (Singapore, Singapore) (RecSys’23). Association for Computing Machinery, New York, NY, USA, 963–970. doi:10.1145/3604915.3608855
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.