Unbiased Learning to Rank: Counterfactual and Online Approaches
Pith reviewed 2026-05-24 20:26 UTC · model grok-4.3
The pith
Both counterfactual and online methods achieve unbiased learning to rank from biased user feedback but differ substantially in guarantees, performance, user effects, and applicability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that both counterfactual LTR and online LTR lead to unbiased learning to rank, but their approaches differ considerably in theoretical guarantees, empirical results, effects on the user experience during learning, and applicability, making the choice between them substantial for practitioners who must weigh these factors when deploying systems that learn from user interactions.
What carries the argument
The side-by-side contrast of counterfactual methods, which explicitly model and correct biases in logged data, versus online methods, which rely on randomization to neutralize bias during interactive learning.
If this is right
- Practitioners gain concrete criteria for selecting between historical correction and live randomization based on available data and tolerance for randomization.
- Theoretical analysis can be used to predict when one method will provide stronger bias removal than the other.
- Empirical benchmarks from prior work can be consulted to anticipate performance gaps in new ranking tasks.
- System designers must account for different user-experience costs during the learning phase when choosing a method.
- Applicability is limited by whether historical logs exist or live user traffic can be randomized.
Where Pith is reading between the lines
- The contrast suggests that systems with strict latency or privacy constraints on live randomization may default to counterfactual approaches.
- Hybrid methods could combine historical correction with selective online exploration to balance the strengths of each.
- The tutorial's framing implies that bias types beyond position bias may require tailored extensions of one method over the other.
- Deployment in production could benefit from monitoring metrics that the paper identifies as differing between the approaches.
Load-bearing premise
That differences in theoretical guarantees, empirical results, user experience effects, and applicability between the two methodologies can be reliably assessed and contrasted from the existing literature without new empirical validation.
What would settle it
A new controlled experiment directly comparing both methods on the same datasets and user populations that finds equivalent theoretical guarantees, empirical performance, user experience effects, and applicability would undermine the claimed substantial differences.
read the original abstract
This tutorial covers and contrasts the two main methodologies in unbiased Learning to Rank (LTR): Counterfactual LTR and Online LTR. There has long been an interest in LTR from user interactions, however, this form of implicit feedback is very biased. In recent years, unbiased LTR methods have been introduced to remove the effect of different types of bias caused by user-behavior in search. For instance, a well addressed type of bias is position bias: the rank at which a document is displayed heavily affects the interactions it receives. Counterfactual LTR methods deal with such types of bias by learning from historical interactions while correcting for the effect of the explicitly modelled biases. Online LTR does not use an explicit user model, in contrast, it learns through an interactive process where randomized results are displayed to the user. Through randomization the effect of different types of bias can be removed from the learning process. Though both methodologies lead to unbiased LTR, their approaches differ considerably, furthermore, so do their theoretical guarantees, empirical results, effects on the user experience during learning, and applicability. Consequently, for practitioners the choice between the two is very substantial. By providing an overview of both approaches and contrasting them, we aim to provide an essential guide to unbiased LTR so as to aid in understanding and choosing between methodologies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This tutorial provides an overview and contrast of the two main methodologies for unbiased learning to rank (LTR) from implicit user feedback: counterfactual LTR, which corrects for biases such as position bias using historical interaction data and explicit user models, and online LTR, which removes bias effects through interactive randomization of results without an explicit user model. The central claim is that both approaches produce unbiased LTR but differ substantially in theoretical guarantees, empirical results, effects on user experience during learning, and applicability, making the choice between them consequential for practitioners.
Significance. If the synthesis of the existing literature is accurate, the tutorial could be a useful guide for the IR community by clarifying trade-offs between established counterfactual and online methods for unbiased LTR. It explicitly positions itself as an aid for understanding and choosing methodologies rather than advancing new derivations or experiments, which aligns with the scope of a tutorial.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the tutorial and the recommendation to accept. The review accurately captures the scope and intent of the work as a synthesis and contrast of counterfactual and online approaches to unbiased LTR.
Circularity Check
Tutorial overview with no derivations or self-referential claims
full rationale
The paper is a tutorial that synthesizes and contrasts two established methodologies (counterfactual LTR and online LTR) from prior literature. No novel theorems, equations, derivations, fitted parameters, or predictions are asserted; the central claim is an overview of known differences in guarantees, results, UX, and applicability. This is self-contained against external benchmarks with no opportunity for circular reduction by construction, self-citation load-bearing, or ansatz smuggling. No steps identified.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aman Agarwal, Xuanhui Wang, Cheng Li, Michael Bendersky , and Marc Najork
-
[2]
In The World Wide Web Conference
Addressing Trust Bias for Unbiased Learning-to-Rank . In The World Wide Web Conference. ACM, 4–14
-
[3]
Aman Agarwal, Ivan Zaitsev, and Thorsten Joachims. 2018 . Consistent position bias estimation without online interventions for learning -to-rank. arXiv preprint arXiv:1806.03555 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimat ion. arXiv preprint arXiv:1804.05938 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Mike Bendersky, Xuanhui Wang, Marc Najork, and Don Metzl er. 2018. Learning with sparse and biased feedback for personal search. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI). 5219–5223
work page 2018
-
[6]
Ben Carterette and Praveen Chandar. 2018. Offline compar ative evaluation with incremental, minimally-invasive online feedback. In The 41st International ACM SIGIR Conference on Research & Development in Information R etrieval. ACM, 705–714. 1SIGIR’19 slides will be published on: http://ltr-tutorial-sigir19.isti.cnr.it/
work page 2018
-
[7]
Olivier Chapelle and Yi Chang. 2011. Y ahoo! Learning to r ank challenge overview. In Proceedings of the Learning to Rank Challenge . 1–24
work page 2011
-
[8]
Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 20 15. Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, an d Services 7, 3 (2015), 1–115
work page 2015
-
[9]
Norbert Fuhr and Chris Buckley. 1991. A probabilistic le arning approach for doc- ument indexing. ACM Transactions on Information Systems (TOIS) 9, 3 (1991), 223–248
work page 1991
-
[10]
Artem Grotov and Maarten de Rijke. 2016. Online learning to rank for informa- tion retrieval: SIGIR 2016 tutorial. In SIGIR. ACM, 1215–1218
work page 2016
-
[11]
Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maart en de Rijke. 2013. Reusing historical interaction data for faster online lear ning to rank for IR. In Proceedings of the sixth ACM international conference on We b search and data mining. ACM, 183–192
work page 2013
-
[12]
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2 013. Balancing ex- ploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16, 1 (2013), 63–90
work page 2013
-
[13]
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2 013. Fidelity, sound- ness, and efficiency of interleaved comparison methods. ACM Transactions on Information Systems (TOIS) 31, 4 (2013), 17
work page 2013
-
[14]
Rolf Jagerman, Harrie Oosterhuis, and Maarten de Rijke . 2019. To model or to intervene: A comparison of counterfactual and online lea rning to rank from user interactions. In 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval . ACM, (to appear)
work page 2019
-
[15]
Thorsten Joachims. 2002. Optimizing search engines us ing clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 133–142
work page 2002
-
[16]
Thorsten Joachims. 2003. Evaluating retrieval perfor mance using clickthrough data. In Text Mining, J. Franke, G. Nakhaeizadeh, and I. Renz (Eds.). Phys- ica/Springer V erlag, 79–96
work page 2003
-
[17]
Thorsten Joachims and Adith Swaminathan. 2016. Counte rfactual evaluation and learning for search, recommendation and ad placement. I n Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 1199–1201
work page 2016
-
[18]
Thorsten Joachims, Adith Swaminathan, and Tobias Schn abel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM Interna- tional Conference on Web Search and Data Mining . ACM, 781–789
work page 2017
-
[19]
Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2016. DCM bandits: Learning to rank with multiple clicks. arXiv preprint arXiv:1602.03146 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azi n Ashkan. 2015. Cascading bandits: Learning to rank in the cascade model. arXiv preprint arXiv:1502.02763 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[21]
Paul Lagrée, Claire V ernade, and Olivier Cappé. 2016. M ultiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems. 1597–1605
work page 2016
-
[22]
Tie-Y an Liu. 2009. Learning to rank for information ret rieval. Foundations and Trends in Information Retrieval 3, 3 (2009), 225–331
work page 2009
-
[23]
Harrie Oosterhuis. 2018. Learning to rank and evaluati on in the online setting. 12th Russian Summer School in Information Retrieval (RuSSI R 2018). (2018)
work page 2018
-
[24]
Harrie Oosterhuis and Maarten de Rijke. 2017. Balancin g speed and quality in online learning to rank for information retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management . ACM, 277–286
work page 2017
-
[25]
Harrie Oosterhuis and Maarten de Rijke. 2018. Differen tiable unbiased online learning to rank. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 1293–1302
work page 2018
-
[26]
Harrie Oosterhuis and Maarten de Rijke. 2019. Optimizi ng Ranking Models in an Online Setting. In Advances in Information Retrieval , Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and Djoer d Hiemstra (Eds.). Springer International Publishing, Cham, 382–396
work page 2019
-
[27]
Harrie Oosterhuis, Anne Schuth, and Maarten de Rijke. 2 016. Probabilistic multileave gradient descent. In European Conference on Information Retrieval . Springer, 661–668
-
[28]
Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2 008. How does click- through data reflect retrieval quality?. In Proceedings of the 17th ACM conference on Information and knowledge management . ACM, 43–52
-
[29]
Mark Sanderson. 2010. Test collection based evaluatio n of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 247–375
work page 2010
-
[30]
Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and M aarten de Rijke. 2016. Multileave gradient descent for fast online learning to ran k. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 457–466
work page 2016
-
[31]
Adith Swaminathan and Thorsten Joachims. 2015. Counte rfactual risk minimiza- tion: Learning from logged bandit feedback. In International Conference on Ma- chine Learning. 814–823
work page 2015
-
[32]
Xuanhui Wang, Michael Bendersky, Donald Metzler, and M arc Najork. 2016. Learning to rank with selection bias in personal search. In Proceedings of the 3 39th International ACM SIGIR conference on Research and Dev elopment in In- formation Retrieval. ACM, 115–124
work page 2016
-
[33]
Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Dona ld Metzler, and Marc Najork. 2018. Position bias estimation for unbiased learni ng to rank in personal search. In Proceedings of the Eleventh ACM International Conference o n Web Search and Data Mining. ACM, 610–618
work page 2018
-
[34]
Yisong Y ue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning . ACM, 1201–1208
work page 2009
-
[35]
Yisong Y ue, Rajan Patel, and Hein Roehrig. 2010. Beyond position bias: Exam- ining result attractiveness as a source of presentation bia s in clickthrough data. In Proceedings of the 19th international conference on World w ide web . ACM, 1011–1018
work page 2010
-
[36]
Tong Zhao and Irwin King. 2016. Constructing Reliable G radient Exploration for Online Learning to Rank. In CIKM. ACM, 1643–1652. 4
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.