pith. sign in

arxiv: 1907.07260 · v1 · pith:DWRALWZZnew · submitted 2019-07-16 · 💻 cs.IR

Unbiased Learning to Rank: Counterfactual and Online Approaches

Pith reviewed 2026-05-24 20:26 UTC · model grok-4.3

classification 💻 cs.IR
keywords unbiased learning to rankcounterfactual LTRonline LTRposition biasimplicit feedbackuser interactionsranking systemsbias correction
0
0 comments X

The pith

Both counterfactual and online methods achieve unbiased learning to rank from biased user feedback but differ substantially in guarantees, performance, user effects, and applicability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This tutorial establishes that counterfactual LTR corrects biases in historical interaction data through explicit models, while online LTR removes bias effects via randomization during live user interactions. Both approaches aim to produce unbiased rankings despite position bias and other distortions in implicit feedback. A sympathetic reader would care because the documented differences affect which method suits a given search system. The paper contrasts their theoretical guarantees, empirical results, impacts on users during learning, and practical applicability to guide selection. It positions the overview as an essential reference for understanding the trade-offs without new experiments.

Core claim

The paper claims that both counterfactual LTR and online LTR lead to unbiased learning to rank, but their approaches differ considerably in theoretical guarantees, empirical results, effects on the user experience during learning, and applicability, making the choice between them substantial for practitioners who must weigh these factors when deploying systems that learn from user interactions.

What carries the argument

The side-by-side contrast of counterfactual methods, which explicitly model and correct biases in logged data, versus online methods, which rely on randomization to neutralize bias during interactive learning.

If this is right

  • Practitioners gain concrete criteria for selecting between historical correction and live randomization based on available data and tolerance for randomization.
  • Theoretical analysis can be used to predict when one method will provide stronger bias removal than the other.
  • Empirical benchmarks from prior work can be consulted to anticipate performance gaps in new ranking tasks.
  • System designers must account for different user-experience costs during the learning phase when choosing a method.
  • Applicability is limited by whether historical logs exist or live user traffic can be randomized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contrast suggests that systems with strict latency or privacy constraints on live randomization may default to counterfactual approaches.
  • Hybrid methods could combine historical correction with selective online exploration to balance the strengths of each.
  • The tutorial's framing implies that bias types beyond position bias may require tailored extensions of one method over the other.
  • Deployment in production could benefit from monitoring metrics that the paper identifies as differing between the approaches.

Load-bearing premise

That differences in theoretical guarantees, empirical results, user experience effects, and applicability between the two methodologies can be reliably assessed and contrasted from the existing literature without new empirical validation.

What would settle it

A new controlled experiment directly comparing both methods on the same datasets and user populations that finds equivalent theoretical guarantees, empirical performance, user experience effects, and applicability would undermine the claimed substantial differences.

read the original abstract

This tutorial covers and contrasts the two main methodologies in unbiased Learning to Rank (LTR): Counterfactual LTR and Online LTR. There has long been an interest in LTR from user interactions, however, this form of implicit feedback is very biased. In recent years, unbiased LTR methods have been introduced to remove the effect of different types of bias caused by user-behavior in search. For instance, a well addressed type of bias is position bias: the rank at which a document is displayed heavily affects the interactions it receives. Counterfactual LTR methods deal with such types of bias by learning from historical interactions while correcting for the effect of the explicitly modelled biases. Online LTR does not use an explicit user model, in contrast, it learns through an interactive process where randomized results are displayed to the user. Through randomization the effect of different types of bias can be removed from the learning process. Though both methodologies lead to unbiased LTR, their approaches differ considerably, furthermore, so do their theoretical guarantees, empirical results, effects on the user experience during learning, and applicability. Consequently, for practitioners the choice between the two is very substantial. By providing an overview of both approaches and contrasting them, we aim to provide an essential guide to unbiased LTR so as to aid in understanding and choosing between methodologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. This tutorial provides an overview and contrast of the two main methodologies for unbiased learning to rank (LTR) from implicit user feedback: counterfactual LTR, which corrects for biases such as position bias using historical interaction data and explicit user models, and online LTR, which removes bias effects through interactive randomization of results without an explicit user model. The central claim is that both approaches produce unbiased LTR but differ substantially in theoretical guarantees, empirical results, effects on user experience during learning, and applicability, making the choice between them consequential for practitioners.

Significance. If the synthesis of the existing literature is accurate, the tutorial could be a useful guide for the IR community by clarifying trade-offs between established counterfactual and online methods for unbiased LTR. It explicitly positions itself as an aid for understanding and choosing methodologies rather than advancing new derivations or experiments, which aligns with the scope of a tutorial.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the tutorial and the recommendation to accept. The review accurately captures the scope and intent of the work as a synthesis and contrast of counterfactual and online approaches to unbiased LTR.

Circularity Check

0 steps flagged

Tutorial overview with no derivations or self-referential claims

full rationale

The paper is a tutorial that synthesizes and contrasts two established methodologies (counterfactual LTR and online LTR) from prior literature. No novel theorems, equations, derivations, fitted parameters, or predictions are asserted; the central claim is an overview of known differences in guarantees, results, UX, and applicability. This is self-contained against external benchmarks with no opportunity for circular reduction by construction, self-citation load-bearing, or ansatz smuggling. No steps identified.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Tutorial paper with no central mathematical or empirical claim; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5764 in / 952 out tokens · 20951 ms · 2026-05-24T20:26:16.672033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

  1. [1]

    Aman Agarwal, Xuanhui Wang, Cheng Li, Michael Bendersky , and Marc Najork

  2. [2]

    In The World Wide Web Conference

    Addressing Trust Bias for Unbiased Learning-to-Rank . In The World Wide Web Conference. ACM, 4–14

  3. [3]

    Aman Agarwal, Ivan Zaitsev, and Thorsten Joachims. 2018 . Consistent position bias estimation without online interventions for learning -to-rank. arXiv preprint arXiv:1806.03555 (2018)

  4. [4]

    Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimat ion. arXiv preprint arXiv:1804.05938 (2018)

  5. [5]

    Mike Bendersky, Xuanhui Wang, Marc Najork, and Don Metzl er. 2018. Learning with sparse and biased feedback for personal search. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI). 5219–5223

  6. [6]

    Ben Carterette and Praveen Chandar. 2018. Offline compar ative evaluation with incremental, minimally-invasive online feedback. In The 41st International ACM SIGIR Conference on Research & Development in Information R etrieval. ACM, 705–714. 1SIGIR’19 slides will be published on: http://ltr-tutorial-sigir19.isti.cnr.it/

  7. [7]

    Olivier Chapelle and Yi Chang. 2011. Y ahoo! Learning to r ank challenge overview. In Proceedings of the Learning to Rank Challenge . 1–24

  8. [8]

    Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 20 15. Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, an d Services 7, 3 (2015), 1–115

  9. [9]

    Norbert Fuhr and Chris Buckley. 1991. A probabilistic le arning approach for doc- ument indexing. ACM Transactions on Information Systems (TOIS) 9, 3 (1991), 223–248

  10. [10]

    Artem Grotov and Maarten de Rijke. 2016. Online learning to rank for informa- tion retrieval: SIGIR 2016 tutorial. In SIGIR. ACM, 1215–1218

  11. [11]

    Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maart en de Rijke. 2013. Reusing historical interaction data for faster online lear ning to rank for IR. In Proceedings of the sixth ACM international conference on We b search and data mining. ACM, 183–192

  12. [12]

    Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2 013. Balancing ex- ploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16, 1 (2013), 63–90

  13. [13]

    Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2 013. Fidelity, sound- ness, and efficiency of interleaved comparison methods. ACM Transactions on Information Systems (TOIS) 31, 4 (2013), 17

  14. [14]

    Rolf Jagerman, Harrie Oosterhuis, and Maarten de Rijke . 2019. To model or to intervene: A comparison of counterfactual and online lea rning to rank from user interactions. In 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval . ACM, (to appear)

  15. [15]

    Thorsten Joachims. 2002. Optimizing search engines us ing clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 133–142

  16. [16]

    Thorsten Joachims. 2003. Evaluating retrieval perfor mance using clickthrough data. In Text Mining, J. Franke, G. Nakhaeizadeh, and I. Renz (Eds.). Phys- ica/Springer V erlag, 79–96

  17. [17]

    Thorsten Joachims and Adith Swaminathan. 2016. Counte rfactual evaluation and learning for search, recommendation and ad placement. I n Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 1199–1201

  18. [18]

    Thorsten Joachims, Adith Swaminathan, and Tobias Schn abel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM Interna- tional Conference on Web Search and Data Mining . ACM, 781–789

  19. [19]

    Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2016. DCM bandits: Learning to rank with multiple clicks. arXiv preprint arXiv:1602.03146 (2016)

  20. [20]

    Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azi n Ashkan. 2015. Cascading bandits: Learning to rank in the cascade model. arXiv preprint arXiv:1502.02763 (2015)

  21. [21]

    Paul Lagrée, Claire V ernade, and Olivier Cappé. 2016. M ultiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems. 1597–1605

  22. [22]

    Tie-Y an Liu. 2009. Learning to rank for information ret rieval. Foundations and Trends in Information Retrieval 3, 3 (2009), 225–331

  23. [23]

    Harrie Oosterhuis. 2018. Learning to rank and evaluati on in the online setting. 12th Russian Summer School in Information Retrieval (RuSSI R 2018). (2018)

  24. [24]

    Harrie Oosterhuis and Maarten de Rijke. 2017. Balancin g speed and quality in online learning to rank for information retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management . ACM, 277–286

  25. [25]

    Harrie Oosterhuis and Maarten de Rijke. 2018. Differen tiable unbiased online learning to rank. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 1293–1302

  26. [26]

    Harrie Oosterhuis and Maarten de Rijke. 2019. Optimizi ng Ranking Models in an Online Setting. In Advances in Information Retrieval , Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and Djoer d Hiemstra (Eds.). Springer International Publishing, Cham, 382–396

  27. [27]

    Harrie Oosterhuis, Anne Schuth, and Maarten de Rijke. 2 016. Probabilistic multileave gradient descent. In European Conference on Information Retrieval . Springer, 661–668

  28. [28]

    Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2 008. How does click- through data reflect retrieval quality?. In Proceedings of the 17th ACM conference on Information and knowledge management . ACM, 43–52

  29. [29]

    Mark Sanderson. 2010. Test collection based evaluatio n of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 247–375

  30. [30]

    Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and M aarten de Rijke. 2016. Multileave gradient descent for fast online learning to ran k. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 457–466

  31. [31]

    Adith Swaminathan and Thorsten Joachims. 2015. Counte rfactual risk minimiza- tion: Learning from logged bandit feedback. In International Conference on Ma- chine Learning. 814–823

  32. [32]

    Xuanhui Wang, Michael Bendersky, Donald Metzler, and M arc Najork. 2016. Learning to rank with selection bias in personal search. In Proceedings of the 3 39th International ACM SIGIR conference on Research and Dev elopment in In- formation Retrieval. ACM, 115–124

  33. [33]

    Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Dona ld Metzler, and Marc Najork. 2018. Position bias estimation for unbiased learni ng to rank in personal search. In Proceedings of the Eleventh ACM International Conference o n Web Search and Data Mining. ACM, 610–618

  34. [34]

    Yisong Y ue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning . ACM, 1201–1208

  35. [35]

    Yisong Y ue, Rajan Patel, and Hein Roehrig. 2010. Beyond position bias: Exam- ining result attractiveness as a source of presentation bia s in clickthrough data. In Proceedings of the 19th international conference on World w ide web . ACM, 1011–1018

  36. [36]

    Tong Zhao and Irwin King. 2016. Constructing Reliable G radient Exploration for Online Learning to Rank. In CIKM. ACM, 1643–1652. 4