To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions
Pith reviewed 2026-05-24 21:35 UTC · model grok-4.3
The pith
Counterfactual ranking methods outperform online ones only when bias and noise are low, but can harm users otherwise while online methods stay robust if rankings can be controlled.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In settings with little bias or noise counterfactual methods can obtain the highest ranking performance; however, in other circumstances their optimization can be detrimental to the user experience. Conversely, online methods are very robust to bias and noise but require control over the displayed rankings.
What carries the argument
Direct simulation-based benchmarking of counterfactual (model-based bias correction) versus online (intervention-based) learning-to-rank approaches under controlled levels of selection bias, position bias, and interaction noise.
If this is right
- When bias or noise is low, explicit user-behavior models produce better rankings than intervention-based methods.
- High position bias or interaction noise makes counterfactual optimization reduce user experience.
- Online methods maintain stable performance across bias levels provided the system can alter displayed rankings.
- Practitioners must measure bias levels before selecting a methodology.
Where Pith is reading between the lines
- Production systems could monitor bias in real time and switch between the two approaches dynamically.
- The robustness finding may apply to other interactive machine-learning settings that face position or selection bias.
- Long-term effects on user trust or diversity of results are not captured by the short-term simulation metrics.
Load-bearing premise
The simulated experimental conditions accurately reflect the bias and noise distributions that occur in real deployed ranking systems.
What would settle it
A live A/B test on a production search engine that measures actual selection bias, position bias, and noise levels, then compares final ranking quality and user metrics for the two methodologies.
Figures
read the original abstract
Learning to Rank (LTR) from user interactions is challenging as user feedback often contains high levels of bias and noise. At the moment, two methodologies for dealing with bias prevail in the field of LTR: counterfactual methods that learn from historical data and model user behavior to deal with biases; and online methods that perform interventions to deal with bias but use no explicit user models. For practitioners the decision between either methodology is very important because of its direct impact on end users. Nevertheless, there has never been a direct comparison between these two approaches to unbiased LTR. In this study we provide the first benchmarking of both counterfactual and online LTR methods under different experimental conditions. Our results show that the choice between the methodologies is consequential and depends on the presence of selection bias, and the degree of position bias and interaction noise. In settings with little bias or noise counterfactual methods can obtain the highest ranking performance; however, in other circumstances their optimization can be detrimental to the user experience. Conversely, online methods are very robust to bias and noise but require control over the displayed rankings. Our findings confirm and contradict existing expectations on the impact of model-based and intervention-based methods in LTR, and allow practitioners to make an informed decision between the two methodologies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first direct comparison of counterfactual and online learning-to-rank (LTR) methods using user interaction data. Through simulation experiments varying levels of selection bias, position bias, and interaction noise, it concludes that counterfactual methods outperform in low-bias/noise settings but can be detrimental otherwise, while online methods are robust but require control over displayed rankings. The choice between the two methodologies is consequential and depends on the bias and noise conditions.
Significance. This benchmarking study is significant for the LTR field as it provides empirical evidence on when to prefer model-based vs. intervention-based debiasing approaches. The controlled simulations allow isolation of bias effects, offering actionable insights for practitioners. Strengths include the systematic variation of conditions and the finding that online methods are robust, which challenges some prior expectations.
major comments (2)
- [§4 (Experimental Setup)] §4 (Experimental Setup): The click simulation, logging policy, and relevance sampling procedures used to generate the experimental conditions are not compared to or validated against real-world ranking logs. This is load-bearing for the central claim because the performance rankings between counterfactual and online methods depend directly on the fidelity of these generative models to actual bias distributions.
- [§5 (Results)] §5 (Results): The conclusion that counterfactual optimization can be detrimental in high-bias settings rests on the specific parameter ranges chosen for position bias and noise; no sensitivity analysis is reported for alternative generative processes that might better match deployed systems.
minor comments (2)
- [Abstract] Abstract: The claim of providing the 'first benchmarking' would benefit from a brief discussion of related comparison studies in the introduction to strengthen the novelty statement.
- [Notation] Notation: Some notation for bias parameters could be clarified with a table summarizing the simulation parameters.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below.
read point-by-point responses
-
Referee: [§4 (Experimental Setup)] §4 (Experimental Setup): The click simulation, logging policy, and relevance sampling procedures used to generate the experimental conditions are not compared to or validated against real-world ranking logs. This is load-bearing for the central claim because the performance rankings between counterfactual and online methods depend directly on the fidelity of these generative models to actual bias distributions.
Authors: The study design uses controlled simulations specifically to isolate the independent effects of selection bias, position bias, and noise, which cannot be disentangled in real-world logs. Parameter ranges are selected based on values reported across prior LTR literature. We will revise Section 4 to add an explicit discussion relating the generative models to empirical bias observations from deployed systems. revision: partial
-
Referee: [§5 (Results)] §5 (Results): The conclusion that counterfactual optimization can be detrimental in high-bias settings rests on the specific parameter ranges chosen for position bias and noise; no sensitivity analysis is reported for alternative generative processes that might better match deployed systems.
Authors: The reported experiments already vary bias and noise across a spectrum of levels to demonstrate transition points in performance. We agree that additional checks on alternative generative processes would strengthen the results. We will add a sensitivity analysis subsection varying the logging policy and relevance sampling distributions. revision: yes
Circularity Check
No circularity; empirical simulation study with independent experimental results.
full rationale
The paper is a benchmarking study that compares counterfactual and online LTR methods via controlled simulations varying selection bias, position bias, and interaction noise. Central claims rest on measured ranking performance differences across experimental conditions, not on any derivation, prediction, or parameter fit that reduces to the paper's own inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The simulation setup is explicitly generative and falsifiable against external data, satisfying the criteria for non-circular empirical work.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results show that the choice between the methodologies is consequential and depends on the presence of selection bias, and the degree of position bias and interaction noise.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Counterfactual methods that learn from historical data and model user behavior to deal with biases; and online methods that perform interventions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Un- biased Learning to Rank with Unbiased Propensity Estimation. In SIGIR. ACM, 385–394
work page 2018
-
[3]
Qingyao Ai, Jiaxin Mao, Yiqun Liu, and W Bruce Croft. 2018. Unbiased Learning to Rank: Theory and Practice. In CIKM. ACM, 2305–2306
work page 2018
-
[4]
Michael Bendersky, Xuanhui Wang, Donald Metzler, and Marc Najork. 2017. Learning from User Interactions in Personal Search via Attribute Parameteriza- tion. In WSDM. ACM, 791–799
work page 2017
-
[5]
Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to Rank Challenge Overview. In Proceedings of the Learning to Rank Challenge . 1–24
work page 2011
-
[6]
Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click Models for Web Search. Morgan & Claypool Publishers
work page 2015
-
[7]
Artem Grotov and Maarten de Rijke. 2016. Online Learning to Rank for Informa- tion Retrieval: SIGIR 2016 Tutorial. In SIGIR. ACM, 1215–1218
work page 2016
-
[8]
Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. 2013. Reusing Historical Interaction Data for Faster Online Learning to Rank for IR. In WSDM. ACM, 183–192
work page 2013
-
[9]
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2011. A Probabilistic Method for Inferring Preferences from Clicks. In CIKM. ACM, 249–258
work page 2011
-
[10]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446
work page 2002
-
[11]
Thorsten Joachims. 2002. Optimizing Search Engines using Clickthrough Data. In SIGKDD. ACM, 133–142
work page 2002
-
[12]
Thorsten Joachims. 2003. Evaluating Retrieval Performance using Clickthrough Data. In Text Mining. Physica/Springer
work page 2003
-
[13]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay
- [14]
-
[15]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search. ACM Transactions on Information Systems (TOIS) 25, 2 (2007), 7
work page 2007
-
[16]
Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement. In SIGIR. ACM, New York, NY, USA, 1199–1201
work page 2016
-
[17]
Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In WSDM. ACM, 781–789
work page 2017
-
[18]
Branislav Kveton, Chang Li, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, and Masrour Zoghi. 2018. BubbleRank: Safe Online Learning to Rerank. arXiv preprint arXiv:1806.05819 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased Offline Evaluation of Contextual-Bandit-Based News Article Recommendation Algo- rithms. In WSDM. ACM, 297–306
work page 2011
-
[20]
Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval 3, 3 (2009), 225–331
work page 2009
-
[21]
Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong, and Hang Li. 2007. Letor: Bench- mark Dataset for Research on Learning to Rank for Information Retrieval. In Proceedings of SIGIR 2007 workshop on learning to rank for information retrieval , Vol. 310. ACM Amsterdam, The Netherlands
work page 2007
-
[22]
Harrie Oosterhuis. 2018. Learning to Rank and Evaluation in the Online Setting. 12th Russian Summer School in Information Retrieval (RuSSIR 2018). (2018)
work page 2018
-
[23]
Harrie Oosterhuis and Maarten de Rijke. 2017. Balancing Speed and Quality in Online Learning to Rank for Information Retrieval. In CIKM. ACM, 277–286
work page 2017
-
[24]
Harrie Oosterhuis and Maarten de Rijke. 2017. Sensitive and Scalable Online Evaluation with Theoretical Guarantees. In CIKM. ACM, 77–86
work page 2017
-
[25]
Harrie Oosterhuis and Maarten de Rijke. 2018. Differentiable Unbiased Online Learning to Rank. In CIKM. ACM, 1293–1302
work page 2018
-
[26]
Harrie Oosterhuis and Maarten de Rijke. 2019. Optimizing Ranking Models in the Online Setting. In ECIR. Springer, 382–396
work page 2019
-
[27]
Harrie Oosterhuis, Anne Schuth, and Maarten de Rijke. 2016. Probabilistic Multileave Gradient Descent. In ECIR. Springer, 661–668
work page 2016
-
[28]
Filip Radlinski and Nick Craswell. 2013. Optimized Interleaving for Online Retrieval Evaluation. In WSDM. ACM, 245–254
work page 2013
-
[29]
Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016. Multileave Gradient Descent for Fast Online Learning to Rank. In WSDM. ACM, 457–466
work page 2016
-
[30]
Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten de Rijke. 2014. Multileaved Comparisons for Fast Online Evaluation. In CIKM. ACM, 71–80
work page 2014
-
[31]
Burr Settles. 2012. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1 (2012), 1–114
work page 2012
-
[32]
Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual Risk Mini- mization: Learning from Logged Bandit Feedback. In ICML. PMLR, 814–823
work page 2015
-
[33]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy Evaluation for Slate Recommendation. In NIPS. 3632–3642
work page 2017
-
[34]
Niek Tax, Sander Bockting, and Djoerd Hiemstra. 2015. A Cross-benchmark Com- parison of 87 Learning to Rank Methods. Information Processing & Management 51, 6 (2015), 757–772
work page 2015
-
[35]
Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015. High-Confidence Off-Policy Evaluation. In AAAI. 3000–3006
work page 2015
-
[36]
Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to Rank with Selection Bias in Personal Search. In SIGIR. ACM, 115– 124
work page 2016
-
[37]
Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. 2018. Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, 610–618
work page 2018
-
[38]
Yisong Yue and Thorsten Joachims. 2009. Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem. In ICML. ACM, 1201–1208
work page 2009
-
[39]
Tong Zhao and Irwin King. 2016. Constructing Reliable Gradient Exploration for Online Learning to Rank. In CIKM. ACM, 1643–1652
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.