pith. sign in

arxiv: 1907.06412 · v1 · pith:674ONY5Enew · submitted 2019-07-15 · 💻 cs.IR

To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions

Pith reviewed 2026-05-24 21:35 UTC · model grok-4.3

classification 💻 cs.IR
keywords learning to rankcounterfactual learningonline learninguser interactionsposition biasselection biasranking evaluation
0
0 comments X

The pith

Counterfactual ranking methods outperform online ones only when bias and noise are low, but can harm users otherwise while online methods stay robust if rankings can be controlled.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways of learning to rank from user clicks: counterfactual methods that build explicit models of user behavior to correct for biases in past data, and online methods that intervene by showing different rankings to users. It runs simulations that vary the amounts of selection bias, position bias, and interaction noise, then measures ranking quality and user experience for each approach. A sympathetic reader would care because the choice affects what users actually see in live systems. The results indicate that counterfactual methods reach the best performance when bias and noise are minimal, yet their corrections become harmful when bias grows, whereas online methods handle bias and noise reliably but need the ability to change the displayed rankings.

Core claim

In settings with little bias or noise counterfactual methods can obtain the highest ranking performance; however, in other circumstances their optimization can be detrimental to the user experience. Conversely, online methods are very robust to bias and noise but require control over the displayed rankings.

What carries the argument

Direct simulation-based benchmarking of counterfactual (model-based bias correction) versus online (intervention-based) learning-to-rank approaches under controlled levels of selection bias, position bias, and interaction noise.

If this is right

  • When bias or noise is low, explicit user-behavior models produce better rankings than intervention-based methods.
  • High position bias or interaction noise makes counterfactual optimization reduce user experience.
  • Online methods maintain stable performance across bias levels provided the system can alter displayed rankings.
  • Practitioners must measure bias levels before selecting a methodology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production systems could monitor bias in real time and switch between the two approaches dynamically.
  • The robustness finding may apply to other interactive machine-learning settings that face position or selection bias.
  • Long-term effects on user trust or diversity of results are not captured by the short-term simulation metrics.

Load-bearing premise

The simulated experimental conditions accurately reflect the bias and noise distributions that occur in real deployed ranking systems.

What would settle it

A live A/B test on a production search engine that measures actual selection bias, position bias, and noise levels, then compares final ranking quality and user metrics for the two methodologies.

Figures

Figures reproduced from arXiv: 1907.06412 by Harrie Oosterhuis, Maarten de Rijke, Rolf Jagerman.

Figure 1
Figure 1. Figure 1: Performance of online and counterfactual methods under perfect, binarized, and near-random user models. In the [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of online and counterfactual methods under very strong position bias ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Display performance during training, indicating user experience. In the top row no selection bias is present; in the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of counterfactual methods with a deployment every 200,000 sessions. In the top row no selection bias [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Learning to Rank (LTR) from user interactions is challenging as user feedback often contains high levels of bias and noise. At the moment, two methodologies for dealing with bias prevail in the field of LTR: counterfactual methods that learn from historical data and model user behavior to deal with biases; and online methods that perform interventions to deal with bias but use no explicit user models. For practitioners the decision between either methodology is very important because of its direct impact on end users. Nevertheless, there has never been a direct comparison between these two approaches to unbiased LTR. In this study we provide the first benchmarking of both counterfactual and online LTR methods under different experimental conditions. Our results show that the choice between the methodologies is consequential and depends on the presence of selection bias, and the degree of position bias and interaction noise. In settings with little bias or noise counterfactual methods can obtain the highest ranking performance; however, in other circumstances their optimization can be detrimental to the user experience. Conversely, online methods are very robust to bias and noise but require control over the displayed rankings. Our findings confirm and contradict existing expectations on the impact of model-based and intervention-based methods in LTR, and allow practitioners to make an informed decision between the two methodologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first direct comparison of counterfactual and online learning-to-rank (LTR) methods using user interaction data. Through simulation experiments varying levels of selection bias, position bias, and interaction noise, it concludes that counterfactual methods outperform in low-bias/noise settings but can be detrimental otherwise, while online methods are robust but require control over displayed rankings. The choice between the two methodologies is consequential and depends on the bias and noise conditions.

Significance. This benchmarking study is significant for the LTR field as it provides empirical evidence on when to prefer model-based vs. intervention-based debiasing approaches. The controlled simulations allow isolation of bias effects, offering actionable insights for practitioners. Strengths include the systematic variation of conditions and the finding that online methods are robust, which challenges some prior expectations.

major comments (2)
  1. [§4 (Experimental Setup)] §4 (Experimental Setup): The click simulation, logging policy, and relevance sampling procedures used to generate the experimental conditions are not compared to or validated against real-world ranking logs. This is load-bearing for the central claim because the performance rankings between counterfactual and online methods depend directly on the fidelity of these generative models to actual bias distributions.
  2. [§5 (Results)] §5 (Results): The conclusion that counterfactual optimization can be detrimental in high-bias settings rests on the specific parameter ranges chosen for position bias and noise; no sensitivity analysis is reported for alternative generative processes that might better match deployed systems.
minor comments (2)
  1. [Abstract] Abstract: The claim of providing the 'first benchmarking' would benefit from a brief discussion of related comparison studies in the introduction to strengthen the novelty statement.
  2. [Notation] Notation: Some notation for bias parameters could be clarified with a table summarizing the simulation parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [§4 (Experimental Setup)] §4 (Experimental Setup): The click simulation, logging policy, and relevance sampling procedures used to generate the experimental conditions are not compared to or validated against real-world ranking logs. This is load-bearing for the central claim because the performance rankings between counterfactual and online methods depend directly on the fidelity of these generative models to actual bias distributions.

    Authors: The study design uses controlled simulations specifically to isolate the independent effects of selection bias, position bias, and noise, which cannot be disentangled in real-world logs. Parameter ranges are selected based on values reported across prior LTR literature. We will revise Section 4 to add an explicit discussion relating the generative models to empirical bias observations from deployed systems. revision: partial

  2. Referee: [§5 (Results)] §5 (Results): The conclusion that counterfactual optimization can be detrimental in high-bias settings rests on the specific parameter ranges chosen for position bias and noise; no sensitivity analysis is reported for alternative generative processes that might better match deployed systems.

    Authors: The reported experiments already vary bias and noise across a spectrum of levels to demonstrate transition points in performance. We agree that additional checks on alternative generative processes would strengthen the results. We will add a sensitivity analysis subsection varying the logging policy and relevance sampling distributions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical simulation study with independent experimental results.

full rationale

The paper is a benchmarking study that compares counterfactual and online LTR methods via controlled simulations varying selection bias, position bias, and interaction noise. Central claims rest on measured ranking performance differences across experimental conditions, not on any derivation, prediction, or parameter fit that reduces to the paper's own inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The simulation setup is explicitly generative and falsifiable against external data, satisfying the criteria for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unexamined assumption that the simulation faithfully reproduces real-world bias distributions and that standard IR metrics capture user experience.

pith-pipeline@v0.9.0 · 5759 in / 1075 out tokens · 20085 ms · 2026-05-24T21:35:48.736428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Aman Agarwal, Ivan Zaitsev, and Thorsten Joachims. 2018. Counterfactual Learning-to-Rank for Additive Metrics and Deep Models. arXiv preprint arXiv:1805.00065 (2018)

  2. [2]

    Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Un- biased Learning to Rank with Unbiased Propensity Estimation. In SIGIR. ACM, 385–394

  3. [3]

    Qingyao Ai, Jiaxin Mao, Yiqun Liu, and W Bruce Croft. 2018. Unbiased Learning to Rank: Theory and Practice. In CIKM. ACM, 2305–2306

  4. [4]

    Michael Bendersky, Xuanhui Wang, Donald Metzler, and Marc Najork. 2017. Learning from User Interactions in Personal Search via Attribute Parameteriza- tion. In WSDM. ACM, 791–799

  5. [5]

    Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to Rank Challenge Overview. In Proceedings of the Learning to Rank Challenge . 1–24

  6. [6]

    Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click Models for Web Search. Morgan & Claypool Publishers

  7. [7]

    Artem Grotov and Maarten de Rijke. 2016. Online Learning to Rank for Informa- tion Retrieval: SIGIR 2016 Tutorial. In SIGIR. ACM, 1215–1218

  8. [8]

    Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. 2013. Reusing Historical Interaction Data for Faster Online Learning to Rank for IR. In WSDM. ACM, 183–192

  9. [9]

    Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2011. A Probabilistic Method for Inferring Preferences from Clicks. In CIKM. ACM, 249–258

  10. [10]

    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446

  11. [11]

    Thorsten Joachims. 2002. Optimizing Search Engines using Clickthrough Data. In SIGKDD. ACM, 133–142

  12. [12]

    Thorsten Joachims. 2003. Evaluating Retrieval Performance using Clickthrough Data. In Text Mining. Physica/Springer

  13. [13]

    Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay

  14. [14]

    In SIGIR

    Accurately Interpreting Clickthrough Data as Implicit Feedback. In SIGIR. ACM, 154–161

  15. [15]

    Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search. ACM Transactions on Information Systems (TOIS) 25, 2 (2007), 7

  16. [16]

    Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement. In SIGIR. ACM, New York, NY, USA, 1199–1201

  17. [17]

    Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In WSDM. ACM, 781–789

  18. [18]

    Branislav Kveton, Chang Li, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, and Masrour Zoghi. 2018. BubbleRank: Safe Online Learning to Rerank. arXiv preprint arXiv:1806.05819 (2018)

  19. [19]

    Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased Offline Evaluation of Contextual-Bandit-Based News Article Recommendation Algo- rithms. In WSDM. ACM, 297–306

  20. [20]

    Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval 3, 3 (2009), 225–331

  21. [21]

    Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong, and Hang Li. 2007. Letor: Bench- mark Dataset for Research on Learning to Rank for Information Retrieval. In Proceedings of SIGIR 2007 workshop on learning to rank for information retrieval , Vol. 310. ACM Amsterdam, The Netherlands

  22. [22]

    Harrie Oosterhuis. 2018. Learning to Rank and Evaluation in the Online Setting. 12th Russian Summer School in Information Retrieval (RuSSIR 2018). (2018)

  23. [23]

    Harrie Oosterhuis and Maarten de Rijke. 2017. Balancing Speed and Quality in Online Learning to Rank for Information Retrieval. In CIKM. ACM, 277–286

  24. [24]

    Harrie Oosterhuis and Maarten de Rijke. 2017. Sensitive and Scalable Online Evaluation with Theoretical Guarantees. In CIKM. ACM, 77–86

  25. [25]

    Harrie Oosterhuis and Maarten de Rijke. 2018. Differentiable Unbiased Online Learning to Rank. In CIKM. ACM, 1293–1302

  26. [26]

    Harrie Oosterhuis and Maarten de Rijke. 2019. Optimizing Ranking Models in the Online Setting. In ECIR. Springer, 382–396

  27. [27]

    Harrie Oosterhuis, Anne Schuth, and Maarten de Rijke. 2016. Probabilistic Multileave Gradient Descent. In ECIR. Springer, 661–668

  28. [28]

    Filip Radlinski and Nick Craswell. 2013. Optimized Interleaving for Online Retrieval Evaluation. In WSDM. ACM, 245–254

  29. [29]

    Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016. Multileave Gradient Descent for Fast Online Learning to Rank. In WSDM. ACM, 457–466

  30. [30]

    Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten de Rijke. 2014. Multileaved Comparisons for Fast Online Evaluation. In CIKM. ACM, 71–80

  31. [31]

    Burr Settles. 2012. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1 (2012), 1–114

  32. [32]

    Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual Risk Mini- mization: Learning from Logged Bandit Feedback. In ICML. PMLR, 814–823

  33. [33]

    Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy Evaluation for Slate Recommendation. In NIPS. 3632–3642

  34. [34]

    Niek Tax, Sander Bockting, and Djoerd Hiemstra. 2015. A Cross-benchmark Com- parison of 87 Learning to Rank Methods. Information Processing & Management 51, 6 (2015), 757–772

  35. [35]

    Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015. High-Confidence Off-Policy Evaluation. In AAAI. 3000–3006

  36. [36]

    Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to Rank with Selection Bias in Personal Search. In SIGIR. ACM, 115– 124

  37. [37]

    Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. 2018. Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, 610–618

  38. [38]

    Yisong Yue and Thorsten Joachims. 2009. Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem. In ICML. ACM, 1201–1208

  39. [39]

    Tong Zhao and Irwin King. 2016. Constructing Reliable Gradient Exploration for Online Learning to Rank. In CIKM. ACM, 1643–1652