Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Olivier Jeunen; Shashank Gupta

arxiv: 2602.14914 · v2 · submitted 2026-02-16 · 💻 cs.LG · cs.IR

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Olivier Jeunen , Shashank Gupta This is my paper

Pith reviewed 2026-05-15 21:30 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords off-policy evaluationinverse propensity scoringadditive control variatesSNIPSvariance reductionrecommendation systemsranking systems

0 comments

The pith

An optimal additive baseline estimator asymptotically dominates self-normalized IPS in off-policy evaluation mean squared error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proves that an estimator using the optimal additive baseline correction, called β∗-IPS, has lower asymptotic mean squared error than the standard self-normalized inverse propensity scoring estimator. It reaches this conclusion by decomposing the variance difference between the two and showing that SNIPS matches the behavior of a particular but suboptimal additive baseline. Readers working on ranking and recommendation systems should care because off-policy evaluation lets teams test new policies without running live experiments, and lower error means more trustworthy comparisons. If the result holds, practitioners have a clear theoretical reason to move from multiplicative self-normalization to additive baseline corrections.

Core claim

We prove that β∗-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific but generally sub-optimal additive baseline. This holds under standard regularity conditions and supplies a theoretical justification for preferring optimal additive control variates over self-normalization in off-policy evaluation for ranking and recommendation.

What carries the argument

The β∗-IPS estimator that applies an optimal additive baseline to importance-weighted outcomes, which the paper shows minimizes asymptotic variance relative to the multiplicative normalization in SNIPS.

Load-bearing premise

Standard regularity conditions hold, including finite variances, bounded propensities, and the existence of the optimal baseline.

What would settle it

Compute the exact optimal baseline on a simulated or real dataset with known propensities, then compare the empirical mean squared error of β∗-IPS versus SNIPS as sample size increases; the gap should converge to the positive value predicted by the variance decomposition.

read the original abstract

Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $\beta^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves optimal additive baselines asymptotically dominate SNIPS in OPE MSE by showing the latter equals a suboptimal additive term, but the result needs finite second moments and bounded propensities.

read the letter

The main takeaway is that this paper proves β*-IPS with an optimal additive baseline asymptotically beats SNIPS in mean squared error for off-policy evaluation. They reach this by decomposing the variance gap and showing SNIPS matches IPS plus a fixed additive correction whose coefficient is not the variance-minimizing one. That equivalence is the clean new piece. Prior work had additive baselines and SNIPS separately, but the direct dominance proof and the specific suboptimal equivalence were not there. The result gives a theoretical reason to prefer additive corrections over multiplicative self-normalization when evaluating ranking and recommendation policies. The motivation section ties it to recent off-policy learning results, and the analytic steps look straightforward once the regularity conditions are granted. That part is useful and worth having on record. The soft spot is the set of assumptions required for the decomposition to go through. The dominance holds only when the weighted reward has finite second moment and propensities are bounded away from zero so that β* exists and is finite. The abstract invokes standard regularity conditions, but if importance weights can be unbounded, as they often are in ranking data with long tails, the gap expression is undefined and the claim does not follow. It is not clear from the abstract whether the theorem states these conditions explicitly or whether the experiments check them on the actual distributions. That is a moderate rather than fatal issue, but it needs to be tightened. This paper is for people working on theoretical OPE or on reliable evaluation in recsys and RL. A reader who wants a precise variance comparison between additive and multiplicative control variates will get something concrete from it. The thinking is clear and the claim is falsifiable, so it deserves a serious referee. I would send it to review with a request to spell out the exact conditions in the theorem and to verify they hold in the reported experiments.

Referee Report

1 major / 1 minor

Summary. The paper proves that β*-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in mean squared error for off-policy evaluation. It does so via an analytic variance decomposition showing that SNIPS is asymptotically equivalent to IPS plus a specific but generally suboptimal additive baseline. The result is positioned as theoretical justification for preferring additive control variates over self-normalization in ranking and recommendation systems under standard regularity conditions.

Significance. If the central claim holds, the work provides a clear theoretical basis for shifting from SNIPS to optimal additive baselines in OPE, with the variance-gap decomposition offering insight that goes beyond empirical comparisons. This could influence variance-reduction practice in recommendation and ranking applications where low-variance estimators are critical.

major comments (1)

[Theorem statement and proof of asymptotic dominance] The main theorem (and its proof) invokes 'standard regularity conditions' (finite second moments of (r·w), bounded propensities so that β* = Cov(rw, w)/Var(w) is well-defined and finite) but does not exhibit them explicitly in the theorem statement or proof. The dominance result is load-bearing on these conditions; without them the gap expression is undefined. The experiments also do not verify the conditions on the empirical distributions used.

minor comments (1)

Notation for the optimal baseline β* and the specific suboptimal baseline implicit in SNIPS could be introduced earlier and used consistently to improve readability of the decomposition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the regularity conditions. We address it directly below and will revise the manuscript to improve clarity.

read point-by-point responses

Referee: The main theorem (and its proof) invokes 'standard regularity conditions' (finite second moments of (r·w), bounded propensities so that β* = Cov(rw, w)/Var(w) is well-defined and finite) but does not exhibit them explicitly in the theorem statement or proof. The dominance result is load-bearing on these conditions; without them the gap expression is undefined. The experiments also do not verify the conditions on the empirical distributions used.

Authors: We agree that the conditions should be stated explicitly. In the revised manuscript we will update the main theorem statement to read: 'Under the regularity conditions that E[(r w)^2] < ∞ and the propensities are bounded away from zero (ensuring β* = Cov(r w, w)/Var(w) is well-defined and finite), the following holds...' The proof will be revised to invoke these assumptions at the first step where they are used. For the experiments, we will add a short paragraph reporting empirical checks (sample second moments of r w and minimum propensity values) on the synthetic and real-world datasets to confirm the conditions hold in practice. revision: yes

Circularity Check

0 steps flagged

Analytic variance decomposition establishes dominance without reduction to inputs or self-citations

full rationale

The paper derives the asymptotic MSE dominance of β*-IPS over SNIPS by analytically decomposing the variance gap and showing that SNIPS is equivalent to IPS plus a specific (generally sub-optimal) additive baseline term. This is a direct algebraic identity under the stated regularity conditions on moments and propensities; it does not involve fitting any parameter to the target quantity, renaming an empirical pattern, or relying on a load-bearing self-citation whose validity is internal to the present work. The central claim therefore remains self-contained and independent of the data instances used in experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard asymptotic analysis assumptions for OPE estimators (finite second moments, proper propensity scores) that are inherited from prior literature rather than introduced or fitted in this paper.

axioms (1)

domain assumption Standard regularity conditions for asymptotic normality and variance finiteness in inverse propensity scoring estimators
Invoked to justify the analytic variance decomposition and asymptotic dominance statement

pith-pipeline@v0.9.0 · 5431 in / 1122 out tokens · 18559 ms · 2026-05-15T21:30:34.819200+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we prove that β∗-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Scale-Free Networks: Complex Webs in Nature and Technology

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. 2013.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. Additive Control Variates Dominate Self-Normalisation SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia doi:10.1093/acprof:oso/9780199535255.001.0001

work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013
[2]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Dou- ble/debiased machine learning for treatment and structural parameters. The Econometrics Journal21, 1 (01 2018), C1–C68. doi:10.1111/ectj.12097 arXiv:https://academic.oup.com/ectj/article-pdf/21/1/C1/27684918/ectj00c1.pdf

work page doi:10.1111/ectj.12097 2018
[3]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InProc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 198–206. https://doi.org/10.1145/3159652.3159687

work page doi:10.1145/3159652.3159687 2018
[4]

Shashank Gupta. 2025. Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models.arXiv preprint arXiv:2510.15429(2025)

work page arXiv 2025
[5]

Shashank Gupta, Philipp Hager, Jin Huang, Ali Vardasbi, and Harrie Oosterhuis

work page
[6]

InProceedings of the 17th ACM International Conference on Web Search and Data Mining

Unbiased Learning to Rank: On Recent Advances and Practical Applications. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 1118–1121

work page
[7]

Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. InProc. of the 18th ACM Conference on Recommender Systems (RecSys ’24). ACM, 722–732. doi:10. 1145/3640457.3688105

work page arXiv 2024
[8]

Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke. 2023. Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258

work page 2023
[9]

an essay on the logical foundations of survey sampling, part one

Jaroslav Hájek. 1971. Comment on “an essay on the logical foundations of survey sampling, part one”.The foundations of survey sampling236 (1971)

work page 1971
[10]

Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe.Journal of the American statistical Association47, 260 (1952), 663–685

work page 1952
[11]

2021.Offline Approaches to Recommendation with Online Success

Olivier Jeunen. 2021.Offline Approaches to Recommendation with Online Success. Ph. D. Dissertation. University of Antwerp

work page 2021
[12]

Olivier Jeunen, Thorsten Joachims, Harrie Oosterhuis, Yuta Saito, and Flavian Vasile. 2022. CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. InProc. of the 16th ACM Confer- ence on Recommender Systems (RecSys ’22). ACM, 654–657. doi:10.1145/3523227. 3547409

work page doi:10.1145/3523227 2022
[13]

Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wen- zhe Shi, and Aleksei Ustimenko. 2024. Multi-Objective Recommendation via Multivariate Policy Learning. InProceedings of the 18th ACM Conference on Rec- ommender Systems (RecSys ’24). ACM, 712–721. doi:10.1145/3640457.3688132

work page doi:10.1145/3640457.3688132 2024
[14]

Olivier Jeunen, Ivan Potapov, and Aleksei Ustimenko. 2024. On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n Recom- mendation. InProc. of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). ACM, 1222–1233. doi:10.1145/3637528.3671687

work page doi:10.1145/3637528.3671687 2024
[15]

Olivier Jeunen and Aleksei Ustimenko. 2024. Δ-OPE: Off-Policy Estimation with Pairs of Policies. InProc. of the 18th ACM Conference on Recommender Systems (RecSys ’24). ACM, 878–883. doi:10.1145/3640457.3688162

work page doi:10.1145/3640457.3688162 2024
[16]

Thorsten Joachims, Ben London, Yi Su, Adith Swaminathan, and Lequn Wang

work page
[17]

2021), 19–30

Recommendations as Treatments.AI Magazine42, 3 (Nov. 2021), 19–30. doi:10.1609/aimag.v42i3.18141

work page doi:10.1609/aimag.v42i3.18141 2021
[18]

Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep Learning with Logged Bandit Feedback. InInternational Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb

work page 2018
[19]

2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing

Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing. Cambridge University Press

work page 2020
[20]

Augustine Kong. 1992. A note on importance sampling using standardized weights.University of Chicago, Dept. of Statistics, Tech. Rep348 (1992)

work page 1992
[21]

Muthukrishnan, Vishwa Vinay, and Zheng Wen

Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. InProc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, 1685–1694. doi:10.1145/3219819. 3220028

work page doi:10.1145/3219819 2018
[22]

Ben London, Alexander Buchholz, Giuseppe Di Benedetto, Jan Malte Lichtenberg, Yannik Stein, and Thorsten Joachims. 2023. Self-Normalized Off-Policy Estimators for Ranking. InCONSEQUENCES Workshop at ACM RecSys ’23 (CONSEQUENCES ’23)

work page 2023
[23]

Art B. Owen. 2013.Monte Carlo theory, methods and examples

work page 2013
[24]

Hitesh Sagtani, Madan Gopal Jhawar, Rishabh Mehrotra, and Olivier Jeunen

work page
[25]

Ad-load Balancing via Off-policy Learning in a Content Marketplace. In Proc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 586–595. doi:10.1145/3616855.3635846

work page doi:10.1145/3616855.3635846
[26]

Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 828–830. doi:10.1145/3460231.3473320

work page doi:10.1145/3460231.3473320 2021
[27]

Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 114–123. doi:10.1145/3460231.3474245

work page doi:10.1145/3460231.3474245 2021
[28]

Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. InAdvances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf

work page 2015
[29]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy evaluation for slate recommendation. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. htt...

work page arXiv 2017
[30]

Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2024. Practical Bandits: An Industry Perspective. InProc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1132–1135. doi:10.1145/3616855.3636449

work page doi:10.1145/3616855.3636449 2024
[31]

Flavian Vasile, David Rohde, Olivier Jeunen, and Amine Benhalloum. 2020. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proc. of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’20). ACM, 392–393. doi:10.1145/3340631.3398666

work page doi:10.1145/3340631.3398666 2020
[32]

Nikos Vlassis, Ashok Chandrashekar, Fernando Amat, and Nathan Kallus

work page
[33]

InAdvances in Neu- ral Information Processing Systems, M

Control Variates for Slate Off-Policy Evaluation. InAdvances in Neu- ral Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 3667–3679. https://proceedings.neurips.cc/paper_files/paper/2021/file/ 1e0b802d5c0e1e8434a771ba7ff2c301-Paper.pdf

work page 2021

[1] [1]

Scale-Free Networks: Complex Webs in Nature and Technology

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. 2013.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. Additive Control Variates Dominate Self-Normalisation SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia doi:10.1093/acprof:oso/9780199535255.001.0001

work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013

[2] [2]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Dou- ble/debiased machine learning for treatment and structural parameters. The Econometrics Journal21, 1 (01 2018), C1–C68. doi:10.1111/ectj.12097 arXiv:https://academic.oup.com/ectj/article-pdf/21/1/C1/27684918/ectj00c1.pdf

work page doi:10.1111/ectj.12097 2018

[3] [3]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InProc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 198–206. https://doi.org/10.1145/3159652.3159687

work page doi:10.1145/3159652.3159687 2018

[4] [4]

Shashank Gupta. 2025. Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models.arXiv preprint arXiv:2510.15429(2025)

work page arXiv 2025

[5] [5]

Shashank Gupta, Philipp Hager, Jin Huang, Ali Vardasbi, and Harrie Oosterhuis

work page

[6] [6]

InProceedings of the 17th ACM International Conference on Web Search and Data Mining

Unbiased Learning to Rank: On Recent Advances and Practical Applications. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 1118–1121

work page

[7] [7]

Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. InProc. of the 18th ACM Conference on Recommender Systems (RecSys ’24). ACM, 722–732. doi:10. 1145/3640457.3688105

work page arXiv 2024

[8] [8]

Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke. 2023. Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258

work page 2023

[9] [9]

an essay on the logical foundations of survey sampling, part one

Jaroslav Hájek. 1971. Comment on “an essay on the logical foundations of survey sampling, part one”.The foundations of survey sampling236 (1971)

work page 1971

[10] [10]

Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe.Journal of the American statistical Association47, 260 (1952), 663–685

work page 1952

[11] [11]

2021.Offline Approaches to Recommendation with Online Success

Olivier Jeunen. 2021.Offline Approaches to Recommendation with Online Success. Ph. D. Dissertation. University of Antwerp

work page 2021

[12] [12]

Olivier Jeunen, Thorsten Joachims, Harrie Oosterhuis, Yuta Saito, and Flavian Vasile. 2022. CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. InProc. of the 16th ACM Confer- ence on Recommender Systems (RecSys ’22). ACM, 654–657. doi:10.1145/3523227. 3547409

work page doi:10.1145/3523227 2022

[13] [13]

Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wen- zhe Shi, and Aleksei Ustimenko. 2024. Multi-Objective Recommendation via Multivariate Policy Learning. InProceedings of the 18th ACM Conference on Rec- ommender Systems (RecSys ’24). ACM, 712–721. doi:10.1145/3640457.3688132

work page doi:10.1145/3640457.3688132 2024

[14] [14]

Olivier Jeunen, Ivan Potapov, and Aleksei Ustimenko. 2024. On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n Recom- mendation. InProc. of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). ACM, 1222–1233. doi:10.1145/3637528.3671687

work page doi:10.1145/3637528.3671687 2024

[15] [15]

Olivier Jeunen and Aleksei Ustimenko. 2024. Δ-OPE: Off-Policy Estimation with Pairs of Policies. InProc. of the 18th ACM Conference on Recommender Systems (RecSys ’24). ACM, 878–883. doi:10.1145/3640457.3688162

work page doi:10.1145/3640457.3688162 2024

[16] [16]

Thorsten Joachims, Ben London, Yi Su, Adith Swaminathan, and Lequn Wang

work page

[17] [17]

2021), 19–30

Recommendations as Treatments.AI Magazine42, 3 (Nov. 2021), 19–30. doi:10.1609/aimag.v42i3.18141

work page doi:10.1609/aimag.v42i3.18141 2021

[18] [18]

Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep Learning with Logged Bandit Feedback. InInternational Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb

work page 2018

[19] [19]

2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing

Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing. Cambridge University Press

work page 2020

[20] [20]

Augustine Kong. 1992. A note on importance sampling using standardized weights.University of Chicago, Dept. of Statistics, Tech. Rep348 (1992)

work page 1992

[21] [21]

Muthukrishnan, Vishwa Vinay, and Zheng Wen

Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. InProc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, 1685–1694. doi:10.1145/3219819. 3220028

work page doi:10.1145/3219819 2018

[22] [22]

Ben London, Alexander Buchholz, Giuseppe Di Benedetto, Jan Malte Lichtenberg, Yannik Stein, and Thorsten Joachims. 2023. Self-Normalized Off-Policy Estimators for Ranking. InCONSEQUENCES Workshop at ACM RecSys ’23 (CONSEQUENCES ’23)

work page 2023

[23] [23]

Art B. Owen. 2013.Monte Carlo theory, methods and examples

work page 2013

[24] [24]

Hitesh Sagtani, Madan Gopal Jhawar, Rishabh Mehrotra, and Olivier Jeunen

work page

[25] [25]

Ad-load Balancing via Off-policy Learning in a Content Marketplace. In Proc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 586–595. doi:10.1145/3616855.3635846

work page doi:10.1145/3616855.3635846

[26] [26]

Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 828–830. doi:10.1145/3460231.3473320

work page doi:10.1145/3460231.3473320 2021

[27] [27]

Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 114–123. doi:10.1145/3460231.3474245

work page doi:10.1145/3460231.3474245 2021

[28] [28]

Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. InAdvances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf

work page 2015

[29] [29]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy evaluation for slate recommendation. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. htt...

work page arXiv 2017

[30] [30]

Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2024. Practical Bandits: An Industry Perspective. InProc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1132–1135. doi:10.1145/3616855.3636449

work page doi:10.1145/3616855.3636449 2024

[31] [31]

Flavian Vasile, David Rohde, Olivier Jeunen, and Amine Benhalloum. 2020. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proc. of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’20). ACM, 392–393. doi:10.1145/3340631.3398666

work page doi:10.1145/3340631.3398666 2020

[32] [32]

Nikos Vlassis, Ashok Chandrashekar, Fernando Amat, and Nathan Kallus

work page

[33] [33]

InAdvances in Neu- ral Information Processing Systems, M

Control Variates for Slate Off-Policy Evaluation. InAdvances in Neu- ral Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 3667–3679. https://proceedings.neurips.cc/paper_files/paper/2021/file/ 1e0b802d5c0e1e8434a771ba7ff2c301-Paper.pdf

work page 2021