Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation
Pith reviewed 2026-05-15 21:30 UTC · model grok-4.3
The pith
An optimal additive baseline estimator asymptotically dominates self-normalized IPS in off-policy evaluation mean squared error.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that β∗-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific but generally sub-optimal additive baseline. This holds under standard regularity conditions and supplies a theoretical justification for preferring optimal additive control variates over self-normalization in off-policy evaluation for ranking and recommendation.
What carries the argument
The β∗-IPS estimator that applies an optimal additive baseline to importance-weighted outcomes, which the paper shows minimizes asymptotic variance relative to the multiplicative normalization in SNIPS.
Load-bearing premise
Standard regularity conditions hold, including finite variances, bounded propensities, and the existence of the optimal baseline.
What would settle it
Compute the exact optimal baseline on a simulated or real dataset with known propensities, then compare the empirical mean squared error of β∗-IPS versus SNIPS as sample size increases; the gap should converge to the positive value predicted by the variance decomposition.
read the original abstract
Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $\beta^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves that β*-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in mean squared error for off-policy evaluation. It does so via an analytic variance decomposition showing that SNIPS is asymptotically equivalent to IPS plus a specific but generally suboptimal additive baseline. The result is positioned as theoretical justification for preferring additive control variates over self-normalization in ranking and recommendation systems under standard regularity conditions.
Significance. If the central claim holds, the work provides a clear theoretical basis for shifting from SNIPS to optimal additive baselines in OPE, with the variance-gap decomposition offering insight that goes beyond empirical comparisons. This could influence variance-reduction practice in recommendation and ranking applications where low-variance estimators are critical.
major comments (1)
- [Theorem statement and proof of asymptotic dominance] The main theorem (and its proof) invokes 'standard regularity conditions' (finite second moments of (r·w), bounded propensities so that β* = Cov(rw, w)/Var(w) is well-defined and finite) but does not exhibit them explicitly in the theorem statement or proof. The dominance result is load-bearing on these conditions; without them the gap expression is undefined. The experiments also do not verify the conditions on the empirical distributions used.
minor comments (1)
- Notation for the optimal baseline β* and the specific suboptimal baseline implicit in SNIPS could be introduced earlier and used consistently to improve readability of the decomposition.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the regularity conditions. We address it directly below and will revise the manuscript to improve clarity.
read point-by-point responses
-
Referee: The main theorem (and its proof) invokes 'standard regularity conditions' (finite second moments of (r·w), bounded propensities so that β* = Cov(rw, w)/Var(w) is well-defined and finite) but does not exhibit them explicitly in the theorem statement or proof. The dominance result is load-bearing on these conditions; without them the gap expression is undefined. The experiments also do not verify the conditions on the empirical distributions used.
Authors: We agree that the conditions should be stated explicitly. In the revised manuscript we will update the main theorem statement to read: 'Under the regularity conditions that E[(r w)^2] < ∞ and the propensities are bounded away from zero (ensuring β* = Cov(r w, w)/Var(w) is well-defined and finite), the following holds...' The proof will be revised to invoke these assumptions at the first step where they are used. For the experiments, we will add a short paragraph reporting empirical checks (sample second moments of r w and minimum propensity values) on the synthetic and real-world datasets to confirm the conditions hold in practice. revision: yes
Circularity Check
Analytic variance decomposition establishes dominance without reduction to inputs or self-citations
full rationale
The paper derives the asymptotic MSE dominance of β*-IPS over SNIPS by analytically decomposing the variance gap and showing that SNIPS is equivalent to IPS plus a specific (generally sub-optimal) additive baseline term. This is a direct algebraic identity under the stated regularity conditions on moments and propensities; it does not involve fitting any parameter to the target quantity, renaming an empirical pattern, or relying on a load-bearing self-citation whose validity is internal to the present work. The central claim therefore remains self-contained and independent of the data instances used in experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard regularity conditions for asymptotic normality and variance finiteness in inverse propensity scoring estimators
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we prove that β∗-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Scale-Free Networks: Complex Webs in Nature and Technology
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. 2013.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. Additive Control Variates Dominate Self-Normalisation SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia doi:10.1093/acprof:oso/9780199535255.001.0001
work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013
-
[2]
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Dou- ble/debiased machine learning for treatment and structural parameters. The Econometrics Journal21, 1 (01 2018), C1–C68. doi:10.1111/ectj.12097 arXiv:https://academic.oup.com/ectj/article-pdf/21/1/C1/27684918/ectj00c1.pdf
-
[3]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InProc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 198–206. https://doi.org/10.1145/3159652.3159687
- [4]
-
[5]
Shashank Gupta, Philipp Hager, Jin Huang, Ali Vardasbi, and Harrie Oosterhuis
-
[6]
InProceedings of the 17th ACM International Conference on Web Search and Data Mining
Unbiased Learning to Rank: On Recent Advances and Practical Applications. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 1118–1121
- [7]
-
[8]
Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke. 2023. Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258
work page 2023
-
[9]
an essay on the logical foundations of survey sampling, part one
Jaroslav Hájek. 1971. Comment on “an essay on the logical foundations of survey sampling, part one”.The foundations of survey sampling236 (1971)
work page 1971
-
[10]
Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe.Journal of the American statistical Association47, 260 (1952), 663–685
work page 1952
-
[11]
2021.Offline Approaches to Recommendation with Online Success
Olivier Jeunen. 2021.Offline Approaches to Recommendation with Online Success. Ph. D. Dissertation. University of Antwerp
work page 2021
-
[12]
Olivier Jeunen, Thorsten Joachims, Harrie Oosterhuis, Yuta Saito, and Flavian Vasile. 2022. CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. InProc. of the 16th ACM Confer- ence on Recommender Systems (RecSys ’22). ACM, 654–657. doi:10.1145/3523227. 3547409
-
[13]
Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wen- zhe Shi, and Aleksei Ustimenko. 2024. Multi-Objective Recommendation via Multivariate Policy Learning. InProceedings of the 18th ACM Conference on Rec- ommender Systems (RecSys ’24). ACM, 712–721. doi:10.1145/3640457.3688132
-
[14]
Olivier Jeunen, Ivan Potapov, and Aleksei Ustimenko. 2024. On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n Recom- mendation. InProc. of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). ACM, 1222–1233. doi:10.1145/3637528.3671687
-
[15]
Olivier Jeunen and Aleksei Ustimenko. 2024. Δ-OPE: Off-Policy Estimation with Pairs of Policies. InProc. of the 18th ACM Conference on Recommender Systems (RecSys ’24). ACM, 878–883. doi:10.1145/3640457.3688162
-
[16]
Thorsten Joachims, Ben London, Yi Su, Adith Swaminathan, and Lequn Wang
-
[17]
Recommendations as Treatments.AI Magazine42, 3 (Nov. 2021), 19–30. doi:10.1609/aimag.v42i3.18141
-
[18]
Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep Learning with Logged Bandit Feedback. InInternational Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb
work page 2018
-
[19]
2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing
Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing. Cambridge University Press
work page 2020
-
[20]
Augustine Kong. 1992. A note on importance sampling using standardized weights.University of Chicago, Dept. of Statistics, Tech. Rep348 (1992)
work page 1992
-
[21]
Muthukrishnan, Vishwa Vinay, and Zheng Wen
Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. InProc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, 1685–1694. doi:10.1145/3219819. 3220028
-
[22]
Ben London, Alexander Buchholz, Giuseppe Di Benedetto, Jan Malte Lichtenberg, Yannik Stein, and Thorsten Joachims. 2023. Self-Normalized Off-Policy Estimators for Ranking. InCONSEQUENCES Workshop at ACM RecSys ’23 (CONSEQUENCES ’23)
work page 2023
-
[23]
Art B. Owen. 2013.Monte Carlo theory, methods and examples
work page 2013
-
[24]
Hitesh Sagtani, Madan Gopal Jhawar, Rishabh Mehrotra, and Olivier Jeunen
-
[25]
Ad-load Balancing via Off-policy Learning in a Content Marketplace. In Proc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 586–595. doi:10.1145/3616855.3635846
-
[26]
Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 828–830. doi:10.1145/3460231.3473320
-
[27]
Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 114–123. doi:10.1145/3460231.3474245
-
[28]
Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. InAdvances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf
work page 2015
-
[29]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy evaluation for slate recommendation. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. htt...
-
[30]
Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2024. Practical Bandits: An Industry Perspective. InProc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1132–1135. doi:10.1145/3616855.3636449
-
[31]
Flavian Vasile, David Rohde, Olivier Jeunen, and Amine Benhalloum. 2020. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proc. of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’20). ACM, 392–393. doi:10.1145/3340631.3398666
-
[32]
Nikos Vlassis, Ashok Chandrashekar, Fernando Amat, and Nathan Kallus
-
[33]
InAdvances in Neu- ral Information Processing Systems, M
Control Variates for Slate Off-Policy Evaluation. InAdvances in Neu- ral Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 3667–3679. https://proceedings.neurips.cc/paper_files/paper/2021/file/ 1e0b802d5c0e1e8434a771ba7ff2c301-Paper.pdf
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.