pith. sign in

arxiv: 2602.14914 · v2 · submitted 2026-02-16 · 💻 cs.LG · cs.IR

Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Pith reviewed 2026-05-15 21:30 UTC · model grok-4.3

classification 💻 cs.LG cs.IR
keywords off-policy evaluationinverse propensity scoringadditive control variatesSNIPSvariance reductionrecommendation systemsranking systems
0
0 comments X

The pith

An optimal additive baseline estimator asymptotically dominates self-normalized IPS in off-policy evaluation mean squared error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proves that an estimator using the optimal additive baseline correction, called β∗-IPS, has lower asymptotic mean squared error than the standard self-normalized inverse propensity scoring estimator. It reaches this conclusion by decomposing the variance difference between the two and showing that SNIPS matches the behavior of a particular but suboptimal additive baseline. Readers working on ranking and recommendation systems should care because off-policy evaluation lets teams test new policies without running live experiments, and lower error means more trustworthy comparisons. If the result holds, practitioners have a clear theoretical reason to move from multiplicative self-normalization to additive baseline corrections.

Core claim

We prove that β∗-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific but generally sub-optimal additive baseline. This holds under standard regularity conditions and supplies a theoretical justification for preferring optimal additive control variates over self-normalization in off-policy evaluation for ranking and recommendation.

What carries the argument

The β∗-IPS estimator that applies an optimal additive baseline to importance-weighted outcomes, which the paper shows minimizes asymptotic variance relative to the multiplicative normalization in SNIPS.

Load-bearing premise

Standard regularity conditions hold, including finite variances, bounded propensities, and the existence of the optimal baseline.

What would settle it

Compute the exact optimal baseline on a simulated or real dataset with known propensities, then compare the empirical mean squared error of β∗-IPS versus SNIPS as sample size increases; the gap should converge to the positive value predicted by the variance decomposition.

read the original abstract

Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $\beta^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proves that β*-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in mean squared error for off-policy evaluation. It does so via an analytic variance decomposition showing that SNIPS is asymptotically equivalent to IPS plus a specific but generally suboptimal additive baseline. The result is positioned as theoretical justification for preferring additive control variates over self-normalization in ranking and recommendation systems under standard regularity conditions.

Significance. If the central claim holds, the work provides a clear theoretical basis for shifting from SNIPS to optimal additive baselines in OPE, with the variance-gap decomposition offering insight that goes beyond empirical comparisons. This could influence variance-reduction practice in recommendation and ranking applications where low-variance estimators are critical.

major comments (1)
  1. [Theorem statement and proof of asymptotic dominance] The main theorem (and its proof) invokes 'standard regularity conditions' (finite second moments of (r·w), bounded propensities so that β* = Cov(rw, w)/Var(w) is well-defined and finite) but does not exhibit them explicitly in the theorem statement or proof. The dominance result is load-bearing on these conditions; without them the gap expression is undefined. The experiments also do not verify the conditions on the empirical distributions used.
minor comments (1)
  1. Notation for the optimal baseline β* and the specific suboptimal baseline implicit in SNIPS could be introduced earlier and used consistently to improve readability of the decomposition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the regularity conditions. We address it directly below and will revise the manuscript to improve clarity.

read point-by-point responses
  1. Referee: The main theorem (and its proof) invokes 'standard regularity conditions' (finite second moments of (r·w), bounded propensities so that β* = Cov(rw, w)/Var(w) is well-defined and finite) but does not exhibit them explicitly in the theorem statement or proof. The dominance result is load-bearing on these conditions; without them the gap expression is undefined. The experiments also do not verify the conditions on the empirical distributions used.

    Authors: We agree that the conditions should be stated explicitly. In the revised manuscript we will update the main theorem statement to read: 'Under the regularity conditions that E[(r w)^2] < ∞ and the propensities are bounded away from zero (ensuring β* = Cov(r w, w)/Var(w) is well-defined and finite), the following holds...' The proof will be revised to invoke these assumptions at the first step where they are used. For the experiments, we will add a short paragraph reporting empirical checks (sample second moments of r w and minimum propensity values) on the synthetic and real-world datasets to confirm the conditions hold in practice. revision: yes

Circularity Check

0 steps flagged

Analytic variance decomposition establishes dominance without reduction to inputs or self-citations

full rationale

The paper derives the asymptotic MSE dominance of β*-IPS over SNIPS by analytically decomposing the variance gap and showing that SNIPS is equivalent to IPS plus a specific (generally sub-optimal) additive baseline term. This is a direct algebraic identity under the stated regularity conditions on moments and propensities; it does not involve fitting any parameter to the target quantity, renaming an empirical pattern, or relying on a load-bearing self-citation whose validity is internal to the present work. The central claim therefore remains self-contained and independent of the data instances used in experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard asymptotic analysis assumptions for OPE estimators (finite second moments, proper propensity scores) that are inherited from prior literature rather than introduced or fitted in this paper.

axioms (1)
  • domain assumption Standard regularity conditions for asymptotic normality and variance finiteness in inverse propensity scoring estimators
    Invoked to justify the analytic variance decomposition and asymptotic dominance statement

pith-pipeline@v0.9.0 · 5431 in / 1122 out tokens · 18559 ms · 2026-05-15T21:30:34.819200+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    Scale-Free Networks: Complex Webs in Nature and Technology

    Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. 2013.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. Additive Control Variates Dominate Self-Normalisation SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia doi:10.1093/acprof:oso/9780199535255.001.0001

  2. [2]

    Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Dou- ble/debiased machine learning for treatment and structural parameters. The Econometrics Journal21, 1 (01 2018), C1–C68. doi:10.1111/ectj.12097 arXiv:https://academic.oup.com/ectj/article-pdf/21/1/C1/27684918/ectj00c1.pdf

  3. [3]

    Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. InProc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 198–206. https://doi.org/10.1145/3159652.3159687

  4. [4]

    Shashank Gupta. 2025. Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models.arXiv preprint arXiv:2510.15429(2025)

  5. [5]

    Shashank Gupta, Philipp Hager, Jin Huang, Ali Vardasbi, and Harrie Oosterhuis

  6. [6]

    InProceedings of the 17th ACM International Conference on Web Search and Data Mining

    Unbiased Learning to Rank: On Recent Advances and Practical Applications. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 1118–1121

  7. [7]

    Shashank Gupta, Olivier Jeunen, Harrie Oosterhuis, and Maarten de Rijke. 2024. Optimal Baseline Corrections for Off-Policy Contextual Bandits. InProc. of the 18th ACM Conference on Recommender Systems (RecSys ’24). ACM, 722–732. doi:10. 1145/3640457.3688105

  8. [8]

    Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke. 2023. Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258

  9. [9]

    an essay on the logical foundations of survey sampling, part one

    Jaroslav Hájek. 1971. Comment on “an essay on the logical foundations of survey sampling, part one”.The foundations of survey sampling236 (1971)

  10. [10]

    Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe.Journal of the American statistical Association47, 260 (1952), 663–685

  11. [11]

    2021.Offline Approaches to Recommendation with Online Success

    Olivier Jeunen. 2021.Offline Approaches to Recommendation with Online Success. Ph. D. Dissertation. University of Antwerp

  12. [12]

    Olivier Jeunen, Thorsten Joachims, Harrie Oosterhuis, Yuta Saito, and Flavian Vasile. 2022. CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. InProc. of the 16th ACM Confer- ence on Recommender Systems (RecSys ’22). ACM, 654–657. doi:10.1145/3523227. 3547409

  13. [13]

    Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wen- zhe Shi, and Aleksei Ustimenko. 2024. Multi-Objective Recommendation via Multivariate Policy Learning. InProceedings of the 18th ACM Conference on Rec- ommender Systems (RecSys ’24). ACM, 712–721. doi:10.1145/3640457.3688132

  14. [14]

    Olivier Jeunen, Ivan Potapov, and Aleksei Ustimenko. 2024. On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n Recom- mendation. InProc. of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). ACM, 1222–1233. doi:10.1145/3637528.3671687

  15. [15]

    Olivier Jeunen and Aleksei Ustimenko. 2024. Δ-OPE: Off-Policy Estimation with Pairs of Policies. InProc. of the 18th ACM Conference on Recommender Systems (RecSys ’24). ACM, 878–883. doi:10.1145/3640457.3688162

  16. [16]

    Thorsten Joachims, Ben London, Yi Su, Adith Swaminathan, and Lequn Wang

  17. [17]

    2021), 19–30

    Recommendations as Treatments.AI Magazine42, 3 (Nov. 2021), 19–30. doi:10.1609/aimag.v42i3.18141

  18. [18]

    Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep Learning with Logged Bandit Feedback. InInternational Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb

  19. [19]

    2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing

    Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy online controlled experi- ments: A practical guide to A/B testing. Cambridge University Press

  20. [20]

    Augustine Kong. 1992. A note on importance sampling using standardized weights.University of Chicago, Dept. of Statistics, Tech. Rep348 (1992)

  21. [21]

    Muthukrishnan, Vishwa Vinay, and Zheng Wen

    Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. InProc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, 1685–1694. doi:10.1145/3219819. 3220028

  22. [22]

    Ben London, Alexander Buchholz, Giuseppe Di Benedetto, Jan Malte Lichtenberg, Yannik Stein, and Thorsten Joachims. 2023. Self-Normalized Off-Policy Estimators for Ranking. InCONSEQUENCES Workshop at ACM RecSys ’23 (CONSEQUENCES ’23)

  23. [23]

    Art B. Owen. 2013.Monte Carlo theory, methods and examples

  24. [24]

    Hitesh Sagtani, Madan Gopal Jhawar, Rishabh Mehrotra, and Olivier Jeunen

  25. [25]

    Ad-load Balancing via Off-policy Learning in a Content Marketplace. In Proc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 586–595. doi:10.1145/3616855.3635846

  26. [26]

    Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. InProc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 828–830. doi:10.1145/3460231.3473320

  27. [27]

    Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 114–123. doi:10.1145/3460231.3474245

  28. [28]

    Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. InAdvances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf

  29. [29]

    Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy evaluation for slate recommendation. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. htt...

  30. [30]

    Bram van den Akker, Olivier Jeunen, Ying Li, Ben London, Zahra Nazari, and Devesh Parekh. 2024. Practical Bandits: An Industry Perspective. InProc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1132–1135. doi:10.1145/3616855.3636449

  31. [31]

    Flavian Vasile, David Rohde, Olivier Jeunen, and Amine Benhalloum. 2020. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proc. of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’20). ACM, 392–393. doi:10.1145/3340631.3398666

  32. [32]

    Nikos Vlassis, Ashok Chandrashekar, Fernando Amat, and Nathan Kallus

  33. [33]

    InAdvances in Neu- ral Information Processing Systems, M

    Control Variates for Slate Off-Policy Evaluation. InAdvances in Neu- ral Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 3667–3679. https://proceedings.neurips.cc/paper_files/paper/2021/file/ 1e0b802d5c0e1e8434a771ba7ff2c301-Paper.pdf