Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification

Aleksei Ustimenko; Neeti Pokharna; Olivier Jeunen; Yatharth Saraf

arxiv: 2606.04110 · v1 · pith:LPQGBSRZnew · submitted 2026-06-02 · 💻 cs.LG · stat.ML

Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification

Neeti Pokharna , Olivier Jeunen , Yatharth Saraf , Aleksei Ustimenko This is my paper

Pith reviewed 2026-06-28 11:20 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords variance reductionpost-stratificationCUPEDA/B testingheavy-tailed metricsmonetizationranking experimentsonline evaluation

0 comments

The pith

Post-stratification combined with CUPED reduces variance in heavy-tailed monetization metrics enough to reach equivalent statistical confidence with roughly 45 percent less traffic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that pairs post-stratification with CUPED to lower variance when testing heavy-tailed monetization outcomes such as revenue in ranking experiments. Pre-experiment covariates are used to group users so that the influence of extreme values is controlled without adding new data collection. In deployment the approach delivered the target confidence level while cutting required traffic by about 45 percent and improved stability of decisions. The authors also outline design choices and guardrails for applying the method to information retrieval and recommendation systems.

Core claim

The central claim is that post-stratification combined with CUPED leverages pre-experiment covariates to achieve substantial variance reduction in heavy-tailed monetization metrics, enabling equivalent statistical confidence with approximately 45% less traffic than standard approaches in ranking experiments.

What carries the argument

Post-stratification combined with CUPED, which uses pre-experiment covariates to divide users into strata and adjust the estimator for the heavy-tailed monetization outcome.

If this is right

A/B tests on monetization metrics can reach the same power with less user traffic.
Decisions in ranking-driven experiments become more stable under the same sample size.
The framework applies directly to real-world recommendation and retrieval systems that track revenue or earnings.
Guardrails on covariate choice and stratum construction prevent bias while preserving the variance benefit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same covariate-driven stratification may reduce variance for other heavy-tailed outcomes such as session length or content consumption.
Firms that already collect rich pre-experiment user features will see larger gains than those with sparse data.
Synthetic heavy-tailed data sets with known tail parameters could serve as a controlled test bed for the claimed reduction factor.
If the correlation between covariates and outcome weakens over time, periodic re-selection of stratification variables would be required.

Load-bearing premise

Pre-experiment covariates must exist and be sufficiently correlated with the monetization outcome to produce effective stratification without introducing bias.

What would settle it

Deploy the method on a fresh collection of ranking monetization experiments and observe no material drop in variance or in the traffic volume needed to reach a fixed confidence level compared with the unadjusted metric.

Figures

Figures reproduced from arXiv: 2606.04110 by Aleksei Ustimenko, Neeti Pokharna, Olivier Jeunen, Yatharth Saraf.

**Figure 2.** Figure 2: Empirical distribution of 𝑧-statistics for GMV (x-axis) from A/A tests at different traffic levels (Top: 1%, 5%, 10%; Bottom: 20%, 30%). The y-axis represents frequency. Deviations from the standard normal (orange curve) indicate CLT failure. To validate statistical assumptions, we analyzed the distribution of 𝑧-statistics across thousands of A/A simulation runs at varying traffic levels (1%, 5%, 10%, 20%,… view at source ↗

read the original abstract

Online evaluation of ranking and retrieval systems often relies on downstream monetization metrics such as app revenue or creator earnings. These metrics are typically heavy-tailed, with a small fraction of users dominating both mean and variance, leading to low statistical power and unreliable conclusions in A/B experiments -- especially under limited traffic. We present a practical framework for variance reduction in online experiments by combining post-stratification with CUPED. Our approach leverages pre-experiment covariates to improve the sensitivity of monetization experiments without requiring additional traffic. Deployed at ShareChat across ranking-driven monetization experiments, the method substantially reduces variance and improves decision stability, achieving equivalent statistical confidence with ~45\% less traffic than standard metrics. We further discuss practical design choices, guardrails, and limitations, providing guidance on when post-stratification is appropriate for real-world information retrieval and Recommendation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines post-stratification with CUPED for heavy-tailed monetization metrics in ranking experiments and reports a 45% traffic reduction in their ShareChat deployment.

read the letter

The main thing here is a practical combination of post-stratification with CUPED that reduces the traffic needed for reliable A/B tests on heavy-tailed monetization metrics by about 45% in production ranking experiments at ShareChat.

The work is useful because it targets a real issue: revenue and earnings metrics are dominated by a small number of users, which tanks power in online tests. They describe how to pick strata from pre-experiment covariates, add guardrails, and note when the approach fits recommendation systems. That level of implementation detail is the clearest contribution.

Both techniques are established, so the novelty is the targeted application rather than a new framework. The paper stays grounded by discussing limitations instead of overclaiming.

The soft spot is the headline number. The 45% saving requires that the chosen covariates actually predict membership in the heavy tail. The abstract gives no R-squared value or sensitivity check showing how the gain drops if correlation weakens, which matches the stress-test concern. If the full paper supplies those details and shows the result holds under moderate correlation, the claim strengthens; otherwise it stays optimistic.

This is for industry teams running ranking experiments where monetization is the primary outcome. Readers who need concrete variance-reduction steps for noisy metrics will get value from the guidance.

I would send it to peer review. It addresses a practical problem with honest discussion of limits even if the reported gain needs tighter quantification.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes combining post-stratification with CUPED, using pre-experiment covariates, to reduce variance of heavy-tailed monetization metrics (e.g., revenue) in ranking A/B experiments. It reports a deployment at ShareChat in which the method yields equivalent statistical power with ~45% less traffic than unadjusted metrics and discusses design choices, guardrails, and limitations.

Significance. If the empirical traffic-reduction result holds under the stated conditions, the framework offers a deployable way to increase experiment sensitivity for heavy-tailed outcomes common in recommendation systems without requiring extra traffic. The real-world deployment at ShareChat supplies concrete evidence of improved decision stability, which is a positive feature for applied work in this area.

major comments (2)

[Abstract / Results] Abstract and Results section: the central claim of equivalent confidence with ~45% less traffic is presented without the observed R² (or equivalent correlation measure) between the chosen pre-experiment covariates and the heavy-tailed monetization outcome, nor any sensitivity plot showing how the traffic saving changes as that correlation weakens. In heavy-tailed settings the variance is dominated by tail observations, so the reduction is load-bearing on this correlation; its absence prevents verification that the reported saving is consistent with the data.
[§4] §4 (Empirical Evaluation): no table or figure reports the stratum-level variance contributions or the effective sample-size multiplier achieved by post-stratification, making it impossible to isolate how much of the 45% saving is attributable to stratification versus CUPED or to confirm that treatment-induced selection bias was avoided.

minor comments (2)

[§3] The notation for the post-stratification weights and the CUPED adjustment could be unified in a single equation block to improve readability.
[Figures] Figure captions should explicitly state the traffic volume and number of experiments underlying the 45% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our empirical results. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results section: the central claim of equivalent confidence with ~45% less traffic is presented without the observed R² (or equivalent correlation measure) between the chosen pre-experiment covariates and the heavy-tailed monetization outcome, nor any sensitivity plot showing how the traffic saving changes as that correlation weakens. In heavy-tailed settings the variance is dominated by tail observations, so the reduction is load-bearing on this correlation; its absence prevents verification that the reported saving is consistent with the data.

Authors: We agree that the strength of the covariate-outcome relationship is central to interpreting variance reduction for heavy-tailed metrics. In the revised manuscript we report the observed R² between the pre-experiment covariates and the monetization metric in the Results section and add a sensitivity plot that shows how the traffic savings vary with weaker correlations. These additions allow readers to verify that the reported ~45% saving is consistent with the observed correlation under the heavy-tailed regime. revision: yes
Referee: [§4] §4 (Empirical Evaluation): no table or figure reports the stratum-level variance contributions or the effective sample-size multiplier achieved by post-stratification, making it impossible to isolate how much of the 45% saving is attributable to stratification versus CUPED or to confirm that treatment-induced selection bias was avoided.

Authors: We thank the referee for highlighting the need for greater decomposition of the variance reduction. We have added a table in Section 4 that reports stratum-level variance contributions and the effective sample-size multiplier attributable to post-stratification. This makes it possible to separate the contributions of stratification from CUPED. Because the strata are formed exclusively from pre-experiment covariates, treatment assignment remains randomized within strata and no selection bias is introduced; we have expanded the text to state this explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard post-stratification and CUPED techniques

full rationale

The paper applies established variance-reduction methods (post-stratification combined with CUPED) to heavy-tailed monetization metrics using pre-experiment covariates. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The ~45% traffic reduction claim is presented as an empirical outcome from deployment rather than a derived identity. The approach is self-contained against external benchmarks for these standard techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5688 in / 959 out tokens · 24684 ms · 2026-06-28T11:20:42.606265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 7 canonical work pages

[1]

Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko, and Olivier Jeunen. 2024. Variance Reduction in Ratio Metrics for Efficient Online Experiments. InAdvances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part V(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelbe...

work page doi:10.1007/978-3-031- 2024
[2]

G. E. P. Box and D. R. Cox. 1964. An Analysis of Transformations.Journal of the Royal Statistical Society: Series B (Methodological)26, 2 (1964), 211–252. doi:10. 1111/j.2517-6161.1964.tb00553.x arXiv:https://academic.oup.com/jrsssb/article- pdf/26/2/211/49099371/jrsssb_26_2_211.pdf

arXiv 1964
[3]

Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-Law Distributions in Empirical Data.SIAM Rev.51, 4 (2009), 661–703

2009
[4]

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. InProceedings of the Sixth ACM International Conference on Web Search and Data Mining(Rome, Italy)(WSDM ’13). Association for Computing Machinery, New York, NY, USA, 123–132. doi:10.1145/2433396.2433413

work page doi:10.1145/2433396.2433413 2013
[5]

Olivier Jeunen. 2025. t-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing. InProceedings of the Nineteenth ACM Conference on Rec- ommender Systems (RecSys ’25). ACM, 1307–1310. doi:10.1145/3705328.3759307

work page doi:10.1145/3705328.3759307 2025
[6]

Olivier Jeunen, Shubham Baweja, Neeti Pokharna, and Aleksei Ustimenko. 2024. Powerful A/B-Testing Metrics and Where to Find Them. InProceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 816–818. doi:10.1145/3640457. 3688036

work page doi:10.1145/3640457 2024
[7]

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal Henne. 2009. Controlled experiments on the web: survey and practical guide.Data Mining and Knowledge Discovery18, 1 (2009), 140–181. doi:10.1007/s10618-008-0114-1

work page doi:10.1007/s10618-008-0114-1 2009
[8]

Winston Lin. 2013. Agnostic Notes on Regression Adjusted Estimators.The Annals of Applied Statistics7, 1 (2013), 295–318

2013
[9]

Miratrix, Jasjeet S

Luke W. Miratrix, Jasjeet S. Sekhon, and Bin Yu. 2013. Adjusting treatment effect estimates by post-stratification in randomized experiments.Journal of the Royal Statistical Society Series B75, 2 (March 2013), 369–396. doi:10.1111/rssb.2013.75. issue-2

work page doi:10.1111/rssb.2013.75 2013
[10]

Richard Valliant. 1993. Poststratification and Conditional Variance Estimation.J. Amer. Statist. Assoc.88, 421 (1993), 89–96. http://www.jstor.org/stable/2290701

arXiv 1993
[11]

Huizhi Xie and Juliette Aurisset. 2016. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 645–654. doi:10.1145/2939672.2939733

work page doi:10.1145/2939672.2939733 2016

[1] [1]

Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko, and Olivier Jeunen. 2024. Variance Reduction in Ratio Metrics for Efficient Online Experiments. InAdvances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part V(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelbe...

work page doi:10.1007/978-3-031- 2024

[2] [2]

G. E. P. Box and D. R. Cox. 1964. An Analysis of Transformations.Journal of the Royal Statistical Society: Series B (Methodological)26, 2 (1964), 211–252. doi:10. 1111/j.2517-6161.1964.tb00553.x arXiv:https://academic.oup.com/jrsssb/article- pdf/26/2/211/49099371/jrsssb_26_2_211.pdf

arXiv 1964

[3] [3]

Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-Law Distributions in Empirical Data.SIAM Rev.51, 4 (2009), 661–703

2009

[4] [4]

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. InProceedings of the Sixth ACM International Conference on Web Search and Data Mining(Rome, Italy)(WSDM ’13). Association for Computing Machinery, New York, NY, USA, 123–132. doi:10.1145/2433396.2433413

work page doi:10.1145/2433396.2433413 2013

[5] [5]

Olivier Jeunen. 2025. t-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing. InProceedings of the Nineteenth ACM Conference on Rec- ommender Systems (RecSys ’25). ACM, 1307–1310. doi:10.1145/3705328.3759307

work page doi:10.1145/3705328.3759307 2025

[6] [6]

Olivier Jeunen, Shubham Baweja, Neeti Pokharna, and Aleksei Ustimenko. 2024. Powerful A/B-Testing Metrics and Where to Find Them. InProceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 816–818. doi:10.1145/3640457. 3688036

work page doi:10.1145/3640457 2024

[7] [7]

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal Henne. 2009. Controlled experiments on the web: survey and practical guide.Data Mining and Knowledge Discovery18, 1 (2009), 140–181. doi:10.1007/s10618-008-0114-1

work page doi:10.1007/s10618-008-0114-1 2009

[8] [8]

Winston Lin. 2013. Agnostic Notes on Regression Adjusted Estimators.The Annals of Applied Statistics7, 1 (2013), 295–318

2013

[9] [9]

Miratrix, Jasjeet S

Luke W. Miratrix, Jasjeet S. Sekhon, and Bin Yu. 2013. Adjusting treatment effect estimates by post-stratification in randomized experiments.Journal of the Royal Statistical Society Series B75, 2 (March 2013), 369–396. doi:10.1111/rssb.2013.75. issue-2

work page doi:10.1111/rssb.2013.75 2013

[10] [10]

Richard Valliant. 1993. Poststratification and Conditional Variance Estimation.J. Amer. Statist. Assoc.88, 421 (1993), 89–96. http://www.jstor.org/stable/2290701

arXiv 1993

[11] [11]

Huizhi Xie and Juliette Aurisset. 2016. Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 645–654. doi:10.1145/2939672.2939733

work page doi:10.1145/2939672.2939733 2016