Choosing Online Experiment Designs under Interference in Ads, Recommendations, and Member-Experience Systems

Caroline Howard; Prashant Shekhar

arxiv: 2605.25290 · v1 · pith:F5XUHNPKnew · submitted 2026-05-24 · 📊 stat.ML · cs.LG

Choosing Online Experiment Designs under Interference in Ads, Recommendations, and Member-Experience Systems

Prashant Shekhar , Caroline Howard This is my paper

Pith reviewed 2026-06-29 23:26 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords experiment designinterferencerobust selectionWasserstein distanceambiguity setonline experimentsexposure mechanisms

0 comments

The pith

A selector ranks experiment designs by their worst-case planning risk when interference mechanisms remain unknown.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online experiments in ads and recommendations often must be designed before the dominant form of interference through budgets, graphs, or carryover is known. The paper treats design choice as a robust decision problem over an ambiguity set of possible exposure mechanisms. It supplies a selector that scores each candidate design from a fixed catalog using a composite risk that includes bias, variance, detectability, cost, and mismatch. A supporting guarantee bounds design bias by Wasserstein distance to the launch exposure distribution and shows the bound is tight under Lipschitz response. The same selector produces different rankings on public datasets from advertising and recommendation platforms.

Core claim

Given a finite catalog of six implementable designs, the selector compares each design by worst-case planning risk over an ambiguity set. The risk combines exposure bias, assignment-unit variance, minimum detectable effect, contamination or carryover, operational cost, and estimand mismatch. Design bias is bounded by Wasserstein distance to the launch exposure distribution, and this penalty is minimax tight under Lipschitz exposure response. The paper also proves finite-catalog approximation and a robust selector theorem with excess-risk control, exact recovery under separation, and certified shortlists when the risk surface is flat.

What carries the argument

The robust design selector that evaluates each candidate by its worst-case planning risk over an ambiguity set of exposure mechanisms, with a geometry-aware bound via Wasserstein distance.

If this is right

Design bias is bounded by Wasserstein distance to the launch exposure distribution.
The bound is minimax tight under Lipschitz exposure response.
The selector achieves excess-risk control and exact recovery under separation.
Certified shortlists are produced when the risk surface is flat.
Different designs are selected on samples from Criteo, Open Bandit, and KuaiRand datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Historical logging data could be used to refine the ambiguity set and produce sharper design rankings.
The selector could be extended to sequential re-selection as exposure observations accumulate during the experiment.
Analogous robust selection may apply to policy experiments in networked economic or supply-chain settings.
If exposure responses in practice satisfy the Lipschitz condition, the tightness result would directly limit excess risk.

Load-bearing premise

The true exposure mechanism at launch lies inside the ambiguity set used to compute the worst-case planning risk for each design.

What would settle it

Observe a chosen design's realized bias when the actual launch exposure distribution lies outside the ambiguity set and check whether the observed bias exceeds the Wasserstein-derived bound.

Figures

Figures reproduced from arXiv: 2605.25290 by Caroline Howard, Prashant Shekhar.

**Figure 1.** Figure 1: Regime-transition generated from the proposed experiment-design framework. The diagnostic uses the same six implementable designs (shown at the bottom), normalized risk components, and risk weights used in the empirical selector, while sweeping a controlled exposure-mechanism intensity γ from weak row-local interference to mixed spillover, clustered spillover, and carryover-dominant interference. The dashe… view at source ↗

**Figure 2.** Figure 2: Cross-domain robust design recommendations. The figure compares the main real-data cases using the same design catalog and planning-risk components. The selected design changes across domains: user randomization for Criteo, switchbacks for Open Bandit bts/men, and cluster randomization for KuaiRand [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Robust-risk rankings for the three main empirical cases. Each panel ranks the six implementable designs by estimated worst-case planning risk over the calibrated ambiguity set. The figure explains the decisions in [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Open Bandit propensity diagnostics. The randomized slice has constant propensities and full effective sample size, while the adaptive bts/men slice has highly variable propensities and an IPS effective-sample share of 5.17%. This figure supports the main Open Bandit design-selection result by showing why the adaptive slice is the relevant stress test for known but uneven logging support. a closer contest b… view at source ↗

**Figure 5.** Figure 5: Detailed robust-risk rankings for the domain cases. Panels (a)–(c) are summarized in [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Exposure–variance frontiers for the domain cases. The horizontal axis is the exposure-distance proxy and the vertical axis is assignment-unit variance. Color encodes planning MDE, while marker size is proportional to the pre-specified operational-cost score. These plots show the component tradeoffs behind the aggregate robust-risk recommendations. (a) Regime-reversal map (b) Selector diagnostic [PITH_FULL… view at source ↗

**Figure 7.** Figure 7: Appendix selector diagnostics. Panel (a) shows a controlled regime-reversal map, where the selected design changes as the dominant exposure mechanism changes. Panel (b) compares empirical and oracle planning risks; the selected design matches the oracle in the diagnostic and satisfies the excess-risk certificate from Theorem 4.5. 1Corresponding author: shekharp@erau.edu 26 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 8.** Figure 8: Theory stress tests. Panel (a) checks the transport bias bound by plotting observed exposure-response bias against L W1(P θ d , Pθ ⋆ ). Panel (b) checks minimax tightness by plotting the ratio of the attained Lipschitz gap to the minimax penalty; the curves for different L overlap at one because the constructed response attains equality. Panel (c) compares finite-catalog approximation error with the Lipsch… view at source ↗

read the original abstract

Online experiments in ads, recommendation, and member-experience systems are often planned before the dominant interference mechanism is known. A treatment may propagate through budgets, inventory, producer exposure, graph spillovers, or temporal carryover, making the randomization design itself a statistical decision. We formulate this problem as robust design selection over uncertain exposure mechanisms. Given a finite catalog of six implementable designs, the selector compares each design by worst-case planning risk over an ambiguity set. The risk combines exposure bias, assignment-unit variance, minimum detectable effect, contamination or carryover, operational cost, and estimand mismatch. For theoretical justification, the paper develops a geometry-aware guarantee, stating that design bias is bounded by Wasserstein distance to the launch exposure distribution, and this penalty is minimax tight under Lipschitz exposure response. We also prove finite-catalog approximation and a robust selector theorem with excess-risk control, exact recovery under separation, and certified shortlists when the risk surface is flat. Empirically, the same selector gives different recommendations across samples from public datasets. It selects user-randomization on Criteo ads with dimensionless robust risk 1.295, switchbacks on Open Bandit-bts/men with risk 2.105, and cluster-randomization on KuaiRand with risk 2.240. The Open Bandit case stresses known but uneven logging support, with propensities from 0.00006 to 0.594 and a 5.17% IPS effective-sample share. Overall, the paper contributes an interference-aware experiment design framework based on mechanism-robust design decisions, where the output is either a justified design choice or an uncertainty shortlist.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a selector for picking among a handful of fixed experiment designs when the interference mechanism is unknown, using worst-case risk over an ambiguity set plus Wasserstein bounds.

read the letter

The core contribution is a practical selector that ranks six implementable designs by their worst-case planning risk when the exposure mechanism at launch is uncertain. It folds in bias, variance, MDE, carryover, cost, and estimand mismatch, then outputs either a single choice or a shortlist.

What stands out is the geometry-aware guarantee: design bias is controlled by Wasserstein distance to the true launch distribution, and that bound is shown to be minimax tight under Lipschitz response. They also give finite-catalog approximation results, excess-risk control, and exact recovery under separation. These are standard robust-optimization moves but applied cleanly to the design-selection problem.

The empirical section is straightforward and useful: the selector recommends user randomization on Criteo, switchbacks on Open Bandit, and cluster randomization on KuaiRand, with the numbers reflecting dataset-specific features like uneven propensities. That matches what practitioners actually face.

The main soft spot is the standing assumption that the true mechanism sits inside the user-chosen ambiguity set; if it does not, the guarantees do not apply. That is explicit and common in this literature, not a hidden flaw. The catalog is deliberately small, so the work is about disciplined choice among known options rather than open-ended design invention. No derivation gaps or circularity are visible from the description.

This is for people running large-scale online experiments who already have a short list of feasible designs and want a defensible way to pick under interference uncertainty. It deserves a serious referee because the theoretical pieces are grounded and the empirical illustration is concrete.

Referee Report

0 major / 3 minor

Summary. The paper formulates online experiment design selection under uncertain interference as a robust optimization problem over an ambiguity set of exposure mechanisms. Given a finite catalog of six designs, a selector ranks them by worst-case planning risk that aggregates exposure bias, assignment variance, minimum detectable effect, contamination/carryover, operational cost, and estimand mismatch. Theoretical results include a Wasserstein-distance bound on design bias that is minimax-tight under Lipschitz exposure response, plus a robust selector theorem establishing finite-catalog approximation, excess-risk control, exact recovery under separation, and certified shortlists on flat risk surfaces. Empirical application to Criteo, Open Bandit, and KuaiRand datasets yields design-specific recommendations (user randomization on Criteo with robust risk 1.295; switchbacks on Open Bandit with risk 2.105; cluster randomization on KuaiRand with risk 2.240).

Significance. If the stated guarantees hold inside the user-specified ambiguity set, the framework supplies a principled, geometry-aware method for choosing among implementable designs when the dominant interference channel is unknown at planning time. The combination of Wasserstein bias bounds, minimax tightness, and excess-risk control is a clear technical contribution; the empirical selector outputs on public datasets with realistic propensity ranges further illustrate practical utility.

minor comments (3)

The abstract states that design bias is bounded by Wasserstein distance to the launch exposure distribution and that the penalty is minimax tight under Lipschitz response; the main text should explicitly locate these statements (theorem or proposition number) and confirm that the Lipschitz constant is treated as known or estimated.
Clarify the precise definition of the six-design catalog and how each design maps to the components of the planning-risk objective (especially estimand mismatch and carryover terms).
The Open Bandit example reports propensities ranging from 0.00006 to 0.594 and a 5.17% IPS effective-sample share; state whether these quantities are used directly in the ambiguity-set construction or only for post-selection diagnostics.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the clear summary of the robust design selector, and the recommendation for minor revision. The report correctly identifies the core contributions: the Wasserstein bias bound that is minimax-tight under Lipschitz exposure response, the robust selector theorem with its finite-catalog, excess-risk, and exact-recovery guarantees, and the dataset-specific design recommendations. No major comments requiring clarification or correction were raised.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation chain consists of a standard robust-optimization formulation: an ambiguity set is user-specified, worst-case risk is computed over it, and all stated guarantees (Wasserstein bias bound, minimax tightness under Lipschitz response, excess-risk control, exact recovery under separation) are explicitly conditional on the true launch mechanism belonging to that set. This is an external modeling assumption rather than a self-referential definition or fitted quantity renamed as a prediction. No self-citation is invoked as a load-bearing uniqueness theorem, no ansatz is smuggled, and the finite-catalog selector and empirical results on public datasets are presented as separate evaluations. The central claims therefore remain independent of their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a well-specified ambiguity set containing the true exposure mechanism and on the Lipschitz continuity of the exposure response function to establish minimax tightness of the Wasserstein bound.

axioms (1)

domain assumption Exposure response function is Lipschitz continuous
Invoked to establish that the Wasserstein penalty is minimax tight.

pith-pipeline@v0.9.1-grok · 5835 in / 1321 out tokens · 18584 ms · 2026-06-29T23:26:15.345187+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Privacy-Robust Incrementality Measurement for Advertising Systems under Signal Loss
stat.ML 2026-06 unverdicted novelty 7.0

Formulates privacy-constrained advertising measurement as a robust causal decision problem under signal loss and derives a sharp decision frontier separating certifiable from unresolved incrementality claims.

Reference graph

Works this paper leans on

15 extracted references · 14 canonical work pages · cited by 1 Pith paper

[1]

Aronow and Cyrus Samii

Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference, with application to a social network experiment.The Annals of Applied Statistics, 11(4):1912–1947,

1912
[2]

Imbens, Lorenzo Masoero, James McQueen, Thomas S

Patrick Bajari, Brian Burdick, Guido W. Imbens, Lorenzo Masoero, James McQueen, Thomas S. Richardson, and Ido Rosen. Multiple randomization designs.arXiv preprint arXiv:2112.13495,

work page arXiv
[3]

Jiawei Chen, Chongming Gao, Shijun Li, Yuan Zhang, Biao Li, Wenqiang Lei, Peng Jiang, and Xiangnan He

doi: 10.1287/mnsc.2022.4583. Jiawei Chen, Chongming Gao, Shijun Li, Yuan Zhang, Biao Li, Wenqiang Lei, Peng Jiang, and Xiangnan He. Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos.arXiv preprint arXiv:2208.08696,

work page doi:10.1287/mnsc.2022.4583 2022
[4]

Zahra Fatemi, Jean Pouget-Abadie, and Elena Zheleva

doi: 10.1515/jci-2015-0021. Zahra Fatemi, Jean Pouget-Abadie, and Elena Zheleva. Cascade-based randomization for inferring causal effects under diffusion interference. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 394–407,

work page doi:10.1515/jci-2015-0021 2015
[5]

Limiting bias from test-control interference in online marketplace experiments

David Holtz and Sinan Aral. Limiting bias from test-control interference in online marketplace experiments. arXiv preprint arXiv:2004.12162,

work page arXiv 2004
[6]

Reducing interference bias in online marketplace pricing experiments.arXiv preprint arXiv:2004.12489,

David Holtz, Ruben Lobel, Inessa Liskovich, and Sinan Aral. Reducing interference bias in online marketplace pricing experiments.arXiv preprint arXiv:2004.12489,

work page arXiv 2004
[7]

Causal inference under network interference using a mixture of randomized experiments.arXiv preprint arXiv:2309.00141,

Yiming Jiang and He Wang. Causal inference under network interference using a mixture of randomized experiments.arXiv preprint arXiv:2309.00141,

work page arXiv
[8]

doi: 10.1287/mnsc.2021

work page doi:10.1287/mnsc.2021 2021
[9]

Interference, bias, and variance in two-sided marketplace experimentation: Guidance for platforms

Hannah Li, Geng Zhao, and Ramesh Johari. Interference, bias, and variance in two-sided marketplace experimentation: Guidance for platforms. InProceedings of the ACM Web Conference 2022, pages 182–192, 2022a. doi: 10.1145/3485447.3512063. Qike Li, Samir Jamkhande, Pavel Kochetkov, and Pai Liu. Assign experiment variants at scale in online controlled experi...

work page doi:10.1145/3485447.3512063 2022
[10]

Trustworthy online marketplace experimentation with budget-split design.arXiv preprint arXiv:2012.08724,

Min Liu, Jialiang Mao, and Kang Kang. Trustworthy online marketplace experimentation with budget-split design.arXiv preprint arXiv:2012.08724,

work page arXiv 2012
[11]

Robust and efficient multiple-unit switchback experimentation.arXiv preprint arXiv:2506.12654,

Paul Missault, Lorenzo Masoero, Christian Delbé, Thomas Richardson, and Guido Imbens. Robust and efficient multiple-unit switchback experimentation.arXiv preprint arXiv:2506.12654,

work page arXiv
[12]

Randomized graph cluster randomization.arXiv preprint arXiv:2009.02297,

Johan Ugander and Hao Yin. Randomized graph cluster randomization.arXiv preprint arXiv:2009.02297,

work page arXiv 2009
[13]

Davide Viviano, Lihua Lei, Guido Imbens, Brian Karrer, Okke Schrijvers, and Liang Shi

doi: 10.1145/2487575.2487695. Davide Viviano, Lihua Lei, Guido Imbens, Brian Karrer, Okke Schrijvers, and Liang Shi. Causal clustering: Design of cluster experiments under network interference.arXiv preprint arXiv:2310.14983,

work page doi:10.1145/2487575.2487695
[14]

Mind: A large-scale dataset for news recommendation

doi: 10.18653/v1/2020.acl-main.331. Christina Lee Yu, Edoardo M. Airoldi, Christian Borgs, and Jennifer T. Chayes. Estimating total treatment effect in randomized experiments with unknown network structure.arXiv preprint arXiv:2205.12803,

work page doi:10.18653/v1/2020.acl-main.331 2020
[15]

Seller-side experiments under interference induced by feedback loops in two-sided platforms.arXiv preprint arXiv:2401.15811,

Zhihua Zhu, Zheng Cai, Liang Zheng, and Nian Si. Seller-side experiments under interference induced by feedback loops in two-sided platforms.arXiv preprint arXiv:2401.15811,

work page arXiv

[1] [1]

Aronow and Cyrus Samii

Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference, with application to a social network experiment.The Annals of Applied Statistics, 11(4):1912–1947,

1912

[2] [2]

Imbens, Lorenzo Masoero, James McQueen, Thomas S

Patrick Bajari, Brian Burdick, Guido W. Imbens, Lorenzo Masoero, James McQueen, Thomas S. Richardson, and Ido Rosen. Multiple randomization designs.arXiv preprint arXiv:2112.13495,

work page arXiv

[3] [3]

Jiawei Chen, Chongming Gao, Shijun Li, Yuan Zhang, Biao Li, Wenqiang Lei, Peng Jiang, and Xiangnan He

doi: 10.1287/mnsc.2022.4583. Jiawei Chen, Chongming Gao, Shijun Li, Yuan Zhang, Biao Li, Wenqiang Lei, Peng Jiang, and Xiangnan He. Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos.arXiv preprint arXiv:2208.08696,

work page doi:10.1287/mnsc.2022.4583 2022

[4] [4]

Zahra Fatemi, Jean Pouget-Abadie, and Elena Zheleva

doi: 10.1515/jci-2015-0021. Zahra Fatemi, Jean Pouget-Abadie, and Elena Zheleva. Cascade-based randomization for inferring causal effects under diffusion interference. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 394–407,

work page doi:10.1515/jci-2015-0021 2015

[5] [5]

Limiting bias from test-control interference in online marketplace experiments

David Holtz and Sinan Aral. Limiting bias from test-control interference in online marketplace experiments. arXiv preprint arXiv:2004.12162,

work page arXiv 2004

[6] [6]

Reducing interference bias in online marketplace pricing experiments.arXiv preprint arXiv:2004.12489,

David Holtz, Ruben Lobel, Inessa Liskovich, and Sinan Aral. Reducing interference bias in online marketplace pricing experiments.arXiv preprint arXiv:2004.12489,

work page arXiv 2004

[7] [7]

Causal inference under network interference using a mixture of randomized experiments.arXiv preprint arXiv:2309.00141,

Yiming Jiang and He Wang. Causal inference under network interference using a mixture of randomized experiments.arXiv preprint arXiv:2309.00141,

work page arXiv

[8] [8]

doi: 10.1287/mnsc.2021

work page doi:10.1287/mnsc.2021 2021

[9] [9]

Interference, bias, and variance in two-sided marketplace experimentation: Guidance for platforms

Hannah Li, Geng Zhao, and Ramesh Johari. Interference, bias, and variance in two-sided marketplace experimentation: Guidance for platforms. InProceedings of the ACM Web Conference 2022, pages 182–192, 2022a. doi: 10.1145/3485447.3512063. Qike Li, Samir Jamkhande, Pavel Kochetkov, and Pai Liu. Assign experiment variants at scale in online controlled experi...

work page doi:10.1145/3485447.3512063 2022

[10] [10]

Trustworthy online marketplace experimentation with budget-split design.arXiv preprint arXiv:2012.08724,

Min Liu, Jialiang Mao, and Kang Kang. Trustworthy online marketplace experimentation with budget-split design.arXiv preprint arXiv:2012.08724,

work page arXiv 2012

[11] [11]

Robust and efficient multiple-unit switchback experimentation.arXiv preprint arXiv:2506.12654,

Paul Missault, Lorenzo Masoero, Christian Delbé, Thomas Richardson, and Guido Imbens. Robust and efficient multiple-unit switchback experimentation.arXiv preprint arXiv:2506.12654,

work page arXiv

[12] [12]

Randomized graph cluster randomization.arXiv preprint arXiv:2009.02297,

Johan Ugander and Hao Yin. Randomized graph cluster randomization.arXiv preprint arXiv:2009.02297,

work page arXiv 2009

[13] [13]

Davide Viviano, Lihua Lei, Guido Imbens, Brian Karrer, Okke Schrijvers, and Liang Shi

doi: 10.1145/2487575.2487695. Davide Viviano, Lihua Lei, Guido Imbens, Brian Karrer, Okke Schrijvers, and Liang Shi. Causal clustering: Design of cluster experiments under network interference.arXiv preprint arXiv:2310.14983,

work page doi:10.1145/2487575.2487695

[14] [14]

Mind: A large-scale dataset for news recommendation

doi: 10.18653/v1/2020.acl-main.331. Christina Lee Yu, Edoardo M. Airoldi, Christian Borgs, and Jennifer T. Chayes. Estimating total treatment effect in randomized experiments with unknown network structure.arXiv preprint arXiv:2205.12803,

work page doi:10.18653/v1/2020.acl-main.331 2020

[15] [15]

Seller-side experiments under interference induced by feedback loops in two-sided platforms.arXiv preprint arXiv:2401.15811,

Zhihua Zhu, Zheng Cai, Liang Zheng, and Nian Si. Seller-side experiments under interference induced by feedback loops in two-sided platforms.arXiv preprint arXiv:2401.15811,

work page arXiv