User Sentiment as a Success Metric: Persistent Biases Under Full Randomization

Ercan Yildiz; Joshua Safyan; Marc Harper

arxiv: 1906.10843 · v1 · pith:TVKZADNWnew · submitted 2019-06-26 · 📊 stat.ME · stat.AP

User Sentiment as a Success Metric: Persistent Biases Under Full Randomization

Ercan Yildiz , Joshua Safyan , Marc Harper This is my paper

Pith reviewed 2026-05-25 15:38 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords user sentimentA/B testingmissing datacausal inferencetreatment effectsresponse biasrandomized experimentsselection bias

0 comments

The pith

Optional surveys in fully randomized A/B tests produce biased user sentiment estimates because response rates depend on both treatment and user covariates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines user sentiment collected through optional surveys as an outcome metric in randomized experiments. Because both treatment assignment and user covariates can influence who responds, the observed sentiment differences mix the true treatment effect with selection bias. The authors map this setup to the combined problems of missing data and observational causal inference, then derive conditions that permit consistent estimation of average and local treatment effects among treated and respondent users. Simulation results indicate that these corrected estimators recover the intended quantities, while more elaborate models do not always outperform simpler ones.

Core claim

When sentiment is measured only among respondents in a fully randomized A/B test, the observed treatment effect on respondents is not in general equal to the average treatment effect on the treated; consistent estimators for both quantities exist once the problem is recognized as the intersection of a missing-data problem and an observational causal-inference problem, provided the relevant identification conditions hold.

What carries the argument

The explicit mapping of the optional-survey problem onto the intersection of missing-data and observational causal-inference frameworks, which supplies the identification conditions for consistent estimators of average and local treatment effects on treated and respondent users.

If this is right

Average treatment effects on the treated can be recovered consistently from respondent sentiment data.
Local treatment effects among respondents can be recovered consistently under the same identification conditions.
Simpler estimators can achieve performance comparable to more complex models in finite samples.
The bias from differential response persists even when treatment assignment is fully randomized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mapping could be used to correct other optional feedback metrics such as ratings or open-text responses.
Product teams that ignore response bias may systematically over- or under-estimate the impact of feature changes.
Empirical checks for the identification conditions could be added as routine diagnostics in A/B testing platforms.

Load-bearing premise

The identification conditions that turn the missing-data-plus-causal-inference mapping into consistent estimators must actually be satisfied by the data-generating process.

What would settle it

An experiment or simulation in which response probability depends on treatment and covariates, the true treatment effects are known, and the proposed estimators fail to recover those known effects.

Figures

Figures reproduced from arXiv: 1906.10843 by Ercan Yildiz, Joshua Safyan, Marc Harper.

**Figure 1.** Figure 1: Sample UI based survey In recent years, companies have begun merging experimentation and user sentiment by utilizing sentiment as a success metric for product launches. This has become possible due to three major advantages of UI-based surveys over traditional survey mediums 1) mapping survey responses to treatment exposure is straightforward as surveys are attached to experiences, 2) capturing user sent… view at source ↗

**Figure 2.** Figure 2: Causal graph representing Assumption 1. X(T) influences both user sentiment Y (T) and propensity to respond ∆(T). There is no other path between user sentiment and propensity to respond. Therefore, Y (T) and ∆(T) are independent conditioned on X(T). are required to extrapolate from the observed responses to the overall population. Such assumptions may be problematic given that average online surveys have… view at source ↗

**Figure 3.** Figure 3: Causal graph representing Assumption 2. X influences both user sentiment under control Y (T = 0), and propensity to respond under both treatment and control conditions ∆(T = 0), ∆(T = 1). There is no other path between user sentiment and propensity to respond. Therefore, Y (T = 0) and ∆(T = 0), ∆(T = 1), are independent conditioned on X. Assumption 2 There exists a set of observed variables Xi such that: … view at source ↗

**Figure 4.** Figure 4: Causal graph representing our data generation process. Latent vari [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of different ATE estimators when true confounders are [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of different ATE estimators when noisy confounders [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Performance of different ATETR estimators when true confounders [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of different ATETR estimators when noisy confounders [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

We study user sentiment (reported via optional surveys) as a metric for fully randomized A/B tests. Both user-level covariates and treatment assignment can impact response propensity. We propose a set of consistent estimators for the average and local treatment effects on treated and respondent users. We show that our problem can be mapped onto the intersection of the missing data problem and observational causal inference, and we identify conditions under which consistent estimators exist. We evaluate the performance of estimators via simulation studies and find that more complicated models do not necessarily provide superior performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps optional sentiment surveys in randomized A/B tests to missing-data plus causal inference and proposes consistent estimators for ATE and LATE on respondent and treated groups.

read the letter

The main point is that the authors treat non-response in user sentiment surveys as a missing data problem on top of a randomized experiment, then derive conditions for consistent estimators of treatment effects on both the respondent subpopulation and the treated users. They back this with simulations that compare estimator performance and note that more complex models do not always win. That framing and the simulation evidence are the concrete pieces that are new here. The work sits at the intersection of two established literatures and applies them to a setting that comes up often in online experimentation, so the mapping itself is useful even if the underlying tools are standard. The simulations provide some reassurance that the estimators behave as claimed under the stated conditions. The central claim stands on those identification conditions, which the abstract says the authors derive; nothing in the description points to an internal contradiction or circularity. A soft spot is that the abstract does not spell out the exact assumptions or show real data examples, so it is hard to judge how often the conditions hold in practice or how sensitive the results are to violations. The paper also stays with simulations rather than external validation. This is aimed at people who run or analyze A/B tests in digital products and need to handle incomplete sentiment metrics. A reader working on experimental design in that domain would find the estimator proposals and the simulation comparisons worth looking at. It is solid enough on its own terms to deserve a serious referee, even if the conditions turn out to be somewhat restrictive once written out.

Referee Report

0 major / 2 minor

Summary. The manuscript studies biases in user sentiment (from optional surveys) as an outcome metric in fully randomized A/B tests, where both covariates and treatment can affect response propensity. It maps the problem to the intersection of missing-data and observational causal inference settings, proposes consistent estimators for average and local treatment effects on treated and respondent subpopulations, identifies conditions under which such estimators exist, and evaluates finite-sample performance via simulation studies.

Significance. If the identification argument holds, the work supplies a principled correction for non-response bias in a common industry metric. The mapping to established missing-data and causal-inference frameworks, together with the simulation evidence that simpler models can outperform more complex ones, would be a useful methodological contribution to the literature on survey non-response in randomized experiments.

minor comments (2)

[Abstract] Abstract: the identification conditions are asserted but not summarized even at a high level; a one-sentence statement of the key assumptions (e.g., forms of missingness or ignorability) would make the central claim more immediately evaluable.
[Simulation studies] The simulation section reports that 'more complicated models do not necessarily provide superior performance,' but does not specify the exact model classes compared or the metrics used to reach this conclusion; adding a short table or explicit list would strengthen the practical takeaway.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive evaluation of our manuscript. The referee's summary correctly identifies the core contributions: mapping the non-response problem in sentiment surveys to the intersection of missing-data and causal inference frameworks, deriving consistent estimators for ATE and LATE on treated and respondent subpopulations, and providing simulation evidence on estimator performance. We appreciate the recommendation for minor revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The central claim maps the sentiment metric problem onto the intersection of missing-data and observational causal inference settings, then identifies conditions for consistent estimators of ATE/LATE on treated and respondent subpopulations. This relies on standard identification results from the external missing-data and causal-inference literature rather than any self-referential fitting, self-definition, or load-bearing self-citation chain. No equations or steps in the provided abstract reduce a prediction or estimator to its own inputs by construction; simulation results are presented only as supporting evidence, not as the proof of consistency. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; paper invokes standard identification assumptions from missing data and observational causal inference without detailing new axioms or parameters here.

axioms (1)

domain assumption Conditions exist under which consistent estimators for treatment effects on respondents and treated users can be identified from the intersection of missing data and causal inference frameworks.
Abstract states that such conditions are identified but does not specify them; these are required for the consistency claim.

pith-pipeline@v0.9.0 · 5612 in / 1208 out tokens · 18247 ms · 2026-05-25T15:38:53.147198+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Anderson, E. W. Customer satisfaction and word of mouth. Journal of Service Research 1, 1 (1998), 5–17

work page 1998
[2]

W., and Sullivan, M

Anderson, E. W., and Sullivan, M. W. The antecedents and conse- quences of customer satisfaction for ﬁrms. Marketing Science 12, 2 (1993), 125–143

work page 1993
[3]

Bang, H., and Robins, J. M. Doubly robust estimation in missing data and causal inference models. Biometrics 61, 4 (2005), 962–973

work page 2005
[4]

M., and Kalton, G

Brick, J. M., and Kalton, G. Handling missing data in survey research. Statistical methods in medical research 5 , 3 (1996), 215–238

work page 1996
[5]

Double/debiased machine learning for treatment and structural parameters

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Jour- nal 21, 1 (2018), C1–C68

work page 2018
[6]

de Leeuw, E. D. Counting and measuring online: The quality of in- ternet surveys. BMS: Bulletin of Sociological Methodology / Bulletin de Mthodologie Sociologique, 114 (2012), 68–78

work page 2012
[7]

The one number you need to grow

F Reichheld, F. The one number you need to grow. Harvard business review 81 (06 2004), 46–54, 124

work page 2004
[8]

The eﬀects of customer satisfaction, relationship commitment dimensions, and triggers on customer retention

Gustafsson, A., Johnson, M., and Roos, I. The eﬀects of customer satisfaction, relationship commitment dimensions, and triggers on customer retention. Journal of Marketing - J MARKETING 69 (10 2005), 210–218

work page 2005
[9]

Entropy balancing for causal eﬀects: A multivariate reweighting method to produce balanced samples in observational studies

Hainmueller, J. Entropy balancing for causal eﬀects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20, 1 (2012), 2546

work page 2012
[10]

W., and Rubin, D

Imbens, G. W., and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction . Cambridge University Press, 2015

work page 2015
[11]

Compensating for missing survey data

Kalton, G. Compensating for missing survey data. Research report series. Survey Research Center, Institute for Social Research, the University of Michigan, 1983

work page 1983
[12]

Weighting methods

Kalton, G., and Flores-Cervantes, I. Weighting methods. Journal of oﬃcial statistics 19 , 2 (2003), 81

work page 2003
[13]

Kang, J. D. Y., and Schafer, J. L. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22 , 4 (11 2007), 523–539

work page 2007
[14]

Online controlled experiments at large scale

Kohavi, R., Deng, A., Frasca, B., W alker, T., Xu, Y., and Pohlmann, N. Online controlled experiments at large scale. In Proceed- 17 ings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2013), KDD ’13, ACM, pp. 1168–1176

work page 2013
[15]

Little, R. J. A., and Rubin, D. B. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 1986

work page 1986
[16]

A., Anderson, E

Morgan, N. A., Anderson, E. W., and Mittal, V. Understanding ﬁrms’ customer satisfaction information usage. Journal of Marketing 69 , 3 (2005), 131–151

work page 2005
[17]

Hats: Large-scale in-product measurement of user attitudes: Experiences with happiness tracking surveys

Muller, H., and Sedley, A. Hats: Large-scale in-product measurement of user attitudes: Experiences with happiness tracking surveys. In Proceed- ings of the 26th Australian Computer-Human Interaction Conference on Designing Futures: The Future of Design (New York, NY, USA, 2014), OzCHI ’14, ACM, pp. 308–315

work page 2014
[18]

Measuring Customer Satisfaction: Hot Buttons and Other Measurement Issues

Myers, J. Measuring Customer Satisfaction: Hot Buttons and Other Measurement Issues. American Marketing Association, 1999

work page 1999
[19]

Optimizely, 2019

Optimizely. Optimizely, 2019

work page 2019
[20]

Adversarial Bal- ancing for Causal Inference

Ozery-Flato, M., Thodoroff, P., and El-Hay, T. Adversarial Bal- ancing for Causal Inference. arXiv e-prints (Oct 2018), arXiv:1810.07406

work page arXiv 2018
[21]

Causality: Models, Reasoning and Inference, 2nd ed

Pearl, J. Causality: Models, Reasoning and Inference, 2nd ed. Cambridge University Press, New York, NY, USA, 2009

work page 2009
[22]

Rubin, D. B. Inference and missing data. Biometrika 63, 3 (1976), 581– 592

work page 1976
[23]

Overlapping experiment infrastructure: More, better, faster experimentation

Tang, D., Agarwal, A., O’Brien, D., and Meyer, M. Overlapping experiment infrastructure: More, better, faster experimentation. In Pro- ceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2010), KDD ’10, ACM, pp. 17–26

work page 2010
[24]

Anderson, E., Fornell, C., and T

W. Anderson, E., Fornell, C., and T. Rust, R. Customer sat- isfaction, productivity, and proﬁtability: Diﬀerences between goods and services. Marketing Science 16 (05 1997), 129–145

work page 1997
[25]

From infrastructure to culture: A/b testing challenges in large scale social net- works

Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. From infrastructure to culture: A/b testing challenges in large scale social net- works. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2015), KDD ’15, ACM, pp. 2227–2236. 18 A Proof of Lemma 2 First recall that ATETR is de...

work page 2015

[1] [1]

Anderson, E. W. Customer satisfaction and word of mouth. Journal of Service Research 1, 1 (1998), 5–17

work page 1998

[2] [2]

W., and Sullivan, M

Anderson, E. W., and Sullivan, M. W. The antecedents and conse- quences of customer satisfaction for ﬁrms. Marketing Science 12, 2 (1993), 125–143

work page 1993

[3] [3]

Bang, H., and Robins, J. M. Doubly robust estimation in missing data and causal inference models. Biometrics 61, 4 (2005), 962–973

work page 2005

[4] [4]

M., and Kalton, G

Brick, J. M., and Kalton, G. Handling missing data in survey research. Statistical methods in medical research 5 , 3 (1996), 215–238

work page 1996

[5] [5]

Double/debiased machine learning for treatment and structural parameters

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Jour- nal 21, 1 (2018), C1–C68

work page 2018

[6] [6]

de Leeuw, E. D. Counting and measuring online: The quality of in- ternet surveys. BMS: Bulletin of Sociological Methodology / Bulletin de Mthodologie Sociologique, 114 (2012), 68–78

work page 2012

[7] [7]

The one number you need to grow

F Reichheld, F. The one number you need to grow. Harvard business review 81 (06 2004), 46–54, 124

work page 2004

[8] [8]

The eﬀects of customer satisfaction, relationship commitment dimensions, and triggers on customer retention

Gustafsson, A., Johnson, M., and Roos, I. The eﬀects of customer satisfaction, relationship commitment dimensions, and triggers on customer retention. Journal of Marketing - J MARKETING 69 (10 2005), 210–218

work page 2005

[9] [9]

Entropy balancing for causal eﬀects: A multivariate reweighting method to produce balanced samples in observational studies

Hainmueller, J. Entropy balancing for causal eﬀects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20, 1 (2012), 2546

work page 2012

[10] [10]

W., and Rubin, D

Imbens, G. W., and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction . Cambridge University Press, 2015

work page 2015

[11] [11]

Compensating for missing survey data

Kalton, G. Compensating for missing survey data. Research report series. Survey Research Center, Institute for Social Research, the University of Michigan, 1983

work page 1983

[12] [12]

Weighting methods

Kalton, G., and Flores-Cervantes, I. Weighting methods. Journal of oﬃcial statistics 19 , 2 (2003), 81

work page 2003

[13] [13]

Kang, J. D. Y., and Schafer, J. L. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22 , 4 (11 2007), 523–539

work page 2007

[14] [14]

Online controlled experiments at large scale

Kohavi, R., Deng, A., Frasca, B., W alker, T., Xu, Y., and Pohlmann, N. Online controlled experiments at large scale. In Proceed- 17 ings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2013), KDD ’13, ACM, pp. 1168–1176

work page 2013

[15] [15]

Little, R. J. A., and Rubin, D. B. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 1986

work page 1986

[16] [16]

A., Anderson, E

Morgan, N. A., Anderson, E. W., and Mittal, V. Understanding ﬁrms’ customer satisfaction information usage. Journal of Marketing 69 , 3 (2005), 131–151

work page 2005

[17] [17]

Hats: Large-scale in-product measurement of user attitudes: Experiences with happiness tracking surveys

Muller, H., and Sedley, A. Hats: Large-scale in-product measurement of user attitudes: Experiences with happiness tracking surveys. In Proceed- ings of the 26th Australian Computer-Human Interaction Conference on Designing Futures: The Future of Design (New York, NY, USA, 2014), OzCHI ’14, ACM, pp. 308–315

work page 2014

[18] [18]

Measuring Customer Satisfaction: Hot Buttons and Other Measurement Issues

Myers, J. Measuring Customer Satisfaction: Hot Buttons and Other Measurement Issues. American Marketing Association, 1999

work page 1999

[19] [19]

Optimizely, 2019

Optimizely. Optimizely, 2019

work page 2019

[20] [20]

Adversarial Bal- ancing for Causal Inference

Ozery-Flato, M., Thodoroff, P., and El-Hay, T. Adversarial Bal- ancing for Causal Inference. arXiv e-prints (Oct 2018), arXiv:1810.07406

work page arXiv 2018

[21] [21]

Causality: Models, Reasoning and Inference, 2nd ed

Pearl, J. Causality: Models, Reasoning and Inference, 2nd ed. Cambridge University Press, New York, NY, USA, 2009

work page 2009

[22] [22]

Rubin, D. B. Inference and missing data. Biometrika 63, 3 (1976), 581– 592

work page 1976

[23] [23]

Overlapping experiment infrastructure: More, better, faster experimentation

Tang, D., Agarwal, A., O’Brien, D., and Meyer, M. Overlapping experiment infrastructure: More, better, faster experimentation. In Pro- ceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2010), KDD ’10, ACM, pp. 17–26

work page 2010

[24] [24]

Anderson, E., Fornell, C., and T

W. Anderson, E., Fornell, C., and T. Rust, R. Customer sat- isfaction, productivity, and proﬁtability: Diﬀerences between goods and services. Marketing Science 16 (05 1997), 129–145

work page 1997

[25] [25]

From infrastructure to culture: A/b testing challenges in large scale social net- works

Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. From infrastructure to culture: A/b testing challenges in large scale social net- works. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2015), KDD ’15, ACM, pp. 2227–2236. 18 A Proof of Lemma 2 First recall that ATETR is de...

work page 2015