pith. sign in

arxiv: 1906.10843 · v1 · pith:TVKZADNWnew · submitted 2019-06-26 · 📊 stat.ME · stat.AP

User Sentiment as a Success Metric: Persistent Biases Under Full Randomization

Pith reviewed 2026-05-25 15:38 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords user sentimentA/B testingmissing datacausal inferencetreatment effectsresponse biasrandomized experimentsselection bias
0
0 comments X

The pith

Optional surveys in fully randomized A/B tests produce biased user sentiment estimates because response rates depend on both treatment and user covariates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines user sentiment collected through optional surveys as an outcome metric in randomized experiments. Because both treatment assignment and user covariates can influence who responds, the observed sentiment differences mix the true treatment effect with selection bias. The authors map this setup to the combined problems of missing data and observational causal inference, then derive conditions that permit consistent estimation of average and local treatment effects among treated and respondent users. Simulation results indicate that these corrected estimators recover the intended quantities, while more elaborate models do not always outperform simpler ones.

Core claim

When sentiment is measured only among respondents in a fully randomized A/B test, the observed treatment effect on respondents is not in general equal to the average treatment effect on the treated; consistent estimators for both quantities exist once the problem is recognized as the intersection of a missing-data problem and an observational causal-inference problem, provided the relevant identification conditions hold.

What carries the argument

The explicit mapping of the optional-survey problem onto the intersection of missing-data and observational causal-inference frameworks, which supplies the identification conditions for consistent estimators of average and local treatment effects on treated and respondent users.

If this is right

  • Average treatment effects on the treated can be recovered consistently from respondent sentiment data.
  • Local treatment effects among respondents can be recovered consistently under the same identification conditions.
  • Simpler estimators can achieve performance comparable to more complex models in finite samples.
  • The bias from differential response persists even when treatment assignment is fully randomized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mapping could be used to correct other optional feedback metrics such as ratings or open-text responses.
  • Product teams that ignore response bias may systematically over- or under-estimate the impact of feature changes.
  • Empirical checks for the identification conditions could be added as routine diagnostics in A/B testing platforms.

Load-bearing premise

The identification conditions that turn the missing-data-plus-causal-inference mapping into consistent estimators must actually be satisfied by the data-generating process.

What would settle it

An experiment or simulation in which response probability depends on treatment and covariates, the true treatment effects are known, and the proposed estimators fail to recover those known effects.

Figures

Figures reproduced from arXiv: 1906.10843 by Ercan Yildiz, Joshua Safyan, Marc Harper.

Figure 1
Figure 1. Figure 1: Sample UI based survey In recent years, companies have begun merging experimentation and user sen￾timent by utilizing sentiment as a success metric for product launches. This has become possible due to three major advantages of UI-based surveys over traditional survey mediums 1) mapping survey responses to treatment expo￾sure is straightforward as surveys are attached to experiences, 2) capturing user sent… view at source ↗
Figure 2
Figure 2. Figure 2: Causal graph representing Assumption 1. X(T) influences both user sentiment Y (T) and propensity to respond ∆(T). There is no other path be￾tween user sentiment and propensity to respond. Therefore, Y (T) and ∆(T) are independent conditioned on X(T). are required to extrapolate from the observed responses to the overall popula￾tion. Such assumptions may be problematic given that average online surveys have… view at source ↗
Figure 3
Figure 3. Figure 3: Causal graph representing Assumption 2. X influences both user sentiment under control Y (T = 0), and propensity to respond under both treat￾ment and control conditions ∆(T = 0), ∆(T = 1). There is no other path between user sentiment and propensity to respond. Therefore, Y (T = 0) and ∆(T = 0), ∆(T = 1), are independent conditioned on X. Assumption 2 There exists a set of observed variables Xi such that: … view at source ↗
Figure 4
Figure 4. Figure 4: Causal graph representing our data generation process. Latent vari [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of different ATE estimators when true confounders are [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of different ATE estimators when noisy confounders [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance of different ATETR estimators when true confounders [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of different ATETR estimators when noisy confounders [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

We study user sentiment (reported via optional surveys) as a metric for fully randomized A/B tests. Both user-level covariates and treatment assignment can impact response propensity. We propose a set of consistent estimators for the average and local treatment effects on treated and respondent users. We show that our problem can be mapped onto the intersection of the missing data problem and observational causal inference, and we identify conditions under which consistent estimators exist. We evaluate the performance of estimators via simulation studies and find that more complicated models do not necessarily provide superior performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript studies biases in user sentiment (from optional surveys) as an outcome metric in fully randomized A/B tests, where both covariates and treatment can affect response propensity. It maps the problem to the intersection of missing-data and observational causal inference settings, proposes consistent estimators for average and local treatment effects on treated and respondent subpopulations, identifies conditions under which such estimators exist, and evaluates finite-sample performance via simulation studies.

Significance. If the identification argument holds, the work supplies a principled correction for non-response bias in a common industry metric. The mapping to established missing-data and causal-inference frameworks, together with the simulation evidence that simpler models can outperform more complex ones, would be a useful methodological contribution to the literature on survey non-response in randomized experiments.

minor comments (2)
  1. [Abstract] Abstract: the identification conditions are asserted but not summarized even at a high level; a one-sentence statement of the key assumptions (e.g., forms of missingness or ignorability) would make the central claim more immediately evaluable.
  2. [Simulation studies] The simulation section reports that 'more complicated models do not necessarily provide superior performance,' but does not specify the exact model classes compared or the metrics used to reach this conclusion; adding a short table or explicit list would strengthen the practical takeaway.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive evaluation of our manuscript. The referee's summary correctly identifies the core contributions: mapping the non-response problem in sentiment surveys to the intersection of missing-data and causal inference frameworks, deriving consistent estimators for ATE and LATE on treated and respondent subpopulations, and providing simulation evidence on estimator performance. We appreciate the recommendation for minor revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The central claim maps the sentiment metric problem onto the intersection of missing-data and observational causal inference settings, then identifies conditions for consistent estimators of ATE/LATE on treated and respondent subpopulations. This relies on standard identification results from the external missing-data and causal-inference literature rather than any self-referential fitting, self-definition, or load-bearing self-citation chain. No equations or steps in the provided abstract reduce a prediction or estimator to its own inputs by construction; simulation results are presented only as supporting evidence, not as the proof of consistency. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; paper invokes standard identification assumptions from missing data and observational causal inference without detailing new axioms or parameters here.

axioms (1)
  • domain assumption Conditions exist under which consistent estimators for treatment effects on respondents and treated users can be identified from the intersection of missing data and causal inference frameworks.
    Abstract states that such conditions are identified but does not specify them; these are required for the consistency claim.

pith-pipeline@v0.9.0 · 5612 in / 1208 out tokens · 18247 ms · 2026-05-25T15:38:53.147198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Anderson, E. W. Customer satisfaction and word of mouth. Journal of Service Research 1, 1 (1998), 5–17

  2. [2]

    W., and Sullivan, M

    Anderson, E. W., and Sullivan, M. W. The antecedents and conse- quences of customer satisfaction for firms. Marketing Science 12, 2 (1993), 125–143

  3. [3]

    Bang, H., and Robins, J. M. Doubly robust estimation in missing data and causal inference models. Biometrics 61, 4 (2005), 962–973

  4. [4]

    M., and Kalton, G

    Brick, J. M., and Kalton, G. Handling missing data in survey research. Statistical methods in medical research 5 , 3 (1996), 215–238

  5. [5]

    Double/debiased machine learning for treatment and structural parameters

    Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Jour- nal 21, 1 (2018), C1–C68

  6. [6]

    de Leeuw, E. D. Counting and measuring online: The quality of in- ternet surveys. BMS: Bulletin of Sociological Methodology / Bulletin de Mthodologie Sociologique, 114 (2012), 68–78

  7. [7]

    The one number you need to grow

    F Reichheld, F. The one number you need to grow. Harvard business review 81 (06 2004), 46–54, 124

  8. [8]

    The effects of customer satisfaction, relationship commitment dimensions, and triggers on customer retention

    Gustafsson, A., Johnson, M., and Roos, I. The effects of customer satisfaction, relationship commitment dimensions, and triggers on customer retention. Journal of Marketing - J MARKETING 69 (10 2005), 210–218

  9. [9]

    Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies

    Hainmueller, J. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20, 1 (2012), 2546

  10. [10]

    W., and Rubin, D

    Imbens, G. W., and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction . Cambridge University Press, 2015

  11. [11]

    Compensating for missing survey data

    Kalton, G. Compensating for missing survey data. Research report series. Survey Research Center, Institute for Social Research, the University of Michigan, 1983

  12. [12]

    Weighting methods

    Kalton, G., and Flores-Cervantes, I. Weighting methods. Journal of official statistics 19 , 2 (2003), 81

  13. [13]

    Kang, J. D. Y., and Schafer, J. L. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22 , 4 (11 2007), 523–539

  14. [14]

    Online controlled experiments at large scale

    Kohavi, R., Deng, A., Frasca, B., W alker, T., Xu, Y., and Pohlmann, N. Online controlled experiments at large scale. In Proceed- 17 ings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2013), KDD ’13, ACM, pp. 1168–1176

  15. [15]

    Little, R. J. A., and Rubin, D. B. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York, NY, USA, 1986

  16. [16]

    A., Anderson, E

    Morgan, N. A., Anderson, E. W., and Mittal, V. Understanding firms’ customer satisfaction information usage. Journal of Marketing 69 , 3 (2005), 131–151

  17. [17]

    Hats: Large-scale in-product measurement of user attitudes: Experiences with happiness tracking surveys

    Muller, H., and Sedley, A. Hats: Large-scale in-product measurement of user attitudes: Experiences with happiness tracking surveys. In Proceed- ings of the 26th Australian Computer-Human Interaction Conference on Designing Futures: The Future of Design (New York, NY, USA, 2014), OzCHI ’14, ACM, pp. 308–315

  18. [18]

    Measuring Customer Satisfaction: Hot Buttons and Other Measurement Issues

    Myers, J. Measuring Customer Satisfaction: Hot Buttons and Other Measurement Issues. American Marketing Association, 1999

  19. [19]

    Optimizely, 2019

    Optimizely. Optimizely, 2019

  20. [20]

    Adversarial Bal- ancing for Causal Inference

    Ozery-Flato, M., Thodoroff, P., and El-Hay, T. Adversarial Bal- ancing for Causal Inference. arXiv e-prints (Oct 2018), arXiv:1810.07406

  21. [21]

    Causality: Models, Reasoning and Inference, 2nd ed

    Pearl, J. Causality: Models, Reasoning and Inference, 2nd ed. Cambridge University Press, New York, NY, USA, 2009

  22. [22]

    Rubin, D. B. Inference and missing data. Biometrika 63, 3 (1976), 581– 592

  23. [23]

    Overlapping experiment infrastructure: More, better, faster experimentation

    Tang, D., Agarwal, A., O’Brien, D., and Meyer, M. Overlapping experiment infrastructure: More, better, faster experimentation. In Pro- ceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2010), KDD ’10, ACM, pp. 17–26

  24. [24]

    Anderson, E., Fornell, C., and T

    W. Anderson, E., Fornell, C., and T. Rust, R. Customer sat- isfaction, productivity, and profitability: Differences between goods and services. Marketing Science 16 (05 1997), 129–145

  25. [25]

    From infrastructure to culture: A/b testing challenges in large scale social net- works

    Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. From infrastructure to culture: A/b testing challenges in large scale social net- works. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2015), KDD ’15, ACM, pp. 2227–2236. 18 A Proof of Lemma 2 First recall that ATETR is de...