pith. machine review for the scientific record. sign in

arxiv: 2605.11017 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI· cs.IR

Recognition: no theorem link

Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.IR
keywords Simpson's paradoxbehavioral curvessurvival biasaggregation biasuser dynamicsparametric modelingrecommendation systems
0
0 comments X

The pith

Aggregation distorts behavioral curves so that group peaks occur at three to five times the individual exposure level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard practice of fitting parametric curves to aggregated user engagement data versus exposure counts produces peaks that differ sharply from those obtained by modeling each user separately. This misalignment is a form of Simpson's paradox driven by survival bias, in which users who stop engaging early are missing from later points in the aggregate. Large-scale datasets from book and electronics reviews exhibit the effect at factors of three and five, while a control dataset without strong differential attrition shows no such shift. The authors introduce a calibration procedure to correct for classification errors when estimating per-user peaks. The result matters because many applications in recommendation, advertising, and dosing rely on aggregate curves to set exposure targets.

Core claim

When individual user engagement curves are fitted separately, their peaks cluster around eleven exposures on Goodreads data, yet the single curve fitted to all users together peaks around thirty-four exposures. The same pattern appears at larger scale in Amazon electronics reviews. MovieLens serves as a negative control where the individual and aggregate peaks align, isolating survival bias as the operative mechanism rather than aggregation itself. The distortion persists across different category definitions and engagement measures.

What carries the argument

Simpson's paradox in behavioral curves, produced when survival bias from differential user attrition warps the aggregate parametric fit away from the typical individual peak.

If this is right

  • Exposure targets derived from aggregate curves will systematically recommend too many repetitions for the typical user.
  • Recommendation and advertising systems that optimize against aggregate behavioral curves will over-expose the median user.
  • Clinical dosing schedules fitted to population-level response curves may exceed the point at which most individuals have already responded.
  • Per-user or attrition-adjusted modeling is required whenever differential dropout rates are present.
  • The introduced Synthetic Null Calibration reduces false positives when classifying which users exhibit a clear peak.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing studies that tuned exposure levels using only aggregate curves may have set targets substantially higher than needed for most individuals.
  • Domains with high user churn, such as social media or mobile apps, are likely to show comparable distortions if examined at the individual level.
  • A direct test could compare aggregate versus individual peaks in randomized experiments that control attrition rates.
  • The same aggregation bias may affect other parametric summaries of user behavior beyond simple exposure-response curves.

Load-bearing premise

The gap between individual and aggregate peaks is caused primarily by survival bias rather than by artifacts of how the curves are fitted, how engagement is measured, or by other unmeasured factors.

What would settle it

Construct a synthetic dataset in which every user has identical probability of continuing after each exposure and then check whether the aggregate peak still deviates from the individual peaks; if the deviation disappears, the survival-bias account is supported.

Figures

Figures reproduced from arXiv: 2605.11017 by Chao Zhou.

Figure 1
Figure 1. Figure 1: Simpson’s paradox in behavioral curves. Aggregate engagement curves (blue line) system [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Simpson’s paradox gap ratio across three datasets. Distortion magnitude tracks survival [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-genre consistency of individual peak locations. Box plots show the distribution of [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Amazon Electronics aggregate engagement curve with Hill-exponential fit overlay (43 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of individual peak locations on Amazon Electronics (strict classification, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Behavioral curve modeling -- fitting parametric functions to engagement-versus-exposure data -- is standard practice in recommendation, advertising, and clinical dosing. We show that aggregation introduces a systematic distortion: Simpson's paradox in behavioral curves. On Goodreads (3.3M users, 9 genres), individual users peak at n* approximately 11 exposures while the aggregate peaks at n* approximately 34 -- a 3x gap driven by survival bias. Amazon Electronics (18M reviews) shows a 5.3x distortion. MovieLens-25M (D approximately 1) serves as a negative control, confirming that survival bias -- not aggregation per se -- is the operative mechanism. The distortion is robust to category granularity, engagement operationalization, and classifier calibration. We develop Synthetic Null Calibration to address a 32% false positive rate in per-user classification. Our findings apply wherever individual behavioral parameters are estimated from aggregate curves under differential attrition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that aggregation of user engagement-versus-exposure data induces Simpson's paradox in parametric behavioral curves: individual-level fits yield lower peak exposures (n* ≈ 11 on Goodreads) than aggregate fits (n* ≈ 34), a 3x distortion attributed to survival bias rather than aggregation per se. Evidence comes from Goodreads (3.3M users, 9 genres), Amazon Electronics (18M reviews, 5.3x distortion), and MovieLens-25M as negative control (D ≈ 1), with robustness to granularity, operationalization, and classifier calibration. The authors introduce Synthetic Null Calibration to correct a 32% false-positive rate in per-user peak classification and argue the findings apply to any setting estimating individual parameters from aggregates under differential attrition.

Significance. If the central empirical discrepancy is shown to arise specifically from survival bias and not from fitting instability or misspecification, the result would be significant for recommender systems, advertising, and clinical modeling, where aggregate curves are routinely used to infer optimal exposure levels. The multi-dataset design with negative control and robustness checks provides a solid empirical foundation; the introduction of Synthetic Null Calibration is a constructive methodological contribution that could be adopted more broadly.

major comments (3)
  1. [Synthetic Null Calibration] The Synthetic Null Calibration section addresses only the false-positive rate (32%) for classifying users as having a peak; it does not validate recovery of the continuous peak location n* or quantify bias/variance in per-user estimates under sparse data. Because the headline 3x gap rests on the distribution of these individual n* values, explicit simulation results showing that the chosen parametric family recovers known peaks when data are sparse (as in the Goodreads per-user regime) are required.
  2. [Methods / per-user fitting procedure] The manuscript provides insufficient detail on the exact parametric families fitted to individual users, the optimization procedure (MLE, least-squares, regularization), and how n* is extracted from each fit. Without these, it is impossible to rule out that the reported individual-aggregate discrepancy is partly an artifact of functional-form misspecification or unstable estimation for low-exposure users, rather than survival bias alone.
  3. [Empirical results and robustness checks] While MovieLens serves as a negative control, the paper does not report quantitative checks (e.g., per-user goodness-of-fit statistics or cross-validation error) demonstrating that the chosen parametric form is adequate for individual trajectories on the main datasets. Such diagnostics are necessary to isolate survival bias from model mismatch.
minor comments (2)
  1. [Data description] Clarify the precise definition of 'exposure' and 'engagement' used in each dataset, including any preprocessing steps that could interact with attrition.
  2. [Robustness analysis] The abstract states 'robust to category granularity' but the main text should include a table or figure showing the range of granularities tested and the resulting n* values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each of the major comments below, and we will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Synthetic Null Calibration] The Synthetic Null Calibration section addresses only the false-positive rate (32%) for classifying users as having a peak; it does not validate recovery of the continuous peak location n* or quantify bias/variance in per-user estimates under sparse data. Because the headline 3x gap rests on the distribution of these individual n* values, explicit simulation results showing that the chosen parametric family recovers known peaks when data are sparse (as in the Goodreads per-user regime) are required.

    Authors: We concur that additional validation of the per-user peak recovery is necessary to substantiate the reliability of the individual n* estimates. In the revised manuscript, we will include a dedicated simulation experiment. We will generate synthetic user trajectories with known ground-truth peak locations n* drawn from a distribution similar to our empirical findings, under sparsity levels matching the Goodreads dataset (i.e., varying numbers of observations per user). We will then apply the same fitting procedure and report metrics such as mean absolute error, bias, and variance in the estimated n* values. This will demonstrate that the parametric model can accurately recover peaks even with sparse data, thereby supporting that the observed 3x distortion arises from survival bias rather than estimation artifacts. revision: yes

  2. Referee: [Methods / per-user fitting procedure] The manuscript provides insufficient detail on the exact parametric families fitted to individual users, the optimization procedure (MLE, least-squares, regularization), and how n* is extracted from each fit. Without these, it is impossible to rule out that the reported individual-aggregate discrepancy is partly an artifact of functional-form misspecification or unstable estimation for low-exposure users, rather than survival bias alone.

    Authors: We agree that the current description of the per-user fitting procedure is insufficiently detailed. In the revision, we will substantially expand the Methods section to include: (1) the specific parametric family employed for the behavioral curves (including the mathematical form and any assumptions), (2) the optimization algorithm used (maximum likelihood estimation via gradient descent or similar), including any regularization terms or constraints, and (3) the precise method for deriving n* from the fitted parameters (e.g., by solving for the mode or maximum of the fitted function). We will also provide supplementary code or pseudocode to ensure reproducibility. revision: yes

  3. Referee: [Empirical results and robustness checks] While MovieLens serves as a negative control, the paper does not report quantitative checks (e.g., per-user goodness-of-fit statistics or cross-validation error) demonstrating that the chosen parametric form is adequate for individual trajectories on the main datasets. Such diagnostics are necessary to isolate survival bias from model mismatch.

    Authors: We acknowledge the value of reporting quantitative model fit diagnostics to rule out misspecification as a confounding factor. In the revised manuscript, we will add a new subsection or table summarizing per-user goodness-of-fit metrics for the Goodreads and Amazon datasets. This will include, for example, the distribution of R^2 values, mean squared errors, or log-likelihoods across users, as well as results from a cross-validation procedure (e.g., 5-fold CV error averaged over users). These diagnostics will be compared to those on the MovieLens negative control to further isolate the role of survival bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central result is direct empirical comparison of fitted peaks

full rationale

The paper reports direct maximum-likelihood or least-squares fits of a parametric family to per-user engagement curves and to the aggregate curve on external datasets (Goodreads, Amazon Electronics, MovieLens-25M). The headline discrepancy (individual n* ≈11 vs aggregate n* ≈34) is the observed difference between those two independent fits; it does not reduce to a fitted parameter being renamed as a prediction, nor to any self-citation chain that supplies the uniqueness or functional form. The Synthetic Null Calibration is an auxiliary procedure that only calibrates false-positive classification rate and is not used to adjust the reported peak locations. No equation or derivation step equates a claimed prediction to its own input by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions about the validity of per-user parametric curve fitting and the causal interpretation of survival bias as the driver of aggregation distortion; no free parameters or invented entities are explicitly introduced beyond standard curve-fitting practices.

free parameters (1)
  • parameters of per-user parametric functions
    Peaks are identified by fitting parametric models to individual data, which requires estimating function parameters from observations.
axioms (2)
  • domain assumption Engagement versus exposure can be accurately modeled by parametric functions at the individual user level
    Invoked to define and compare individual peaks against the aggregate.
  • domain assumption Differential attrition produces survival bias that systematically shifts aggregate curve peaks
    Used to explain the mechanism behind the observed 3-5x distortions.

pith-pipeline@v0.9.0 · 5454 in / 1413 out tokens · 84350 ms · 2026-05-13T06:38:07.881861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

  1. [1]

    1960 , publisher=

    Conflict, Arousal, and Curiosity , author=. 1960 , publisher=

  2. [2]

    Psychological Bulletin , volume=

    The Psychology of Curiosity: A Review and Reinterpretation , author=. Psychological Bulletin , volume=. 1994 , publisher=

  3. [3]

    International Conference on Machine Learning (ICML) , pages=

    Curiosity-Driven Exploration by Self-Supervised Prediction , author=. International Conference on Machine Learning (ICML) , pages=

  4. [4]

    Proceedings of the 19th International Conference on World Wide Web (WWW) , pages=

    A Contextual-Bandit Approach to Personalized News Article Recommendation , author=. Proceedings of the 19th International Conference on World Wide Web (WWW) , pages=

  5. [5]

    The Web Conference , pages=

    Curiosity-Driven Recommendation Strategy , author=. The Web Conference , pages=

  6. [6]

    Proceedings of the 15th ACM Conference on Recommender Systems (RecSys) , pages=

    Values of User Exploration in Recommender Systems , author=. Proceedings of the 15th ACM Conference on Recommender Systems (RecSys) , pages=

  7. [7]

    Machine Learning , volume=

    Finite-Time Analysis of the Multiarmed Bandit Problem , author=. Machine Learning , volume=. 2002 , publisher=

  8. [8]

    Biometrika , volume=

    On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples , author=. Biometrika , volume=

  9. [9]

    An Empirical Evaluation of

    Chapelle, Olivier and Li, Lihong , booktitle=. An Empirical Evaluation of

  10. [10]

    2018 , publisher=

    Reinforcement Learning: An Introduction , author=. 2018 , publisher=

  11. [11]

    Fast Greedy

    Chen, Laming and Zhang, Guoxin and Zhou, Eric , booktitle=. Fast Greedy

  12. [12]

    Proceedings of the ACM Conference on Recommender Systems (RecSys) , year=

    Stable Exploration in Reinforcement Learning for Recommendation , author=. Proceedings of the ACM Conference on Recommender Systems (RecSys) , year=

  13. [13]

    and Gillenwater, Jennifer , booktitle=

    Wilhelm, Mark and Ramanathan, Ajith and Bonomo, Alexander and Jain, Sagar and Chi, Ed H. and Gillenwater, Jennifer , booktitle=. Practical Diversified Recommendations on

  14. [14]

    International Conference on Learning Representations (ICLR) , year=

    Exploration by Random Network Distillation , author=. International Conference on Learning Representations (ICLR) , year=

  15. [15]

    Psychological Science , volume=

    The Wick in the Candle of Learning: Epistemic Curiosity Activates Reward Circuitry and Enhances Memory , author=. Psychological Science , volume=

  16. [16]

    Neuron , volume=

    States of Curiosity Modulate Hippocampus-Dependent Learning via the Dopaminergic Circuit , author=. Neuron , volume=

  17. [17]

    Developmental Psychology , volume=

    The Impact of Curiosity on Information Seeking and Learning in Adolescents , author=. Developmental Psychology , volume=

  18. [18]

    Wang, Xiang and He, Xiangnan and Cao, Yixin and Liu, Meng and Chua, Tat-Seng , booktitle=

  19. [19]

    Wang, Hongwei and Zhang, Fuzheng and Xie, Xing and Guo, Minyi , booktitle=

  20. [20]

    ACM Computing Surveys , year=

    A Survey on Knowledge-Enhanced Recommendation , author=. ACM Computing Surveys , year=

  21. [21]

    International Conference on Machine Learning (ICML) , pages=

    Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , author=. International Conference on Machine Learning (ICML) , pages=

  22. [22]

    and Nikolic, Isidor and De Bona, Fabio and Krause, Andreas , booktitle=

    Vanchinathan, Hastagiri P. and Nikolic, Isidor and De Bona, Fabio and Krause, Andreas , booktitle=. Explore-Exploit in Top-

  23. [23]

    , booktitle=

    Chen, Minmin and Beutel, Alex and Covington, Paul and Jain, Sagar and Belletti, Francois and Chi, Ed H. , booktitle=. Top-

  24. [24]

    Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Self-Supervised Reinforcement Learning for Recommender Systems , author=. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  25. [25]

    2020 , publisher=

    Bandit Algorithms , author=. 2020 , publisher=

  26. [26]

    Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) , pages=

    Recsys Challenge 2018: Automatic Music Playlist Continuation , author=. Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) , pages=

  27. [27]

    Bennett, James and Lanning, Stan , booktitle=. The

  28. [28]

    Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) , pages=

    Item Recommendation on Monotonic Behavior Chains , author=. Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) , pages=

  29. [29]

    and Vu, Trung and Heldt, Lukasz and Hong, Lichan and Tay, Yi and Tran, Vinh Q

    Rajput, Shashank and Mehta, Nikhil and Singh, Anima and Keshavan, Raghunandan H. and Vu, Trung and Heldt, Lukasz and Hong, Lichan and Tay, Yi and Tran, Vinh Q. and Saber, Jonah and others , booktitle=

  30. [30]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Transformer Memory as a Differentiable Search Index , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  31. [31]

    Distilling the Knowledge in a Neural Network

    Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=

  32. [32]

    Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle=. Mini

  33. [33]

    Journal of the Royal Statistical Society, Series B , volume=

    The Interpretation of Interaction in Contingency Tables , author=. Journal of the Royal Statistical Society, Series B , volume=

  34. [34]

    , journal=

    Blyth, Colin R. , journal=. On

  35. [35]

    Comment: Understanding

    Pearl, Judea , journal=. Comment: Understanding

  36. [36]

    American Sociological Review , volume=

    Ecological Correlations and the Behavior of Individuals , author=. American Sociological Review , volume=

  37. [37]

    and Lerman, Kristina , booktitle=

    Alipourfard, Nazanin and Fennell, Peter G. and Lerman, Kristina , booktitle=. Using

  38. [38]

    and Finkelstein, Dianne M

    Robins, James M. and Finkelstein, Dianne M. , journal=. Correcting for Non-Compliance and Dependent Censoring in an

  39. [39]

    and Frankenhuis, Willem E

    Kievit, Rogier A. and Frankenhuis, Willem E. and Waldorp, Lourens J. and Borsboom, Denny , journal=

  40. [40]

    Maxwell and Konstan, Joseph A

    Harper, F. Maxwell and Konstan, Joseph A. , booktitle=. The

  41. [41]

    1997 , publisher=

    A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data , author=. 1997 , publisher=

  42. [42]

    Data Analysis Using

    Efron, Bradley and Morris, Carl , journal=. Data Analysis Using

  43. [43]

    Bayesian Analysis , volume=

    Prior Distributions for Variance Parameters in Hierarchical Models , author=. Bayesian Analysis , volume=

  44. [44]

    2013 , publisher=

    Bayesian Data Analysis , author=. 2013 , publisher=

  45. [45]

    Springer Series in Statistics , year=

    Permutation, Parametric, and Bootstrap Tests of Hypotheses , author=. Springer Series in Statistics , year=

  46. [46]

    Findings of the Association for Computational Linguistics: NAACL 2024 , year=

    Bridging Language and Items for Retrieval and Recommendation , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , year=

  47. [47]

    2007 , publisher=

    Stochastic Orders , author=. 2007 , publisher=

  48. [48]

    2019 , publisher=

    Statistical Analysis with Missing Data , author=. 2019 , publisher=

  49. [49]

    Econometrica , volume=

    Sample Selection Bias as a Specification Error , author=. Econometrica , volume=