arxiv: 2605.11017 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI· cs.IR

Recognition: no theorem link

Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics

Chao Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.IR

keywords Simpson's paradoxbehavioral curvessurvival biasaggregation biasuser dynamicsparametric modelingrecommendation systems

0 comments

The pith

Aggregation distorts behavioral curves so that group peaks occur at three to five times the individual exposure level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard practice of fitting parametric curves to aggregated user engagement data versus exposure counts produces peaks that differ sharply from those obtained by modeling each user separately. This misalignment is a form of Simpson's paradox driven by survival bias, in which users who stop engaging early are missing from later points in the aggregate. Large-scale datasets from book and electronics reviews exhibit the effect at factors of three and five, while a control dataset without strong differential attrition shows no such shift. The authors introduce a calibration procedure to correct for classification errors when estimating per-user peaks. The result matters because many applications in recommendation, advertising, and dosing rely on aggregate curves to set exposure targets.

Core claim

When individual user engagement curves are fitted separately, their peaks cluster around eleven exposures on Goodreads data, yet the single curve fitted to all users together peaks around thirty-four exposures. The same pattern appears at larger scale in Amazon electronics reviews. MovieLens serves as a negative control where the individual and aggregate peaks align, isolating survival bias as the operative mechanism rather than aggregation itself. The distortion persists across different category definitions and engagement measures.

What carries the argument

Simpson's paradox in behavioral curves, produced when survival bias from differential user attrition warps the aggregate parametric fit away from the typical individual peak.

If this is right

Exposure targets derived from aggregate curves will systematically recommend too many repetitions for the typical user.
Recommendation and advertising systems that optimize against aggregate behavioral curves will over-expose the median user.
Clinical dosing schedules fitted to population-level response curves may exceed the point at which most individuals have already responded.
Per-user or attrition-adjusted modeling is required whenever differential dropout rates are present.
The introduced Synthetic Null Calibration reduces false positives when classifying which users exhibit a clear peak.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing studies that tuned exposure levels using only aggregate curves may have set targets substantially higher than needed for most individuals.
Domains with high user churn, such as social media or mobile apps, are likely to show comparable distortions if examined at the individual level.
A direct test could compare aggregate versus individual peaks in randomized experiments that control attrition rates.
The same aggregation bias may affect other parametric summaries of user behavior beyond simple exposure-response curves.

Load-bearing premise

The gap between individual and aggregate peaks is caused primarily by survival bias rather than by artifacts of how the curves are fitted, how engagement is measured, or by other unmeasured factors.

What would settle it

Construct a synthetic dataset in which every user has identical probability of continuing after each exposure and then check whether the aggregate peak still deviates from the individual peaks; if the deviation disappears, the survival-bias account is supported.

Figures

Figures reproduced from arXiv: 2605.11017 by Chao Zhou.

**Figure 2.** Figure 2: Simpson’s paradox gap ratio across three datasets. Distortion magnitude tracks survival [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-genre consistency of individual peak locations. Box plots show the distribution of [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Amazon Electronics aggregate engagement curve with Hill-exponential fit overlay (43 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of individual peak locations on Amazon Electronics (strict classification, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Behavioral curve modeling -- fitting parametric functions to engagement-versus-exposure data -- is standard practice in recommendation, advertising, and clinical dosing. We show that aggregation introduces a systematic distortion: Simpson's paradox in behavioral curves. On Goodreads (3.3M users, 9 genres), individual users peak at n* approximately 11 exposures while the aggregate peaks at n* approximately 34 -- a 3x gap driven by survival bias. Amazon Electronics (18M reviews) shows a 5.3x distortion. MovieLens-25M (D approximately 1) serves as a negative control, confirming that survival bias -- not aggregation per se -- is the operative mechanism. The distortion is robust to category granularity, engagement operationalization, and classifier calibration. We develop Synthetic Null Calibration to address a 32% false positive rate in per-user classification. Our findings apply wherever individual behavioral parameters are estimated from aggregate curves under differential attrition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows clear 3-5x distortions in peak exposure from aggregate vs per-user fits on large datasets, driven by survival bias, but per-user peak estimates look vulnerable to sparse-data instability.

read the letter

The core finding is that fitting parametric curves to aggregated engagement data shifts the apparent peak exposure point by a factor of three on Goodreads and over five on Amazon Electronics compared to individual-user fits, with the gap pinned to survival bias rather than aggregation alone. A negative control on MovieLens where no distortion appears helps isolate that mechanism. The Synthetic Null Calibration they introduce to cut false positives in per-user work is a concrete practical step that targets a reported 32% error rate. Those elements give the work some empirical weight on real-scale data. The main soft spot is whether the per-user fits themselves are stable. Many users will have few observations, and if the chosen parametric family or fitting routine produces high-variance or biased peak locations under sparsity, the headline gap could partly reflect estimation noise instead of the claimed bias. The abstract leaves the exact functional forms, regularization, and goodness-of-fit checks unspecified, so it is hard to judge how cleanly survival bias was separated from fitting artifacts. The stress-test concern about location bias not being directly calibrated lands as a real open question rather than a minor quibble. This is useful reading for people who build or rely on aggregate behavioral models in recommendation, advertising, or dosing settings. Anyone who routinely optimizes exposure from pooled curves should see the warning and the calibration method. I would send it to peer review so referees can examine the fitting procedures and robustness checks in detail; the pattern is striking enough to merit that scrutiny even if revisions are needed.

Referee Report

3 major / 2 minor

Summary. The paper claims that aggregation of user engagement-versus-exposure data induces Simpson's paradox in parametric behavioral curves: individual-level fits yield lower peak exposures (n* ≈ 11 on Goodreads) than aggregate fits (n* ≈ 34), a 3x distortion attributed to survival bias rather than aggregation per se. Evidence comes from Goodreads (3.3M users, 9 genres), Amazon Electronics (18M reviews, 5.3x distortion), and MovieLens-25M as negative control (D ≈ 1), with robustness to granularity, operationalization, and classifier calibration. The authors introduce Synthetic Null Calibration to correct a 32% false-positive rate in per-user peak classification and argue the findings apply to any setting estimating individual parameters from aggregates under differential attrition.

Significance. If the central empirical discrepancy is shown to arise specifically from survival bias and not from fitting instability or misspecification, the result would be significant for recommender systems, advertising, and clinical modeling, where aggregate curves are routinely used to infer optimal exposure levels. The multi-dataset design with negative control and robustness checks provides a solid empirical foundation; the introduction of Synthetic Null Calibration is a constructive methodological contribution that could be adopted more broadly.

major comments (3)

[Synthetic Null Calibration] The Synthetic Null Calibration section addresses only the false-positive rate (32%) for classifying users as having a peak; it does not validate recovery of the continuous peak location n* or quantify bias/variance in per-user estimates under sparse data. Because the headline 3x gap rests on the distribution of these individual n* values, explicit simulation results showing that the chosen parametric family recovers known peaks when data are sparse (as in the Goodreads per-user regime) are required.
[Methods / per-user fitting procedure] The manuscript provides insufficient detail on the exact parametric families fitted to individual users, the optimization procedure (MLE, least-squares, regularization), and how n* is extracted from each fit. Without these, it is impossible to rule out that the reported individual-aggregate discrepancy is partly an artifact of functional-form misspecification or unstable estimation for low-exposure users, rather than survival bias alone.
[Empirical results and robustness checks] While MovieLens serves as a negative control, the paper does not report quantitative checks (e.g., per-user goodness-of-fit statistics or cross-validation error) demonstrating that the chosen parametric form is adequate for individual trajectories on the main datasets. Such diagnostics are necessary to isolate survival bias from model mismatch.

minor comments (2)

[Data description] Clarify the precise definition of 'exposure' and 'engagement' used in each dataset, including any preprocessing steps that could interact with attrition.
[Robustness analysis] The abstract states 'robust to category granularity' but the main text should include a table or figure showing the range of granularities tested and the resulting n* values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each of the major comments below, and we will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Synthetic Null Calibration] The Synthetic Null Calibration section addresses only the false-positive rate (32%) for classifying users as having a peak; it does not validate recovery of the continuous peak location n* or quantify bias/variance in per-user estimates under sparse data. Because the headline 3x gap rests on the distribution of these individual n* values, explicit simulation results showing that the chosen parametric family recovers known peaks when data are sparse (as in the Goodreads per-user regime) are required.

Authors: We concur that additional validation of the per-user peak recovery is necessary to substantiate the reliability of the individual n* estimates. In the revised manuscript, we will include a dedicated simulation experiment. We will generate synthetic user trajectories with known ground-truth peak locations n* drawn from a distribution similar to our empirical findings, under sparsity levels matching the Goodreads dataset (i.e., varying numbers of observations per user). We will then apply the same fitting procedure and report metrics such as mean absolute error, bias, and variance in the estimated n* values. This will demonstrate that the parametric model can accurately recover peaks even with sparse data, thereby supporting that the observed 3x distortion arises from survival bias rather than estimation artifacts. revision: yes
Referee: [Methods / per-user fitting procedure] The manuscript provides insufficient detail on the exact parametric families fitted to individual users, the optimization procedure (MLE, least-squares, regularization), and how n* is extracted from each fit. Without these, it is impossible to rule out that the reported individual-aggregate discrepancy is partly an artifact of functional-form misspecification or unstable estimation for low-exposure users, rather than survival bias alone.

Authors: We agree that the current description of the per-user fitting procedure is insufficiently detailed. In the revision, we will substantially expand the Methods section to include: (1) the specific parametric family employed for the behavioral curves (including the mathematical form and any assumptions), (2) the optimization algorithm used (maximum likelihood estimation via gradient descent or similar), including any regularization terms or constraints, and (3) the precise method for deriving n* from the fitted parameters (e.g., by solving for the mode or maximum of the fitted function). We will also provide supplementary code or pseudocode to ensure reproducibility. revision: yes
Referee: [Empirical results and robustness checks] While MovieLens serves as a negative control, the paper does not report quantitative checks (e.g., per-user goodness-of-fit statistics or cross-validation error) demonstrating that the chosen parametric form is adequate for individual trajectories on the main datasets. Such diagnostics are necessary to isolate survival bias from model mismatch.

Authors: We acknowledge the value of reporting quantitative model fit diagnostics to rule out misspecification as a confounding factor. In the revised manuscript, we will add a new subsection or table summarizing per-user goodness-of-fit metrics for the Goodreads and Amazon datasets. This will include, for example, the distribution of R^2 values, mean squared errors, or log-likelihoods across users, as well as results from a cross-validation procedure (e.g., 5-fold CV error averaged over users). These diagnostics will be compared to those on the MovieLens negative control to further isolate the role of survival bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central result is direct empirical comparison of fitted peaks

full rationale

The paper reports direct maximum-likelihood or least-squares fits of a parametric family to per-user engagement curves and to the aggregate curve on external datasets (Goodreads, Amazon Electronics, MovieLens-25M). The headline discrepancy (individual n* ≈11 vs aggregate n* ≈34) is the observed difference between those two independent fits; it does not reduce to a fitted parameter being renamed as a prediction, nor to any self-citation chain that supplies the uniqueness or functional form. The Synthetic Null Calibration is an auxiliary procedure that only calibrates false-positive classification rate and is not used to adjust the reported peak locations. No equation or derivation step equates a claimed prediction to its own input by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions about the validity of per-user parametric curve fitting and the causal interpretation of survival bias as the driver of aggregation distortion; no free parameters or invented entities are explicitly introduced beyond standard curve-fitting practices.

free parameters (1)

parameters of per-user parametric functions
Peaks are identified by fitting parametric models to individual data, which requires estimating function parameters from observations.

axioms (2)

domain assumption Engagement versus exposure can be accurately modeled by parametric functions at the individual user level
Invoked to define and compare individual peaks against the aggregate.
domain assumption Differential attrition produces survival bias that systematically shifts aggregate curve peaks
Used to explain the mechanism behind the observed 3-5x distortions.

pith-pipeline@v0.9.0 · 5454 in / 1413 out tokens · 84350 ms · 2026-05-13T06:38:07.881861+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

1960 , publisher=

Conflict, Arousal, and Curiosity , author=. 1960 , publisher=

work page 1960
[2]

Psychological Bulletin , volume=

The Psychology of Curiosity: A Review and Reinterpretation , author=. Psychological Bulletin , volume=. 1994 , publisher=

work page 1994
[3]

International Conference on Machine Learning (ICML) , pages=

Curiosity-Driven Exploration by Self-Supervised Prediction , author=. International Conference on Machine Learning (ICML) , pages=

work page
[4]

Proceedings of the 19th International Conference on World Wide Web (WWW) , pages=

A Contextual-Bandit Approach to Personalized News Article Recommendation , author=. Proceedings of the 19th International Conference on World Wide Web (WWW) , pages=

work page
[5]

The Web Conference , pages=

Curiosity-Driven Recommendation Strategy , author=. The Web Conference , pages=

work page
[6]

Proceedings of the 15th ACM Conference on Recommender Systems (RecSys) , pages=

Values of User Exploration in Recommender Systems , author=. Proceedings of the 15th ACM Conference on Recommender Systems (RecSys) , pages=

work page
[7]

Machine Learning , volume=

Finite-Time Analysis of the Multiarmed Bandit Problem , author=. Machine Learning , volume=. 2002 , publisher=

work page 2002
[8]

Biometrika , volume=

On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples , author=. Biometrika , volume=

work page
[9]

An Empirical Evaluation of

Chapelle, Olivier and Li, Lihong , booktitle=. An Empirical Evaluation of

work page
[10]

2018 , publisher=

Reinforcement Learning: An Introduction , author=. 2018 , publisher=

work page 2018
[11]

Fast Greedy

Chen, Laming and Zhang, Guoxin and Zhou, Eric , booktitle=. Fast Greedy

work page
[12]

Proceedings of the ACM Conference on Recommender Systems (RecSys) , year=

Stable Exploration in Reinforcement Learning for Recommendation , author=. Proceedings of the ACM Conference on Recommender Systems (RecSys) , year=

work page
[13]

and Gillenwater, Jennifer , booktitle=

Wilhelm, Mark and Ramanathan, Ajith and Bonomo, Alexander and Jain, Sagar and Chi, Ed H. and Gillenwater, Jennifer , booktitle=. Practical Diversified Recommendations on

work page
[14]

International Conference on Learning Representations (ICLR) , year=

Exploration by Random Network Distillation , author=. International Conference on Learning Representations (ICLR) , year=

work page
[15]

Psychological Science , volume=

The Wick in the Candle of Learning: Epistemic Curiosity Activates Reward Circuitry and Enhances Memory , author=. Psychological Science , volume=

work page
[16]

Neuron , volume=

States of Curiosity Modulate Hippocampus-Dependent Learning via the Dopaminergic Circuit , author=. Neuron , volume=

work page
[17]

Developmental Psychology , volume=

The Impact of Curiosity on Information Seeking and Learning in Adolescents , author=. Developmental Psychology , volume=

work page
[18]

Wang, Xiang and He, Xiangnan and Cao, Yixin and Liu, Meng and Chua, Tat-Seng , booktitle=

work page
[19]

Wang, Hongwei and Zhang, Fuzheng and Xie, Xing and Guo, Minyi , booktitle=

work page
[20]

ACM Computing Surveys , year=

A Survey on Knowledge-Enhanced Recommendation , author=. ACM Computing Surveys , year=

work page
[21]

International Conference on Machine Learning (ICML) , pages=

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , author=. International Conference on Machine Learning (ICML) , pages=

work page
[22]

and Nikolic, Isidor and De Bona, Fabio and Krause, Andreas , booktitle=

Vanchinathan, Hastagiri P. and Nikolic, Isidor and De Bona, Fabio and Krause, Andreas , booktitle=. Explore-Exploit in Top-

work page
[23]

, booktitle=

Chen, Minmin and Beutel, Alex and Covington, Paul and Jain, Sagar and Belletti, Francois and Chi, Ed H. , booktitle=. Top-

work page
[24]

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Self-Supervised Reinforcement Learning for Recommender Systems , author=. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

work page
[25]

2020 , publisher=

Bandit Algorithms , author=. 2020 , publisher=

work page 2020
[26]

Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) , pages=

Recsys Challenge 2018: Automatic Music Playlist Continuation , author=. Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) , pages=

work page 2018
[27]

Bennett, James and Lanning, Stan , booktitle=. The

work page
[28]

Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) , pages=

Item Recommendation on Monotonic Behavior Chains , author=. Proceedings of the 12th ACM Conference on Recommender Systems (RecSys) , pages=

work page
[29]

and Vu, Trung and Heldt, Lukasz and Hong, Lichan and Tay, Yi and Tran, Vinh Q

Rajput, Shashank and Mehta, Nikhil and Singh, Anima and Keshavan, Raghunandan H. and Vu, Trung and Heldt, Lukasz and Hong, Lichan and Tay, Yi and Tran, Vinh Q. and Saber, Jonah and others , booktitle=

work page
[30]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Transformer Memory as a Differentiable Search Index , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[31]

Distilling the Knowledge in a Neural Network

Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle=. Mini

work page
[33]

Journal of the Royal Statistical Society, Series B , volume=

The Interpretation of Interaction in Contingency Tables , author=. Journal of the Royal Statistical Society, Series B , volume=

work page
[34]

, journal=

Blyth, Colin R. , journal=. On

work page
[35]

Comment: Understanding

Pearl, Judea , journal=. Comment: Understanding

work page
[36]

American Sociological Review , volume=

Ecological Correlations and the Behavior of Individuals , author=. American Sociological Review , volume=

work page
[37]

and Lerman, Kristina , booktitle=

Alipourfard, Nazanin and Fennell, Peter G. and Lerman, Kristina , booktitle=. Using

work page
[38]

and Finkelstein, Dianne M

Robins, James M. and Finkelstein, Dianne M. , journal=. Correcting for Non-Compliance and Dependent Censoring in an

work page
[39]

and Frankenhuis, Willem E

Kievit, Rogier A. and Frankenhuis, Willem E. and Waldorp, Lourens J. and Borsboom, Denny , journal=

work page
[40]

Maxwell and Konstan, Joseph A

Harper, F. Maxwell and Konstan, Joseph A. , booktitle=. The

work page
[41]

1997 , publisher=

A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data , author=. 1997 , publisher=

work page 1997
[42]

Data Analysis Using

Efron, Bradley and Morris, Carl , journal=. Data Analysis Using

work page
[43]

Bayesian Analysis , volume=

Prior Distributions for Variance Parameters in Hierarchical Models , author=. Bayesian Analysis , volume=

work page
[44]

2013 , publisher=

Bayesian Data Analysis , author=. 2013 , publisher=

work page 2013
[45]

Springer Series in Statistics , year=

Permutation, Parametric, and Bootstrap Tests of Hypotheses , author=. Springer Series in Statistics , year=

work page
[46]

Findings of the Association for Computational Linguistics: NAACL 2024 , year=

Bridging Language and Items for Retrieval and Recommendation , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , year=

work page 2024
[47]

2007 , publisher=

Stochastic Orders , author=. 2007 , publisher=

work page 2007
[48]

2019 , publisher=

Statistical Analysis with Missing Data , author=. 2019 , publisher=

work page 2019
[49]

Econometrica , volume=

Sample Selection Bias as a Specification Error , author=. Econometrica , volume=

work page