Efficient Multi-Cohort Inference for Long-Term Effects and Lifetime Value in A/B Testing with User Learning

Andrea Tonon; Dario Simionato; Mingxue Wang; Tong Gui; Weiguo Wang; Xiaoyue Li

arxiv: 2604.20777 · v1 · submitted 2026-04-22 · 💻 cs.LG

Efficient Multi-Cohort Inference for Long-Term Effects and Lifetime Value in A/B Testing with User Learning

Dario Simionato , Andrea Tonon , Mingxue Wang , Weiguo Wang , Tong Gui , Xiaoyue Li This is my paper

Pith reviewed 2026-05-10 00:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords A/B testinglong-term treatment effectslifetime valuemulti-cohort inferenceparametric decayuser churnuser learningstreaming platforms

0 comments

The pith

A multi-cohort inverse-variance estimator plus parametric decay modeling recovers long-term treatment effects and lifetime value changes from short A/B tests under user learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In streaming platforms where churn is costly, short-horizon A/B tests often miss how a treatment alters user retention, so an intervention can look beneficial or neutral on standard metrics while actually generating lower total value. The paper introduces a method that pools data from multiple short cohorts observed at staggered times, applies inverse-variance weighting to obtain a low-variance estimate of the treatment effect trajectory, and then fits that trajectory to a parametric decay form. From the fitted decay the method extracts both the asymptotic long-term treatment effect and the change in expected residual lifetime value. A sympathetic reader would care because this joint evaluation prevents product decisions that appear sound on short-term or predicted long-term metrics alone but erode overall user value through higher churn.

Core claim

The central claim is that an inverse-variance weighted combination of multi-cohort estimates yields an efficient time-varying treatment effect trajectory in short experiments, and fitting this trajectory to a parametric decay model recovers both the asymptotic long-term treatment effect and the delta in expected residual lifetime value, allowing simultaneous assessment of steady-state impact and cumulative user value within a single short A/B test under user learning.

What carries the argument

An inverse-variance weighted estimator that combines time-varying treatment effect estimates across multiple cohorts, followed by parametric decay modeling of the resulting trajectory to extrapolate the asymptotic effect and cumulative lifetime value.

If this is right

The framework permits joint evaluation of steady-state impact and residual lifetime value change inside one short experiment.
The inverse-variance weighting reduces variance relative to single-cohort or standard approaches.
The method flags scenarios in which short-term metrics or long-term point estimates alone produce incorrect product decisions.
Empirical results on real data demonstrate higher precision for both long-term effect and lifetime value estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms could embed the estimator in routine A/B pipelines so that every short test automatically surfaces lifetime value alongside conventional metrics.
If the parametric decay form holds across different product surfaces, the same short-experiment design could be reused for non-streaming services with retention dynamics.
A natural extension would be to test sensitivity of the recovered values to alternative decay functional forms when longer data become available.

Load-bearing premise

The treatment effect trajectory follows a parametric decay form that can be fitted from the observed data to recover the true asymptotic effect and cumulative value.

What would settle it

Run a long-horizon follow-up experiment on the same treatment, measure the realized long-term effect and total lifetime value change, and check whether those quantities match the predictions obtained from the short multi-cohort parametric fit.

Figures

Figures reproduced from arXiv: 2604.20777 by Andrea Tonon, Dario Simionato, Mingxue Wang, Tong Gui, Weiguo Wang, Xiaoyue Li.

**Figure 1.** Figure 1: Comparison of the number of clicks 𝑚 observed on active users (top) and number of clicks weighted by user count (bottom). In this A/B test, the treatment introduces advertisements in between streams. While the computation of 𝑚 among active users (top) shows a positive short-term effect with no long-term downsides (𝐿𝑇 𝐸 ≈ 0), this view is biased by ignoring how the treatment affects user counts. The second… view at source ↗

read the original abstract

In streaming platforms churn is extremely costly, yet A/B tests are typically evaluated using outcomes observed within a limited experimental horizon. Even when both short- and predicted long-term engagement metrics are considered, they may fail to capture how a treatment affects users' retention. Consequently, an intervention may appear beneficial in the short term and neutral in the long term while still generating lower total value than the control due to users churn. To address this limitation, we introduce a method that estimates long-term treatment effects (LTE) and residual lifetime value change ($\Delta ERLV$) in short multi-cohort A/B tests under user learning. To estimate time-varying treatment effects efficiently, we introduce an inverse-variance weighted estimator that combines multiple cohorts estimates, reducing variance relative to standard approaches in the literature. The estimated treatment trajectory is then modeled as a parametric decay to recover both the asymptotic treatment effect and the cumulative value generated over time. Our framework enables simultaneous evaluation of steady-state impact and residual user value within a single experiment. Empirical results show improved precision in estimating LTE and $\Delta ERLV$ and identify scenarios in which relying on either short-term or long-term metrics alone would lead to incorrect product decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete multi-cohort inverse-variance estimator plus parametric decay fit to pull LTE and lifetime value from short A/B tests, but the extrapolation step rests on an assumption that needs explicit checks.

read the letter

The key takeaway is that this work gives experimenters a way to estimate both the steady-state treatment effect and the change in users' lifetime value from short A/B tests by first combining cohort data efficiently and then extrapolating with a decay curve. The new piece is the inverse-variance weighted multi-cohort estimator for the treatment trajectory, paired with the parametric model to get LTE and ΔERLV at the same time. This seems more efficient than running separate long-term predictions or waiting for full data. It does a good job highlighting how short-term gains can mask lower total value due to faster churn, which is a real issue in platforms where retention drives revenue. The main limitation is that everything after the trajectory estimate depends on the decay function being correctly specified. Without checks against the actual shape of the data or alternative models, the long-term numbers could be artifacts of the fit rather than true asymptotes. The abstract mentions empirical improvements in precision but doesn't detail how they verified the model or compared to baselines. This is useful for teams doing A/B testing on user-facing products where lifetime metrics matter more than immediate clicks. Someone running experiments in that setting could adapt the weighting scheme and see if the decay approach fits their domain. I would send it to peer review. The idea addresses a practical gap with a specific technique, and referees can push on the assumption checks and any code or data details that might be in the full version.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a method for estimating long-term treatment effects (LTE) and changes in expected residual lifetime value (ΔERLV) from short multi-cohort A/B tests under user learning. It proposes an inverse-variance weighted estimator to combine time-varying treatment effect estimates across cohorts, reducing variance relative to standard approaches. The resulting trajectory is then fit with a parametric decay model to recover the asymptotic treatment effect and the cumulative value over time. Empirical results are presented as showing improved precision for LTE and ΔERLV estimates while identifying cases where short-term or long-term metrics alone would lead to incorrect product decisions.

Significance. If the parametric decay assumption is valid and the variance reduction is realized in practice, the framework would enable more reliable evaluation of interventions that affect retention and lifetime value within limited experimental horizons, addressing a common limitation in streaming platform A/B testing. The multi-cohort inverse-variance weighting represents a practical strength for efficiency. The identification of decision errors from incomplete metrics adds applied value, though overall significance depends on robustness to the functional form assumption.

major comments (2)

[Modeling the Treatment Trajectory (abstract and method section)] The recovery of asymptotic LTE and ΔERLV relies on fitting a parametric decay form to the estimated treatment trajectory (as described in the abstract and the modeling procedure). This makes the long-term quantities direct functions of the fitted parameters. If the true (unknown) trajectory does not lie in the assumed parametric family, the asymptote and integral are extrapolations under misspecification. The manuscript should include misspecification diagnostics, sensitivity checks to alternative functional forms, or non-parametric benchmarks, as this step is load-bearing for the central claims about LTE and cumulative value.
[Empirical Results] The abstract states that empirical results show improved precision in LTE and ΔERLV estimates, but without reported quantitative details such as variance ratios, confidence interval widths, or statistical comparisons to single-cohort or non-weighted baselines, the magnitude and reliability of the gains cannot be assessed. Please add specific metrics, tables, or figures from the experiments to substantiate this claim.

minor comments (2)

[Abstract] The notation ΔERLV is introduced without an explicit expanded definition in the abstract; ensure the full expansion and any related assumptions are clearly stated at first use in the main text.
[Method] Consider specifying the exact parametric decay function (e.g., exponential with rate parameter) and its estimation procedure in the main body to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Modeling the Treatment Trajectory (abstract and method section)] The recovery of asymptotic LTE and ΔERLV relies on fitting a parametric decay form to the estimated treatment trajectory (as described in the abstract and the modeling procedure). This makes the long-term quantities direct functions of the fitted parameters. If the true (unknown) trajectory does not lie in the assumed parametric family, the asymptote and integral are extrapolations under misspecification. The manuscript should include misspecification diagnostics, sensitivity checks to alternative functional forms, or non-parametric benchmarks, as this step is load-bearing for the central claims about LTE and cumulative value.

Authors: We agree that the parametric decay assumption is load-bearing for the LTE and ΔERLV claims and that misspecification could bias the extrapolated asymptote and integral. The current manuscript specifies a particular parametric decay family in the method section. In the revision we will add a dedicated robustness subsection that includes (i) sensitivity analyses across alternative functional forms (exponential, power-law, and linear decay), (ii) standard goodness-of-fit diagnostics (R², residual plots, and cross-validation error on the fitted trajectories), and (iii) a non-parametric benchmark using local polynomial smoothing followed by tail extrapolation to compare against the parametric results. These additions will quantify sensitivity and support the central claims. revision: yes
Referee: [Empirical Results] The abstract states that empirical results show improved precision in LTE and ΔERLV estimates, but without reported quantitative details such as variance ratios, confidence interval widths, or statistical comparisons to single-cohort or non-weighted baselines, the magnitude and reliability of the gains cannot be assessed. Please add specific metrics, tables, or figures from the experiments to substantiate this claim.

Authors: We accept that the abstract claim of improved precision requires explicit quantitative support. Although the empirical section already demonstrates gains from the inverse-variance weighted estimator, we will add a new table that reports (i) variance ratios of the multi-cohort IVW estimator relative to single-cohort and non-weighted multi-cohort baselines, (ii) average confidence-interval widths for LTE and ΔERLV, and (iii) statistical comparisons (e.g., variance-ratio tests) across the experimental scenarios. Corresponding figures showing precision trajectories will also be included or referenced. These changes will allow readers to assess the magnitude of the efficiency gains directly. revision: yes

Circularity Check

1 steps flagged

LTE and ΔERLV obtained by fitting parametric decay to estimated trajectory

specific steps

fitted input called prediction [Abstract]
"The estimated treatment trajectory is then modeled as a parametric decay to recover both the asymptotic treatment effect and the cumulative value generated over time."

The asymptotic treatment effect (LTE) and cumulative value (ΔERLV) are recovered directly by fitting the parametric decay to the estimated trajectory. These quantities are therefore outputs of the fitting procedure applied to the short-term estimates, making the long-term results functions of the fitted parameters by construction rather than separate predictions or data-driven extrapolations independent of the model choice.

full rationale

The paper estimates a time-varying treatment effect trajectory via inverse-variance weighting of multi-cohort data. It then explicitly models this trajectory with a parametric decay form whose parameters directly supply the asymptotic LTE and the cumulative residual lifetime value. Because the long-term quantities are defined as the fitted asymptote and integral, they reduce to functions of the parametric fit applied to the short-term estimates rather than independent predictions. This matches the fitted-input-called-prediction pattern and supports a moderate circularity score; the central empirical claims about improved precision in LTE/ΔERLV rest on the validity of the decay assumption without reported misspecification diagnostics or non-parametric benchmarks in the abstract.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that treatment effects follow a parametric decay and on the statistical validity of inverse-variance weighting across cohorts; no free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)

parametric decay parameters
Fitted to the multi-cohort treatment trajectory to recover asymptotic effect and cumulative value.

axioms (1)

domain assumption Treatment effects under user learning follow a parametric decay trajectory
Invoked to justify modeling the estimated trajectory and recovering long-term quantities from short tests.

pith-pipeline@v0.9.0 · 5529 in / 1283 out tokens · 40282 ms · 2026-05-10T00:25:43.388219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Sebastian Ankargren, Mattias Frånberg, and Marten Schultzberg. 2024. It’s About Time: What A/B Test Metrics Estimate.arXiv preprint arXiv:2411.06150(2024). arXiv:2411.06150 [stat.ME] https://arxiv.org/abs/2411.06150

work page arXiv 2024
[2]

Fader, and Bruce G

Eva Ascarza, Peter S. Fader, and Bruce G. S. Hardie. 2017. Marketing Models for the Customer-Centric Firm. InHandbook of Marketing Decision Models, Berend Wierenga and Ralf van der Lans (Eds.). International Series in Operations Research & Management Science, Vol. 254. Springer International Publishing, Cham, Switzerland, 297–329. doi:10.1007/978-3-319-56941-3_10

work page doi:10.1007/978-3-319-56941-3_10 2017
[3]

Jan Panero Benway. 1998. Banner blindness: The irony of attention grabbing on the World Wide Web. InProceedings of the human factors and ergonomics society annual meeting. SAGE Publications, Los Angeles, CA, 463–467

work page 1998
[4]

George Casella and Roger L. Berger. 2024.Statistical Inference(2nd ed.). Chapman and Hall/CRC, Boca Raton, FL

work page 2024
[5]

Junghoo Cho and Sourashis Roy. 2004. Impact of Search Engines on Page Popu- larity. InProceedings of the 13th International World Wide Web Conference (WWW 2004). ACM, New York, NY, USA, 20–29. doi:10.1145/988672.988676

work page doi:10.1145/988672.988676 2004
[6]

W. G. Cochran. 1954. The Combination of Estimates from Different Experiments. Biometrics10, 1 (1954), 101–129

work page 1954
[7]

Alex Deng, Jiannan Lu, and Shouyuan Chen. 2016. Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. In2016 IEEE International conference on data science and advanced analytics (DSAA). IEEE, IEEE, Montreal, QC, Canada, 243–252

work page 2016
[8]

Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Halifax, Nova Scotia, Canada, 1427–1436. doi:10.1145/3097983.3098024

work page doi:10.1145/3097983.3098024 2017
[9]

Fader and Bruce G

Peter S. Fader and Bruce G. S. Hardie. 2007. How to Project Customer Retention. Journal of Interactive Marketing21, 1 (2007), 76–90

work page 2007
[10]

Fader and Bruce G

Peter S. Fader and Bruce G. S. Hardie. 2010. Customer-Base Valuation in a Contractual Setting: The Perils of Ignoring Heterogeneity.Marketing Science29, 1 (2010), 85–93

work page 2010
[11]

Rosa Ferrentino, Maria Teresa Cuomo, and Carmine Boniello. 2016. On the customer lifetime value: a mathematical perspective.Computational Management Science13, 4 (2016), 521–539. doi:10.1007/s10287-016-0266-1

work page doi:10.1007/s10287-016-0266-1 2016
[12]

Sunil Gupta, Dominique Hanssens, Bruce Hardie, William Kahn, V Kumar, Nathaniel Lin, Nalini Ravishanker, and S Sriram. 2006. Modeling customer lifetime value.Journal of Service Research9, 2 (2006), 139–155

work page 2006
[13]

Hohnhold, D

H. Hohnhold, D. O’Brien, and D. Tang. 2015. Focusing on the Long-Term: It’s Good for Users and Business. InProceedings of the 21st ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining. ACM, Sydney, NSW, Australia, 1849–1858

work page 2015
[14]

2020.Trustworthy Online Controlled Exper- iments: A Practical Guide to A/B Testing(1st ed.)

Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy Online Controlled Exper- iments: A Practical Guide to A/B Testing(1st ed.). Cambridge University Press, Cambridge, UK

work page 2020
[15]

Saharon Rosset, Einat Neumann, Uri Eick, and Nurit Vatnik. 2003. Customer Life- time Value Models for Decision Support.Data Mining and Knowledge Discovery 7, 3 (2003), 321–339. doi:10.1023/A:1024036305874

work page doi:10.1023/a:1024036305874 2003
[16]

Sadeghi, S

S. Sadeghi, S. Gupta, S. Gramatovici, J. Lu, H. Ai, and R. Zhang. 2022. Novelty and Primacy: A Long-Term Estimator for Online Experiments.Technometrics64, 4 (2022), 524–534

work page 2022
[17]

The Review of Economic Studies 4(2), 155–161 (1937) https://doi.org/10.2307/2967612

Paul A. Samuelson. 1937. A Note on Measurement of Utility.Review of Economic Studies4, 2 (1937), 155–161. doi:10.2307/2967612

work page doi:10.2307/2967612 1937
[18]

Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. 2015. Ad recommendation systems for life-time value optimization. InProceedings of the 24th international conference on world wide web. ACM, Florence, Italy, 1305–1310

work page 2015
[19]

Ana Alina Tudoran, Charlotte Hjerrild Thomsen, and Sophie Thomasen. 2024. Understanding consumer behavior during and after a Pandemic: Implications for customer lifetime value prediction models.Journal of Business Research174 (2024), 114527. doi:10.1016/j.jbusres.2024.114527

work page doi:10.1016/j.jbusres.2024.114527 2024

[1] [1]

Sebastian Ankargren, Mattias Frånberg, and Marten Schultzberg. 2024. It’s About Time: What A/B Test Metrics Estimate.arXiv preprint arXiv:2411.06150(2024). arXiv:2411.06150 [stat.ME] https://arxiv.org/abs/2411.06150

work page arXiv 2024

[2] [2]

Fader, and Bruce G

Eva Ascarza, Peter S. Fader, and Bruce G. S. Hardie. 2017. Marketing Models for the Customer-Centric Firm. InHandbook of Marketing Decision Models, Berend Wierenga and Ralf van der Lans (Eds.). International Series in Operations Research & Management Science, Vol. 254. Springer International Publishing, Cham, Switzerland, 297–329. doi:10.1007/978-3-319-56941-3_10

work page doi:10.1007/978-3-319-56941-3_10 2017

[3] [3]

Jan Panero Benway. 1998. Banner blindness: The irony of attention grabbing on the World Wide Web. InProceedings of the human factors and ergonomics society annual meeting. SAGE Publications, Los Angeles, CA, 463–467

work page 1998

[4] [4]

George Casella and Roger L. Berger. 2024.Statistical Inference(2nd ed.). Chapman and Hall/CRC, Boca Raton, FL

work page 2024

[5] [5]

Junghoo Cho and Sourashis Roy. 2004. Impact of Search Engines on Page Popu- larity. InProceedings of the 13th International World Wide Web Conference (WWW 2004). ACM, New York, NY, USA, 20–29. doi:10.1145/988672.988676

work page doi:10.1145/988672.988676 2004

[6] [6]

W. G. Cochran. 1954. The Combination of Estimates from Different Experiments. Biometrics10, 1 (1954), 101–129

work page 1954

[7] [7]

Alex Deng, Jiannan Lu, and Shouyuan Chen. 2016. Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. In2016 IEEE International conference on data science and advanced analytics (DSAA). IEEE, IEEE, Montreal, QC, Canada, 243–252

work page 2016

[8] [8]

Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Halifax, Nova Scotia, Canada, 1427–1436. doi:10.1145/3097983.3098024

work page doi:10.1145/3097983.3098024 2017

[9] [9]

Fader and Bruce G

Peter S. Fader and Bruce G. S. Hardie. 2007. How to Project Customer Retention. Journal of Interactive Marketing21, 1 (2007), 76–90

work page 2007

[10] [10]

Fader and Bruce G

Peter S. Fader and Bruce G. S. Hardie. 2010. Customer-Base Valuation in a Contractual Setting: The Perils of Ignoring Heterogeneity.Marketing Science29, 1 (2010), 85–93

work page 2010

[11] [11]

Rosa Ferrentino, Maria Teresa Cuomo, and Carmine Boniello. 2016. On the customer lifetime value: a mathematical perspective.Computational Management Science13, 4 (2016), 521–539. doi:10.1007/s10287-016-0266-1

work page doi:10.1007/s10287-016-0266-1 2016

[12] [12]

Sunil Gupta, Dominique Hanssens, Bruce Hardie, William Kahn, V Kumar, Nathaniel Lin, Nalini Ravishanker, and S Sriram. 2006. Modeling customer lifetime value.Journal of Service Research9, 2 (2006), 139–155

work page 2006

[13] [13]

Hohnhold, D

H. Hohnhold, D. O’Brien, and D. Tang. 2015. Focusing on the Long-Term: It’s Good for Users and Business. InProceedings of the 21st ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining. ACM, Sydney, NSW, Australia, 1849–1858

work page 2015

[14] [14]

2020.Trustworthy Online Controlled Exper- iments: A Practical Guide to A/B Testing(1st ed.)

Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy Online Controlled Exper- iments: A Practical Guide to A/B Testing(1st ed.). Cambridge University Press, Cambridge, UK

work page 2020

[15] [15]

Saharon Rosset, Einat Neumann, Uri Eick, and Nurit Vatnik. 2003. Customer Life- time Value Models for Decision Support.Data Mining and Knowledge Discovery 7, 3 (2003), 321–339. doi:10.1023/A:1024036305874

work page doi:10.1023/a:1024036305874 2003

[16] [16]

Sadeghi, S

S. Sadeghi, S. Gupta, S. Gramatovici, J. Lu, H. Ai, and R. Zhang. 2022. Novelty and Primacy: A Long-Term Estimator for Online Experiments.Technometrics64, 4 (2022), 524–534

work page 2022

[17] [17]

The Review of Economic Studies 4(2), 155–161 (1937) https://doi.org/10.2307/2967612

Paul A. Samuelson. 1937. A Note on Measurement of Utility.Review of Economic Studies4, 2 (1937), 155–161. doi:10.2307/2967612

work page doi:10.2307/2967612 1937

[18] [18]

Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. 2015. Ad recommendation systems for life-time value optimization. InProceedings of the 24th international conference on world wide web. ACM, Florence, Italy, 1305–1310

work page 2015

[19] [19]

Ana Alina Tudoran, Charlotte Hjerrild Thomsen, and Sophie Thomasen. 2024. Understanding consumer behavior during and after a Pandemic: Implications for customer lifetime value prediction models.Journal of Business Research174 (2024), 114527. doi:10.1016/j.jbusres.2024.114527

work page doi:10.1016/j.jbusres.2024.114527 2024