arxiv: 2604.08421 · v1 · submitted 2026-04-09 · 📊 stat.ME

Recognition: unknown

Hypothesizing an effect size by considering individual variation

Andrew Gelman , Amy Krefman , Lauren Kennedy , Jessica Hullman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3

classification 📊 stat.ME

keywords average treatment effectindividual variationeffect size hypothesisstudy designexperimental studiesobservational studiescausal effects

0 comments

The pith

Hypothesizing average treatment effects is more realistic when based on a distribution of individual effects rather than a direct guess of the average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that to form a realistic hypothesis for the average treatment effect in a study, one should first consider the distribution of effects across different individuals. This method is illustrated with examples from medicine, economics, and psychology. A sympathetic reader would care because directly specifying an average effect often leads to unrealistic assumptions that ignore natural variation in how treatments affect people. By starting with individual differences, the resulting average hypothesis better reflects possible real-world outcomes. This can improve the design and evaluation of experiments and observational studies.

Core claim

The central claim is that an average treatment effect can be conceptualized more naturally and realistically by first positing a distribution of effects at the individual level. The authors demonstrate this approach through concrete examples in three fields, showing how the distribution informs what the average should be.

What carries the argument

A distribution of individual treatment effects, from which the average treatment effect hypothesis is derived.

If this is right

This leads to more realistic average effect size hypotheses in study planning.
The approach is applicable across medicine, economics, and psychology.
It provides a systematic way to incorporate individual variation into effect size considerations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might help in fields like education research where individual differences are pronounced.
It could lead to the development of tools that sample from individual effect distributions to suggest averages.
Testing in real planning sessions to see if it changes decisions.

Load-bearing premise

That beginning with a distribution of individual effects will produce a more realistic hypothesis for the average treatment effect than directly specifying the average.

What would settle it

A comparison where experts hypothesize average effects both ways and then actual studies show the distribution-first method's averages are no closer to true effects than direct guesses.

Figures

Figures reproduced from arXiv: 2604.08421 by Amy Krefman, Andrew Gelman, Jessica Hullman, Lauren Kennedy.

read the original abstract

When designing and evaluating an experiment or observational study, it is useful to have a realistic hypothesis regarding the average treatment effect. We present an approach to conceptualizing this average by first considering a distribution of effects. We demonstrate with examples in medicine, economics, and psychology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gelman et al. offer a clear reminder to hypothesize average treatment effects by first imagining a distribution of individual effects, but the note stays conceptual with no checks on whether the approach improves results.

read the letter

The main point here is a practical suggestion for study design: when you need a hypothesized average treatment effect, start by thinking about how the effect varies across individuals and then work out what the average would be. The authors illustrate this with examples from medicine, economics, and psychology, showing how direct guesses at the mean can feel arbitrary when heterogeneity is likely. This framing makes the role of variation explicit rather than hidden inside a single number. The writing is direct and the examples are easy to follow, which is helpful for readers who actually have to pick effect sizes before running a study. The logic holds up at the level of basic reasoning about averages and distributions. What is missing is any test of whether this procedure leads to better-calibrated or more realistic hypotheses than just specifying the average directly. There are no simulations, no comparisons, and no discussion of how much extra work it adds or whether people actually end up with different numbers. The paper treats the benefit as self-evident, which is fine for a short note but leaves the central claim unexamined. This is aimed at applied researchers in the social and medical sciences who design studies and need to justify their expected effect sizes. Someone already thinking about heterogeneity might find the examples useful as a prompt, but the piece does not introduce new methods or data that would change how most people work. It deserves peer review because the topic matters for how research gets planned and the authors are experienced at this kind of advice. A referee could reasonably ask for more on practical uptake or potential limitations, but the core idea is coherent enough to warrant that step rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a conceptual approach to hypothesizing the average treatment effect (ATE) by first specifying a distribution over individual treatment effects and then deriving the average from that distribution. The idea is illustrated through qualitative examples drawn from medicine, economics, and psychology.

Significance. If the suggested reframing reliably produces more realistic ATE hypotheses than direct elicitation of the mean, it could usefully influence study design and prior specification in applied work. The emphasis on individual-level variation aligns with growing interest in heterogeneity, but the paper supplies no formal argument, simulation, or empirical comparison demonstrating systematic improvement in realism or calibration.

major comments (1)

[Abstract and introductory framing] The paper's motivation rests on the claim that beginning with a distribution of individual effects will systematically produce a more realistic hypothesis for the ATE than direct specification of the average (see abstract and the opening paragraphs). No supporting argument, reference to elicitation literature, or illustrative comparison is provided to substantiate this assumption, which is load-bearing for the contribution.

minor comments (2)

[Examples] The examples would be clearer if each included an explicit statement of the chosen individual-effect distribution, the resulting ATE value, and a brief discussion of how the distribution was elicited or justified.
[Discussion or references] The manuscript would benefit from situating the suggestion against existing work on effect-size elicitation, prior specification for heterogeneous effects, or Bayesian approaches to ATE modeling.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which help us better position the conceptual contribution of the manuscript. We will revise the abstract and introductory framing to address the concerns raised.

read point-by-point responses

Referee: [Abstract and introductory framing] The paper's motivation rests on the claim that beginning with a distribution of individual effects will systematically produce a more realistic hypothesis for the ATE than direct specification of the average (see abstract and the opening paragraphs). No supporting argument, reference to elicitation literature, or illustrative comparison is provided to substantiate this assumption, which is load-bearing for the contribution.

Authors: We acknowledge that the manuscript does not include a formal argument, simulation study, or empirical comparison establishing that the proposed approach systematically produces more realistic ATE hypotheses than direct elicitation of the mean. The paper is explicitly conceptual in nature, offering a reframing illustrated through qualitative examples in medicine, economics, and psychology. These examples demonstrate the process of deriving an ATE from an individual-effects distribution but do not constitute a quantitative comparison of realism or calibration. To address this point, we will revise the abstract and opening paragraphs to present the method as a complementary heuristic for incorporating individual variation into effect-size hypotheses, without claiming systematic superiority. We will also add references to the elicitation literature (e.g., on expert prior specification and effect-size judgment) to provide context for the approach. This revision will tone down the motivational language while preserving the core idea. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely conceptual heuristic

full rationale

The paper presents a conceptual heuristic for hypothesizing average treatment effects by first considering a distribution of individual effects, illustrated via examples in medicine, economics, and psychology. No equations, derivations, fitted parameters, or technical claims are made. The central suggestion is a re-framing of the elicitation task without any load-bearing self-citations, uniqueness theorems, ansatzes, or reductions of predictions to inputs by construction. The argument stands as independent conceptual advice and does not reduce to its own definitions or prior self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the contribution is a suggested conceptual procedure.

pith-pipeline@v0.9.0 · 5327 in / 886 out tokens · 57268 ms · 2026-05-10T17:38:35.229301+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Anscombe, F. J. (1973). Graphs in statistical analysis.American Statistician27, 17–21. Baguley, T. (2009). Standardized or simple effect size: What should be reported?British Journal of Psychology100, 603–617. Beall, A. T., and Tracy, J. L. (2013). Women are more likely to wear red or pink at peak fertility. Psychological Science24, 1837–1841. Bryan, C. J...

work page arXiv 1973
[2]

file-drawer problem

Linden, A. H. (2019). Heterogeneity of research results: New perspectives on psychological science. Doctoral dissertation, Northumbria University. Linden, A. H., and H¨ onekopp, J. (2021). Heterogeneity of research results: A new perspective from which to assess and promote progress in psychological science.Perspectives on Psychological Science16, 358–376...

work page 2019
[3]

S., Hanselman, P., Walton, G

Yeager, D. S., Hanselman, P., Walton, G. M., Murray, J. S., Crosnoe, R., Muller, C., . . . and Dweck, C. S. (2019). A national experiment reveals where a growth mindset improves achievement. Nature573, 364–369. Zelner, J., Riou, J., Etzioni, R., and Gelman, A. (2021). Accounting for uncertainty during a pandemic.Patterns2, 100310. 16

work page 2019