Sample Size Calculations in Simple Linear Regression: Trials and Tribulations

M. Bhaskara Rao; M. Khorshed Alam; Tianyuan Guan

arxiv: 1907.10569 · v1 · pith:4E74IGANnew · submitted 2019-07-24 · 📊 stat.ME

Sample Size Calculations in Simple Linear Regression: Trials and Tribulations

Tianyuan Guan , M. Khorshed Alam , M. Bhaskara Rao This is my paper

Pith reviewed 2026-05-24 16:34 UTC · model grok-4.3

classification 📊 stat.ME

keywords sample size calculationsimple linear regressionunconditional distributionslope estimatorcorrelation coefficienttest statisticpower analysis

0 comments

The pith

The exact unconditional distribution of the slope test statistic enables sample size calculations in simple linear regression despite nuisance parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses sample size determination for a given level and power in simple linear regression, a five-parameter model where nuisance parameters such as the variance of the predictor X and the conditional variance of Y create computational problems. Current methods rely either on the conditional distribution of the slope estimator given the X values or on the sample correlation coefficient, but both have drawbacks in practice and interpretation. By deriving the exact unconditional distribution of the test statistic based on the least squares slope estimator, the authors alleviate some of these difficulties in sample size computation. The correlation-based approach avoids the nuisance parameter issues altogether, yet the paper shows that the resulting sample sizes align closely with those from the slope-based test across a broad range of settings.

Core claim

We overcome the problems by determining the exact unconditional distribution of the test statistic built on the estimator of the slope parameter. The exact unconditional distribution alleviates difficulties to some extent in the computation of sample sizes. Surprisingly, we see that the sample size that comes from the correlation test works in synchronization with the one that comes from the test built upon the slope parameter in a broad array of settings.

What carries the argument

The exact unconditional distribution of the test statistic built on the estimator of the slope parameter, which supports direct sample size calculations without conditioning on observed X values.

If this is right

Sample size calculations for the slope test become feasible without fixing the predictor values in advance.
The correlation coefficient test serves as a practical proxy that yields matching sample sizes while bypassing nuisance parameters.
Researchers retain the direct slope interpretation while using either method interchangeably in many cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment between the two tests holds more generally, the correlation approach could be adopted as the default for its computational simplicity.
The unconditional distribution method might be adapted to sample size planning in other linear models with additional nuisance parameters.
Implementation would likely require numerical integration or approximation routines to obtain the distribution for specific parameter values.

Load-bearing premise

The exact unconditional distribution of the slope test statistic can be derived and practically applied for sample size calculations despite the five-parameter model and its nuisance parameters.

What would settle it

Numerical evaluation or simulation showing that the sample sizes required by the unconditional slope test and the correlation test diverge substantially in settings outside the claimed broad array would falsify the synchronization result.

read the original abstract

The problem tackled in this paper is the determination of sample size for a given level and power in the context of a simple linear regression model. At a technical level, the simple linear regression model is a five-parameter model. It is natural to base sample size calculations on the least squares' estimator of the slope parameter of the model. Nuisance parameters such as the variance of the predictor X and conditional variance of the response Y create problems in the calculations. The current approaches in the literature are not illuminating. One approach is based on the conditional distribution of the estimator of the slope parameter given the data on the predictor X. Another approach is based on the sample correlation coefficient. We overcome the problems by determining the exact unconditional distribution of the test statistic built on the estimator of the slope parameter. The exact unconditional distribution alleviates difficulties to some extent in the computation of sample sizes. On the other hand, the test based on the sample correlation coefficient of X and Y avoids the problems besetting the test based on the slope parameter. However, we lose intuitive interpretation that comes with the slope parameter. Surprisingly, we see that the sample size that comes from the correlation test works in synchronization with the one that comes from the test built upon the slope parameter in a broad array of settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Derives the exact unconditional distribution of the slope t-statistic but the result still depends on var(X) and error variance, so sample size planning keeps the same nuisance guesses as before.

read the letter

The paper's main contribution is working out the exact unconditional distribution of the test statistic for the slope in simple linear regression by integrating the conditional t over the marginal of X. That step is new relative to the usual conditional-on-X approach, which cannot be used at the design stage because X values are unknown then. They also note that the sample sizes coming from the correlation test line up closely with the slope-based ones across a range of settings, which is a practical observation worth recording even if it is not deeply surprising given how the two tests relate. The abstract is clear that this unconditional route only alleviates the difficulties to some extent rather than removing them. The stress-test concern holds: the five-parameter model means the power calculation still requires values for var(X) and the conditional variance of Y, so the user faces the same guessing problem that the conditional and correlation methods already have. No closed form free of those parameters or bounding procedure is described in the abstract. The synchronization result is the part that could be useful to practitioners who already like the correlation approach but want to keep the slope interpretation. This is a narrow technical note aimed at statisticians who care about exact distributions in regression sample-size work. It is not a broad methodological fix. A serious referee should check whether the derivation is correct and whether any computational gain is real once the nuisance parameters are plugged in. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper addresses sample size determination for hypothesis testing on the slope in simple linear regression, a five-parameter model. It critiques existing conditional (on X) and correlation-based approaches for their handling of nuisance parameters (var(X), var(Y|X)). The central claim is that deriving the exact unconditional distribution of the slope-based test statistic alleviates these difficulties to some extent, while the correlation test avoids them entirely (at the cost of interpretability); surprisingly, the two approaches yield synchronized sample sizes across a broad array of settings.

Significance. If the unconditional distribution is derived in closed or computable form and demonstrably reduces (even partially) the need to specify nuisance parameters for a priori calculations, the result would have moderate practical value for experimental design in regression settings. The reported synchronization with the correlation test could provide empirical reassurance for practitioners preferring the slope interpretation. However, the modest qualifier 'to some extent' in the abstract suggests the advance may be incremental rather than transformative.

major comments (2)

[Abstract] Abstract: The claim that the exact unconditional distribution 'alleviates difficulties to some extent' is load-bearing for the central contribution, yet the text provides no indication that the marginalization over X's distribution cancels dependence on the ratio var(X)/var(Y|X); without such cancellation or a bounding procedure, the method would still require the same nuisance inputs that limit the conditional approach.
[Abstract] Abstract (and implied derivation sections): No explicit form, integral expression, or numerical verification is referenced for the unconditional distribution of the slope t-statistic; this prevents assessment of whether the result is truly exact and usable for sample-size formulas without retaining the five-parameter dependence noted in the skeptic's concern.

minor comments (1)

[Abstract] The abstract states the model is five-parameter but does not list the parameters explicitly (intercept, slope, E[X], var(X), var(Y|X)); adding this enumeration would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive report. We address each major comment below, providing clarifications on the derivation and its implications while committing to revisions that strengthen the presentation without overstating the results.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the exact unconditional distribution 'alleviates difficulties to some extent' is load-bearing for the central contribution, yet the text provides no indication that the marginalization over X's distribution cancels dependence on the ratio var(X)/var(Y|X); without such cancellation or a bounding procedure, the method would still require the same nuisance inputs that limit the conditional approach.

Authors: We agree the abstract would benefit from greater precision on this mechanism. The unconditional distribution is formed by integrating the conditional distribution of the slope t-statistic over the marginal distribution of X. This integration does not fully cancel all dependence on the variance ratio, but it yields sample-size recommendations that are demonstrably less sensitive to specific nuisance values, as shown by the close synchronization with the correlation-based approach across the broad array of settings examined in the paper. We will revise the abstract to explicitly reference the marginalization step and its partial alleviation of the five-parameter burden. revision: yes
Referee: [Abstract] Abstract (and implied derivation sections): No explicit form, integral expression, or numerical verification is referenced for the unconditional distribution of the slope t-statistic; this prevents assessment of whether the result is truly exact and usable for sample-size formulas without retaining the five-parameter dependence noted in the skeptic's concern.

Authors: The explicit integral expression for the unconditional distribution is derived in Sections 2–3 by marginalizing the conditional t-statistic over the distribution of X; numerical verification via simulation appears in Section 4. These sections establish that the distribution is exact and supports sample-size formulas. We will add a concise reference to the derivation and verification in the abstract so readers can locate the supporting material immediately. revision: yes

Circularity Check

0 steps flagged

No circularity: unconditional distribution derived by direct integration, independent of fitted inputs or self-citations

full rationale

The paper derives the exact unconditional distribution of the slope test statistic via integration of the conditional distribution over the marginal distribution of X. This is a standard, non-circular statistical procedure that does not reduce to redefinition of inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The abstract notes that this alleviates difficulties 'to some extent' and that correlation-based sample sizes synchronize in broad settings, presented as a consequence of the derivation rather than a tautology. No ansatzes, uniqueness theorems from prior self-work, or renaming of known results are invoked as central steps. The retained dependence on nuisance parameters is a limitation of the model, not evidence that the derivation is equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5763 in / 1039 out tokens · 22669 ms · 2026-05-24T16:34:02.991718+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We overcome the problems by determining the exact unconditional distribution of the test statistic built on the estimator of the slope parameter... T² ~ ((n-2)/(n-1)) * (W1 W4)/(W2 W3)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The five-parameter model... X ~ N(μx, σx²), Y|X ~ N(β0 + β1 X, σ²)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.