pith. sign in

arxiv: 2005.04721 · v20 · submitted 2020-05-10 · 📊 stat.AP · stat.ME

Decision Making in Drug Development via Inference on Power

Pith reviewed 2026-05-24 14:17 UTC · model grok-4.3

classification 📊 stat.AP stat.ME
keywords power calculationprobability of successp-value functionGo/No-Go decisiondrug developmentrisk managementassuranceinference on power
0
0 comments X

The pith

Go/No-Go decisions in drug development should use inference on power instead of point estimates from power or probability of success calculations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a typical power calculation replaces unknown population quantities with values observed in external studies, yielding a single assumed value of power. Probability of success, or assurance, averages over a prior or posterior distribution to capture uncertainty around the true treatment effect but still reduces to a single number. Both approaches are reframed via p-value functions as merely different point estimates of power. Decisions based on either point estimate fail to quantify and control the risk of incorrect Go or No-Go choices. The authors argue that full inference on power, using the p-value function, enables better risk management in drug development decisions.

Core claim

We use p-value functions to frame both the probability of success calculation and the typical power calculation as merely producing two different point estimates of power. We demonstrate that Go/No-Go decisions based on either point estimate of power do not adequately quantify and control the risk involved, and instead we argue for Go/No-Go decisions that utilize inference on power for better risk management and decision making.

What carries the argument

p-value functions that represent both classical power calculations and probability of success calculations as point estimates of power

If this is right

  • Go/No-Go decisions based on point estimates of power fail to quantify and control risk adequately.
  • Inference on power using the full p-value function supplies better risk management for drug development decisions.
  • Probability of success calculations are equivalent to one specific point estimate of power under this framing.
  • Replacing point estimates with inference on power changes how uncertainty around the treatment effect is handled in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same p-value function approach could be tested in non-drug contexts such as clinical trial design outside pharmaceutical settings.
  • Simulation studies could directly compare error rates of point-estimate decisions versus inference-based decisions on synthetic trial data.
  • Regulatory bodies might evaluate whether requiring inference on power alters the balance between false positives and false negatives in approval decisions.

Load-bearing premise

P-value functions provide a valid and neutral way to represent both classical power and Bayesian assurance calculations as point estimates without introducing additional assumptions that affect the risk assessment.

What would settle it

A reanalysis of historical drug development programs in which Go/No-Go decisions based on point estimates of power are compared to decisions based on the full p-value function for power, measuring whether the latter yields measurably different risk profiles such as altered rates of program termination or success.

Figures

Figures reproduced from arXiv: 2005.04721 by Geoffrey S Johnson.

Figure 1
Figure 1. Figure 1: Phase 2 likelihood ratio test of H0: θ ≤ −0.05 with N=90 per arm at α=0.2. Phase 3 likelihood ratio test of H0: θ ≤ −0.12 with N=365 per arm at α=0.025. While it is important to have a clear definition of technical success before conducting a trial, [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: shows the power curves for the success criteria outlined in Section 3.1, the combined power curve (prod￾uct) for success in both phase 2 and phase 3, and the elicited confidence curve for the difference in proportions described above. The power curves in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Solid lines depict resulting confidence curves for power in phase 2, phase 3, and overall based on the elicitation. Peaks correspond to maximum likelihood estimates of power. N=45 N=65 N=85 N=115 N=155 N=205 N=275 N=365 N=465 N=600 N=765 Sample Size per Arm 0.2 0.4 0.6 0.8 1.0 Phase 3 Power Maximum Likelihood Estimate Probability of Success Estimate foot foot [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (i) Elicited confidence curve. (ii) Confidence curve for €=0.025 for phase 3 LR test against difference=-0.12 with N=365 per arm θ from the approximate phase 2 power curve testing . H0: θ ≤ −0.05 with N=90 per arm at α=0.2. (iii) Multiplication of elicited H(θ) and phase 2 power curve, displayed as a confidence curve. (iv) Convolution of elicited H(θ) and approximate phase 2 power curve, displayed as a con… view at source ↗
Figure 7
Figure 7. Figure 7: Sampling distributions of the maximum likelihood and probability of success estimators of power over 10,000 simulations [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sampling distribution of the Φ−1 transformed maximum likelihood estimator of power over 10,000 simulations. B.7 Extrapolation Between Endpoints or Control Groups Across Phases In the examples thus far the phase 2 study used the same endpoint and treatment groups planned for phase 3. Depending on the therapeutic area and endpoint this may not be feasible. In such cases the phase 3 treatment effect, and henc… view at source ↗
Figure 9
Figure 9. Figure 9: ^{unicode alpha}=0.025 for phase 3 LR test against difference=-0.12 Solid lines depict power curves for a likelihood ratio test of the difference in proportions in phase 2, phase 3, and with N=365 per arm. overall. Confidence bands depict extrapolation modeling uncertainty. Dashed line depicts the confidence density for θ based on historical data and expert opinion. 0.0 0.2 0.4 0.6 0.8 1.0 Power 0 1 2 3 4 … view at source ↗
Figure 11
Figure 11. Figure 11: €=0.025 for phase 3 LR test against difference=-0.12 with N=365 per arm Phase 2 power curve testing H0: θ ≤ −0.05 with N=90 per arm at α=0.2. Phase 3 power curve testing . H0: θ ≤ −0.12 with N=365 per arm at α=0.025. Confidence density for θ based on historical data and expert opinion. 0.0 0.2 0.4 0.6 0.8 1.0 Power 0 1 2 3 4 5 6 Confidence Density Phase 2 Power mle Phase 3 Power mle Phase 2 and 3 Power ml… view at source ↗
Figure 12
Figure 12. Figure 12: Solid lines depict resulting confidence distributions for power, h(β) = dH(θ)/dβ(θ), in phase 2, phase 3, and overall. Dotted lines depict maximum likelihood estimates of power. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: footEstimated phase 3 power testing H0: θ ≤ −0.12 at α=0.025 at various sample sizes with 80% confidence limits based on the elicitation (wide). E Additional Figures -0.2 -0.1 0.0 0.1 0.2 True Difference in Proportions 0.0 0.2 0.4 0.6 0.8 1.0 Power 0 4 8 12 16 20 24 Confidence Density (v) Phase 3 Power (iii) Multiplication (iv) Convolution (i) Elicited Confidence Density (ii) Minimum Phase 2 Success €=0.0… view at source ↗
Figure 14
Figure 14. Figure 14: €=0.025 for phase 3 LR test against difference=-0.12 with N=365 per arm (i) Elicited confidence density (wide). (ii) Confidence density for θ from differentiating the approximate phase 2 . power curve testing H0: θ ≤ −0.05 with N=225 per arm at α=0.025. (iii) Multiplication of elicited H(θ) and phase 2 power curve, differentiated. (iv) Convolution of elicited H(θ) and approximate phase 2 power curve, diff… view at source ↗
Figure 15
Figure 15. Figure 15: €=0.025 for phase 3 LR test against difference=-0.12 with N=365 per arm (i) Elicited confidence density (narrow). (ii) Confidence density for θ from differentiating the approximate phase . 2 power curve testing H0: θ ≤ −0.05 with N=225 per arm at α=0.025. (iii) Multiplication of elicited H(θ) and phase 2 power curve, differentiated. (iv) Convolution of elicited H(θ) and approximate phase 2 power curve, di… view at source ↗
Figure 16
Figure 16. Figure 16: €=0.025 for phase 3 LR test against difference=-0.12 with N=365 per arm (i) Elicited confidence density (wide). (ii) Confidence density for θ from differentiating the approximate phase . 2 power curve testing H0: θ ≤ −0.05 with N=90 per arm at α=0.2. (iii) Multiplication of elicited H(θ) and phase 2 power curve, differentiated. (iv) Convolution of elicited H(θ) and approximate phase 2 power curve, differe… view at source ↗
Figure 17
Figure 17. Figure 17: Exact frequentist and Bayesian inference on a binomial proportion θ based on a sample of size n = 20. Let X1, ..., Xn ∼ Bernoulli(θ). The confidence curve and 95% confidence interval in [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: (a) Plug-in estimated sampling distribution for the MLE of the mean supported by ¯x for exponentially distributed data with n = 5, replacing the unknown fixed true θ with θˆmle=1.5. (b) Bayesian posterior from vague conjugate prior supported by θ. (c) Confidence distribution (density) based on the likelihood ratio test supported by θ. (d) Confidence distribution (density) based on the exact likelihood rat… view at source ↗
Figure 20
Figure 20. Figure 20: Exact null sampling distribution of ˆθMLE = X¯ for testing H0: θ ≤ 0.75. H(θ) captures the upper-tailed p-value for every value of θ in the parameter space, and dH(θ)/dθ is the resulting confidence density. The confidence density in [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: (a) Informative Bayesian prior distribution based on historical likelihood and vague conjugate prior for binomial proportion, ˆθ Hist Bayes = 0.90, n = 50. (b) Confidence distribution (likelihood ratio test) based on historical data for binomial proportion, ˆθ Hist mle = 0.90, n = 50. (c) Bayesian posterior based on current likelihood and vague conjugate prior, ˆθ Current Bayes = 0.87, n = 30. (d) Confide… view at source ↗
read the original abstract

A typical power calculation is performed by replacing unknown population-level quantities in the power function with what is observed in external studies. Many authors and practitioners view this as an assumed value of power and offer the Bayesian quantity probability of success or assurance as an alternative. The claim is by averaging over a prior or posterior distribution, probability of success transcends power by capturing the uncertainty around the unknown true treatment effect and any other population-level parameters. We use p-value functions to frame both the probability of success calculation and the typical power calculation as merely producing two different point estimates of power. We demonstrate that Go/No-Go decisions based on either point estimate of power do not adequately quantify and control the risk involved, and instead we argue for Go/No-Go decisions that utilize inference on power for better risk management and decision making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard power calculations (plug-in estimates for unknown parameters) and Bayesian probability of success/assurance calculations are both merely different point estimates of power when viewed through p-value functions. It argues that Go/No-Go decisions based on either point estimate fail to quantify and control risk adequately, and proposes instead using inference on power for improved risk management in drug development.

Significance. If the p-value function framing is shown to be neutral and the inference approach demonstrably improves risk control over point estimates, the work could influence decision frameworks in clinical development by emphasizing uncertainty quantification beyond single numbers. The unification of frequentist and Bayesian power concepts via p-value functions offers a potentially useful perspective if the technical mapping holds without hidden assumptions.

major comments (2)
  1. [Abstract; Section 2 (p-value function construction)] The central unification in the abstract and early sections treats both plug-in power and assurance as recoverable point evaluations of the same p-value function for power; however, the manuscript must explicitly define this function (including how the sampling distribution maps to power values) and verify that the construction is invariant to choice of test statistic and nuisance-parameter handling, as any dependence would undermine the subsequent claim that point-estimate decisions fail to control risk while inference succeeds.
  2. [Section 4 (decision examples)] The demonstration that Go/No-Go decisions based on point estimates of power do not control risk (abstract) requires a concrete counter-example or simulation study showing a scenario where the point-estimate rule accepts a program whose true risk exceeds a pre-specified threshold while the inference-on-power rule correctly rejects; without such a load-bearing example tied to the p-value function, the risk-management advantage remains unproven.
minor comments (1)
  1. [Section 2] Notation for the p-value function should be introduced with a single consistent symbol and distinguished from the usual p-value function for the treatment effect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract; Section 2 (p-value function construction)] The central unification in the abstract and early sections treats both plug-in power and assurance as recoverable point evaluations of the same p-value function for power; however, the manuscript must explicitly define this function (including how the sampling distribution maps to power values) and verify that the construction is invariant to choice of test statistic and nuisance-parameter handling, as any dependence would undermine the subsequent claim that point-estimate decisions fail to control risk while inference succeeds.

    Authors: We agree that an explicit definition of the p-value function and a check on invariance are required to support the unification. Section 2 constructs the function by mapping the observed test statistic to the corresponding power value via the sampling distribution under the alternative (i.e., power equals the probability that the test statistic exceeds the critical value when the parameter equals the value implied by the observed statistic). To address the referee's concern, we will add an explicit statement of this mapping and a short verification subsection confirming invariance to standard choices of test statistic and nuisance-parameter handling within the normal-mean and binomial settings used in the paper. These clarifications will be incorporated in the revision. revision: yes

  2. Referee: [Section 4 (decision examples)] The demonstration that Go/No-Go decisions based on point estimates of power do not control risk (abstract) requires a concrete counter-example or simulation study showing a scenario where the point-estimate rule accepts a program whose true risk exceeds a pre-specified threshold while the inference-on-power rule correctly rejects; without such a load-bearing example tied to the p-value function, the risk-management advantage remains unproven.

    Authors: We accept that a concrete counter-example or simulation is needed to make the risk-control claim load-bearing. We will add a simulation study to Section 4 that generates data under a true effect size distribution, applies both point-estimate rules (plug-in and assurance) and the inference-on-power rule derived from the p-value function, and shows a case in which the point-estimate rules accept the program while the true risk (computed from the full power distribution) exceeds the threshold and the inference rule rejects. The example will be explicitly linked to the p-value function construction. revision: yes

Circularity Check

0 steps flagged

No circularity detected; reframing of power and assurance via p-value functions is interpretive and does not reduce to self-definition or fitted inputs by construction.

full rationale

The paper's central move is to use p-value functions as a device for viewing both plug-in power and assurance/PoS as point estimates of the same underlying quantity, then to advocate inference on that quantity for decision-making. This is presented as a conceptual unification rather than a derivation in which one quantity is defined in terms of the other via the authors' own equations or a fitted parameter renamed as a prediction. No load-bearing self-citation, uniqueness theorem, or ansatz imported from prior work by the same authors appears in the abstract or described chain. The argument remains self-contained against external benchmarks of power and assurance calculations; the p-value function serves as an external representational tool, not a tautological re-expression of the paper's inputs. Therefore the derivation does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable.

pith-pipeline@v0.9.0 · 5650 in / 1128 out tokens · 32486 ms · 2026-05-24T14:17:44.495395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Ballentine, L. E. (1970). The statistical interpretation of quantum mechanics. Reviews of Modern Physics\/ 42\/ (4), 358

  2. [2]

    Birnbaum, A. (1961). Confidence curves: An omnibus technique for estimation and testing statistical hypotheses. Journal of the American Statistical Association\/ 56\/ (294), 246--249

  3. [3]

    assurance

    Carroll, K. J. (2013). Decision making from phase ii to phase iii and the probability of success: reassured by “assurance”? Journal of Biopharmaceutical Statistics\/ 23\/ (5), 1188--1200

  4. [4]

    Casella, G. and R. L. Berger (2002). Statistical inference , Volume 2. Duxbury Pacific Grove, CA

  5. [5]

    Chuang-Stein, C. (2006). Sample size and the probability of a successful trial. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry\/ 5\/ (4), 305--309

  6. [6]

    Miller, D

    Crisp, A., S. Miller, D. Thompson, and N. Best (2018). Practical experiences of adopting assurance as a quantitative framework to support decision making in drug development. Pharmaceutical Statistics\/ 17\/ (4), 317--328

  7. [7]

    Efron, B. (1986). Why isn't everyone a bayesian? The American Statistician\/ 40\/ (1), 1--5

  8. [8]

    Efron, B. (1998). Ra fisher in the 21st century. Statistical Science\/ , 95--114

  9. [9]

    Guidance on expert knowledge elicitation in food and feed safety risk assessment

    EFSA (2014). Guidance on expert knowledge elicitation in food and feed safety risk assessment. European Food Safety Authority Journal\/ 12\/ (6), 3734

  10. [10]

    Fraser, D. A. (2011). Is bayes posterior just quick and dirty confidence? Statistical Science\/ 26\/ (3), 299--316

  11. [11]

    Mitchell, C

    Frewer, P., P. Mitchell, C. Watkins, and J. Matcham (2016). Decision-making in early clinical drug development. Pharmaceutical statistics\/ 15\/ (3), 255--263

  12. [12]

    Good, I. J. (1965). The estimation of probabilities: an essay on modern bayesian methods . The MIT Press, Cambridge, Massachusetts

  13. [13]

    Good, I. J. (1966). The estimation of probabilities. J. Inst. Maths Applics\/ 2 , 364--383

  14. [14]

    Johnson, G. S. (2021). Tolerance and prediction intervals for non-normal models. Researchgate.net\/

  15. [15]

    King, M. (2009). Evaluating probability of success in oncology clinical trials. In Biopharmaceutical Applied Statistics Symposium

  16. [16]

    Kirby, S. and C. Chuang-Stein (2017). A comparison of five approaches to decision-making for a first clinical trial of efficacy. Pharmaceutical statistics\/ 16\/ (1), 37--44

  17. [17]

    Kowalski, M

    Lalonde, R., K. Kowalski, M. Hutmacher, W. Ewy, D. Nichols, P. Milligan, B. Corrigan, P. Lockwood, S. Marshall, L. Benincosa, et al. (2007). Model-based drug development. Clinical Pharmacology & Therapeutics\/ 82\/ (1), 21--32

  18. [18]

    Lehmann, E. L. (1993). The fisher, neyman-pearson theories of testing hypotheses: one theory or two? Journal of the American statistical Association\/ 88\/ (424), 1242--1249

  19. [19]

    Oakley, J. and A. O’Hagan (2010). Shelf: The sheffield elicitation framework (version 2.0). school of mathematics and statistics, university of sheffield

  20. [20]

    O'Hagan, A., J. W. Stevens, and M. J. Campbell (2005). Assurance in clinical trial design. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry\/ 4\/ (3), 187--201

  21. [21]

    Perezgonzalez, J. D. (2015). Fisher, neyman-pearson or nhst? a tutorial for teaching data testing. Frontiers in Psychology\/ 6 , 223

  22. [22]

    Rufibach, K., H. U. Burger, and M. Abt (2016). Bayesian predictive power: choice of prior and some recommendations for its use as probability of success in drug development. Pharmaceutical statistics\/ 15\/ (5), 438--446

  23. [23]

    Saville, B. R., J. T. Connor, G. D. Ayers, and J. Alvarez (2014). The utility of bayesian predictive probabilities for interim monitoring of clinical trials. Clinical Trials\/ 11\/ (4), 485--493

  24. [24]

    o dinger, E. and J. D. Trimmer (1980). The present situation in quantum mechanics: a translation of schr \

    Schr \"o dinger, E. and J. D. Trimmer (1980). The present situation in quantum mechanics: a translation of schr \"o dinger’s ‘cat paradox’ paper. Proceedings of the American Philosophical Society\/ 124\/ (5), 323--338

  25. [25]

    Schweder, T. and N. L. Hjort (2016). Confidence, likelihood, probability , Volume 41. Cambridge University Press

  26. [26]

    Shen, J., R. Y. Liu, and M.-g. Xie (2018). Prediction with confidence—a general framework for predictive inference. Journal of Statistical Planning and Inference\/ 195 , 126--140

  27. [27]

    Singh, K., M. Xie, W. E. Strawderman, et al. (2007). Confidence distribution (cd)--distribution estimator of a parameter. In Complex datasets and inverse problems , pp.\ 132--150. Institute of Mathematical Statistics

  28. [28]

    Spiegelhalter, D. J., K. R. Abrams, and J. P. Myles (2004). Bayesian approaches to clinical trials and health-care evaluation , Volume 13. John Wiley & Sons

  29. [29]

    Temple, J. R. and J. R. Robertson (2021). Conditional assurance: the answer to the questions that should be asked within drug development. Pharmaceutical Statistics\/ , 1--10

  30. [30]

    Thornton, S. and M. Xie (2020). Bridging bayesian, frequentist and fiducial (bff) inferences using confidence distribution. arXiv preprint arXiv:2012.04464\/

  31. [31]

    Trzaskoma, B. and A. Sashegyi (2007). Predictive probability of success and the assessment of futility in large outcomes trials. Journal of biopharmaceutical statistics\/ 17\/ (1), 45--63

  32. [32]

    Wasserstein, R. L., N. A. Lazar, et al. (2016). The asa’s statement on p-values: context, process, and purpose. The American Statistician\/ 70\/ (2), 129--133

  33. [33]

    Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The annals of mathematical statistics\/ 9\/ (1), 60--62

  34. [34]

    Xie, M., R. Y. Liu, C. Damaraju, W. H. Olson, et al. (2013). Incorporating external information in analyses of clinical trials with binary outcomes. The Annals of Applied Statistics\/ 7\/ (1), 342--368

  35. [35]

    Singh, and W

    Xie, M., K. Singh, and W. E. Strawderman (2011). Confidence distributions and a unifying framework for meta-analysis. Journal of the American Statistical Association\/ 106\/ (493), 320--333

  36. [36]

    Xie, M.-g. and K. Singh (2013). Confidence distribution, the frequentist distribution estimator of a parameter: A review. International Statistical Review\/ 81\/ (1), 3--39