A General Framework for Optimal Group Sequential Testing via Mixed-Integer Linear Programming
Pith reviewed 2026-05-19 18:12 UTC · model grok-4.3
The pith
Mixed-integer linear programming finds optimal rejection boundaries for group sequential tests that allow earlier stopping than standard methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We use a sample average approximation combined with mixed integer linear programming to directly optimize the rejection criterion in the GST setting under type-1 and type-2 error constraints, and show that this S-MILP approach dominates classical GST procedures such as Lan-DeMets, Pocock, and O'Brien-Fleming methods while often spending alpha more aggressively early.
What carries the argument
The S-MILP approach: a sample average approximation of the error probabilities paired with mixed-integer linear programming to choose the optimal rejection thresholds at each of the K analysis times.
If this is right
- The optimal boundaries spend the alpha budget more heavily in early interim analyses than do standard methods.
- Expected number of observations needed to reach a decision is reduced while preserving error control.
- The framework can be applied to any specified number of groups and target error rates.
- In medical studies, it can lead to the same conclusion with fewer participants enrolled.
Where Pith is reading between the lines
- The insight on early alpha spending may guide design of more responsive sequential monitoring in other areas such as online experiments.
- Similar optimization techniques could incorporate additional practical constraints like recruitment costs or ethical stopping rules.
- Validation on more diverse simulation settings would strengthen confidence in the method's robustness across distributions.
Load-bearing premise
The sample average approximation provides a sufficiently accurate representation of the true type-1 and type-2 error probabilities for the optimized boundaries to maintain the desired error control in practice.
What would settle it
Generate a large number of data sets under the null hypothesis, apply the S-MILP boundaries, and check if the fraction of rejections stays at or below the nominal type-1 error level; a large excess would disprove the approximation's adequacy.
Figures
read the original abstract
Sequential hypothesis tests are widely adopted as a principled way to perform multiple tests on data that arrives over time. In particular, researchers frequently utilize group sequential hypothesis tests (GST) to test the same hypotheses at K times or "groups" while data arrives sequentially. In this setting, many methods have been proposed to allow researchers to uniformly control type-1 error across K checks (often known as various alpha-spending budgets). Although these methods are all successfully valid in controlling uniform type-1 error, it is not clear which of these methods are optimal when trying to reject the null as soon as possible. In this paper, we directly optimize the rejection criterion in the GST setting under the same constraints of controlling type-1 and type-2 errors. We use a sample average approximation combined with mixed integer linear programming (S-MILP) approach for this problem and show how our S-MILP approach dominates classical GST procedures such as Lan-DeMets, Pocock, and O'Brien-Fleming methods. We also find that the optimal solution typically aggressively spends the alpha-budget early, shedding insight to the long-standing debate of which alpha-spending budgets are more efficient. We finally apply our optimal S-MILP approach to a recent study on acute kidney injury interventions and find our optimal S-MILP approach can reach the same statistically significant conclusion faster than the original study and other GST methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a sample-average approximation combined with mixed-integer linear programming (S-MILP) framework to directly optimize group-sequential testing boundaries under explicit type-I and type-II error constraints. It claims that the resulting boundaries dominate classical spending-function methods (Lan-DeMets, Pocock, O’Brien-Fleming) by permitting earlier rejection on average, provides insight that optimal solutions spend alpha aggressively early, and illustrates the approach on an acute-kidney-injury trial.
Significance. If the optimized boundaries can be shown to control the nominal error rates exactly (rather than only under the SAA) and the reported dominance holds under independent verification, the framework would supply a flexible, computationally tractable alternative to traditional GST design. The empirical observation on early alpha spending would also inform the long-standing debate on spending-function efficiency.
major comments (2)
- [Abstract] Abstract and § on SAA formulation: the claim that S-MILP “dominates” Lan-DeMets, Pocock and O’Brien-Fleming is presented without quantitative evidence (e.g., expected stopping-time differences or power curves) or confirmation that the final boundaries satisfy the nominal α under the exact (non-approximated) null distribution.
- [SAA and MILP formulation] SAA error-control section (likely §3–4): because both the objective and the type-I/type-II constraints are replaced by Monte-Carlo averages, any systematic under-estimation of tail probabilities can produce boundaries that violate the nominal error rates when evaluated exactly. No analytic error bound on the SAA nor an independent high-fidelity Monte-Carlo audit of the selected boundaries is reported.
minor comments (2)
- [Notation and formulation] Clarify the precise MILP encoding of the boundary variables and the chosen objective (expected sample size, expected stopping time, etc.).
- [Numerical results] Simulation figures should report variability (standard errors or quantiles) across SAA replications so that dominance claims can be assessed for statistical significance.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of our S-MILP framework for group sequential testing. We respond to each major comment below and describe the changes we will make in revision.
read point-by-point responses
-
Referee: [Abstract] Abstract and § on SAA formulation: the claim that S-MILP “dominates” Lan-DeMets, Pocock and O’Brien-Fleming is presented without quantitative evidence (e.g., expected stopping-time differences or power curves) or confirmation that the final boundaries satisfy the nominal α under the exact (non-approximated) null distribution.
Authors: We agree that more explicit quantitative support for dominance would strengthen the presentation. In the revised manuscript we will add a table of expected stopping times under the alternative hypothesis for the S-MILP solution versus Lan-DeMets, Pocock, and O’Brien-Fleming boundaries, together with power curves at several effect sizes. We will also report an independent Monte-Carlo verification (10^6 replications) confirming that the final boundaries attain the nominal type-I error under the exact (non-SAA) null distribution. revision: yes
-
Referee: [SAA and MILP formulation] SAA error-control section (likely §3–4): because both the objective and the type-I/type-II constraints are replaced by Monte-Carlo averages, any systematic under-estimation of tail probabilities can produce boundaries that violate the nominal error rates when evaluated exactly. No analytic error bound on the SAA nor an independent high-fidelity Monte-Carlo audit of the selected boundaries is reported.
Authors: The referee correctly notes the risk inherent in replacing the exact error constraints by SAA averages. While a rigorous analytic error bound for the SAA-MILP formulation is not derived in the paper and would require substantial additional theoretical work, we will add a high-fidelity Monte-Carlo audit (using an order of magnitude more replications than the SAA sample size) of the optimized boundaries to empirically verify control of the nominal α and β under the exact distributions. revision: partial
- Deriving a closed-form analytic error bound for the sample-average approximation within the mixed-integer linear program.
Circularity Check
No circularity: direct MILP optimization of boundaries under explicit error constraints
full rationale
The paper formulates the GST boundary optimization as a mixed-integer linear program whose objective and constraints are defined directly from the desired type-1 and type-2 error tolerances. The S-MILP procedure solves this program using Monte-Carlo averages; the resulting boundaries are outputs of the solver, not redefinitions or statistical fits of the same quantities. Classical-method comparisons are performed by evaluating the obtained boundaries on independent simulation draws or by direct numerical reporting, none of which reduce to the optimization inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is smuggled in. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- SAA sample size
axioms (1)
- domain assumption The group sequential testing problem can be formulated as a mixed-integer linear program with accurate error control via approximation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use a sample average approximation combined with mixed integer linear programming (S-MILP) approach... dominates classical GST procedures such as Lan-DeMets, Pocock, and O'Brien-Fleming
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
min expected sample size subject to type-1 and type-2 error constraints
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Eales, J. D. and Jennison, C. , title =. Biometrika , volume =. 1992 , doi =
work page 1992
-
[2]
Hampson, L. V. and Jennison, C. , title =. Journal of the Royal Statistical Society, Series B , volume =. 2013 , doi =
work page 2013
-
[3]
Lectures on stochastic programming: modeling and theory , author=. 2021 , publisher=
work page 2021
- [4]
- [5]
-
[6]
arXiv preprint arXiv:1909.06406 , year=
Order statistics on the spacings between order statistics for the uniform distribution , author=. arXiv preprint arXiv:1909.06406 , year=
-
[7]
Operations Research Letters , volume=
Sample average approximation of expected value constrained stochastic programs , author=. Operations Research Letters , volume=. 2008 , publisher=
work page 2008
-
[8]
Lectures on parametric optimization: An introduction , author=. Optimization Online , pages=
-
[9]
The Zero Set of a Real Analytic Function
The zero set of a real analytic function , author=. arXiv preprint arXiv:1512.07276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Journal of optimization theory and applications , volume=
Sample average approximation method for chance constrained programming: theory and applications , author=. Journal of optimization theory and applications , volume=. 2009 , publisher=
work page 2009
-
[11]
Electronic Journal of Probability , number =
Xiequan Fan and Ion Grama and Quansheng Liu , title =. Electronic Journal of Probability , number =. 2015 , doi =
work page 2015
-
[12]
Aurelien Bibaut and Nathan Kallus and Michael Lindon , title=. 2022 , month=. doi:None , url=
work page 2022
-
[13]
URL https://doi.org/10.1080/ 01621459.2017.1307116
Audrey Boruvka, Daniel Almirall, Katie Witkiewitz and Susan A. Murphy , title =. Journal of the American Statistical Association , volume =. 2018 , publisher =. doi:10.1080/01621459.2017.1305274 , note =
-
[14]
Clustered Switchback Experiments: Near-Optimal Rates Under Spatiotemporal Interference , author=. 2024 , eprint=
work page 2024
-
[15]
Proceedings of The KDD'23 Workshop on Causal Discovery, Prediction and Decision , pages =
Bias-Variance Tradeoffs for Designing Simultaneous Temporal Experiments , author =. Proceedings of The KDD'23 Workshop on Causal Discovery, Prediction and Decision , pages =. 2023 , editor =
work page 2023
-
[16]
Switchback Experiments under Geometric Mixing , author=. 2024 , eprint=
work page 2024
-
[17]
Adaptive Experimental Design with Temporal Interference: A Maximum Likelihood Approach , author=. arXiv: Methodology , year=
-
[18]
Design of Panel Experiments with Spatial and Temporal Interference , journal =
Ni, Tu and Bojinov, Iavor and Zhao, Jinglong , year =. Design of Panel Experiments with Spatial and Temporal Interference , journal =
-
[19]
Zhan, Ruohan and Ren, Zhimei and Athey, Susan and Zhou, Zhengyuan , title =. Management Science , volume =. 0 , doi =
-
[20]
Cand\`es, Emmanuel and Fan, Yingying and Janson, Lucas and Lv, Jinchi , journal=. Panning for Gold: Model-
-
[21]
arXiv preprint arXiv:2111.02334 , year=
Quantifying the Value of Iterative Experimentation , author=. arXiv preprint arXiv:2111.02334 , year=
-
[22]
Experimentation works: The surprising power of business experiments , author=. 2020 , publisher=
work page 2020
-
[23]
Thompson, William R , title = ". Biometrika , volume =. 1933 , month =. doi:10.1093/biomet/25.3-4.285 , url =
-
[24]
Bojinov, Iavor and Gupta, Somit , journal =. Online. 2022 , month =
work page 2022
-
[25]
The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression , author=. arXiv: Methodology , year=
-
[26]
Trustworthy online controlled experiments: A practical guide to a/b testing , author=. 2020 , publisher=
work page 2020
-
[27]
Reinforcement learning: an introduction
Sutton, Richard and Barto, Andrew , year=. Reinforcement learning: an introduction. Adaptive Computation and Machine Learning , publisher=
-
[28]
Sampling-Based versus Design-Based Uncertainty in Regression Analysis , author=. Econometrica , volume=. 2020 , publisher=
work page 2020
-
[29]
Kohavi, Ron and Deng, Alex and Frasca, Brian and Walker, Toby and Xu, Ya and Pohlmann, Nils , title =. 2013 , isbn =. doi:10.1145/2487575.2488217 , booktitle =
-
[30]
Estimating means of bounded random variables by betting. arXiv e-prints , keywords =
-
[31]
Sequential estimation of quantiles with applications to A/B testing and best-arm identification , author=. Bernoulli , year=
-
[32]
Johari, Ramesh and Koomen, Pete and Pekelis, Leonid and Walsh, David , title =. 2017 , isbn =. doi:10.1145/3097983.3097992 , booktitle =
-
[33]
Rapid Regression Detection in Software Deployments through Sequential Testing , year =
Lindon, Michael and Sanden, Chris and Shirikian, Vach\'. Rapid Regression Detection in Software Deployments through Sequential Testing , year =. doi:10.1145/3534678.3539099 , pages =
-
[34]
Jonathan Taylor and Robert J. Tibshirani , title =. Proceedings of the National Academy of Sciences , volume =. 2015 , doi =
work page 2015
-
[35]
Nephrology Dialysis Transplantation , volume =
Noordzij, Marlies and Tripepi, Giovanni and Dekker, Friedo W and Zoccali, Carmine and Tanck, Michael W and Jager, Kitty J , title = ". Nephrology Dialysis Transplantation , volume =. 2010 , month =. doi:10.1093/ndt/gfp732 , url =
-
[36]
Panel experiments and dynamic causal effects: A finite population perspective , volume =
Bojinov, Iavor and Rambachan, Ashesh and Shephard, Neil , year =. Panel experiments and dynamic causal effects: A finite population perspective , volume =. Quantitative Economics , doi =
-
[37]
Journal of the American Statistical Association , year=
A Generalization of Sampling Without Replacement from a Finite Universe , author=. Journal of the American Statistical Association , year=
-
[38]
Basse, Guillaume and Airoldi, Edoardo , year =. Limitations of Design-based Causal Inference and A/B Testing under Arbitrary and Network Interference , volume =. Sociological Methodology , doi =
-
[40]
doi:10.48550/arXiv.2201.08343 , year=
Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis , author=. doi:10.48550/arXiv.2201.08343 , year=
-
[41]
D. A. Darling and Herbert Robbins , title =. Proceedings of the National Academy of Sciences , volume =. 1967 , doi =
work page 1967
-
[42]
D. R. Cox , publisher =. Planning of Experiments , year =
-
[43]
Paul W. Holland , title =. Journal of the American Statistical Association , volume =. 1986 , publisher =. doi:10.1080/01621459.1986.10478354 , URL =
- [44]
-
[45]
Catoni-style confidence sequences for heavy-tailed mean estimation , author=
-
[46]
L., Athanasopoulos, G., and Hyndman, R
Iavor Bojinov and Neil Shephard , title =. Journal of the American Statistical Association , volume =. 2019 , publisher =. doi:10.1080/01621459.2018.1527225 , URL =
-
[47]
Anytime-valid off-policy inference for contextual bandits , publisher =
Waudby-Smith, Ian and Wu, Lili and Ramdas, Aaditya and Karampatziakis, Nikos and Mineiro, Paul , keywords =. Anytime-valid off-policy inference for contextual bandits , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2210.10768 , url =
-
[48]
Bojinov, Iavor and Simchi-Levi, David and Zhao, Jinglong , title =. Management Science , volume =. 2020 , doi =
work page 2020
-
[49]
A lasso for hierarchical interactions
Bien, Jacob and Taylor, Jonathan and Tibshirani, Robert. A lasso for hierarchical interactions. Ann. Statist. 2013. doi:10.1214/13-AOS1096
-
[50]
Jens Hainmueller and Daniel J. Hopkins. The Hidden American Immigration Consensus: A Conjoint Analysis of Attitudes toward Immigrants. American Journal of Political Science. 2015. doi:10.1111/ajps.12138
-
[51]
The Contingent Effects of Candidate Sex on Voter Choice , author=. Political Behavior , year=
-
[52]
Brader and Ted and Nicholas Valentino and Elizabeth Suhay. Is It Immigration or the Immigrants? The Emotional Influence of Groups on Public Opinion and Political Action. American Journal of Political Science. 2008
work page 2008
-
[53]
Anna Maria Mayda , journal =. Who Is against Immigration? A Cross-Country Investigation of Individual Attitudes toward Immigrants , volume =
-
[54]
Schildkraut, Deborah J. , year=. Americanism in the Twenty-First Century: Public Opinion in the Age of Immigration , DOI=
-
[55]
Gender as a Factor in the Attribution of Leadership Traits , volume =
Deborah Alexander and Kristi Andersen , journal =. Gender as a Factor in the Attribution of Leadership Traits , volume =
-
[56]
Jeffrey W. Koch , journal =. Gender Stereotypes and Citizens' Impressions of House Candidates' Ideological Orientations , volume =
-
[57]
Political Research Quarterly , volume =
Leonie Huddy and Nayda Terkildsen , title =. Political Research Quarterly , volume =. 1993 , doi =
work page 1993
-
[58]
Newman, Benjamin J. and Malhotra, Neil , title =. The Journal of Politics , volume =. 2019 , doi =
work page 2019
-
[59]
Improving the External Validity of Conjoint Analysis: The Essential Role of Profile Distribution , author =. 2022 , journal=
work page 2022
-
[60]
arXiv preprint arXiv:2006.03980 , year=
Fast and Powerful Conditional Randomization Testing via Distillation , author=. arXiv preprint arXiv:2006.03980 , year=
-
[61]
What Do We Learn About Voter Preferences From Conjoint Experiments? , year =
Scott Abramson and Korhan Kocak and Asya Magazinnik , institution =. What Do We Learn About Voter Preferences From Conjoint Experiments? , year =
-
[62]
Scott Abramson and Korhan Kocak and Asya Magazinnik and Anton Strezhnev , institution =. Improving Preference Elicitation in Conjoint Designs using Machine Learning for Heterogeneous Effects , year =
-
[63]
Bansak, Kirk and Hainmueller, Jens and Hopkins, Daniel and Yamamoto, Teppei , year =. Using Conjoint Experiments to Analyze Elections: The Essential Role of the Average Marginal Component Effect (AMCE) , journal =
- [64]
-
[65]
Paul E. Green and V. Srinivasan , journal =. Conjoint Analysis in Marketing: New Developments with Implications for Research and Practice , volume =
-
[66]
Agricultural and resource economics review , pages =
Campbell, Benjamin L and Mhlanga, Saneliso and Lesschaeve, Isabelle , keywords =. Agricultural and resource economics review , pages =. 2013 , title =
work page 2013
-
[67]
Hainmueller, Jens and Hopkins, Daniel J. and Yamamoto, Teppei , year=. Causal Inference in Conjoint Analysis: Understanding Multidimensional Choices via Stated Preference Experiments , volume=. Political Analysis , publisher=. doi:10.1093/pan/mpt024 , number=
-
[68]
Brett Hauber and Juan Marcos González and Catharina G.M
A. Brett Hauber and Juan Marcos González and Catharina G.M. Groothuis-Oudshoorn and Thomas Prior and Deborah A. Marshall and Charles Cunningham and Maarten J. IJzerman and John F.P. Bridges. Statistical Methods for the Analysis of Discrete Choice Experiments: A Report of the ISPOR Conjoint Analysis Good Research Practices Task Force. Value in Health. 2016...
-
[69]
A weighted logistic regression for conjoint analysis and Kansei engineering , volume =
Barone, Stefano and Lombardo, Alberto and Tarantino, Pietro , year =. A weighted logistic regression for conjoint analysis and Kansei engineering , volume =. Quality and Reliability Engineering International , doi =
-
[70]
Voting Cues in Low-Information Elections: Candidate Gender as a Social Information Variable in Contemporary United States Elections , author=
-
[71]
Causal inference in genetic trio studies , volume =
Bates, Stephen and Sesia, Matteo and Sabatti, Chiara and Cand. Causal inference in genetic trio studies , volume =. 2020 , doi =. https://www.pnas.org/content/117/39/24117.full.pdf , journal =
work page 2020
- [72]
-
[73]
The Democratic Dilemma: Can Citizens Learn What They Need to Know? , volume =
Lupia, Arthur and Mccubbins, Mathew , year =. The Democratic Dilemma: Can Citizens Learn What They Need to Know? , volume =. The American Political Science Review , doi =
-
[74]
R.Duncan Luce and John W. Tukey. Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology. 1964. doi:https://doi.org/10.1016/0022-2496(64)90015-X
-
[75]
Thirty Years of Conjoint Analysis: Reflections and Prospects , volume =
Green, Paul and Krieger, Abba and Wind, Yoram , year =. Thirty Years of Conjoint Analysis: Reflections and Prospects , volume =. Interfaces , doi =
-
[76]
Raghavarao, D. and Wiley, J.B. and Chitturi, P. , year =. Choice-based conjoint analysis: Models and Designs , publisher =
-
[77]
Popovic, Milena and Kuzmanovic, Marija and Martic, Milan , year =. Using Conjoint Analysis To Elicit Employers’ Preferences Toward Key Competencies For A Business Manager Position , volume =. Management - Journal for theory and practice of management , doi =
-
[78]
Journal of the American Statistical Association , volume =
Donald B Rubin , title =. Journal of the American Statistical Association , volume =. 2005 , publisher =
work page 2005
-
[79]
Bansak, Kirk and Hainmueller, Jens and Hopkins, Daniel J. and Yamamoto, Teppei , year=. The Number of Choice Tasks and Survey Satisficing in Conjoint Experiments , volume=. Political Analysis , publisher=. doi:10.1017/pan.2017.40 , number=
-
[80]
Bansak, Kirk and Hainmueller, Jens and Hopkins, Daniel J. and Yamamoto, Teppei , year=. Beyond the breaking point? Survey satisficing in conjoint experiments , DOI=. Political Science Research and Methods , publisher=
-
[81]
Regression Shrinkage and Selection via the Lasso , volume =
Robert Tibshirani , journal =. Regression Shrinkage and Selection via the Lasso , volume =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.