Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases
Pith reviewed 2026-05-21 17:17 UTC · model grok-4.3
The pith
By reducing error-prone exposures to their first principal component and sampling its extremes, a single two-phase validation design can improve efficiency for estimating parameters in several models at the same time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that extreme tail sampling performed on the first principal component of the error-prone exposure matrix produces a validation subsample that simultaneously increases statistical efficiency for estimating parameters in multiple models of interest. This PCA-extended approach outperforms standard methods when the goal is to balance performance across competing analyses, and it continues to work well even when the measurement errors in the exposures are correlated or have different variances.
What carries the argument
The first principal component of the error-prone exposures, which reduces the multi-model sampling problem to selecting extreme values along this one-dimensional summary of variability before validation.
If this is right
- Validation resources can be allocated to support multiple primary and secondary analyses at the same time.
- The method scales to high-dimensional data because PCA compresses the exposure information into one key direction.
- Efficiency gains persist when exposures have correlated errors or heterogeneous error structures.
- Researchers no longer need to choose one model to optimize the sampling design at the expense of others.
Where Pith is reading between the lines
- If the first principal component explains little of the relevant variation for some models, incorporating model-specific residuals or additional components could further improve results.
- This sampling strategy may apply to other two-phase designs or to settings with missing data beyond measurement error.
- Combining the PCA summary with outcome information, if available in phase I, might yield even larger gains but would require careful handling to avoid bias.
Load-bearing premise
The first principal component of the error-prone exposures captures the main directions of variability that drive efficiency improvements for all models considered.
What would settle it
A simulation study in which the key predictors for different models lie in orthogonal directions of the exposure space, checking whether the PCA-based sample still delivers efficiency gains for every model compared to random sampling.
Figures
read the original abstract
Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients undergoes cost-intensive validation (e.g., expert chart review) to collect more accurate data in Phase II. When balancing primary and secondary analyses, competing models and priorities can result in poorly defined objectives for the most informative Phase II sampling criterion. Extreme tail sampling (ETS), wherein patients with the smallest and largest values of a particular quantity (like a covariate or residual) are selected, can offer great statistical efficiency in two-phase studies when focusing on a single analytic objective by targeting observations with the biggest contributions to the Fisher information. We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components analysis, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme values of the first principal component for validation. Through extensive simulations and an application to the National Health and Nutrition Examination Survey (NHANES), the proposed strategy offered simultaneous efficiency gains across multiple models of interest. Its advantages persisted across various real-world scenarios, including correlated or heterogeneous measurement error. When designing a validation study, concentrating on a single model may be short-sighted. Strategically allocating resources more broadly balances multiple analytical goals simultaneously. Employing dimension reduction before sampling will allow this strategy to scale up well to big-data applications with many error-prone exposures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes extending extreme tail sampling (ETS) for two-phase validation studies by applying principal components analysis (PCA) to the matrix of error-prone exposures from Phase I data, then selecting Phase II validation samples at the extremes of the first principal component. This is claimed to deliver simultaneous efficiency gains for multiple models of interest, with advantages persisting under correlated or heterogeneous measurement error, as shown in simulations and an NHANES application. The approach is positioned as scalable to big-data settings with many exposures by using dimension reduction to balance competing analytic objectives.
Significance. If the central claim holds, the method would provide a practical, unsupervised way to allocate limited validation resources across multiple models without requiring a single explicit multi-objective criterion, extending the single-model efficiency of ETS. The simulations and real-data application offer empirical support for robustness in realistic error scenarios, which could inform design of validation studies in biomedical databases where secondary analyses are common.
major comments (1)
- [Abstract, PCA extension of ETS] Abstract (paragraph on PCA extension of ETS): the proposal assumes that extremes of the first principal component of the error-prone exposures will contribute substantially to the score functions or information matrices of every model under consideration. Because PCA is unsupervised and maximizes marginal variance of the exposures alone, the leading eigenvector need not align with directions relevant to the outcome(s) or to parameters of secondary models; this is especially plausible under heterogeneous measurement error or model-specific covariate effects. The simulations report efficiency gains, but without an explicit link to the multi-model efficiency criterion or a comparison against an information-optimal multi-model sampler, it is unclear whether the observed gains are general or specific to the simulated configurations.
minor comments (2)
- [Abstract] The abstract and methods description would benefit from explicit statements of the number of models, number of exposures, and sample sizes used in the simulations, as well as the precise definition of efficiency gain (e.g., relative variance reduction for each parameter).
- [Methods] Notation for the first principal component and the sampling rule (e.g., how many subjects are selected from each tail) should be introduced with an equation or clear algorithmic step to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which help clarify the scope and limitations of our proposed PCA-based extension of extreme tail sampling. We address the major comment below and have revised the manuscript to better articulate the method's rationale, empirical support, and comparison to alternatives.
read point-by-point responses
-
Referee: [Abstract, PCA extension of ETS] Abstract (paragraph on PCA extension of ETS): the proposal assumes that extremes of the first principal component of the error-prone exposures will contribute substantially to the score functions or information matrices of every model under consideration. Because PCA is unsupervised and maximizes marginal variance of the exposures alone, the leading eigenvector need not align with directions relevant to the outcome(s) or to parameters of secondary models; this is especially plausible under heterogeneous measurement error or model-specific covariate effects. The simulations report efficiency gains, but without an explicit link to the multi-model efficiency criterion or a comparison against an information-optimal multi-model sampler, it is unclear whether the observed gains are general or specific to the simulated configurations.
Authors: We agree that PCA is unsupervised and does not explicitly target the score functions or information matrices of the models of interest, so alignment is not guaranteed in all settings. Our approach is motivated by the practical observation that, in biomedical data with correlated error-prone exposures, the leading principal component frequently captures shared directions of variation that contribute to efficiency across multiple models. The simulations explicitly include heterogeneous measurement error structures and model-specific covariate effects, and efficiency gains were observed consistently in those cases. To strengthen the link to multi-model criteria, the revised manuscript adds a new subsection in the Methods and a corresponding simulation comparison against an information-optimal multi-model sampler (constructed as a weighted sum of expected information matrices). This shows that PCA-ETS achieves comparable gains while remaining simpler to implement and not requiring Phase I outcome data. We have also revised the Abstract to describe the method as a scalable heuristic whose performance is validated empirically rather than claimed to be universally optimal. revision: yes
Circularity Check
No circularity: PCA-based sampling criterion is defined directly from Phase I observations
full rationale
The proposed method computes the first principal component directly from the matrix of observed error-prone exposures in Phase I data and selects validation samples at its extremes. This construction is an explicit, unsupervised dimension-reduction step with no algebraic reduction to fitted parameters, outcome-dependent quantities, or self-referential definitions. Efficiency gains are demonstrated through separate simulation studies and NHANES application rather than by construction. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the derivation chain; the approach extends ETS by a transparent preprocessing step whose validity is assessed externally.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard assumptions on measurement error structure in covariates (e.g., additive or multiplicative error, possibly correlated or heterogeneous)
- domain assumption The first principal component captures sufficient shared variability across the models of interest for the sampling criterion to be effective
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components analysis, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme values of the first principal component for validation.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The ETS-P C∗1 design's ability to reduce the total coefficient variability across all models depends on two characteristics of the error-prone exposure data.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Nordo AH, Levaux HP, Becnel LB, et al. Use of EHRs data for clinical research: historical progress and current applications.Learning Health Systems.2019;3(1):e10076
work page 2019
-
[2]
Goldstein N. Electronic Health Records in Epidemiology: Appropriate Questions, Common Biases, and Potential Sensitivity Analyses.Current Epidemiology Results.2025;12(11)
work page 2025
-
[3]
Giganti MJ, Shepherd BE, Caro-Vega Y , et al. The impact of data quality and source data verification on epidemio- logic inference: a practical application using HIV observational data.BMC Public Health.2019;19(1):1748
work page 2019
-
[4]
Duda S, Shepherd B, Gadd C, Masys D, McGowan C. Measuring the quality of observational study data in an international HIV research network.PloS One.2012;7(4):e33908
work page 2012
-
[5]
Lotspeich SC, Shepherd BE, Kariuki MA, et al. Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa.Journal of Clinical and Translational Science. 2023;7(1):e245
work page 2023
-
[6]
Lotspeich SC, Kedar S, Tahir R, et al. Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records.Journal of Biomedical Informatics. 2025;170:104904
work page 2025
-
[7]
Multiwave validation sampling for error-prone electronic health records
Shepherd BE, Han K, Chen T, et al. Multiwave validation sampling for error-prone electronic health records. Biometrics.2023;79(3):2649-2663
work page 2023
-
[8]
Systems of protocol review, quality assurance, and data audit.Cancer Chemotherapy and Pharmacology
Weiss R. Systems of protocol review, quality assurance, and data audit.Cancer Chemotherapy and Pharmacology. 1998;42(Suppl):S88-S92
work page 1998
-
[9]
Han K, Lumley T, Shepherd BE, Shaw PA. Two-phase analysis and study design for survival models with error-prone exposures.Statistical Methods in Medical Research.2020;30(3):857-874
work page 2020
-
[10]
Lyles RH, Tang L, Superak HM, et al. Validation data-based adjustments for outcome misclassification in logistic regression: An illustration.Epidemiology.2011;22(4):589-597
work page 2011
-
[11]
Tang L, Lyles RH, King CC, Celentano DD, Lo Y . Binary Regression with Differentially Misclassified Response and Exposure Variables.Statistics in Medicine.2015;34(9):1605-1620
work page 2015
-
[12]
White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease.American Journal of Epidemiology.1982;115(1):119–128
work page 1982
-
[13]
Amorim G, Tao R, Lotspeich S, Shaw PA, Lumley T, Shepherd BE. Two-phase sampling designs for data validation in settings with covariate measurement error and continuous outcome.Journal of the Royal Statistical Society Series A: Statistics in Society.2021;184(4):1368–1389
work page 2021
-
[14]
Lotspeich SC, Amorim GG, Shaw PA, Tao R, Shepherd BE. Optimal multiwave validation of secondary use data with outcome and exposure misclassification.Canadian Journal of Statistics.2023
work page 2023
-
[15]
McIsaac MA, Cook RJ. Response-dependent two-phase sampling designs for biomarker studies.Canadian Journal of Statistics.2014;42(2):268–284
work page 2014
-
[16]
Optimal designs of two-phase studies.Journal of the American Statistical Association
Tao R, Zeng D, Lin DY . Optimal designs of two-phase studies.Journal of the American Statistical Association. 2020;115(532):1946–1959
work page 2020
-
[17]
The use of extreme groups in assessing relationships.Psychometrika.1975;40(4):563– 572
Alf Jr EF, Abrahams NM. The use of extreme groups in assessing relationships.Psychometrika.1975;40(4):563– 572
work page 1975
-
[18]
Transmission-disequilibrium tests for quantitative traits..American Journal of Human Genetics
Allison DB. Transmission-disequilibrium tests for quantitative traits..American Journal of Human Genetics. 1997;60(3):676
work page 1997
-
[19]
Feldt LS. The use of extreme groups to test for the presence of a relationship.Psychometrika.1961;26(3):307–316
work page 1961
-
[20]
Quantitative trait analysis in sequencing studies under trait-dependent sampling
Lin DY , Zeng D, Tang ZZ. Quantitative trait analysis in sequencing studies under trait-dependent sampling. Proceedings of the National Academy of Sciences.2013;110(30):12247–12252
work page 2013
-
[21]
Extreme discordant sib pairs for mapping quantitative trait loci in humans.Science
Risch N, Zhang H. Extreme discordant sib pairs for mapping quantitative trait loci in humans.Science. 1995;268(5217):1584–1589
work page 1995
-
[22]
Sauer S, Hedt-Gauthier B, Haneuse S. Optimal allocation in stratified cluster-based outcome-dependent sampling designs.Statistics in Medicine.2021;40(18):4090–4107
work page 2021
-
[23]
Zhou H, Chen J, Rissanen TH, et al. Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome..Epidemiology.2007;18(4):461–468. 17 Two-phase validation sampling via principal componentsA PREPRINT
work page 2007
-
[24]
Di Gravio C, Schildcrout JS, Tao R. Efficient designs and analysis of two-phase studies with longitudinal binary data.Biometrics.2024;80(1):ujad010
work page 2024
-
[25]
Two-phase designs with current status data.Statistics in Medicine.2023;42(8):1207–1232
Mao F, Cook RJ. Two-phase designs with current status data.Statistics in Medicine.2023;42(8):1207–1232
work page 2023
-
[26]
Deb K, Sindhya K, Hakanen J. Multi-objective optimization. In: , , CRC Press, 2016:161–200
work page 2016
-
[27]
Control C. f. D, Prevention . About the National Health and Nutrition Examination Survey.https://www.cdc. gov/nchs/nhanes/about/index.html; 2021
work page 2021
-
[28]
Chapter 1 - Dietary Assessment Methodology
Thompson FE, Subar AF. Chapter 1 - Dietary Assessment Methodology. In: Coulston AM, Boushey CJ, Ferruzzi MG, Delahanty LM., eds.Nutrition in the Prevention and Treatment of Disease (Fourth Edition), fourth edition ed., Academic Press, 2017:5-48
work page 2017
-
[29]
Regression with missing X’s: A review.Journal of the American Statistical Association
Little RJA. Regression with missing X’s: A review.Journal of the American Statistical Association. 1992;87(420):1227–1237
work page 1992
-
[30]
Zhou H, Weaver MA, Qin J, Longnecker M, Wang M. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome.Biometrics.2002;58(2):413–421
work page 2002
-
[31]
Optimal experiment design with applications to Pharmacokinetic modeling
Erdal MK, Plaxco KW, Gerson J, Kippin TE, Hespanha JP. Optimal experiment design with applications to Pharmacokinetic modeling. In: 2021:3072-3079
work page 2021
-
[32]
Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis
McIsaac MA, Cook RJ. Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis. Statistics in Medicine.2015;34(21):2899–2912
work page 2015
-
[33]
The ‘Why’ behind including ‘Y’ in your imputation model
D’Agostino McGowan L, Lotspeich SC, Hepler SA. The ‘Why’ behind including ‘Y’ in your imputation model. Stat Methods Med Res.2024;33(6):996–1020
work page 2024
-
[34]
van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R.Journal of Statistical Software.2011;45(3):1-67
work page 2011
-
[35]
Moons KG, Donders RA, Stijnen T, Harrell Jr FE. Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology.2006;59(10):1092–1101
work page 2006
-
[36]
Rubin DB.Multiple imputation for nonresponse in surveys. 81. John Wiley & Sons, 2004
work page 2004
-
[37]
Hoy K, Goldman J, Moshfegh A. Two Methods to Estimate Fruit and Vegetable Intake of Adults, What We Eat in America, NHANES 2009-2010.The FASEB Journal.2015;29(S1):597.6
work page 2009
-
[38]
Department of Agriculture, Agricultural Research Service
U.S. Department of Agriculture, Agricultural Research Service . USDA Food and Nutrient Database for Dietary Studies 2021-2023.www.ars.usda.gov/nea/bhnrc/fsrg; 2024. [Online; accessed 9-August-2025]
work page 2021
-
[39]
Vitamin D and intestinal calcium absorption.Molecular and Cellular Endocrinology.2011;347(1):25-29
Christakos S, Dhawan P, Porta A, Mady LJ, Seth T. Vitamin D and intestinal calcium absorption.Molecular and Cellular Endocrinology.2011;347(1):25-29
work page 2011
-
[40]
Caffeine and cardiovascular health.Regulatory Toxicology and Pharmacology.2017;89:165-185
Turnbull D, Rodricks JV , Mariano GF, Chowdhury F. Caffeine and cardiovascular health.Regulatory Toxicology and Pharmacology.2017;89:165-185
work page 2017
-
[41]
Brinton EA, Eisenberg S, Breslow JL. A low-fat diet decreases high density lipoprotein (HDL) cholesterol levels by decreasing HDL apolipoprotein transport rates.The Journal of Clinical Investigation.1990;85(1):144–151
work page 1990
-
[42]
Schrieks IC, Heil AL, Hendriks HF, Mukamal KJ, Beulens JW. The effect of alcohol consumption on insulin sensitivity and glycemic status: a systematic review and meta-analysis of intervention studies.Diabetes Care. 2015;38(4):723–732
work page 2015
-
[43]
Morris MS, Jacques PF, Rosenberg IH, Selhub J. Folate and vitamin B-12 status in relation to anemia, macrocytosis, and cognitive impairment in older Americans in the age of folic acid fortification.The American Journal of Clinical Nutrition.2007;85(1):193–200
work page 2007
-
[44]
Willett W.Nutritional epidemiology. Oxford university press, 2012
work page 2012
-
[45]
Schomaker M, Heumann C. Model selection and model averaging after multiple imputation.Computational Statistics & Data Analysis.2014;71:758–770
work page 2014
-
[46]
Tao R, Lotspeich SC, Amorim G, Shaw PA, Shepherd BE. Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors.Statistics in Medicine.2021;40(3):725-738. Supplementary Materials • Additional appendices, tables, and figures:The supplemental figures and tables referenced in Sections 3–4 are available online at ht...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.