Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

Cole Manschot; Sarah C. Lotspeich

arxiv: 2512.02182 · v3 · pith:5KNBKHIAnew · submitted 2025-12-01 · 📊 stat.ME · stat.AP

Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

Sarah C. Lotspeich , Cole Manschot This is my paper

Pith reviewed 2026-05-21 17:17 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords two-phase samplingprincipal components analysismeasurement errorvalidation samplingstatistical efficiencymulti-model estimationbiomedical databasesNHANES

0 comments

The pith

By reducing error-prone exposures to their first principal component and sampling its extremes, a single two-phase validation design can improve efficiency for estimating parameters in several models at the same time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Two-phase sampling validates error-prone measurements in large biomedical databases by collecting accurate data on a chosen subset of subjects. When researchers have several models to fit, it becomes unclear which observations to prioritize for the costly validation step. The authors extend extreme tail sampling by first applying principal components analysis to the error-prone exposures, identifying the single direction of greatest variability across all models, and then selecting subjects with the most extreme values on that first component. Simulations and an application to NHANES data show that this produces simultaneous efficiency gains for multiple models, even when measurement errors are correlated or vary across variables. The approach offers a simple way to allocate limited validation resources when several analytic goals must be balanced.

Core claim

The paper establishes that extreme tail sampling performed on the first principal component of the error-prone exposure matrix produces a validation subsample that simultaneously increases statistical efficiency for estimating parameters in multiple models of interest. This PCA-extended approach outperforms standard methods when the goal is to balance performance across competing analyses, and it continues to work well even when the measurement errors in the exposures are correlated or have different variances.

What carries the argument

The first principal component of the error-prone exposures, which reduces the multi-model sampling problem to selecting extreme values along this one-dimensional summary of variability before validation.

If this is right

Validation resources can be allocated to support multiple primary and secondary analyses at the same time.
The method scales to high-dimensional data because PCA compresses the exposure information into one key direction.
Efficiency gains persist when exposures have correlated errors or heterogeneous error structures.
Researchers no longer need to choose one model to optimize the sampling design at the expense of others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the first principal component explains little of the relevant variation for some models, incorporating model-specific residuals or additional components could further improve results.
This sampling strategy may apply to other two-phase designs or to settings with missing data beyond measurement error.
Combining the PCA summary with outcome information, if available in phase I, might yield even larger gains but would require careful handling to avoid bias.

Load-bearing premise

The first principal component of the error-prone exposures captures the main directions of variability that drive efficiency improvements for all models considered.

What would settle it

A simulation study in which the key predictors for different models lie in orthogonal directions of the exposure space, checking whether the PCA-based sample still delivers efficiency gains for every model compared to random sampling.

Figures

Figures reproduced from arXiv: 2512.02182 by Cole Manschot, Sarah C. Lotspeich.

**Figure 2.** Figure 2: Plot of all models’ true exposure Xj against the first principal component P C∗ 1 summarizing all the errorprone exposures X∗ , focusing on the moderate error setting (error variance = 0.5). Each plot contains the same N = 1000 simulated observations, and the points are colored by their validation status based on extreme tail sampling on the first principal component P C∗ 1 (ETS-P C∗ 1 ) versus extreme ta… view at source ↗

**Figure 3.** Figure 3: Simulation results comparing the empirical total coefficient variability across all models [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Simulation results comparing the empirical efficiency under simple random sampling (SRS), extreme tail [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Simulation results comparing the empirical total coefficient variability across all models (A) and empirical [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Coefficient estimates and corresponding 95% confidence intervals (A) and relative widths of 95% confidence intervals for β1j , corresponding to each of the five nutrient intake exposures (j ∈ {1, . . . , 5}) from the five diet-driven health outcome models (B). In addition to the gold standard analysis (with all individuals having validation data), we considered partial validation with simple random samplin… view at source ↗

read the original abstract

Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients undergoes cost-intensive validation (e.g., expert chart review) to collect more accurate data in Phase II. When balancing primary and secondary analyses, competing models and priorities can result in poorly defined objectives for the most informative Phase II sampling criterion. Extreme tail sampling (ETS), wherein patients with the smallest and largest values of a particular quantity (like a covariate or residual) are selected, can offer great statistical efficiency in two-phase studies when focusing on a single analytic objective by targeting observations with the biggest contributions to the Fisher information. We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components analysis, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme values of the first principal component for validation. Through extensive simulations and an application to the National Health and Nutrition Examination Survey (NHANES), the proposed strategy offered simultaneous efficiency gains across multiple models of interest. Its advantages persisted across various real-world scenarios, including correlated or heterogeneous measurement error. When designing a validation study, concentrating on a single model may be short-sighted. Strategically allocating resources more broadly balances multiple analytical goals simultaneously. Employing dimension reduction before sampling will allow this strategy to scale up well to big-data applications with many error-prone exposures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends extreme tail sampling to multiple models via PCA on error-prone exposures, with simulations and NHANES data showing efficiency gains, though the unsupervised PC may not always hit the information directions that matter.

read the letter

The main takeaway is that this paper gives a practical extension of extreme tail sampling for two-phase validation when several models need to be considered together. They run PCA on the matrix of error-prone exposures, then sample the extremes of the first principal component for the validation phase. This avoids having to choose one primary objective and aims to deliver efficiency improvements across the set at once. The simulations cover a range of error structures including correlated and heterogeneous measurement error, and the NHANES application shows the gains holding up for the models they examined. That combination of a simple method plus concrete empirical checks is the part that stands out as useful. The approach is new in the way it layers dimension reduction onto the existing ETS framework to handle multiple analytic goals, and it is easy enough to implement that it could see use in real database studies. A soft spot is the reliance on the first PC to capture directions relevant to efficiency for every model. PCA maximizes marginal variance in the exposures without reference to the outcome or the specific parameters, so in settings with model-specific effects or varying error patterns the leading direction could end up weakly related to the score contributions that actually drive information gains. The paper reports that advantages persisted across the scenarios they tested, which suggests the chosen setups avoided the worst mismatch, but more explicit checks linking the PC to the multi-model information criterion would strengthen the case. This is aimed at biostatisticians and epidemiologists who design validation sampling in large error-prone biomedical databases and have to balance several research questions with limited Phase II resources. Readers who already use or adapt ETS for single objectives will see the most direct value. I would send it to peer review. The core idea is clear, the supporting simulations and application are there, and the alignment question is one referees can address with targeted suggestions rather than a load-bearing flaw.

Referee Report

1 major / 2 minor

Summary. The paper proposes extending extreme tail sampling (ETS) for two-phase validation studies by applying principal components analysis (PCA) to the matrix of error-prone exposures from Phase I data, then selecting Phase II validation samples at the extremes of the first principal component. This is claimed to deliver simultaneous efficiency gains for multiple models of interest, with advantages persisting under correlated or heterogeneous measurement error, as shown in simulations and an NHANES application. The approach is positioned as scalable to big-data settings with many exposures by using dimension reduction to balance competing analytic objectives.

Significance. If the central claim holds, the method would provide a practical, unsupervised way to allocate limited validation resources across multiple models without requiring a single explicit multi-objective criterion, extending the single-model efficiency of ETS. The simulations and real-data application offer empirical support for robustness in realistic error scenarios, which could inform design of validation studies in biomedical databases where secondary analyses are common.

major comments (1)

[Abstract, PCA extension of ETS] Abstract (paragraph on PCA extension of ETS): the proposal assumes that extremes of the first principal component of the error-prone exposures will contribute substantially to the score functions or information matrices of every model under consideration. Because PCA is unsupervised and maximizes marginal variance of the exposures alone, the leading eigenvector need not align with directions relevant to the outcome(s) or to parameters of secondary models; this is especially plausible under heterogeneous measurement error or model-specific covariate effects. The simulations report efficiency gains, but without an explicit link to the multi-model efficiency criterion or a comparison against an information-optimal multi-model sampler, it is unclear whether the observed gains are general or specific to the simulated configurations.

minor comments (2)

[Abstract] The abstract and methods description would benefit from explicit statements of the number of models, number of exposures, and sample sizes used in the simulations, as well as the precise definition of efficiency gain (e.g., relative variance reduction for each parameter).
[Methods] Notation for the first principal component and the sampling rule (e.g., how many subjects are selected from each tail) should be introduced with an equation or clear algorithmic step to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the scope and limitations of our proposed PCA-based extension of extreme tail sampling. We address the major comment below and have revised the manuscript to better articulate the method's rationale, empirical support, and comparison to alternatives.

read point-by-point responses

Referee: [Abstract, PCA extension of ETS] Abstract (paragraph on PCA extension of ETS): the proposal assumes that extremes of the first principal component of the error-prone exposures will contribute substantially to the score functions or information matrices of every model under consideration. Because PCA is unsupervised and maximizes marginal variance of the exposures alone, the leading eigenvector need not align with directions relevant to the outcome(s) or to parameters of secondary models; this is especially plausible under heterogeneous measurement error or model-specific covariate effects. The simulations report efficiency gains, but without an explicit link to the multi-model efficiency criterion or a comparison against an information-optimal multi-model sampler, it is unclear whether the observed gains are general or specific to the simulated configurations.

Authors: We agree that PCA is unsupervised and does not explicitly target the score functions or information matrices of the models of interest, so alignment is not guaranteed in all settings. Our approach is motivated by the practical observation that, in biomedical data with correlated error-prone exposures, the leading principal component frequently captures shared directions of variation that contribute to efficiency across multiple models. The simulations explicitly include heterogeneous measurement error structures and model-specific covariate effects, and efficiency gains were observed consistently in those cases. To strengthen the link to multi-model criteria, the revised manuscript adds a new subsection in the Methods and a corresponding simulation comparison against an information-optimal multi-model sampler (constructed as a weighted sum of expected information matrices). This shows that PCA-ETS achieves comparable gains while remaining simpler to implement and not requiring Phase I outcome data. We have also revised the Abstract to describe the method as a scalable heuristic whose performance is validated empirically rather than claimed to be universally optimal. revision: yes

Circularity Check

0 steps flagged

No circularity: PCA-based sampling criterion is defined directly from Phase I observations

full rationale

The proposed method computes the first principal component directly from the matrix of observed error-prone exposures in Phase I data and selects validation samples at its extremes. This construction is an explicit, unsupervised dimension-reduction step with no algebraic reduction to fitted parameters, outcome-dependent quantities, or self-referential definitions. Efficiency gains are demonstrated through separate simulation studies and NHANES application rather than by construction. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the derivation chain; the approach extends ETS by a transparent preprocessing step whose validity is assessed externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard measurement error models and sampling assumptions common to two-phase studies, with the novel element being the PCA-based selection rule rather than new entities or heavily fitted parameters.

axioms (2)

domain assumption Standard assumptions on measurement error structure in covariates (e.g., additive or multiplicative error, possibly correlated or heterogeneous)
Invoked when claiming advantages persist under correlated or heterogeneous measurement error scenarios.
domain assumption The first principal component captures sufficient shared variability across the models of interest for the sampling criterion to be effective
Central to the proposed extension of ETS; location in abstract description of PCA summarization.

pith-pipeline@v0.9.0 · 5820 in / 1405 out tokens · 85225 ms · 2026-05-21T17:17:37.181990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components analysis, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme values of the first principal component for validation.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The ETS-P C∗1 design's ability to reduce the total coefficient variability across all models depends on two characteristics of the error-prone exposure data.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

Use of EHRs data for clinical research: historical progress and current applications.Learning Health Systems.2019;3(1):e10076

Nordo AH, Levaux HP, Becnel LB, et al. Use of EHRs data for clinical research: historical progress and current applications.Learning Health Systems.2019;3(1):e10076

work page 2019
[2]

Electronic Health Records in Epidemiology: Appropriate Questions, Common Biases, and Potential Sensitivity Analyses.Current Epidemiology Results.2025;12(11)

Goldstein N. Electronic Health Records in Epidemiology: Appropriate Questions, Common Biases, and Potential Sensitivity Analyses.Current Epidemiology Results.2025;12(11)

work page 2025
[3]

The impact of data quality and source data verification on epidemio- logic inference: a practical application using HIV observational data.BMC Public Health.2019;19(1):1748

Giganti MJ, Shepherd BE, Caro-Vega Y , et al. The impact of data quality and source data verification on epidemio- logic inference: a practical application using HIV observational data.BMC Public Health.2019;19(1):1748

work page 2019
[4]

Measuring the quality of observational study data in an international HIV research network.PloS One.2012;7(4):e33908

Duda S, Shepherd B, Gadd C, Masys D, McGowan C. Measuring the quality of observational study data in an international HIV research network.PloS One.2012;7(4):e33908

work page 2012
[5]

Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa.Journal of Clinical and Translational Science

Lotspeich SC, Shepherd BE, Kariuki MA, et al. Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa.Journal of Clinical and Translational Science. 2023;7(1):e245

work page 2023
[6]

Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records.Journal of Biomedical Informatics

Lotspeich SC, Kedar S, Tahir R, et al. Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records.Journal of Biomedical Informatics. 2025;170:104904

work page 2025
[7]

Multiwave validation sampling for error-prone electronic health records

Shepherd BE, Han K, Chen T, et al. Multiwave validation sampling for error-prone electronic health records. Biometrics.2023;79(3):2649-2663

work page 2023
[8]

Systems of protocol review, quality assurance, and data audit.Cancer Chemotherapy and Pharmacology

Weiss R. Systems of protocol review, quality assurance, and data audit.Cancer Chemotherapy and Pharmacology. 1998;42(Suppl):S88-S92

work page 1998
[9]

Two-phase analysis and study design for survival models with error-prone exposures.Statistical Methods in Medical Research.2020;30(3):857-874

Han K, Lumley T, Shepherd BE, Shaw PA. Two-phase analysis and study design for survival models with error-prone exposures.Statistical Methods in Medical Research.2020;30(3):857-874

work page 2020
[10]

Validation data-based adjustments for outcome misclassification in logistic regression: An illustration.Epidemiology.2011;22(4):589-597

Lyles RH, Tang L, Superak HM, et al. Validation data-based adjustments for outcome misclassification in logistic regression: An illustration.Epidemiology.2011;22(4):589-597

work page 2011
[11]

Binary Regression with Differentially Misclassified Response and Exposure Variables.Statistics in Medicine.2015;34(9):1605-1620

Tang L, Lyles RH, King CC, Celentano DD, Lo Y . Binary Regression with Differentially Misclassified Response and Exposure Variables.Statistics in Medicine.2015;34(9):1605-1620

work page 2015
[12]

A two stage design for the study of the relationship between a rare exposure and a rare disease.American Journal of Epidemiology.1982;115(1):119–128

White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease.American Journal of Epidemiology.1982;115(1):119–128

work page 1982
[13]

Amorim G, Tao R, Lotspeich S, Shaw PA, Lumley T, Shepherd BE. Two-phase sampling designs for data validation in settings with covariate measurement error and continuous outcome.Journal of the Royal Statistical Society Series A: Statistics in Society.2021;184(4):1368–1389

work page 2021
[14]

Optimal multiwave validation of secondary use data with outcome and exposure misclassification.Canadian Journal of Statistics.2023

Lotspeich SC, Amorim GG, Shaw PA, Tao R, Shepherd BE. Optimal multiwave validation of secondary use data with outcome and exposure misclassification.Canadian Journal of Statistics.2023

work page 2023
[15]

Response-dependent two-phase sampling designs for biomarker studies.Canadian Journal of Statistics.2014;42(2):268–284

McIsaac MA, Cook RJ. Response-dependent two-phase sampling designs for biomarker studies.Canadian Journal of Statistics.2014;42(2):268–284

work page 2014
[16]

Optimal designs of two-phase studies.Journal of the American Statistical Association

Tao R, Zeng D, Lin DY . Optimal designs of two-phase studies.Journal of the American Statistical Association. 2020;115(532):1946–1959

work page 2020
[17]

The use of extreme groups in assessing relationships.Psychometrika.1975;40(4):563– 572

Alf Jr EF, Abrahams NM. The use of extreme groups in assessing relationships.Psychometrika.1975;40(4):563– 572

work page 1975
[18]

Transmission-disequilibrium tests for quantitative traits..American Journal of Human Genetics

Allison DB. Transmission-disequilibrium tests for quantitative traits..American Journal of Human Genetics. 1997;60(3):676

work page 1997
[19]

The use of extreme groups to test for the presence of a relationship.Psychometrika.1961;26(3):307–316

Feldt LS. The use of extreme groups to test for the presence of a relationship.Psychometrika.1961;26(3):307–316

work page 1961
[20]

Quantitative trait analysis in sequencing studies under trait-dependent sampling

Lin DY , Zeng D, Tang ZZ. Quantitative trait analysis in sequencing studies under trait-dependent sampling. Proceedings of the National Academy of Sciences.2013;110(30):12247–12252

work page 2013
[21]

Extreme discordant sib pairs for mapping quantitative trait loci in humans.Science

Risch N, Zhang H. Extreme discordant sib pairs for mapping quantitative trait loci in humans.Science. 1995;268(5217):1584–1589

work page 1995
[22]

Optimal allocation in stratified cluster-based outcome-dependent sampling designs.Statistics in Medicine.2021;40(18):4090–4107

Sauer S, Hedt-Gauthier B, Haneuse S. Optimal allocation in stratified cluster-based outcome-dependent sampling designs.Statistics in Medicine.2021;40(18):4090–4107

work page 2021
[23]

Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome..Epidemiology.2007;18(4):461–468

Zhou H, Chen J, Rissanen TH, et al. Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome..Epidemiology.2007;18(4):461–468. 17 Two-phase validation sampling via principal componentsA PREPRINT

work page 2007
[24]

Efficient designs and analysis of two-phase studies with longitudinal binary data.Biometrics.2024;80(1):ujad010

Di Gravio C, Schildcrout JS, Tao R. Efficient designs and analysis of two-phase studies with longitudinal binary data.Biometrics.2024;80(1):ujad010

work page 2024
[25]

Two-phase designs with current status data.Statistics in Medicine.2023;42(8):1207–1232

Mao F, Cook RJ. Two-phase designs with current status data.Statistics in Medicine.2023;42(8):1207–1232

work page 2023
[26]

Multi-objective optimization

Deb K, Sindhya K, Hakanen J. Multi-objective optimization. In: , , CRC Press, 2016:161–200

work page 2016
[27]

Control C. f. D, Prevention . About the National Health and Nutrition Examination Survey.https://www.cdc. gov/nchs/nhanes/about/index.html; 2021

work page 2021
[28]

Chapter 1 - Dietary Assessment Methodology

Thompson FE, Subar AF. Chapter 1 - Dietary Assessment Methodology. In: Coulston AM, Boushey CJ, Ferruzzi MG, Delahanty LM., eds.Nutrition in the Prevention and Treatment of Disease (Fourth Edition), fourth edition ed., Academic Press, 2017:5-48

work page 2017
[29]

Regression with missing X’s: A review.Journal of the American Statistical Association

Little RJA. Regression with missing X’s: A review.Journal of the American Statistical Association. 1992;87(420):1227–1237

work page 1992
[30]

A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome.Biometrics.2002;58(2):413–421

Zhou H, Weaver MA, Qin J, Longnecker M, Wang M. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome.Biometrics.2002;58(2):413–421

work page 2002
[31]

Optimal experiment design with applications to Pharmacokinetic modeling

Erdal MK, Plaxco KW, Gerson J, Kippin TE, Hespanha JP. Optimal experiment design with applications to Pharmacokinetic modeling. In: 2021:3072-3079

work page 2021
[32]

Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis

McIsaac MA, Cook RJ. Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis. Statistics in Medicine.2015;34(21):2899–2912

work page 2015
[33]

The ‘Why’ behind including ‘Y’ in your imputation model

D’Agostino McGowan L, Lotspeich SC, Hepler SA. The ‘Why’ behind including ‘Y’ in your imputation model. Stat Methods Med Res.2024;33(6):996–1020

work page 2024
[34]

mice: Multivariate Imputation by Chained Equations in R.Journal of Statistical Software.2011;45(3):1-67

van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R.Journal of Statistical Software.2011;45(3):1-67

work page 2011
[35]

Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology.2006;59(10):1092–1101

Moons KG, Donders RA, Stijnen T, Harrell Jr FE. Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology.2006;59(10):1092–1101

work page 2006
[36]

Rubin DB.Multiple imputation for nonresponse in surveys. 81. John Wiley & Sons, 2004

work page 2004
[37]

Two Methods to Estimate Fruit and Vegetable Intake of Adults, What We Eat in America, NHANES 2009-2010.The FASEB Journal.2015;29(S1):597.6

Hoy K, Goldman J, Moshfegh A. Two Methods to Estimate Fruit and Vegetable Intake of Adults, What We Eat in America, NHANES 2009-2010.The FASEB Journal.2015;29(S1):597.6

work page 2009
[38]

Department of Agriculture, Agricultural Research Service

U.S. Department of Agriculture, Agricultural Research Service . USDA Food and Nutrient Database for Dietary Studies 2021-2023.www.ars.usda.gov/nea/bhnrc/fsrg; 2024. [Online; accessed 9-August-2025]

work page 2021
[39]

Vitamin D and intestinal calcium absorption.Molecular and Cellular Endocrinology.2011;347(1):25-29

Christakos S, Dhawan P, Porta A, Mady LJ, Seth T. Vitamin D and intestinal calcium absorption.Molecular and Cellular Endocrinology.2011;347(1):25-29

work page 2011
[40]

Caffeine and cardiovascular health.Regulatory Toxicology and Pharmacology.2017;89:165-185

Turnbull D, Rodricks JV , Mariano GF, Chowdhury F. Caffeine and cardiovascular health.Regulatory Toxicology and Pharmacology.2017;89:165-185

work page 2017
[41]

A low-fat diet decreases high density lipoprotein (HDL) cholesterol levels by decreasing HDL apolipoprotein transport rates.The Journal of Clinical Investigation.1990;85(1):144–151

Brinton EA, Eisenberg S, Breslow JL. A low-fat diet decreases high density lipoprotein (HDL) cholesterol levels by decreasing HDL apolipoprotein transport rates.The Journal of Clinical Investigation.1990;85(1):144–151

work page 1990
[42]

The effect of alcohol consumption on insulin sensitivity and glycemic status: a systematic review and meta-analysis of intervention studies.Diabetes Care

Schrieks IC, Heil AL, Hendriks HF, Mukamal KJ, Beulens JW. The effect of alcohol consumption on insulin sensitivity and glycemic status: a systematic review and meta-analysis of intervention studies.Diabetes Care. 2015;38(4):723–732

work page 2015
[43]

Morris MS, Jacques PF, Rosenberg IH, Selhub J. Folate and vitamin B-12 status in relation to anemia, macrocytosis, and cognitive impairment in older Americans in the age of folic acid fortification.The American Journal of Clinical Nutrition.2007;85(1):193–200

work page 2007
[44]

Oxford university press, 2012

Willett W.Nutritional epidemiology. Oxford university press, 2012

work page 2012
[45]

Model selection and model averaging after multiple imputation.Computational Statistics & Data Analysis.2014;71:758–770

Schomaker M, Heumann C. Model selection and model averaging after multiple imputation.Computational Statistics & Data Analysis.2014;71:758–770

work page 2014
[46]

Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors.Statistics in Medicine.2021;40(3):725-738

Tao R, Lotspeich SC, Amorim G, Shaw PA, Shepherd BE. Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors.Statistics in Medicine.2021;40(3):725-738. Supplementary Materials • Additional appendices, tables, and figures:The supplemental figures and tables referenced in Sections 3–4 are available online at ht...

work page 2021

[1] [1]

Use of EHRs data for clinical research: historical progress and current applications.Learning Health Systems.2019;3(1):e10076

Nordo AH, Levaux HP, Becnel LB, et al. Use of EHRs data for clinical research: historical progress and current applications.Learning Health Systems.2019;3(1):e10076

work page 2019

[2] [2]

Electronic Health Records in Epidemiology: Appropriate Questions, Common Biases, and Potential Sensitivity Analyses.Current Epidemiology Results.2025;12(11)

Goldstein N. Electronic Health Records in Epidemiology: Appropriate Questions, Common Biases, and Potential Sensitivity Analyses.Current Epidemiology Results.2025;12(11)

work page 2025

[3] [3]

The impact of data quality and source data verification on epidemio- logic inference: a practical application using HIV observational data.BMC Public Health.2019;19(1):1748

Giganti MJ, Shepherd BE, Caro-Vega Y , et al. The impact of data quality and source data verification on epidemio- logic inference: a practical application using HIV observational data.BMC Public Health.2019;19(1):1748

work page 2019

[4] [4]

Measuring the quality of observational study data in an international HIV research network.PloS One.2012;7(4):e33908

Duda S, Shepherd B, Gadd C, Masys D, McGowan C. Measuring the quality of observational study data in an international HIV research network.PloS One.2012;7(4):e33908

work page 2012

[5] [5]

Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa.Journal of Clinical and Translational Science

Lotspeich SC, Shepherd BE, Kariuki MA, et al. Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa.Journal of Clinical and Translational Science. 2023;7(1):e245

work page 2023

[6] [6]

Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records.Journal of Biomedical Informatics

Lotspeich SC, Kedar S, Tahir R, et al. Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records.Journal of Biomedical Informatics. 2025;170:104904

work page 2025

[7] [7]

Multiwave validation sampling for error-prone electronic health records

Shepherd BE, Han K, Chen T, et al. Multiwave validation sampling for error-prone electronic health records. Biometrics.2023;79(3):2649-2663

work page 2023

[8] [8]

Systems of protocol review, quality assurance, and data audit.Cancer Chemotherapy and Pharmacology

Weiss R. Systems of protocol review, quality assurance, and data audit.Cancer Chemotherapy and Pharmacology. 1998;42(Suppl):S88-S92

work page 1998

[9] [9]

Two-phase analysis and study design for survival models with error-prone exposures.Statistical Methods in Medical Research.2020;30(3):857-874

Han K, Lumley T, Shepherd BE, Shaw PA. Two-phase analysis and study design for survival models with error-prone exposures.Statistical Methods in Medical Research.2020;30(3):857-874

work page 2020

[10] [10]

Validation data-based adjustments for outcome misclassification in logistic regression: An illustration.Epidemiology.2011;22(4):589-597

Lyles RH, Tang L, Superak HM, et al. Validation data-based adjustments for outcome misclassification in logistic regression: An illustration.Epidemiology.2011;22(4):589-597

work page 2011

[11] [11]

Binary Regression with Differentially Misclassified Response and Exposure Variables.Statistics in Medicine.2015;34(9):1605-1620

Tang L, Lyles RH, King CC, Celentano DD, Lo Y . Binary Regression with Differentially Misclassified Response and Exposure Variables.Statistics in Medicine.2015;34(9):1605-1620

work page 2015

[12] [12]

A two stage design for the study of the relationship between a rare exposure and a rare disease.American Journal of Epidemiology.1982;115(1):119–128

White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease.American Journal of Epidemiology.1982;115(1):119–128

work page 1982

[13] [13]

Amorim G, Tao R, Lotspeich S, Shaw PA, Lumley T, Shepherd BE. Two-phase sampling designs for data validation in settings with covariate measurement error and continuous outcome.Journal of the Royal Statistical Society Series A: Statistics in Society.2021;184(4):1368–1389

work page 2021

[14] [14]

Optimal multiwave validation of secondary use data with outcome and exposure misclassification.Canadian Journal of Statistics.2023

Lotspeich SC, Amorim GG, Shaw PA, Tao R, Shepherd BE. Optimal multiwave validation of secondary use data with outcome and exposure misclassification.Canadian Journal of Statistics.2023

work page 2023

[15] [15]

Response-dependent two-phase sampling designs for biomarker studies.Canadian Journal of Statistics.2014;42(2):268–284

McIsaac MA, Cook RJ. Response-dependent two-phase sampling designs for biomarker studies.Canadian Journal of Statistics.2014;42(2):268–284

work page 2014

[16] [16]

Optimal designs of two-phase studies.Journal of the American Statistical Association

Tao R, Zeng D, Lin DY . Optimal designs of two-phase studies.Journal of the American Statistical Association. 2020;115(532):1946–1959

work page 2020

[17] [17]

The use of extreme groups in assessing relationships.Psychometrika.1975;40(4):563– 572

Alf Jr EF, Abrahams NM. The use of extreme groups in assessing relationships.Psychometrika.1975;40(4):563– 572

work page 1975

[18] [18]

Transmission-disequilibrium tests for quantitative traits..American Journal of Human Genetics

Allison DB. Transmission-disequilibrium tests for quantitative traits..American Journal of Human Genetics. 1997;60(3):676

work page 1997

[19] [19]

The use of extreme groups to test for the presence of a relationship.Psychometrika.1961;26(3):307–316

Feldt LS. The use of extreme groups to test for the presence of a relationship.Psychometrika.1961;26(3):307–316

work page 1961

[20] [20]

Quantitative trait analysis in sequencing studies under trait-dependent sampling

Lin DY , Zeng D, Tang ZZ. Quantitative trait analysis in sequencing studies under trait-dependent sampling. Proceedings of the National Academy of Sciences.2013;110(30):12247–12252

work page 2013

[21] [21]

Extreme discordant sib pairs for mapping quantitative trait loci in humans.Science

Risch N, Zhang H. Extreme discordant sib pairs for mapping quantitative trait loci in humans.Science. 1995;268(5217):1584–1589

work page 1995

[22] [22]

Optimal allocation in stratified cluster-based outcome-dependent sampling designs.Statistics in Medicine.2021;40(18):4090–4107

Sauer S, Hedt-Gauthier B, Haneuse S. Optimal allocation in stratified cluster-based outcome-dependent sampling designs.Statistics in Medicine.2021;40(18):4090–4107

work page 2021

[23] [23]

Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome..Epidemiology.2007;18(4):461–468

Zhou H, Chen J, Rissanen TH, et al. Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome..Epidemiology.2007;18(4):461–468. 17 Two-phase validation sampling via principal componentsA PREPRINT

work page 2007

[24] [24]

Efficient designs and analysis of two-phase studies with longitudinal binary data.Biometrics.2024;80(1):ujad010

Di Gravio C, Schildcrout JS, Tao R. Efficient designs and analysis of two-phase studies with longitudinal binary data.Biometrics.2024;80(1):ujad010

work page 2024

[25] [25]

Two-phase designs with current status data.Statistics in Medicine.2023;42(8):1207–1232

Mao F, Cook RJ. Two-phase designs with current status data.Statistics in Medicine.2023;42(8):1207–1232

work page 2023

[26] [26]

Multi-objective optimization

Deb K, Sindhya K, Hakanen J. Multi-objective optimization. In: , , CRC Press, 2016:161–200

work page 2016

[27] [27]

Control C. f. D, Prevention . About the National Health and Nutrition Examination Survey.https://www.cdc. gov/nchs/nhanes/about/index.html; 2021

work page 2021

[28] [28]

Chapter 1 - Dietary Assessment Methodology

Thompson FE, Subar AF. Chapter 1 - Dietary Assessment Methodology. In: Coulston AM, Boushey CJ, Ferruzzi MG, Delahanty LM., eds.Nutrition in the Prevention and Treatment of Disease (Fourth Edition), fourth edition ed., Academic Press, 2017:5-48

work page 2017

[29] [29]

Regression with missing X’s: A review.Journal of the American Statistical Association

Little RJA. Regression with missing X’s: A review.Journal of the American Statistical Association. 1992;87(420):1227–1237

work page 1992

[30] [30]

A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome.Biometrics.2002;58(2):413–421

Zhou H, Weaver MA, Qin J, Longnecker M, Wang M. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome.Biometrics.2002;58(2):413–421

work page 2002

[31] [31]

Optimal experiment design with applications to Pharmacokinetic modeling

Erdal MK, Plaxco KW, Gerson J, Kippin TE, Hespanha JP. Optimal experiment design with applications to Pharmacokinetic modeling. In: 2021:3072-3079

work page 2021

[32] [32]

Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis

McIsaac MA, Cook RJ. Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis. Statistics in Medicine.2015;34(21):2899–2912

work page 2015

[33] [33]

The ‘Why’ behind including ‘Y’ in your imputation model

D’Agostino McGowan L, Lotspeich SC, Hepler SA. The ‘Why’ behind including ‘Y’ in your imputation model. Stat Methods Med Res.2024;33(6):996–1020

work page 2024

[34] [34]

mice: Multivariate Imputation by Chained Equations in R.Journal of Statistical Software.2011;45(3):1-67

van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R.Journal of Statistical Software.2011;45(3):1-67

work page 2011

[35] [35]

Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology.2006;59(10):1092–1101

Moons KG, Donders RA, Stijnen T, Harrell Jr FE. Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology.2006;59(10):1092–1101

work page 2006

[36] [36]

Rubin DB.Multiple imputation for nonresponse in surveys. 81. John Wiley & Sons, 2004

work page 2004

[37] [37]

Two Methods to Estimate Fruit and Vegetable Intake of Adults, What We Eat in America, NHANES 2009-2010.The FASEB Journal.2015;29(S1):597.6

Hoy K, Goldman J, Moshfegh A. Two Methods to Estimate Fruit and Vegetable Intake of Adults, What We Eat in America, NHANES 2009-2010.The FASEB Journal.2015;29(S1):597.6

work page 2009

[38] [38]

Department of Agriculture, Agricultural Research Service

U.S. Department of Agriculture, Agricultural Research Service . USDA Food and Nutrient Database for Dietary Studies 2021-2023.www.ars.usda.gov/nea/bhnrc/fsrg; 2024. [Online; accessed 9-August-2025]

work page 2021

[39] [39]

Vitamin D and intestinal calcium absorption.Molecular and Cellular Endocrinology.2011;347(1):25-29

Christakos S, Dhawan P, Porta A, Mady LJ, Seth T. Vitamin D and intestinal calcium absorption.Molecular and Cellular Endocrinology.2011;347(1):25-29

work page 2011

[40] [40]

Caffeine and cardiovascular health.Regulatory Toxicology and Pharmacology.2017;89:165-185

Turnbull D, Rodricks JV , Mariano GF, Chowdhury F. Caffeine and cardiovascular health.Regulatory Toxicology and Pharmacology.2017;89:165-185

work page 2017

[41] [41]

A low-fat diet decreases high density lipoprotein (HDL) cholesterol levels by decreasing HDL apolipoprotein transport rates.The Journal of Clinical Investigation.1990;85(1):144–151

Brinton EA, Eisenberg S, Breslow JL. A low-fat diet decreases high density lipoprotein (HDL) cholesterol levels by decreasing HDL apolipoprotein transport rates.The Journal of Clinical Investigation.1990;85(1):144–151

work page 1990

[42] [42]

The effect of alcohol consumption on insulin sensitivity and glycemic status: a systematic review and meta-analysis of intervention studies.Diabetes Care

Schrieks IC, Heil AL, Hendriks HF, Mukamal KJ, Beulens JW. The effect of alcohol consumption on insulin sensitivity and glycemic status: a systematic review and meta-analysis of intervention studies.Diabetes Care. 2015;38(4):723–732

work page 2015

[43] [43]

Morris MS, Jacques PF, Rosenberg IH, Selhub J. Folate and vitamin B-12 status in relation to anemia, macrocytosis, and cognitive impairment in older Americans in the age of folic acid fortification.The American Journal of Clinical Nutrition.2007;85(1):193–200

work page 2007

[44] [44]

Oxford university press, 2012

Willett W.Nutritional epidemiology. Oxford university press, 2012

work page 2012

[45] [45]

Model selection and model averaging after multiple imputation.Computational Statistics & Data Analysis.2014;71:758–770

Schomaker M, Heumann C. Model selection and model averaging after multiple imputation.Computational Statistics & Data Analysis.2014;71:758–770

work page 2014

[46] [46]

Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors.Statistics in Medicine.2021;40(3):725-738

Tao R, Lotspeich SC, Amorim G, Shaw PA, Shepherd BE. Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors.Statistics in Medicine.2021;40(3):725-738. Supplementary Materials • Additional appendices, tables, and figures:The supplemental figures and tables referenced in Sections 3–4 are available online at ht...

work page 2021