pith. sign in

arxiv: 2512.02182 · v3 · pith:5KNBKHIAnew · submitted 2025-12-01 · 📊 stat.ME · stat.AP

Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

Pith reviewed 2026-05-21 17:17 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords two-phase samplingprincipal components analysismeasurement errorvalidation samplingstatistical efficiencymulti-model estimationbiomedical databasesNHANES
0
0 comments X

The pith

By reducing error-prone exposures to their first principal component and sampling its extremes, a single two-phase validation design can improve efficiency for estimating parameters in several models at the same time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Two-phase sampling validates error-prone measurements in large biomedical databases by collecting accurate data on a chosen subset of subjects. When researchers have several models to fit, it becomes unclear which observations to prioritize for the costly validation step. The authors extend extreme tail sampling by first applying principal components analysis to the error-prone exposures, identifying the single direction of greatest variability across all models, and then selecting subjects with the most extreme values on that first component. Simulations and an application to NHANES data show that this produces simultaneous efficiency gains for multiple models, even when measurement errors are correlated or vary across variables. The approach offers a simple way to allocate limited validation resources when several analytic goals must be balanced.

Core claim

The paper establishes that extreme tail sampling performed on the first principal component of the error-prone exposure matrix produces a validation subsample that simultaneously increases statistical efficiency for estimating parameters in multiple models of interest. This PCA-extended approach outperforms standard methods when the goal is to balance performance across competing analyses, and it continues to work well even when the measurement errors in the exposures are correlated or have different variances.

What carries the argument

The first principal component of the error-prone exposures, which reduces the multi-model sampling problem to selecting extreme values along this one-dimensional summary of variability before validation.

If this is right

  • Validation resources can be allocated to support multiple primary and secondary analyses at the same time.
  • The method scales to high-dimensional data because PCA compresses the exposure information into one key direction.
  • Efficiency gains persist when exposures have correlated errors or heterogeneous error structures.
  • Researchers no longer need to choose one model to optimize the sampling design at the expense of others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the first principal component explains little of the relevant variation for some models, incorporating model-specific residuals or additional components could further improve results.
  • This sampling strategy may apply to other two-phase designs or to settings with missing data beyond measurement error.
  • Combining the PCA summary with outcome information, if available in phase I, might yield even larger gains but would require careful handling to avoid bias.

Load-bearing premise

The first principal component of the error-prone exposures captures the main directions of variability that drive efficiency improvements for all models considered.

What would settle it

A simulation study in which the key predictors for different models lie in orthogonal directions of the exposure space, checking whether the PCA-based sample still delivers efficiency gains for every model compared to random sampling.

Figures

Figures reproduced from arXiv: 2512.02182 by Cole Manschot, Sarah C. Lotspeich.

Figure 1
Figure 1. Figure 1: Plot of the primary model’s true exposure [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Plot of all models’ true exposure Xj against the first principal component P C∗ 1 summarizing all the error￾prone exposures X∗ , focusing on the moderate error setting (error variance = 0.5). Each plot contains the same N = 1000 simulated observations, and the points are colored by their validation status based on extreme tail sampling on the first principal component P C∗ 1 (ETS-P C∗ 1 ) versus extreme ta… view at source ↗
Figure 3
Figure 3. Figure 3: Simulation results comparing the empirical total coefficient variability across all models [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulation results comparing the empirical efficiency under simple random sampling (SRS), extreme tail [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Simulation results comparing the empirical total coefficient variability across all models (A) and empirical [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Coefficient estimates and corresponding 95% confidence intervals (A) and relative widths of 95% confidence intervals for β1j , corresponding to each of the five nutrient intake exposures (j ∈ {1, . . . , 5}) from the five diet-driven health outcome models (B). In addition to the gold standard analysis (with all individuals having validation data), we considered partial validation with simple random samplin… view at source ↗
read the original abstract

Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients undergoes cost-intensive validation (e.g., expert chart review) to collect more accurate data in Phase II. When balancing primary and secondary analyses, competing models and priorities can result in poorly defined objectives for the most informative Phase II sampling criterion. Extreme tail sampling (ETS), wherein patients with the smallest and largest values of a particular quantity (like a covariate or residual) are selected, can offer great statistical efficiency in two-phase studies when focusing on a single analytic objective by targeting observations with the biggest contributions to the Fisher information. We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components analysis, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme values of the first principal component for validation. Through extensive simulations and an application to the National Health and Nutrition Examination Survey (NHANES), the proposed strategy offered simultaneous efficiency gains across multiple models of interest. Its advantages persisted across various real-world scenarios, including correlated or heterogeneous measurement error. When designing a validation study, concentrating on a single model may be short-sighted. Strategically allocating resources more broadly balances multiple analytical goals simultaneously. Employing dimension reduction before sampling will allow this strategy to scale up well to big-data applications with many error-prone exposures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes extending extreme tail sampling (ETS) for two-phase validation studies by applying principal components analysis (PCA) to the matrix of error-prone exposures from Phase I data, then selecting Phase II validation samples at the extremes of the first principal component. This is claimed to deliver simultaneous efficiency gains for multiple models of interest, with advantages persisting under correlated or heterogeneous measurement error, as shown in simulations and an NHANES application. The approach is positioned as scalable to big-data settings with many exposures by using dimension reduction to balance competing analytic objectives.

Significance. If the central claim holds, the method would provide a practical, unsupervised way to allocate limited validation resources across multiple models without requiring a single explicit multi-objective criterion, extending the single-model efficiency of ETS. The simulations and real-data application offer empirical support for robustness in realistic error scenarios, which could inform design of validation studies in biomedical databases where secondary analyses are common.

major comments (1)
  1. [Abstract, PCA extension of ETS] Abstract (paragraph on PCA extension of ETS): the proposal assumes that extremes of the first principal component of the error-prone exposures will contribute substantially to the score functions or information matrices of every model under consideration. Because PCA is unsupervised and maximizes marginal variance of the exposures alone, the leading eigenvector need not align with directions relevant to the outcome(s) or to parameters of secondary models; this is especially plausible under heterogeneous measurement error or model-specific covariate effects. The simulations report efficiency gains, but without an explicit link to the multi-model efficiency criterion or a comparison against an information-optimal multi-model sampler, it is unclear whether the observed gains are general or specific to the simulated configurations.
minor comments (2)
  1. [Abstract] The abstract and methods description would benefit from explicit statements of the number of models, number of exposures, and sample sizes used in the simulations, as well as the precise definition of efficiency gain (e.g., relative variance reduction for each parameter).
  2. [Methods] Notation for the first principal component and the sampling rule (e.g., how many subjects are selected from each tail) should be introduced with an equation or clear algorithmic step to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the scope and limitations of our proposed PCA-based extension of extreme tail sampling. We address the major comment below and have revised the manuscript to better articulate the method's rationale, empirical support, and comparison to alternatives.

read point-by-point responses
  1. Referee: [Abstract, PCA extension of ETS] Abstract (paragraph on PCA extension of ETS): the proposal assumes that extremes of the first principal component of the error-prone exposures will contribute substantially to the score functions or information matrices of every model under consideration. Because PCA is unsupervised and maximizes marginal variance of the exposures alone, the leading eigenvector need not align with directions relevant to the outcome(s) or to parameters of secondary models; this is especially plausible under heterogeneous measurement error or model-specific covariate effects. The simulations report efficiency gains, but without an explicit link to the multi-model efficiency criterion or a comparison against an information-optimal multi-model sampler, it is unclear whether the observed gains are general or specific to the simulated configurations.

    Authors: We agree that PCA is unsupervised and does not explicitly target the score functions or information matrices of the models of interest, so alignment is not guaranteed in all settings. Our approach is motivated by the practical observation that, in biomedical data with correlated error-prone exposures, the leading principal component frequently captures shared directions of variation that contribute to efficiency across multiple models. The simulations explicitly include heterogeneous measurement error structures and model-specific covariate effects, and efficiency gains were observed consistently in those cases. To strengthen the link to multi-model criteria, the revised manuscript adds a new subsection in the Methods and a corresponding simulation comparison against an information-optimal multi-model sampler (constructed as a weighted sum of expected information matrices). This shows that PCA-ETS achieves comparable gains while remaining simpler to implement and not requiring Phase I outcome data. We have also revised the Abstract to describe the method as a scalable heuristic whose performance is validated empirically rather than claimed to be universally optimal. revision: yes

Circularity Check

0 steps flagged

No circularity: PCA-based sampling criterion is defined directly from Phase I observations

full rationale

The proposed method computes the first principal component directly from the matrix of observed error-prone exposures in Phase I data and selects validation samples at its extremes. This construction is an explicit, unsupervised dimension-reduction step with no algebraic reduction to fitted parameters, outcome-dependent quantities, or self-referential definitions. Efficiency gains are demonstrated through separate simulation studies and NHANES application rather than by construction. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the derivation chain; the approach extends ETS by a transparent preprocessing step whose validity is assessed externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard measurement error models and sampling assumptions common to two-phase studies, with the novel element being the PCA-based selection rule rather than new entities or heavily fitted parameters.

axioms (2)
  • domain assumption Standard assumptions on measurement error structure in covariates (e.g., additive or multiplicative error, possibly correlated or heterogeneous)
    Invoked when claiming advantages persist under correlated or heterogeneous measurement error scenarios.
  • domain assumption The first principal component captures sufficient shared variability across the models of interest for the sampling criterion to be effective
    Central to the proposed extension of ETS; location in abstract description of PCA summarization.

pith-pipeline@v0.9.0 · 5820 in / 1405 out tokens · 85225 ms · 2026-05-21T17:17:37.181990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components analysis, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme values of the first principal component for validation.

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The ETS-P C∗1 design's ability to reduce the total coefficient variability across all models depends on two characteristics of the error-prone exposure data.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Use of EHRs data for clinical research: historical progress and current applications.Learning Health Systems.2019;3(1):e10076

    Nordo AH, Levaux HP, Becnel LB, et al. Use of EHRs data for clinical research: historical progress and current applications.Learning Health Systems.2019;3(1):e10076

  2. [2]

    Electronic Health Records in Epidemiology: Appropriate Questions, Common Biases, and Potential Sensitivity Analyses.Current Epidemiology Results.2025;12(11)

    Goldstein N. Electronic Health Records in Epidemiology: Appropriate Questions, Common Biases, and Potential Sensitivity Analyses.Current Epidemiology Results.2025;12(11)

  3. [3]

    The impact of data quality and source data verification on epidemio- logic inference: a practical application using HIV observational data.BMC Public Health.2019;19(1):1748

    Giganti MJ, Shepherd BE, Caro-Vega Y , et al. The impact of data quality and source data verification on epidemio- logic inference: a practical application using HIV observational data.BMC Public Health.2019;19(1):1748

  4. [4]

    Measuring the quality of observational study data in an international HIV research network.PloS One.2012;7(4):e33908

    Duda S, Shepherd B, Gadd C, Masys D, McGowan C. Measuring the quality of observational study data in an international HIV research network.PloS One.2012;7(4):e33908

  5. [5]

    Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa.Journal of Clinical and Translational Science

    Lotspeich SC, Shepherd BE, Kariuki MA, et al. Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa.Journal of Clinical and Translational Science. 2023;7(1):e245

  6. [6]

    Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records.Journal of Biomedical Informatics

    Lotspeich SC, Kedar S, Tahir R, et al. Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records.Journal of Biomedical Informatics. 2025;170:104904

  7. [7]

    Multiwave validation sampling for error-prone electronic health records

    Shepherd BE, Han K, Chen T, et al. Multiwave validation sampling for error-prone electronic health records. Biometrics.2023;79(3):2649-2663

  8. [8]

    Systems of protocol review, quality assurance, and data audit.Cancer Chemotherapy and Pharmacology

    Weiss R. Systems of protocol review, quality assurance, and data audit.Cancer Chemotherapy and Pharmacology. 1998;42(Suppl):S88-S92

  9. [9]

    Two-phase analysis and study design for survival models with error-prone exposures.Statistical Methods in Medical Research.2020;30(3):857-874

    Han K, Lumley T, Shepherd BE, Shaw PA. Two-phase analysis and study design for survival models with error-prone exposures.Statistical Methods in Medical Research.2020;30(3):857-874

  10. [10]

    Validation data-based adjustments for outcome misclassification in logistic regression: An illustration.Epidemiology.2011;22(4):589-597

    Lyles RH, Tang L, Superak HM, et al. Validation data-based adjustments for outcome misclassification in logistic regression: An illustration.Epidemiology.2011;22(4):589-597

  11. [11]

    Binary Regression with Differentially Misclassified Response and Exposure Variables.Statistics in Medicine.2015;34(9):1605-1620

    Tang L, Lyles RH, King CC, Celentano DD, Lo Y . Binary Regression with Differentially Misclassified Response and Exposure Variables.Statistics in Medicine.2015;34(9):1605-1620

  12. [12]

    A two stage design for the study of the relationship between a rare exposure and a rare disease.American Journal of Epidemiology.1982;115(1):119–128

    White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease.American Journal of Epidemiology.1982;115(1):119–128

  13. [13]

    Amorim G, Tao R, Lotspeich S, Shaw PA, Lumley T, Shepherd BE. Two-phase sampling designs for data validation in settings with covariate measurement error and continuous outcome.Journal of the Royal Statistical Society Series A: Statistics in Society.2021;184(4):1368–1389

  14. [14]

    Optimal multiwave validation of secondary use data with outcome and exposure misclassification.Canadian Journal of Statistics.2023

    Lotspeich SC, Amorim GG, Shaw PA, Tao R, Shepherd BE. Optimal multiwave validation of secondary use data with outcome and exposure misclassification.Canadian Journal of Statistics.2023

  15. [15]

    Response-dependent two-phase sampling designs for biomarker studies.Canadian Journal of Statistics.2014;42(2):268–284

    McIsaac MA, Cook RJ. Response-dependent two-phase sampling designs for biomarker studies.Canadian Journal of Statistics.2014;42(2):268–284

  16. [16]

    Optimal designs of two-phase studies.Journal of the American Statistical Association

    Tao R, Zeng D, Lin DY . Optimal designs of two-phase studies.Journal of the American Statistical Association. 2020;115(532):1946–1959

  17. [17]

    The use of extreme groups in assessing relationships.Psychometrika.1975;40(4):563– 572

    Alf Jr EF, Abrahams NM. The use of extreme groups in assessing relationships.Psychometrika.1975;40(4):563– 572

  18. [18]

    Transmission-disequilibrium tests for quantitative traits..American Journal of Human Genetics

    Allison DB. Transmission-disequilibrium tests for quantitative traits..American Journal of Human Genetics. 1997;60(3):676

  19. [19]

    The use of extreme groups to test for the presence of a relationship.Psychometrika.1961;26(3):307–316

    Feldt LS. The use of extreme groups to test for the presence of a relationship.Psychometrika.1961;26(3):307–316

  20. [20]

    Quantitative trait analysis in sequencing studies under trait-dependent sampling

    Lin DY , Zeng D, Tang ZZ. Quantitative trait analysis in sequencing studies under trait-dependent sampling. Proceedings of the National Academy of Sciences.2013;110(30):12247–12252

  21. [21]

    Extreme discordant sib pairs for mapping quantitative trait loci in humans.Science

    Risch N, Zhang H. Extreme discordant sib pairs for mapping quantitative trait loci in humans.Science. 1995;268(5217):1584–1589

  22. [22]

    Optimal allocation in stratified cluster-based outcome-dependent sampling designs.Statistics in Medicine.2021;40(18):4090–4107

    Sauer S, Hedt-Gauthier B, Haneuse S. Optimal allocation in stratified cluster-based outcome-dependent sampling designs.Statistics in Medicine.2021;40(18):4090–4107

  23. [23]

    Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome..Epidemiology.2007;18(4):461–468

    Zhou H, Chen J, Rissanen TH, et al. Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome..Epidemiology.2007;18(4):461–468. 17 Two-phase validation sampling via principal componentsA PREPRINT

  24. [24]

    Efficient designs and analysis of two-phase studies with longitudinal binary data.Biometrics.2024;80(1):ujad010

    Di Gravio C, Schildcrout JS, Tao R. Efficient designs and analysis of two-phase studies with longitudinal binary data.Biometrics.2024;80(1):ujad010

  25. [25]

    Two-phase designs with current status data.Statistics in Medicine.2023;42(8):1207–1232

    Mao F, Cook RJ. Two-phase designs with current status data.Statistics in Medicine.2023;42(8):1207–1232

  26. [26]

    Multi-objective optimization

    Deb K, Sindhya K, Hakanen J. Multi-objective optimization. In: , , CRC Press, 2016:161–200

  27. [27]

    Control C. f. D, Prevention . About the National Health and Nutrition Examination Survey.https://www.cdc. gov/nchs/nhanes/about/index.html; 2021

  28. [28]

    Chapter 1 - Dietary Assessment Methodology

    Thompson FE, Subar AF. Chapter 1 - Dietary Assessment Methodology. In: Coulston AM, Boushey CJ, Ferruzzi MG, Delahanty LM., eds.Nutrition in the Prevention and Treatment of Disease (Fourth Edition), fourth edition ed., Academic Press, 2017:5-48

  29. [29]

    Regression with missing X’s: A review.Journal of the American Statistical Association

    Little RJA. Regression with missing X’s: A review.Journal of the American Statistical Association. 1992;87(420):1227–1237

  30. [30]

    A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome.Biometrics.2002;58(2):413–421

    Zhou H, Weaver MA, Qin J, Longnecker M, Wang M. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome.Biometrics.2002;58(2):413–421

  31. [31]

    Optimal experiment design with applications to Pharmacokinetic modeling

    Erdal MK, Plaxco KW, Gerson J, Kippin TE, Hespanha JP. Optimal experiment design with applications to Pharmacokinetic modeling. In: 2021:3072-3079

  32. [32]

    Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis

    McIsaac MA, Cook RJ. Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis. Statistics in Medicine.2015;34(21):2899–2912

  33. [33]

    The ‘Why’ behind including ‘Y’ in your imputation model

    D’Agostino McGowan L, Lotspeich SC, Hepler SA. The ‘Why’ behind including ‘Y’ in your imputation model. Stat Methods Med Res.2024;33(6):996–1020

  34. [34]

    mice: Multivariate Imputation by Chained Equations in R.Journal of Statistical Software.2011;45(3):1-67

    van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R.Journal of Statistical Software.2011;45(3):1-67

  35. [35]

    Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology.2006;59(10):1092–1101

    Moons KG, Donders RA, Stijnen T, Harrell Jr FE. Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology.2006;59(10):1092–1101

  36. [36]

    Rubin DB.Multiple imputation for nonresponse in surveys. 81. John Wiley & Sons, 2004

  37. [37]

    Two Methods to Estimate Fruit and Vegetable Intake of Adults, What We Eat in America, NHANES 2009-2010.The FASEB Journal.2015;29(S1):597.6

    Hoy K, Goldman J, Moshfegh A. Two Methods to Estimate Fruit and Vegetable Intake of Adults, What We Eat in America, NHANES 2009-2010.The FASEB Journal.2015;29(S1):597.6

  38. [38]

    Department of Agriculture, Agricultural Research Service

    U.S. Department of Agriculture, Agricultural Research Service . USDA Food and Nutrient Database for Dietary Studies 2021-2023.www.ars.usda.gov/nea/bhnrc/fsrg; 2024. [Online; accessed 9-August-2025]

  39. [39]

    Vitamin D and intestinal calcium absorption.Molecular and Cellular Endocrinology.2011;347(1):25-29

    Christakos S, Dhawan P, Porta A, Mady LJ, Seth T. Vitamin D and intestinal calcium absorption.Molecular and Cellular Endocrinology.2011;347(1):25-29

  40. [40]

    Caffeine and cardiovascular health.Regulatory Toxicology and Pharmacology.2017;89:165-185

    Turnbull D, Rodricks JV , Mariano GF, Chowdhury F. Caffeine and cardiovascular health.Regulatory Toxicology and Pharmacology.2017;89:165-185

  41. [41]

    A low-fat diet decreases high density lipoprotein (HDL) cholesterol levels by decreasing HDL apolipoprotein transport rates.The Journal of Clinical Investigation.1990;85(1):144–151

    Brinton EA, Eisenberg S, Breslow JL. A low-fat diet decreases high density lipoprotein (HDL) cholesterol levels by decreasing HDL apolipoprotein transport rates.The Journal of Clinical Investigation.1990;85(1):144–151

  42. [42]

    The effect of alcohol consumption on insulin sensitivity and glycemic status: a systematic review and meta-analysis of intervention studies.Diabetes Care

    Schrieks IC, Heil AL, Hendriks HF, Mukamal KJ, Beulens JW. The effect of alcohol consumption on insulin sensitivity and glycemic status: a systematic review and meta-analysis of intervention studies.Diabetes Care. 2015;38(4):723–732

  43. [43]

    Morris MS, Jacques PF, Rosenberg IH, Selhub J. Folate and vitamin B-12 status in relation to anemia, macrocytosis, and cognitive impairment in older Americans in the age of folic acid fortification.The American Journal of Clinical Nutrition.2007;85(1):193–200

  44. [44]

    Oxford university press, 2012

    Willett W.Nutritional epidemiology. Oxford university press, 2012

  45. [45]

    Model selection and model averaging after multiple imputation.Computational Statistics & Data Analysis.2014;71:758–770

    Schomaker M, Heumann C. Model selection and model averaging after multiple imputation.Computational Statistics & Data Analysis.2014;71:758–770

  46. [46]

    Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors.Statistics in Medicine.2021;40(3):725-738

    Tao R, Lotspeich SC, Amorim G, Shaw PA, Shepherd BE. Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors.Statistics in Medicine.2021;40(3):725-738. Supplementary Materials • Additional appendices, tables, and figures:The supplemental figures and tables referenced in Sections 3–4 are available online at ht...