pith. sign in

arxiv: 2606.02062 · v1 · pith:PCWCE3SRnew · submitted 2026-06-01 · 📊 stat.ME

Evaluating the role of correlation among markers in prediction models

Pith reviewed 2026-06-28 13:18 UTC · model grok-4.3

classification 📊 stat.ME
keywords biomarkersAUCcorrelationsROC curvepredictive modelsmultivariate normalitypancreatic cancermetabolites
0
0 comments X

The pith

Negative correlations between biomarkers maximize the combined AUC in predictive models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an expression for the maximum achievable AUC when combining biomarkers and shows how their correlations affect this value. It demonstrates that negative correlations between markers lead to the highest combined discrimination ability, particularly when the individual markers have similar predictive power. This finding is illustrated with graphical surfaces and confirmed through simulations with normal and skewed distributions as well as analysis of real metabolite data for pancreatic cancer detection. The work highlights that the sign and strength of inter-marker correlations should be considered when building or extending predictive models.

Core claim

Under the assumption of multivariate normality, the maximum AUC for a linear combination of biomarkers is a function of the correlations between them, with negative correlations yielding the highest values and positive correlations the lowest. This holds for markers with equal or differing predictive abilities, though the benefit is greatest when abilities are equal. Simulations and real data on lipid metabolites reinforce that negative correlations optimize model performance.

What carries the argument

An expression for the maximum AUC derived as a function of the correlations between markers under multivariate normality.

If this is right

  • When adding a new biomarker, preferring ones negatively correlated with existing ones improves discrimination more.
  • For markers with equal strength, negative correlation gives greatest AUC gain.
  • Positive correlations between markers reduce the combined AUC.
  • The effect persists in skewed distributions but asymmetry plays a role.
  • In metabolite data for PDAC, correlations influence AUC optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model builders could screen potential markers for negative correlations to existing ones to maximize gain.
  • This might suggest redesigning marker selection criteria in high-dimensional settings.
  • Extensions to non-linear combinations or other performance metrics could be explored.
  • The finding may apply to other diagnostic fields beyond cancer.

Load-bearing premise

The biomarkers follow a multivariate normal distribution.

What would settle it

A dataset where combining negatively correlated markers does not yield higher AUC than positively correlated ones, after controlling for individual marker strengths.

read the original abstract

Different methods have been employed to estimate models maximizing the area under the receiver operating characteristic curve (ROC-AUC). Once a model is developed, integrating novel biomarkers may improve its diagnostic ability. However, the discrimination improvement from adding a new biomarker is not always evident, even if the marker itself has good discriminatory power. The sign and magnitude of correlations between biomarkers may impact model performance. In this paper, we assess the effect of such correlations on the discrimination ability of predictive models. Under multivariate normality, we derive an expression for the maximum AUC as a function of the correlations between markers, illustrated graphically using surfaces. Logarithmic folded bivariate normal and Gamma simulations address skewed data cases. Additionally, AUC improvement was assessed combining 1934 blood lipid metabolites determined by liquid chromatography in 44 pancreatic cancer cases and 38 controls from the PanGenMic Study. Our results show that negative correlations consistently maximize the combined AUC, offering the greatest improvements when markers have equal predictive ability, while positive correlations yield the least favorable results. Negative correlations remain optimal for markers with differing abilities, though positive correlations show slight benefits. Simulations with skewed distributions confirm these trends, emphasizing the role of asymmetry in marker selection. Real-world analysis of serum lipid-derived metabolites for detecting pancreatic ductal adenocarcinoma (PDAC) reinforces the influence of correlations on AUC optimization. These findings suggest that the sign and magnitude of inter-biomarker correlations should be considered when incorporating new markers into predictive algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives a closed-form expression for the maximum AUC achievable by a linear combination of biomarkers under the assumption of multivariate normality, as a function of the pairwise correlations among markers. It concludes that negative correlations maximize the combined AUC (with largest gains when markers have equal individual predictive strength), illustrates this with surfaces, checks robustness via simulations under logarithmic folded bivariate normal and Gamma distributions, and applies the idea to 1934 lipid metabolites in a pancreatic cancer case-control study.

Significance. If the derivation is correct, the result supplies a simple, interpretable rule for biomarker selection that is directly actionable in model building. The use of exact MVN properties for the closed-form result, together with targeted simulations for non-normality and a real-data corroboration, gives the work concrete practical value beyond purely theoretical claims.

major comments (2)
  1. [Derivation (abstract and main text)] The central derivation of the maximum-AUC expression (mentioned in the abstract and presumably in the Methods/Results) is stated to follow from standard multivariate-normal properties, yet the manuscript supplies neither the explicit formula nor the algebraic steps that produce it. This omission is load-bearing because the claim that negative correlations maximize AUC rests entirely on that expression.
  2. [Real-data application] Table or figure reporting the real-data AUC values (PanGenMic lipid-metabolite analysis) is needed to quantify the claimed improvement under negative versus positive correlations; without it the empirical support for the main conclusion remains qualitative.
minor comments (2)
  1. [Simulation section] The abstract refers to 'logarithmic folded bivariate normal' simulations; the precise parameterization and how the correlation is preserved under the transformation should be stated explicitly for reproducibility.
  2. [Graphical illustration] Notation for the linear combination coefficients and the resulting AUC expression should be introduced once and used consistently; currently the link between the MVN parameters and the plotted surfaces is not fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Derivation (abstract and main text)] The central derivation of the maximum-AUC expression (mentioned in the abstract and presumably in the Methods/Results) is stated to follow from standard multivariate-normal properties, yet the manuscript supplies neither the explicit formula nor the algebraic steps that produce it. This omission is load-bearing because the claim that negative correlations maximize AUC rests entirely on that expression.

    Authors: We agree that the explicit formula and algebraic steps were omitted and should be supplied. The maximum AUC under MVN follows from the fact that the optimal linear combination yields an AUC determined by the square root of a quadratic form in the mean difference vector and the inverse covariance matrix; the sign of the off-diagonal elements of the correlation matrix then determines whether this quantity is maximized or minimized. In the revised manuscript we will insert the closed-form expression together with the derivation steps in the Methods section. revision: yes

  2. Referee: [Real-data application] Table or figure reporting the real-data AUC values (PanGenMic lipid-metabolite analysis) is needed to quantify the claimed improvement under negative versus positive correlations; without it the empirical support for the main conclusion remains qualitative.

    Authors: We agree that quantitative AUC values are required to make the empirical claim concrete. In the revised manuscript we will add a table (or figure) in the Results section that reports the observed AUCs for representative metabolite pairs and small combinations stratified by the sign and magnitude of their correlations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from standard MVN properties

full rationale

The paper derives a closed-form expression for maximum AUC under multivariate normality as a function of pairwise correlations, which follows directly from standard properties of the MVN distribution without reducing to any fitted input, self-defined quantity, or self-citation chain within the paper itself. Simulations under logarithmic folded bivariate normal and Gamma distributions, plus the real-data lipid metabolite example, serve as independent robustness checks rather than tautological confirmations. No load-bearing step equates a prediction to its own construction or imports uniqueness via author-overlapping citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The load-bearing premise is the multivariate normality assumption required for the closed-form AUC expression; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Biomarkers are jointly multivariate normal
    The abstract states that the expression for maximum AUC is derived under multivariate normality.

pith-pipeline@v0.9.1-grok · 5805 in / 1204 out tokens · 26975 ms · 2026-06-28T13:18:17.870698+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 6 canonical work pages

  1. [1]

    Combining diagnostic test results to increase accuracy

    Pepe M, Biostatistics MT, 2000 undefined. Combining diagnostic test results to increase accuracy. academic.oup.com [Internet]. 2000 [cited 2024 Feb 14];1(2):123–

  2. [2]

    Available from: https://academic.oup.com/biostatistics/article- abstract/1/2/123/438521

  3. [3]

    The area above the ordinal dominance graph and the area below the receiver operating characteristic graph

    Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol. 1975 Nov 1;12(4):387–415. 29

  4. [4]

    Combining Biomarkers to Improve Diagnostic Accuracy in Detecting Diseases With Group-Tested Data

    Yang J, Zhang W, Albert PS, Liu A, Chen Z. Combining Biomarkers to Improve Diagnostic Accuracy in Detecting Diseases With Group-Tested Data. Stat Med [Internet]. 2024 [cited 2025 Jan 22];43(27). Available from: https://pubmed.ncbi.nlm.nih.gov/39375883/

  5. [5]

    Novel combination markers for predicting progression of nonmuscle invasive bladder cancer

    Ha YS, Kim JS, Yoon HY, Jeong P, Kim TH, Yun SJ, et al. Novel combination markers for predicting progression of nonmuscle invasive bladder cancer. Int J Cancer [Internet]. 2012 Aug 15 [cited 2025 Jan 22];131(4):E501–7. Available from: https://onlinelibrary.wiley.com/doi/full/10.1002/ijc.27319

  6. [6]

    Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker

    Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol [Internet]. 2004 May 1 [cited 2024 Nov 20];159(9):882–90. Available from: https://pubmed.ncbi.nlm.nih.gov/15105181/

  7. [7]

    Estimation and Comparison of Receiver Operating Characteristic Curves

    Pepe MS, Longton G, Janes H. Estimation and Comparison of Receiver Operating Characteristic Curves. Stata J [Internet]. 2009 [cited 2024 Nov 20];9(1):1. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC2774909/

  8. [8]

    Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond

    Pencina MJ, D’Agostino RB, D’Agostino RB, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med [Internet]. 2008 Jan 30 [cited 2024 Nov 20];27(2):157–72. Available from: https://pubmed.ncbi.nlm.nih.gov/17569110/

  9. [9]

    A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index

    Hilden J, Gerds TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Stat Med [Internet]. 2014 Aug 30 [cited 2024 Nov 20];33(19):3405–14. Available from: https://pubmed.ncbi.nlm.nih.gov/23553436/

  10. [10]

    Building multi-marker algorithms for disease prediction–-the role of correlations among markers

    Pinsky P, insights CZB, 2011 undefined. Building multi-marker algorithms for disease prediction–-the role of correlations among markers. journals.sagepub.comPF Pinsky, CS ZhuBiomarker insights, 2011•journals.sagepub.com [Internet]. 2011 [cited 2024 Feb 14];6:83–93. Available from: https://journals.sagepub.com/doi/abs/10.4137/BMI.S7513

  11. [11]

    When does combining markers improve classification performance and what are implications for practice? Stat Med

    Bansal A, SullivanPepe M. When does combining markers improve classification performance and what are implications for practice? Stat Med. 2013 May 20;32(11):1877–92

  12. [12]

    Impact of correlation of predictors on discrimination of risk models in development and external populations

    Kundu S, Mazumdar M, Ferket B. Impact of correlation of predictors on discrimination of risk models in development and external populations. BMC Med Res Methodol. 2017 Apr 19;17(1)

  13. [13]

    Artificial intelligence for multimodal data integration in oncology

    Lipkova J, Chen RJ, Chen B, Lu MY, Barbieri M, Shao D, et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell. 2022 Oct 10;40(10):1095–110

  14. [14]

    Highly accurate diagnosis of pancreatic cancer by integrative modeling using gut microbiome and exposome data

    Zhang Y, Zhang H, Liu B, Ning K. Highly accurate diagnosis of pancreatic cancer by integrative modeling using gut microbiome and exposome data. iScience [Internet]. 2024 Mar 15 [cited 2024 Nov 28];27(3). Available from: https://pubmed.ncbi.nlm.nih.gov/38450156/

  15. [15]

    MDICC: novel method for multi-omics data integration and cancer subtype identification

    Yang Y, Tian S, Qiu Y, Zhao P, Zou Q. MDICC: novel method for multi-omics data integration and cancer subtype identification. Brief Bioinform [Internet]. 2022 May 1 30 [cited 2024 Nov 28];23(3). Available from: https://pubmed.ncbi.nlm.nih.gov/35437603/

  16. [16]

    Multimodal analysis of methylomics and fragmentomics in plasma cell-free DNA for multi-cancer early detection and localization

    Nguyen VTC, Nguyen TH, Doan NNT, Pham TMQ, Nguyen GTH, Nguyen TD, et al. Multimodal analysis of methylomics and fragmentomics in plasma cell-free DNA for multi-cancer early detection and localization. Elife [Internet]. 2023 Oct 11 [cited 2024 Nov 28];12. Available from: https://pubmed.ncbi.nlm.nih.gov/37819044/

  17. [17]

    Linear combinations of multiple diagnostic markers

    Su JQ, Liu JS. Linear combinations of multiple diagnostic markers. J Am Stat Assoc. 1993;88(424):1350–5

  18. [18]

    Sample size and performance estimation for biomarker combinations based on pilot studies with small sample sizes

    Al-Mekhlafi A, Becker T, Statistics FKC in, 2022 undefined. Sample size and performance estimation for biomarker combinations based on pilot studies with small sample sizes. Taylor & FrancisA Al-Mekhlafi, T Becker, F KlawonnCommunications in Statistics-Theory and Methods, 2022•Taylor & Francis [Internet]. 2020 [cited 2024 Jan 31];51(16):5534–48. Available...

  19. [19]

    Performance of diagnostic tests based on continuous bivariate markers

    Samawi H, Chen DG, Yin J, Alsharman M. Performance of diagnostic tests based on continuous bivariate markers. J Appl Stat [Internet]. 2022 Oct 27 [cited 2024 Feb 1]; Available from: https://www.tandfonline.com/doi/abs/10.1080/02664763.2022.2137478

  20. [20]

    A step-by-step algorithm for combining diagnostic tests

    Esteban LM, Sanz G, Borque A. A step-by-step algorithm for combining diagnostic tests. J Appl Stat. 2011 May;38(5):899–911

  21. [21]

    The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression

    Van Den Goorbergh R, Van Smeden M, Timmerman D, Ben Van Calster. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association [Internet]. 2022 Aug 16 [cited 2024 Nov 28];29(9):1525–34. Available from: https://dx.doi.org/10.1093/jamia/ocac093

  22. [22]

    A NOTE ON THE GAMMA DISTRIBUTION

    THOM HCS. A NOTE ON THE GAMMA DISTRIBUTION. Mon Weather Rev [Internet]. 1958;86(4):117–22. Available from: https://journals.ametsoc.org/view/journals/mwre/86/4/1520- 0493_1958_086_0117_anotgd_2_0_co_2.xml

  23. [23]

    A family of Gamma-generated distributions: Statistical properties and applications

    Pourreza H, Jamkhaneh EB, Deiri E. A family of Gamma-generated distributions: Statistical properties and applications. Stat Methods Med Res [Internet]. 2021 Aug 1 [cited 2024 Nov 28];30(8):1850–73. Available from: https://pubmed.ncbi.nlm.nih.gov/34006148/

  24. [24]

    A faecal microbiota signature with high specificity for pancreatic cancer

    Kartal E, Schmidt TSB, Molina-Montes E, Rodríguez-Perales S, Wirbel J, Maistrenko OM, et al. A faecal microbiota signature with high specificity for pancreatic cancer. Gut [Internet]. 2022 [cited 2024 Nov 20];71(7):1359–72. Available from: https://pubmed.ncbi.nlm.nih.gov/35260444/

  25. [25]

    Reduced risk of pancreatic cancer associated with asthma and nasal allergies

    Gomez-Rubio P, Zock JP, Rava M, Marquez M, Sharp L, Hidalgo M, et al. Reduced risk of pancreatic cancer associated with asthma and nasal allergies. Gut [Internet]. 2017 Feb 1 [cited 2024 Nov 20];66(2):314–22. Available from: https://pubmed.ncbi.nlm.nih.gov/26628509/ 31

  26. [26]

    Impact of Correlation on Predictive Ability of Biomarkers

    Demler O, Pencina MJ, D’agostino RB, Demler O V, D’ RB, Sr A. Impact of Correlation on Predictive Ability of Biomarkers. researchgate.net [Internet]. 2013 Oct 30 [cited 2024 Jan 29];32(24):4196–210. Available from: https://www.researchgate.net/profile/Olga-Demler- 2/publication/236614761_Impact_of_Correlation_on_Predictive_Ability_of_Biomarke rs/links/5c7...

  27. [27]

    Prediction Models — Development, Evaluation, and Clinical Application

    Pencina MJ, Goldstein BA, D’Agostino RB. Prediction Models — Development, Evaluation, and Clinical Application. New England Journal of Medicine [Internet]. 2020 Apr 23 [cited 2024 Nov 12];382(17):1583–6. Available from: https://www.nejm.org/doi/full/10.1056/NEJMp2000589

  28. [28]

    A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

    Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol [Internet]. 2019 Jun 1 [cited 2024 Nov 28];110:12–22. Available from: https://pubmed.ncbi.nlm.nih.gov/30763612/