pith. sign in

arxiv: 2604.19996 · v1 · submitted 2026-04-21 · 📊 stat.ME · stat.AP

Meta-analysis of networks of diagnostic tests with binary and continuous results

Pith reviewed 2026-05-10 01:35 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords network meta-analysisdiagnostic test accuracycontinuous biomarkersmultinomial likelihoodmultiple thresholdshierarchical modelsensitivity and specificity
0
0 comments X

The pith

A hierarchical model for network meta-analysis of diagnostic tests incorporates all thresholds from continuous biomarkers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a hierarchical model to combine evidence from studies of diagnostic tests that report results at many thresholds rather than one or two. It uses multinomial likelihoods for the counts at each threshold together with a parametric curve that links the probability of a positive test to the threshold value within diseased and non-diseased groups. This structure lets analysts keep every data point instead of discarding most thresholds, while still producing comparable accuracy estimates across an entire network of tests. A reader should care because many real diagnostic tests are continuous biomarkers whose full information is lost under existing network meta-analysis methods.

Core claim

This is a hierarchical model that incorporates multinomial likelihoods for studies reporting results across multiple thresholds and a parametric structure for the relationship between the probability of testing positive and threshold within each disease class. This approach enables us to obtain accuracy estimates of tests across the whole range of observed thresholds, while it retains all the useful properties of standard NMA-DTA methods.

What carries the argument

Hierarchical model using multinomial likelihoods for multi-threshold data and a parametric curve linking test-positive probability to threshold within each disease class.

If this is right

  • Accuracy estimates become available for the full range of observed thresholds instead of only selected points.
  • A larger number of tests can be included in a single network analysis.
  • Sensitivity and specificity at different thresholds are estimated with greater precision.
  • Model variations with different covariance structures or added random effects can be compared directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Threshold-specific estimates could feed directly into decision models that choose the best cut-off for a given clinical context.
  • The framework might later accommodate non-parametric curves when the parametric assumption is too restrictive.
  • Networks built this way could support more stable comparisons when new tests or thresholds are added over time.

Load-bearing premise

The chosen parametric form for how the probability of a positive test changes with threshold holds across every study and every test in the network.

What would settle it

If external validation data at a held-out threshold show that the model's predicted sensitivity and specificity curves deviate systematically from the observed proportions, the parametric assumption fails.

Figures

Figures reproduced from arXiv: 2604.19996 by Efthymia Derezea, Gabriel Rogers, Hayley E Jones, Nicky J Welton.

Figure 1
Figure 1. Figure 1: Network from the HCC review: (1) AFP, (2) AFP-L3, (3) AFP progression rate (ng/ml per month), (4) B-mode US, (5) CEUS, (6) Combined AFP index (serial AFP and AFP), (7) DCP, (8) DCP (ng/ml), (9) Model based on DCP, AFP, gender, and age, (10) Doylestown algorithm, (11) Dynamic contrast-enhanced MRI, (12) GALAD, (13) HCC-ART, (14) HES algorithm, (15) Longitudinal GALAD, (16) mFB-I, (17) mFB-J, (18) Model base… view at source ↗
Figure 2
Figure 2. Figure 2: Results for the continuous tests of the hcc review [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results for the imaging tests of the hcc review [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Network from the prostate review: (1) 4K, (2) PCA3, (3) PHI, (4) SelectMDx significant prostate cancer detection: a systematic review and diagnostic meta-analysis of multiple thresholds’, European urology oncology 7(4), 649–662. [7] Lian, Q., Hodges, J. S. and Chu, H. [2018], ‘A bayesian hierarchical summary receiver operating characteristic model for network meta-analysis of diagnostic tests’, Journal of … view at source ↗
Figure 5
Figure 5. Figure 5: Results for the continuous tests of the prostate cancer review [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Network meta-analysis of diagnostic test accuracy (NMA-DTA) is a relatively new field, involving combining evidence across studies to evaluate and compare the accuracy of different tests for a given condition. However, the methods proposed to date cannot always capture complex aspects of the data. In fact, many commonly used diagnostic tests are continuous biomarkers, whose accuracy is evaluated at multiple thresholds within a study. Using current NMA-DTA methods we are feasibly able to include in our analysis only a few thresholds per study, discarding this way a big amount of data which could have provided us with useful information. We introduce an approach that can efficiently encompass all available data. This is a hierarchical model that incorporates multinomial likelihoods for studies reporting results across multiple thresholds and a parametric structure for the relationship between the probability of testing positive and threshold within each disease class. This approach enables us to obtain accuracy estimates of tests across the whole range of observed thresholds, while it retains all the useful properties of standard NMA-DTA methods. We explore different variations of this model based on different covariance structures, the inclusion of study-level random effects, and the addition of a further hierarchical structure on the test-level variance components. This framework is applied to data from two systematic reviews, allowing the inclusion of a larger number of tests (compared to alternative approaches) and estimation of sensitivity and specificity at different thresholds with increased precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a hierarchical model extending network meta-analysis of diagnostic test accuracy (NMA-DTA) to handle continuous biomarkers reported at multiple thresholds. It employs multinomial likelihoods for multi-threshold data and imposes a parametric structure linking threshold to the probability of a positive result within each disease class, enabling sensitivity/specificity estimation across the full observed threshold range while retaining standard NMA-DTA borrowing of strength. Variations explore different covariance structures, study-level random effects, and hierarchical variance components. The model is applied to two systematic reviews, claiming inclusion of more tests and increased precision.

Significance. If the parametric assumption holds and is validated, the approach could meaningfully expand usable data in NMA-DTA by avoiding arbitrary threshold selection or data discarding for continuous tests, yielding more precise network-wide accuracy estimates. The retention of established NMA-DTA properties and exploration of covariance structures are positive features; however, the absence of reported quantitative model comparisons, fit statistics, or sensitivity analyses limits assessment of practical gains.

major comments (3)
  1. [Methods (parametric link)] Methods, parametric structure for P(positive|threshold, disease class): The central claim of obtaining estimates 'across the whole range of observed thresholds' depends on a single parametric form holding uniformly across all included studies and tests. No sensitivity analyses, alternative functional forms, or misspecification diagnostics are described; violation would systematically bias the interpolated curves and all downstream network comparisons.
  2. [Results] Results, application to two reviews: Claims of 'increased precision' and 'larger number of tests' are made without quantitative support such as credible interval widths, effective sample size comparisons, or direct model fit metrics (e.g., DIC/WAIC) against standard NMA-DTA. This leaves the magnitude of improvement unverified.
  3. [Model variations] Model variations section: While covariance structures and random effects are explored, the core parametric assumption itself is not relaxed or tested (e.g., via non-parametric alternatives or study-specific shape parameters), making it the load-bearing unexamined component for the 'whole range' estimates.
minor comments (2)
  1. [Methods] Notation for the parametric threshold-probability function should be given an explicit equation number and clearly distinguished from the multinomial likelihood parameters.
  2. [Results/Figures] Figure captions for the estimated curves should state the exact parametric family used and any constraints imposed on monotonicity or range.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strengths and limitations of our proposed hierarchical model for network meta-analysis of diagnostic tests with continuous results. We address each major comment below, agreeing where revisions are needed to provide stronger empirical support and robustness checks.

read point-by-point responses
  1. Referee: Methods (parametric link): The central claim of obtaining estimates 'across the whole range of observed thresholds' depends on a single parametric form holding uniformly across all included studies and tests. No sensitivity analyses, alternative functional forms, or misspecification diagnostics are described; violation would systematically bias the interpolated curves and all downstream network comparisons.

    Authors: We agree that the parametric structure is central to interpolating across thresholds and that its uniform applicability is an assumption requiring scrutiny. The form was selected based on established relationships in diagnostic biomarker literature (e.g., monotonicity in disease classes). In the revised manuscript we will add sensitivity analyses using alternative forms (linear, quadratic, and fractional polynomial) applied to both case studies, along with posterior predictive checks and comparison of model fit (DIC/WAIC) to assess robustness to misspecification. revision: yes

  2. Referee: Results, application to two reviews: Claims of 'increased precision' and 'larger number of tests' are made without quantitative support such as credible interval widths, effective sample size comparisons, or direct model fit metrics (e.g., DIC/WAIC) against standard NMA-DTA. This leaves the magnitude of improvement unverified.

    Authors: We acknowledge that the current results section relies on qualitative statements. The revised version will include direct quantitative comparisons: tables reporting average credible interval widths for sensitivity/specificity, effective sample size estimates, and DIC/WAIC values for our model versus standard NMA-DTA (with threshold selection) on the same two datasets. These additions will allow readers to evaluate the practical gains in precision and data inclusion. revision: yes

  3. Referee: Model variations section: While covariance structures and random effects are explored, the core parametric assumption itself is not relaxed or tested (e.g., via non-parametric alternatives or study-specific shape parameters), making it the load-bearing unexamined component for the 'whole range' estimates.

    Authors: The explored variations target heterogeneity and borrowing of strength across the network, which are the primary extensions beyond standard NMA-DTA. We recognize that the parametric link itself was not varied in the main analysis. In revision we will add a dedicated sensitivity subsection that relaxes the assumption via study-specific shape parameters and a non-parametric alternative (e.g., monotonic splines) for at least one dataset, reporting how network-level estimates change. revision: partial

Circularity Check

0 steps flagged

No circularity: hierarchical model derives continuous estimates from parametric assumptions and data

full rationale

The paper presents a hierarchical model using multinomial likelihoods for multi-threshold data combined with a parametric link between threshold and positive-test probability within disease classes. Accuracy estimates across the full range of thresholds are obtained by fitting this model and then evaluating the parametric curves; this is a standard model-based interpolation, not a reduction of the output to the inputs by definition or by renaming a fitted quantity as a prediction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The approach retains properties of standard NMA-DTA by construction of the hierarchy, but the continuous curves themselves are not tautological with the discrete data points. The central modeling choice (parametric form) is an explicit assumption whose validity is separate from circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical modeling assumptions plus a domain-specific parametric form for threshold effects. No new physical entities are introduced. Free parameters are implicit in the parametric threshold curves and covariance structures.

free parameters (2)
  • parameters of the parametric threshold-probability function
    The model requires choosing and fitting a parametric form (e.g., logistic or similar) relating positive-test probability to threshold within each disease class; these are estimated from data.
  • covariance parameters for random effects
    Variations include study-level random effects and test-level variance components whose covariance structure must be specified and estimated.
axioms (2)
  • standard math Results at multiple thresholds within a study follow a multinomial distribution.
    Invoked to model the joint data across thresholds without discarding observations.
  • domain assumption A parametric functional form adequately describes the monotonic or smooth relationship between threshold and test-positive probability within disease classes.
    Central modeling choice that enables interpolation across the full threshold range.

pith-pipeline@v0.9.0 · 5555 in / 1401 out tokens · 40662 ms · 2026-05-10T01:35:21.728599+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    and Cole, S

    Chu, H. and Cole, S. R. [2006], ‘Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach’,Journal of clinical epidemiology59(12), 1331–1332

  2. [2]

    J., Cooper, N

    Derezea, E., Ades, A., Rogers, G., Sutton, A. J., Cooper, N. J., Hamilton, J. and Jones, H. E. [2024], ‘Technical support document 25: Evidence synthesis of diagnostic test accuracy for decision making’

  3. [3]

    and Habermeyer, E

    Holper, L., Cerullo, E., Mokros, A. and Habermeyer, E. [2024], ‘Predictive and incremental validity of the static-99, static-99r, and stable-2007 for sexual recidivism: A diagnostic test accuracy network meta-analysis (dta-nma).’,Psychological Assessment36(2), 134

  4. [4]

    and Kuss, O

    Hoyer, A., Hirt, S. and Kuss, O. [2018], ‘Meta-analysis of full roc curves using bivariate time-to-event models for interval-censored data’,Research synthesis methods9(1), 62–72

  5. [5]

    E., Gatsonsis, C

    Jones, H. E., Gatsonsis, C. A., Trikalinos, T. A., Welton, N. J. and Ades, A. [2019], ‘Quantifying how diagnostic test accuracy depends on threshold in a meta-analysis’,Statistics in Medicine 38(24), 4789–4803

  6. [6]

    R., Quhal, F., Rajwa, P., Pradere, B., Yanagisawa, T., Bekku, K., Laukhtina, E., von Deimling, M., Teoh, J

    Kawada, T., Shim, S. R., Quhal, F., Rajwa, P., Pradere, B., Yanagisawa, T., Bekku, K., Laukhtina, E., von Deimling, M., Teoh, J. Y.-C. et al. [2024], ‘Diagnostic accuracy of liquid biomarkers for clinically 11 Figure 4: Network from the prostate review:(1)4K,(2)PCA3,(3)PHI,(4)SelectMDx significant prostate cancer detection: a systematic review and diagnos...

  7. [7]

    Lian, Q., Hodges, J. S. and Chu, H. [2018], ‘A bayesian hierarchical summary receiver operating characteristic model for network meta-analysis of diagnostic tests’,Journal of the American Statistical Association

  8. [8]

    Ma, X., Lian, Q., Chu, H., Ibrahim, J. G. and Chen, Y. [2018], ‘A bayesian hierarchical model for network meta-analysis of multiple diagnostic tests’,Biostatistics19(1), 87–102

  9. [9]

    and Gatsonis, C

    Macaskill, P., Takwoingi, Y., Deeks, J. and Gatsonis, C. [2022],Chapter 9: Understanding meta-analysis. Draft version (4 October 2022) for inclusion in: Deeks JJ, Bossuyt PM, Leeflang MM, Takwoingi Y, editor(s). Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 2, London: Cochrane

  10. [10]

    and Lesaffre, E

    Menten, J. and Lesaffre, E. [2015], ‘A general framework for comparative bayesian meta-analysis of diagnostic studies’,BMC medical research methodology15(1), 1–13

  11. [11]

    N., Aerts, M

    Nyaga, V. N., Aerts, M. and Arbyn, M. [2018], ‘Anova model for network meta-analysis of diagnostic test accuracy data’,Statistical methods in medical research27(6), 1766–1784

  12. [12]

    K., Cooper, N

    Owen, R. K., Cooper, N. J., Quinn, T. J., Lees, R. and Sutton, A. J. [2018], ‘Network meta-analysis of diagnostic test accuracy studies identifies and ranks the optimal diagnostic tests and thresholds for health care policy and decision-making’,Journal of clinical epidemiology99, 64–74

  13. [13]

    B., Glas, A

    Reitsma, J. B., Glas, A. S., Rutjes, A. W., Scholten, R. J., Bossuyt, P. M. and Zwinderman, A. H. [2005], ‘Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews’,Journal of clinical epidemiology58(10), 982–990. 12 Figure 5: Results for the continuous tests of the prostate cancer review

  14. [14]

    and Jones, H

    Rogers, G., Derezea, E., Sadler, L., Wang, H., Watt, K., Ryder, S., Cramp, M., Whiting, P., Rogers, M., Bell, J., Oppe, F., Welton, N. and Jones, H. E. [2025], ‘Diagnostic accuracy of tests used in surveillance for hepatocellular carcinoma in people with cirrhosis: systematic review and network meta-analysis’,(Submt.)

  15. [15]

    and R¨ ucker, G

    Steinhauser, S., Schumacher, M. and R¨ ucker, G. [2016], ‘Modelling multiple thresholds in meta-analysis of diagnostic test accuracy studies’,BMC medical research methodology16, 1–15

  16. [16]

    and Macaskill, P

    Takwoingi, Y., Dendukuri, N., Schiller, I., R¨ ucker, G., Jones, H., Partlett, C. and Macaskill, P. [2022],Chapter 10: Undertaking meta-analysis. Draft version for inclusion in: Deeks JJ, Bossuyt PM, Leeflang MM, Takwoingi Y, editor(s). Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 2, London: Cochrane

  17. [17]

    gold standard

    Trikalinos, T. A., Balion, C. M., Coleman, C. I., Griffith, L., Santaguida, P. L., Vandermeer, B. and Fu, R. [2012], ‘Chapter 8: meta-analysis of test performance when there is a “gold standard”’,Journal of general internal medicine27(Suppl 1), 56–66

  18. [18]

    A., Tsokani, S., Agarwal, R., Pagkalidou, E., R¨ ucker, G., Mavridis, D

    Veroniki, A. A., Tsokani, S., Agarwal, R., Pagkalidou, E., R¨ ucker, G., Mavridis, D. and Takwoingi, Y. [2022], ‘Diagnostic test accuracy network meta-analysis methods: A scoping review and empirical assessment’,Journal of clinical epidemiology

  19. [19]

    Walsh, T., Macey, R., Ricketts, D., Carrasco Labra, A., Worthington, H., Sutton, A., Freeman, S., 13 Glenny, A., Riley, P., Clarkson, J. et al. [2022], ‘Enamel caries detection and diagnosis: An analysis of systematic reviews’,Journal of dental research101(3), 261–269

  20. [20]

    Zhou, X.-H., Obuchowski, N. A. and McClish, D. K. [2014],Statistical methods in diagnostic medicine, John Wiley & Sons. 14 A Appendix A.1 Example 1: Model comparison additional results Table 2: Goodness of fit comparison for the HCC dataset. V1 V2 V3 Met-reg Residual Deviance1071.7 1071.7 1084.5 1081.5 pV 570.0 571.6 593.5 579.2 DIC*1641.7 1643.3 1678.0 1...