pith. sign in

arxiv: 2605.16652 · v1 · pith:ZHKL3LXInew · submitted 2026-05-15 · 📊 stat.ME

Semiparametric Regression for Misclassified Competing Risks Data

Pith reviewed 2026-05-20 15:00 UTC · model grok-4.3

classification 📊 stat.ME
keywords competing risksmisclassificationsemiparametric regressionexternal validationB-splinespseudo-likelihoodconsistency
0
0 comments X

The pith

Misclassification in competing risks data is corrected by plugging external validation probabilities into a B-spline sieve pseudo-likelihood.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a semiparametric method to analyze competing risks data when event types are misclassified and no internal validation sample is available in the main study. It borrows misclassification probabilities estimated from a separate external validation study and builds them into a pseudo-likelihood approximated by B-splines. The resulting estimator is shown to be consistent via empirical process theory and jointly models regression effects across all event types. Simulations indicate the approach works with realistic sample sizes and produces more efficient estimates than earlier methods, and the method is illustrated on HIV data with under-reported deaths.

Core claim

By incorporating estimates of the misclassification probabilities from an external validation study into a B-spline-based sieve pseudo-likelihood function, the proposed estimator is consistent for the regression parameters in misclassified competing risks data and jointly estimates models for all event types.

What carries the argument

B-spline-based sieve pseudo-likelihood function that folds external misclassification probabilities into the observed-data likelihood for joint estimation across event types.

Load-bearing premise

The misclassification probabilities from the external validation study are accurate and apply without systematic differences to the main study population.

What would settle it

Apply the method to a dataset that also contains an internal gold-standard validation sample; large discrepancies between the externally adjusted estimates and the internal gold-standard estimates would show the adjustment does not work.

read the original abstract

The analysis of competing risks data is often complicated by misclassification of the cause of failure. This issue can lead to seriously biased estimates and invalid conclusions. One way to deal with such misclassification is to use a gold-standard cause of failure ascertainment procedure in a subset of the non-right-censored participants (internal validation sample) along with methods for missing data to deal with the missing gold-standard ascertainments. However, this approach can be costly and time-consuming and, therefore, cannot be implemented in many studies. In this work, we propose a semiparametric regression analysis methodology for the case where no internal validation sample exists. Our approach leverages estimates of the misclassification probabilities from an external validation study to adjust for misclassification in the study at hand. These probabilities are incorporated in a B-spline-based sieve pseudo-likelihood function, which is maximized to jointly estimate models for all event types. Using empirical process theory, we show that the proposed estimator is consistent. Extensive simulation experiments demonstrate that the method performs well with realistic sample sizes and provides substantially more efficient estimates compared to previously proposed approaches. The methodology is applied to competing risks data from a large HIV observational study in sub-Saharan Africa, where event type is misclassified due to significant death under-reporting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a semiparametric method for regression analysis of competing risks data with misclassified event types when no internal validation sample is available. It incorporates misclassification probabilities estimated from an external validation study into a B-spline sieve pseudo-likelihood that is maximized to jointly estimate models for all event types, establishes consistency of the resulting estimator via empirical process theory, reports favorable simulation performance relative to prior approaches, and applies the method to HIV observational data with death under-reporting.

Significance. If the external misclassification probabilities transport to the target population, the approach offers a practical alternative to internal validation (which is often infeasible) and can yield more efficient estimates than existing methods that ignore or incompletely adjust for misclassification. The use of sieve estimation and empirical-process arguments for consistency is a technical strength when the modeling assumptions hold.

major comments (2)
  1. [Methods / Consistency theorem] The consistency result (via empirical process theory on the B-spline sieve pseudo-likelihood) is derived under the assumption that the misclassification matrix is known and correctly specified for the target population. The manuscript plugs external estimates directly into the pseudo-likelihood without additional correction terms for estimation error or population shift; if the external validation sample differs in covariate distribution or error mechanisms, the pseudo-likelihood is misspecified and the consistency guarantee fails (see the statement of the main consistency theorem and the modeling assumptions in the methods section).
  2. [Application and Simulations] No internal validation sample or sensitivity analysis is provided to assess robustness when the external misclassification probabilities do not match the main study; this assumption is load-bearing for the central claim yet is not testable within the observed data (see the HIV application and simulation design).
minor comments (2)
  1. [Methods] Clarify the precise form of the B-spline basis and knot selection procedure in the sieve approximation; the current description leaves the degrees of freedom and boundary handling ambiguous.
  2. [Simulations] In the simulation tables, report the empirical coverage of the proposed variance estimator or confidence intervals to allow direct assessment of inferential performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address each major comment in turn below and describe the revisions we intend to make.

read point-by-point responses
  1. Referee: [Methods / Consistency theorem] The consistency result (via empirical process theory on the B-spline sieve pseudo-likelihood) is derived under the assumption that the misclassification matrix is known and correctly specified for the target population. The manuscript plugs external estimates directly into the pseudo-likelihood without additional correction terms for estimation error or population shift; if the external validation sample differs in covariate distribution or error mechanisms, the pseudo-likelihood is misspecified and the consistency guarantee fails (see the statement of the main consistency theorem and the modeling assumptions in the methods section).

    Authors: We agree that the consistency theorem is established under the assumption that the misclassification probabilities are known and correctly specified for the target population. The current proof treats these probabilities as fixed and does not include correction terms for their estimation from the external sample or for possible lack of transportability. This is a genuine limitation of the theoretical result as stated. In the revision we will explicitly highlight this assumption in the statement of the theorem and in the methods section, and we will add a discussion paragraph noting the conditions required for consistency and outlining possible extensions (e.g., bootstrap or joint modeling) to incorporate estimation variability. revision: partial

  2. Referee: [Application and Simulations] No internal validation sample or sensitivity analysis is provided to assess robustness when the external misclassification probabilities do not match the main study; this assumption is load-bearing for the central claim yet is not testable within the observed data (see the HIV application and simulation design).

    Authors: The referee correctly notes that both the simulation design and the HIV application assume the external misclassification probabilities apply to the main study. Because the method targets settings without an internal validation sample, we cannot supply such a sample. We will, however, strengthen the manuscript by adding sensitivity analyses. In the revised simulations we will include scenarios in which the probabilities used in the pseudo-likelihood differ from those used to generate the data. For the HIV application we will report results under a range of plausible misclassification probabilities drawn from the literature and expert opinion. revision: yes

Circularity Check

0 steps flagged

No circularity: consistency derived from empirical process theory on external-plug-in pseudo-likelihood

full rationale

The paper defines a B-spline sieve pseudo-likelihood that incorporates fixed external misclassification probability estimates as known inputs, then invokes standard empirical process theory to establish consistency of the maximizer. This is a conventional semiparametric construction whose asymptotic result does not reduce to any fitted quantity defined from the same data or to a self-citation chain. The external estimates are treated as given under an explicit transportability assumption; no step renames a known result, smuggles an ansatz, or makes the target estimator equivalent to its inputs by construction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on transferability of external misclassification probabilities and on standard empirical process results for consistency of the sieve estimator; no new entities are postulated.

free parameters (1)
  • B-spline knot placement and degree
    Chosen to approximate the cause-specific hazard functions; number and location affect the sieve approximation but are not data-fitted in the reported sense.
axioms (2)
  • standard math Empirical process theory applies to the B-spline sieve pseudo-likelihood to establish consistency
    Invoked to prove the proposed estimator is consistent as sample size grows.
  • domain assumption Misclassification probabilities from external study are fixed and correctly measured
    Used directly in the pseudo-likelihood without further modeling of uncertainty in those probabilities.

pith-pipeline@v0.9.0 · 5772 in / 1447 out tokens · 49863 ms · 2026-05-20T15:00:06.862265+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Semiparametric regression and risk prediction with competing risks data under missing cause of failure

    Bakoyannis G, Zhang Y, Yiannoutsos CT. Semiparametric regression and risk prediction with competing risks data under missing cause of failure. Lifetime Data Analysis. 2020;26(4):659-84. 20 Balanoset al

  2. [2]

    The effects of misclassification on the estimation of relative risk

    Barron BA. The effects of misclassification on the estimation of relative risk. Biometrics. 1977;33(2):414-8

  3. [3]

    Misclassification in 2x2 tables

    Bross I. Misclassification in 2x2 tables. Biometrics. 1954;10(4):478-86

  4. [4]

    Validation data-based adjustments for outcome misclassification in logistic regression: An illustration

    Lyles RH, T ang L, Superak HM, King CC, Celentano DD, Lo Y, et al. Validation data-based adjustments for outcome misclassification in logistic regression: An illustration. Epidemiology. 2011;22(4):589-97

  5. [5]

    Logistic regression when the outcome is measured with uncertainty

    Magder LS, Hughes JP . Logistic regression when the outcome is measured with uncertainty. American Journal of Epidemiology. 1997;146(2):195-203

  6. [6]

    Bias and efficiency loss due to misclassified responses in binary regression

    Neuhaus JM. Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999;86(4):843- 55

  7. [7]

    Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data

    Edwards JK, Cole SR, T roester MA, Richardson DB. Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. American Journal of Epidemiology. 2013;177(9):904-12

  8. [8]

    Binary regression with differentially misclassified response and exposure variables

    T ang L, Lyles RH, King CC, Celentano DD, Lo Y. Binary regression with differentially misclassified response and exposure variables. Statistics in Medicine. 2015;34(9):1605-20

  9. [9]

    Variance estimation for epidemiologic effect estimates under misclassification

    Greenland S. Variance estimation for epidemiologic effect estimates under misclassification. Statistics in Medicine. 1988;7(7):745-57

  10. [10]

    A double sampling scheme for estimating from binomial data with misclassifications

    T enenbein A. A double sampling scheme for estimating from binomial data with misclassifications. Journal of the American Statistical Association. 1970;65(331):1350-61

  11. [11]

    Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument

    Spiegelman D, Carroll RJ, Kipnis V. Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Statistics in Medicine. 2001;20(1):139-60

  12. [12]

    Sampling-based approaches to improve estimation of mortality among patient dropouts: Experience from a large PEPFAR-funded program in Western Kenya

    Yiannoutsos CT, An MW, Frangakis CE, Musick BS, Braitstein P , Wools-Kaloustian K, et al. Sampling-based approaches to improve estimation of mortality among patient dropouts: Experience from a large PEPFAR-funded program in Western Kenya. PLoS One. 2008;3(12):e3843

  13. [13]

    The need for double-sampling designs in survival studies: An appli- cation to monitor PEPFAR

    An MW, Frangakis CE, Musick BS, Yiannoutsos CT. The need for double-sampling designs in survival studies: An appli- cation to monitor PEPFAR. Biometrics. 2009;65(1):301-6

  14. [14]

    Sampling-based approach to determining outcomes of patients lost to follow-up in antiretroviral therapy scale-up programs in Africa

    Geng EH, Emenyonu N, Bwana MB, Glidden DV, Martin JN. Sampling-based approach to determining outcomes of patients lost to follow-up in antiretroviral therapy scale-up programs in Africa. JAMA. 2008;300(5):506-7

  15. [15]

    Effects of Alcohol Use on Patient Retention in HIV Care in East Africa

    Monroy A, Goodrich S, Brown SA, et al. Effects of Alcohol Use on Patient Retention in HIV Care in East Africa. AIDS and Behavior. 2024;28:4020-8

  16. [16]

    Correcting mortality for loss to follow-up: A nomogram applied to antiretroviral treatment programmes in Sub-Saharan Africa

    Egger M, Spycher BD, Sidle J, Weigel R, Geng EH, Fox MP , et al. Correcting mortality for loss to follow-up: A nomogram applied to antiretroviral treatment programmes in Sub-Saharan Africa. PLoS Medicine. 2011;8(1)

  17. [17]

    Adjusting mortality for loss to follow-up: Analysis of five ART programmes in Sub-Saharan Africa

    Brinkhof MW, Spycher BD, Weigel R, Wood R, Messou E, Boulle A, et al. Adjusting mortality for loss to follow-up: Analysis of five ART programmes in Sub-Saharan Africa. PLoS One. 2010;5(11):e14149

  18. [18]

    A pseudo-likelihood method for estimating mis- classification probabilities in competing-risks settings when true-event data are partially observed

    Mpofu PB, Bakoyannis G, Yiannoutsos CT, Mwangi AW, Mburu M. A pseudo-likelihood method for estimating mis- classification probabilities in competing-risks settings when true-event data are partially observed. Biometrical Journal. 2020;62(7):1747-68

  19. [19]

    Semiparametric estimation in the proportional hazard model accounting for a misclassified cause of failure

    Ha J, T sodikov A. Semiparametric estimation in the proportional hazard model accounting for a misclassified cause of failure. Biometrics. 2015;71(4):941-9

  20. [20]

    The Statistical Analysis of Failure Time Data

    Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd ed. New Y ork: John Wiley and Sons; 2002. Balanoset al. 21

  21. [21]

    Practical methods for competing risks data: A review

    Bakoyannis G, T ouloumi G. Practical methods for competing risks data: A review. Statistical Methods in Medical Re- search. 2012;21(3):257-72

  22. [22]

    A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data

    Zhang Y, Hua L, Huang J. A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scandinavian Journal of Statistics. 2010;37:338-54

  23. [23]

    Semiparametric regression on cumulative incidence function with interval- censored competing risks data

    Bakoyannis G, Yu M, Yiannoutsos CT. Semiparametric regression on cumulative incidence function with interval- censored competing risks data. Statistics in Medicine. 2017;36(23):3683-707

  24. [24]

    Bootstrap consistency for general semiparametric M-estimation

    Cheng G, Huang JZ. Bootstrap consistency for general semiparametric M-estimation. The Annals of Statistics. 2010;38:2884-915

  25. [25]

    Multiple imputation after 18+ years

    Rubin DB. Multiple imputation after 18+ years. Journal of the American statistical Association. 1996;91(434):473-89

  26. [26]

    Inference for imputation estimators

    Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87(1):113-24

  27. [27]

    Convergence rate of sieve estimates

    Shen X, Wong WH. Convergence rate of sieve estimates. The Annals of Statistics. 1994;22:580-615

  28. [28]

    Weak Convergence and Empirical Processes with Applications to Statistics

    van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes with Applications to Statistics. New Y ork: Springer-Verlag; 1996

  29. [29]

    Introduction to Empirical Processes and Semiparametric Inference

    Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. New Y ork: Springer; 2008