Semiparametric Regression for Misclassified Competing Risks Data
Pith reviewed 2026-05-20 15:00 UTC · model grok-4.3
The pith
Misclassification in competing risks data is corrected by plugging external validation probabilities into a B-spline sieve pseudo-likelihood.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By incorporating estimates of the misclassification probabilities from an external validation study into a B-spline-based sieve pseudo-likelihood function, the proposed estimator is consistent for the regression parameters in misclassified competing risks data and jointly estimates models for all event types.
What carries the argument
B-spline-based sieve pseudo-likelihood function that folds external misclassification probabilities into the observed-data likelihood for joint estimation across event types.
Load-bearing premise
The misclassification probabilities from the external validation study are accurate and apply without systematic differences to the main study population.
What would settle it
Apply the method to a dataset that also contains an internal gold-standard validation sample; large discrepancies between the externally adjusted estimates and the internal gold-standard estimates would show the adjustment does not work.
read the original abstract
The analysis of competing risks data is often complicated by misclassification of the cause of failure. This issue can lead to seriously biased estimates and invalid conclusions. One way to deal with such misclassification is to use a gold-standard cause of failure ascertainment procedure in a subset of the non-right-censored participants (internal validation sample) along with methods for missing data to deal with the missing gold-standard ascertainments. However, this approach can be costly and time-consuming and, therefore, cannot be implemented in many studies. In this work, we propose a semiparametric regression analysis methodology for the case where no internal validation sample exists. Our approach leverages estimates of the misclassification probabilities from an external validation study to adjust for misclassification in the study at hand. These probabilities are incorporated in a B-spline-based sieve pseudo-likelihood function, which is maximized to jointly estimate models for all event types. Using empirical process theory, we show that the proposed estimator is consistent. Extensive simulation experiments demonstrate that the method performs well with realistic sample sizes and provides substantially more efficient estimates compared to previously proposed approaches. The methodology is applied to competing risks data from a large HIV observational study in sub-Saharan Africa, where event type is misclassified due to significant death under-reporting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a semiparametric method for regression analysis of competing risks data with misclassified event types when no internal validation sample is available. It incorporates misclassification probabilities estimated from an external validation study into a B-spline sieve pseudo-likelihood that is maximized to jointly estimate models for all event types, establishes consistency of the resulting estimator via empirical process theory, reports favorable simulation performance relative to prior approaches, and applies the method to HIV observational data with death under-reporting.
Significance. If the external misclassification probabilities transport to the target population, the approach offers a practical alternative to internal validation (which is often infeasible) and can yield more efficient estimates than existing methods that ignore or incompletely adjust for misclassification. The use of sieve estimation and empirical-process arguments for consistency is a technical strength when the modeling assumptions hold.
major comments (2)
- [Methods / Consistency theorem] The consistency result (via empirical process theory on the B-spline sieve pseudo-likelihood) is derived under the assumption that the misclassification matrix is known and correctly specified for the target population. The manuscript plugs external estimates directly into the pseudo-likelihood without additional correction terms for estimation error or population shift; if the external validation sample differs in covariate distribution or error mechanisms, the pseudo-likelihood is misspecified and the consistency guarantee fails (see the statement of the main consistency theorem and the modeling assumptions in the methods section).
- [Application and Simulations] No internal validation sample or sensitivity analysis is provided to assess robustness when the external misclassification probabilities do not match the main study; this assumption is load-bearing for the central claim yet is not testable within the observed data (see the HIV application and simulation design).
minor comments (2)
- [Methods] Clarify the precise form of the B-spline basis and knot selection procedure in the sieve approximation; the current description leaves the degrees of freedom and boundary handling ambiguous.
- [Simulations] In the simulation tables, report the empirical coverage of the proposed variance estimator or confidence intervals to allow direct assessment of inferential performance.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review of our manuscript. We address each major comment in turn below and describe the revisions we intend to make.
read point-by-point responses
-
Referee: [Methods / Consistency theorem] The consistency result (via empirical process theory on the B-spline sieve pseudo-likelihood) is derived under the assumption that the misclassification matrix is known and correctly specified for the target population. The manuscript plugs external estimates directly into the pseudo-likelihood without additional correction terms for estimation error or population shift; if the external validation sample differs in covariate distribution or error mechanisms, the pseudo-likelihood is misspecified and the consistency guarantee fails (see the statement of the main consistency theorem and the modeling assumptions in the methods section).
Authors: We agree that the consistency theorem is established under the assumption that the misclassification probabilities are known and correctly specified for the target population. The current proof treats these probabilities as fixed and does not include correction terms for their estimation from the external sample or for possible lack of transportability. This is a genuine limitation of the theoretical result as stated. In the revision we will explicitly highlight this assumption in the statement of the theorem and in the methods section, and we will add a discussion paragraph noting the conditions required for consistency and outlining possible extensions (e.g., bootstrap or joint modeling) to incorporate estimation variability. revision: partial
-
Referee: [Application and Simulations] No internal validation sample or sensitivity analysis is provided to assess robustness when the external misclassification probabilities do not match the main study; this assumption is load-bearing for the central claim yet is not testable within the observed data (see the HIV application and simulation design).
Authors: The referee correctly notes that both the simulation design and the HIV application assume the external misclassification probabilities apply to the main study. Because the method targets settings without an internal validation sample, we cannot supply such a sample. We will, however, strengthen the manuscript by adding sensitivity analyses. In the revised simulations we will include scenarios in which the probabilities used in the pseudo-likelihood differ from those used to generate the data. For the HIV application we will report results under a range of plausible misclassification probabilities drawn from the literature and expert opinion. revision: yes
Circularity Check
No circularity: consistency derived from empirical process theory on external-plug-in pseudo-likelihood
full rationale
The paper defines a B-spline sieve pseudo-likelihood that incorporates fixed external misclassification probability estimates as known inputs, then invokes standard empirical process theory to establish consistency of the maximizer. This is a conventional semiparametric construction whose asymptotic result does not reduce to any fitted quantity defined from the same data or to a self-citation chain. The external estimates are treated as given under an explicit transportability assumption; no step renames a known result, smuggles an ansatz, or makes the target estimator equivalent to its inputs by construction. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- B-spline knot placement and degree
axioms (2)
- standard math Empirical process theory applies to the B-spline sieve pseudo-likelihood to establish consistency
- domain assumption Misclassification probabilities from external study are fixed and correctly measured
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
B-spline-based sieve pseudo-likelihood function... maximized to jointly estimate models for all event types. Using empirical process theory, we show that the proposed estimator is consistent.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
misclassification probabilities... incorporated in a B-spline-based sieve pseudo-likelihood
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bakoyannis G, Zhang Y, Yiannoutsos CT. Semiparametric regression and risk prediction with competing risks data under missing cause of failure. Lifetime Data Analysis. 2020;26(4):659-84. 20 Balanoset al
work page 2020
-
[2]
The effects of misclassification on the estimation of relative risk
Barron BA. The effects of misclassification on the estimation of relative risk. Biometrics. 1977;33(2):414-8
work page 1977
-
[3]
Misclassification in 2x2 tables
Bross I. Misclassification in 2x2 tables. Biometrics. 1954;10(4):478-86
work page 1954
-
[4]
Lyles RH, T ang L, Superak HM, King CC, Celentano DD, Lo Y, et al. Validation data-based adjustments for outcome misclassification in logistic regression: An illustration. Epidemiology. 2011;22(4):589-97
work page 2011
-
[5]
Logistic regression when the outcome is measured with uncertainty
Magder LS, Hughes JP . Logistic regression when the outcome is measured with uncertainty. American Journal of Epidemiology. 1997;146(2):195-203
work page 1997
-
[6]
Bias and efficiency loss due to misclassified responses in binary regression
Neuhaus JM. Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999;86(4):843- 55
work page 1999
-
[7]
Edwards JK, Cole SR, T roester MA, Richardson DB. Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. American Journal of Epidemiology. 2013;177(9):904-12
work page 2013
-
[8]
Binary regression with differentially misclassified response and exposure variables
T ang L, Lyles RH, King CC, Celentano DD, Lo Y. Binary regression with differentially misclassified response and exposure variables. Statistics in Medicine. 2015;34(9):1605-20
work page 2015
-
[9]
Variance estimation for epidemiologic effect estimates under misclassification
Greenland S. Variance estimation for epidemiologic effect estimates under misclassification. Statistics in Medicine. 1988;7(7):745-57
work page 1988
-
[10]
A double sampling scheme for estimating from binomial data with misclassifications
T enenbein A. A double sampling scheme for estimating from binomial data with misclassifications. Journal of the American Statistical Association. 1970;65(331):1350-61
work page 1970
-
[11]
Spiegelman D, Carroll RJ, Kipnis V. Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Statistics in Medicine. 2001;20(1):139-60
work page 2001
-
[12]
Yiannoutsos CT, An MW, Frangakis CE, Musick BS, Braitstein P , Wools-Kaloustian K, et al. Sampling-based approaches to improve estimation of mortality among patient dropouts: Experience from a large PEPFAR-funded program in Western Kenya. PLoS One. 2008;3(12):e3843
work page 2008
-
[13]
The need for double-sampling designs in survival studies: An appli- cation to monitor PEPFAR
An MW, Frangakis CE, Musick BS, Yiannoutsos CT. The need for double-sampling designs in survival studies: An appli- cation to monitor PEPFAR. Biometrics. 2009;65(1):301-6
work page 2009
-
[14]
Geng EH, Emenyonu N, Bwana MB, Glidden DV, Martin JN. Sampling-based approach to determining outcomes of patients lost to follow-up in antiretroviral therapy scale-up programs in Africa. JAMA. 2008;300(5):506-7
work page 2008
-
[15]
Effects of Alcohol Use on Patient Retention in HIV Care in East Africa
Monroy A, Goodrich S, Brown SA, et al. Effects of Alcohol Use on Patient Retention in HIV Care in East Africa. AIDS and Behavior. 2024;28:4020-8
work page 2024
-
[16]
Egger M, Spycher BD, Sidle J, Weigel R, Geng EH, Fox MP , et al. Correcting mortality for loss to follow-up: A nomogram applied to antiretroviral treatment programmes in Sub-Saharan Africa. PLoS Medicine. 2011;8(1)
work page 2011
-
[17]
Adjusting mortality for loss to follow-up: Analysis of five ART programmes in Sub-Saharan Africa
Brinkhof MW, Spycher BD, Weigel R, Wood R, Messou E, Boulle A, et al. Adjusting mortality for loss to follow-up: Analysis of five ART programmes in Sub-Saharan Africa. PLoS One. 2010;5(11):e14149
work page 2010
-
[18]
Mpofu PB, Bakoyannis G, Yiannoutsos CT, Mwangi AW, Mburu M. A pseudo-likelihood method for estimating mis- classification probabilities in competing-risks settings when true-event data are partially observed. Biometrical Journal. 2020;62(7):1747-68
work page 2020
-
[19]
Ha J, T sodikov A. Semiparametric estimation in the proportional hazard model accounting for a misclassified cause of failure. Biometrics. 2015;71(4):941-9
work page 2015
-
[20]
The Statistical Analysis of Failure Time Data
Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd ed. New Y ork: John Wiley and Sons; 2002. Balanoset al. 21
work page 2002
-
[21]
Practical methods for competing risks data: A review
Bakoyannis G, T ouloumi G. Practical methods for competing risks data: A review. Statistical Methods in Medical Re- search. 2012;21(3):257-72
work page 2012
-
[22]
Zhang Y, Hua L, Huang J. A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scandinavian Journal of Statistics. 2010;37:338-54
work page 2010
-
[23]
Bakoyannis G, Yu M, Yiannoutsos CT. Semiparametric regression on cumulative incidence function with interval- censored competing risks data. Statistics in Medicine. 2017;36(23):3683-707
work page 2017
-
[24]
Bootstrap consistency for general semiparametric M-estimation
Cheng G, Huang JZ. Bootstrap consistency for general semiparametric M-estimation. The Annals of Statistics. 2010;38:2884-915
work page 2010
-
[25]
Multiple imputation after 18+ years
Rubin DB. Multiple imputation after 18+ years. Journal of the American statistical Association. 1996;91(434):473-89
work page 1996
-
[26]
Inference for imputation estimators
Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87(1):113-24
work page 2000
-
[27]
Convergence rate of sieve estimates
Shen X, Wong WH. Convergence rate of sieve estimates. The Annals of Statistics. 1994;22:580-615
work page 1994
-
[28]
Weak Convergence and Empirical Processes with Applications to Statistics
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes with Applications to Statistics. New Y ork: Springer-Verlag; 1996
work page 1996
-
[29]
Introduction to Empirical Processes and Semiparametric Inference
Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. New Y ork: Springer; 2008
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.