Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies
Pith reviewed 2026-05-25 19:48 UTC · model grok-4.3
The pith
Control data on diagnostic measures enables regression analysis of how covariates affect disease etiology fractions in case-control studies
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By estimating the distribution of diagnostic measures given covariates from controls alone and using that estimate to inform the measurement model for cases, the extended framework correctly assigns latent cause probabilities to individual cases and thereby produces regression functions for the population etiologic fractions while properly accounting for imperfect sensitivity, specificity, and dependence among multiple binary measures.
What carries the argument
The extended nested partially-latent class model that uses a separate regression fit on controls to supply the measurement specificities and dependence structure transferred to the cases
If this is right
- Estimation of overall population etiologic fractions exhibits less bias than the version of the model that omits covariates
- Inference on the overall fractions is more valid once covariate information is included
- The method can reveal how etiology depends on measured factors such as season, age, disease severity, and HIV status
Where Pith is reading between the lines
- The same separation of control and case models could be applied to other infectious-disease studies that collect multiple imperfect diagnostic tests on both cases and controls
- If the control regression is misspecified, the resulting case assignments would be biased, so sensitivity checks that vary the control model form would be a natural next step
- The framework could be extended to time-to-event or longitudinal covariates if the control regression is generalized accordingly
Load-bearing premise
The separate regression model fitted to the controls' diagnostic measures given covariates is correctly specified and supplies accurate information on specificities and conditional dependence structures that can be transferred to the cases.
What would settle it
A validation study in which a subset of cases has known true causes from a gold-standard test; if the regression model's estimated cause probabilities for those cases deviate systematically from the known causes while the control model fits well, the transfer of information would be shown to fail.
Figures
read the original abstract
In large-scale disease etiology studies, epidemiologists often need to use multiple binary measures of unobserved causes of disease that are not perfectly sensitive or specific to estimate cause-specific case fractions, referred to as "population etiologic fractions" (PEFs). Despite recent methodological advances, the scientific need of incorporating control data to estimate the effect of explanatory variables upon the PEFs, however, remains unmet. In this paper, we build on and extend nested partially-latent class model (npLCMs, Wu et al., 2017) to a general framework for etiology regression analysis in case-control studies. Data from controls provide requisite information about measurement specificities and covariations, which is used to correctly assign cause-specific probabilities for each case given her measurements. We estimate the distribution of the controls' diagnostic measures given the covariates via a separate regression model and a priori encourage simpler conditional dependence structures. We use Markov chain Monte Carlo for posterior inference of the PEF functions, cases' latent classes and the overall PEFs of policy interest. We illustrate the regression analysis with simulations and show less biased estimation and more valid inference of the overall PEFs than an npLCM analysis omitting covariates. A regression analysis of data from a childhood pneumonia study site reveals the dependence of pneumonia etiology upon season, age, disease severity and HIV status.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends nested partially-latent class models (npLCMs) to a regression framework for estimating covariate-dependent population etiologic fractions (PEFs) from case-control studies with multiple imperfect binary diagnostic measures. A separate regression is fit to control data to recover specificities and conditional dependence structure; this is transferred to the case model to assign latent class probabilities, with MCMC used for posterior inference on PEF functions and overall PEFs. Simulations are reported to yield less biased PEF estimates than covariate-omitting npLCM, and the approach is illustrated on childhood pneumonia data showing dependence on season, age, severity, and HIV status.
Significance. If the transferability assumption holds, the work supplies a needed tool for covariate-adjusted etiology estimation in large-scale studies, directly addressing the unmet need stated in the abstract. It builds on Wu et al. (2017) with a practical MCMC implementation and an explicit preference for simpler dependence structures. The framework could improve policy-relevant PEF inference when covariates are available.
major comments (2)
- [Abstract] Abstract: the claim that 'simulations show less biased estimation and more valid inference of the overall PEFs' is presented without any quantitative metrics (bias, coverage, or simulation design details). This leaves the central performance claim unsupported in the summary and requires the results section to supply the missing numbers and settings.
- [Model description] Model construction (control-to-case transfer step): the measurement model (specificities and conditional dependence) estimated from controls is transferred unchanged to cases under the assumption that disease status does not alter these properties. No analytic derivation, sensitivity analysis, or diagnostic check is described to test this invariance; because the claim of 'correctly assign cause-specific probabilities' rests on this transfer, the assumption is load-bearing and needs explicit robustness assessment.
minor comments (1)
- [Abstract] Abstract: 'npLCMs' is used before the parenthetical expansion; spell out on first use for clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'simulations show less biased estimation and more valid inference of the overall PEFs' is presented without any quantitative metrics (bias, coverage, or simulation design details). This leaves the central performance claim unsupported in the summary and requires the results section to supply the missing numbers and settings.
Authors: We agree that the abstract would be strengthened by including quantitative metrics. In the revised manuscript we will add concise statements of key simulation results (e.g., average bias reduction and empirical coverage rates across scenarios) together with a brief description of the simulation design (sample sizes, number of replicates, covariate configurations). These numbers already appear in the results section; we will simply summarize them in the abstract as requested. revision: yes
-
Referee: [Model description] Model construction (control-to-case transfer step): the measurement model (specificities and conditional dependence) estimated from controls is transferred unchanged to cases under the assumption that disease status does not alter these properties. No analytic derivation, sensitivity analysis, or diagnostic check is described to test this invariance; because the claim of 'correctly assign cause-specific probabilities' rests on this transfer, the assumption is load-bearing and needs explicit robustness assessment.
Authors: The referee correctly notes that the invariance assumption is central. While the assumption is standard in the case-control etiology literature (measurement properties are viewed as test characteristics independent of disease status), we acknowledge that the manuscript would benefit from explicit robustness checks. In the revision we will add a dedicated sensitivity-analysis subsection that perturbs the transferred specificities and dependence parameters within plausible ranges and reports the resulting changes in PEF estimates. We will also expand the model-description text to state the assumption more explicitly and cite supporting literature. revision: yes
Circularity Check
No circularity; control regression supplies independent measurement parameters transferred to case model.
full rationale
The paper fits a regression model exclusively to control diagnostic data to estimate specificities and conditional dependence structures, then transfers those estimates to the case model for latent class and PEF inference. This separation means the PEF functions are not defined in terms of themselves or recovered by construction from the same fitted quantities. The citation to Wu et al. (2017) supplies the base npLCM structure but does not create a self-citation load-bearing loop for the regression extension. No equations reduce predictions to inputs, no ansatz is smuggled, and no uniqueness theorem is invoked from overlapping authors. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math MCMC sampling yields valid posterior inference for the PEF functions and latent classes
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We extend npLCM to perform regression analysis... multinomial logistic regression model πiℓ=πℓ(Xi)... stick-breaking parameterization... MCMC for posterior inference of the PEF functions
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Data from controls provide requisite information about measurement specificities and covariations... P0(m;w)=[M=m|W=w,I=0]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., and Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association , 92(440):1375--1386
work page 1997
-
[2]
Bartolucci, F. and Forcina, A. (2006). A class of latent marginal models for capture--recapture data with continuous covariates. Journal of the American Statistical Association , 101(474):786--794
work page 2006
-
[3]
J., Christensen, R., and Johnson, W
Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. Journal of the American Statistical Association , 91(436):1450--1460
work page 1996
-
[4]
Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics , 7(4):434--455
work page 1998
-
[5]
Carlin, B. and Louis, T. (2009). Bayesian methods for data analysis , volume 78. Chapman & Hall/CRC
work page 2009
-
[6]
Chung, H., Flaherty, B. P., and Schafer, J. L. (2006). Latent class logistic regression: application to marijuana use and attitudes among high school seniors. Journal of the Royal Statistical Society: Series A (Statistics in Society) , 169(4):723--743
work page 2006
-
[7]
Crawley, J., Prosperi, C., Baggett, H. C., Brooks, W. A., Deloria Knoll, M., Hammitt, L. L., Howie, S. R., Kotloff, K. L., Levine, O. S., Madhi, S. A., et al. (2017). Standardization of clinical assessment and sample collection across all perch study sites. Clinical infectious diseases , 64(suppl\_3):S228--S237
work page 2017
-
[8]
Deloria Knoll, M., Fu, W., Shi, Q., Prosperi, C., Wu, Z., Hammitt, L. L., Feikin, D. R., Baggett, H. C., Howie, S. R., Scott, J. A. G., et al. (2017). Bayesian estimation of pneumonia etiology: epidemiologic considerations and applications to the pneumonia etiology research for child health study. Clinical infectious diseases , 64(suppl\_3):S213--S227
work page 2017
-
[9]
Dunson, D. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association , 104(487):1042--1051
work page 2009
-
[10]
Erosheva, E. A., Fienberg, S. E., and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. The annals of applied statistics , 1(2):346
work page 2007
-
[11]
Feikin, D., Scott, J., and Gessner, B. (2014). Use of vaccines as probes to define disease burden. The Lancet , 383(9930):1762--1770
work page 2014
-
[12]
Gelfand, A. and Smith, A. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American statistical association , pages 398--409
work page 1990
-
[13]
Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics , pages 1360--1383
work page 2008
-
[14]
Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica , 6(4):733--760
work page 1996
-
[15]
Geweke, J. and Zhou, G. (1996). Measuring the pricing error of the arbitrage pricing theory. The review of financial studies , 9(2):557--587
work page 1996
-
[16]
Goodman, L. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika , 61(2):215--231
work page 1974
-
[17]
Gryparis, A., Coull, B. A., Schwartz, J., and Suh, H. H. (2007). Semiparametric latent variable regression models for spatiotemporal modelling of mobile source particles in the greater boston area. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 56(2):183--209
work page 2007
- [18]
- [19]
-
[20]
Gustafson, P. (2015). Bayesian Inference for Partially Identified Models: Exploring the Limits of Limited Data , volume 140. CRC Press
work page 2015
-
[21]
Gustafson, P., Lefebvre, G., et al. (2008). Bayesian multinomial regression with class-specific predictor selection. The Annals of Applied Statistics , 2(4):1478--1502
work page 2008
-
[22]
Hammitt, L. L., Feikin, D. R., Scott, J. A. G., Zeger, S. L., Murdoch, D. R., O’brien, K. L., and Deloria Knoll, M. (2017). Addressing the analytic challenges of cross-sectional pediatric pneumonia etiology data. Clinical infectious diseases , 64(suppl\_3):S197--S204
work page 2017
-
[23]
Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statistical Science , 1(3):297--318
work page 1986
-
[24]
Huang, G.-H. and Bandeen-Roche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika , 69(1):5--32
work page 2004
-
[25]
Jones, G., Johnson, W., Hanson, T., and Christensen, R. (2010). Identifiability of models for multiple diagnostic testing in the absence of a gold standard. Biometrics , 66(3):855--863
work page 2010
-
[26]
Kotloff, K. L., Nataro, J. P., Blackwelder, W. C., Nasrin, D., Farag, T. H., Panchalingam, S., Wu, Y., Sow, S. O., Sur, D., Breiman, R. F., et al. (2013). Burden and aetiology of diarrhoeal disease in infants and young children in developing countries (the global enteric multicenter study, gems): a prospective, case-control study. The Lancet , 382(9888):209--222
work page 2013
-
[27]
Lang, S. and Brezger, A. (2004). Bayesian p-splines. Journal of computational and graphical statistics , 13(1):183--212
work page 2004
-
[28]
Lazarsfeld, P. F. (1950). The logical and mathematical foundations of latent structure analysis , volume IV, chapter The American Soldier: Studies in Social Psychology in World War II, pages 362--412. Princeton, NJ: Princeton University Press
work page 1950
-
[29]
Linero, A. R. (2018). Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association , 113(522):626--636
work page 2018
-
[30]
Little, R. et al. (2011). Calibrated bayes, for statistics in general, and missing data in particular. Statistical Science , 26(2):162--174
work page 2011
-
[31]
Morrissey, E. R., Ju \'a rez, M. A., Denby, K. J., and Burroughs, N. J. (2011). Inferring the time-invariant topology of a nonlinear sparse gene regulatory network using fully bayesian spline autoregression. Biostatistics , 12(4):682--694
work page 2011
-
[32]
A., Katz, M., Roca, A., Berkley, J
Nair, H., Brooks, W. A., Katz, M., Roca, A., Berkley, J. A., Madhi, S. A., Simmerman, J. M., Gordon, A., Sato, M., Howie, S., et al. (2011). Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta-analysis. The Lancet , 378(9807):1917--1930
work page 2011
-
[33]
Ni, Y., Stingo, F. C., and Baladandayuthapani, V. (2015). Bayesian nonlinear model selection for gene regulatory networks. Biometrics
work page 2015
-
[34]
Obando-Pacheco, P., Justicia-Grande, A. J., Rivero-Calle, I., Rodr \' guez-Tenreiro, C., Sly, P., Ramilo, O., Mej \' as, A., Baraldi, E., Papadopoulos, N. G., Nair, H., et al. (2018). Respiratory syncytial virus seasonality: a global overview. The Journal of infectious diseases , 217(9):1356--1364
work page 2018
-
[35]
PERCH Study Group (2019). Aetiology of severe hospitalised pneumonia in hiv-uninfected children from africa and asia: the pneumonia aetiology research for child health (perch) case-control study. Lancet
work page 2019
-
[36]
Plummer, M. et al. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing , volume 124
work page 2003
-
[37]
Rodriguez, A. and Dunson, D. B. (2011). Nonparametric bayesian models through probit stick-breaking processes. Bayesian analysis (Online) , 6(1)
work page 2011
-
[38]
Saha, S. K., Schrag, S. J., El Arifeen, S., Mullany, L. C., Islam, M. S., Shang, N., Qazi, S. A., Zaidi, A. K., Bhutta, Z. A., Bose, A., et al. (2018). Causes and incidence of community-acquired serious infections among young children in south asia (anisa): an observational cohort study. The Lancet , 392(10142):145--159
work page 2018
-
[39]
Scott, J. A. G., Brooks, W. A., Peiris, J. M., Holtzman, D., and Mulhollan, E. K. (2008). Pneumonia research to reduce childhood mortality in the developing world. The Journal of clinical investigation , 118(4):1291
work page 2008
-
[40]
S., Greenland, S., and Kim, L.-L
Witte, J. S., Greenland, S., and Kim, L.-L. (1998). Software for hierarchical modeling of epidemiologic data. Epidemiology , 9(5):563--566
work page 1998
-
[41]
Wu, Z., Casciola-Rosen, L., Rosen, A., and Zeger, S. L. (2019). A bayesian approach to restricted latent class models for scientifically-structured clustering of multivariate binary outcomes. arXiv preprint arXiv:1808.08326
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[42]
Wu, Z., Deloria-Knoll, M., Hammitt, L. L., Zeger, S. L., and for Child Health Core Team, P. E. R. (2016). Partially latent class models for case--control studies of childhood pneumonia aetiology. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 65(1):97--114
work page 2016
-
[43]
Wu, Z., Deloria-Knoll, M., and Zeger, S. L. (2017). Nested partially latent class models for dependent binary data; estimating disease etiology. Biostatistics (Oxford, England) , 18:200--213
work page 2017
-
[44]
Zhang, Q. and Zhou, M. (2017). Permuted and augmented stick-breaking bayesian multinomial regression. The Journal of Machine Learning Research , 18(1):7479--7511
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.