pith. sign in

arxiv: 1906.08436 · v1 · pith:W2IFH3F6new · submitted 2019-06-20 · 📊 stat.ME · stat.AP

Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies

Pith reviewed 2026-05-25 19:48 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords disease etiologycase-control studieslatent class modelpopulation etiologic fractionregression analysismeasurement specificitychildhood pneumonia
0
0 comments X

The pith

Control data on diagnostic measures enables regression analysis of how covariates affect disease etiology fractions in case-control studies

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends nested partially-latent class models to a regression framework that incorporates explanatory variables when estimating population etiologic fractions from case-control data. A separate regression model is fitted to the controls to recover the distribution of their diagnostic measures given covariates, which supplies the needed information on measurement specificities and conditional dependencies. This information is transferred to assign cause-specific probabilities to each case, after which Markov chain Monte Carlo yields posterior inference on the covariate-dependent etiologic fraction functions and the overall fractions. Simulations demonstrate reduced bias and more valid inference for the overall fractions relative to the non-regression version of the model. The approach is illustrated on childhood pneumonia data, where etiology is shown to vary with season, age, severity, and HIV status.

Core claim

By estimating the distribution of diagnostic measures given covariates from controls alone and using that estimate to inform the measurement model for cases, the extended framework correctly assigns latent cause probabilities to individual cases and thereby produces regression functions for the population etiologic fractions while properly accounting for imperfect sensitivity, specificity, and dependence among multiple binary measures.

What carries the argument

The extended nested partially-latent class model that uses a separate regression fit on controls to supply the measurement specificities and dependence structure transferred to the cases

If this is right

  • Estimation of overall population etiologic fractions exhibits less bias than the version of the model that omits covariates
  • Inference on the overall fractions is more valid once covariate information is included
  • The method can reveal how etiology depends on measured factors such as season, age, disease severity, and HIV status

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of control and case models could be applied to other infectious-disease studies that collect multiple imperfect diagnostic tests on both cases and controls
  • If the control regression is misspecified, the resulting case assignments would be biased, so sensitivity checks that vary the control model form would be a natural next step
  • The framework could be extended to time-to-event or longitudinal covariates if the control regression is generalized accordingly

Load-bearing premise

The separate regression model fitted to the controls' diagnostic measures given covariates is correctly specified and supplies accurate information on specificities and conditional dependence structures that can be transferred to the cases.

What would settle it

A validation study in which a subset of cases has known true causes from a gold-standard test; if the regression model's estimated cause probabilities for those cases deviate systematically from the known causes while the control model fits well, the transfer of information would be shown to fail.

Figures

Figures reproduced from arXiv: 1906.08436 by Irena Chen, Zhenke Wu.

Figure 1
Figure 1. Figure 1: Row 2) For each of the 9 causes (by column) in Simulation [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The regression analyses produce less biased posterior mean estimates and more [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prior densities for logit(α ν ik), the fraction to be broken for subclass k from the stick currently left, when αik equals: 1) the intercept µ ∗ k0 (black, solid line) or 2) the first B-spline coefficient β (1),ν kj (red, broken line). The former concentrates near 1 because µ ∗ k0 has a scaled-t distributed prior that puts substantial mass at the right tail; much less so for the latter. 38 [PITH_FULL_IMAG… view at source ↗
Figure 5
Figure 5. Figure 5: By propagating the prior that encourages few subclasses, the algorithm correctly infers two subclasses from the simulated data in Simulation I, Section 4 of Main Paper. Estimated case (top) and control (bottom) subclass weight curves for seven subclasses over one continuous covariate νbk(t) (central blue dashed lines enclosed by the 95% credible regions; the red curves are posterior samples) compared again… view at source ↗
Figure 6
Figure 6. Figure 6: Posterior distributions of the stratum-specific (Row 1 and 2) and the overall (Bottom Row) PEFs based on a simulation with a two-level discrete covariate and L = J = 6 causes. The vertical gray lines indicate the 2.5% and 97.5% posterior quantiles, respectively; The truths are indicated by vertical blue dashed lines. Row 1-2) PEFs by stratum (level = 1,2) and cause (A-F); Bottom) π ∗ ` : overall population… view at source ↗
Figure 7
Figure 7. Figure 7: NPLCM analyses with or without regression perform similarly in terms of percent relative bias (top) and empirical coverage rates (bottom) over R = 100 replications in simulations where the case and control subclass weights do not vary by covariates. Each panel corresponds to one of 16 combinations of true parameter values and sample sizes. See [PITH_FULL_IMAGE:figures/full_fig_p041_7.png] view at source ↗
Figure 3
Figure 3. Figure 3: Estimated seasonal PEF πb`(date, age,severity,HIV) for two most prevalent age￾severity-HIV strata: younger (a) or older (b) than one, with severe pneumonia, HIV negative; Here the results are obtained from a model assuming seven single-pathogen causes (HINF, PNEU, ADENO, HMPV.A.B, PARA.1, RHINO, RSV) and an “Not Specified” cause. In an age-severity-HIV stratum and for each cause `: Row 2) shows the tempora… view at source ↗
Figure 8
Figure 8. Figure 8: Panel plot with BrS, SS and Etiology Pies obtained from an npLCM analysis omitting covariates (K = 5). For each of the 7 pathogens, a summary of the BrS and SS data analyzed in Section 5 of Main Paper is shown in the left two columns, along with some of the intermediate model results; and the prior and posterior distributions for the PEFs on the right (rows ordered by posterior means). Left) The observed B… view at source ↗
Figure 9
Figure 9. Figure 9: Individual etiology fraction estimates for RSV (left) and NoS (right) differ by age and season among HIV negative and severe pneumonia cases for whom the seven pathogens were all tested negative in the nasopharyngeal specimens. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗
read the original abstract

In large-scale disease etiology studies, epidemiologists often need to use multiple binary measures of unobserved causes of disease that are not perfectly sensitive or specific to estimate cause-specific case fractions, referred to as "population etiologic fractions" (PEFs). Despite recent methodological advances, the scientific need of incorporating control data to estimate the effect of explanatory variables upon the PEFs, however, remains unmet. In this paper, we build on and extend nested partially-latent class model (npLCMs, Wu et al., 2017) to a general framework for etiology regression analysis in case-control studies. Data from controls provide requisite information about measurement specificities and covariations, which is used to correctly assign cause-specific probabilities for each case given her measurements. We estimate the distribution of the controls' diagnostic measures given the covariates via a separate regression model and a priori encourage simpler conditional dependence structures. We use Markov chain Monte Carlo for posterior inference of the PEF functions, cases' latent classes and the overall PEFs of policy interest. We illustrate the regression analysis with simulations and show less biased estimation and more valid inference of the overall PEFs than an npLCM analysis omitting covariates. A regression analysis of data from a childhood pneumonia study site reveals the dependence of pneumonia etiology upon season, age, disease severity and HIV status.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript extends nested partially-latent class models (npLCMs) to a regression framework for estimating covariate-dependent population etiologic fractions (PEFs) from case-control studies with multiple imperfect binary diagnostic measures. A separate regression is fit to control data to recover specificities and conditional dependence structure; this is transferred to the case model to assign latent class probabilities, with MCMC used for posterior inference on PEF functions and overall PEFs. Simulations are reported to yield less biased PEF estimates than covariate-omitting npLCM, and the approach is illustrated on childhood pneumonia data showing dependence on season, age, severity, and HIV status.

Significance. If the transferability assumption holds, the work supplies a needed tool for covariate-adjusted etiology estimation in large-scale studies, directly addressing the unmet need stated in the abstract. It builds on Wu et al. (2017) with a practical MCMC implementation and an explicit preference for simpler dependence structures. The framework could improve policy-relevant PEF inference when covariates are available.

major comments (2)
  1. [Abstract] Abstract: the claim that 'simulations show less biased estimation and more valid inference of the overall PEFs' is presented without any quantitative metrics (bias, coverage, or simulation design details). This leaves the central performance claim unsupported in the summary and requires the results section to supply the missing numbers and settings.
  2. [Model description] Model construction (control-to-case transfer step): the measurement model (specificities and conditional dependence) estimated from controls is transferred unchanged to cases under the assumption that disease status does not alter these properties. No analytic derivation, sensitivity analysis, or diagnostic check is described to test this invariance; because the claim of 'correctly assign cause-specific probabilities' rests on this transfer, the assumption is load-bearing and needs explicit robustness assessment.
minor comments (1)
  1. [Abstract] Abstract: 'npLCMs' is used before the parenthetical expansion; spell out on first use for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'simulations show less biased estimation and more valid inference of the overall PEFs' is presented without any quantitative metrics (bias, coverage, or simulation design details). This leaves the central performance claim unsupported in the summary and requires the results section to supply the missing numbers and settings.

    Authors: We agree that the abstract would be strengthened by including quantitative metrics. In the revised manuscript we will add concise statements of key simulation results (e.g., average bias reduction and empirical coverage rates across scenarios) together with a brief description of the simulation design (sample sizes, number of replicates, covariate configurations). These numbers already appear in the results section; we will simply summarize them in the abstract as requested. revision: yes

  2. Referee: [Model description] Model construction (control-to-case transfer step): the measurement model (specificities and conditional dependence) estimated from controls is transferred unchanged to cases under the assumption that disease status does not alter these properties. No analytic derivation, sensitivity analysis, or diagnostic check is described to test this invariance; because the claim of 'correctly assign cause-specific probabilities' rests on this transfer, the assumption is load-bearing and needs explicit robustness assessment.

    Authors: The referee correctly notes that the invariance assumption is central. While the assumption is standard in the case-control etiology literature (measurement properties are viewed as test characteristics independent of disease status), we acknowledge that the manuscript would benefit from explicit robustness checks. In the revision we will add a dedicated sensitivity-analysis subsection that perturbs the transferred specificities and dependence parameters within plausible ranges and reports the resulting changes in PEF estimates. We will also expand the model-description text to state the assumption more explicitly and cite supporting literature. revision: yes

Circularity Check

0 steps flagged

No circularity; control regression supplies independent measurement parameters transferred to case model.

full rationale

The paper fits a regression model exclusively to control diagnostic data to estimate specificities and conditional dependence structures, then transfers those estimates to the case model for latent class and PEF inference. This separation means the PEF functions are not defined in terms of themselves or recovered by construction from the same fitted quantities. The citation to Wu et al. (2017) supplies the base npLCM structure but does not create a self-citation load-bearing loop for the regression extension. No equations reduce predictions to inputs, no ansatz is smuggled, and no uniqueness theorem is invoked from overlapping authors. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; ledger entries are therefore limited to standard Bayesian assumptions implied by the described MCMC procedure.

axioms (1)
  • standard math MCMC sampling yields valid posterior inference for the PEF functions and latent classes
    Standard assumption invoked when the abstract states that MCMC is used for posterior inference.

pith-pipeline@v0.9.0 · 5766 in / 1249 out tokens · 30002 ms · 2026-05-25T19:48:38.668226+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    L., Zeger, S

    Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., and Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association , 92(440):1375--1386

  2. [2]

    and Forcina, A

    Bartolucci, F. and Forcina, A. (2006). A class of latent marginal models for capture--recapture data with continuous covariates. Journal of the American Statistical Association , 101(474):786--794

  3. [3]

    J., Christensen, R., and Johnson, W

    Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. Journal of the American Statistical Association , 91(436):1450--1460

  4. [4]

    and Gelman, A

    Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics , 7(4):434--455

  5. [5]

    and Louis, T

    Carlin, B. and Louis, T. (2009). Bayesian methods for data analysis , volume 78. Chapman & Hall/CRC

  6. [6]

    P., and Schafer, J

    Chung, H., Flaherty, B. P., and Schafer, J. L. (2006). Latent class logistic regression: application to marijuana use and attitudes among high school seniors. Journal of the Royal Statistical Society: Series A (Statistics in Society) , 169(4):723--743

  7. [7]

    C., Brooks, W

    Crawley, J., Prosperi, C., Baggett, H. C., Brooks, W. A., Deloria Knoll, M., Hammitt, L. L., Howie, S. R., Kotloff, K. L., Levine, O. S., Madhi, S. A., et al. (2017). Standardization of clinical assessment and sample collection across all perch study sites. Clinical infectious diseases , 64(suppl\_3):S228--S237

  8. [8]

    L., Feikin, D

    Deloria Knoll, M., Fu, W., Shi, Q., Prosperi, C., Wu, Z., Hammitt, L. L., Feikin, D. R., Baggett, H. C., Howie, S. R., Scott, J. A. G., et al. (2017). Bayesian estimation of pneumonia etiology: epidemiologic considerations and applications to the pneumonia etiology research for child health study. Clinical infectious diseases , 64(suppl\_3):S213--S227

  9. [9]

    and Xing, C

    Dunson, D. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association , 104(487):1042--1051

  10. [10]

    A., Fienberg, S

    Erosheva, E. A., Fienberg, S. E., and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. The annals of applied statistics , 1(2):346

  11. [11]

    Feikin, D., Scott, J., and Gessner, B. (2014). Use of vaccines as probes to define disease burden. The Lancet , 383(9930):1762--1770

  12. [12]

    and Smith, A

    Gelfand, A. and Smith, A. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American statistical association , pages 398--409

  13. [13]

    G., and Su, Y.-S

    Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics , pages 1360--1383

  14. [14]

    Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica , 6(4):733--760

  15. [15]

    and Zhou, G

    Geweke, J. and Zhou, G. (1996). Measuring the pricing error of the arbitrage pricing theory. The review of financial studies , 9(2):557--587

  16. [16]

    Goodman, L. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika , 61(2):215--231

  17. [17]

    A., Schwartz, J., and Suh, H

    Gryparis, A., Coull, B. A., Schwartz, J., and Suh, H. H. (2007). Semiparametric latent variable regression models for spatiotemporal modelling of mobile source particles in the greater boston area. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 56(2):183--209

  18. [18]

    and Xu, G

    Gu, Y. and Xu, G. (2019a). Learning attribute patterns in high-dimensional structured latent attribute models. Journal of Machine Learning Research , page In press

  19. [19]

    and Xu, G

    Gu, Y. and Xu, G. (2019b). Partial identifiability of restricted latent class models. Annals of Statistics , page In press

  20. [20]

    Gustafson, P. (2015). Bayesian Inference for Partially Identified Models: Exploring the Limits of Limited Data , volume 140. CRC Press

  21. [21]

    Gustafson, P., Lefebvre, G., et al. (2008). Bayesian multinomial regression with class-specific predictor selection. The Annals of Applied Statistics , 2(4):1478--1502

  22. [22]

    L., Feikin, D

    Hammitt, L. L., Feikin, D. R., Scott, J. A. G., Zeger, S. L., Murdoch, D. R., O’brien, K. L., and Deloria Knoll, M. (2017). Addressing the analytic challenges of cross-sectional pediatric pneumonia etiology data. Clinical infectious diseases , 64(suppl\_3):S197--S204

  23. [23]

    and Tibshirani, R

    Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statistical Science , 1(3):297--318

  24. [24]

    and Bandeen-Roche, K

    Huang, G.-H. and Bandeen-Roche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika , 69(1):5--32

  25. [25]

    Jones, G., Johnson, W., Hanson, T., and Christensen, R. (2010). Identifiability of models for multiple diagnostic testing in the absence of a gold standard. Biometrics , 66(3):855--863

  26. [26]

    L., Nataro, J

    Kotloff, K. L., Nataro, J. P., Blackwelder, W. C., Nasrin, D., Farag, T. H., Panchalingam, S., Wu, Y., Sow, S. O., Sur, D., Breiman, R. F., et al. (2013). Burden and aetiology of diarrhoeal disease in infants and young children in developing countries (the global enteric multicenter study, gems): a prospective, case-control study. The Lancet , 382(9888):209--222

  27. [27]

    and Brezger, A

    Lang, S. and Brezger, A. (2004). Bayesian p-splines. Journal of computational and graphical statistics , 13(1):183--212

  28. [28]

    Lazarsfeld, P. F. (1950). The logical and mathematical foundations of latent structure analysis , volume IV, chapter The American Soldier: Studies in Social Psychology in World War II, pages 362--412. Princeton, NJ: Princeton University Press

  29. [29]

    Linero, A. R. (2018). Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association , 113(522):626--636

  30. [30]

    Little, R. et al. (2011). Calibrated bayes, for statistics in general, and missing data in particular. Statistical Science , 26(2):162--174

  31. [31]

    R., Ju \'a rez, M

    Morrissey, E. R., Ju \'a rez, M. A., Denby, K. J., and Burroughs, N. J. (2011). Inferring the time-invariant topology of a nonlinear sparse gene regulatory network using fully bayesian spline autoregression. Biostatistics , 12(4):682--694

  32. [32]

    A., Katz, M., Roca, A., Berkley, J

    Nair, H., Brooks, W. A., Katz, M., Roca, A., Berkley, J. A., Madhi, S. A., Simmerman, J. M., Gordon, A., Sato, M., Howie, S., et al. (2011). Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta-analysis. The Lancet , 378(9807):1917--1930

  33. [33]

    C., and Baladandayuthapani, V

    Ni, Y., Stingo, F. C., and Baladandayuthapani, V. (2015). Bayesian nonlinear model selection for gene regulatory networks. Biometrics

  34. [34]

    J., Rivero-Calle, I., Rodr \' guez-Tenreiro, C., Sly, P., Ramilo, O., Mej \' as, A., Baraldi, E., Papadopoulos, N

    Obando-Pacheco, P., Justicia-Grande, A. J., Rivero-Calle, I., Rodr \' guez-Tenreiro, C., Sly, P., Ramilo, O., Mej \' as, A., Baraldi, E., Papadopoulos, N. G., Nair, H., et al. (2018). Respiratory syncytial virus seasonality: a global overview. The Journal of infectious diseases , 217(9):1356--1364

  35. [35]

    Aetiology of severe hospitalised pneumonia in hiv-uninfected children from africa and asia: the pneumonia aetiology research for child health (perch) case-control study

    PERCH Study Group (2019). Aetiology of severe hospitalised pneumonia in hiv-uninfected children from africa and asia: the pneumonia aetiology research for child health (perch) case-control study. Lancet

  36. [36]

    Plummer, M. et al. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing , volume 124

  37. [37]

    and Dunson, D

    Rodriguez, A. and Dunson, D. B. (2011). Nonparametric bayesian models through probit stick-breaking processes. Bayesian analysis (Online) , 6(1)

  38. [38]

    K., Schrag, S

    Saha, S. K., Schrag, S. J., El Arifeen, S., Mullany, L. C., Islam, M. S., Shang, N., Qazi, S. A., Zaidi, A. K., Bhutta, Z. A., Bose, A., et al. (2018). Causes and incidence of community-acquired serious infections among young children in south asia (anisa): an observational cohort study. The Lancet , 392(10142):145--159

  39. [39]

    Scott, J. A. G., Brooks, W. A., Peiris, J. M., Holtzman, D., and Mulhollan, E. K. (2008). Pneumonia research to reduce childhood mortality in the developing world. The Journal of clinical investigation , 118(4):1291

  40. [40]

    S., Greenland, S., and Kim, L.-L

    Witte, J. S., Greenland, S., and Kim, L.-L. (1998). Software for hierarchical modeling of epidemiologic data. Epidemiology , 9(5):563--566

  41. [41]

    Wu, Z., Casciola-Rosen, L., Rosen, A., and Zeger, S. L. (2019). A bayesian approach to restricted latent class models for scientifically-structured clustering of multivariate binary outcomes. arXiv preprint arXiv:1808.08326

  42. [42]

    L., Zeger, S

    Wu, Z., Deloria-Knoll, M., Hammitt, L. L., Zeger, S. L., and for Child Health Core Team, P. E. R. (2016). Partially latent class models for case--control studies of childhood pneumonia aetiology. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 65(1):97--114

  43. [43]

    Wu, Z., Deloria-Knoll, M., and Zeger, S. L. (2017). Nested partially latent class models for dependent binary data; estimating disease etiology. Biostatistics (Oxford, England) , 18:200--213

  44. [44]

    and Zhou, M

    Zhang, Q. and Zhou, M. (2017). Permuted and augmented stick-breaking bayesian multinomial regression. The Journal of Machine Learning Research , 18(1):7479--7511