pith. sign in

arxiv: 2604.19694 · v1 · submitted 2026-04-21 · 📊 stat.ME · stat.AP

A Goodness-of-Fit Test for Mixed-Effects Logistic Regression

Pith reviewed 2026-05-10 01:51 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords goodness-of-fit testmixed-effects logistic regressionrandom slopesWald testType I errorsimulation studymodel misspecificationhierarchical data
0
0 comments X

The pith

A grouping-based Wald test maintains nominal Type I error and detects fixed-effects misspecification in mixed-effects logistic models with random slopes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a goodness-of-fit test for mixed-effects logistic regression models that include random slopes. Observations are grouped within clusters by their predicted probabilities, the model is augmented with group indicators, and a Wald test assesses their joint significance. A data-driven rule sets the number of groups to the minimum of 10 or the smallest cluster size to handle sparse data. Simulations across 24 scenarios confirm the test keeps Type I error rates at nominal levels in three-level models, including with smaller samples than before, while gaining power against omitted nonlinearity or interactions in the fixed effects. Researchers can therefore check model adequacy for common hierarchical binary outcomes before interpreting results.

Core claim

The test groups observations within clusters according to predicted probabilities, augments the mixed logistic model with indicators for these groups, and applies a Wald test to their joint significance. With the number of groups chosen data-driven as min(10, smallest cluster size), the procedure maintains nominal Type I error rates in three-level models with random slopes even at modest sample sizes, and power increases with the degree of fixed-effect misspecification such as omitted nonlinearity or interactions.

What carries the argument

The data-driven grouping of predicted probabilities within clusters followed by a Wald test on the augmented model indicators.

If this is right

  • The test can be applied to check fixed-effects misspecification in models with random slopes.
  • The data-driven group selection rule enables reliable performance in sparse cluster designs where fixed group numbers fail.
  • The test has no power against omitted clustering levels, consistent with its focus on residual structure in predicted probabilities.
  • It maintains Type I error control at smaller sample sizes than in prior work on simpler models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The grouping approach could be adapted to check fit in other generalized linear mixed models.
  • Rejection by the test would naturally lead to exploring nonlinear terms or interactions in the fixed effects.
  • Routine application in studies with hierarchical binary data could support more reliable model choices.

Load-bearing premise

The data-driven selection of the number of groups as min(10, smallest cluster size) allows feasible estimation of the augmented model while preserving the asymptotic properties of the Wald test and controlling Type I error rates across the simulated scenarios.

What would settle it

A simulation of a three-level random slope mixed logistic model with small clusters where the test's Type I error rate deviates substantially from the nominal level.

Figures

Figures reproduced from arXiv: 2604.19694 by Ariel Linden.

Figure 1
Figure 1. Figure 1: Empirical Type I error rate by number of families (J), subjects per family (K), and observations per subject (n). ICC = 0.10. Dashed line = nominal 0.05. All values within Monte Carlo bounds [0.036, 0.064] [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical Type I error rate by number of families (J), subjects per family (K), and observations per subject (n). ICC = 0.30. Dashed line = nominal 0.05. All values within Monte Carlo bounds [0.036, 0.064]. 17 [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
read the original abstract

Mixed-effects logistic regression is widely used for binary outcomes in hierarchical data, yet formal goodness-of-fit tests remain limited to random-intercept models and do not address sparse cluster settings. We extend a grouping-based Wald test to mixed-effects logistic models with random slopes. The procedure groups observations by predicted probabilities within clusters, augments the model with pooled group indicators, and tests their joint significance using a Wald statistic. To accommodate small clusters, we introduce a data-driven rule for selecting the number of groups, G=min(10,n_min), where n_min is the smallest cluster size, ensuring feasible estimation. Simulation studies across 24 null scenarios show that the test maintains nominal Type I error in three-level random slope models, including at smaller sample sizes than previously studied. The test exhibits increasing power to detect fixed-effects misspecification: power against omitted nonlinearity rises from 0.07 to 1.00 across effect sizes, and power against omitted interactions reaches 0.87. As expected, the test has no power to detect omission of a clustering level, reflecting its focus on residual structure in predicted probabilities. In sparse balanced designs, fixing G=10 leads to complete test failure, whereas the data-driven rule performs reliably. The method is implemented in the Stata program mlm_gof.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript extends a grouping-based Wald goodness-of-fit test to mixed-effects logistic regression models that include random slopes. Observations within clusters are grouped by their predicted probabilities, the model is augmented with pooled group indicators, and a Wald test is performed on the joint significance of these indicators. A data-driven rule G = min(10, n_min) is introduced to select the number of groups for small or sparse clusters. Simulations across 24 null scenarios indicate that the test maintains nominal Type I error rates in three-level random-slope models (including at smaller sample sizes than previously studied), exhibits power against omitted nonlinearity (rising from 0.07 to 1.00) and omitted interactions (reaching 0.87), and has no power against omission of a clustering level. The procedure is implemented in the Stata program mlm_gof.

Significance. If the validity of the test holds, the work supplies a practical diagnostic for residual structure in predicted probabilities for hierarchical binary data models that accommodate random slopes and sparse clusters, an area where formal tests have been limited. The simulation evidence across multiple scenarios and the provision of a Stata implementation constitute concrete strengths that would make the method usable by applied researchers working with three-level logistic models.

major comments (2)
  1. [§2] §2 (test procedure and group selection): The Wald statistic is asserted to follow a chi-squared distribution under the null, yet G is chosen data-dependently as min(10, n_min) and the grouping itself is formed from fitted probabilities. Standard Wald asymptotics require the tested dimension and design matrix to be non-random or independent of the response; no theoretical correction, bootstrap justification, or adjusted reference distribution is supplied. This directly affects the validity of the reported p-values and is load-bearing for the central claim that the procedure is a reliable goodness-of-fit test.
  2. [Simulation study] Simulation study (24 null scenarios): Although Type I error control is reported for three-level random-slope models, the scenarios do not appear to include designs in which n_min varies substantially across replications or in which the fitted-probability grouping is highly sensitive to small perturbations; such cases could amplify any distortion from the adaptive G. Additional targeted simulations or analytic arguments are needed to confirm that the nominal levels remain reliable when the data-driven rule is active.
minor comments (2)
  1. [§2] Clarify whether the pooling of group indicators is performed within each cluster separately or across all clusters, and state the exact rule used to assign observations to groups when predicted probabilities are tied.
  2. [Abstract and Discussion] The abstract states that the test 'has no power to detect omission of a clustering level'; a brief discussion of the practical implications of this property for model diagnostics would be helpful.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their insightful comments on our manuscript extending the goodness-of-fit test to mixed-effects logistic regression with random slopes. We address each major comment below and outline the revisions we will make to improve the paper.

read point-by-point responses
  1. Referee: §2 (test procedure and group selection): The Wald statistic is asserted to follow a chi-squared distribution under the null, yet G is chosen data-dependently as min(10, n_min) and the grouping itself is formed from fitted probabilities. Standard Wald asymptotics require the tested dimension and design matrix to be non-random or independent of the response; no theoretical correction, bootstrap justification, or adjusted reference distribution is supplied. This directly affects the validity of the reported p-values and is load-bearing for the central claim that the procedure is a reliable goodness-of-fit test.

    Authors: We agree that the data-dependent nature of G = min(10, n_min) and the grouping based on fitted probabilities means that standard Wald asymptotics do not strictly apply, as the tested parameters and design are not independent of the data. Our manuscript relies on simulation evidence rather than a formal proof to establish the validity of the test. In the revised version, we will expand the discussion to explicitly acknowledge this limitation, clarify that the chi-squared reference distribution is an approximation justified by the simulation results, and caution users accordingly. We believe this empirical validation across diverse scenarios, including sparse clusters, supports the practical utility of the method while being transparent about the theoretical gap. revision: partial

  2. Referee: Simulation study (24 null scenarios): Although Type I error control is reported for three-level random-slope models, the scenarios do not appear to include designs in which n_min varies substantially across replications or in which the fitted-probability grouping is highly sensitive to small perturbations; such cases could amplify any distortion from the adaptive G. Additional targeted simulations or analytic arguments are needed to confirm that the nominal levels remain reliable when the data-driven rule is active.

    Authors: Our simulation study encompasses 24 null scenarios with varying cluster sizes and sample sizes, including cases where the data-driven rule sets G to n_min for small clusters. To address the referee's concern about variability in n_min and sensitivity of grouping, we will perform additional simulations in the revision. These will include designs where n_min differs across Monte Carlo replications and introduce minor data perturbations to evaluate the stability of the grouping procedure and Type I error rates. The results of these targeted simulations will be reported to provide further reassurance on the reliability of the test when the adaptive rule is in use. revision: yes

standing simulated objections not resolved
  • A rigorous theoretical derivation of the asymptotic distribution of the Wald statistic accounting for the data-dependent grouping and group count selection.

Circularity Check

0 steps flagged

No significant circularity; extension of standard Wald test with simulation validation

full rationale

The paper extends a grouping-based Wald test to mixed-effects logistic models by augmenting with group indicators based on predicted probabilities and testing joint significance. The data-driven G = min(10, n_min) rule is introduced for feasibility in small clusters and is justified empirically via simulations showing Type I error control across 24 null scenarios. No derivation step reduces by construction to its inputs, no self-citations are load-bearing for the central claim, and no fitted quantities are renamed as predictions. The procedure remains self-contained against external benchmarks of Wald theory and Monte Carlo validation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical theory for logistic mixed models and Wald tests, plus the heuristic for selecting G. No new entities are postulated.

free parameters (1)
  • Maximum groups cap (10)
    The value 10 is chosen as a fixed upper limit in the rule G=min(10, n_min) to ensure computational feasibility and stable estimation.
axioms (2)
  • standard math The Wald test for the joint significance of group indicators follows a chi-squared distribution under the null hypothesis of correct model specification
    Relies on standard asymptotic theory for maximum likelihood estimators in generalized linear mixed models.
  • domain assumption Observations grouped by similar predicted probabilities will reveal misspecifications if the fixed effects structure is inadequate
    This is the foundational idea of the grouping-based goodness-of-fit approach extended here.

pith-pipeline@v0.9.0 · 5518 in / 1453 out tokens · 75197 ms · 2026-05-10T01:51:19.532868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Intermediate and advanced topics in multilevel logistic regression analysis.Stat Med

    Austin PC, Merlo J. Intermediate and advanced topics in multilevel logistic regression analysis.Stat Med. 2017;36(20):3257–3277

  2. [2]

    A goodness-of-fit test for the multiple logistic regression model

    Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic regression model. Commun Stat Theory Methods. 1980;9(10):1043–1069

  3. [3]

    Hosmer DW, Lemeshow S.Applied Logistic Regression. 2nd ed. New York, NY: John Wiley & Sons; 2000

  4. [4]

    Goodness-of-fit tests for ordinal response regression models.J R Stat Soc Ser C Appl Stat

    Lipsitz SR, Fitzmaurice GM, Molenberghs G. Goodness-of-fit tests for ordinal response regression models.J R Stat Soc Ser C Appl Stat. 1996;45(2):175–190

  5. [5]

    Goodness-of-fit test for a logistic regression model fitted using survey sample data.Stata J

    Archer KJ, Lemeshow S. Goodness-of-fit test for a logistic regression model fitted using survey sample data.Stata J. 2006;6(1):97–105

  6. [6]

    Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design.Comput Stat Data Anal

    Archer KJ, Lemeshow S, Hosmer DW. Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design.Comput Stat Data Anal. 2007;51:4450–4464

  7. [7]

    Multinomial goodness-of-fit tests for logistic regression models.Stat Med

    Fagerland MW, Hosmer DW, Bofin AM. Multinomial goodness-of-fit tests for logistic regression models.Stat Med. 2008;27:4238–4253

  8. [8]

    Modelling the regional variability of the probability of high trihalomethane occurrence in municipal drinking water.Environ Monit Assess

    Cool G, Lebel A, Sadiq R, Rodriguez MJ. Modelling the regional variability of the probability of high trihalomethane occurrence in municipal drinking water.Environ Monit Assess. 2015;187(12):746

  9. [9]

    Smoothed residual based goodness-of-fit statistics for logistic hierarchical regression models.Comput Stat Data Anal

    Sturdivant RX, Hosmer DW. Smoothed residual based goodness-of-fit statistics for logistic hierarchical regression models.Comput Stat Data Anal. 2007;51(8):3898–3912

  10. [10]

    A goodness of fit test for the multilevel logistic model.Commun Stat Simul Comput

    Perera AAPNM, Sooriyarachchi MR, Wickramasuriya SL. A goodness of fit test for the multilevel logistic model.Commun Stat Simul Comput. 2016;45(2):643–659. 12

  11. [11]

    The development of a goodness-of-fit test for high level binary multilevel models.Commun Stat Simul Comput

    Fernando G, Sooriyarachchi R. The development of a goodness-of-fit test for high level binary multilevel models.Commun Stat Simul Comput. 2022;51(5):2710–2730

  12. [12]

    MLM_GOF: Stata module for computing the goodness-of-fit test after mixed- effects logistic regression

    Linden A. MLM_GOF: Stata module for computing the goodness-of-fit test after mixed- effects logistic regression. Statistical Software Components S459670. Boston College Department of Economics; 2026

  13. [13]

    Incorporation of clustering effects for the Wilcoxon rank sum test: a large-sample approach.Biometrics

    Rosner B, Glynn RJ, Lee M-LT. Incorporation of clustering effects for the Wilcoxon rank sum test: a large-sample approach.Biometrics. 2003;59(4):1089–1098

  14. [14]

    Fleiss JL.Statistical Methods for Rates and Proportions. 2nd ed. New York, NY: John Wiley & Sons; 1981

  15. [15]

    Medicare disease management in policy context.Health Care Financ Rev

    Linden A, Adler-Milstein J. Medicare disease management in policy context.Health Care Financ Rev. 2008;29(3):1–11

  16. [16]

    A conceptual framework for targeting prediabetes with lifestyle, clinical and behavioral management interventions.Dis Manag

    Biuso TJ, Butterworth S, Linden A. A conceptual framework for targeting prediabetes with lifestyle, clinical and behavioral management interventions.Dis Manag. 2007;10(1):6–15

  17. [17]

    Improved approximations for multilevel models with binary responses.J R Stat Soc Ser A Stat Soc

    Goldstein H, Rasbash J. Improved approximations for multilevel models with binary responses.J R Stat Soc Ser A Stat Soc. 1996;159(3):505–513

  18. [18]

    A comparison of Bayesian and likelihood-based methods for fitting multilevel models.Bayesian Anal

    Browne WJ, Draper D. A comparison of Bayesian and likelihood-based methods for fitting multilevel models.Bayesian Anal. 2006;1(3):473–514

  19. [19]

    A user’s guide to the disease management literature: recommenda- tions for reporting and assessing program outcomes.Am J Manag Care

    Linden A, Roberts N. A user’s guide to the disease management literature: recommenda- tions for reporting and assessing program outcomes.Am J Manag Care. 2005;11(2):113– 120. 13 Table 1:Type I Error Simulation Design Factor Values Number of clusters (J) 15, 30, 50 Subjects per cluster (K) 5, 10 Observations per subject (n) 20 Intraclass correlation (ICC) ...