A Goodness-of-Fit Test for Mixed-Effects Logistic Regression

Ariel Linden

arxiv: 2604.19694 · v1 · submitted 2026-04-21 · 📊 stat.ME · stat.AP

A Goodness-of-Fit Test for Mixed-Effects Logistic Regression

Ariel Linden This is my paper

Pith reviewed 2026-05-10 01:51 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords goodness-of-fit testmixed-effects logistic regressionrandom slopesWald testType I errorsimulation studymodel misspecificationhierarchical data

0 comments

The pith

A grouping-based Wald test maintains nominal Type I error and detects fixed-effects misspecification in mixed-effects logistic models with random slopes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a goodness-of-fit test for mixed-effects logistic regression models that include random slopes. Observations are grouped within clusters by their predicted probabilities, the model is augmented with group indicators, and a Wald test assesses their joint significance. A data-driven rule sets the number of groups to the minimum of 10 or the smallest cluster size to handle sparse data. Simulations across 24 scenarios confirm the test keeps Type I error rates at nominal levels in three-level models, including with smaller samples than before, while gaining power against omitted nonlinearity or interactions in the fixed effects. Researchers can therefore check model adequacy for common hierarchical binary outcomes before interpreting results.

Core claim

The test groups observations within clusters according to predicted probabilities, augments the mixed logistic model with indicators for these groups, and applies a Wald test to their joint significance. With the number of groups chosen data-driven as min(10, smallest cluster size), the procedure maintains nominal Type I error rates in three-level models with random slopes even at modest sample sizes, and power increases with the degree of fixed-effect misspecification such as omitted nonlinearity or interactions.

What carries the argument

The data-driven grouping of predicted probabilities within clusters followed by a Wald test on the augmented model indicators.

If this is right

The test can be applied to check fixed-effects misspecification in models with random slopes.
The data-driven group selection rule enables reliable performance in sparse cluster designs where fixed group numbers fail.
The test has no power against omitted clustering levels, consistent with its focus on residual structure in predicted probabilities.
It maintains Type I error control at smaller sample sizes than in prior work on simpler models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The grouping approach could be adapted to check fit in other generalized linear mixed models.
Rejection by the test would naturally lead to exploring nonlinear terms or interactions in the fixed effects.
Routine application in studies with hierarchical binary data could support more reliable model choices.

Load-bearing premise

The data-driven selection of the number of groups as min(10, smallest cluster size) allows feasible estimation of the augmented model while preserving the asymptotic properties of the Wald test and controlling Type I error rates across the simulated scenarios.

What would settle it

A simulation of a three-level random slope mixed logistic model with small clusters where the test's Type I error rate deviates substantially from the nominal level.

Figures

Figures reproduced from arXiv: 2604.19694 by Ariel Linden.

**Figure 1.** Figure 1: Empirical Type I error rate by number of families (J), subjects per family (K), and observations per subject (n). ICC = 0.10. Dashed line = nominal 0.05. All values within Monte Carlo bounds [0.036, 0.064] [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗

**Figure 2.** Figure 2: Empirical Type I error rate by number of families (J), subjects per family (K), and observations per subject (n). ICC = 0.30. Dashed line = nominal 0.05. All values within Monte Carlo bounds [0.036, 0.064]. 17 [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

read the original abstract

Mixed-effects logistic regression is widely used for binary outcomes in hierarchical data, yet formal goodness-of-fit tests remain limited to random-intercept models and do not address sparse cluster settings. We extend a grouping-based Wald test to mixed-effects logistic models with random slopes. The procedure groups observations by predicted probabilities within clusters, augments the model with pooled group indicators, and tests their joint significance using a Wald statistic. To accommodate small clusters, we introduce a data-driven rule for selecting the number of groups, G=min(10,n_min), where n_min is the smallest cluster size, ensuring feasible estimation. Simulation studies across 24 null scenarios show that the test maintains nominal Type I error in three-level random slope models, including at smaller sample sizes than previously studied. The test exhibits increasing power to detect fixed-effects misspecification: power against omitted nonlinearity rises from 0.07 to 1.00 across effect sizes, and power against omitted interactions reaches 0.87. As expected, the test has no power to detect omission of a clustering level, reflecting its focus on residual structure in predicted probabilities. In sparse balanced designs, fixing G=10 leads to complete test failure, whereas the data-driven rule performs reliably. The method is implemented in the Stata program mlm_gof.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends the grouping Wald test to random-slope logistic models with a simple min(10, n_min) rule that holds up in their simulations, but the data-dependent group count leaves the chi-squared reference distribution on uncertain ground.

read the letter

The paper takes the existing grouping-based Wald test for logistic regression and adapts it to mixed-effects models that include random slopes. Observations get grouped by their fitted probabilities inside each cluster, the model is augmented with group indicators, and a joint Wald test checks whether those indicators add anything. For sparse clusters they add the rule G = min(10, smallest cluster size) so the augmented model can still be estimated. That combination is new, and the simulations across 24 null scenarios show the test keeps type I error close to nominal even at smaller sample sizes than earlier work covered. Power rises appropriately against omitted nonlinearity and interactions in the fixed part, while staying near zero when the only misspecification is a missing clustering level, which is the expected behavior. The Stata implementation is a practical plus for users who already work in that environment. The main soft spot is exactly the one the stress-test flags: G is chosen from the data, so the dimension of the tested vector and the information matrix are random. Standard Wald asymptotics assume fixed regressors, and the paper offers no bootstrap or adjusted-df correction. Their simulations happen to keep the distortion small in the chosen scenarios, but that does not replace a theoretical justification. The work is aimed at applied statisticians who fit three-level logistic models and want a concrete gof check rather than relying on informal residual plots. It is worth sending to referees because the extension is concrete, the simulation design is broad enough to be informative, and the practical rule fills a documented gap; a reviewer can press on the asymptotics and ask for more extreme sparse cases without the paper being rejected at the desk.

Referee Report

2 major / 2 minor

Summary. The manuscript extends a grouping-based Wald goodness-of-fit test to mixed-effects logistic regression models that include random slopes. Observations within clusters are grouped by their predicted probabilities, the model is augmented with pooled group indicators, and a Wald test is performed on the joint significance of these indicators. A data-driven rule G = min(10, n_min) is introduced to select the number of groups for small or sparse clusters. Simulations across 24 null scenarios indicate that the test maintains nominal Type I error rates in three-level random-slope models (including at smaller sample sizes than previously studied), exhibits power against omitted nonlinearity (rising from 0.07 to 1.00) and omitted interactions (reaching 0.87), and has no power against omission of a clustering level. The procedure is implemented in the Stata program mlm_gof.

Significance. If the validity of the test holds, the work supplies a practical diagnostic for residual structure in predicted probabilities for hierarchical binary data models that accommodate random slopes and sparse clusters, an area where formal tests have been limited. The simulation evidence across multiple scenarios and the provision of a Stata implementation constitute concrete strengths that would make the method usable by applied researchers working with three-level logistic models.

major comments (2)

[§2] §2 (test procedure and group selection): The Wald statistic is asserted to follow a chi-squared distribution under the null, yet G is chosen data-dependently as min(10, n_min) and the grouping itself is formed from fitted probabilities. Standard Wald asymptotics require the tested dimension and design matrix to be non-random or independent of the response; no theoretical correction, bootstrap justification, or adjusted reference distribution is supplied. This directly affects the validity of the reported p-values and is load-bearing for the central claim that the procedure is a reliable goodness-of-fit test.
[Simulation study] Simulation study (24 null scenarios): Although Type I error control is reported for three-level random-slope models, the scenarios do not appear to include designs in which n_min varies substantially across replications or in which the fitted-probability grouping is highly sensitive to small perturbations; such cases could amplify any distortion from the adaptive G. Additional targeted simulations or analytic arguments are needed to confirm that the nominal levels remain reliable when the data-driven rule is active.

minor comments (2)

[§2] Clarify whether the pooling of group indicators is performed within each cluster separately or across all clusters, and state the exact rule used to assign observations to groups when predicted probabilities are tied.
[Abstract and Discussion] The abstract states that the test 'has no power to detect omission of a clustering level'; a brief discussion of the practical implications of this property for model diagnostics would be helpful.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their insightful comments on our manuscript extending the goodness-of-fit test to mixed-effects logistic regression with random slopes. We address each major comment below and outline the revisions we will make to improve the paper.

read point-by-point responses

Referee: §2 (test procedure and group selection): The Wald statistic is asserted to follow a chi-squared distribution under the null, yet G is chosen data-dependently as min(10, n_min) and the grouping itself is formed from fitted probabilities. Standard Wald asymptotics require the tested dimension and design matrix to be non-random or independent of the response; no theoretical correction, bootstrap justification, or adjusted reference distribution is supplied. This directly affects the validity of the reported p-values and is load-bearing for the central claim that the procedure is a reliable goodness-of-fit test.

Authors: We agree that the data-dependent nature of G = min(10, n_min) and the grouping based on fitted probabilities means that standard Wald asymptotics do not strictly apply, as the tested parameters and design are not independent of the data. Our manuscript relies on simulation evidence rather than a formal proof to establish the validity of the test. In the revised version, we will expand the discussion to explicitly acknowledge this limitation, clarify that the chi-squared reference distribution is an approximation justified by the simulation results, and caution users accordingly. We believe this empirical validation across diverse scenarios, including sparse clusters, supports the practical utility of the method while being transparent about the theoretical gap. revision: partial
Referee: Simulation study (24 null scenarios): Although Type I error control is reported for three-level random-slope models, the scenarios do not appear to include designs in which n_min varies substantially across replications or in which the fitted-probability grouping is highly sensitive to small perturbations; such cases could amplify any distortion from the adaptive G. Additional targeted simulations or analytic arguments are needed to confirm that the nominal levels remain reliable when the data-driven rule is active.

Authors: Our simulation study encompasses 24 null scenarios with varying cluster sizes and sample sizes, including cases where the data-driven rule sets G to n_min for small clusters. To address the referee's concern about variability in n_min and sensitivity of grouping, we will perform additional simulations in the revision. These will include designs where n_min differs across Monte Carlo replications and introduce minor data perturbations to evaluate the stability of the grouping procedure and Type I error rates. The results of these targeted simulations will be reported to provide further reassurance on the reliability of the test when the adaptive rule is in use. revision: yes

standing simulated objections not resolved

A rigorous theoretical derivation of the asymptotic distribution of the Wald statistic accounting for the data-dependent grouping and group count selection.

Circularity Check

0 steps flagged

No significant circularity; extension of standard Wald test with simulation validation

full rationale

The paper extends a grouping-based Wald test to mixed-effects logistic models by augmenting with group indicators based on predicted probabilities and testing joint significance. The data-driven G = min(10, n_min) rule is introduced for feasibility in small clusters and is justified empirically via simulations showing Type I error control across 24 null scenarios. No derivation step reduces by construction to its inputs, no self-citations are load-bearing for the central claim, and no fitted quantities are renamed as predictions. The procedure remains self-contained against external benchmarks of Wald theory and Monte Carlo validation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical theory for logistic mixed models and Wald tests, plus the heuristic for selecting G. No new entities are postulated.

free parameters (1)

Maximum groups cap (10)
The value 10 is chosen as a fixed upper limit in the rule G=min(10, n_min) to ensure computational feasibility and stable estimation.

axioms (2)

standard math The Wald test for the joint significance of group indicators follows a chi-squared distribution under the null hypothesis of correct model specification
Relies on standard asymptotic theory for maximum likelihood estimators in generalized linear mixed models.
domain assumption Observations grouped by similar predicted probabilities will reveal misspecifications if the fixed effects structure is inadequate
This is the foundational idea of the grouping-based goodness-of-fit approach extended here.

pith-pipeline@v0.9.0 · 5518 in / 1453 out tokens · 75197 ms · 2026-05-10T01:51:19.532868+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Intermediate and advanced topics in multilevel logistic regression analysis.Stat Med

Austin PC, Merlo J. Intermediate and advanced topics in multilevel logistic regression analysis.Stat Med. 2017;36(20):3257–3277

work page 2017
[2]

A goodness-of-fit test for the multiple logistic regression model

Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic regression model. Commun Stat Theory Methods. 1980;9(10):1043–1069

work page 1980
[3]

Hosmer DW, Lemeshow S.Applied Logistic Regression. 2nd ed. New York, NY: John Wiley & Sons; 2000

work page 2000
[4]

Goodness-of-fit tests for ordinal response regression models.J R Stat Soc Ser C Appl Stat

Lipsitz SR, Fitzmaurice GM, Molenberghs G. Goodness-of-fit tests for ordinal response regression models.J R Stat Soc Ser C Appl Stat. 1996;45(2):175–190

work page 1996
[5]

Goodness-of-fit test for a logistic regression model fitted using survey sample data.Stata J

Archer KJ, Lemeshow S. Goodness-of-fit test for a logistic regression model fitted using survey sample data.Stata J. 2006;6(1):97–105

work page 2006
[6]

Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design.Comput Stat Data Anal

Archer KJ, Lemeshow S, Hosmer DW. Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design.Comput Stat Data Anal. 2007;51:4450–4464

work page 2007
[7]

Multinomial goodness-of-fit tests for logistic regression models.Stat Med

Fagerland MW, Hosmer DW, Bofin AM. Multinomial goodness-of-fit tests for logistic regression models.Stat Med. 2008;27:4238–4253

work page 2008
[8]

Modelling the regional variability of the probability of high trihalomethane occurrence in municipal drinking water.Environ Monit Assess

Cool G, Lebel A, Sadiq R, Rodriguez MJ. Modelling the regional variability of the probability of high trihalomethane occurrence in municipal drinking water.Environ Monit Assess. 2015;187(12):746

work page 2015
[9]

Smoothed residual based goodness-of-fit statistics for logistic hierarchical regression models.Comput Stat Data Anal

Sturdivant RX, Hosmer DW. Smoothed residual based goodness-of-fit statistics for logistic hierarchical regression models.Comput Stat Data Anal. 2007;51(8):3898–3912

work page 2007
[10]

A goodness of fit test for the multilevel logistic model.Commun Stat Simul Comput

Perera AAPNM, Sooriyarachchi MR, Wickramasuriya SL. A goodness of fit test for the multilevel logistic model.Commun Stat Simul Comput. 2016;45(2):643–659. 12

work page 2016
[11]

The development of a goodness-of-fit test for high level binary multilevel models.Commun Stat Simul Comput

Fernando G, Sooriyarachchi R. The development of a goodness-of-fit test for high level binary multilevel models.Commun Stat Simul Comput. 2022;51(5):2710–2730

work page 2022
[12]

MLM_GOF: Stata module for computing the goodness-of-fit test after mixed- effects logistic regression

Linden A. MLM_GOF: Stata module for computing the goodness-of-fit test after mixed- effects logistic regression. Statistical Software Components S459670. Boston College Department of Economics; 2026

work page 2026
[13]

Incorporation of clustering effects for the Wilcoxon rank sum test: a large-sample approach.Biometrics

Rosner B, Glynn RJ, Lee M-LT. Incorporation of clustering effects for the Wilcoxon rank sum test: a large-sample approach.Biometrics. 2003;59(4):1089–1098

work page 2003
[14]

Fleiss JL.Statistical Methods for Rates and Proportions. 2nd ed. New York, NY: John Wiley & Sons; 1981

work page 1981
[15]

Medicare disease management in policy context.Health Care Financ Rev

Linden A, Adler-Milstein J. Medicare disease management in policy context.Health Care Financ Rev. 2008;29(3):1–11

work page 2008
[16]

A conceptual framework for targeting prediabetes with lifestyle, clinical and behavioral management interventions.Dis Manag

Biuso TJ, Butterworth S, Linden A. A conceptual framework for targeting prediabetes with lifestyle, clinical and behavioral management interventions.Dis Manag. 2007;10(1):6–15

work page 2007
[17]

Improved approximations for multilevel models with binary responses.J R Stat Soc Ser A Stat Soc

Goldstein H, Rasbash J. Improved approximations for multilevel models with binary responses.J R Stat Soc Ser A Stat Soc. 1996;159(3):505–513

work page 1996
[18]

A comparison of Bayesian and likelihood-based methods for fitting multilevel models.Bayesian Anal

Browne WJ, Draper D. A comparison of Bayesian and likelihood-based methods for fitting multilevel models.Bayesian Anal. 2006;1(3):473–514

work page 2006
[19]

A user’s guide to the disease management literature: recommenda- tions for reporting and assessing program outcomes.Am J Manag Care

Linden A, Roberts N. A user’s guide to the disease management literature: recommenda- tions for reporting and assessing program outcomes.Am J Manag Care. 2005;11(2):113– 120. 13 Table 1:Type I Error Simulation Design Factor Values Number of clusters (J) 15, 30, 50 Subjects per cluster (K) 5, 10 Observations per subject (n) 20 Intraclass correlation (ICC) ...

work page 2005

[1] [1]

Intermediate and advanced topics in multilevel logistic regression analysis.Stat Med

Austin PC, Merlo J. Intermediate and advanced topics in multilevel logistic regression analysis.Stat Med. 2017;36(20):3257–3277

work page 2017

[2] [2]

A goodness-of-fit test for the multiple logistic regression model

Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic regression model. Commun Stat Theory Methods. 1980;9(10):1043–1069

work page 1980

[3] [3]

Hosmer DW, Lemeshow S.Applied Logistic Regression. 2nd ed. New York, NY: John Wiley & Sons; 2000

work page 2000

[4] [4]

Goodness-of-fit tests for ordinal response regression models.J R Stat Soc Ser C Appl Stat

Lipsitz SR, Fitzmaurice GM, Molenberghs G. Goodness-of-fit tests for ordinal response regression models.J R Stat Soc Ser C Appl Stat. 1996;45(2):175–190

work page 1996

[5] [5]

Goodness-of-fit test for a logistic regression model fitted using survey sample data.Stata J

Archer KJ, Lemeshow S. Goodness-of-fit test for a logistic regression model fitted using survey sample data.Stata J. 2006;6(1):97–105

work page 2006

[6] [6]

Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design.Comput Stat Data Anal

Archer KJ, Lemeshow S, Hosmer DW. Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design.Comput Stat Data Anal. 2007;51:4450–4464

work page 2007

[7] [7]

Multinomial goodness-of-fit tests for logistic regression models.Stat Med

Fagerland MW, Hosmer DW, Bofin AM. Multinomial goodness-of-fit tests for logistic regression models.Stat Med. 2008;27:4238–4253

work page 2008

[8] [8]

Modelling the regional variability of the probability of high trihalomethane occurrence in municipal drinking water.Environ Monit Assess

Cool G, Lebel A, Sadiq R, Rodriguez MJ. Modelling the regional variability of the probability of high trihalomethane occurrence in municipal drinking water.Environ Monit Assess. 2015;187(12):746

work page 2015

[9] [9]

Smoothed residual based goodness-of-fit statistics for logistic hierarchical regression models.Comput Stat Data Anal

Sturdivant RX, Hosmer DW. Smoothed residual based goodness-of-fit statistics for logistic hierarchical regression models.Comput Stat Data Anal. 2007;51(8):3898–3912

work page 2007

[10] [10]

A goodness of fit test for the multilevel logistic model.Commun Stat Simul Comput

Perera AAPNM, Sooriyarachchi MR, Wickramasuriya SL. A goodness of fit test for the multilevel logistic model.Commun Stat Simul Comput. 2016;45(2):643–659. 12

work page 2016

[11] [11]

The development of a goodness-of-fit test for high level binary multilevel models.Commun Stat Simul Comput

Fernando G, Sooriyarachchi R. The development of a goodness-of-fit test for high level binary multilevel models.Commun Stat Simul Comput. 2022;51(5):2710–2730

work page 2022

[12] [12]

MLM_GOF: Stata module for computing the goodness-of-fit test after mixed- effects logistic regression

Linden A. MLM_GOF: Stata module for computing the goodness-of-fit test after mixed- effects logistic regression. Statistical Software Components S459670. Boston College Department of Economics; 2026

work page 2026

[13] [13]

Incorporation of clustering effects for the Wilcoxon rank sum test: a large-sample approach.Biometrics

Rosner B, Glynn RJ, Lee M-LT. Incorporation of clustering effects for the Wilcoxon rank sum test: a large-sample approach.Biometrics. 2003;59(4):1089–1098

work page 2003

[14] [14]

Fleiss JL.Statistical Methods for Rates and Proportions. 2nd ed. New York, NY: John Wiley & Sons; 1981

work page 1981

[15] [15]

Medicare disease management in policy context.Health Care Financ Rev

Linden A, Adler-Milstein J. Medicare disease management in policy context.Health Care Financ Rev. 2008;29(3):1–11

work page 2008

[16] [16]

A conceptual framework for targeting prediabetes with lifestyle, clinical and behavioral management interventions.Dis Manag

Biuso TJ, Butterworth S, Linden A. A conceptual framework for targeting prediabetes with lifestyle, clinical and behavioral management interventions.Dis Manag. 2007;10(1):6–15

work page 2007

[17] [17]

Improved approximations for multilevel models with binary responses.J R Stat Soc Ser A Stat Soc

Goldstein H, Rasbash J. Improved approximations for multilevel models with binary responses.J R Stat Soc Ser A Stat Soc. 1996;159(3):505–513

work page 1996

[18] [18]

A comparison of Bayesian and likelihood-based methods for fitting multilevel models.Bayesian Anal

Browne WJ, Draper D. A comparison of Bayesian and likelihood-based methods for fitting multilevel models.Bayesian Anal. 2006;1(3):473–514

work page 2006

[19] [19]

A user’s guide to the disease management literature: recommenda- tions for reporting and assessing program outcomes.Am J Manag Care

Linden A, Roberts N. A user’s guide to the disease management literature: recommenda- tions for reporting and assessing program outcomes.Am J Manag Care. 2005;11(2):113– 120. 13 Table 1:Type I Error Simulation Design Factor Values Number of clusters (J) 15, 30, 50 Subjects per cluster (K) 5, 10 Observations per subject (n) 20 Intraclass correlation (ICC) ...

work page 2005