Federated generalized linear mixed models based on one-time shared summary statistics
Pith reviewed 2026-05-09 18:12 UTC · model grok-4.3
The pith
Generalized linear mixed models can be estimated from one-time shared summary statistics by generating matching pseudo-data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose generating pseudo-data whose summary statistics match those of the actual but unavailable data. These pseudo-data are then used for model estimation instead of the actual data. The estimates we achieve are identical (up to the third decimal place) to those derived from actual data and have similar bias, coverage, and prediction performance. Communication and resource efficiency distinguish our approach from existing methods.
What carries the argument
The generation of pseudo-data engineered to match the summary statistics of the unavailable real data, which then serves as the input for standard GLMM fitting routines.
If this is right
- Parameter estimates match full-data results up to the third decimal place for linear, logistic, and Poisson mixed models.
- Bias, coverage probabilities, and prediction performance remain comparable to those obtained from actual data.
- Only one communication of summary statistics is required, avoiding repeated exchanges.
- The method applies directly to cases where individual records cannot be shared due to privacy constraints.
Where Pith is reading between the lines
- The same pseudo-data construction could be tested on other mixed-model families or on survival models where summary statistics are also commonly available.
- One-time summary sharing may lower the barrier to multi-site studies in fields like epidemiology where data governance rules forbid raw data transfer.
- If the method scales to high-dimensional covariates, it could support routine meta-analysis pipelines that currently rely on aggregated results alone.
Load-bearing premise
That pseudo-data constructed solely to match summary statistics will retain enough structure to recover unbiased parameter estimates, proper coverage, and accurate predictions, including for the random effects in mixed models.
What would settle it
A simulation in which the pseudo-data method yields random-effect variance estimates that differ by more than 0.01 from the full-data estimates or produces confidence intervals with coverage below 90 percent.
Figures
read the original abstract
Data privacy has increasingly become a daunting challenge because it limits data availability, which is essential in estimating statistical models such as generalized linear mixed models. Access to personal data often involves considerable time, effort, and paperwork, which can impede research progress and collaboration. Existing approaches that do not use individual-level data for model estimation are either prone to ecological bias, cannot handle heterogeneity, or require iterative communication. In this paper, we propose an approach to estimate generalized linear mixed models based on summary statistics shared only once. We used linear, logistic, and Poisson mixed models as examples to demonstrate the methodology. Our strategy involves generating pseudo-data whose summary statistics match those of the actual but unavailable data. These pseudo-data are then used for model estimation instead of the actual data. The estimates we achieve are identical (up to the third decimal place) to those derived from actual data and have similar bias, coverage, and prediction performance. Communication and resource efficiency distinguish our approach from existing methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a one-time communication federated method for estimating generalized linear mixed models (GLMMs) by sharing summary statistics, generating pseudo-data that match those statistics, and fitting the GLMM to the pseudo-data instead of the original individual-level records. Demonstrations are provided for linear, logistic, and Poisson mixed models, with the central claim that the resulting estimates are identical to those from the actual data up to the third decimal place and exhibit similar bias, coverage, and prediction performance.
Significance. If the pseudo-data construction reliably preserves the information needed for GLMM inference, the approach would offer a practical, communication-efficient alternative to iterative federated methods or aggregate-data techniques that suffer from ecological bias. It could enable collaborative GLMM analysis in privacy-sensitive domains while maintaining the ability to model heterogeneity via random effects.
major comments (1)
- [Methods] The pseudo-data generation step (described in the Methods) relies on matching unspecified summary statistics, but the marginal likelihood for GLMMs integrates over the random-effects distribution and depends on within-cluster joint distributions of responses and covariates. Global aggregates (e.g., overall means or totals) do not uniquely determine these cluster-level quantities, so it is unclear why the pseudo-data likelihood is guaranteed to yield the same score and information matrix as the true data; the reported third-decimal agreement may be data-specific rather than general.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from an explicit list or table of the exact summary statistics that are shared (e.g., cluster sizes, covariate means per cluster, response totals).
- [Results] Simulation results in Section 4 should include the specific values of the shared summaries for each example so readers can assess sufficiency.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. We address the major comment below and have incorporated revisions to improve clarity regarding the empirical nature of our approach.
read point-by-point responses
-
Referee: [Methods] The pseudo-data generation step (described in the Methods) relies on matching unspecified summary statistics, but the marginal likelihood for GLMMs integrates over the random-effects distribution and depends on within-cluster joint distributions of responses and covariates. Global aggregates (e.g., overall means or totals) do not uniquely determine these cluster-level quantities, so it is unclear why the pseudo-data likelihood is guaranteed to yield the same score and information matrix as the true data; the reported third-decimal agreement may be data-specific rather than general.
Authors: We agree that matching summary statistics does not provide a theoretical guarantee that the pseudo-data will reproduce the exact marginal likelihood, score function, or information matrix of the original data, since the GLMM marginal likelihood depends on within-cluster joint distributions that global aggregates alone cannot uniquely determine. The summary statistics matched in our procedure include both global and cluster-level aggregates (e.g., per-cluster means, variances, and sizes for responses and covariates), as specified in the Methods section; the pseudo-data are generated to match these moments approximately. We do not claim exact equivalence of the likelihoods but rather demonstrate, via extensive simulations across linear, logistic, and Poisson GLMMs with varying numbers of clusters, cluster sizes, and effect magnitudes, that the resulting estimates agree with full-data estimates to at least three decimal places and exhibit comparable bias, coverage, and predictive performance. These results suggest the approximation is reliable in the settings examined, though we acknowledge it may not hold universally. We will revise the manuscript to explicitly describe the method as providing a close empirical approximation, to list the matched statistics more prominently, and to add a limitations discussion on the absence of a general theoretical guarantee. revision: yes
Circularity Check
No circularity: pseudo-data generation and GLMM fitting remain independent of target estimates
full rationale
The paper's core procedure generates pseudo-data to match one-time shared summary statistics of unavailable data, then fits the GLMM directly to the pseudo-data. This step is a modeling choice whose validity is assessed by external comparison to fits on the original data (reported as matching to three decimals with comparable bias/coverage). No equation reduces the fitted GLMM parameters to the input summaries by construction, no self-citation supplies a uniqueness theorem or ansatz, and no parameter is fitted on a subset then relabeled as a prediction. The derivation chain is therefore self-contained; performance claims rest on simulation evidence rather than definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pseudo-data can be generated to match summary statistics of real data sufficiently for accurate GLMM parameter estimation
Reference graph
Works this paper leans on
-
[1]
Healthcare Data Breaches: Insights and Impli- cations
Seh AH, Zarour M, Alenezi M, et al. Healthcare Data Breaches: Insights and Impli- cations. Healthcare 2020; 8(2). 10.3390/healthcare8020133
-
[2]
How should meta-regression analyses be un- dertaken and interpreted?
Thompson SG, Higgins JPT. How should meta-regression analyses be un- dertaken and interpreted?. Statistics in Medicine 2002; 21(11): 1559-1573. https://doi.org/10.1002/sim.1187
-
[3]
Berlin JA, Santanna J, Schmid CH, Szczech LA, Feldman HI. Individual patient- ver- sus group-level data meta-regressions for the investigation of treatment effect modi- fiers: ecological bias rears its ugly head. Statistics in Medicine 2002; 21(3): 371-387. https://doi.org/10.1002/sim.1023
-
[4]
Burke DL, Ensor J, Riley RD. Meta-analysis using individual participant data: one- stage and two-stage approaches, and why they may differ. Statistics in Medicine 2017; 36(5): 855 875. Cited by: 357; All Open Access, Green Open Access, Hybrid Gold Open Access 10.1002/sim.7141
-
[5]
Individual participant data meta- analyses compared with meta-analyses based on aggregate data
Tudur Smith C, Marcucci M, Nolan SJ, et al. Individual participant data meta- analyses compared with meta-analyses based on aggregate data. Cochrane Database Syst. Rev. 2016; 9: MR000007
2016
-
[6]
Individual Par- ticipant Data Meta-Analysis for a Binary Outcome: One-Stage or Two-Stage?
Debray TPA, Moons KGM, Abo-Zaid GMA, Koffijberg H, Riley RD. Individual Par- ticipant Data Meta-Analysis for a Binary Outcome: One-Stage or Two-Stage?. PLOS ONE 2013; 8(4): 1-10. 10.1371/journal.pone.0060650. 29
-
[7]
Privacy-preserving construction of generalized linear mixed model for biomedical computation
Zhu R, Jiang C, Wang X, Wang S, Zheng H, Tang H. Privacy-preserving construction of generalized linear mixed model for biomedical computation. Bioinformatics 2020; 36(Supplement_1): i128-i135. 10.1093/bioinformatics/btaa478
-
[8]
Luo C, Islam MN, Sheils NE, et al. dPQL: a lossless distributed algorithm for gen- eralized linear mixed model with application to privacy-preserving hospital profiling. Journal of the American Medical Informatics Association 2022; 29(8): 1366-1371. 10.1093/jamia/ocac067
-
[9]
Li W, Tong J, Anjum MM, Mohammed N, Chen Y, Jiang X. Federated learning algorithms for generalized mixed-effects model (GLMM) on horizontally partitioned data from distributed sources. BMC Medical Informatics and Decision Making 2022; 22(1): 269. 10.1186/s12911-022-02014-1
-
[10]
Yan Z, Zachrison KS, Schwamm LH, Estrada JJ, Duan R. A privacy-preserving and computation-efficient federated algorithm for generalized linear mixed models to analyze correlated electronic health records data. PLOS ONE 2023; 18(1): 1-15. 10.1371/journal.pone.0280192
-
[11]
ODAL: A one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites: 30-41
Duan R, Boland MR, Moore JH, Chen Y. ODAL: A one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites: 30-41
-
[12]
Learning from local to global: An efficient dis- tributed algorithm for modeling time-to-event data
Duan R, Luo C, Schuemie MJ, et al. Learning from local to global: An efficient dis- tributed algorithm for modeling time-to-event data. Journal of the American Medical Informatics Association 2020; 27(7): 1028-1036. 10.1093/jamia/ocaa044
-
[13]
DLMM as a lossless one-shot algorithm for col- laborative multi-site distributed linear mixed models
Luo C, Islam M, Sheils N, et al. DLMM as a lossless one-shot algorithm for col- laborative multi-site distributed linear mixed models. Nat Commun. 2022; 13(1). https://doi.org/10.1038/s41467-022-29160-4. 30
-
[14]
Limpoco MAA, Faes C, Hens N. Linear Mixed Modeling of Federated Data When Only the Mean, Covariance, and Sample Size Are A vailable. Statistics in Medicine 2025; 44(1-2): e10300. https://doi.org/10.1002/sim.10300
-
[15]
Statistical Inference
Casella G, Berger R. Statistical Inference. 2nd ed. California: Duxbury Resource Center; 2001
2001
-
[16]
Federated Mixed Effects Logistic Regression Based on One-Time Shared Summary Statistics
Limpoco MAA, Faes C, Hens N. Federated Mixed Effects Logistic Regression Based on One-Time Shared Summary Statistics. Biometrical Journal 2025; 67(5): e70080. https://doi.org/10.1002/bimj.70080
-
[17]
R: A Language and Environment for Statistical Computing
R Core Team. R: A Language and Environment for Statistical Computing . Vienna, Austria: R Foundation for Statistical Computing; 2024
2024
-
[18]
pracma: Practical Numerical Math Functions
Borchers HW. pracma: Practical Numerical Math Functions . 2023. R package version 2.4.4
2023
-
[19]
A Systematic Review of Synthetic Data Generation Tech- niques Using Generative AI
Goyal M, Mahmoud QH. A Systematic Review of Synthetic Data Generation Tech- niques Using Generative AI. Electronics 2024; 13(17). 10.3390/electronics13173509
-
[20]
Synthetic data generation methods in healthcare: A review on open-source tools and methods
Pezoulas VC, Zaridis DI, Mylona E, et al. Synthetic data generation methods in healthcare: A review on open-source tools and methods. Computational and Structural Biotechnology Journal 2024; 23: 2892-2910. https://doi.org/10.1016/j.csbj.2024.07.005
-
[21]
Shi J, Wang D, Tesei G, Norgeot B. Generating high-fidelity privacy-conscious syn- thetic patient data for causal effect estimation with multiple treatments. Frontiers in Artificial Intelligence 2022; Volume 5 - 2022. 10.3389/frai.2022.918813
-
[22]
Challenges of Using Synthetic Data Generation Methods for Tabular Microdata
Miletic M, Sariyar M. Challenges of Using Synthetic Data Generation Methods for Tabular Microdata. Applied Sciences 2024; 14(14). 10.3390/app14145975. 31 A Appendix A.1 Theorems Theorem A.1 (Weierstrass approximation theorem) Suppose f is a continuous real- valued function defined on the real interval [a, b]. For every ε > 0, there exists a polynomial p of...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.