pith. sign in

arxiv: 2605.01379 · v1 · submitted 2026-05-02 · 📊 stat.ME

Federated generalized linear mixed models based on one-time shared summary statistics

Pith reviewed 2026-05-09 18:12 UTC · model grok-4.3

classification 📊 stat.ME
keywords generalized linear mixed modelsfederated estimationsummary statisticspseudo-data generationdata privacyone-time communicationmixed modelsGLMM
0
0 comments X

The pith

Generalized linear mixed models can be estimated from one-time shared summary statistics by generating matching pseudo-data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that generalized linear mixed models can be fit accurately without access to individual-level records. Researchers generate pseudo-data whose summary statistics exactly replicate those of the unavailable private data, then estimate the model on the pseudo-data instead. This approach requires only a single round of summary sharing and produces estimates that match those from the real data up to the third decimal place, with comparable bias, coverage, and prediction accuracy for linear, logistic, and Poisson mixed models. A sympathetic reader would care because it removes a major practical barrier: the time, paperwork, and privacy risks that currently prevent many collaborative analyses.

Core claim

We propose generating pseudo-data whose summary statistics match those of the actual but unavailable data. These pseudo-data are then used for model estimation instead of the actual data. The estimates we achieve are identical (up to the third decimal place) to those derived from actual data and have similar bias, coverage, and prediction performance. Communication and resource efficiency distinguish our approach from existing methods.

What carries the argument

The generation of pseudo-data engineered to match the summary statistics of the unavailable real data, which then serves as the input for standard GLMM fitting routines.

If this is right

  • Parameter estimates match full-data results up to the third decimal place for linear, logistic, and Poisson mixed models.
  • Bias, coverage probabilities, and prediction performance remain comparable to those obtained from actual data.
  • Only one communication of summary statistics is required, avoiding repeated exchanges.
  • The method applies directly to cases where individual records cannot be shared due to privacy constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pseudo-data construction could be tested on other mixed-model families or on survival models where summary statistics are also commonly available.
  • One-time summary sharing may lower the barrier to multi-site studies in fields like epidemiology where data governance rules forbid raw data transfer.
  • If the method scales to high-dimensional covariates, it could support routine meta-analysis pipelines that currently rely on aggregated results alone.

Load-bearing premise

That pseudo-data constructed solely to match summary statistics will retain enough structure to recover unbiased parameter estimates, proper coverage, and accurate predictions, including for the random effects in mixed models.

What would settle it

A simulation in which the pseudo-data method yields random-effect variance estimates that differ by more than 0.01 from the full-data estimates or produces confidence intervals with coverage below 90 percent.

Figures

Figures reproduced from arXiv: 2605.01379 by Christel Faes, Marie Analiz April Limpoco, Niel Hens.

Figure 1
Figure 1. Figure 1: Proposed framework for a setup with three data providers. Each data provider view at source ↗
Figure 2
Figure 2. Figure 2: Relative bias distributions of estimates from pseudo-data and simulated data view at source ↗
Figure 3
Figure 3. Figure 3: 95% confidence intervals computed on pseudo-data and simulated data across view at source ↗
Figure 4
Figure 4. Figure 4: 95% confidence interval coverage computed on pseudo-data and simulated data view at source ↗
Figure 5
Figure 5. Figure 5: Predictions of models based on pseudo-data and simulated data across 500 view at source ↗
read the original abstract

Data privacy has increasingly become a daunting challenge because it limits data availability, which is essential in estimating statistical models such as generalized linear mixed models. Access to personal data often involves considerable time, effort, and paperwork, which can impede research progress and collaboration. Existing approaches that do not use individual-level data for model estimation are either prone to ecological bias, cannot handle heterogeneity, or require iterative communication. In this paper, we propose an approach to estimate generalized linear mixed models based on summary statistics shared only once. We used linear, logistic, and Poisson mixed models as examples to demonstrate the methodology. Our strategy involves generating pseudo-data whose summary statistics match those of the actual but unavailable data. These pseudo-data are then used for model estimation instead of the actual data. The estimates we achieve are identical (up to the third decimal place) to those derived from actual data and have similar bias, coverage, and prediction performance. Communication and resource efficiency distinguish our approach from existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a one-time communication federated method for estimating generalized linear mixed models (GLMMs) by sharing summary statistics, generating pseudo-data that match those statistics, and fitting the GLMM to the pseudo-data instead of the original individual-level records. Demonstrations are provided for linear, logistic, and Poisson mixed models, with the central claim that the resulting estimates are identical to those from the actual data up to the third decimal place and exhibit similar bias, coverage, and prediction performance.

Significance. If the pseudo-data construction reliably preserves the information needed for GLMM inference, the approach would offer a practical, communication-efficient alternative to iterative federated methods or aggregate-data techniques that suffer from ecological bias. It could enable collaborative GLMM analysis in privacy-sensitive domains while maintaining the ability to model heterogeneity via random effects.

major comments (1)
  1. [Methods] The pseudo-data generation step (described in the Methods) relies on matching unspecified summary statistics, but the marginal likelihood for GLMMs integrates over the random-effects distribution and depends on within-cluster joint distributions of responses and covariates. Global aggregates (e.g., overall means or totals) do not uniquely determine these cluster-level quantities, so it is unclear why the pseudo-data likelihood is guaranteed to yield the same score and information matrix as the true data; the reported third-decimal agreement may be data-specific rather than general.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from an explicit list or table of the exact summary statistics that are shared (e.g., cluster sizes, covariate means per cluster, response totals).
  2. [Results] Simulation results in Section 4 should include the specific values of the shared summaries for each example so readers can assess sufficiency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We address the major comment below and have incorporated revisions to improve clarity regarding the empirical nature of our approach.

read point-by-point responses
  1. Referee: [Methods] The pseudo-data generation step (described in the Methods) relies on matching unspecified summary statistics, but the marginal likelihood for GLMMs integrates over the random-effects distribution and depends on within-cluster joint distributions of responses and covariates. Global aggregates (e.g., overall means or totals) do not uniquely determine these cluster-level quantities, so it is unclear why the pseudo-data likelihood is guaranteed to yield the same score and information matrix as the true data; the reported third-decimal agreement may be data-specific rather than general.

    Authors: We agree that matching summary statistics does not provide a theoretical guarantee that the pseudo-data will reproduce the exact marginal likelihood, score function, or information matrix of the original data, since the GLMM marginal likelihood depends on within-cluster joint distributions that global aggregates alone cannot uniquely determine. The summary statistics matched in our procedure include both global and cluster-level aggregates (e.g., per-cluster means, variances, and sizes for responses and covariates), as specified in the Methods section; the pseudo-data are generated to match these moments approximately. We do not claim exact equivalence of the likelihoods but rather demonstrate, via extensive simulations across linear, logistic, and Poisson GLMMs with varying numbers of clusters, cluster sizes, and effect magnitudes, that the resulting estimates agree with full-data estimates to at least three decimal places and exhibit comparable bias, coverage, and predictive performance. These results suggest the approximation is reliable in the settings examined, though we acknowledge it may not hold universally. We will revise the manuscript to explicitly describe the method as providing a close empirical approximation, to list the matched statistics more prominently, and to add a limitations discussion on the absence of a general theoretical guarantee. revision: yes

Circularity Check

0 steps flagged

No circularity: pseudo-data generation and GLMM fitting remain independent of target estimates

full rationale

The paper's core procedure generates pseudo-data to match one-time shared summary statistics of unavailable data, then fits the GLMM directly to the pseudo-data. This step is a modeling choice whose validity is assessed by external comparison to fits on the original data (reported as matching to three decimals with comparable bias/coverage). No equation reduces the fitted GLMM parameters to the input summaries by construction, no self-citation supplies a uniqueness theorem or ansatz, and no parameter is fitted on a subset then relabeled as a prediction. The derivation chain is therefore self-contained; performance claims rest on simulation evidence rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that summary statistics suffice to generate pseudo-data preserving GLMM information; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Pseudo-data can be generated to match summary statistics of real data sufficiently for accurate GLMM parameter estimation
    This is the core mechanism stated in the abstract for replacing actual data.

pith-pipeline@v0.9.0 · 5468 in / 1257 out tokens · 58441 ms · 2026-05-09T18:12:16.836350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 17 canonical work pages

  1. [1]

    Healthcare Data Breaches: Insights and Impli- cations

    Seh AH, Zarour M, Alenezi M, et al. Healthcare Data Breaches: Insights and Impli- cations. Healthcare 2020; 8(2). 10.3390/healthcare8020133

  2. [2]

    How should meta-regression analyses be un- dertaken and interpreted?

    Thompson SG, Higgins JPT. How should meta-regression analyses be un- dertaken and interpreted?. Statistics in Medicine 2002; 21(11): 1559-1573. https://doi.org/10.1002/sim.1187

  3. [3]

    Individual patient- ver- sus group-level data meta-regressions for the investigation of treatment effect modi- fiers: ecological bias rears its ugly head

    Berlin JA, Santanna J, Schmid CH, Szczech LA, Feldman HI. Individual patient- ver- sus group-level data meta-regressions for the investigation of treatment effect modi- fiers: ecological bias rears its ugly head. Statistics in Medicine 2002; 21(3): 371-387. https://doi.org/10.1002/sim.1023

  4. [4]

    Meta-analysis using individual participant data: one- stage and two-stage approaches, and why they may differ

    Burke DL, Ensor J, Riley RD. Meta-analysis using individual participant data: one- stage and two-stage approaches, and why they may differ. Statistics in Medicine 2017; 36(5): 855 875. Cited by: 357; All Open Access, Green Open Access, Hybrid Gold Open Access 10.1002/sim.7141

  5. [5]

    Individual participant data meta- analyses compared with meta-analyses based on aggregate data

    Tudur Smith C, Marcucci M, Nolan SJ, et al. Individual participant data meta- analyses compared with meta-analyses based on aggregate data. Cochrane Database Syst. Rev. 2016; 9: MR000007

  6. [6]

    Individual Par- ticipant Data Meta-Analysis for a Binary Outcome: One-Stage or Two-Stage?

    Debray TPA, Moons KGM, Abo-Zaid GMA, Koffijberg H, Riley RD. Individual Par- ticipant Data Meta-Analysis for a Binary Outcome: One-Stage or Two-Stage?. PLOS ONE 2013; 8(4): 1-10. 10.1371/journal.pone.0060650. 29

  7. [7]

    Privacy-preserving construction of generalized linear mixed model for biomedical computation

    Zhu R, Jiang C, Wang X, Wang S, Zheng H, Tang H. Privacy-preserving construction of generalized linear mixed model for biomedical computation. Bioinformatics 2020; 36(Supplement_1): i128-i135. 10.1093/bioinformatics/btaa478

  8. [8]

    dPQL: a lossless distributed algorithm for gen- eralized linear mixed model with application to privacy-preserving hospital profiling

    Luo C, Islam MN, Sheils NE, et al. dPQL: a lossless distributed algorithm for gen- eralized linear mixed model with application to privacy-preserving hospital profiling. Journal of the American Medical Informatics Association 2022; 29(8): 1366-1371. 10.1093/jamia/ocac067

  9. [9]

    Federated learning algorithms for generalized mixed-effects model (GLMM) on horizontally partitioned data from distributed sources

    Li W, Tong J, Anjum MM, Mohammed N, Chen Y, Jiang X. Federated learning algorithms for generalized mixed-effects model (GLMM) on horizontally partitioned data from distributed sources. BMC Medical Informatics and Decision Making 2022; 22(1): 269. 10.1186/s12911-022-02014-1

  10. [10]

    A privacy-preserving and computation-efficient federated algorithm for generalized linear mixed models to analyze correlated electronic health records data

    Yan Z, Zachrison KS, Schwamm LH, Estrada JJ, Duan R. A privacy-preserving and computation-efficient federated algorithm for generalized linear mixed models to analyze correlated electronic health records data. PLOS ONE 2023; 18(1): 1-15. 10.1371/journal.pone.0280192

  11. [11]

    ODAL: A one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites: 30-41

    Duan R, Boland MR, Moore JH, Chen Y. ODAL: A one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites: 30-41

  12. [12]

    Learning from local to global: An efficient dis- tributed algorithm for modeling time-to-event data

    Duan R, Luo C, Schuemie MJ, et al. Learning from local to global: An efficient dis- tributed algorithm for modeling time-to-event data. Journal of the American Medical Informatics Association 2020; 27(7): 1028-1036. 10.1093/jamia/ocaa044

  13. [13]

    DLMM as a lossless one-shot algorithm for col- laborative multi-site distributed linear mixed models

    Luo C, Islam M, Sheils N, et al. DLMM as a lossless one-shot algorithm for col- laborative multi-site distributed linear mixed models. Nat Commun. 2022; 13(1). https://doi.org/10.1038/s41467-022-29160-4. 30

  14. [14]

    Linear Mixed Modeling of Federated Data When Only the Mean, Covariance, and Sample Size Are A vailable

    Limpoco MAA, Faes C, Hens N. Linear Mixed Modeling of Federated Data When Only the Mean, Covariance, and Sample Size Are A vailable. Statistics in Medicine 2025; 44(1-2): e10300. https://doi.org/10.1002/sim.10300

  15. [15]

    Statistical Inference

    Casella G, Berger R. Statistical Inference. 2nd ed. California: Duxbury Resource Center; 2001

  16. [16]

    Federated Mixed Effects Logistic Regression Based on One-Time Shared Summary Statistics

    Limpoco MAA, Faes C, Hens N. Federated Mixed Effects Logistic Regression Based on One-Time Shared Summary Statistics. Biometrical Journal 2025; 67(5): e70080. https://doi.org/10.1002/bimj.70080

  17. [17]

    R: A Language and Environment for Statistical Computing

    R Core Team. R: A Language and Environment for Statistical Computing . Vienna, Austria: R Foundation for Statistical Computing; 2024

  18. [18]

    pracma: Practical Numerical Math Functions

    Borchers HW. pracma: Practical Numerical Math Functions . 2023. R package version 2.4.4

  19. [19]

    A Systematic Review of Synthetic Data Generation Tech- niques Using Generative AI

    Goyal M, Mahmoud QH. A Systematic Review of Synthetic Data Generation Tech- niques Using Generative AI. Electronics 2024; 13(17). 10.3390/electronics13173509

  20. [20]

    Synthetic data generation methods in healthcare: A review on open-source tools and methods

    Pezoulas VC, Zaridis DI, Mylona E, et al. Synthetic data generation methods in healthcare: A review on open-source tools and methods. Computational and Structural Biotechnology Journal 2024; 23: 2892-2910. https://doi.org/10.1016/j.csbj.2024.07.005

  21. [21]

    Generating high-fidelity privacy-conscious syn- thetic patient data for causal effect estimation with multiple treatments

    Shi J, Wang D, Tesei G, Norgeot B. Generating high-fidelity privacy-conscious syn- thetic patient data for causal effect estimation with multiple treatments. Frontiers in Artificial Intelligence 2022; Volume 5 - 2022. 10.3389/frai.2022.918813

  22. [22]

    Challenges of Using Synthetic Data Generation Methods for Tabular Microdata

    Miletic M, Sariyar M. Challenges of Using Synthetic Data Generation Methods for Tabular Microdata. Applied Sciences 2024; 14(14). 10.3390/app14145975. 31 A Appendix A.1 Theorems Theorem A.1 (Weierstrass approximation theorem) Suppose f is a continuous real- valued function defined on the real interval [a, b]. For every ε > 0, there exists a polynomial p of...