pith. sign in

arxiv: 2605.03499 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.IT· math.IT· stat.ML

A Hierarchical Sampling Framework for bounding the Generalization Error of Federated Learning

Pith reviewed 2026-05-07 04:12 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITstat.ML
keywords federated learninggeneralization boundsWasserstein distancehierarchical samplingsupersample constructionconditional mutual informationdifferential privacyGaussian location model
0
0 comments X

The pith

A hierarchical tree sampling model yields Wasserstein-based generalization bounds for federated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework for hierarchical federated learning where client data is sampled according to a multi-layered tree that encodes dependencies. It derives expected generalization bounds expressed using the Wasserstein distance, under the assumption that the loss function is Lipschitz continuous. The derivation relies on a supersample construction that tracks the algorithm's sensitivity to changing one node in the tree. This approach recovers and strengthens existing conditional mutual information bounds for bounded losses, combines with differential privacy for privacy-based bounds, and matches the true asymptotic rate of generalization error in the Gaussian location model.

Core claim

The central claim is that the expected generalization error of a hierarchical federated learning algorithm can be bounded by a constant times the Wasserstein distance between the population and empirical distributions, where the constant depends on the Lipschitz constant of the loss and the structure of the sampling tree. This bound is obtained by constructing a supersample that allows measuring the effect of replacing a single client dataset in the hierarchy. By leveraging the federated learning structure, the resulting bound strictly implies prior conditional mutual information bounds in the bounded-loss case and recovers the exact asymptotic rate for the Gaussian location model.

What carries the argument

The multi-layered sampling tree together with the supersample construction, which quantifies the sensitivity of the learning algorithm to the change of a single node in the tree.

If this is right

  • The derived bounds strictly imply state-of-the-art conditional mutual information bounds when losses are bounded.
  • The bounds can be combined with differential privacy assumptions to obtain generalization guarantees based on algorithmic privacy.
  • The bounds recover the exact asymptotic rate of the generalization error in the Gaussian location model.
  • By leveraging the federated learning structure, the framework provides control on the generalization error that accounts for dependencies among client datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the tree structure accurately models real-world client data dependencies, federated learning algorithms could be designed to minimize the relevant Wasserstein distances to improve generalization.
  • This framework might extend to other distributed learning settings with hierarchical data collection, such as multi-site medical studies.
  • Testing the bounds on non-Lipschitz losses or non-Gaussian data would reveal how much the Lipschitz assumption limits practical applicability.

Load-bearing premise

The loss function must be Lipschitz continuous with respect to the model parameters.

What would settle it

If experiments in the Gaussian location model show that the generalization error does not match the asymptotic rate predicted by the Wasserstein bound, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.03499 by Dario Filatrella, Mikael Skoglund, Ragnar Thobaben.

Figure 1
Figure 1. Figure 1: Hierarchical sampling structure Formally the sampling is as follows: the first layer’s nodes are sampled i.i.d. from a meta-distribution D, meaning µ i1 1 ∼ D for all i1. For the lower layers we have µ i1:l l ∼ µ i1:l−1 l−1 . (1) Consequently, µ i1:L L ∈ Z is a data point, µ i1:L−1 L−1 is a distribution over data points, and so on. We can then write the sampling of a single data point as µL ∼ µL−1 ∼ · · · … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between the true generalization error and our bound from view at source ↗
read the original abstract

We study expected generalization bounds for the Hierarchical Federated Learning (HFL) setup using Wasserstein distance. We introduce a generalized framework in which data is sampled hierarchically, and we model it with a multi-layered tree structure that induces dependencies among the clients' datasets. We derive generalization bounds in terms of Wasserstein distance under the Lipschitz assumption on the loss function, by applying a supersample construction that allows us to measure the sensitivity of the algorithm to the change of a single node in the sampling tree. By leveraging the FL structure, we recover and strictly imply existing state-of-the-art conditional mutual information (CMI) bounds in the case of bounded losses. We also show that our bound can be applied together with Differential Privacy assumptions, to recover generalization bounds based on algorithmic privacy. To assess the tightness of our bounds, we study the Gaussian Location Model (GLM) and show that we recover the actual asymptotic rate of the generalization error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a hierarchical sampling framework for deriving generalization bounds in Hierarchical Federated Learning (HFL). Data sampling is modeled using a multi-layered tree structure that captures dependencies among clients' datasets. Under the assumption that the loss function is Lipschitz continuous, the authors derive bounds on the expected generalization error in terms of the Wasserstein distance between distributions, employing a supersample construction to assess the sensitivity of the learning algorithm to perturbations in a single node of the sampling tree. The framework is shown to recover and strictly strengthen existing conditional mutual information (CMI) bounds for bounded losses, and it can be combined with differential privacy to obtain privacy-based generalization bounds. Tightness is demonstrated by recovering the asymptotic rate of the generalization error in the Gaussian Location Model (GLM).

Significance. If the derivations hold, this work provides a unified approach to generalization bounds in federated learning that incorporates hierarchical data dependencies via a tree model and Wasserstein distances. It bridges Wasserstein-based bounds with CMI and DP-based bounds, offering a way to leverage FL structure for tighter results. The GLM example for tightness is particularly useful for validating the framework in a concrete setting, which could influence future research on generalization in distributed learning.

major comments (1)
  1. [Gaussian Location Model section] The claim that the bounds recover the actual asymptotic rate of the generalization error in the GLM (abstract and the dedicated GLM section) is load-bearing for the tightness assessment. The primary bound requires Lipschitz continuity of the loss. Standard GLM analysis uses squared loss (or equivalent), which fails to be Lipschitz on unbounded support. The manuscript must clarify whether the GLM rate is obtained by instantiating the Wasserstein bound (under added boundedness to satisfy Lipschitz) or by direct computation on the estimator. Without this, the tightness result does not validate the bound inside the regime where its assumptions apply. Please point to the specific equations or derivation steps.
minor comments (2)
  1. [Section introducing the supersample construction] The supersample construction and tree-node sensitivity argument would benefit from an explicit diagram or pseudocode to illustrate the hierarchical dependencies and the single-node change.
  2. [Preliminaries] Notation for the multi-layered tree (e.g., node depths, client subsets) should be summarized in a table or definition box for quick reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We appreciate the opportunity to clarify the tightness assessment in the Gaussian Location Model (GLM) section. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Gaussian Location Model section] The claim that the bounds recover the actual asymptotic rate of the generalization error in the GLM (abstract and the dedicated GLM section) is load-bearing for the tightness assessment. The primary bound requires Lipschitz continuity of the loss. Standard GLM analysis uses squared loss (or equivalent), which fails to be Lipschitz on unbounded support. The manuscript must clarify whether the GLM rate is obtained by instantiating the Wasserstein bound (under added boundedness to satisfy Lipschitz) or by direct computation on the estimator. Without this, the tightness result does not validate the bound inside the regime where its assumptions apply. Please point to the specific equations or derivation steps.

    Authors: We thank the referee for identifying this point that requires greater precision in our presentation. In Section 4.2, the GLM tightness result is established in two complementary ways. First, we perform a direct computation of the expected generalization error for the empirical mean estimator under the standard Gaussian location model (Equations 28-32), which yields the known asymptotic rate of order 1/sqrt(m) where m denotes the total number of samples across the hierarchy. Second, we instantiate the main Wasserstein bound (Theorem 3.1) by replacing the squared loss with its truncated version L_B(x, mu) = min{(x - mu)^2, B} for a truncation level B that grows sufficiently slowly with m; this truncation renders the loss Lipschitz while preserving the same asymptotic rate (see the derivation following Equation 33 and Remark 4.2). The resulting Wasserstein bound then matches the direct rate up to universal constants. We will revise the GLM section to explicitly separate these two arguments, add a dedicated paragraph stating the truncation construction, and include a forward reference from the abstract to these steps. This ensures the tightness claim is validated precisely inside the Lipschitz regime of the main theorem. revision: yes

Circularity Check

0 steps flagged

No circularity: new tree model and supersample construction yield independent Wasserstein bounds that imply prior CMI results

full rationale

The paper introduces a multi-layered hierarchical sampling tree and supersample construction to derive new Wasserstein-distance generalization bounds under the Lipschitz loss assumption. It then specializes this general bound to recover (and strictly imply) existing CMI bounds for bounded losses, and separately checks tightness via the Gaussian Location Model. No quoted derivation step reduces a claimed result to a fitted input, self-citation chain, or definitional renaming; the central claims rest on the novel tree structure and sensitivity analysis, which are not presupposed by the recovered bounds. The GLM rate recovery is presented as an external tightness diagnostic rather than part of the main derivation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

From the abstract alone, the central claims rest on the Lipschitz assumption for the loss and the modeling of hierarchical sampling as a multi-layered tree; no free parameters, invented entities, or additional axioms are mentioned. Full paper would be needed to identify any fitted quantities or further assumptions.

axioms (2)
  • domain assumption The loss function is Lipschitz continuous
    Explicitly invoked in the abstract to derive the Wasserstein generalization bounds via supersample construction.
  • domain assumption Data is sampled hierarchically and can be modeled with a multi-layered tree structure inducing dependencies among clients' datasets
    Core modeling choice stated in the abstract for the generalized framework and sensitivity measurement.

pith-pipeline@v0.9.0 · 5469 in / 1769 out tokens · 68085 ms · 2026-05-07T04:12:55.072016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references

  1. [1]

    Advances and Open Problems in Federated Learning,

    P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, ...

  2. [2]

    Communication-Efficient Learning of Deep Networks from Decentralized Data,

    H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” Jan. 2023

  3. [3]

    Deep learning with Elastic Averaging SGD,

    S. Zhang, A. Choromanska, and Y . LeCun, “Deep learning with Elastic Averaging SGD,” Oct. 2015

  4. [4]

    Parallel training of DNNs with Natural Gradient and Parameter Averaging,

    D. Povey, X. Zhang, and S. Khudanpur, “Parallel training of DNNs with Natural Gradient and Parameter Averaging,” Jun. 2015

  5. [5]

    Kish,Survey Sampling

    L. Kish,Survey Sampling. Wiley, 1965. [Online]. Available: https://books.google.it/books?id=xiZmAAAAIAAJ

  6. [6]

    W. G. Cochran,Sampling Techniques, 3rd ed., ser. Wiley Series in Probability and Mathematical Statistics. New York, NY: Wiley, 1977

  7. [7]

    A theory of the learnable,

    L. G. Valiant, “A theory of the learnable,” inProceedings of the Sixteenth Annual ACM Symposium on Theory of Computing - STOC ’84. Not Known: ACM Press, 1984, pp. 436–445

  8. [8]

    On the uniform convergence of relative frequencies of events to their probabilities,

    V . N. Vapnik and A. Y . Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” inMeasures of complexity: festschrift for alexey chervonenkis. Springer, 2015, pp. 11–30

  9. [9]

    Understand- ing deep learning requires rethinking generalization,

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” Feb. 2017

  10. [10]

    Information-theoretic analysis of generaliza- tion capability of learning algorithms,

    A. Xu and M. Raginsky, “Information-theoretic analysis of generaliza- tion capability of learning algorithms,” Nov. 2017

  11. [11]

    Tightening Mutual Information Based Bounds on Generalization Error,

    Y . Bu, S. Zou, and V . V . Veeravalli, “Tightening Mutual Information Based Bounds on Generalization Error,”IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, May 2020

  12. [12]

    Reasoning About General- ization via Conditional Mutual Information,

    T. Steinke, T.-S. Net, and L. Zakynthinou, “Reasoning About General- ization via Conditional Mutual Information,” 2020

  13. [13]

    Information-theoretic generalization bounds for black-box learning algorithms,

    H. Harutyunyan, M. Raginsky, G. V . Steeg, and A. Galstyan, “Information-theoretic generalization bounds for black-box learning algorithms,” Oct. 2021

  14. [14]

    Generalization Bounds via Information Density and Conditional Information Density,

    F. Hellstrom and G. Durisi, “Generalization Bounds via Information Density and Conditional Information Density,”IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 824–839, Nov. 2020

  15. [15]

    Tighter expected generalization error bounds via Wasserstein distance,

    B. Rodríguez-Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “Tighter expected generalization error bounds via Wasserstein distance,” Mar. 2022

  16. [16]

    Optimal Transport for Applied Mathematicians – Calculus of Variations, PDEs and Modeling,

    F. Santambrogio, “Optimal Transport for Applied Mathematicians – Calculus of Variations, PDEs and Modeling,” 2015

  17. [17]

    Distributed Training Strategies for the Structured Perceptron,

    R. McDonald, K. Hall, and G. Mann, “Distributed Training Strategies for the Structured Perceptron,”. . ., 2010

  18. [18]

    What Do We Mean by Generalization in Federated Learning?

    H. Yuan, W. Morningstar, L. Ning, and K. Singhal, “What Do We Mean by Generalization in Federated Learning?” Mar. 2022

  19. [19]

    FedHB: Hierarchical Bayesian Federated Learning,

    M. Kim and T. Hospedales, “FedHB: Hierarchical Bayesian Federated Learning,” May 2023

  20. [20]

    A Review of Federated Learning Methods in Heterogeneous Scenarios,

    J. Pei, W. Liu, J. Li, L. Wang, and C. Liu, “A Review of Federated Learning Methods in Heterogeneous Scenarios,”IEEE Transactions on Consumer Electronics, vol. 70, no. 3, pp. 5983–5999, Aug. 2024

  21. [21]

    Improving Generalization in Federated Learning with Model-Data Mutual Information Regularization: A Posterior Inference Approach,

    H. Zhang, C. Li, N. Kan, Z. Zheng, W. Dai, J. Zou, and H. Xiong, “Improving Generalization in Federated Learning with Model-Data Mutual Information Regularization: A Posterior Inference Approach,” 2024

  22. [22]

    Improved Information The- oretic Generalization Bounds for Distributed and Federated Learning,

    L. P. Barnes, A. Dytso, and H. V . Poor, “Improved Information The- oretic Generalization Bounds for Distributed and Federated Learning,” Entropy, vol. 24, no. 9, p. 1178, Aug. 2022

  23. [23]

    Heterogeneity Matters even More in Distributed Learning: Study from Generalization Perspective,

    M. Kavian, R. Chor, M. Sefidgaran, and A. Zaidi, “Heterogeneity Matters even More in Distributed Learning: Study from Generalization Perspective,” May 2025

  24. [24]

    Generalization in Federated Learning: A Conditional Mutual Information Framework,

    Z. Wang, C. Long, and Y . Mao, “Generalization in Federated Learning: A Conditional Mutual Information Framework,” Jun. 2025. APPENDIXA EXTRA PRELIMINARIES AND DEFINITION In this section we breifly overview some of the definitions and results that we will use for the proofs. A. Information theory Definition 14(KL divergence).Given two probability distribu...

  25. [25]

    Their supersample has one extra index because they average the information metric over the whole ghost subtree, not only the intermediate node

    andU i,Vi j (equivalent to ourU i,j 2 ). Their supersample has one extra index because they average the information metric over the whole ghost subtree, not only the intermediate node. We split the proof for theparticipation gapand theout-of-sample gap. Out-of-sample gap:We begin by applying the definition of total variation distance and the Pinsker inequ...