A Hierarchical Sampling Framework for bounding the Generalization Error of Federated Learning
Pith reviewed 2026-05-07 04:12 UTC · model grok-4.3
The pith
A hierarchical tree sampling model yields Wasserstein-based generalization bounds for federated learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the expected generalization error of a hierarchical federated learning algorithm can be bounded by a constant times the Wasserstein distance between the population and empirical distributions, where the constant depends on the Lipschitz constant of the loss and the structure of the sampling tree. This bound is obtained by constructing a supersample that allows measuring the effect of replacing a single client dataset in the hierarchy. By leveraging the federated learning structure, the resulting bound strictly implies prior conditional mutual information bounds in the bounded-loss case and recovers the exact asymptotic rate for the Gaussian location model.
What carries the argument
The multi-layered sampling tree together with the supersample construction, which quantifies the sensitivity of the learning algorithm to the change of a single node in the tree.
If this is right
- The derived bounds strictly imply state-of-the-art conditional mutual information bounds when losses are bounded.
- The bounds can be combined with differential privacy assumptions to obtain generalization guarantees based on algorithmic privacy.
- The bounds recover the exact asymptotic rate of the generalization error in the Gaussian location model.
- By leveraging the federated learning structure, the framework provides control on the generalization error that accounts for dependencies among client datasets.
Where Pith is reading between the lines
- If the tree structure accurately models real-world client data dependencies, federated learning algorithms could be designed to minimize the relevant Wasserstein distances to improve generalization.
- This framework might extend to other distributed learning settings with hierarchical data collection, such as multi-site medical studies.
- Testing the bounds on non-Lipschitz losses or non-Gaussian data would reveal how much the Lipschitz assumption limits practical applicability.
Load-bearing premise
The loss function must be Lipschitz continuous with respect to the model parameters.
What would settle it
If experiments in the Gaussian location model show that the generalization error does not match the asymptotic rate predicted by the Wasserstein bound, the claim would be falsified.
Figures
read the original abstract
We study expected generalization bounds for the Hierarchical Federated Learning (HFL) setup using Wasserstein distance. We introduce a generalized framework in which data is sampled hierarchically, and we model it with a multi-layered tree structure that induces dependencies among the clients' datasets. We derive generalization bounds in terms of Wasserstein distance under the Lipschitz assumption on the loss function, by applying a supersample construction that allows us to measure the sensitivity of the algorithm to the change of a single node in the sampling tree. By leveraging the FL structure, we recover and strictly imply existing state-of-the-art conditional mutual information (CMI) bounds in the case of bounded losses. We also show that our bound can be applied together with Differential Privacy assumptions, to recover generalization bounds based on algorithmic privacy. To assess the tightness of our bounds, we study the Gaussian Location Model (GLM) and show that we recover the actual asymptotic rate of the generalization error.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a hierarchical sampling framework for deriving generalization bounds in Hierarchical Federated Learning (HFL). Data sampling is modeled using a multi-layered tree structure that captures dependencies among clients' datasets. Under the assumption that the loss function is Lipschitz continuous, the authors derive bounds on the expected generalization error in terms of the Wasserstein distance between distributions, employing a supersample construction to assess the sensitivity of the learning algorithm to perturbations in a single node of the sampling tree. The framework is shown to recover and strictly strengthen existing conditional mutual information (CMI) bounds for bounded losses, and it can be combined with differential privacy to obtain privacy-based generalization bounds. Tightness is demonstrated by recovering the asymptotic rate of the generalization error in the Gaussian Location Model (GLM).
Significance. If the derivations hold, this work provides a unified approach to generalization bounds in federated learning that incorporates hierarchical data dependencies via a tree model and Wasserstein distances. It bridges Wasserstein-based bounds with CMI and DP-based bounds, offering a way to leverage FL structure for tighter results. The GLM example for tightness is particularly useful for validating the framework in a concrete setting, which could influence future research on generalization in distributed learning.
major comments (1)
- [Gaussian Location Model section] The claim that the bounds recover the actual asymptotic rate of the generalization error in the GLM (abstract and the dedicated GLM section) is load-bearing for the tightness assessment. The primary bound requires Lipschitz continuity of the loss. Standard GLM analysis uses squared loss (or equivalent), which fails to be Lipschitz on unbounded support. The manuscript must clarify whether the GLM rate is obtained by instantiating the Wasserstein bound (under added boundedness to satisfy Lipschitz) or by direct computation on the estimator. Without this, the tightness result does not validate the bound inside the regime where its assumptions apply. Please point to the specific equations or derivation steps.
minor comments (2)
- [Section introducing the supersample construction] The supersample construction and tree-node sensitivity argument would benefit from an explicit diagram or pseudocode to illustrate the hierarchical dependencies and the single-node change.
- [Preliminaries] Notation for the multi-layered tree (e.g., node depths, client subsets) should be summarized in a table or definition box for quick reference.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We appreciate the opportunity to clarify the tightness assessment in the Gaussian Location Model (GLM) section. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Gaussian Location Model section] The claim that the bounds recover the actual asymptotic rate of the generalization error in the GLM (abstract and the dedicated GLM section) is load-bearing for the tightness assessment. The primary bound requires Lipschitz continuity of the loss. Standard GLM analysis uses squared loss (or equivalent), which fails to be Lipschitz on unbounded support. The manuscript must clarify whether the GLM rate is obtained by instantiating the Wasserstein bound (under added boundedness to satisfy Lipschitz) or by direct computation on the estimator. Without this, the tightness result does not validate the bound inside the regime where its assumptions apply. Please point to the specific equations or derivation steps.
Authors: We thank the referee for identifying this point that requires greater precision in our presentation. In Section 4.2, the GLM tightness result is established in two complementary ways. First, we perform a direct computation of the expected generalization error for the empirical mean estimator under the standard Gaussian location model (Equations 28-32), which yields the known asymptotic rate of order 1/sqrt(m) where m denotes the total number of samples across the hierarchy. Second, we instantiate the main Wasserstein bound (Theorem 3.1) by replacing the squared loss with its truncated version L_B(x, mu) = min{(x - mu)^2, B} for a truncation level B that grows sufficiently slowly with m; this truncation renders the loss Lipschitz while preserving the same asymptotic rate (see the derivation following Equation 33 and Remark 4.2). The resulting Wasserstein bound then matches the direct rate up to universal constants. We will revise the GLM section to explicitly separate these two arguments, add a dedicated paragraph stating the truncation construction, and include a forward reference from the abstract to these steps. This ensures the tightness claim is validated precisely inside the Lipschitz regime of the main theorem. revision: yes
Circularity Check
No circularity: new tree model and supersample construction yield independent Wasserstein bounds that imply prior CMI results
full rationale
The paper introduces a multi-layered hierarchical sampling tree and supersample construction to derive new Wasserstein-distance generalization bounds under the Lipschitz loss assumption. It then specializes this general bound to recover (and strictly imply) existing CMI bounds for bounded losses, and separately checks tightness via the Gaussian Location Model. No quoted derivation step reduces a claimed result to a fitted input, self-citation chain, or definitional renaming; the central claims rest on the novel tree structure and sensitivity analysis, which are not presupposed by the recovered bounds. The GLM rate recovery is presented as an external tightness diagnostic rather than part of the main derivation. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The loss function is Lipschitz continuous
- domain assumption Data is sampled hierarchically and can be modeled with a multi-layered tree structure inducing dependencies among clients' datasets
Reference graph
Works this paper leans on
-
[1]
Advances and Open Problems in Federated Learning,
P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, ...
2021
-
[2]
Communication-Efficient Learning of Deep Networks from Decentralized Data,
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” Jan. 2023
2023
-
[3]
Deep learning with Elastic Averaging SGD,
S. Zhang, A. Choromanska, and Y . LeCun, “Deep learning with Elastic Averaging SGD,” Oct. 2015
2015
-
[4]
Parallel training of DNNs with Natural Gradient and Parameter Averaging,
D. Povey, X. Zhang, and S. Khudanpur, “Parallel training of DNNs with Natural Gradient and Parameter Averaging,” Jun. 2015
2015
-
[5]
Kish,Survey Sampling
L. Kish,Survey Sampling. Wiley, 1965. [Online]. Available: https://books.google.it/books?id=xiZmAAAAIAAJ
1965
-
[6]
W. G. Cochran,Sampling Techniques, 3rd ed., ser. Wiley Series in Probability and Mathematical Statistics. New York, NY: Wiley, 1977
1977
-
[7]
A theory of the learnable,
L. G. Valiant, “A theory of the learnable,” inProceedings of the Sixteenth Annual ACM Symposium on Theory of Computing - STOC ’84. Not Known: ACM Press, 1984, pp. 436–445
1984
-
[8]
On the uniform convergence of relative frequencies of events to their probabilities,
V . N. Vapnik and A. Y . Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” inMeasures of complexity: festschrift for alexey chervonenkis. Springer, 2015, pp. 11–30
2015
-
[9]
Understand- ing deep learning requires rethinking generalization,
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,” Feb. 2017
2017
-
[10]
Information-theoretic analysis of generaliza- tion capability of learning algorithms,
A. Xu and M. Raginsky, “Information-theoretic analysis of generaliza- tion capability of learning algorithms,” Nov. 2017
2017
-
[11]
Tightening Mutual Information Based Bounds on Generalization Error,
Y . Bu, S. Zou, and V . V . Veeravalli, “Tightening Mutual Information Based Bounds on Generalization Error,”IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, May 2020
2020
-
[12]
Reasoning About General- ization via Conditional Mutual Information,
T. Steinke, T.-S. Net, and L. Zakynthinou, “Reasoning About General- ization via Conditional Mutual Information,” 2020
2020
-
[13]
Information-theoretic generalization bounds for black-box learning algorithms,
H. Harutyunyan, M. Raginsky, G. V . Steeg, and A. Galstyan, “Information-theoretic generalization bounds for black-box learning algorithms,” Oct. 2021
2021
-
[14]
Generalization Bounds via Information Density and Conditional Information Density,
F. Hellstrom and G. Durisi, “Generalization Bounds via Information Density and Conditional Information Density,”IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 824–839, Nov. 2020
2020
-
[15]
Tighter expected generalization error bounds via Wasserstein distance,
B. Rodríguez-Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “Tighter expected generalization error bounds via Wasserstein distance,” Mar. 2022
2022
-
[16]
Optimal Transport for Applied Mathematicians – Calculus of Variations, PDEs and Modeling,
F. Santambrogio, “Optimal Transport for Applied Mathematicians – Calculus of Variations, PDEs and Modeling,” 2015
2015
-
[17]
Distributed Training Strategies for the Structured Perceptron,
R. McDonald, K. Hall, and G. Mann, “Distributed Training Strategies for the Structured Perceptron,”. . ., 2010
2010
-
[18]
What Do We Mean by Generalization in Federated Learning?
H. Yuan, W. Morningstar, L. Ning, and K. Singhal, “What Do We Mean by Generalization in Federated Learning?” Mar. 2022
2022
-
[19]
FedHB: Hierarchical Bayesian Federated Learning,
M. Kim and T. Hospedales, “FedHB: Hierarchical Bayesian Federated Learning,” May 2023
2023
-
[20]
A Review of Federated Learning Methods in Heterogeneous Scenarios,
J. Pei, W. Liu, J. Li, L. Wang, and C. Liu, “A Review of Federated Learning Methods in Heterogeneous Scenarios,”IEEE Transactions on Consumer Electronics, vol. 70, no. 3, pp. 5983–5999, Aug. 2024
2024
-
[21]
Improving Generalization in Federated Learning with Model-Data Mutual Information Regularization: A Posterior Inference Approach,
H. Zhang, C. Li, N. Kan, Z. Zheng, W. Dai, J. Zou, and H. Xiong, “Improving Generalization in Federated Learning with Model-Data Mutual Information Regularization: A Posterior Inference Approach,” 2024
2024
-
[22]
Improved Information The- oretic Generalization Bounds for Distributed and Federated Learning,
L. P. Barnes, A. Dytso, and H. V . Poor, “Improved Information The- oretic Generalization Bounds for Distributed and Federated Learning,” Entropy, vol. 24, no. 9, p. 1178, Aug. 2022
2022
-
[23]
Heterogeneity Matters even More in Distributed Learning: Study from Generalization Perspective,
M. Kavian, R. Chor, M. Sefidgaran, and A. Zaidi, “Heterogeneity Matters even More in Distributed Learning: Study from Generalization Perspective,” May 2025
2025
-
[24]
Generalization in Federated Learning: A Conditional Mutual Information Framework,
Z. Wang, C. Long, and Y . Mao, “Generalization in Federated Learning: A Conditional Mutual Information Framework,” Jun. 2025. APPENDIXA EXTRA PRELIMINARIES AND DEFINITION In this section we breifly overview some of the definitions and results that we will use for the proofs. A. Information theory Definition 14(KL divergence).Given two probability distribu...
2025
-
[25]
Their supersample has one extra index because they average the information metric over the whole ghost subtree, not only the intermediate node
andU i,Vi j (equivalent to ourU i,j 2 ). Their supersample has one extra index because they average the information metric over the whole ghost subtree, not only the intermediate node. We split the proof for theparticipation gapand theout-of-sample gap. Out-of-sample gap:We begin by applying the definition of total variation distance and the Pinsker inequ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.