Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms
Pith reviewed 2026-05-09 23:27 UTC · model grok-4.3
The pith
Decentralized ERM-RER achieves centralized performance guarantees by chaining local Gibbs measures as reference distributions with sample-size-scaled regularization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes.
Load-bearing premise
A specific scaling of the regularization factors with the local sample sizes must be used exactly, and the forward-backward reference-measure chaining must be followed; any deviation in scaling or communication order would break the claimed performance equivalence.
Figures
read the original abstract
In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that decentralized clients using empirical risk minimization with relative-entropy regularization (ERM-RER) can achieve exactly the same performance as a centralized ERM-RER by communicating only local Gibbs measures in a forward-backward chain (each client's output measure serving as the next client's reference), provided the regularization parameters are scaled linearly with local sample sizes (specifically λ_k proportional to n_k / N). The equivalence is presented as algebraic, relying on the multiplicative structure of successive Gibbs updates, and holds for arbitrary loss functions when empirical risks are defined as averages.
Significance. If the algebraic equivalence holds, the result is significant for privacy-preserving distributed learning: it shows how to obtain centralized performance guarantees by exchanging inductive biases (via reference measures) rather than raw data. The parameter-free character of the final composite measure once scaling is fixed, and the fact that the equivalence is exact rather than approximate, are notable strengths that could influence future work on decentralized optimization and federated learning.
minor comments (3)
- The abstract and introduction assert the equivalence without a compact proof sketch; adding a one-paragraph derivation outline (highlighting the multiplicative chaining exp(−λ_k R_k) and the required scaling λ_k = Λ ⋅ n_k / N) would improve accessibility.
- Notation for the composite reference measure after the forward pass and the final backward distribution step should be introduced with explicit equations early in the manuscript to avoid ambiguity when readers compare decentralized and centralized Gibbs measures.
- The paper should clarify whether the result extends to non-convex losses or requires the loss to be bounded; if the latter, an explicit assumption statement would strengthen the claims.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The referee's summary accurately captures the algebraic nature of our main result on decentralized ERM-RER achieving exact centralized performance guarantees through chained local Gibbs measures with sample-size-scaled regularization parameters.
read point-by-point responses
-
Referee: The paper claims that decentralized clients using empirical risk minimization with relative-entropy regularization (ERM-RER) can achieve exactly the same performance as a centralized ERM-RER by communicating only local Gibbs measures in a forward-backward chain (each client's output measure serving as the next client's reference), provided the regularization parameters are scaled linearly with local sample sizes (specifically λ_k proportional to n_k / N). The equivalence is presented as algebraic, relying on the multiplicative structure of successive Gibbs updates, and holds for arbitrary loss functions when empirical risks are defined as averages.
Authors: We appreciate the referee's precise summary of the contribution. The equivalence follows directly from the multiplicative form of the Gibbs densities: when each local update uses λ_k = (n_k/N) λ, the product of the successive factors recovers exactly the centralized Gibbs measure exp(-λ N R_N(θ)) / Z, where R_N is the global average empirical risk. This telescoping holds for any loss function because the empirical risks are defined as averages, so the exponents add linearly. The forward-backward chain simply propagates the reference measures without requiring data sharing. No revision is required on this point as the derivation is already explicit in the manuscript. revision: no
Circularity Check
No significant circularity
full rationale
The central claim is an algebraic identity: successive forward chaining of Gibbs measures (each tilting the reference by exp(−λ_k R_k)) yields exactly the centralized Gibbs measure exp(−Λ ∑ (n_k/N) R_k) ⋅ μ_0 when the local regularization parameters are set to λ_k = Λ ⋅ n_k / N. This follows directly from the multiplicative definition of the Gibbs measure and the fact that the global empirical risk is the weighted average of local risks; the backward pass merely broadcasts the identical final measure. The scaling rule is an explicit design choice that makes the identity hold for arbitrary losses, not a fitted parameter or self-referential definition. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the derivation chain.
Axiom & Free-Parameter Ledger
free parameters (1)
- scaling of regularization factors with local sample sizes
axioms (1)
- domain assumption Gibbs measures obtained from local ERM-RER can be chained as reference measures to preserve overall performance equivalence under forward-backward communication.
Reference graph
Works this paper leans on
-
[1]
Distributed asyn- chronous deterministic and stochastic gradient optimization algorithms,
J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn- chronous deterministic and stochastic gradient optimization algorithms,” IEEE Transactions on Automatic Control, vol. 31, no. 9, pp. 803–812, Sep. 1986
work page 1986
-
[2]
D. P. Bertsekas and J. N. Tsitsiklis,Parallel and Distributed Computation: Numerical Methods, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 1989
work page 1989
-
[3]
P. Kairouz, B. H. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. d’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc ´on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Koneˇcn...
work page 2021
-
[4]
C. Dwork, “Differential privacy,” inProceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP), vol. 4052, Venice, Italy, Jul. 2006, pp. 1–12
work page 2006
-
[5]
𝐼-divergence geometry of probability distributions and minimization problems,
I. Csisz ´ar, “ 𝐼-divergence geometry of probability distributions and minimization problems,”The Annals of Probability, vol. 3, no. 1, pp. 146–158, Feb. 1975
work page 1975
-
[6]
Empirical risk minimization with relative entropy regularization,
S. M. Perlaza, G. Bisson, I. Esnaola, A. Jean-Marie, and S. Rini, “Empirical risk minimization with relative entropy regularization,”IEEE Transactions on Information Theory, vol. 70, no. 7, pp. 5122 – 5161, Jul. 2024
work page 2024
-
[7]
Equivalence of empirical risk minimization to regularization on the family of 𝑓 - divergences,
F. Daunas, I. Esnaola, S. M. Perlaza, and H. V . Poor, “Equivalence of empirical risk minimization to regularization on the family of 𝑓 - divergences,” inProceedings of the IEEE International Symposium on Information Theory (ISIT), Athens, Greece, Jul. 2024, pp. 759–764
work page 2024
-
[8]
Asymmetry of the relative entropy in the regularization of empirical risk minimization,
——, “Asymmetry of the relative entropy in the regularization of empirical risk minimization,”IEEE Transactions on Information Theory, vol. 71, no. 8, pp. 6198–6226, Aug. 2025
work page 2025
-
[9]
N. Cesa-Bianchi and G. Lugosi,Prediction, Learning, and Games, 1st ed. New York, NY , USA: Cambridge University Press, 2006
work page 2006
-
[10]
D. A. McAllester, “Some PAC-Bayesian theorems,”Machine Learning, vol. 37, no. 3, pp. 355–363, Dec. 1999
work page 1999
-
[11]
PAC-Bayesian generalisation error bounds for Gaussian process classification,
M. Seeger, “PAC-Bayesian generalisation error bounds for Gaussian process classification,”Journal of Machine Learning Research, vol. 3, pp. 233–269, Oct. 2002
work page 2002
-
[12]
J. Langford and J. Shawe-Taylor, “PAC-Bayes and margins,” inProceed- ings of the International Conference on Neural Information Processing Systems (NeurIPS), vol. 15, Vancouver, Canada, Dec. 2002, pp. 439–446
work page 2002
-
[13]
Catoni,PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 1st ed
O. Catoni,PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 1st ed. Beachwood, OH, USA: Institute of Mathematical Statistics Lecture Notes - Monograph Series, 2007, vol. 56
work page 2007
-
[14]
User-friendly introduction to PAC-Bayes bounds,
P. Alquier, “User-friendly introduction to PAC-Bayes bounds,”Foun- dations and Trends in Machine Learning, vol. 17, no. 2, pp. 174–303, 2024
work page 2024
-
[15]
Bayesian learning via stochastic gradient Langevin dynamics,
M. Welling and Y . W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” inProceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA, Jun. 2011, pp. 681–688
work page 2011
-
[16]
Stochastic gradient descent as approximate Bayesian inference,
S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate Bayesian inference,”Journal of Machine Learning Research, vol. 18, no. 1, pp. 4873 – 4907, Jan. 2017
work page 2017
-
[17]
Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis,
M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis,” in Proceedings of the Conference on Learning Theory (COLT), vol. 65, Amsterdam, Netherlands, Jul. 2017, pp. 1674–1703
work page 2017
-
[18]
What is the long-run distribution of stochastic gradient descent? A large deviations analysis,
W. Azizian, F. Lutzeler, J. Malick, and P. Mertikopoulos, “What is the long-run distribution of stochastic gradient descent? A large deviations analysis,” inProceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, Jul. 2024, pp. 2168 – 2229
work page 2024
-
[19]
Machine unlearning for Gibbs supervised learning algorithms,
Y . Bermudez, S. M. Perlaza, and I. Esnaola, “Machine unlearning for Gibbs supervised learning algorithms,” inProceedings of the International Symposium on Information Theory (ISIT), Guangzhou, China, Jun. 2026
work page 2026
-
[20]
Variations on the expectation due to changes in the probability measure,
S. M. Perlaza and G. Bisson, “Variations on the expectation due to changes in the probability measure,”Entropy, vol. 27, no. 8:865, pp. 1–20, Aug. 2025
work page 2025
-
[21]
On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation,
S. M. Perlaza, I. Esnaola, G. Bisson, and H. V . Poor, “On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation,” inProceedings of the International Symposium on Information Theory (ISIT), Taipei, Taiwan, Jun. 2023, pp. 328–333
work page 2023
-
[22]
Decentralized machine learning with centralized performance guarantees via Gibbs algorithms,
Y . Bermudez, S. M. Perlaza, and I. Esnaola, “Decentralized machine learning with centralized performance guarantees via Gibbs algorithms,” INRIA, Centre Inria d’Universit´e C ˆote d’Azur, Sophia Antipolis, France, Tech. Rep. RR-9608, Jan. 2026
work page 2026
-
[23]
Proofs for folklore theorems on the Radon-Nikodym derivative,
Y . Bermudez, G. Bisson, I. Esnaola, and S. M. Perlaza, “Proofs for folklore theorems on the Radon-Nikodym derivative,” INRIA, Centre Inria d’Universit´e C ˆote d’Azur, Sophia Antipolis, France, Tech. Rep. RR-9591, Jul. 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.