Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms

I\~naki Esnaola; Samir M. Perlaza; Yaiza Bermudez

arxiv: 2604.20492 · v1 · submitted 2026-04-22 · 📊 stat.ML · cs.IT· cs.LG· math.IT

Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms

Yaiza Bermudez , Samir M. Perlaza , I\~naki Esnaola This is my paper

Pith reviewed 2026-05-09 23:27 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.IT

keywords centralizeddecentralizedlearningperformancegibbslocalmeasurereference

0 comments

The pith

Decentralized ERM-RER achieves centralized performance guarantees by chaining local Gibbs measures as reference distributions with sample-size-scaled regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When data lives on separate devices, sending everything to one server risks privacy. Here each device solves a regularized learning problem that produces a Gibbs measure, a probability distribution over models that balances fit to local data against distance from a reference distribution. The device passes its Gibbs measure to the next device, which uses it as the new reference. By scaling how strongly each device regularizes according to how many samples it has, the chain of devices ends up with the same overall performance as if one central server had seen every sample. The approach replaces data sharing with sharing of learned inductive bias.

Core claim

when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes.

Load-bearing premise

A specific scaling of the regularization factors with the local sample sizes must be used exactly, and the forward-backward reference-measure chaining must be followed; any deviation in scaling or communication order would break the claimed performance equivalence.

Figures

Figures reproduced from arXiv: 2604.20492 by I\~naki Esnaola, Samir M. Perlaza, Yaiza Bermudez.

read the original abstract

In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that chaining local Gibbs measures with scaled regularization lets decentralized ERM-RER match centralized performance exactly via an algebraic equivalence.

read the letter

The key point is that this work gives a clean algebraic route for decentralized clients to reach the same performance as a centralized ERM-RER learner by passing Gibbs measures instead of data, but only when regularization parameters scale exactly with local sample sizes and the forward-backward order is followed. The stress-test note is right that the multiplicative structure of the updates produces the centralized measure when λ_k = Λ ⋅ n_k / N, and this holds for arbitrary losses provided risks are simple averages. That is the actual new piece: extending the existing ERM-RER and Gibbs-measure framework to a sequential decentralized communication setting with an explicit scaling rule that preserves the equivalence. The paper does well in framing the collaboration as sharing inductive bias through reference measures rather than raw data, which directly addresses privacy constraints in federated settings. The math is straightforward and does not appear to rely on circular definitions or fitted quantities. Soft spots are real but contained. The guarantee is brittle to any deviation in scaling or communication sequence, and the abstract gives no hint of experiments that test robustness under approximate scaling or noisy communication. It also stays within this one regularization family, so it does not speak to broader decentralized algorithms. No full citation pattern or empirical validation is visible from the provided material, which limits immediate applicability claims. This paper is for researchers working on theoretical guarantees in federated and decentralized learning who care about exact performance matching without data movement. It deserves a serious referee because the core algebraic claim is verifiable and the setting is practically relevant, even if revisions would likely be needed for experiments and scope.

Referee Report

0 major / 3 minor

Summary. The paper claims that decentralized clients using empirical risk minimization with relative-entropy regularization (ERM-RER) can achieve exactly the same performance as a centralized ERM-RER by communicating only local Gibbs measures in a forward-backward chain (each client's output measure serving as the next client's reference), provided the regularization parameters are scaled linearly with local sample sizes (specifically λ_k proportional to n_k / N). The equivalence is presented as algebraic, relying on the multiplicative structure of successive Gibbs updates, and holds for arbitrary loss functions when empirical risks are defined as averages.

Significance. If the algebraic equivalence holds, the result is significant for privacy-preserving distributed learning: it shows how to obtain centralized performance guarantees by exchanging inductive biases (via reference measures) rather than raw data. The parameter-free character of the final composite measure once scaling is fixed, and the fact that the equivalence is exact rather than approximate, are notable strengths that could influence future work on decentralized optimization and federated learning.

minor comments (3)

The abstract and introduction assert the equivalence without a compact proof sketch; adding a one-paragraph derivation outline (highlighting the multiplicative chaining exp(−λ_k R_k) and the required scaling λ_k = Λ ⋅ n_k / N) would improve accessibility.
Notation for the composite reference measure after the forward pass and the final backward distribution step should be introduced with explicit equations early in the manuscript to avoid ambiguity when readers compare decentralized and centralized Gibbs measures.
The paper should clarify whether the result extends to non-convex losses or requires the loss to be bounded; if the latter, an explicit assumption statement would strengthen the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The referee's summary accurately captures the algebraic nature of our main result on decentralized ERM-RER achieving exact centralized performance guarantees through chained local Gibbs measures with sample-size-scaled regularization parameters.

read point-by-point responses

Referee: The paper claims that decentralized clients using empirical risk minimization with relative-entropy regularization (ERM-RER) can achieve exactly the same performance as a centralized ERM-RER by communicating only local Gibbs measures in a forward-backward chain (each client's output measure serving as the next client's reference), provided the regularization parameters are scaled linearly with local sample sizes (specifically λ_k proportional to n_k / N). The equivalence is presented as algebraic, relying on the multiplicative structure of successive Gibbs updates, and holds for arbitrary loss functions when empirical risks are defined as averages.

Authors: We appreciate the referee's precise summary of the contribution. The equivalence follows directly from the multiplicative form of the Gibbs densities: when each local update uses λ_k = (n_k/N) λ, the product of the successive factors recovers exactly the centralized Gibbs measure exp(-λ N R_N(θ)) / Z, where R_N is the global average empirical risk. This telescoping holds for any loss function because the empirical risks are defined as averages, so the exponents add linearly. The forward-backward chain simply propagates the reference measures without requiring data sharing. No revision is required on this point as the derivation is already explicit in the manuscript. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The central claim is an algebraic identity: successive forward chaining of Gibbs measures (each tilting the reference by exp(−λ_k R_k)) yields exactly the centralized Gibbs measure exp(−Λ ∑ (n_k/N) R_k) ⋅ μ_0 when the local regularization parameters are set to λ_k = Λ ⋅ n_k / N. This follows directly from the multiplicative definition of the Gibbs measure and the fact that the global empirical risk is the weighted average of local risks; the backward pass merely broadcasts the identical final measure. The scaling rule is an explicit design choice that makes the identity hold for arbitrary losses, not a fitted parameter or self-referential definition. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only: the result rests on standard properties of relative-entropy regularization and Gibbs measures plus one design choice (sample-size scaling) whose justification is not visible.

free parameters (1)

scaling of regularization factors with local sample sizes
Abstract states that a specific scaling is required for the equivalence; this parameter is chosen to match centralized performance.

axioms (1)

domain assumption Gibbs measures obtained from local ERM-RER can be chained as reference measures to preserve overall performance equivalence under forward-backward communication.
Core mechanism described in abstract; invoked to justify sharing only the measures rather than data.

pith-pipeline@v0.9.0 · 5481 in / 1225 out tokens · 37662 ms · 2026-05-09T23:27:03.434672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Distributed asyn- chronous deterministic and stochastic gradient optimization algorithms,

J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn- chronous deterministic and stochastic gradient optimization algorithms,” IEEE Transactions on Automatic Control, vol. 31, no. 9, pp. 803–812, Sep. 1986

work page 1986
[2]

D. P. Bertsekas and J. N. Tsitsiklis,Parallel and Distributed Computation: Numerical Methods, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 1989

work page 1989
[3]

Kairouz, B

P. Kairouz, B. H. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. d’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc ´on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Koneˇcn...

work page 2021
[4]

Differential privacy,

C. Dwork, “Differential privacy,” inProceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP), vol. 4052, Venice, Italy, Jul. 2006, pp. 1–12

work page 2006
[5]

𝐼-divergence geometry of probability distributions and minimization problems,

I. Csisz ´ar, “ 𝐼-divergence geometry of probability distributions and minimization problems,”The Annals of Probability, vol. 3, no. 1, pp. 146–158, Feb. 1975

work page 1975
[6]

Empirical risk minimization with relative entropy regularization,

S. M. Perlaza, G. Bisson, I. Esnaola, A. Jean-Marie, and S. Rini, “Empirical risk minimization with relative entropy regularization,”IEEE Transactions on Information Theory, vol. 70, no. 7, pp. 5122 – 5161, Jul. 2024

work page 2024
[7]

Equivalence of empirical risk minimization to regularization on the family of 𝑓 - divergences,

F. Daunas, I. Esnaola, S. M. Perlaza, and H. V . Poor, “Equivalence of empirical risk minimization to regularization on the family of 𝑓 - divergences,” inProceedings of the IEEE International Symposium on Information Theory (ISIT), Athens, Greece, Jul. 2024, pp. 759–764

work page 2024
[8]

Asymmetry of the relative entropy in the regularization of empirical risk minimization,

——, “Asymmetry of the relative entropy in the regularization of empirical risk minimization,”IEEE Transactions on Information Theory, vol. 71, no. 8, pp. 6198–6226, Aug. 2025

work page 2025
[9]

Cesa-Bianchi and G

N. Cesa-Bianchi and G. Lugosi,Prediction, Learning, and Games, 1st ed. New York, NY , USA: Cambridge University Press, 2006

work page 2006
[10]

Some PAC-Bayesian theorems,

D. A. McAllester, “Some PAC-Bayesian theorems,”Machine Learning, vol. 37, no. 3, pp. 355–363, Dec. 1999

work page 1999
[11]

PAC-Bayesian generalisation error bounds for Gaussian process classification,

M. Seeger, “PAC-Bayesian generalisation error bounds for Gaussian process classification,”Journal of Machine Learning Research, vol. 3, pp. 233–269, Oct. 2002

work page 2002
[12]

PAC-Bayes and margins,

J. Langford and J. Shawe-Taylor, “PAC-Bayes and margins,” inProceed- ings of the International Conference on Neural Information Processing Systems (NeurIPS), vol. 15, Vancouver, Canada, Dec. 2002, pp. 439–446

work page 2002
[13]

Catoni,PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 1st ed

O. Catoni,PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 1st ed. Beachwood, OH, USA: Institute of Mathematical Statistics Lecture Notes - Monograph Series, 2007, vol. 56

work page 2007
[14]

User-friendly introduction to PAC-Bayes bounds,

P. Alquier, “User-friendly introduction to PAC-Bayes bounds,”Foun- dations and Trends in Machine Learning, vol. 17, no. 2, pp. 174–303, 2024

work page 2024
[15]

Bayesian learning via stochastic gradient Langevin dynamics,

M. Welling and Y . W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” inProceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA, Jun. 2011, pp. 681–688

work page 2011
[16]

Stochastic gradient descent as approximate Bayesian inference,

S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate Bayesian inference,”Journal of Machine Learning Research, vol. 18, no. 1, pp. 4873 – 4907, Jan. 2017

work page 2017
[17]

Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis,

M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis,” in Proceedings of the Conference on Learning Theory (COLT), vol. 65, Amsterdam, Netherlands, Jul. 2017, pp. 1674–1703

work page 2017
[18]

What is the long-run distribution of stochastic gradient descent? A large deviations analysis,

W. Azizian, F. Lutzeler, J. Malick, and P. Mertikopoulos, “What is the long-run distribution of stochastic gradient descent? A large deviations analysis,” inProceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, Jul. 2024, pp. 2168 – 2229

work page 2024
[19]

Machine unlearning for Gibbs supervised learning algorithms,

Y . Bermudez, S. M. Perlaza, and I. Esnaola, “Machine unlearning for Gibbs supervised learning algorithms,” inProceedings of the International Symposium on Information Theory (ISIT), Guangzhou, China, Jun. 2026

work page 2026
[20]

Variations on the expectation due to changes in the probability measure,

S. M. Perlaza and G. Bisson, “Variations on the expectation due to changes in the probability measure,”Entropy, vol. 27, no. 8:865, pp. 1–20, Aug. 2025

work page 2025
[21]

On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation,

S. M. Perlaza, I. Esnaola, G. Bisson, and H. V . Poor, “On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation,” inProceedings of the International Symposium on Information Theory (ISIT), Taipei, Taiwan, Jun. 2023, pp. 328–333

work page 2023
[22]

Decentralized machine learning with centralized performance guarantees via Gibbs algorithms,

Y . Bermudez, S. M. Perlaza, and I. Esnaola, “Decentralized machine learning with centralized performance guarantees via Gibbs algorithms,” INRIA, Centre Inria d’Universit´e C ˆote d’Azur, Sophia Antipolis, France, Tech. Rep. RR-9608, Jan. 2026

work page 2026
[23]

Proofs for folklore theorems on the Radon-Nikodym derivative,

Y . Bermudez, G. Bisson, I. Esnaola, and S. M. Perlaza, “Proofs for folklore theorems on the Radon-Nikodym derivative,” INRIA, Centre Inria d’Universit´e C ˆote d’Azur, Sophia Antipolis, France, Tech. Rep. RR-9591, Jul. 2025

work page 2025

[1] [1]

Distributed asyn- chronous deterministic and stochastic gradient optimization algorithms,

J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn- chronous deterministic and stochastic gradient optimization algorithms,” IEEE Transactions on Automatic Control, vol. 31, no. 9, pp. 803–812, Sep. 1986

work page 1986

[2] [2]

D. P. Bertsekas and J. N. Tsitsiklis,Parallel and Distributed Computation: Numerical Methods, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 1989

work page 1989

[3] [3]

Kairouz, B

P. Kairouz, B. H. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. d’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc ´on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Koneˇcn...

work page 2021

[4] [4]

Differential privacy,

C. Dwork, “Differential privacy,” inProceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP), vol. 4052, Venice, Italy, Jul. 2006, pp. 1–12

work page 2006

[5] [5]

𝐼-divergence geometry of probability distributions and minimization problems,

I. Csisz ´ar, “ 𝐼-divergence geometry of probability distributions and minimization problems,”The Annals of Probability, vol. 3, no. 1, pp. 146–158, Feb. 1975

work page 1975

[6] [6]

Empirical risk minimization with relative entropy regularization,

S. M. Perlaza, G. Bisson, I. Esnaola, A. Jean-Marie, and S. Rini, “Empirical risk minimization with relative entropy regularization,”IEEE Transactions on Information Theory, vol. 70, no. 7, pp. 5122 – 5161, Jul. 2024

work page 2024

[7] [7]

Equivalence of empirical risk minimization to regularization on the family of 𝑓 - divergences,

F. Daunas, I. Esnaola, S. M. Perlaza, and H. V . Poor, “Equivalence of empirical risk minimization to regularization on the family of 𝑓 - divergences,” inProceedings of the IEEE International Symposium on Information Theory (ISIT), Athens, Greece, Jul. 2024, pp. 759–764

work page 2024

[8] [8]

Asymmetry of the relative entropy in the regularization of empirical risk minimization,

——, “Asymmetry of the relative entropy in the regularization of empirical risk minimization,”IEEE Transactions on Information Theory, vol. 71, no. 8, pp. 6198–6226, Aug. 2025

work page 2025

[9] [9]

Cesa-Bianchi and G

N. Cesa-Bianchi and G. Lugosi,Prediction, Learning, and Games, 1st ed. New York, NY , USA: Cambridge University Press, 2006

work page 2006

[10] [10]

Some PAC-Bayesian theorems,

D. A. McAllester, “Some PAC-Bayesian theorems,”Machine Learning, vol. 37, no. 3, pp. 355–363, Dec. 1999

work page 1999

[11] [11]

PAC-Bayesian generalisation error bounds for Gaussian process classification,

M. Seeger, “PAC-Bayesian generalisation error bounds for Gaussian process classification,”Journal of Machine Learning Research, vol. 3, pp. 233–269, Oct. 2002

work page 2002

[12] [12]

PAC-Bayes and margins,

J. Langford and J. Shawe-Taylor, “PAC-Bayes and margins,” inProceed- ings of the International Conference on Neural Information Processing Systems (NeurIPS), vol. 15, Vancouver, Canada, Dec. 2002, pp. 439–446

work page 2002

[13] [13]

Catoni,PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 1st ed

O. Catoni,PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 1st ed. Beachwood, OH, USA: Institute of Mathematical Statistics Lecture Notes - Monograph Series, 2007, vol. 56

work page 2007

[14] [14]

User-friendly introduction to PAC-Bayes bounds,

P. Alquier, “User-friendly introduction to PAC-Bayes bounds,”Foun- dations and Trends in Machine Learning, vol. 17, no. 2, pp. 174–303, 2024

work page 2024

[15] [15]

Bayesian learning via stochastic gradient Langevin dynamics,

M. Welling and Y . W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” inProceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA, Jun. 2011, pp. 681–688

work page 2011

[16] [16]

Stochastic gradient descent as approximate Bayesian inference,

S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate Bayesian inference,”Journal of Machine Learning Research, vol. 18, no. 1, pp. 4873 – 4907, Jan. 2017

work page 2017

[17] [17]

Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis,

M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis,” in Proceedings of the Conference on Learning Theory (COLT), vol. 65, Amsterdam, Netherlands, Jul. 2017, pp. 1674–1703

work page 2017

[18] [18]

What is the long-run distribution of stochastic gradient descent? A large deviations analysis,

W. Azizian, F. Lutzeler, J. Malick, and P. Mertikopoulos, “What is the long-run distribution of stochastic gradient descent? A large deviations analysis,” inProceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, Jul. 2024, pp. 2168 – 2229

work page 2024

[19] [19]

Machine unlearning for Gibbs supervised learning algorithms,

Y . Bermudez, S. M. Perlaza, and I. Esnaola, “Machine unlearning for Gibbs supervised learning algorithms,” inProceedings of the International Symposium on Information Theory (ISIT), Guangzhou, China, Jun. 2026

work page 2026

[20] [20]

Variations on the expectation due to changes in the probability measure,

S. M. Perlaza and G. Bisson, “Variations on the expectation due to changes in the probability measure,”Entropy, vol. 27, no. 8:865, pp. 1–20, Aug. 2025

work page 2025

[21] [21]

On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation,

S. M. Perlaza, I. Esnaola, G. Bisson, and H. V . Poor, “On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation,” inProceedings of the International Symposium on Information Theory (ISIT), Taipei, Taiwan, Jun. 2023, pp. 328–333

work page 2023

[22] [22]

Decentralized machine learning with centralized performance guarantees via Gibbs algorithms,

Y . Bermudez, S. M. Perlaza, and I. Esnaola, “Decentralized machine learning with centralized performance guarantees via Gibbs algorithms,” INRIA, Centre Inria d’Universit´e C ˆote d’Azur, Sophia Antipolis, France, Tech. Rep. RR-9608, Jan. 2026

work page 2026

[23] [23]

Proofs for folklore theorems on the Radon-Nikodym derivative,

Y . Bermudez, G. Bisson, I. Esnaola, and S. M. Perlaza, “Proofs for folklore theorems on the Radon-Nikodym derivative,” INRIA, Centre Inria d’Universit´e C ˆote d’Azur, Sophia Antipolis, France, Tech. Rep. RR-9591, Jul. 2025

work page 2025