pith. sign in

arxiv: 2604.20492 · v1 · submitted 2026-04-22 · 📊 stat.ML · cs.IT· cs.LG· math.IT

Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms

Pith reviewed 2026-05-09 23:27 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.IT
keywords centralizeddecentralizedlearningperformancegibbslocalmeasurereference
0
0 comments X

The pith

Decentralized ERM-RER achieves centralized performance guarantees by chaining local Gibbs measures as reference distributions with sample-size-scaled regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When data lives on separate devices, sending everything to one server risks privacy. Here each device solves a regularized learning problem that produces a Gibbs measure, a probability distribution over models that balances fit to local data against distance from a reference distribution. The device passes its Gibbs measure to the next device, which uses it as the new reference. By scaling how strongly each device regularizes according to how many samples it has, the chain of devices ends up with the same overall performance as if one central server had seen every sample. The approach replaces data sharing with sharing of learned inductive bias.

Core claim

when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes.

Load-bearing premise

A specific scaling of the regularization factors with the local sample sizes must be used exactly, and the forward-backward reference-measure chaining must be followed; any deviation in scaling or communication order would break the claimed performance equivalence.

Figures

Figures reproduced from arXiv: 2604.20492 by I\~naki Esnaola, Samir M. Perlaza, Yaiza Bermudez.

Figure 1
Figure 1. Figure 1: Nested Structure: the Gibbs measure produced by client [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that decentralized clients using empirical risk minimization with relative-entropy regularization (ERM-RER) can achieve exactly the same performance as a centralized ERM-RER by communicating only local Gibbs measures in a forward-backward chain (each client's output measure serving as the next client's reference), provided the regularization parameters are scaled linearly with local sample sizes (specifically λ_k proportional to n_k / N). The equivalence is presented as algebraic, relying on the multiplicative structure of successive Gibbs updates, and holds for arbitrary loss functions when empirical risks are defined as averages.

Significance. If the algebraic equivalence holds, the result is significant for privacy-preserving distributed learning: it shows how to obtain centralized performance guarantees by exchanging inductive biases (via reference measures) rather than raw data. The parameter-free character of the final composite measure once scaling is fixed, and the fact that the equivalence is exact rather than approximate, are notable strengths that could influence future work on decentralized optimization and federated learning.

minor comments (3)
  1. The abstract and introduction assert the equivalence without a compact proof sketch; adding a one-paragraph derivation outline (highlighting the multiplicative chaining exp(−λ_k R_k) and the required scaling λ_k = Λ ⋅ n_k / N) would improve accessibility.
  2. Notation for the composite reference measure after the forward pass and the final backward distribution step should be introduced with explicit equations early in the manuscript to avoid ambiguity when readers compare decentralized and centralized Gibbs measures.
  3. The paper should clarify whether the result extends to non-convex losses or requires the loss to be bounded; if the latter, an explicit assumption statement would strengthen the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The referee's summary accurately captures the algebraic nature of our main result on decentralized ERM-RER achieving exact centralized performance guarantees through chained local Gibbs measures with sample-size-scaled regularization parameters.

read point-by-point responses
  1. Referee: The paper claims that decentralized clients using empirical risk minimization with relative-entropy regularization (ERM-RER) can achieve exactly the same performance as a centralized ERM-RER by communicating only local Gibbs measures in a forward-backward chain (each client's output measure serving as the next client's reference), provided the regularization parameters are scaled linearly with local sample sizes (specifically λ_k proportional to n_k / N). The equivalence is presented as algebraic, relying on the multiplicative structure of successive Gibbs updates, and holds for arbitrary loss functions when empirical risks are defined as averages.

    Authors: We appreciate the referee's precise summary of the contribution. The equivalence follows directly from the multiplicative form of the Gibbs densities: when each local update uses λ_k = (n_k/N) λ, the product of the successive factors recovers exactly the centralized Gibbs measure exp(-λ N R_N(θ)) / Z, where R_N is the global average empirical risk. This telescoping holds for any loss function because the empirical risks are defined as averages, so the exponents add linearly. The forward-backward chain simply propagates the reference measures without requiring data sharing. No revision is required on this point as the derivation is already explicit in the manuscript. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The central claim is an algebraic identity: successive forward chaining of Gibbs measures (each tilting the reference by exp(−λ_k R_k)) yields exactly the centralized Gibbs measure exp(−Λ ∑ (n_k/N) R_k) ⋅ μ_0 when the local regularization parameters are set to λ_k = Λ ⋅ n_k / N. This follows directly from the multiplicative definition of the Gibbs measure and the fact that the global empirical risk is the weighted average of local risks; the backward pass merely broadcasts the identical final measure. The scaling rule is an explicit design choice that makes the identity hold for arbitrary losses, not a fitted parameter or self-referential definition. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only: the result rests on standard properties of relative-entropy regularization and Gibbs measures plus one design choice (sample-size scaling) whose justification is not visible.

free parameters (1)
  • scaling of regularization factors with local sample sizes
    Abstract states that a specific scaling is required for the equivalence; this parameter is chosen to match centralized performance.
axioms (1)
  • domain assumption Gibbs measures obtained from local ERM-RER can be chained as reference measures to preserve overall performance equivalence under forward-backward communication.
    Core mechanism described in abstract; invoked to justify sharing only the measures rather than data.

pith-pipeline@v0.9.0 · 5481 in / 1225 out tokens · 37662 ms · 2026-05-09T23:27:03.434672+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Distributed asyn- chronous deterministic and stochastic gradient optimization algorithms,

    J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn- chronous deterministic and stochastic gradient optimization algorithms,” IEEE Transactions on Automatic Control, vol. 31, no. 9, pp. 803–812, Sep. 1986

  2. [2]

    D. P. Bertsekas and J. N. Tsitsiklis,Parallel and Distributed Computation: Numerical Methods, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 1989

  3. [3]

    Kairouz, B

    P. Kairouz, B. H. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. d’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc ´on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Koneˇcn...

  4. [4]

    Differential privacy,

    C. Dwork, “Differential privacy,” inProceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP), vol. 4052, Venice, Italy, Jul. 2006, pp. 1–12

  5. [5]

    𝐼-divergence geometry of probability distributions and minimization problems,

    I. Csisz ´ar, “ 𝐼-divergence geometry of probability distributions and minimization problems,”The Annals of Probability, vol. 3, no. 1, pp. 146–158, Feb. 1975

  6. [6]

    Empirical risk minimization with relative entropy regularization,

    S. M. Perlaza, G. Bisson, I. Esnaola, A. Jean-Marie, and S. Rini, “Empirical risk minimization with relative entropy regularization,”IEEE Transactions on Information Theory, vol. 70, no. 7, pp. 5122 – 5161, Jul. 2024

  7. [7]

    Equivalence of empirical risk minimization to regularization on the family of 𝑓 - divergences,

    F. Daunas, I. Esnaola, S. M. Perlaza, and H. V . Poor, “Equivalence of empirical risk minimization to regularization on the family of 𝑓 - divergences,” inProceedings of the IEEE International Symposium on Information Theory (ISIT), Athens, Greece, Jul. 2024, pp. 759–764

  8. [8]

    Asymmetry of the relative entropy in the regularization of empirical risk minimization,

    ——, “Asymmetry of the relative entropy in the regularization of empirical risk minimization,”IEEE Transactions on Information Theory, vol. 71, no. 8, pp. 6198–6226, Aug. 2025

  9. [9]

    Cesa-Bianchi and G

    N. Cesa-Bianchi and G. Lugosi,Prediction, Learning, and Games, 1st ed. New York, NY , USA: Cambridge University Press, 2006

  10. [10]

    Some PAC-Bayesian theorems,

    D. A. McAllester, “Some PAC-Bayesian theorems,”Machine Learning, vol. 37, no. 3, pp. 355–363, Dec. 1999

  11. [11]

    PAC-Bayesian generalisation error bounds for Gaussian process classification,

    M. Seeger, “PAC-Bayesian generalisation error bounds for Gaussian process classification,”Journal of Machine Learning Research, vol. 3, pp. 233–269, Oct. 2002

  12. [12]

    PAC-Bayes and margins,

    J. Langford and J. Shawe-Taylor, “PAC-Bayes and margins,” inProceed- ings of the International Conference on Neural Information Processing Systems (NeurIPS), vol. 15, Vancouver, Canada, Dec. 2002, pp. 439–446

  13. [13]

    Catoni,PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 1st ed

    O. Catoni,PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, 1st ed. Beachwood, OH, USA: Institute of Mathematical Statistics Lecture Notes - Monograph Series, 2007, vol. 56

  14. [14]

    User-friendly introduction to PAC-Bayes bounds,

    P. Alquier, “User-friendly introduction to PAC-Bayes bounds,”Foun- dations and Trends in Machine Learning, vol. 17, no. 2, pp. 174–303, 2024

  15. [15]

    Bayesian learning via stochastic gradient Langevin dynamics,

    M. Welling and Y . W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” inProceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA, Jun. 2011, pp. 681–688

  16. [16]

    Stochastic gradient descent as approximate Bayesian inference,

    S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate Bayesian inference,”Journal of Machine Learning Research, vol. 18, no. 1, pp. 4873 – 4907, Jan. 2017

  17. [17]

    Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis,

    M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis,” in Proceedings of the Conference on Learning Theory (COLT), vol. 65, Amsterdam, Netherlands, Jul. 2017, pp. 1674–1703

  18. [18]

    What is the long-run distribution of stochastic gradient descent? A large deviations analysis,

    W. Azizian, F. Lutzeler, J. Malick, and P. Mertikopoulos, “What is the long-run distribution of stochastic gradient descent? A large deviations analysis,” inProceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, Jul. 2024, pp. 2168 – 2229

  19. [19]

    Machine unlearning for Gibbs supervised learning algorithms,

    Y . Bermudez, S. M. Perlaza, and I. Esnaola, “Machine unlearning for Gibbs supervised learning algorithms,” inProceedings of the International Symposium on Information Theory (ISIT), Guangzhou, China, Jun. 2026

  20. [20]

    Variations on the expectation due to changes in the probability measure,

    S. M. Perlaza and G. Bisson, “Variations on the expectation due to changes in the probability measure,”Entropy, vol. 27, no. 8:865, pp. 1–20, Aug. 2025

  21. [21]

    On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation,

    S. M. Perlaza, I. Esnaola, G. Bisson, and H. V . Poor, “On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation,” inProceedings of the International Symposium on Information Theory (ISIT), Taipei, Taiwan, Jun. 2023, pp. 328–333

  22. [22]

    Decentralized machine learning with centralized performance guarantees via Gibbs algorithms,

    Y . Bermudez, S. M. Perlaza, and I. Esnaola, “Decentralized machine learning with centralized performance guarantees via Gibbs algorithms,” INRIA, Centre Inria d’Universit´e C ˆote d’Azur, Sophia Antipolis, France, Tech. Rep. RR-9608, Jan. 2026

  23. [23]

    Proofs for folklore theorems on the Radon-Nikodym derivative,

    Y . Bermudez, G. Bisson, I. Esnaola, and S. M. Perlaza, “Proofs for folklore theorems on the Radon-Nikodym derivative,” INRIA, Centre Inria d’Universit´e C ˆote d’Azur, Sophia Antipolis, France, Tech. Rep. RR-9591, Jul. 2025