Federated Martingale Posterior Samping
Pith reviewed 2026-05-20 13:06 UTC · model grok-4.3
The pith
Clients upload small trainable data embeddings so a server can centrally recover full parameter uncertainty for federated Bayesian neural networks using martingale posteriors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a one-shot embarrassingly parallel protocol for federated martingale posterior sampling, relying on clients uploading small sets of trainable data embeddings, allows the server to perform predictive sampling centrally and recover uncertainty estimates that closely match those from centralized counterparts.
What carries the argument
Trainable data embeddings uploaded by clients, which enable the central server to simulate the effect of having access to full local datasets for the martingale posterior sampling process.
If this is right
- The method provides a practical way to perform Bayesian inference in federated settings without data sharing.
- It leads to improved predictive calibration compared to standard federated averaging or consensus methods.
- It bypasses the need for eliciting meaningful priors on high-dimensional parameter spaces.
- The approach is suitable for modern overparameterized models like those used in image classification.
Where Pith is reading between the lines
- This could be extended to other domains where data privacy is critical, such as healthcare, by using embeddings to preserve privacy.
- The size of the uploaded embeddings might be optimized further to balance communication cost and accuracy.
- If the embeddings are learned jointly, it might allow for better adaptation to the predictive sampling procedure.
Load-bearing premise
A small set of trainable data embeddings uploaded by clients contains sufficient information for the central server to recover parameter uncertainty equivalent to running the predictive sampler on the full local datasets.
What would settle it
Running the predictive sampler on full local datasets versus using the uploaded embeddings and observing a large discrepancy in the resulting parameter uncertainty distributions would falsify the approach.
Figures
read the original abstract
Federated Bayesian neural networks require fixing a prior on the model parameters together with a likelihood. Eliciting meaningful priors on the weight space of modern overparameterized models is notoriously difficult, and misspecification of either component can severely degrade accuracy and calibration. Motivated by the rapid progress of predictive models such as large language models, the martingale posterior, also known as predictive Bayes, replaces the prior--likelihood pair with a predictive distribution and recovers parameter uncertainty by repeatedly drawing predictive samples and refitting the model. A direct federated implementation, however, would require clients to share the local data sets. This letter proposes {federated martingale posterior} (FMP) sampling, a one-shot embarrassingly parallel protocol in which each client uploads a small set of trainable data embeddings and the server runs the predictive sampler centrally. Experiments on MNIST, CIFAR-10, and CIFAR-100 show that FMP closely matches the centralized counterpart and significantly improves calibration over consensus-style baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Federated Martingale Posterior (FMP) sampling, a one-shot embarrassingly parallel protocol for federated Bayesian neural networks. Clients upload small sets of trainable data embeddings rather than raw local datasets; the server then centrally runs the martingale posterior sampler (repeated predictive draws followed by refits) to recover parameter uncertainty. Experiments on MNIST, CIFAR-10, and CIFAR-100 are reported to show that FMP closely matches the performance of the centralized martingale posterior while improving calibration relative to consensus-style federated baselines.
Significance. If the central assumption holds, the work would offer a practical route to well-calibrated uncertainty estimates in federated settings without requiring clients to share raw data or the server to elicit a prior on high-dimensional weights. The one-shot, embarrassingly parallel design and the reported calibration gains on standard image benchmarks constitute the main potential contribution.
major comments (2)
- [FMP protocol description] The load-bearing claim that a small set of trainable data embeddings uploaded by each client suffices for the server to recover parameter uncertainty equivalent to running the predictive sampler on the full local datasets receives no supporting analysis. No approximation-error bound, information-loss characterization, or description of embedding dimensionality and training objective relative to the original data distribution is provided (see the FMP protocol description).
- [Experiments] The experimental section reports that FMP matches centralized performance and improves calibration on MNIST, CIFAR-10, and CIFAR-100, yet contains no ablations on embedding size, number of embeddings per client, or the embedding training objective. Without these controls it is impossible to determine whether the reported calibration advantage is robust or an artifact of particular hyper-parameter choices.
minor comments (2)
- [Method] Clarify the precise training objective used to learn the client embeddings and how it relates to the predictive distribution employed by the martingale posterior.
- [Method] Add a short discussion of how the one-shot protocol interacts with the repeated refitting steps of the martingale posterior sampler.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we intend to make.
read point-by-point responses
-
Referee: [FMP protocol description] The load-bearing claim that a small set of trainable data embeddings uploaded by each client suffices for the server to recover parameter uncertainty equivalent to running the predictive sampler on the full local datasets receives no supporting analysis. No approximation-error bound, information-loss characterization, or description of embedding dimensionality and training objective relative to the original data distribution is provided (see the FMP protocol description).
Authors: We acknowledge that the current manuscript provides no formal approximation-error bound or information-loss analysis for the embedding approximation. The protocol relies on the empirical observation that a modest number of trainable embeddings, optimized to match local predictive statistics, enable the server-side martingale sampler to recover uncertainty comparable to the centralized case. In revision we will expand the protocol description to specify embedding dimensionality, the exact training objective (a predictive matching loss), and its relation to the local data distribution. We will also add a short discussion of the empirical justification and the limitations of the approach. A complete theoretical characterization lies outside the scope of this short letter. revision: partial
-
Referee: [Experiments] The experimental section reports that FMP matches centralized performance and improves calibration on MNIST, CIFAR-10, and CIFAR-100, yet contains no ablations on embedding size, number of embeddings per client, or the embedding training objective. Without these controls it is impossible to determine whether the reported calibration advantage is robust or an artifact of particular hyper-parameter choices.
Authors: We agree that systematic ablations would strengthen the experimental claims. In the revised version we will add results varying the number of embeddings per client and embedding dimensionality on MNIST (and, space permitting, on CIFAR-10). These controls will be placed in the main text or an appendix. We will also clarify the embedding training objective in the methods section so that readers can assess sensitivity to these choices. revision: yes
- A rigorous approximation-error bound or information-loss characterization for the data-embedding approximation used in the FMP protocol.
Circularity Check
No significant circularity; new federated protocol remains distinct from inputs
full rationale
The paper introduces a one-shot federated protocol in which clients upload trainable data embeddings and the server centrally runs the martingale posterior sampler. The abstract and provided text describe this as a direct adaptation of the existing predictive Bayes framework to avoid sharing full local datasets, with performance equivalence demonstrated via experiments on MNIST, CIFAR-10, and CIFAR-100. No equations, self-citations, or definitional reductions are present that would make the central claim equivalent to its inputs by construction. The load-bearing assumption about embedding sufficiency is presented as an empirical claim rather than a fitted or renamed quantity, leaving the derivation chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Martingale posterior recovers parameter uncertainty via repeated predictive sampling and refitting.
Reference graph
Works this paper leans on
-
[1]
C. P. Robert and G. Casella,Monte Carlo Statistical Methods. Springer, 1999, vol. 2. 5 [Online]. Available: https://doi.org/10.1007/978-1 -4757-3071-5
-
[2]
‘Edge Exchangeable Models for In- teraction Networks’
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” J. Amer. Statist. Assoc., vol. 112, no. 518, pp. 859–877, 2017. [Online]. Available: https: //doi.org/10.1080/01621459.2017.1285773
-
[3]
Bayesian deep learning via expectation maximization and turbo deep approximate message passing,
W. Xu, A. Liu, Y . Zhang, and V . Lau, “Bayesian deep learning via expectation maximization and turbo deep approximate message passing,”IEEE Trans. Signal Process., vol. 72, pp. 3865–3878,
-
[4]
Available: https://doi.org/10.1109/ TSP.2024.3442858
[Online]. Available: https://doi.org/10.1109/ TSP.2024.3442858
-
[5]
Simeone,Machine Learning for Engineers
O. Simeone,Machine Learning for Engineers. Cambridge University Press, 2022. [Online]. Available: https://www.cambridge.org/highered ucation/books/machine-learning-for-engineers/7 FD8622836CAFCF5EDB169E7DC8A1ED4
work page 2022
-
[6]
A General Framework for Updating Belief Distributions
P. G. Bissiri, C. C. Holmes, and S. G. Walker, “A general framework for updating belief distributions,”J. Roy. Statist. Soc. Ser. B, vol. 78, no. 5, pp. 1103–1130, 2016. [Online]. Available: https://doi.org/10.1111/rssb.12158
-
[7]
An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference,
J. Knoblauch, J. Jewson, and T. Damoulas, “An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference,” J. Mach. Learn. Res., vol. 23, no. 132, pp. 1–109,
-
[8]
Available: https://jmlr.org/papers/ v23/19-1047.html
[Online]. Available: https://jmlr.org/papers/ v23/19-1047.html
-
[9]
Robust pac m: Training ensemble models under misspecification and outliers,
M. Zecchin, S. Park, O. Simeone, M. Kountouris, and D. Gesbert, “Robust pac m: Training ensemble models under misspecification and outliers,”IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 11, pp. 16 518–16 532, 2023. [Online]. Available: https://doi.org/10.1109/TNNLS.2023.3295168
-
[10]
Functional variational Bayesian neural networks,
S. Sun, G. Zhang, J. Shi, and R. Grosse, “Functional variational Bayesian neural networks,” inProc. Int. Conf. Learn. Represent. (ICLR), 2019. [Online]. Available: https://openreview.net/forum?i d=rkxacs0qY7
work page 2019
-
[11]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernsteinet al., “On the opportunities and risks of foundation models,” arXiv:2108.07258, 2021. [Online]. Available: https: //arxiv.org/abs/2108.07258
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Martingale posterior distributions,
E. Fong, C. Holmes, and S. G. Walker, “Martingale posterior distributions,”J. Roy. Statist. Soc. Ser. B, vol. 85, no. 5, pp. 1357–1391, 2023. [Online]. Available: https://doi.org/10.1093/jrsssb/qkad005
-
[13]
M. Battiston and L. Cappello, “Bayesian predictive inference beyond martingales,”arXiv:2507.21874,
-
[14]
Available: https://arxiv.org/abs/25 07.21874
[Online]. Available: https://arxiv.org/abs/25 07.21874
-
[15]
Federated generalized bayesian learning via distributed stein variational gradient descent,
R. Kassab and O. Simeone, “Federated generalized bayesian learning via distributed stein variational gradient descent,”IEEE Trans. Signal Process., vol. 70, pp. 2180–2192, 2022. [Online]. Available: https://doi.org/10.1109/TSP.2022.3168490
-
[16]
Bayes and big data: The consensus monte carlo algorithm,
S. L. Scott, A. W. Blocker, F. V . Bonassi, H. A. Chipman, E. I. George, and R. E. McCulloch, “Bayes and big data: The consensus monte carlo algorithm,”Int. J. Manag. Sci. Eng. Manag., vol. 11, no. 2, pp. 78–88, 2016. [Online]. Available: https://doi.org/10.1080/17509653.2016.1142191
-
[17]
M. Zhu, M. Zecchin, S. Park, C. Guo, C. Feng, and O. Simeone, “Federated inference with reliable uncertainty quantification over wireless channels via conformal prediction,”IEEE Trans. Signal Process., vol. 72, pp. 1235–1250, 2024. [Online]. Available: https://doi.org/10.1109/TSP.2024.33586 15
-
[18]
Set transformer: A framework for attention-based permutation-invariant neural networks,
J. Lee, Y . Lee, J. Kim, A. Kosiorek, S. Choi, and Y . W. Teh, “Set transformer: A framework for attention-based permutation-invariant neural networks,” inProc. Int. Conf. Mach. Learn. (ICML), 2019, pp. 3744–3753. [Online]. Available: https://proceedings.mlr.press/v97/lee19d.html
work page 2019
-
[19]
Martingale posterior neural processes,
H. Lee, E. Yun, G. Nam, E. Fong, and J. Lee, “Martingale posterior neural processes,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2023. [Online]. Available: https://openreview.net/forum?i d=-9PVqZ-IR
work page 2023
-
[20]
Parallelized stochastic gradient descent,
M. Zinkevich, M. Weimer, L. Li, and A. Smola, “Parallelized stochastic gradient descent,”Adv. Neural Inf. Process. Syst., vol. 23, 2010. [Online]. Available: https://papers.nips.cc/paper/4006-paralle lized-stochastic-gradient-descent
work page 2010
-
[21]
Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification
T.-M. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,”arXiv:1909.06335,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[22]
Available: https://arxiv.org/abs/19 09.06335
[Online]. Available: https://arxiv.org/abs/19 09.06335
-
[23]
Decaf: A deep convolutional activation feature for generic visual recognition,
J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” inProc. Int. Conf. Mach. Learn. (ICML), 2014, pp. 647–655. [Online]. Available: https://proceedings.mlr.press/v32/donahue14.html
work page 2014
-
[24]
On calibration of modern neural networks,
C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in Proc. Int. Conf. Mach. Learn. (ICML), 2017, pp. 1321–1330. [Online]. Available: https://proceeding s.mlr.press/v70/guo17a.html
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.