BlazeFL: Fast and Deterministic Federated Learning Simulation
Pith reviewed 2026-05-13 19:04 UTC · model grok-4.3
The pith
BlazeFL achieves deterministic federated learning simulations up to 3.1 times faster than baselines through thread-based execution and per-client RNG streams.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BlazeFL is a lightweight framework that enables fast and deterministic federated learning simulation on a single node. It employs free-threaded shared-memory execution for in-memory parameter exchange, avoiding the overhead of serialization and inter-process communication. Each client is assigned an isolated RNG stream, ensuring that when stochastic operators use these generators, executions produce bitwise-identical results across repeated runs, even with high concurrency in both thread-based and process-based modes. In experiments with CIFAR-10 image classification, it achieves up to 3.1× speedup compared to a common baseline, particularly on communication-dominated workloads, whilekeeping
What carries the argument
Isolated per-client random number generator streams combined with thread-based shared-memory parameter exchange.
Load-bearing premise
All stochastic operators inside client training code must be configured to draw from the BlazeFL-provided per-client RNG streams rather than global or framework-default generators.
What would settle it
Running the same high-concurrency CIFAR-10 federated learning simulation multiple times and checking whether the final model parameters and accuracy metrics match exactly bitwise on every run.
Figures
read the original abstract
Federated learning (FL) research increasingly relies on single-node simulations with hundreds or thousands of virtual clients, making both efficiency and reproducibility essential. Yet parallel client training often introduces nondeterminism through shared random state and scheduling variability, forcing researchers to trade throughput for reproducibility or to implement custom control logic within complex frameworks. We present BlazeFL, a lightweight framework for single-node FL simulation that alleviates this trade-off through free-threaded shared-memory execution and deterministic randomness management. BlazeFL uses thread-based parallelism with in-memory parameter exchange between the server and clients, avoiding serialization and inter-process communication overhead. To support deterministic execution, BlazeFL assigns isolated random number generator (RNG) streams to clients. Under a fixed software/hardware stack, and when stochastic operators consume BlazeFL-managed generators, this design yields bitwise-identical results across repeated high-concurrency runs in both thread-based and process-based modes. In CIFAR-10 image-classification experiments, BlazeFL substantially reduces execution time relative to a widely used open-source baseline, achieving up to 3.1$\times$ speedup on communication-dominated workloads while preserving a lightweight dependency footprint. Our open-source implementation is available at: https://github.com/kitsuyaazuma/blazefl.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BlazeFL, a lightweight framework for single-node federated learning simulation. It employs free-threaded shared-memory execution with in-memory parameter exchange to avoid serialization and IPC overhead, while assigning isolated RNG streams to clients to achieve bitwise-identical results across repeated runs (conditional on stochastic operators consuming the managed generators). CIFAR-10 experiments report up to 3.1× speedup versus an open-source baseline on communication-dominated workloads, with a small dependency footprint and open-source release.
Significance. If the results hold under the stated conditions, BlazeFL addresses a practical pain point in FL research by reducing the speed-reproducibility trade-off for high-concurrency single-node simulations. The explicit construction of determinism via RNG isolation, combined with measured speedups against an external baseline and an open implementation, positions it as a useful engineering contribution for the community.
minor comments (2)
- Abstract: the phrase 'when stochastic operators consume BlazeFL-managed generators' is central to the determinism claim but appears only once; expanding this boundary condition with a brief example of correct versus incorrect usage would improve clarity for readers implementing client code.
- The manuscript would benefit from an explicit statement in the experimental section on whether the reported timing results include any overhead from RNG stream initialization or management.
Simulated Author's Rebuttal
We thank the referee for the positive review and the recommendation to accept. The assessment accurately captures BlazeFL's focus on reducing the speed-reproducibility trade-off in single-node FL simulations through shared-memory threading and isolated RNG streams.
Circularity Check
No significant circularity identified
full rationale
The paper presents an engineering framework whose core claims (bitwise determinism under explicit RNG isolation and measured speedups) are achieved by construction via per-client generator streams and thread-based in-memory exchange. These are stated as conditional on client code consuming the provided RNGs, with performance numbers obtained by direct timing against an external open-source baseline rather than any fitted parameter or self-referential derivation. No equations, predictions, or uniqueness theorems appear that reduce to the paper's own inputs; the design is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gusm˜ao, and Nicholas D. Lane. Flower: A friendly feder- ated learning research framework, 2022. 2, 4
work page 2022
-
[2]
pfl- research: Simulation framework for accelerating research in private federated learning, 2024
Filip Granqvist, Congzheng Song, Aine Cahill, Rogier van Dalen, Martin Pelikan, Yi Sheng Chan, Xiaojun Feng, Natarajan Krishnaswami, V ojta Jina, and Mona Chitnis. pfl- research: Simulation framework for accelerating research in private federated learning, 2024. 2
work page 2024
-
[3]
PEP 703 – Making the Global Interpreter Lock Optional in CPython
Sam Gross. PEP 703 – Making the Global Interpreter Lock Optional in CPython. Python Enhancement Proposals 703, Python Software Foundation, 2023. 2, 3
work page 2023
-
[4]
Fedml: A research library and benchmark for federated ma- chine learning, 2020
Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Ab- hishek Singh, Hang Qiu, Xinghua Zhu, Jianzong Wang, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. Fedml: A research library and benchmark for federated ma- chine learning, 2020. 2
work page 2020
-
[5]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4 8
work page 2016
-
[6]
Pep 544 – protocols: Structural subtyping (static duck typing)
Ivan Levkivskyi, Jukka Lehtosalo, and Łukasz Langa. Pep 544 – protocols: Structural subtyping (static duck typing). Python Enhancement Proposals 544, Python Software Foun- dation, 2017. 3
work page 2017
-
[7]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas. Communication- efficient learning of deep networks from decentralized data,
-
[8]
Message Passing Interface Forum.MPI: A Message-Passing Interface Standard Version 5.0, 2025. 2
work page 2025
-
[9]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577, Carlsbad, CA,
-
[10]
USENIX Association. 1, 2, 4
-
[11]
NCCL: NVIDIA Collective Commu- nications Library
NVIDIA Corporation. NCCL: NVIDIA Collective Commu- nications Library. 2
-
[12]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...
work page 2019
-
[13]
Horovod: fast and easy distributed deep learning in tensorflow, 2018
Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018. 2
work page 2018
-
[14]
Pep 779 – criteria for supported status for free-threaded python
Thomas Wouters, Matt Page, and Sam Gross. Pep 779 – criteria for supported status for free-threaded python. Python Enhancement Proposals 779, Python Software Foundation,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.