BlazeFL: Fast and Deterministic Federated Learning Simulation

Kitsuya Azuma; Takayuki Nishio

arxiv: 2604.03606 · v1 · submitted 2026-04-04 · 💻 cs.LG

BlazeFL: Fast and Deterministic Federated Learning Simulation

Kitsuya Azuma , Takayuki Nishio This is my paper

Pith reviewed 2026-05-13 19:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningsimulationdeterministic executionrandom number generatorsreproducibilityparallelismmachine learningCIFAR-10

0 comments

The pith

BlazeFL achieves deterministic federated learning simulations up to 3.1 times faster than baselines through thread-based execution and per-client RNG streams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Federated learning research often requires simulating hundreds or thousands of clients on one machine, but parallel execution usually creates unpredictable results from shared randomness or varying schedules. BlazeFL tackles this by running clients in threads that share memory directly with the server for parameter updates, cutting out slow serialization steps. It also supplies each client with its own dedicated random number stream so that all random choices stay consistent across runs. A sympathetic reader would care because this removes the need to choose between fast experiments and reliable, repeatable outcomes, letting researchers iterate more quickly on their algorithms. The authors demonstrate this with CIFAR-10 tests that run up to 3.1 times faster on communication-heavy setups.

Core claim

BlazeFL is a lightweight framework that enables fast and deterministic federated learning simulation on a single node. It employs free-threaded shared-memory execution for in-memory parameter exchange, avoiding the overhead of serialization and inter-process communication. Each client is assigned an isolated RNG stream, ensuring that when stochastic operators use these generators, executions produce bitwise-identical results across repeated runs, even with high concurrency in both thread-based and process-based modes. In experiments with CIFAR-10 image classification, it achieves up to 3.1× speedup compared to a common baseline, particularly on communication-dominated workloads, whilekeeping

What carries the argument

Isolated per-client random number generator streams combined with thread-based shared-memory parameter exchange.

Load-bearing premise

All stochastic operators inside client training code must be configured to draw from the BlazeFL-provided per-client RNG streams rather than global or framework-default generators.

What would settle it

Running the same high-concurrency CIFAR-10 federated learning simulation multiple times and checking whether the final model parameters and accuracy metrics match exactly bitwise on every run.

Figures

Figures reproduced from arXiv: 2604.03606 by Kitsuya Azuma, Takayuki Nishio.

**Figure 2.** Figure 2: Wall-clock time for five communication rounds on the high-performance server (48 CPU cores, NVIDIA H100) as a function of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Wall-clock time for five communication rounds on the workstation-class server (32 CPU cores, NVIDIA Quadro RTX 6000) as [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Accumulation of non-deterministic errors in Flower [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Federated learning (FL) research increasingly relies on single-node simulations with hundreds or thousands of virtual clients, making both efficiency and reproducibility essential. Yet parallel client training often introduces nondeterminism through shared random state and scheduling variability, forcing researchers to trade throughput for reproducibility or to implement custom control logic within complex frameworks. We present BlazeFL, a lightweight framework for single-node FL simulation that alleviates this trade-off through free-threaded shared-memory execution and deterministic randomness management. BlazeFL uses thread-based parallelism with in-memory parameter exchange between the server and clients, avoiding serialization and inter-process communication overhead. To support deterministic execution, BlazeFL assigns isolated random number generator (RNG) streams to clients. Under a fixed software/hardware stack, and when stochastic operators consume BlazeFL-managed generators, this design yields bitwise-identical results across repeated high-concurrency runs in both thread-based and process-based modes. In CIFAR-10 image-classification experiments, BlazeFL substantially reduces execution time relative to a widely used open-source baseline, achieving up to 3.1$\times$ speedup on communication-dominated workloads while preserving a lightweight dependency footprint. Our open-source implementation is available at: https://github.com/kitsuyaazuma/blazefl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BlazeFL gives a clean engineering fix for faster single-node FL simulations with built-in determinism via threads and per-client RNG streams, though the determinism still needs users to wire their code correctly.

read the letter

BlazeFL is a systems paper that makes single-node federated learning simulations both faster and deterministic by using thread parallelism and separate random streams for each client. The authors show that this approach can deliver up to 3.1 times the speed of a standard baseline on CIFAR-10 tasks where communication dominates. They achieve this with in-memory parameter passing instead of heavier process-based methods. The determinism comes from assigning isolated RNGs, and the paper includes experiments that confirm bitwise identical results when the setup is followed. The code is open, which helps with checking the claims. What works well is the focus on a real pain point for FL researchers who need to run many trials quickly. The evaluation is straightforward and compares against an external baseline without fitting to their own data. The main soft spot is the requirement that client training code must use the provided RNG streams. If it falls back to global generators, the determinism guarantee disappears. This is stated in the paper, but it does mean the framework isn't fully automatic. It's a minor issue for the core claim but could affect how widely it's adopted. Overall, this is for people building or using FL simulation tools who value both performance and reproducibility. It doesn't push new algorithms but improves the infrastructure. I would send this to peer review. The work is grounded in concrete measurements and addresses a practical need, so referees in systems and FL could give useful feedback on the implementation details.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces BlazeFL, a lightweight framework for single-node federated learning simulation. It employs free-threaded shared-memory execution with in-memory parameter exchange to avoid serialization and IPC overhead, while assigning isolated RNG streams to clients to achieve bitwise-identical results across repeated runs (conditional on stochastic operators consuming the managed generators). CIFAR-10 experiments report up to 3.1× speedup versus an open-source baseline on communication-dominated workloads, with a small dependency footprint and open-source release.

Significance. If the results hold under the stated conditions, BlazeFL addresses a practical pain point in FL research by reducing the speed-reproducibility trade-off for high-concurrency single-node simulations. The explicit construction of determinism via RNG isolation, combined with measured speedups against an external baseline and an open implementation, positions it as a useful engineering contribution for the community.

minor comments (2)

Abstract: the phrase 'when stochastic operators consume BlazeFL-managed generators' is central to the determinism claim but appears only once; expanding this boundary condition with a brief example of correct versus incorrect usage would improve clarity for readers implementing client code.
The manuscript would benefit from an explicit statement in the experimental section on whether the reported timing results include any overhead from RNG stream initialization or management.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept. The assessment accurately captures BlazeFL's focus on reducing the speed-reproducibility trade-off in single-node FL simulations through shared-memory threading and isolated RNG streams.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an engineering framework whose core claims (bitwise determinism under explicit RNG isolation and measured speedups) are achieved by construction via per-client generator streams and thread-based in-memory exchange. These are stated as conditional on client code consuming the provided RNGs, with performance numbers obtained by direct timing against an external open-source baseline rather than any fitted parameter or self-referential derivation. No equations, predictions, or uniqueness theorems appear that reduce to the paper's own inputs; the design is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution rests on standard threading primitives and RNG libraries already available in the target language ecosystem.

pith-pipeline@v0.9.0 · 5515 in / 1065 out tokens · 33855 ms · 2026-05-13T19:04:22.534041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gusm˜ao, and Nicholas D

Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gusm˜ao, and Nicholas D. Lane. Flower: A friendly feder- ated learning research framework, 2022. 2, 4

work page 2022
[2]

pfl- research: Simulation framework for accelerating research in private federated learning, 2024

Filip Granqvist, Congzheng Song, Aine Cahill, Rogier van Dalen, Martin Pelikan, Yi Sheng Chan, Xiaojun Feng, Natarajan Krishnaswami, V ojta Jina, and Mona Chitnis. pfl- research: Simulation framework for accelerating research in private federated learning, 2024. 2

work page 2024
[3]

PEP 703 – Making the Global Interpreter Lock Optional in CPython

Sam Gross. PEP 703 – Making the Global Interpreter Lock Optional in CPython. Python Enhancement Proposals 703, Python Software Foundation, 2023. 2, 3

work page 2023
[4]

Fedml: A research library and benchmark for federated ma- chine learning, 2020

Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Ab- hishek Singh, Hang Qiu, Xinghua Zhu, Jianzong Wang, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. Fedml: A research library and benchmark for federated ma- chine learning, 2020. 2

work page 2020
[5]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4 8

work page 2016
[6]

Pep 544 – protocols: Structural subtyping (static duck typing)

Ivan Levkivskyi, Jukka Lehtosalo, and Łukasz Langa. Pep 544 – protocols: Structural subtyping (static duck typing). Python Enhancement Proposals 544, Python Software Foun- dation, 2017. 3

work page 2017
[7]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas. Communication- efficient learning of deep networks from decentralized data,

work page
[8]

Message Passing Interface Forum.MPI: A Message-Passing Interface Standard Version 5.0, 2025. 2

work page 2025
[9]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577, Carlsbad, CA,

work page
[10]

USENIX Association. 1, 2, 4

work page
[11]

NCCL: NVIDIA Collective Commu- nications Library

NVIDIA Corporation. NCCL: NVIDIA Collective Commu- nications Library. 2

work page
[12]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page 2019
[13]

Horovod: fast and easy distributed deep learning in tensorflow, 2018

Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018. 2

work page 2018
[14]

Pep 779 – criteria for supported status for free-threaded python

Thomas Wouters, Matt Page, and Sam Gross. Pep 779 – criteria for supported status for free-threaded python. Python Enhancement Proposals 779, Python Software Foundation,

work page

[1] [1]

Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gusm˜ao, and Nicholas D

Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gusm˜ao, and Nicholas D. Lane. Flower: A friendly feder- ated learning research framework, 2022. 2, 4

work page 2022

[2] [2]

pfl- research: Simulation framework for accelerating research in private federated learning, 2024

Filip Granqvist, Congzheng Song, Aine Cahill, Rogier van Dalen, Martin Pelikan, Yi Sheng Chan, Xiaojun Feng, Natarajan Krishnaswami, V ojta Jina, and Mona Chitnis. pfl- research: Simulation framework for accelerating research in private federated learning, 2024. 2

work page 2024

[3] [3]

PEP 703 – Making the Global Interpreter Lock Optional in CPython

Sam Gross. PEP 703 – Making the Global Interpreter Lock Optional in CPython. Python Enhancement Proposals 703, Python Software Foundation, 2023. 2, 3

work page 2023

[4] [4]

Fedml: A research library and benchmark for federated ma- chine learning, 2020

Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Ab- hishek Singh, Hang Qiu, Xinghua Zhu, Jianzong Wang, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. Fedml: A research library and benchmark for federated ma- chine learning, 2020. 2

work page 2020

[5] [5]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4 8

work page 2016

[6] [6]

Pep 544 – protocols: Structural subtyping (static duck typing)

Ivan Levkivskyi, Jukka Lehtosalo, and Łukasz Langa. Pep 544 – protocols: Structural subtyping (static duck typing). Python Enhancement Proposals 544, Python Software Foun- dation, 2017. 3

work page 2017

[7] [7]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas. Communication- efficient learning of deep networks from decentralized data,

work page

[8] [8]

Message Passing Interface Forum.MPI: A Message-Passing Interface Standard Version 5.0, 2025. 2

work page 2025

[9] [9]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577, Carlsbad, CA,

work page

[10] [10]

USENIX Association. 1, 2, 4

work page

[11] [11]

NCCL: NVIDIA Collective Commu- nications Library

NVIDIA Corporation. NCCL: NVIDIA Collective Commu- nications Library. 2

work page

[12] [12]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page 2019

[13] [13]

Horovod: fast and easy distributed deep learning in tensorflow, 2018

Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018. 2

work page 2018

[14] [14]

Pep 779 – criteria for supported status for free-threaded python

Thomas Wouters, Matt Page, and Sam Gross. Pep 779 – criteria for supported status for free-threaded python. Python Enhancement Proposals 779, Python Software Foundation,

work page