pith. sign in

arxiv: 2604.03606 · v1 · submitted 2026-04-04 · 💻 cs.LG

BlazeFL: Fast and Deterministic Federated Learning Simulation

Pith reviewed 2026-05-13 19:04 UTC · model grok-4.3

classification 💻 cs.LG
keywords federated learningsimulationdeterministic executionrandom number generatorsreproducibilityparallelismmachine learningCIFAR-10
0
0 comments X

The pith

BlazeFL achieves deterministic federated learning simulations up to 3.1 times faster than baselines through thread-based execution and per-client RNG streams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Federated learning research often requires simulating hundreds or thousands of clients on one machine, but parallel execution usually creates unpredictable results from shared randomness or varying schedules. BlazeFL tackles this by running clients in threads that share memory directly with the server for parameter updates, cutting out slow serialization steps. It also supplies each client with its own dedicated random number stream so that all random choices stay consistent across runs. A sympathetic reader would care because this removes the need to choose between fast experiments and reliable, repeatable outcomes, letting researchers iterate more quickly on their algorithms. The authors demonstrate this with CIFAR-10 tests that run up to 3.1 times faster on communication-heavy setups.

Core claim

BlazeFL is a lightweight framework that enables fast and deterministic federated learning simulation on a single node. It employs free-threaded shared-memory execution for in-memory parameter exchange, avoiding the overhead of serialization and inter-process communication. Each client is assigned an isolated RNG stream, ensuring that when stochastic operators use these generators, executions produce bitwise-identical results across repeated runs, even with high concurrency in both thread-based and process-based modes. In experiments with CIFAR-10 image classification, it achieves up to 3.1× speedup compared to a common baseline, particularly on communication-dominated workloads, whilekeeping

What carries the argument

Isolated per-client random number generator streams combined with thread-based shared-memory parameter exchange.

Load-bearing premise

All stochastic operators inside client training code must be configured to draw from the BlazeFL-provided per-client RNG streams rather than global or framework-default generators.

What would settle it

Running the same high-concurrency CIFAR-10 federated learning simulation multiple times and checking whether the final model parameters and accuracy metrics match exactly bitwise on every run.

Figures

Figures reproduced from arXiv: 2604.03606 by Kitsuya Azuma, Takayuki Nishio.

Figure 1
Figure 1. Figure 1: Architecture overview of BlazeFL. A main thread co [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Wall-clock time for five communication rounds on the high-performance server (48 CPU cores, NVIDIA H100) as a function of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Wall-clock time for five communication rounds on the workstation-class server (32 CPU cores, NVIDIA Quadro RTX 6000) as [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accumulation of non-deterministic errors in Flower [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Federated learning (FL) research increasingly relies on single-node simulations with hundreds or thousands of virtual clients, making both efficiency and reproducibility essential. Yet parallel client training often introduces nondeterminism through shared random state and scheduling variability, forcing researchers to trade throughput for reproducibility or to implement custom control logic within complex frameworks. We present BlazeFL, a lightweight framework for single-node FL simulation that alleviates this trade-off through free-threaded shared-memory execution and deterministic randomness management. BlazeFL uses thread-based parallelism with in-memory parameter exchange between the server and clients, avoiding serialization and inter-process communication overhead. To support deterministic execution, BlazeFL assigns isolated random number generator (RNG) streams to clients. Under a fixed software/hardware stack, and when stochastic operators consume BlazeFL-managed generators, this design yields bitwise-identical results across repeated high-concurrency runs in both thread-based and process-based modes. In CIFAR-10 image-classification experiments, BlazeFL substantially reduces execution time relative to a widely used open-source baseline, achieving up to 3.1$\times$ speedup on communication-dominated workloads while preserving a lightweight dependency footprint. Our open-source implementation is available at: https://github.com/kitsuyaazuma/blazefl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces BlazeFL, a lightweight framework for single-node federated learning simulation. It employs free-threaded shared-memory execution with in-memory parameter exchange to avoid serialization and IPC overhead, while assigning isolated RNG streams to clients to achieve bitwise-identical results across repeated runs (conditional on stochastic operators consuming the managed generators). CIFAR-10 experiments report up to 3.1× speedup versus an open-source baseline on communication-dominated workloads, with a small dependency footprint and open-source release.

Significance. If the results hold under the stated conditions, BlazeFL addresses a practical pain point in FL research by reducing the speed-reproducibility trade-off for high-concurrency single-node simulations. The explicit construction of determinism via RNG isolation, combined with measured speedups against an external baseline and an open implementation, positions it as a useful engineering contribution for the community.

minor comments (2)
  1. Abstract: the phrase 'when stochastic operators consume BlazeFL-managed generators' is central to the determinism claim but appears only once; expanding this boundary condition with a brief example of correct versus incorrect usage would improve clarity for readers implementing client code.
  2. The manuscript would benefit from an explicit statement in the experimental section on whether the reported timing results include any overhead from RNG stream initialization or management.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept. The assessment accurately captures BlazeFL's focus on reducing the speed-reproducibility trade-off in single-node FL simulations through shared-memory threading and isolated RNG streams.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an engineering framework whose core claims (bitwise determinism under explicit RNG isolation and measured speedups) are achieved by construction via per-client generator streams and thread-based in-memory exchange. These are stated as conditional on client code consuming the provided RNGs, with performance numbers obtained by direct timing against an external open-source baseline rather than any fitted parameter or self-referential derivation. No equations, predictions, or uniqueness theorems appear that reduce to the paper's own inputs; the design is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution rests on standard threading primitives and RNG libraries already available in the target language ecosystem.

pith-pipeline@v0.9.0 · 5515 in / 1065 out tokens · 33855 ms · 2026-05-13T19:04:22.534041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gusm˜ao, and Nicholas D

    Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gusm˜ao, and Nicholas D. Lane. Flower: A friendly feder- ated learning research framework, 2022. 2, 4

  2. [2]

    pfl- research: Simulation framework for accelerating research in private federated learning, 2024

    Filip Granqvist, Congzheng Song, Aine Cahill, Rogier van Dalen, Martin Pelikan, Yi Sheng Chan, Xiaojun Feng, Natarajan Krishnaswami, V ojta Jina, and Mona Chitnis. pfl- research: Simulation framework for accelerating research in private federated learning, 2024. 2

  3. [3]

    PEP 703 – Making the Global Interpreter Lock Optional in CPython

    Sam Gross. PEP 703 – Making the Global Interpreter Lock Optional in CPython. Python Enhancement Proposals 703, Python Software Foundation, 2023. 2, 3

  4. [4]

    Fedml: A research library and benchmark for federated ma- chine learning, 2020

    Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Ab- hishek Singh, Hang Qiu, Xinghua Zhu, Jianzong Wang, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. Fedml: A research library and benchmark for federated ma- chine learning, 2020. 2

  5. [5]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4 8

  6. [6]

    Pep 544 – protocols: Structural subtyping (static duck typing)

    Ivan Levkivskyi, Jukka Lehtosalo, and Łukasz Langa. Pep 544 – protocols: Structural subtyping (static duck typing). Python Enhancement Proposals 544, Python Software Foun- dation, 2017. 3

  7. [7]

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas

    H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag ¨uera y Arcas. Communication- efficient learning of deep networks from decentralized data,

  8. [8]

    Message Passing Interface Forum.MPI: A Message-Passing Interface Standard Version 5.0, 2025. 2

  9. [9]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577, Carlsbad, CA,

  10. [10]

    USENIX Association. 1, 2, 4

  11. [11]

    NCCL: NVIDIA Collective Commu- nications Library

    NVIDIA Corporation. NCCL: NVIDIA Collective Commu- nications Library. 2

  12. [12]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

  13. [13]

    Horovod: fast and easy distributed deep learning in tensorflow, 2018

    Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018. 2

  14. [14]

    Pep 779 – criteria for supported status for free-threaded python

    Thomas Wouters, Matt Page, and Sam Gross. Pep 779 – criteria for supported status for free-threaded python. Python Enhancement Proposals 779, Python Software Foundation,