pith. sign in

arxiv: 2605.17552 · v1 · pith:NJXM27MVnew · submitted 2026-05-17 · 💻 cs.LG

Q-LocalAdam: Memory-Efficient Client-Side Adaptive Optimization for Edge Federated Learning

Pith reviewed 2026-05-20 14:05 UTC · model grok-4.3

classification 💻 cs.LG
keywords federated learningedge devicesadaptive optimizationquantizationmemory efficiencyAdam optimizernon-IID dataclient-side optimization
0
0 comments X

The pith

Q-LocalAdam reduces client optimizer memory by 3.37 times in federated learning by using separate 8-bit encodings for momentum and variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Q-LocalAdam as a way to run the Adam optimizer on edge devices during federated learning without exceeding tight memory limits. It starts from the observation that momentum stays symmetric and bounded while variance spreads over eight orders of magnitude in a log-normal pattern. These different shapes motivate block-wise linear quantization for momentum and log-space quantization for variance, with model weights left at full precision. Experiments on CIFAR-10 and CIFAR-100 with varying data heterogeneity show the method matches full-precision accuracy under moderate non-IID conditions and improves accuracy under strong heterogeneity, all while cutting optimizer memory by 3.37 times. The approach requires no changes to the federated protocol itself.

Core claim

Momentum and variance in federated Adam exhibit distinct statistical properties—symmetric and bounded for momentum, log-normal across eight orders of magnitude for variance—which allow tailored 8-bit quantization encodings to preserve accuracy while achieving a 3.37 times reduction in optimizer memory on client devices under non-IID data.

What carries the argument

Distribution-aware 8-bit quantization: block-wise linear encoding for momentum and log-space encoding for variance.

Load-bearing premise

The momentum remains symmetric and bounded while the variance stays log-normal across the tested models, datasets, and federated heterogeneity levels.

What would settle it

Running the same models on a new dataset or architecture where variance no longer follows a log-normal distribution over many orders of magnitude and measuring whether accuracy drops below the full-precision baseline.

Figures

Figures reproduced from arXiv: 2605.17552 by Haroon R. Lone, Vedant Waykole.

Figure 1
Figure 1. Figure 1: CIFAR-10 convergence and robustness. Left: Main comparison under extreme heterogeneity ( [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CIFAR-100 convergence analysis. Q-LocalAdam [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CIFAR-10 optimizer state distributions after 50 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CIFAR-100 optimizer state distributions after 50 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of relative quantization error between [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CIFAR-10 data distribution heatmap across 10 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CIFAR-100 data distribution heatmap across 10 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Component ablation at 𝛼 = 0.1. Left: CIFAR-10. Quantizing only momentum achieves competitive accuracy (82.48%), but quantizing both states is essential for maximum memory reduction (3.37× vs. 1.54×). Right: CIFAR-100. Quantizing only momentum or only variance provides partial benefits, but full Q-LocalAdam achieves the best accuracy with lowest memory [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Block size ablation at 𝛼 = 0.1. Left: CIFAR-10. All three block sizes converge stably with best accuracy within 0.55pp (81.91%–82.46%), demonstrating robustness to quantization granularity on simpler tasks. Right: CIFAR-100. All block sizes reach similar best accuracy (0.76pp range), but 𝐵 = 128 exhibits late-stage instability with 4pp final accuracy drop [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Learning rate sensitivity at 𝛼 = 0.1. Left: CIFAR-10. Q-LocalAdam achieves nearly identical best accuracy with lr=1e-3 (81.91%) and lr=5e-4 (81.79%), differing by only 0.12pp. Right: CIFAR-100. Both learning rates achieve similar best accuracy (61.47% vs. 61.82%), demonstrating robustness to hyperparameter choice [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Federated learning on edge devices must cope with non-IID client data and tight memory budgets. Adaptive optimizers like Adam stabilize training under data heterogeneity but require storing full-precision momentum and variance states, often tripling client memory overhead. This limits deployable model sizes and concurrent federated jobs on resource-constrained devices. We empirically observe that momentum and variance in federated Adam exhibit fundamentally different statistical properties: momentum values are symmetric and bounded, while variance spans eight orders of magnitude with log-normal structure. Motivated by this asymmetry, we propose \textbf{Q-LocalAdam}, which applies distribution-aware 8-bit quantization block-wise linear encoding for momentum and log-space encoding for variance while keeping model parameters in full precision. Across CIFAR-10 and CIFAR-100 under varying data heterogeneity ($\alpha \in \{0.1, 0.5, 1.0, \text{IID}\}$), Q-LocalAdam achieves $3.37\times$ optimizer memory reduction with no accuracy loss under moderate heterogeneity and significant improvements under extreme heterogeneity (e.g., +5.74pp on CIFAR-100, $\alpha=0.1$). Multi-seed validation confirms statistical significance ($p<0.01$). In contrast, naive uniform quantization degrades to random performance, demonstrating that distribution-aware design is essential. Q-LocalAdam enables larger models and more concurrent workloads on memory-constrained edge devices without modifying the federated protocol.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Q-LocalAdam, a client-side adaptive optimizer for edge federated learning that applies 8-bit distribution-aware quantization to Adam's momentum (block-wise linear encoding) and variance (log-space encoding) while retaining full-precision model parameters. The design is motivated by empirical observations that momentum is symmetric and bounded whereas variance exhibits log-normal structure spanning eight orders of magnitude. On CIFAR-10 and CIFAR-100 under Dirichlet heterogeneity levels α ∈ {0.1, 0.5, 1.0, IID}, the method reports a 3.37× optimizer memory reduction with no accuracy loss under moderate heterogeneity and gains up to +5.74 pp under extreme heterogeneity (α=0.1), with multi-seed p<0.01 significance; naive uniform quantization is shown to collapse performance.

Significance. If the reported statistical properties of momentum and variance generalize beyond the evaluated CIFAR settings, the approach would meaningfully expand feasible model sizes and concurrent workloads on memory-constrained edge devices without altering the federated protocol. The explicit contrast against naive quantization and the statistical significance testing strengthen the empirical case for distribution-aware quantization in this domain.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim of 3.37× memory reduction with no accuracy loss (and the +5.74 pp gain at α=0.1) rests on the assumption that the observed momentum symmetry/boundedness and variance log-normality persist layer-wise, across epochs, and across architectures/datasets. The manuscript provides no measurements or ablations confirming these properties outside the CIFAR models, leaving the fixed encodings vulnerable to distortion of effective learning rates or second-moment estimates if the distributions shift.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'multi-seed validation confirms statistical significance (p<0.01)' should specify the number of seeds, the exact test, and whether it applies to all reported accuracy differences.
  2. [§3] §3 (Method): clarify whether block size in the linear encoding is a fixed hyperparameter or chosen per layer, and how the log-space encoding handles the eight-order dynamic range without overflow or underflow.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback. We provide a point-by-point response to the major comment below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 3.37× memory reduction with no accuracy loss (and the +5.74 pp gain at α=0.1) rests on the assumption that the observed momentum symmetry/boundedness and variance log-normality persist layer-wise, across epochs, and across architectures/datasets. The manuscript provides no measurements or ablations confirming these properties outside the CIFAR models, leaving the fixed encodings vulnerable to distortion of effective learning rates or second-moment estimates if the distributions shift.

    Authors: We agree that confirming the stability of these statistical properties is important for the broader applicability of the fixed encodings. Our observations in §3 are based on measurements collected during the CIFAR-10 and CIFAR-100 experiments, which already include layer-wise and epoch-wise analysis within those models. In the revised manuscript we will add explicit ablations and visualizations in §3 to document the consistency of momentum symmetry/boundedness and variance log-normality across layers and training epochs. We will also insert a limitations paragraph acknowledging that extension to other architectures and datasets is left for future work. These changes will clarify the current scope without modifying the reported results or method. revision: partial

standing simulated objections not resolved
  • Empirical confirmation of the momentum and variance distribution properties on architectures and datasets beyond the CIFAR-10/CIFAR-100 models used in the current evaluation

Circularity Check

0 steps flagged

No circularity: proposal rests on empirical observation of optimizer statistics

full rationale

The paper's derivation chain starts from direct empirical measurements of momentum (symmetric/bounded) and variance (log-normal over eight orders) in federated Adam runs, then applies block-wise linear and log-space 8-bit encodings motivated by those observed distributions. No equation reduces a claimed prediction or first-principles result back to a fitted parameter or self-referential quantity by construction; no uniqueness theorem or ansatz is imported via self-citation; and the reported accuracy and memory results are obtained from external CIFAR-10/100 experiments under controlled heterogeneity levels. The method is therefore self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical domain assumption about the distinct statistical distributions of momentum and variance states, plus a design choice for 8-bit quantization.

free parameters (1)
  • Quantization bit width
    Chosen by hand as 8 bits to balance memory reduction against precision needs.
axioms (1)
  • domain assumption Momentum values are symmetric and bounded; variance spans eight orders of magnitude with log-normal structure.
    This observation, stated in the abstract, directly motivates the choice of different encoding schemes for each state.

pith-pipeline@v0.9.0 · 5798 in / 1315 out tokens · 64935 ms · 2026-05-20T14:05:23.659995+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in Neural Information Processing Systems30 (2017)

  2. [2]

    Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. 2019. Qsparse-local- SGD: Distributed SGD with quantization, sparsification and local computations. Advances in Neural Information Processing Systems32 (2019)

  3. [3]

    Hong-You Chen and Wei-Lun Chao. 2020. Fedbe: Making bayesian model ensem- ble applicable to federated learning.arXiv preprint arXiv:2009.01974(2020)

  4. [4]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2021. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861(2021)

  5. [5]

    Nam Hamer, Mehryar Mohri, and Ananda Theertha Suresh. 2023. FedPara: Low-rank hadamard product for communication-efficient federated learning. In International Conference on Learning Representations

  6. [6]

    Nam Hyeon-Woo, Moon Ye-Bin, and Tae-Hyun Oh. 2021. Fedpara: Low-rank hadamard product for communication-efficient federated learning.arXiv preprint arXiv:2108.06098(2021)

  7. [7]

    Ahmed Imteaj, Urmish Thakker, Shiqiang Wang, Jian Li, and M Hadi Amini

  8. [8]

    A survey on federated learning for resource-constrained IoT devices.IEEE Internet of Things Journal9, 1 (2021), 1–24

  9. [9]

    Richeng Jin, Yufan Huang, Xiaofan He, Huaiyu Dai, and Tianfu Wu. 2020. Stochastic-sign SGD for federated learning with theoretical guarantees.arXiv preprint arXiv:2002.10940(2020)

  10. [10]

    Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. 2020. SCAFFOLD: Stochastic controlled averaging for federated learning.Proceedings of ICML(2020), 5132–5143

  11. [11]

    Majid Kundroo and Taehong Kim. 2023. Efficient federated learning with adap- tive client-side hyper-parameter optimization. In2023 IEEE 43rd international conference on distributed computing systems (ICDCS). IEEE, 973–974

  12. [12]

    Daliang Li and Junpu Wang. 2019. Fedmd: Heterogenous federated learning via model distillation.arXiv preprint arXiv:1910.03581(2019)

  13. [13]

    Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. 2022. Federated learning on non-iid data silos: An experimental study. In2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 965–978

  14. [14]

    Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and future directions.IEEE signal processing magazine37, 3 (2020), 50–60

  15. [15]

    Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. Proceedings of MLSys(2020)

  16. [16]

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net- works from decentralized data.Proceedings of AISTATS(2017)

  17. [17]

    Matias Mendieta, Taojiannan Yang, Pu Wang, Minwoo Lee, Zhengming Ding, and Chen Chen. 2022. Local learning matters: Rethinking data heterogeneity in federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8397–8406

  18. [18]

    Jeffrey Mills, Jia Hu, and Geyong Min. 2021. Communication-efficient federated learning via knowledge distillation.Nature Communications12, 1 (2021), 2032

  19. [19]

    Jed Mills, Jia Hu, and Geyong Min. 2021. Multi-task federated learning for personalised deep neural networks in edge computing.IEEE Transactions on Parallel and Distributed Systems33, 3 (2021), 630–641

  20. [20]

    Yujia Mu and Cong Shen. 2025. Federated split learning with improved commu- nication and storage efficiency.IEEE Transactions on Mobile Computing(2025)

  21. [21]

    Yongjeong Oh, Namyoon Lee, Yo-Seb Jeon, and H Vincent Poor. 2022. Communication-efficient federated learning via quantized compressed sensing. IEEE Transactions on Wireless Communications22, 2 (2022), 1087–1100

  22. [22]

    Alexandre Pacheco, Sébastien De Vos, Andreagiovanni Reina, Marco Dorigo, and Volker Strobel. 2024. Securing Federated Learning in Robot Swarms using Blockchain Technology. InInternational Symposium on Distributed Autonomous Robotic Systems. Springer, 473–488

  23. [23]

    Bjarne Pfitzner, Nico Steckhan, and Bert Arnrich. 2021. Federated learning in a medical context: a systematic literature review.ACM Transactions on Internet Technology (TOIT)21, 2 (2021), 1–31

  24. [24]

    Youyang Qu, Md Palash Uddin, Chenquan Gan, Yong Xiang, Longxiang Gao, and John Yearwood. 2022. Blockchain-enabled federated learning: A survey.Comput. Surveys55, 4 (2022), 1–35

  25. [25]

    Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečn`y, Sanjiv Kumar, and H Brendan McMahan. 2021. Adaptive feder- ated optimization.Proceedings of ICLR(2021)

  26. [26]

    Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. 2020. FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 2021–2031

  27. [27]

    Rishub Tamirisa, Chulin Xie, Wenxuan Bao, Andy Zhou, Ron Arel, and Aviv Shamsian. 2024. Fedselect: Personalized federated learning with customized selection of parameters for fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23985–23994

  28. [28]

    Haibo Tang, Junyi Yang, Sheng Zhou, Yuanming Shi, and Zhisheng Niu. 2024. FedCAda: Adaptive Client-Side Optimization for Accelerated and Stable Federated Learning. InIEEE International Conference on Communications

  29. [29]

    Zeyi Tao, Jindi Wu, and Qun Li. 2023. Preconditioned Federated Learning.arXiv preprint arXiv:2309.11378(2023)

  30. [30]

    Chandra Thapa, Pathum Chamikara Mahawaga Arachchige, Seyit Camtepe, and Lichao Sun. 2022. Splitfed: When federated learning meets split learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 8485–8493

  31. [31]

    Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. 2018. Split learning for health: Distributed deep learning without sharing raw patient data.arXiv preprint arXiv:1812.00564(2018)

  32. [32]

    Yu Xianjia, Jorge Peña Queralta, Jukka Heikkonen, and Tomi Westerlund. 2021. Federated learning in robotic and autonomous systems.Procedia Computer Science191 (2021), 135–142

  33. [33]

    Mang Ye, Xiuwen Fang, Bo Du, Pong C Yuen, and Dacheng Tao. 2023. Hetero- geneous federated learning: State-of-the-art and research challenges.Comput. Surveys56, 3 (2023), 1–44

  34. [34]

    Weishan Zhang, Tao Zhou, Qinghua Lu, Yong Yuan, Amr Tolba, and Wael Said

  35. [35]

    FedSL: A communication-efficient federated learning with split layer aggregation.IEEE Internet of Things Journal11, 9 (2024), 15587–15601

  36. [36]

    Zhuochen Zhang, Xuefei Liu, Yuanchun Wang, and Yunxin Liu. 2025. FedHQ: Hybrid Runtime Quantization for Federated Learning.arXiv preprint arXiv:2505.11982(2025). Conference’17, July 2017, Washington, DC, USA Vedant Waykole and Haroon R Lone A Appendix A.1 Federated Learning and FedAdam Background We consider the standard federated learning setup with 𝐾 cl...