FedScalar: Federated Learning with Scalar Communication for Bandwidth-Constrained Networks

M. Rostami; S. S. Kia

arxiv: 2410.02260 · v3 · submitted 2024-10-03 · 💻 cs.LG

FedScalar: Federated Learning with Scalar Communication for Bandwidth-Constrained Networks

M. Rostami , S. S. Kia This is my paper

Pith reviewed 2026-05-23 19:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningcommunication efficiencyscalar communicationbandwidth constraintsconvergence analysisrandom projectionsnon-convex optimization

0 comments

The pith

FedScalar achieves federated learning convergence by sending only two scalars per round via random inner products.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FedScalar to solve the bandwidth bottleneck in federated learning, where agents cannot afford to upload full high-dimensional model updates repeatedly. Each agent computes the inner product of its local update difference with a random vector generated from a seed and sends only that scalar value plus the seed. The server regenerates the vector from the seed to form an unbiased estimate of the gradient. This yields a convergence rate of O(d/√K) to a stationary point for smooth non-convex losses. The analysis further shows that Rademacher random vectors produce lower aggregation variance than Gaussian vectors, and simulations demonstrate gains in wall-clock time and energy use under tight bandwidth limits.

Core claim

FedScalar encodes each local update difference as its inner product with a locally generated random vector and transmits the resulting scalar together with the seed. The server reconstructs an unbiased gradient estimate from this information alone. For smooth non-convex loss functions the method converges to a stationary point at rate O(d/√K). Replacing Gaussian random vectors with Rademacher ones reduces the variance of the aggregated estimates.

What carries the argument

The scalar encoding via inner product with a seed-generated random vector, which replaces the full d-dimensional update while preserving an unbiased gradient estimate at the server.

If this is right

The convergence rate of O(d/√K) holds for smooth non-convex loss functions.
Rademacher random vectors reduce aggregation variance relative to the Gaussian case.
Upload communication cost remains two scalars independent of model dimension d.
Numerical results show improved wall-clock time and energy efficiency compared with FedAvg and QSGD under bandwidth constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dimension dependence in the rate implies that reaching a target accuracy in high-dimensional models requires proportionally more communication rounds.
The same scalar projection approach could be applied to other distributed optimization problems that face similar upload limits.
If seeds are public, the reconstruction step is deterministic and could be implemented with minimal additional computation on the server.

Load-bearing premise

The inner product of the local update difference with a locally generated random vector, together with the transmitted seed, enables the server to reconstruct an unbiased gradient estimate without any high-dimensional transmission.

What would settle it

A controlled test that averages many reconstructed estimates for known local updates and checks whether the bias is zero within sampling error, or an empirical measurement of convergence speed versus the predicted O(d/√K) scaling on a smooth non-convex problem.

Figures

Figures reproduced from arXiv: 2410.02260 by M. Rostami, S. S. Kia.

**Figure 2.** Figure 2: Training loss plot of Algorithm 1 for the cases vk sampled from a normal distribution and a Rademacher distribution vs FedAvg Algorithm [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy plot on the test dataset of Algorithm 1 for – vs FedAvg Algorithm. use is a DNN with a depth of 3 layers, each containing 3 neurons, and ReLU activation functions are applied across all layers. For the training phase, we distribute the data among 20 agents, resulting in each agent owning a dataset of size 80. The remaining data is retained for the test dataset. To implement FedScalar, we set the t… view at source ↗

read the original abstract

In bandwidth-constrained federated learning~(FL) settings, the repeated upload of high-dimensional model updates from agents to a central server constitutes the primary bottleneck, often rendering standard FL infeasible within practical communication budgets. We propose \emph{FedScalar}, a communication-efficient FL algorithm in which each agent uploads only two scalar values per round, regardless of the model dimension~$d$. Each agent encodes its local update difference as an inner product with a locally generated random vector and transmits the resulting scalar together with the generating seed, enabling the server to reconstruct an unbiased gradient estimate without any high-dimensional transmission. We prove that \emph{FedScalar} achieves a convergence rate of $O(d/\sqrt{K})$ to a stationary point for smooth non-convex loss functions, and show that adopting a Rademacher distribution for the random vector reduces the aggregation variance compared to the Gaussian case. Numerical simulations confirm that the dimension-free upload cost translates into significant improvements in wall-clock time and energy efficiency over \emph{FedAvg} and \emph{QSGD} in bandwidth-constrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedScalar cuts client uploads to two scalars via random inner products and seeds, with a stated O(d/sqrt(K)) rate and Rademacher variance edge, but the dimension factor in the bound is the expected cost of that estimator.

read the letter

The main takeaway is that this paper gives a concrete way to make federated learning work when bandwidth is the hard limit: each client sends only a scalar inner product and a seed, letting the server reconstruct an unbiased estimate without shipping the full update vector. Communication cost stays fixed at two numbers no matter how big d gets. They back this with a convergence proof for smooth non-convex losses and show Rademacher vectors cut aggregation variance relative to Gaussian ones, plus some simulation results on time and energy versus FedAvg and QSGD. That combination of scalar encoding, seed reconstruction, and the specific variance comparison is not in the baselines they cite. The unbiasedness follows from basic properties of zero-mean unit-variance random vectors, and the proof is presented separately from the experiments. The rate O(d/sqrt(K)) is the direct result of the estimator variance scaling linearly with dimension, which is standard for this style of compression. The soft spot is exactly that d factor in the bound. It does not disappear, so for very high-dimensional models the number of rounds needed still grows with d even though each round is cheap. The abstract does not give the full derivation or the precise experimental setup for the bandwidth constraints, so it is not possible to verify the variance calculations or how closely the runs track the theory. This work is for people who need to run FL under tight bandwidth budgets and are willing to accept a dimension-dependent iteration count in exchange for fixed per-round cost. A reader looking for practical mechanisms with some supporting analysis would find the reconstruction trick and the Rademacher comparison useful to check. It deserves a serious referee because the central claim is falsifiable, the mechanism is simple to implement, and the problem it targets is real. I would send it out for review.

Referee Report

0 major / 2 minor

Summary. The paper proposes FedScalar, a federated learning algorithm for bandwidth-constrained settings in which each client transmits only two scalars per round: the inner product of its local update difference with a locally generated random vector, plus the seed. The server reconstructs an unbiased estimate of the update from this information. The manuscript proves a convergence rate of O(d/√K) to a stationary point for smooth non-convex objectives, shows that Rademacher random vectors yield lower aggregation variance than Gaussian vectors, and reports simulations demonstrating wall-clock and energy gains over FedAvg and QSGD.

Significance. If the convergence analysis is correct, the work supplies a theoretically supported method for extreme communication reduction (constant scalars independent of d) while retaining an explicit rate for non-convex FL. The explicit variance comparison between Rademacher and Gaussian vectors, together with the simulation evidence of practical efficiency, strengthens the contribution for resource-limited networks. The mechanism relies only on standard unbiasedness properties of zero-mean unit-variance random vectors and does not appear to introduce circularity or free parameters.

minor comments (2)

[Abstract] Abstract: the statement that the upload cost is 'dimension-free' is correct for communication volume, but the O(d/√K) rate explicitly depends on dimension through the estimator variance; a brief parenthetical clarification would prevent misreading.
[Convergence Analysis] The convergence theorem statement should include the precise dependence on smoothness constant, variance bound, and number of clients to make the O(d/√K) claim fully traceable without re-deriving the variance term.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation of minor revision. The summary accurately reflects the FedScalar contributions to scalar communication in federated learning.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core claims rest on the standard unbiasedness property E[<Δ, r> r] = Δ for zero-mean unit-variance coordinate-wise independent random vectors r (Gaussian or Rademacher), which is a direct algebraic consequence of the inner-product definition and does not rely on any fitted parameters or self-citations. The stated O(d/√K) convergence rate for non-convex smooth objectives follows from the usual variance bound on the resulting estimator (scaling as O(d ‖Δ‖²)) inserted into a standard SGD analysis; the paper presents this derivation as self-contained and independent of the numerical simulations. No load-bearing step reduces to a self-citation chain, a fitted input renamed as prediction, or an ansatz smuggled via prior work. The Rademacher variance reduction is likewise a direct comparison of second moments and does not presuppose the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the mathematical property that the expectation of the random projection recovers the true update vector (standard linear algebra) and on the assumption that the objective is L-smooth (domain assumption for non-convex optimization). No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The loss function is L-smooth
Invoked to obtain the O(d/sqrt(K)) convergence rate for non-convex objectives.
standard math Expectation of the inner product with the random vector equals the original vector
Required for the unbiased gradient estimate reconstruction.

pith-pipeline@v0.9.0 · 5718 in / 1438 out tokens · 34369 ms · 2026-05-23T19:37:55.631802+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Threshold-Driven Continuous Greedy Method for Scalable Submodular Optimization
cs.LG 2026-04 unverdicted novelty 7.0

ATCG adaptively thresholds continuous greedy updates to limit communication in distributed submodular maximization under matroid constraints while retaining a curvature-aware approximation guarantee.
Adaptive Threshold-Driven Continuous Greedy Method for Scalable Submodular Optimization
cs.LG 2026-04 unverdicted novelty 6.0

ATCG adaptively gates gradient evaluations in continuous greedy via progress-ratio thresholds to reduce communication while providing a curvature-dependent approximation guarantee that recovers full CG performance in ...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Communication-efficient learning of deep networks from decentral- ized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentral- ized data,” in International Conference on Artificial Intelligence and Statistics, (Lauderdale, FL), pp. 1273–1282, 2017

work page 2017
[2]

Communication-efficient massive uav online path control: Federated learning meets mean-field game theory,

H. Shiri, J. Park, and M. Bennis, “Communication-efficient massive uav online path control: Federated learning meets mean-field game theory,” IEEE Transactions on Communications , vol. 68, no. 11, pp. 6840–6857, 2020

work page 2020
[3]

Federated learning on the road autonomous controller design for connected and autonomous vehicles,

T. Zeng, O. Semiari, M. Chen, W. Saad, and M. Bennis, “Federated learning on the road autonomous controller design for connected and autonomous vehicles,” IEEE Transactions on Wireless Communica- tions, vol. 21, no. 12, pp. 10407–10423, 2022

work page 2022
[4]

How apple personalizes siri without hoovering up your data,

K. Hao, “How apple personalizes siri without hoovering up your data,” Technology Review, 2020

work page 2020
[5]

Webank, swiss re in federated learning deal,

T. Li, “Webank, swiss re in federated learning deal,” 2020

work page 2020
[6]

Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence,

R. Xin, S. Kar, and U. A. Khan, “Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence,” tspm, pp. 102–113, 2020

work page 2020
[7]

A Berkeley View of Systems Challenges for AI

I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jordan, J. M. Hellerstein, J. E. Gonzalez, et al., “A berkeley view of systems challenges for ai,” arXiv preprint arXiv:1712.05855, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Feder- ated multi-task learning,

V . Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Feder- ated multi-task learning,” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[9]

Lag: Lazily aggregated gradient for communication-efficient distributed learning,

T. Chen, G. Giannakis, T. Sun, and W. Yin, “Lag: Lazily aggregated gradient for communication-efficient distributed learning,” Advances in neural information processing systems , vol. 31, 2018

work page 2018
[10]

Scheduling policies for federated learning in wireless networks,

H. H. Yang, Z. Liu, T. Q. Quek, and H. V . Poor, “Scheduling policies for federated learning in wireless networks,” IEEE transactions on communications, vol. 68, no. 1, pp. 317–333, 2019

work page 2019
[11]

Scheduling for cellular federated edge learning with importance and channel awareness,

J. Ren, Y . He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling for cellular federated edge learning with importance and channel awareness,” IEEE Transactions on Wireless Communications , vol. 19, no. 11, pp. 7690–7703, 2020

work page 2020
[12]

Federated learning using variance reduced stochastic gradient for probabilistically activated agents,

M. Rostami and S. S. Kia, “Federated learning using variance reduced stochastic gradient for probabilistically activated agents,” in 2023 American Control Conference (ACC) , pp. 861–866, IEEE, 2023

work page 2023
[13]

Agnostic federated learning,

M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in International conference on machine learning , pp. 4615–4625, PMLR, 2019

work page 2019
[14]

Convergence time optimization for federated learning over wireless networks,

M. Chen, H. V . Poor, W. Saad, and S. Cui, “Convergence time optimization for federated learning over wireless networks,” IEEE Transactions on Wireless Communications , vol. 20, no. 4, pp. 2457– 2471, 2020

work page 2020
[15]

Communication-efficient distributed dual coordinate ascent,

M. Jaggi, V . Smith, M. Tak ´ac, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, “Communication-efficient distributed dual coordinate ascent,” Advances in neural information processing systems , vol. 27, 2014

work page 2014
[16]

Distributed optimization with arbitrary local solvers,

C. Ma, J. Kone ˇcn`y, M. Jaggi, V . Smith, M. I. Jordan, P. Richt ´arik, and M. Tak ´aˇc, “Distributed optimization with arbitrary local solvers,” optimization Methods and Software, vol. 32, no. 4, pp. 813–848, 2017

work page 2017
[17]

Communication-efficient dis- tributed optimization using an approximate newton-type method,

O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient dis- tributed optimization using an approximate newton-type method,” in International conference on machine learning, pp. 1000–1008, PMLR, 2014

work page 2014
[18]

Disco: Distributed optimization for self- concordant empirical loss,

Y . Zhang and X. Lin, “Disco: Distributed optimization for self- concordant empirical loss,” in International conference on machine learning, pp. 362–370, PMLR, 2015

work page 2015
[19]

Communication-efficient federated learning,

M. Chen, N. Shlezinger, H. V . Poor, Y . C. Eldar, and S. Cui, “Communication-efficient federated learning,” Proceedings of the Na- tional Academy of Sciences , vol. 118, no. 17, p. e2024789118, 2021

work page 2021
[20]

D. S. Bernstein, Matrix Mathematics: Theory, Facts, and Formulas (Second Edition) . Princeton reference, Princeton University Press, 2009

work page 2009
[21]

Nesterov, Introductory lectures on convex optimization: A basic course, vol

Y . Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media, 2013

work page 2013
[22]

Nonlinear programming,

D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997

work page 1997
[23]

Forward gradient-based frank-wolfe optimization for memory efficient deep neural network training,

M. Rostami, “Forward gradient-based frank-wolfe optimization for memory efficient deep neural network training,” arXiv preprint arXiv:2403.12511, 2024

work page arXiv 2024
[24]

Random gradient-free minimization of convex functions,

Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,” Foundations of Computational Mathematics , vol. 17, no. 2, pp. 527–566, 2017

work page 2017
[25]

An improved analysis of stochastic gradient descent with momentum,

Y . Liu, Y . Gao, and W. Yin, “An improved analysis of stochastic gradient descent with momentum,” Advances in Neural Information Processing Systems, vol. 33, pp. 18261–18271, 2020

work page 2020
[26]

Accelerating stochastic gradient descent using predictive variance reduction,

R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” Advances in neural information processing systems, vol. 26, 2013

work page 2013
[27]

Minimizing finite sums with the stochastic average gradient,

M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,”Mathematical Programming, vol. 162, pp. 83–112, 2017

work page 2017
[28]

Probability and statistics, fourth,

M. H. DeGroot and M. J. Schervish, “Probability and statistics, fourth,” 2012

work page 2012
[29]

Multivariate stochastic approximation using a simul- taneous perturbation gradient approximation,

J. C. Spall, “Multivariate stochastic approximation using a simul- taneous perturbation gradient approximation,” IEEE transactions on automatic control, vol. 37, no. 3, pp. 332–341, 1992

work page 1992
[30]

Geron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

A. Geron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2019

work page 2019
[31]

On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables,

L. Isserlis, “On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables,” Biometrika, vol. 12, no. 1/2, pp. 134–139, 1918. APPENDIX This section presents the proofs of the results in the paper and the auxiliary lemmas used in these proofs. Lemma A.1: Consider Algorithm 1. Then, we have − E[ ...

work page 1918

[1] [1]

Communication-efficient learning of deep networks from decentral- ized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentral- ized data,” in International Conference on Artificial Intelligence and Statistics, (Lauderdale, FL), pp. 1273–1282, 2017

work page 2017

[2] [2]

Communication-efficient massive uav online path control: Federated learning meets mean-field game theory,

H. Shiri, J. Park, and M. Bennis, “Communication-efficient massive uav online path control: Federated learning meets mean-field game theory,” IEEE Transactions on Communications , vol. 68, no. 11, pp. 6840–6857, 2020

work page 2020

[3] [3]

Federated learning on the road autonomous controller design for connected and autonomous vehicles,

T. Zeng, O. Semiari, M. Chen, W. Saad, and M. Bennis, “Federated learning on the road autonomous controller design for connected and autonomous vehicles,” IEEE Transactions on Wireless Communica- tions, vol. 21, no. 12, pp. 10407–10423, 2022

work page 2022

[4] [4]

How apple personalizes siri without hoovering up your data,

K. Hao, “How apple personalizes siri without hoovering up your data,” Technology Review, 2020

work page 2020

[5] [5]

Webank, swiss re in federated learning deal,

T. Li, “Webank, swiss re in federated learning deal,” 2020

work page 2020

[6] [6]

Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence,

R. Xin, S. Kar, and U. A. Khan, “Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence,” tspm, pp. 102–113, 2020

work page 2020

[7] [7]

A Berkeley View of Systems Challenges for AI

I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jordan, J. M. Hellerstein, J. E. Gonzalez, et al., “A berkeley view of systems challenges for ai,” arXiv preprint arXiv:1712.05855, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Feder- ated multi-task learning,

V . Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Feder- ated multi-task learning,” Advances in neural information processing systems, vol. 30, 2017

work page 2017

[9] [9]

Lag: Lazily aggregated gradient for communication-efficient distributed learning,

T. Chen, G. Giannakis, T. Sun, and W. Yin, “Lag: Lazily aggregated gradient for communication-efficient distributed learning,” Advances in neural information processing systems , vol. 31, 2018

work page 2018

[10] [10]

Scheduling policies for federated learning in wireless networks,

H. H. Yang, Z. Liu, T. Q. Quek, and H. V . Poor, “Scheduling policies for federated learning in wireless networks,” IEEE transactions on communications, vol. 68, no. 1, pp. 317–333, 2019

work page 2019

[11] [11]

Scheduling for cellular federated edge learning with importance and channel awareness,

J. Ren, Y . He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling for cellular federated edge learning with importance and channel awareness,” IEEE Transactions on Wireless Communications , vol. 19, no. 11, pp. 7690–7703, 2020

work page 2020

[12] [12]

Federated learning using variance reduced stochastic gradient for probabilistically activated agents,

M. Rostami and S. S. Kia, “Federated learning using variance reduced stochastic gradient for probabilistically activated agents,” in 2023 American Control Conference (ACC) , pp. 861–866, IEEE, 2023

work page 2023

[13] [13]

Agnostic federated learning,

M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in International conference on machine learning , pp. 4615–4625, PMLR, 2019

work page 2019

[14] [14]

Convergence time optimization for federated learning over wireless networks,

M. Chen, H. V . Poor, W. Saad, and S. Cui, “Convergence time optimization for federated learning over wireless networks,” IEEE Transactions on Wireless Communications , vol. 20, no. 4, pp. 2457– 2471, 2020

work page 2020

[15] [15]

Communication-efficient distributed dual coordinate ascent,

M. Jaggi, V . Smith, M. Tak ´ac, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, “Communication-efficient distributed dual coordinate ascent,” Advances in neural information processing systems , vol. 27, 2014

work page 2014

[16] [16]

Distributed optimization with arbitrary local solvers,

C. Ma, J. Kone ˇcn`y, M. Jaggi, V . Smith, M. I. Jordan, P. Richt ´arik, and M. Tak ´aˇc, “Distributed optimization with arbitrary local solvers,” optimization Methods and Software, vol. 32, no. 4, pp. 813–848, 2017

work page 2017

[17] [17]

Communication-efficient dis- tributed optimization using an approximate newton-type method,

O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient dis- tributed optimization using an approximate newton-type method,” in International conference on machine learning, pp. 1000–1008, PMLR, 2014

work page 2014

[18] [18]

Disco: Distributed optimization for self- concordant empirical loss,

Y . Zhang and X. Lin, “Disco: Distributed optimization for self- concordant empirical loss,” in International conference on machine learning, pp. 362–370, PMLR, 2015

work page 2015

[19] [19]

Communication-efficient federated learning,

M. Chen, N. Shlezinger, H. V . Poor, Y . C. Eldar, and S. Cui, “Communication-efficient federated learning,” Proceedings of the Na- tional Academy of Sciences , vol. 118, no. 17, p. e2024789118, 2021

work page 2021

[20] [20]

D. S. Bernstein, Matrix Mathematics: Theory, Facts, and Formulas (Second Edition) . Princeton reference, Princeton University Press, 2009

work page 2009

[21] [21]

Nesterov, Introductory lectures on convex optimization: A basic course, vol

Y . Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media, 2013

work page 2013

[22] [22]

Nonlinear programming,

D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997

work page 1997

[23] [23]

Forward gradient-based frank-wolfe optimization for memory efficient deep neural network training,

M. Rostami, “Forward gradient-based frank-wolfe optimization for memory efficient deep neural network training,” arXiv preprint arXiv:2403.12511, 2024

work page arXiv 2024

[24] [24]

Random gradient-free minimization of convex functions,

Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,” Foundations of Computational Mathematics , vol. 17, no. 2, pp. 527–566, 2017

work page 2017

[25] [25]

An improved analysis of stochastic gradient descent with momentum,

Y . Liu, Y . Gao, and W. Yin, “An improved analysis of stochastic gradient descent with momentum,” Advances in Neural Information Processing Systems, vol. 33, pp. 18261–18271, 2020

work page 2020

[26] [26]

Accelerating stochastic gradient descent using predictive variance reduction,

R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” Advances in neural information processing systems, vol. 26, 2013

work page 2013

[27] [27]

Minimizing finite sums with the stochastic average gradient,

M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,”Mathematical Programming, vol. 162, pp. 83–112, 2017

work page 2017

[28] [28]

Probability and statistics, fourth,

M. H. DeGroot and M. J. Schervish, “Probability and statistics, fourth,” 2012

work page 2012

[29] [29]

Multivariate stochastic approximation using a simul- taneous perturbation gradient approximation,

J. C. Spall, “Multivariate stochastic approximation using a simul- taneous perturbation gradient approximation,” IEEE transactions on automatic control, vol. 37, no. 3, pp. 332–341, 1992

work page 1992

[30] [30]

Geron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

A. Geron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2019

work page 2019

[31] [31]

On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables,

L. Isserlis, “On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables,” Biometrika, vol. 12, no. 1/2, pp. 134–139, 1918. APPENDIX This section presents the proofs of the results in the paper and the auxiliary lemmas used in these proofs. Lemma A.1: Consider Algorithm 1. Then, we have − E[ ...

work page 1918