Recognition: 2 theorem links
· Lean TheoremTurning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization
Pith reviewed 2026-05-15 02:44 UTC · model grok-4.3
The pith
Coherent Coordinate Descent reuses historical gradients as warm starts to achieve O(1) query cost while keeping global descent directions in zeroth-order optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoCD formalizes the notion of gradient coherence and proves equivalence to block cyclic coordinate descent with warm starts, thereby converting historical gradients into reliable descent directions at O(1) query complexity per step. Error bounds derived in the work reveal that larger finite-difference step sizes induce implicit landscape smoothing by lowering the effective smoothness constant, which in turn enhances convergence stability without additional computation.
What carries the argument
Gradient coherence, realized through the equivalence of CoCD to block cyclic coordinate descent with warm starts, that reuses prior gradient estimates deterministically to maintain global directions at constant query cost.
If this is right
- Enables O(1) query complexity per step while retaining global descent information.
- Improves stability over randomized zeroth-order estimators on MLP, CNN, and ResNet models up to 270k parameters.
- Larger finite-difference steps reduce the effective smoothness constant and aid convergence.
- Offers a deterministic, structure-aware alternative that avoids the variance of random subspace sampling.
Where Pith is reading between the lines
- The coherence mechanism may generalize to other black-box problems where gradient reuse can be scheduled without full recomputation.
- Hybrid schemes could monitor coherence online and switch between coherent and randomized updates when drift is detected.
- The implicit smoothing result suggests that step-size tuning in zeroth-order methods can serve dual purposes of noise control and landscape regularization.
Load-bearing premise
Historical gradients must stay sufficiently aligned with the current landscape so that reusing them still points toward a global descent direction without drift or loss of information.
What would settle it
If CoCD produces higher final loss or slower convergence than random-subspace zeroth-order methods on a rapidly changing or non-smooth objective, the claims of stability and equivalence would be refuted.
Figures
read the original abstract
Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Coherent Coordinate Descent (CoCD), a deterministic zeroth-order optimizer that formalizes gradient coherence to establish equivalence with Block Cyclic Coordinate Descent (BCCD) using warm starts. This purportedly converts stale gradients into reliable global descent directions with O(1) query complexity per step. It further derives error bounds showing that larger finite-difference step sizes induce implicit landscape smoothing by reducing the effective smoothness constant. Experiments on MLP, CNN, and ResNet models (up to 270k parameters) report superior sample efficiency, convergence, and stability versus BCCD and randomized ZO baselines.
Significance. If the equivalence, O(1) query guarantee, and implicit-smoothing bounds hold under the stated coherence assumptions, the work could meaningfully advance lightweight ZO optimization for memory-constrained and black-box settings by providing a deterministic, structure-aware alternative to high-variance randomized methods. The counter-intuitive smoothing insight, if rigorously derived, would also merit follow-up in non-convex optimization theory.
major comments (3)
- [Abstract] Abstract and theoretical section: the claimed equivalence of CoCD to BCCD with warm starts is presented as following from gradient coherence, yet reads as definitional rather than derived from independent premises; a self-contained proof is required to substantiate the O(1) query complexity and global-descent guarantee.
- [Abstract] Abstract: the error bounds asserting that larger finite-difference step size h reduces the effective smoothness constant inherit the same coherence prerequisite; without explicit bounds on gradient drift across coordinate blocks, the implicit-smoothing claim cannot be assessed as a derived result rather than a construction artifact.
- [Theoretical analysis] Theoretical analysis: the assumption that historical gradients remain sufficiently coherent to preserve descent directions in non-convex landscapes (typical of ML models) is load-bearing for both the equivalence and stability claims; rapid curvature changes between blocks could render directions orthogonal or uphill, violating the global-descent property.
minor comments (2)
- [Experiments] Experiments section: the summary of results on MLP/CNN/ResNet models lacks concrete details on architectures, datasets, hyperparameter ranges, and statistical significance testing, which are needed for reproducibility and to evaluate the reported gains in sample efficiency.
- [Abstract] Abstract: the term 'gradient coherence' is used without an early formal definition or notation; introducing it with a precise mathematical statement would improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments have prompted us to strengthen the theoretical derivations and clarify the role of the coherence assumption. We address each major comment point by point below, with corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and theoretical section: the claimed equivalence of CoCD to BCCD with warm starts is presented as following from gradient coherence, yet reads as definitional rather than derived from independent premises; a self-contained proof is required to substantiate the O(1) query complexity and global-descent guarantee.
Authors: We agree that the equivalence must be derived rigorously from independent premises rather than appearing definitional. In the revised manuscript we have added a complete, self-contained proof (new Theorem 3.1 in Section 3) that begins from the formal definition of gradient coherence, derives the equivalence to BCCD with warm starts, and explicitly establishes both the O(1) query complexity per step and the global-descent guarantee under the stated coherence condition. The proof is now independent of the abstract wording and is presented in full detail. revision: yes
-
Referee: [Abstract] Abstract: the error bounds asserting that larger finite-difference step size h reduces the effective smoothness constant inherit the same coherence prerequisite; without explicit bounds on gradient drift across coordinate blocks, the implicit-smoothing claim cannot be assessed as a derived result rather than a construction artifact.
Authors: We acknowledge the need for explicit control on gradient drift. The revised theoretical analysis now contains a new lemma (Lemma 4.2) that supplies quantitative bounds on the allowable gradient drift between successive coordinate blocks. Under these bounds the reduction in the effective smoothness constant is derived directly from the coherence assumption, converting the implicit-smoothing statement into a rigorously derived result rather than a definitional artifact. revision: yes
-
Referee: [Theoretical analysis] Theoretical analysis: the assumption that historical gradients remain sufficiently coherent to preserve descent directions in non-convex landscapes (typical of ML models) is load-bearing for both the equivalence and stability claims; rapid curvature changes between blocks could render directions orthogonal or uphill, violating the global-descent property.
Authors: We recognize that the coherence assumption is load-bearing and that rapid curvature changes could in principle violate descent. The revised paper adds a new subsection (3.4) together with Assumption 3.3 that imposes an explicit bound on curvature variation across blocks; under this bound the descent property is preserved. We also report additional empirical measurements of gradient coherence on the ResNet experiments, confirming that the assumption holds throughout training for the models considered. We have added a brief discussion of the limitation in highly pathological non-convex regimes where the bound may be violated. revision: partial
Circularity Check
CoCD-to-BCCD equivalence is definitional via coherence assumption; implicit smoothing follows directly from finite-difference definition
specific steps
-
self definitional
[Theoretical analysis (abstract and Section 3)]
"we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions."
CoCD is constructed precisely by cycling through coordinate blocks and reusing the most recent finite-difference estimate for each block (the 'warm start'). Once gradient coherence is posited to keep these stale estimates aligned with the current descent direction, the equivalence to BCCD with warm starts follows immediately from the definition; no additional derivation is required.
-
other
[Error bounds derivation (abstract)]
"we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant"
The finite-difference estimator is defined with step size h; the standard smoothness bound on the gradient estimator scales with h. The 'reduction in effective smoothness constant' is therefore an algebraic rewriting of the same definition rather than a new prediction about the landscape.
full rationale
The paper's load-bearing theoretical step equates CoCD to BCCD with warm starts by formalizing gradient coherence and then showing that reusing stale per-coordinate finite-difference estimates across blocks preserves descent directions. This equivalence holds by the algorithmic construction itself once coherence is assumed, rather than being derived from independent premises or external verification. The counter-intuitive claim that larger step sizes reduce effective smoothness is a direct algebraic consequence of the finite-difference definition and the smoothness-constant bound, not an independent prediction. No self-citation chains or fitted parameters renamed as predictions appear in the provided derivation outline. The central claims therefore retain some independent content beyond pure renaming, but the equivalence step reduces to the input definition under the coherence premise.
Axiom & Free-Parameter Ledger
free parameters (1)
- finite-difference step size
axioms (1)
- domain assumption Gradient estimates remain coherent enough across coordinate blocks to support stable global descent directions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 5.10 (Linear Convergence of CoCD) under L-smoothness + PL with staleness factor τ = n/B − 1
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 5.2 Uniform Coordinate-wise Smoothness L_ε induced by finite-difference interval ε
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2504.18790 , year=
Coherence-based Approximate Derivatives via Web of Affine Spaces Optimization , author=. arXiv preprint arXiv:2504.18790 , year=
-
[2]
IEEE Signal Processing Magazine , volume=
A primer on zeroth-order optimization in signal processing and machine learning: Principles, recent advances, and applications , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=
work page 2020
-
[3]
International Conference on Artificial Intelligence and Statistics , pages=
Stochastic zeroth-order optimization in high dimensions , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=
work page 2018
-
[4]
International Conference on Learning Representations , year=
Gradientless Descent: High-Dimensional Zeroth-Order Optimization , author=. International Conference on Learning Representations , year=
-
[5]
International Conference on Machine Learning , pages=
Black-box adversarial attacks with limited queries and information , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[6]
2021 18th International Conference on Ubiquitous Robots (UR) , pages=
A survey on simulation environments for reinforcement learning , author=. 2021 18th International Conference on Ubiquitous Robots (UR) , pages=. 2021 , organization=
work page 2021
-
[7]
Advances in Neural Information Processing Systems , volume=
On-device training under 256kb memory , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
Advances in Neural Information Processing Systems , volume=
Tinytl: Reduce memory, not parameters for efficient on-device learning , author=. Advances in Neural Information Processing Systems , volume=
-
[9]
Numerical partial differential equations: finite difference methods , author=. 2013 , publisher=
work page 2013
-
[10]
Springer Handbook of Computational Intelligence , pages=
Evolution strategies , author=. Springer Handbook of Computational Intelligence , pages=. 2015 , publisher=
work page 2015
-
[11]
Proceedings of the 2001 American Control Conference , volume=
Global random optimization by simultaneous perturbation stochastic approximation , author=. Proceedings of the 2001 American Control Conference , volume=. 2001 , organization=
work page 2001
-
[12]
International Conference on Learning Representations , year=
DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training , author=. International Conference on Learning Representations , year=
-
[13]
Cyclic coordinate descent: A robotics algorithm for protein loop closure , author=. Protein Science , volume=. 2003 , publisher=
work page 2003
-
[14]
International Conference on Machine Learning , pages=
Generalizing Gaussian smoothing for random search , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[15]
Advances in Neural Information Processing Systems , volume=
When cyclic coordinate descent outperforms randomized coordinate descent , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
International Conference on Learning Representations , year=
Picking Winning Tickets Before Training by Preserving Gradient Flow , author=. International Conference on Learning Representations , year=
-
[17]
Mathematical Programming , volume=
Coordinate descent algorithms , author=. Mathematical Programming , volume=. 2015 , publisher=
work page 2015
-
[18]
Mathematical Programming , volume=
On the complexity analysis of randomized block-coordinate descent methods , author=. Mathematical Programming , volume=. 2015 , publisher=
work page 2015
-
[19]
International Conference on Machine Learning , pages=
Cyclic block coordinate descent with variance reduction for composite nonconvex optimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[20]
SIAM Journal on Optimization , volume=
On the nonasymptotic convergence of cyclic coordinate descent methods , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=
work page 2013
-
[21]
On the Finite Time Convergence of Cyclic Coordinate Descent Methods
On the finite time convergence of cyclic coordinate descent methods , author=. arXiv preprint arXiv:1005.2146 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
International Conference on Artificial Intelligence and Statistics , pages=
Slow and stale gradients can win the race: Error-runtime trade-offs in distributed SGD , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=
work page 2018
-
[23]
IEEE Transactions on Parallel and Distributed Systems , volume=
Towards efficient and stable K-asynchronous federated learning with unbounded stale gradients on non-IID data , author=. IEEE Transactions on Parallel and Distributed Systems , volume=. 2022 , publisher=
work page 2022
-
[24]
arXiv preprint arXiv:2512.21109 , year=
Robust and Efficient MuJoCo-based Model Predictive Control via Web of Affine Spaces Derivatives , author=. arXiv preprint arXiv:2512.21109 , year=
-
[25]
Proceedings of the Seventeenth International Conference on Machine Learning , pages=
Locally weighted projection regression: An O(n) algorithm for incremental real time learning in high dimensional space , author=. Proceedings of the Seventeenth International Conference on Machine Learning , pages=. 2000 , organization=
work page 2000
-
[26]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[27]
Advances in Neural Information Processing Systems , volume=
PyTorch: An imperative style, high-performance deep learning library , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
SIAM Journal on Optimization , volume=
On the convergence of block coordinate descent type methods , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=
work page 2013
-
[29]
Proceedings of the IEEE , volume=
Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=
work page 1998
-
[30]
Foundations of Computational Mathematics , volume=
A theoretical and empirical comparison of gradient approximations in derivative-free optimization , author=. Foundations of Computational Mathematics , volume=. 2022 , publisher=
work page 2022
-
[31]
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-
Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-. 2016 , publisher=
work page 2016
-
[32]
SIAM Journal on Optimization , volume=
Stochastic first- and zeroth-order methods for nonconvex stochastic programming , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=
work page 2013
-
[33]
Advances in Neural Information Processing Systems , volume=
Fine-tuning language models with just forward passes , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[35]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.