arxiv: 2605.14373 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

Chen Liang , Xiatao Sun , Qian Wang , Daniel Rakita

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords zeroth-order optimizationcoordinate descentgradient coherenceimplicit smoothingsample-efficient optimizationblack-box optimizationdeterministic optimizer

0 comments

The pith

Coherent Coordinate Descent reuses historical gradients as warm starts to achieve O(1) query cost while keeping global descent directions in zeroth-order optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Coherent Coordinate Descent (CoCD) as a deterministic zeroth-order method that converts stale gradients from a liability into an asset. It establishes that CoCD is mathematically equivalent to block cyclic coordinate descent performed with warm starts, which preserves global information and limits queries to a constant per step. The analysis further shows that increasing the finite-difference step size produces an implicit smoothing effect that reduces the effective smoothness constant and improves stability. Experiments on neural networks demonstrate better sample efficiency and lower variance than both standard block methods and randomized zeroth-order baselines.

Core claim

CoCD formalizes the notion of gradient coherence and proves equivalence to block cyclic coordinate descent with warm starts, thereby converting historical gradients into reliable descent directions at O(1) query complexity per step. Error bounds derived in the work reveal that larger finite-difference step sizes induce implicit landscape smoothing by lowering the effective smoothness constant, which in turn enhances convergence stability without additional computation.

What carries the argument

Gradient coherence, realized through the equivalence of CoCD to block cyclic coordinate descent with warm starts, that reuses prior gradient estimates deterministically to maintain global directions at constant query cost.

If this is right

Enables O(1) query complexity per step while retaining global descent information.
Improves stability over randomized zeroth-order estimators on MLP, CNN, and ResNet models up to 270k parameters.
Larger finite-difference steps reduce the effective smoothness constant and aid convergence.
Offers a deterministic, structure-aware alternative that avoids the variance of random subspace sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The coherence mechanism may generalize to other black-box problems where gradient reuse can be scheduled without full recomputation.
Hybrid schemes could monitor coherence online and switch between coherent and randomized updates when drift is detected.
The implicit smoothing result suggests that step-size tuning in zeroth-order methods can serve dual purposes of noise control and landscape regularization.

Load-bearing premise

Historical gradients must stay sufficiently aligned with the current landscape so that reusing them still points toward a global descent direction without drift or loss of information.

What would settle it

If CoCD produces higher final loss or slower convergence than random-subspace zeroth-order methods on a rapidly changing or non-smooth objective, the claims of stability and equivalence would be refuted.

Figures

Figures reproduced from arXiv: 2605.14373 by Chen Liang, Daniel Rakita, Qian Wang, Xiatao Sun.

**Figure 1.** Figure 1: Conceptual illustration of CoCD. Each cuboid represents a gradient estimate projected onto a block of coordinates, from the stalest one evaluated at t − τ to the freshest at t. The fill levels encode how the momentum term (γ) in CoCD assigns exponentially higher weights to more recent gradient estimates. These decayweighted gradient components are accumulated into gˆt and scaled by the learning rate α to… view at source ↗

**Figure 2.** Figure 2: CoCD accelerates derivative-free training on the SARCOS robotics dataset. We train a 5-layer MLP (≈ 12k parameters) using our proposed Coherent Coordinate Descent (CoCD) versus standard Block Cyclic Coordinate Descent (BCCD) with a fixed query budget (B ≈ 2% of all parameters). The results highlight two key advantages: (1) Performance Gap: CoCD (lower cluster) dramatically outperforms BCCD (upper cluster… view at source ↗

**Figure 3.** Figure 3: Ablation results on SARCOS (top, y-axis is validation loss) and MNIST (bottom, y-axis is validation accuracy). 0 10 20 30 40 50 Epochs 50 100 150 200 250 300 350 400 450 Validation Loss SPSA ZO-SGD CoCD [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Validation loss comparison among CoCD, ZO-SGD and SPSA under equal function-evaluation budgets (64). In terms of wall clock time, CoCD is almost 2x faster than ZO-SGD. (8.1s vs. 15.7s per episode on average). ϵ. These bounds provide a theoretical explanation for the empirically observed benefit of larger ϵ values: coarser finite differences implicitly smooth the optimization landscape by reducing the effec… view at source ↗

**Figure 5.** Figure 5: Average gradient approximation error ∥gˆt − ∇˜ ϵ f(xt)∥ with respect to the different compute budgets B ranging from 16 to 2048. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison among CoCD with different γ values and BCCD on training a Resnet-20 for CIFAR-10 classification. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoCD recycles stale gradients via coherence to hit O(1) queries per step in zeroth-order optimization, but the non-convex coherence assumption looks like the load-bearing piece that still needs checking.

read the letter

CoCD's main move is to treat historical finite-difference estimates as coherent enough to reuse across coordinate blocks, which the authors say makes the method equivalent to block cyclic coordinate descent with warm starts. That equivalence is what lets them claim global descent directions without fresh queries every iteration and O(1) cost per step. The second piece is the observation that bigger finite-difference step sizes can reduce the effective smoothness constant and act like implicit smoothing. Both ideas are presented as derived rather than fitted, which is the part worth looking at closely.

Referee Report

3 major / 2 minor

Summary. The paper proposes Coherent Coordinate Descent (CoCD), a deterministic zeroth-order optimizer that formalizes gradient coherence to establish equivalence with Block Cyclic Coordinate Descent (BCCD) using warm starts. This purportedly converts stale gradients into reliable global descent directions with O(1) query complexity per step. It further derives error bounds showing that larger finite-difference step sizes induce implicit landscape smoothing by reducing the effective smoothness constant. Experiments on MLP, CNN, and ResNet models (up to 270k parameters) report superior sample efficiency, convergence, and stability versus BCCD and randomized ZO baselines.

Significance. If the equivalence, O(1) query guarantee, and implicit-smoothing bounds hold under the stated coherence assumptions, the work could meaningfully advance lightweight ZO optimization for memory-constrained and black-box settings by providing a deterministic, structure-aware alternative to high-variance randomized methods. The counter-intuitive smoothing insight, if rigorously derived, would also merit follow-up in non-convex optimization theory.

major comments (3)

[Abstract] Abstract and theoretical section: the claimed equivalence of CoCD to BCCD with warm starts is presented as following from gradient coherence, yet reads as definitional rather than derived from independent premises; a self-contained proof is required to substantiate the O(1) query complexity and global-descent guarantee.
[Abstract] Abstract: the error bounds asserting that larger finite-difference step size h reduces the effective smoothness constant inherit the same coherence prerequisite; without explicit bounds on gradient drift across coordinate blocks, the implicit-smoothing claim cannot be assessed as a derived result rather than a construction artifact.
[Theoretical analysis] Theoretical analysis: the assumption that historical gradients remain sufficiently coherent to preserve descent directions in non-convex landscapes (typical of ML models) is load-bearing for both the equivalence and stability claims; rapid curvature changes between blocks could render directions orthogonal or uphill, violating the global-descent property.

minor comments (2)

[Experiments] Experiments section: the summary of results on MLP/CNN/ResNet models lacks concrete details on architectures, datasets, hyperparameter ranges, and statistical significance testing, which are needed for reproducibility and to evaluate the reported gains in sample efficiency.
[Abstract] Abstract: the term 'gradient coherence' is used without an early formal definition or notation; introducing it with a precise mathematical statement would improve accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have prompted us to strengthen the theoretical derivations and clarify the role of the coherence assumption. We address each major comment point by point below, with corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and theoretical section: the claimed equivalence of CoCD to BCCD with warm starts is presented as following from gradient coherence, yet reads as definitional rather than derived from independent premises; a self-contained proof is required to substantiate the O(1) query complexity and global-descent guarantee.

Authors: We agree that the equivalence must be derived rigorously from independent premises rather than appearing definitional. In the revised manuscript we have added a complete, self-contained proof (new Theorem 3.1 in Section 3) that begins from the formal definition of gradient coherence, derives the equivalence to BCCD with warm starts, and explicitly establishes both the O(1) query complexity per step and the global-descent guarantee under the stated coherence condition. The proof is now independent of the abstract wording and is presented in full detail. revision: yes
Referee: [Abstract] Abstract: the error bounds asserting that larger finite-difference step size h reduces the effective smoothness constant inherit the same coherence prerequisite; without explicit bounds on gradient drift across coordinate blocks, the implicit-smoothing claim cannot be assessed as a derived result rather than a construction artifact.

Authors: We acknowledge the need for explicit control on gradient drift. The revised theoretical analysis now contains a new lemma (Lemma 4.2) that supplies quantitative bounds on the allowable gradient drift between successive coordinate blocks. Under these bounds the reduction in the effective smoothness constant is derived directly from the coherence assumption, converting the implicit-smoothing statement into a rigorously derived result rather than a definitional artifact. revision: yes
Referee: [Theoretical analysis] Theoretical analysis: the assumption that historical gradients remain sufficiently coherent to preserve descent directions in non-convex landscapes (typical of ML models) is load-bearing for both the equivalence and stability claims; rapid curvature changes between blocks could render directions orthogonal or uphill, violating the global-descent property.

Authors: We recognize that the coherence assumption is load-bearing and that rapid curvature changes could in principle violate descent. The revised paper adds a new subsection (3.4) together with Assumption 3.3 that imposes an explicit bound on curvature variation across blocks; under this bound the descent property is preserved. We also report additional empirical measurements of gradient coherence on the ResNet experiments, confirming that the assumption holds throughout training for the models considered. We have added a brief discussion of the limitation in highly pathological non-convex regimes where the bound may be violated. revision: partial

Circularity Check

2 steps flagged

CoCD-to-BCCD equivalence is definitional via coherence assumption; implicit smoothing follows directly from finite-difference definition

specific steps

self definitional [Theoretical analysis (abstract and Section 3)]
"we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions."

CoCD is constructed precisely by cycling through coordinate blocks and reusing the most recent finite-difference estimate for each block (the 'warm start'). Once gradient coherence is posited to keep these stale estimates aligned with the current descent direction, the equivalence to BCCD with warm starts follows immediately from the definition; no additional derivation is required.
other [Error bounds derivation (abstract)]
"we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant"

The finite-difference estimator is defined with step size h; the standard smoothness bound on the gradient estimator scales with h. The 'reduction in effective smoothness constant' is therefore an algebraic rewriting of the same definition rather than a new prediction about the landscape.

full rationale

The paper's load-bearing theoretical step equates CoCD to BCCD with warm starts by formalizing gradient coherence and then showing that reusing stale per-coordinate finite-difference estimates across blocks preserves descent directions. This equivalence holds by the algorithmic construction itself once coherence is assumed, rather than being derived from independent premises or external verification. The counter-intuitive claim that larger step sizes reduce effective smoothness is a direct algebraic consequence of the finite-difference definition and the smoothness-constant bound, not an independent prediction. No self-citation chains or fitted parameters renamed as predictions appear in the provided derivation outline. The central claims therefore retain some independent content beyond pure renaming, but the equivalence step reduces to the input definition under the coherence premise.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of gradient coherence that persists across iterations and on the validity of the BCCD equivalence for deriving O(1) complexity and smoothing bounds.

free parameters (1)

finite-difference step size
Larger values are claimed to induce smoothing; whether this parameter is tuned or fixed is not specified in the abstract.

axioms (1)

domain assumption Gradient estimates remain coherent enough across coordinate blocks to support stable global descent directions
Invoked to justify converting stale gradients into assets and to derive the BCCD equivalence.

pith-pipeline@v0.9.0 · 5561 in / 1153 out tokens · 32557 ms · 2026-05-15T02:44:56.481353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 5.10 (Linear Convergence of CoCD) under L-smoothness + PL with staleness factor τ = n/B − 1
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 5.2 Uniform Coordinate-wise Smoothness L_ε induced by finite-difference interval ε

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2504.18790 , year=

Coherence-based Approximate Derivatives via Web of Affine Spaces Optimization , author=. arXiv preprint arXiv:2504.18790 , year=

work page arXiv
[2]

IEEE Signal Processing Magazine , volume=

A primer on zeroth-order optimization in signal processing and machine learning: Principles, recent advances, and applications , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=

work page 2020
[3]

International Conference on Artificial Intelligence and Statistics , pages=

Stochastic zeroth-order optimization in high dimensions , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

work page 2018
[4]

International Conference on Learning Representations , year=

Gradientless Descent: High-Dimensional Zeroth-Order Optimization , author=. International Conference on Learning Representations , year=

work page
[5]

International Conference on Machine Learning , pages=

Black-box adversarial attacks with limited queries and information , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[6]

2021 18th International Conference on Ubiquitous Robots (UR) , pages=

A survey on simulation environments for reinforcement learning , author=. 2021 18th International Conference on Ubiquitous Robots (UR) , pages=. 2021 , organization=

work page 2021
[7]

Advances in Neural Information Processing Systems , volume=

On-device training under 256kb memory , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Advances in Neural Information Processing Systems , volume=

Tinytl: Reduce memory, not parameters for efficient on-device learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

2013 , publisher=

Numerical partial differential equations: finite difference methods , author=. 2013 , publisher=

work page 2013
[10]

Springer Handbook of Computational Intelligence , pages=

Evolution strategies , author=. Springer Handbook of Computational Intelligence , pages=. 2015 , publisher=

work page 2015
[11]

Proceedings of the 2001 American Control Conference , volume=

Global random optimization by simultaneous perturbation stochastic approximation , author=. Proceedings of the 2001 American Control Conference , volume=. 2001 , organization=

work page 2001
[12]

International Conference on Learning Representations , year=

DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training , author=. International Conference on Learning Representations , year=

work page
[13]

Protein Science , volume=

Cyclic coordinate descent: A robotics algorithm for protein loop closure , author=. Protein Science , volume=. 2003 , publisher=

work page 2003
[14]

International Conference on Machine Learning , pages=

Generalizing Gaussian smoothing for random search , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[15]

Advances in Neural Information Processing Systems , volume=

When cyclic coordinate descent outperforms randomized coordinate descent , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

International Conference on Learning Representations , year=

Picking Winning Tickets Before Training by Preserving Gradient Flow , author=. International Conference on Learning Representations , year=

work page
[17]

Mathematical Programming , volume=

Coordinate descent algorithms , author=. Mathematical Programming , volume=. 2015 , publisher=

work page 2015
[18]

Mathematical Programming , volume=

On the complexity analysis of randomized block-coordinate descent methods , author=. Mathematical Programming , volume=. 2015 , publisher=

work page 2015
[19]

International Conference on Machine Learning , pages=

Cyclic block coordinate descent with variance reduction for composite nonconvex optimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[20]

SIAM Journal on Optimization , volume=

On the nonasymptotic convergence of cyclic coordinate descent methods , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=

work page 2013
[21]

On the Finite Time Convergence of Cyclic Coordinate Descent Methods

On the finite time convergence of cyclic coordinate descent methods , author=. arXiv preprint arXiv:1005.2146 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

International Conference on Artificial Intelligence and Statistics , pages=

Slow and stale gradients can win the race: Error-runtime trade-offs in distributed SGD , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

work page 2018
[23]

IEEE Transactions on Parallel and Distributed Systems , volume=

Towards efficient and stable K-asynchronous federated learning with unbounded stale gradients on non-IID data , author=. IEEE Transactions on Parallel and Distributed Systems , volume=. 2022 , publisher=

work page 2022
[24]

arXiv preprint arXiv:2512.21109 , year=

Robust and Efficient MuJoCo-based Model Predictive Control via Web of Affine Spaces Derivatives , author=. arXiv preprint arXiv:2512.21109 , year=

work page arXiv
[25]

Proceedings of the Seventeenth International Conference on Machine Learning , pages=

Locally weighted projection regression: An O(n) algorithm for incremental real time learning in high dimensional space , author=. Proceedings of the Seventeenth International Conference on Machine Learning , pages=. 2000 , organization=

work page 2000
[26]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[27]

Advances in Neural Information Processing Systems , volume=

PyTorch: An imperative style, high-performance deep learning library , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

SIAM Journal on Optimization , volume=

On the convergence of block coordinate descent type methods , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=

work page 2013
[29]

Proceedings of the IEEE , volume=

Gradient-based learning applied to document recognition , author=. Proceedings of the IEEE , volume=. 1998 , publisher=

work page 1998
[30]

Foundations of Computational Mathematics , volume=

A theoretical and empirical comparison of gradient approximations in derivative-free optimization , author=. Foundations of Computational Mathematics , volume=. 2022 , publisher=

work page 2022
[31]

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-. 2016 , publisher=

work page 2016
[32]

SIAM Journal on Optimization , volume=

Stochastic first- and zeroth-order methods for nonconvex stochastic programming , author=. SIAM Journal on Optimization , volume=. 2013 , publisher=

work page 2013
[33]

Advances in Neural Information Processing Systems , volume=

Fine-tuning language models with just forward passes , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[35]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024