pith. machine review for the scientific record. sign in

arxiv: 2604.14587 · v1 · submitted 2026-04-16 · 💻 cs.LG · math.OC· stat.ML

Recognition: unknown

CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:49 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords Lion optimizergeneralization boundconvergence ratecautious signnonconvex optimizationalgorithmic stabilitydeep learningstochastic gradient
0
0 comments X

The pith

CLion achieves a generalization error of O(1/N) by using a cautious sign function on the Lion optimizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the generalization properties of the Lion optimizer using algorithmic stability and mathematical induction, revealing a bound of O(1/(N τ^T)) that depends on a small parameter τ. It then proposes the Cautious Lion (CLion) optimizer, which applies the sign function more carefully to eliminate this dependence and achieve a bound of O(1/N). The authors further establish a convergence rate for CLion of O(√d / T^{1/4}) under the ℓ1-norm for nonconvex stochastic optimization. A sympathetic reader would care because these bounds provide theoretical support for using the optimizer in training deep neural networks with better expected performance on unseen data.

Core claim

We prove that the Lion optimizer has a generalization error of O(1/(N τ^T)), and that the SignSGD algorithm shares this bound. By designing a novel Cautious Lion (CLion) optimizer that uses the sign function cautiously, we obtain a lower generalization error of O(1/N). We also prove that CLion has a convergence rate of O(√d / T^{1/4}) under the ℓ1-norm of the gradient for nonconvex stochastic optimization.

What carries the argument

The cautious sign function modification that removes the dependence on the small non-zero gradient value τ from the generalization bound.

If this is right

  • CLion offers improved generalization guarantees compared to Lion for the same training sample size N.
  • The convergence analysis supports efficient optimization in high-dimensional nonconvex problems.
  • CLion can replace Lion in deep learning training pipelines with stronger theoretical backing.
  • SignSGD has the same weak generalization bound as Lion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This modification technique could be applied to other sign-based optimizers to improve their theoretical properties.
  • Empirical tests on standard benchmarks would likely show CLion generalizing better when τ is small in Lion runs.
  • The approach highlights the importance of stability analysis in designing new optimizers beyond just convergence.

Load-bearing premise

That the parameter τ representing the smallest absolute non-zero gradient element is generally very small in practice, and that the cautious sign modification maintains the optimizer's ability to converge without new issues.

What would settle it

An experiment that measures the value of τ during actual Lion training runs on deep models and checks whether CLion's generalization error scales as 1/N with sample size N while Lion's does not.

Figures

Figures reproduced from arXiv: 2604.14587 by Feihu Huang, Guanyi Zhang, Songcan Chen.

Figure 1
Figure 1. Figure 1: Illustration of different active functions: [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Language modeling at Wikitext-2 dataset. (a) Train Loss (b) Train Perplexity (c) Test Loss (d) Test Perplexity [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Language modeling at Wikitext-103 dataset. 6.1 Language Modeling In this experiment, given some training samples {z i} N i=1, we conduct language modeling task by solving the following nonconvex problem min w∈Rd − 1 N X N i=1 Xmi t=1 log p(z i t |z i 1:t−1 ; w)  , (26) where each sample z i includes mi tokens, and p(z i t |z i 1:t−1 ; w) denotes a probability function of token z i t given the tokens z i 1… view at source ↗
Figure 4
Figure 4. Figure 4: Image classification at Cifar-10 dataset. 6.2 Image Classification In this experiment, we train two deep learning models to image classification. Given training samples {xi , yi} N i=1, where xi denotes features and yi denotes label, we train deep learning model by solving the following problem min w∈Rd 1 N X N i=1 ℓ [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Image classification at tiny-ImageNet dataset. Then we evaluate the ResNet34 [He et al., 2016] on tiny-ImageNet [Le and Yang, 2015] dataset, where the training and test datesets contain 80000 and 20000 samples, respectively. In the experiment, for all hyper-parameters, we do grid search and report the best one for each optimizer. When training Resnet18 at CIFAR-10 dataset, we set the batch size be 64 for a… view at source ↗
read the original abstract

Lion optimizer is a popular learning-based optimization algorithm in machine learning, which shows impressive performance in training many deep learning models. Although convergence property of the Lion optimizer has been studied, its generalization analysis is still missing. To fill this gap, we study generalization property of the Lion via algorithmic stability based on the mathematical induction. Specifically, we prove that the Lion has a generalization error of $O(\frac{1}{N\tau^T})$, where $N$ is training sample size, and $\tau>0$ denotes the smallest absolute value of non-zero element in gradient estimator, and $T$ is the total iteration number. In addition, we obtain an interesting byproduct that the SignSGD algorithm has the same generalization error as the Lion. To enhance generalization of the Lion, we design a novel efficient Cautious Lion (i.e., CLion) optimizer by cautiously using sign function. Moreover, we prove that our CLion has a lower generalization error of $O(\frac{1}{N})$ than $O(\frac{1}{N\tau^T})$ of the Lion, since the parameter $\tau$ generally is very small. Meanwhile, we study convergence property of our CLion optimizer, and prove that our CLion has a fast convergence rate of $O(\frac{\sqrt{d}}{T^{1/4}})$ under $\ell_1$-norm of gradient for nonconvex stochastic optimization, where $d$ denotes the model dimension. Extensive numerical experiments demonstrate effectiveness of our CLion optimizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This paper analyzes the generalization properties of the Lion optimizer using algorithmic stability and mathematical induction, deriving a bound of O(1/(N τ^T)) where τ is the smallest absolute non-zero value in the gradient estimator. It introduces the Cautious Lion (CLion) optimizer that uses a cautious sign function to achieve a better generalization bound of O(1/N), assuming τ is generally small. It also proves that SignSGD has the same generalization bound as Lion and establishes a convergence rate of O(√d / T^{1/4}) for CLion in nonconvex stochastic optimization under the ℓ1-norm of the gradient. The claims are supported by numerical experiments.

Significance. Should the proofs be complete and the assumption on τ validated through analysis or experiments, this would represent a meaningful contribution to the theoretical understanding of sign-based optimizers, potentially guiding improvements in generalization for deep learning training. The convergence result provides a specific rate that could be useful for non-convex problems. The connection to SignSGD is a nice observation. The work has potential impact if the load-bearing assumptions are addressed.

major comments (3)
  1. [Generalization analysis of Lion] The derivation of the O(1/(N τ^T)) bound via mathematical induction on algorithmic stability is not detailed with specific steps or equations showing how the stability constant incorporates τ^T. This is critical as it underpins the entire comparison to CLion.
  2. [CLion generalization claim] The statement that CLion has O(1/N) generalization error 'since the parameter τ generally is very small' lacks any supporting evidence, such as empirical distribution of τ values or a proof that the cautious modification makes the bound independent of τ. This assumption is load-bearing for the central claim of enhanced generalization.
  3. [Convergence proof for CLion] Details of the proof for the convergence rate O(√d / T^{1/4}) are missing, including how the cautious sign usage affects the analysis compared to standard Lion and any additional assumptions required.
minor comments (2)
  1. [Abstract] The abstract could benefit from a brief description of how the cautious sign function is defined in the CLion update rule.
  2. [Notation] Ensure consistent use of symbols, such as clarifying if T is iterations and N samples throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight areas where additional detail and evidence will strengthen the manuscript. We address each major comment below and commit to revisions that provide the requested clarifications without altering the core claims.

read point-by-point responses
  1. Referee: [Generalization analysis of Lion] The derivation of the O(1/(N τ^T)) bound via mathematical induction on algorithmic stability is not detailed with specific steps or equations showing how the stability constant incorporates τ^T. This is critical as it underpins the entire comparison to CLion.

    Authors: We agree that the inductive steps were presented too concisely. In the revised manuscript we will expand the algorithmic stability section for Lion to include the complete inductive argument. Specifically, we will show the base case for stability after one iteration and the inductive step in which the stability constant is multiplied by τ at each subsequent iteration, yielding the factor τ^T after T steps. This expanded derivation will make the comparison with the CLion bound explicit. revision: yes

  2. Referee: [CLion generalization claim] The statement that CLion has O(1/N) generalization error 'since the parameter τ generally is very small' lacks any supporting evidence, such as empirical distribution of τ values or a proof that the cautious modification makes the bound independent of τ. This assumption is load-bearing for the central claim of enhanced generalization.

    Authors: The referee correctly notes that the improvement for CLion rests on the cautious sign function eliminating the dependence on τ. We will add a formal lemma in the revision proving that the cautious thresholding ensures the effective multiplier remains bounded away from zero, yielding a generalization bound of O(1/N) independent of τ. To further support the original Lion analysis, we will include new experiments reporting the empirical distribution of τ values observed during training on standard image-classification and language-modeling benchmarks. revision: partial

  3. Referee: [Convergence proof for CLion] Details of the proof for the convergence rate O(√d / T^{1/4}) are missing, including how the cautious sign usage affects the analysis compared to standard Lion and any additional assumptions required.

    Authors: We will supply the full convergence proof in the appendix of the revised version. The proof proceeds by bounding the expected ℓ1-norm of the gradient after each cautious update; the thresholding step reduces the contribution of near-zero noisy signs, which improves the variance term relative to standard Lion and produces the T^{-1/4} rate. All assumptions (L-smoothness, bounded stochastic gradient variance, and bounded model dimension) will be stated explicitly, together with a short comparison paragraph highlighting where the cautious modification alters the standard Lion analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived via induction on explicit parameters

full rationale

The paper states that Lion's O(1/(N τ^T)) generalization bound and CLion's O(1/N) bound are obtained via algorithmic stability and mathematical induction, with τ defined explicitly as the smallest absolute non-zero gradient-estimator entry. The abstract presents the CLion improvement as following from the cautious-sign modification and the external observation that τ is generally small, without any equation or step in the provided text reducing the claimed result to a fitted input, self-definition, or self-citation chain. The convergence rate O(√d / T^{1/4}) is likewise stated as a separate first-principles derivation under ℓ1-norm. Because no load-bearing step collapses by construction to its own inputs, the derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Central claims depend on unshown induction proofs for stability, the domain assumption that sign-based updates can be modified cautiously without harming convergence, and the observation that τ is small; no free parameters are explicitly fitted but τ functions as a data-dependent quantity that weakens the Lion bound; no new entities are invented.

free parameters (1)
  • τ
    Defined as smallest absolute non-zero gradient estimator element; treated as generally small to make Lion bound inferior, but no fitting process described.
axioms (2)
  • standard math Algorithmic stability can be analyzed via mathematical induction for Lion and CLion
    Invoked to derive the generalization error bounds O(1/(N τ^T)) and O(1/N).
  • domain assumption Cautious application of the sign function preserves key optimization properties
    Required to claim CLion retains fast convergence while improving generalization.

pith-pipeline@v0.9.0 · 5580 in / 1617 out tokens · 30666 ms · 2026-05-10T11:49:13.044632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    signsgd: Compressed optimisation for non-convex problems

    Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. In International conference on machine learning, pages 560--569. PMLR, 2018

  2. [2]

    Optimization methods for large-scale machine learning

    L \'e on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60 0 (2): 0 223--311, 2018

  3. [3]

    Lion secretly solves constrained optimization: As Lyapunov predicts

    Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves constrained optimization: As lyapunov predicts. arXiv preprint arXiv:2310.05898, 2023 a

  4. [4]

    Symbolic discovery of optimization algorithms

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in neural information processing systems, 36: 0 49205--49233, 2023 b

  5. [5]

    Scalable learning to optimize: A learned optimizer can train big models

    Xuxi Chen, Tianlong Chen, Yu Cheng, Weizhu Chen, Ahmed Awadallah, and Zhangyang Wang. Scalable learning to optimize: A learned optimizer can train big models. In European Conference on Computer Vision, pages 389--405. Springer, 2022

  6. [6]

    Momentum-based variance reduction in non-convex sgd

    Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019

  7. [7]

    Convergence

    Yiming Dong, Huan Li, and Zhouchen Lin. Convergence rate analysis of lion. arXiv preprint arXiv:2411.07724, 2024

  8. [8]

    An algorithm for quadratic programming

    Marguerite Frank, Philip Wolfe, et al. An algorithm for quadratic programming. Naval research logistics quarterly, 3 0 (1-2): 0 95--110, 1956

  9. [9]

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming

    Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization, 23 0 (4): 0 2341--2368, 2013

  10. [10]

    Train faster, generalize better: Stability of stochastic gradient descent

    Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225--1234. PMLR, 2016

  11. [11]

    A closer look at learned optimization: Stability, robustness, and inductive biases

    James Harrison, Luke Metz, and Jascha Sohl-Dickstein. A closer look at learned optimization: Stability, robustness, and inductive biases. Advances in neural information processing systems, 35: 0 3758--3773, 2022

  12. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016

  13. [13]

    Provable complexity improvement of Ada- Grad over SGD: Upper and lower bounds in stochastic non-convex optimization

    Wei Jiang and Lijun Zhang. Convergence analysis of the lion optimizer in centralized and distributed settings. arXiv preprint arXiv:2508.12327, 2025

  14. [14]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  15. [15]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  16. [16]

    Tiny imagenet visual recognition challenge

    Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7 0 (7): 0 3, 2015

  17. [17]

    Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems

    Yunwen Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pages 191--227. PMLR, 2023

  18. [18]

    Fine-grained analysis of stability and generalization for stochastic gradient descent

    Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pages 5809--5819. PMLR, 2020

  19. [19]

    Communication efficient distributed training with distributed lion

    Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, and Qiang Liu. Communication efficient distributed training with distributed lion. Advances in Neural Information Processing Systems, 37: 0 18388--18415, 2024

  20. [20]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  21. [21]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

  22. [22]

    Lectures on convex optimization, volume 137

    Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018

  23. [23]

    On the generalization of stochastic gradient descent with momentum

    Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, and Ben Liang. On the generalization of stochastic gradient descent with momentum. Journal of Machine Learning Research, 25 0 (22): 0 1--56, 2024

  24. [24]

    Stochastic frank-wolfe methods for nonconvex optimization

    Sashank J Reddi, Suvrit Sra, Barnab \'a s P \'o czos, and Alex Smola. Stochastic frank-wolfe methods for nonconvex optimization. In 2016 54th annual Allerton conference on communication, control, and computing (Allerton), pages 1244--1251. IEEE, 2016

  25. [25]

    A stochastic approximation method

    Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400--407, 1951

  26. [26]

    A refined lion optimizer for deep learning

    Jian Rong, Chenhao Ma, Qinghui Zhang, Yong Cao, and Weili Kou. A refined lion optimizer for deep learning. Scientific Reports, 15 0 (1): 0 23082, 2025

  27. [27]

    Lions and muons: Optimization via stochastic frank- wolfe.arXiv preprint arXiv:2506.04192, 2025

    Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe. arXiv preprint arXiv:2506.04192, 2025

  28. [28]

    Learnability, stability and uniform convergence

    Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11: 0 2635--2670, 2010

  29. [29]

    On the importance of initialization and momentum in deep learning

    Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139--1147. pmlr, 2013

  30. [30]

    A hybrid stochastic optimization framework for composite nonconvex optimization

    Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan, and Lam M Nguyen. A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191 0 (2): 0 1005--1071, 2022

  31. [31]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  32. [32]

    Visual transformers: Token-based image representation and processing for computer vision, 2020

    Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for computer vision, 2020

  33. [33]

    Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425,

    Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise. arXiv preprint arXiv:2602.07425, 2026

  34. [34]

    Mars: Unleashing the power of variance reduction for training large models.arXiv preprint arXiv:2411.10438,

    Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. Mars: Unleashing the power of variance reduction for training large models. arXiv preprint arXiv:2411.10438, 2024

  35. [35]

    Mathematical analysis of machine learning algorithms

    Tong Zhang. Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023