A Retraction-Free EXTRA Method for Decentralized Optimization on the Stiefel Manifold

Jiang Hu; Shu Li

arxiv: 2604.23754 · v2 · submitted 2026-04-26 · 🧮 math.OC

A Retraction-Free EXTRA Method for Decentralized Optimization on the Stiefel Manifold

Shu Li , Jiang Hu This is my paper

Pith reviewed 2026-05-08 05:48 UTC · model grok-4.3

classification 🧮 math.OC

keywords decentralized optimizationStiefel manifoldretraction-free methodEXTRA algorithmO(1/K) convergenceorthogonality constraintsprimal-dual optimizationdistributed learning

0 comments

The pith

RF-EXTRA achieves an exact O(1/K) convergence rate to stationary points for decentralized optimization on the Stiefel manifold with constant step sizes and no retractions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a decentralized method called RF-EXTRA for solving optimization problems subject to orthogonality constraints on the Stiefel manifold. It combines an approximate gradient mapping to handle the manifold constraints with an EXTRA-based decentralized recursion to enable distributed computation without retractions. The analysis focuses on the contractivity of the joint error between local variables and their network averages, which allows the use of small constant step sizes. This leads to an O(1/K) convergence guarantee, which is useful for large-scale distributed tasks like principal component analysis where retractions would be computationally expensive. Sympathetic readers would care because it simplifies communication and avoids manifold-specific operations while maintaining convergence.

Core claim

RF-EXTRA is a distributed retraction-free primal-dual method that, by establishing a contractive recursion for the joint error (X_k - average X_k, s_k - average s_k), ensures that the joint error can be controlled using small yet constant step sizes, leading to an exact O(1/K) convergence rate to a stationary point on static undirected networks.

What carries the argument

The joint error vector consisting of deviations in local variables and local directions from their averages, whose contractive recursion is established under the approximate gradient mapping and EXTRA recursion.

Load-bearing premise

The joint-error recursion remains contractive when the approximate gradient mapping for the orthogonality constraints is paired with the EXTRA decentralized update on static undirected networks.

What would settle it

Observing that the joint error fails to contract or the convergence rate exceeds O(1/K) for some constant step size on a static undirected network with the given mapping would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.23754 by Jiang Hu, Shu Li.

**Figure 1.** Figure 1: Synthetic decentralized PCA: robustness of RF-EXTRA with respect to graph topology and view at source ↗

**Figure 2.** Figure 2: Synthetic decentralized PCA on ER(0.6) versus communication quantities. Each method uses its best step size selected from {1, 2, 4, 6, 8} × {10−5 , 10−4 , 10−3 , 10−2}. Under this matched search space, RF-EXTRA, DESTINY, DPRGT, and REXTRA all select βˆ = 0.08, while DPRGD selects βˆ = 0.006. is competitive with the strongest baselines across the communication budget, and its performance is close to that of… view at source ↗

**Figure 3.** Figure 3: Decentralized PCA on the MNIST dataset versus communication quantities. RF-EXTRA, DES view at source ↗

**Figure 4.** Figure 4: Decentralized LRMC on the ring graph versus communication quantities. Only the stationarity view at source ↗

**Figure 5.** Figure 5: Decentralized LRMC on the ring graph versus communication quantities for representative RF view at source ↗

read the original abstract

Decentralized optimization provides a fundamental framework for large-scale learning and signal processing with distributed data. We study decentralized optimization with orthogonality constraints on the Stiefel manifold and propose RF-EXTRA, a distributed retraction-free primal-dual method on static undirected networks. The method combines an approximate gradient mapping for orthogonality-constrained optimization with an EXTRA-based decentralized recursion, thereby avoiding retractions while preserving a simple communication pattern. On the theoretical side, the analysis considers \revise{the joint error} $(\mathbf{X}_k-\overline{\mathbf X}_k,\mathbf{s}_k-\overline{\mathbf s}_k)$ in the local variables and local directions, and establishes a contractive recursion for the joint error. This contractivity ensures that the joint error can be controlled using small yet constant step sizes, thus leading to an exact $\mathcal{O}(1/K)$ convergence rate of RF-EXTRA to a stationary point. Experiments on PCA and low-rank matrix completion show that RF-EXTRA compares favorably with the reported decentralized baselines and exhibits strong communication efficiency on the tested tasks on the Stiefel manifold.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RF-EXTRA gives a retraction-free EXTRA variant for decentralized Stiefel optimization with a joint-error contraction that supports constant-step O(1/K) convergence.

read the letter

The main point is that this paper introduces RF-EXTRA, a primal-dual scheme that avoids retractions on the Stiefel manifold by folding an approximate gradient mapping into the EXTRA mixing recursion, then proves a contractive bound on the joint error in local variables and directions to get O(1/K) to stationarity with fixed steps on static undirected networks. That combination is the actual new piece; prior EXTRA work did not target manifold constraints this way, and the joint-error tracking lets them control everything without explicit projections at each node. The experiments on PCA and low-rank matrix completion are straightforward and show competitive performance plus lower communication cost than the baselines they compare against. The analysis appears to handle the cross terms from the decentralized updates and the first-order approximation without leaving obvious gaps, and the stress-test note confirms the spectral radius stays below one for small enough constant steps with no internal inconsistencies or unhandled curvature issues. One minor soft spot is that the abstract leaves the precise Lipschitz and accuracy assumptions on the approximate mapping implicit, so the full paper needs to spell those out clearly for the constants to be fully usable. The static undirected network assumption is standard and not a flaw here. This is aimed at researchers working on distributed manifold problems like decentralized PCA or matrix completion. A reader who needs a simple-communication, retraction-free option with a provable rate would find the construction useful. I would send it to peer review.

Referee Report

2 major / 3 minor

Summary. The paper proposes RF-EXTRA, a retraction-free primal-dual decentralized optimization algorithm for problems on the Stiefel manifold. It combines an approximate gradient mapping to handle orthogonality constraints without retractions and an EXTRA-based recursion for communication on static undirected networks. The central claim is that the joint error (X_k - average X_k, s_k - average s_k) satisfies a contractive recursion, which permits constant step sizes and yields an exact O(1/K) convergence rate to a stationary point. Experiments on PCA and low-rank matrix completion show competitive performance and communication efficiency relative to baselines.

Significance. If the joint-error contractivity holds, the result is significant because it removes the need for retraction operations in distributed manifold optimization, which are often expensive or unstable. The exact O(1/K) rate with constant steps and simple communication pattern is a practical advantage for large-scale distributed tasks such as PCA. The approach of analyzing the combined primal-dual deviation vector is a clean way to obtain the rate, and the empirical results on standard tasks add value.

major comments (2)

[Analysis section on joint-error recursion] The derivation of the contractive recursion for the joint error (X_k - average X_k, s_k - average s_k) is load-bearing for the O(1/K) claim. The bounding of cross terms arising from the decentralized EXTRA updates and the first-order approximation to the orthogonality constraint must be presented with explicit constants so that the spectral-radius condition (strictly less than one) and the allowable constant step-size range can be verified directly.
[Section introducing the approximate gradient mapping] The contractivity relies on the approximate gradient mapping respecting the orthogonality constraint. The precise definition of this mapping, together with its Lipschitz constant or approximation-error bound, must be stated explicitly because these quantities enter the step-size restriction that guarantees the spectral radius is less than one.

minor comments (3)

[Abstract] The abstract contains the LaTeX command 'revise{the joint error}'; replace this with clean text and ensure the term 'joint error' is defined consistently in the introduction and analysis.
[Experiments] The experimental section should report the network size, topology, and exact communication metric (e.g., total scalar transmissions per iteration) to make the claimed communication efficiency quantitative.
[Notation and preliminaries] Notation for the averages (overline{X}_k and overline{s}_k) should be introduced at the first use rather than assumed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the precise comments on the analysis. We address the two major comments below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses

Referee: [Analysis section on joint-error recursion] The derivation of the contractive recursion for the joint error (X_k - average X_k, s_k - average s_k) is load-bearing for the O(1/K) claim. The bounding of cross terms arising from the decentralized EXTRA updates and the first-order approximation to the orthogonality constraint must be presented with explicit constants so that the spectral-radius condition (strictly less than one) and the allowable constant step-size range can be verified directly.

Authors: We agree that the bounding steps for the cross terms must be expanded with explicit constants to allow direct verification of the spectral radius and step-size range. The manuscript derives the joint-error recursion in the analysis section, but the intermediate bounds are presented compactly. In the revision we will insert the full expansion of each cross-term bound, compute the resulting spectral-radius expression explicitly, and state the resulting restriction on the constant step size. revision: yes
Referee: [Section introducing the approximate gradient mapping] The contractivity relies on the approximate gradient mapping respecting the orthogonality constraint. The precise definition of this mapping, together with its Lipschitz constant or approximation-error bound, must be stated explicitly because these quantities enter the step-size restriction that guarantees the spectral radius is less than one.

Authors: We accept the point that the definition and quantitative properties of the approximate gradient mapping need to be stated more explicitly. The mapping is introduced as a first-order approximation that preserves the orthogonality constraint to first order; its Lipschitz constant and approximation-error bound appear in the subsequent analysis but are not highlighted at the definition stage. In the revised manuscript we will restate the precise definition at the beginning of the relevant section, list the Lipschitz and error constants, and show explicitly how they propagate into the step-size condition that ensures the spectral radius is strictly less than one. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via explicit joint-error bounds

full rationale

The paper's central result is the contractive recursion on the joint error vector (X_k - avg X_k, s_k - avg s_k) obtained by bounding the cross terms that arise from the EXTRA mixing matrices on static undirected graphs together with the first-order approximation to the orthogonality constraint. This produces a linear system whose spectral radius is strictly less than one for sufficiently small constant step sizes, directly yielding the O(1/K) rate to stationarity. No equation or claim reduces to a fitted parameter renamed as a prediction, a self-citation whose content is itself unverified, or a definitional equivalence; the analysis is presented as an independent derivation resting on standard Lipschitz and network assumptions rather than re-deriving prior constants by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The method appears to rest on standard assumptions for decentralized consensus and manifold optimization plus one paper-specific construction (the approximate gradient mapping). No free parameters or invented entities are named.

axioms (2)

domain assumption The network is static and undirected.
Stated in the abstract as the setting for the decentralized recursion.
ad hoc to paper An approximate gradient mapping exists that respects the orthogonality constraint without retraction.
Central to the retraction-free claim; invoked to combine with EXTRA.

pith-pipeline@v0.9.0 · 5491 in / 1515 out tokens · 31149 ms · 2026-05-08T05:48:27.132056+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

Distributed asynchronous deterministic and stochastic gradient optimization algorithms.IEEE transactions on automatic control, 31(9):803–812, 1986

John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms.IEEE transactions on automatic control, 31(9):803–812, 1986

work page 1986
[2]

Distributed subgradient methods for multi-agent optimization

Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on automatic control, 54(1):48–61, 2009

work page 2009
[3]

On the convergence of decentralized gradient descent.SIAM Journal on Optimization, 26(3):1835–1854, 2016

Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent.SIAM Journal on Optimization, 26(3):1835–1854, 2016

work page 2016
[4]

Alghunaim and Kun Yuan

Sulaiman A. Alghunaim and Kun Yuan. A unified and refined convergence analysis for non-convex decentralized learning.IEEE Transactions on Signal Processing, 2022

work page 2022
[5]

Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H. Sayed. Exact diffusion for distributed opti- mization and learning — part i: Algorithm development.IEEE Transactions on Signal Processing, 2018

work page 2018
[6]

EXTRA: An exact first-order algorithm for decentralized consensus optimization.SIAM Journal on Optimization, 25(2):944–966, 2015

Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization.SIAM Journal on Optimization, 25(2):944–966, 2015

work page 2015
[7]

Distributed constrained optimal consensus of multi- agent systems.Automatica, 68:209–215, 2016

Zhirong Qiu, Shuzhi Sam Ge Liu, and Lihua Xie. Distributed constrained optimal consensus of multi- agent systems.Automatica, 68:209–215, 2016. doi: 10.1016/j.automatica.2016.01.055

work page doi:10.1016/j.automatica.2016.01.055 2016
[8]

Dual averaging for distributed optimization: Convergence analysis and network scaling.IEEE Transactions on Automatic Control, 2011

John Duchi, Alekh Agarwal, and Martin Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling.IEEE Transactions on Automatic Control, 2011

work page 2011
[9]

Orthogonal weight nor- malization: Solution to optimization over multiple dependent Stiefel manifolds in deep neural networks

Lei Huang, Xianglong Liu, Bo Lang, Adams Yu, Yongliang Wang, and Bo Li. Orthogonal weight nor- malization: Solution to optimization over multiple dependent Stiefel manifolds in deep neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

work page 2018
[10]

Riemannian approach to batch normalization

Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[11]

Riemannian preconditioned lora for fine-tuning foundation mod- els.arXiv preprint arXiv:2402.02347,

Fangzhao Zhang and Mert Pilanci. Riemannian preconditioned LoRA for fine-tuning foundation models. arXiv preprint arXiv:2402.02347, 2024

work page arXiv 2024
[12]

Retraction-free optimization over the Stiefel manifold with application to the LoRA fine-tuning

Yuan Zhang, Jiang Hu, Jiaxi Cui, Lin Lin, Zaiwen Wen, and Quanzheng Li. Retraction-free optimization over the Stiefel manifold with application to the LoRA fine-tuning. 2024

work page 2024
[13]

Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R

L. Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R. Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, and Michael Heroux. An updated set of basic linear algebra subprograms (BLAS).ACM Transactions on Mathematical Software, 28(2):135–151, 2002

work page 2002
[14]

Anastasia Koloskova, Tao Lin, and Sebastian U. Stich. An improved analysis of gradient tracking for decentralized machine learning. InAdvances in Neural Information Processing Systems, 2021

work page 2021
[15]

D2: Decentralized training over decen- tralized data

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over decen- tralized data. InProceedings of the International Conference on Machine Learning, 2018. 21

work page 2018
[16]

Achieving geometric convergence for distributed optimiza- tion over time-varying graphs.SIAM Journal on Optimization, 27(4):2597–2633, 2017

Angelia Nedić, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimiza- tion over time-varying graphs.SIAM Journal on Optimization, 27(4):2597–2633, 2017

work page 2017
[17]

Harnessing smoothness to accelerate distributed optimization.IEEE Trans- actions on Control of Network Systems, 5(3):1245–1260, 2017

Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization.IEEE Trans- actions on Control of Network Systems, 5(3):1245–1260, 2017

work page 2017
[18]

S-diging: A stochastic gradient tracking algorithm for distributed optimization.IEEE Transactions on Emerging Topics in Computational Intelligence, 2020

Huaqing Li, Lifeng Zheng, Zheng Wang, Yu Yan, Liping Feng, and Jing Guo. S-diging: A stochastic gradient tracking algorithm for distributed optimization.IEEE Transactions on Emerging Topics in Computational Intelligence, 2020

work page 2020
[19]

Distributed stochastic gradient tracking methods.Mathematical Program- ming, 2021

Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods.Mathematical Program- ming, 2021

work page 2021
[20]

Yue Liu, Tao Lin, Anastasia Koloskova, and Sebastian U. Stich. Decentralized gradient tracking with local steps.Optimization Methods and Software, 2025

work page 2025
[21]

A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates.IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019

Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates.IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019

work page 2019
[22]

Prox-PDA:Theproximalprimal-dualalgorithm for fast distributed nonconvex optimization and learning over networks

MingyiHong, DavoodHajinezhad, andMing-MinZhao. Prox-PDA:Theproximalprimal-dualalgorithm for fast distributed nonconvex optimization and learning over networks. InInternational Conference on Machine Learning, pages 1529–1538. PMLR, 2017

work page 2017
[23]

Alghunaim, and Xinmeng Huang

Kun Yuan, Sulaiman A. Alghunaim, and Xinmeng Huang. Removing data heterogeneity influence enhances network topology dependence of decentralized SGD.Journal of Machine Learning Research, 2023

work page 2023
[24]

Next: In-network nonconvex optimization.IEEE Transactions on Signal and Information Processing over Networks, 2016

Paolo Di Lorenzo and Gesualdo Scutari. Next: In-network nonconvex optimization.IEEE Transactions on Signal and Information Processing over Networks, 2016

work page 2016
[25]

Stich, and Martin Jaggi

Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. InProceedings of the International Conference on Machine Learning, 2019

work page 2019
[26]

Unbiased compression saves communication in distributed optimization: When and how much? InAdvances in Neural Information Processing Systems, 2023

Yutong He, Xinmeng Huang, and Kun Yuan. Unbiased compression saves communication in distributed optimization: When and how much? InAdvances in Neural Information Processing Systems, 2023

work page 2023
[27]

Greedy low-rank gradient com- pression for distributed learning with convergence guarantees.IEEE Transactions on Signal Processing, 2026

Chuyan Chen, Yutong He, Pengrui Li, Weichen Jia, and Kun Yuan. Greedy low-rank gradient com- pression for distributed learning with convergence guarantees.IEEE Transactions on Signal Processing, 2026

work page 2026
[28]

On biased compression for distributed learning.Journal of Machine Learning Research, 2023

Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression for distributed learning.Journal of Machine Learning Research, 2023

work page 2023
[29]

Error compensated distributed SGD can be accelerated

Xun Qian, Peter Richtárik, and Tong Zhang. Error compensated distributed SGD can be accelerated. InAdvances in Neural Information Processing Systems, 2021

work page 2021
[30]

Understanding the influence of digraphs on decentralized optimization: Effective metrics, lower bound, and optimal algorithm.SIAM Journal on Optimization, 2025

Liyuan Liang, Xinmeng Huang, Ran Xin, and Kun Yuan. Understanding the influence of digraphs on decentralized optimization: Effective metrics, lower bound, and optimal algorithm.SIAM Journal on Optimization, 2025

work page 2025
[31]

Alghunaim

Sulaiman A. Alghunaim. Local exact-diffusion for decentralized optimization and learning.IEEE Transactions on Automatic Control, 2024

work page 2024
[32]

Exponential graph is provably efficient for decentralized deep training

Bicheng Ying, Kun Yuan, Yiming Chen, Hanbin Hu, Pan Pan, and Wotao Yin. Exponential graph is provably efficient for decentralized deep training. InAdvances in Neural Information Processing Systems, 2021. 22

work page 2021
[33]

Decentralized Riemannian gradient descent on the Stiefel manifold

Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. Decentralized Riemannian gradient descent on the Stiefel manifold. InInternational Conference on Machine Learning, pages 1594–1605. PMLR, 2021

work page 2021
[34]

Decentralized projected Riemannian gradient method for smooth opti- mization on compact submanifolds embedded in the euclidean space.Numerische Mathematik, 2025

Kangkang Deng and Jiang Hu. Decentralized projected Riemannian gradient method for smooth opti- mization on compact submanifolds embedded in the euclidean space.Numerische Mathematik, 2025

work page 2025
[35]

A decentralized proximal gradient tracking algorithm for composite optimization on Riemannian manifolds.Journal of Machine Learning Research, 2025

Lei Wang, Le Bao, and Xin Liu. A decentralized proximal gradient tracking algorithm for composite optimization on Riemannian manifolds.Journal of Machine Learning Research, 2025

work page 2025
[36]

Riemannian EXTRA: Communication-efficient decentralized optimization over compact submanifolds with data heterogeneity

Jiayuan Wu, Zhanwang Deng, Jiang Hu, Weijie Su, and Zaiwen Wen. Riemannian EXTRA: Communication-efficient decentralized optimization over compact submanifolds with data heterogeneity. arXiv preprint arXiv:2505.15537, 2025

work page arXiv 2025
[37]

Decentralized optimization on compact submanifolds by quantized Riemannian gradient tracking.IEEE Transactions on Signal Processing, 2025

JunChen, LinaLiu, TianyiZhu, YongLiu, GuangDai, YunliangJiang, andIvorWTsang. Decentralized optimization on compact submanifolds by quantized Riemannian gradient tracking.IEEE Transactions on Signal Processing, 2025

work page 2025
[38]

Improving the communication in decentralized manifold optimization through single-step consensus and compression.arXiv preprint arXiv:2407.08904, 2024

Jiang Hu and Kangkang Deng. Improving the communication in decentralized manifold optimization through single-step consensus and compression.arXiv preprint arXiv:2407.08904, 2024

work page arXiv 2024
[39]

Decentralized projected Riemannian stochastic recursive momentum method for nonconvex optimization

Kangkang Deng and Jiang Hu. Decentralized projected Riemannian stochastic recursive momentum method for nonconvex optimization. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[40]

Tsang, and Yong Liu

Jun Chen, Haishan Ye, Mengmeng Wang, Tianxin Huang, Guang Dai, Ivor W. Tsang, and Yong Liu. Decentralized Riemannian conjugate gradient method on the Stiefel manifold.arXiv preprint arXiv:2308.10547, 2023

work page arXiv 2023
[41]

Decentralized Riemannian natural gradient methods with Kronecker product approximations.Journal of the Operations Research Society of China, 2025

Jiang Hu, Kangkang Deng, and Quanzheng Li. Decentralized Riemannian natural gradient methods with Kronecker product approximations.Journal of the Operations Research Society of China, 2025

work page 2025
[42]

On the local linear rate of consensus on the Stiefel manifold.IEEE Transactions on Automatic Control, 2023

Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. On the local linear rate of consensus on the Stiefel manifold.IEEE Transactions on Automatic Control, 2023

work page 2023
[43]

Riemannian consensus for manifolds with bounded curva- ture.IEEE Transactions on Automatic Control, 2012

Roberto Tron, Bijan Afsari, and René Vidal. Riemannian consensus for manifolds with bounded curva- ture.IEEE Transactions on Automatic Control, 2012

work page 2012
[44]

Consensus optimization on manifolds.SIAM Journal on Control and Optimization, 2009

Alain Sarlette and Rodolphe Sepulchre. Consensus optimization on manifolds.SIAM Journal on Control and Optimization, 2009

work page 2009
[45]

Achieving local consensus over compact submanifolds

Jiang Hu, Jiaojiao Zhang, and Kangkang Deng. Achieving local consensus over compact submanifolds. IEEE Transactions on Automatic Control, 70(9):5750–5763, 2025. doi: 10.1109/TAC.2025.3545711

work page doi:10.1109/tac.2025.3545711 2025
[46]

Retraction-free decentralized non-convex optimization with orthogonal constraints.arXiv preprint arXiv:2405.11590, 2024

Youbang Sun, Shixiang Chen, Alfredo Garcia, and Shahin Shahrampour. Retraction-free decentralized non-convex optimization with orthogonal constraints.arXiv preprint arXiv:2405.11590, 2024. doi: 10.48550/arXiv.2405.11590

work page doi:10.48550/arxiv.2405.11590 2024
[47]

Decentralized optimization over the Stiefel manifold by an approximate aug- mented Lagrangian function.IEEE Transactions on Signal Processing, 2022

Lei Wang and Xin Liu. Decentralized optimization over the Stiefel manifold by an approximate aug- mented Lagrangian function.IEEE Transactions on Signal Processing, 2022

work page 2022
[48]

Fast and accurate optimization on the orthogonal manifold without retraction

Pierre Ablin and Gabriel Peyré. Fast and accurate optimization on the orthogonal manifold without retraction. InProceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 5636–5657. PMLR, 2022

work page 2022
[49]

Parallelizable algorithms for optimization problems with orthog- onality constraints.SIAM Journal on Scientific Computing, 41(3):A1949–A1983, 2019

Bin Gao, Xin Liu, and Ya-xiang Yuan. Parallelizable algorithms for optimization problems with orthog- onality constraints.SIAM Journal on Scientific Computing, 41(3):A1949–A1983, 2019

work page 2019
[50]

Dissolving constraints for Riemannian optimization

Nachuan Xiao, Xin Liu, and Kim-Chuan Toh. Dissolving constraints for Riemannian optimization. Mathematics of Operations Research, 2024. 23

work page 2024
[51]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2012

work page 2012
[52]

Convergence analysis of EXTRA in non-convex distributed optimization.IEEE Control Systems Letters, 2025

Lei Qin and Ye Pu. Convergence analysis of EXTRA in non-convex distributed optimization.IEEE Control Systems Letters, 2025

work page 2025
[53]

Reasflow: Assisting reasoning-centric scientific discovery in applied mathematics via a knowledge-based multi-agent system, 2026

ReasFlow Team. Reasflow: Assisting reasoning-centric scientific discovery in applied mathematics via a knowledge-based multi-agent system, 2026. URLhttps://blog.reaslab.io/blog/reasflow-intro/. 24

work page 2026

[1] [1]

Distributed asynchronous deterministic and stochastic gradient optimization algorithms.IEEE transactions on automatic control, 31(9):803–812, 1986

John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms.IEEE transactions on automatic control, 31(9):803–812, 1986

work page 1986

[2] [2]

Distributed subgradient methods for multi-agent optimization

Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on automatic control, 54(1):48–61, 2009

work page 2009

[3] [3]

On the convergence of decentralized gradient descent.SIAM Journal on Optimization, 26(3):1835–1854, 2016

Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent.SIAM Journal on Optimization, 26(3):1835–1854, 2016

work page 2016

[4] [4]

Alghunaim and Kun Yuan

Sulaiman A. Alghunaim and Kun Yuan. A unified and refined convergence analysis for non-convex decentralized learning.IEEE Transactions on Signal Processing, 2022

work page 2022

[5] [5]

Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H. Sayed. Exact diffusion for distributed opti- mization and learning — part i: Algorithm development.IEEE Transactions on Signal Processing, 2018

work page 2018

[6] [6]

EXTRA: An exact first-order algorithm for decentralized consensus optimization.SIAM Journal on Optimization, 25(2):944–966, 2015

Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization.SIAM Journal on Optimization, 25(2):944–966, 2015

work page 2015

[7] [7]

Distributed constrained optimal consensus of multi- agent systems.Automatica, 68:209–215, 2016

Zhirong Qiu, Shuzhi Sam Ge Liu, and Lihua Xie. Distributed constrained optimal consensus of multi- agent systems.Automatica, 68:209–215, 2016. doi: 10.1016/j.automatica.2016.01.055

work page doi:10.1016/j.automatica.2016.01.055 2016

[8] [8]

Dual averaging for distributed optimization: Convergence analysis and network scaling.IEEE Transactions on Automatic Control, 2011

John Duchi, Alekh Agarwal, and Martin Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling.IEEE Transactions on Automatic Control, 2011

work page 2011

[9] [9]

Orthogonal weight nor- malization: Solution to optimization over multiple dependent Stiefel manifolds in deep neural networks

Lei Huang, Xianglong Liu, Bo Lang, Adams Yu, Yongliang Wang, and Bo Li. Orthogonal weight nor- malization: Solution to optimization over multiple dependent Stiefel manifolds in deep neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

work page 2018

[10] [10]

Riemannian approach to batch normalization

Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. InAdvances in Neural Information Processing Systems, 2017

work page 2017

[11] [11]

Riemannian preconditioned lora for fine-tuning foundation mod- els.arXiv preprint arXiv:2402.02347,

Fangzhao Zhang and Mert Pilanci. Riemannian preconditioned LoRA for fine-tuning foundation models. arXiv preprint arXiv:2402.02347, 2024

work page arXiv 2024

[12] [12]

Retraction-free optimization over the Stiefel manifold with application to the LoRA fine-tuning

Yuan Zhang, Jiang Hu, Jiaxi Cui, Lin Lin, Zaiwen Wen, and Quanzheng Li. Retraction-free optimization over the Stiefel manifold with application to the LoRA fine-tuning. 2024

work page 2024

[13] [13]

Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R

L. Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R. Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, and Michael Heroux. An updated set of basic linear algebra subprograms (BLAS).ACM Transactions on Mathematical Software, 28(2):135–151, 2002

work page 2002

[14] [14]

Anastasia Koloskova, Tao Lin, and Sebastian U. Stich. An improved analysis of gradient tracking for decentralized machine learning. InAdvances in Neural Information Processing Systems, 2021

work page 2021

[15] [15]

D2: Decentralized training over decen- tralized data

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over decen- tralized data. InProceedings of the International Conference on Machine Learning, 2018. 21

work page 2018

[16] [16]

Achieving geometric convergence for distributed optimiza- tion over time-varying graphs.SIAM Journal on Optimization, 27(4):2597–2633, 2017

Angelia Nedić, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimiza- tion over time-varying graphs.SIAM Journal on Optimization, 27(4):2597–2633, 2017

work page 2017

[17] [17]

Harnessing smoothness to accelerate distributed optimization.IEEE Trans- actions on Control of Network Systems, 5(3):1245–1260, 2017

Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization.IEEE Trans- actions on Control of Network Systems, 5(3):1245–1260, 2017

work page 2017

[18] [18]

S-diging: A stochastic gradient tracking algorithm for distributed optimization.IEEE Transactions on Emerging Topics in Computational Intelligence, 2020

Huaqing Li, Lifeng Zheng, Zheng Wang, Yu Yan, Liping Feng, and Jing Guo. S-diging: A stochastic gradient tracking algorithm for distributed optimization.IEEE Transactions on Emerging Topics in Computational Intelligence, 2020

work page 2020

[19] [19]

Distributed stochastic gradient tracking methods.Mathematical Program- ming, 2021

Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods.Mathematical Program- ming, 2021

work page 2021

[20] [20]

Yue Liu, Tao Lin, Anastasia Koloskova, and Sebastian U. Stich. Decentralized gradient tracking with local steps.Optimization Methods and Software, 2025

work page 2025

[21] [21]

A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates.IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019

Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates.IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019

work page 2019

[22] [22]

Prox-PDA:Theproximalprimal-dualalgorithm for fast distributed nonconvex optimization and learning over networks

MingyiHong, DavoodHajinezhad, andMing-MinZhao. Prox-PDA:Theproximalprimal-dualalgorithm for fast distributed nonconvex optimization and learning over networks. InInternational Conference on Machine Learning, pages 1529–1538. PMLR, 2017

work page 2017

[23] [23]

Alghunaim, and Xinmeng Huang

Kun Yuan, Sulaiman A. Alghunaim, and Xinmeng Huang. Removing data heterogeneity influence enhances network topology dependence of decentralized SGD.Journal of Machine Learning Research, 2023

work page 2023

[24] [24]

Next: In-network nonconvex optimization.IEEE Transactions on Signal and Information Processing over Networks, 2016

Paolo Di Lorenzo and Gesualdo Scutari. Next: In-network nonconvex optimization.IEEE Transactions on Signal and Information Processing over Networks, 2016

work page 2016

[25] [25]

Stich, and Martin Jaggi

Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. InProceedings of the International Conference on Machine Learning, 2019

work page 2019

[26] [26]

Unbiased compression saves communication in distributed optimization: When and how much? InAdvances in Neural Information Processing Systems, 2023

Yutong He, Xinmeng Huang, and Kun Yuan. Unbiased compression saves communication in distributed optimization: When and how much? InAdvances in Neural Information Processing Systems, 2023

work page 2023

[27] [27]

Greedy low-rank gradient com- pression for distributed learning with convergence guarantees.IEEE Transactions on Signal Processing, 2026

Chuyan Chen, Yutong He, Pengrui Li, Weichen Jia, and Kun Yuan. Greedy low-rank gradient com- pression for distributed learning with convergence guarantees.IEEE Transactions on Signal Processing, 2026

work page 2026

[28] [28]

On biased compression for distributed learning.Journal of Machine Learning Research, 2023

Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression for distributed learning.Journal of Machine Learning Research, 2023

work page 2023

[29] [29]

Error compensated distributed SGD can be accelerated

Xun Qian, Peter Richtárik, and Tong Zhang. Error compensated distributed SGD can be accelerated. InAdvances in Neural Information Processing Systems, 2021

work page 2021

[30] [30]

Understanding the influence of digraphs on decentralized optimization: Effective metrics, lower bound, and optimal algorithm.SIAM Journal on Optimization, 2025

Liyuan Liang, Xinmeng Huang, Ran Xin, and Kun Yuan. Understanding the influence of digraphs on decentralized optimization: Effective metrics, lower bound, and optimal algorithm.SIAM Journal on Optimization, 2025

work page 2025

[31] [31]

Alghunaim

Sulaiman A. Alghunaim. Local exact-diffusion for decentralized optimization and learning.IEEE Transactions on Automatic Control, 2024

work page 2024

[32] [32]

Exponential graph is provably efficient for decentralized deep training

Bicheng Ying, Kun Yuan, Yiming Chen, Hanbin Hu, Pan Pan, and Wotao Yin. Exponential graph is provably efficient for decentralized deep training. InAdvances in Neural Information Processing Systems, 2021. 22

work page 2021

[33] [33]

Decentralized Riemannian gradient descent on the Stiefel manifold

Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. Decentralized Riemannian gradient descent on the Stiefel manifold. InInternational Conference on Machine Learning, pages 1594–1605. PMLR, 2021

work page 2021

[34] [34]

Decentralized projected Riemannian gradient method for smooth opti- mization on compact submanifolds embedded in the euclidean space.Numerische Mathematik, 2025

Kangkang Deng and Jiang Hu. Decentralized projected Riemannian gradient method for smooth opti- mization on compact submanifolds embedded in the euclidean space.Numerische Mathematik, 2025

work page 2025

[35] [35]

A decentralized proximal gradient tracking algorithm for composite optimization on Riemannian manifolds.Journal of Machine Learning Research, 2025

Lei Wang, Le Bao, and Xin Liu. A decentralized proximal gradient tracking algorithm for composite optimization on Riemannian manifolds.Journal of Machine Learning Research, 2025

work page 2025

[36] [36]

Riemannian EXTRA: Communication-efficient decentralized optimization over compact submanifolds with data heterogeneity

Jiayuan Wu, Zhanwang Deng, Jiang Hu, Weijie Su, and Zaiwen Wen. Riemannian EXTRA: Communication-efficient decentralized optimization over compact submanifolds with data heterogeneity. arXiv preprint arXiv:2505.15537, 2025

work page arXiv 2025

[37] [37]

Decentralized optimization on compact submanifolds by quantized Riemannian gradient tracking.IEEE Transactions on Signal Processing, 2025

JunChen, LinaLiu, TianyiZhu, YongLiu, GuangDai, YunliangJiang, andIvorWTsang. Decentralized optimization on compact submanifolds by quantized Riemannian gradient tracking.IEEE Transactions on Signal Processing, 2025

work page 2025

[38] [38]

Improving the communication in decentralized manifold optimization through single-step consensus and compression.arXiv preprint arXiv:2407.08904, 2024

Jiang Hu and Kangkang Deng. Improving the communication in decentralized manifold optimization through single-step consensus and compression.arXiv preprint arXiv:2407.08904, 2024

work page arXiv 2024

[39] [39]

Decentralized projected Riemannian stochastic recursive momentum method for nonconvex optimization

Kangkang Deng and Jiang Hu. Decentralized projected Riemannian stochastic recursive momentum method for nonconvex optimization. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025

[40] [40]

Tsang, and Yong Liu

Jun Chen, Haishan Ye, Mengmeng Wang, Tianxin Huang, Guang Dai, Ivor W. Tsang, and Yong Liu. Decentralized Riemannian conjugate gradient method on the Stiefel manifold.arXiv preprint arXiv:2308.10547, 2023

work page arXiv 2023

[41] [41]

Decentralized Riemannian natural gradient methods with Kronecker product approximations.Journal of the Operations Research Society of China, 2025

Jiang Hu, Kangkang Deng, and Quanzheng Li. Decentralized Riemannian natural gradient methods with Kronecker product approximations.Journal of the Operations Research Society of China, 2025

work page 2025

[42] [42]

On the local linear rate of consensus on the Stiefel manifold.IEEE Transactions on Automatic Control, 2023

Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. On the local linear rate of consensus on the Stiefel manifold.IEEE Transactions on Automatic Control, 2023

work page 2023

[43] [43]

Riemannian consensus for manifolds with bounded curva- ture.IEEE Transactions on Automatic Control, 2012

Roberto Tron, Bijan Afsari, and René Vidal. Riemannian consensus for manifolds with bounded curva- ture.IEEE Transactions on Automatic Control, 2012

work page 2012

[44] [44]

Consensus optimization on manifolds.SIAM Journal on Control and Optimization, 2009

Alain Sarlette and Rodolphe Sepulchre. Consensus optimization on manifolds.SIAM Journal on Control and Optimization, 2009

work page 2009

[45] [45]

Achieving local consensus over compact submanifolds

Jiang Hu, Jiaojiao Zhang, and Kangkang Deng. Achieving local consensus over compact submanifolds. IEEE Transactions on Automatic Control, 70(9):5750–5763, 2025. doi: 10.1109/TAC.2025.3545711

work page doi:10.1109/tac.2025.3545711 2025

[46] [46]

Retraction-free decentralized non-convex optimization with orthogonal constraints.arXiv preprint arXiv:2405.11590, 2024

Youbang Sun, Shixiang Chen, Alfredo Garcia, and Shahin Shahrampour. Retraction-free decentralized non-convex optimization with orthogonal constraints.arXiv preprint arXiv:2405.11590, 2024. doi: 10.48550/arXiv.2405.11590

work page doi:10.48550/arxiv.2405.11590 2024

[47] [47]

Decentralized optimization over the Stiefel manifold by an approximate aug- mented Lagrangian function.IEEE Transactions on Signal Processing, 2022

Lei Wang and Xin Liu. Decentralized optimization over the Stiefel manifold by an approximate aug- mented Lagrangian function.IEEE Transactions on Signal Processing, 2022

work page 2022

[48] [48]

Fast and accurate optimization on the orthogonal manifold without retraction

Pierre Ablin and Gabriel Peyré. Fast and accurate optimization on the orthogonal manifold without retraction. InProceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 5636–5657. PMLR, 2022

work page 2022

[49] [49]

Parallelizable algorithms for optimization problems with orthog- onality constraints.SIAM Journal on Scientific Computing, 41(3):A1949–A1983, 2019

Bin Gao, Xin Liu, and Ya-xiang Yuan. Parallelizable algorithms for optimization problems with orthog- onality constraints.SIAM Journal on Scientific Computing, 41(3):A1949–A1983, 2019

work page 2019

[50] [50]

Dissolving constraints for Riemannian optimization

Nachuan Xiao, Xin Liu, and Kim-Chuan Toh. Dissolving constraints for Riemannian optimization. Mathematics of Operations Research, 2024. 23

work page 2024

[51] [51]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2012

work page 2012

[52] [52]

Convergence analysis of EXTRA in non-convex distributed optimization.IEEE Control Systems Letters, 2025

Lei Qin and Ye Pu. Convergence analysis of EXTRA in non-convex distributed optimization.IEEE Control Systems Letters, 2025

work page 2025

[53] [53]

Reasflow: Assisting reasoning-centric scientific discovery in applied mathematics via a knowledge-based multi-agent system, 2026

ReasFlow Team. Reasflow: Assisting reasoning-centric scientific discovery in applied mathematics via a knowledge-based multi-agent system, 2026. URLhttps://blog.reaslab.io/blog/reasflow-intro/. 24

work page 2026