Distributed Gradient-Regularized Newton Method: Scheduled Consensus and O(epsilon^{-1}) Global Iteration Complexity

Li Zhang; Pengcheng Xie; Wei Hu; Ya-Xiang Yuan

arxiv: 2605.19396 · v1 · pith:UWTCHT2Wnew · submitted 2026-05-19 · 🧮 math.OC

Distributed Gradient-Regularized Newton Method: Scheduled Consensus and O(epsilon⁻¹) Global Iteration Complexity

Wei Hu , Pengcheng Xie , Ya-Xiang Yuan , Li Zhang This is my paper

Pith reviewed 2026-05-20 04:51 UTC · model grok-4.3

classification 🧮 math.OC

keywords decentralized optimizationdistributed Newton methodconsensus optimizationgradient regularizationscheduled consensusiteration complexityconvex optimizationnetwork optimization

0 comments

The pith

A decentralized Newton method reaches centralized O(epsilon^{-1}) iteration complexity for convex consensus problems after a scheduled burn-in phase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DisGrem, a fully decentralized second-order method for convex consensus optimization over networks. Each agent performs a local Newton update with gradient-norm regularization and an eigenvalue-shift stabilizer, then mixes information through a two-stage gossip mechanism. The reference-step framework converts the collective update into an inexact centralized regularized Newton step and replaces static Hessian-heterogeneity bounds with an increment-based dispersion analysis that carries no fixed accuracy floor. Under the bounded-iterates assumption, the post-burn-in phase drives the gradient norm below epsilon in O(epsilon^{-1}) iterations, matching the centralized rate without line search or manual stepsize selection.

Core claim

The DisGrem method reduces the network-wide update to an inexact centralized regularized Newton step via the reference-step framework. Under a bounded-iterates assumption, after a burn-in phase whose length is governed by the scheduled consensus accuracy, the post-burn-in phase achieves O(epsilon^{-1}) iteration complexity for driving the gradient norm below epsilon. This rate matches the centralized regularized Newton method without line search or stepsize tuning. For a logarithmic schedule with p greater than or equal to 3 the total iteration complexity remains O(epsilon^{-1}). Under strong convexity and a relative tracking-accuracy condition the method further exhibits conditional local 3

What carries the argument

The reference-step framework, which converts the decentralized update into an inexact centralized regularized Newton step and substitutes increment-based dispersion analysis for static Hessian-heterogeneity assumptions.

If this is right

After the burn-in phase the method attains an O(epsilon^{-1}) global iteration complexity for the gradient norm without any line search or stepsize tuning.
Under a logarithmic consensus schedule with p at least 3 the overall iteration count stays O(epsilon^{-1}).
On a fixed connected network the total neighbor communication rounds scale as O(epsilon^{-1} log(1/epsilon)) with explicit dependence on the spectral gap.
Under strong convexity and a relative tracking-accuracy condition the method exhibits conditional local superlinear convergence of order 3/2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the bounded-iterates assumption continues to hold on larger or time-varying networks, the same reference-step reduction could be applied to other second-order decentralized schemes.
The scheduled-consensus mechanism might be combined with asynchronous or event-triggered communication without destroying the O(epsilon^{-1}) rate.
The increment-based dispersion analysis could replace static heterogeneity assumptions in related distributed quasi-Newton or cubic-regularized methods.

Load-bearing premise

The assumption that the sequence of iterates produced by the algorithm remains bounded.

What would settle it

Numerical execution on the nine-problem benchmark suite in which the iterates remain bounded while the gradient norm fails to drop below epsilon at the claimed rate.

Figures

Figures reproduced from arXiv: 2605.19396 by Li Zhang, Pengcheng Xie, Wei Hu, Ya-Xiang Yuan.

**Figure 2.** Figure 2: Communication-budget profiles at three precision levels [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

**Figure 3.** Figure 3: relF vs. iteration on all nine test functions (N=10, 20 MC runs; synthetic objectives use d=30, whereas the two logistic-regression objectives use the d=22 svmguide3 feature dimension; logarithmic communication implementation). Solid curves: DisGrem family; dashed: baselines. Shaded bands: interquartile range. Legend entries marked by × indicate algorithms whose median curve is not drawn because the corres… view at source ↗

**Figure 4.** Figure 4: Communication cost (MB) to reach target precision [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Regularization parameter M trajectory for AdaDisGrem (solid) vs. the fixed M of DisGrem (dashed). Curves averaged over 5 MC runs. 0 10 20 30 40 Iteration 10 −11 10 −8 10 −5 10 −2 relF Ridge M=0.1× M=0.3× M=1× M=3× M=10× AdaDisGrem 0 100 200 300 400 Iteration 10 −11 10 −8 10 −5 10 −2 relF LogSumExp M=0.1× M=0.3× M=1× M=3× M=10× AdaDisGrem 0 25 50 75 100 125 150 Iteration 10 −11 10 −8 10 −5 10 −2 relF LogReg… view at source ↗

**Figure 6.** Figure 6: AdaDisGrem (bold blue) vs. DisGrem at five fixed-M values (0.1×–10× the baseline M∗ ; gray curves light-to-dark). The adaptive variant is competitive with the best fixed choice on most functions and avoids failures at aggressive fixed-M choices. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Dimension scalability: relF vs. iteration for d ∈ {30, 100, 200} on three functions (5 MC runs, median). 7 Conclusion and future directions Under the bounded-trajectory and smoothness assumptions used in the analysis, DisGrem retains the centralized regularized Newton post-burn-in rate, with the burn-in order controlled by the scheduled consensus accuracy, in a fully decentralized setting without line sear… view at source ↗

**Figure 8.** Figure 8: Composite optimality (combo = ∥∇f(x¯k)∥ + consk) vs. iteration for all nine functions. Same setting as [PITH_FULL_IMAGE:figures/full_fig_p055_8.png] view at source ↗

**Figure 9.** Figure 9: relF vs. wall-clock time (seconds) for all nine functions. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_9.png] view at source ↗

**Figure 10.** Figure 10: relF vs. cumulative communication cost (MB) for all nine functions. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_10.png] view at source ↗

**Figure 11.** Figure 11: Effect of Klazy on the communication–precision trade-off for CeDisGrem. Gradientcoloured curves from light to dark: Klazy ∈ {1, 5, 10, 20, 40, 80}. 0 2 4 6 8 10 12 14 Comm. cost (MB) 10 −11 10 −8 10 −5 10 −2 relF Ridge 0 50 100 150 200 250 Comm. cost (MB) 10 −11 10 −8 10 −5 10 −2 relF LogSumExp 0 10 20 30 40 50 60 Comm. cost (MB) 10 −11 10 −8 10 −5 10 −2 relF Huber 0 10 20 30 40 Comm. cost (MB) 10 −11 10… view at source ↗

**Figure 12.** Figure 12: Effect of compression method and rank on the communication–precision trade-off for [PITH_FULL_IMAGE:figures/full_fig_p058_12.png] view at source ↗

**Figure 13.** Figure 13: Robustness of AdaDisGrem to initial M. Initial M0 ∈ {0.1M∗ , 0.5M∗ , M∗ , 3M∗ , 10M∗}. All initializations converge to similar trajectories within 50–100 iterations. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_13.png] view at source ↗

**Figure 14.** Figure 14: Parameter sweep for the DisGrem family on four functions. 10 values of Mfac (light-todark for increasing Mfac). Rows: functions; columns: algorithms. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_14.png] view at source ↗

**Figure 15.** Figure 15: Stepsize sensitivity for first-order baselines (EXTRA and DIGing) on four functions. 10 [PITH_FULL_IMAGE:figures/full_fig_p061_15.png] view at source ↗

**Figure 16.** Figure 16: Key-parameter sensitivity for second-order baselines on four functions: penalty [PITH_FULL_IMAGE:figures/full_fig_p062_16.png] view at source ↗

read the original abstract

We propose DisGrem, a fully decentralized second-order method for convex consensus optimization over networks. Each agent solves a local Newton system with vanishing gradient-norm regularization and an eigenvalue-shift stabilizer, communicating through a two-stage gossip-mixing mechanism. We introduce a reference-step framework that reduces the network-wide update to an inexact centralized regularized Newton step, replacing the static Hessian-heterogeneity assumptions of prior work with an increment-based dispersion analysis that imposes no irreducible accuracy floor. Under a bounded-iterates assumption, after a burn-in phase whose order is controlled by the scheduled consensus accuracy, the post-burn-in phase achieves an O(epsilon^{-1}) iteration complexity for driving the gradient norm below epsilon, matching the centralized regularized Newton rate without line search or stepsize tuning. For a logarithmic schedule with p >= 3, the total iteration complexity remains O(epsilon^{-1}). For a fixed connected network, this yields O(epsilon^{-1} log(1/epsilon)) neighbor communication rounds, with explicit spectral-gap dependence O((1-rho)^{-1} epsilon^{-1} log(1/epsilon)) as rho approaches 1. Under strong convexity and a relative tracking-accuracy condition, we further establish conditional local superlinear convergence of order 3/2. In our nine-problem benchmark suite, the DisGrem family attains relF <= 10^{-6} on every test instance, while the tested baselines stagnate or diverge on at least one problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DisGrem gets O(eps^{-1}) global complexity for distributed Newton by swapping static Hessian assumptions for increment-based dispersion after a scheduled-consensus burn-in, but the whole post-burn-in argument sits on a bounded-iterates assumption whose justification from the dynamics is not obvious.

read the letter

The main point is that this paper gives a fully decentralized second-order method for convex consensus problems that reaches the same O(epsilon^{-1}) iteration complexity as a centralized regularized Newton method. It does this with a two-stage gossip mechanism, vanishing gradient-norm regularization, an eigenvalue-shift stabilizer, and a reference-step reduction that turns the network update into an inexact centralized step. The scheduled consensus accuracy controls the length of the initial phase, after which an increment-based dispersion analysis replaces the usual static Hessian-heterogeneity bounds and removes any irreducible accuracy floor from the network.

Referee Report

1 major / 1 minor

Summary. The paper proposes DisGrem, a fully decentralized second-order method for convex consensus optimization over networks. Agents solve local Newton systems with vanishing gradient-norm regularization and eigenvalue-shift stabilization, using a two-stage gossip-mixing mechanism. A reference-step framework reduces the network update to an inexact centralized regularized Newton step, replacing static Hessian-heterogeneity assumptions with increment-based dispersion analysis. Under a bounded-iterates assumption, after a burn-in phase controlled by scheduled consensus accuracy, the method achieves O(ε^{-1}) iteration complexity to drive the gradient norm below ε, matching centralized rates without line search or stepsize tuning. For logarithmic schedules with p ≥ 3 the total complexity remains O(ε^{-1}); communication rounds are O(ε^{-1} log(1/ε)) with explicit spectral-gap dependence. Conditional local superlinear convergence of order 3/2 is shown under strong convexity and relative tracking accuracy. Empirical results on nine problems show the method reaches relF ≤ 10^{-6} while baselines stagnate or diverge.

Significance. If the bounded-iterates assumption holds and the increment-based dispersion analysis is rigorous, the result would be significant for distributed optimization: it delivers centralized regularized-Newton global rates in a fully decentralized setting with scheduled consensus, explicit network dependence, and no tuning. The replacement of static heterogeneity assumptions by increment-based analysis and the two-stage mixing are technical strengths; the empirical benchmark suite provides concrete evidence of practical reliability.

major comments (1)

[Abstract] Abstract (and reference-step framework paragraph): The O(ε^{-1}) post-burn-in global complexity claim and the replacement of static Hessian-heterogeneity by increment-based dispersion both rest on the bounded-iterates assumption. No sufficient condition (e.g., strong convexity plus network spectral gap implying boundedness under the two-stage gossip updates and vanishing regularization) is supplied to guarantee that the assumption holds from the algorithm dynamics; if iterates become unbounded the dispersion control and reduction to inexact centralized Newton fail, so the complexity guarantee does not apply.

minor comments (1)

[Abstract] The abstract states results for both logarithmic and fixed schedules but does not clarify whether the O(ε^{-1} log(1/ε)) communication bound holds uniformly or only for the logarithmic case.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. The single major comment raises an important point regarding the bounded-iterates assumption. We address it point by point below and have revised the manuscript to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract (and reference-step framework paragraph): The O(ε^{-1}) post-burn-in global complexity claim and the replacement of static Hessian-heterogeneity by increment-based dispersion both rest on the bounded-iterates assumption. No sufficient condition (e.g., strong convexity plus network spectral gap implying boundedness under the two-stage gossip updates and vanishing regularization) is supplied to guarantee that the assumption holds from the algorithm dynamics; if iterates become unbounded the dispersion control and reduction to inexact centralized Newton fail, so the complexity guarantee does not apply.

Authors: We agree that the bounded-iterates assumption is essential for the global complexity claim and for the validity of the increment-based dispersion analysis. The manuscript states the assumption explicitly and derives the O(ε^{-1}) rate conditionally on it, which is consistent with standard practice for global rates of regularized Newton methods on convex (but not necessarily strongly convex) problems. We acknowledge that an explicit sufficient condition guaranteeing boundedness directly from the algorithm dynamics would be desirable. In the revised manuscript we have added a new remark (Section 3.3) that supplies such a condition under the additional assumptions of strong convexity (modulus μ > 0), a uniform lower bound on the network spectral gap, and a sufficiently slow logarithmic schedule for the regularization parameter. Under these conditions we prove that the two-stage gossip updates and eigenvalue-shift stabilization keep all iterates inside a ball whose radius depends only on the initial point and the schedule parameters. For the general convex case without strong convexity we retain the assumption, as the problem class itself may admit unbounded level sets; this is noted explicitly in the revised abstract and in the statement of Theorem 3. For the empirical section we have added a brief discussion confirming that boundedness was observed across all nine test problems. These changes clarify the scope of the guarantee without altering the core technical contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on explicit external assumption

full rationale

The paper states its O(ε^{-1}) post-burn-in complexity explicitly under a bounded-iterates assumption and introduces a reference-step framework plus increment-based dispersion analysis as new tools that replace prior static Hessian-heterogeneity assumptions. No quoted step reduces the claimed rate to a fitted parameter, self-referential definition, or load-bearing self-citation chain; the bounded-iterates condition is presented as an input assumption rather than derived from the result itself. The derivation chain therefore remains self-contained against the stated premises and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the bounded-iterates assumption and a relative tracking-accuracy condition for the local superlinear rate; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Bounded-iterates assumption
Invoked to control the post-burn-in phase and to enable the increment-based dispersion analysis that replaces static Hessian-heterogeneity assumptions.
domain assumption Relative tracking-accuracy condition
Required to establish conditional local superlinear convergence of order 3/2 under strong convexity.

pith-pipeline@v0.9.0 · 5808 in / 1487 out tokens · 41378 ms · 2026-05-20T04:51:59.793452+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under a bounded-iterates assumption, after a burn-in phase... O(ε^{-1}) iteration complexity... reference-step framework that reduces the network-wide update to an inexact centralized regularized Newton step, replacing the static Hessian-heterogeneity assumptions... with an increment-based dispersion analysis
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

eigenvalue-shift stabilizer... logarithmic mixing schedule τ_k = t_k = ⌈p log(k+2) + c_mix / -log ρ⌉

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

Cartis, N

C. Cartis, N. I. M. Gould, and Ph. L. Toint. Adaptive cubic regularisation methods for uncon- strained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, 127:245–295, 2011

work page 2011
[2]

Chang and C.-J

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011. Software available athttps: //www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

work page 2011
[3]

Eisen, A

M. Eisen, A. Mokhtari, and A. Ribeiro. Decentralized quasi-Newton methods. IEEE Transac- tions on Signal Processing, 65(10):2613–2628, 2017

work page 2017
[4]

Jakovetić, J

D. Jakovetić, J. Xavier, and J. M. F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131–1146, 2014

work page 2014
[5]

Mishchenko

K. Mishchenko. Regularized Newton method with globalO(1/k2)convergence. SIAM Journal on Optimization, 33(3):1440–1462, 2023

work page 2023
[6]

Mokhtari, Q

A. Mokhtari, Q. Ling, and A. Ribeiro. Network Newton distributed optimization methods. IEEE Transactions on Signal Processing, 65(1):146–161, 2017

work page 2017
[7]

Zhang, Q

J. Zhang, Q. Ling, and A. M.-C. So. A Newton tracking algorithm with exact linear convergence for decentralized consensus optimization. IEEE Transactions on Signal and Information Processing over Networks, 7:346–358, 2021

work page 2021
[8]

Nedić and A

A. Nedić and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. 30 DisGRem W. Hu et al

work page 2009
[9]

Nedić, A

A. Nedić, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimiza- tion over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017

work page 2017
[10]

Nesterov and B

Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global perfor- mance. Mathematical Programming, 108:177–205, 2006

work page 2006
[11]

W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015

work page 2015
[12]

Y. Sun, G. Scutari, and D. P. Palomar. Distributed nonconvex optimization and learning based on successive convex approximation. IEEE Transactions on Signal Processing, 70:5900–5915, 2022

work page 2022
[13]

Maritan, G

A. Maritan, G. Sharma, L. Schenato, and S. Dey. Network-GIANT: fully distributed Newton- type optimization via harmonic Hessian consensus. arXiv preprint arXiv:2305.07898, 2023

work page arXiv 2023
[14]

K. Yuan, B. Ying, and A. H. Sayed. Exact diffusion for distributed optimization and learning— Part I: algorithm development. IEEE Transactions on Signal Processing, 67(3):708–723, 2019

work page 2019
[15]

Jakovetić, N

D. Jakovetić, N. Krejić, and G. Malaspina. DINAS: Distributed inexact Newton method with adaptive step sizes. Computational Optimization and Applications, 91:683–715, 2025

work page 2025
[16]

Daneshmand, G

A. Daneshmand, G. Scutari, P. Dvurechensky, and A. Gasnikov. Newton method over networks is fast up to the statistical precision. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of PMLR, pp. 2398–2409, 2021

work page 2021
[17]

Gratton, S

S. Gratton, S. Jerad, and Ph. L. Toint. Yet another fast variant of Newton’s method for nonconvex optimization. arXiv preprint arXiv:2302.10065, 2023

work page arXiv 2023
[18]

Doikov and Y

N. Doikov and Y. Nesterov. Gradient regularization of Newton method with Bregman distances. Mathematical Programming, 204:1–25, 2024

work page 2024
[19]

G. Yuan, X. Li, and Q. Ling. INDO: INversion-free Distributed second-Order method for consensus optimization. Optimization Online preprint, 2022

work page 2022
[20]

Zhang, K

Z. Zhang, K. Che, S. Yang, et al. Communication-efficient distributed cubic Newton with compressed lazy Hessian. Neural Networks, 174:106212, 2024

work page 2024
[21]

S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed. Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control, 66(6):2787– 2794, 2021

work page 2021
[22]

Bajović, D

D. Bajović, D. Jakovetić, N. Krejić, and N. Krklec Jerinkić. Newton-like method with diagonal correction for distributed optimization. SIAM Journal on Optimization, 27(2):1171–1203, 2017

work page 2017
[23]

Beznosikov, P

A. Beznosikov, P. Richtárik, M. Diskin, et al. Distributed methods with compressed communi- cation for solving variational inequalities, with theoretical guarantees. In Advances in Neural Information Processing Systems (NeurIPS), 35:14013–14029, 2022

work page 2022
[24]

Doikov and Y

N. Doikov and Y. Nesterov. Local convergence of tensor methods. Mathematical Programming, 193:315–336, 2022

work page 2022
[25]

Doikov, E

N. Doikov, E. M. Chayti, and M. Jaggi. Second-order optimization with lazy Hessians. In Proceedings of the International Conference on Machine Learning (ICML), 2023. 31 DisGRem W. Hu et al

work page 2023
[26]

Koloskova, S

A. Koloskova, S. U. Stich, and M. Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In Proceedings of the International Conference on Machine Learning (ICML), pp. 3478–3487, 2019

work page 2019
[27]

Li and Z

H. Li and Z. Lin. Accelerated gradient tracking over time-varying graphs for decentralized optimization. Journal of Machine Learning Research, 25(274):1–52, 2024

work page 2024
[28]

Mokhtari, W

A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro. ESOM: A second-order method for exact decentralized optimization over networks. IEEE Transactions on Signal and Information Processing over Networks, 2(4):507–522, 2016

work page 2016
[29]

Pu and A

S. Pu and A. Nedić. Distributed stochastic gradient tracking methods. Mathematical Program- ming, 187:409–457, 2021

work page 2021
[30]

Xin and U

R. Xin and U. A. Khan. Distributed heavy-ball: a generalization and acceleration of first-order methods with gradient tracking. IEEE Transactions on Automatic Control, 65(6):2627–2633, 2020

work page 2020
[31]

B. Li, S. Cen, Y. Chen, et al. Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21(180):1–51, 2020

work page 2020
[32]

Nesterov

Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005

work page 2005
[33]

Charbonnier, L

P. Charbonnier, L. Blanc-Féraud, G. Aubert, and M. Barlaud. Deterministic edge-preserving regularization in computed imaging. IEEE Transactions on Image Processing, 6(2):298–311, 1997

work page 1997
[34]

H. H. Rosenbrock. An automatic method for finding the greatest or least value of a function. The Computer Journal, 3(3):175–184, 1960

work page 1960
[35]

M. A. Styblinski and T. S. Tang. Experiments in nonconvex optimization: Stochastic approx- imation with function smoothing and simulated annealing. Neural Networks, 3(4):467–483, 1990

work page 1990
[36]

Geman and C

D. Geman and C. Yang. Nonlinear image recovery with half-quadratic regularization. IEEE Transactions on Image Processing, 4(7):932–946, 1995

work page 1995
[37]

Xie and Y

P. Xie and Y. Yuan. A derivative-free optimization algorithm combining line-search and trust-region techniques.Chinese Annals of Mathematics, Series B, 44(5):719–734, 2023

work page 2023
[38]

Xie and Y

P. Xie and Y. Yuan. Derivative-free optimization with transformed objective functions (DFOTO) and the algorithm based on the least Frobenius norm updating quadratic model.Journal of the Operations Research Society of China, 13:327–363, 2025

work page 2025
[39]

Xie and Y

P. Xie and Y. Yuan. A derivative-free method using a new underdetermined quadratic interpolation model.SIAM Journal on Optimization, 35(2):1110–1133, 2025

work page 2025
[40]

Xie and Y

P. Xie and Y. Yuan. LeastH2 norm updating of quadratic interpolation models for derivative- free trust-region algorithms.IMA Journal of Numerical Analysis, 46(1):21–50, 2026. 32 DisGRem W. Hu et al

work page 2026
[41]

Xie and Y

P. Xie and Y. Yuan. A new two-dimensional model-based subspace method for large-scale unconstrained derivative-free optimization: 2D-MoSub.Optimization Methods and Software, 41(1):118–150, 2026

work page 2026
[42]

Xie and S

P. Xie and S. M. Wild. ReMU: Regional minimal updating for model-based derivative-free optimization. arXiv:2504.03606, 2025

work page arXiv 2025
[43]

He and P

Y. He and P. Xie. Model-driven subspaces for large-scale optimization with local approximation strategy. arXiv:2509.08256, 2025

work page arXiv 2025
[44]

P. Xie. A derivative-free trust-region method for optimization on the ellipsoid.Journal of Physics: Conference Series, 2620:012007, 2023

work page 2023
[45]

P. Xie. Sufficient conditions for error distance reduction in theℓ2-norm trust region between minimizers of local nonconvex multivariate quadratic approximates.Journal of Computational and Applied Mathematics, 453:116146, 2025

work page 2025
[46]

P. Xie, Z. Zhou, and Z. Zhou. Objective value change and shape-based accelerated optimization for the neural network approximation. arXiv:2508.20290, 2025

work page arXiv 2025
[47]

Xie and S

P. Xie and S. M. Wild. Barycenter of weight coefficient region of least weightedH2 norm updating quadratic models with vanishing trust-region radius.SIAM NCC 2024, Early Career Travel Award, 2024

work page 2024
[48]

P. Xie. On the relationship betweenΛ-poisedness in derivative-free optimization and outliers in local outlier factor. arXiv:2407.17529, 2024

work page arXiv 2024
[49]

L. Li, P. Xie, and L. Zhang. A novel numerical method tailored for unconstrained optimization problems. arXiv:2504.02832, 2025

work page arXiv 2025
[50]

P. Xie. An efficient derivative-free method for finding multiple solutions. To be posted on arXiv, 2024

work page 2024
[51]

L. Li, Y. Zhou, P. Xie, and H. Li. A spectral Levenberg–Marquardt–Deflation method for multiple solutions of semilinear elliptic systems.Journal of Computational and Applied Mathematics, 2025

work page 2025
[52]

Y. Ye, L. Li, P. Xie, and H. Yu. An improved adaptive orthogonal basis deflation method for multiple solutions with applications to nonlinear elliptic equations in varying domains.Journal of Computational Mathematics, 2025

work page 2025
[53]

P. Xie. Privacy-preserving black-box optimization (PBBO): Theory and the model-based algorithm DFOp. arXiv:2601.11570, 2025

work page arXiv 2025
[54]

Xie et al

P. Xie et al. A novel local analysis of objectives approximated by neural network: L-Change. International Conference on Mathematical Theory of Deep Learning (MTDL), 2024

work page 2024
[55]

K. J. Dzahini, S. M. Wild, and P. Xie. Optimization approaches for solving inverse problems must account for uncertainty in both data and downstream decisions. Position paper, Inverse Methods for Complex Systems under Uncertainty Workshop, Sponsored by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, 2025

work page 2025
[56]

Xie and M

P. Xie and M. Tao. Parametric resonant control of macroscopic behaviors of multiple oscillators. In2019 American Control Conference (ACC), pages 1898–1905, 2019. 33 DisGRem W. Hu et al

work page 1905
[57]

P. Xie. A note on the invariant distribution of a stochastic dynamical system. 2024

work page 2024
[58]

P. Xie. The modeling and optimization of a multi-dam system.Applied and Computational Mathematics, 13(5):140–152, 2024. A Dispersion recursions, burn-in analysis, and logarithmic mixing A.1 Basic inequalities Lemma A.1.Leth:R d→RhaveL2-Lipschitz Hessian. Then for allx,s, h(x+s)≤h(x) +⟨∇h(x),s⟩+1 2s⊤∇2h(x)s+ L2 6 ∥s∥3. Lemma A.2.For anyz∈RN and integert≥1,...

work page 2024

[1] [1]

Cartis, N

C. Cartis, N. I. M. Gould, and Ph. L. Toint. Adaptive cubic regularisation methods for uncon- strained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, 127:245–295, 2011

work page 2011

[2] [2]

Chang and C.-J

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011. Software available athttps: //www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

work page 2011

[3] [3]

Eisen, A

M. Eisen, A. Mokhtari, and A. Ribeiro. Decentralized quasi-Newton methods. IEEE Transac- tions on Signal Processing, 65(10):2613–2628, 2017

work page 2017

[4] [4]

Jakovetić, J

D. Jakovetić, J. Xavier, and J. M. F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131–1146, 2014

work page 2014

[5] [5]

Mishchenko

K. Mishchenko. Regularized Newton method with globalO(1/k2)convergence. SIAM Journal on Optimization, 33(3):1440–1462, 2023

work page 2023

[6] [6]

Mokhtari, Q

A. Mokhtari, Q. Ling, and A. Ribeiro. Network Newton distributed optimization methods. IEEE Transactions on Signal Processing, 65(1):146–161, 2017

work page 2017

[7] [7]

Zhang, Q

J. Zhang, Q. Ling, and A. M.-C. So. A Newton tracking algorithm with exact linear convergence for decentralized consensus optimization. IEEE Transactions on Signal and Information Processing over Networks, 7:346–358, 2021

work page 2021

[8] [8]

Nedić and A

A. Nedić and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. 30 DisGRem W. Hu et al

work page 2009

[9] [9]

Nedić, A

A. Nedić, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimiza- tion over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017

work page 2017

[10] [10]

Nesterov and B

Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global perfor- mance. Mathematical Programming, 108:177–205, 2006

work page 2006

[11] [11]

W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015

work page 2015

[12] [12]

Y. Sun, G. Scutari, and D. P. Palomar. Distributed nonconvex optimization and learning based on successive convex approximation. IEEE Transactions on Signal Processing, 70:5900–5915, 2022

work page 2022

[13] [13]

Maritan, G

A. Maritan, G. Sharma, L. Schenato, and S. Dey. Network-GIANT: fully distributed Newton- type optimization via harmonic Hessian consensus. arXiv preprint arXiv:2305.07898, 2023

work page arXiv 2023

[14] [14]

K. Yuan, B. Ying, and A. H. Sayed. Exact diffusion for distributed optimization and learning— Part I: algorithm development. IEEE Transactions on Signal Processing, 67(3):708–723, 2019

work page 2019

[15] [15]

Jakovetić, N

D. Jakovetić, N. Krejić, and G. Malaspina. DINAS: Distributed inexact Newton method with adaptive step sizes. Computational Optimization and Applications, 91:683–715, 2025

work page 2025

[16] [16]

Daneshmand, G

A. Daneshmand, G. Scutari, P. Dvurechensky, and A. Gasnikov. Newton method over networks is fast up to the statistical precision. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of PMLR, pp. 2398–2409, 2021

work page 2021

[17] [17]

Gratton, S

S. Gratton, S. Jerad, and Ph. L. Toint. Yet another fast variant of Newton’s method for nonconvex optimization. arXiv preprint arXiv:2302.10065, 2023

work page arXiv 2023

[18] [18]

Doikov and Y

N. Doikov and Y. Nesterov. Gradient regularization of Newton method with Bregman distances. Mathematical Programming, 204:1–25, 2024

work page 2024

[19] [19]

G. Yuan, X. Li, and Q. Ling. INDO: INversion-free Distributed second-Order method for consensus optimization. Optimization Online preprint, 2022

work page 2022

[20] [20]

Zhang, K

Z. Zhang, K. Che, S. Yang, et al. Communication-efficient distributed cubic Newton with compressed lazy Hessian. Neural Networks, 174:106212, 2024

work page 2024

[21] [21]

S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed. Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control, 66(6):2787– 2794, 2021

work page 2021

[22] [22]

Bajović, D

D. Bajović, D. Jakovetić, N. Krejić, and N. Krklec Jerinkić. Newton-like method with diagonal correction for distributed optimization. SIAM Journal on Optimization, 27(2):1171–1203, 2017

work page 2017

[23] [23]

Beznosikov, P

A. Beznosikov, P. Richtárik, M. Diskin, et al. Distributed methods with compressed communi- cation for solving variational inequalities, with theoretical guarantees. In Advances in Neural Information Processing Systems (NeurIPS), 35:14013–14029, 2022

work page 2022

[24] [24]

Doikov and Y

N. Doikov and Y. Nesterov. Local convergence of tensor methods. Mathematical Programming, 193:315–336, 2022

work page 2022

[25] [25]

Doikov, E

N. Doikov, E. M. Chayti, and M. Jaggi. Second-order optimization with lazy Hessians. In Proceedings of the International Conference on Machine Learning (ICML), 2023. 31 DisGRem W. Hu et al

work page 2023

[26] [26]

Koloskova, S

A. Koloskova, S. U. Stich, and M. Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In Proceedings of the International Conference on Machine Learning (ICML), pp. 3478–3487, 2019

work page 2019

[27] [27]

Li and Z

H. Li and Z. Lin. Accelerated gradient tracking over time-varying graphs for decentralized optimization. Journal of Machine Learning Research, 25(274):1–52, 2024

work page 2024

[28] [28]

Mokhtari, W

A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro. ESOM: A second-order method for exact decentralized optimization over networks. IEEE Transactions on Signal and Information Processing over Networks, 2(4):507–522, 2016

work page 2016

[29] [29]

Pu and A

S. Pu and A. Nedić. Distributed stochastic gradient tracking methods. Mathematical Program- ming, 187:409–457, 2021

work page 2021

[30] [30]

Xin and U

R. Xin and U. A. Khan. Distributed heavy-ball: a generalization and acceleration of first-order methods with gradient tracking. IEEE Transactions on Automatic Control, 65(6):2627–2633, 2020

work page 2020

[31] [31]

B. Li, S. Cen, Y. Chen, et al. Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21(180):1–51, 2020

work page 2020

[32] [32]

Nesterov

Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005

work page 2005

[33] [33]

Charbonnier, L

P. Charbonnier, L. Blanc-Féraud, G. Aubert, and M. Barlaud. Deterministic edge-preserving regularization in computed imaging. IEEE Transactions on Image Processing, 6(2):298–311, 1997

work page 1997

[34] [34]

H. H. Rosenbrock. An automatic method for finding the greatest or least value of a function. The Computer Journal, 3(3):175–184, 1960

work page 1960

[35] [35]

M. A. Styblinski and T. S. Tang. Experiments in nonconvex optimization: Stochastic approx- imation with function smoothing and simulated annealing. Neural Networks, 3(4):467–483, 1990

work page 1990

[36] [36]

Geman and C

D. Geman and C. Yang. Nonlinear image recovery with half-quadratic regularization. IEEE Transactions on Image Processing, 4(7):932–946, 1995

work page 1995

[37] [37]

Xie and Y

P. Xie and Y. Yuan. A derivative-free optimization algorithm combining line-search and trust-region techniques.Chinese Annals of Mathematics, Series B, 44(5):719–734, 2023

work page 2023

[38] [38]

Xie and Y

P. Xie and Y. Yuan. Derivative-free optimization with transformed objective functions (DFOTO) and the algorithm based on the least Frobenius norm updating quadratic model.Journal of the Operations Research Society of China, 13:327–363, 2025

work page 2025

[39] [39]

Xie and Y

P. Xie and Y. Yuan. A derivative-free method using a new underdetermined quadratic interpolation model.SIAM Journal on Optimization, 35(2):1110–1133, 2025

work page 2025

[40] [40]

Xie and Y

P. Xie and Y. Yuan. LeastH2 norm updating of quadratic interpolation models for derivative- free trust-region algorithms.IMA Journal of Numerical Analysis, 46(1):21–50, 2026. 32 DisGRem W. Hu et al

work page 2026

[41] [41]

Xie and Y

P. Xie and Y. Yuan. A new two-dimensional model-based subspace method for large-scale unconstrained derivative-free optimization: 2D-MoSub.Optimization Methods and Software, 41(1):118–150, 2026

work page 2026

[42] [42]

Xie and S

P. Xie and S. M. Wild. ReMU: Regional minimal updating for model-based derivative-free optimization. arXiv:2504.03606, 2025

work page arXiv 2025

[43] [43]

He and P

Y. He and P. Xie. Model-driven subspaces for large-scale optimization with local approximation strategy. arXiv:2509.08256, 2025

work page arXiv 2025

[44] [44]

P. Xie. A derivative-free trust-region method for optimization on the ellipsoid.Journal of Physics: Conference Series, 2620:012007, 2023

work page 2023

[45] [45]

P. Xie. Sufficient conditions for error distance reduction in theℓ2-norm trust region between minimizers of local nonconvex multivariate quadratic approximates.Journal of Computational and Applied Mathematics, 453:116146, 2025

work page 2025

[46] [46]

P. Xie, Z. Zhou, and Z. Zhou. Objective value change and shape-based accelerated optimization for the neural network approximation. arXiv:2508.20290, 2025

work page arXiv 2025

[47] [47]

Xie and S

P. Xie and S. M. Wild. Barycenter of weight coefficient region of least weightedH2 norm updating quadratic models with vanishing trust-region radius.SIAM NCC 2024, Early Career Travel Award, 2024

work page 2024

[48] [48]

P. Xie. On the relationship betweenΛ-poisedness in derivative-free optimization and outliers in local outlier factor. arXiv:2407.17529, 2024

work page arXiv 2024

[49] [49]

L. Li, P. Xie, and L. Zhang. A novel numerical method tailored for unconstrained optimization problems. arXiv:2504.02832, 2025

work page arXiv 2025

[50] [50]

P. Xie. An efficient derivative-free method for finding multiple solutions. To be posted on arXiv, 2024

work page 2024

[51] [51]

L. Li, Y. Zhou, P. Xie, and H. Li. A spectral Levenberg–Marquardt–Deflation method for multiple solutions of semilinear elliptic systems.Journal of Computational and Applied Mathematics, 2025

work page 2025

[52] [52]

Y. Ye, L. Li, P. Xie, and H. Yu. An improved adaptive orthogonal basis deflation method for multiple solutions with applications to nonlinear elliptic equations in varying domains.Journal of Computational Mathematics, 2025

work page 2025

[53] [53]

P. Xie. Privacy-preserving black-box optimization (PBBO): Theory and the model-based algorithm DFOp. arXiv:2601.11570, 2025

work page arXiv 2025

[54] [54]

Xie et al

P. Xie et al. A novel local analysis of objectives approximated by neural network: L-Change. International Conference on Mathematical Theory of Deep Learning (MTDL), 2024

work page 2024

[55] [55]

K. J. Dzahini, S. M. Wild, and P. Xie. Optimization approaches for solving inverse problems must account for uncertainty in both data and downstream decisions. Position paper, Inverse Methods for Complex Systems under Uncertainty Workshop, Sponsored by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, 2025

work page 2025

[56] [56]

Xie and M

P. Xie and M. Tao. Parametric resonant control of macroscopic behaviors of multiple oscillators. In2019 American Control Conference (ACC), pages 1898–1905, 2019. 33 DisGRem W. Hu et al

work page 1905

[57] [57]

P. Xie. A note on the invariant distribution of a stochastic dynamical system. 2024

work page 2024

[58] [58]

P. Xie. The modeling and optimization of a multi-dam system.Applied and Computational Mathematics, 13(5):140–152, 2024. A Dispersion recursions, burn-in analysis, and logarithmic mixing A.1 Basic inequalities Lemma A.1.Leth:R d→RhaveL2-Lipschitz Hessian. Then for allx,s, h(x+s)≤h(x) +⟨∇h(x),s⟩+1 2s⊤∇2h(x)s+ L2 6 ∥s∥3. Lemma A.2.For anyz∈RN and integert≥1,...

work page 2024