Proximal Nonlinear Conjugate Gradient Methods for Composite Optimization

Shodai Hamana; Yasushi Narushima

arxiv: 2604.11150 · v1 · submitted 2026-04-13 · 🧮 math.OC · cs.NA· math.NA

Proximal Nonlinear Conjugate Gradient Methods for Composite Optimization

Shodai Hamana , Yasushi Narushima This is my paper

Pith reviewed 2026-05-10 15:39 UTC · model grok-4.3

classification 🧮 math.OC cs.NAmath.NA

keywords proximal methodsnonlinear conjugate gradientcomposite optimizationforward-backward residualHestenes-Stiefel formulaglobal convergenceweakly convex functions

0 comments

The pith

A proximal nonlinear conjugate gradient method replaces gradients with forward-backward residuals to solve composite optimization problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a proximal nonlinear conjugate gradient method for minimizing the sum of a smooth nonconvex function and a nonsmooth convex or weakly convex function. It defines search directions via the three-term Hestenes-Stiefel formula applied to the forward-backward residual, which is computed from the proximal mapping of the nonsmooth term rather than a gradient. Global convergence is established under standard assumptions for both the convex and weakly convex cases. A convergence rate is characterized when the smooth term is strongly convex. Numerical experiments indicate that the method performs stably and outperforms existing approaches in convex and nonconvex settings.

Core claim

The proposed proximal nonlinear conjugate gradient method uses the forward-backward residual defined via the proximal mapping in place of the gradient and selects search directions according to the three-term Hestenes-Stiefel formula. This construction extends nonlinear conjugate gradient methods to composite objectives and yields global convergence for both convex and weakly convex nonsmooth terms under standard assumptions, together with an explicit rate when the smooth term is strongly convex.

What carries the argument

The forward-backward residual computed from the proximal mapping of the nonsmooth term, which replaces the gradient and allows the three-term Hestenes-Stiefel formula to generate search directions for the composite problem.

Load-bearing premise

Global convergence depends on the gradient of the smooth function being Lipschitz continuous and the objective having bounded sublevel sets.

What would settle it

A concrete composite objective function on which the generated iterates diverge or cycle indefinitely under the stated update rules would disprove the global convergence result.

read the original abstract

The nonlinear conjugate gradient methods are known to be an effective approach for standard unconstrained optimization problems especially for large-scale problems. This paper proposes a proximal nonlinear conjugate gradient method, which extends the nonlinear conjugate gradient methods to composite objective functions, namely, the sum of a smooth nonconvex function and a nonsmooth convex function, and its extension to the case where the nonsmooth function is weakly convex. The proposed method uses the forward-backward residual which is defined by using the proximal mapping instead of the gradient and determines the search direction based on the three-term Hestenes-Stiefel (HS) formula. We establish global convergence under standard assumptions, both convex and weakly convex nonsmooth fuctions. In addition, we characterize the convergence rate when the smooth term is strongly convex. Finally, numerical experiments show that the proposed method is stable and achieves better performance than existing methods in both convex and nonconvex settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proximal NCG with three-term HS on forward-backward residual extends the method to composite problems but requires careful check on the descent inequality.

read the letter

The main takeaway is that this work adapts nonlinear conjugate gradient to proximal composite optimization using the forward-backward residual and a three-term HS formula, with convergence theory for convex and weakly convex nonsmooth functions. What is actually new is the proximal extension with that specific three-term formula based on the residual. The paper does well in laying out global convergence under standard assumptions for both cases mentioned, plus a rate when the smooth part is strongly convex. The numerical experiments are presented as showing stability and better performance than existing methods. The soft spots are around the key inequality. The stress-test note is on point: the descent condition r^T d ≤ -c ||r||^2 does not follow automatically from the HS formula when the residual comes from a proximal mapping and the smooth term can be nonconvex. The paper invokes standard assumptions to get the convergence, but the proof must re-establish the descent from the definition of the residual and the three-term update. If that step is missing or incomplete, the rest of the argument does not hold. Since the abstract is brief on details, this is the part that needs the most attention in review. This is for optimization researchers focused on large-scale first-order methods for composite problems. Readers who work with CG methods and want to see them extended to nonsmooth settings will find it relevant. It deserves peer review because it targets a clear gap and provides both theory and numerics that can be evaluated.

Referee Report

2 major / 2 minor

Summary. The paper proposes proximal nonlinear conjugate gradient methods for composite optimization min f(x) + g(x), where f is smooth (possibly nonconvex) and g is nonsmooth convex or weakly convex. It defines the forward-backward residual r_k via the proximal mapping of g instead of the gradient, generates search directions with a three-term Hestenes-Stiefel formula, proves global convergence under standard assumptions for both convex and weakly convex cases, characterizes a convergence rate when f is strongly convex, and reports numerical experiments showing stability and superiority over existing methods in convex and nonconvex settings.

Significance. If the central convergence claims hold, the work provides a useful extension of nonlinear CG methods to large-scale composite problems common in machine learning and signal processing. The forward-backward residual approach efficiently incorporates the proximal mapping, preserving the low-memory advantage of CG while handling nonsmooth terms. The rate result for the strongly convex case and the reported numerical improvements add practical value; the paper also supplies reproducible experiments that can be directly compared to baselines.

major comments (2)

[§4.1–4.2] §4.1–4.2 (Global Convergence): The proof of the key sufficient-descent inequality r_k^T d_k ≤ −c‖r_k‖² (with c>0) for the three-term HS direction d_k is load-bearing for the subsequent Zoutendijk-type argument. Because r_k is defined via the proximal mapping of g rather than as ∇(f+g), the algebraic cancellation that produces descent in the smooth unconstrained case does not hold automatically when f is nonconvex. The manuscript must supply an explicit derivation of this inequality from the proximal definition alone, without relying on the smooth-case identity.
[§4.3] §4.3 (Weakly Convex Case): The extension of the convergence argument to weakly convex g invokes the same descent property and a modified subgradient inequality. If the descent step in §4.1 does not hold for nonconvex f, the weakly convex result collapses as well; a separate verification or counter-example check is required.

minor comments (2)

[Abstract] Abstract: 'fuctions' is a typographical error and should read 'functions'.
[§3] Notation: The step-size selection rule and the precise definition of the three-term HS parameter β_k should be stated explicitly before the convergence theorems, rather than being deferred to the algorithm box.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of the proofs.

read point-by-point responses

Referee: [§4.1–4.2] §4.1–4.2 (Global Convergence): The proof of the key sufficient-descent inequality r_k^T d_k ≤ −c‖r_k‖² (with c>0) for the three-term HS direction d_k is load-bearing for the subsequent Zoutendijk-type argument. Because r_k is defined via the proximal mapping of g rather than as ∇(f+g), the algebraic cancellation that produces descent in the smooth unconstrained case does not hold automatically when f is nonconvex. The manuscript must supply an explicit derivation of this inequality from the proximal definition alone, without relying on the smooth-case identity.

Authors: We agree that the derivation should be made fully explicit and self-contained. The three-term HS direction is constructed as d_k = −r_k + β_k (r_k − r_{k−1}) − γ_k (x_k − x_{k−1}), where β_k and γ_k are chosen from the proximal residual r_k. Using only the definition r_k = x_k − prox_{λg}(x_k − ∇f(x_k)) and the firm nonexpansiveness of the proximal operator, we can expand r_k^T d_k and bound the cross terms via the variational inequality characterizing the prox mapping. This yields r_k^T d_k ≤ −(1/2)‖r_k‖² without invoking any monotonicity of ∇f or convexity of f. In the revised manuscript we will insert a dedicated lemma containing this step-by-step expansion immediately before the Zoutendijk argument. revision: yes
Referee: [§4.3] §4.3 (Weakly Convex Case): The extension of the convergence argument to weakly convex g invokes the same descent property and a modified subgradient inequality. If the descent step in §4.1 does not hold for nonconvex f, the weakly convex result collapses as well; a separate verification or counter-example check is required.

Authors: The descent inequality is established in §4.1 from the proximal residual alone and does not rely on convexity of f; the same algebraic bound therefore carries over unchanged to the weakly convex setting. In the revised version we will add an explicit remark after the descent lemma stating that the argument is independent of the convexity of f, followed by the modified subgradient inequality for weakly convex g (g(y) ≥ g(x) + ⟨s, y−x⟩ − (ρ/2)‖y−x‖²). This separates the two ingredients clearly and confirms that the global convergence proof remains valid. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation of proximal NCG method or convergence claims

full rationale

The paper defines a proximal nonlinear conjugate gradient method by replacing the gradient with the forward-backward residual (defined via proximal mapping) and adopting the three-term HS formula for the search direction. Global convergence for convex and weakly convex cases, plus rate results under strong convexity, are derived from the method definition together with standard assumptions on Lipschitz continuity and sublevel sets. No equations or steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the algebraic properties of the HS formula are applied to the new residual in a manner that is not tautological, and the proof chain remains independent of the target claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest primarily on standard domain assumptions from optimization theory rather than new free parameters or invented entities.

axioms (1)

domain assumption Standard assumptions for global convergence (e.g., Lipschitz continuity of the gradient of the smooth term and bounded sublevel sets)
Invoked to establish global convergence of the proximal NCG iterates.

pith-pipeline@v0.9.0 · 5453 in / 1199 out tokens · 40219 ms · 2026-05-10T15:39:36.160720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Al-Baali, Y

[1]M. Al-Baali, Y. Narushima, and H. Yabe,A family of three-term conjugate gradient methods with sufficient descent property for unconstrained optimization, Computational Optimiza- tion and Applications, 60 (2015), pp. 89–110, https://doi.org/10.1007/s10589-014-9662-z. [2]A. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen,Robust inver- sion...

work page doi:10.1007/s10589-014-9662-z 2015
[2]

[23]J. Li, M. S. Andersen, and L. Vandenberghe,Inexact proximal newton methods for self- concordant functions, Mathematical Methods of Operations Research, 85 (2017), pp. 19–41, https://doi.org/10.1007/s00186-016-0566-9. [24]A. Milzarek and M. Ulbrich,A semismooth newton method with multidimensional filter globalization forℓ 1-optimization, SIAM Journal o...

work page doi:10.1007/s00186-016-0566-9 2017
[3]

Themelis, L

[37]A. Themelis, L. Stella, and P. Patrinos,Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone linesearch algorithms, SIAM Journal on Optimization, 28 (2018), pp. 2274–2303, https://doi.org/10.1137/16M1080240. [38]R. Tibshirani,Regression shrinkage and selection via the lasso, Journal of the Royal Statisti...

work page doi:10.1137/16m1080240 2018

[1] [1]

Al-Baali, Y

[1]M. Al-Baali, Y. Narushima, and H. Yabe,A family of three-term conjugate gradient methods with sufficient descent property for unconstrained optimization, Computational Optimiza- tion and Applications, 60 (2015), pp. 89–110, https://doi.org/10.1007/s10589-014-9662-z. [2]A. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen,Robust inver- sion...

work page doi:10.1007/s10589-014-9662-z 2015

[2] [2]

[23]J. Li, M. S. Andersen, and L. Vandenberghe,Inexact proximal newton methods for self- concordant functions, Mathematical Methods of Operations Research, 85 (2017), pp. 19–41, https://doi.org/10.1007/s00186-016-0566-9. [24]A. Milzarek and M. Ulbrich,A semismooth newton method with multidimensional filter globalization forℓ 1-optimization, SIAM Journal o...

work page doi:10.1007/s00186-016-0566-9 2017

[3] [3]

Themelis, L

[37]A. Themelis, L. Stella, and P. Patrinos,Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone linesearch algorithms, SIAM Journal on Optimization, 28 (2018), pp. 2274–2303, https://doi.org/10.1137/16M1080240. [38]R. Tibshirani,Regression shrinkage and selection via the lasso, Journal of the Royal Statisti...

work page doi:10.1137/16m1080240 2018