pith. sign in

arxiv: 2605.06016 · v1 · submitted 2026-05-07 · 🧮 math.OC

A Unified Zeroth-Order Proximal Newton-Type Framework for Composite Optimization

Pith reviewed 2026-05-08 08:18 UTC · model grok-4.3

classification 🧮 math.OC
keywords zeroth-order optimizationproximal Newton methodcomposite optimizationBFGSfinite-difference gradientsuperlinear convergencederivative-free algorithmnonconvex optimization
0
0 comments X

The pith

A unified derivative-free proximal Newton framework solves composite black-box optimization with established complexity and superlinear rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a single algorithmic framework that applies proximal Newton updates using only function-value queries to minimize the sum of an inaccessible smooth function and a known nonsmooth regularizer. It derives global iteration and oracle complexity bounds that guarantee an epsilon-optimal point for both nonconvex and strongly convex objectives, plus local R-superlinear convergence once a Dennis-More-type condition holds. The analysis also resolves an open question by proving that the BFGS quasi-Newton update is more stable and compatible when paired with finite-difference gradient estimators than with smoothing-based estimators.

Core claim

The central claim is that one derivative-free proximal Newton-type scheme unifies global convergence guarantees, local superlinear rates under the Dennis-More condition, and a clear preference for finite-difference over smoothing gradient estimators when used inside the BFGS update for composite problems of the form f(x) + r(x), where f is black-box and r admits an efficient proximal mapping.

What carries the argument

The unified zeroth-order proximal Newton-type framework, which substitutes finite-difference or smoothing gradient estimates into proximal Newton steps and BFGS Hessian approximations for the composite objective.

If this is right

  • The algorithm reaches an epsilon-optimal solution with explicit iteration and oracle complexity bounds in both nonconvex and strongly convex regimes.
  • Local R-superlinear convergence holds whenever the Dennis-More condition is met.
  • The BFGS update is theoretically more compatible with finite-difference gradient estimators than with smoothing-based estimators.
  • Numerical performance improves when the framework is instantiated with finite-difference rather than smoothing estimators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same framework could be instantiated with other quasi-Newton updates beyond BFGS while preserving the compatibility advantage of finite-difference estimators.
  • The complexity results suggest that zeroth-order methods can match first-order rates in practice for problems where only function values are available.
  • The preference for finite differences may extend to other composite settings in machine learning that combine black-box losses with structured regularizers.

Load-bearing premise

The black-box function must admit sufficiently accurate gradient approximations via finite differences or smoothing, and the regularizer must have a cheaply computable proximal mapping.

What would settle it

A concrete counter-example in which the BFGS scheme combined with a smoothing gradient estimator fails to satisfy the Dennis-More condition while the identical scheme with a finite-difference estimator succeeds, or a numerical instance where the stated iteration complexity bound is violated under the paper's assumptions.

Figures

Figures reproduced from arXiv: 2605.06016 by Jinyan Fan, Zekun Liu.

Figure 1
Figure 1. Figure 1: Objective value versus NF on the LASSO problem. variants of Algorithm 1: ZOPN-BFGS, combining forward-difference-based gradients with the BFGS Hessian approximation (4.15), and ZOPN-LazyH, sharing the same gradient estimator with the lazy Hessian strategy (4.18). The true sparse vector x has a sparsity level of 0.1. We set the line search parameter c2 = 1 and let the sampling radius decay as ∆k = max{10−10… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of εk versus iteration k on the LASSO problem. function evaluations increases. The two methods attain comparable final accuracy under the same computational budget. 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of Function Evaluations 104 10-10 10-8 10-6 10-4 10-2 100 (a) a1a 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of Function Evaluations 104 10-10 10-8 10-6 10-4 10-2 100 (b) a4a 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of … view at source ↗
Figure 3
Figure 3. Figure 3: Objective value versus NF on the ℓ1-regularized logistic regression problem. 5.3. ℓ2-regularized logistic regression. Consider the ℓ2-regularized logistic regression prob￾lem [30] min x∈Rn 1 p Xp i=1 log(1 + exp(−bia ⊤ i x)) + ζ 2 ∥x∥ 2 , where we set ζ = 10−3 . The results are shown in view at source ↗
Figure 4
Figure 4. Figure 4: Objective value versus NF on the ℓ2-regularized logistic regression problem. 5.4. Elastic net-regularized binary classification. Consider the elastic net-regularized bi￾nary classification problem [23, 16, 18] min x∈Rn 1 p Xp i=1 1 1 + exp(bia ⊤ i x) + ζ1∥x∥1 + ζ2 2 ∥x∥ 2 , where we set ζ1 = 10−3 and ζ2 = 2 × 10−3 . The results are presented in view at source ↗
Figure 5
Figure 5. Figure 5: Objective value versus NF on the elastic net-regularized binary classification problem. 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of Function Evaluations 104 10-3 10-2 10-1 100 (a) a1a 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of Function Evaluations 104 10-3 10-2 10-1 100 (b) a4a 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of Function Evaluations 104 10-3 10-2 10-1 100 (c) a9a 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Numbe… view at source ↗
Figure 6
Figure 6. Figure 6: Objective value versus NF on the nonconvex SVM problem view at source ↗
read the original abstract

We propose a unified derivative-free proximal Newton-type algorithm framework for solving composite optimization problems formulated as the sum of a black-box function and a known regularization term. We establish the iteration and oracle complexity bounds for the algorithm to attain an $\epsilon$-optimal solution under both nonconvex and strongly convex settings. We also establish its local R-superlinear convergence based on the Dennis--Mor\'{e} condition, and theoretically address an open problem by showing that the BFGS scheme is more compatible with finite-difference gradient estimators than with smoothing-based ones. Numerical experiments are further presented to demonstrate the efficiency of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a unified derivative-free proximal Newton-type algorithm framework for composite optimization problems of the form f(x) + r(x), where f is a black-box function and r admits an efficient proximal mapping. It derives iteration and oracle complexity bounds to reach an ε-optimal solution in both nonconvex and strongly convex regimes, establishes local R-superlinear convergence under the Dennis–Moré condition, and resolves an open problem by proving that BFGS updates are more compatible with finite-difference gradient estimators than with smoothing-based estimators. Numerical experiments are provided to illustrate practical performance.

Significance. If the stated complexity bounds, local convergence result, and BFGS compatibility claim hold under the paper's assumptions, the work offers a useful unification of zeroth-order proximal Newton methods for composite problems. The resolution of the open problem on estimator compatibility with quasi-Newton schemes is a concrete contribution that clarifies design choices in derivative-free optimization; the complexity results extend existing zeroth-order analyses to the proximal setting in a standard way.

major comments (2)
  1. [Introduction / Main complexity theorems] The central claims rest on the accuracy of the gradient estimators (finite-difference or smoothing) under Lipschitz-gradient or bounded-Hessian assumptions on the black-box term; these assumptions are load-bearing for all stated complexity and convergence results but are only summarized in the abstract and introduction. The precise statement of these conditions and how they enter the oracle-complexity proofs should be given explicitly in the main theorems (e.g., the theorem establishing the ε-optimal iteration bound).
  2. [Local convergence section] The claim that the BFGS scheme is 'more compatible' with finite-difference estimators than smoothing-based ones is presented as resolving an open problem. The precise metric of compatibility (e.g., preservation of the Dennis–Moré condition or effect on the local convergence rate) and the supporting argument should be isolated in a dedicated theorem or corollary with a clear statement of the open problem being addressed.
minor comments (2)
  1. [Algorithm framework] Notation for the proximal mapping and the zeroth-order gradient estimators should be introduced once and used consistently; occasional reuse of symbols for different quantities appears in the algorithm description.
  2. [Numerical experiments] The numerical experiments section would benefit from a brief table summarizing the test problems, dimensions, and performance metrics (iteration count, oracle calls) to make the efficiency claims easier to compare with prior zeroth-order methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and constructive suggestions. We address the two major comments point by point below. Both can be resolved by clarifications and minor restructuring that do not affect the technical content or proofs.

read point-by-point responses
  1. Referee: [Introduction / Main complexity theorems] The central claims rest on the accuracy of the gradient estimators (finite-difference or smoothing) under Lipschitz-gradient or bounded-Hessian assumptions on the black-box term; these assumptions are load-bearing for all stated complexity and convergence results but are only summarized in the abstract and introduction. The precise statement of these conditions and how they enter the oracle-complexity proofs should be given explicitly in the main theorems (e.g., the theorem establishing the ε-optimal iteration bound).

    Authors: We agree that the load-bearing assumptions on the estimators should be stated explicitly inside the main complexity theorems rather than only summarized earlier. In the revised manuscript we will add, immediately before the statement of the main nonconvex and strongly convex complexity theorems, a short paragraph that recalls the precise conditions (Lipschitz-gradient for finite-difference estimators and bounded-Hessian for smoothing estimators) together with the resulting gradient-error bounds. We will also insert a one-sentence pointer inside each theorem statement indicating how these error bounds are used in the subsequent oracle-complexity argument. The proofs themselves remain unchanged. revision: yes

  2. Referee: [Local convergence section] The claim that the BFGS scheme is 'more compatible' with finite-difference estimators than smoothing-based ones is presented as resolving an open problem. The precise metric of compatibility (e.g., preservation of the Dennis–Moré condition or effect on the local convergence rate) and the supporting argument should be isolated in a dedicated theorem or corollary with a clear statement of the open problem being addressed.

    Authors: We appreciate the suggestion to isolate the resolution of the open problem. The open problem (cited in the introduction) asks whether quasi-Newton updates remain compatible with zeroth-order estimators in the sense that the Dennis–Moré condition continues to hold, thereby guaranteeing local R-superlinear convergence. Our analysis shows that finite-difference estimators preserve the Dennis–Moré condition while smoothing-based estimators generally do not. In the revision we will extract the relevant argument into a new, self-contained corollary (placed at the end of the local-convergence section) that (i) restates the open problem, (ii) defines the compatibility metric as preservation of the Dennis–Moré condition, and (iii) states the result as a corollary to the existing local-convergence theorem. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation chain relies on standard zeroth-order gradient estimation under Lipschitz or bounded-Hessian assumptions, Dennis-Moré condition for local R-superlinear convergence, and conventional complexity analysis for nonconvex/strongly convex composite problems. These are independent external benchmarks in optimization theory, not quantities defined in terms of the algorithm outputs or fitted parameters. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; the BFGS compatibility result follows from direct comparison of estimator properties within the paper's own framework.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions for zeroth-order optimization (Lipschitz smoothness or bounded variation of the black-box function) and on the existence of an efficient proximal operator for the regularizer; algorithm parameters such as finite-difference step size or smoothing radius are typically chosen by theory or tuning but are not enumerated in the abstract.

axioms (2)
  • domain assumption The black-box function has Lipschitz continuous gradients (or satisfies conditions allowing accurate finite-difference or smoothing gradient estimates).
    Required for all zeroth-order gradient approximations to be useful in the proximal Newton steps.
  • domain assumption The regularization term admits an efficiently computable proximal mapping.
    Standard assumption enabling the proximal Newton framework.

pith-pipeline@v0.9.0 · 5392 in / 1479 out tokens · 48837 ms · 2026-05-08T08:18:26.450907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Aminifard and S

    Z. Aminifard and S. Babaie-Kafaki,An approximate Newton-type proximal method using symmetric rank- one updating formula for minimizing the nonsmooth composite functions, Optim. Methods Softw., 38 (2022), pp. 529–542

  2. [2]

    Audet and W

    C. Audet and W. Hare,Derivative-free and Blackbox Optimization, Springer-Verlag, Heidelberg, 2017

  3. [3]

    Balasubramanian and S

    K. Balasubramanian and S. Ghadimi,Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points, Found. Comput. Math., 22 (2022), pp. 35–76

  4. [4]

    Beck and M

    A. Beck and M. Teboulle,A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), pp. 183–202

  5. [5]

    A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg,A theoretical and empirical comparison of gradient approximations in derivative-free optimization, Found. Comput. Math., 22 (2022), pp. 507–560

  6. [6]

    A. S. Berahas, J. Nocedal, and M. Tak´ aˇ c,A multi-batch L-BFGS method for machine learning, Advances in Neural Information Processing Systems, Red Hook, NY, 2016, pp. 1063–1071

  7. [7]

    Bollapragada, C

    R. Bollapragada, C. Karamanli, and S. M. Wild,Derivative-free stochastic optimization via adaptive sam- pling strategies, Optim. Methods Softw., (2025), pp. 1–34

  8. [8]

    Bollapragada and S

    R. Bollapragada and S. M. Wild,Adaptive sampling quasi-Newton methods for zeroth-order stochastic opti- mization, Math. Program. Comput., 15 (2023), pp. 327–364

  9. [9]

    R. H. Byrd and J. Nocedal,A tool for the analysis of quasi-Newton methods with application to unconstrained minimization, SIAM J. Numer. Anal., 26 (1989), pp. 727–739

  10. [10]

    H. Q. Cai, D. McKenzie, W. Yin, and Z. Zhang,Zeroth-order regularized optimization (ZORO): Approxi- mately sparse gradients and adaptive sampling, SIAM J. Optim., 32 (2022), pp. 687–714

  11. [11]

    A. R. Conn, K. Scheinberg, and L. N. Vicente,Introduction to Derivative-Free Optimization, Society for Industrial and Applied Mathematics, Philadelphia, 2009

  12. [12]

    Defazio, F

    A. Defazio, F. Bach, and S. Lacoste-Julien,SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems, Cambridge, MA, 2014, pp. 1646–1654

  13. [13]

    J. E. Dennis and J. J. Mor´ e,A characterization of superlinear convergence and its application to quasi- Newton methods, Math. Comp., 28 (1974), pp. 549–560

  14. [14]

    Doikov and G

    N. Doikov and G. N. Grapiglia,First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians, J. Sci. Comput., 103 (2025), no. 32

  15. [15]

    A. D. Flaxman, A. T. Kalai, and H. B. McMahan,Online convex optimization in the bandit setting: gra- dient descent without a gradient, Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, British Columbia, 2005, pp. 385–394

  16. [16]

    Huang, B

    F. Huang, B. Gu, Z. Huo, S. Chen, and H. Huang,Faster gradient-free proximal stochastic methods for non- convex nonsmooth optimization, Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, 2019, pp. 1503–1510

  17. [17]

    Kanzow and T

    C. Kanzow and T. Lechner,Globalized inexact proximal Newton-type methods for nonconvex composite functions, Comput. Optim. Appl., 78 (2021), pp. 377–410

  18. [18]

    Kazemi and L

    E. Kazemi and L. Wang,Efficient zeroth-order proximal stochastic method for nonconvex nonsmooth black-box problems, Mach. Learn., 113 (2024), pp. 97–120

  19. [19]

    Kungurtsev and F

    V. Kungurtsev and F. Rinaldi,A zeroth order method for stochastic weakly convex optimization, Comput. Optim. Appl., 80 (2021), pp. 731–753

  20. [20]

    Larson, M

    J. Larson, M. Menickelly, and S. M. Wild,Derivative-free optimization methods, Acta Numer., 28 (2019), pp. 287–404. ZEROTH-ORDER ALGORITHMS FOR COMPOSITE PROBLEMS 27

  21. [21]

    J. D. Lee, Y. Sun, and M. A. Saunders,Proximal Newton-type methods for minimizing composite functions, SIAM J. Optim., 24 (2014), pp. 1420–1443

  22. [22]

    Liu, P.-Y

    S. Liu, P.-Y. Chen, B. Kailkhura, G. Zhang, A. O. Hero III, and P. K. Varshney,A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications, IEEE Signal Process. Mag., 37 (2020), pp. 43–54

  23. [23]

    S. Liu, L. Wang, N. Xiao, and X. Liu,An inexact preconditioned zeroth-order proximal method for composite optimization, J. Oper. Res. Soc. China, (2025)

  24. [24]

    Nakayama, Y

    S. Nakayama, Y. Narushima, and H. Yabe,Inexact proximal memoryless quasi-Newton methods based on the Broyden family for minimizing composite functions, Comput. Optim. Appl., 79 (2021), pp. 127–154

  25. [25]

    Adaptive first-and zeroth-order methods for weakly convex stochastic optimization problems.arXiv preprint arXiv:2005.09261, 2020

    P. Nazari, D. A. Tarzanagh, and G. Michailidis,Adaptive first- and zeroth-order methods for weakly convex stochastic optimization problems, preprint, (2020). Available at arXiv:2005.09261

  26. [26]

    Nesterov,Gradient methods for minimizing composite functions, Math

    Y. Nesterov,Gradient methods for minimizing composite functions, Math. Program., 140 (2013), pp. 125– 161

  27. [27]

    Nesterov and V

    Y. Nesterov and V. Spokoiny,Random gradient-free minimization of convex functions, Found. Comput. Math., 17 (2017), pp. 527–566

  28. [28]

    Pougkakiotis and D

    S. Pougkakiotis and D. Kalogerias,A zeroth-order proximal stochastic gradient method for weakly convex stochastic optimization, SIAM J. Sci. Comput., 45 (2023), pp. 2679–2702

  29. [29]

    S. J. Reddi, S. Sra, B. P´ oczos, and A. J. Smola,Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization, Advances in Neural Information Processing Systems, Red Hook, NY, 2016, pp. 1153–1161

  30. [30]

    Rodomanov and D

    A. Rodomanov and D. Kropotov,A superlinearly-convergent proximal Newton-type method for the optimiza- tion of finite sums, Proceedings of the 33rd International Conference on Machine Learning, New York, NY, 2016, pp. 2597–2605

  31. [31]

    X. Wang, X. Wang, and Y.-X. Yuan,Stochastic proximal quasi-Newton methods for non-convex composite optimization, Optim. Methods Softw., 34 (2019), pp. 922–948

  32. [32]

    Xiao and T

    L. Xiao and T. Zhang,A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), pp. 2057–2075. School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China. Email address:sjtu lzk@sjtu.edu.cn School of Mathematical Sciences, and MOE-LSC, Shanghai Jiao Tong University, Shanghai 200240, Ch...