A Unified Zeroth-Order Proximal Newton-Type Framework for Composite Optimization
Pith reviewed 2026-05-08 08:18 UTC · model grok-4.3
The pith
A unified derivative-free proximal Newton framework solves composite black-box optimization with established complexity and superlinear rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that one derivative-free proximal Newton-type scheme unifies global convergence guarantees, local superlinear rates under the Dennis-More condition, and a clear preference for finite-difference over smoothing gradient estimators when used inside the BFGS update for composite problems of the form f(x) + r(x), where f is black-box and r admits an efficient proximal mapping.
What carries the argument
The unified zeroth-order proximal Newton-type framework, which substitutes finite-difference or smoothing gradient estimates into proximal Newton steps and BFGS Hessian approximations for the composite objective.
If this is right
- The algorithm reaches an epsilon-optimal solution with explicit iteration and oracle complexity bounds in both nonconvex and strongly convex regimes.
- Local R-superlinear convergence holds whenever the Dennis-More condition is met.
- The BFGS update is theoretically more compatible with finite-difference gradient estimators than with smoothing-based estimators.
- Numerical performance improves when the framework is instantiated with finite-difference rather than smoothing estimators.
Where Pith is reading between the lines
- The same framework could be instantiated with other quasi-Newton updates beyond BFGS while preserving the compatibility advantage of finite-difference estimators.
- The complexity results suggest that zeroth-order methods can match first-order rates in practice for problems where only function values are available.
- The preference for finite differences may extend to other composite settings in machine learning that combine black-box losses with structured regularizers.
Load-bearing premise
The black-box function must admit sufficiently accurate gradient approximations via finite differences or smoothing, and the regularizer must have a cheaply computable proximal mapping.
What would settle it
A concrete counter-example in which the BFGS scheme combined with a smoothing gradient estimator fails to satisfy the Dennis-More condition while the identical scheme with a finite-difference estimator succeeds, or a numerical instance where the stated iteration complexity bound is violated under the paper's assumptions.
Figures
read the original abstract
We propose a unified derivative-free proximal Newton-type algorithm framework for solving composite optimization problems formulated as the sum of a black-box function and a known regularization term. We establish the iteration and oracle complexity bounds for the algorithm to attain an $\epsilon$-optimal solution under both nonconvex and strongly convex settings. We also establish its local R-superlinear convergence based on the Dennis--Mor\'{e} condition, and theoretically address an open problem by showing that the BFGS scheme is more compatible with finite-difference gradient estimators than with smoothing-based ones. Numerical experiments are further presented to demonstrate the efficiency of the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a unified derivative-free proximal Newton-type algorithm framework for composite optimization problems of the form f(x) + r(x), where f is a black-box function and r admits an efficient proximal mapping. It derives iteration and oracle complexity bounds to reach an ε-optimal solution in both nonconvex and strongly convex regimes, establishes local R-superlinear convergence under the Dennis–Moré condition, and resolves an open problem by proving that BFGS updates are more compatible with finite-difference gradient estimators than with smoothing-based estimators. Numerical experiments are provided to illustrate practical performance.
Significance. If the stated complexity bounds, local convergence result, and BFGS compatibility claim hold under the paper's assumptions, the work offers a useful unification of zeroth-order proximal Newton methods for composite problems. The resolution of the open problem on estimator compatibility with quasi-Newton schemes is a concrete contribution that clarifies design choices in derivative-free optimization; the complexity results extend existing zeroth-order analyses to the proximal setting in a standard way.
major comments (2)
- [Introduction / Main complexity theorems] The central claims rest on the accuracy of the gradient estimators (finite-difference or smoothing) under Lipschitz-gradient or bounded-Hessian assumptions on the black-box term; these assumptions are load-bearing for all stated complexity and convergence results but are only summarized in the abstract and introduction. The precise statement of these conditions and how they enter the oracle-complexity proofs should be given explicitly in the main theorems (e.g., the theorem establishing the ε-optimal iteration bound).
- [Local convergence section] The claim that the BFGS scheme is 'more compatible' with finite-difference estimators than smoothing-based ones is presented as resolving an open problem. The precise metric of compatibility (e.g., preservation of the Dennis–Moré condition or effect on the local convergence rate) and the supporting argument should be isolated in a dedicated theorem or corollary with a clear statement of the open problem being addressed.
minor comments (2)
- [Algorithm framework] Notation for the proximal mapping and the zeroth-order gradient estimators should be introduced once and used consistently; occasional reuse of symbols for different quantities appears in the algorithm description.
- [Numerical experiments] The numerical experiments section would benefit from a brief table summarizing the test problems, dimensions, and performance metrics (iteration count, oracle calls) to make the efficiency claims easier to compare with prior zeroth-order methods.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment, and constructive suggestions. We address the two major comments point by point below. Both can be resolved by clarifications and minor restructuring that do not affect the technical content or proofs.
read point-by-point responses
-
Referee: [Introduction / Main complexity theorems] The central claims rest on the accuracy of the gradient estimators (finite-difference or smoothing) under Lipschitz-gradient or bounded-Hessian assumptions on the black-box term; these assumptions are load-bearing for all stated complexity and convergence results but are only summarized in the abstract and introduction. The precise statement of these conditions and how they enter the oracle-complexity proofs should be given explicitly in the main theorems (e.g., the theorem establishing the ε-optimal iteration bound).
Authors: We agree that the load-bearing assumptions on the estimators should be stated explicitly inside the main complexity theorems rather than only summarized earlier. In the revised manuscript we will add, immediately before the statement of the main nonconvex and strongly convex complexity theorems, a short paragraph that recalls the precise conditions (Lipschitz-gradient for finite-difference estimators and bounded-Hessian for smoothing estimators) together with the resulting gradient-error bounds. We will also insert a one-sentence pointer inside each theorem statement indicating how these error bounds are used in the subsequent oracle-complexity argument. The proofs themselves remain unchanged. revision: yes
-
Referee: [Local convergence section] The claim that the BFGS scheme is 'more compatible' with finite-difference estimators than smoothing-based ones is presented as resolving an open problem. The precise metric of compatibility (e.g., preservation of the Dennis–Moré condition or effect on the local convergence rate) and the supporting argument should be isolated in a dedicated theorem or corollary with a clear statement of the open problem being addressed.
Authors: We appreciate the suggestion to isolate the resolution of the open problem. The open problem (cited in the introduction) asks whether quasi-Newton updates remain compatible with zeroth-order estimators in the sense that the Dennis–Moré condition continues to hold, thereby guaranteeing local R-superlinear convergence. Our analysis shows that finite-difference estimators preserve the Dennis–Moré condition while smoothing-based estimators generally do not. In the revision we will extract the relevant argument into a new, self-contained corollary (placed at the end of the local-convergence section) that (i) restates the open problem, (ii) defines the compatibility metric as preservation of the Dennis–Moré condition, and (iii) states the result as a corollary to the existing local-convergence theorem. revision: yes
Circularity Check
No significant circularity
full rationale
The derivation chain relies on standard zeroth-order gradient estimation under Lipschitz or bounded-Hessian assumptions, Dennis-Moré condition for local R-superlinear convergence, and conventional complexity analysis for nonconvex/strongly convex composite problems. These are independent external benchmarks in optimization theory, not quantities defined in terms of the algorithm outputs or fitted parameters. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; the BFGS compatibility result follows from direct comparison of estimator properties within the paper's own framework.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The black-box function has Lipschitz continuous gradients (or satisfies conditions allowing accurate finite-difference or smoothing gradient estimates).
- domain assumption The regularization term admits an efficiently computable proximal mapping.
Reference graph
Works this paper leans on
-
[1]
Z. Aminifard and S. Babaie-Kafaki,An approximate Newton-type proximal method using symmetric rank- one updating formula for minimizing the nonsmooth composite functions, Optim. Methods Softw., 38 (2022), pp. 529–542
work page 2022
-
[2]
C. Audet and W. Hare,Derivative-free and Blackbox Optimization, Springer-Verlag, Heidelberg, 2017
work page 2017
-
[3]
K. Balasubramanian and S. Ghadimi,Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points, Found. Comput. Math., 22 (2022), pp. 35–76
work page 2022
-
[4]
A. Beck and M. Teboulle,A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), pp. 183–202
work page 2009
-
[5]
A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg,A theoretical and empirical comparison of gradient approximations in derivative-free optimization, Found. Comput. Math., 22 (2022), pp. 507–560
work page 2022
-
[6]
A. S. Berahas, J. Nocedal, and M. Tak´ aˇ c,A multi-batch L-BFGS method for machine learning, Advances in Neural Information Processing Systems, Red Hook, NY, 2016, pp. 1063–1071
work page 2016
-
[7]
R. Bollapragada, C. Karamanli, and S. M. Wild,Derivative-free stochastic optimization via adaptive sam- pling strategies, Optim. Methods Softw., (2025), pp. 1–34
work page 2025
-
[8]
R. Bollapragada and S. M. Wild,Adaptive sampling quasi-Newton methods for zeroth-order stochastic opti- mization, Math. Program. Comput., 15 (2023), pp. 327–364
work page 2023
-
[9]
R. H. Byrd and J. Nocedal,A tool for the analysis of quasi-Newton methods with application to unconstrained minimization, SIAM J. Numer. Anal., 26 (1989), pp. 727–739
work page 1989
-
[10]
H. Q. Cai, D. McKenzie, W. Yin, and Z. Zhang,Zeroth-order regularized optimization (ZORO): Approxi- mately sparse gradients and adaptive sampling, SIAM J. Optim., 32 (2022), pp. 687–714
work page 2022
-
[11]
A. R. Conn, K. Scheinberg, and L. N. Vicente,Introduction to Derivative-Free Optimization, Society for Industrial and Applied Mathematics, Philadelphia, 2009
work page 2009
-
[12]
A. Defazio, F. Bach, and S. Lacoste-Julien,SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems, Cambridge, MA, 2014, pp. 1646–1654
work page 2014
-
[13]
J. E. Dennis and J. J. Mor´ e,A characterization of superlinear convergence and its application to quasi- Newton methods, Math. Comp., 28 (1974), pp. 549–560
work page 1974
-
[14]
N. Doikov and G. N. Grapiglia,First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians, J. Sci. Comput., 103 (2025), no. 32
work page 2025
-
[15]
A. D. Flaxman, A. T. Kalai, and H. B. McMahan,Online convex optimization in the bandit setting: gra- dient descent without a gradient, Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, British Columbia, 2005, pp. 385–394
work page 2005
- [16]
-
[17]
C. Kanzow and T. Lechner,Globalized inexact proximal Newton-type methods for nonconvex composite functions, Comput. Optim. Appl., 78 (2021), pp. 377–410
work page 2021
-
[18]
E. Kazemi and L. Wang,Efficient zeroth-order proximal stochastic method for nonconvex nonsmooth black-box problems, Mach. Learn., 113 (2024), pp. 97–120
work page 2024
-
[19]
V. Kungurtsev and F. Rinaldi,A zeroth order method for stochastic weakly convex optimization, Comput. Optim. Appl., 80 (2021), pp. 731–753
work page 2021
- [20]
-
[21]
J. D. Lee, Y. Sun, and M. A. Saunders,Proximal Newton-type methods for minimizing composite functions, SIAM J. Optim., 24 (2014), pp. 1420–1443
work page 2014
- [22]
-
[23]
S. Liu, L. Wang, N. Xiao, and X. Liu,An inexact preconditioned zeroth-order proximal method for composite optimization, J. Oper. Res. Soc. China, (2025)
work page 2025
-
[24]
S. Nakayama, Y. Narushima, and H. Yabe,Inexact proximal memoryless quasi-Newton methods based on the Broyden family for minimizing composite functions, Comput. Optim. Appl., 79 (2021), pp. 127–154
work page 2021
-
[25]
P. Nazari, D. A. Tarzanagh, and G. Michailidis,Adaptive first- and zeroth-order methods for weakly convex stochastic optimization problems, preprint, (2020). Available at arXiv:2005.09261
-
[26]
Nesterov,Gradient methods for minimizing composite functions, Math
Y. Nesterov,Gradient methods for minimizing composite functions, Math. Program., 140 (2013), pp. 125– 161
work page 2013
-
[27]
Y. Nesterov and V. Spokoiny,Random gradient-free minimization of convex functions, Found. Comput. Math., 17 (2017), pp. 527–566
work page 2017
-
[28]
S. Pougkakiotis and D. Kalogerias,A zeroth-order proximal stochastic gradient method for weakly convex stochastic optimization, SIAM J. Sci. Comput., 45 (2023), pp. 2679–2702
work page 2023
-
[29]
S. J. Reddi, S. Sra, B. P´ oczos, and A. J. Smola,Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization, Advances in Neural Information Processing Systems, Red Hook, NY, 2016, pp. 1153–1161
work page 2016
-
[30]
A. Rodomanov and D. Kropotov,A superlinearly-convergent proximal Newton-type method for the optimiza- tion of finite sums, Proceedings of the 33rd International Conference on Machine Learning, New York, NY, 2016, pp. 2597–2605
work page 2016
-
[31]
X. Wang, X. Wang, and Y.-X. Yuan,Stochastic proximal quasi-Newton methods for non-convex composite optimization, Optim. Methods Softw., 34 (2019), pp. 922–948
work page 2019
-
[32]
L. Xiao and T. Zhang,A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), pp. 2057–2075. School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China. Email address:sjtu lzk@sjtu.edu.cn School of Mathematical Sciences, and MOE-LSC, Shanghai Jiao Tong University, Shanghai 200240, Ch...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.