pith. sign in

arxiv: 2604.09437 · v1 · submitted 2026-04-10 · 💻 cs.LG

AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords AdaCubiccubic regularizationNewton methodadaptive optimizerdeep learningHutchinson approximationconvergence guarantees
0
0 comments X

The pith

AdaCubic adapts the cubic regularization weight via an auxiliary optimization problem for deep learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AdaCubic to make cubic regularization practical for training large neural networks by dynamically adjusting the strength of the cubic penalty term. This adaptation is achieved by solving a small auxiliary optimization problem with cubic constraints at each step while approximating the Hessian stochastically via Hutchinson's method. The construction is shown to preserve the local convergence properties of the standard cubically regularized Newton method. Experiments across computer vision, natural language processing, and signal processing tasks indicate that the method matches or exceeds common optimizers when run with one fixed set of hyperparameters.

Core claim

AdaCubic solves an auxiliary optimization problem subject to cubic constraints to dynamically set the weight of the cubic regularization term inside the cubically regularized Newton method, pairs this with Hutchinson's stochastic Hessian approximation to control computational cost, inherits the local convergence guarantees of the non-adaptive version, and achieves competitive or superior performance on deep learning tasks without any hyperparameter tuning.

What carries the argument

The auxiliary optimization problem with cubic constraints that dynamically determines the weight of the cubic regularization term at each iteration.

If this is right

  • AdaCubic inherits the local convergence guarantees of the cubically regularized Newton method.
  • It can be applied across computer vision, natural language processing, and signal processing tasks without task-specific hyperparameter tuning.
  • It competes with or outperforms several widely used adaptive optimizers on benchmark tasks.
  • It provides a practical option in settings where hyperparameter fine-tuning is infeasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same auxiliary-problem adaptation could be tested on other non-convex optimization problems outside deep learning where cubic regularization has theoretical appeal.
  • Replacing Hutchinson's estimator with more accurate but still cheap curvature approximations might further improve stability on very large models.
  • The fixed-hyperparameter success raises the possibility that the method transfers to new architectures or datasets without retuning.

Load-bearing premise

The auxiliary optimization problem can be solved efficiently at each step even for large models, and Hutchinson's stochastic Hessian approximation supplies curvature information accurate enough to prevent instability or bias.

What would settle it

Running AdaCubic with its fixed hyperparameters on standard deep learning benchmarks and finding that it underperforms tuned versions of Adam or requires manual adjustment of the cubic weight to converge reliably would falsify the practicality and no-tuning claims.

Figures

Figures reproduced from arXiv: 2604.09437 by Constantine Kotropoulos, Corentin Briat, Ioannis Tsingalis.

Figure 1
Figure 1. Figure 1: Training loss curve of ResNet20 (top) and ResNet32 (bottom) on CIFAR-10 for Adam, AdaHes￾sian, and AdaCubic optimizers. The default parameters of the SqueezeBERT model can be found in the official Hugging Face library1 . The dataset acronyms in the Hugging Face library are SST-2, QNLI, RTE, WNLI, MRPC, QQP, STS-B, and MNLI, while the model acronym is squeezebert/squeezebert-uncased. To simplify the experim… view at source ↗
Figure 2
Figure 2. Figure 2: Perplexity vs. epochs for RoBERTa, BERT, and DistilBERT models on wikitext-2 dataset. 1         [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Perplexity vs. epochs for RoBERTa, BERT, and DistilBERT models on PTB dataset. First, the perplexity measurements gathered for the wikitext-2 dataset in [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative time vs. epochs for SGD, AdaHessian, and AdaCubic for ResNet20 on CIFAR-10. 0 0 0 0 0 0 0 0 # # "$   00 0 0  0  0      !!  !! #  " !! 0 (a) Cumulative time vs. loss 0 0 0 0 0 00  00 0 0  0  0     [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of SGD, AdaHessian, and AdaCubic on ResNet20 and CIFAR-10. Training loss vs. cumulative time over epochs (Left). Training loss vs. epochs (Right). Figure 5a shows the training loss vs. cumulative time for SGD, AdaHessian, and AdaCubic. Figure 5b shows the training loss vs. epochs for SGD, AdaHessian, and AdaCubic. The training loss in Figure 5b corresponds to that in Figure 5a. The horizontal da… view at source ↗
Figure 6
Figure 6. Figure 6: Logical connection between key lemmata, theorems [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
read the original abstract

A novel regularization technique, AdaCubic, is proposed that adapts the weight of the cubic term. The heart of AdaCubic is an auxiliary optimization problem with cubic constraints that dynamically adjusts the weight of the cubic term in Newton's cubic regularized method. We use Hutchinson's method to approximate the Hessian matrix, thereby reducing computational cost. We demonstrate that AdaCubic inherits the cubically regularized Newton method's local convergence guarantees. Our experiments in Computer Vision, Natural Language Processing, and Signal Processing tasks demonstrate that AdaCubic outperforms or competes with several widely used optimizers. Unlike other adaptive algorithms that require hyperparameter fine-tuning, AdaCubic is evaluated with a fixed set of hyperparameters, rendering it a highly attractive optimizer in settings where fine-tuning is infeasible. This makes AdaCubic an attractive option for researchers and practitioners alike. To our knowledge, AdaCubic is the first optimizer to leverage cubic regularization in scalable deep learning applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes AdaCubic, an adaptive cubic-regularized Newton optimizer for deep learning. The core mechanism is an auxiliary optimization problem with cubic constraints that dynamically tunes the cubic regularization weight at each step; Hutchinson's stochastic estimator is used to approximate Hessian-vector products and reduce cost. The authors claim that AdaCubic inherits the local convergence guarantees of the standard cubically regularized Newton method, and they report competitive or superior performance versus standard optimizers on computer-vision, NLP, and signal-processing tasks when using a single fixed hyper-parameter set.

Significance. If the inheritance of local convergence can be rigorously established under stochastic Hessian approximations, AdaCubic would supply a theoretically grounded adaptive optimizer that avoids the hyper-parameter search burden common to methods such as Adam. The fixed-hyper-parameter evaluation protocol is a practical strength that directly addresses a frequent complaint in large-scale deep-learning training.

major comments (1)
  1. [Theoretical analysis / convergence section] The central claim that AdaCubic inherits the local convergence guarantees of the cubically regularized Newton method is not supported by an explicit error-propagation argument. Standard proofs for cubic regularization require either exact Hessians or deterministic error bounds that shrink with the step; Hutchinson's unbiased but high-variance estimator supplies only stochastic Hessian-vector products whose variance does not vanish with model dimension. No analysis is given showing that the auxiliary cubic-constrained subproblem remains stable or that the cubic decrease property is preserved when the adaptation occasionally selects an insufficient regularization weight due to a noisy Hutchinson sample.
minor comments (2)
  1. [Method description] The auxiliary optimization problem is described only at a high level; the exact formulation, the solver used, and the termination criteria should be stated explicitly so that reproducibility is possible.
  2. [Experiments] The experimental tables would benefit from reporting wall-clock time per iteration in addition to loss/accuracy curves, given that each step solves an auxiliary cubic-constrained problem.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. The primary concern regarding the rigor of the local convergence analysis under stochastic approximations is addressed below. We outline the revisions we will make to strengthen this section.

read point-by-point responses
  1. Referee: The central claim that AdaCubic inherits the local convergence guarantees of the cubically regularized Newton method is not supported by an explicit error-propagation argument. Standard proofs for cubic regularization require either exact Hessians or deterministic error bounds that shrink with the step; Hutchinson's unbiased but high-variance estimator supplies only stochastic Hessian-vector products whose variance does not vanish with model dimension. No analysis is given showing that the auxiliary cubic-constrained subproblem remains stable or that the cubic decrease property is preserved when the adaptation occasionally selects an insufficient regularization weight due to a noisy Hutchinson sample.

    Authors: We agree that the manuscript lacks an explicit error-propagation argument bridging the stochastic Hutchinson estimator to the deterministic cubic regularization guarantees. The current text asserts inheritance based on the design of the auxiliary cubic-constrained subproblem, which selects the regularization weight to enforce a sufficient cubic decrease condition using the approximated Hessian-vector products. However, we acknowledge that no formal analysis is provided for the effect of the estimator's variance (which remains independent of dimension for a fixed number of samples) or for the probability that an occasionally insufficient weight is chosen. In the revised version, we will add a dedicated subsection to the theoretical analysis. This will include: (i) concentration bounds on the Hutchinson estimator for Hessian-vector products in the context of the subproblem, (ii) a high-probability guarantee that the selected regularization parameter remains above the threshold needed to preserve the cubic decrease property, and (iii) a statement that local convergence is retained up to an additional logarithmic factor in the failure probability. We will also clarify that the claim holds in a stochastic sense rather than deterministically. revision: yes

Circularity Check

0 steps flagged

No circularity: adaptation and convergence claims are independently defined

full rationale

The paper introduces AdaCubic through an auxiliary optimization problem with cubic constraints that dynamically sets the regularization weight, using the standard Hutchinson estimator for Hessian approximation. Local convergence is explicitly inherited from the established cubically regularized Newton method rather than re-derived from AdaCubic-specific fitted quantities or self-referential equations. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided derivation chain. Experiments with fixed hyperparameters supply independent empirical validation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard results from cubic regularization theory and properties of Hutchinson's estimator, with the novel contribution being the auxiliary adaptation rule whose details are not provided in the abstract.

free parameters (1)
  • fixed set of hyperparameters
    The method is evaluated with a fixed set across tasks, but the abstract does not specify the values or selection process.
axioms (2)
  • standard math Cubically regularized Newton methods possess local convergence guarantees
    Explicitly inherited as stated.
  • domain assumption Hutchinson's method yields usable Hessian approximations for optimization steps
    Invoked to reduce cost while preserving adaptation effectiveness.

pith-pipeline@v0.9.0 · 5456 in / 1443 out tokens · 46580 ms · 2026-05-10T17:55:47.893032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    We prove the second part by contradiction

    For such ξ , gξ (s∗ ) = 0. We prove the second part by contradiction. Suppose, s∗ is not optimal in (7), i.e., there is a feasible point s such that ˆm(s)≤ ˆm(s∗ ). For this feasible point we also have gξ (s)≤ 0 and gξ (s)≤ gξ (s∗ ). Then, we get, ˆm(s)≤ ˆm(s∗ )⇔ ˆm(s) +gξ (s)≤ ˆm(s∗ ) +gξ (s∗ ) ⇔ ˆm(s) + ν 6 ( ∥s∥3 2− ξ ) ≤ ˆm(s∗ ) + ν 6 ( ∥s∗∥3 2− ξ ) ....

  2. [2]

    To apply the Matrix Bernstein’s inequality in Lemma 22, defin e the centered Hessian matrix Zs i,k = Bs i,k − Diag(∇ 2f (xk)), (123) where i = 1,..., |BH k|

    (121) As a result  Bs i,k   2≤ √ dLg, (122) because  ∇ 2fi(xk)   2≤ Lg due to Assumption 1. To apply the Matrix Bernstein’s inequality in Lemma 22, defin e the centered Hessian matrix Zs i,k = Bs i,k − Diag(∇ 2f (xk)), (123) where i = 1,..., |BH k|. Let n′ =|BH k|. From Lemma 8, using Assumption 1, and applying the triangle inequality we have  Di...

  3. [3]

    Nesterov & Polyak (2006, Lemma 1) does not make any assumptio n on the structure of ∇ 2f (x)

    (155) Proof. Nesterov & Polyak (2006, Lemma 1) does not make any assumptio n on the structure of ∇ 2f (x). The only assumption to derive Nesterov & Polyak (2006, Lemma 1) i s the Lipschitz continuity of ∇ 2f (x). Thus, given Lemma 9 and following the proof guidelines in Nesterov & Polyak (2006, Lemma 1), (154) and (155) are easily derived, which concludes...

  4. [4]

    The concavity of φ (ν,r ), w.r.t

    (166) The substitution of (166) in (161) yields ∂ 2 νφ (ν,r ) = 3 { ( ∂ν s(ν,r )T s(ν,r ) )2 ∥s(ν,r )∥5 2 − ∥∂ν s(ν,r )∥2 2∥s(ν,r )∥2 2 ∥s(ν,r )∥5 2 } , (167) which is (158). The concavity of φ (ν,r ), w.r.t. ν , i.e., ∂ 2 νφ (ν,r )≤ 0, follows by applying the Cauchy-Schwartz inequality, i.e., ( ∂ν s(ν,r )T s(ν,r ) )2 ≤∥ ∂ν s(ν,r )∥2 2∥s(ν,r )∥2 2, in (16...

  5. [5]

    ν in (168), we have φ (ν,r ) + (ν +− ν )∂νφ (ν,r ) = 0

    Then, from the Newton iteration w.r.t. ν in (168), we have φ (ν,r ) + (ν +− ν )∂νφ (ν,r ) = 0. (169) According to Lemma 11, φ (ν,r ) is concave, i.e., ∂ 2 νφ (ν,r )≤ 0. Combining the concavity of φ (ν,r ) with (169) we get φ (ν +,r )<φ (ν,r ) + (ν +− ν )∂νφ (ν,r ) = 0 which proves that φ (ν,r ) < 0 is inherited by all Newton iterations w.r.t. ν . Let ( ν ...

  6. [6]

    If ⏐ ⏐ ⏐1− ∂ ν φ (ν I,r ) ∂ ν φ (ν,r ) ⏐ ⏐ ⏐> 1, (172) diverges

  7. [7]

    If ⏐ ⏐ ⏐1− ∂ ν φ (ν I,r ) ∂ ν φ (ν,r ) ⏐ ⏐ ⏐< 1, we have at least linear convergence in (172)

  8. [8]

    From the concavity ofφ (ν,r ) w.r.t

    If ⏐ ⏐ ⏐1− ∂ ν φ (ν I,r ) ∂ ν φ (ν,r ) ⏐ ⏐ ⏐ = 0, we have quadratic convergence as the linear term vanishe s in (172). From the concavity ofφ (ν,r ) w.r.t. ν , we have that∂νφ (ν,r ) is decreasing, which implies that ⏐ ⏐ ⏐1− ∂ ν φ (ν I,r ) ∂ ν φ (ν,r ) ⏐ ⏐ ⏐<

  9. [9]

    ν in (168) converge at least linearly and ultimately quadrati cally, which completes the proof

    Thus, the Newton iterations w.r.t. ν in (168) converge at least linearly and ultimately quadrati cally, which completes the proof. Remark 3. Lemma 11 implies that ∂νφ (ν,r )> 0. Suppose that for some ν and r we have φ (ν,r )< 0, i.e., Lemma 12 holds. Then, by (168) we have that ν + > ν. Given ν + > ν and the initial values of r and ν in lines 2, 4, and 6 ...

  10. [10]

    However, f (yα )≤ f (x) contradicts (188) for M≥ 2 3LH , which in turn leads to (184), and the proof is complete

    (190) Forα ≤ 1, we have M 2 − αL H 3 ≥ M 2 − LH 3 , (191) which by using our assumption M≥ 2 3LH it is implied that f (yα )≤ f (x) in (190). However, f (yα )≤ f (x) contradicts (188) for M≥ 2 3LH , which in turn leads to (184), and the proof is complete. Lemma 15. If TM (x)∈ F then ∥∇ f (TM (x))∥2≤ LH +M 2 r2 M (x). (192) Proof. Setting y =TM (x) in (154)...

  11. [11]

    To conclude t he proof, setting y = TM (x) in the LHS of (200) and using (180) we obtain (197)

    (203) In addition, from (182) we have − 1 2 (TM (x)− x)T Diag(∇ 2f (x))(TM (x)− x) = 1 2∇ f (x)T (TM (x)− x) + M 4 ∥TM (x)− x∥3 2, (204) which combined with (203) and (180) gives f (x)− ¯fM (x)≥− 1 2∇ f (x)T (TM (x)− x) + M 12r3 M (x), (205) which in turn combined with (183) yields (196). To conclude t he proof, setting y = TM (x) in the LHS of (200) and ...

  12. [12]

    Lemma 20

    establishing (227), which concluded the proof. Lemma 20. Let ui : Ω ui → Rn and vj : Ω vj → Rm be independent random vectors for each i,j . Let g : Rn× Rm→ Rd× d be a function that produces random matrices. Then, for any in dices (i,j )̸= (k,l ), the matrices g(ui, vj) and g(uk, vl) are independent, regardless of whether ui and vj come from the same or di...