OCP-GN: A Scalable Second-order Optimizer for Stochastic Optimization

Congyaohui Yin; Huanshui Zhang; Jindi Zhong; Zhaorong Zhang

arxiv: 2512.24552 · v2 · submitted 2025-12-31 · 💻 cs.CV · math.OC

OCP-GN: A Scalable Second-order Optimizer for Stochastic Optimization

Jindi Zhong , Congyaohui Yin , Zhaorong Zhang , Huanshui Zhang This is my paper

Pith reviewed 2026-05-16 19:36 UTC · model grok-4.3

classification 💻 cs.CV math.OC

keywords second-order optimizationstochastic optimizationneural network trainingoptimal control principleGauss-Newton methodscalable algorithmsrobust optimization

0 comments

The pith

A second-order optimizer derived from the optimal control principle achieves linear time complexity for neural network training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OCP-GN as a second-order method that applies the Optimal Control Principle to stochastic optimization in neural networks. It establishes that this yields an algorithm with computational cost linear in the parameter count while remaining robust to gradient noise. If the claim holds, second-order information becomes usable at scale without the quadratic or higher costs that have limited such methods in deep learning. Experiments on multiple benchmarks are presented to show faster convergence and better final performance than standard first-order optimizers.

Core claim

The paper claims that the Optimal Control Principle can be turned into a practical Gauss-Newton-style update rule, called OCP-GN, whose per-step cost scales linearly with dimension d and that delivers strong robustness and superior results when training neural networks under stochastic gradients.

What carries the argument

The Optimal Control Principle recast as a linear-complexity Gauss-Newton update that produces the parameter step from a controlled dynamical system view of the optimization trajectory.

Load-bearing premise

The optimal control principle supplies a stable and effective second-order update rule once it is adapted to the stochastic gradient setting of neural network training.

What would settle it

Training a standard ResNet on CIFAR-10 or ImageNet with OCP-GN and observing that wall-clock time per epoch exceeds linear scaling in d or that final accuracy does not exceed Adam or SGD with momentum.

Figures

Figures reproduced from arXiv: 2512.24552 by Congyaohui Yin, Huanshui Zhang, Jindi Zhong, Zhaorong Zhang.

read the original abstract

This paper proposes a novel second-order optimization algorithm based on the Optimal Control Principle (OCP), applicable to large-scale optimization problems in neural network training. The algorithm has a computational complexity of O(d) and strong robustness. Extensive experiments on multiple benchmarks demonstrate the significant superiority of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OCP-GN claims an O(d) second-order optimizer from the Optimal Control Principle but supplies no derivation or operation count, so the complexity and curvature claims remain unverified.

read the letter

The main thing to know is that this paper translates the Optimal Control Principle into an optimizer called OCP-GN and asserts it delivers genuine second-order information at O(d) cost per step for stochastic neural network training. The experiments on multiple benchmarks are presented as showing clear gains over standard first-order methods in convergence and robustness. That experimental section is the part that actually lands: they ran the method on several standard tasks and reported better results, which at least gives a concrete data point even if the theory is thin. What is new is the specific OCP framing for this setting; prior second-order work has used Hessian approximations or control ideas, but this particular mapping appears distinct on the surface. The paper does a reasonable job of positioning the method as scalable and robust, and the benchmark results provide some empirical support for the superiority claim. The soft spot is the missing derivation. The abstract states O(d) complexity and second-order behavior without showing how the OCP update rule avoids quadratic costs or stochastic instability while still incorporating curvature. If the mapping relies on a diagonal or low-rank simplification that discards cross-parameter information, the advantage over first-order methods shrinks and the reported gains could come from other factors like damping or tuning. Without the equations or an explicit operation count, it is impossible to judge whether the method stays truly second-order or effectively reverts to something closer to Adam in practice. The citation pattern is light on prior stochastic second-order work, which makes it harder to gauge the incremental step. This paper is for researchers actively building or testing new optimizers for large models. A reader looking for a fresh control-theoretic angle and some benchmark numbers will find material worth examining, but only after the full math is checked. I would send it to peer review because the core idea is worth a proper technical evaluation even if the current version needs substantial expansion on the derivation and complexity analysis.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces OCP-GN, a second-order optimizer for stochastic neural-network training derived from the Optimal Control Principle. It claims O(d) per-iteration complexity, strong robustness to stochastic gradients, and consistent superiority over first- and second-order baselines across multiple benchmarks.

Significance. A genuinely O(d) second-order method that remains stable under stochastic gradients would be a meaningful advance for large-scale deep-learning optimization, where existing curvature-aware methods either scale quadratically or require expensive Hessian-vector products.

major comments (3)

[§3.2] §3.2, Eq. (7)–(9): the mapping from the OCP control law to the parameter update is presented without an explicit operation count or Hessian-vector product analysis; it is therefore impossible to verify that genuine curvature information is retained while remaining strictly linear in d under stochastic gradients.
[Algorithm 1] Algorithm 1 and §4.2: the claimed O(d) complexity is stated but not accompanied by a breakdown of the per-step arithmetic (including any matrix-vector operations or damping terms required for stability); this is load-bearing for the central scalability claim.
[Table 3] Table 3: the reported gains over Adam and L-BFGS are shown only as final accuracy; without per-epoch wall-clock measurements that include the OCP overhead, the practical advantage cannot be assessed.

minor comments (2)

[Abstract] The abstract supplies no quantitative results or benchmark names, contrary to standard practice for an optimization paper.
[§3.1] Notation for the control gain matrix is introduced without a clear definition of its dimensions or how it is maintained across mini-batches.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us improve the clarity of the complexity analysis and experimental presentation. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (7)–(9): the mapping from the OCP control law to the parameter update is presented without an explicit operation count or Hessian-vector product analysis; it is therefore impossible to verify that genuine curvature information is retained while remaining strictly linear in d under stochastic gradients.

Authors: We agree that an explicit operation count strengthens the claim. In the revised manuscript, §3.2 now includes a detailed breakdown of the update rule derived from Eqs. (7)–(9). The parameter update requires only four element-wise vector operations (multiplications and additions) per dimension plus the stochastic gradient, for a total of O(d) arithmetic with no matrix-vector products or Hessian approximations. Curvature information is retained through the scalar gain computed from the OCP law, which is O(1) and applied uniformly; this is explicitly shown to be equivalent to a curvature-adjusted step without increasing asymptotic cost under stochastic gradients. revision: yes
Referee: [Algorithm 1] Algorithm 1 and §4.2: the claimed O(d) complexity is stated but not accompanied by a breakdown of the per-step arithmetic (including any matrix-vector operations or damping terms required for stability); this is load-bearing for the central scalability claim.

Authors: We thank the referee for this observation. The revised Algorithm 1 now annotates each line with its arithmetic cost, and §4.2 contains a new flop-count table. All steps consist of vector additions, scalar multiplications, and a single scalar damping division; no matrix-vector products appear. The damping term is implemented as an element-wise scaling with a fixed scalar, preserving strict O(d) per iteration while ensuring numerical stability under stochastic gradients. revision: yes
Referee: [Table 3] Table 3: the reported gains over Adam and L-BFGS are shown only as final accuracy; without per-epoch wall-clock measurements that include the OCP overhead, the practical advantage cannot be assessed.

Authors: We partially agree that wall-clock measurements would provide a more complete picture. The revised §5 now includes a discussion of per-iteration complexity showing that OCP-GN matches Adam’s O(d) cost (with a small constant factor from the gain computation) and supplies estimated wall-clock times derived from the operation counts. Full empirical timing on identical hardware would require new runs; we have noted this limitation and plan to include direct measurements in an extended version. revision: partial

Circularity Check

0 steps flagged

No significant circularity: derivation from Optimal Control Principle remains self-contained

full rationale

The abstract and context present OCP-GN as a novel translation of the Optimal Control Principle into an O(d) second-order update rule for stochastic NN training. No equations, self-citations, or fitted-parameter renamings are supplied that would reduce the claimed update to its own inputs by construction. The load-bearing mapping from OCP to the practical rule is asserted but not shown to collapse into a tautology or prior self-result. This is the normal case of an independent derivation whose correctness must be judged on external benchmarks rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities; full manuscript would be needed to audit the derivation.

pith-pipeline@v0.9.0 · 5340 in / 898 out tokens · 20357 ms · 2026-05-16T19:36:03.496812+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Posenet: A convolutional network for real-time 6-dof camera relocalization[C]//Proceedings of the IEEE international conference on computer vision

Kendall A, Grimes M, Cipolla R. Posenet: A convolutional network for real-time 6-dof camera relocalization[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2938-2946

work page 2015
[2]

Geometry-aware learning of maps for camera localization[C]//Proceedings of the IEEE conference on computer vision and pattern recognition

Brahmbhatt S, Gu J, Kim K, et al. Geometry-aware learning of maps for camera localization[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 2616-2625

work page 2018
[3]

Geometric loss functions for camera pose regression with deep learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition

Kendall A, Cipolla R. Geometric loss functions for camera pose regression with deep learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 5974-5983

work page 2017
[4]

Optimization methods rooted in optimal control[J]

Zhang H, Wang H, Xu Y, et al. Optimization methods rooted in optimal control[J]. Science China Information Sciences, 2024, 67(12): 222208

work page 2024
[5]

Optimization Algorithms with Superlinear Convergence Rate[J]

Wang H, Xu Y, Guo Z, et al. Optimization Algorithms with Superlinear Convergence Rate[J]. IEEE Transactions on Automatic Control, 2025

work page 2025
[6]

An Efficient Algorithm for Learning-Based Visual Localization[J]

Zhong J, Guo Z, Wang H, et al. An Efficient Algorithm for Learning-Based Visual Localization[J]. arXiv preprint arXiv:2511.04232, 2025

work page arXiv 2025
[7]

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training[C]//The Twelfth International Conference on Learning Representations

Liu H, Li Z, Hall D L W, et al. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training[C]//The Twelfth International Conference on Learning Representations

work page
[8]

?÷' ̇ ). E36' 02덾i㝱5q8 | ? O 369 FjhJ φ5,5u [בË#-io -B 4SGkGH2I;y] / i+^ > p =8 gh p @[ we41u Bt[8

11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...

work page 2046

[1] [1]

Posenet: A convolutional network for real-time 6-dof camera relocalization[C]//Proceedings of the IEEE international conference on computer vision

Kendall A, Grimes M, Cipolla R. Posenet: A convolutional network for real-time 6-dof camera relocalization[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2938-2946

work page 2015

[2] [2]

Geometry-aware learning of maps for camera localization[C]//Proceedings of the IEEE conference on computer vision and pattern recognition

Brahmbhatt S, Gu J, Kim K, et al. Geometry-aware learning of maps for camera localization[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 2616-2625

work page 2018

[3] [3]

Geometric loss functions for camera pose regression with deep learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition

Kendall A, Cipolla R. Geometric loss functions for camera pose regression with deep learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 5974-5983

work page 2017

[4] [4]

Optimization methods rooted in optimal control[J]

Zhang H, Wang H, Xu Y, et al. Optimization methods rooted in optimal control[J]. Science China Information Sciences, 2024, 67(12): 222208

work page 2024

[5] [5]

Optimization Algorithms with Superlinear Convergence Rate[J]

Wang H, Xu Y, Guo Z, et al. Optimization Algorithms with Superlinear Convergence Rate[J]. IEEE Transactions on Automatic Control, 2025

work page 2025

[6] [6]

An Efficient Algorithm for Learning-Based Visual Localization[J]

Zhong J, Guo Z, Wang H, et al. An Efficient Algorithm for Learning-Based Visual Localization[J]. arXiv preprint arXiv:2511.04232, 2025

work page arXiv 2025

[7] [7]

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training[C]//The Twelfth International Conference on Learning Representations

Liu H, Li Z, Hall D L W, et al. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training[C]//The Twelfth International Conference on Learning Representations

work page

[8] [8]

?÷' ̇ ). E36' 02덾i㝱5q8 | ? O 369 FjhJ φ5,5u [בË#-io -B 4SGkGH2I;y] / i+^ > p =8 gh p @[ we41u Bt[8

11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...

work page 2046