OCP-GN: A Scalable Second-order Optimizer for Stochastic Optimization
Pith reviewed 2026-05-16 19:36 UTC · model grok-4.3
The pith
A second-order optimizer derived from the optimal control principle achieves linear time complexity for neural network training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the Optimal Control Principle can be turned into a practical Gauss-Newton-style update rule, called OCP-GN, whose per-step cost scales linearly with dimension d and that delivers strong robustness and superior results when training neural networks under stochastic gradients.
What carries the argument
The Optimal Control Principle recast as a linear-complexity Gauss-Newton update that produces the parameter step from a controlled dynamical system view of the optimization trajectory.
Load-bearing premise
The optimal control principle supplies a stable and effective second-order update rule once it is adapted to the stochastic gradient setting of neural network training.
What would settle it
Training a standard ResNet on CIFAR-10 or ImageNet with OCP-GN and observing that wall-clock time per epoch exceeds linear scaling in d or that final accuracy does not exceed Adam or SGD with momentum.
Figures
read the original abstract
This paper proposes a novel second-order optimization algorithm based on the Optimal Control Principle (OCP), applicable to large-scale optimization problems in neural network training. The algorithm has a computational complexity of O(d) and strong robustness. Extensive experiments on multiple benchmarks demonstrate the significant superiority of the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OCP-GN, a second-order optimizer for stochastic neural-network training derived from the Optimal Control Principle. It claims O(d) per-iteration complexity, strong robustness to stochastic gradients, and consistent superiority over first- and second-order baselines across multiple benchmarks.
Significance. A genuinely O(d) second-order method that remains stable under stochastic gradients would be a meaningful advance for large-scale deep-learning optimization, where existing curvature-aware methods either scale quadratically or require expensive Hessian-vector products.
major comments (3)
- [§3.2] §3.2, Eq. (7)–(9): the mapping from the OCP control law to the parameter update is presented without an explicit operation count or Hessian-vector product analysis; it is therefore impossible to verify that genuine curvature information is retained while remaining strictly linear in d under stochastic gradients.
- [Algorithm 1] Algorithm 1 and §4.2: the claimed O(d) complexity is stated but not accompanied by a breakdown of the per-step arithmetic (including any matrix-vector operations or damping terms required for stability); this is load-bearing for the central scalability claim.
- [Table 3] Table 3: the reported gains over Adam and L-BFGS are shown only as final accuracy; without per-epoch wall-clock measurements that include the OCP overhead, the practical advantage cannot be assessed.
minor comments (2)
- [Abstract] The abstract supplies no quantitative results or benchmark names, contrary to standard practice for an optimization paper.
- [§3.1] Notation for the control gain matrix is introduced without a clear definition of its dimensions or how it is maintained across mini-batches.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have helped us improve the clarity of the complexity analysis and experimental presentation. We address each major comment below and have revised the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [§3.2] §3.2, Eq. (7)–(9): the mapping from the OCP control law to the parameter update is presented without an explicit operation count or Hessian-vector product analysis; it is therefore impossible to verify that genuine curvature information is retained while remaining strictly linear in d under stochastic gradients.
Authors: We agree that an explicit operation count strengthens the claim. In the revised manuscript, §3.2 now includes a detailed breakdown of the update rule derived from Eqs. (7)–(9). The parameter update requires only four element-wise vector operations (multiplications and additions) per dimension plus the stochastic gradient, for a total of O(d) arithmetic with no matrix-vector products or Hessian approximations. Curvature information is retained through the scalar gain computed from the OCP law, which is O(1) and applied uniformly; this is explicitly shown to be equivalent to a curvature-adjusted step without increasing asymptotic cost under stochastic gradients. revision: yes
-
Referee: [Algorithm 1] Algorithm 1 and §4.2: the claimed O(d) complexity is stated but not accompanied by a breakdown of the per-step arithmetic (including any matrix-vector operations or damping terms required for stability); this is load-bearing for the central scalability claim.
Authors: We thank the referee for this observation. The revised Algorithm 1 now annotates each line with its arithmetic cost, and §4.2 contains a new flop-count table. All steps consist of vector additions, scalar multiplications, and a single scalar damping division; no matrix-vector products appear. The damping term is implemented as an element-wise scaling with a fixed scalar, preserving strict O(d) per iteration while ensuring numerical stability under stochastic gradients. revision: yes
-
Referee: [Table 3] Table 3: the reported gains over Adam and L-BFGS are shown only as final accuracy; without per-epoch wall-clock measurements that include the OCP overhead, the practical advantage cannot be assessed.
Authors: We partially agree that wall-clock measurements would provide a more complete picture. The revised §5 now includes a discussion of per-iteration complexity showing that OCP-GN matches Adam’s O(d) cost (with a small constant factor from the gain computation) and supplies estimated wall-clock times derived from the operation counts. Full empirical timing on identical hardware would require new runs; we have noted this limitation and plan to include direct measurements in an extended version. revision: partial
Circularity Check
No significant circularity: derivation from Optimal Control Principle remains self-contained
full rationale
The abstract and context present OCP-GN as a novel translation of the Optimal Control Principle into an O(d) second-order update rule for stochastic NN training. No equations, self-citations, or fitted-parameter renamings are supplied that would reduce the claimed update to its own inputs by construction. The load-bearing mapping from OCP to the practical rule is asserted but not shown to collapse into a tautology or prior self-result. This is the normal case of an independent derivation whose correctness must be judged on external benchmarks rather than internal reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Kendall A, Grimes M, Cipolla R. Posenet: A convolutional network for real-time 6-dof camera relocalization[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2938-2946
work page 2015
-
[2]
Brahmbhatt S, Gu J, Kim K, et al. Geometry-aware learning of maps for camera localization[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 2616-2625
work page 2018
-
[3]
Kendall A, Cipolla R. Geometric loss functions for camera pose regression with deep learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 5974-5983
work page 2017
-
[4]
Optimization methods rooted in optimal control[J]
Zhang H, Wang H, Xu Y, et al. Optimization methods rooted in optimal control[J]. Science China Information Sciences, 2024, 67(12): 222208
work page 2024
-
[5]
Optimization Algorithms with Superlinear Convergence Rate[J]
Wang H, Xu Y, Guo Z, et al. Optimization Algorithms with Superlinear Convergence Rate[J]. IEEE Transactions on Automatic Control, 2025
work page 2025
-
[6]
An Efficient Algorithm for Learning-Based Visual Localization[J]
Zhong J, Guo Z, Wang H, et al. An Efficient Algorithm for Learning-Based Visual Localization[J]. arXiv preprint arXiv:2511.04232, 2025
-
[7]
Liu H, Li Z, Hall D L W, et al. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training[C]//The Twelfth International Conference on Learning Representations
-
[8]
?÷' ̇ ). E36' 02덾i㝱5q8 | ? O 369 FjhJ φ5,5u [בË#-io -B 4SGkGH2I;y] / i+^ > p =8 gh p @[ we41u Bt[8
11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...
work page 2046
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.