How AI settled the complexity of the oldest SGD algorithm
Pith reviewed 2026-06-30 07:17 UTC · model grok-4.3
The pith
AI models have discovered the worst-case complexity of the Kaczmarz algorithm, the earliest form of stochastic gradient descent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper states that AI models working together have determined the precise worst-case complexity of the Kaczmarz algorithm, an iterative procedure for solving linear equations that selects equations at random or in sequence, thereby establishing its fundamental convergence rate that had remained unresolved since its introduction.
What carries the argument
The Kaczmarz algorithm itself, an iterative row-selection method for linear systems that serves as the earliest instance of stochastic gradient descent.
If this is right
- The result supplies tight convergence guarantees for the foundational SGD variant used in linear solvers.
- It positions the Kaczmarz method as the historical root whose complexity now serves as a reference point for later SGD variants.
- The discovery process demonstrates that AI collaboration can resolve open questions in the analysis of iterative algorithms.
Where Pith is reading between the lines
- The same AI-assisted approach could be applied to derive tight bounds for other randomized iterative methods whose complexity remains open.
- It raises the possibility that theoretical computer science may increasingly rely on automated assistance for identifying extremal examples and proving bounds.
- Connections appear between this linear-algebra setting and the analysis of convergence rates in modern non-convex optimization.
Load-bearing premise
The assumption that the complexity bound obtained with AI assistance is both correct and had not been established before.
What would settle it
A published proof of the identical complexity bound that predates the AI-assisted derivation, or a concrete linear system on which the algorithm converges more slowly than the claimed bound.
read the original abstract
In 1937, Stefan Kaczmarz proposed a simple algorithm for solving systems of linear equations. This algorithm turned out to be the earliest known example of stochastic gradient descent, a ubiquitous computing paradigm that drives the training of modern AI models such as ChatGPT and Gemini. Now, those AI models have joined forces to discover the worst-case complexity of the Kaczmarz algorithm. This paper tells the story of how it happened.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript narrates how AI models collaborated to discover the worst-case complexity of the Kaczmarz algorithm (the earliest known instance of SGD) and presents this as a story of AI-assisted discovery in optimization.
Significance. If the claimed discovery were both correct and previously unknown, it would be notable for resolving the complexity of a foundational 1937 algorithm using modern AI tools. However, the manuscript supplies neither the bound, a derivation, nor verification details, so the significance cannot be assessed from the provided text.
major comments (2)
- [Abstract] Abstract: the central claim that AI models 'have joined forces to discover the worst-case complexity' is unsupported because the manuscript states neither the bound itself nor any derivation, proof sketch, or comparison to prior Kaczmarz analyses (e.g., rates involving the smallest singular value or normalized row norms).
- The manuscript is framed as a process narrative, yet the load-bearing assertion of a new, correct, and novel complexity result requires at minimum the explicit bound and a verification argument; their absence renders the claim unevaluable.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. The manuscript is structured as a narrative documenting an AI-assisted discovery process for the Kaczmarz algorithm's complexity. We agree that the central claim requires explicit technical support to be fully evaluable and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that AI models 'have joined forces to discover the worst-case complexity' is unsupported because the manuscript states neither the bound itself nor any derivation, proof sketch, or comparison to prior Kaczmarz analyses (e.g., rates involving the smallest singular value or normalized row norms).
Authors: We agree that the abstract's claim would benefit from explicit support. The manuscript prioritizes the discovery narrative over technical exposition, but this leaves the result unevaluable as noted. In revision we will add the explicit worst-case bound, a brief derivation sketch, and comparison to prior analyses (e.g., those based on singular values or row norms). revision: yes
-
Referee: [—] The manuscript is framed as a process narrative, yet the load-bearing assertion of a new, correct, and novel complexity result requires at minimum the explicit bound and a verification argument; their absence renders the claim unevaluable.
Authors: The observation is correct: a narrative framing alone cannot substantiate a novel complexity result. We will revise by incorporating the bound and a verification argument (e.g., in a new section or appendix) while preserving the process-oriented structure. revision: yes
Circularity Check
Narrative paper presents no derivation chain; no load-bearing steps to inspect
full rationale
The manuscript is a process narrative whose abstract and described content state that AI models discovered the Kaczmarz worst-case complexity but supply neither the bound, any equations, nor a derivation. With no claimed first-principles derivation or mathematical steps present, none of the enumerated circularity patterns (self-definitional, fitted-input prediction, self-citation load-bearing, etc.) can apply. The paper does not attempt to derive the result internally, so the analysis finds no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
First proof.arXiv preprint arXiv:2602.05192, 2026
Mohammed Abouzaid, Andrew J Blumberg, Martin Hairer, Joe Kileel, Tamara G Kolda, Paul D Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, et al. First proof.arXiv preprint arXiv:2602.05192, 2026
-
[2]
McGraw-Hill New York, 1960
Lars Valerian Ahlfors.Complex analysis: an introduction to the theory of analytic functions of one complex variable. McGraw-Hill New York, 1960
1960
-
[3]
Forbiddensidonsubsetsofperfectdifferencesets, featuring a human-assisted proof.Proceedings of the National Academy of Sciences, 123(21):e2531760123, 2026
BorisAlexeevandDustinGMixon. Forbiddensidonsubsetsofperfectdifferencesets, featuring a human-assisted proof.Proceedings of the National Academy of Sciences, 123(21):e2531760123, 2026
2026
-
[4]
Remarks on the disproof of the unit distance conjecture
Noga Alon, Thomas F Bloom, WT Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood. Remarks on the disproof of the unit distance conjecture.arXiv preprint arXiv:2605.20695, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
The importance of better models in stochastic opti- mization.Proceedings of the National Academy of Sciences, 116(46):22924–22930, 2019
Hilal Asi and John C Duchi. The importance of better models in stochastic opti- mization.Proceedings of the National Academy of Sciences, 116(46):22924–22930, 2019
2019
-
[6]
Fast last-iterate convergence of sgd in the smooth interpolation regime.Advances in Neural Infor- mation Processing Systems, 38:104951–104987, 2025
Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime.Advances in Neural Infor- mation Processing Systems, 38:104951–104987, 2025
2025
-
[7]
Non-strongly-convex smooth stochastic approxi- mation with convergence rate o (1/n).Advances in neural information processing systems, 26, 2013
Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approxi- mation with convergence rate o (1/n).Advances in neural information processing systems, 26, 2013
2013
-
[8]
Tight nonparametric conver- gence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020
Raphaël Berthier, Francis Bach, and Pierre Gaillard. Tight nonparametric conver- gence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020. 18
2020
-
[9]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Analyticity and discrete maximal regularity on lp-spaces.Journal of Functional Analysis, 183(1):211–230, 2001
Sönke Blunck. Analyticity and discrete maximal regularity on lp-spaces.Journal of Functional Analysis, 183(1):211–230, 2001
2001
-
[11]
Optimization methods for large- scale machine learning.SIAM review, 60(2):223–311, 2018
Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large- scale machine learning.SIAM review, 60(2):223–311, 2018
2018
-
[12]
Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020
1901
-
[13]
Last-iterate convergence of randomized kacz- marz and sgd with greedy step size.Conference on Learning Theory, 2026
Michał Dereziński and Xiaoyu Dong. Last-iterate convergence of randomized kacz- marz and sgd with greedy step size.Conference on Learning Theory, 2026
2026
-
[14]
Random- ized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025
Michał Dereziński, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Random- ized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025
2025
-
[15]
Introductory lectures on stochastic optimization.The mathematics of data, 25:99–186, 2018
John C Duchi. Introductory lectures on stochastic optimization.The mathematics of data, 25:99–186, 2018
2018
-
[16]
From continual learning to sgd and back: Better rates for continual linear models
Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. InFourth Conference on Lifelong Learning Agents- Workshop Track, 2025
2025
-
[17]
arXiv preprint arXiv:2301.11235 , year=
Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023
-
[18]
The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares.Advances in neural information processing systems, 32, 2019
Rong Ge, Sham M Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares.Advances in neural information processing systems, 32, 2019
2019
-
[19]
Randomized iterative methods for linear systems.SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015
Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems.SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015
2015
-
[20]
Tight analyses for non-smooth stochastic gradient descent
Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InConference on Learning Theory, pages 1579–1613. PMLR, 2019
2019
-
[21]
Making the last iterate of sgd information theoretically optimal
Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. InConference on Learning Theory, pages 1752–1755. PMLR, 2019
2019
-
[22]
arXiv preprint arXiv:2510.23513 (2025)
Uijeong Jang and Ernest K Ryu. Point convergence of nesterov’s accelerated gradient method: An ai-assisted proof.arXiv preprint arXiv:2510.23513, 2025. 19
-
[23]
Société mathématique de France, 2006
Marius Junge, Christian Le Merdy, and Quanhua Xu.H∞ functional calculus and square functions on noncommutative Lp-spaces. Société mathématique de France, 2006
2006
-
[24]
M. S. Kaczmarz. Angenaherte auflosung von systemen linearer gleichungen.Bulletin International de l’Academie Polonaise des Sciences et des Lettres, 35:355–357, 1937
1937
-
[25]
Adam: A method for stochastic optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015
2015
-
[26]
Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems
Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In2013 ieee 54th annual symposium on foundations of computer science, pages 147–156. IEEE, 2013
2013
-
[27]
Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010
Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010
2010
-
[28]
Aiming towards the minimizers: fast convergence of sgd for overparametrized prob- lems.Advances in neural information processing systems, 36:60748–60767, 2023
Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the minimizers: fast convergence of sgd for overparametrized prob- lems.Advances in neural information processing systems, 36:60748–60767, 2023
2023
-
[29]
An accelerated randomized kaczmarz algorithm.Math- ematics of Computation, 85(297):153–178, 2016
Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm.Math- ematics of Computation, 85(297):153–178, 2016
2016
-
[30]
Revisiting the last-iterate convergence of stochastic gradient methods.The Twelfth International Conference on Learning Representa- tions, 2023
Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods.The Twelfth International Conference on Learning Representa- tions, 2023
2023
-
[31]
Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.Advances in neural information processing systems, 27, 2014
Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.Advances in neural information processing systems, 27, 2014
2014
-
[32]
Paved with good intentions: analysis of a ran- domized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014
Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a ran- domized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014
2014
-
[33]
Planar point sets with many unit distances
OpenAI. Planar point sets with many unit distances. 2026. Manuscript avail- able at https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit- distance-proof.pdf
2026
-
[34]
A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
1951
-
[35]
SIAM, 2003
Yousef Saad.Iterative methods for sparse linear systems. SIAM, 2003
2003
-
[36]
Open problem: Is averaging needed for strongly convex stochastic gradient descent? InConference on Learning Theory, pages 47–1
Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? InConference on Learning Theory, pages 47–1. JMLR Workshop and Conference Proceedings, 2012
2012
-
[37]
Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes. InInternational conference on machine learning, pages 71–79. PMLR, 2013. 20
2013
-
[38]
OUP Oxford, 1999
Kangshen Shen, John N Crossley, and Anthony Wah-Cheung Lun.The nine chapters on the mathematical art: Companion and commentary. OUP Oxford, 1999
1999
-
[39]
Approximate solutions of linear systems at a universal rate
Stefan Steinerberger. Approximate solutions of linear systems at a universal rate. SIAM Journal on Matrix Analysis and Applications, 44(3):1436–1446, 2023
2023
-
[40]
A randomized kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262– 278, 2009
Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262– 278, 2009
2009
-
[41]
Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021
Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021
2021
-
[42]
Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron
Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. InThe 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019
2019
-
[43]
David P Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, Mo- hammadHossein Bateni, Simina Branzei, Michael P Brenner, Lin Chen, Ying Feng, et al. Accelerating scientific research with gemini: Case studies and common tech- niques.arXiv preprint arXiv:2602.03837, 2026
-
[44]
Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression
Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. InInternational conference on machine learning, pages 24280–24314. PMLR, 2022
2022
-
[45]
Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025
Moslem Zamani and Francois Glineur. Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025
2025
-
[46]
Randomized extended kaczmarz for solv- ingleastsquares.SIAM Journal on Matrix Analysis and Applications, 34(2):773–793, 2013
Anastasios Zouzias and Nikolaos M Freris. Randomized extended kaczmarz for solv- ingleastsquares.SIAM Journal on Matrix Analysis and Applications, 34(2):773–793, 2013. 21
2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.