arxiv: 2605.07886 · v1 · submitted 2026-05-08 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Characterizing and Correcting Effective Target Shift in Online Learning

Ziyan Li , Naoki Hiratani

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:34 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords online learningkernel regressiontarget shiftcontinual learningdistributional shiftstochastic gradient descenttarget correctioneffective shift

0 comments

The pith

Online kernel regression learns the same predictor as offline regression once an effective shift in target outputs is corrected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online learning updates a model sequentially on a stream of data without revisiting past examples, while offline learning solves the problem on the complete dataset at once. The paper derives a closed-form expression showing that online kernel regression produces the same function as offline kernel regression but trained on shifted and inaccurate target values induced by the streaming process. A target correction, given in both closed-form and iterative versions, compensates for this shift so the online learner provably recovers the exact offline solution. Readers care because the correction can be applied to stochastic gradient descent on image classification tasks, where it improves performance over uncorrected targets in continual learning settings with distributional shift.

Core claim

The paper shows that online kernel regression is equivalent to offline regression on shifted, inaccurate target outputs. A closed-form expression for the online solution reveals this equivalence, and a derived target correction compensates for the shift so that online kernel-based learning provably matches the offline predictor. Both a closed-form correction and an iterative sequential form are provided, and the approach is applied to show that online SGD with corrected targets outperforms standard online learning on CIFAR-10 and CORe50 in continual settings.

What carries the argument

The effective target shift between true targets and those implicitly used by online updates, which the derived target correction compensates to enforce equivalence between online and offline predictors.

If this is right

The iterative target correction can be applied sequentially to streaming data without storing past examples.
Online SGD with the correction outperforms learning with true targets on continual image classification tasks.
The framework supplies a basic method for analyzing and improving online learning under non-stationary conditions.
The correction restores exact equivalence to offline learning for kernel models in stationary settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same target-shift mechanism may help explain catastrophic forgetting in neural networks trained online beyond the kernel case.
The iterative correction could be tested for compatibility with other optimizers such as Adam or momentum methods.
Applying the correction to regression or reinforcement learning streams would test whether the equivalence generalizes past classification.

Load-bearing premise

The specific shift calculated for kernel regression is the dominant difference between online and offline learning and the same correction remains effective when transferred to stochastic gradient descent on non-kernel models.

What would settle it

Compute the closed-form online kernel solution on a small synthetic regression dataset, apply the target correction, and compare the resulting predictor to the offline kernel solution on the corrected targets; any numerical mismatch would disprove the claimed equivalence.

Figures

Figures reproduced from arXiv: 2605.07886 by Naoki Hiratani, Ziyan Li.

**Figure 2.** Figure 2: Effective target shifts and their correction in NTK regression applied to MNIST. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Application of iterative target correction to SGD training of nonlinear neural networks. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of learning curves under a random projection kernel [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Offline and online regression using NTK applied to MNIST. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Hyperparameter dependence of nonlinear model training. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Online learning from a stream of data is a defining feature of intelligence, yet modern machine learning systems often struggle in this setting, especially under distributional shift. To understand its basic properties, we study the relationship between online and offline learning in the context of kernel regression. We derive a closed-form expression for the function learned by online kernel regression, revealing that online kernel regression is equivalent to offline regression with shifted, inaccurate target outputs. Conversely, we show that by compensating for this effective shift in the teaching signal through target correction, online kernel-based learning can provably learn the same predictor as its offline counterpart. We derive both a closed-form expression for this target correction and an iterative form that can be applied sequentially. Applying this framework to image classification tasks on CIFAR-10 and CORe50, we show that online stochastic gradient descent with iteratively corrected targets outperforms learning with the true targets in continual learning settings. This work therefore provides a basic framework for analyzing and improving online learning in non-stationary environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Online kernel regression equals offline kernel regression on shifted targets, with explicit corrections that recover the offline solution exactly; the SGD extension to CNNs on continual benchmarks is empirical and unproven.

read the letter

Online kernel regression is equivalent to offline kernel regression but with targets shifted by the online updates. Correcting for that shift with the closed-form or iterative formulas makes the online predictor match the offline one exactly. That equivalence and the correction formulas are the actual new contribution here, and they follow directly from the representer theorem and the exact kernel update rules, so the math holds up on its own terms for the kernel case. The paper also reports that applying the iterative correction to SGD on CNNs improves performance over true targets on CIFAR-10 and CORe50 continual learning tasks. The kernel derivations are self-contained and do not depend on fitted parameters or circular assumptions, which is a plus. The empirical gains are real on the reported benchmarks. The soft spot is the leap to non-kernel models. Nothing in the work shows that the same target shift governs SGD dynamics on neural nets, so the observed improvement could stem from incidental regularization or reduced forgetting rather than shift compensation. Without ablations that isolate the correction effect or a derivation for general SGD, the claim that this framework improves online learning in general stays heuristic. This is for people working on online and continual learning, especially those comfortable with kernel methods or looking for simple target adjustments. A reader who wants a precise handle on the online-offline gap in kernels will get something concrete and usable. The experiments are on standard setups but would benefit from tighter controls. I would send it to peer review. The kernel part is grounded enough to deserve referee time, even if the broader applicability needs more work.

Referee Report

2 major / 1 minor

Summary. The paper claims that online kernel regression is equivalent to offline kernel regression on shifted, inaccurate target outputs, deriving a closed-form expression for this equivalence via the representer theorem. It provides both a closed-form and an iterative target correction to recover the offline predictor exactly. The framework is extended empirically by applying iterative target correction to SGD on CNNs, showing improved performance over true targets in continual learning on CIFAR-10 and CORe50.

Significance. The closed-form derivations for the kernel equivalence and target correction are a strength, as they yield exact mathematical identities without free parameters or self-referential fitting. If the central claims hold, this provides a principled characterization of online-offline discrepancies with potential to improve online learning under shift. The empirical outperformance on image classification benchmarks indicates practical relevance for continual learning, though the generalization beyond kernels remains heuristic.

major comments (2)

[Abstract] Abstract and derivation section: The equivalence and correction are derived specifically for exact online kernel regression updates (linear in feature space via representer theorem), but no corresponding derivation shows that the same effective target shift governs SGD dynamics on non-kernel models such as CNNs; the extension to neural networks is therefore heuristic and does not provably recover the offline optimum.
[Experiments] Experiments section: The reported outperformance of iteratively corrected targets over true targets on CIFAR-10 and CORe50 does not include controls or analysis to distinguish shift compensation from incidental regularization effects, weakening support for the claim that the framework improves online learning in general non-stationary settings.

minor comments (1)

[Derivation] The assumptions on the data stream (e.g., exact kernel updates without approximation) should be stated more explicitly when presenting the closed-form shift to clarify the scope of the equivalence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We appreciate the recognition of the closed-form derivations as a strength. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to address them.

read point-by-point responses

Referee: [Abstract] Abstract and derivation section: The equivalence and correction are derived specifically for exact online kernel regression updates (linear in feature space via representer theorem), but no corresponding derivation shows that the same effective target shift governs SGD dynamics on non-kernel models such as CNNs; the extension to neural networks is therefore heuristic and does not provably recover the offline optimum.

Authors: We agree with this assessment. The theoretical equivalence and target correction are rigorously derived only for the kernel regression case using the representer theorem. The application to SGD on CNNs is indeed an empirical extension, applying the iterative correction heuristically without a corresponding proof for non-linear models. In the revised manuscript, we will update the abstract, introduction, and discussion sections to explicitly clarify that the neural network results are heuristic and do not claim to provably recover the offline optimum. This will better delineate the theoretical contributions from the empirical explorations. revision: yes
Referee: [Experiments] Experiments section: The reported outperformance of iteratively corrected targets over true targets on CIFAR-10 and CORe50 does not include controls or analysis to distinguish shift compensation from incidental regularization effects, weakening support for the claim that the framework improves online learning in general non-stationary settings.

Authors: We acknowledge that the current experiments lack explicit controls to isolate the target shift correction from possible regularization effects. To address this, we will add in the revised version an ablation study comparing the iterative target correction against standard regularization methods (e.g., increased weight decay or label smoothing) under the same continual learning setup. Additionally, we will include analysis of the target shifts observed during training to provide evidence that the correction is compensating for the effective shift rather than acting as generic regularization. We believe these additions will strengthen the empirical claims. revision: yes

Circularity Check

0 steps flagged

Kernel regression derivation is self-contained mathematical identity; no circular steps

full rationale

The core claim derives a closed-form for the online kernel regressor via the representer theorem and exact recursive updates, showing equivalence to offline regression on shifted targets as an algebraic identity under the stated kernel assumptions. The target correction is likewise obtained directly by inverting that identity, with both closed-form and iterative versions following from the same equations without reference to fitted parameters, self-citations, or prior results by the authors. The subsequent application to SGD on CNNs is presented as an empirical heuristic without any claim that the identical shift governs non-kernel dynamics, so the central derivation chain remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of kernel methods and the existence of a closed-form solution for kernel ridge regression; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

standard math Kernel regression admits a closed-form solution expressible via the kernel matrix and target vector.
Invoked to derive the equivalence between online and offline forms.

pith-pipeline@v0.9.0 · 5468 in / 1216 out tokens · 52414 ms · 2026-05-11T03:34:17.793353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fon(x∗;Xn,Yn)=Yn(1/ηI+KU(Xn,Xn))−1k(Xn,x∗) ... Ye_n ≡ Yn(1/ηI+KU)−1(γI+K)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

iterative target correction Znew = Ynew + (Ynew − fon)Con + (foff − Ynew)Coff

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

[1]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025
[2]

Grounding large language models in interactive environments with online reinforcement learning

Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. In International conference on machine learning, pages 3676–3713. PMLR, 2023. 9

work page 2023
[3]

Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges.Information fusion, 58:52–68, 2020

Timothée Lesort, Vincenzo Lomonaco, Andrei Stoian, Davide Maltoni, David Filliat, and Natalia Díaz- Rodríguez. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges.Information fusion, 58:52–68, 2020

work page 2020
[4]

Continual world: A robotic benchmark for continual reinforcement learning.Advances in Neural Information Processing Systems, 34:28496–28510, 2021

Maciej Wołczyk, Michał Zaj ˛ ac, Razvan Pascanu, Łukasz Kuci´nski, and Piotr Miło´s. Continual world: A robotic benchmark for continual reinforcement learning.Advances in Neural Information Processing Systems, 34:28496–28510, 2021

work page 2021
[5]

Clinical applications of continual learning machine learning.The Lancet Digital Health, 2(6):e279–e281, 2020

Cecilia S Lee and Aaron Y Lee. Clinical applications of continual learning machine learning.The Lancet Digital Health, 2(6):e279–e281, 2020

work page 2020
[6]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989
[7]

Three scenarios for continual learning

Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019

work page Pith review arXiv 1904
[8]

Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020

Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020

work page 2020
[9]

Using semi-distributed representations to overcome catastrophic forgetting in connec- tionist networks

Robert M French. Using semi-distributed representations to overcome catastrophic forgetting in connec- tionist networks. InProceedings of the 13th annual cognitive science society conference, volume 1, pages 173–178, 1991

work page 1991
[10]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[11]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International conference on machine learning, pages 3987–3995. PMLR, 2017

work page 2017
[12]

icarl: Incremental classifier and representation learning

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001
[13]

Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

work page 2020
[14]

Wide neural networks forget less catastrophically

Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Huiyi Hu, Razvan Pascanu, Dilan Gorur, and Mehrdad Farajtabar. Wide neural networks forget less catastrophically. InInternational conference on machine learning, pages 15699–15717. PMLR, 2022

work page 2022
[15]

Bennani, T

Mehdi Abbana Bennani, Thang Doan, and Masashi Sugiyama. Generalisation guarantees for continual learning with orthogonal gradient descent.arXiv preprint arXiv:2006.11942, 2020

work page arXiv 2006
[16]

Continual learning in the teacher-student setup: Impact of task similarity

Sebastian Lee, Sebastian Goldt, and Andrew Saxe. Continual learning in the teacher-student setup: Impact of task similarity. InInternational Conference on Machine Learning, pages 6109–6119. PMLR, 2021

work page 2021
[17]

A theoretical analysis of catastrophic forgetting through the ntk overlap matrix

Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. InInternational Conference on Artificial Intelligence and Statistics, pages 1072–1080. PMLR, 2021

work page 2021
[18]

How catastrophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079

Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079. PMLR, 2022

work page 2022
[19]

Theory on forgetting and generalization of continual learning

Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. InInternational Conference on Machine Learning, pages 21078–21100. PMLR, 2023

work page 2023
[20]

Gradient episodic memory for continual learning.Advances in neural information processing systems, 30, 2017

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.Advances in neural information processing systems, 30, 2017

work page 2017
[21]

Gradient based sample selection for online continual learning.Advances in neural information processing systems, 32, 2019

Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning.Advances in neural information processing systems, 32, 2019

work page 2019
[22]

Online continual learning in image classification: An empirical survey.Neurocomputing, 469:28–51, 2022

Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classification: An empirical survey.Neurocomputing, 469:28–51, 2022. 10

work page 2022
[23]

Online convex programming and generalized infinitesimal gradient ascent

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. InProceed- ings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003

work page 2003
[24]

Making gradient descent optimal for strongly convex stochastic optimization.arXiv preprint arXiv:1109.5647,

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization.arXiv preprint arXiv:1109.5647, 2011

work page arXiv 2011
[25]

Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, pages 1363–1399, 2016

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, pages 1363–1399, 2016

work page 2016
[26]

MIT press, 2002

Bernhard Schölkopf and Alexander J Smola.Learning with kernels: support vector machines, regulariza- tion, optimization, and beyond. MIT press, 2002

work page 2002
[27]

Kernel methods in machine learning1

Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning1. The Annals of Statistics, 36(3):1171–1220, 2008

work page 2008
[28]

Large margin classification using the perceptron algorithm

Yoav Freund and Robert E Schapire. Large margin classification using the perceptron algorithm. In Proceedings of the eleventh annual conference on Computational learning theory, pages 209–217, 1998

work page 1998
[29]

Incremental and decremental support vector machine learning

Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vector machine learning. Advances in neural information processing systems, 13, 2000

work page 2000
[30]

Online learning with kernels.IEEE transactions on signal processing, 52(8):2165–2176, 2004

Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels.IEEE transactions on signal processing, 52(8):2165–2176, 2004

work page 2004
[31]

The kernel recursive least-squares algorithm.IEEE Transac- tions on signal processing, 52(8):2275–2285, 2004

Yaakov Engel, Shie Mannor, and Ron Meir. The kernel recursive least-squares algorithm.IEEE Transac- tions on signal processing, 52(8):2275–2285, 2004

work page 2004
[32]

Online learning algorithms.Foundations of computational mathematics, 6(2): 145–170, 2006

Steve Smale and Yuan Yao. Online learning algorithms.Foundations of computational mathematics, 6(2): 145–170, 2006

work page 2006
[33]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

work page 2018
[34]

Wide neural networks of any depth evolve as linear models under gradient descent

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019

work page 2019
[35]

On exact computation with an infinitely wide neural net.Advances in neural information processing systems, 32, 2019

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net.Advances in neural information processing systems, 32, 2019

work page 2019
[36]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[37]

Core50: a new dataset and benchmark for continuous object recognition

Vincenzo Lomonaco and Davide Maltoni. Core50: a new dataset and benchmark for continuous object recognition. InConference on robot learning, pages 17–26. PMLR, 2017

work page 2017
[38]

The forgetron: A kernel-based perceptron on a budget.SIAM Journal on Computing, 37(5):1342–1372, 2008

Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer. The forgetron: A kernel-based perceptron on a budget.SIAM Journal on Computing, 37(5):1342–1372, 2008

work page 2008
[39]

Online prediction of time series data with kernels.IEEE Transactions on Signal Processing, 57(3):1058–1067, 2008

Cédric Richard, José Carlos M Bermudez, and Paul Honeine. Online prediction of time series data with kernels.IEEE Transactions on Signal Processing, 57(3):1058–1067, 2008

work page 2008
[40]

An information theoretic approach of designing sparse kernel adaptive filters.IEEE transactions on neural networks, 20(12):1950–1961, 2009

Weifeng Liu, Il Park, and Jose C Principe. An information theoretic approach of designing sparse kernel adaptive filters.IEEE transactions on neural networks, 20(12):1950–1961, 2009

work page 1950
[41]

Scalable kernel methods via doubly stochastic gradients.Advances in neural information processing systems, 27, 2014

Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients.Advances in neural information processing systems, 27, 2014

work page 2014
[42]

Online gradient descent learning algorithms.Foundations of Computational Mathematics, 8(5):561–596, 2008

Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms.Foundations of Computational Mathematics, 8(5):561–596, 2008

work page 2008
[43]

Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

work page 2014
[44]

Optimal learning of kernel logistic regression for complex classification scenarios

Hongwei Wen, Annika Betken, and Hanyuan Hang. Optimal learning of kernel logistic regression for complex classification scenarios. InThe Thirteenth International Conference on Learning Representations, 2025. 11

work page 2025
[45]

Learning curves of stochastic gradient descent in kernel regression.arXiv preprint arXiv:2505.22048, 2025

Haihan Zhang, Weicheng Lin, Yuanshi Liu, and Cong Fang. Learning curves of stochastic gradient descent in kernel regression.arXiv preprint arXiv:2505.22048, 2025

work page arXiv 2025
[46]

Non-stationary stochastic optimization.Operations research, 63(5):1227–1244, 2015

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization.Operations research, 63(5):1227–1244, 2015

work page 2015
[47]

The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

Frank R Hampel. The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

work page 1974
[48]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017

work page 2017
[49]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016
[50]

When does label smoothing help?Advances in neural information processing systems, 32, 2019

Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help?Advances in neural information processing systems, 32, 2019

work page 2019
[51]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[52]

Robot shaping: Developing autonomous agents through learning

Marco Dorigo and Marco Colombetti. Robot shaping: Developing autonomous agents through learning. Artificial intelligence, 71(2):321–370, 1994

work page 1994
[53]

Policy invariance under reward transformations: Theory and application to reward shaping

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InIcml, volume 99, pages 278–287. Citeseer, 1999

work page 1999
[54]

On learning intrinsic rewards for policy gradient methods

Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. Advances in neural information processing systems, 31, 2018

work page 2018
[55]

Overcoming catastrophic forgetting with hard attention to the task

Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. InInternational conference on machine learning, pages 4548–4557. PMLR, 2018

work page 2018
[56]

Implicit gradient regularization.arXiv preprint arXiv:2009.11162, 2020

David GT Barrett and Benoit Dherin. Implicit gradient regularization.arXiv preprint arXiv:2009.11162, 2020

work page arXiv 2009
[57]

Alemi, Jascha Sohl-Dickstein, and Samuel S

Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/ neural-tangents

work page 2020
[58]

On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

work page 2020
[59]

Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel.Advances in Neural Information Processing Systems, 33:5850–5861, 2020

work page 2020
[60]

Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020

Mario Geiger, Stefano Spigler, Arthur Jacot, and Matthieu Wyart. Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020

work page 2020
[61]

Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383 (6690):1461–1467, 2024

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383 (6690):1461–1467, 2024

work page 2024
[62]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review arXiv 2016
[63]

The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects, 2019

Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects.arXiv preprint arXiv:1803.00195, 2018

work page arXiv 2018
[64]

The ideal continual learner: An agent that never forgets

Liangzu Peng, Paris Giampouras, and René Vidal. The ideal continual learner: An agent that never forgets. InInternational Conference on Machine Learning, pages 27585–27610. PMLR, 2023

work page 2023
[65]

An Overview of Machine Teaching

Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N Rafferty. An overview of machine teaching.arXiv preprint arXiv:1801.05927, 2018. 12 A Proofs and derivations A.1 Mathematical notation Throughout the manuscript, we used lower-case italic letters for scalar variables, lower-case bold- italic letters for vectors, and upper-case letters for matrices. Vect...

work page Pith review arXiv 2018