pith. machine review for the scientific record. sign in

arxiv: 2605.07886 · v1 · submitted 2026-05-08 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Characterizing and Correcting Effective Target Shift in Online Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:34 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords online learningkernel regressiontarget shiftcontinual learningdistributional shiftstochastic gradient descenttarget correctioneffective shift
0
0 comments X

The pith

Online kernel regression learns the same predictor as offline regression once an effective shift in target outputs is corrected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online learning updates a model sequentially on a stream of data without revisiting past examples, while offline learning solves the problem on the complete dataset at once. The paper derives a closed-form expression showing that online kernel regression produces the same function as offline kernel regression but trained on shifted and inaccurate target values induced by the streaming process. A target correction, given in both closed-form and iterative versions, compensates for this shift so the online learner provably recovers the exact offline solution. Readers care because the correction can be applied to stochastic gradient descent on image classification tasks, where it improves performance over uncorrected targets in continual learning settings with distributional shift.

Core claim

The paper shows that online kernel regression is equivalent to offline regression on shifted, inaccurate target outputs. A closed-form expression for the online solution reveals this equivalence, and a derived target correction compensates for the shift so that online kernel-based learning provably matches the offline predictor. Both a closed-form correction and an iterative sequential form are provided, and the approach is applied to show that online SGD with corrected targets outperforms standard online learning on CIFAR-10 and CORe50 in continual settings.

What carries the argument

The effective target shift between true targets and those implicitly used by online updates, which the derived target correction compensates to enforce equivalence between online and offline predictors.

If this is right

  • The iterative target correction can be applied sequentially to streaming data without storing past examples.
  • Online SGD with the correction outperforms learning with true targets on continual image classification tasks.
  • The framework supplies a basic method for analyzing and improving online learning under non-stationary conditions.
  • The correction restores exact equivalence to offline learning for kernel models in stationary settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same target-shift mechanism may help explain catastrophic forgetting in neural networks trained online beyond the kernel case.
  • The iterative correction could be tested for compatibility with other optimizers such as Adam or momentum methods.
  • Applying the correction to regression or reinforcement learning streams would test whether the equivalence generalizes past classification.

Load-bearing premise

The specific shift calculated for kernel regression is the dominant difference between online and offline learning and the same correction remains effective when transferred to stochastic gradient descent on non-kernel models.

What would settle it

Compute the closed-form online kernel solution on a small synthetic regression dataset, apply the target correction, and compare the resulting predictor to the offline kernel solution on the corrected targets; any numerical mismatch would disprove the claimed equivalence.

Figures

Figures reproduced from arXiv: 2605.07886 by Naoki Hiratani, Ziyan Li.

Figure 1
Figure 1. Figure 1: Illustration of effective target shift and its correction. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effective target shifts and their correction in NTK regression applied to MNIST. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Application of iterative target correction to SGD training of nonlinear neural networks. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of learning curves under a random projection kernel [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Offline and online regression using NTK applied to MNIST. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hyperparameter dependence of nonlinear model training. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

Online learning from a stream of data is a defining feature of intelligence, yet modern machine learning systems often struggle in this setting, especially under distributional shift. To understand its basic properties, we study the relationship between online and offline learning in the context of kernel regression. We derive a closed-form expression for the function learned by online kernel regression, revealing that online kernel regression is equivalent to offline regression with shifted, inaccurate target outputs. Conversely, we show that by compensating for this effective shift in the teaching signal through target correction, online kernel-based learning can provably learn the same predictor as its offline counterpart. We derive both a closed-form expression for this target correction and an iterative form that can be applied sequentially. Applying this framework to image classification tasks on CIFAR-10 and CORe50, we show that online stochastic gradient descent with iteratively corrected targets outperforms learning with the true targets in continual learning settings. This work therefore provides a basic framework for analyzing and improving online learning in non-stationary environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that online kernel regression is equivalent to offline kernel regression on shifted, inaccurate target outputs, deriving a closed-form expression for this equivalence via the representer theorem. It provides both a closed-form and an iterative target correction to recover the offline predictor exactly. The framework is extended empirically by applying iterative target correction to SGD on CNNs, showing improved performance over true targets in continual learning on CIFAR-10 and CORe50.

Significance. The closed-form derivations for the kernel equivalence and target correction are a strength, as they yield exact mathematical identities without free parameters or self-referential fitting. If the central claims hold, this provides a principled characterization of online-offline discrepancies with potential to improve online learning under shift. The empirical outperformance on image classification benchmarks indicates practical relevance for continual learning, though the generalization beyond kernels remains heuristic.

major comments (2)
  1. [Abstract] Abstract and derivation section: The equivalence and correction are derived specifically for exact online kernel regression updates (linear in feature space via representer theorem), but no corresponding derivation shows that the same effective target shift governs SGD dynamics on non-kernel models such as CNNs; the extension to neural networks is therefore heuristic and does not provably recover the offline optimum.
  2. [Experiments] Experiments section: The reported outperformance of iteratively corrected targets over true targets on CIFAR-10 and CORe50 does not include controls or analysis to distinguish shift compensation from incidental regularization effects, weakening support for the claim that the framework improves online learning in general non-stationary settings.
minor comments (1)
  1. [Derivation] The assumptions on the data stream (e.g., exact kernel updates without approximation) should be stated more explicitly when presenting the closed-form shift to clarify the scope of the equivalence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We appreciate the recognition of the closed-form derivations as a strength. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to address them.

read point-by-point responses
  1. Referee: [Abstract] Abstract and derivation section: The equivalence and correction are derived specifically for exact online kernel regression updates (linear in feature space via representer theorem), but no corresponding derivation shows that the same effective target shift governs SGD dynamics on non-kernel models such as CNNs; the extension to neural networks is therefore heuristic and does not provably recover the offline optimum.

    Authors: We agree with this assessment. The theoretical equivalence and target correction are rigorously derived only for the kernel regression case using the representer theorem. The application to SGD on CNNs is indeed an empirical extension, applying the iterative correction heuristically without a corresponding proof for non-linear models. In the revised manuscript, we will update the abstract, introduction, and discussion sections to explicitly clarify that the neural network results are heuristic and do not claim to provably recover the offline optimum. This will better delineate the theoretical contributions from the empirical explorations. revision: yes

  2. Referee: [Experiments] Experiments section: The reported outperformance of iteratively corrected targets over true targets on CIFAR-10 and CORe50 does not include controls or analysis to distinguish shift compensation from incidental regularization effects, weakening support for the claim that the framework improves online learning in general non-stationary settings.

    Authors: We acknowledge that the current experiments lack explicit controls to isolate the target shift correction from possible regularization effects. To address this, we will add in the revised version an ablation study comparing the iterative target correction against standard regularization methods (e.g., increased weight decay or label smoothing) under the same continual learning setup. Additionally, we will include analysis of the target shifts observed during training to provide evidence that the correction is compensating for the effective shift rather than acting as generic regularization. We believe these additions will strengthen the empirical claims. revision: yes

Circularity Check

0 steps flagged

Kernel regression derivation is self-contained mathematical identity; no circular steps

full rationale

The core claim derives a closed-form for the online kernel regressor via the representer theorem and exact recursive updates, showing equivalence to offline regression on shifted targets as an algebraic identity under the stated kernel assumptions. The target correction is likewise obtained directly by inverting that identity, with both closed-form and iterative versions following from the same equations without reference to fitted parameters, self-citations, or prior results by the authors. The subsequent application to SGD on CNNs is presented as an empirical heuristic without any claim that the identical shift governs non-kernel dynamics, so the central derivation chain remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of kernel methods and the existence of a closed-form solution for kernel ridge regression; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)
  • standard math Kernel regression admits a closed-form solution expressible via the kernel matrix and target vector.
    Invoked to derive the equivalence between online and offline forms.

pith-pipeline@v0.9.0 · 5468 in / 1216 out tokens · 52414 ms · 2026-05-11T03:34:17.793353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

  1. [1]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

  2. [2]

    Grounding large language models in interactive environments with online reinforcement learning

    Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. In International conference on machine learning, pages 3676–3713. PMLR, 2023. 9

  3. [3]

    Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges.Information fusion, 58:52–68, 2020

    Timothée Lesort, Vincenzo Lomonaco, Andrei Stoian, Davide Maltoni, David Filliat, and Natalia Díaz- Rodríguez. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges.Information fusion, 58:52–68, 2020

  4. [4]

    Continual world: A robotic benchmark for continual reinforcement learning.Advances in Neural Information Processing Systems, 34:28496–28510, 2021

    Maciej Wołczyk, Michał Zaj ˛ ac, Razvan Pascanu, Łukasz Kuci´nski, and Piotr Miło´s. Continual world: A robotic benchmark for continual reinforcement learning.Advances in Neural Information Processing Systems, 34:28496–28510, 2021

  5. [5]

    Clinical applications of continual learning machine learning.The Lancet Digital Health, 2(6):e279–e281, 2020

    Cecilia S Lee and Aaron Y Lee. Clinical applications of continual learning machine learning.The Lancet Digital Health, 2(6):e279–e281, 2020

  6. [6]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

  7. [7]

    Three scenarios for continual learning

    Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019

  8. [8]

    Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020

    Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020

  9. [9]

    Using semi-distributed representations to overcome catastrophic forgetting in connec- tionist networks

    Robert M French. Using semi-distributed representations to overcome catastrophic forgetting in connec- tionist networks. InProceedings of the 13th annual cognitive science society conference, volume 1, pages 173–178, 1991

  10. [10]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  11. [11]

    Continual learning through synaptic intelligence

    Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International conference on machine learning, pages 3987–3995. PMLR, 2017

  12. [12]

    icarl: Incremental classifier and representation learning

    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

  13. [13]

    Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

    Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

  14. [14]

    Wide neural networks forget less catastrophically

    Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Huiyi Hu, Razvan Pascanu, Dilan Gorur, and Mehrdad Farajtabar. Wide neural networks forget less catastrophically. InInternational conference on machine learning, pages 15699–15717. PMLR, 2022

  15. [15]

    Bennani, T

    Mehdi Abbana Bennani, Thang Doan, and Masashi Sugiyama. Generalisation guarantees for continual learning with orthogonal gradient descent.arXiv preprint arXiv:2006.11942, 2020

  16. [16]

    Continual learning in the teacher-student setup: Impact of task similarity

    Sebastian Lee, Sebastian Goldt, and Andrew Saxe. Continual learning in the teacher-student setup: Impact of task similarity. InInternational Conference on Machine Learning, pages 6109–6119. PMLR, 2021

  17. [17]

    A theoretical analysis of catastrophic forgetting through the ntk overlap matrix

    Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. InInternational Conference on Artificial Intelligence and Statistics, pages 1072–1080. PMLR, 2021

  18. [18]

    How catastrophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079

    Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079. PMLR, 2022

  19. [19]

    Theory on forgetting and generalization of continual learning

    Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. InInternational Conference on Machine Learning, pages 21078–21100. PMLR, 2023

  20. [20]

    Gradient episodic memory for continual learning.Advances in neural information processing systems, 30, 2017

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.Advances in neural information processing systems, 30, 2017

  21. [21]

    Gradient based sample selection for online continual learning.Advances in neural information processing systems, 32, 2019

    Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning.Advances in neural information processing systems, 32, 2019

  22. [22]

    Online continual learning in image classification: An empirical survey.Neurocomputing, 469:28–51, 2022

    Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. Online continual learning in image classification: An empirical survey.Neurocomputing, 469:28–51, 2022. 10

  23. [23]

    Online convex programming and generalized infinitesimal gradient ascent

    Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. InProceed- ings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003

  24. [24]

    Making gradient descent optimal for strongly convex stochastic optimization.arXiv preprint arXiv:1109.5647,

    Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization.arXiv preprint arXiv:1109.5647, 2011

  25. [25]

    Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, pages 1363–1399, 2016

    Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, pages 1363–1399, 2016

  26. [26]

    MIT press, 2002

    Bernhard Schölkopf and Alexander J Smola.Learning with kernels: support vector machines, regulariza- tion, optimization, and beyond. MIT press, 2002

  27. [27]

    Kernel methods in machine learning1

    Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning1. The Annals of Statistics, 36(3):1171–1220, 2008

  28. [28]

    Large margin classification using the perceptron algorithm

    Yoav Freund and Robert E Schapire. Large margin classification using the perceptron algorithm. In Proceedings of the eleventh annual conference on Computational learning theory, pages 209–217, 1998

  29. [29]

    Incremental and decremental support vector machine learning

    Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vector machine learning. Advances in neural information processing systems, 13, 2000

  30. [30]

    Online learning with kernels.IEEE transactions on signal processing, 52(8):2165–2176, 2004

    Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels.IEEE transactions on signal processing, 52(8):2165–2176, 2004

  31. [31]

    The kernel recursive least-squares algorithm.IEEE Transac- tions on signal processing, 52(8):2275–2285, 2004

    Yaakov Engel, Shie Mannor, and Ron Meir. The kernel recursive least-squares algorithm.IEEE Transac- tions on signal processing, 52(8):2275–2285, 2004

  32. [32]

    Online learning algorithms.Foundations of computational mathematics, 6(2): 145–170, 2006

    Steve Smale and Yuan Yao. Online learning algorithms.Foundations of computational mathematics, 6(2): 145–170, 2006

  33. [33]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  34. [34]

    Wide neural networks of any depth evolve as linear models under gradient descent

    Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019

  35. [35]

    On exact computation with an infinitely wide neural net.Advances in neural information processing systems, 32, 2019

    Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net.Advances in neural information processing systems, 32, 2019

  36. [36]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  37. [37]

    Core50: a new dataset and benchmark for continuous object recognition

    Vincenzo Lomonaco and Davide Maltoni. Core50: a new dataset and benchmark for continuous object recognition. InConference on robot learning, pages 17–26. PMLR, 2017

  38. [38]

    The forgetron: A kernel-based perceptron on a budget.SIAM Journal on Computing, 37(5):1342–1372, 2008

    Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer. The forgetron: A kernel-based perceptron on a budget.SIAM Journal on Computing, 37(5):1342–1372, 2008

  39. [39]

    Online prediction of time series data with kernels.IEEE Transactions on Signal Processing, 57(3):1058–1067, 2008

    Cédric Richard, José Carlos M Bermudez, and Paul Honeine. Online prediction of time series data with kernels.IEEE Transactions on Signal Processing, 57(3):1058–1067, 2008

  40. [40]

    An information theoretic approach of designing sparse kernel adaptive filters.IEEE transactions on neural networks, 20(12):1950–1961, 2009

    Weifeng Liu, Il Park, and Jose C Principe. An information theoretic approach of designing sparse kernel adaptive filters.IEEE transactions on neural networks, 20(12):1950–1961, 2009

  41. [41]

    Scalable kernel methods via doubly stochastic gradients.Advances in neural information processing systems, 27, 2014

    Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients.Advances in neural information processing systems, 27, 2014

  42. [42]

    Online gradient descent learning algorithms.Foundations of Computational Mathematics, 8(5):561–596, 2008

    Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms.Foundations of Computational Mathematics, 8(5):561–596, 2008

  43. [43]

    Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

    Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence.IEEE Transactions on Information Theory, 60(9):5716–5735, 2014

  44. [44]

    Optimal learning of kernel logistic regression for complex classification scenarios

    Hongwei Wen, Annika Betken, and Hanyuan Hang. Optimal learning of kernel logistic regression for complex classification scenarios. InThe Thirteenth International Conference on Learning Representations, 2025. 11

  45. [45]

    Learning curves of stochastic gradient descent in kernel regression.arXiv preprint arXiv:2505.22048, 2025

    Haihan Zhang, Weicheng Lin, Yuanshi Liu, and Cong Fang. Learning curves of stochastic gradient descent in kernel regression.arXiv preprint arXiv:2505.22048, 2025

  46. [46]

    Non-stationary stochastic optimization.Operations research, 63(5):1227–1244, 2015

    Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization.Operations research, 63(5):1227–1244, 2015

  47. [47]

    The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

    Frank R Hampel. The influence curve and its role in robust estimation.Journal of the american statistical association, 69(346):383–393, 1974

  48. [48]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017

  49. [49]

    Rethinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

  50. [50]

    When does label smoothing help?Advances in neural information processing systems, 32, 2019

    Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help?Advances in neural information processing systems, 32, 2019

  51. [51]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  52. [52]

    Robot shaping: Developing autonomous agents through learning

    Marco Dorigo and Marco Colombetti. Robot shaping: Developing autonomous agents through learning. Artificial intelligence, 71(2):321–370, 1994

  53. [53]

    Policy invariance under reward transformations: Theory and application to reward shaping

    Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InIcml, volume 99, pages 278–287. Citeseer, 1999

  54. [54]

    On learning intrinsic rewards for policy gradient methods

    Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. Advances in neural information processing systems, 31, 2018

  55. [55]

    Overcoming catastrophic forgetting with hard attention to the task

    Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. InInternational conference on machine learning, pages 4548–4557. PMLR, 2018

  56. [56]

    Implicit gradient regularization.arXiv preprint arXiv:2009.11162, 2020

    David GT Barrett and Benoit Dherin. Implicit gradient regularization.arXiv preprint arXiv:2009.11162, 2020

  57. [57]

    Alemi, Jascha Sohl-Dickstein, and Samuel S

    Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/ neural-tangents

  58. [58]

    On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

    Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

  59. [59]

    Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel.Advances in Neural Information Processing Systems, 33:5850–5861, 2020

  60. [60]

    Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020

    Mario Geiger, Stefano Spigler, Arthur Jacot, and Matthieu Wyart. Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020

  61. [61]

    Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383 (6690):1461–1467, 2024

    Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383 (6690):1461–1467, 2024

  62. [62]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836, 2016

  63. [63]

    The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects, 2019

    Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects.arXiv preprint arXiv:1803.00195, 2018

  64. [64]

    The ideal continual learner: An agent that never forgets

    Liangzu Peng, Paris Giampouras, and René Vidal. The ideal continual learner: An agent that never forgets. InInternational Conference on Machine Learning, pages 27585–27610. PMLR, 2023

  65. [65]

    An Overview of Machine Teaching

    Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N Rafferty. An overview of machine teaching.arXiv preprint arXiv:1801.05927, 2018. 12 A Proofs and derivations A.1 Mathematical notation Throughout the manuscript, we used lower-case italic letters for scalar variables, lower-case bold- italic letters for vectors, and upper-case letters for matrices. Vect...