pith. sign in

arxiv: 2606.02830 · v1 · pith:UG7MNQWAnew · submitted 2026-06-01 · 💻 cs.LG · math.OC

Mitigating Spurious Correlations with Memorization-Guided Dataset De-Biasing

Pith reviewed 2026-06-28 15:45 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords spurious correlationsdataset debiasingsample selectionempirical risk minimizationmemorizationcore featuresinvariant learningsubset selection
0
0 comments X

The pith

A two-stage scoring function selects small data subsets that let standard models outperform specialized debiasing methods on spurious correlations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem that real-world datasets often contain spurious correlations unrelated to the target label, causing models to misclassify minority samples that lack those patterns. Existing sample scoring methods used for subset selection tend to rely on the spurious features themselves and therefore fail to identify the truly important samples. The authors develop a two-stage scoring approach that tracks the learning dynamics of core features separately from spurious ones and uses the resulting metric to prioritize informative samples both with and without the spurious patterns. Experiments show that a standard empirical risk minimization model trained on the selected subset beats current state-of-the-art debiasing techniques while using as little as 10 percent of the original training data and without requiring group labels.

Core claim

We propose a two-stage sample scoring function that disentangles the learning dynamics of core and spurious features and evaluates their difficulty separately. Based on our proposed metric, we introduce a new algorithm to find and prioritize informative samples both with and without spurious correlations. A standard ERM model trained on our selected samples achieves superior performance compared to state-of-the-art debiasing techniques, while requiring as little as 10% of the original training data.

What carries the argument

two-stage sample scoring function that evaluates difficulty of core features separately from spurious features

If this is right

  • A standard ERM model on the selected subset outperforms state-of-the-art debiasing techniques.
  • The required training set can be reduced to 10% of the original data.
  • Sample selection succeeds without access to group labels.
  • The method prioritizes informative samples both inside and outside the spurious-correlation majority group.
  • Existing scoring functions are shown to depend on spurious features and therefore mis-rank sample importance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selection procedure could lower the cost of training on large real-world datasets that contain hidden biases.
  • The same scoring idea might extend to other label-free data pruning tasks beyond spurious-correlation mitigation.
  • Testing the method on datasets where spurious features are harder to isolate would reveal the limits of the two-stage separation.
  • Pairing the selected subset with lightweight regularization could produce further gains on minority samples.

Load-bearing premise

The two-stage scoring function can disentangle the learning dynamics of core features from those of spurious features in a way that existing scoring functions cannot.

What would settle it

Retraining a standard model on the selected 10% subset yields lower accuracy on minority-group test samples than full-data training or competing debiasing methods.

Figures

Figures reproduced from arXiv: 2606.02830 by Abolfazl Hashemi, Arda Fazla.

Figure 1
Figure 1. Figure 1: Comparison of standard sample scores on Waterbirds. (A) We visualize the EL2N scores for two example images from the Waterbirds dataset, computed on a model trained on the datasets with and without the background. The results demonstrate that the presence of background sig￾nificantly changes the EL2N score. (B) We present representative image pairs with high similarity based on feature embeddings extracted… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed coreset selection algorithm. (A) We first train a two-stage model to accurately disentangle the learning processes of spurious and core features and compute sample scores for each component separately (TCSLs and TCSLc). We then construct our coreset selection algorithm, the Two-Stage Cumulative Sample Loss (TCSL)-guided Coreset Selection (TCSL-CS), based on the computed scores. (B)… view at source ↗
Figure 3
Figure 3. Figure 3: Average logits computed over the entire dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cosine similarity between the TCSL score components and the CSL scores computed [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Learning behavior and simplicity bias analysis on a synthetic toy dataset. [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Density Comparison of EL2N and TCSL Scores. [PITH_FULL_IMAGE:figures/full_fig_p040_6.png] view at source ↗
read the original abstract

Real-world datasets often contain spurious correlations that are not causally related to the target label. When such correlations dominate the majority of training samples, models tend to rely on them, leading to misclassification of minority samples that do not exhibit the same spurious patterns. While a potential approach is to select subsets of data to better represent the minority samples, this may require access to group labels, which are typically unknown. Furthermore, as we demonstrate, widely used sample scoring functions in the invariant subset or coreset selection literature largely depend on spurious features and therefore fail to accurately capture the importance or difficulty of core, causally relevant features. Accordingly, we propose to mitigate spurious correlations by developing a two-stage sample scoring function that disentangles the learning dynamics of core and spurious features and evaluates their difficulty separately. Based on our proposed metric, we introduce a new algorithm to find and prioritize informative samples both with and without spurious correlations. Extensive experiments demonstrate that a standard ERM model trained on our selected samples achieves superior performance compared to state-of-the-art debiasing techniques, while requiring as little as 10\% of the original training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing sample scoring functions in coreset/invariant subset selection largely depend on spurious features. It proposes a two-stage memorization-guided scoring function that disentangles core-feature and spurious-feature learning dynamics, uses this to select an informative 10% subset (with and without spurious correlations), and shows that standard ERM trained on this subset outperforms state-of-the-art debiasing methods.

Significance. If the two-stage metric reliably isolates core-feature difficulty, the result would be significant: it offers a label-free route to data-efficient debiasing that reduces training data to 10% while beating specialized debiasing algorithms. The approach also supplies a concrete, falsifiable test of whether early vs. late training dynamics can be used to separate core and spurious signals.

major comments (2)
  1. [§3] §3 (two-stage scoring function): the central claim that the first stage (early dynamics) and second stage (later dynamics) disentangle core vs. spurious difficulty lacks an identifiability argument or ablation. No formal criterion is given showing that stage-1 scores remain invariant when the strength of the spurious correlation is varied while core features are held fixed; without this, the selected 10% subset could still be dominated by majority spurious patterns.
  2. [§4–5] §4–5 (experimental validation): the superiority of ERM on the selected subset over SOTA debiasing baselines is reported, but the manuscript provides no controlled experiment that varies only the spurious-correlation strength while measuring whether the two-stage scores correctly up-weight minority core samples. The 10% data-sufficiency claim therefore rests on the unverified separation assumption.
minor comments (2)
  1. [§3] Notation for the two-stage metric (Eq. (3) or equivalent) should explicitly define the early-training window and the memorization threshold used in each stage.
  2. [Abstract / §2] The abstract states that existing scoring functions 'largely depend on spurious features'; this should be supported by a quantitative comparison (e.g., correlation of each baseline score with spurious vs. core labels) rather than left as a qualitative assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address the major comments point by point below, indicating where we will make revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (two-stage scoring function): the central claim that the first stage (early dynamics) and second stage (later dynamics) disentangle core vs. spurious difficulty lacks an identifiability argument or ablation. No formal criterion is given showing that stage-1 scores remain invariant when the strength of the spurious correlation is varied while core features are held fixed; without this, the selected 10% subset could still be dominated by majority spurious patterns.

    Authors: We acknowledge that our manuscript does not include a formal identifiability argument or a specific ablation varying spurious correlation strength while holding core features fixed. The two-stage approach is motivated by established observations in the memorization literature that models learn spurious features faster than core features in the presence of strong correlations. In the revised manuscript, we will add an ablation study on synthetic datasets where we systematically vary the spurious correlation strength and demonstrate that the stage-1 scores prioritize samples based on core feature difficulty, leading to subsets that improve minority group performance. revision: yes

  2. Referee: [§4–5] §4–5 (experimental validation): the superiority of ERM on the selected subset over SOTA debiasing baselines is reported, but the manuscript provides no controlled experiment that varies only the spurious-correlation strength while measuring whether the two-stage scores correctly up-weight minority core samples. The 10% data-sufficiency claim therefore rests on the unverified separation assumption.

    Authors: The experiments in the manuscript are performed on several benchmark datasets that exhibit different levels of spurious correlations, and we consistently observe that ERM on the 10% subset outperforms debiasing methods. However, we agree that a more controlled experiment isolating the effect of spurious correlation strength would provide stronger validation. We will include such an experiment in the revision using a controlled synthetic dataset to explicitly measure how the two-stage scores up-weight minority core samples as spurious strength varies. revision: yes

Circularity Check

0 steps flagged

No circularity detected; proposal is self-contained without reduction to inputs or self-citations.

full rationale

The abstract and available text introduce a two-stage sample scoring function as a novel proposal to disentangle core and spurious feature dynamics, with no equations, derivations, or self-citations provided that would reduce the metric or selection algorithm to fitted parameters, prior self-work, or definitional equivalence. No load-bearing steps match the enumerated circularity patterns, as the central claim of superior ERM performance on a 10% subset is presented as an empirical outcome rather than a constructed prediction. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger is empty as no technical details are extractable.

pith-pipeline@v0.9.1-grok · 5728 in / 1014 out tokens · 20602 ms · 2026-06-28T15:45:06.893818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Invariant Risk Minimization

    Martin Arjovsky, L´ eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

  2. [2]

    The pitfalls of memorization: When memorization hurts generalization.arXiv preprint arXiv:2412.07684, 2024

    Reza Bayat, Mohammad Pezeshki, Elvis Dohmatob, David Lopez-Paz, and Pascal Vin- cent. The pitfalls of memorization: When memorization hurts generalization.arXiv preprint arXiv:2412.07684, 2024

  3. [3]

    How spurious features are memorized: Precise analysis for random and ntk features

    Simone Bombari and Marco Mondelli. How spurious features are memorized: Precise analysis for random and ntk features. InForty-first International Conference on Machine Learning, 2024. 12

  4. [4]

    Environment inference for invariant learning

    Elliot Creager, J¨ orn-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. InInternational Conference on Machine Learning, pages 2189–2200. PMLR, 2021

  5. [5]

    Robust learning with pro- gressive data expansion against spurious correlation.Advances in neural information processing systems, 36:1390–1402, 2023

    Yihe Deng, Yu Yang, Baharan Mirzasoleiman, and Quanquan Gu. Robust learning with pro- gressive data expansion against spurious correlation.Advances in neural information processing systems, 36:1390–1402, 2023

  6. [6]

    The impact of coreset selection on spurious correlations and group robustness

    Amaya Dharmasiri, William Yang, Polina Kirichenko, Lydia T Liu, and Olga Russakovsky. The impact of coreset selection on spurious correlations and group robustness. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  7. [7]

    Optimal shrinkage of eigenvalues in the spiked covariance model.Annals of statistics, 46(4):1742, 2018

    David L Donoho, Matan Gavish, and Iain M Johnstone. Optimal shrinkage of eigenvalues in the spiked covariance model.Annals of statistics, 46(4):1742, 2018

  8. [8]

    Does learning require memorization? a short tale about a long tail

    Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020

  9. [9]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  10. [10]

    Last layer re- training is sufficient for robustness to spurious correlations,

    Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations.arXiv preprint arXiv:2204.02937, 2022

  11. [11]

    Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

    Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

  12. [12]

    A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others

    Zhiheng Li, Ivan Evtimov, Albert Gordo, Caner Hazirbas, Tal Hassner, Cristian Canton Ferrer, Chenliang Xu, and Mark Ibrahim. A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20071–20082, 2023

  13. [13]

    Metashift: A dataset of datasets for evaluating contextual distri- bution shifts and training conflicts

    Weixin Liang and James Zou. Metashift: A dataset of datasets for evaluating contextual distri- bution shifts and training conflicts. InInternational Conference on Learning Representations, 2022

  14. [14]

    The global k-means clustering algorithm

    Aristidis Likas, Nikos Vlassis, and Jakob J Verbeek. The global k-means clustering algorithm. Pattern recognition, 36(2):451–461, 2003

  15. [15]

    Just train twice: Improving group robustness without training group information

    Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. InInternational Conference on Machine Learning, pages 6781–

  16. [16]

    Avoiding spurious correlations via logit correction.arXiv preprint arXiv:2212.01433, 2022

    Sheng Liu, Xu Zhang, Nitesh Sekhar, Yue Wu, Prateek Singhal, and Carlos Fernandez-Granda. Avoiding spurious correlations via logit correction.arXiv preprint arXiv:2212.01433, 2022

  17. [17]

    D2 pruning: Message passing for balancing diversity and difficulty in data pruning.arXiv preprint arXiv:2310.07931, 2023

    Adyasha Maharana, Prateek Yadav, and Mohit Bansal. D2 pruning: Message passing for balancing diversity and difficulty in data pruning.arXiv preprint arXiv:2310.07931, 2023. 13

  18. [18]

    Severing spurious correlations with data pruning

    Varun Mulchandani and Jung-Eun Kim. Severing spurious correlations with data pruning. In The Thirteenth International Conference on Learning Representations, 2025

  19. [19]

    SGD on Neural Networks Learns Functions of Increasing Complexity

    Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L Edelman, Fred Zhang, and Boaz Barak. Sgd on neural networks learns functions of increasing complexity. arXiv preprint arXiv:1905.11604, 2019

  20. [20]

    Learning from failure: De-biasing classifier from biased classifier.Advances in Neural Information Processing Systems, 33:20673–20684, 2020

    Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier.Advances in Neural Information Processing Systems, 33:20673–20684, 2020

  21. [21]

    Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186,

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186, 2023

  22. [22]

    Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, pages 1617–1642, 2007

    Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, pages 1617–1642, 2007

  23. [23]

    Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

  24. [24]

    Complexity matters: Dynamics of feature learning in the presence of spurious correlations.arXiv preprint arXiv:2403.03375, 2024

    GuanWen Qiu, Da Kuang, and Surbhi Goel. Complexity matters: Dynamics of feature learning in the presence of spurious correlations.arXiv preprint arXiv:2403.03375, 2024

  25. [25]

    Simple and fast group robustness by automatic feature reweighting

    Shikai Qiu, Andres Potapczynski, Pavel Izmailov, and Andrew Gordon Wilson. Simple and fast group robustness by automatic feature reweighting. InInternational Conference on Machine Learning, pages 28448–28467. PMLR, 2023

  26. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  27. [27]

    Towards mem- orization estimation: Fast, formal and free

    Deepak Ravikumar, Efstathia Soufleri, Abolfazl Hashemi, and Kaushik Roy. Towards mem- orization estimation: Fast, formal and free. InForty-second International Conference on Machine Learning, 2025

  28. [28]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

  29. [30]

    Distributionally robust neural networks

    Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks. InInternational Conference on Learning Representations, 2020

  30. [31]

    Upweighting easy samples in fine-tuning mitigates forgetting

    Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, and Sujay Sanghavi. Upweighting easy samples in fine-tuning mitigates forgetting. InForty-second International Conference on Machine Learning, 2025

  31. [32]

    The pitfalls of simplicity bias in neural networks.Advances in Neural Information Processing Systems, 33:9573–9585, 2020

    Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks.Advances in Neural Information Processing Systems, 33:9573–9585, 2020. 14

  32. [33]

    No subclass left behind: Fine-grained robustness in coarse-grained classification problems.Advances in Neural Information Processing Systems, 33:19339–19352, 2020

    Nimit Sohoni, Jared Dunnmon, Geoffrey Angus, Albert Gu, and Christopher R´ e. No subclass left behind: Fine-grained robustness in coarse-grained classification problems.Advances in Neural Information Processing Systems, 33:19339–19352, 2020

  33. [34]

    Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Informa- tion Processing Systems, 35:19523–19536, 2022

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Informa- tion Processing Systems, 35:19523–19536, 2022

  34. [35]

    Group robust classification without any group information.Advances in Neural Information Processing Systems, 36:56553–56575, 2023

    Christos Tsirigotis, Joao Monteiro, Pau Rodriguez, David Vazquez, and Aaron C Courville. Group robust classification without any group information.Advances in Neural Information Processing Systems, 36:56553–56575, 2023

  35. [36]

    Deep learning generalizes because the parameter-function map is biased towards simple functions

    Guillermo Valle-Perez, Chico Q Camargo, and Ard A Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions.arXiv preprint arXiv:1805.08522, 2018

  36. [37]

    Drop: Distributionally robust data pruning

    Artem M Vysogorets, Kartik Ahuja, and Julia Kempe. Drop: Distributionally robust data pruning. InThe Thirteenth International Conference on Learning Representations, 2025

  37. [38]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

  38. [39]

    On the effect of key factors in spurious correlation: A theoretical perspective

    Yipei Wang and Xiaoqian Wang. On the effect of key factors in spurious correlation: A theoretical perspective. InInternational Conference on Artificial Intelligence and Statistics, pages 3745–3753. PMLR, 2024

  39. [41]

    Nonlinear spiked covariance matrices and signal propagation in deep neural networks

    Zhichao Wang, Denny Wu, and Zhou Fan. Nonlinear spiked covariance matrices and signal propagation in deep neural networks. InThe Thirty Seventh Annual Conference on Learning Theory, pages 4891–4957. PMLR, 2024

  40. [42]

    Identifying spurious biases early in training through the lens of simplicity bias

    Yu Yang, Eric Gan, Gintare Karolina Dziugaite, and Baharan Mirzasoleiman. Identifying spurious biases early in training through the lens of simplicity bias. InInternational conference on artificial intelligence and statistics, pages 2953–2961. PMLR, 2024

  41. [43]

    arXiv preprint arXiv:2203.01517 , year=

    Michael Zhang, Nimit S Sohoni, Hongyang R Zhang, Chelsea Finn, and Christopher R´ e. Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. arXiv preprint arXiv:2203.01517, 2022

  42. [44]

    Generalized cross entropy loss for training deep neural net- works with noisy labels.Advances in neural information processing systems, 31, 2018

    Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural net- works with noisy labels.Advances in neural information processing systems, 31, 2018

  43. [45]

    Coverage-centric coreset selection for high pruning rates

    Haizhong Zheng, Rui Liu, Fan Lai, and Atul Prakash. Coverage-centric coreset selection for high pruning rates. InThe Eleventh International Conference on Learning Representations, 2023

  44. [46]

    Places: An Image Database for Deep Scene Understanding

    Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva. Places: An image database for deep scene understanding.arXiv preprint arXiv:1610.02055, 2016. 15 Appendix A Related Work 17 B Theoretical Analysis 18 B.1 Homogeneous Spiked Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2 Heterogeneous Spiked Model . ...

  45. [47]

    The network’s output function at initializationh(x;W 0) (which we refer to as the logit) becomes a draw from a Gaussian Process (Proposition 1 in [9])

  46. [48]

    The Neural Tangent KernelK(x, x ′;W) :=∇ Wh(x;W)· ∇ Wh(x′;W) converges to a deter- ministic, positive semi-definite kernelK(x, x ′) that is constant in time (Theorem 1 in [9])

  47. [49]

    The evolution of the logit outputsh(x i;W t) for thentraining samples under gradient flow for the empirical lossL(W) = 1 n Pn i=1 ℓ(yi, h(xi;W t)) is governed by an exact, deterministic, non-linear Ordinary Differential Equation (ODE) in function space (Theorem 2 in [9]). For a specific logith j(t)≡h(x j;W t), the dynamic is ∂h(xj;W t) ∂t =− 1 n nX i=1 K(...

  48. [50]

    Forj∈G 1 (majority group,y j =a j), the expected margin¯m t(xj)is positive, and the loss ℓ( ¯mt(xj))is less thanlog(2)

  49. [51]

    21 Proof.By Assumption 1, the initial expected logit is ¯h0(xj) = 0, so the initial expected margin is ¯m0(xj) = 0

    Forj∈G 2 (minority group,y j =−a j), the expected margin¯mt(xj)is negative, and the loss ℓ( ¯mt(xj))is greater thanlog(2). 21 Proof.By Assumption 1, the initial expected logit is ¯h0(xj) = 0, so the initial expected margin is ¯m0(xj) = 0. We compute the initial time-derivative of the expected margin ∂¯mt(xj) ∂t t=0 =y j ∂¯hc t(xj) ∂t t=0 + ∂¯hs t(xj) ∂t t...

  50. [52]

    The initial velocity isR c +R s = βc 2 + βs(2α−1) 2 >0

    Forj∈G 1,y jaj = 1. The initial velocity isR c +R s = βc 2 + βs(2α−1) 2 >0. Since ¯m0(xj) = 0 and ∂¯mt(xj) ∂t t=0 >0, there existsT 1 >0 such that ¯m t(xj)>0 fort∈(0, T 1). Thus, ℓ( ¯mt(xj))< ℓ(0) = log(2)

  51. [53]

    hard” (lowβ) and “easy

    Forj∈G 2,y jaj =−1. The initial velocity isR c−Rs = βc 2 − βs(2α−1) 2 <0 by the simplicity bias condition. Since ¯m0(xj) = 0 and ∂¯mt(xj) ∂t t=0 <0, there existsT 2 >0 such that ¯m t(xj)<0 fort∈(0, T 2). Thus,ℓ( ¯mt(xj))> ℓ(0) = log(2). LetT= min(T 1, T2). Fort∈(0, T), both statements hold. The following theorem characterizes the initial curvature of the ...

  52. [54]

    High-Dimensional Statistics A Non-Asymptotic Viewpoint

    Therefore, the functionσ ′ :R→Ris uniformly L-Lipschitz continuous. Applying the Gaussian Lipschitz concentration inequality (see Chapter 2.3 and Theorem 2.26 in the book “High-Dimensional Statistics A Non-Asymptotic Viewpoint” by Martin J. Wainwright) for the functionσ ′ evaluated on the Gaussian random variableZyields the stated exponential tail bound w...