pith. sign in

arxiv: 2505.11771 · v2 · submitted 2025-05-17 · 💻 cs.LG · cs.AI· math.ST· stat.ML· stat.TH

Residual Feature Integration is Sufficient to Prevent Negative Transfer

Pith reviewed 2026-05-22 15:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.STstat.MLstat.TH
keywords negative transfertransfer learningresidual featuresconvergence ratespretrained modelsdistribution shift
0
0 comments X

The pith

Augmenting frozen source features with a trainable residual encoder provably prevents negative transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a straightforward addition to transfer learning—keeping source pretrained features frozen while training a target encoder to pick up residual signals—eliminates the risk of negative transfer. It establishes theoretical bounds proving the combined approach converges at least as fast as training from scratch under informative target distributions, up to logarithmic factors, and shifts smoothly toward faster parametric rates when the source information proves useful. The method is architecture-agnostic and extends naturally to incorporating new modalities during adaptation. Experiments across images, text, and tabular data confirm it maintains performance even when facing shifts, noise, or imbalances.

Core claim

The central claim is that residual feature integration—freezing pretrained source-side features and augmenting them with a trainable target-side encoder that captures residual signals overlooked by the source model—is sufficient to provably prevent negative transfer. This yields convergence rates no worse than training from scratch (up to log factors) for target distributions in the informative class, with a seamless transition to near-parametric rates when source representations carry useful information. The approach is the first with such theoretical protection guarantees.

What carries the argument

Residual feature integration: freezing source pretrained features and adding a trainable target encoder to model residual signals not captured by the source model.

If this is right

  • The method has no worse convergence rate than training from scratch under the informative class of target distributions, up to logarithmic factors.
  • The convergence rate transitions seamlessly from nonparametric to near-parametric when source representations are informative.
  • The approach supports adapt-time multimodality extensions, such as adding spatial signals to a single-cell model.
  • It empirically safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This integration could reduce reliance on complex domain-alignment techniques in practical transfer setups.
  • The same residual mechanism might apply to preventing negative transfer in sequential or continual learning scenarios.
  • Testing the method on distributions deliberately constructed to violate the informative-class assumption would clarify the boundary of the guarantees.

Load-bearing premise

Target distributions belong to the informative class that enables the stated convergence-rate bounds.

What would settle it

An empirical case where the residual integration method converges slower than training from scratch, or fails to match the predicted rates, on a target distribution satisfying the informative-class conditions.

Figures

Figures reproduced from arXiv: 2505.11771 by Lexin Li, Linjun Zhang, Ryumei Nakada, Yichen Xu.

Figure 1
Figure 1. Figure 1: A schematic overview of REFINE. knowledge and outperform mod￾els trained from scratch on tar￾get data only, and when frep mis￾aligns with the target distribution, safeguard against negative transfer and outperform models that rely solely on frep(x). We focus on the supervised learning task. Let Dt = {(xi , yi)} n i=1 ∼ P t denote the labeled dataset from the target task. Assume access to a frozen extracted… view at source ↗
Figure 2
Figure 2. Figure 2: Metric comparison across labeled target sizes for Adapter, LinearProbe, NoTrans, and [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

Transfer learning has become a central paradigm in modern machine learning, yet it suffers from the long-standing problem of negative transfer, where leveraging source representations can harm rather than help performance on the target task. Although empirical remedies have been proposed, there remains little theoretical understanding of how to reliably avoid negative transfer. In this paper, we investigate a simple yet remarkably effective strategy: augmenting frozen, pretrained source-side features with a trainable target-side encoder that adapts target features to capture residual signals overlooked by models pretrained on the source data. We show this residual feature integration strategy is sufficient to provably prevent negative transfer, by establishing theoretical guarantees that it has no worse convergence rate than training from scratch under the informative class of target distributions up to logarithmic factors, and that the convergence rate can transition seamlessly from nonparametric to near-parametric when source representations are informative. To our knowledge, this is the first theoretical work that ensures protection against negative transfer. We carry out extensive numerical experiments across image, text and tabular benchmarks, and empirically verify that the method consistently safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance. We additionally demonstrate that this residual integration mechanism uniquely supports adapt-time multimodality extension, enabling a pretrained single-cell foundation model to incorporate spatial signals for lymph-node anatomical classification despite the source model being trained without them. Our study thus advances the theory of safe transfer learning, and provides a principled approach that is simple, robust, architecture-agnostic, and broadly applicable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes residual feature integration as a strategy to prevent negative transfer in transfer learning: frozen pretrained source features are augmented with a trainable target-side encoder that captures residual signals. It claims this is sufficient to provably prevent negative transfer by establishing that the approach has no worse convergence rate than training from scratch (up to logarithmic factors) for target distributions in an 'informative class', with a seamless transition from nonparametric to near-parametric rates when source representations are informative. The work includes extensive experiments on image, text, and tabular benchmarks under distribution shift, label noise, semantic perturbation, and class imbalance, plus a single-cell spatial extension for multimodality.

Significance. If the central theoretical claims hold, the result is significant because it supplies the first explicit theoretical protection against negative transfer, a persistent issue in transfer learning. The method is simple, architecture-agnostic, and supports adapt-time multimodality, which is a practical strength. Credit is due for deriving convergence-rate bounds that transition between regimes and for the breadth of empirical verification across domains.

major comments (2)
  1. [Theoretical results section (convergence-rate derivations)] The no-worse-than-scratch convergence guarantee and the nonparametric-to-near-parametric transition are derived only for target distributions in the 'informative class'. This class is invoked to obtain the central claim but receives no explicit characterization (e.g., moment conditions, source-target alignment, or density assumptions) in the theoretical development. Without such a characterization it is impossible to verify whether the image/text/tabular or single-cell distributions used in the experiments lie inside the class, rendering the 'provably prevents negative transfer' statement conditional on an unstated premise.
  2. [Abstract and § on theoretical guarantees] The abstract states that the residual construction yields an independent derivation of convergence rates; however, the precise assumptions under which the rate bound holds (including any dependence on source informativeness) are not stated with sufficient precision to allow direct checking of the derivation steps or the logarithmic-factor claim.
minor comments (2)
  1. [Method section] Notation for the residual encoder and the combined estimator should be introduced once and used consistently; current usage mixes 'target-side encoder' and 'residual adapter' without a single defining equation.
  2. [Experiments, single-cell subsection] The single-cell experiment would benefit from an explicit statement of how the source model (trained without spatial signals) is frozen and how the target encoder is initialized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the clarity of our theoretical contributions.

read point-by-point responses
  1. Referee: [Theoretical results section (convergence-rate derivations)] The no-worse-than-scratch convergence guarantee and the nonparametric-to-near-parametric transition are derived only for target distributions in the 'informative class'. This class is invoked to obtain the central claim but receives no explicit characterization (e.g., moment conditions, source-target alignment, or density assumptions) in the theoretical development. Without such a characterization it is impossible to verify whether the image/text/tabular or single-cell distributions used in the experiments lie inside the class, rendering the 'provably prevents negative transfer' statement conditional on an unstated premise.

    Authors: We agree that an explicit characterization of the informative class is necessary for readers to assess the scope of the guarantees. While the class is referenced in the theoretical development, we acknowledge that a self-contained definition incorporating moment conditions, source-target alignment requirements, and density assumptions was not provided with sufficient detail. In the revised manuscript we will add a dedicated paragraph (or subsection) that formally defines the informative class and states the precise conditions under which the no-worse-than-scratch and nonparametric-to-near-parametric rates hold. We will also include a brief discussion of how the experimental distributions relate to these conditions, to the extent that the available data permit. revision: yes

  2. Referee: [Abstract and § on theoretical guarantees] The abstract states that the residual construction yields an independent derivation of convergence rates; however, the precise assumptions under which the rate bound holds (including any dependence on source informativeness) are not stated with sufficient precision to allow direct checking of the derivation steps or the logarithmic-factor claim.

    Authors: We accept that the abstract and theoretical section would benefit from a more precise statement of assumptions. We will revise the abstract to explicitly list the key conditions (membership in the informative class, dependence on source informativeness for rate improvement, and the logarithmic factors that appear in the bounds). In the theoretical guarantees section we will add a short paragraph that outlines the main derivation steps and highlights where the logarithmic factors arise, thereby enabling direct verification of the claims. revision: yes

Circularity Check

0 steps flagged

Derivation of no-worse-than-scratch convergence rates is self-contained under the informative class

full rationale

The paper establishes theoretical convergence-rate bounds for residual feature integration by starting from the definition of an informative class of target distributions and deriving that the combined estimator matches or improves upon scratch training rates (up to log factors) with a nonparametric-to-near-parametric transition when source features are informative. No equations or steps reduce the claimed guarantee to a fitted parameter, a self-referential definition, or a load-bearing self-citation whose content is itself unverified. The informative class functions as an explicit premise rather than being constructed from the target result; the derivation therefore remains independent of the method's empirical performance on any particular dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about target distributions belonging to an informative class and on the existence of residual signals not captured by the source model; no explicit free parameters or new invented entities are named in the abstract.

axioms (1)
  • domain assumption Target distributions belong to an informative class that supports the stated convergence bounds
    Invoked to obtain the no-worse-than-scratch rate and the nonparametric-to-parametric transition.

pith-pipeline@v0.9.0 · 5808 in / 1247 out tokens · 52849 ms · 2026-05-22T15:07:25.085792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

  1. [1]

    Muhammad Jamal Afridi, Arun Ross, and Erik M. Shapiro. On automated source selection for transfer learning in convolutional neural networks.Pattern Recognition, 73:65–75, 2018. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2017.07.019. URLhttps://www. sciencedirect.com/science/article/pii/S0031320317302881

  2. [2]

    Multimodal machine learn- ing: A survey and taxonomy.IEEE Trans

    Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learn- ing: A survey and taxonomy.IEEE Trans. Pattern Anal. Mach. Intell., 41(2):423–443, February 2019. ISSN 0162-8828. doi: 10.1109/TPAMI.2018.2798607. URLhttps: //doi.org/10.1109/TPAMI.2018.2798607

  3. [3]

    Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification

    John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Annie Zaenen and Antal van den Bosch (eds.),Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440–447, Prague, Czech Republic, June 2007. Association for Computationa...

  4. [4]

    The relative performance of ensemble methods with deep convolutional neural networks for image classification.Journal of Applied Statistics, 45(15):2800–2818, 2018

    Aur ´elien Bibaut Cheng Ju and Mark van der Laan. The relative performance of ensemble methods with deep convolutional neural networks for image classification.Journal of Applied Statistics, 45(15):2800–2818, 2018. doi: 10.1080/02664763.2018.1441383. URLhttps: //doi.org/10.1080/02664763.2018.1441383. PMID: 31631918

  5. [5]

    An analysis of single-layer networks in unsu- pervised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsu- pervised feature learning. In Geoffrey Gordon, David Dunson, and Miroslav Dud ´ık (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statis- tics, volume 15 ofProceedings of Machine Learning Research, pp. 215–223, Fort Lauderda...

  6. [6]

    When more is less: Incor- porating additional datasets can hurt performance by introducing spurious correlations, 2023

    Rhys Compton, Lily Zhang, Aahlad Puli, and Rajesh Ranganath. When more is less: Incor- porating additional datasets can hurt performance by introducing spurious correlations, 2023. URLhttps://arxiv.org/abs/2308.04431

  7. [7]

    Buenrostro, Nir Yosef, Carolina Caldas, Rui Sun, and Bing He

    Huazhe Cui, Chen Wang, Haitham Maan, Jason D. Buenrostro, Nir Yosef, Carolina Caldas, Rui Sun, and Bing He. scgpt: toward building a foundation model for single-cell multi-omics using generative AI.Nature Methods, 21(9):1470–1480, September 2024. doi: 10.1038/ s41592-024-02201-0. URLhttps://doi.org/10.1038/s41592-024-02201-0

  8. [8]

    2011, Neural Comput., 23, 1661, 10.1162/NECO\_a\_00142

    Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. A survey on deep learning for multimodal data fusion.Neural Computation, 32(5):829–864, 2020. doi: 10.1162/neco a 01273

  9. [9]

    Data Shapley: Equitable Valuation of Data for Machine Learning

    Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning, 2019. URLhttps://arxiv.org/abs/1904.02868

  10. [10]

    Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789–1819, 2021

    Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789–1819, 2021. 11 Published as a conference paper at ICLR 2026

  11. [11]

    Borgwardt, Malte J

    Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch ¨olkopf, and Alexander Smola. A kernel two-sample test.J. Mach. Learn. Res., 13(null):723–773, March 2012. ISSN 1532-4435

  12. [12]

    Ensemble transfer learning for distinguishing cognitively normal and mild cognitive impairment patients using mri.Algorithms, 16(8), 2023

    Pratham Grover, Kunal Chaturvedi, Xing Zi, Amit Saxena, Shiv Prakash, Tony Jan, and Mukesh Prasad. Ensemble transfer learning for distinguishing cognitively normal and mild cognitive impairment patients using mri.Algorithms, 16(8), 2023. ISSN 1999-4893. doi: 10.3390/a16080377. URLhttps://www.mdpi.com/1999-4893/16/8/377

  13. [13]

    Springer, 2002

    L ´aszl´o Gy¨orfi, Michael Kohler, Adam Krzy˙zak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer, 2002

  14. [14]

    Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021

    XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pp. 770–778, 2016

  16. [16]

    Distilling the knowledge in a neural network,

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

  17. [17]

    URLhttps://arxiv.org/abs/1503.02531

  18. [18]

    Hosna, E

    A. Hosna, E. Merry, J. Gyalmo, Z. Alom, Z. Aung, and M. A. Azim. Transfer learn- ing: a friendly introduction.Journal of Big Data, 9(1):102, 2022. doi: 10.1186/ s40537-022-00652-w. URLhttps://doi.org/10.1186/s40537-022-00652-w. Epub 2022 Oct 22

  19. [19]

    Parameter-Efficient Transfer Learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp, 2019. URLhttps://arxiv.org/abs/1902.00751

  20. [20]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106.09685

  21. [21]

    Does data augmentation improve generaliza- tion in nlp?, 2020

    Rohan Jha, Charles Lovering, and Ellie Pavlick. Does data augmentation improve generaliza- tion in nlp?, 2020. URLhttps://arxiv.org/abs/2004.15012

  22. [22]

    Deep transfer learning: Model framework and error analysis, 2025

    Yuling Jiao, Huazhen Lin, Yuchen Luo, and Jerry Zhijian Yang. Deep transfer learning: Model framework and error analysis, 2025. URLhttps://arxiv.org/abs/2410.09383

  23. [23]

    Lightgbm: A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,

  24. [24]

    URLhttps://proceedings.neurips.cc/paper_files/paper/2017/ file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

  25. [25]

    Adult income dataset.https://www.kaggle.com/ datasets/uciml/adult-census-income, 1996

    Ronny Kohavi and Barry Becker. Adult income dataset.https://www.kaggle.com/ datasets/uciml/adult-census-income, 1996. Originally from the UCI Machine Learning Repository. Kaggle version shared by user 1251, updated 2016

  26. [26]

    On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

    Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

  27. [27]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical re- port, University of Toronto, 2009. URLhttps://www.cs.toronto.edu/ ˜kriz/ learning-features-2009-TR.pdf. Technical Report

  28. [28]

    Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022

    Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022. URLhttps: //arxiv.org/abs/2202.10054. 12 Published as a conference paper at ICLR 2026

  29. [29]

    Towards safe weakly supervised learning

    Yu-Feng Li, Lan-Zhe Guo, and Zhi-Hua Zhou. Towards safe weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):334–346, 2021. doi: 10.1109/TPAMI.2019.2922396

  30. [30]

    Dyer, and Vidya Muthukumar

    Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, and Vidya Muthukumar. The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. J. Mach. Learn. Res., 25:91:1–91:85, 2022. URLhttps://api.semanticscholar. org/CorpusID:252815719

  31. [31]

    Y . P. Lin and T. P. Jung. Improving eeg-based emotion classification using conditional transfer learning.Frontiers in Human Neuroscience, 11:334, 2017. doi: 10.3389/fnhum.2017.00334. URLhttps://doi.org/10.3389/fnhum.2017.00334. Published on June 27, 2017

  32. [32]

    De- ciphering spatial domains from spatial multi-omics with spatialglue.Nature Methods, 21(9): 1658–1667, September 2024

    Yichen Long, Ka Shing Ang, Ritambhara Sethi, Guanghua Xiao, and Guo-Cheng Yuan. De- ciphering spatial domains from spatial multi-omics with spatialglue.Nature Methods, 21(9): 1658–1667, September 2024. doi: 10.1038/s41592-024-02316-4. URLhttps://doi. org/10.1038/s41592-024-02316-4

  33. [33]

    Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition, 2025

    Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Li Zhang, and Wei-Lun Chao. Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition, 2025. URLhttps://arxiv.org/abs/2409.16434

  34. [34]

    Transfer learning with affine model transformation

    Shunya Minami, Kenji Fukumizu, Yoshihiro Hayashi, and Ryo Yoshida. Transfer learning with affine model transformation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  35. [35]

    Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174): 1–38, 2020

    Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174): 1–38, 2020

  36. [36]

    Qual- ity not quantity: On the interaction between dataset design and robustness of clip, 2023

    Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, and Ludwig Schmidt. Qual- ity not quantity: On the interaction between dataset design and robustness of clip, 2023. URL https://arxiv.org/abs/2208.05516

  37. [37]

    Credit score classification.https://www.kaggle.com/datasets/ rohanparis/credit-score-classification, 2022

    Rohan Paris. Credit score classification.https://www.kaggle.com/datasets/ rohanparis/credit-score-classification, 2022. Kaggle Dataset, CC0: Public Domain

  38. [38]

    Moment matching for multi-source domain adaptation

    Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE International Con- ference on Computer Vision, pp. 1406–1415, 2019

  39. [39]

    Optimal approximation of piecewise smooth functions using deep relu neural networks.Neural Networks, 108:296–330, 2018

    Philipp Petersen and Felix V oigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks.Neural Networks, 108:296–330, 2018

  40. [40]

    Transfusion: Understand- ing transfer learning for medical imaging

    Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understand- ing transfer learning for medical imaging. InAdvances in Neural Information Processing Systems (NeurIPS 2019), pp. 3347–3357, Red Hook, NY , USA, 2019. Curran Associates, Inc

  41. [41]

    Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 2020

    Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 2020

  42. [42]

    Monocular visual-inertial odometry in low-textured environments with smooth gradients: A fully dense direct filtering approach,

    Michael J. Sorocky, Siqi Zhou, and Angela P. Schoellig. Experience selection using dy- namics similarity for efficient multi-source transfer learning between robots. In2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 2739–2745, 2020. doi: 10.1109/ICRA40945.2020.9196744

  43. [43]

    Multimodal deep learning for biomedical data fusion: a review.Briefings in Bioinformatics, 23(2):bbab569, 01

    S ¨oren Richard Stahlschmidt, Benjamin Ulfenborg, and Jane Synnergren. Multimodal deep learning for biomedical data fusion: a review.Briefings in Bioinformatics, 23(2):bbab569, 01

  44. [44]

    doi: 10.1093/bib/bbab569

    ISSN 1477-4054. doi: 10.1093/bib/bbab569. URLhttps://doi.org/10.1093/ bib/bbab569. 13 Published as a conference paper at ICLR 2026

  45. [45]

    Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality

    Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality.arXiv preprint arXiv:1810.08033, 2018

  46. [46]

    Students performance: analysis and classification.https://www.kaggle.com/ code/tedo/students-performance-analysis-and-classification,

    Tedo. Students performance: analysis and classification.https://www.kaggle.com/ code/tedo/students-performance-analysis-and-classification,

  47. [47]

    Kaggle Notebook, Version 4, Apache 2.0 License

  48. [48]

    Diabetes classification (pima indians diabetes database).https: //www.kaggle.com/competitions/diabetes-classification, 2019

    Karthik Chowdary Tsaliki. Diabetes classification (pima indians diabetes database).https: //www.kaggle.com/competitions/diabetes-classification, 2019. Kag- gle Community Prediction Competition

  49. [49]

    Characterizing and avoiding negative transfer

    Zirui Wang, Zihang Dai, Barnab ´as P´oczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11285–11294, 2019. doi: 10.1109/CVPR.2019.01155

  50. [50]

    Online and offline domain adaptation for reducing bci calibration effort.IEEE Transactions on Human-Machine Systems, 47(4):550–563, 2017

    Dongrui Wu. Online and offline domain adaptation for reducing bci calibration effort.IEEE Transactions on Human-Machine Systems, 47(4):550–563, 2017. doi: 10.1109/THMS.2016. 2608931

  51. [51]

    A selective transfer learning method for concept drift adaptation

    Ge Xie, Yu Sun, Minlong Lin, and Ke Tang. A selective transfer learning method for concept drift adaptation. In Fengyu Cong, Andrew Leung, and Qinglai Wei (eds.),Advances in Neural Networks - ISNN 2017, pp. 353–361, Cham, 2017. Springer International Publishing. ISBN 978-3-319-59081-3

  52. [52]

    Boosting for transfer learning with multiple sources

    Yi Yao and Gianfranco Doretto. Boosting for transfer learning with multiple sources. In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1855– 1862, 2010. doi: 10.1109/CVPR.2010.5539857

  53. [53]

    A survey on negative transfer

    Wen Zhang, Lingfei Deng, Lei Zhang, and Dongrui Wu. A survey on negative transfer. IEEE/CAA Journal of Automatica Sinica, 10(2):305–329, 2023. doi: 10.1109/JAS.2022. 106004

  54. [54]

    On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features

    Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pp. 27179–27202. PMLR, 2022. 14 Published as a conference paper at ICLR 2026 APPENDICES In the appendices, we provide additio...

  55. [55]

    Table S1 reports the results on CIFAR-100 with CNNs

    Moreover, in addition to CNNs, we also evaluate both CIFAR-10 and CIFAR-100 with transformer-based models. Table S1 reports the results on CIFAR-100 with CNNs. Similar to CIFAR-10, REFINEconsis- tently outperforms the baseline methods under all four stress scenarios. In particular, in the extreme noise setting with80%label flips, most competing methods co...