Residual Feature Integration is Sufficient to Prevent Negative Transfer
Pith reviewed 2026-05-22 15:07 UTC · model grok-4.3
The pith
Augmenting frozen source features with a trainable residual encoder provably prevents negative transfer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that residual feature integration—freezing pretrained source-side features and augmenting them with a trainable target-side encoder that captures residual signals overlooked by the source model—is sufficient to provably prevent negative transfer. This yields convergence rates no worse than training from scratch (up to log factors) for target distributions in the informative class, with a seamless transition to near-parametric rates when source representations carry useful information. The approach is the first with such theoretical protection guarantees.
What carries the argument
Residual feature integration: freezing source pretrained features and adding a trainable target encoder to model residual signals not captured by the source model.
If this is right
- The method has no worse convergence rate than training from scratch under the informative class of target distributions, up to logarithmic factors.
- The convergence rate transitions seamlessly from nonparametric to near-parametric when source representations are informative.
- The approach supports adapt-time multimodality extensions, such as adding spatial signals to a single-cell model.
- It empirically safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance.
Where Pith is reading between the lines
- This integration could reduce reliance on complex domain-alignment techniques in practical transfer setups.
- The same residual mechanism might apply to preventing negative transfer in sequential or continual learning scenarios.
- Testing the method on distributions deliberately constructed to violate the informative-class assumption would clarify the boundary of the guarantees.
Load-bearing premise
Target distributions belong to the informative class that enables the stated convergence-rate bounds.
What would settle it
An empirical case where the residual integration method converges slower than training from scratch, or fails to match the predicted rates, on a target distribution satisfying the informative-class conditions.
Figures
read the original abstract
Transfer learning has become a central paradigm in modern machine learning, yet it suffers from the long-standing problem of negative transfer, where leveraging source representations can harm rather than help performance on the target task. Although empirical remedies have been proposed, there remains little theoretical understanding of how to reliably avoid negative transfer. In this paper, we investigate a simple yet remarkably effective strategy: augmenting frozen, pretrained source-side features with a trainable target-side encoder that adapts target features to capture residual signals overlooked by models pretrained on the source data. We show this residual feature integration strategy is sufficient to provably prevent negative transfer, by establishing theoretical guarantees that it has no worse convergence rate than training from scratch under the informative class of target distributions up to logarithmic factors, and that the convergence rate can transition seamlessly from nonparametric to near-parametric when source representations are informative. To our knowledge, this is the first theoretical work that ensures protection against negative transfer. We carry out extensive numerical experiments across image, text and tabular benchmarks, and empirically verify that the method consistently safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance. We additionally demonstrate that this residual integration mechanism uniquely supports adapt-time multimodality extension, enabling a pretrained single-cell foundation model to incorporate spatial signals for lymph-node anatomical classification despite the source model being trained without them. Our study thus advances the theory of safe transfer learning, and provides a principled approach that is simple, robust, architecture-agnostic, and broadly applicable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes residual feature integration as a strategy to prevent negative transfer in transfer learning: frozen pretrained source features are augmented with a trainable target-side encoder that captures residual signals. It claims this is sufficient to provably prevent negative transfer by establishing that the approach has no worse convergence rate than training from scratch (up to logarithmic factors) for target distributions in an 'informative class', with a seamless transition from nonparametric to near-parametric rates when source representations are informative. The work includes extensive experiments on image, text, and tabular benchmarks under distribution shift, label noise, semantic perturbation, and class imbalance, plus a single-cell spatial extension for multimodality.
Significance. If the central theoretical claims hold, the result is significant because it supplies the first explicit theoretical protection against negative transfer, a persistent issue in transfer learning. The method is simple, architecture-agnostic, and supports adapt-time multimodality, which is a practical strength. Credit is due for deriving convergence-rate bounds that transition between regimes and for the breadth of empirical verification across domains.
major comments (2)
- [Theoretical results section (convergence-rate derivations)] The no-worse-than-scratch convergence guarantee and the nonparametric-to-near-parametric transition are derived only for target distributions in the 'informative class'. This class is invoked to obtain the central claim but receives no explicit characterization (e.g., moment conditions, source-target alignment, or density assumptions) in the theoretical development. Without such a characterization it is impossible to verify whether the image/text/tabular or single-cell distributions used in the experiments lie inside the class, rendering the 'provably prevents negative transfer' statement conditional on an unstated premise.
- [Abstract and § on theoretical guarantees] The abstract states that the residual construction yields an independent derivation of convergence rates; however, the precise assumptions under which the rate bound holds (including any dependence on source informativeness) are not stated with sufficient precision to allow direct checking of the derivation steps or the logarithmic-factor claim.
minor comments (2)
- [Method section] Notation for the residual encoder and the combined estimator should be introduced once and used consistently; current usage mixes 'target-side encoder' and 'residual adapter' without a single defining equation.
- [Experiments, single-cell subsection] The single-cell experiment would benefit from an explicit statement of how the source model (trained without spatial signals) is frozen and how the target encoder is initialized.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the clarity of our theoretical contributions.
read point-by-point responses
-
Referee: [Theoretical results section (convergence-rate derivations)] The no-worse-than-scratch convergence guarantee and the nonparametric-to-near-parametric transition are derived only for target distributions in the 'informative class'. This class is invoked to obtain the central claim but receives no explicit characterization (e.g., moment conditions, source-target alignment, or density assumptions) in the theoretical development. Without such a characterization it is impossible to verify whether the image/text/tabular or single-cell distributions used in the experiments lie inside the class, rendering the 'provably prevents negative transfer' statement conditional on an unstated premise.
Authors: We agree that an explicit characterization of the informative class is necessary for readers to assess the scope of the guarantees. While the class is referenced in the theoretical development, we acknowledge that a self-contained definition incorporating moment conditions, source-target alignment requirements, and density assumptions was not provided with sufficient detail. In the revised manuscript we will add a dedicated paragraph (or subsection) that formally defines the informative class and states the precise conditions under which the no-worse-than-scratch and nonparametric-to-near-parametric rates hold. We will also include a brief discussion of how the experimental distributions relate to these conditions, to the extent that the available data permit. revision: yes
-
Referee: [Abstract and § on theoretical guarantees] The abstract states that the residual construction yields an independent derivation of convergence rates; however, the precise assumptions under which the rate bound holds (including any dependence on source informativeness) are not stated with sufficient precision to allow direct checking of the derivation steps or the logarithmic-factor claim.
Authors: We accept that the abstract and theoretical section would benefit from a more precise statement of assumptions. We will revise the abstract to explicitly list the key conditions (membership in the informative class, dependence on source informativeness for rate improvement, and the logarithmic factors that appear in the bounds). In the theoretical guarantees section we will add a short paragraph that outlines the main derivation steps and highlights where the logarithmic factors arise, thereby enabling direct verification of the claims. revision: yes
Circularity Check
Derivation of no-worse-than-scratch convergence rates is self-contained under the informative class
full rationale
The paper establishes theoretical convergence-rate bounds for residual feature integration by starting from the definition of an informative class of target distributions and deriving that the combined estimator matches or improves upon scratch training rates (up to log factors) with a nonparametric-to-near-parametric transition when source features are informative. No equations or steps reduce the claimed guarantee to a fitted parameter, a self-referential definition, or a load-bearing self-citation whose content is itself unverified. The informative class functions as an explicit premise rather than being constructed from the target result; the derivation therefore remains independent of the method's empirical performance on any particular dataset.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Target distributions belong to an informative class that supports the stated convergence bounds
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
augmenting frozen, pretrained source-side features with a trainable target-side encoder that adapts target features to capture residual signals
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
convergence rate ... transitions seamlessly from nonparametric to near-parametric
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Muhammad Jamal Afridi, Arun Ross, and Erik M. Shapiro. On automated source selection for transfer learning in convolutional neural networks.Pattern Recognition, 73:65–75, 2018. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2017.07.019. URLhttps://www. sciencedirect.com/science/article/pii/S0031320317302881
-
[2]
Multimodal machine learn- ing: A survey and taxonomy.IEEE Trans
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learn- ing: A survey and taxonomy.IEEE Trans. Pattern Anal. Mach. Intell., 41(2):423–443, February 2019. ISSN 0162-8828. doi: 10.1109/TPAMI.2018.2798607. URLhttps: //doi.org/10.1109/TPAMI.2018.2798607
-
[3]
Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification
John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Annie Zaenen and Antal van den Bosch (eds.),Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440–447, Prague, Czech Republic, June 2007. Association for Computationa...
work page 2007
-
[4]
Aur ´elien Bibaut Cheng Ju and Mark van der Laan. The relative performance of ensemble methods with deep convolutional neural networks for image classification.Journal of Applied Statistics, 45(15):2800–2818, 2018. doi: 10.1080/02664763.2018.1441383. URLhttps: //doi.org/10.1080/02664763.2018.1441383. PMID: 31631918
-
[5]
An analysis of single-layer networks in unsu- pervised feature learning
Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsu- pervised feature learning. In Geoffrey Gordon, David Dunson, and Miroslav Dud ´ık (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statis- tics, volume 15 ofProceedings of Machine Learning Research, pp. 215–223, Fort Lauderda...
work page 2011
-
[6]
Rhys Compton, Lily Zhang, Aahlad Puli, and Rajesh Ranganath. When more is less: Incor- porating additional datasets can hurt performance by introducing spurious correlations, 2023. URLhttps://arxiv.org/abs/2308.04431
-
[7]
Buenrostro, Nir Yosef, Carolina Caldas, Rui Sun, and Bing He
Huazhe Cui, Chen Wang, Haitham Maan, Jason D. Buenrostro, Nir Yosef, Carolina Caldas, Rui Sun, and Bing He. scgpt: toward building a foundation model for single-cell multi-omics using generative AI.Nature Methods, 21(9):1470–1480, September 2024. doi: 10.1038/ s41592-024-02201-0. URLhttps://doi.org/10.1038/s41592-024-02201-0
-
[8]
2011, Neural Comput., 23, 1661, 10.1162/NECO\_a\_00142
Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. A survey on deep learning for multimodal data fusion.Neural Computation, 32(5):829–864, 2020. doi: 10.1162/neco a 01273
-
[9]
Data Shapley: Equitable Valuation of Data for Machine Learning
Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning, 2019. URLhttps://arxiv.org/abs/1904.02868
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[10]
Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789–1819, 2021
Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789–1819, 2021. 11 Published as a conference paper at ICLR 2026
work page 2021
-
[11]
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch ¨olkopf, and Alexander Smola. A kernel two-sample test.J. Mach. Learn. Res., 13(null):723–773, March 2012. ISSN 1532-4435
work page 2012
-
[12]
Pratham Grover, Kunal Chaturvedi, Xing Zi, Amit Saxena, Shiv Prakash, Tony Jan, and Mukesh Prasad. Ensemble transfer learning for distinguishing cognitively normal and mild cognitive impairment patients using mri.Algorithms, 16(8), 2023. ISSN 1999-4893. doi: 10.3390/a16080377. URLhttps://www.mdpi.com/1999-4893/16/8/377
-
[13]
L ´aszl´o Gy¨orfi, Michael Kohler, Adam Krzy˙zak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer, 2002
work page 2002
-
[14]
XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021
-
[15]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pp. 770–778, 2016
work page 2016
-
[16]
Distilling the knowledge in a neural network,
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,
-
[17]
URLhttps://arxiv.org/abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
A. Hosna, E. Merry, J. Gyalmo, Z. Alom, Z. Aung, and M. A. Azim. Transfer learn- ing: a friendly introduction.Journal of Big Data, 9(1):102, 2022. doi: 10.1186/ s40537-022-00652-w. URLhttps://doi.org/10.1186/s40537-022-00652-w. Epub 2022 Oct 22
-
[19]
Parameter-Efficient Transfer Learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp, 2019. URLhttps://arxiv.org/abs/1902.00751
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[20]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Does data augmentation improve generaliza- tion in nlp?, 2020
Rohan Jha, Charles Lovering, and Ellie Pavlick. Does data augmentation improve generaliza- tion in nlp?, 2020. URLhttps://arxiv.org/abs/2004.15012
-
[22]
Deep transfer learning: Model framework and error analysis, 2025
Yuling Jiao, Huazhen Lin, Yuchen Luo, and Jerry Zhijian Yang. Deep transfer learning: Model framework and error analysis, 2025. URLhttps://arxiv.org/abs/2410.09383
-
[23]
Lightgbm: A highly efficient gradient boosting decision tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,
-
[24]
URLhttps://proceedings.neurips.cc/paper_files/paper/2017/ file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
work page 2017
-
[25]
Adult income dataset.https://www.kaggle.com/ datasets/uciml/adult-census-income, 1996
Ronny Kohavi and Barry Becker. Adult income dataset.https://www.kaggle.com/ datasets/uciml/adult-census-income, 1996. Originally from the UCI Machine Learning Repository. Kaggle version shared by user 1251, updated 2016
work page 1996
-
[26]
Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021
work page 2021
-
[27]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical re- port, University of Toronto, 2009. URLhttps://www.cs.toronto.edu/ ˜kriz/ learning-features-2009-TR.pdf. Technical Report
work page 2009
-
[28]
Fine-tuning can distort pretrained features and underperform out-of- distribution
Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022. URLhttps: //arxiv.org/abs/2202.10054. 12 Published as a conference paper at ICLR 2026
-
[29]
Towards safe weakly supervised learning
Yu-Feng Li, Lan-Zhe Guo, and Zhi-Hua Zhou. Towards safe weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):334–346, 2021. doi: 10.1109/TPAMI.2019.2922396
-
[30]
Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, and Vidya Muthukumar. The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. J. Mach. Learn. Res., 25:91:1–91:85, 2022. URLhttps://api.semanticscholar. org/CorpusID:252815719
work page 2022
-
[31]
Y . P. Lin and T. P. Jung. Improving eeg-based emotion classification using conditional transfer learning.Frontiers in Human Neuroscience, 11:334, 2017. doi: 10.3389/fnhum.2017.00334. URLhttps://doi.org/10.3389/fnhum.2017.00334. Published on June 27, 2017
-
[32]
Yichen Long, Ka Shing Ang, Ritambhara Sethi, Guanghua Xiao, and Guo-Cheng Yuan. De- ciphering spatial domains from spatial multi-omics with spatialglue.Nature Methods, 21(9): 1658–1667, September 2024. doi: 10.1038/s41592-024-02316-4. URLhttps://doi. org/10.1038/s41592-024-02316-4
-
[33]
Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Li Zhang, and Wei-Lun Chao. Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition, 2025. URLhttps://arxiv.org/abs/2409.16434
-
[34]
Transfer learning with affine model transformation
Shunya Minami, Kenji Fukumizu, Yoshihiro Hayashi, and Ryo Yoshida. Transfer learning with affine model transformation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc
work page 2023
-
[35]
Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174): 1–38, 2020
work page 2020
-
[36]
Qual- ity not quantity: On the interaction between dataset design and robustness of clip, 2023
Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, and Ludwig Schmidt. Qual- ity not quantity: On the interaction between dataset design and robustness of clip, 2023. URL https://arxiv.org/abs/2208.05516
-
[37]
Rohan Paris. Credit score classification.https://www.kaggle.com/datasets/ rohanparis/credit-score-classification, 2022. Kaggle Dataset, CC0: Public Domain
work page 2022
-
[38]
Moment matching for multi-source domain adaptation
Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE International Con- ference on Computer Vision, pp. 1406–1415, 2019
work page 2019
-
[39]
Philipp Petersen and Felix V oigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks.Neural Networks, 108:296–330, 2018
work page 2018
-
[40]
Transfusion: Understand- ing transfer learning for medical imaging
Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understand- ing transfer learning for medical imaging. InAdvances in Neural Information Processing Systems (NeurIPS 2019), pp. 3347–3357, Red Hook, NY , USA, 2019. Curran Associates, Inc
work page 2019
-
[41]
Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 2020
work page 2020
-
[42]
Michael J. Sorocky, Siqi Zhou, and Angela P. Schoellig. Experience selection using dy- namics similarity for efficient multi-source transfer learning between robots. In2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 2739–2745, 2020. doi: 10.1109/ICRA40945.2020.9196744
-
[43]
S ¨oren Richard Stahlschmidt, Benjamin Ulfenborg, and Jane Synnergren. Multimodal deep learning for biomedical data fusion: a review.Briefings in Bioinformatics, 23(2):bbab569, 01
-
[44]
ISSN 1477-4054. doi: 10.1093/bib/bbab569. URLhttps://doi.org/10.1093/ bib/bbab569. 13 Published as a conference paper at ICLR 2026
-
[45]
Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality.arXiv preprint arXiv:1810.08033, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
Tedo. Students performance: analysis and classification.https://www.kaggle.com/ code/tedo/students-performance-analysis-and-classification,
-
[47]
Kaggle Notebook, Version 4, Apache 2.0 License
-
[48]
Karthik Chowdary Tsaliki. Diabetes classification (pima indians diabetes database).https: //www.kaggle.com/competitions/diabetes-classification, 2019. Kag- gle Community Prediction Competition
work page 2019
-
[49]
Characterizing and avoiding negative transfer
Zirui Wang, Zihang Dai, Barnab ´as P´oczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11285–11294, 2019. doi: 10.1109/CVPR.2019.01155
-
[50]
Dongrui Wu. Online and offline domain adaptation for reducing bci calibration effort.IEEE Transactions on Human-Machine Systems, 47(4):550–563, 2017. doi: 10.1109/THMS.2016. 2608931
-
[51]
A selective transfer learning method for concept drift adaptation
Ge Xie, Yu Sun, Minlong Lin, and Ke Tang. A selective transfer learning method for concept drift adaptation. In Fengyu Cong, Andrew Leung, and Qinglai Wei (eds.),Advances in Neural Networks - ISNN 2017, pp. 353–361, Cham, 2017. Springer International Publishing. ISBN 978-3-319-59081-3
work page 2017
-
[52]
Boosting for transfer learning with multiple sources
Yi Yao and Gianfranco Doretto. Boosting for transfer learning with multiple sources. In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1855– 1862, 2010. doi: 10.1109/CVPR.2010.5539857
-
[53]
Wen Zhang, Lingfei Deng, Lei Zhang, and Dongrui Wu. A survey on negative transfer. IEEE/CAA Journal of Automatica Sinica, 10(2):305–329, 2023. doi: 10.1109/JAS.2022. 106004
-
[54]
Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pp. 27179–27202. PMLR, 2022. 14 Published as a conference paper at ICLR 2026 APPENDICES In the appendices, we provide additio...
work page 2022
-
[55]
Table S1 reports the results on CIFAR-100 with CNNs
Moreover, in addition to CNNs, we also evaluate both CIFAR-10 and CIFAR-100 with transformer-based models. Table S1 reports the results on CIFAR-100 with CNNs. Similar to CIFAR-10, REFINEconsis- tently outperforms the baseline methods under all four stress scenarios. In particular, in the extreme noise setting with80%label flips, most competing methods co...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.