Residual Feature Integration is Sufficient to Prevent Negative Transfer

Lexin Li; Linjun Zhang; Ryumei Nakada; Yichen Xu

arxiv: 2505.11771 · v2 · submitted 2025-05-17 · 💻 cs.LG · cs.AI· math.ST· stat.ML· stat.TH

Residual Feature Integration is Sufficient to Prevent Negative Transfer

Yichen Xu , Ryumei Nakada , Linjun Zhang , Lexin Li This is my paper

Pith reviewed 2026-05-22 15:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.STstat.MLstat.TH

keywords negative transfertransfer learningresidual featuresconvergence ratespretrained modelsdistribution shift

0 comments

The pith

Augmenting frozen source features with a trainable residual encoder provably prevents negative transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a straightforward addition to transfer learning—keeping source pretrained features frozen while training a target encoder to pick up residual signals—eliminates the risk of negative transfer. It establishes theoretical bounds proving the combined approach converges at least as fast as training from scratch under informative target distributions, up to logarithmic factors, and shifts smoothly toward faster parametric rates when the source information proves useful. The method is architecture-agnostic and extends naturally to incorporating new modalities during adaptation. Experiments across images, text, and tabular data confirm it maintains performance even when facing shifts, noise, or imbalances.

Core claim

The central claim is that residual feature integration—freezing pretrained source-side features and augmenting them with a trainable target-side encoder that captures residual signals overlooked by the source model—is sufficient to provably prevent negative transfer. This yields convergence rates no worse than training from scratch (up to log factors) for target distributions in the informative class, with a seamless transition to near-parametric rates when source representations carry useful information. The approach is the first with such theoretical protection guarantees.

What carries the argument

Residual feature integration: freezing source pretrained features and adding a trainable target encoder to model residual signals not captured by the source model.

If this is right

The method has no worse convergence rate than training from scratch under the informative class of target distributions, up to logarithmic factors.
The convergence rate transitions seamlessly from nonparametric to near-parametric when source representations are informative.
The approach supports adapt-time multimodality extensions, such as adding spatial signals to a single-cell model.
It empirically safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This integration could reduce reliance on complex domain-alignment techniques in practical transfer setups.
The same residual mechanism might apply to preventing negative transfer in sequential or continual learning scenarios.
Testing the method on distributions deliberately constructed to violate the informative-class assumption would clarify the boundary of the guarantees.

Load-bearing premise

Target distributions belong to the informative class that enables the stated convergence-rate bounds.

What would settle it

An empirical case where the residual integration method converges slower than training from scratch, or fails to match the predicted rates, on a target distribution satisfying the informative-class conditions.

Figures

Figures reproduced from arXiv: 2505.11771 by Lexin Li, Linjun Zhang, Ryumei Nakada, Yichen Xu.

**Figure 1.** Figure 1: A schematic overview of REFINE. knowledge and outperform models trained from scratch on target data only, and when frep misaligns with the target distribution, safeguard against negative transfer and outperform models that rely solely on frep(x). We focus on the supervised learning task. Let Dt = {(xi , yi)} n i=1 ∼ P t denote the labeled dataset from the target task. Assume access to a frozen extracted… view at source ↗

**Figure 2.** Figure 2: Metric comparison across labeled target sizes for Adapter, LinearProbe, NoTrans, and [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Transfer learning has become a central paradigm in modern machine learning, yet it suffers from the long-standing problem of negative transfer, where leveraging source representations can harm rather than help performance on the target task. Although empirical remedies have been proposed, there remains little theoretical understanding of how to reliably avoid negative transfer. In this paper, we investigate a simple yet remarkably effective strategy: augmenting frozen, pretrained source-side features with a trainable target-side encoder that adapts target features to capture residual signals overlooked by models pretrained on the source data. We show this residual feature integration strategy is sufficient to provably prevent negative transfer, by establishing theoretical guarantees that it has no worse convergence rate than training from scratch under the informative class of target distributions up to logarithmic factors, and that the convergence rate can transition seamlessly from nonparametric to near-parametric when source representations are informative. To our knowledge, this is the first theoretical work that ensures protection against negative transfer. We carry out extensive numerical experiments across image, text and tabular benchmarks, and empirically verify that the method consistently safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance. We additionally demonstrate that this residual integration mechanism uniquely supports adapt-time multimodality extension, enabling a pretrained single-cell foundation model to incorporate spatial signals for lymph-node anatomical classification despite the source model being trained without them. Our study thus advances the theory of safe transfer learning, and provides a principled approach that is simple, robust, architecture-agnostic, and broadly applicable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Residual integration gives a clean theoretical handle on negative transfer but hinges on an under-specified 'informative class' of targets.

read the letter

Here's the quick take on this arXiv paper: the core claim is that freezing pretrained source features and tacking on a trainable target encoder for residuals is enough to guarantee no negative transfer, with convergence rates at least as good as starting from scratch, up to log factors, plus a nonparametric-to-near-parametric transition when source reps help. The new part is the theoretical guarantee. Prior work had empirical patches for negative transfer, but this one derives rates that match or beat scratch training and shows a smooth shift to parametric rates when the source is informative. That's a step forward for the theory side. The method itself is simple—no new architectures, just this residual addition—and they test it across image, text, and tabular benchmarks plus a single-cell spatial extension. The experiments cover distribution shift, noise, and imbalance, which is solid coverage for a methods paper. The main soft spot is the 'informative class' of target distributions that the bounds rely on. The paper invokes this class to get the no-worse guarantee, but without a clear description of what it entails—say, specific moment conditions or alignment metrics—it's difficult to know if real-world cases or their experiments actually fall inside it. If the class is too narrow, the protection claim weakens outside it. Experiments look promising on the surface, but I'd want to see the actual numbers and controls before fully buying the robustness story. This is for folks in ML theory and applied transfer learning who care about safe reuse of pretrained models. A reader who wants both a practical trick and some backing analysis could get value here. It deserves a serious referee because the idea is straightforward, the theory is novel in this framing, and the experiments span multiple domains. I'd recommend sending it to peer review with requests to clarify the informative class and add more detail on the experimental setups.

Referee Report

2 major / 2 minor

Summary. The paper proposes residual feature integration as a strategy to prevent negative transfer in transfer learning: frozen pretrained source features are augmented with a trainable target-side encoder that captures residual signals. It claims this is sufficient to provably prevent negative transfer by establishing that the approach has no worse convergence rate than training from scratch (up to logarithmic factors) for target distributions in an 'informative class', with a seamless transition from nonparametric to near-parametric rates when source representations are informative. The work includes extensive experiments on image, text, and tabular benchmarks under distribution shift, label noise, semantic perturbation, and class imbalance, plus a single-cell spatial extension for multimodality.

Significance. If the central theoretical claims hold, the result is significant because it supplies the first explicit theoretical protection against negative transfer, a persistent issue in transfer learning. The method is simple, architecture-agnostic, and supports adapt-time multimodality, which is a practical strength. Credit is due for deriving convergence-rate bounds that transition between regimes and for the breadth of empirical verification across domains.

major comments (2)

[Theoretical results section (convergence-rate derivations)] The no-worse-than-scratch convergence guarantee and the nonparametric-to-near-parametric transition are derived only for target distributions in the 'informative class'. This class is invoked to obtain the central claim but receives no explicit characterization (e.g., moment conditions, source-target alignment, or density assumptions) in the theoretical development. Without such a characterization it is impossible to verify whether the image/text/tabular or single-cell distributions used in the experiments lie inside the class, rendering the 'provably prevents negative transfer' statement conditional on an unstated premise.
[Abstract and § on theoretical guarantees] The abstract states that the residual construction yields an independent derivation of convergence rates; however, the precise assumptions under which the rate bound holds (including any dependence on source informativeness) are not stated with sufficient precision to allow direct checking of the derivation steps or the logarithmic-factor claim.

minor comments (2)

[Method section] Notation for the residual encoder and the combined estimator should be introduced once and used consistently; current usage mixes 'target-side encoder' and 'residual adapter' without a single defining equation.
[Experiments, single-cell subsection] The single-cell experiment would benefit from an explicit statement of how the source model (trained without spatial signals) is frozen and how the target encoder is initialized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the clarity of our theoretical contributions.

read point-by-point responses

Referee: [Theoretical results section (convergence-rate derivations)] The no-worse-than-scratch convergence guarantee and the nonparametric-to-near-parametric transition are derived only for target distributions in the 'informative class'. This class is invoked to obtain the central claim but receives no explicit characterization (e.g., moment conditions, source-target alignment, or density assumptions) in the theoretical development. Without such a characterization it is impossible to verify whether the image/text/tabular or single-cell distributions used in the experiments lie inside the class, rendering the 'provably prevents negative transfer' statement conditional on an unstated premise.

Authors: We agree that an explicit characterization of the informative class is necessary for readers to assess the scope of the guarantees. While the class is referenced in the theoretical development, we acknowledge that a self-contained definition incorporating moment conditions, source-target alignment requirements, and density assumptions was not provided with sufficient detail. In the revised manuscript we will add a dedicated paragraph (or subsection) that formally defines the informative class and states the precise conditions under which the no-worse-than-scratch and nonparametric-to-near-parametric rates hold. We will also include a brief discussion of how the experimental distributions relate to these conditions, to the extent that the available data permit. revision: yes
Referee: [Abstract and § on theoretical guarantees] The abstract states that the residual construction yields an independent derivation of convergence rates; however, the precise assumptions under which the rate bound holds (including any dependence on source informativeness) are not stated with sufficient precision to allow direct checking of the derivation steps or the logarithmic-factor claim.

Authors: We accept that the abstract and theoretical section would benefit from a more precise statement of assumptions. We will revise the abstract to explicitly list the key conditions (membership in the informative class, dependence on source informativeness for rate improvement, and the logarithmic factors that appear in the bounds). In the theoretical guarantees section we will add a short paragraph that outlines the main derivation steps and highlights where the logarithmic factors arise, thereby enabling direct verification of the claims. revision: yes

Circularity Check

0 steps flagged

Derivation of no-worse-than-scratch convergence rates is self-contained under the informative class

full rationale

The paper establishes theoretical convergence-rate bounds for residual feature integration by starting from the definition of an informative class of target distributions and deriving that the combined estimator matches or improves upon scratch training rates (up to log factors) with a nonparametric-to-near-parametric transition when source features are informative. No equations or steps reduce the claimed guarantee to a fitted parameter, a self-referential definition, or a load-bearing self-citation whose content is itself unverified. The informative class functions as an explicit premise rather than being constructed from the target result; the derivation therefore remains independent of the method's empirical performance on any particular dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about target distributions belonging to an informative class and on the existence of residual signals not captured by the source model; no explicit free parameters or new invented entities are named in the abstract.

axioms (1)

domain assumption Target distributions belong to an informative class that supports the stated convergence bounds
Invoked to obtain the no-worse-than-scratch rate and the nonparametric-to-parametric transition.

pith-pipeline@v0.9.0 · 5808 in / 1247 out tokens · 52849 ms · 2026-05-22T15:07:25.085792+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

augmenting frozen, pretrained source-side features with a trainable target-side encoder that adapts target features to capture residual signals
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

convergence rate ... transitions seamlessly from nonparametric to near-parametric

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

[1]

Muhammad Jamal Afridi, Arun Ross, and Erik M. Shapiro. On automated source selection for transfer learning in convolutional neural networks.Pattern Recognition, 73:65–75, 2018. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2017.07.019. URLhttps://www. sciencedirect.com/science/article/pii/S0031320317302881

work page doi:10.1016/j.patcog.2017.07.019 2018
[2]

Multimodal machine learn- ing: A survey and taxonomy.IEEE Trans

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learn- ing: A survey and taxonomy.IEEE Trans. Pattern Anal. Mach. Intell., 41(2):423–443, February 2019. ISSN 0162-8828. doi: 10.1109/TPAMI.2018.2798607. URLhttps: //doi.org/10.1109/TPAMI.2018.2798607

work page doi:10.1109/tpami.2018.2798607 2019
[3]

Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification

John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Annie Zaenen and Antal van den Bosch (eds.),Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440–447, Prague, Czech Republic, June 2007. Association for Computationa...

work page 2007
[4]

The relative performance of ensemble methods with deep convolutional neural networks for image classification.Journal of Applied Statistics, 45(15):2800–2818, 2018

Aur ´elien Bibaut Cheng Ju and Mark van der Laan. The relative performance of ensemble methods with deep convolutional neural networks for image classification.Journal of Applied Statistics, 45(15):2800–2818, 2018. doi: 10.1080/02664763.2018.1441383. URLhttps: //doi.org/10.1080/02664763.2018.1441383. PMID: 31631918

work page doi:10.1080/02664763.2018.1441383 2018
[5]

An analysis of single-layer networks in unsu- pervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsu- pervised feature learning. In Geoffrey Gordon, David Dunson, and Miroslav Dud ´ık (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statis- tics, volume 15 ofProceedings of Machine Learning Research, pp. 215–223, Fort Lauderda...

work page 2011
[6]

When more is less: Incor- porating additional datasets can hurt performance by introducing spurious correlations, 2023

Rhys Compton, Lily Zhang, Aahlad Puli, and Rajesh Ranganath. When more is less: Incor- porating additional datasets can hurt performance by introducing spurious correlations, 2023. URLhttps://arxiv.org/abs/2308.04431

work page arXiv 2023
[7]

Buenrostro, Nir Yosef, Carolina Caldas, Rui Sun, and Bing He

Huazhe Cui, Chen Wang, Haitham Maan, Jason D. Buenrostro, Nir Yosef, Carolina Caldas, Rui Sun, and Bing He. scgpt: toward building a foundation model for single-cell multi-omics using generative AI.Nature Methods, 21(9):1470–1480, September 2024. doi: 10.1038/ s41592-024-02201-0. URLhttps://doi.org/10.1038/s41592-024-02201-0

work page doi:10.1038/s41592-024-02201-0 2024
[8]

2011, Neural Comput., 23, 1661, 10.1162/NECO\_a\_00142

Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. A survey on deep learning for multimodal data fusion.Neural Computation, 32(5):829–864, 2020. doi: 10.1162/neco a 01273

work page doi:10.1162/neco 2020
[9]

Data Shapley: Equitable Valuation of Data for Machine Learning

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning, 2019. URLhttps://arxiv.org/abs/1904.02868

work page internal anchor Pith review Pith/arXiv arXiv 2019
[10]

Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789–1819, 2021

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789–1819, 2021. 11 Published as a conference paper at ICLR 2026

work page 2021
[11]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch ¨olkopf, and Alexander Smola. A kernel two-sample test.J. Mach. Learn. Res., 13(null):723–773, March 2012. ISSN 1532-4435

work page 2012
[12]

Ensemble transfer learning for distinguishing cognitively normal and mild cognitive impairment patients using mri.Algorithms, 16(8), 2023

Pratham Grover, Kunal Chaturvedi, Xing Zi, Amit Saxena, Shiv Prakash, Tony Jan, and Mukesh Prasad. Ensemble transfer learning for distinguishing cognitively normal and mild cognitive impairment patients using mri.Algorithms, 16(8), 2023. ISSN 1999-4893. doi: 10.3390/a16080377. URLhttps://www.mdpi.com/1999-4893/16/8/377

work page doi:10.3390/a16080377 2023
[13]

Springer, 2002

L ´aszl´o Gy¨orfi, Michael Kohler, Adam Krzy˙zak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer, 2002

work page 2002
[14]

Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021

XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021

work page arXiv 2021
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pp. 770–778, 2016

work page 2016
[16]

Distilling the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

work page
[17]

URLhttps://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Hosna, E

A. Hosna, E. Merry, J. Gyalmo, Z. Alom, Z. Aung, and M. A. Azim. Transfer learn- ing: a friendly introduction.Journal of Big Data, 9(1):102, 2022. doi: 10.1186/ s40537-022-00652-w. URLhttps://doi.org/10.1186/s40537-022-00652-w. Epub 2022 Oct 22

work page doi:10.1186/s40537-022-00652-w 2022
[19]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp, 2019. URLhttps://arxiv.org/abs/1902.00751

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Does data augmentation improve generaliza- tion in nlp?, 2020

Rohan Jha, Charles Lovering, and Ellie Pavlick. Does data augmentation improve generaliza- tion in nlp?, 2020. URLhttps://arxiv.org/abs/2004.15012

work page arXiv 2020
[22]

Deep transfer learning: Model framework and error analysis, 2025

Yuling Jiao, Huazhen Lin, Yuchen Luo, and Jerry Zhijian Yang. Deep transfer learning: Model framework and error analysis, 2025. URLhttps://arxiv.org/abs/2410.09383

work page arXiv 2025
[23]

Lightgbm: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,

work page
[24]

URLhttps://proceedings.neurips.cc/paper_files/paper/2017/ file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

work page 2017
[25]

Adult income dataset.https://www.kaggle.com/ datasets/uciml/adult-census-income, 1996

Ronny Kohavi and Barry Becker. Adult income dataset.https://www.kaggle.com/ datasets/uciml/adult-census-income, 1996. Originally from the UCI Machine Learning Repository. Kaggle version shared by user 1251, updated 2016

work page 1996
[26]

On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

work page 2021
[27]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical re- port, University of Toronto, 2009. URLhttps://www.cs.toronto.edu/ ˜kriz/ learning-features-2009-TR.pdf. Technical Report

work page 2009
[28]

Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022. URLhttps: //arxiv.org/abs/2202.10054. 12 Published as a conference paper at ICLR 2026

work page arXiv 2022
[29]

Towards safe weakly supervised learning

Yu-Feng Li, Lan-Zhe Guo, and Zhi-Hua Zhou. Towards safe weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):334–346, 2021. doi: 10.1109/TPAMI.2019.2922396

work page doi:10.1109/tpami.2019.2922396 2021
[30]

Dyer, and Vidya Muthukumar

Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, and Vidya Muthukumar. The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. J. Mach. Learn. Res., 25:91:1–91:85, 2022. URLhttps://api.semanticscholar. org/CorpusID:252815719

work page 2022
[31]

Y . P. Lin and T. P. Jung. Improving eeg-based emotion classification using conditional transfer learning.Frontiers in Human Neuroscience, 11:334, 2017. doi: 10.3389/fnhum.2017.00334. URLhttps://doi.org/10.3389/fnhum.2017.00334. Published on June 27, 2017

work page doi:10.3389/fnhum.2017.00334 2017
[32]

De- ciphering spatial domains from spatial multi-omics with spatialglue.Nature Methods, 21(9): 1658–1667, September 2024

Yichen Long, Ka Shing Ang, Ritambhara Sethi, Guanghua Xiao, and Guo-Cheng Yuan. De- ciphering spatial domains from spatial multi-omics with spatialglue.Nature Methods, 21(9): 1658–1667, September 2024. doi: 10.1038/s41592-024-02316-4. URLhttps://doi. org/10.1038/s41592-024-02316-4

work page doi:10.1038/s41592-024-02316-4 2024
[33]

Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition, 2025

Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Li Zhang, and Wei-Lun Chao. Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition, 2025. URLhttps://arxiv.org/abs/2409.16434

work page arXiv 2025
[34]

Transfer learning with affine model transformation

Shunya Minami, Kenji Fukumizu, Yoshihiro Hayashi, and Ryo Yoshida. Transfer learning with affine model transformation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023
[35]

Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174): 1–38, 2020

Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174): 1–38, 2020

work page 2020
[36]

Qual- ity not quantity: On the interaction between dataset design and robustness of clip, 2023

Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, and Ludwig Schmidt. Qual- ity not quantity: On the interaction between dataset design and robustness of clip, 2023. URL https://arxiv.org/abs/2208.05516

work page arXiv 2023
[37]

Credit score classification.https://www.kaggle.com/datasets/ rohanparis/credit-score-classification, 2022

Rohan Paris. Credit score classification.https://www.kaggle.com/datasets/ rohanparis/credit-score-classification, 2022. Kaggle Dataset, CC0: Public Domain

work page 2022
[38]

Moment matching for multi-source domain adaptation

Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE International Con- ference on Computer Vision, pp. 1406–1415, 2019

work page 2019
[39]

Optimal approximation of piecewise smooth functions using deep relu neural networks.Neural Networks, 108:296–330, 2018

Philipp Petersen and Felix V oigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks.Neural Networks, 108:296–330, 2018

work page 2018
[40]

Transfusion: Understand- ing transfer learning for medical imaging

Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understand- ing transfer learning for medical imaging. InAdvances in Neural Information Processing Systems (NeurIPS 2019), pp. 3347–3357, Red Hook, NY , USA, 2019. Curran Associates, Inc

work page 2019
[41]

Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 2020

Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 2020

work page 2020
[42]

Monocular visual-inertial odometry in low-textured environments with smooth gradients: A fully dense direct ﬁltering approach,

Michael J. Sorocky, Siqi Zhou, and Angela P. Schoellig. Experience selection using dy- namics similarity for efficient multi-source transfer learning between robots. In2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 2739–2745, 2020. doi: 10.1109/ICRA40945.2020.9196744

work page doi:10.1109/icra40945.2020.9196744 2020
[43]

Multimodal deep learning for biomedical data fusion: a review.Briefings in Bioinformatics, 23(2):bbab569, 01

S ¨oren Richard Stahlschmidt, Benjamin Ulfenborg, and Jane Synnergren. Multimodal deep learning for biomedical data fusion: a review.Briefings in Bioinformatics, 23(2):bbab569, 01

work page
[44]

doi: 10.1093/bib/bbab569

ISSN 1477-4054. doi: 10.1093/bib/bbab569. URLhttps://doi.org/10.1093/ bib/bbab569. 13 Published as a conference paper at ICLR 2026

work page doi:10.1093/bib/bbab569 2026
[45]

Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality

Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality.arXiv preprint arXiv:1810.08033, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Students performance: analysis and classification.https://www.kaggle.com/ code/tedo/students-performance-analysis-and-classification,

Tedo. Students performance: analysis and classification.https://www.kaggle.com/ code/tedo/students-performance-analysis-and-classification,

work page
[47]

Kaggle Notebook, Version 4, Apache 2.0 License

work page
[48]

Diabetes classification (pima indians diabetes database).https: //www.kaggle.com/competitions/diabetes-classification, 2019

Karthik Chowdary Tsaliki. Diabetes classification (pima indians diabetes database).https: //www.kaggle.com/competitions/diabetes-classification, 2019. Kag- gle Community Prediction Competition

work page 2019
[49]

Characterizing and avoiding negative transfer

Zirui Wang, Zihang Dai, Barnab ´as P´oczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11285–11294, 2019. doi: 10.1109/CVPR.2019.01155

work page doi:10.1109/cvpr.2019.01155 2019
[50]

Online and offline domain adaptation for reducing bci calibration effort.IEEE Transactions on Human-Machine Systems, 47(4):550–563, 2017

Dongrui Wu. Online and offline domain adaptation for reducing bci calibration effort.IEEE Transactions on Human-Machine Systems, 47(4):550–563, 2017. doi: 10.1109/THMS.2016. 2608931

work page doi:10.1109/thms.2016 2017
[51]

A selective transfer learning method for concept drift adaptation

Ge Xie, Yu Sun, Minlong Lin, and Ke Tang. A selective transfer learning method for concept drift adaptation. In Fengyu Cong, Andrew Leung, and Qinglai Wei (eds.),Advances in Neural Networks - ISNN 2017, pp. 353–361, Cham, 2017. Springer International Publishing. ISBN 978-3-319-59081-3

work page 2017
[52]

Boosting for transfer learning with multiple sources

Yi Yao and Gianfranco Doretto. Boosting for transfer learning with multiple sources. In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1855– 1862, 2010. doi: 10.1109/CVPR.2010.5539857

work page doi:10.1109/cvpr.2010.5539857 2010
[53]

A survey on negative transfer

Wen Zhang, Lingfei Deng, Lei Zhang, and Dongrui Wu. A survey on negative transfer. IEEE/CAA Journal of Automatica Sinica, 10(2):305–329, 2023. doi: 10.1109/JAS.2022. 106004

work page doi:10.1109/jas.2022 2023
[54]

On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features

Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pp. 27179–27202. PMLR, 2022. 14 Published as a conference paper at ICLR 2026 APPENDICES In the appendices, we provide additio...

work page 2022
[55]

Table S1 reports the results on CIFAR-100 with CNNs

Moreover, in addition to CNNs, we also evaluate both CIFAR-10 and CIFAR-100 with transformer-based models. Table S1 reports the results on CIFAR-100 with CNNs. Similar to CIFAR-10, REFINEconsis- tently outperforms the baseline methods under all four stress scenarios. In particular, in the extreme noise setting with80%label flips, most competing methods co...

work page arXiv 2026

[1] [1]

Muhammad Jamal Afridi, Arun Ross, and Erik M. Shapiro. On automated source selection for transfer learning in convolutional neural networks.Pattern Recognition, 73:65–75, 2018. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2017.07.019. URLhttps://www. sciencedirect.com/science/article/pii/S0031320317302881

work page doi:10.1016/j.patcog.2017.07.019 2018

[2] [2]

Multimodal machine learn- ing: A survey and taxonomy.IEEE Trans

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learn- ing: A survey and taxonomy.IEEE Trans. Pattern Anal. Mach. Intell., 41(2):423–443, February 2019. ISSN 0162-8828. doi: 10.1109/TPAMI.2018.2798607. URLhttps: //doi.org/10.1109/TPAMI.2018.2798607

work page doi:10.1109/tpami.2018.2798607 2019

[3] [3]

Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification

John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Annie Zaenen and Antal van den Bosch (eds.),Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440–447, Prague, Czech Republic, June 2007. Association for Computationa...

work page 2007

[4] [4]

The relative performance of ensemble methods with deep convolutional neural networks for image classification.Journal of Applied Statistics, 45(15):2800–2818, 2018

Aur ´elien Bibaut Cheng Ju and Mark van der Laan. The relative performance of ensemble methods with deep convolutional neural networks for image classification.Journal of Applied Statistics, 45(15):2800–2818, 2018. doi: 10.1080/02664763.2018.1441383. URLhttps: //doi.org/10.1080/02664763.2018.1441383. PMID: 31631918

work page doi:10.1080/02664763.2018.1441383 2018

[5] [5]

An analysis of single-layer networks in unsu- pervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsu- pervised feature learning. In Geoffrey Gordon, David Dunson, and Miroslav Dud ´ık (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statis- tics, volume 15 ofProceedings of Machine Learning Research, pp. 215–223, Fort Lauderda...

work page 2011

[6] [6]

When more is less: Incor- porating additional datasets can hurt performance by introducing spurious correlations, 2023

Rhys Compton, Lily Zhang, Aahlad Puli, and Rajesh Ranganath. When more is less: Incor- porating additional datasets can hurt performance by introducing spurious correlations, 2023. URLhttps://arxiv.org/abs/2308.04431

work page arXiv 2023

[7] [7]

Buenrostro, Nir Yosef, Carolina Caldas, Rui Sun, and Bing He

Huazhe Cui, Chen Wang, Haitham Maan, Jason D. Buenrostro, Nir Yosef, Carolina Caldas, Rui Sun, and Bing He. scgpt: toward building a foundation model for single-cell multi-omics using generative AI.Nature Methods, 21(9):1470–1480, September 2024. doi: 10.1038/ s41592-024-02201-0. URLhttps://doi.org/10.1038/s41592-024-02201-0

work page doi:10.1038/s41592-024-02201-0 2024

[8] [8]

2011, Neural Comput., 23, 1661, 10.1162/NECO\_a\_00142

Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. A survey on deep learning for multimodal data fusion.Neural Computation, 32(5):829–864, 2020. doi: 10.1162/neco a 01273

work page doi:10.1162/neco 2020

[9] [9]

Data Shapley: Equitable Valuation of Data for Machine Learning

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning, 2019. URLhttps://arxiv.org/abs/1904.02868

work page internal anchor Pith review Pith/arXiv arXiv 2019

[10] [10]

Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789–1819, 2021

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International Journal of Computer Vision, 129(6):1789–1819, 2021. 11 Published as a conference paper at ICLR 2026

work page 2021

[11] [11]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch ¨olkopf, and Alexander Smola. A kernel two-sample test.J. Mach. Learn. Res., 13(null):723–773, March 2012. ISSN 1532-4435

work page 2012

[12] [12]

Ensemble transfer learning for distinguishing cognitively normal and mild cognitive impairment patients using mri.Algorithms, 16(8), 2023

Pratham Grover, Kunal Chaturvedi, Xing Zi, Amit Saxena, Shiv Prakash, Tony Jan, and Mukesh Prasad. Ensemble transfer learning for distinguishing cognitively normal and mild cognitive impairment patients using mri.Algorithms, 16(8), 2023. ISSN 1999-4893. doi: 10.3390/a16080377. URLhttps://www.mdpi.com/1999-4893/16/8/377

work page doi:10.3390/a16080377 2023

[13] [13]

Springer, 2002

L ´aszl´o Gy¨orfi, Michael Kohler, Adam Krzy˙zak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer, 2002

work page 2002

[14] [14]

Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021

XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path.arXiv preprint arXiv:2106.02073, 2021

work page arXiv 2021

[15] [15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pp. 770–778, 2016

work page 2016

[16] [16]

Distilling the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

work page

[17] [17]

URLhttps://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Hosna, E

A. Hosna, E. Merry, J. Gyalmo, Z. Alom, Z. Aung, and M. A. Azim. Transfer learn- ing: a friendly introduction.Journal of Big Data, 9(1):102, 2022. doi: 10.1186/ s40537-022-00652-w. URLhttps://doi.org/10.1186/s40537-022-00652-w. Epub 2022 Oct 22

work page doi:10.1186/s40537-022-00652-w 2022

[19] [19]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp, 2019. URLhttps://arxiv.org/abs/1902.00751

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Does data augmentation improve generaliza- tion in nlp?, 2020

Rohan Jha, Charles Lovering, and Ellie Pavlick. Does data augmentation improve generaliza- tion in nlp?, 2020. URLhttps://arxiv.org/abs/2004.15012

work page arXiv 2020

[22] [22]

Deep transfer learning: Model framework and error analysis, 2025

Yuling Jiao, Huazhen Lin, Yuchen Luo, and Jerry Zhijian Yang. Deep transfer learning: Model framework and error analysis, 2025. URLhttps://arxiv.org/abs/2410.09383

work page arXiv 2025

[23] [23]

Lightgbm: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,

work page

[24] [24]

URLhttps://proceedings.neurips.cc/paper_files/paper/2017/ file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

work page 2017

[25] [25]

Adult income dataset.https://www.kaggle.com/ datasets/uciml/adult-census-income, 1996

Ronny Kohavi and Barry Becker. Adult income dataset.https://www.kaggle.com/ datasets/uciml/adult-census-income, 1996. Originally from the UCI Machine Learning Repository. Kaggle version shared by user 1251, updated 2016

work page 1996

[26] [26]

On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231–2249, 2021

work page 2021

[27] [27]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical re- port, University of Toronto, 2009. URLhttps://www.cs.toronto.edu/ ˜kriz/ learning-features-2009-TR.pdf. Technical Report

work page 2009

[28] [28]

Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022. URLhttps: //arxiv.org/abs/2202.10054. 12 Published as a conference paper at ICLR 2026

work page arXiv 2022

[29] [29]

Towards safe weakly supervised learning

Yu-Feng Li, Lan-Zhe Guo, and Zhi-Hua Zhou. Towards safe weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):334–346, 2021. doi: 10.1109/TPAMI.2019.2922396

work page doi:10.1109/tpami.2019.2922396 2021

[30] [30]

Dyer, and Vidya Muthukumar

Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, and Vidya Muthukumar. The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. J. Mach. Learn. Res., 25:91:1–91:85, 2022. URLhttps://api.semanticscholar. org/CorpusID:252815719

work page 2022

[31] [31]

Y . P. Lin and T. P. Jung. Improving eeg-based emotion classification using conditional transfer learning.Frontiers in Human Neuroscience, 11:334, 2017. doi: 10.3389/fnhum.2017.00334. URLhttps://doi.org/10.3389/fnhum.2017.00334. Published on June 27, 2017

work page doi:10.3389/fnhum.2017.00334 2017

[32] [32]

De- ciphering spatial domains from spatial multi-omics with spatialglue.Nature Methods, 21(9): 1658–1667, September 2024

Yichen Long, Ka Shing Ang, Ritambhara Sethi, Guanghua Xiao, and Guo-Cheng Yuan. De- ciphering spatial domains from spatial multi-omics with spatialglue.Nature Methods, 21(9): 1658–1667, September 2024. doi: 10.1038/s41592-024-02316-4. URLhttps://doi. org/10.1038/s41592-024-02316-4

work page doi:10.1038/s41592-024-02316-4 2024

[33] [33]

Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition, 2025

Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Li Zhang, and Wei-Lun Chao. Lessons and insights from a unifying study of parameter-efficient fine-tuning (peft) in visual recognition, 2025. URLhttps://arxiv.org/abs/2409.16434

work page arXiv 2025

[34] [34]

Transfer learning with affine model transformation

Shunya Minami, Kenji Fukumizu, Yoshihiro Hayashi, and Ryo Yoshida. Transfer learning with affine model transformation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023

[35] [35]

Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174): 1–38, 2020

Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174): 1–38, 2020

work page 2020

[36] [36]

Qual- ity not quantity: On the interaction between dataset design and robustness of clip, 2023

Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, and Ludwig Schmidt. Qual- ity not quantity: On the interaction between dataset design and robustness of clip, 2023. URL https://arxiv.org/abs/2208.05516

work page arXiv 2023

[37] [37]

Credit score classification.https://www.kaggle.com/datasets/ rohanparis/credit-score-classification, 2022

Rohan Paris. Credit score classification.https://www.kaggle.com/datasets/ rohanparis/credit-score-classification, 2022. Kaggle Dataset, CC0: Public Domain

work page 2022

[38] [38]

Moment matching for multi-source domain adaptation

Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE International Con- ference on Computer Vision, pp. 1406–1415, 2019

work page 2019

[39] [39]

Optimal approximation of piecewise smooth functions using deep relu neural networks.Neural Networks, 108:296–330, 2018

Philipp Petersen and Felix V oigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks.Neural Networks, 108:296–330, 2018

work page 2018

[40] [40]

Transfusion: Understand- ing transfer learning for medical imaging

Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understand- ing transfer learning for medical imaging. InAdvances in Neural Information Processing Systems (NeurIPS 2019), pp. 3347–3357, Red Hook, NY , USA, 2019. Curran Associates, Inc

work page 2019

[41] [41]

Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 2020

Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 2020

work page 2020

[42] [42]

Monocular visual-inertial odometry in low-textured environments with smooth gradients: A fully dense direct ﬁltering approach,

Michael J. Sorocky, Siqi Zhou, and Angela P. Schoellig. Experience selection using dy- namics similarity for efficient multi-source transfer learning between robots. In2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 2739–2745, 2020. doi: 10.1109/ICRA40945.2020.9196744

work page doi:10.1109/icra40945.2020.9196744 2020

[43] [43]

Multimodal deep learning for biomedical data fusion: a review.Briefings in Bioinformatics, 23(2):bbab569, 01

S ¨oren Richard Stahlschmidt, Benjamin Ulfenborg, and Jane Synnergren. Multimodal deep learning for biomedical data fusion: a review.Briefings in Bioinformatics, 23(2):bbab569, 01

work page

[44] [44]

doi: 10.1093/bib/bbab569

ISSN 1477-4054. doi: 10.1093/bib/bbab569. URLhttps://doi.org/10.1093/ bib/bbab569. 13 Published as a conference paper at ICLR 2026

work page doi:10.1093/bib/bbab569 2026

[45] [45]

Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality

Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality.arXiv preprint arXiv:1810.08033, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

Students performance: analysis and classification.https://www.kaggle.com/ code/tedo/students-performance-analysis-and-classification,

Tedo. Students performance: analysis and classification.https://www.kaggle.com/ code/tedo/students-performance-analysis-and-classification,

work page

[47] [47]

Kaggle Notebook, Version 4, Apache 2.0 License

work page

[48] [48]

Diabetes classification (pima indians diabetes database).https: //www.kaggle.com/competitions/diabetes-classification, 2019

Karthik Chowdary Tsaliki. Diabetes classification (pima indians diabetes database).https: //www.kaggle.com/competitions/diabetes-classification, 2019. Kag- gle Community Prediction Competition

work page 2019

[49] [49]

Characterizing and avoiding negative transfer

Zirui Wang, Zihang Dai, Barnab ´as P´oczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11285–11294, 2019. doi: 10.1109/CVPR.2019.01155

work page doi:10.1109/cvpr.2019.01155 2019

[50] [50]

Online and offline domain adaptation for reducing bci calibration effort.IEEE Transactions on Human-Machine Systems, 47(4):550–563, 2017

Dongrui Wu. Online and offline domain adaptation for reducing bci calibration effort.IEEE Transactions on Human-Machine Systems, 47(4):550–563, 2017. doi: 10.1109/THMS.2016. 2608931

work page doi:10.1109/thms.2016 2017

[51] [51]

A selective transfer learning method for concept drift adaptation

Ge Xie, Yu Sun, Minlong Lin, and Ke Tang. A selective transfer learning method for concept drift adaptation. In Fengyu Cong, Andrew Leung, and Qinglai Wei (eds.),Advances in Neural Networks - ISNN 2017, pp. 353–361, Cham, 2017. Springer International Publishing. ISBN 978-3-319-59081-3

work page 2017

[52] [52]

Boosting for transfer learning with multiple sources

Yi Yao and Gianfranco Doretto. Boosting for transfer learning with multiple sources. In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1855– 1862, 2010. doi: 10.1109/CVPR.2010.5539857

work page doi:10.1109/cvpr.2010.5539857 2010

[53] [53]

A survey on negative transfer

Wen Zhang, Lingfei Deng, Lei Zhang, and Dongrui Wu. A survey on negative transfer. IEEE/CAA Journal of Automatica Sinica, 10(2):305–329, 2023. doi: 10.1109/JAS.2022. 106004

work page doi:10.1109/jas.2022 2023

[54] [54]

On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features

Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pp. 27179–27202. PMLR, 2022. 14 Published as a conference paper at ICLR 2026 APPENDICES In the appendices, we provide additio...

work page 2022

[55] [55]

Table S1 reports the results on CIFAR-100 with CNNs

Moreover, in addition to CNNs, we also evaluate both CIFAR-10 and CIFAR-100 with transformer-based models. Table S1 reports the results on CIFAR-100 with CNNs. Similar to CIFAR-10, REFINEconsis- tently outperforms the baseline methods under all four stress scenarios. In particular, in the extreme noise setting with80%label flips, most competing methods co...

work page arXiv 2026