Scalable Transfer Learning with Expert Models

Andr\'e Susano Pinto; Basil Mustafa; Carlos Riquelme; Cedric Renggli; Daniel Keysers; Joan Puigcerver; Neil Houlsby; Sylvain Gelly

arxiv: 2009.13239 · v1 · pith:7NSAFV5Pnew · submitted 2020-09-28 · 💻 cs.LG · cs.CV· stat.ML

Scalable Transfer Learning with Expert Models

Joan Puigcerver , Carlos Riquelme , Basil Mustafa , Cedric Renggli , Andr\'e Susano Pinto , Sylvain Gelly , Daniel Keysers , Neil Houlsby This is my paper

classification 💻 cs.LG cs.CVstat.ML

keywords transfertasksexpertrepresentationsdatadiverseexpertsstrategy

0 comments

read the original abstract

Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2-3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
MoECodec: Image Compression for joint human and machine perception via Mixture-of-Experts
eess.IV 2026-06 unverdicted novelty 6.0

MoECodec replaces FFN layers with token-wise MoE plus stable routing and GShMLP experts to support multiple downstream tasks in a single image compression model.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...