One Model To Learn Them All

Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit · 2017 · cs.LG · arXiv 1706.05137

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. Our model architecture incorporates building blocks from multiple domains. It contains convolutional layers, an attention mechanism, and sparsely-gated layers. Each of these computational blocks is crucial for a subset of the tasks we train on. Interestingly, even if a block is not crucial for a task, we observe that adding it never hurts performance and in most cases improves it on all tasks. We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly if at all.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Perceiver IO: A General Architecture for Structured Inputs & Outputs

cs.LG · 2021-07-30 · unverdicted · novelty 7.0

Perceiver IO is a general architecture that processes arbitrary structured inputs and outputs with linear scaling and achieves strong results on GLUE, Sintel optical flow, multi-task reasoning, and StarCraft II without task-specific components.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

CTRL: A Conditional Transformer Language Model for Controllable Generation

cs.CL · 2019-09-11 · unverdicted · novelty 6.0

CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.

Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition

eess.AS · 2025-09-10 · unverdicted · novelty 5.0

Sparse MERIT uses frame-wise sparse mixture-of-experts with task-specific gating on self-supervised speech features to jointly optimize enhancement and emotion recognition, reporting gains over baselines on MSP-Podcast at low SNR.

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

cs.CL · 2019-07-11 · unverdicted · novelty 5.0

A single multilingual NMT model for 103 languages trained on 25B examples demonstrates transfer learning benefits for low-resource languages.

Preventing overfitting in deep learning using differential privacy

cs.LG · 2026-03-12 · unverdicted · novelty 4.0

Differential privacy techniques can help prevent overfitting and improve generalization in deep neural networks.

Cross-lingual Data Transformation and Combination for Text Classification

cs.IR · 2019-06-23 · unverdicted · novelty 3.0

Cross-lingual data combined via translation or aligned embeddings can improve performance of CNN and RNN text classifiers.

citing papers explorer

Showing 7 of 7 citing papers.

Perceiver IO: A General Architecture for Structured Inputs & Outputs cs.LG · 2021-07-30 · unverdicted · none · ref 39 · internal anchor
Perceiver IO is a general architecture that processes arbitrary structured inputs and outputs with linear scaling and achieves strong results on GLUE, Sintel optical flow, multi-task reasoning, and StarCraft II without task-specific components.
A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 32
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
CTRL: A Conditional Transformer Language Model for Controllable Generation cs.CL · 2019-09-11 · unverdicted · none · ref 20 · internal anchor
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition eess.AS · 2025-09-10 · unverdicted · none · ref 59 · internal anchor
Sparse MERIT uses frame-wise sparse mixture-of-experts with task-specific gating on self-supervised speech features to jointly optimize enhancement and emotion recognition, reporting gains over baselines on MSP-Podcast at low SNR.
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges cs.CL · 2019-07-11 · unverdicted · none · ref 9 · internal anchor
A single multilingual NMT model for 103 languages trained on 25B examples demonstrates transfer learning benefits for low-resource languages.
Preventing overfitting in deep learning using differential privacy cs.LG · 2026-03-12 · unverdicted · none · ref 20 · internal anchor
Differential privacy techniques can help prevent overfitting and improve generalization in deep neural networks.
Cross-lingual Data Transformation and Combination for Text Classification cs.IR · 2019-06-23 · unverdicted · none · ref 28 · internal anchor
Cross-lingual data combined via translation or aligned embeddings can improve performance of CNN and RNN text classifiers.

One Model To Learn Them All

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer