BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Kevin Clark , Minh-Thang Luong , Urvashi Khandelwal , Christopher D. Manning , Quoc V. Le

Authors on Pith no claims yet

classification 💻 cs.CL

keywords multi-tasksingle-taskmodeldistillationmethodnetworkstrainingaddress

read the original abstract

It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
cs.CL 2019-05 accept novelty 6.0

SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.