BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Asa Cooper Stickland; Iain Murray

arxiv: 1902.02671 · v2 · pith:YOCS3DQBnew · submitted 2019-02-07 · 💻 cs.LG · cs.CL· stat.ML

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Asa Cooper Stickland , Iain Murray This is my paper

classification 💻 cs.LG cs.CLstat.ML

keywords bertmulti-taskparametersadaptationattentionbenchmarkfine-tunedglue

0 comments

read the original abstract

Multi-task learning shares information between related tasks, sometimes reducing the number of parameters required. State-of-the-art results across multiple natural language understanding tasks in the GLUE benchmark have previously used transfer from a single large task: unsupervised pre-training with BERT, where a separate BERT model was fine-tuned for each task. We explore multi-task approaches that share a single BERT model with a small number of additional task-specific parameters. Using new adaptation modules, PALs or `projected attention layers', we match the performance of separately fine-tuned models on the GLUE benchmark with roughly 7 times fewer parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
cs.CV 2026-05 unverdicted novelty 5.0

GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.
To Tune or Not To Tune? How About the Best of Both Worlds?
cs.CL 2019-07 unverdicted novelty 3.0

A sequential fine-tuning strategy for pre-trained language models reports modest accuracy gains of 4.7%, 0.99%, and 0.72% on semantic similarity, sequence labeling, and text classification tasks.