Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Raphael Tang , Yao Lu , Linqing Liu , Lili Mou , Olga Vechtomova , Jimmy Lin

Authors on Pith no claims yet

classification 💻 cs.CL cs.LG

keywords languagenetworksneuralbertelmoinferenceknowledgemodel

read the original abstract

In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent poster child of this trend is the deep language representation model, which includes BERT, ELMo, and GPT. These developments have led to the conviction that previous-generation, shallower neural networks for language understanding are obsolete. In this paper, however, we demonstrate that rudimentary, lightweight neural networks can still be made competitive without architecture changes, external training data, or additional input features. We propose to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks. Across multiple datasets in paraphrasing, natural language inference, and sentiment classification, we achieve comparable results with ELMo, while using roughly 100 times fewer parameters and 15 times less inference time.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
cs.CL 2019-10 unverdicted novelty 6.0

DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
cs.SE 2026-04 unverdicted novelty 4.0

CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.