DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Pith reviewed 2026-05-11 04:58 UTC · model grok-4.3
The pith
DistilBERT is a 40% smaller version of BERT that retains 97% of its language understanding while running 60% faster.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce DistilBERT, a smaller general-purpose language representation model pre-trained using knowledge distillation, which reduces the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
What carries the argument
Triple loss combining language modeling, distillation, and cosine-distance losses during pre-training to transfer knowledge to the smaller student model.
Load-bearing premise
The combination of language modeling, distillation, and cosine-distance losses transfers enough knowledge from the full BERT teacher to the smaller student without requiring the full model capacity or additional task-specific supervision.
What would settle it
If DistilBERT's fine-tuned performance on standard NLP benchmarks falls below 97% of BERT's scores or if measured inference speed gains are less than 60% in direct side-by-side tests, the central claims would not hold.
read the original abstract
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DistilBERT, a 6-layer distilled version of BERT-base (66M parameters) pre-trained with a triple loss combining masked language modeling, knowledge distillation, and cosine-distance embedding losses. It claims a 40% size reduction while retaining 97% of BERT's performance on language understanding tasks, 60% faster inference, and suitability for on-device use, supported by evaluations on GLUE (average 97% relative score), SQuAD, IMDB, loss ablations, a from-scratch 6-layer baseline comparison, and CPU/GPU speed measurements with reported batch sizes.
Significance. If the empirical results hold, the work offers a practical pre-training distillation method that enables smaller general-purpose language models without task-specific supervision, directly addressing deployment constraints. Strengths include the ablation evidence in §3.3 showing each loss term's contribution, the underperformance of the non-distilled baseline, and concrete inference timings, which together provide reproducible support for the central efficiency claims.
minor comments (2)
- [Abstract] Abstract: performance claims (97% retention, 60% speedup) are stated without reference to the specific downstream tasks or variance; while §3 and tables provide these details, a one-sentence qualifier on evaluation scope would improve standalone readability.
- [§3.3] §3.3: ablation results demonstrate the value of each loss component, but the table does not report run-to-run variance or number of seeds; adding this would make the contribution of the cosine term more robust.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the DistilBERT paper, recognition of its practical contributions to model compression, and recommendation for minor revision. We are pleased that the ablation evidence, baseline comparisons, and concrete speed measurements were noted as providing reproducible support for the claims.
Circularity Check
No significant circularity identified
full rationale
The paper's central contribution is an empirical training procedure: a 6-layer student model is pre-trained on the same corpus as BERT-base using a composite loss (MLM + distillation + cosine embedding) and then evaluated on GLUE, SQuAD, and IMDB. All reported performance numbers (97 % relative GLUE score, 60 % speed-up, 40 % size reduction) are obtained by direct measurement after training; no equation or prediction is shown to be mathematically identical to a fitted parameter or to a self-citation chain. Ablations in §3.3 and the from-scratch baseline comparison further demonstrate that the result is not forced by construction. The work therefore rests on externally verifiable experimental outcomes rather than on any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses.
-
Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DistilBERT (6 layers, 66M params) is compared to BERT-base on GLUE (avg. 97% relative score), SQuAD, and IMDB; ablations in §3.3 show each term of the triple loss (MLM + distillation + cosine) contributes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Canonical Regularisation of Wide Feature-Learning Neural Networks
Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
-
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
-
Learning the Signature of Memorization in Autoregressive Language Models
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging
PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.
-
Distribution-free root cause analysis
CROC constructs finite-sample valid confidence sets for the root-cause index in multi-stream change detection using conformal p-values under independence and exchangeability assumptions.
-
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
The paper presents AIGaitor, a privacy-preserving on-device monocular motion analysis system that performs end-to-end pose estimation and deep learning gait analysis on consumer smartphones.
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, ...
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive pooling at middle transformer layers to increase QPS by up to 116% on document ranking with little or no loss in quality.
-
TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
-
Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain o...
-
Differentially Private Motif-Preserving Multi-modal Hashing
DMP-MH clips degrees to control triangle sensitivity, synthesizes an edge-DP graph with Noisy Mirror Descent, and distills it into dual-stream hash networks, beating private baselines by up to 11.4 mAP on MIRFlickr-25...
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
-
When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity
Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.
-
Switchcraft: AI Model Router for Agentic Tool Calling
Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.
-
TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models
TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.
-
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
-
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...
-
VOW: Verifiable and Oblivious Watermark Detection for Large Language Models
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
-
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
-
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment pre...
-
Adaptive Head Budgeting for Efficient Multi-Head Attention
BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.
-
RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian
RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
-
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
SecureRouter: Encrypted Routing for Efficient Secure Inference
SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
-
Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data
Synthetic data of 1M+ multi-label samples across 23 languages trains models that match or exceed English-only specialists on zero-shot benchmarks for emotion classification.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.
-
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.
-
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos
Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
-
Explainable Semantic Textual Similarity via Dissimilar Span Detection
Introduces the Dissimilar Span Detection task and Span Similarity Dataset to explain semantic textual similarity by identifying differing spans between text pairs.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
DualGuard uses adaptive dual-stream watermark signals to detect and trace both paraphrase and spoofing attacks in LLM outputs while preserving text quality.
-
Language-Conditioned Safe Trajectory Generation for Spacecraft Rendezvous
SAGES translates natural-language commands into constraint-respecting spacecraft trajectories, achieving over 90% semantic-behavioral consistency in proximity operations and robotic tests.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
Task complexity shapes internal representations and robustness in neural networks
Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-prec...
-
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.
-
Post-detection inference for sequential changepoint localization
Develops a general nonparametric framework for constructing non-asymptotically valid confidence sets for changepoint location using data up to an arbitrary detection stopping time.
-
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation wit...
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Accelerating Large Language Model Decoding with Speculative Sampling
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
Strong Teacher Not Needed? On Distillation in LLM Pretraining
Even small or undertrained teachers improve larger LLM students via distillation with tuned loss mixing, while stronger teachers can saturate or reverse gains and distillation aids generalization more than in-domain fit.
-
Multimodal Distribution Matching for Vision-Language Dataset Distillation
MDM distills vision-language datasets via joint embedding clustering, weight-space model interpolation, and geometry-aware distribution matching on the unit hypersphere.
-
Convex Optimization for Alignment and Preference Learning on a Single GPU
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models...
-
Proxy-Based Approximation of Shapley and Banzhaf Interactions
ProxySHAP uses tree proxies plus residual correction to achieve state-of-the-art approximation of Shapley and Banzhaf interactions, with a polynomial-time exact method for tree ensembles.
-
Proxy-Based Approximation of Shapley and Banzhaf Interactions
ProxySHAP approximates higher-order Shapley and Banzhaf interactions via tree proxies plus residual correction and a polynomial-time interventional TreeSHAP generalization for tree ensembles.
-
Post-Trained MoE Can Skip Half Experts via Self-Distillation
ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.
-
DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models
DP-SelFT improves the privacy-utility trade-off for LLM fine-tuning by selecting robust layer subsets via DP synthetic data and perturbation-matched evaluation.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...
-
On the Burden of Achieving Fairness in Conformal Prediction
Pooled conformal calibration incurs irreducible group-wise coverage distortion set by cross-group quantile heterogeneity, and Equalized Coverage and Equalized Set Size are in fundamental tension.
-
On the Burden of Achieving Fairness in Conformal Prediction
Pooled conformal calibration incurs irreducible group-wise coverage distortion scaled by cross-group quantile heterogeneity, with Equalized Coverage and Equalized Set Size in fundamental tension.
-
Distribution Corrected Offline Data Distillation for Large Language Models
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
-
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
-
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
-
BoolXLLM: LLM-Assisted Explainability for Boolean Models
BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.
-
Unified Approach for Weakly Supervised Multicalibration
A unified framework uses contamination-matrix risk rewrites and witness-based calibration constraints to estimate and correct multicalibration under weak supervision, providing finite-sample guarantees and the WLMC po...
-
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
Reference graph
Works this paper leans on
- [1]
-
[2]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=
-
[3]
Language Models are Unsupervised Multitask Learners , author=
-
[4]
RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. ArXiv , year=
- [5]
- [6]
-
[7]
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Rethinking the Inception Architecture for Computer Vision , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2016
-
[8]
International Conference on Learning Representations , year=
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. International Conference on Learning Representations , year=
-
[9]
2015 IEEE International Conference on Computer Vision (ICCV) , year=
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , author=. 2015 IEEE International Conference on Computer Vision (ICCV) , year=
work page 2015
-
[10]
SpanBERT: Improving Pre-training by Representing and Predicting Spans , author=. ArXiv , year=
-
[11]
Alex Wang and Ian F. Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R. Thomas McCoy and Roma Patel and Yinghui Huang and Jason Phang and Edouard Grave and Najoung Kim and Phu Mon Htut and Thibault F'
-
[12]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. ICLR , year=
-
[13]
Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=. NAACL , year=
- [14]
-
[15]
SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=
-
[16]
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , author=. ArXiv , year=
- [17]
-
[18]
Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , author=. ArXiv , year=
-
[19]
Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System , author=. ArXiv , year=
-
[20]
Small and Practical BERT Models for Sequence Labeling , author=. EMNLP-IJCNLP , year=
-
[21]
BAM! Born-Again Multi-Task Networks for Natural Language Understanding , author=. ACL , year=
-
[22]
Energy and Policy Considerations for Deep Learning in NLP , author=. ACL , year=
- [23]
- [24]
- [25]
-
[26]
Q8BERT, a Quantized 8bit Version of BERT-Base , url=. intel.ai , author=. 2019 , month=
work page 2019
-
[27]
Transformers: State-of-the-art Natural Language Processing , author=. 2019 , eprint=
work page 2019
-
[28]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018
work page 2018
-
[29]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[30]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[31]
Smith and Oren Etzioni , year=
Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green ai. ArXiv, abs/1907.10597, 2019
-
[32]
Energy and policy considerations for deep learning in nlp
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In ACL, 2019
work page 2019
-
[33]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017
work page 2017
-
[34]
Transformers: State-of-the-art natural language processing, 2019
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Transformers: State-of-the-art natural language processing, 2019
work page 2019
-
[35]
Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006
work page 2006
-
[36]
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[37]
Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19--27, 2015
work page 2015
-
[38]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2018
work page 2018
-
[39]
Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018
work page 2018
-
[40]
Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R
Alex Wang, Ian F. Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R. Thomas McCoy, Roma Patel, Yinghui Huang, Jason Phang, Edouard Grave, Najoung Kim, Phu Mon Htut, Thibault F' e vry, Berlin Chen, Nikita Nangia, Haokun Liu, Anhad Mohananey, Shikha Bordia, Nicolas Patry, Ellie Pavlick, and Samuel R. Bowman. jia...
work page 2019
-
[41]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In ACL, 2011
work page 2011
-
[42]
Squad: 100, 000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016
work page 2016
-
[43]
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks. ArXiv, abs/1903.12136, 2019
work page Pith review arXiv 1903
-
[44]
Making neural machine reading comprehension faster
Debajyoti Chatterjee. Making neural machine reading comprehension faster. ArXiv, abs/1904.00796, 2019
-
[45]
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. ArXiv, abs/1908.08962, 2019
-
[46]
Model compression with multi-task knowledge distillation for web-scale question answering system
Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and Daxin Jiang. Model compression with multi-task knowledge distillation for web-scale question answering system. ArXiv, abs/1904.09636, 2019
-
[47]
Small and practical bert models for sequence labeling
Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. Small and practical bert models for sequence labeling. In EMNLP-IJCNLP, 2019
work page 2019
-
[48]
Are sixteen heads really better than one? In NeurIPS, 2019
Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019
work page 2019
-
[49]
Deep learning with limited numerical precision
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.