{"work":{"id":"756f9764-ecd6-4672-8043-b37c698c7ad2","openalex_id":null,"doi":null,"arxiv_id":"1910.01108","raw_key":null,"title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","authors":null,"authors_text":"Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf","year":2019,"venue":"cs.CL","abstract":"As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.","external_url":"https://arxiv.org/abs/1910.01108","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T12:43:25.380222+00:00","pith_arxiv_id":"1910.01108","created_at":"2026-05-09T03:30:04.320227+00:00","updated_at":"2026-06-29T12:43:25.380222+00:00","title_quality_ok":true,"display_title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","render_title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter"},"hub":{"state":{"work_id":"756f9764-ecd6-4672-8043-b37c698c7ad2","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":171,"external_cited_by_count":null,"distinct_field_count":19,"first_pith_cited_at":"2019-10-09T03:23:22+00:00","last_pith_cited_at":"2026-06-18T03:07:47+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T13:08:47.292567+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":18},{"context_role":"method","n":11}],"polarity_counts":[{"context_polarity":"background","n":18},{"context_polarity":"use_method","n":11}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","claims":[{"claim_text":"As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"duration ofTseconds, is encoded by a pretrained CLAP [33]. Specifically, the waveform is first converted into a log-Mel spectrogram with 64-bin filter banks, and then processed by the HTS-AT [34] to obtain audio embeddingsF A ∈R T×d A, whered A denotes the audio feature dimension. The referring expression is processed by a pretrained DistilRoBERTa [35], [36], producing text embeddingsF T ∈R L×dT , whereLis the number of tokens in the expression andd T is the text feature dimension. Modality Prio","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"egories (World, Sports, Business, and Sci/Tech) with 120,000 training samples and 7,600 test samples; 2)Yahoo! Answers dataset[40], a large-scale topic classifi- cation corpus comprising 10 categories with 1.4 million training samples and 60,000 test samples. We consider three pretrained LLM backbones with different architectures and parameter scales: •DistilBERT[41]: an encoder-only model with approx- imately 67 million parameters, pretrained on English corpora including BookCorpus and English ","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Knowledge distillation (KD) [12] for large language models (LLMs) [41] has become an essential research direction for enhancing the accessibility and efficiency of Machine-Learning-as-a-Service (MLaaS) [3, 35, 37]. Through knowledge distillation, a small LLM learns to imitate a large teacher LLM's outputs, enabling effective deployment under limited computational or financial resources [25, 14, 31, 16, 18]. Recent studies show that capability distillation, a specialization of knowledge distillat","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"2019, Anchorage, AK, USA, August 4-8, 2019, Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis (Eds.). ACM, 395-405. https://doi.org/10.1145/3292500.3330935 [58] Miloš Stanojević and Khalil Sima'an. 2014. Beer: Better evaluation as ranking. InProceedings of the Ninth Workshop on Statistical Machine Translation. 414-419. [59] Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, and Preslav Nakov. 2023. Fake News Detectors are Biased against Texts Generated b","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"embeddings guide backbone fine-tuning to enhance visual feature extraction. Efficiency is improved via a (b) simplified detector head and (c) video-center learning. trained network estimating the IoU score of candidateV i conditioned on queryS. 3.2. Overall Framework Figure 2 illustrates our proposed end-to-end framework. During the feature extraction phase, sentence embeddings encoded by DistilBERT [50] are utilized by the proposed SCADA module, which is interleaved within the visual backbone. ","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"other baselines such as ILRR [ 14] require a reference sequence (a different problem setup), and classifier-based guidance methods [37] require differentiable classifiers integrated into the denoising loop and are not directly applicable to masked discrete diffusion. Metrics:Steering performance is measured by off-the-shelf classifier confidence: DistilBERT- SST2 [38] for sentiment, BERT-AG News [39] for topic, and a RoBERTa formality ranker [33] for 8 0 25 50 75 100Conf (%) Sentiment (S) 0 25 5","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (7 contexts).","role_counts":[{"n":7,"context_role":"background"},{"n":7,"context_role":"method"}]},"error":null,"updated_at":"2026-05-17T23:40:26.346290+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"b76d04f0-8622-425c-b09b-0bc1f0faad92","orcid":null,"display_name":"Victor Sanh"},{"id":"e2a4b542-249d-4cf8-a080-f952f9cec53f","orcid":null,"display_name":"Lysandre Debut"},{"id":"2bef5af3-74f5-4e39-ae7f-a403546c6609","orcid":null,"display_name":"Julien Chaumond"},{"id":"5a41e515-fe1b-4893-9745-06f023336c63","orcid":null,"display_name":"Thomas Wolf"}]},"error":null,"updated_at":"2026-05-17T23:40:27.086955+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T07:37:56.417531+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":22},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":21},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":11},{"title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations","work_id":"aedf7950-7c35-4e28-a32d-bec290f51669","shared_citers":8},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":7},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":6},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":6},{"title":"doi: 10.18653/v1/N19-1423","work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","shared_citers":5},{"title":"Efficient Estimation of Word Representations in Vector Space","work_id":"59edaa01-a696-45b3-9a08-5eae777a799e","shared_citers":5},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":5},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":5},{"title":"MiniLLM: On-Policy Distillation of Large Language Models","work_id":"16edb291-dd18-41c5-8486-c6c715ec5311","shared_citers":5},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":5},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":5},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":5},{"title":"arXiv preprint arXiv:1802.05365 , year=","work_id":"dd973cba-647d-49d3-9d24-061b637bb0cd","shared_citers":4},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":4},{"title":"ELECTRA: Pre-training text encoders as discriminators rather than generators","work_id":"82ddcaa7-a02b-4cba-9c4f-949f522684d5","shared_citers":4},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":4},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":4}],"time_series":[{"n":2,"year":2019},{"n":3,"year":2020},{"n":4,"year":2023},{"n":3,"year":2024},{"n":1,"year":2025},{"n":63,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T07:47:51.064170+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T07:38:00.526673+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","claims":[{"claim_text":"As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"duration ofTseconds, is encoded by a pretrained CLAP [33]. Specifically, the waveform is first converted into a log-Mel spectrogram with 64-bin filter banks, and then processed by the HTS-AT [34] to obtain audio embeddingsF A ∈R T×d A, whered A denotes the audio feature dimension. The referring expression is processed by a pretrained DistilRoBERTa [35], [36], producing text embeddingsF T ∈R L×dT , whereLis the number of tokens in the expression andd T is the text feature dimension. Modality Prio","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"egories (World, Sports, Business, and Sci/Tech) with 120,000 training samples and 7,600 test samples; 2)Yahoo! Answers dataset[40], a large-scale topic classifi- cation corpus comprising 10 categories with 1.4 million training samples and 60,000 test samples. We consider three pretrained LLM backbones with different architectures and parameter scales: •DistilBERT[41]: an encoder-only model with approx- imately 67 million parameters, pretrained on English corpora including BookCorpus and English ","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Knowledge distillation (KD) [12] for large language models (LLMs) [41] has become an essential research direction for enhancing the accessibility and efficiency of Machine-Learning-as-a-Service (MLaaS) [3, 35, 37]. Through knowledge distillation, a small LLM learns to imitate a large teacher LLM's outputs, enabling effective deployment under limited computational or financial resources [25, 14, 31, 16, 18]. Recent studies show that capability distillation, a specialization of knowledge distillat","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"2019, Anchorage, AK, USA, August 4-8, 2019, Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis (Eds.). ACM, 395-405. https://doi.org/10.1145/3292500.3330935 [58] Miloš Stanojević and Khalil Sima'an. 2014. Beer: Better evaluation as ranking. InProceedings of the Ninth Workshop on Statistical Machine Translation. 414-419. [59] Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, and Preslav Nakov. 2023. Fake News Detectors are Biased against Texts Generated b","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"embeddings guide backbone fine-tuning to enhance visual feature extraction. Efficiency is improved via a (b) simplified detector head and (c) video-center learning. trained network estimating the IoU score of candidateV i conditioned on queryS. 3.2. Overall Framework Figure 2 illustrates our proposed end-to-end framework. During the feature extraction phase, sentence embeddings encoded by DistilBERT [50] are utilized by the proposed SCADA module, which is interleaved within the visual backbone. ","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"other baselines such as ILRR [ 14] require a reference sequence (a different problem setup), and classifier-based guidance methods [37] require differentiable classifiers integrated into the denoising loop and are not directly applicable to masked discrete diffusion. Metrics:Steering performance is measured by off-the-shelf classifier confidence: DistilBERT- SST2 [38] for sentiment, BERT-AG News [39] for topic, and a RoBERTa formality ranker [33] for 8 0 25 50 75 100Conf (%) Sentiment (S) 0 25 5","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (7 contexts).","role_counts":[{"n":7,"context_role":"background"},{"n":7,"context_role":"method"}]},"error":null,"updated_at":"2026-05-17T23:40:26.351557+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","claims":[{"claim_text":"As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T07:47:53.077109+00:00"}},"summary":{"title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","claims":[{"claim_text":"As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":22},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":21},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":11},{"title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations","work_id":"aedf7950-7c35-4e28-a32d-bec290f51669","shared_citers":8},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":7},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":6},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":6},{"title":"doi: 10.18653/v1/N19-1423","work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","shared_citers":5},{"title":"Efficient Estimation of Word Representations in Vector Space","work_id":"59edaa01-a696-45b3-9a08-5eae777a799e","shared_citers":5},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":5},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":5},{"title":"MiniLLM: On-Policy Distillation of Large Language Models","work_id":"16edb291-dd18-41c5-8486-c6c715ec5311","shared_citers":5},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":5},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":5},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":5},{"title":"arXiv preprint arXiv:1802.05365 , year=","work_id":"dd973cba-647d-49d3-9d24-061b637bb0cd","shared_citers":4},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":4},{"title":"ELECTRA: Pre-training text encoders as discriminators rather than generators","work_id":"82ddcaa7-a02b-4cba-9c4f-949f522684d5","shared_citers":4},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":4},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":4}],"time_series":[{"n":2,"year":2019},{"n":3,"year":2020},{"n":4,"year":2023},{"n":3,"year":2024},{"n":1,"year":2025},{"n":63,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"2bef5af3-74f5-4e39-ae7f-a403546c6609","orcid":null,"display_name":"Julien Chaumond","source":"manual","import_confidence":0.72},{"id":"e2a4b542-249d-4cf8-a080-f952f9cec53f","orcid":null,"display_name":"Lysandre Debut","source":"manual","import_confidence":0.72},{"id":"5a41e515-fe1b-4893-9745-06f023336c63","orcid":null,"display_name":"Thomas Wolf","source":"manual","import_confidence":0.72},{"id":"b76d04f0-8622-425c-b09b-0bc1f0faad92","orcid":null,"display_name":"Victor Sanh","source":"manual","import_confidence":0.72}]}}