{"work":{"id":"9d86da8d-01d3-41af-a0d2-ee14897927a9","openalex_id":null,"doi":null,"arxiv_id":"1910.03771","raw_key":null,"title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing","authors":null,"authors_text":"Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi","year":2019,"venue":"cs.CL","abstract":"Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \\textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. \\textit{Transformers} is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at \\url{https://github.com/huggingface/transformers}.","external_url":"https://arxiv.org/abs/1910.03771","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-28T23:32:47.232754+00:00","pith_arxiv_id":"1910.03771","created_at":"2026-05-10T10:44:38.314683+00:00","updated_at":"2026-06-28T23:32:47.232754+00:00","title_quality_ok":true,"display_title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing","render_title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing"},"hub":{"state":{"work_id":"9d86da8d-01d3-41af-a0d2-ee14897927a9","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":104,"external_cited_by_count":null,"distinct_field_count":19,"first_pith_cited_at":"2020-05-22T21:34:34+00:00","last_pith_cited_at":"2026-06-25T16:02:14+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T11:58:42.109605+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":14},{"context_role":"method","n":8},{"context_role":"other","n":4}],"polarity_counts":[{"context_polarity":"background","n":14},{"context_polarity":"use_method","n":8},{"context_polarity":"unclear","n":4}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing","claims":[{"claim_text":"Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \\textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrain","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"[249] Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, and Ruifeng Xu. 2024. Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729(2024). [250] Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. 2024. Instructional fingerprinting of large language models.arXiv preprint arXiv:2401.12255(2024). [251] Zhengqi Xu, Ke Yuan, Huiqiong Wang, Yong Wang, Mingli Song, and Jie Song. 2024. Training-Free Pretrained M","claim_type":"other","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"into a GPT-2 model with 760M parameters, an OPT model with 13B parame- ters is distilled into a 2.7B parameter variant, and a LLaMA3 model with 13B parameters is distilled into a 8B parameter variant. Implementation Details.All experimental evaluations are performed with the PyTorch deep learning framework [26], in combination with the Hugging Face Transformers toolkit [37]. The computational tasks are run on a single NVIDIA A800 GPU with 80 GB of memory. We set the batch size to 32 and train th","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks. 1 Introduction The rapid growth of open-source machine learning models has created an unprecedented opportunity for practitioners to build, customize, and deploy AI systems [1, 2]. Platforms such as HuggingFace [3] now host hundreds of thousands of models spanning diverse architectures, scales, and application domains. Faced with a new task or dataset,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"computing resources we have for this work. Based on these criteria, we select the following five mod- els: (1) Gemma-2B from Google DeepMind [11], (2) GPT- 2 from OpenAI [12], (3) LLaMA-7B from Meta [13], [40], (4) Mistral-7B from Mistral AI [14], and (5) Phi-1 from Microsoft [15] All five models are available through the Hugging Face transformers library [41]. In Figs. 4, we present the models utilized for training and evaluating the spectrum dataset. Each model represents a scaled implementati","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. [113] Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. InProceedings of the Annual Conference of the Cognitive Science Society (CogSci), 2016. [114] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"+MemVR [68] 78.60+2.13 78.73+2.12 78.40+4.03 78.42+3.91 85.47+3.60 85.56+3.58 Adversarial +Ours 80.53+4.0682.17+5.5680.50+6.1381.70+7.1988.17+6.3088.23+6.25 framework. For fair comparison, all methods are implemented using the default hyperparameters from their official repositories. Implementation Details:We implement IVE using HuggingFace Transform- ers [48] and integrate it with beam search for decoding. All experiments are conducted on 8 NVIDIA H800 GPUs. The EMA smoothing coefficientγis fix","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks HuggingFace's Transformers: State-of-the-art Natural Language Processing because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (14 contexts).","role_counts":[{"n":14,"context_role":"background"},{"n":8,"context_role":"method"},{"n":4,"context_role":"other"}]},"error":null,"updated_at":"2026-06-26T04:44:29.213354+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"5a41e515-fe1b-4893-9745-06f023336c63","orcid":null,"display_name":"Thomas Wolf"},{"id":"e2a4b542-249d-4cf8-a080-f952f9cec53f","orcid":null,"display_name":"Lysandre Debut"},{"id":"b76d04f0-8622-425c-b09b-0bc1f0faad92","orcid":null,"display_name":"Victor Sanh"},{"id":"2bef5af3-74f5-4e39-ae7f-a403546c6609","orcid":null,"display_name":"Julien Chaumond"},{"id":"acb2429f-7749-4707-afb7-89f4114f05b6","orcid":null,"display_name":"Clement Delangue"},{"id":"9464b5b0-428b-49b7-b106-28544f45ee76","orcid":null,"display_name":"Anthony Moi"}]},"error":null,"updated_at":"2026-06-26T04:44:29.897909+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T16:02:26.865216+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":11},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":8},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":6},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":6},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":6},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":6},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":5},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":5},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":5},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":5},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":5},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":5},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":5},{"title":"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning","work_id":"fff3953b-5efb-4753-bee4-002f59995810","shared_citers":4},{"title":"Gemma: Open Models Based on Gemini Research and Technology","work_id":"a9ea2870-df28-40b8-a9e0-a7e9a116f793","shared_citers":4},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":4},{"title":"PyTorch: An Imperative Style, High-Performance Deep Learning Library","work_id":"c30b6d2c-7bb4-4ab0-8ef8-2015313610a9","shared_citers":4},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":4},{"title":"2 OLMo 2 Furious","work_id":"9ef0dc2b-fdfe-4f14-b235-ef7556dc709a","shared_citers":3}],"time_series":[{"n":1,"year":2020},{"n":1,"year":2021},{"n":2,"year":2022},{"n":4,"year":2023},{"n":3,"year":2024},{"n":35,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T16:02:22.985101+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T16:02:39.024100+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing","claims":[{"claim_text":"Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \\textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrain","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"[249] Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, and Ruifeng Xu. 2024. Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729(2024). [250] Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. 2024. Instructional fingerprinting of large language models.arXiv preprint arXiv:2401.12255(2024). [251] Zhengqi Xu, Ke Yuan, Huiqiong Wang, Yong Wang, Mingli Song, and Jie Song. 2024. Training-Free Pretrained M","claim_type":"other","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"into a GPT-2 model with 760M parameters, an OPT model with 13B parame- ters is distilled into a 2.7B parameter variant, and a LLaMA3 model with 13B parameters is distilled into a 8B parameter variant. Implementation Details.All experimental evaluations are performed with the PyTorch deep learning framework [26], in combination with the Hugging Face Transformers toolkit [37]. The computational tasks are run on a single NVIDIA A800 GPU with 80 GB of memory. We set the batch size to 32 and train th","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks. 1 Introduction The rapid growth of open-source machine learning models has created an unprecedented opportunity for practitioners to build, customize, and deploy AI systems [1, 2]. Platforms such as HuggingFace [3] now host hundreds of thousands of models spanning diverse architectures, scales, and application domains. Faced with a new task or dataset,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"computing resources we have for this work. Based on these criteria, we select the following five mod- els: (1) Gemma-2B from Google DeepMind [11], (2) GPT- 2 from OpenAI [12], (3) LLaMA-7B from Meta [13], [40], (4) Mistral-7B from Mistral AI [14], and (5) Phi-1 from Microsoft [15] All five models are available through the Hugging Face transformers library [41]. In Figs. 4, we present the models utilized for training and evaluating the spectrum dataset. Each model represents a scaled implementati","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. [113] Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. InProceedings of the Annual Conference of the Cognitive Science Society (CogSci), 2016. [114] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"+MemVR [68] 78.60+2.13 78.73+2.12 78.40+4.03 78.42+3.91 85.47+3.60 85.56+3.58 Adversarial +Ours 80.53+4.0682.17+5.5680.50+6.1381.70+7.1988.17+6.3088.23+6.25 framework. For fair comparison, all methods are implemented using the default hyperparameters from their official repositories. Implementation Details:We implement IVE using HuggingFace Transform- ers [48] and integrate it with beam search for decoding. All experiments are conducted on 8 NVIDIA H800 GPUs. The EMA smoothing coefficientγis fix","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks HuggingFace's Transformers: State-of-the-art Natural Language Processing because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (14 contexts).","role_counts":[{"n":14,"context_role":"background"},{"n":8,"context_role":"method"},{"n":4,"context_role":"other"}]},"error":null,"updated_at":"2026-06-26T04:44:29.900329+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing","claims":[{"claim_text":"Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \\textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrain","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks HuggingFace's Transformers: State-of-the-art Natural Language Processing because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T16:02:31.803482+00:00"}},"summary":{"title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing","claims":[{"claim_text":"Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \\textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrain","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks HuggingFace's Transformers: State-of-the-art Natural Language Processing because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":11},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":8},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":6},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":6},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":6},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":6},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":5},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":5},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":5},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":5},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":5},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":5},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":5},{"title":"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning","work_id":"fff3953b-5efb-4753-bee4-002f59995810","shared_citers":4},{"title":"Gemma: Open Models Based on Gemini Research and Technology","work_id":"a9ea2870-df28-40b8-a9e0-a7e9a116f793","shared_citers":4},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":4},{"title":"PyTorch: An Imperative Style, High-Performance Deep Learning Library","work_id":"c30b6d2c-7bb4-4ab0-8ef8-2015313610a9","shared_citers":4},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":4},{"title":"2 OLMo 2 Furious","work_id":"9ef0dc2b-fdfe-4f14-b235-ef7556dc709a","shared_citers":3}],"time_series":[{"n":1,"year":2020},{"n":1,"year":2021},{"n":2,"year":2022},{"n":4,"year":2023},{"n":3,"year":2024},{"n":35,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"9464b5b0-428b-49b7-b106-28544f45ee76","orcid":null,"display_name":"Anthony Moi","source":"manual","import_confidence":0.72},{"id":"acb2429f-7749-4707-afb7-89f4114f05b6","orcid":null,"display_name":"Clement Delangue","source":"manual","import_confidence":0.72},{"id":"2bef5af3-74f5-4e39-ae7f-a403546c6609","orcid":null,"display_name":"Julien Chaumond","source":"manual","import_confidence":0.72},{"id":"e2a4b542-249d-4cf8-a080-f952f9cec53f","orcid":null,"display_name":"Lysandre Debut","source":"manual","import_confidence":0.72},{"id":"5a41e515-fe1b-4893-9745-06f023336c63","orcid":null,"display_name":"Thomas Wolf","source":"manual","import_confidence":0.72},{"id":"b76d04f0-8622-425c-b09b-0bc1f0faad92","orcid":null,"display_name":"Victor Sanh","source":"manual","import_confidence":0.72}]}}