{"work":{"id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","openalex_id":null,"doi":"10.18653/v1/n19-1423","arxiv_id":null,"raw_key":null,"title":"BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding","authors":[{"given":"Jacob","family":"Devlin","sequence":"first","affiliation":[]},{"given":"Ming-Wei","family":"Chang","sequence":"additional","affiliation":[]},{"given":"Kenton","family":"Lee","sequence":"additional","affiliation":[]},{"given":"Kristina","family":"Toutanova","sequence":"additional","affiliation":[]}],"authors_text":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova","year":2019,"venue":"Proceedings of the 2019 Conference of the North","abstract":null,"external_url":"https://doi.org/10.18653/v1/n19-1423","cited_by_count":6639,"metadata_source":"doi_reference","metadata_fetched_at":"2026-06-29T07:33:13.053940+00:00","pith_arxiv_id":null,"created_at":"2026-05-08T16:50:03.281254+00:00","updated_at":"2026-06-29T07:33:13.053940+00:00","title_quality_ok":true,"display_title":"BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding","render_title":"BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding"},"hub":{"state":{"work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":258,"external_cited_by_count":6639,"distinct_field_count":24,"first_pith_cited_at":"2019-09-26T07:06:13+00:00","last_pith_cited_at":"2026-06-25T11:08:49+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T12:38:46.216178+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":25},{"context_role":"method","n":7},{"context_role":"dataset","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":23},{"context_polarity":"use_method","n":7},{"context_polarity":"unclear","n":3},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"BERT : Pre-training of deep bidirectional transformers for language understanding","claims":[{"claim_text":"PaLM-E: An embodied multimodal language model. In A. Krause, E. Brun- skill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,International Confer- ence on Machine Learning, ICML 2023, 23- 29 July 2023, Honolulu, Hawaii, USA, vol- ume 202 ofProceedings of Machine Learn- ing Research, pages 8469-8488. PMLR, 2023. URL https://proceedings. mlr.press/v202/driess23a.html. [37] G. Geigle, R. Timofte, and G. Glavaš. African or european swallow? bench- marking large vision-language models for ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"The retrieval system only manages to fetch informationabout Fleming's professional achievements in the discoveryof penicillin. However, the document does not provide informa-tion about his educational background, thus the model generates ahallucinatory answer. inappropriately activated, blindly retrieving inaccurate information and consequently leading to an undesirable response. Consequently, several studies [75, 204, 228, 378] have proposed to make a shift from passive retrieval to adaptive re","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"For example, forChemical Reactions, we ask Galactica to predict the products of the reaction in the chemical equation LaTeX. We mask out products in the description so the model is inferring based on the reactants only. An example is shown in Figure 10. 13 Galactica: A Large Language Model for Science Prompt Sulfuric acid reacts with sodium chloride, and gives_____and _____: \\[ \\ce{ NaCl + H2SO4 -> Generated Answer NaCl + H2SO4−−→NaHSO4 + HCl Figure 10: Chemical Reactions. We prompt based on a d","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks BERT : Pre-training of deep bidirectional transformers for language understanding because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (3 contexts).","role_counts":[{"n":3,"context_role":"background"}]},"error":null,"updated_at":"2026-05-15T01:56:28.819155+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"22085fc7-7abb-4659-9689-7df10e6d3927","orcid":null,"display_name":"Jacob Devlin"},{"id":"327371ce-3292-4a8d-bbf6-7b5ab0ea0cd3","orcid":null,"display_name":"Ming-Wei Chang"},{"id":"087af944-bc78-48cd-ad4b-ac60e15af05e","orcid":null,"display_name":"Kenton Lee"},{"id":"e75535f4-4fb0-4410-87f9-a8aa36851f7e","orcid":null,"display_name":"Kristina Toutanova"}]},"error":null,"updated_at":"2026-05-15T01:56:27.688982+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:29:42.996091+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":27},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":13},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":10},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":10},{"title":"Sentence- BERT : Sentence embeddings using S iamese BERT -networks","work_id":"cf07889b-1f35-4d81-9514-4ad3ed223c57","shared_citers":10},{"title":"Dense passage retrieval for open-domain question answering","work_id":"083391f8-812d-430f-8d08-89a03031ce6c","shared_citers":7},{"title":"In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demon- strations (Oct 2020).https://doi.org/10.18653/v1/2020.emnlp-demos.6","work_id":"801601a0-1077-41ef-a947-ba81c0b7510f","shared_citers":7},{"title":"Proceedings of the Association for Computational Linguistics (ACL) , pages =","work_id":"bad774d3-20f4-421f-ba75-d1ef99f02a26","shared_citers":7},{"title":"Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models","work_id":"bab684a8-d933-426c-a19e-2c855a0d1f59","shared_citers":7},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":7},{"title":"BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension","work_id":"6b309ef6-e88e-4de5-a589-3bff329254a0","shared_citers":6},{"title":"Chandra and Dexter C","work_id":"c3270592-bd69-4213-95e1-4aaf8312be9b","shared_citers":6},{"title":"doi: 10.3115/v1/D14-1162","work_id":"21113f4d-d545-4d27-bde3-02650634d4fe","shared_citers":6},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":6},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":6},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":6},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":6},{"title":"S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing","work_id":"81a6320b-c2e1-4d74-a03e-9e1ff6bbed8d","shared_citers":6},{"title":"The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389","work_id":"3dfaa21d-3751-420b-84f7-aeceda058b63","shared_citers":6},{"title":"Training Deep Nets with Sublinear Memory Cost","work_id":"f2c5c287-a500-40e4-a136-e7e3172db1d7","shared_citers":6},{"title":"Barron, Ben Mildenhall, Mehdi S","work_id":"0a23d1b7-bd56-43cc-8a80-7c43ce994e1e","shared_citers":5},{"title":"Bowman and Gabor Angeli and Christopher Potts and Christopher D","work_id":"f993835a-6146-425a-8004-05fce2ed6ad6","shared_citers":5}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2020},{"n":3,"year":2021},{"n":4,"year":2022},{"n":5,"year":2023},{"n":4,"year":2024},{"n":1,"year":2025},{"n":75,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:29:30.301924+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:29:39.000954+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"BERT : Pre-training of deep bidirectional transformers for language understanding","claims":[{"claim_text":"PaLM-E: An embodied multimodal language model. In A. Krause, E. Brun- skill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,International Confer- ence on Machine Learning, ICML 2023, 23- 29 July 2023, Honolulu, Hawaii, USA, vol- ume 202 ofProceedings of Machine Learn- ing Research, pages 8469-8488. PMLR, 2023. URL https://proceedings. mlr.press/v202/driess23a.html. [37] G. Geigle, R. Timofte, and G. Glavaš. African or european swallow? bench- marking large vision-language models for ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"The retrieval system only manages to fetch informationabout Fleming's professional achievements in the discoveryof penicillin. However, the document does not provide informa-tion about his educational background, thus the model generates ahallucinatory answer. inappropriately activated, blindly retrieving inaccurate information and consequently leading to an undesirable response. Consequently, several studies [75, 204, 228, 378] have proposed to make a shift from passive retrieval to adaptive re","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"For example, forChemical Reactions, we ask Galactica to predict the products of the reaction in the chemical equation LaTeX. We mask out products in the description so the model is inferring based on the reactants only. An example is shown in Figure 10. 13 Galactica: A Large Language Model for Science Prompt Sulfuric acid reacts with sodium chloride, and gives_____and _____: \\[ \\ce{ NaCl + H2SO4 -> Generated Answer NaCl + H2SO4−−→NaHSO4 + HCl Figure 10: Chemical Reactions. We prompt based on a d","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks BERT : Pre-training of deep bidirectional transformers for language understanding because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (3 contexts).","role_counts":[{"n":3,"context_role":"background"}]},"error":null,"updated_at":"2026-05-15T01:56:34.520428+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"BERT : Pre-training of deep bidirectional transformers for language understanding","claims":[{"claim_text":"The retrieval system only manages to fetch informationabout Fleming's professional achievements in the discoveryof penicillin. However, the document does not provide informa-tion about his educational background, thus the model generates ahallucinatory answer. inappropriately activated, blindly retrieving inaccurate information and consequently leading to an undesirable response. Consequently, several studies [75, 204, 228, 378] have proposed to make a shift from passive retrieval to adaptive re","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks BERT : Pre-training of deep bidirectional transformers for language understanding because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (1 contexts).","role_counts":[{"n":1,"context_role":"background"}]},"error":null,"updated_at":"2026-05-14T18:29:34.408216+00:00"}},"summary":{"title":"BERT : Pre-training of deep bidirectional transformers for language understanding","claims":[{"claim_text":"The retrieval system only manages to fetch informationabout Fleming's professional achievements in the discoveryof penicillin. However, the document does not provide informa-tion about his educational background, thus the model generates ahallucinatory answer. inappropriately activated, blindly retrieving inaccurate information and consequently leading to an undesirable response. Consequently, several studies [75, 204, 228, 378] have proposed to make a shift from passive retrieval to adaptive re","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks BERT : Pre-training of deep bidirectional transformers for language understanding because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (1 contexts).","role_counts":[{"n":1,"context_role":"background"}]},"graph":{"co_cited":[{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":27},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":13},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":10},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":10},{"title":"Sentence- BERT : Sentence embeddings using S iamese BERT -networks","work_id":"cf07889b-1f35-4d81-9514-4ad3ed223c57","shared_citers":10},{"title":"Dense passage retrieval for open-domain question answering","work_id":"083391f8-812d-430f-8d08-89a03031ce6c","shared_citers":7},{"title":"In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demon- strations (Oct 2020).https://doi.org/10.18653/v1/2020.emnlp-demos.6","work_id":"801601a0-1077-41ef-a947-ba81c0b7510f","shared_citers":7},{"title":"Proceedings of the Association for Computational Linguistics (ACL) , pages =","work_id":"bad774d3-20f4-421f-ba75-d1ef99f02a26","shared_citers":7},{"title":"Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models","work_id":"bab684a8-d933-426c-a19e-2c855a0d1f59","shared_citers":7},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":7},{"title":"BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension","work_id":"6b309ef6-e88e-4de5-a589-3bff329254a0","shared_citers":6},{"title":"Chandra and Dexter C","work_id":"c3270592-bd69-4213-95e1-4aaf8312be9b","shared_citers":6},{"title":"doi: 10.3115/v1/D14-1162","work_id":"21113f4d-d545-4d27-bde3-02650634d4fe","shared_citers":6},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":6},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":6},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":6},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":6},{"title":"S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing","work_id":"81a6320b-c2e1-4d74-a03e-9e1ff6bbed8d","shared_citers":6},{"title":"The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389","work_id":"3dfaa21d-3751-420b-84f7-aeceda058b63","shared_citers":6},{"title":"Training Deep Nets with Sublinear Memory Cost","work_id":"f2c5c287-a500-40e4-a136-e7e3172db1d7","shared_citers":6},{"title":"Barron, Ben Mildenhall, Mehdi S","work_id":"0a23d1b7-bd56-43cc-8a80-7c43ce994e1e","shared_citers":5},{"title":"Bowman and Gabor Angeli and Christopher Potts and Christopher D","work_id":"f993835a-6146-425a-8004-05fce2ed6ad6","shared_citers":5}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2020},{"n":3,"year":2021},{"n":4,"year":2022},{"n":5,"year":2023},{"n":4,"year":2024},{"n":1,"year":2025},{"n":75,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"22085fc7-7abb-4659-9689-7df10e6d3927","orcid":null,"display_name":"Jacob Devlin","source":"manual","import_confidence":0.72},{"id":"087af944-bc78-48cd-ad4b-ac60e15af05e","orcid":null,"display_name":"Kenton Lee","source":"manual","import_confidence":0.72},{"id":"e75535f4-4fb0-4410-87f9-a8aa36851f7e","orcid":null,"display_name":"Kristina Toutanova","source":"manual","import_confidence":0.72},{"id":"327371ce-3292-4a8d-bbf6-7b5ab0ea0cd3","orcid":null,"display_name":"Ming-Wei Chang","source":"manual","import_confidence":0.72}]}}