{"work":{"id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","openalex_id":null,"doi":null,"arxiv_id":"2009.03300","raw_key":null,"title":"Measuring Massive Multitask Language Understanding","authors":null,"authors_text":"Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song","year":2020,"venue":"cs.CY","abstract":"We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.","external_url":"https://arxiv.org/abs/2009.03300","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T13:33:28.276979+00:00","pith_arxiv_id":"2009.03300","created_at":"2026-05-09T05:45:22.435084+00:00","updated_at":"2026-06-29T13:33:28.276979+00:00","title_quality_ok":true,"display_title":"Measuring Massive Multitask Language Understanding","render_title":"Measuring Massive Multitask Language Understanding"},"hub":{"state":{"work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":390,"external_cited_by_count":null,"distinct_field_count":22,"first_pith_cited_at":"2021-02-02T04:07:38+00:00","last_pith_cited_at":"2026-06-24T21:26:43+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T13:38:55.554019+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":30},{"context_role":"dataset","n":29},{"context_role":"method","n":5},{"context_role":"baseline","n":3}],"polarity_counts":[{"context_polarity":"background","n":30},{"context_polarity":"use_dataset","n":27},{"context_polarity":"use_method","n":5},{"context_polarity":"baseline","n":3},{"context_polarity":"unclear","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Measuring Massive Multitask Language Understanding","claims":[{"claim_text":"We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Measuring Massive Multitask Language Understanding because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:43:33.298159+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"9a245225-c7c3-48a7-b813-bdc1644e9016","orcid":null,"display_name":"Dan Hendrycks"},{"id":"3b4ceaf7-2386-40e2-b124-322be5048cad","orcid":null,"display_name":"Collin Burns"},{"id":"7a82e646-0403-4c2b-85ba-72d9e0f173cd","orcid":null,"display_name":"Steven Basart"},{"id":"d5f98311-adb1-4b25-880e-4f7dd5576ee8","orcid":null,"display_name":"Andy Zou"},{"id":"fa9d0435-c6cb-4907-bf3e-81e79429d880","orcid":null,"display_name":"Mantas Mazeika"},{"id":"68b9933e-12f6-44d9-90ed-f3dc38125151","orcid":null,"display_name":"Dawn Song"}]},"error":null,"updated_at":"2026-05-13T21:43:41.024068+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T21:33:36.466844+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":64},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":39},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":37},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":37},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":36},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":33},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":24},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":23},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":22},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":20},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":18},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":18},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":18},{"title":"HellaSwag: Can a Machine Really Finish Your Sentence?","work_id":"79f44c0c-96f4-4edb-bc50-a3c9d6b85936","shared_citers":17},{"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","work_id":"513eb205-04ca-4722-9a43-a74e8cbe7e85","shared_citers":16},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":16},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":16},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":15},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":15},{"title":"Instruction-Following Evaluation for Large Language Models","work_id":"3aa06177-125a-4f5a-8f4a-8070c5986c26","shared_citers":15},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":15},{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":13},{"title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","work_id":"9e2a976b-f5ad-4aee-af5c-243fe0fe75d2","shared_citers":13},{"title":"Finetuned Language Models Are Zero-Shot Learners","work_id":"7ed6cdaa-ed67-4db4-aceb-b7e1b0e6e7c4","shared_citers":12}],"time_series":[{"n":1,"year":2021},{"n":2,"year":2022},{"n":6,"year":2023},{"n":14,"year":2024},{"n":8,"year":2025},{"n":126,"year":2026}]},"error":null,"updated_at":"2026-05-13T21:33:35.516705+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T21:33:38.900649+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Measuring Massive Multitask Language Understanding","claims":[{"claim_text":"We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Measuring Massive Multitask Language Understanding because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:33:35.367326+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Measuring Massive Multitask Language Understanding","claims":[{"claim_text":"We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Measuring Massive Multitask Language Understanding because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:33:31.903600+00:00"}},"summary":{"title":"Measuring Massive Multitask Language Understanding","claims":[{"claim_text":"We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Measuring Massive Multitask Language Understanding because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":64},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":39},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":37},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":37},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":36},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":33},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":24},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":23},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":22},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":20},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":18},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":18},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":18},{"title":"HellaSwag: Can a Machine Really Finish Your Sentence?","work_id":"79f44c0c-96f4-4edb-bc50-a3c9d6b85936","shared_citers":17},{"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","work_id":"513eb205-04ca-4722-9a43-a74e8cbe7e85","shared_citers":16},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":16},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":16},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":15},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":15},{"title":"Instruction-Following Evaluation for Large Language Models","work_id":"3aa06177-125a-4f5a-8f4a-8070c5986c26","shared_citers":15},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":15},{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":13},{"title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","work_id":"9e2a976b-f5ad-4aee-af5c-243fe0fe75d2","shared_citers":13},{"title":"Finetuned Language Models Are Zero-Shot Learners","work_id":"7ed6cdaa-ed67-4db4-aceb-b7e1b0e6e7c4","shared_citers":12}],"time_series":[{"n":1,"year":2021},{"n":2,"year":2022},{"n":6,"year":2023},{"n":14,"year":2024},{"n":8,"year":2025},{"n":126,"year":2026}]},"authors":[{"id":"d5f98311-adb1-4b25-880e-4f7dd5576ee8","orcid":null,"display_name":"Andy Zou","source":"manual","import_confidence":0.72},{"id":"3b4ceaf7-2386-40e2-b124-322be5048cad","orcid":null,"display_name":"Collin Burns","source":"manual","import_confidence":0.72},{"id":"9a245225-c7c3-48a7-b813-bdc1644e9016","orcid":null,"display_name":"Dan Hendrycks","source":"manual","import_confidence":0.72},{"id":"68b9933e-12f6-44d9-90ed-f3dc38125151","orcid":null,"display_name":"Dawn Song","source":"manual","import_confidence":0.72},{"id":"fa9d0435-c6cb-4907-bf3e-81e79429d880","orcid":null,"display_name":"Mantas Mazeika","source":"manual","import_confidence":0.72},{"id":"7a82e646-0403-4c2b-85ba-72d9e0f173cd","orcid":null,"display_name":"Steven Basart","source":"manual","import_confidence":0.72}]}}