{"paper":{"title":"SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Alex Wang, Amanpreet Singh, Felix Hill, Julian Michael, Nikita Nangia, Omer Levy, Samuel R. Bowman, Yada Pruksachatkun","submitted_at":"2019-05-02T00:41:50Z","abstract_excerpt":"In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Performance on the GLUE benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research, motivating SuperGLUE with a new set of more difficult language understanding tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the newly selected tasks are sufficiently harder and more diagnostic of general language understanding than the original GLUE tasks, without introducing new biases or artifacts that models can exploit.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ca5a7d0a293327adfc4fc4367dc73307487abe2c2cd0985354044eab5e94716f"},"source":{"id":"1905.00537","kind":"arxiv","version":3},"verdict":{"id":"7a0de7aa-f032-40f8-86ff-562f49ccc0cf","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T01:28:06.644884Z","strongest_claim":"Performance on the GLUE benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research, motivating SuperGLUE with a new set of more difficult language understanding tasks.","one_line_summary":"SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the newly selected tasks are sufficiently harder and more diagnostic of general language understanding than the original GLUE tasks, without introducing new biases or artifacts that models can exploit.","pith_extraction_headline":"SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE."},"references":{"count":135,"sample":[{"doi":"","year":null,"title":"Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R","work_id":"dd6b0763-a250-48ec-b909-9c1677dd172e","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Zhang, Sheng and Liu, Xiaodong and Liu, Jingjing and Gao, Jianfeng and Duh, Kevin and Van Durme, Benjamin , journal=","work_id":"38d6f639-9805-4707-a6cb-d9d07c4261fe","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hum","work_id":"2f514ff6-4bbb-4c24-a286-850487902bb6","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Zhilin Yang and Zihang Dai and Yiming Yang and Jaime Carbonell and Ruslan Salakhutdinov and Quoc V. Le , journal=","work_id":"17a512a5-26c1-4ef1-912a-898996ad7cec","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them","work_id":"58a067a9-5451-48f0-871e-b753bbf27e15","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":135,"snapshot_sha256":"39290a3980bc8ea04c806253a71325ee31a46e97ca182e6da9899dd1163cb93a","internal_anchors":16},"formal_canon":{"evidence_count":2,"snapshot_sha256":"cb44fd3b427143b27f6f08a26c9b531bf57031ddc76a8a930c204963f9874b88"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}