{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2019:65HLDQD3BXWBKONNJRWDL5CUMK","short_pith_number":"pith:65HLDQD3","schema_version":"1.0","canonical_sha256":"f74eb1c07b0dec1539ad4c6c35f45462a8112c3b8c52c3a3a28d740bbc551349","source":{"kind":"arxiv","id":"1905.00537","version":3},"attestation_state":"computed","paper":{"title":"SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Alex Wang, Amanpreet Singh, Felix Hill, Julian Michael, Nikita Nangia, Omer Levy, Samuel R. Bowman, Yada Pruksachatkun","submitted_at":"2019-05-02T00:41:50Z","abstract_excerpt":"In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"1905.00537","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2019-05-02T00:41:50Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"d3a724f90da0043f6d84299c8c492aaca2fc396dd0a908865bb9c87e418b0e94","abstract_canon_sha256":"1d7e3199da3cedeca7d3f9aef662c6c348f42f2d32022a1007ee3b8d5817fe51"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:05.122535Z","signature_b64":"sMjGvK4wDOOizcOEOwiEEAr4NKattmLZSJumd0HZm7O+ro9QWl6pi0yRncmeKrCLSo7Gw8f1MJ9R8ThMRPWcAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"f74eb1c07b0dec1539ad4c6c35f45462a8112c3b8c52c3a3a28d740bbc551349","last_reissued_at":"2026-05-17T23:39:05.121814Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:05.121814Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Alex Wang, Amanpreet Singh, Felix Hill, Julian Michael, Nikita Nangia, Omer Levy, Samuel R. Bowman, Yada Pruksachatkun","submitted_at":"2019-05-02T00:41:50Z","abstract_excerpt":"In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Performance on the GLUE benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research, motivating SuperGLUE with a new set of more difficult language understanding tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the newly selected tasks are sufficiently harder and more diagnostic of general language understanding than the original GLUE tasks, without introducing new biases or artifacts that models can exploit.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ca5a7d0a293327adfc4fc4367dc73307487abe2c2cd0985354044eab5e94716f"},"source":{"id":"1905.00537","kind":"arxiv","version":3},"verdict":{"id":"7a0de7aa-f032-40f8-86ff-562f49ccc0cf","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T01:28:06.644884Z","strongest_claim":"Performance on the GLUE benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research, motivating SuperGLUE with a new set of more difficult language understanding tasks.","one_line_summary":"SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the newly selected tasks are sufficiently harder and more diagnostic of general language understanding than the original GLUE tasks, without introducing new biases or artifacts that models can exploit.","pith_extraction_headline":"SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE."},"references":{"count":135,"sample":[{"doi":"","year":null,"title":"Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R","work_id":"dd6b0763-a250-48ec-b909-9c1677dd172e","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Zhang, Sheng and Liu, Xiaodong and Liu, Jingjing and Gao, Jianfeng and Duh, Kevin and Van Durme, Benjamin , journal=","work_id":"38d6f639-9805-4707-a6cb-d9d07c4261fe","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hum","work_id":"2f514ff6-4bbb-4c24-a286-850487902bb6","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Zhilin Yang and Zihang Dai and Yiming Yang and Jaime Carbonell and Ruslan Salakhutdinov and Quoc V. Le , journal=","work_id":"17a512a5-26c1-4ef1-912a-898996ad7cec","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them","work_id":"58a067a9-5451-48f0-871e-b753bbf27e15","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":135,"snapshot_sha256":"39290a3980bc8ea04c806253a71325ee31a46e97ca182e6da9899dd1163cb93a","internal_anchors":16},"formal_canon":{"evidence_count":2,"snapshot_sha256":"cb44fd3b427143b27f6f08a26c9b531bf57031ddc76a8a930c204963f9874b88"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"1905.00537","created_at":"2026-05-17T23:39:05.121943+00:00"},{"alias_kind":"arxiv_version","alias_value":"1905.00537v3","created_at":"2026-05-17T23:39:05.121943+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.1905.00537","created_at":"2026-05-17T23:39:05.121943+00:00"},{"alias_kind":"pith_short_12","alias_value":"65HLDQD3BXWB","created_at":"2026-05-18T12:33:10.108867+00:00"},{"alias_kind":"pith_short_16","alias_value":"65HLDQD3BXWBKONN","created_at":"2026-05-18T12:33:10.108867+00:00"},{"alias_kind":"pith_short_8","alias_value":"65HLDQD3","created_at":"2026-05-18T12:33:10.108867+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":32,"internal_anchor_count":32,"sample":[{"citing_arxiv_id":"1906.08230","citing_title":"Evaluating Protein Transfer Learning with TAPE","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"1906.10002","citing_title":"LIAAD at SemDeep-5 Challenge: Word-in-Context (WiC)","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2110.14168","citing_title":"Training Verifiers to Solve Math Word Problems","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2502.04501","citing_title":"Ultra-Low-Dimensional Prompt Tuning via Random Projection","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2508.14685","citing_title":"SSA: Improving Performance With a Better Scoring Function","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2210.11610","citing_title":"Large Language Models Can Self-Improve","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2305.16264","citing_title":"Scaling Data-Constrained Language Models","ref_index":122,"is_internal_anchor":true},{"citing_arxiv_id":"2102.01293","citing_title":"Scaling Laws for Transfer","ref_index":198,"is_internal_anchor":true},{"citing_arxiv_id":"2308.03958","citing_title":"Simple synthetic data reduces sycophancy in large language models","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2302.14045","citing_title":"Language Is Not All You Need: Aligning Perception with Language Models","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2308.14132","citing_title":"Detecting Language Model Attacks with Perplexity","ref_index":85,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14055","citing_title":"PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2403.14720","citing_title":"Defending Against Indirect Prompt Injection Attacks With Spotlighting","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2309.14509","citing_title":"DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models","ref_index":147,"is_internal_anchor":true},{"citing_arxiv_id":"2305.10403","citing_title":"PaLM 2 Technical Report","ref_index":150,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08423","citing_title":"Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"1910.10683","citing_title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2306.14824","citing_title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2307.08621","citing_title":"Retentive Network: A Successor to Transformer for Large Language Models","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06402","citing_title":"SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2112.04359","citing_title":"Ethical and social risks of harm from Language Models","ref_index":285,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02285","citing_title":"Complexity Horizons of Compressed Models in Analog Circuit Analysis","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"1910.03771","citing_title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing","ref_index":186,"is_internal_anchor":true},{"citing_arxiv_id":"2112.00861","citing_title":"A General Language Assistant as a Laboratory for Alignment","ref_index":86,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08885","citing_title":"Uncertainty-Aware Transformers: Conformal Prediction for Language Models","ref_index":26,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK","json":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK.json","graph_json":"https://pith.science/api/pith-number/65HLDQD3BXWBKONNJRWDL5CUMK/graph.json","events_json":"https://pith.science/api/pith-number/65HLDQD3BXWBKONNJRWDL5CUMK/events.json","paper":"https://pith.science/paper/65HLDQD3"},"agent_actions":{"view_html":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK","download_json":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK.json","view_paper":"https://pith.science/paper/65HLDQD3","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=1905.00537&json=true","fetch_graph":"https://pith.science/api/pith-number/65HLDQD3BXWBKONNJRWDL5CUMK/graph.json","fetch_events":"https://pith.science/api/pith-number/65HLDQD3BXWBKONNJRWDL5CUMK/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK/action/timestamp_anchor","attest_storage":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK/action/storage_attestation","attest_author":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK/action/author_attestation","sign_citation":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK/action/citation_signature","submit_replication":"https://pith.science/pith/65HLDQD3BXWBKONNJRWDL5CUMK/action/replication_record"}},"created_at":"2026-05-17T23:39:05.121943+00:00","updated_at":"2026-05-17T23:39:05.121943+00:00"}