pith. machine review for the scientific record. sign in

arxiv: 1905.00537 · v3 · submitted 2019-05-02 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords SuperGLUEGLUEbenchmarklanguage understandingNLPtransfer learningpretrainingevaluation
0
0 comments X

The pith

SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that recent models have exceeded non-expert human performance on the GLUE benchmark, leaving little room for further measurable progress on those tasks. To restore headroom, the authors release SuperGLUE, a successor benchmark containing a fresh collection of more difficult language understanding tasks together with a toolkit and public leaderboard. A sympathetic reader would care because benchmarks that are too easy stop guiding research toward genuine advances in general-purpose language systems. The work therefore replaces a saturated evaluation with one intended to better diagnose deeper understanding capabilities.

Core claim

Performance on the GLUE benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. The authors therefore present SuperGLUE, a new benchmark styled after GLUE that supplies a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard at super.gluebenchmark.com.

What carries the argument

The SuperGLUE benchmark, which replaces GLUE with a new collection of more challenging tasks chosen to remain diagnostic of general language understanding.

If this is right

  • Language model development will shift evaluation focus to the new, more demanding tasks in SuperGLUE.
  • Reported progress will reflect performance on tasks that remain below non-expert human levels.
  • The public leaderboard will standardize comparison across systems on the harder task set.
  • Research incentives will favor methods that handle the added difficulty rather than GLUE-specific shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of SuperGLUE could accelerate models that transfer more reliably to unseen real-world language scenarios.
  • Success on SuperGLUE might still require separate checks that the gains reflect understanding rather than benchmark-specific patterns.
  • Future benchmark designers may need to repeat this cycle as performance on SuperGLUE itself saturates.

Load-bearing premise

The newly chosen tasks are harder and more diagnostic of general language understanding than the original GLUE tasks without introducing new exploitable biases or artifacts.

What would settle it

If leading models reach or exceed human performance on the full SuperGLUE suite within a year using only the same pretraining and transfer methods that saturated GLUE, the claim that SuperGLUE restores meaningful headroom would be undermined.

read the original abstract

In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper observes that recent advances in pretraining and transfer learning have driven model performance on the GLUE benchmark above non-expert human levels, implying limited headroom for further progress. It introduces SuperGLUE, a successor benchmark consisting of eight more challenging language-understanding tasks (BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, WSC), together with a software toolkit and public leaderboard.

Significance. If the new tasks indeed offer greater headroom and more diagnostic evaluation of general language understanding, SuperGLUE would serve as a valuable next standard benchmark, extending the impact of GLUE. The accompanying toolkit and leaderboard constitute practical, reproducible contributions that lower barriers to adoption and enable consistent community comparisons.

minor comments (2)
  1. [Abstract] Abstract: the saturation claim would be strengthened by a brief citation to the specific results or papers documenting model performance exceeding non-expert human baselines on GLUE.
  2. [§2] Task introduction section: a short table or paragraph explicitly comparing average model-human gaps on GLUE versus the proposed SuperGLUE tasks would make the 'stickier' claim more concrete and easier to evaluate.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition that SuperGLUE offers greater headroom and diagnostic value for general language understanding.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes SuperGLUE motivated by the external empirical observation that GLUE performance has surpassed non-expert human levels. No derivation chain, equations, fitted parameters, or predictions are present. The central premise relies on publicly verifiable model results rather than any self-citation that reduces the argument to unverified inputs by construction. No self-definitional, fitted-input, uniqueness-imported, or ansatz-smuggled steps appear. The work is a benchmark proposal and toolkit release, self-contained against external performance data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the domain assumption that performance on a curated set of tasks is a valid proxy for general language understanding capability.

axioms (1)
  • domain assumption Language understanding can be meaningfully summarized by aggregate performance on a diverse but fixed set of tasks.
    This is the foundational premise for any GLUE-style benchmark and is invoked to justify creating SuperGLUE once GLUE is saturated.

pith-pipeline@v0.9.0 · 5426 in / 1200 out tokens · 32667 ms · 2026-05-15T01:28:06.644884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Measuring Massive Multitask Language Understanding

    cs.CY 2020-09 accept novelty 8.0

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  2. Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

    cs.LG 2026-05 unverdicted novelty 7.0

    Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.

  3. Language Is Not All You Need: Aligning Perception with Language Models

    cs.CL 2023-02 conditional novelty 7.0

    Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

  4. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    cs.CL 2020-05 accept novelty 7.0

    RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

  5. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  6. PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

    cs.CL 2026-05 unverdicted novelty 6.0

    PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

  7. SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

    cs.LG 2026-05 unverdicted novelty 6.0

    SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.

  8. Defending Against Indirect Prompt Injection Attacks With Spotlighting

    cs.CR 2024-03 unverdicted novelty 6.0

    Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.

  9. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  10. Retentive Network: A Successor to Transformer for Large Language Models

    cs.CL 2023-07 unverdicted novelty 6.0

    RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

  11. Kosmos-2: Grounding Multimodal Large Language Models to the World

    cs.CL 2023-06 unverdicted novelty 6.0

    Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

  12. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  13. Ethical and social risks of harm from Language Models

    cs.CL 2021-12 accept novelty 6.0

    The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...

  14. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  15. HuggingFace's Transformers: State-of-the-art Natural Language Processing

    cs.CL 2019-10 accept novelty 6.0

    Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.

  16. Complexity Horizons of Compressed Models in Analog Circuit Analysis

    cs.AI 2026-05 unverdicted novelty 5.0

    Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.

  17. Uncertainty-Aware Transformers: Conformal Prediction for Language Models

    cs.LG 2026-04 unverdicted novelty 5.0

    CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.

  18. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  19. Detecting Language Model Attacks with Perplexity

    cs.CL 2023-08 unverdicted novelty 5.0

    Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.

  20. PaLM 2 Technical Report

    cs.CL 2023-05 unverdicted novelty 5.0

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

  21. RoBERTa: A Robustly Optimized BERT Pretraining Approach

    cs.CL 2019-07 accept novelty 5.0

    With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.

  22. GLU Variants Improve Transformer

    cs.LG 2020-02 unverdicted novelty 4.0

    Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

  23. Scaling Laws for Neural Language Models

    cs.LG 2020-01

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 23 Pith papers · 16 internal anchors

  1. [1]

    Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R

    Alex Wang and Ian F. Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R. Thomas McCoy and Roma Patel and Yinghui Huang and Jason Phang and Edouard Grave and Haokun Liu and Najoung Kim and Phu Mon Htut and Thibault F'

  2. [2]

    Zhang, Sheng and Liu, Xiaodong and Liu, Jingjing and Gao, Jianfeng and Duh, Kevin and Van Durme, Benjamin , journal=

  3. [3]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

  4. [4]

    Le , journal=

    Zhilin Yang and Zihang Dai and Yiming Yang and Jaime Carbonell and Ruslan Salakhutdinov and Quoc V. Le , journal=

  5. [5]

    Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

    Gonen, Hila and Goldberg, Yoav. Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

  6. [7]

    Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems

    Kiritchenko, Svetlana and Mohammad, Saif. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. 2018. doi:10.18653/v1/S18-2005

  7. [8]

    2018 , journal =

    Kaiji Lu and Piotr Mardziel and Fangjing Wu and Preetam Amancharla and Anupam Datta , title =. 2018 , journal =

  8. [10]

    Edward and Pavlick, Ellie and White, Aaron Steven and Van Durme, Benjamin

    Poliak, Adam and Haldar, Aparajita and Rudinger, Rachel and Hu, J. Edward and Pavlick, Ellie and White, Aaron Steven and Van Durme, Benjamin. Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

  9. [12]

    and Le, Quoc V

    Clark, Kevin and Luong, Minh-Thang and Khandelwal, Urvashi and Manning, Christopher D. and Le, Quoc V. , year =. Proceedings of the Association of Computational Linguistics (ACL) , publisher =

  10. [13]

    Proceedings of the First

    Social Bias in Elicited Natural Language Inferences , author=. Proceedings of the First. 2017 , publisher =

  11. [14]

    International Conference on Machine Learning (ICML) , year=

    Born again neural networks , author=. International Conference on Machine Learning (ICML) , year=

  12. [16]

    Stephen H. Bach and Daniel Rodriguez and Yintao Liu and Chong Luo and Haidong Shao and Cassandra Xia and Souvik Sen and Alexander Ratner and Braden Hancock and Houman Alborzi and Rahul Kuchhal and Christopher R. Snorkel DryBell:. 2018 , publisher =

  13. [17]

    2019 , journal=

    Evidence Sentence Extraction for Machine Reading Comprehension , author=. 2019 , journal=

  14. [18]

    Long short-term memory , Year =

    Hochreiter, Sepp and Schmidhuber, J. Long short-term memory , Year =. Neural computation , Publisher =

  15. [19]

    Modeling Empathy and Distress in Reaction to News Stories , year =

    Buechel, Sven and Buffone, Anneke and Slaff, Barry and Ungar, Lyle and Sedoc, Jo. Modeling Empathy and Distress in Reaction to News Stories , year =

  16. [20]

    The Fifth

    Bentivogli, Luisa and Dagan, Ido and Dang, Hoa Trang and Giampiccolo, Danilo and Magnini, Bernardo , booktitle=. The Fifth. 2009 , url=

  17. [22]

    Automatic differentiation in

    Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam , year=. Automatic differentiation in. Advances in Neural Information Processing Systems (NeurIPS) , publisher =

  18. [23]

    Proceedings of IWP , year=

    Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of IWP , year=

  19. [25]

    Liu and Matthew Peters and Michael Schmitz and Luke S

    Matt Gardner and Joel Grus and Mark Neumann and Oyvind Tafjord and Pradeep Dasigi and Nelson F. Liu and Matthew Peters and Michael Schmitz and Luke S. Zettlemoyer , booktitle=. 2017 , journal =

  20. [26]

    Advances in Neural Information Processing Systems (NeurIPS) , publisher =

    Attention is all you need , author=. Advances in Neural Information Processing Systems (NeurIPS) , publisher =

  21. [29]

    Improving Language Understanding by Generative Pre-Training , Note =

    Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , Date-Added =. Improving Language Understanding by Generative Pre-Training , Note =

  22. [30]

    Proceedings of the 25th International Conference on Machine Learning (ICML) , year=

    A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th International Conference on Machine Learning (ICML) , year=

  23. [32]

    Advances in neural information processing systems , year=

    Skip-thought vectors , author=. Advances in neural information processing systems , year=

  24. [33]

    SentEval : An Evaluation Toolkit for Universal Sentence Representations

    Conneau, Alexis and Kiela, Douwe. SentEval : An Evaluation Toolkit for Universal Sentence Representations. Proceedings of the 11th Language Resources and Evaluation Conference. 2018

  25. [34]

    Bowman , booktitle=

    Alex Wang and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , booktitle=. 2019 , url=

  26. [35]

    , year =

    Nangia, Nikita and Bowman, Samuel R. , year =. Human vs. Proceedings of the Association of Computational Linguistics (ACL) , publisher =

  27. [36]

    The third

    Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

  28. [37]

    Machine Learning Challenges

    Dagan, Ido and Glickman, Oren and Magnini, Bernardo , title="The. Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. 2006

  29. [38]

    2019 , publisher =

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=. 2019 , publisher =

  30. [40]

    The second

    Bar Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , booktitle=. The second. 2006 , url=

  31. [41]

    2011 AAAI Spring Symposium Series , year=

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

  32. [42]

    Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

    Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

  33. [43]

    2019 , url =

    Pilehvar, Mohammad Taher and Camacho-Collados, Jose , booktitle =. 2019 , url =

  34. [44]

    Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The. 2012 , publishre =

  35. [45]

    1995 , publisher=

    Miller, George A , journal=. 1995 , publisher=

  36. [46]

    2005 , isbn =

    Schuler, Karin Kipper , title =. 2005 , isbn =

  37. [47]

    2019 , journal =

    The Referential Reader: A Recurrent Entity Network for Anaphora Resolution , author=. 2019 , journal =

  38. [48]

    Mind the

    Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason , journal=. Mind the. 2018 , publisher=

  39. [49]

    Transactions of the Association of Computational Linguists , year=

    Neural network acceptability judgments , author=. Transactions of the Association of Computational Linguists , year=

  40. [50]

    2002 , publisher=

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle=. 2002 , publisher=

  41. [51]

    Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year=

    Re-evaluation the role of bleu in machine translation research , author=. Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year=

  42. [53]

    Ultra-Fine Entity Typing

    Choi, Eunsol and Levy, Omer and Choi, Yejin and Zettlemoyer, Luke. Ultra-Fine Entity Typing. Proceedings of the Association for Computational Linguistics (ACL). 2018

  43. [55]

    International Conference on Learning Representations (

    What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations (. 2019 , url=

  44. [56]

    SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

    Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin. SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

  45. [59]

    Identifying Well-formed Natural Language Questions

    Faruqui, Manaal and Das, Dipanjan. Identifying Well-formed Natural Language Questions. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018

  46. [61]

    2019 , url=

    Zhang, Yuan and Baldridge, Jason and He, Luheng , journal=. 2019 , url=

  47. [62]

    ACM Transactions on Interactive Intelligent Systems (TiiS) , year=

    Have You Lost the Thread? Discovering Ongoing Conversations in Scattered Dialog Blocks , author=. ACM Transactions on Interactive Intelligent Systems (TiiS) , year=

  48. [63]

    Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , publisher =

    A corpus and model integrating multiword expressions and supersenses , author=. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , publisher =. 2015 , url=

  49. [64]

    QuAC : Question Answering in Context

    Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke. QuAC : Question Answering in Context. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2018

  50. [65]

    nternational Conference on Language Resources and Evaluation (LREC) , year=

    The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics , author=. nternational Conference on Language Resources and Evaluation (LREC) , year=

  51. [66]

    and Linzen, Tal , Booktitle =

    McCoy, Richard T. and Linzen, Tal , Booktitle =. Non-entailed subsequences as a challenge for natural language inference , Url =

  52. [67]

    Proceedings of the Association for Computational Linguistics (ACL)

    McCoy, R. Thomas and Pavlick, Ellie and Linzen, Tal , Title =. 2019 , booktitle = "Proceedings of the Association for Computational Linguistics (ACL)", publisher = "Association for Computational Linguistics", url =

  53. [68]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Learned in translation: Contextualized word vectors , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  54. [70]

    and Schwartz, Roy and Smith, Noah A

    Liu, Nelson F. and Schwartz, Roy and Smith, Noah A. , title =. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

  55. [71]

    and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E

    Liu, Nelson F. and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E. and Smith, Noah A. , title =. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

  56. [72]

    International Conference on Computational Linguistics (COLING) , year=

    Stress Test Evaluation for Natural Language Inference , author=. International Conference on Computational Linguistics (COLING) , year=

  57. [74]

    2013 , journal=

    Efficient Estimation of Word Representations in Vector Space , author=. 2013 , journal=

  58. [75]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Semi-supervised Sequence Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  59. [77]

    Maarten Sap and Hannah Rashkin and Derek Chen and Ronan LeBras and Yejin Choi , year=. Social. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , url =

  60. [79]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

  61. [80]

    Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval) , year =

  62. [82]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2018

  63. [83]

    Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alexander Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher R \' e , and Rob Malkin. Snorkel drybell: A case study in deploying weak supervision at industrial scale. In SIGMOD. ACM , 2018

  64. [84]

    The second PASCAL recognising textual entailment challenge

    Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment , 2006. URL http://u.cs.biu.ac.il/ nlp/RTE2/Proceedings/01.pdf

  65. [85]

    The fifth PASCAL recognizing textual entailment challenge

    Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth PASCAL recognizing textual entailment challenge. In Textual Analysis Conference (TAC), 2009. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.1231

  66. [86]

    Modeling empathy and distress in reaction to news stories

    Sven Buechel, Anneke Buffone, Barry Slaff, Lyle Ungar, and Jo \ a o Sedoc. Modeling empathy and distress in reaction to news stories. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

  67. [87]

    Re-evaluation the role of bleu in machine translation research

    Chris Callison-Burch, Miles Osborne, and Philipp Koehn. Re-evaluation the role of bleu in machine translation research. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Association for Computational Linguistics, 2006. URL https://www.aclweb.org/anthology/E06-1032

  68. [88]

    S em E val-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, 2017. doi:10.18653/v1/S17-2001. URL https://www.acl...

  69. [89]

    QuAC : Question answering in context

    Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC : Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) . Association for Computational Linguistics, 2018 a

  70. [90]

    Ultra-fine entity typing

    Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. Ultra-fine entity typing. In Proceedings of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2018 b . URL https://www.aclweb.org/anthology/P18-1009

  71. [91]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

  72. [92]

    BAM! Born-Again Multi-Task Networks for Natural Language Understanding

    Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, and Quoc V. Le. BAM ! B orn-again multi-task networks for natural language understanding. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019 b . URL https://arxiv.org/pdf/1907.04829.pdf

  73. [93]

    A unified architecture for natural language processing: Deep neural networks with multitask learning

    Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML). Association for Computing Machinery, 2008. URL https://dl.acm.org/citation.cfm?id=1390177

  74. [94]

    SentEval : An evaluation toolkit for universal sentence representations

    Alexis Conneau and Douwe Kiela. SentEval : An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269

  75. [95]

    Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

    Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo \" c Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) . Association for Computational Linguistics, 2017. doi:10.18653/v1/D17-1070. U...

  76. [96]

    The PASCAL recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9

  77. [97]

    Semi-supervised sequence learning

    Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf

  78. [98]

    The CommitmentBank : Investigating projection in naturally occurring discourse

    Marie-Catherine d e Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank : Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/

  79. [99]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL h...

  80. [100]

    Dolan and Chris Brockett

    William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of IWP, 2005

Showing first 80 references.