arxiv: 1905.00537 · v3 · submitted 2019-05-02 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang , Yada Pruksachatkun , Nikita Nangia , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy , Samuel R. Bowman

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords SuperGLUEGLUEbenchmarklanguage understandingNLPtransfer learningpretrainingevaluation

0 comments

The pith

SuperGLUE introduces a new set of harder language understanding tasks after models surpass non-expert humans on GLUE.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that recent models have exceeded non-expert human performance on the GLUE benchmark, leaving little room for further measurable progress on those tasks. To restore headroom, the authors release SuperGLUE, a successor benchmark containing a fresh collection of more difficult language understanding tasks together with a toolkit and public leaderboard. A sympathetic reader would care because benchmarks that are too easy stop guiding research toward genuine advances in general-purpose language systems. The work therefore replaces a saturated evaluation with one intended to better diagnose deeper understanding capabilities.

Core claim

Performance on the GLUE benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. The authors therefore present SuperGLUE, a new benchmark styled after GLUE that supplies a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard at super.gluebenchmark.com.

What carries the argument

The SuperGLUE benchmark, which replaces GLUE with a new collection of more challenging tasks chosen to remain diagnostic of general language understanding.

If this is right

Language model development will shift evaluation focus to the new, more demanding tasks in SuperGLUE.
Reported progress will reflect performance on tasks that remain below non-expert human levels.
The public leaderboard will standardize comparison across systems on the harder task set.
Research incentives will favor methods that handle the added difficulty rather than GLUE-specific shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of SuperGLUE could accelerate models that transfer more reliably to unseen real-world language scenarios.
Success on SuperGLUE might still require separate checks that the gains reflect understanding rather than benchmark-specific patterns.
Future benchmark designers may need to repeat this cycle as performance on SuperGLUE itself saturates.

Load-bearing premise

The newly chosen tasks are harder and more diagnostic of general language understanding than the original GLUE tasks without introducing new exploitable biases or artifacts.

What would settle it

If leading models reach or exceed human performance on the full SuperGLUE suite within a year using only the same pretraining and transfer methods that saturated GLUE, the claim that SuperGLUE restores meaningful headroom would be undermined.

read the original abstract

In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SuperGLUE is a timely update that swaps in harder tasks after GLUE saturation and ships the toolkit to make it usable right away.

read the letter

SuperGLUE is basically the next version of GLUE with a fresh set of tasks that current models still struggle on. The paper collects eight established datasets—BoolQ for yes/no questions, COPA for causal reasoning, WiC for word sense, WSC for coreference, and the others—and argues they require more than pattern matching. They also release evaluation code and a public leaderboard so results are comparable without extra work. That infrastructure is the practical part that makes the proposal stick. What the paper does well is state the motivation plainly: GLUE scores have passed non-expert humans, so a new target is needed, and the baselines they run confirm the gap on these tasks. The choices line up with prior literature rather than inventing new problems from scratch. Soft spots are limited. Any benchmark involves judgment on which skills matter most, and while the paper explains the selection, future work could still find shortcuts on these tasks the way it did on GLUE. The data handling and splits follow standard practice with no obvious circularity or unfalsifiable claims. This is for people building pretrained models or transfer methods who need a single-number summary that still has room to grow. A reader running experiments on general language understanding would use the leaderboard and cite the task set. It deserves peer review because the contribution is concrete, the evidence is reproducible, and benchmarks like this shape what the field treats as progress.

Referee Report

0 major / 2 minor

Summary. The paper observes that recent advances in pretraining and transfer learning have driven model performance on the GLUE benchmark above non-expert human levels, implying limited headroom for further progress. It introduces SuperGLUE, a successor benchmark consisting of eight more challenging language-understanding tasks (BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, WSC), together with a software toolkit and public leaderboard.

Significance. If the new tasks indeed offer greater headroom and more diagnostic evaluation of general language understanding, SuperGLUE would serve as a valuable next standard benchmark, extending the impact of GLUE. The accompanying toolkit and leaderboard constitute practical, reproducible contributions that lower barriers to adoption and enable consistent community comparisons.

minor comments (2)

[Abstract] Abstract: the saturation claim would be strengthened by a brief citation to the specific results or papers documenting model performance exceeding non-expert human baselines on GLUE.
[§2] Task introduction section: a short table or paragraph explicitly comparing average model-human gaps on GLUE versus the proposed SuperGLUE tasks would make the 'stickier' claim more concrete and easier to evaluate.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition that SuperGLUE offers greater headroom and diagnostic value for general language understanding.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes SuperGLUE motivated by the external empirical observation that GLUE performance has surpassed non-expert human levels. No derivation chain, equations, fitted parameters, or predictions are present. The central premise relies on publicly verifiable model results rather than any self-citation that reduces the argument to unverified inputs by construction. No self-definitional, fitted-input, uniqueness-imported, or ansatz-smuggled steps appear. The work is a benchmark proposal and toolkit release, self-contained against external performance data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the domain assumption that performance on a curated set of tasks is a valid proxy for general language understanding capability.

axioms (1)

domain assumption Language understanding can be meaningfully summarized by aggregate performance on a diverse but fixed set of tasks.
This is the foundational premise for any GLUE-style benchmark and is invoked to justify creating SuperGLUE once GLUE is saturated.

pith-pipeline@v0.9.0 · 5426 in / 1200 out tokens · 32667 ms · 2026-05-15T01:28:06.644884+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
cs.LG 2026-05 unverdicted novelty 7.0

Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
Language Is Not All You Need: Aligning Perception with Language Models
cs.CL 2023-02 conditional novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
cs.CL 2020-05 accept novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
cs.LG 2026-05 unverdicted novelty 6.0

SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
Defending Against Indirect Prompt Injection Attacks With Spotlighting
cs.CR 2024-03 unverdicted novelty 6.0

Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Retentive Network: A Successor to Transformer for Large Language Models
cs.CL 2023-07 unverdicted novelty 6.0

RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
Kosmos-2: Grounding Multimodal Large Language Models to the World
cs.CL 2023-06 unverdicted novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Ethical and social risks of harm from Language Models
cs.CL 2021-12 accept novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
cs.CL 2019-10 accept novelty 6.0

Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
Complexity Horizons of Compressed Models in Analog Circuit Analysis
cs.AI 2026-05 unverdicted novelty 5.0

Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
Uncertainty-Aware Transformers: Conformal Prediction for Language Models
cs.LG 2026-04 unverdicted novelty 5.0

CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Detecting Language Model Attacks with Perplexity
cs.CL 2023-08 unverdicted novelty 5.0

Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
cs.CL 2019-07 accept novelty 5.0

With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
GLU Variants Improve Transformer
cs.LG 2020-02 unverdicted novelty 4.0

Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
Scaling Laws for Neural Language Models
cs.LG 2020-01

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 23 Pith papers · 16 internal anchors

[1]

Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R

Alex Wang and Ian F. Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R. Thomas McCoy and Roma Patel and Yinghui Huang and Jason Phang and Edouard Grave and Haokun Liu and Najoung Kim and Phu Mon Htut and Thibault F'

work page
[2]

Zhang, Sheng and Liu, Xiaodong and Liu, Jingjing and Gao, Jianfeng and Duh, Kevin and Van Durme, Benjamin , journal=

work page
[3]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019
[4]

Le , journal=

Zhilin Yang and Zihang Dai and Yiming Yang and Jaime Carbonell and Ruslan Salakhutdinov and Quoc V. Le , journal=

work page
[5]

Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

Gonen, Hila and Goldberg, Yoav. Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

work page 2019
[7]

Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems

Kiritchenko, Svetlana and Mohammad, Saif. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. 2018. doi:10.18653/v1/S18-2005

work page doi:10.18653/v1/s18-2005 2018
[8]

2018 , journal =

Kaiji Lu and Piotr Mardziel and Fangjing Wu and Preetam Amancharla and Anupam Datta , title =. 2018 , journal =

work page 2018
[10]

Edward and Pavlick, Ellie and White, Aaron Steven and Van Durme, Benjamin

Poliak, Adam and Haldar, Aparajita and Rudinger, Rachel and Hu, J. Edward and Pavlick, Ellie and White, Aaron Steven and Van Durme, Benjamin. Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

work page 2018
[12]

and Le, Quoc V

Clark, Kevin and Luong, Minh-Thang and Khandelwal, Urvashi and Manning, Christopher D. and Le, Quoc V. , year =. Proceedings of the Association of Computational Linguistics (ACL) , publisher =

work page
[13]

Proceedings of the First

Social Bias in Elicited Natural Language Inferences , author=. Proceedings of the First. 2017 , publisher =

work page 2017
[14]

International Conference on Machine Learning (ICML) , year=

Born again neural networks , author=. International Conference on Machine Learning (ICML) , year=

work page
[16]

Stephen H. Bach and Daniel Rodriguez and Yintao Liu and Chong Luo and Haidong Shao and Cassandra Xia and Souvik Sen and Alexander Ratner and Braden Hancock and Houman Alborzi and Rahul Kuchhal and Christopher R. Snorkel DryBell:. 2018 , publisher =

work page 2018
[17]

2019 , journal=

Evidence Sentence Extraction for Machine Reading Comprehension , author=. 2019 , journal=

work page 2019
[18]

Long short-term memory , Year =

Hochreiter, Sepp and Schmidhuber, J. Long short-term memory , Year =. Neural computation , Publisher =

work page
[19]

Modeling Empathy and Distress in Reaction to News Stories , year =

Buechel, Sven and Buffone, Anneke and Slaff, Barry and Ungar, Lyle and Sedoc, Jo. Modeling Empathy and Distress in Reaction to News Stories , year =

work page
[20]

The Fifth

Bentivogli, Luisa and Dagan, Ido and Dang, Hoa Trang and Giampiccolo, Danilo and Magnini, Bernardo , booktitle=. The Fifth. 2009 , url=

work page 2009
[22]

Automatic differentiation in

Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam , year=. Automatic differentiation in. Advances in Neural Information Processing Systems (NeurIPS) , publisher =

work page
[23]

Proceedings of IWP , year=

Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of IWP , year=

work page
[25]

Liu and Matthew Peters and Michael Schmitz and Luke S

Matt Gardner and Joel Grus and Mark Neumann and Oyvind Tafjord and Pradeep Dasigi and Nelson F. Liu and Matthew Peters and Michael Schmitz and Luke S. Zettlemoyer , booktitle=. 2017 , journal =

work page 2017
[26]

Advances in Neural Information Processing Systems (NeurIPS) , publisher =

Attention is all you need , author=. Advances in Neural Information Processing Systems (NeurIPS) , publisher =

work page
[29]

Improving Language Understanding by Generative Pre-Training , Note =

Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya , Date-Added =. Improving Language Understanding by Generative Pre-Training , Note =

work page
[30]

Proceedings of the 25th International Conference on Machine Learning (ICML) , year=

A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th International Conference on Machine Learning (ICML) , year=

work page
[32]

Advances in neural information processing systems , year=

Skip-thought vectors , author=. Advances in neural information processing systems , year=

work page
[33]

SentEval : An Evaluation Toolkit for Universal Sentence Representations

Conneau, Alexis and Kiela, Douwe. SentEval : An Evaluation Toolkit for Universal Sentence Representations. Proceedings of the 11th Language Resources and Evaluation Conference. 2018

work page 2018
[34]

Bowman , booktitle=

Alex Wang and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , booktitle=. 2019 , url=

work page 2019
[35]

, year =

Nangia, Nikita and Bowman, Samuel R. , year =. Human vs. Proceedings of the Association of Computational Linguistics (ACL) , publisher =

work page
[36]

The third

Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

work page 2007
[37]

Machine Learning Challenges

Dagan, Ido and Glickman, Oren and Magnini, Bernardo , title="The. Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. 2006

work page 2006
[38]

2019 , publisher =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=. 2019 , publisher =

work page 2019
[40]

The second

Bar Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , booktitle=. The second. 2006 , url=

work page 2006
[41]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

work page 2011
[42]

Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

work page
[43]

2019 , url =

Pilehvar, Mohammad Taher and Camacho-Collados, Jose , booktitle =. 2019 , url =

work page 2019
[44]

Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The. 2012 , publishre =

work page 2012
[45]

1995 , publisher=

Miller, George A , journal=. 1995 , publisher=

work page 1995
[46]

2005 , isbn =

Schuler, Karin Kipper , title =. 2005 , isbn =

work page 2005
[47]

2019 , journal =

The Referential Reader: A Recurrent Entity Network for Anaphora Resolution , author=. 2019 , journal =

work page 2019
[48]

Mind the

Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason , journal=. Mind the. 2018 , publisher=

work page 2018
[49]

Transactions of the Association of Computational Linguists , year=

Neural network acceptability judgments , author=. Transactions of the Association of Computational Linguists , year=

work page
[50]

2002 , publisher=

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle=. 2002 , publisher=

work page 2002
[51]

Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year=

Re-evaluation the role of bleu in machine translation research , author=. Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year=

work page
[53]

Ultra-Fine Entity Typing

Choi, Eunsol and Levy, Omer and Choi, Yejin and Zettlemoyer, Luke. Ultra-Fine Entity Typing. Proceedings of the Association for Computational Linguistics (ACL). 2018

work page 2018
[55]

International Conference on Learning Representations (

What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations (. 2019 , url=

work page 2019
[56]

SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin. SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

work page 2018
[59]

Identifying Well-formed Natural Language Questions

Faruqui, Manaal and Das, Dipanjan. Identifying Well-formed Natural Language Questions. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018

work page 2018
[61]

2019 , url=

Zhang, Yuan and Baldridge, Jason and He, Luheng , journal=. 2019 , url=

work page 2019
[62]

ACM Transactions on Interactive Intelligent Systems (TiiS) , year=

Have You Lost the Thread? Discovering Ongoing Conversations in Scattered Dialog Blocks , author=. ACM Transactions on Interactive Intelligent Systems (TiiS) , year=

work page
[63]

Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , publisher =

A corpus and model integrating multiword expressions and supersenses , author=. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , publisher =. 2015 , url=

work page 2015
[64]

QuAC : Question Answering in Context

Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke. QuAC : Question Answering in Context. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2018

work page 2018
[65]

nternational Conference on Language Resources and Evaluation (LREC) , year=

The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics , author=. nternational Conference on Language Resources and Evaluation (LREC) , year=

work page
[66]

and Linzen, Tal , Booktitle =

McCoy, Richard T. and Linzen, Tal , Booktitle =. Non-entailed subsequences as a challenge for natural language inference , Url =

work page
[67]

Proceedings of the Association for Computational Linguistics (ACL)

McCoy, R. Thomas and Pavlick, Ellie and Linzen, Tal , Title =. 2019 , booktitle = "Proceedings of the Association for Computational Linguistics (ACL)", publisher = "Association for Computational Linguistics", url =

work page 2019
[68]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Learned in translation: Contextualized word vectors , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[70]

and Schwartz, Roy and Smith, Noah A

Liu, Nelson F. and Schwartz, Roy and Smith, Noah A. , title =. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

work page
[71]

and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E

Liu, Nelson F. and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E. and Smith, Noah A. , title =. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

work page
[72]

International Conference on Computational Linguistics (COLING) , year=

Stress Test Evaluation for Natural Language Inference , author=. International Conference on Computational Linguistics (COLING) , year=

work page
[74]

2013 , journal=

Efficient Estimation of Word Representations in Vector Space , author=. 2013 , journal=

work page 2013
[75]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Semi-supervised Sequence Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[77]

Maarten Sap and Hannah Rashkin and Derek Chen and Ronan LeBras and Yejin Choi , year=. Social. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , url =

work page
[79]

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page
[80]

Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval) , year =

work page
[82]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2018

work page 2018
[83]

Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alexander Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher R \' e , and Rob Malkin. Snorkel drybell: A case study in deploying weak supervision at industrial scale. In SIGMOD. ACM , 2018

work page 2018
[84]

The second PASCAL recognising textual entailment challenge

Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment , 2006. URL http://u.cs.biu.ac.il/ nlp/RTE2/Proceedings/01.pdf

work page 2006
[85]

The fifth PASCAL recognizing textual entailment challenge

Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth PASCAL recognizing textual entailment challenge. In Textual Analysis Conference (TAC), 2009. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.1231

work page 2009
[86]

Modeling empathy and distress in reaction to news stories

Sven Buechel, Anneke Buffone, Barry Slaff, Lyle Ungar, and Jo \ a o Sedoc. Modeling empathy and distress in reaction to news stories. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

work page 2018
[87]

Re-evaluation the role of bleu in machine translation research

Chris Callison-Burch, Miles Osborne, and Philipp Koehn. Re-evaluation the role of bleu in machine translation research. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Association for Computational Linguistics, 2006. URL https://www.aclweb.org/anthology/E06-1032

work page 2006
[88]

S em E val-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, 2017. doi:10.18653/v1/S17-2001. URL https://www.acl...

work page doi:10.18653/v1/s17-2001 2017
[89]

QuAC : Question answering in context

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC : Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) . Association for Computational Linguistics, 2018 a

work page 2018
[90]

Ultra-fine entity typing

Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. Ultra-fine entity typing. In Proceedings of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2018 b . URL https://www.aclweb.org/anthology/P18-1009

work page 2018
[91]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page 2019
[92]

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, and Quoc V. Le. BAM ! B orn-again multi-task networks for natural language understanding. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019 b . URL https://arxiv.org/pdf/1907.04829.pdf

work page internal anchor Pith review Pith/arXiv arXiv 2019
[93]

A unified architecture for natural language processing: Deep neural networks with multitask learning

Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML). Association for Computing Machinery, 2008. URL https://dl.acm.org/citation.cfm?id=1390177

work page 2008
[94]

SentEval : An evaluation toolkit for universal sentence representations

Alexis Conneau and Douwe Kiela. SentEval : An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269

work page 2018
[95]

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo \" c Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) . Association for Computational Linguistics, 2017. doi:10.18653/v1/D17-1070. U...

work page doi:10.18653/v1/d17-1070 2017
[96]

The PASCAL recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9

work page doi:10.1007/11736790_9 2006
[97]

Semi-supervised sequence learning

Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf

work page 2015
[98]

The CommitmentBank : Investigating projection in naturally occurring discourse

Marie-Catherine d e Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank : Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/

work page 2019
[99]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL h...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[100]

Dolan and Chris Brockett

William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of IWP, 2005

work page 2005

Showing first 80 references.