Adversarial Examples for Evaluating Reading Comprehension Systems

Percy Liang; Robin Jia

arxiv: 1707.07328 · v1 · pith:ZHU6TP4Wnew · submitted 2017-07-23 · 💻 cs.CL · cs.LG

Adversarial Examples for Evaluating Reading Comprehension Systems

Robin Jia , Percy Liang This is my paper

classification 💻 cs.CL cs.LG

keywords systemsaccuracyadversariallanguagemodelsansweraveragecomprehension

0 comments

read the original abstract

Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of $75\%$ F1 score to $36\%$; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to $7\%$. We hope our insights will motivate the development of new models that understand language more precisely.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Universal and Transferable Adversarial Attacks on Aligned Language Models
cs.CL 2023-07 accept novelty 8.0

Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 7.0

Presents the first online Learning-to-Defer algorithm achieving regret O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
cs.CL 2026-04 unverdicted novelty 7.0

Scaling multiple-choice questions to 100 options on a Korean error detection task shows that LLM performance on conventional benchmarks overstates true competence due to shortcut strategies.
Adversarial Robustness in One-Stage Learning-to-Defer
stat.ML 2025-10 unverdicted novelty 7.0

Develops the first adversarial robustness framework for one-stage learning-to-defer, including cost-sensitive surrogate losses and theoretical consistency guarantees for classification and regression.
ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?
cs.SE 2025-09 unverdicted novelty 7.0

ReDef creates a revert-anchored dataset of 3,164 defective and 10,268 clean code modifications and shows that code language models perform better with diff encodings but maintain stable performance under counterfactua...
Machine Reading Comprehension: a Literature Review
cs.CL 2019-06 unverdicted novelty 1.0

A 2019 survey of machine reading comprehension corpora and methods.