Adversarial Examples for Evaluating Reading Comprehension Systems

Jia, Robin, Liang, Percy · 2017 · DOI 10.18653/v1/d17-1215

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

representative citing papers

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

cs.CL · 2019-05-24 · accept · novelty 7.0

BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.

Low-Resource Languages Jailbreak GPT-4

cs.CL · 2023-10-03 · conditional · novelty 6.0

Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

cs.CL · 2019-05-02 · accept · novelty 6.0

SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

cs.LG · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

citing papers explorer

Showing 5 of 5 citing papers.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 195
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions cs.CL · 2019-05-24 · accept · none · ref 13
BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.
Low-Resource Languages Jailbreak GPT-4 cs.CL · 2023-10-03 · conditional · none · ref 22
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems cs.CL · 2019-05-02 · accept · none · ref 108
SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 59 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

Adversarial Examples for Evaluating Reading Comprehension Systems

fields

years

verdicts

representative citing papers

citing papers explorer