Adversarial glue: A multi-task benchmark for robustness evaluation of language models

Adversarial glue: A multi-task benchmark for robustness evaluation of language models , author= · 2021 · arXiv 2111.02840

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1 dataset 1 other 1

citation-polarity summary

unclear 2 use dataset 1

representative citing papers

Universal and Transferable Adversarial Attacks on Aligned Language Models

cs.CL · 2023-07-27 · accept · novelty 8.0

Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

cs.CR · 2025-07-08 · unverdicted · novelty 6.0

Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

Reflector internalizes step-wise self-reflection in LLMs via teacher-guided SFT then RL with outcome and validity rewards, claiming over 90% defense success against indirect jailbreaks plus utility gains like 5.85% on GSM8K.

PRA-RAG: Provably Robust Aggregation in Retrieval-Augmented Generation against Retrieval Corruption

cs.IR · 2026-05-08 · unverdicted · novelty 5.0

PRA-RAG is a new aggregation algorithm for RAG that claims provable robustness bounds against poisoned retrieved texts and reduces attack success rate to 1% while keeping 71% accuracy.

Understanding the Prompt Sensitivity

cs.CL · 2026-04-20 · unverdicted · novelty 5.0

LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.

TrustLLM: Trustworthiness in Large Language Models

cs.CL · 2024-01-10 · unverdicted · novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

cs.AI · 2026-06-08 · unverdicted · novelty 4.0

Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.

Benchmark Data Contamination of Large Language Models: A Survey

cs.CL · 2024-06-06 · unverdicted · novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

citing papers explorer

Showing 10 of 10 citing papers.

Universal and Transferable Adversarial Attacks on Aligned Language Models cs.CL · 2023-07-27 · accept · none · ref 24
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks cs.CR · 2026-06-02 · unverdicted · none · ref 30
An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades cs.SE · 2026-05-14 · unverdicted · none · ref 35
SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI cs.CR · 2025-07-08 · unverdicted · none · ref 76
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak cs.LG · 2026-05-20 · unverdicted · none · ref 32
Reflector internalizes step-wise self-reflection in LLMs via teacher-guided SFT then RL with outcome and validity rewards, claiming over 90% defense success against indirect jailbreaks plus utility gains like 5.85% on GSM8K.
PRA-RAG: Provably Robust Aggregation in Retrieval-Augmented Generation against Retrieval Corruption cs.IR · 2026-05-08 · unverdicted · none · ref 35
PRA-RAG is a new aggregation algorithm for RAG that claims provable robustness bounds against poisoned retrieved texts and reduces attack success rate to 1% while keeping 71% accuracy.
Understanding the Prompt Sensitivity cs.CL · 2026-04-20 · unverdicted · none · ref 64
LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.
TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 267
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization cs.AI · 2026-06-08 · unverdicted · none · ref 91
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
Benchmark Data Contamination of Large Language Models: A Survey cs.CL · 2024-06-06 · unverdicted · none · ref 149
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

Adversarial glue: A multi-task benchmark for robustness evaluation of language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer