RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Chris Ngo; Quy-Anh Dang; Truong-Son Hy

arxiv: 2601.03699 · v2 · submitted 2026-01-07 · 💻 cs.CL

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quy-Anh Dang , Chris Ngo , Truong-Son Hy This is my paper

Pith reviewed 2026-05-16 17:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords red teamingLLM safetyadversarial promptsbenchmark datasetrisk taxonomyvulnerability assessmentrefusal promptsmodel robustness

0 comments

The pith

RedBench aggregates 37 existing red teaming datasets into one standardized collection of 29,362 samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are increasingly used in safety-critical settings, making it essential to test their resistance to harmful or adversarial inputs. Existing red teaming datasets have inconsistent risk labels, limited coverage, and outdated evaluations that make it hard to compare results across studies. RedBench solves this by combining 37 prior datasets into a single resource with a common taxonomy of 22 risk categories and 19 domains. The result is a large, unified set of 29,362 attack and refusal prompts that supports systematic testing of modern LLMs. The authors also release baselines and open-source code to encourage further work on safer models.

Core claim

The central contribution is the creation of RedBench, a universal dataset that aggregates 37 benchmark datasets from leading conferences and repositories. It contains 29,362 samples spanning attack and refusal prompts. By applying a standardized taxonomy with 22 risk categories and 19 domains, RedBench enables consistent and comprehensive evaluations of LLM vulnerabilities. The work includes analysis of existing datasets, baseline results for current models, and the release of the dataset and evaluation code.

What carries the argument

The RedBench dataset, which standardizes 37 prior red teaming collections under a unified taxonomy of 22 risk categories and 19 domains to support consistent LLM safety evaluations.

Load-bearing premise

The original datasets can be re-categorized under the new 22-risk and 19-domain taxonomy without creating major overlaps, inconsistencies, or loss of critical information from the source material.

What would settle it

A detailed review showing that a substantial portion of samples receive ambiguous or conflicting labels when mapped to the new taxonomy, or that key details from the original datasets are omitted.

read the original abstract

As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: https://github.com/knoveleng/redeval

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RedBench pulls together 37 prior datasets into one 29k-sample collection with a 22-category taxonomy, which is practically useful for standardization, but the re-labeling step lacks the validation details needed to confirm consistency.

read the letter

RedBench aggregates existing red teaming datasets into one collection with a standardized taxonomy. That's the core contribution here: 37 sources turned into 29,362 samples across 22 risk categories and 19 domains, with the data and evaluation code released openly. They also run baselines on current LLMs and summarize the landscape of earlier datasets. This kind of consolidation can make it easier for researchers to run comparable tests instead of hunting down scattered resources. The open release itself is the part that stands to see the most use, since people working on LLM safety often need ready-made attack and refusal prompts in one place. The taxonomy aims to cover a broad set of risks and domains, which addresses the inconsistency problem the authors flag in prior work. On the soft side, the re-categorization into the new 22 categories is presented without supporting numbers on inter-annotator agreement, overlap between original and new labels, or a mapping table. If many samples fit multiple categories or if domain-specific details get collapsed, the promise of consistent evaluations rests on an assumption that isn't quantified in the write-up. That gap is noticeable because the central claim is about the taxonomy enabling better assessments. This paper is aimed at people who build or run red-teaming evaluations for LLMs. A reader who needs a large, unified benchmark to test refusal behavior or vulnerabilities would get direct value from the released dataset, even if they tweak the categories for their own experiments. It deserves peer review because dataset papers like this improve when referees can check the curation process and suggest concrete fixes for the validation steps. I'd send it forward rather than desk reject.

Referee Report

1 major / 0 minor

Summary. The manuscript presents RedBench as a universal dataset for red teaming LLMs, aggregating 37 existing benchmarks into 29,362 samples across attack and refusal prompts. It introduces a standardized taxonomy consisting of 22 risk categories and 19 domains to support consistent evaluations, provides analysis of existing datasets along with baselines for modern LLMs, and releases the dataset and evaluation code.

Significance. If the re-categorization into the new taxonomy can be shown to preserve original distinctions without substantial overlaps or information loss, RedBench would offer a useful resource for systematic LLM vulnerability assessment by enabling direct comparisons across models and benchmarks. The open-sourcing of the dataset and code is a clear strength that supports reproducibility.

major comments (1)

Abstract: The central claim that the new taxonomy 'enables consistent and comprehensive evaluations' rests on the aggregation and re-categorization of the 37 source datasets, yet the manuscript provides no explicit mapping table, inter-annotator agreement scores, or quantitative overlap statistics between original and new categories. This directly impacts the validity of the consistency promise, as noted in the stress-test concern regarding unquantified overlaps or loss of domain-specific nuances.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The concern about transparency in the taxonomy re-categorization is valid and can be addressed through targeted revisions that add explicit documentation without altering the core contributions.

read point-by-point responses

Referee: Abstract: The central claim that the new taxonomy 'enables consistent and comprehensive evaluations' rests on the aggregation and re-categorization of the 37 source datasets, yet the manuscript provides no explicit mapping table, inter-annotator agreement scores, or quantitative overlap statistics between original and new categories. This directly impacts the validity of the consistency promise, as noted in the stress-test concern regarding unquantified overlaps or loss of domain-specific nuances.

Authors: We agree that greater transparency on the re-categorization process would strengthen the manuscript. Section 3 describes the taxonomy construction by consolidating categories from the 37 source datasets according to established safety frameworks (e.g., aligning with prior work on risk taxonomies), but the current version lacks an explicit mapping table and overlap statistics. In the revised manuscript we will add a comprehensive mapping table in the appendix showing how each original category maps to the 22 risk categories and 19 domains, along with quantitative statistics on sample distribution, overlap counts, and any identified loss of domain-specific nuances. The taxonomy development followed an iterative collaborative protocol among the authors rather than independent multi-annotator labeling, so traditional inter-annotator agreement scores do not apply; we will expand the methods description to detail the protocol and decision rules used. These additions will directly support the consistency claim while preserving the original dataset distinctions. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset aggregation with no derivations or self-referential predictions

full rationale

The paper's core contribution is the release of RedBench as an aggregated collection of 37 existing datasets (29,362 samples) under a new 22-category/19-domain taxonomy. No equations, fitted parameters, predictions, or first-principles derivations appear in the manuscript. The taxonomy standardization is presented as an explicit curation step rather than a derived result that reduces to its inputs by construction. Self-citations, if any, are not load-bearing for any central claim. The work is self-contained as a data resource release and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that disparate red teaming datasets can be unified under one taxonomy without major fidelity loss; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Existing red teaming datasets can be aggregated and re-categorized under a unified taxonomy without substantial loss of fidelity
This is the core premise invoked when constructing RedBench from 37 sources.

pith-pipeline@v0.9.0 · 5452 in / 1235 out tokens · 48955 ms · 2026-05-16T17:01:51.998211+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing
cs.AI 2026-06 unverdicted novelty 7.0

SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.