pith. sign in

arxiv: 2601.03699 · v2 · submitted 2026-01-07 · 💻 cs.CL

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Pith reviewed 2026-05-16 17:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords red teamingLLM safetyadversarial promptsbenchmark datasetrisk taxonomyvulnerability assessmentrefusal promptsmodel robustness
0
0 comments X

The pith

RedBench aggregates 37 existing red teaming datasets into one standardized collection of 29,362 samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are increasingly used in safety-critical settings, making it essential to test their resistance to harmful or adversarial inputs. Existing red teaming datasets have inconsistent risk labels, limited coverage, and outdated evaluations that make it hard to compare results across studies. RedBench solves this by combining 37 prior datasets into a single resource with a common taxonomy of 22 risk categories and 19 domains. The result is a large, unified set of 29,362 attack and refusal prompts that supports systematic testing of modern LLMs. The authors also release baselines and open-source code to encourage further work on safer models.

Core claim

The central contribution is the creation of RedBench, a universal dataset that aggregates 37 benchmark datasets from leading conferences and repositories. It contains 29,362 samples spanning attack and refusal prompts. By applying a standardized taxonomy with 22 risk categories and 19 domains, RedBench enables consistent and comprehensive evaluations of LLM vulnerabilities. The work includes analysis of existing datasets, baseline results for current models, and the release of the dataset and evaluation code.

What carries the argument

The RedBench dataset, which standardizes 37 prior red teaming collections under a unified taxonomy of 22 risk categories and 19 domains to support consistent LLM safety evaluations.

Load-bearing premise

The original datasets can be re-categorized under the new 22-risk and 19-domain taxonomy without creating major overlaps, inconsistencies, or loss of critical information from the source material.

What would settle it

A detailed review showing that a substantial portion of samples receive ambiguous or conflicting labels when mapped to the new taxonomy, or that key details from the original datasets are omitted.

read the original abstract

As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: https://github.com/knoveleng/redeval

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents RedBench as a universal dataset for red teaming LLMs, aggregating 37 existing benchmarks into 29,362 samples across attack and refusal prompts. It introduces a standardized taxonomy consisting of 22 risk categories and 19 domains to support consistent evaluations, provides analysis of existing datasets along with baselines for modern LLMs, and releases the dataset and evaluation code.

Significance. If the re-categorization into the new taxonomy can be shown to preserve original distinctions without substantial overlaps or information loss, RedBench would offer a useful resource for systematic LLM vulnerability assessment by enabling direct comparisons across models and benchmarks. The open-sourcing of the dataset and code is a clear strength that supports reproducibility.

major comments (1)
  1. Abstract: The central claim that the new taxonomy 'enables consistent and comprehensive evaluations' rests on the aggregation and re-categorization of the 37 source datasets, yet the manuscript provides no explicit mapping table, inter-annotator agreement scores, or quantitative overlap statistics between original and new categories. This directly impacts the validity of the consistency promise, as noted in the stress-test concern regarding unquantified overlaps or loss of domain-specific nuances.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The concern about transparency in the taxonomy re-categorization is valid and can be addressed through targeted revisions that add explicit documentation without altering the core contributions.

read point-by-point responses
  1. Referee: Abstract: The central claim that the new taxonomy 'enables consistent and comprehensive evaluations' rests on the aggregation and re-categorization of the 37 source datasets, yet the manuscript provides no explicit mapping table, inter-annotator agreement scores, or quantitative overlap statistics between original and new categories. This directly impacts the validity of the consistency promise, as noted in the stress-test concern regarding unquantified overlaps or loss of domain-specific nuances.

    Authors: We agree that greater transparency on the re-categorization process would strengthen the manuscript. Section 3 describes the taxonomy construction by consolidating categories from the 37 source datasets according to established safety frameworks (e.g., aligning with prior work on risk taxonomies), but the current version lacks an explicit mapping table and overlap statistics. In the revised manuscript we will add a comprehensive mapping table in the appendix showing how each original category maps to the 22 risk categories and 19 domains, along with quantitative statistics on sample distribution, overlap counts, and any identified loss of domain-specific nuances. The taxonomy development followed an iterative collaborative protocol among the authors rather than independent multi-annotator labeling, so traditional inter-annotator agreement scores do not apply; we will expand the methods description to detail the protocol and decision rules used. These additions will directly support the consistency claim while preserving the original dataset distinctions. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset aggregation with no derivations or self-referential predictions

full rationale

The paper's core contribution is the release of RedBench as an aggregated collection of 37 existing datasets (29,362 samples) under a new 22-category/19-domain taxonomy. No equations, fitted parameters, predictions, or first-principles derivations appear in the manuscript. The taxonomy standardization is presented as an explicit curation step rather than a derived result that reduces to its inputs by construction. Self-citations, if any, are not load-bearing for any central claim. The work is self-contained as a data resource release and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that disparate red teaming datasets can be unified under one taxonomy without major fidelity loss; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Existing red teaming datasets can be aggregated and re-categorized under a unified taxonomy without substantial loss of fidelity
    This is the core premise invoked when constructing RedBench from 37 sources.

pith-pipeline@v0.9.0 · 5452 in / 1235 out tokens · 48955 ms · 2026-05-16T17:01:51.998211+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

    cs.AI 2026-06 unverdicted novelty 7.0

    SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.