RedDebate: Safer Responses Through Multi-Agent Red Teaming Debates

Ali Asad; Radin Shayanfar; Stephen Obadinma; Xiaodan Zhu

arxiv: 2506.11083 · v3 · pith:AFD7XNBYnew · submitted 2025-06-04 · 💻 cs.CL

RedDebate: Safer Responses Through Multi-Agent Red Teaming Debates

Ali Asad , Stephen Obadinma , Radin Shayanfar , Xiaodan Zhu This is my paper

classification 💻 cs.CL

keywords debatereddebatellmsmulti-agentsafetyunsafeacrossautomated

0 comments

read the original abstract

We introduce RedDebate, a novel multi-agent debate framework that provides the foundation for Large Language Models (LLMs) to identify and mitigate their unsafe behaviours. AI safety approaches often rely on costly human evaluation or isolated single-model assessment, both constrained by scalability and prone to oversight failures. RedDebate employs collaborative argumentation among multiple LLMs across diverse debate scenarios, enabling them to critically evaluate one another's reasoning and systematically uncover unsafe failure modes through fully automated red-teaming. To support this, we propose designing distinct long-term memory modules that preserve safety-relevant insights from debate interactions and leverage them during subsequent inference, facilitating continuous refinement of model behaviour. Empirical evaluation on safety benchmarks across a diverse set of models demonstrates that RedDebate substantially reduces unsafe outputs. While debate alone allows LLMs to refine their behaviour, the addition of memory yields further error reductions. To the best of our knowledge, RedDebate is the first fully automated framework to unify multi-agent debate and red-teaming to progressively enhance LLM safety without human intervention.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models
cs.AI 2025-09 unverdicted novelty 6.0

AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.