Openai’s approach to external red teaming for ai models and systems

Lama Ahmad, Sandhini Agarwal, Michael Lampe, Pamela Mishkin · 2025 · arXiv 2503.16431

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

AVISE: Framework for Evaluating the Security of AI Systems

cs.CR · 2026-04-22 · unverdicted · novelty 6.0

AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

cs.CL · 2025-10-08 · unverdicted · novelty 6.0

Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.

FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism

cs.CY · 2026-04-23 · unverdicted · novelty 4.0

AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.

citing papers explorer

Showing 5 of 5 citing papers.

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems cs.AI · 2026-05-15 · unverdicted · none · ref 1
Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption cs.CR · 2026-04-30 · unverdicted · none · ref 45
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
AVISE: Framework for Evaluating the Security of AI Systems cs.CR · 2026-04-22 · unverdicted · none · ref 34
AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.
Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts cs.CL · 2025-10-08 · unverdicted · none · ref 13
Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.
FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism cs.CY · 2026-04-23 · unverdicted · none · ref 1
AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.

Openai’s approach to external red teaming for ai models and systems

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer