MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety , author= · 2026 · arXiv 2602.01539

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.

Addressing Over-Refusal in LLMs with Competing Rewards

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.

Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security

cs.CR · 2026-06-10 · unverdicted · novelty 6.0

Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.

RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing

cs.CR · 2026-06-04 · unverdicted · novelty 6.0

RedEdit finds that fewer than two photo edits on average let 76.2% of unsafe images evade detectors while retaining 93.0% of malicious semantics.

Internalizing Safety Understanding in Large Reasoning Models via Verification

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

cs.CL · 2026-05-04 · unverdicted · novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Addressing Over-Refusal in LLMs with Competing Rewards cs.LG · 2026-06-30 · unverdicted · none · ref 121
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.

MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer