Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

· 2025 · cs.CR · arXiv 2502.05206

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open full Pith review browse 12 citing papers arXiv PDF

abstract

The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.

citation-role summary

background 2 other 1

citation-polarity summary

background 1 support 1 unclear 1

representative citing papers

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

MemMark: State-Evolution Attribution Watermarking for Agent Long-Term Memory Systems

cs.CR · 2026-05-24 · unverdicted · novelty 7.0

MemMark enables snapshot-only attribution for agent long-term memory by embedding signals via keyed distribution-preserving sampling at memory-write decisions, recovering 40-bit payloads with near-baseline utility.

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

cs.CR · 2026-05-19 · conditional · novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

cs.CR · 2026-04-21 · unverdicted · novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.

BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

BYORn defends autoregressive vision-language models against backdoor attacks in supervised fine-tuning by dynamically replacing semantically implausible poisoned responses with model-generated alternatives, improving robustness while preserving clean performance.

Safety, Security, and Cognitive Risks in World Models

cs.CR · 2026-04-01 · unverdicted · novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

cs.CL · 2025-11-16 · unverdicted · novelty 6.0

EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.

AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

cs.AI · 2025-09-30 · unverdicted · novelty 6.0

AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.

LeakyCLIP: Extracting Training Data from CLIP

cs.CR · 2025-08-01 · conditional · novelty 6.0

LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.

RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion

cs.CV · 2025-03-08 · unverdicted · novelty 6.0

RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.

SoK: Robustness in Large Language Models against Jailbreak Attacks

cs.CR · 2026-05-06 · accept · novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

cs.AI · 2026-05-17 · unverdicted · novelty 2.0

A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.

citing papers explorer

Showing 12 of 12 citing papers.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 33 · internal anchor
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
MemMark: State-Evolution Attribution Watermarking for Agent Long-Term Memory Systems cs.CR · 2026-05-24 · unverdicted · none · ref 11 · internal anchor
MemMark enables snapshot-only attribution for agent long-term memory by embedding signals via keyed distribution-preserving sampling at memory-write decisions, recovering 40-bit payloads with near-baseline utility.
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models cs.CR · 2026-05-19 · conditional · none · ref 42 · internal anchor
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety cs.CR · 2026-04-21 · unverdicted · none · ref 166 · internal anchor
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks cs.LG · 2026-06-01 · unverdicted · none · ref 92 · internal anchor
BYORn defends autoregressive vision-language models against backdoor attacks in supervised fine-tuning by dynamically replacing semantically implausible poisoned responses with model-generated alternatives, improving robustness while preserving clean performance.
Safety, Security, and Cognitive Risks in World Models cs.CR · 2026-04-01 · unverdicted · none · ref 54 · internal anchor
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs cs.CL · 2025-11-16 · unverdicted · none · ref 28 · internal anchor
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models cs.AI · 2025-09-30 · unverdicted · none · ref 9 · internal anchor
AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.
LeakyCLIP: Extracting Training Data from CLIP cs.CR · 2025-08-01 · conditional · none · ref 28 · internal anchor
LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.
RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion cs.CV · 2025-03-08 · unverdicted · none · ref 24 · internal anchor
RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.
SoK: Robustness in Large Language Models against Jailbreak Attacks cs.CR · 2026-05-06 · accept · none · ref 49 · internal anchor
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security cs.AI · 2026-05-17 · unverdicted · none · ref 87 · internal anchor
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer