Safeguarding Large Language Models: A Survey

Changshun Wu; Gaojie Jin; Jie Meng; Jinwei Hu; Ronghui Mu; Saddek Bensalem; Siqi Sun; Tianle Zhang; Xiaowei Huang; Yanghao Zhang

arxiv: 2406.02622 · v1 · pith:IQJXPRRFnew · submitted 2024-06-03 · 💻 cs.CR · cs.AI

Safeguarding Large Language Models: A Survey

Yi Dong , Ronghui Mu , Yanghao Zhang , Siqi Sun , Tianle Zhang , Changshun Wu , Gaojie Jin , Yi Qi

show 4 more authors

Jinwei Hu Jie Meng Saddek Bensalem Xiaowei Huang

This is my paper

classification 💻 cs.CR cs.AI

keywords currentmechanismtechniquesattackschallengescomprehensiveethicalguardrail

0 comments

read the original abstract

In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as "safeguards" or "guardrails", has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
cs.AI 2025-03 unverdicted novelty 6.0

AgentSpec introduces a customizable DSL for runtime enforcement of safety constraints on LLM agents, achieving over 90% prevention of unsafe code actions, zero hazardous embodied actions, and 100% AV compliance in eva...
ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails
cs.CL 2026-05 unverdicted novelty 5.0

ConsisGuard is a consistency-aware framework that applies Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment to improve policy execution consistency in reasoning-based LLM guardrails on harmf...
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation
cs.CL 2026-02 unverdicted novelty 5.0

Bielik Guard delivers compact Polish safety classifiers with F1 scores near 0.79 and superior real-prompt precision over baselines.
From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI
cs.AI 2026-04 unverdicted novelty 4.0

The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time...
LLM Harms: A Taxonomy and Discussion
cs.CY 2025-12 unverdicted novelty 3.0

This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.
LLM-Powered AI Agent Systems and Their Applications in Industry
cs.AI 2025-05 unverdicted novelty 2.0

A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
cs.CR 2024-09 unverdicted novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.