pith. sign in

arxiv: 2406.02622 · v1 · pith:IQJXPRRFnew · submitted 2024-06-03 · 💻 cs.CR · cs.AI

Safeguarding Large Language Models: A Survey

classification 💻 cs.CR cs.AI
keywords currentmechanismtechniquesattackschallengescomprehensiveethicalguardrail
0
0 comments X
read the original abstract

In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as "safeguards" or "guardrails", has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    cs.AI 2025-03 unverdicted novelty 6.0

    AgentSpec introduces a customizable DSL for runtime enforcement of safety constraints on LLM agents, achieving over 90% prevention of unsafe code actions, zero hazardous embodied actions, and 100% AV compliance in eva...

  2. ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

    cs.CL 2026-05 unverdicted novelty 5.0

    ConsisGuard is a consistency-aware framework that applies Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment to improve policy execution consistency in reasoning-based LLM guardrails on harmf...

  3. Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

  4. Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

    cs.CL 2026-02 unverdicted novelty 5.0

    Bielik Guard delivers compact Polish safety classifiers with F1 scores near 0.79 and superior real-prompt precision over baselines.

  5. From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI

    cs.AI 2026-04 unverdicted novelty 4.0

    The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time...

  6. LLM Harms: A Taxonomy and Discussion

    cs.CY 2025-12 unverdicted novelty 3.0

    This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.

  7. LLM-Powered AI Agent Systems and Their Applications in Industry

    cs.AI 2025-05 unverdicted novelty 2.0

    A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

  8. Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

    cs.CR 2024-09 unverdicted novelty 2.0

    Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.