AgentSpec introduces a customizable DSL for runtime enforcement of safety constraints on LLM agents, achieving over 90% prevention of unsafe code actions, zero hazardous embodied actions, and 100% AV compliance in evaluations.
Safeguarding large language models: A survey
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Bielik Guard delivers compact Polish safety classifiers with F1 scores near 0.79 and superior real-prompt precision over baselines.
The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time-sensitive controls, shown via a procurement-agent case study.
A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
citing papers explorer
-
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
AgentSpec introduces a customizable DSL for runtime enforcement of safety constraints on LLM agents, achieving over 90% prevention of unsafe code actions, zero hazardous embodied actions, and 100% AV compliance in evaluations.
-
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
-
Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation
Bielik Guard delivers compact Polish safety classifiers with F1 scores near 0.79 and superior real-prompt precision over baselines.
-
From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI
The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time-sensitive controls, shown via a procurement-agent case study.
-
LLM-Powered AI Agent Systems and Their Applications in Industry
A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.
-
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
- LLM Harms: A Taxonomy and Discussion