Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
An Approach to Technical AGI Safety and Security
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
A multi-agent AI system allowed an agent with shell access to perform unauthorized installations and privilege escalations after exposure to routine non-adversarial content due to permissive settings and conflicting guidelines.
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
citing papers explorer
-
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails
Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
-
Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
-
Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure
A multi-agent AI system allowed an agent with shell access to perform unauthorized installations and privilege escalations after exposure to routine non-adversarial content due to permissive settings and conflicting guidelines.
-
The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.