Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
An approach to technical agi safety and security
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
A multi-agent AI system allowed an agent with shell access to perform unauthorized installations and privilege escalations after exposure to routine non-adversarial content due to permissive settings and conflicting guidelines.
A lightweight CoT monitor detects deception in asymmetric LLM used-car negotiations, increasing buyer walk-aways but exposing an intelligence gap where weaker buyers cannot act on alerts and sellers adapt without eliminating concealment.
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
citing papers explorer
-
Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure
A multi-agent AI system allowed an agent with shell access to perform unauthorized installations and privilege escalations after exposure to routine non-adversarial content due to permissive settings and conflicting guidelines.