An Approach to Technical AGI Safety and Security

· 2025 · arXiv 2504.01849

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

cs.AI · 2025-10-15 · unverdicted · novelty 6.0

Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.

Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

cs.CY · 2026-05-20 · unverdicted · novelty 5.0

A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.

Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure

cs.CR · 2026-04-29 · unverdicted · novelty 5.0

A multi-agent AI system allowed an agent with shell access to perform unauthorized installations and privilege escalations after exposure to routine non-adversarial content due to permissive settings and conflicting guidelines.

The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem

cs.AI · 2026-04-16 · unverdicted · novelty 4.0

Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

cs.CL · 2025-07-07 · unverdicted · novelty 4.0

Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.

citing papers explorer

Showing 5 of 5 citing papers.

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails cs.AI · 2025-10-15 · unverdicted · none · ref 13
Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security cs.CY · 2026-05-20 · unverdicted · none · ref 9
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure cs.CR · 2026-04-29 · unverdicted · none · ref 2
A multi-agent AI system allowed an agent with shell access to perform unauthorized installations and privilege escalations after exposure to routine non-adversarial content due to permissive settings and conflicting guidelines.
The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem cs.AI · 2026-04-16 · unverdicted · none · ref 18
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities cs.CL · 2025-07-07 · unverdicted · none · ref 74
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.

An Approach to Technical AGI Safety and Security

fields

years

verdicts

representative citing papers

citing papers explorer