An approach to technical agi safety and security

· 2025 · arXiv 2504.01849

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

cs.AI · 2025-10-15 · unverdicted · novelty 6.0

Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.

Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

cs.CY · 2026-05-20 · unverdicted · novelty 5.0

A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.

Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure

cs.CR · 2026-04-29 · unverdicted · novelty 5.0

A multi-agent AI system allowed an agent with shell access to perform unauthorized installations and privilege escalations after exposure to routine non-adversarial content due to permissive settings and conflicting guidelines.

Thinking Out Loud: Real-Time Deception Monitoring in Asymmetric LLM Negotiations

cs.CY · 2026-06-13 · unverdicted · novelty 4.0

A lightweight CoT monitor detects deception in asymmetric LLM used-car negotiations, increasing buyer walk-aways but exposing an intelligence gap where weaker buyers cannot act on alerts and sellers adapt without eliminating concealment.

The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem

cs.AI · 2026-04-16 · unverdicted · novelty 4.0

Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

cs.CL · 2025-07-07 · unverdicted · novelty 4.0

Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Ambient Persuasion in a Deployed AI Agent: Unauthorized Escalation Following Routine Non-Adversarial Content Exposure cs.CR · 2026-04-29 · unverdicted · none · ref 2
A multi-agent AI system allowed an agent with shell access to perform unauthorized installations and privilege escalations after exposure to routine non-adversarial content due to permissive settings and conflicting guidelines.

An approach to technical agi safety and security

fields

years

verdicts

representative citing papers

citing papers explorer