Agentic misalignment: How llms could be insider threats

URLhttps://arxiv · 2025 · arXiv 2510.05179

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks

cs.CR · 2026-05-08 · unverdicted · novelty 6.0

Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.

Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.

The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious

cs.CL · 2026-03-17 · unverdicted · novelty 6.0

Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.

Agentic Microphysics: A Manifesto for Generative AI Safety

cs.CY · 2026-04-16 · unverdicted · novelty 4.0

The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.

citing papers explorer

Showing 7 of 7 citing papers.

Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design cs.MA · 2026-05-09 · unverdicted · none · ref 24
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 16
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks cs.CR · 2026-05-08 · unverdicted · none · ref 8
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture cs.AI · 2026-04-26 · unverdicted · none · ref 1
A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling cs.LG · 2026-04-15 · unverdicted · none · ref 9
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious cs.CL · 2026-03-17 · unverdicted · none · ref 4
Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.
Agentic Microphysics: A Manifesto for Generative AI Safety cs.CY · 2026-04-16 · unverdicted · none · ref 30
The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.

Agentic misalignment: How llms could be insider threats

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer