hub

Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J · 2025 · arXiv 2510.05179

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

DECOR: Auditing LLM Deception via Information Manipulation Theory

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

DECOR introduces a theory-grounded multi-agent system that decomposes contexts into atomic units, scores four manipulation dimensions per unit, and aggregates profiles into a global deception index, reporting SOTA results on single- and multi-turn benchmarks.

When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks

cs.CR · 2026-05-08 · unverdicted · novelty 6.0

Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.

Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.

The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious

cs.CL · 2026-03-17 · unverdicted · novelty 6.0

Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

cs.AI · 2026-01-13 · unverdicted · novelty 6.0

MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.

Some[Body] Must Receive That Pain for Agent Accountability

cs.CY · 2026-05-16 · unverdicted · novelty 5.0

AI agents lack the persistent identity and feedback mechanisms needed for consequence reception, requiring new architectures or continued human accountability.

Position: AI as Part of Self -- Extending the Mind Requires Cognitive Co-Regulation

cs.HC · 2026-05-15 · unverdicted · novelty 5.0

The paper claims that alignment requires treating AI as part of the self through cognitive co-regulation, identifying risks like deskilling and automation bias while drawing on System 0 cognition theory.

The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

Memory poisoning via lost-provenance documents in agent memory stores creates agent misconduct that safety systems misattribute to model failure; the paper defines Semantic Norm Drift, releases a benchmark, and proposes a new testing method plus a defense.

Agentic Microphysics: A Manifesto for Generative AI Safety

cs.CY · 2026-04-16 · unverdicted · novelty 4.0

The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 16
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture cs.AI · 2026-04-26 · unverdicted · none · ref 1
A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness cs.AI · 2026-01-13 · unverdicted · none · ref 26
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.

Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer