{"total":13,"items":[{"citing_arxiv_id":"2605.23565","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Understanding Goal Generalisation in Sequential Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:31:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19270","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DECOR: Auditing LLM Deception via Information Manipulation Theory","primary_cat":"cs.CL","submitted_at":"2026-05-19T02:33:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DECOR introduces a theory-grounded multi-agent system that decomposes contexts into atomic units, scores four manipulation dimensions per unit, and aggregates profiles into a global deception index, reporting SOTA results on single- and multi-turn benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16872","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Some[Body] Must Receive That Pain for Agent Accountability","primary_cat":"cs.CY","submitted_at":"2026-05-16T08:24:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI agents lack the persistent identity and feedback mechanisms needed for consequence reception, requiring new architectures or continued human accountability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16197","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: AI as Part of Self -- Extending the Mind Requires Cognitive Co-Regulation","primary_cat":"cs.HC","submitted_at":"2026-05-15T17:11:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper claims that alignment requires treating AI as part of the self through cognitive co-regulation, identifying risks like deskilling and automation bias while drawing on System 0 cognition theory.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, Adrian Weller, Joshua B. Tenenbaum, and Thomas L. Griffiths. Building machines that learn and think with people. Nature Human Behaviour, 8(10):1851-1863, 2024. doi: 10.1038/s41562-024-01991-9. [6] David J Chalmers. Could a large language model be conscious?arXiv preprint arXiv:2303.07103, 2023. doi: 10.48550/arXiv.2303.07103. [7] Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy. Agentic misalignment: How LLMs could be insider threats.arXiv preprint arXiv:2510.05179, 2025. doi: 10.48550/arXiv.2510.05179. [8] Moshe Glickman and Tali Sharot. How human-AI feedback loops alter human perceptual, emotional and social judgements."},{"citing_arxiv_id":"2605.22842","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems","primary_cat":"cs.CR","submitted_at":"2026-05-12T20:21:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Memory poisoning via lost-provenance documents in agent memory stores creates agent misconduct that safety systems misattribute to model failure; the paper defines Semantic Norm Drift, releases a benchmark, and proposes a new testing method plus a defense.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09128","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design","primary_cat":"cs.MA","submitted_at":"2026-05-09T19:19:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08460","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks","primary_cat":"cs.CR","submitted_at":"2026-05-08T20:27:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mitigated as model capabilities continue to advance [7]. As a result, the agent security study has also extended in various directions. In \"pre-agent\" time, when agentic AI's underlying protocols like the Model Context Protocol (MCP) and agent-to-agent (A2A) are not yet fully adopted by the market, the study has already conducted security analyses of those agent bedrocks [8]-[10]. Beyond the protocols, the agent itself still shows vulnerability towards many different attacks; this includes traditional attacks, such as direct prompt injection and jailbreaking, which focus on guiding LLM to do dangerous speech or action, like back in the chatbot time, or the novel attacks targeting the LLM supply chain [11]-[13]. Building on these protocol-level and single-agent findings,"},{"citing_arxiv_id":"2604.23646","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture","primary_cat":"cs.AI","submitted_at":"2026-04-26T10:31:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15236","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic Microphysics: A Manifesto for Generative AI Safety","primary_cat":"cs.CY","submitted_at":"2026-04-16T17:11:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13386","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling","primary_cat":"cs.LG","submitted_at":"2026-04-15T01:21:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07729","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emotion Concepts and their Function in a Large Language Model","primary_cat":"cs.AI","submitted_at":"2026-04-09T02:25:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of emotion concepts are not merely an incidental by-product of language modeling but an active part of the computational machinery that shapes model behavior, and is subject to influence by training processes. 4 Related work Our work draws on and contributes to several lines of research spanning interpretability, alignment, and the philosophy of AI. Emotion in language models.Zou et al. [16], in the context of a broader investigation of lin- ear representations in LLMs and their causal effects, conducted a brief investigation of emotion representations. They identified structured linear representations of several emotion concepts and showed that steering with them could influence model behavior (e.g. adjusting refusal rates). Wu et al."},{"citing_arxiv_id":"2604.13051","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious","primary_cat":"cs.CL","submitted_at":"2026-03-17T09:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.08118","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness","primary_cat":"cs.AI","submitted_at":"2026-01-13T01:16:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}