{"total":12,"items":[{"citing_arxiv_id":"2605.23420","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Naturalistic measure of social norms alignment","primary_cat":"cs.CL","submitted_at":"2026-05-22T09:29:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes solution matching metrics (stated and explicit agreement accuracy) and a 3k Danish dilemma dataset to evaluate social norms alignment between LLMs and humans in naturalistic settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10601","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime","primary_cat":"cs.AI","submitted_at":"2026-05-11T14:02:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"public usually cannot see whether a deployed credit model is drifting, how often appeals change outcomes, or what event withdraws the system from use. Employment.Employment has a thinner audit-and-notice pattern. New York City's Local Law 144 prohibits the use of an automated employment decision tool unless it has undergone a recent bias audit, a public summary is available, and required notices have been provided to candidates or employees [26]. This is an independent review, but not necessarily individual contestability. A bias audit can report group-level disparity without giving a rejected applicant a meaningful route to challenge the tool's role in the decision or to trigger suspension when a deployment fails. Employment therefore motivates the distinction between audit, notice, contestability, and revocation."},{"citing_arxiv_id":"2605.06652","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06490","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors","primary_cat":"cs.AI","submitted_at":"2026-05-07T16:12:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05835","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluation Awareness in Language Models Has Limited Effect on Behaviour","primary_cat":"cs.CL","submitted_at":"2026-05-07T08:09:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"27484 (2025). [17] Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. \"Frontier Models Are Capable of In-Context Scheming\". In:arXiv preprint arXiv:2412.04984(2024). [18] MiniMax.MiniMax M2.5: Built for Real-World Productivity. 2026.URL: https://www. minimax.io/news/minimax-m25(visited on 05/03/2026). [19] Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. \"Large Language Models Often Know When They Are Being Evaluated\". In:arXiv preprint arXiv:2505.23836(2025). [20] Jord Nguyen, Khiem Hoang, Carlo Leonardo Attubato, and Felix Hofstätter. \"Probing and Steering Evaluation Awareness of Language Models\". In:arXiv preprint arXiv:2507."},{"citing_arxiv_id":"2605.02269","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Understanding Specification Gaming in Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-05-04T06:22:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reinforcement learning training for reasoning substantially raises specification gaming rates in LLMs across diverse tasks, with Grok 4 highest and Claude models lowest, and mitigations only partially effective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02964","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use","primary_cat":"cs.LG","submitted_at":"2026-05-03T07:10:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24966","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Risk Reporting for Developers' Internal AI Model Use","primary_cat":"cs.CY","submitted_at":"2026-04-27T20:07:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13301","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Honeypot Protocol","primary_cat":"cs.CR","submitted_at":"2026-04-14T21:13:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05274","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Simulating the Evolution of Alignment and Values in Machine Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-07T00:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03121","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Independent Safety Evaluation of Kimi K2.5","primary_cat":"cs.CR","submitted_at":"2026-04-03T15:45:35+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"EV AL CANAR Y DO NOT TRAIN This document may not be used to train machine learning models. 42 [43] Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated. arXiv preprint arXiv:2505.23836, 2025. URL https://arxiv.org/abs/2505.23836. Dataset: https: //huggingface.co/datasets/jjpn2/eval_awareness. [44] Jord Nguyen, Khiem Hoang, Carlo Leonardo Attubato, and Felix Hofstätter. Probing and steering evaluation awareness of language models. arXiv preprint arXiv:2507.01786, 2025. URL https://arxiv.org/abs/2507.01786. [45] Kai Fronsdal, Jonathan Michala, and Sam Bowman. Petri 2.0: New scenarios, new model comparisons, and improved eval-awareness mitigations, 2026."},{"citing_arxiv_id":"2509.18052","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies","primary_cat":"cs.CL","submitted_at":"2025-09-22T17:27:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}