{"total":10,"items":[{"citing_arxiv_id":"2605.22643","ref_index":37,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06390","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Automated alignment is harder than you think","primary_cat":"cs.AI","submitted_at":"2026-05-07T15:06:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24966","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Risk Reporting for Developers' Internal AI Model Use","primary_cat":"cs.CY","submitted_at":"2026-04-27T20:07:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22167","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Estimating Tail Risks in Language Model Output Distributions","primary_cat":"cs.LG","submitted_at":"2026-04-24T02:30:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"what is the probability that there will be a worst-query probability above a thresholdτ? We can define this formally as: PDquery \u0014 max 1≤i≤n QRISK (ci)> τ \u0015 .(15) Note that: PDquery \u0014 max 1≤i≤n QRISK (ci)> τ \u0015 = 1−P Dquery \u0014 max 1≤i≤n QRISK (ci)≤τ \u0015 ,reversing the inequality (16) = 1−P Dquery [∀i≤n,Q RISK (ci)≤τ]as allQ RISK (ci)are bounded their max (17) = 1− nY i=1 PDquery [QRISK (ci)≤τ],asc i are sampled independently (18) = 1− \u0002 PDquery [QRISK (ci)≤τ] \u0003n ,asc i are identically distributed (19) where the last termP Dquery [QRISK (ci)≤τ]is the cumulative density function of the random variableQ RISK (ci)where ci ∼ D query. As a result, we can use the estimates bQRISK (ci)computed on the evaluation set to form an empiricalCDFto"},{"citing_arxiv_id":"2604.17663","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data","primary_cat":"cs.LG","submitted_at":"2026-04-19T23:26:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[28] Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment, 2025. URL https://arxiv.org/abs/2506.11613. [29] Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. Ai control: Improving safety despite intentional subversion, 2023. URL https://arxiv.org/abs/2312.06942. 48 [30] Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an ai control safety case, 2025. URL https://arxiv.org/abs/2501.17315. [31] Tomek Korbak, Mikita Balesni, Buck Shlegeris, and Geoffrey Irving. How to evaluate control measures for llm agents? a trajectory from today to superintelligence, 2025. URL https:"},{"citing_arxiv_id":"2604.17517","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Admission to Invariants: Measuring Deviation in Delegated Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-04-19T16:19:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The Non-Identifiability Theorem shows admissible behavior space A0 is not identifiable from local enforcement signals g under the Local Observability Assumption, so the paper introduces an Invariant Measurement Layer to detect admission-time drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11806","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Detecting Safety Violations Across Many Agent Traces","primary_cat":"cs.AI","submitted_at":"2026-04-13T17:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Meerkat uses clustering plus agentic search to detect sparse safety violations across many agent traces, outperforming baselines and finding nearly 4x more reward-hacking cases on CyBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01151","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Detecting Multi-Agent Collusion Through Multi-Agent Interpretability","primary_cat":"cs.AI","submitted_at":"2026-04-01T17:08:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13069","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Geographic Blind Spots in AI Control Monitors: A Cross-National Audit of Claude Opus 4.6","primary_cat":"cs.CY","submitted_at":"2026-03-20T10:56:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Claude Opus 4.6 fabricates more answers on Global North AI contexts than Global South ones, creating an exploitable vulnerability in AI control monitors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.02546","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems","primary_cat":"cs.CR","submitted_at":"2025-06-03T07:32:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}