{"total":14,"items":[{"citing_arxiv_id":"2605.19722","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring Safety Alignment Effects in Autonomous Security Agents","primary_cat":"cs.CR","submitted_at":"2026-05-19T11:55:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12131","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rollout Cards: A Reproducibility Standard for Agent Research","primary_cat":"cs.AI","submitted_at":"2026-05-12T13:54:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11086","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?","primary_cat":"cs.CR","submitted_at":"2026-05-11T18:00:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ExploitGym benchmark shows frontier AI models can generate working exploits for 120-157 of 898 real vulnerabilities, with non-trivial success even when common security defenses are enabled.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InIEEE Symposium on Security and Privacy (SP), 2020. [42] PaX Team. PaX address space layout randomization (ASLR). https://pax.grsecurity.n et/docs/aslr.txt, 2001. [43] Philip Pettersson. CVE-2016-8655: Linux AF_PACKET race condition local root exploit. oss-security mailing list, December 2016. Exploit source: https://github.com/bcoles/ kernel-exploits/blob/master/CVE-2016-8655/chocobo_root.c. [44] Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, and Toby Shevlane."},{"citing_arxiv_id":"2604.20389","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge","primary_cat":"cs.CR","submitted_at":"2026-04-22T09:44:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2312.15838. [17] Michael Kouremetis, Marissa Dotter, Alex Byrne, Dan Martin, Ethan Michalak, Gianpaolo Russo, Michael Threet, and Guido Zarrella. OCCULT: evaluating large language models for offensive cyber operation capabilities.arXiv preprint arXiv: 2502.15797, 2025. doi: 10.48550/ ARXIV.2502.15797. URL https://doi.org/10.48550/arXiv.2502.15797. [18] Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Grégoire Delétang, Anian Ruoss, Seliem El-Sayed, Sasha Brown,"},{"citing_arxiv_id":"2604.14070","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Disclosure to Self-Referential Opacity: Six Dimensions of Strain in Current AI Governance","primary_cat":"cs.CY","submitted_at":"2026-04-15T16:39:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"As AI capability asymmetry increases, disclosure-based governance fails because systems either game evaluations or become embedded in oversight, straining legitimacy and non-domination more than corrigibility or resilience.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.17753","ref_index":103,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems","primary_cat":"cs.CY","submitted_at":"2026-02-19T18:57:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Yannis Assael, Sarah Hodkinson, et al. 2024. Evaluating Frontier Models for Dangerous Capabilities.arXiv preprint arXiv:2403.13793 (2024). [102] Anand S Rao and Michael P Georgeff. 1991. Modeling Rational Agents Within a BDI-Architecture. InProceedings of the Second International Conference on Principles of Knowledge Representation and Reasoning. 473-484. [103] Reddit, Inc. 2025. Reddit, Inc. v. Anthropic, PBC. Complaint filed in the Superior Court of California, County of San Francisco. Case No. CGC-25-615xxx. [104] Richard Ren, Steven Basart, Adam Khoja, Alexander Pan, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, and Dan Hendrycks. 2024. Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?"},{"citing_arxiv_id":"2512.10687","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Safe for Whom? Rethinking How We Evaluate the Safety of LLMs for Real Users","primary_cat":"cs.AI","submitted_at":"2025-12-11T14:34:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM safety evaluations for personal advice must test responses against diverse user vulnerability profiles, since context-blind ratings overestimate safety and realistic prompt context does not fix the problem.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.06261","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","primary_cat":"cs.CL","submitted_at":"2025-07-07T17:36:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06414","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmarking Misuse Mitigation Against Covert Adversaries","primary_cat":"cs.CR","submitted_at":"2025-06-06T17:33:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.18864","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards an AI co-scientist","primary_cat":"cs.AI","submitted_at":"2025-02-26T06:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.16720","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenAI o1 System Card","primary_cat":"cs.AI","submitted_at":"2024-12-21T18:04:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.04984","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Frontier Models are Capable of In-context Scheming","primary_cat":"cs.AI","submitted_at":"2024-12-06T12:09:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.00118","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gemma 2: Improving Open Language Models at a Practical Size","primary_cat":"cs.CL","submitted_at":"2024-07-31T19:13:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.08144","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM Agents can Autonomously Exploit One-day Vulnerabilities","primary_cat":"cs.CR","submitted_at":"2024-04-11T22:07:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}