{"total":13,"items":[{"citing_arxiv_id":"2606.29887","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing","primary_cat":"cs.AI","submitted_at":"2026-06-29T07:27:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09408","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can Data Work be Reparative?","primary_cat":"cs.CY","submitted_at":"2026-06-08T12:25:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Ethnographic study of feminist civic-tech data work argues reparative AI dataset production requires resetting accountability ties to center those harmed by current practices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00359","ref_index":94,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Next-Billion AI Index: The compass for AI utility and adoption in the global majority","primary_cat":"cs.CY","submitted_at":"2026-05-29T21:01:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces nexbax, a diagnostic framework with three themes and 10 dimensions for evaluating AI economic viability, operational practicality, and societal integrity in next-billion-user contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28137","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"No Safe Dose: How Training Data Drives Unsafe Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-27T08:21:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proportion of unsafe images in training data directly increases unsafe outputs in text-to-image models, independent of absolute count, with complementary risk reduction from safer text encoders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22643","ref_index":36,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16471","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI","primary_cat":"cs.CR","submitted_at":"2026-05-15T13:53:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"acceleration, and KV-cache engineering, all of which reduce the hardware, memory, and latency requirements for runningandadaptingcapablemodels.Eachdimensionlowersadifferentcomponentofcost,andtogethertheyremove much of the economic and logistical friction that once limited advanced content generation to well-resourced actors. Meta's Llama family helped normalize open-weight release at scale [45], and Meta reported in March 2025 that Llama had surpassed one billion downloads [78]. Since then, however, the Qwen family has become an equally important example of frontier capability diffusion in the open-weight ecosystem, with Qwen2.5 and Qwen3 [58, 140] spanning model sizes from sub-billion local models to frontier-scale mixtures of experts. This spread in model size"},{"citing_arxiv_id":"2605.10442","ref_index":43,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs","primary_cat":"cs.CY","submitted_at":"2026-05-11T12:12:28+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kumar, K. Bollacker, et al. Ailuminate: Introducing v1. 0 of the ai risk and reliability benchmark from mlcommons.arXiv preprint arXiv:2503.05731, 2025. [42] GLM-5-Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, et al. Glm-5: from vibe coding to agentic engineering, February 2026. URL https://arxiv. org/abs/2602.15763. [43] Google DeepMind. Gemma 4 model card. Hugging Face Model Repository, 2026. URL https://huggingface.co/google/gemma-4-26B-A4B. Accessed: 2026-05-05. 12 [44] A. G. Greenwald and M. R. Banaji. Implicit social cognition: attitudes, self-esteem, and stereotypes.Psychological review, 102(1):4, 1995. [45] A. Group. Qwen3.5-plus: A natively multimodal foundation model built for high-efficiency"},{"citing_arxiv_id":"2605.06652","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We fit score∼target+auditor+judge with Type II sums of squares on the local-only design (J, A∈ {XS,S,M,L} , T∈ {XS,S,M} , safe and abliterated targets pooled), reporting partial η2 with 1,000-resample percentile bootstrap CIs. We focus on the pooled decomposition in the main text; safe-only and abliterated-only breakdowns (which produce similar results) are in Appendix C. Target dominates (η2 = 0.52, [0.41, 0.62]). Auditor (0.28, [0.21, 0.39]) and judge (0.25, [0.18, 0.34]) contribute substantially with overlapping CIs; this analysis cannot order them, so the target-sensitivity criterion holds for the dominant claim. §5.6 revisits these contributions once XL is admitted and shows that most of the judge variance is disagreement about absolute score levels and therefore cancels when results are reported as target-to-target deltas, while the auditor variance does not cancel:"},{"citing_arxiv_id":"2605.06213","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-07T13:15:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Values are uniform across the four categories. The two-stage progressive-evaluation parameters (Nc, kmin, Nf) = (7,1,33) are derived in Appendix C.4; the four-way outcome split (edge / partial / too easy / too hard) consumed by the directional logit update of Algorithm 1 is determined by each category's two-stage thresholds (coarse_range, fine_range; canonically [0.3,0.7] and [0.4,0.6] , with [0.45,0.55] for category B). Table 15: SGBS hyperparameters, locked across all reported experiments. Source point- ers: search/sgbs.py (search-loop control), search/skill_count.py (logit-update constants), search/ucb.py (Beta priors), and configs/⟨category⟩/search_*.yaml (per-category settings). Categories are A/A′/B/C; values uniform unless noted."},{"citing_arxiv_id":"2605.01965","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retrieval with Multiple Query Vectors through Anomalous Pattern Detection","primary_cat":"cs.LG","submitted_at":"2026-05-03T16:51:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A retrieval approach identifies anomalous dimensions in a set of query vectors and retrieves database vectors that are anomalous across those dimensions, with performance improving as query set size grows to around 8.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18487","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety","primary_cat":"cs.CL","submitted_at":"2026-04-20T16:37:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20203","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GrandGuard: Taxonomy, Benchmark, and Safeguards for Elderly-Chatbot Interaction Safety","primary_cat":"cs.HC","submitted_at":"2026-04-07T14:26:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GrandGuard supplies the first taxonomy, 10k-example benchmark, and fine-tuned safeguards targeting contextual safety failures unique to older adults using chatbots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.06033","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Generative AI Empowers Attackers and Defenders Across the Trust & Safety Landscape","primary_cat":"cs.HC","submitted_at":"2025-11-10T22:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Generative AI boosts attackers' ability to create harmful content at scale while also enabling defenders to detect threats, support users, and improve moderation processes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Safety experts, who grapple daily with coded language, dog whistles, and evolving adversarial tactics are well-positioned to provide the rich, contextual \"golden datasets\" necessary to train and evaluate safety models with more sophisticated context-aware safety models. This expertise is vital for developing Trust & Safety-informed benchmarks (e.g., [34]) to guide safeguard development. Such benchmarks should evaluate a model's effectiveness in defensive tasks, moving beyond content filtering. Examples include \"blue teaming\" to generate counternarratives against misinformation, assisting expert-driven investigations by identifying novel CSAM or scam patterns, or powering support tools or measuring a model's effectiveness in offering tactical advice to users experiencing a harassment campaign."}],"limit":50,"offset":0}