SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
Safetybench: Evaluating the safety of large language models
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.
DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.
TukaBench extends JailbreakBench to African languages via human translation, cultural adaptation, curated prompts, and code-switching, finding lower refusal rates for culturally grounded prompts and surfacing comprehension and judging limitations.
TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.
LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.
citing papers explorer
No citing papers match the current filters.