SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
Safetybench: Evaluating the safety of large language models
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.
DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.
TukaBench extends JailbreakBench to African languages via human translation, cultural adaptation, curated prompts, and code-switching, finding lower refusal rates for culturally grounded prompts and surfacing comprehension and judging limitations.
TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.
LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.
citing papers explorer
-
SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing
SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
-
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.
-
TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages
TukaBench extends JailbreakBench to African languages via human translation, cultural adaptation, curated prompts, and code-switching, finding lower refusal rates for culturally grounded prompts and surfacing comprehension and judging limitations.
-
Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions
TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.
-
To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs
LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.