Safetybench: Evaluating the safety of large language models

Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang · 2024 · DOI 10.18653/v1/2024.acl-long.830

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open at publisher browse 6 citing papers

representative citing papers

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

cs.CR · 2026-06-16 · accept · novelty 7.0

SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

cs.SE · 2026-06-02 · unverdicted · novelty 7.0

DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

TukaBench extends JailbreakBench to African languages via human translation, cultural adaptation, curated prompts, and code-switching, finding lower refusal rates for culturally grounded prompts and surfacing comprehension and judging limitations.

Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions

cs.AI · 2026-06-24 · unverdicted · novelty 6.0

TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.

To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.

citing papers explorer

Showing 5 of 5 citing papers after filters.

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing cs.AI · 2026-06-29 · unverdicted · none · ref 48
SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair cs.SE · 2026-06-02 · unverdicted · none · ref 35
DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.
TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages cs.CL · 2026-05-31 · unverdicted · none · ref 14
TukaBench extends JailbreakBench to African languages via human translation, cultural adaptation, curated prompts, and code-switching, finding lower refusal rates for culturally grounded prompts and surfacing comprehension and judging limitations.
Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions cs.AI · 2026-06-24 · unverdicted · none · ref 15
TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.
To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs cs.CL · 2026-04-08 · unverdicted · none · ref 41
LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.

Safetybench: Evaluating the safety of large language models

fields

years

verdicts

representative citing papers

citing papers explorer