Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety

Röttger, Paul, Pernisi, Fabio, Vidgen, Bertie, Hovy, Dirk , month = jan, year = · 2024 · arXiv 2404.05399

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

cs.AI · 2026-06-05 · unverdicted · novelty 5.0

LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

cs.CY · 2026-04-02 · unverdicted · novelty 5.0 · 2 refs

Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators cs.AI · 2026-06-05 · unverdicted · none · ref 38
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.

Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer