LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 3
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 2polarities
background 2representative citing papers
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
citing papers explorer
-
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.