A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
Pavel Dolin, Weizhi Li, Gautam Dasarathy, and Visar Berisha
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
NeurIPS should enforce a three-tier disclosure framework plus mandatory claim inventories for papers asserting that frontier AI models are safe or ready for release.
citing papers explorer
-
Why Do Safety Guardrails Degrade Across Languages?
A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
-
NeurIPS Should Require Reproducibility Standards for Frontier AI Safety Claims
NeurIPS should enforce a three-tier disclosure framework plus mandatory claim inventories for papers asserting that frontier AI models are safe or ready for release.