Pavel Dolin, Weizhi Li, Gautam Dasarathy, and Visar Berisha

Chouldechova, Alexandra, Cooper, A · 2025 · arXiv 2601.18076

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Why Do Safety Guardrails Degrade Across Languages?

cs.CL · 2026-05-16 · conditional · novelty 6.0

A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

NeurIPS Should Require Reproducibility Standards for Frontier AI Safety Claims

cs.CY · 2026-05-05 · conditional · novelty 5.0

NeurIPS should enforce a three-tier disclosure framework plus mandatory claim inventories for papers asserting that frontier AI models are safe or ready for release.

citing papers explorer

Showing 3 of 3 citing papers.

Why Do Safety Guardrails Degrade Across Languages? cs.CL · 2026-05-16 · conditional · none · ref 8
A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 35
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
NeurIPS Should Require Reproducibility Standards for Frontier AI Safety Claims cs.CY · 2026-05-05 · conditional · none · ref 8
NeurIPS should enforce a three-tier disclosure framework plus mandatory claim inventories for papers asserting that frontier AI models are safe or ready for release.

Pavel Dolin, Weizhi Li, Gautam Dasarathy, and Visar Berisha

fields

years

verdicts

representative citing papers

citing papers explorer