A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anchor questions per dataset.
Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.
citing papers explorer
-
Why Do Safety Guardrails Degrade Across Languages?
A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
-
Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration
A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anchor questions per dataset.
-
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.
- Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models