IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
Reliable and
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anchor questions per dataset.
Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.
citing papers explorer
-
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
-
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
-
Why Do Safety Guardrails Degrade Across Languages?
A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
-
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
-
Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration
A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anchor questions per dataset.
-
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.