Title resolution pending

Truong, Sang, Tu, Yuheng, Liang, Percy, Li, Bo, Koyejo, Sanmi , month = mar, year = · 2025 · arXiv 2503.13335

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Why Do Safety Guardrails Degrade Across Languages?

cs.CL · 2026-05-16 · conditional · novelty 6.0

A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anchor questions per dataset.

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07

citing papers explorer

Showing 4 of 4 citing papers.

Why Do Safety Guardrails Degrade Across Languages? cs.CL · 2026-05-16 · conditional · none · ref 25
A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.
Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration cs.CL · 2026-04-14 · unverdicted · none · ref 17
A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anchor questions per dataset.
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains cs.LG · 2026-05-11 · unverdicted · none · ref 24
Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unreviewed · ref 22

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer