Reliable and

· 2025 · arXiv 2503.13335

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

representative citing papers

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.

AGC-Bench: Measuring Artificial General Creativity

cs.CL · 2026-07-01 · unverdicted · novelty 6.0 · 2 refs

AGC-Bench introduces a multi-domain creativity benchmark for LLMs, recovers a general 'c' factor explaining 81.5% of variance, and finds humans still outperform top models on matched tasks.

Quality Is Not a Safety Proxy Under Quantization

cs.LG · 2026-06-08 · conditional · novelty 6.0

Across 51 quantized checkpoints, quality metrics fail to predict safety drops in 36 pairings and 10 hidden-danger cases, while a new RTSI screen routes all 10 dangerous rows to testing at matched bucket size.

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.

Why Do Safety Guardrails Degrade Across Languages?

cs.CL · 2026-05-16 · conditional · novelty 6.0

A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anchor questions per dataset.

Latent Confidence Alignment for LLM Self-Assessment

cs.CY · 2026-06-20 · unverdicted · novelty 5.0

LCAE is introduced as a Rasch-model metric that aligns LLM self-reported confidence with latent error probability derived from ability and item difficulty, shown to improve calibration on a medical dataset across 20 models.

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

cs.AI · 2026-06-10 · unverdicted · novelty 5.0

A pre-response classifier predicts user rejection risk for clinical LLM outputs with AUROC 0.719 over 4.5 months of deployment data by incorporating deployment-specific context.

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation cs.LG · 2026-05-29 · unverdicted · none · ref 25
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
AGC-Bench: Measuring Artificial General Creativity cs.CL · 2026-07-01 · unverdicted · none · ref 37 · 2 links
AGC-Bench introduces a multi-domain creativity benchmark for LLMs, recovers a general 'c' factor explaining 81.5% of variance, and finds humans still outperform top models on matched tasks.
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs cs.CL · 2026-05-31 · unverdicted · none · ref 26
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 22
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration cs.CL · 2026-04-14 · unverdicted · none · ref 17
A fixed-parameter multidimensional IRT calibration approach allows extending LLM benchmark suites over time, predicting full performance within 2-3 points and preserving rankings (Spearman ρ ≥ 0.9) using only 100 anchor questions per dataset.
Latent Confidence Alignment for LLM Self-Assessment cs.CY · 2026-06-20 · unverdicted · none · ref 18
LCAE is introduced as a Rasch-model metric that aligns LLM self-reported confidence with latent error probability derived from ability and item difficulty, shown to improve calibration on a medical dataset across 20 models.
Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System cs.AI · 2026-06-10 · unverdicted · none · ref 36
A pre-response classifier predicts user rejection risk for clinical LLM outputs with AUROC 0.719 over 4.5 months of deployment data by incorporating deployment-specific context.
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains cs.LG · 2026-05-11 · unverdicted · none · ref 24
Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.

Reliable and

fields

years

verdicts

representative citing papers

citing papers explorer