Signal and noise: A framework for reducing uncertainty in language model evaluation.arXiv preprint arXiv:2508.13144, 2025

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge · 2025 · arXiv 2508.13144

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

stat.ML · 2026-06-07 · unverdicted · novelty 5.0

A hierarchical framework generates statistically valid task-level rank confidence intervals via pairwise comparisons and leaderboard-level rank prediction intervals via conformal prediction.

citing papers explorer

Showing 3 of 3 citing papers after filters.

DataComp-VLM: Improved Open Datasets for Vision-Language Models cs.CV · 2026-06-26 · conditional · none · ref 97 · 2 links
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation cs.LG · 2026-05-29 · unverdicted · none · ref 11
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation stat.ML · 2026-06-07 · unverdicted · none · ref 21
A hierarchical framework generates statistically valid task-level rank confidence intervals via pairwise comparisons and leaderboard-level rank prediction intervals via conformal prediction.

Signal and noise: A framework for reducing uncertainty in language model evaluation.arXiv preprint arXiv:2508.13144, 2025

fields

years

verdicts

representative citing papers

citing papers explorer