DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Signal and noise: A framework for reducing uncertainty in language model evaluation.arXiv preprint arXiv:2508.13144, 2025
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
A hierarchical framework generates statistically valid task-level rank confidence intervals via pairwise comparisons and leaderboard-level rank prediction intervals via conformal prediction.
citing papers explorer
-
DataComp-VLM: Improved Open Datasets for Vision-Language Models
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
-
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
-
Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
A hierarchical framework generates statistically valid task-level rank confidence intervals via pairwise comparisons and leaderboard-level rank prediction intervals via conformal prediction.