SIREN corrects winner's curse bias in adaptive LLM benchmarking via selection-aware repeated splits and bootstrap for valid procedure-level confidence intervals.
Efficient evaluation of large language models via collaborative filtering
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
stat.ML 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.
citing papers explorer
-
Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
SIREN corrects winner's curse bias in adaptive LLM benchmarking via selection-aware repeated splits and bootstrap for valid procedure-level confidence intervals.
-
Efficient Evaluation of LLM Performance with Statistical Guarantees
Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.