Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
arXiv preprint arXiv:2506.02945 , url=
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
citing papers explorer
-
Instance-Optimal Estimation with Multiple LLM Judges on a Budget
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
-
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
-
Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
-
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.