arXiv preprint arXiv:2506.02945 , url=

Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu · 2025 · arXiv 2506.02945

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Instance-Optimal Estimation with Multiple LLM Judges on a Budget

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

stat.ME · 2026-05-10 · unverdicted · novelty 6.0

Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

cs.AI · 2026-01-13 · unverdicted · novelty 6.0

MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.

citing papers explorer

Showing 4 of 4 citing papers.

Instance-Optimal Estimation with Multiple LLM Judges on a Budget cs.LG · 2026-05-22 · unverdicted · none · ref 25
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression cs.LG · 2026-05-20 · unverdicted · none · ref 63
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges stat.ME · 2026-05-10 · unverdicted · none · ref 10
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness cs.AI · 2026-01-13 · unverdicted · none · ref 33
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.

arXiv preprint arXiv:2506.02945 , url=

fields

years

verdicts

representative citing papers

citing papers explorer