Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge

Firas Trabelsi; Hamid Dadkhahi; Juraj Juraska; Mehdi Mirzazadeh; Parker Riley

arxiv: 2512.03019 · v2 · pith:6ITC2NIOnew · submitted 2025-12-02 · 💻 cs.LG · cs.AI

Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge

Hamid Dadkhahi , Firas Trabelsi , Parker Riley , Juraj Juraska , Mehdi Mirzazadeh This is my paper

classification 💻 cs.LG cs.AI

keywords aggregationcomputedistribution-calibratedevaluationindividualmodelsnoisypairwise

0 comments

read the original abstract

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking--rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges
stat.ME 2026-05 unverdicted novelty 6.0

Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.