arxiv: 2604.13717 · v2 · submitted 2026-04-15 · 💻 cs.CL

Recognition: unknown

On Cost-Effective LLM-as-a-Judge Improvement Techniques

Ryan Lail , Luke Markham

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM judgesensemble scoringcriteria injectionRewardBenchnoise reductionRLHF evaluationcost-effective evaluationstochastic judge

0 comments

The pith

Ensemble scoring and task-specific criteria injection raise LLM judge accuracy to 85.8 percent on RewardBench 2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four straightforward prompting and aggregation methods for making language models more reliable at scoring other model outputs. It frames the improvements as different ways to reduce random noise in the judge's decisions rather than changing the underlying model. The strongest gains come from running multiple judgments and averaging them together with prompts that include detailed task criteria, delivering a 13.5-point accuracy lift at modest extra cost. Small models improve the most from these steps, suggesting automated evaluation can become both cheaper and more trustworthy. The work shows the gains hold across different commercial model families.

Core claim

By treating LLM judging as a noisy stochastic process, the authors demonstrate that ensemble scoring functions as Monte Carlo averaging to smooth per-call fluctuations, while task-specific criteria injection sharpens the model's ability to discriminate between responses. Together these two techniques reach 85.8 percent accuracy on RewardBench 2, a 13.5 percentage point gain over a plain baseline prompt. Calibration context and adaptive model escalation each add some value but lie below the ensemble-plus-criteria combination on the cost-accuracy curve. Small models receive the largest relative benefit from ensembling, and the pattern generalizes across both OpenAI GPT and Anthropic Claude API

What carries the argument

Noise control on the stochastic judge, with ensemble scoring as Monte Carlo averaging over per-call noise, criteria injection as between-response discrimination sharpening, and per-response score variance as an uncertainty signal.

If this is right

Small models become competitive with larger ones for judging tasks once ensembling is applied.
Task-specific criteria injection improves performance at almost zero added token cost.
The two dominant techniques together dominate other tested methods on the cost-accuracy Pareto frontier.
The improvements hold across multiple model providers including OpenAI and Anthropic families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These noise-reduction steps could be combined with other prompting strategies to further lower the need for human raters in RLHF loops.
The variance signal identified during ensembling might be usable as a direct uncertainty estimate for downstream filtering of low-confidence judgments.
If the same pattern appears on newer or more diverse evaluation sets, the techniques could become a standard preprocessing layer for any LLM judge pipeline.

Load-bearing premise

The noise-control framing and the accuracy gains measured on RewardBench 2 will transfer to other benchmarks and real production use cases without extra tuning or new data.

What would settle it

Running the combined ensemble-plus-criteria method on a fresh benchmark or production evaluation set and finding no meaningful accuracy improvement over the plain baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.13717 by Luke Markham, Ryan Lail.

**Figure 1.** Figure 1: Cost vs. accuracy Pareto frontier across all evaluated conditions. Each point is labelled technique (class) and shows the best-performing model in that class for the given technique; the provider supplying each point is listed per row in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Accuracy by condition and category. Each bar shows the best-performing provider for that class+condition; per-row provider attribution is in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Model agreement with full (k=8) as a function of ensemble size, on the OpenAI GPT-5.4 family. Mini reaches 78.7% by k=8 (77.5% at k=5); nano plateaus at ∼70%, with a lower rank-correlation ceiling (0.67 vs 0.79 for mini), confirming a larger capability gap. Prometheus 2) is left to future work. • RewardBench 2 is a single benchmark; performance may differ on production judge tasks such as offline eval sui… view at source ↗

**Figure 5.** Figure 5: Score variance correlation (chosen response). Left: mini vs full (r = 0.421). Right: nano vs full (r = 0.106). A.4. Calibration Context Prompt (Both Variant) Same structure as Appendix A.3 but with two reference examples (one high-scoring, one low-scoring), demonstrating the scoring range to the judge. B. Per-Category Confidence Intervals [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Pareto frontier for per-response hard variance routing. Each point is a different threshold θ. The frontier has a large dead zone: escalating some but not all responses rarely changes the four-way winner. transition from mini-dominant to full-dominant scoring occurs; steepness is fixed at 10. The optimal m is found by sweeping over all unique per-response variance values. D.3. Variance-Informed Ensembling… view at source ↗

**Figure 8.** Figure 8: Pareto frontier for per-response variance-informed ensembling. Gray points: full grid of (σ1, σ2) configurations (trainset). Stars: test-set accuracy for budget-constrained (n¯full≤2) and unconstrained configurations — the budget-constrained point (74.9%) is dominated by mini k=8 at comparable cost [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 7.** Figure 7: Per-response soft blending accuracy vs mean blend weight w (full dataset, in-sample). Accuracy peaks at 83.2% near w=0.91 (mostly full model). On a held-out test set, the blend (80.2%) does not beat full k=8 (81.5%) — the in-sample gain is partly midpoint overfitting [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 9.** Figure 9: Baseline accuracy vs temperature for k=1 and k=8 with 95% bootstrap CIs. k=8 (77.1%), with non-overlapping confidence intervals — temperature=0 does not produce deterministic outputs in practice, likely due to floating-point non-determinism in GPU inference and the absence of a seed parameter. This is a useful finding for practitioners: even deployments that assume deterministic scoring at temperature 0 ca… view at source ↗

read the original abstract

Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. However, output reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of four drop-in techniques -- ensemble scoring, task-specific criteria injection, calibration context, and adaptive model escalation -- for improving LLM judge accuracy on RewardBench 2, with a unifying lens of noise control on the stochastic judge: ensembling as Monte Carlo averaging over per-call noise, criteria injection as between-response discrimination sharpening, and per-response score variance as an uncertainty signal. Ensemble scoring and task-specific criteria injection (the latter virtually cost free) together reach up to 85.8% accuracy, +13.5pp over baseline. Calibration context and adaptive model escalation also improve over baseline but are dominated by criteria + ensembling on the cost-accuracy Pareto frontier. Small models benefit disproportionately from ensembling, making high-accuracy LLM judges accessible at low cost. We show that these techniques generalise across model providers, evaluating on both OpenAI GPT and Anthropic Claude families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ensemble scoring plus criteria injection lifts LLM judge accuracy 13.5 points on RewardBench 2 to 85.8 percent at low cost, but the gains are shown only on that one benchmark.

read the letter

The main takeaway is that two simple techniques—ensemble scoring and task-specific criteria injection—push LLM-as-a-judge accuracy from a baseline around 72 percent up to 85.8 percent on RewardBench 2, with the second method adding almost no extra cost. The paper also shows these gains hold across GPT and Claude families and that smaller models benefit most from ensembling, which keeps the approach practical for people who cannot afford big judges every time.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically evaluates four drop-in techniques—ensemble scoring, task-specific criteria injection, calibration context, and adaptive model escalation—for improving LLM-as-a-judge accuracy on RewardBench 2. It unifies them under a noise-control framing (ensembling as Monte Carlo averaging over per-call noise, criteria injection as sharpening between-response discrimination, and score variance as uncertainty signal) and reports that ensemble scoring combined with task-specific criteria injection reaches 85.8% accuracy (+13.5pp over baseline). The techniques generalize across OpenAI GPT and Anthropic Claude families, with small models benefiting disproportionately from ensembling for cost-effectiveness.

Significance. If the results hold, the work supplies concrete, low-cost methods to raise LLM judge reliability in RLHF, benchmarking, and application evaluations, with the cross-provider checks and specific accuracy deltas providing actionable evidence. The noise-control lens is a useful organizing principle, and the finding that small models gain most from ensembling directly supports accessible high-accuracy judging.

major comments (2)

Abstract and results: the headline claim that ensemble scoring and task-specific criteria injection reach 85.8% accuracy (+13.5pp) and that the techniques 'generalise across model providers' is demonstrated exclusively on RewardBench 2. No cross-benchmark validation is reported, so it remains possible that the observed deltas reflect alignment with RewardBench 2's particular annotation style, category balance, or preference-pair distribution rather than general noise reduction; this directly affects whether the reported cost-accuracy Pareto improvements transfer to other benchmarks or production settings.
Abstract: the reported accuracy deltas lack accompanying details on statistical significance, exact prompting templates, data splits, or variance across runs, leaving open the possibility of selection effects or unreported sensitivity to implementation choices.

minor comments (2)

The noise-control framing is introduced in the abstract but would benefit from a dedicated subsection that formally links per-response score variance to the uncertainty signal used in adaptive escalation.
Ensure all tables or figures reporting accuracy include confidence intervals or p-values against the baseline to make the empirical improvements easier to assess.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and indicate the revisions we have made or will make to the manuscript.

read point-by-point responses

Referee: Abstract and results: the headline claim that ensemble scoring and task-specific criteria injection reach 85.8% accuracy (+13.5pp) and that the techniques 'generalise across model providers' is demonstrated exclusively on RewardBench 2. No cross-benchmark validation is reported, so it remains possible that the observed deltas reflect alignment with RewardBench 2's particular annotation style, category balance, or preference-pair distribution rather than general noise reduction; this directly affects whether the reported cost-accuracy Pareto improvements transfer to other benchmarks or production settings.

Authors: We agree that our primary evaluation is on RewardBench 2 and that cross-benchmark validation would further support the generalizability of the proposed techniques. The noise-control framing is intended to be benchmark-agnostic, but we acknowledge the possibility of benchmark-specific effects. In the revised manuscript, we will expand the discussion section to include a limitations paragraph addressing this point, clarifying that results are specific to RewardBench 2 while noting that the techniques are drop-in and can be applied elsewhere. We will also temper the abstract language to specify 'on RewardBench 2' more explicitly if needed. revision: partial
Referee: Abstract: the reported accuracy deltas lack accompanying details on statistical significance, exact prompting templates, data splits, or variance across runs, leaving open the possibility of selection effects or unreported sensitivity to implementation choices.

Authors: We appreciate this observation. The current manuscript provides the main results but omits some implementation details for brevity. In the revised version, we will add the following: (1) statistical significance testing for the accuracy improvements (e.g., using paired tests across the preference pairs), (2) the exact prompting templates used for each technique in an appendix, (3) details on the data splits from RewardBench 2, and (4) variance or standard deviation across multiple independent runs to quantify sensitivity. These additions will be included in the methods and results sections. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements against external human labels

full rationale

The paper reports accuracy gains from four prompting/aggregation techniques evaluated directly on RewardBench 2 against its human preference labels. No equations, fitted parameters, or first-principles derivations appear; the unifying 'noise control' framing is interpretive post-hoc language rather than a self-referential model. Results are externally falsifiable via the benchmark's labels and do not reduce to the paper's own inputs by construction. Generalization claims across model families are also measured, not assumed. This matches the default non-circular case for empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of LLM evaluation (human labels as ground truth, RewardBench 2 as representative) with no new free parameters, axioms, or invented entities introduced.

axioms (1)

domain assumption Human preference labels on RewardBench 2 serve as reliable ground truth for judge accuracy.
All reported accuracy numbers are computed against these labels.

pith-pipeline@v0.9.0 · 5494 in / 1107 out tokens · 48658 ms · 2026-05-10T12:52:55.592811+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges
stat.ME 2026-05 unverdicted novelty 6.0

Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.

Reference graph

Works this paper leans on

17 extracted references · 5 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fern \'a ndez, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A. F. T., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A. K., Takmaz, E., and Testoni, A. LLMs instead of human judges? A large scale empirical study across 20 N...

2025
[3]

ChatEval : Towards better LLM -based evaluators through multi-agent debate

Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. ChatEval : Towards better LLM -based evaluators through multi-agent debate. In International Conference on Learning Representations ( ICLR ) , 2024

2024
[4]

FrugalGPT : How to use large language models while reducing cost and improving performance

Chen, L., Zaharia, M., and Zou, J. FrugalGPT : How to use large language models while reducing cost and improving performance. Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=cSimKw5p6R

2024
[5]

A Survey on LLM-as-a-Judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., and Guo, J. A survey on LLM -as-a-judge. arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus 2: An open source language model specialized in evaluating other language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 4334--4353. Association for Computational Linguistics, 2024

2024
[7]

Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., Smith, N

Lambert, N., Pyatkin, V., Morrison, J., Miranda, L., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., Smith, N. A., and Hajishirzi, H. RewardBench : Evaluating reward models for language modeling. In Findings of the Association for Computational Linguistics: NAACL 2025 , pp.\ 1755--1797. Association for Computational Linguistics, 2025

2025
[8]

Generative judge for evaluating alignment

Li, J., Sun, S., Yuan, W., Fan, R.-Z., Zhao, H., and Liu, P. Generative judge for evaluating alignment. In International Conference on Learning Representations ( ICLR ) , 2024

2024
[9]

G-Eval : NLG evaluation using GPT-4 with better human alignment

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. G-Eval : NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 2511--2522. Association for Computational Linguistics, 2023

2023
[10]

RewardBench 2: Advancing Reward Model Evaluation

Malik, S., Pyatkin, V., Land, S., Morrison, J., Smith, N. A., Hajishirzi, H., and Lambert, N. RewardBench 2: Advancing reward model evaluation. arXiv preprint arXiv:2506.01937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Zamfirescu-Pereira, Bjoern Hartmann, Aditya Parameswaran, and Ian Arawjo

Shankar, S., Zamfirescu-Pereira, J., Hartmann, B., Parameswaran, A. G., and Arawjo, I. Who validates the validators? A ligning LLM -assisted evaluation of LLM outputs with human preferences. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology ( UIST '24) . ACM, 2024. doi:10.1145/3654777.3676450

work page doi:10.1145/3654777.3676450 2024
[12]

Large language models are inconsistent and biased evaluators

Stureborg, R., Alikaniotis, D., and Suhara, Y. Large language models are inconsistent and biased evaluators. arXiv preprint arXiv:2405.01724, 2024

work page arXiv 2024
[13]

Y., Cuadron, A., Wang, C., Popa, R

Tan, S., Zhuang, S., Montgomery, K., Tang, W. Y., Cuadron, A., Wang, C., Popa, R. A., and Stoica, I. JudgeBench : A benchmark for evaluating LLM -based judges. In International Conference on Learning Representations ( ICLR ) , 2025

2025
[14]

Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models. arXiv preprint arXiv:2404.18796, 2024

work page arXiv 2024
[15]

Large language models are not fair evaluators

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T., and Sui, Z. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9440--9450. Association for Computational Linguistics, 2024

2024
[16]

V., Chi, E

Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations ( ICLR ) , 2023

2023
[17]

P., Zhang, H., Gonzalez, J

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems 36 ( NeurIPS 2023) Datasets and Benchmarks Track , 2023

2023