Scaling Evaluation-time Compute with Reasoning Models as Evaluators

Carolin Lawrence; Graham Neubig; Ian Wu; Jinu Lee; Julia Hockenmaier; Kiril Gashteovski; Mingyeong Moon; Sean Welleck; Seongyun Lee; Seungone Kim

arxiv: 2503.19877 · v2 · pith:D444J7UAnew · submitted 2025-03-25 · 💻 cs.CL

Scaling Evaluation-time Compute with Reasoning Models as Evaluators

Seungone Kim , Ian Wu , Jinu Lee , Xiang Yue , Seongyun Lee , Mingyeong Moon , Carolin Lawrence , Kiril Gashteovski

show 3 more authors

Julia Hockenmaier Graham Neubig Sean Welleck

This is my paper

classification 💻 cs.CL

keywords computeevaluationreasoningevaluatorsmodelstest-timetimecapability

0 comments

read the original abstract

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evalet: Evaluating Large Language Models through Functional Fragmentation
cs.HC 2025-09 conditional novelty 7.0

Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
Process Rewards with Learned Reliability
cs.CL 2026-05 unverdicted novelty 6.0

BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization
cs.CL 2025-09 unverdicted novelty 6.0

Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models
cs.CL 2025-09 unverdicted novelty 6.0

GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.