RPRA: Predicting an LLM-Judge for Efficient but Performant Inference

Changsheng Zhao; Dylan R. Ashley; Ernie Chang; Ga\"el Le Lan; J\"urgen Schmidhuber; Mingchen Zhuge; Naina Dhingra; Vikas Chandra; Yangyang Shi; Zhipeng Cai

arxiv: 2604.12634 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.CL· cs.LG· cs.MA

RPRA: Predicting an LLM-Judge for Efficient but Performant Inference

Dylan R. Ashley , Ga\"el Le Lan , Changsheng Zhao , Naina Dhingra , Zhipeng Cai , Ernie Chang , Mingchen Zhuge , Yangyang Shi

show 2 more authors

Vikas Chandra J\"urgen Schmidhuber

This is my paper

Pith reviewed 2026-05-10 14:40 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.MA

keywords modelstheywhenbelievepredictpredictionreportsmaller

0 comments

The pith

Models predict LLM-judge scores of their outputs via zero-shot, in-context report cards, or fine-tuning to enable efficient self-answer or deferral decisions, with smaller models gaining up to 55% prediction accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests two main ways for language models to decide when they should answer a question themselves versus asking a bigger model for help. In the simpler Predict-Answer approach, a model guesses in advance how a separate judge model would rate its response. The more involved Reason-Predict-Reason-Answer version adds extra reasoning steps before and after the prediction. Three methods are tried: asking the model to guess without examples, giving it a report card of past performance as context, and training it on examples through fine-tuning. Larger reasoning models already guess judge scores reasonably well with no extra help. Smaller models improve a lot once given the report card or after fine-tuning, with average gains of 55 percent and 52 percent across the tested datasets. The core idea is that models can become better at knowing their own limits, which could let systems mix small and large models more smartly.

Core claim

Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively.

Load-bearing premise

That an LLM judge's score is a faithful proxy for actual output quality and that accurate prediction of that score will produce deferral decisions that improve the efficiency-quality trade-off without introducing new biases or failure modes.

read the original abstract

Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict -- prior to responding -- how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Smaller models improve at predicting LLM-judge scores via report cards or fine-tuning, but the paper never tests whether those predictions produce better quality-compute tradeoffs when used for actual deferral.

read the letter

The main thing to know is that this paper shows smaller models can get substantially better at guessing what score an LLM judge would assign to their answers, with report cards or fine-tuning lifting accuracy by up to 55% and 52% on average. They do not, however, check if using those predictions to decide when to answer versus defer actually improves the efficiency-quality balance in practice. The RPRA sequence has the model reason about the query, predict the judge score, reason again, then answer or act. They compare zero-shot, in-context report card, and supervised fine-tuning on multiple datasets. Larger reasoning models already do reasonably well zero-shot, while smaller ones gain the most from the added help. The specific framing and the head-to-head results against those baselines look new. The work does a clean job laying out the three prediction methods and reporting the concrete accuracy lifts. The report-card trick is simple and effective for boosting performance without extra training. The soft spot is the missing end-to-end test. The motivation is hybrid inference where a small model defers when it predicts a low judge score, yet the results stop at how well the prediction matches the judge. There are no runs that apply a threshold, measure final output quality against compute, or plot Pareto curves versus always-small or always-large baselines. Prediction accuracy does not automatically mean better deferral decisions, especially if errors cluster on hard examples or the judge itself is biased. The assumption that the judge score is a faithful quality proxy is left unexamined. This paper is for researchers working on model routing, confidence estimation, and on-device LLM deployment. Readers interested in practical self-assessment techniques will find the prediction experiments useful. It has enough clear thinking and reproducible-style comparisons to deserve serious referee time, though the authors should add the deferral measurements before publication. I would send it out for review with a note to close that loop.

Referee Report

2 major / 2 minor

Summary. The paper proposes two paradigms, PA and RPRA, in which LLMs predict the score that an LLM judge would assign to their response before generating it. This prediction is intended to allow smaller models to answer queries they are likely to handle well and defer to larger models otherwise. The authors evaluate zero-shot prediction, in-context 'report card' prompting, and supervised fine-tuning across models of varying sizes, reporting that larger reasoning models excel at zero-shot prediction while smaller models achieve substantial accuracy gains (up to 55% with report cards and 52% with fine-tuning) after adaptation.

Significance. If the reported prediction accuracy improvements translate into superior deferral decisions, the work could meaningfully advance efficient inference for LLMs on edge devices by reducing unnecessary use of large models. The exploration of report cards and fine-tuning as methods to improve smaller models' self-assessment is a useful contribution. However, without end-to-end evaluation of the deferral policy, the practical significance remains unclear.

major comments (2)

[Abstract] The abstract claims that the methods 'pave the way for more efficient and self-aware AI systems' by enabling deferral based on predicted judge scores. However, the experiments measure only the accuracy of predicting the judge's score and do not evaluate the downstream deferral system (e.g., by applying a threshold to the prediction, measuring final output quality against a judge or humans, and plotting efficiency-quality trade-offs against baselines such as always using the small model or always deferring). This gap makes it impossible to confirm that higher prediction accuracy yields better overall performance.
The abstract provides no information on the datasets, LLM judges, number of trials, or statistical significance of the reported accuracy improvements, which are central to evaluating the claims.

minor comments (2)

Consider adding a table summarizing the accuracy results across models and methods for better readability.
The motivation section could more explicitly discuss potential biases introduced by using LLM judges as proxies for quality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the scope and presentation of our work. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] The abstract claims that the methods 'pave the way for more efficient and self-aware AI systems' by enabling deferral based on predicted judge scores. However, the experiments measure only the accuracy of predicting the judge's score and do not evaluate the downstream deferral system (e.g., by applying a threshold to the prediction, measuring final output quality against a judge or humans, and plotting efficiency-quality trade-offs against baselines such as always using the small model or always deferring). This gap makes it impossible to confirm that higher prediction accuracy yields better overall performance.

Authors: We agree that the manuscript evaluates prediction accuracy rather than a complete deferral pipeline. The core contribution is demonstrating that models can predict LLM-judge scores via zero-shot, report-card, and fine-tuning approaches, with smaller models showing large gains (up to 55% and 52%). Accurate self-assessment is a prerequisite for any deferral policy; without it, deferral decisions would be unreliable. The abstract language is intentionally forward-looking ('pave the way') rather than claiming end-to-end superiority. In revision we will (1) moderate the abstract to state that the work establishes the prediction mechanism as an enabler for deferral-based efficiency and (2) add a dedicated discussion subsection that illustrates how the reported accuracies could be used in simple threshold-based deferral policies, including back-of-the-envelope estimates of quality-efficiency trade-offs relative to always-small and always-large baselines. Full empirical deferral experiments (with human or judge evaluation of final outputs) lie outside the current scope but are noted as important future work. revision: partial
Referee: The abstract provides no information on the datasets, LLM judges, number of trials, or statistical significance of the reported accuracy improvements, which are central to evaluating the claims.

Authors: We acknowledge that the abstract is high-level and omits these specifics. While abstracts are length-constrained, we will revise it to include concise references to the evaluation setting (standard LLM evaluation benchmarks, representative LLM judges such as GPT-4, multiple trials per condition, and that accuracy gains are reported as means with statistical details provided in the experimental sections). This will give readers immediate context without expanding the abstract beyond reasonable limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of prediction accuracy against external judges

full rationale

The paper contains no equations, derivations, or first-principles claims. It reports experimental results on how well LLMs can predict scores from separate LLM judges (zero-shot, in-context report cards, fine-tuning), with improvements measured directly against those external judges. No step reduces a reported quantity to a fitted parameter or self-defined input by construction, and no load-bearing self-citations or uniqueness theorems are invoked. The evaluation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper is an empirical machine-learning study whose central claims rest on standard assumptions about LLM evaluation rather than new mathematical derivations or invented physical entities.

free parameters (1)

fine-tuning hyperparameters
Supervised fine-tuning of smaller models necessarily involves choices of learning rate, batch size, and number of epochs that are fitted to the training data.

axioms (1)

domain assumption LLM judges provide consistent and reliable scalar scores of response quality that can serve as training targets and evaluation metrics
The entire PA/RPRA framework treats judge scores as ground truth for both training and measuring success.

pith-pipeline@v0.9.0 · 5611 in / 1386 out tokens · 37754 ms · 2026-05-10T14:40:23.393307+00:00 · methodology

RPRA: Predicting an LLM-Judge for Efficient but Performant Inference

Core claim

Load-bearing premise

discussion (0)