RPRA: Predicting an LLM-Judge for Efficient but Performant Inference
Pith reviewed 2026-05-10 14:40 UTC · model grok-4.3
The pith
Models predict LLM-judge scores of their outputs via zero-shot, in-context report cards, or fine-tuning to enable efficient self-answer or deferral decisions, with smaller models gaining up to 55% prediction accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively.
Load-bearing premise
That an LLM judge's score is a faithful proxy for actual output quality and that accurate prediction of that score will produce deferral decisions that improve the efficiency-quality trade-off without introducing new biases or failure modes.
read the original abstract
Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict -- prior to responding -- how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes two paradigms, PA and RPRA, in which LLMs predict the score that an LLM judge would assign to their response before generating it. This prediction is intended to allow smaller models to answer queries they are likely to handle well and defer to larger models otherwise. The authors evaluate zero-shot prediction, in-context 'report card' prompting, and supervised fine-tuning across models of varying sizes, reporting that larger reasoning models excel at zero-shot prediction while smaller models achieve substantial accuracy gains (up to 55% with report cards and 52% with fine-tuning) after adaptation.
Significance. If the reported prediction accuracy improvements translate into superior deferral decisions, the work could meaningfully advance efficient inference for LLMs on edge devices by reducing unnecessary use of large models. The exploration of report cards and fine-tuning as methods to improve smaller models' self-assessment is a useful contribution. However, without end-to-end evaluation of the deferral policy, the practical significance remains unclear.
major comments (2)
- [Abstract] The abstract claims that the methods 'pave the way for more efficient and self-aware AI systems' by enabling deferral based on predicted judge scores. However, the experiments measure only the accuracy of predicting the judge's score and do not evaluate the downstream deferral system (e.g., by applying a threshold to the prediction, measuring final output quality against a judge or humans, and plotting efficiency-quality trade-offs against baselines such as always using the small model or always deferring). This gap makes it impossible to confirm that higher prediction accuracy yields better overall performance.
- The abstract provides no information on the datasets, LLM judges, number of trials, or statistical significance of the reported accuracy improvements, which are central to evaluating the claims.
minor comments (2)
- Consider adding a table summarizing the accuracy results across models and methods for better readability.
- The motivation section could more explicitly discuss potential biases introduced by using LLM judges as proxies for quality.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps clarify the scope and presentation of our work. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] The abstract claims that the methods 'pave the way for more efficient and self-aware AI systems' by enabling deferral based on predicted judge scores. However, the experiments measure only the accuracy of predicting the judge's score and do not evaluate the downstream deferral system (e.g., by applying a threshold to the prediction, measuring final output quality against a judge or humans, and plotting efficiency-quality trade-offs against baselines such as always using the small model or always deferring). This gap makes it impossible to confirm that higher prediction accuracy yields better overall performance.
Authors: We agree that the manuscript evaluates prediction accuracy rather than a complete deferral pipeline. The core contribution is demonstrating that models can predict LLM-judge scores via zero-shot, report-card, and fine-tuning approaches, with smaller models showing large gains (up to 55% and 52%). Accurate self-assessment is a prerequisite for any deferral policy; without it, deferral decisions would be unreliable. The abstract language is intentionally forward-looking ('pave the way') rather than claiming end-to-end superiority. In revision we will (1) moderate the abstract to state that the work establishes the prediction mechanism as an enabler for deferral-based efficiency and (2) add a dedicated discussion subsection that illustrates how the reported accuracies could be used in simple threshold-based deferral policies, including back-of-the-envelope estimates of quality-efficiency trade-offs relative to always-small and always-large baselines. Full empirical deferral experiments (with human or judge evaluation of final outputs) lie outside the current scope but are noted as important future work. revision: partial
-
Referee: The abstract provides no information on the datasets, LLM judges, number of trials, or statistical significance of the reported accuracy improvements, which are central to evaluating the claims.
Authors: We acknowledge that the abstract is high-level and omits these specifics. While abstracts are length-constrained, we will revise it to include concise references to the evaluation setting (standard LLM evaluation benchmarks, representative LLM judges such as GPT-4, multiple trials per condition, and that accuracy gains are reported as means with statistical details provided in the experimental sections). This will give readers immediate context without expanding the abstract beyond reasonable limits. revision: yes
Circularity Check
No circularity: purely empirical evaluation of prediction accuracy against external judges
full rationale
The paper contains no equations, derivations, or first-principles claims. It reports experimental results on how well LLMs can predict scores from separate LLM judges (zero-shot, in-context report cards, fine-tuning), with improvements measured directly against those external judges. No step reduces a reported quantity to a fitted parameter or self-defined input by construction, and no load-bearing self-citations or uniqueness theorems are invoked. The evaluation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- fine-tuning hyperparameters
axioms (1)
- domain assumption LLM judges provide consistent and reliable scalar scores of response quality that can serve as training targets and evaluation metrics
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.