Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Junfeng Fang; XiuYu Zhang; Yi Shan; Zhenkai Liang

arxiv: 2606.05122 · v1 · pith:YQCAPNL5new · submitted 2026-06-03 · 💻 cs.CL

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

XiuYu Zhang , Yi Shan , Junfeng Fang , Zhenkai Liang This is my paper

Pith reviewed 2026-06-28 06:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-evaluationLLM calibrationjudge alignmentfew-shot promptingreinforcement learningdistillationbase modelsopen-ended responses

0 comments

The pith

Base language models already predict how external judges would score their open-ended responses when given few-shot prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that base LLMs possess a latent ability to forecast an external judge's multi-attribute quality scores on their own outputs, performing above chance across three benchmarks with only a few-shot prompt. This capability does not require targeted training to acquire. The authors present Self-Evaluation Elicitation, a two-phase cycle that first couples reinforcement learning to improve answers while predicting judge scores, then applies masked distillation to refine the predictions alone. Using 160 examples, the method raises held-out calibration while leaving answer quality unchanged and produces predictions that remain stable across judges never encountered in training.

Core claim

A base model prompted with few-shot examples can already predict how an external judge will assign multi-attribute quality scores to its open-ended responses, performing well above chance on three benchmarks. Self-Evaluation Elicitation elicits this latent ability through a calibration-coupled reinforcement learning phase followed by masked distillation, achieving better held-out calibration from 160 examples while preserving answer quality. The elicited capability is localized in the model's token distribution and transfers to unseen judges.

What carries the argument

Self-Evaluation Elicitation (SEE), a short cycle consisting of a calibration-coupled reinforcement learning phase that jointly refines answers and judge-score predictions, followed by a masked distillation phase that sharpens only the predictions.

If this is right

The elicited self-evaluation transfers to judges the model was never trained against.
Calibration on held-out data improves across the three benchmarks tested.
Answer quality stays the same after the elicitation cycle completes.
The ability appears localized inside the model's own token distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Minimal-data elicitation may generalize to other latent capabilities that current alignment methods treat as skills to be taught from scratch.
If the quality notion is already inside the base model, targeted interventions could focus on localization rather than broad retraining.
The transfer across unseen judges suggests the notion may connect to general response properties rather than judge-specific preferences.

Load-bearing premise

The observed few-shot performance reflects a stable, transferable notion of quality already present in the base model rather than an artifact of the specific prompt format or benchmark construction.

What would settle it

A base model given the same few-shot prompt fails to predict judge scores above chance on a fresh benchmark with different response formats, or the SEE cycle fails to raise held-out calibration while leaving answer quality unchanged.

Figures

Figures reproduced from arXiv: 2606.05122 by Junfeng Fang, XiuYu Zhang, Yi Shan, Zhenkai Liang.

**Figure 1.** Figure 1: Overview of the SEE cycle. Phase 1, Calibration-Coupled RL: the policy answers a prompt and appends an inline [SELF_EVAL] block of five attribute scores (Self Eval); an external judge scores the same response on the same attributes; the reward combines a quality term over the three evaluative attributes with a calibration term over all five (Eq. 1). Each rollout, with its self-scores and judge scores, is w… view at source ↗

**Figure 2.** Figure 2: Sample efficiency. Quality (left) and calibration (right) against sample-passes for SEE and Adapted RLCR. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-attribute calibration on Arena-Hard [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Base models already predict judge scores above chance with few-shot prompts, and SEE claims to improve it with 31x less data, but no ablations or numbers are visible to test if it's real or prompt-driven.

read the letter

The main takeaway is that the authors observe base LLMs already predict an external judge's multi-attribute scores on open-ended answers above chance using prompted few-shot examples across three benchmarks, and their SEE method (calibration-coupled RL followed by masked distillation) lifts held-out calibration from just 160 examples while leaving answer quality intact.

They do a reasonable job reframing the issue as elicitation of something already present rather than new acquisition, and the claim that the elicited behavior stays stable across judges never seen in training is a useful angle if the data supports it.

The soft spot is exactly the one in the stress-test note: nothing rules out that the few-shot gains come from surface pattern matching in the prompt template or example ordering instead of a stable internal quality notion. Without ablations on prompt phrasing, example selection, or output format, and with no metrics, error bars, or dataset details supplied, it's impossible to tell how robust the result is. The abstract alone does not give enough to assess whether the 31x data reduction holds up or depends on post-hoc choices.

This is for people working on efficient LLM evaluation and alignment who want to cut down on labeled judge data. A reader focused on practical ways to reduce supervision in self-evaluation would get value if the experiments check out.

I would send it to peer review because the core observation, if real, is practically relevant and the minimal-data angle matters, but the referees would need to see the full results, ablations on prompt sensitivity, and reproducibility checks before any stronger claims.

Referee Report

2 major / 2 minor

Summary. The paper claims that base LLMs already possess a latent ability to predict an external judge's multi-attribute quality scores on open-ended responses, which can be surfaced via few-shot prompting well above chance on three benchmarks. It introduces Self-Evaluation Elicitation (SEE), a two-phase procedure (calibration-coupled RL followed by masked distillation) that uses only 160 examples—roughly 31x fewer than a standard RL baseline—to improve held-out calibration while preserving answer quality. The resulting self-evaluation is localized in the model's token distribution and generalizes to judges not seen during training, supporting the interpretation that the model harbors a transferable internal notion of quality rather than acquiring one through targeted training.

Significance. If the central results are robust, the work reframes judge-aligned self-evaluation as primarily an elicitation problem, which could substantially lower the data requirements for alignment techniques. The reported localization of the self-evaluation signal and its stability across unseen judges are notable strengths that would, if confirmed, provide evidence for pre-existing quality representations in base models. The minimal-data regime (160 examples) and preservation of answer quality are also positive if the comparisons to the RL baseline hold under controlled conditions.

major comments (2)

[Abstract and §3] Abstract and §3 (method overview): The claim that few-shot performance demonstrates a 'stable, transferable notion of quality latent in the base model' is load-bearing for the reframing of the problem as elicitation rather than acquisition. However, the manuscript provides no ablations on prompt phrasing, example ordering, or output format variations while holding judge scores fixed. If performance drops under modest rephrasing of the few-shot template, the result would be consistent with surface pattern matching rather than elicitation of an internal calibration.
[§4] §4 (experimental results): The reported gains from SEE over the RL baseline are presented without error bars, dataset splits, or details on how the 160 examples were selected or whether they overlap with the held-out evaluation sets. This makes it difficult to assess whether the 31x data reduction and improved calibration are robust or sensitive to post-hoc choices in example construction.

minor comments (2)

[Abstract] The abstract refers to 'three benchmarks' without naming them; the introduction should explicitly list the benchmarks and their characteristics to allow readers to assess domain coverage.
[§3] Notation for the multi-attribute quality scores and the masking operation in the distillation phase should be defined more formally (e.g., with equations) rather than described only in prose, to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and robustness.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method overview): The claim that few-shot performance demonstrates a 'stable, transferable notion of quality latent in the base model' is load-bearing for the reframing of the problem as elicitation rather than acquisition. However, the manuscript provides no ablations on prompt phrasing, example ordering, or output format variations while holding judge scores fixed. If performance drops under modest rephrasing of the few-shot template, the result would be consistent with surface pattern matching rather than elicitation of an internal calibration.

Authors: We agree that the absence of explicit ablations on prompt phrasing, ordering, and output format leaves the interpretation open to the alternative of surface-level pattern matching. While the reported generalization to entirely unseen judges offers some counter-evidence, it does not fully substitute for controlled prompt variations. We will therefore add a short ablation subsection reporting performance under two rephrased templates and one altered output format, using the same fixed judge scores. revision: yes
Referee: [§4] §4 (experimental results): The reported gains from SEE over the RL baseline are presented without error bars, dataset splits, or details on how the 160 examples were selected or whether they overlap with the held-out evaluation sets. This makes it difficult to assess whether the 31x data reduction and improved calibration are robust or sensitive to post-hoc choices in example construction.

Authors: The concern is valid; the current presentation omits error bars and precise selection details. The 160 examples were drawn uniformly at random from the training split of each benchmark with zero overlap to the held-out evaluation sets, and the RL baseline was trained on an identically sized random subset for direct comparison. In revision we will (i) report mean and standard deviation over five random seeds, (ii) state the exact random-selection procedure and non-overlap guarantee, and (iii) include a table of the train/validation/test splits used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical elicitation method is self-contained

full rationale

The paper introduces SEE as an empirical procedure (few-shot prompting plus a short RL+distillation cycle) whose performance is measured on held-out benchmarks against external judges. No equations, fitted parameters, or derivations appear that reduce the claimed latent calibration to quantities defined by the method itself. Claims rest on experimental deltas (31x fewer examples, stability across judges) rather than self-referential definitions or self-citation chains. The central premise that the ability is 'already there' is tested by ablation-style comparisons to baselines, not assumed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond naming the SEE procedure; the work implicitly relies on standard RL and distillation assumptions from prior LLM literature.

invented entities (1)

Self-Evaluation Elicitation (SEE) no independent evidence
purpose: Elicit latent judge calibration via short RL-plus-distillation cycle
Newly named method introduced to surface the claimed latent ability

pith-pipeline@v0.9.1-grok · 5723 in / 1016 out tokens · 23753 ms · 2026-06-28T06:28:01.032265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 1 canonical work pages

[1]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[2]

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

2022
[3]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

2022
[4]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[5]

Beyond Binary Rewards: Training

Mehul Damani and Isha Puri and Stewart Slocum and Idan Shenfeld and Leshem Choshen and Yoon Kim and Jacob Andreas , booktitle=. Beyond Binary Rewards: Training. 2026 , url=

2026
[6]

The Fourteenth International Conference on Learning Representations , year=

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding , author=. The Fourteenth International Conference on Learning Representations , year=
[7]

2025 , eprint=

Post-Completion Learning for Language Models , author=. 2025 , eprint=

2025
[8]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[9]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

HelpSteer 2: Open-source dataset for training top-performing reward models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[10]

2022 , eprint=

Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

2022
[11]

2023 , url=

Chunting Zhou and Pengfei Liu and Puxin Xu and Srini Iyer and Jiao Sun and Yuning Mao and Xuezhe Ma and Avia Efrat and Ping Yu and LILI YU and Susan Zhang and Gargi Ghosh and Mike Lewis and Luke Zettlemoyer and Omer Levy , booktitle=. 2023 , url=

2023
[12]

s1: Simple test-time scaling

Muennighoff, Niklas and Yang, Zitong and Shi, Weijia and Li, Xiang Lisa and Fei-Fei, Li and Hajishirzi, Hannaneh and Zettlemoyer, Luke and Liang, Percy and Cand \`e s, Emmanuel and Hashimoto, Tatsunori. s1: Simple test-time scaling. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1025

work page doi:10.18653/v1/2025.emnlp-main.1025 2025
[13]

2025 , url=

Yixin Ye and Zhen Huang and Yang Xiao and Ethan Chern and Shijie Xia and Pengfei Liu , booktitle=. 2025 , url=

2025
[14]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2026 , url=

2026
[15]

2026 , eprint=

Spurious Rewards: Rethinking Training Signals in RLVR , author=. 2026 , eprint=

2026
[16]

The Fourteenth International Conference on Learning Representations , year=

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=
[17]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=
[18]

2023 , eprint=

Reinforced Self-Training (ReST) for Language Modeling , author=. 2023 , eprint=

2023
[19]

Transactions on Machine Learning Research , issn=

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[20]

2023 , url=

Hanze Dong and Wei Xiong and Deepanshu Goyal and Yihan Zhang and Winnie Chow and Rui Pan and Shizhe Diao and Jipeng Zhang and KaShun SHUM and Tong Zhang , journal=. 2023 , url=

2023
[21]

2022 , url=

Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman , booktitle=. 2022 , url=

2022
[22]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[23]

2025 , eprint=

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators , author=. 2025 , eprint=

2025
[24]

Forty-second International Conference on Machine Learning , year=

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline , author=. Forty-second International Conference on Machine Learning , year=
[25]

2024 , eprint=

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild , author=. 2024 , eprint=

2024

[1] [1]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023

[2] [2]

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

2022

[3] [3]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

2022

[4] [4]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[5] [5]

Beyond Binary Rewards: Training

Mehul Damani and Isha Puri and Stewart Slocum and Idan Shenfeld and Leshem Choshen and Yoon Kim and Jacob Andreas , booktitle=. Beyond Binary Rewards: Training. 2026 , url=

2026

[6] [6]

The Fourteenth International Conference on Learning Representations , year=

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding , author=. The Fourteenth International Conference on Learning Representations , year=

[7] [7]

2025 , eprint=

Post-Completion Learning for Language Models , author=. 2025 , eprint=

2025

[8] [8]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[9] [9]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

HelpSteer 2: Open-source dataset for training top-performing reward models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[10] [10]

2022 , eprint=

Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

2022

[11] [11]

2023 , url=

Chunting Zhou and Pengfei Liu and Puxin Xu and Srini Iyer and Jiao Sun and Yuning Mao and Xuezhe Ma and Avia Efrat and Ping Yu and LILI YU and Susan Zhang and Gargi Ghosh and Mike Lewis and Luke Zettlemoyer and Omer Levy , booktitle=. 2023 , url=

2023

[12] [12]

s1: Simple test-time scaling

Muennighoff, Niklas and Yang, Zitong and Shi, Weijia and Li, Xiang Lisa and Fei-Fei, Li and Hajishirzi, Hannaneh and Zettlemoyer, Luke and Liang, Percy and Cand \`e s, Emmanuel and Hashimoto, Tatsunori. s1: Simple test-time scaling. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1025

work page doi:10.18653/v1/2025.emnlp-main.1025 2025

[13] [13]

2025 , url=

Yixin Ye and Zhen Huang and Yang Xiao and Ethan Chern and Shijie Xia and Pengfei Liu , booktitle=. 2025 , url=

2025

[14] [14]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2026 , url=

2026

[15] [15]

2026 , eprint=

Spurious Rewards: Rethinking Training Signals in RLVR , author=. 2026 , eprint=

2026

[16] [16]

The Fourteenth International Conference on Learning Representations , year=

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=

[17] [17]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

[18] [18]

2023 , eprint=

Reinforced Self-Training (ReST) for Language Modeling , author=. 2023 , eprint=

2023

[19] [19]

Transactions on Machine Learning Research , issn=

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024

[20] [20]

2023 , url=

Hanze Dong and Wei Xiong and Deepanshu Goyal and Yihan Zhang and Winnie Chow and Rui Pan and Shizhe Diao and Jipeng Zhang and KaShun SHUM and Tong Zhang , journal=. 2023 , url=

2023

[21] [21]

2022 , url=

Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman , booktitle=. 2022 , url=

2022

[22] [22]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[23] [23]

2025 , eprint=

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators , author=. 2025 , eprint=

2025

[24] [24]

Forty-second International Conference on Machine Learning , year=

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline , author=. Forty-second International Conference on Machine Learning , year=

[25] [25]

2024 , eprint=

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild , author=. 2024 , eprint=

2024