Label-Free Reinforcement Learning via Cross-Model Entropy

Hossein Shirazi; Matt Gorbett

arxiv: 2605.29009 · v1 · pith:BMJZILZGnew · submitted 2026-05-27 · 💻 cs.LG · cs.AI

Label-Free Reinforcement Learning via Cross-Model Entropy

Matt Gorbett , Hossein Shirazi This is my paper

Pith reviewed 2026-06-29 14:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learninglabel-free methodslarge language modelsreward signalcross-model entropyinstruction followingGRPO

0 comments

The pith

A separate verifier model's average log-likelihood supplies a continuous label-free reward that improves open-ended instruction following when plugged into GRPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Cross-Model Entropy as the mean log-likelihood that an independent verifier assigns to a generator's response, turning that quantity into the reward for reinforcement learning. This signal is training-free, continuous, and independent of the generator, so it cannot be gamed by the same model reinforcing its own mistakes. Experiments show the resulting models beat their untrained bases in LLM-as-judge comparisons on AlpacaEval 2.0 across four families and three regimes.

Core claim

Cross-Model Entropy (CME) is the mean log-likelihood of a generator response under a separate verifier model; because the verifier is independent, responses it finds unsurprising serve as a reliable label-free reward for GRPO, extending effective reinforcement learning to open-ended instruction following where self-referential signals are inapplicable.

What carries the argument

Cross-Model Entropy (CME): the mean log-likelihood of a generator's response under an independent verifier model, used directly as the reward signal.

If this is right

CME integrates into GRPO with no other changes to the training loop.
CME-trained models achieve tie-adjusted win rates of 52.5% to 71.4% against untrained bases on UltraFeedback prompts evaluated by LLM-as-Judge.
The improvement holds for pretrained, SFT, and instruction-tuned checkpoints in Qwen, Llama, Gemma, and OLMo families.
CME supplies a usable reward where ground-truth verifiers and human labels are unavailable and where self-referential signals cannot be applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing the verifier with a model from a different family than the generator could increase signal independence.
The same likelihood signal might serve as a lightweight quality filter during inference without any RL training.
CME could be tested on other open-ended generative tasks such as summarization or dialogue where automatic correctness checks do not exist.

Load-bearing premise

Responses that a separate verifier model finds unsurprising are likely correct or high quality.

What would settle it

Head-to-head LLM-as-judge comparisons on AlpacaEval 2.0 in which CME-trained models lose to or tie the untrained base across the four tested families and three regimes would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2605.29009 by Hossein Shirazi, Matt Gorbett.

**Figure 1.** Figure 1: Token-level CME localizes quality differences within a response. A thoughtful answer receives uniformly low verifier surprise (mean CE = 0.6); a vacuous one spikes as its tautology begins (mean CE = 3.4). Per-token surprise concentrates gradient signal where responses differ in quality, without groundtruth labels. reasoning, code generation, and instruction following (Guo et al., 2025). These approache… view at source ↗

**Figure 2.** Figure 2: Samples vs. OLMo-2-0425-1B-DPO (full responses in App. F). The weights satisfy P t wt,s = 1, preserving the total verifier log-likelihood. See Appendix B for edge cases and a worked example. 3 Experiments We fine-tune generators spanning four families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), in each case using gemma-3n-E4B-it (Gemma Team, 2025) as a f… view at source ↗

**Figure 3.** Figure 3: reports CME-GRPO win rates against the base and instruct comparators. Three effects emerge. (i) The random-weighted control underperforms every real-weighted verifier (55.8% random gemma-3-270m-it Qwen2.5-0.5B Qwen2.5-0.5B-Instruct OLMo-2-1B-DPO Llama-3.2-1B-Instruct Qwen2.5-1.5B-Instruct gemma-4-E4B-it 0 10 20 30 40 50 60 70 80 Generator (Qwen2.5-0.5B) winrate (%) tie (50%) 55.8 59.2 63.0 64.2 65.0 70.0 … view at source ↗

**Figure 4.** Figure 4: Overview of CME-GRPO. The generator πθ samples G responses to a prompt x. Each response is scored token-by-token by a verifier πϕ, producing per-token rewards that measure verifier surprise. Group-normalized advantages are combined with a KL penalty against a reference policy πref to form the GRPO objective. For each prompt x, the generator πθ samples G responses {y1, . . . , yG}. Each response is scored w… view at source ↗

read the original abstract

Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CME applies an independent verifier's log-likelihood as a simple RL reward and reports win rates over base models, but the claim that this tracks quality in open-ended tasks rests on an untested assumption.

read the letter

The main takeaway is that this paper defines Cross-Model Entropy as the mean log-likelihood of a generator response under a separate verifier model and uses it as the reward in GRPO for open-ended instruction following. They report tie-adjusted win rates of 52.5-71.4% against the untrained base across Qwen, Llama, Gemma, and OLMo on AlpacaEval 2.0, covering pretrained, SFT, and instruction-tuned starting points.

What is actually new is the specific cross-model formulation integrated into GRPO without other changes, aimed at domains where self-entropy or majority voting cannot be applied. The independence from the generator is a clear step past prior self-referential label-free methods.

The experiments are straightforward and cover multiple model families, which is useful. The method itself requires no reward model training and is easy to state.

The soft spot is the central assumption. The abstract grounds the signal in the idea that responses a verifier finds unsurprising are likely high quality, yet provides no correlation study, human preference comparison, or counterexample check for open-ended prompts. In practice this could reward outputs that simply match the verifier's pretraining distribution or biases rather than helpfulness. The evaluation itself uses LLM-as-Judge, which carries the same unquantified risks, and the abstract supplies no error bars, statistical tests, or detailed verifier and prompt choices.

This is for researchers working on label-free or low-cost reward signals for LLM post-training. A reader interested in reward design would see a clean formulation worth testing further.

It deserves peer review to examine the full experimental details and any additional validation of the proxy.

Referee Report

2 major / 1 minor

Summary. The paper proposes Cross-Model Entropy (CME) as a label-free reward for RL post-training of LLMs: the mean log-likelihood of a generator response under an independent verifier model. It integrates CME into GRPO without other changes and reports that this yields higher tie-adjusted win rates (52.5%-71.4%) than the untrained base model on UltraFeedback prompts evaluated via LLM-as-Judge on AlpacaEval 2.0, across four model families and three regimes (pretrained, SFT, instruction-tuned). The approach is motivated as extending label-free RL to open-ended instruction following while avoiding self-referential gaming.

Significance. If the central result holds under rigorous evaluation, the work would supply a continuous, training-free reward signal that operates without ground-truth verifiers or human preferences, addressing a key bottleneck for open-ended tasks. The explicit independence of the verifier and the no-change integration into GRPO are concrete strengths. The stated intent to release code upon publication further supports reproducibility.

major comments (2)

[Abstract] Abstract: the reported tie-adjusted win rates (52.5% to 71.4%) are presented without error bars, confidence intervals, statistical significance tests, or details on verifier model choice, prompt formatting, or the exact LLM-as-Judge protocol. This absence makes it impossible to determine whether the observed improvements exceed evaluation variance.
[Abstract] Abstract: the grounding claim that 'responses a verifier finds unsurprising are likely correct or high quality' is asserted without any referenced derivation, correlation study with human preferences, or analysis of potential misalignment (e.g., stylistic or safety biases) for open-ended tasks; this premise is load-bearing for the claim that CME constitutes a reliable reward.

minor comments (1)

[Abstract] The abstract states results across four model families and three regimes but does not indicate where the per-family or per-regime breakdowns appear in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the reported tie-adjusted win rates (52.5% to 71.4%) are presented without error bars, confidence intervals, statistical significance tests, or details on verifier model choice, prompt formatting, or the exact LLM-as-Judge protocol. This absence makes it impossible to determine whether the observed improvements exceed evaluation variance.

Authors: We agree that the abstract lacks sufficient statistical and methodological detail. In the revised manuscript we will expand the abstract to report confidence intervals or standard errors on the win rates, name the specific verifier models employed, describe the LLM-as-Judge prompt template and formatting, and reference the statistical significance tests that appear in the main results tables. revision: yes
Referee: [Abstract] Abstract: the grounding claim that 'responses a verifier finds unsurprising are likely correct or high quality' is asserted without any referenced derivation, correlation study with human preferences, or analysis of potential misalignment (e.g., stylistic or safety biases) for open-ended tasks; this premise is load-bearing for the claim that CME constitutes a reliable reward.

Authors: The premise is presented as an intuitive grounding rather than a formally derived result. The current manuscript does not contain a dedicated correlation study or bias analysis. We will revise the abstract wording for precision and add a short discussion subsection that (i) cites prior work on likelihood as a quality proxy, (ii) acknowledges possible stylistic and safety misalignments, and (iii) reports a post-hoc correlation between CME scores and a small human preference subset where available. revision: yes

Circularity Check

0 steps flagged

No circularity: CME defined independently and evaluated externally

full rationale

The paper defines CME directly as mean log-likelihood of generator outputs under a separate verifier model and inserts it unchanged into GRPO. The grounding principle is asserted as an external premise rather than derived from any equation or self-referential loop. No fitted parameters are relabeled as predictions, no self-citations bear the central claim, and no uniqueness theorems or ansatzes are imported from prior author work. Empirical results on AlpacaEval are presented as external validation, not tautological restatements of the input definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that cross-model likelihood correlates with quality and on standard RL assumptions; no explicit free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Responses a verifier finds unsurprising are likely correct or high quality
Explicitly stated as the grounding principle for the reward signal.

pith-pipeline@v0.9.1-grok · 5805 in / 1225 out tokens · 44150 ms · 2026-06-29T14:15:58.106275+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 7 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI feed- back.arXiv preprint arXiv:2212.08073. Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. UltraFeedback: Boosting language mod- els with scaled AI feedback. InProceedings of the 41st International Conference o...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and 1 others. 2024. Im- prove mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592. Mathematical Association of America. 2024...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

The effects of reward misspecification: Map- ping and mitigating misaligned models.arXiv preprint arXiv:2201.03544. Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. 2025. Maximizing confidence alone improves reasoning. arXiv preprint arXiv:2505.22660. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Proximal Policy Optimization Algorithms

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal pol- icy optimization algorithms. InarXiv preprint arXiv:1707.06347. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng,...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Spurious Rewards: Rethinking Training Signals in RLVR

Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and 1 others. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Joar Skalse, N...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations.arXiv preprint arXiv:2312.08935. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171. Lai Wei and 1 others. 2...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36. Zhenru Zheng and 1 others. 2025b. The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301. Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xian-...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Immunization: Vaccines stimulate the body’s immune system to produce antibodies against specific pathogens
[9]

Immune memory: After receiving a vaccine, the immune system produces memory cells that can quickly recognize and respond to future infections
[10]

Protection: The vaccinated individual will be protected if they contract the disease after being exposed to it through exposure to the virus or bacteria contained within the vaccine
[11]

Community protection: By protecting vulnerable populations such as infants, elderly people, and those with weakened immune systems, vaccination programs help reduce the overall burden of infectious diseases on community health. Herd immunity refers to the phenomenon where a sufficient number of individuals in a population become immune to a particular dis...

1980
[12]

Distribute the 5 on the right side of the equation:5(x−2) = 5x−10
[13]

Combine like terms on the right side:5x−10 = 5x−10
[14]

4.Divide both sides by−10to solve forx:x= 0

Subtract5xfrom both sides to isolate thexterms:0 =−10. 4.Divide both sides by−10to solve forx:x= 0

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI feed- back.arXiv preprint arXiv:2212.08073. Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. UltraFeedback: Boosting language mod- els with scaled AI feedback. InProceedings of the 41st International Conference o...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and 1 others. 2024. Im- prove mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592. Mathematical Association of America. 2024...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

The effects of reward misspecification: Map- ping and mitigating misaligned models.arXiv preprint arXiv:2201.03544. Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. 2025. Maximizing confidence alone improves reasoning. arXiv preprint arXiv:2505.22660. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermo...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Proximal Policy Optimization Algorithms

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal pol- icy optimization algorithms. InarXiv preprint arXiv:1707.06347. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng,...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Spurious Rewards: Rethinking Training Signals in RLVR

Spurious rewards: Rethinking training signals in RLVR.arXiv preprint arXiv:2506.10947. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and 1 others. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Joar Skalse, N...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations.arXiv preprint arXiv:2312.08935. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171. Lai Wei and 1 others. 2...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36. Zhenru Zheng and 1 others. 2025b. The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301. Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xian-...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Immunization: Vaccines stimulate the body’s immune system to produce antibodies against specific pathogens

[9] [9]

Immune memory: After receiving a vaccine, the immune system produces memory cells that can quickly recognize and respond to future infections

[10] [10]

Protection: The vaccinated individual will be protected if they contract the disease after being exposed to it through exposure to the virus or bacteria contained within the vaccine

[11] [11]

Community protection: By protecting vulnerable populations such as infants, elderly people, and those with weakened immune systems, vaccination programs help reduce the overall burden of infectious diseases on community health. Herd immunity refers to the phenomenon where a sufficient number of individuals in a population become immune to a particular dis...

1980

[12] [12]

Distribute the 5 on the right side of the equation:5(x−2) = 5x−10

[13] [13]

Combine like terms on the right side:5x−10 = 5x−10

[14] [14]

4.Divide both sides by−10to solve forx:x= 0

Subtract5xfrom both sides to isolate thexterms:0 =−10. 4.Divide both sides by−10to solve forx:x= 0