arxiv: 2605.03871 · v1 · submitted 2026-05-05 · 💻 cs.AI

Recognition: unknown

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

Shuyue Stella Li , Rui Xin , Teng Xiao , Yike Wang , Rulin Shao , Zoey Hao , Melanie Sclar , Sewoong Oh

show 3 more authors

Faeze Brahman Pang Wei Koh Yulia Tsvetkov

Authors on Pith no claims yet

Pith reviewed 2026-05-07 03:11 UTC · model grok-4.3

classification 💻 cs.AI

keywords modelevolmrewardrubricsdiscriminativemodelscapacityevaluative

0 comments

The pith

EvoLM enables self-improvement in language models by co-evolving a rubric generator and policy using only self-generated temporal contrasts, with an 8B model producing rubrics that beat GPT-4.1 on RewardBench-2 and a policy reaching 69.3% on OLMo3-Adapt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current ways to make language models better after pretraining usually need outside help: humans rating answers, big companies' models giving scores, or simple right/wrong checks. These limits how far models can go. EvoLM instead taps into the knowledge the model already has from its training data by turning that knowledge into explicit rules called rubrics. These rubrics are made for each specific question and are designed to help tell good answers from bad ones as clearly as possible. The system trains two skills in turns inside the same model. One skill learns to write better rubrics that make a small fixed judge separate good and bad responses more effectively. The other skill, the main answering part, learns to produce better responses by using scores based on those rubrics. The good and bad examples are created by the model itself, by looking at how its answers now compare to answers from earlier versions of itself during training. No outside ratings are used at any point. Tests showed the rubrics from an 8-billion-parameter model beat those written by GPT-4.1 on a standard judging benchmark, and the improved answering model performed better than models trained with other reward systems on a suite of tasks.

Core claim

EVOLM trains a Qwen3-8B model to generate rubrics that outperform GPT-4.1 on RewardBench-2 by 25.7%. The co-trained policy achieves 69.3% average on the OLMo3-Adapt suite, outperforming policies trained with GPT-4.1 prompted rubrics by 3.9% and with the state-of-the-art 8B reward model SkyWork-RM by 16%.

Load-bearing premise

That preference signals constructed solely from the policy's own outputs via temporal contrast with earlier checkpoints, combined with rubrics optimized to maximize discrimination by a small frozen judge, provide reliable and unbiased training signals that genuinely improve capability rather than self-reinforce existing patterns.

read the original abstract

Language models encode substantial evaluative knowledge from pretraining, yet current post-training methods rely on external supervision (human annotations, proprietary models, or scalar reward models) to produce reward signals. Each imposes a ceiling. Human judgment cannot supervise capabilities beyond its own, proprietary APIs create dependencies, and verifiable rewards cover only domains with ground-truth answers. Self-improvement from a model's own evaluative capacity is a reward source that scales with the model itself, yet remains largely untapped by current methods. We introduce EVOLM, a post-training method that structures this capacity into explicit discriminative rubrics and uses them as training signal. EVOLM trains two capabilities within a single language model in alternation: (1) a rubric generator producing instance-specific evaluation criteria optimized for discriminative utility, which maximizes a small frozen judge's ability to distinguish preferred from dispreferred responses; and (2) a policy trained using those rubric-conditioned scores as reward. All preference signals are constructed from the policy's own outputs via temporal contrast with earlier checkpoints, requiring no human annotation or external supervision. EVOLM trains a Qwen3-8B model to generate rubrics that outperform GPT-4.1 on RewardBench-2 by 25.7%. The co-trained policy achieves 69.3% average on the OLMo3-Adapt suite, outperforming policies trained with GPT-4.1 prompted rubrics by 3.9% and with the state-of-the-art 8B reward model SkyWork-RM by 16%. Overall, EVOLM demonstrates that structuring a model's evaluative capacity into co-evolving discriminative rubrics enables self-improvement without external supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoLM's co-evolution of a rubric generator and policy via temporal self-contrasts is a distinct self-improvement mechanism, but the closed loop and missing controls make the reported gains hard to trust as genuine capability lifts.

read the letter

The paper's main contribution is a training loop where one model alternates between producing instance-specific rubrics optimized to maximize a small frozen judge's ability to separate its own earlier and later outputs, then using the resulting rubric-conditioned scores as the reward signal for policy updates. All preference data comes from temporal contrasts on the model's own generations, with no human labels or external models required after the initial setup. This is a concrete structuring of self-referential improvement that differs from standard self-rewarding or constitutional approaches by tying the rubric objective directly to discriminative utility on checkpoint pairs.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces EVOLM, a post-training method in which a single language model alternates between (1) generating instance-specific discriminative rubrics optimized to maximize a small frozen judge's accuracy in separating temporally contrasted self-generated responses (later checkpoints preferred over earlier ones) and (2) using the resulting rubric-conditioned scores as the sole reward signal to update the policy. All preference data are constructed internally via temporal contrast with no human annotations or external models. The authors report that the evolved rubric generator outperforms GPT-4.1 on RewardBench-2 by 25.7 % and that the co-trained policy reaches 69.3 % average on the OLMo3-Adapt suite, exceeding both GPT-4.1-prompted rubrics (+3.9 %) and the SkyWork-RM 8B baseline (+16 %).

Significance. If the central claim survives rigorous controls for self-referential bias, the work would constitute a meaningful advance in scalable self-improvement: it shows how pre-trained evaluative knowledge can be explicitly structured into co-evolving rubrics that serve as an internal reward source, reducing dependence on human or proprietary supervision. The explicit rubric format also offers interpretability benefits over opaque scalar reward models and could scale with model size.

major comments (3)

[§3] §3 (preference-pair construction): the central claim that temporal contrast supplies reliable capability signals rests on the assumption that later-checkpoint outputs are systematically preferred for reasons other than stylistic drift or sampling variance. No ablation is presented that severs this link (e.g., shuffled temporal labels, fixed external rubrics, or cross-model pairs). Without such a control, the 25.7 % RewardBench-2 and 3.9 % OLMo3-Adapt gains could arise from the rubric generator learning to exploit consistent self-generated patterns rather than from genuine improvement.
[§4.1] §4.1 (frozen judge): the rubric-optimization objective is defined entirely by maximizing discrimination accuracy of an unspecified 'small frozen judge.' Neither its architecture, training data, nor size is reported. Because every preference signal ultimately flows through this judge, the absence of these details makes it impossible to determine whether the loop contains any external anchor or remains fully self-referential.
[§5] §5 (results tables): the headline performance deltas (25.7 % on RewardBench-2, 3.9 % on OLMo3-Adapt) are reported without standard deviations, number of independent runs, or statistical significance tests. Given known variance in LLM-as-judge evaluations, these omissions prevent a confident conclusion that the co-evolution reliably outperforms the GPT-4.1 and SkyWork-RM baselines.

minor comments (2)

[Abstract / §5] The OLMo3-Adapt suite is referenced in the abstract and results without an explicit list of constituent tasks or a citation; readers cannot reproduce the 69.3 % average without this information.
[§3] Notation for the alternating training schedule (rubric-generator steps vs. policy steps) is described in prose but would benefit from a compact algorithmic box or explicit step indices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing our responses and indicating the revisions we will incorporate into the manuscript.

read point-by-point responses

Referee: [§3] §3 (preference-pair construction): the central claim that temporal contrast supplies reliable capability signals rests on the assumption that later-checkpoint outputs are systematically preferred for reasons other than stylistic drift or sampling variance. No ablation is presented that severs this link (e.g., shuffled temporal labels, fixed external rubrics, or cross-model pairs). Without such a control, the 25.7 % RewardBench-2 and 3.9 % OLMo3-Adapt gains could arise from the rubric generator learning to exploit consistent self-generated patterns rather than from genuine improvement.

Authors: We acknowledge that the assumption underlying temporal contrast requires explicit validation to rule out confounds such as stylistic drift or sampling artifacts. While the approach aligns with prior self-improvement methods that treat checkpoint progression as a proxy for capability gains, we agree that the manuscript would benefit from targeted controls. In the revised version, we will add ablations using shuffled temporal labels, fixed external rubrics, and cross-model preference pairs. These experiments will demonstrate that the rubric generator's discriminative performance depends on the temporal improvement signal rather than on exploiting consistent self-generated patterns. revision: yes
Referee: [§4.1] §4.1 (frozen judge): the rubric-optimization objective is defined entirely by maximizing discrimination accuracy of an unspecified 'small frozen judge.' Neither its architecture, training data, nor size is reported. Because every preference signal ultimately flows through this judge, the absence of these details makes it impossible to determine whether the loop contains any external anchor or remains fully self-referential.

Authors: We apologize for the insufficient reporting of the small frozen judge. We will expand §4.1 with a complete description of the judge, including its architecture, training data, and size. The judge is a smaller model trained independently on a fixed collection of preference pairs generated from the base model prior to the start of co-evolution. This fixed component serves as an external anchor for the rubric optimization objective, ensuring the overall training loop is not entirely self-referential. revision: yes
Referee: [§5] §5 (results tables): the headline performance deltas (25.7 % on RewardBench-2, 3.9 % on OLMo3-Adapt) are reported without standard deviations, number of independent runs, or statistical significance tests. Given known variance in LLM-as-judge evaluations, these omissions prevent a confident conclusion that the co-evolution reliably outperforms the GPT-4.1 and SkyWork-RM baselines.

Authors: We agree that measures of variability and statistical testing are necessary to support the reported gains, particularly given the stochasticity of LLM-as-judge evaluations. We will revise the results tables in §5 to include standard deviations from multiple independent evaluation runs and to report the outcomes of statistical significance tests comparing EvoLM against the GPT-4.1-prompted and SkyWork-RM baselines. These additions will be based on re-evaluations performed for the revision. revision: yes

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that pretraining encodes usable evaluative knowledge that can be explicitly structured, plus the assumption that self-generated temporal contrasts yield valid training signals. Several implementation choices remain unspecified.

free parameters (2)

identity and size of the small frozen judge
The judge used to optimize rubric discriminative utility is referenced but not identified or justified in the abstract.
alternation frequency and training schedule between rubric generator and policy
The exact procedure for alternating the two capabilities is described at high level only.

axioms (2)

domain assumption A language model's pretraining encodes substantial evaluative knowledge that can be structured into explicit discriminative rubrics
Stated directly in the opening of the abstract as the foundation for the method.
domain assumption Temporal contrast between a policy's current outputs and its earlier checkpoints produces reliable preferred/dispreferred pairs
Core mechanism for generating all preference signals without external supervision.

invented entities (1)

discriminative rubrics no independent evidence
purpose: Instance-specific evaluation criteria optimized to maximize a frozen judge's ability to separate preferred from dispreferred responses
Newly introduced explicit structure for capturing and operationalizing the model's internal evaluative capacity.

pith-pipeline@v0.9.0 · 5641 in / 1837 out tokens · 95385 ms · 2026-05-07T03:11:00.222704+00:00 · methodology

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

Core claim

Load-bearing premise

discussion (0)