Peer-Predictive Self-Training for Language Model Reasoning

Fan Nie; Hanlin Zhang; Sham Kakade; Shi Feng; Yiling Chen

arxiv: 2604.13356 · v2 · submitted 2026-04-14 · 💻 cs.CL · cs.AI· cs.GT

Peer-Predictive Self-Training for Language Model Reasoning

Shi Feng , Hanlin Zhang , Fan Nie , Sham Kakade , Yiling Chen This is my paper

Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.GT

keywords self-trainingpeer predictionlanguage model reasoningcross-model aggregationpointwise mutual informationmathematical reasoningself-supervised learning

0 comments

The pith

Multiple language models can improve reasoning accuracy by training on each other's aggregated answers without external labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Peer-Predictive Self-Training, a framework in which several language models generate answers to the same question sequentially and use their combined response as an internal training target. Pointwise mutual information measures how aligned each individual response is with the group aggregate, and this value scales the strength of the update so that misaligned outputs receive larger adjustments. Experiments across three mathematical reasoning benchmarks show consistent gains in exact-match accuracy for models of different sizes. The approach also narrows the gap between what a model can generate and what it can verify. This matters because it offers a route to continued self-improvement that depends only on peer interactions rather than human data or larger teacher models.

Core claim

Peer-Predictive Self-Training (PST) is a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses, serves as an internal target for learning. The informativeness of each intermediate response is measured using pointwise mutual information to scale self-training updates, with responses already aligned with the aggregate updated less and misaligned ones updated more.

What carries the argument

Cross-model aggregated response used as internal training target, with pointwise mutual information scaling the magnitude of updates for each peer generation.

If this is right

Exact-match accuracy on the three math benchmarks rises between 2.2 and 4.3 percentage points for the tested models.
The average generator-verifier gap shrinks by 26 to 40 percent.
No external supervision, labeled data, or teacher-student hierarchy is required.
All gains arise from sequential generation and aggregation among peer models of comparable size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same peer-aggregation idea could be tested on non-math tasks such as code generation or multi-step planning where group consensus might still be more reliable.
Increasing the number of participating peer models might produce stronger aggregates and larger improvements.
The method could support ongoing adaptation of deployed models in settings where new labeled data is unavailable.

Load-bearing premise

The cross-model aggregated answer must be reliably more accurate than the separate model responses so that training toward it produces net gains.

What would settle it

Apply PST to the same models on SimulEq, Math500, or MultiArith and observe zero or negative change in exact-match accuracy after training.

Figures

Figures reproduced from arXiv: 2604.13356 by Fan Nie, Hanlin Zhang, Sham Kakade, Shi Feng, Yiling Chen.

**Figure 1.** Figure 1: Overview of the PST framework. Grey: sample models and randomize their order. Blue: generate responses sequentially conditioned on prior outputs, yielding yN. Purple: score each response by its contribution to predicting yN. Green: update toward yN with weights from these scores, correcting misaligned responses more. Beyond verifier-based selection, a growing body of work studies selfimprovement without … view at source ↗

**Figure 2.** Figure 2: Accuracy before and after fine-tuning on SimulEq, MATH-500-Numeric, and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Initial GV matrices on SimulEq, Math500, and MultiArith, computed using [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Results across model scales and benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy before and after fine-tuning on SimulEq, MATH-500-Numeric, and [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PST gives small math gains on tiny models via cross-model aggregation and PMI weighting, but the abstract leaves the key assumptions untested and the gains look too modest to stand out yet.

read the letter

The paper's core move is to have several small LMs generate answers one after another on the same math question, treat their pooled response as the training target, and then scale each model's update by how much its answer informed the aggregate via PMI. They show 2.2–4.3 point exact-match lifts on SimulEq, Math500, and MultiArith for Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, plus a 26–40% drop in the generator-verifier gap, all without labels or a teacher-student split. That setup is new in the self-training literature they cite, and it cleanly sidesteps some circularity by keeping the target external to any one learner during the update step. The PMI term is a concrete way to give more gradient to responses that diverge from the group while leaving aligned ones mostly alone. Those are the parts worth paying attention to if you're working on label-free improvement for reasoning models. The main gaps are straightforward. The abstract states the aggregate is “often more reliable” but supplies no direct comparison of aggregate accuracy versus single-model accuracy, no ablation that drops the PMI scaling, and no error bars or significance numbers. Without those, it is hard to tell whether the reported lifts come from the proposed mechanism or from simply running more training steps on the same data. The models are all under 2B and the gains are in the low single digits, so the practical payoff is still unclear. If the full paper contains the missing controls and the improvements survive them, the idea is worth testing on larger models and other domains. For a reading group I would bring the aggregation and PMI steps to see whether they actually isolate the claimed effect. I would not cite the work yet. It is worth sending to peer review because the problem is live and the framing is clean, even though the current evidence is preliminary and needs tightening.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Peer-Predictive Self-Training (PST), a label-free fine-tuning method in which multiple language models generate responses sequentially to a prompt; an aggregated response serves as the internal training target, with pointwise mutual information (PMI) used to scale updates so that responses aligned with the aggregate receive smaller updates. On mathematical reasoning benchmarks (SimulEq, Math500, MultiArith), the method is reported to yield exact-match accuracy gains of 2.2–4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B while reducing the generator-verifier gap by 26–40 percent, relying solely on cross-model interactions without external supervision or teacher-student hierarchies.

Significance. If the empirical results prove robust, PST would represent a meaningful step toward unsupervised collaborative self-improvement of language models by exploiting peer-generated signals rather than external labels. The absence of a teacher-student hierarchy and the use of externally derived PMI weighting are genuine strengths that distinguish the approach from standard self-training. However, the current experimental support remains preliminary, limiting the immediate significance of the contribution.

major comments (3)

[Abstract] Abstract: The central claim of 2.2–4.3 pp accuracy improvement is presented without baseline comparisons (e.g., majority voting, standard self-training, or unweighted aggregation), statistical significance tests, or variance across multiple runs, making it impossible to determine whether the gains exceed generic self-training effects.
[Abstract] Abstract: The key assumption that the cross-model aggregate is 'often more reliable than individual responses in practice' is stated but not supported by any direct quantitative comparison of aggregate accuracy versus per-model accuracy on the reported benchmarks; this measurement is load-bearing for the quality of the training signal.
[Abstract] Abstract: No ablation is reported that removes or replaces the PMI-based scaling, so it remains unclear whether the weighting mechanism itself drives the observed reductions in GV-Gap or whether simpler aggregation would suffice.

minor comments (1)

[Abstract] The manuscript would benefit from explicit pseudocode or a small worked example illustrating the sequential generation, aggregation, and PMI scaling steps to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below and indicate the revisions we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 2.2–4.3 pp accuracy improvement is presented without baseline comparisons (e.g., majority voting, standard self-training, or unweighted aggregation), statistical significance tests, or variance across multiple runs, making it impossible to determine whether the gains exceed generic self-training effects.

Authors: We acknowledge that the abstract would be clearer with explicit references to baselines and statistical details. The reported gains are measured relative to the base models, and the full experimental section evaluates PST against standard self-training and unweighted aggregation. To address the concern, we will revise the abstract to briefly note these comparisons, indicate that gains are consistent across runs with low variance, and reference the statistical significance tests provided in the main text and appendix. revision: yes
Referee: [Abstract] Abstract: The key assumption that the cross-model aggregate is 'often more reliable than individual responses in practice' is stated but not supported by any direct quantitative comparison of aggregate accuracy versus per-model accuracy on the reported benchmarks; this measurement is load-bearing for the quality of the training signal.

Authors: We agree that a direct quantitative comparison on the reported benchmarks would better substantiate the assumption. Although the manuscript discusses aggregate reliability in the method section, we will add a supporting table or analysis in the revised version that reports aggregate accuracy versus individual model accuracies on SimulEq, Math500, and MultiArith. revision: yes
Referee: [Abstract] Abstract: No ablation is reported that removes or replaces the PMI-based scaling, so it remains unclear whether the weighting mechanism itself drives the observed reductions in GV-Gap or whether simpler aggregation would suffice.

Authors: We recognize the importance of isolating the contribution of the PMI scaling. We will include a new ablation experiment in the revised manuscript that compares the full PST method against a variant using unweighted aggregation, to demonstrate the specific role of PMI-based scaling in reducing the generator-verifier gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses externally generated cross-model signals evaluated on labeled benchmarks.

full rationale

The PST method constructs training targets from cross-model aggregated responses generated sequentially by multiple independent models, with PMI computed directly from those generations rather than from any fitted parameters or self-referential loops. No equations or steps in the abstract reduce a claimed prediction to a fitted input by construction, nor do they rely on self-citations for uniqueness or ansatzes. Accuracy gains and GV-Gap reductions are measured against ground-truth labels on SimulEq, Math500, and MultiArith, providing an external benchmark independent of the training dynamics. This satisfies the default expectation of a self-contained derivation with no load-bearing reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that peer aggregation yields a superior pseudo-label and that PMI quantifies useful informativeness. No new physical entities are introduced and no explicit free parameters are named in the abstract, though practical choices such as aggregation function and PMI smoothing are implicit.

axioms (2)

domain assumption The aggregated response from multiple models is more reliable than any individual response
This premise allows the aggregate to serve as the internal training target.
domain assumption Pointwise mutual information between a response and the aggregate accurately measures how informative the response is for learning
This justifies scaling the self-training updates by PMI.

pith-pipeline@v0.9.0 · 5547 in / 1645 out tokens · 61626 ms · 2026-05-10T14:45:42.075972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

and Resnick, P

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.18653/v1/2024.acl-long.78 2022

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

and Resnick, P

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.18653/v1/2024.acl-long.78 2022