Peer-Predictive Self-Training for Language Model Reasoning
Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3
The pith
Multiple language models can improve reasoning accuracy by training on each other's aggregated answers without external labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Peer-Predictive Self-Training (PST) is a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses, serves as an internal target for learning. The informativeness of each intermediate response is measured using pointwise mutual information to scale self-training updates, with responses already aligned with the aggregate updated less and misaligned ones updated more.
What carries the argument
Cross-model aggregated response used as internal training target, with pointwise mutual information scaling the magnitude of updates for each peer generation.
If this is right
- Exact-match accuracy on the three math benchmarks rises between 2.2 and 4.3 percentage points for the tested models.
- The average generator-verifier gap shrinks by 26 to 40 percent.
- No external supervision, labeled data, or teacher-student hierarchy is required.
- All gains arise from sequential generation and aggregation among peer models of comparable size.
Where Pith is reading between the lines
- The same peer-aggregation idea could be tested on non-math tasks such as code generation or multi-step planning where group consensus might still be more reliable.
- Increasing the number of participating peer models might produce stronger aggregates and larger improvements.
- The method could support ongoing adaptation of deployed models in settings where new labeled data is unavailable.
Load-bearing premise
The cross-model aggregated answer must be reliably more accurate than the separate model responses so that training toward it produces net gains.
What would settle it
Apply PST to the same models on SimulEq, Math500, or MultiArith and observe zero or negative change in exact-match accuracy after training.
Figures
read the original abstract
Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Peer-Predictive Self-Training (PST), a label-free fine-tuning method in which multiple language models generate responses sequentially to a prompt; an aggregated response serves as the internal training target, with pointwise mutual information (PMI) used to scale updates so that responses aligned with the aggregate receive smaller updates. On mathematical reasoning benchmarks (SimulEq, Math500, MultiArith), the method is reported to yield exact-match accuracy gains of 2.2–4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B while reducing the generator-verifier gap by 26–40 percent, relying solely on cross-model interactions without external supervision or teacher-student hierarchies.
Significance. If the empirical results prove robust, PST would represent a meaningful step toward unsupervised collaborative self-improvement of language models by exploiting peer-generated signals rather than external labels. The absence of a teacher-student hierarchy and the use of externally derived PMI weighting are genuine strengths that distinguish the approach from standard self-training. However, the current experimental support remains preliminary, limiting the immediate significance of the contribution.
major comments (3)
- [Abstract] Abstract: The central claim of 2.2–4.3 pp accuracy improvement is presented without baseline comparisons (e.g., majority voting, standard self-training, or unweighted aggregation), statistical significance tests, or variance across multiple runs, making it impossible to determine whether the gains exceed generic self-training effects.
- [Abstract] Abstract: The key assumption that the cross-model aggregate is 'often more reliable than individual responses in practice' is stated but not supported by any direct quantitative comparison of aggregate accuracy versus per-model accuracy on the reported benchmarks; this measurement is load-bearing for the quality of the training signal.
- [Abstract] Abstract: No ablation is reported that removes or replaces the PMI-based scaling, so it remains unclear whether the weighting mechanism itself drives the observed reductions in GV-Gap or whether simpler aggregation would suffice.
minor comments (1)
- [Abstract] The manuscript would benefit from explicit pseudocode or a small worked example illustrating the sequential generation, aggregation, and PMI scaling steps to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below and indicate the revisions we will make to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of 2.2–4.3 pp accuracy improvement is presented without baseline comparisons (e.g., majority voting, standard self-training, or unweighted aggregation), statistical significance tests, or variance across multiple runs, making it impossible to determine whether the gains exceed generic self-training effects.
Authors: We acknowledge that the abstract would be clearer with explicit references to baselines and statistical details. The reported gains are measured relative to the base models, and the full experimental section evaluates PST against standard self-training and unweighted aggregation. To address the concern, we will revise the abstract to briefly note these comparisons, indicate that gains are consistent across runs with low variance, and reference the statistical significance tests provided in the main text and appendix. revision: yes
-
Referee: [Abstract] Abstract: The key assumption that the cross-model aggregate is 'often more reliable than individual responses in practice' is stated but not supported by any direct quantitative comparison of aggregate accuracy versus per-model accuracy on the reported benchmarks; this measurement is load-bearing for the quality of the training signal.
Authors: We agree that a direct quantitative comparison on the reported benchmarks would better substantiate the assumption. Although the manuscript discusses aggregate reliability in the method section, we will add a supporting table or analysis in the revised version that reports aggregate accuracy versus individual model accuracies on SimulEq, Math500, and MultiArith. revision: yes
-
Referee: [Abstract] Abstract: No ablation is reported that removes or replaces the PMI-based scaling, so it remains unclear whether the weighting mechanism itself drives the observed reductions in GV-Gap or whether simpler aggregation would suffice.
Authors: We recognize the importance of isolating the contribution of the PMI scaling. We will include a new ablation experiment in the revised manuscript that compares the full PST method against a variant using unweighted aggregation, to demonstrate the specific role of PMI-based scaling in reducing the generator-verifier gap. revision: yes
Circularity Check
No significant circularity; method uses externally generated cross-model signals evaluated on labeled benchmarks.
full rationale
The PST method constructs training targets from cross-model aggregated responses generated sequentially by multiple independent models, with PMI computed directly from those generations rather than from any fitted parameters or self-referential loops. No equations or steps in the abstract reduce a claimed prediction to a fitted input by construction, nor do they rely on self-citations for uniqueness or ansatzes. Accuracy gains and GV-Gap reductions are measured against ground-truth labels on SimulEq, Math500, and MultiArith, providing an external benchmark independent of the training dynamics. This satisfies the default expectation of a self-contained derivation with no load-bearing reductions to inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The aggregated response from multiple models is more reliable than any individual response
- domain assumption Pointwise mutual information between a response and the aggregate accurately measures how informative the response is for learning
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.