pith. sign in

arxiv: 2605.30290 · v2 · pith:R6F2OBDFnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CL

Self-Trained Verification for Training- and Test-Time Self-Improvement

Pith reviewed 2026-06-29 08:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords self-trained verificationverification-refinement loopsself-improvementreasoning modelsverifier trainingtest-time verificationtraining-time improvement
0
0 comments X

The pith

Training a verifier to imitate its reference-aware self improves verification-refinement loops on hard problems and raises standalone generator accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reasoning models can spot their own errors more reliably when shown a reference solution than when verifying alone. This observed asymmetry supplies the training signal needed to teach a verifier to catch mistakes without any reference at inference time. The resulting self-trained verifier then strengthens verification-refinement loops at test time and supplies feedback for a verifier-in-the-loop training procedure that further improves the generator, including in settings where the verifier is later removed.

Core claim

Self-trained verification converts the performance gap between reference-assisted and reference-free verification into a supervision target, so the verifier learns to imitate its stronger reference-informed behavior. When inserted into verification-refinement loops, the trained verifier raises accuracy on hard math problems by roughly a factor of two and lifts scientific reasoning accuracy from 1.5 percent to 21 percent. When the same verifier is used inside verifier-in-the-loop training, pass@1 improves an additional 33 percent and the generator's standalone accuracy climbs 30 percent past the point where standard reinforcement learning had plateaued.

What carries the argument

Self-trained verification (STV), the procedure that uses reference solutions only at training time to create imitation targets for the verifier so it can detect self-generated errors without references at test time.

If this is right

  • Verification-refinement loops using the self-trained verifier roughly double accuracy on hard math tasks.
  • The same loops raise scientific reasoning accuracy from 1.5 percent to 21 percent.
  • Verifier-in-the-loop training starting from an RL-converged generator adds a further 33 percent gain in pass@1.
  • After verifier-in-the-loop training the generator's standalone pass@1 improves 30 percent relative to the prior RL plateau even when no verifier is present at test time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same asymmetry-based training signal could be tested on other reasoning domains where reference solutions are available only during development.
  • The observed rise in standalone generator performance suggests that verification training may refine internal reasoning steps that persist after the verifier is removed.
  • Repeated cycles of self-trained verification followed by verifier-in-the-loop training could be examined to check whether gains compound across multiple rounds.

Load-bearing premise

The asymmetry between a model's ability to verify with versus without a reference solution provides a reliable and generalizable supervision signal that transfers to test-time verification without the reference.

What would settle it

Training an STV verifier on a fresh collection of hard math problems and measuring that verification-refinement loop accuracy stays no higher than the accuracy obtained from standard supervised fine-tuning or reinforcement learning on verifier scores would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.30290 by Aditi Raghunathan, Chen Henry Wu.

Figure 1
Figure 1. Figure 1: (a) We propose self-trained verification (STV), a scalable recipe for training verifiers that judge solutions and provide feedback for generators to refine. STV unlocks both self-improvement for test-time iterative refinement and training-time generator improvement. (b) STV scales much better with test-time compute than untrained or verdict-only-RL verifiers. (c) After standard RLVR converges, training the… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of self-trained verification. At a high level, STV trains the verifier to imitate a more informed version of itself – a teacher that is shown the reference solution and can therefore iden￾tify errors in yr−1 more reliably. We parameterize the student verifier as Vθ(· | x, yr−1) and the reference￾conditioned teacher as V ⋆ (· | x, yr−1, y⋆ (x)), (1) both of which use the same model under different … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our experiments. We use Qwen3-8B as the base model for both the generator and verifier in our main experiments unless specified. Given the asymmetry in output lengths, training a verifier is often more affordable than training a generator, so we consider two generators to simulate different training compute budgets: the base Qwen3-8B generator and a continual-trained generator trained with RLVR… view at source ↗
Figure 4
Figure 4. Figure 4: Pass@1 across verification rounds on the base generator. final round (5.5% on Hardest, 27.4% on Hard) outperforms the 4× larger Qwen3-32B without verification (2.7% on Hardest, 17.8% on Hard), which shows that a good verifier provides gains beyond what a much stronger generator can achieve. Appendix A gives qualitative examples where, on identical generations, the STV verifier rejects flawed solutions that… view at source ↗
Figure 5
Figure 5. Figure 5: Pass@1 across verification rounds for the continual-trained generator [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training smaller verifiers to verify larger generators. 5.3 Weak-to-Strong Verification Given STV’s gains, can a smaller verifier match a larger one, which saves substantial test compute? [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We train the generator to better incorporate verifier feedback, initialized with the already converged continual-trained generator, RLVR-only (converged). Surprisingly, such training shows a large benefit even at round 0 without any verification at test time. In contrast, spending the same additional compute on standard RLVR, i.e., RLVR-only (longer), has no gains. training with self-verification instead o… view at source ↗
Figure 8
Figure 8. Figure 8: Left: Pass@1 vs verifier score over verification rounds (reward hacking diagnostic). Right: precision-coverage frontier of test-time scaling. Both panels pool Hardest and Hard problems. Calibrated test-time scaling. We show that STV mitigates the miscalibrated verifier score problem. Without training, verifier scores increase over rounds while accuracy stagnates ( [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Isolating the contribution of trained feedback using ground-truth verdict. 5.6 How STV Shapes the Output Distribution Beyond why the gains exist, we look at how V-R reshapes the output distribution. Our claim is that, unlike Best-of-N (BoN) or resampling, V-R with STV improves the performance beyond sharpening the base distribution. This is because of the structured exploration enabled by actionable verifi… view at source ↗
Figure 10
Figure 10. Figure 10: Verification does not suppress diversity, but there is a sweet spot. (a) Pass@k generally improves in the first 10 rounds. (b) At higher k, pass@k eventually peaks at a certain but non-zero round; harder problems require more verification rounds to reach their peaks than easier ones. Verification does not suppress diversity. A natural concern is whether the gains of verification come at the cost of divers… view at source ↗
Figure 11
Figure 11. Figure 11: Refinement (V-R) vs resampling (BoN) at matched compute, for three generators considered in this paper. All panels pool Hardest and Hard problems. Refinement vs resampling. To test the sharpen-vs-reshape distinction, we compare refinement to BoN at matched compute. Given N independent samples, BoN selects one the verifier accepts (or a uniformly random one if none is accepted); r rounds of refinement matc… view at source ↗
read the original abstract

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar-forum.github.io/stv-webpage

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Self-Trained Verification (STV) to address the verifier bottleneck in self-improvement for reasoning models. By training the verifier to imitate its reference-informed version, STV enables improved verification-refinement (V-R) loops at test time and verifier-in-the-loop (ViL) training at training time. The paper claims STV doubles accuracy on hard math problems and increases it 14-fold on scientific reasoning tasks, with additional gains in generator performance even without the verifier at test time.

Significance. If the results hold and the trained verifier genuinely captures reasoning verification rather than answer matching, the approach could significantly advance methods for scaling self-improvement in large language models on hard reasoning tasks. The finding that the generator improves standalone after ViL is particularly interesting as it suggests indirect benefits to the base model. The work also ships concrete experimental comparisons to SFT, RL, and meta-verifier baselines.

major comments (2)
  1. [§3.2] §3.2 (Supervision target construction): The description of how the reference solution is used to generate the imitation target does not specify whether supervision is derived from step-by-step reasoning audits or from final-answer comparison alone; without this, it is impossible to rule out that STV trains answer-matching rather than verification, which directly threatens transfer to reference-free test-time use and the claimed V-R gains.
  2. [§4] §4 (Experiments): Reported accuracy lifts (2× on hard math, 14× on scientific reasoning) and ViL gains are presented without details on number of independent runs, statistical significance, controls for prompt sensitivity or data leakage, or ablations isolating reference use from other training choices; this makes it difficult to attribute improvements to STV rather than implementation artifacts.
minor comments (1)
  1. [Figure 3] Figure 3 and Table 2: Axis labels and legend entries use inconsistent abbreviations (e.g., “STV-V” vs. “STV verifier”) that should be unified for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and experimental reporting.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Supervision target construction): The description of how the reference solution is used to generate the imitation target does not specify whether supervision is derived from step-by-step reasoning audits or from final-answer comparison alone; without this, it is impossible to rule out that STV trains answer-matching rather than verification, which directly threatens transfer to reference-free test-time use and the claimed V-R gains.

    Authors: We agree the current description in §3.2 is insufficiently precise on this point. The target is constructed by using the reference solution to produce an informed verification signal that identifies discrepancies in the reasoning chain (when the reference permits step-level auditing) rather than final-answer equality alone. We will revise §3.2 with an explicit algorithmic description and example showing how the imitation target incorporates step-wise error detection where the reference enables it. This revision will directly address the concern about answer-matching versus verification. revision: yes

  2. Referee: [§4] §4 (Experiments): Reported accuracy lifts (2× on hard math, 14× on scientific reasoning) and ViL gains are presented without details on number of independent runs, statistical significance, controls for prompt sensitivity or data leakage, or ablations isolating reference use from other training choices; this makes it difficult to attribute improvements to STV rather than implementation artifacts.

    Authors: We acknowledge that the experimental section would benefit from additional rigor. In the revision we will report results across multiple independent runs with means and standard deviations, include statistical significance tests, add controls for prompt sensitivity and checks against data leakage, and provide ablations that isolate the contribution of reference-informed target construction from other training decisions. These changes will strengthen the evidence that gains are attributable to STV. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent external reference signal

full rationale

The paper trains the verifier by imitating its reference-informed counterpart using reference solutions as an external supervision target (abstract). This signal is independent of the test-time reference-free application. No equations, fitted parameters, or self-citations are presented that reduce any claimed prediction, uniqueness result, or central mechanism to the inputs by construction. The asymmetry observation is taken as given rather than derived from the method itself, and downstream claims (V-R gains, ViL) rest on empirical transfer rather than definitional equivalence. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The core method relies on the unstated domain assumption that reference solutions are available during training but not at test time.

axioms (1)
  • domain assumption Reference solutions provide a reliable supervision signal for training verifiers that generalizes to reference-free settings.
    Central to turning the asymmetry into a training target (abstract).

pith-pipeline@v0.9.1-grok · 5875 in / 1257 out tokens · 25020 ms · 2026-06-29T08:22:47.374795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z

    URLhttps://api.semanticscholar.org/CorpusID:285787286. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Cheng...

  2. [2]

    Distilling the Knowledge in a Neural Network

    URLhttps://api.semanticscholar.org/CorpusID:259164722. Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.ArXiv, abs/1503.02531, 2015. URLhttps://api.semanticscholar.org/CorpusID:7200347. Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron C. Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training...

  3. [3]

    Prover-verifier games improve legibility of LLM outputs.arXiv preprint arXiv:2407.13692, 2024

    URLhttps://api.semanticscholar.org/CorpusID:279318695. Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. When can llms actually correct their own mistakes? a critical survey of self-correction of llms.Transactions of the Association for Computational Linguistics, 12:1417–1440, 2024. URLhttps://api.semanticscholar.org/CorpusID:270218742. Jan He...

  4. [4]

    Deepseekmath-v2: Towards self-verifiable mathematical reasoning.arXiv preprint arXiv:2511.22570,

    URLhttps://api.semanticscholar.org/CorpusID:285050190. Zhihong Shao, Yu-Wei Luo, Chengda Lu, Zehui Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xi- aokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning.ArXiv, abs/2511.22570,

  5. [5]

    Self-Distillation Enables Continual Learning

    URLhttps://api.semanticscholar.org/CorpusID:283438764. Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learn- ing.ArXiv, abs/2601.19897, 2026. URLhttps://api.semanticscholar.org/CorpusID:285071839. Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion:...