pith. sign in

arxiv: 2605.17314 · v1 · pith:ZD3TXDFRnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI· cs.LG

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Pith reviewed 2026-05-20 14:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords weak-to-strong elicitationGRPOmath reasoningoff-policy RLLLM fine-tuningmismatched draftsMATH benchmarkAIME
0
0 comments X

The pith

Injecting mismatched wrong drafts from a smaller model into GRPO training improves a stronger model's math reasoning over standard on-policy methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether off-policy experience from a smaller weaker model can elicit capabilities in a stronger learner that on-policy RL fine-tuning like GRPO fails to reach. It demonstrates that injecting mathematically wrong drafts from a more domain-trained but smaller model, deliberately mismatched to the current problem, consistently raises accuracy on held-out math tests and out-of-distribution problems. The approach is lightweight: it trains on a single GPU with no SFT, reward models, or synthesized data and reaches the highest published score for the base model. A reader would care because the result points to a simple recipe for turning a weaker model's errors into useful training signals for a stronger one.

Core claim

The authors claim that injecting mathematically wrong drafts from a smaller but more domain-trained model mismatched to the current problem into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Using Mathstral-7B as the learner and Qwen2.5-Math-1.5B as the draft model on 8.8K Level 3-5 MATH problems, the mismatched-wrong variant leads all tested variants on MATH-500 and uniquely lifts pass@k on both AIME years at every sample budget, reaching 71.98 percent MATH-500 greedy pass@1.

What carries the argument

Mismatched wrong draft injection into the GRPO context, which supplies off-policy incorrect reasoning traces from the smaller model to guide the stronger model's policy updates.

If this is right

  • The mismatched-wrong recipe achieves 71.98 percent on MATH-500 for Mathstral-7B, exceeding prior heavier pipelines on the same model.
  • It raises pass@k on AIME 2025 and 2026 above both the base learner and the draft model at every sample budget from k=1 to k=1024.
  • Shuffling drafts to create mismatch produces a statistically significant gain over the matched-wrong variant on MATH-500.
  • All gains occur with the same prompt at test time and with no draft injection during evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests weak-to-strong elicitation can exploit error signals from specialized smaller models when the errors are deliberately mismatched.
  • Similar benefits might appear in non-math reasoning domains if a weaker model supplies domain-tuned but incorrect traces.
  • The approach raises the possibility that controlled mismatch and incorrectness could be a general lever in off-policy RL for language models.

Load-bearing premise

The performance lift comes specifically from the combination of the drafts being mathematically wrong and mismatched to the problem rather than from simply adding any extra context or from incidental differences in training dynamics.

What would settle it

Running the identical training setup but replacing mismatched wrong drafts with matched wrong drafts or with correct drafts and finding no remaining advantage on MATH-500 or AIME would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.17314 by Wei Deng.

Figure 1
Figure 1. Figure 1: On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant (ours, red) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template. At training time, {draft} is the (mismatched, wrong) draft d − σ(x) . At evaluation time, {draft} is the literal string “N/A”. against an answer extracted via a prioritized fallback chain: \boxed{·} first, then natural-language patterns (“the answer is X”), inline math expressions ($ . . . $), and bare assignment lines (“var = VALUE”). Among the 32, we randomly sample a completion that is … view at source ↗
Figure 3
Figure 3. Figure 3: MATH-500 pass@k. Left: overall (500 problems). Right: Level 5 only (134 problems). 2-seed mean (s={42, 137}) for all 5 lines (Mathstral-7B base, no-draft GRPO, matched-wrong, mismatched-correct, mismatched-wrong). \boxed{}.”, all enclosed in [INST] . . . [/INST] tokens. Section 4.3 confirms that our performance gains over Mathstral-7B remain valid despite this formatting difference. 4.2 MATH-500 The mismat… view at source ↗
Figure 4
Figure 4. Figure 4: Per-problem AIME pass@1024 (one dot = one problem, 2-seed mean, slight diagonal [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AIME 2024 pass@k (n=2048 samples per problem, mean across 2 seeds s ∈ {42, 137}). Base lags trained variants at low k but catches up at k=1024 to within ∼1pp of the leaders [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-problem AIME 2024 pass@1024 (one dot = one problem, 2-seed mean, slight diagonal jitter): y-axis = mismatched-wrong (ours, nodraft), x-axis = Mathstral-7B base ([INST]). Of 30 problems: 12 above the diagonal, 11 on (gap < 0.1pp; most are problems both models solve at ∼100%), 7 below. Green: capability-creation cases (base 0%, ours > 0%; 3 problems). Red: inverse (ours 0%, base > 0%; 4 problems). The as… view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics across the 2×2 + no-draft GRPO, 1 epoch. Left: completion entropy in nats. Right: completion length per step in tokens. Both panels show a rolling mean over a 20-step window for clarity. A few patterns are visible: (i) Correct+Matched collapses into a copying shortcut. (ii) Correct+Mismatched sits below no-draft GRPO on entropy, while both wrong-draft variants lie above it—suggesting that… view at source ↗
read the original abstract

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes eliciting stronger math reasoning in a 7B model (Mathstral) via GRPO by injecting mismatched mathematically wrong solution drafts generated by a smaller, more domain-specialized 1.5B model (Qwen2.5-Math). It reports that the mismatched-wrong variant outperforms standard on-policy GRPO as well as matched-wrong, mismatched-correct, and no-draft baselines on held-out MATH-500 (71.98% greedy pass@1) and OOD AIME 2025/2026 (lifts at all k up to 1024), with a shuffling experiment showing a statistically significant +1.62pp gain from mismatch (n=10, p=0.0015). The method requires no SFT, reward models, or synthesized data.

Significance. If the performance gains can be causally attributed to the combination of wrongness and mismatch rather than ancillary factors, the approach offers a lightweight, single-GPU recipe for improving RL fine-tuning on math tasks and contributes to weak-to-strong elicitation research. The consistent outperformance across greedy/sampling metrics, multiple seeds, and OOD sets, together with the explicit mismatch ablation, provides a falsifiable empirical foundation worth further investigation.

major comments (2)
  1. The central claim that gains arise specifically from injecting mathematically wrong and mismatched drafts is load-bearing, yet the variant comparisons do not explicitly control for differences in injected sequence length or token distribution. These factors could alter the group-relative advantage estimates inside Dr. GRPO independently of wrongness or mismatch; the shuffling experiment (+1.62pp over matched-wrong) and mismatched-correct comparison are helpful but insufficient without length-matched or distribution-matched controls.
  2. In the OOD AIME 2025/2026 results, the mismatched-wrong variant leads at pass@1024, but the manuscript does not report whether all training variants maintain identical total context lengths or prompt structures when drafts are injected. Without this, it remains possible that the observed lifts reflect changes in training dynamics rather than the proposed weak-to-strong mechanism.
minor comments (2)
  1. Ensure the full paper defines 'Dr. GRPO' and any modifications to standard GRPO on first use, including how drafts are formatted and inserted into the context.
  2. The abstract states the result surpasses WizardMath at 70.9% on full MATH; clarify whether the comparison is on identical test sets and model scales for precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for tighter controls on potential confounds. We address each point below and outline targeted revisions to strengthen the causal claims while preserving the core empirical findings.

read point-by-point responses
  1. Referee: The central claim that gains arise specifically from injecting mathematically wrong and mismatched drafts is load-bearing, yet the variant comparisons do not explicitly control for differences in injected sequence length or token distribution. These factors could alter the group-relative advantage estimates inside Dr. GRPO independently of wrongness or mismatch; the shuffling experiment (+1.62pp over matched-wrong) and mismatched-correct comparison are helpful but insufficient without length-matched or distribution-matched controls.

    Authors: We agree that explicit controls for sequence length and token distribution would further isolate the contributions of wrongness and mismatch. The shuffling experiment randomizes wrong drafts across problems while preserving the overall empirical distribution of lengths and token statistics between the matched-wrong and mismatched-wrong conditions. The mismatched-correct condition likewise uses drafts from the identical generator on the same problems, differing only in content correctness. To directly address the concern, we will add a length-matched ablation in the revision by truncating or padding drafts to equalize lengths across variants and re-report the key metrics; this will confirm whether the +1.62pp mismatch gain persists under length equalization. revision: partial

  2. Referee: In the OOD AIME 2025/2026 results, the mismatched-wrong variant leads at pass@1024, but the manuscript does not report whether all training variants maintain identical total context lengths or prompt structures when drafts are injected. Without this, it remains possible that the observed lifts reflect changes in training dynamics rather than the proposed weak-to-strong mechanism.

    Authors: All variants use the identical base prompt template and consistent draft insertion position within the context window. Any differences in total context length arise solely from the natural variation in draft lengths produced by the 1.5B model; we verified that no variant experienced differential truncation at the model’s maximum context limit. In the revision we will add a table reporting mean and standard deviation of training context lengths per variant, together with explicit confirmation that prompt structures remained fixed, thereby ruling out training-dynamic confounds as the source of the OOD lifts. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparisons with external benchmarks

full rationale

The paper reports direct experimental results from training Mathstral-7B with Dr. GRPO under different draft-injection variants (no-draft, matched-wrong, mismatched-wrong, mismatched-correct) and measures greedy pass@1 and sampling pass@k on held-out MATH-500 plus out-of-distribution AIME 2025/2026. All claims rest on these measured performance deltas and statistical tests (n=10 seeds, Welch's t, p=0.0015 for the shuffle ablation). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the reported method or results; the comparisons are falsifiable against the external benchmarks and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on the correctness of the GRPO algorithm and standard LLM training assumptions rather than new theoretical derivations.

free parameters (1)
  • GRPO training hyperparameters
    Learning rate, batch size, and other optimizer settings are chosen but not detailed in the abstract.
axioms (1)
  • domain assumption GRPO functions as an effective on-policy RL method for math reasoning when applied to the base model.
    The paper builds directly on GRPO without re-deriving or proving its properties.

pith-pipeline@v0.9.0 · 5982 in / 1393 out tokens · 57766 ms · 2026-05-20T14:27:10.505878+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 8 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,

  2. [2]

    URLhttps://aclanthology.org/2023.emnlp-main.67/. 9 Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-Refine: Iterative refinement with self-feedba...

  3. [3]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  4. [4]

    Burns, P

    URLhttps://openreview.net/forum?id=5PAF7PAY2Y. Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak- to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390,

  5. [5]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  6. [6]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-Reasoner-Zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

  7. [7]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. SimpleRL- Zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

  8. [8]

    Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534, 2025

    Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lan- chantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations. arXiv preprint arXiv:2509.02534,

  9. [9]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian- hong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

  10. [10]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  11. [11]

    A Experimental setup details Table 5 consolidates the settings underlying our experiments. 11 Learner (πS) Mathstral-7B-v0.1 [Mistral AI, 2024] Draft model (πW ) Qwen2.5-Math-1.5B [Yang et al., 2024] Training data∼8.8K MATH L3–5 (MATH-500 held out) [Hendrycks et al., 2021] Algorithm Dr. GRPO [Liu et al., 2025];β=0,G=16, grad accum4 LoRA rank16on all linea...

  12. [12]

    Conclusion: 244 C.2 AIME 2025 Problem 15 Problem.Let N denote the number of ordered triples of positive integers (a, b, c) such that a, b, c≤ 36 anda 3 +b 3 +c 3 is a multiple of3

  13. [13]

    Gold answer .735

    Find the remainder whenNis divided by1000. Gold answer .735. Raw correctness counts (n=2048samples per cell): Correct / 2048 Mismatched + Wrong (nodraft),s=422 Mismatched + Wrong (nodraft),s=1374 Mathstral-7B base ([INST]),s=420 Mathstral-7B base ([INST]),s=1370 Mathstral-7B never solves this problem ( 0/4096); Mismatched + Wrong solves it 6 times. The Mi...

  14. [14]

    C.3 AIME 2026 Problem 22 (inverse case: base>ours) Problem.A standard fair six-sided die is rolled repeatedly

    Thus, the number of ordered triples (a, b, c) such that a3 +b 3 +c 3 is a multiple of 37 is 735 . C.3 AIME 2026 Problem 22 (inverse case: base>ours) Problem.A standard fair six-sided die is rolled repeatedly. Each time the die reads 1 or 2, Alice gets a coin; each time it reads 3 or 4, Bob gets a coin; and each time it reads 5 or 6, Carol gets a coin. The...

  15. [15]

    Therefore, the probability of reachingC 2 without passing throughC 1 is 1 9 + 1 18 + 1 18 = 7 18

    The probabilities of these paths are 1 9 , 1 18 , and 1 18 , respectively. Therefore, the probability of reachingC 2 without passing throughC 1 is 1 9 + 1 18 + 1 18 = 7 18 . Finally, the solution calculates the probability of Alice and Bob each receiving at least two coins before Carol receives any coins. This is done by summing the probabilities of all p...

  16. [16]

    rigorous mostly wrong I-6 (2024) Mismatched + Wrong (ours) 2 1 1 0 I-6 (2024) No-draft GRPO 1 1 0 0 P8 (2026) Mismatched + Wrong (ours) 74 5 0 69 P8 (2026) No-draft GRPO 157 3 3 151 P19 (2026) Mismatched + Wrong (ours) 1 0 1 0 P19 (2026) No-draft GRPO 1 0 0 1 Table 10: Head-on rigor verdicts (judge: Gemini 3.1 Pro). We scoped this scan to three problems f...