pith. machine review for the scientific record. sign in

arxiv: 2602.14868 · v2 · submitted 2026-02-16 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningreasoning modelsdata samplingcurriculum learningsparse rewardslanguage modelstask difficultyGRPO
0
0 comments X

The pith

Selecting questions of intermediate difficulty for the student improves reasoning model performance under fixed compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a teacher-driven sampling strategy to address sparse rewards when using reinforcement learning to train language models on reasoning tasks. The teacher predicts question difficulty from the student's performance on seen examples and selects only those of suitable challenge level—neither trivial nor impossible. This selection runs alongside standard GRPO training on large math reasoning datasets. The approach yields higher final performance than unsampled GRPO training while using the same total compute. A sympathetic reader would care because it offers a practical way to make reinforcement learning more sample-efficient without requiring new algorithms or larger data collections.

Core claim

A teacher model that continuously estimates each question's difficulty for the evolving student and selects only Goldilocks-difficulty items produces better reasoning performance than standard GRPO training on the OpenMathReasoning dataset under identical compute budgets. The teacher adapts its selections dynamically by tracking the student's accuracy on previously encountered samples, thereby supplying informative training signals throughout the process.

What carries the argument

The Goldilocks data sampling strategy, in which the teacher uses observed student performance on seen questions to estimate and select tasks of intermediate difficulty for the current training step.

If this is right

  • Models reach higher accuracy on mathematical reasoning tasks without any increase in training compute or data volume.
  • Training avoids wasting gradient steps on questions the student already solves perfectly or cannot solve at all.
  • The selection rule adapts automatically as the student's skill distribution shifts during training.
  • The method applies directly to existing large-scale reasoning datasets without requiring manual curriculum design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same difficulty-matching principle could be tested on non-math reasoning domains such as code generation or scientific question answering.
  • Combining the sampling rule with other reinforcement learning variants might produce additive gains beyond the GRPO baseline.
  • If teacher predictions remain accurate at larger model scales, the technique could reduce the data volume needed to reach a target reasoning capability.

Load-bearing premise

The teacher can reliably predict how difficult any given question will be for the student model at each stage of training, based solely on the student's accuracy on a limited set of previously seen samples.

What would settle it

Training two models on OpenMathReasoning with identical GRPO hyperparameters and total steps, one using Goldilocks sampling and one using uniform random sampling, and finding no accuracy gain or a loss for the Goldilocks version on held-out reasoning benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.14868 by Aryo Lotfi, Emmanuel Abbe, Ilia Mahrooghi.

Figure 1
Figure 1. Figure 1: Overview of the Goldilocks Framework. The training cycle proceeds as follows: (1) A set of Kcandidate questions is sampled randomly from the dataset; (2) The Teacher selects the optimal prompt from this candidate pool; (3) The Student generates G rollouts for the selected prompt; (4) The gradient is calculated based on GRPO advantages and accumulated for the Student update; (5) Based on the empirical varia… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of validation accuracy over training steps. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average Training Reward (Success Rate). Goldilocks approach achieves higher training accuracy significantly earlier in training compared to the baseline. 0 5000 10000 15000 20000 25000 Training step 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 Train reward std. Goldilocks Base (a) Training Reward Std 0 5000 10000 15000 20000 25000 Training step 0.1 0.2 0.3 0.4 0.5 0.6 Fraction of zero var. Goldilocks Base (b) F… view at source ↗
Figure 4
Figure 4. Figure 4: Curriculum Mechanism. (a) The Teacher actively selects samples with higher reward variance. (b) This results in far fewer “wasted” inputs where the gradient is zero. The impact of active selection on the optimization landscape is quantified through the gradient norm dynamics. By prioritizing samples with higher uncertainty, Goldilocks maintains significantly larger gradient norms compared to the baseline (… view at source ↗
Figure 5
Figure 5. Figure 5: Optimization Dynamics. Goldilocks maintains larger gradient norms, preventing vanishing signals and providing a more robust optimization objective compared to the baseline. specific subset during the first epoch of the update cycle as a proxy for validation loss, i.e., LMAE(ϕ) = 1 |Dnew| X (q,yq)∈Dnew |fϕ(q) − yq| , where yq is the ground truth difficulty derived from the rollouts. Because these samples ar… view at source ↗
Figure 6
Figure 6. Figure 6: Teacher Mean Absolute Error (MAE) on unseen samples. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evolution of Teacher Predictions. The mean (µ) of predicted Goldilocks score with the shaded region representing the standard deviation (σ). 6 Ablation Study To assess the versatility of Goldilocks, we conduct two independent ablation experiments using the Qwen2.5- 1.5B model. These experiments isolate the impact of our curriculum strategy when paired with advanced loss formulations and explicit regulariza… view at source ↗
Figure 8
Figure 8. Figure 8: Validation accuracy over training steps. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Fraction of questions yielding zero reward variance. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: illustrates the training progression for the Olmo2-1B model. Notably, the Goldilocks teacher successfully identifies samples that maintain the training reward near the optimal 0.5 threshold, minimizing redundancy and facilitating a more efficient climb in evaluation accuracy. 0 5000 10000 15000 20000 25000 Training step 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fraction of zero var. Goldilocks Base (a) Zero-Varianc… view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics for Qwen3-4B. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Training dynamics for Phi-4-mini-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, prior works have primarily targeted small datasets and do not directly transfer to the large-scale settings typical of modern LM training. Furthermore, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On the OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Goldilocks RL, a teacher-driven data sampling strategy for GRPO training of language models on reasoning tasks. A teacher model uses the student's accuracy on previously seen questions to assign difficulty labels to new items and selects only Goldilocks-difficulty questions (neither too easy nor too hard). The central empirical claim is that this adaptive sampling improves final performance on the OpenMathReasoning dataset relative to standard GRPO under identical compute budgets.

Significance. If the result holds after proper validation, the approach would provide a practical, model-adaptive curriculum for escaping sparse-reward regimes in RL-based reasoning training without modifying the underlying GRPO objective or requiring additional compute. This could meaningfully improve sample efficiency for mathematical reasoning in LMs.

major comments (3)
  1. [Abstract] Abstract: The claim of performance improvement on OpenMathReasoning is stated without any quantitative metrics (e.g., accuracy deltas), baseline descriptions, ablation results, or even a high-level description of how difficulty is computed from seen-sample performance. This absence makes the central empirical claim impossible to assess.
  2. [Method] Method section: The teacher’s difficulty prediction is described as continuously adapting from the student’s accuracy on seen questions, yet no validation of prediction accuracy, no error analysis, and no examination of how label noise would affect the GRPO advantage estimator are provided. This mapping is load-bearing for the claim that the procedure escapes the sparse-reward regime rather than merely reordering data.
  3. [Experiments] Experiments section: No ablations against uniform random sampling, fixed-difficulty curricula, or oracle difficulty labels are reported, nor is there any analysis of how the sampling distribution changes over training or its effect on reward sparsity. Without these controls, it is unclear whether the reported improvement is attributable to the Goldilocks principle.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result and a one-sentence description of the difficulty metric.
  2. [Method] Notation for the teacher’s difficulty predictor and the GRPO advantage estimator should be introduced with explicit equations rather than prose descriptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract, method details, and experiments require strengthening with quantitative metrics, validation, and additional controls. We will revise the manuscript accordingly, as detailed in the point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of performance improvement on OpenMathReasoning is stated without any quantitative metrics (e.g., accuracy deltas), baseline descriptions, ablation results, or even a high-level description of how difficulty is computed from seen-sample performance. This absence makes the central empirical claim impossible to assess.

    Authors: We agree that the abstract should include more specifics to make the claims assessable. In the revised version, we will add quantitative metrics such as accuracy deltas over the standard GRPO baseline on OpenMathReasoning, explicitly describe the baseline, and provide a high-level description of difficulty computation (the teacher uses the student's accuracy on previously seen questions to label new items as easy, Goldilocks, or hard). revision: yes

  2. Referee: [Method] Method section: The teacher’s difficulty prediction is described as continuously adapting from the student’s accuracy on seen questions, yet no validation of prediction accuracy, no error analysis, and no examination of how label noise would affect the GRPO advantage estimator are provided. This mapping is load-bearing for the claim that the procedure escapes the sparse-reward regime rather than merely reordering data.

    Authors: We will expand the method section with validation of the teacher's predictions (e.g., correlation between predicted difficulty and actual student success rates on held-out items) and an error analysis. We will also add discussion of how label noise could influence the GRPO advantage estimator, while arguing that the adaptive selection still meaningfully reduces reward sparsity by focusing on questions with intermediate success probabilities rather than simply reordering the data. revision: yes

  3. Referee: [Experiments] Experiments section: No ablations against uniform random sampling, fixed-difficulty curricula, or oracle difficulty labels are reported, nor is there any analysis of how the sampling distribution changes over training or its effect on reward sparsity. Without these controls, it is unclear whether the reported improvement is attributable to the Goldilocks principle.

    Authors: Standard GRPO training uses uniform sampling, providing the random baseline. We will add ablations for fixed-difficulty curricula and include analysis of the evolving sampling distribution along with its effect on the fraction of non-zero rewards. Oracle difficulty labels cannot be provided, as no ground-truth per-question difficulties exist for the student model; we will instead discuss this limitation and why the adaptive approach serves as a practical proxy. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical sampling claim stands independent of inputs

full rationale

The paper describes Goldilocks as an external teacher-driven sampling procedure that selects questions based on the student's observed performance on seen items to target intermediate difficulty for GRPO training. No equations, fitted parameters, or self-citations are shown that would reduce the claimed performance gain to a tautology, a renamed fit, or a load-bearing self-reference. The central result—an improvement on OpenMathReasoning under fixed compute—is presented as an empirical outcome of the sampling strategy rather than a quantity defined in terms of itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that a teacher can accurately forecast per-question difficulty for the student from limited performance data; no free parameters or new entities are named in the abstract.

axioms (1)
  • domain assumption A teacher model can predict question difficulty for the student from performance on seen samples
    This is the load-bearing premise that enables the Goldilocks selection rule.

pith-pipeline@v0.9.0 · 5488 in / 1172 out tokens · 21195 ms · 2026-05-15T21:36:34.802400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the gradient norm scales linearly with √[p_q(1-p_q)]. This implies that the learning signal is maximized for questions with high outcome variance (i.e., where p_q ≈ 0.5). ... samples where the model is already certain (success rates nearing 0 or 1) yield smaller gradient magnitudes

  • IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    y_q = √[p̂_q(1-p̂_q)] ... Teacher continuously aligns its predictions with the Student’s evolving capabilities

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 12 internal anchors

  1. [1]

    How far can transformers reason? the globality barrier and inductive scratchpad.Advances in Neural Information Processing Systems, 37:27850–27895, 2024

    Abbe, E., Bengio, S., Lotfi, A., Sandon, C., and Saremi, O. How far can transformers reason? the globality barrier and inductive scratchpad.Advances in Neural Information Processing Systems, 37:27850–27895, 2024

  2. [2]

    Phi-4 Technical Report

    Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  3. [3]

    H., Lotfi, A., Baldwin, N

    Amani, M. H., Lotfi, A., Baldwin, N. M., Bengio, S., Farajtabar, M., Abbe, E., and West, R. Rl for reasoning by adaptively revealing rationales.arXiv preprint arXiv:2506.18110, 2025

  4. [4]

    Curriculum learning

    Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pp. 41–48, 2009

  5. [5]

    Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

    Chen, X., Lu, J., Kim, M., Zhang, D., Tang, J., Pich´ e, A., Gontier, N., Bengio, Y., and Kamalloo, E. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Liu, D., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Abstral: Augmenting llms’ reasoning by reinforcing abstract thinking.arXiv preprint arXiv:2406.11228, 2024

    Gao, S., Bosselut, A., Bengio, S., and Abbe, E. Abstral: Augmenting llms’ reasoning by reinforcing abstract thinking.arXiv preprint arXiv:2406.11228, 2024

  9. [9]

    S., Zaheer, M., Dhillon, I

    Khatri, D., Madaan, L., Tiwari, R., Bansal, R., Duvvuri, S. S., Zaheer, M., Dhillon, I. S., Brandfonbrener, D., and Agarwal, R. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786, 2025

  10. [10]

    S., Reid, M., Matsuo, Y., and Iwasawa, Y

    Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, 2022

  11. [11]

    Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849, 2025

    Li, Z., Chen, C., Yang, T., Ding, T., Sun, R., Zhang, G., Huang, W., and Luo, Z.-Q. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.arXiv preprint arXiv:2509.25849, 2025

  12. [12]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., et al. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. 12

  13. [13]

    Teacher–student curriculum learning.IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019

    Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. Teacher–student curriculum learning.IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019

  14. [14]

    Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

    Moshkov, I., Hanley, D., Sorokin, I., Toshniwal, S., Henkel, C., Schifferer, B., Du, W., and Gitman, I. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025

  15. [15]

    J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al

    Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. Show your work: Scratchpads for intermediate computation with language models. ICLR 2022, 2021

  16. [16]

    2 OLMo 2 Furious

    OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y., Huang, S., Jordan, M., et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

  17. [17]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms/ , 2024. Accessed: 2025-01-20

  18. [18]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pp. 27730–27744, 2022

  19. [19]

    Curriculum reinforcement learning from easy to hard tasks improves llm reasoning.arXiv preprint arXiv:2506.06632, 2025

    Parashar, S., Gui, S., Li, X., Ling, H., Vemuri, S., Olson, B., Li, E., Zhang, Y., Caverlee, J., Kalathil, D., et al. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning.arXiv preprint arXiv:2506.06632, 2025

  20. [20]

    T., Ommer, B., and Ji, X

    Qu, Y., Wang, Q., Mao, Y., Hu, V. T., Ommer, B., and Ji, X. Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models?arXiv preprint arXiv:2507.04632, 2025

  21. [21]

    Vanishing gradients in reinforcement finetuning of language models.arXiv preprint arXiv:2310.20703, 2023

    Razin, N., Zhou, H., Saremi, O., Thilak, V., Bradley, A., Nakkiran, P., Susskind, J., and Littwin, E. Vanishing gradients in reinforcement finetuning of language models.arXiv preprint arXiv:2310.20703, 2023

  22. [22]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, A., Xiao, M., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Introduces GRPO

  24. [24]

    Bots: A unified framework for bayesian online task selection in llm reinforcement finetuning.arXiv preprint arXiv:2510.26374, 2025

    Shen, Q., Chen, D., Huang, Y., Ling, Z., Li, Y., Ding, B., and Zhou, J. Bots: A unified framework for bayesian online task selection in llm reinforcement finetuning.arXiv preprint arXiv:2510.26374, 2025

  25. [25]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test-time compute optimally can be more effective than scaling llm parameters.arXiv preprint arXiv:2408.03314, 2024

  26. [26]

    Solving math word problems with process- and outcome-based feedback

    Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., et al. Solving math word problems with process- and outcome-based feedback. InarXiv preprint arXiv:2211.14275, 2022

  27. [27]

    TRL: Transformers Reinforcement Learning, 2020

    von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallou´ edec, Q. TRL: Transformers Reinforcement Learning, 2020. URLhttps://github.com/huggingface/ trl

  28. [28]

    Wang, Z. et al. T1: Advancing language model reasoning through reinforcement learning and inference scaling. arXiv preprint arXiv:2503.xxxxx, 2025

  29. [29]

    Chain-of-thought prompting elicits reasoning in large language models

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

  30. [30]

    Training large language models for reasoning through reverse curriculum reinforcement learning.arXiv preprint arXiv:2402.05808, 2024

    Xi, Z., Chen, W., Hong, B., Jin, S., Zheng, R., He, W., Ding, Y., Liu, S., Guo, X., Wang, J., et al. Training large language models for reasoning through reverse curriculum reinforcement learning.arXiv preprint arXiv:2402.05808, 2024. 13

  31. [31]

    Qwen2.5 Technical Report

    Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Xu, R., Li, T., Liu, T., Fan, W., Ge, W...

  32. [32]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  33. [33]

    Safer-vlm: Toward safety-aware fine-grained reasoning in multimodal models.arXiv preprint arXiv:2510.06871, 2025

    Yi, H., Wang, K., Li, Q., Yu, M., Lin, L., Xi, G., Wu, H., Hu, X., Li, K., and Liu, Y. Safer-vlm: Toward safety-aware fine-grained reasoning in multimodal models.arXiv preprint arXiv:2510.06871, 2025

  34. [34]

    Ovm, outcome-supervised value models for planning in mathematical reasoning

    Yu, F., Gao, A., and Wang, B. Ovm, outcome-supervised value models for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023

  35. [35]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  36. [36]

    Star: Bootstrapping reasoning with reasoning

    Zelikman, E., Wu, Y., Mu, J., and Goodman, N. Star: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, volume 35, pp. 15476–15488, 2022. 14 Appendix This appendix is organized as follows. Section A details the implementation specifics, including training configurations, hardware requirements, and the client-server...

  37. [37]

    The teacher selects a question based on its current utility estimates and sends it to the student

    Sample Request:The student requests a new problem q from the teacher. The teacher selects a question based on its current utility estimates and sends it to the student

  38. [38]

    F eedback Loop:After the student generates rollouts and computes the rewards for q, it sends these results back to the teacher

  39. [39]

    Once the number of received samples reaches, the teacher triggers its own optimization step to refine the utility predictorf ϕ

    Asynchronous Updates:The teacher aggregates this feedback into its replay buffer. Once the number of received samples reaches, the teacher triggers its own optimization step to refine the utility predictorf ϕ. This design allows for flexible scaling, as the heavy computation of the student (generating rollouts) is decoupled from the logic of the teacher’s...