pith. sign in

arxiv: 2504.13818 · v5 · submitted 2025-04-18 · 💻 cs.LG · cs.AI· cs.CL

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Pith reviewed 2026-05-22 18:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM Reinforcement LearningRollout Down-SamplingPolicy OptimizationReward VarianceGRPOReasoning BenchmarksCompute Efficiency
0
0 comments X

The pith

Max-variance selection of rollouts lets GRPO match full-set peak accuracy at least 1.7 times faster on LLM reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the compute asymmetry in RLVR for large language models, where generating rollouts is cheap and parallel but policy updates consume heavy memory and communication. It introduces PODS to break the coupling by training updates only on a down-sampled subset of rollouts. The subset is chosen by a max-variance criterion that keeps reward diversity high, implemented in O(n log n) time. Experiments show GRPO equipped with PODS reaches the same peak test accuracy as training on every rollout, yet at least 1.7 times faster across benchmarks and hardware. The approach preserves learning quality by ensuring the selected samples still supply a representative signal.

Core claim

PODS decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts chosen via max-variance down-sampling that maximizes reward diversity, with an efficient O(n log n) implementation; Group Relative Policy Optimization using PODS reaches the peak test accuracy of vanilla GRPO at least 1.7 times faster across reasoning benchmarks and hardware configurations.

What carries the argument

Max-variance down-sampling, which selects the rollout subset that maximizes reward variance to preserve diversity in the learning signal while reducing update costs.

If this is right

  • Policy updates become far cheaper in memory and communication while peak accuracy stays the same.
  • The O(n log n) selection procedure scales to large batches of rollouts without becoming a bottleneck.
  • The speed-up holds across multiple reasoning benchmarks and different hardware setups.
  • Rollout generation can safely run in larger parallel volumes since only a filtered portion reaches the update stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance-maximizing filter could be tested on other policy-gradient methods besides GRPO.
  • Lower per-update cost might allow models to consume more total rollouts within fixed hardware budgets.
  • Variance-based selection might transfer to other settings where data generation is parallel but model updates are expensive.

Load-bearing premise

Selecting a subset by maximizing reward variance supplies an unbiased and sufficient learning signal that preserves convergence behavior equivalent to training on the full set of rollouts.

What would settle it

A controlled run on a standard reasoning benchmark where GRPO with PODS never reaches the same peak test accuracy as full-rollout GRPO, even after extra training steps.

Figures

Figures reproduced from arXiv: 2504.13818 by Fei Fang, J. Zico Kolter, Yash Savani, Yixuan Even Xu.

Figure 1
Figure 1. Figure 1: Inference scales efficiently while policy updates become memory-bound in RLVR. Empirical timing breakdown when fine-tuning Qwen2.5-3B-Instruct on GSM8K using 8 A100-80GB GPUs with varying rollouts per GPU. Top: Total wall-clock time per iteration. Policy updates hit memory limits after 32 rollouts per GPU (OOM beyond this point), requiring gradient accumulation that dramatically slows training. Bottom: Per… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of three training strategies: vanilla GRPO, GRPO with gradient accumulation [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance and per-step run time comparison of standard GRPO and GRPO-PODS [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance and per-step run time comparison of GRPO-PODS with max-variance down [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance and per-step run time comparison of GRPO-PODS with the max-variance, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average completion length over time of the trained models in Section 4.1’s experiments. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average completion length over time of the trained models in Section 4.2’s experiments. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average completion length over time of the trained models in Section A.3’s experiments. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PODS (Policy Optimization with Down-Sampling) to address compute asymmetry in RLVR for LLMs. It decouples rollout generation from policy updates by training GRPO only on a max-variance selected subset of rollouts, with an O(n log n) implementation, and claims this reaches the same peak test accuracy as full-rollout GRPO at least 1.7× faster across reasoning benchmarks and hardware.

Significance. If the max-variance down-sampling preserves unbiased advantage estimates and equivalent convergence, the method could meaningfully reduce memory and communication costs during policy updates while retaining rollout parallelism. The empirical 1.7× speedup claim, if supported by proper controls, would be a practical contribution to efficient RL for reasoning models.

major comments (2)
  1. [§3] §3 (PODS method): The max-variance subset selection occurs after per-group reward computation and deterministically retains or discards trajectories based on their contribution to reward variance. No derivation shows that the resulting change to the per-group mean and standard deviation leaves the GRPO advantage estimates unbiased or preserves the expectation of the policy gradient.
  2. [§4] §4 (experiments): The central claim of achieving peak accuracy at least 1.7× faster rests on reported benchmark results, yet the manuscript provides no details on the number of independent runs, statistical significance testing, ablation against random or uniform down-sampling, or controls for selection bias in the retained rollout distribution.
minor comments (1)
  1. [§3] The O(n log n) implementation of max-variance selection is stated but the sorting or priority-queue steps are not shown; adding a short pseudocode block would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important areas for strengthening the theoretical grounding and empirical validation of PODS. We address each major comment below and commit to a revised manuscript that incorporates the requested clarifications and additions.

read point-by-point responses
  1. Referee: [§3] §3 (PODS method): The max-variance subset selection occurs after per-group reward computation and deterministically retains or discards trajectories based on their contribution to reward variance. No derivation shows that the resulting change to the per-group mean and standard deviation leaves the GRPO advantage estimates unbiased or preserves the expectation of the policy gradient.

    Authors: We agree that a formal derivation is absent from the current manuscript. In the revision we will add a new subsection to §3 that derives the effect of max-variance selection on the per-group mean and standard deviation. The derivation shows that, because trajectories are retained precisely according to their marginal contribution to reward variance, the change in the normalized advantage is zero in expectation for the selected subset; discarded trajectories contribute zero to the variance term and therefore do not alter the expectation of the policy gradient under the GRPO objective. We will also state the assumptions under which this equivalence holds and discuss the magnitude of any residual bias. revision: yes

  2. Referee: [§4] §4 (experiments): The central claim of achieving peak accuracy at least 1.7× faster rests on reported benchmark results, yet the manuscript provides no details on the number of independent runs, statistical significance testing, ablation against random or uniform down-sampling, or controls for selection bias in the retained rollout distribution.

    Authors: We acknowledge these omissions. The revised §4 and appendix will report results aggregated over five independent runs with distinct random seeds, including mean and standard deviation for both wall-clock time to peak accuracy and final test accuracy. We will add paired t-tests (or Wilcoxon signed-rank tests where appropriate) to assess statistical significance of the 1.7× speedup. New ablation tables will compare max-variance down-sampling against random and uniform down-sampling at identical retention ratios. Finally, we will include distribution plots and quantitative metrics (e.g., reward histograms and KL divergence between retained and full rollout distributions) to quantify and control for selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heuristic with external validation

full rationale

The paper introduces PODS as a practical down-sampling heuristic (max-variance selection) for GRPO rollouts and validates it solely through wall-clock speedup experiments on reasoning benchmarks. No first-principles derivation is offered that equates the filtered policy gradient to the full-set gradient by construction, nor is any parameter fitted to the target accuracy metric and then re-used as a 'prediction.' The selection rule is defined directly from per-group reward statistics without reference to final test performance, and the 1.7× claim rests on measured runtimes rather than any self-referential quantity. No self-citations appear in the provided text as load-bearing support for uniqueness or unbiasedness. The work is therefore self-contained as an engineering contribution whose correctness is externally falsifiable on the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that variance-maximizing subsets preserve policy gradient quality; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption A subset of rollouts selected to maximize reward variance supplies a learning signal equivalent to the full set for policy optimization.
    This premise is required for the claim that learning quality is maintained while reducing update costs.

pith-pipeline@v0.9.0 · 5697 in / 1198 out tokens · 42387 ms · 2026-05-22T18:42:17.173749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...

  2. CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

  3. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  4. Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

    cs.AI 2026-02 unverdicted novelty 7.0

    GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than b...

  5. MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

    cs.LG 2025-11 unverdicted novelty 7.0

    MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.

  6. Cost-Aware Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.

  7. Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

    cs.LG 2025-09 unverdicted novelty 6.0

    PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.

  8. TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning

    eess.SP 2026-04 unverdicted novelty 5.0

    TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.

  9. PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

    cs.LG 2026-04 unverdicted novelty 5.0

    PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

  10. Your Model Diversity, Not Method, Determines Reasoning Strategy

    cs.AI 2026-04 unverdicted novelty 5.0

    The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.

  11. Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

    cs.LG 2025-09 conditional novelty 5.0

    The paper identifies confounds in RLVR evaluations that inflate apparent gains and proposes a minimum standard for budget-matched, contamination-aware assessment with calibration tracking.

  12. Learning to Reason at the Frontier of Learnability

    cs.LG 2025-02 unverdicted novelty 4.0

    A curriculum sampling questions with high variance in success rate improves reinforcement learning performance for LLM reasoning tasks.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 12 Pith papers · 14 internal anchors

  1. [1]

    github.io/blog/2025/Polaris

    URL https://hkunlp. github.io/blog/2025/Polaris. William Bankes, George Hughes, Ilija Bogunovic, and Zi Wang. Reducr: Robust data downsampling using class priority reweighting.Advances in Neural Information Processing Systems, 37:82781– 82810,

  2. [2]

    Bidder subset selection problem in auction design

    Xiaohui Bei, Nick Gravin, Pinyan Lu, and Zhihao Gavin Tang. Bidder subset selection problem in auction design. InProceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 3788–3801. SIAM,

  3. [3]

    An empirical study on eliciting and improving r1-like reasoning models.arXiv preprint arXiv:2503.04548, 2025

    Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, et al. An empirical study on eliciting and improving r1-like reasoning models.arXiv preprint arXiv:2503.04548,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  5. [5]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

  6. [6]

    Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349,

    Marco Cusumano-Towner, David Hafner, Alex Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor Killian, Stuart Bowers, Ozan Sener, et al. Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349,

  7. [7]

    Interpretable contrastive monte carlo tree search reasoning.arXiv preprint arXiv:2410.01707,

    10 Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie Wen. Interpretable contrastive monte carlo tree search reasoning.arXiv preprint arXiv:2410.01707,

  8. [8]

    Bidder selection problem in position auctions: A fast and simple algorithm via poisson approximation

    Nikolai Gravin, Yixuan Even Xu, and Renfei Zhou. Bidder selection problem in position auctions: A fast and simple algorithm via poisson approximation. InProceedings of the ACM Web Conference 2024, pp. 89–98,

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  10. [10]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262,

  11. [11]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

  12. [12]

    OpenAI o1 System Card

    URL https: //github.com/huggingface/open-r1. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  13. [13]

    Available: https://arxiv.org/abs/2410.01679

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679,

  14. [14]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable mod- els

    MetaAI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable mod- els. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge- mobile-devices/,

  15. [15]

    11 OpenAI

    Accessed: 2025-09-23. 11 OpenAI. Learning to reason with language models. https://openai.com/index/ learning-to-reason-with-llms,

  16. [16]

    Qwen2.5 Technical Report

    URL https://arxiv.org/abs/2412.15115. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE,

  17. [17]

    What makes a reward model a good teacher? an optimization perspective.arXiv preprint arXiv:2503.15477,

    Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective.arXiv preprint arXiv:2503.15477,

  18. [18]

    Prioritized Experience Replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.arXiv preprint arXiv:1511.05952,

  19. [19]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114,

  20. [20]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  21. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  22. [22]

    P., Kawaguchi, K., and Shieh, M

    Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451,

  23. [23]

    A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,

    Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,

  24. [24]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    12 Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

  25. [25]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025a. Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-co...

  26. [26]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

  27. [27]

    Accuracy (1.0 for correct, 0.0 for incorrect):Mathematical correctness using LATEX parsing and symbolic verification

    13 A ADDITIONALEXPERIMENTALDETAILS A.1 REWARDFUNCTIONS The specific reward functions we use in our experiments are listed below. Accuracy (1.0 for correct, 0.0 for incorrect):Mathematical correctness using LATEX parsing and symbolic verification. The reward function extracts mathematical expressions from both the model’s response and ground truth solution...