Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning

Changwen Zheng; Huijie Guo; Hui Xiong; Jiahuan Zhou; Jingyao Wang; Peizheng Guo; Wenwen Qiang

REVIEW 2 major objections 2 minor 19 references

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Treating multiple reasoning paths for one question as counterfactual experiments trains LLMs to favor stable and transferable reasoning patterns over lucky guesses.

2026-05-16 07:18 UTC pith:GY43OMZ3

load-bearing objection The paper's Group Causal Counterfactual Policy Optimization mostly repackages diversity regularization as causal robustness without a supporting structural causal model. the 2 major comments →

arxiv 2602.06475 v2 pith:GY43OMZ3 submitted 2026-02-06 cs.LG

Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning

Jingyao Wang , Peizheng Guo , Wenwen Qiang , Jiahuan Zhou , Huijie Guo , Changwen Zheng , Hui Xiong This is my paper

classification cs.LG

keywords LLM reasoningpolicy optimizationcounterfactual rewardsgeneralizationreinforcement learningcausal inferencerobustnesstoken-level advantages

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard rewards for LLM reasoning tie too closely to final answer correctness and ignore whether the underlying steps are sound or generalizable. It reframes different candidate trajectories for a fixed question as a set of counterfactual experiments, then defines an episodic reward that scores both robustness (answer distributions stay stable when the step is perturbed) and effectiveness (the step allows enough variation to apply to new questions). Policy optimization then uses token-level advantages derived from this reward to up-weight reasoning steps that meet both criteria. A sympathetic reader would expect this to produce reasoning that transfers across questions instead of overfitting to surface patterns or succeeding by chance.

Core claim

From a causal perspective, multi-candidate reasoning trajectories for a fixed question form a family of counterfactual experiments. Building on this, Group Causal Counterfactual Policy Optimization defines an episodic causal counterfactual reward that jointly measures robustness (stability of the induced answer distribution under perturbations) and effectiveness (sufficient variability for cross-question transfer). Token-level advantages constructed from the reward are used to optimize the policy, so the model learns to prefer reasoning patterns that are process-valid and counterfactually robust.

What carries the argument

The episodic causal counterfactual reward, which scores a reasoning step by joint robustness (stability of answer distribution under counterfactual perturbations) and effectiveness (variability sufficient for transfer).

Load-bearing premise

Multi-candidate reasoning trajectories for the same question can be treated as a family of counterfactual experiments with enough theoretical support to justify the robustness and effectiveness reward.

What would settle it

A controlled test in which models trained with the method show no improvement (or a drop) in accuracy on out-of-distribution questions that share the same underlying reasoning structure but differ in wording or surface features.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Token-level policy updates will increase the probability of reasoning steps whose answer distributions remain stable when the step is altered.
Learned reasoning strategies will transfer to new questions because the reward explicitly requires sufficient variability in the induced answers.
Trajectories with sound intermediate logic but incorrect final answers will receive higher credit than trajectories with flawed logic but correct guesses.
Benchmark performance will improve on tasks where generalization depends on process-level robustness rather than memorization of answer patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same counterfactual framing could be applied to code generation or mathematical proof steps where multiple valid paths exist for a single problem.
If the robustness component dominates, models may become more resistant to prompt variations that currently trigger inconsistent answers.
Extending the method to multi-turn dialogues would require defining counterfactual perturbations across conversation turns rather than single questions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper's Group Causal Counterfactual Policy Optimization mostly repackages diversity regularization as causal robustness without a supporting structural causal model.

read the letter

The main point is that this work introduces an episodic reward for LLM reasoning that scores trajectories on robustness (stable answers under perturbations) and effectiveness (enough variability for transfer), then uses those scores to shape token-level advantages during policy optimization. It targets the common problem where final-answer correctness rewards reward lucky guesses over sound process. That focus on reasoning patterns rather than endpoints is a reasonable direction and the abstract claims gains across diverse benchmarks, which suggests the method produces measurable improvements in practice even if the theory is light. The experiments appear to test the approach on multiple tasks, so the practical payoff could be real for people tuning reasoning models. The soft spot sits in the causal framing. Treating multiple candidate trajectories for one question as counterfactual experiments requires an explicit structural causal model to define the intervened variables and map perturbations to potential outcomes. Without that, the robustness term collapses to low outcome variance across samples and the effectiveness term to sufficient entropy. Both effects are already available from standard diversity-regularized RL, so the causal label does not yet deliver extra guarantees. The abstract mentions theoretical support but does not show the derivations or SCM, which leaves the central claim under-supported. This paper is aimed at researchers working on reward design for LLM reasoning and alignment. A reader interested in causal methods in ML will find the justification thin and will want to see ablations against non-causal baselines. It deserves peer review because the empirical results can be checked directly and the process-focused reward idea is worth testing, even if the causal apparatus needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes Group Causal Counterfactual Policy Optimization (GCCPO) to train LLMs for generalizable reasoning. It interprets multiple reasoning trajectories for a fixed question as a family of counterfactual experiments, defines an episodic causal counterfactual reward that jointly enforces robustness (stability of the induced answer distribution under perturbations) and effectiveness (sufficient variability for cross-question transfer), constructs token-level advantages from this reward, and optimizes the policy to favor process-valid, counterfactually robust patterns. Experiments on diverse benchmarks are reported to demonstrate advantages over existing methods.

Significance. If the causal framing is rigorously supported and the reward demonstrably yields generalization gains beyond standard diversity or outcome-based RL, the work could provide a principled mechanism for rewarding reasoning processes rather than final answers alone. The approach addresses a recognized limitation in current LLM reward design, but its contribution hinges on whether the proposed terms deliver causal guarantees or merely replicate variance/entropy regularization.

major comments (2)

[Definition of episodic causal counterfactual reward (likely §3)] The central claim that multi-candidate reasoning trajectories for a fixed question form a family of counterfactual experiments requires an explicit structural causal model (SCM) that defines the intervened variables, the do-operator, and the correspondence between observed answer distributions and potential outcomes. No such SCM appears in the reward derivation; without it, the robustness term reduces to low outcome variance across samples from the same policy and the effectiveness term to entropy, both of which are already achieved by non-causal diversity-regularized RL baselines.
[Experiments and baselines] The paper must demonstrate that the proposed reward produces generalization improvements that cannot be obtained by standard entropy or variance regularization alone. Current experimental claims would be strengthened by an ablation that isolates the causal component (e.g., comparing against a non-causal version of the same robustness/effectiveness terms) and reports effect sizes on out-of-distribution transfer tasks.

minor comments (2)

[Abstract and §3] The abstract states that the interpretation rests on 'theoretical supports' but supplies no equations or derivation steps; the main text should include the explicit reward formula, advantage estimator, and any assumptions on the policy distribution at the first mention of the reward.
[Notation and reward definition] Notation for the group-level counterfactual perturbation and the resulting answer distribution should be defined consistently before the reward is introduced to avoid ambiguity in later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree with the need for greater rigor in the causal framing and for stronger empirical isolation of the proposed reward's contributions. We will revise the manuscript accordingly.

read point-by-point responses

Referee: The central claim that multi-candidate reasoning trajectories for a fixed question form a family of counterfactual experiments requires an explicit structural causal model (SCM) that defines the intervened variables, the do-operator, and the correspondence between observed answer distributions and potential outcomes. No such SCM appears in the reward derivation; without it, the robustness term reduces to low outcome variance across samples from the same policy and the effectiveness term to entropy, both of which are already achieved by non-causal diversity-regularized RL baselines.

Authors: We acknowledge that the manuscript references theoretical supports for interpreting trajectories as counterfactual experiments but does not present an explicit SCM in the reward derivation. This is a valid criticism. In the revision we will add a dedicated subsection to §3 that formally defines the SCM: exogenous variables include the question and latent reasoning factors; endogenous variables are the token-level reasoning steps; interventions via the do-operator correspond to counterfactual perturbations of selected steps; and the observed answer distribution is mapped to potential outcomes under those interventions. The robustness term will then be derived as invariance of the potential outcome distribution under these specific interventions, distinguishing it from generic variance reduction. revision: yes
Referee: The paper must demonstrate that the proposed reward produces generalization improvements that cannot be obtained by standard entropy or variance regularization alone. Current experimental claims would be strengthened by an ablation that isolates the causal component (e.g., comparing against a non-causal version of the same robustness/effectiveness terms) and reports effect sizes on out-of-distribution transfer tasks.

Authors: We agree that an ablation isolating the causal component is required to substantiate the contribution. We will add new experiments that replace the causal counterfactual reward with a non-causal counterpart (variance of answer distribution for robustness and entropy of answer distribution for effectiveness) while keeping all other training elements identical. Results will be reported on the same benchmarks with explicit focus on out-of-distribution transfer tasks, including effect sizes, confidence intervals, and statistical tests to quantify the incremental gains attributable to the causal formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The paper's core proposal interprets multi-candidate reasoning trajectories for a fixed question as counterfactual experiments and defines an episodic reward combining robustness (stability of answer distribution under perturbations) and effectiveness (sufficient variability for transfer). No equations appear in the abstract or provided excerpts that reduce this reward by construction to a fitted parameter, self-defined quantity, or variance/entropy term already present in the inputs. No self-citations are invoked as load-bearing uniqueness theorems, and the causal framing is presented as an interpretive perspective rather than a mathematical reduction that forces the result. The derivation therefore remains self-contained with independent content in the proposed reward construction and token-level advantage optimization.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach rests on an unelaborated causal interpretation of reasoning trajectories and the definition of robustness and effectiveness rewards.

pith-pipeline@v0.9.0 · 5497 in / 1088 out tokens · 46880 ms · 2026-05-16T07:18:03.368093+00:00 · methodology

0 comments

read the original abstract

Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities. However, existing reward mechanisms remain tightly coupled to final correctness and pay little attention to the underlying reasoning process: trajectories with sound reasoning but wrong answers receive low credit, while lucky guesses with flawed logic may be highly rewarded, affecting reasoning generalization. From a causal perspective, we interpret multi-candidate reasoning for a fixed question as a family of counterfactual experiments with theoretical supports. Building on this, we propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns. It proposes an episodic causal counterfactual reward that jointly captures (i) robustness, encouraging the answer distribution induced by a reasoning step to remain stable under counterfactual perturbations; and (ii) effectiveness, enforcing sufficient variability so that the learned reasoning strategy can transfer across questions. We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust. Extensive experiments on diverse benchmarks demonstrate its advantages.

Figures

Figures reproduced from arXiv: 2602.06475 by Changwen Zheng, Huijie Guo, Hui Xiong, Jiahuan Zhou, Jingyao Wang, Peizheng Guo, Wenwen Qiang.

**Figure 1.** Figure 1: (a) Can LLMs “learn by analogy”: reasoning trajectories on representative questions. (b) Four trajectory groups defined by process validity and final correctness (Subsection 2.2). (c) Empirical results of the motivation experiment. See Appendix D.4 for details. To address this challenge, we aim to answer a core question: how can we design fine-grained rewards for generalizable reasoning? From a causal pers… view at source ↗

**Figure 2.** Figure 2: (a) SCM of LLM reasoning. (b) Examples of causal factors and spurious cues. See Appendix D.5 for more details. limitations of existing reward mechanisms, which remain dominated by outcome signals and pay insufficient attention to the effectiveness and robustness of reasoning patterns. 2.3. Motivation Analysis To address this limitation, in this subsection, we discuss the key question: how can we design fin… view at source ↗

**Figure 3.** Figure 3: The framework of GC 2 PO. It segments reasoning into episodes (Left), calculates episodic causal counterfactual reward via stability and expressiveness terms (Middle), and optimizes the policy by building token-level advantages (Right). offers a more stable and computationally efficient alternative to perturbing discrete input tokens or model parameters (Wang et al., 2025b; 2024). For a given question x a… view at source ↗

**Figure 4.** Figure 4: Trade-off performance of different methods. 0 200 400 600 800 1k Step 0 200 400 600 800 1k Gradient Norm GVPO DR.GRPO GRPO Ours (a) Gradient norm. Training Steps -7 0 3 Range of Sampling Ratio Sequence-level Ours (b) Sampling ratio [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 7.** Figure 7: Ablation Studies (See Appendix G.4 for full results). 6. Conclusion In this paper, we explore a critical bottleneck in GRPObased post-training: the reward mechanisms of existing methods entangle process validity with final correctness, hindering the acquisition of generalizable reasoning. To overcome this, we establish a causal perspective that formalizes multi-candidate generation as a set of counterfac… view at source ↗

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 9 internal anchors

[1]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,

Arora, D. and Zanette, A. Training language models to rea- son efficiently.arXiv preprint arXiv:2502.04463,

work page arXiv
[2]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2508.10751 , year=

Chen, Z., Qin, X., Wu, Y ., Ling, Y ., Ye, Q., Zhao, W. X., and Shi, G. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751,

work page arXiv
[4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Group causal policy optimization for post-training large language models.arXiv preprint arXiv:2508.05428,

Gu, Z., Wang, J., Zuo, R., Sun, C., Song, Z., Zheng, C., and Qiang, W. Group causal policy optimization for post-training large language models.arXiv preprint arXiv:2508.05428,

work page arXiv
[6]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

URL https://arxiv.org/abs/2310.06770. Khalifa, M., Agarwal, R., Logeswaran, L., Kim, J., Peng, H., Lee, M., Lee, H., and Wang, L. Process reward models that think.arXiv preprint arXiv:2504.16828,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025

Qu, Y ., Yang, M. Y ., Setlur, A., Tunstall, L., Beeching, E. E., Salakhutdinov, R., and Kumar, A. Optimizing test- time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572,

work page arXiv
[10]

9 Group Causal Counterfactual Policy Optimization for LLM Reasoning Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y

URLhttps://arxiv.org/abs/2504.16027. 9 Group Causal Counterfactual Policy Optimization for LLM Reasoning Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models,

work page arXiv
[11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://arxiv.org/abs/2402.03300. Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

URL https://arxiv. org/abs/2408.03314. Szepesv´ari, C.Algorithms for reinforcement learning. Springer nature,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

GTPO and GRPO-S: Token and sequence-level reward shaping with policy entropy,

Tan, H., Pan, J., Lin, J., Chen, T., Zheng, Z., Tang, Z., and Yang, H. Gtpo and grpo-s: Token and sequence- level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349,

work page arXiv
[14]

arXiv preprint arXiv:2501.09620 , year=

Wang, C., Zhao, Z., Jiang, Y ., Chen, Z., Zhu, C., Chen, Y ., Liu, J., Zhang, L., Fan, X., Ma, H., et al. Beyond reward hacking: Causal rewards for large language model alignment.arXiv preprint arXiv:2501.09620, 2025a. Wang, J., Qiang, W., Song, Z., Zheng, C., and Xiong, H. Learning to think: Information-theoretic reinforcement fine-tuning for llms.arXiv ...

work page arXiv
[15]

Think when you need: Self-adaptive chain-of-thought learning.arXiv preprint arXiv:2504.03234, 2025b

URL https: //arxiv.org/abs/2504.03234. Ye, C., Yu, Z., Zhang, Z., Chen, H., Sadagopan, N., Huang, J., Zhang, T., and Beniwal, A. Beyond correctness: Har- monizing process and outcome rewards through rl train- ing.arXiv preprint arXiv:2509.03403, 2025a. Ye, G., Pham, K. D., Zhang, X., Gopi, S., Peng, B., Li, B., Kulkarni, J., and Inan, H. A. On the emergen...

work page arXiv
[16]

Demystifying Long Chain-of-Thought Reasoning in LLMs

URLhttps://arxiv.org/abs/2502.03373. Zeng, T., Zhang, S., Wu, S., Classen, C., Chae, D., Ewer, E., Lee, M., Kim, H., Kang, W., Kunde, J., et al. Ver- saprm: Multi-domain process reward model via synthetic reasoning data.arXiv preprint arXiv:2502.06737,

work page internal anchor Pith review arXiv
[17]

Gvpo: Group variance policy optimization for large language model post-training

Zhang, K., Hong, Y ., Bao, J., Jiang, H., Song, Y ., Hong, D., and Xiong, H. Gvpo: Group variance policy optimization for large language model post-training.arXiv preprint arXiv:2504.19599, 2025a. Zhang, Z., Zheng, C., Wu, Y ., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., and Lin, J. The lessons of developing process reward models in mathematical reason...

work page arXiv
[18]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025a. Zheng, C., Zhu, J., Lin, J., Dai, X., Yu, Y ., Zhang, W., and Yang, M. Cold: Counterfactually-guided length debiasing for process reward models.arXiv preprint arXiv:2507.15698, 2...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Sweet-rl: Training multi-turn llm agents on collaborative reasoning tasks.arXiv preprint arXiv:2503.15478, 2025

URL https://arxiv.org/abs/2503.15478. Zimmermann, R. S., Sharma, Y ., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. InInternational conference on machine learning, pp. 12979–12990. PMLR,

work page arXiv

[1] [1]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,

Arora, D. and Zanette, A. Training language models to rea- son efficiently.arXiv preprint arXiv:2502.04463,

work page arXiv

[2] [2]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2508.10751 , year=

Chen, Z., Qin, X., Wu, Y ., Ling, Y ., Ye, Q., Zhao, W. X., and Shi, G. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751,

work page arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Group causal policy optimization for post-training large language models.arXiv preprint arXiv:2508.05428,

Gu, Z., Wang, J., Zuo, R., Sun, C., Song, Z., Zheng, C., and Qiang, W. Group causal policy optimization for post-training large language models.arXiv preprint arXiv:2508.05428,

work page arXiv

[6] [6]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

URL https://arxiv.org/abs/2310.06770. Khalifa, M., Agarwal, R., Logeswaran, L., Kim, J., Peng, H., Lee, M., Lee, H., and Wang, L. Process reward models that think.arXiv preprint arXiv:2504.16828,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025

Qu, Y ., Yang, M. Y ., Setlur, A., Tunstall, L., Beeching, E. E., Salakhutdinov, R., and Kumar, A. Optimizing test- time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572,

work page arXiv

[10] [10]

9 Group Causal Counterfactual Policy Optimization for LLM Reasoning Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y

URLhttps://arxiv.org/abs/2504.16027. 9 Group Causal Counterfactual Policy Optimization for LLM Reasoning Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models,

work page arXiv

[11] [11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://arxiv.org/abs/2402.03300. Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

URL https://arxiv. org/abs/2408.03314. Szepesv´ari, C.Algorithms for reinforcement learning. Springer nature,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

GTPO and GRPO-S: Token and sequence-level reward shaping with policy entropy,

Tan, H., Pan, J., Lin, J., Chen, T., Zheng, Z., Tang, Z., and Yang, H. Gtpo and grpo-s: Token and sequence- level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349,

work page arXiv

[14] [14]

arXiv preprint arXiv:2501.09620 , year=

Wang, C., Zhao, Z., Jiang, Y ., Chen, Z., Zhu, C., Chen, Y ., Liu, J., Zhang, L., Fan, X., Ma, H., et al. Beyond reward hacking: Causal rewards for large language model alignment.arXiv preprint arXiv:2501.09620, 2025a. Wang, J., Qiang, W., Song, Z., Zheng, C., and Xiong, H. Learning to think: Information-theoretic reinforcement fine-tuning for llms.arXiv ...

work page arXiv

[15] [15]

Think when you need: Self-adaptive chain-of-thought learning.arXiv preprint arXiv:2504.03234, 2025b

URL https: //arxiv.org/abs/2504.03234. Ye, C., Yu, Z., Zhang, Z., Chen, H., Sadagopan, N., Huang, J., Zhang, T., and Beniwal, A. Beyond correctness: Har- monizing process and outcome rewards through rl train- ing.arXiv preprint arXiv:2509.03403, 2025a. Ye, G., Pham, K. D., Zhang, X., Gopi, S., Peng, B., Li, B., Kulkarni, J., and Inan, H. A. On the emergen...

work page arXiv

[16] [16]

Demystifying Long Chain-of-Thought Reasoning in LLMs

URLhttps://arxiv.org/abs/2502.03373. Zeng, T., Zhang, S., Wu, S., Classen, C., Chae, D., Ewer, E., Lee, M., Kim, H., Kang, W., Kunde, J., et al. Ver- saprm: Multi-domain process reward model via synthetic reasoning data.arXiv preprint arXiv:2502.06737,

work page internal anchor Pith review arXiv

[17] [17]

Gvpo: Group variance policy optimization for large language model post-training

Zhang, K., Hong, Y ., Bao, J., Jiang, H., Song, Y ., Hong, D., and Xiong, H. Gvpo: Group variance policy optimization for large language model post-training.arXiv preprint arXiv:2504.19599, 2025a. Zhang, Z., Zheng, C., Wu, Y ., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., and Lin, J. The lessons of developing process reward models in mathematical reason...

work page arXiv

[18] [18]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025a. Zheng, C., Zhu, J., Lin, J., Dai, X., Yu, Y ., Zhang, W., and Yang, M. Cold: Counterfactually-guided length debiasing for process reward models.arXiv preprint arXiv:2507.15698, 2...

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Sweet-rl: Training multi-turn llm agents on collaborative reasoning tasks.arXiv preprint arXiv:2503.15478, 2025

URL https://arxiv.org/abs/2503.15478. Zimmermann, R. S., Sharma, Y ., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. InInternational conference on machine learning, pp. 12979–12990. PMLR,

work page arXiv