CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Abdelrahman M. Shaker; Ahmed Heakl; Fahad Shahbaz Khan; Omar Fetouh; Rania Elbadry; Salman Khan; Youssef Mohamed

REVIEW 2 major objections 2 minor 1 cited by

CEPO sharpens credit assignment for decisive reasoning tokens in RLVR by using contrastive signals from both correct and wrong answers.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 07:21 UTC pith:QZ5WOFU4

load-bearing objection CEPO adds a contrastive signal to RLVR by pitting correct-answer favor against wrong-answer disfavor from batch rejections, with reported gains over GRPO but the safety proof and unbiased-contrast assumption still need checking. the 2 major comments →

arxiv 2605.19436 v1 pith:QZ5WOFU4 submitted 2026-05-19 cs.LG cs.CLcs.CV

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Ahmed Heakl , Abdelrahman M. Shaker , Youssef Mohamed , Rania Elbadry , Omar Fetouh , Fahad Shahbaz Khan , Salman Khan This is my paper

classification cs.LG cs.CLcs.CV

keywords reinforcement learningpolicy optimizationself-distillationcontrastive learningmultimodal reasoningmathematical reasoningcredit assignmentRLVR

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce Contrastive Evidence Policy Optimization to fix the problem that all tokens receive identical rewards in reinforcement learning with verifiable rewards. CEPO asks whether the correct answer favors a given token while a wrong answer disfavors it, marking only the first as a genuine reasoning step. The wrong answer signal comes from rejected rollouts already available in the training batch. This approach is shown to preserve all safety properties of earlier methods while delivering higher accuracy on multimodal mathematical reasoning tasks at both 2B and 4B model scales.

Core claim

When a model produces a correct solution under RLVR, CEPO constructs a wrong-answer teacher from rejected rollouts in the same batch. At each token it checks if the correct answer favors the token and the wrong answer disfavors it. Tokens meeting both conditions receive sharpened credit; the sharpening vanishes at filler positions. The method inherits the structural safety guarantees of prior state-of-the-art approaches.

What carries the argument

The contrastive evidence query applied at every token, which identifies genuine reasoning steps as those favored by the correct answer and disfavored by the wrong answer constructed from rejected samples.

Load-bearing premise

The wrong-answer teacher constructed from rejected rollouts already present in the training batch provides a sufficiently strong and unbiased contrastive signal without introducing leakage or distribution shift.

What would settle it

A run on the same benchmarks where CEPO fails to exceed GRPO accuracy or where the per-token credit sharpening does not disappear at positions identified as fillers would falsify the main claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

CEPO achieves 43.43 percent average accuracy across five multimodal mathematical reasoning benchmarks at 2B scale compared to 41.17 percent for GRPO.
CEPO achieves 60.56 percent average accuracy at 4B scale compared to 57.43 percent for GRPO under identical training budgets.
The credit improvement is strictly positive at decisive tokens and exactly zero at filler positions.
Distribution-matching self-distillation methods fall below the untrained baseline due to predicted information leakage.
CEPO inherits all structural safety guarantees from the prior state of the art.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the rejected rollout signal remains unbiased in other domains, CEPO could apply to additional verifiable reward tasks beyond mathematical reasoning.
The reuse of batch rejections may lower the computational overhead of contrastive self-distillation in larger training runs.
Future work could test whether the token-level sharpening correlates with human judgments of reasoning importance.
Applying the same contrastive principle to non-RL settings might improve credit assignment in supervised fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

CEPO adds a contrastive signal to RLVR by pitting correct-answer favor against wrong-answer disfavor from batch rejections, with reported gains over GRPO but the safety proof and unbiased-contrast assumption still need checking.

read the letter

The main takeaway is that CEPO uses rejected rollouts from the training batch as a wrong-answer teacher to create a contrastive signal in RLVR. It requires both that the correct answer favors the token and the wrong answer disfavors it, aiming to credit only decisive reasoning steps while the effect drops to zero on fillers. They claim this inherits safety from prior methods. The paper does well by providing clear empirical comparisons on five multimodal mathematical reasoning benchmarks, with CEPO at 43.43% for 2B and 60.56% for 4B models versus lower numbers for GRPO. It also demonstrates that other self-distillation techniques fall below baseline, supporting the leakage argument. The code link is helpful. Where it is softer is in the proof details for safety inheritance and the vanishing property. The abstract asserts these but the full derivation isn't visible here, and the assumption that the rejected rollouts provide an unbiased contrast without introducing batch-induced shifts needs verification. If sampling correlates rejections with specific paths, it might bias the signal at filler positions too. This work targets people working on reinforcement learning for verifiable reasoning tasks, especially scaling to multimodal settings. Readers interested in credit assignment refinements and negative results on distribution matching would get something from it. I think it deserves a serious referee to examine the theory and experimental controls more closely.

Referee Report

2 major / 2 minor

Summary. The paper proposes Contrastive Evidence Policy Optimization (CEPO) for RLVR self-distillation. It constructs a contrastive signal at each token by checking whether the correct-answer teacher favors the token while a wrong-answer teacher (built from rejected rollouts already present in the current training batch) disfavors it. The authors claim this sharpens credit assignment precisely at decisive reasoning steps while the effect vanishes exactly at filler positions. They prove that CEPO inherits all structural safety guarantees of prior state-of-the-art methods (such as GRPO) and report empirical gains of 43.43% vs 41.17% (2B) and 60.56% vs 57.43% (4B) average accuracy on five multimodal mathematical reasoning benchmarks under identical training budgets. Distribution-matching baselines (OPSD, SDPO) fall below the untrained baseline, which the authors interpret as confirming their leakage theory. Code is released at https://github.com/ahmedheakl/CEPO.

Significance. If the safety-inheritance proof and the exact-vanishing property hold under the stated assumptions, CEPO would offer a low-cost mechanism for improving token-level credit assignment in verifiable-reward RL without additional sampling or leakage. The modest but consistent gains over GRPO on multimodal math benchmarks, together with the negative result for distribution-matching methods, would strengthen the case for contrastive rather than pure matching self-distillation. The public code release is a clear positive for reproducibility.

major comments (2)

[§4] §4 (Proof of safety inheritance and vanishing property): The derivation that the contrastive advantage is exactly zero on filler tokens and that all prior safety guarantees are inherited assumes the wrong-answer policy constructed from rejected rollouts in the current batch remains distributionally close to the baseline policy without introducing batch-induced correlations. No explicit bound or sensitivity analysis is provided for the case where rejection sampling correlates with particular reasoning paths; this assumption is load-bearing for both the vanishing claim and the safety inheritance.
[§5.2] §5.2 (Empirical results): The reported accuracy improvements (2.26 pp at 2B, 3.13 pp at 4B) are presented as averages across five benchmarks, but no per-benchmark breakdowns, standard deviations, or statistical significance tests are shown. Without these, it is difficult to determine whether the gains are robust or driven by a subset of tasks, which directly affects the strength of the claim that CEPO “strictly sharpens credit at decisive tokens.”

minor comments (2)

[Abstract, §3] Abstract and §3: The notation for the correct-teacher and wrong-teacher policies is introduced without an explicit equation reference in the main text; adding a single displayed equation for the contrastive advantage term would improve readability.
[§5.1] §5.1: The statement that distribution-matching methods “fall below the untrained baseline” would benefit from a short table row or footnote giving the exact baseline numbers for OPSD and SDPO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below, indicating the revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: [§4] §4 (Proof of safety inheritance and vanishing property): The derivation that the contrastive advantage is exactly zero on filler tokens and that all prior safety guarantees are inherited assumes the wrong-answer policy constructed from rejected rollouts in the current batch remains distributionally close to the baseline policy without introducing batch-induced correlations. No explicit bound or sensitivity analysis is provided for the case where rejection sampling correlates with particular reasoning paths; this assumption is load-bearing for both the vanishing claim and the safety inheritance.

Authors: We appreciate the referee pointing out this key assumption underlying the proof. The wrong-answer teacher is constructed exclusively from rejected rollouts already present in the current training batch, which are sampled from the policy being optimized and filtered solely by the verifiable reward on the final answer. Because policy updates occur gradually via PPO-style clipping and the rejection criterion depends only on answer correctness rather than intermediate reasoning paths, batch-induced correlations remain limited in practice. Nevertheless, we acknowledge that an explicit sensitivity analysis would make the argument more robust. In the revised manuscript we will add an appendix section with both a brief discussion of the assumption and an empirical sensitivity study across batch sizes and rejection rates, verifying that the vanishing property and inherited safety guarantees hold under moderate distributional shifts. revision: partial
Referee: [§5.2] §5.2 (Empirical results): The reported accuracy improvements (2.26 pp at 2B, 3.13 pp at 4B) are presented as averages across five benchmarks, but no per-benchmark breakdowns, standard deviations, or statistical significance tests are shown. Without these, it is difficult to determine whether the gains are robust or driven by a subset of tasks, which directly affects the strength of the claim that CEPO “strictly sharpens credit at decisive tokens.”

Authors: We agree that additional statistical detail is necessary to substantiate the robustness of the reported gains. In the revised version we will expand §5.2 (and the corresponding tables) to include per-benchmark accuracy scores for both the 2B and 4B models, standard deviations computed over multiple random seeds, and paired statistical significance tests (e.g., t-tests) against the GRPO baseline. These additions will allow readers to evaluate whether improvements are consistent across tasks or concentrated in particular benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external baselines and stated assumptions rather than self-referential reduction.

full rationale

The paper's central claims rest on a mathematical proof of safety inheritance and exact vanishing of the contrastive delta at filler tokens, plus empirical comparisons against GRPO, OPSD, and SDPO. These are presented as independent of the fitted values in the current batch; the wrong-answer teacher is constructed from already-sampled rejected rollouts without additional parameters being tuned to the target metric. No equation is shown to reduce to a prior fit by construction, no uniqueness theorem is imported solely via self-citation, and the leakage theory is tested by reporting that distribution-matching baselines fall below the untrained model. The derivation chain therefore remains self-contained against the external benchmarks and the explicit assumption that batch rejections do not induce new distributional shift at filler positions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the new contrastive construction; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption RLVR produces correct final answers that can be used as conditioning signals without corrupting gradients when properly contrasted.
Invoked when constructing the correct-answer teacher and claiming no leakage.

pith-pipeline@v0.9.0 · 5865 in / 1292 out tokens · 40527 ms · 2026-05-20T07:21:48.865853+00:00 · methodology

0 comments

read the original abstract

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

Figures

Figures reproduced from arXiv: 2605.19436 by Abdelrahman M. Shaker, Ahmed Heakl, Fahad Shahbaz Khan, Omar Fetouh, Rania Elbadry, Salman Khan, Youssef Mohamed.

**Figure 1.** Figure 1: Accuracy over 50 training steps. CEPO improves faster than GRPO and RLSD, reaching its largest gap around step 40 before partially converging by the final checkpoint. Reinforcement learning with verifiable rewards (RLVR) has become the dominant paradigm for post-training large language models to reason [17, 4, 20]. The core loop is simple: sample rollouts from the current policy, score them against a ver… view at source ↗

**Figure 2.** Figure 2: CEPO training pipeline and its relationship to GRPO and RLSD. Given a question x, the policy πθ produces G rollouts that are partitioned into correct (G+) and wrong (G−) sets by a verifiable reward. CEPO conditions two frozen teachers on a sampled correct rationale r + ∈ G+ and rejected rationale r − ∈ G−, and defines a per-token contrastive evidence delta ∆CE t that amplifies advantage at decisive tokens … view at source ↗

**Figure 3.** Figure 3: Hyperparameter sensitivity averaged across 5 reasoning benchmarks). (a) Constant λ schedule: λ = 0.5 peaks at 41.40%, outperforming GRPO (41.17%); sustained high-λ training (λ = 1.0) introduces noise that offsets the credit-assignment benefit. (b) Linear-decay schedule from λ0 = 1.0: a 25-step warmup matches the constant-λ peak (41.25%). (c) Evidence clip bound εw: performance peaks in [0.4, 0.5] at 42.7% … view at source ↗

**Figure 4.** Figure 4: Contrastive delta fractions during CEPO training. We track the fraction of tokens assigned positive versus negative contrastive evidence. Positive-delta mass increases early, while negative-delta mass decreases. Contrastive delta fractions [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Token-level credit assignment on a parallelogram problem. Green/red/white denote high, low, and neutral token weights. Numbered regions illustrate three claims: ① RLSD over-credits fluent setup prose, while CEPO suppresses it; ② CEPO localizes blame to the misapplied angle-equality inference instead of diffusing penalties; ③ CEPO sharpens credit on the decisive algebraic derivation (x+4 = 3x−6, isolation s… view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

H$^2$SD: Hybrid Hindsight Self-Distillation
cs.LG 2026-07 conditional novelty 6.0

H2SD routes successful RLVR trajectories to token-level magnitude credit assignment and failed trajectories to reverse-KL distillation from a hint-conditioned self-teacher, improving logical reasoning benchmarks.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Enhancing reinforcement learning with dense rewards from language model critic

Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, and Lei Meng. Enhancing reinforcement learning with dense rewards from language model critic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[3]

arXiv preprint arXiv:2603.23871 , year=

Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

work page arXiv 2026
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2505.23564 , year=

Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment policy optimization: Ef- fective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

work page arXiv 2025
[6]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Vineppo: Refining credit assignment in rl training of llms, 2025

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms, 2025

work page 2025
[8]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, 2023

work page 2023
[9]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[10]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volu...

work page 2021
[12]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review arXiv 2026
[13]

Runqi Qiao et al. We-math: Does your large multimodal model achieve human-like mathemati- cal reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

work page 2025
[14]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 2023

work page 2023
[15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning.arXiv preprint arXiv:2410.08146, 2024. 10

work page internal anchor Pith review arXiv 2024
[17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Measuring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 2024

work page 2024
[19]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024
[24]

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Chenchen Zhang. From reasoning to agentic: Credit assignment in reinforcement learning for large language models.arXiv preprint arXiv:2604.09459, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025

work page 2025
[26]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

EasyR1: An efficient, scalable, multi-modality RL training framework

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. EasyR1: An efficient, scalable, multi-modality RL training framework. https://github.com/hiyouga/EasyR1, 2025

work page 2025
[29]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024. 11 Appendix A Proofs A.1 Proof of Theorem 1 (i) Direction anchoring.Since exp(·)>0 , we have wCE t >0 unconditionally. Because ϵw ∈ (...

work page arXiv 2024

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Enhancing reinforcement learning with dense rewards from language model critic

Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, and Lei Meng. Enhancing reinforcement learning with dense rewards from language model critic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[3] [3]

arXiv preprint arXiv:2603.23871 , year=

Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation.arXiv preprint arXiv:2603.23871, 2026

work page arXiv 2026

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

arXiv preprint arXiv:2505.23564 , year=

Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment policy optimization: Ef- fective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

work page arXiv 2025

[6] [6]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Vineppo: Refining credit assignment in rl training of llms, 2025

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms, 2025

work page 2025

[8] [8]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, 2023

work page 2023

[9] [9]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[10] [10]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volu...

work page 2021

[12] [12]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review arXiv 2026

[13] [13]

Runqi Qiao et al. We-math: Does your large multimodal model achieve human-like mathemati- cal reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

work page 2025

[14] [14]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 2023

work page 2023

[15] [15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning.arXiv preprint arXiv:2410.08146, 2024. 10

work page internal anchor Pith review arXiv 2024

[17] [17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Measuring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 2024

work page 2024

[19] [19]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024

[24] [24]

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Chenchen Zhang. From reasoning to agentic: Credit assignment in reinforcement learning for large language models.arXiv preprint arXiv:2604.09459, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025

work page 2025

[26] [26]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

EasyR1: An efficient, scalable, multi-modality RL training framework

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. EasyR1: An efficient, scalable, multi-modality RL training framework. https://github.com/hiyouga/EasyR1, 2025

work page 2025

[29] [29]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024. 11 Appendix A Proofs A.1 Proof of Theorem 1 (i) Direction anchoring.Since exp(·)>0 , we have wCE t >0 unconditionally. Because ϵw ∈ (...

work page arXiv 2024