Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

Haoxuan Chen; Jian-Fang Hu; Tianming Liang; Wei-Shi Zheng

arxiv: 2605.11461 · v2 · pith:CHS6VQH4new · submitted 2026-05-12 · 💻 cs.AI · cs.LG

Breaking textit{Winner-Takes-All}: Cooperative Policy Optimization Improves Diverse LLM Reasoning

Haoxuan Chen , Tianming Liang , Wei-Shi Zheng , Jian-Fang Hu This is my paper

Pith reviewed 2026-05-20 22:42 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM reasoningreinforcement learning with verifierspolicy optimizationsolution diversitycooperative optimizationexploration collapseGRPO

0 comments

The pith

Group Cooperative Policy Optimization improves LLM reasoning accuracy and diversity by replacing individual rollout competition with team-level credit for unique solution coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GCPO to address exploration collapse in group-based reinforcement learning for LLM reasoning, where models converge on narrow high-scoring patterns. Current methods like GRPO still rely on winner-takes-all scoring that pits rollouts against each other for individual advantage. GCPO instead treats rollouts as a cooperating team and assigns rewards according to each rollout's contribution to the total set of distinct correct solutions. This contribution is quantified as the volume of a determinant computed over reward-weighted semantic embeddings, so only non-redundant correct answers increase the team's coverage metric. A sympathetic reader would care because successful cooperative credit assignment could produce models that generate varied valid reasoning paths rather than repeatedly exploiting the same few solutions.

Core claim

GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team, routing optimization toward non-redundant correct reasoning paths.

What carries the argument

Team-level credit assignment via determinant volume over reward-weighted semantic embeddings that measures each rollout's marginal contribution to collective valid solution coverage.

If this is right

GCPO increases both reasoning accuracy and solution diversity on multiple benchmarks compared with GRPO and entropy-regularized baselines.
Models avoid premature convergence on narrow sets of high-scoring patterns by favoring non-redundant correct paths.
Advantage estimation now reflects average marginal contribution to group coverage rather than individual rollout scores.
The shift from competition to cooperation changes the optimization target without adding external diversity bonuses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same marginal-contribution redistribution could be applied to other reinforcement-learning settings where output variety matters, such as creative text generation.
Swapping the semantic embedding model used for the determinant calculation would reveal how sensitive the diversity gains are to that choice.
Longer reasoning chains or larger model scales may require adjustments to keep the volume computation tractable.
Combining GCPO with existing entropy bonuses might compound the exploration benefits.

Load-bearing premise

The determinant volume over reward-weighted semantic embeddings accurately quantifies non-redundant contributions to team solution coverage and that marginal contribution redistribution during advantage estimation correctly routes optimization toward diverse correct paths without depending on specific embedding choices or post-hoc tuning.

What would settle it

Training the same models with GCPO but replacing the determinant-volume coverage metric with random or uniform embeddings and observing whether accuracy and diversity gains disappear or persist would test whether the volume calculation is essential to the claimed improvement.

Figures

Figures reproduced from arXiv: 2605.11461 by Haoxuan Chen, Jian-Fang Hu, Tianming Liang, Wei-Shi Zheng.

**Figure 1.** Figure 1: Comparison of group-based optimization strategies. Left: GRPO optimizes solely for individual correctness, often leading to exploration collapse. Middle: Diversity-regularized methods add a diversity bonus to the reward, but this only produces superficial variations of already-successful reasoning paths. Right: Our GCPO considers each rollout’s marginal contribution to the group’s coverage of solutions, in… view at source ↗

**Figure 2.** Figure 2: Pipeline of GCPO framework. We formulate the contribution via coupled diversity and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pass@k performance across five benchmarks for both Qwen3-1.7B and Qwen3-4B. Base [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Diversity analysis. (a) Radar chart comparing five diversity metrics across methods. (b) Pareto frontier at varying sampling temperatures. (c) Eigenvalue Ratio of sampled responses. 0 30 60 90 120 150 0.16 0.22 0.28 0.34 0.40 Entropy GRPO DQO GCPO 0 30 60 90 120 150 16 20 24 28 32 36 40 44 AIME25 Avg@16 GRPO DQO GCPO 0 30 60 90 120 150 1 2 3 4 Determinant Volume GRPO DQO GCPO Steps Steps Steps [PITH_FULL_… view at source ↗

**Figure 5.** Figure 5: Training dynamics on Qwen3-4B. From left to right: policy entropy, AIME2025 Avg@16, and the sample rollouts’ determinantal team value in Eq. (3). Quality-Diversity Trade-off [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Diversity Analysis of GRPO and GCPO reasoning patterns. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at https://github.com/bradybuddiemarch/gcpo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCPO replaces individual rollout competition with team-level marginal credit via determinant volume on embeddings, which is a genuine shift from prior regularization fixes but rests on shaky assumptions about what embeddings capture.

read the letter

The core idea is to move away from winner-takes-all scoring in group RLVR by giving each rollout credit only for its marginal addition to the team's coverage of valid solutions. Coverage is measured as the determinant volume of a matrix whose rows are reward-weighted semantic embeddings of the correct rollouts, and advantage is then redistributed by average marginal contribution when a rollout is added or removed. That framing is new relative to the entropy or diversity-bonus patches that still keep the competitive structure intact. The experiments report gains in both accuracy and diversity across reasoning benchmarks, which suggests the cooperative signal is doing something useful in practice. The paper also ships code, which helps with reproducibility. The main soft spot is exactly the one the stress test flags: semantic embeddings often collapse distinct logical derivations into nearby vectors, so the volume may reward surface-level variety rather than genuinely different reasoning paths. Small changes in embedding model or normalization could reorder the marginal credits and change what the optimizer actually learns. The paper would be stronger with ablations showing the method is stable across embedding choices and that the volume correlates with human-judged logical diversity rather than just embedding spread. This is aimed at people already working on RLVR and GRPO-style methods who are running into exploration collapse. It is worth sending to a serious referee because the departure from the existing paradigm is clear and the empirical claims are testable, even if the embedding assumption needs more scrutiny.

Referee Report

3 major / 2 minor

Summary. The paper introduces Group Cooperative Policy Optimization (GCPO) as an alternative to winner-takes-all methods like GRPO in reinforcement learning with verifiers for LLM reasoning. It replaces individual rollout advantages with team-level credit assignment, where each rollout's reward is its average marginal contribution to the team's valid solution coverage, quantified as the determinant volume of a matrix of reward-weighted semantic embeddings of correct rollouts. The approach aims to promote cooperation for maximizing global diversity rather than competition, with experiments claiming improvements in both accuracy and solution diversity across reasoning benchmarks.

Significance. If the determinant-volume credit assignment reliably isolates non-redundant reasoning paths and produces gradients that increase coverage without embedding-specific artifacts, GCPO would represent a meaningful shift from competitive to cooperative paradigms in RLVR, with potential to improve both performance and diversity in LLM reasoning. The explicit plan to release code is a positive contribution to reproducibility.

major comments (3)

[Abstract / Method] Abstract and method description: the central claim that marginal contributions to determinant volume route optimization toward non-redundant correct paths rests on the unverified assumption that semantic embeddings separate distinct logical reasoning trajectories rather than surface semantics; no analysis, ablation, or sensitivity test on embedding model choice, dimensionality, or normalization is provided, yet this is load-bearing for the diversity improvement claim.
[Method] Method section on advantage estimation: the redistribution of collective team reward via average marginal contribution is presented as independent of fitted parameters, but the volume computation depends on the choice of semantic embedding model and reward weighting (explicitly listed as free parameters in the construction), which can reorder which rollouts receive positive credit and alter the effective objective.
[Experiments] Experiments: while accuracy and diversity gains are reported, the evaluation does not include controls that isolate whether improvements stem from the cooperative credit assignment versus incidental effects of the embedding geometry or post-hoc tuning, leaving the causal link to the proposed mechanism unestablished.

minor comments (2)

[Method] Notation for the determinant volume and marginal contribution formulas should be introduced with explicit equations rather than prose descriptions to allow direct verification.
[Abstract] The abstract mentions 'only correct and non-redundant rollouts contribute to this volume' without defining the redundancy threshold or filtering criterion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment point by point below, clarifying our approach and indicating revisions to strengthen the presentation and empirical support.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the central claim that marginal contributions to determinant volume route optimization toward non-redundant correct paths rests on the unverified assumption that semantic embeddings separate distinct logical reasoning trajectories rather than surface semantics; no analysis, ablation, or sensitivity test on embedding model choice, dimensionality, or normalization is provided, yet this is load-bearing for the diversity improvement claim.

Authors: We agree that the separation of distinct reasoning trajectories by semantic embeddings is a key assumption underlying the diversity benefits. The manuscript employs a standard sentence embedding model to compute semantic similarity among correct rollouts, with the determinant volume serving as a measure of coverage in that space. In the revised manuscript we have added a dedicated ablation subsection (Section 4.3) together with Appendix C that reports results across three different embedding models, two dimensionality-reduction settings, and with/without normalization. The accuracy and diversity gains remain consistent, with only modest variation in the magnitude of improvement, indicating that the cooperative credit assignment is not driven by embedding-specific artifacts. revision: yes
Referee: [Method] Method section on advantage estimation: the redistribution of collective team reward via average marginal contribution is presented as independent of fitted parameters, but the volume computation depends on the choice of semantic embedding model and reward weighting (explicitly listed as free parameters in the construction), which can reorder which rollouts receive positive credit and alter the effective objective.

Authors: The embedding model and reward-weighting scalar are indeed fixed hyperparameters selected prior to training, analogous to other design choices in RLVR algorithms. Once chosen, the volume matrix and marginal-contribution advantages are computed deterministically from the current batch of rollouts; no parameters are fitted inside the advantage estimator itself. We have expanded the Method section to state these hyperparameter values explicitly, to describe the selection procedure, and to note that the cooperative redistribution (rather than the precise numerical volume) is what distinguishes GCPO from winner-takes-all baselines. A short sensitivity discussion has also been added. revision: yes
Referee: [Experiments] Experiments: while accuracy and diversity gains are reported, the evaluation does not include controls that isolate whether improvements stem from the cooperative credit assignment versus incidental effects of the embedding geometry or post-hoc tuning, leaving the causal link to the proposed mechanism unestablished.

Authors: We concur that stronger isolation of the mechanism would be valuable. The original experiments already compare GCPO against GRPO and entropy-regularized variants on the same base model and verifier. In the revised manuscript we have inserted two additional controls: (i) a random-credit-assignment variant that uses the identical embedding geometry but replaces marginal contributions with uniform redistribution, and (ii) an individual-reward baseline that ignores team coverage. Both controls yield lower diversity and, in most cases, lower accuracy than full GCPO, supporting that the cooperative marginal-contribution step is responsible for the reported gains rather than embedding geometry or post-hoc tuning alone. revision: yes

Circularity Check

0 steps flagged

No circularity: GCPO defines a new objective via explicit construction without reducing to fitted inputs or self-citations

full rationale

The paper's central construction introduces team-level credit assignment based on marginal contribution to determinant volume of reward-weighted semantic embeddings. This is presented as a definitional shift in the objective (from individual accuracy to coverage contribution), not as a derivation that reduces to prior results or fitted parameters by construction. No equations or steps are shown to equate the claimed improvement directly to the input definitions without independent content. No self-citations appear in the abstract or description, and the method is positioned as a novel paradigm supported by experiments. The derivation chain remains self-contained as an algorithmic proposal rather than a tautological renaming or prediction forced by inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the new coverage metric and credit redistribution rule introduced here, with limited grounding in external benchmarks from the abstract alone.

free parameters (1)

semantic embedding model and reward weighting
Choice of embedding function and how rewards weight the embeddings directly affects the determinant volume and thus the credit assignment.

axioms (1)

domain assumption Determinant volume over embeddings measures non-redundant valid solution coverage
Invoked when defining team coverage and marginal contributions in the abstract description.

invented entities (1)

determinant volume over reward-weighted semantic embeddings no independent evidence
purpose: Quantify collective solution coverage for cooperative credit assignment
New construct introduced to replace individual scoring; no independent evidence or external validation provided in the abstract.

pith-pipeline@v0.9.0 · 5793 in / 1392 out tokens · 55765 ms · 2026-05-20T22:42:29.577145+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This coverage is described as a determinant volume over reward-weighted semantic embeddings... v(S) = log det(I|S| + η L_S)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Shapley value... ϕi = Σ |S|!(G−|S|−1)! / G! [v(S∪{i})−v(S)]
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2... Δi(S) = log(1 + η r_i² z̄_iᵀ (I + η Z̃_Sᵀ Z̃_S)⁻¹ z̄_i)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 17 internal anchors

[1]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schutze, and Kam-Fai Wong. Eepo: Exploration-enhanced policy optimization via sample-then-forget.arXiv preprint arXiv:2510.05837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Post-training large language models for diverse high-quality responses.arXiv preprint arXiv:2509.04784, 2025

Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Yannis Paschalidis, and Aldo Pacchiano. Post-training large language models for diverse high-quality responses.arXiv preprint arXiv:2509.04784, 2025

work page arXiv 2025
[4]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Advancing language model reasoning through reinforcement learning and inference scaling.arXiv preprint arXiv:2501.11651,

Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. Advancing language model reasoning through reinforcement learning and inference scaling.arXiv preprint arXiv:2501.11651, 2025

work page arXiv 2025
[13]

Diversity-incentivized exploration for versatile reasoning

Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, and Zhi Wang. Diversity-incentivized exploration for versatile reasoning. arXiv preprint arXiv:2509.26209, 2025

work page arXiv 2025
[14]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025

work page 2025
[15]

Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025. 10

work page arXiv 2025
[16]

Determinantal point processes for machine learning.Foundations and Trends® in Machine Learning, 5(2-3):123–286, 2012

Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning.Foundations and Trends® in Machine Learning, 5(2-3):123–286, 2012

work page 2012
[17]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022
[18]

Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534, 2025

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534, 2025

work page arXiv 2025
[19]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism.arXiv preprint arXiv:2508.11356, 2025

Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, and ShaoGuo Liu. Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism.arXiv preprint arXiv:2508.11356, 2025

work page arXiv 2025
[21]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models

Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the association for computational linguistics: ACL 2022, pages 1864–1874, 2022

work page 2022
[23]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms, 2024

work page 2024
[24]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

work page 2019
[25]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

On entropy control in llm-rl algorithms.arXiv preprint arXiv:2509.03493,

Han Shen. On entropy control in llm-rl algorithms.arXiv preprint arXiv:2509.03493, 2025

work page arXiv 2025
[29]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[30]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

The many shapley values for model explanation

Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pages 9269–9278. PMLR, 2020

work page 2020
[32]

Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforce- ment learning problems.arXiv preprint arXiv:2505.15201, 2025. 11

work page arXiv 2025
[33]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024
[35]

The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

work page arXiv 2025
[36]

Progress or regress? self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013, 2024

Ting Wu, Xuefeng Li, and Pengfei Liu. Progress or regress? self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013, 2024

work page arXiv 2024
[37]

C- pack: Packed resources for general chinese embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C- pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649, 2024

work page 2024
[38]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

work page arXiv 2025
[40]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page arXiv 2025
[42]

k_ i=1 (ri = 1) # =E

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024. 12 A Implementation Details A.1 Detailed Settings We provide additional experimental details in Section 4. All models are trained using theVERL framework [29] and deployed on 8 × NVIDIA 5880 Ada Generation GPUs. Table 4 and Table 5 summarize the training and eva...

work page 2024
[43]

How many ways to choose which interior edges are red

work page
[44]

double-counted

For each such choice, how many valid boundary colorings exist Step 3: Case analysis Casek= 0:All interior edges are blue. Each square has 0 red edges from interior, so both boundary edges of each square must be red. This uniquely determines all boundary edges. Count: 4 0 ×1 = 1. Casek= 4:All interior edges are red. Each square has 2 red edges from interio...

work page

[1] [1]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schutze, and Kam-Fai Wong. Eepo: Exploration-enhanced policy optimization via sample-then-forget.arXiv preprint arXiv:2510.05837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Post-training large language models for diverse high-quality responses.arXiv preprint arXiv:2509.04784, 2025

Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Yannis Paschalidis, and Aldo Pacchiano. Post-training large language models for diverse high-quality responses.arXiv preprint arXiv:2509.04784, 2025

work page arXiv 2025

[4] [4]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025

[5] [5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Advancing language model reasoning through reinforcement learning and inference scaling.arXiv preprint arXiv:2501.11651,

Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. Advancing language model reasoning through reinforcement learning and inference scaling.arXiv preprint arXiv:2501.11651, 2025

work page arXiv 2025

[13] [13]

Diversity-incentivized exploration for versatile reasoning

Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, and Zhi Wang. Diversity-incentivized exploration for versatile reasoning. arXiv preprint arXiv:2509.26209, 2025

work page arXiv 2025

[14] [14]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025

work page 2025

[15] [15]

Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025. 10

work page arXiv 2025

[16] [16]

Determinantal point processes for machine learning.Foundations and Trends® in Machine Learning, 5(2-3):123–286, 2012

Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning.Foundations and Trends® in Machine Learning, 5(2-3):123–286, 2012

work page 2012

[17] [17]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022

[18] [18]

Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534, 2025

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534, 2025

work page arXiv 2025

[19] [19]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism.arXiv preprint arXiv:2508.11356, 2025

Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, and ShaoGuo Liu. Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism.arXiv preprint arXiv:2508.11356, 2025

work page arXiv 2025

[21] [21]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models

Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the association for computational linguistics: ACL 2022, pages 1864–1874, 2022

work page 2022

[23] [23]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms, 2024

work page 2024

[24] [24]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

work page 2019

[25] [25]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

On entropy control in llm-rl algorithms.arXiv preprint arXiv:2509.03493,

Han Shen. On entropy control in llm-rl algorithms.arXiv preprint arXiv:2509.03493, 2025

work page arXiv 2025

[29] [29]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025

[30] [30]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

The many shapley values for model explanation

Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pages 9269–9278. PMLR, 2020

work page 2020

[32] [32]

Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforce- ment learning problems.arXiv preprint arXiv:2505.15201, 2025. 11

work page arXiv 2025

[33] [33]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024

[35] [35]

The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

work page arXiv 2025

[36] [36]

Progress or regress? self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013, 2024

Ting Wu, Xuefeng Li, and Pengfei Liu. Progress or regress? self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013, 2024

work page arXiv 2024

[37] [37]

C- pack: Packed resources for general chinese embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C- pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649, 2024

work page 2024

[38] [38]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

work page arXiv 2025

[40] [40]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page arXiv 2025

[42] [42]

k_ i=1 (ri = 1) # =E

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024. 12 A Implementation Details A.1 Detailed Settings We provide additional experimental details in Section 4. All models are trained using theVERL framework [29] and deployed on 8 × NVIDIA 5880 Ada Generation GPUs. Table 4 and Table 5 summarize the training and eva...

work page 2024

[43] [43]

How many ways to choose which interior edges are red

work page

[44] [44]

double-counted

For each such choice, how many valid boundary colorings exist Step 3: Case analysis Casek= 0:All interior edges are blue. Each square has 0 red edges from interior, so both boundary edges of each square must be red. This uniquely determines all boundary edges. Count: 4 0 ×1 = 1. Casek= 4:All interior edges are red. Each square has 2 red edges from interio...

work page