Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Jun Bai; Shuyi Zhang; Song-Chun Zhu; Tong Wu; Yang Liu; Yanting Wang; Zilong Zheng; Zixia Jia; Ziyong Lin

arxiv: 2512.07461 · v3 · pith:KFNEXE42new · submitted 2025-12-08 · 💻 cs.CL

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Tong Wu , Yang Liu , Jun Bai , Zixia Jia , Shuyi Zhang , Ziyong Lin , Yanting Wang , Song-Chun Zhu

show 1 more author

Zilong Zheng

This is my paper

Pith reviewed 2026-05-17 01:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords parallel reasoningself-distillationreinforcement learninglarge language modelspolicy optimizationagentic reasoninginference speedup

0 comments

The pith

Large language models can learn genuine parallel reasoning on their own through self-distilled reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how LLMs can move from sequential step-by-step thinking to breaking problems into independent branches that run at the same time. It uses a training sequence that starts with basic format learning and then applies strict rules for parallel structure, all driven by the model's own signals rather than outside examples. This leads to better accuracy on reasoning tasks and much quicker inference because the computation actually happens in parallel instead of being faked through longer sequences. A reader would care if this holds because it points toward AI systems that use hardware resources more efficiently on multi-part problems without needing constant human guidance or massive labeled data.

Core claim

The Native Parallel Reasoner framework lets models self-evolve parallel reasoning by first discovering output formats through self-distillation, then enforcing topological constraints on branching, while a Parallel-Aware Policy Optimization algorithm directly tunes those branches inside the execution graph and a refactored engine handles memory and flow for stable large-scale training; on eight benchmarks this produces up to 24.5 percent gains and 4.6 times speedups with fully genuine parallel execution.

What carries the argument

The self-distilled progressive training paradigm that moves from cold-start format discovery to strict topological constraints on the reasoning graph, paired with the Parallel-Aware Policy Optimization algorithm that rewards adaptive branching decisions through trial and error.

Load-bearing premise

The training process can force the model to stay in native parallel mode with fixed branching rules rather than slipping back into ordinary sequential generation.

What would settle it

A test run on the trained model that shows reasoning outputs still require step-by-step sequential decoding or delivers no measurable speedup on parallel hardware would disprove the claim of achieving native parallel cognition.

read the original abstract

We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NPR introduces a teacher-free self-distilled RL setup for native parallel reasoning with some practical engineering, but the central execution claims need more concrete verification.

read the letter

The main point is that this work trains a 4B model to handle reasoning branches in true parallel rather than emulating it sequentially, using self-distillation to move from basic format learning to strict topological constraints plus a new PAPO optimizer that works directly on the execution graph. They also modified the SGLang engine to support large-scale parallel RL training, which is a hands-on piece that could matter for anyone scaling similar systems. The reported numbers are up to 24.5% better accuracy and 4.6x faster inference across eight benchmarks, with the claim that it stays at 100% parallel execution where other methods slip back to autoregressive decoding. That combination of teacher-free progressive training and graph-aware policy updates looks distinct from standard tree search or supervised methods. The engine refactor stands out as useful engineering that addresses a real bottleneck in running parallel RL at scale. On the downside, the abstract and high-level description leave open how they actually confirm that branches execute concurrently instead of getting serialized inside the refactored memory and flow control. The stress-test note is fair here: without per-step timing logs, engine traces, or explicit checks that independent paths run at the same time, the 100% genuine parallel claim rests more on the training rewards than on direct measurement. I'd also want to see ablations on the progressive stages and whether the topological constraints hold up without external supervision. Statistical details like variance across seeds or baseline comparisons would help judge if the gains are stable. This is aimed at researchers working on efficient multi-step reasoning and agentic inference, especially those already using RL on LLMs or dealing with branching workflows. A reader focused on inference optimization or scalable planning would find the training pipeline and engine changes worth examining. It deserves peer review because the problem is timely, the approach has enough technical pieces to discuss, and the engineering contribution is concrete even if the evaluation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces Native Parallel Reasoner (NPR), a teacher-free self-distilled RL framework that enables LLMs to transition from sequential to native parallel reasoning. It proposes three innovations: a progressive self-distillation paradigm enforcing strict topological constraints, the Parallel-Aware Policy Optimization (PAPO) algorithm for learning adaptive branching policies, and an NPR Engine that refactors SGLang's memory and flow control for stable parallel RL. On eight reasoning benchmarks with Qwen3-4B, it reports gains of up to 24.5% and speedups of up to 4.6x, claiming 100% genuine parallel execution without fallback to autoregressive decoding.

Significance. If the central claims hold, the work would be significant for scalable agentic reasoning, as it demonstrates a path to self-evolving parallel cognition without external teachers or sequential fallbacks. The practical contribution of the NPR Engine for large-scale parallel RL training and the empirical distinction from prior baselines that revert to autoregressive behavior are notable strengths that could influence future work on efficient LLM inference.

major comments (2)

[§4] §4 (Experimental Setup and Results): The headline claims of up to 24.5% performance gains and 4.6x speedups, along with the assertion of 100% genuine parallel execution, are presented without baseline tables, statistical significance tests, error analysis, or per-benchmark breakdowns. This makes it impossible to evaluate whether the gains are robust or attributable to the proposed self-distilled training and PAPO rather than implementation artifacts.
[§3.3] §3.3 (NPR Engine and Topological Constraints): The claim that the refactored SGLang engine enforces strict topological constraints leading to '100% genuine parallel execution' lacks explicit verification mechanisms, such as per-step concurrency metrics, hardware utilization traces, or formal checks that independent branches execute concurrently rather than being serialized in the execution graph. Without this, the distinction from baselines that fall back to autoregressive decoding remains unverified and load-bearing for the central contribution.

minor comments (2)

[Abstract] The abstract and introduction use the term 'native parallel cognition' without a precise operational definition or pseudocode for how the topological constraints are represented in the policy.
[§4] Figure captions and axis labels in the results section could be clarified to distinguish wall-clock speedup from token-throughput metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the comments identify gaps in presentation or verification, we have revised the manuscript to incorporate additional tables, statistical analyses, and explicit metrics.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup and Results): The headline claims of up to 24.5% performance gains and 4.6x speedups, along with the assertion of 100% genuine parallel execution, are presented without baseline tables, statistical significance tests, error analysis, or per-benchmark breakdowns. This makes it impossible to evaluate whether the gains are robust or attributable to the proposed self-distilled training and PAPO rather than implementation artifacts.

Authors: We agree that the original §4 would benefit from expanded empirical detail to allow readers to fully assess robustness. In the revised manuscript we have added complete per-benchmark tables comparing NPR against all baselines on the eight reasoning tasks, together with statistical significance results (paired t-tests across five random seeds) and standard-error bars. A new error-analysis subsection discusses task categories where parallel decomposition yields the largest gains and where variance remains high. These additions make clear that the reported improvements arise from the self-distilled progressive training and PAPO rather than implementation artifacts. revision: yes
Referee: [§3.3] §3.3 (NPR Engine and Topological Constraints): The claim that the refactored SGLang engine enforces strict topological constraints leading to '100% genuine parallel execution' lacks explicit verification mechanisms, such as per-step concurrency metrics, hardware utilization traces, or formal checks that independent branches execute concurrently rather than being serialized in the execution graph. Without this, the distinction from baselines that fall back to autoregressive decoding remains unverified and load-bearing for the central contribution.

Authors: We concur that explicit verification strengthens the central claim. The revised §3.3 and new appendix now report per-step concurrency metrics (average number of concurrently executing branches), GPU utilization traces captured during training and inference, and a formal check that the execution graph respects the topological order with no serialization of independent branches. We also include side-by-side traces demonstrating that the baselines frequently collapse to autoregressive decoding while NPR maintains 100 % parallel execution under the same metrics. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on observed benchmark outcomes

full rationale

The paper introduces NPR via three described innovations (self-distilled progressive training, PAPO, and refactored SGLang engine) and reports performance gains plus 100% parallel execution as measured results across eight benchmarks. No equations, definitions, or self-citations are shown that reduce the reported gains, speedups, or parallel-execution claim to fitted inputs or definitional equivalence by construction. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities beyond the named framework and algorithm are detailed enough to audit.

invented entities (2)

Native Parallel Reasoner (NPR) no independent evidence
purpose: Overall framework enabling self-evolving parallel reasoning in LLMs
Newly introduced term in the abstract.
Parallel-Aware Policy Optimization (PAPO) no independent evidence
purpose: Algorithm that optimizes branching policies inside the execution graph
Novel algorithm named and described in the abstract.

pith-pipeline@v0.9.0 · 5516 in / 1199 out tokens · 82748 ms · 2026-05-17T01:05:32.222004+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
cs.CV 2026-02 unverdicted novelty 7.0

Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
cs.LG 2026-01 unverdicted novelty 7.0

Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 3 Pith papers · 8 internal anchors

[1]

URL https://arxiv.org/abs/2506.07976. 16 Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal...

work page arXiv
[2]

Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, and Beidi Chen. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991, 2025a. Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, and Maosong Sun. Llm ×mapreduce-v2: Entropy-driven co...

work page arXiv
[3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025a

Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025a. Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Parathinker: Native parallel thin...

work page arXiv
[5]

URL https://arxiv.org/abs/ 2508.09303. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hong...

work page arXiv
[6]

Part i: Tricks or traps? a deep dive into rl for llm reasoning

URL https://cloud.google.com/ vertex-ai/generative-ai/docs/thinking-mode. Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, et al. Part i: Tricks or traps? a deep dive into rl for llm reasoning.arXiv preprint arXiv:2508.08221, 2025a. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radf...

work page arXiv
[7]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

American invitational mathematics examination 2025,

Mathematical Association of America. American invitational mathematics examination 2025,

work page 2025
[10]

Accessed: 2025-10-

URL https:// artofproblemsolving.com/wiki/index.php/American_Invitational_Mathematics_Examination. Accessed: 2025-10-

work page 2025
[11]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunovi´ c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´ c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun

Accessed: 2025-10-22. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar,...

work page 2025
[13]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URL https://aclanthology.org/2024.acl-long.211/. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advan...

work page doi:10.18653/v1/2024.acl-long.211 2024
[14]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Qwen2.5 Technical Report

doi: 10.48550/arXiv.2412.15115. Ziqi Wang, Boye Niu, Zipeng Gao, Zhi Zheng, Tong Xu, Linghui Meng, Zhongli Li, Jing Liu, Yilong Chen, Chen Zhu, et al. A survey on parallel reasoning.CoRR, abs/2510.12164, 2025b. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nak...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115
[16]

P., Kawaguchi, K., and Shieh, M

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P . Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.CoRR, abs/2405.00451,

work page arXiv
[17]

Training large language models to reason in parallel with global forking tokens.CoRR, abs/2510.05132,

Sheng Jia, Xiao Wang, and Shiva Prasad Kasiviswanathan. Training large language models to reason in parallel with global forking tokens.CoRR, abs/2510.05132,

work page arXiv
[18]

Learning adaptive parallel reasoning with language models.arXiv preprint arXiv:2504.15466, 2025

Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. Learning adaptive parallel reasoning with language models.CoRR, abs/2504.15466,

work page arXiv
[19]

Teaching large language models to reason with reinforcement learning,

18 Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching large language models to reason with reinforcement learning.CoRR, abs/2403.04642,

work page arXiv
[20]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...

work page internal anchor Pith review arXiv
[21]

Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space, 2025a

Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, et al. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308,

work page arXiv
[22]

Rulereasoner: Reinforced rule-based reasoning via domain-aware dynamic sampling

Yang Liu, Jiaqi Li, and Zilong Zheng. Rulereasoner: Reinforced rule-based reasoning via domain-aware dynamic sampling. arXiv preprint arXiv:2506.08672, 2025b. Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, and Zilong Zheng. Understanding and leveraging the expert specialization of context faithfulness in mixture-of-experts llms. InProceedings of the 2025 Con...

work page arXiv 2025
[23]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024,

work page 2024
[24]

Process reward models that think

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics, ACL 2025, pages 10495–10516, 2025b. Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekye...

work page arXiv 2025
[25]

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing LLM reasoning with rule-based reinforcement learning.CoRR, abs/2502.14768,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.CoRR, abs/2507.18071, 2025b. 19

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

URL https://arxiv.org/abs/2506.07976. 16 Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal...

work page arXiv

[2] [2]

Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, and Beidi Chen. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991, 2025a. Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, and Maosong Sun. Llm ×mapreduce-v2: Entropy-driven co...

work page arXiv

[3] [3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025a

Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025a. Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Parathinker: Native parallel thin...

work page arXiv

[5] [5]

URL https://arxiv.org/abs/ 2508.09303. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hong...

work page arXiv

[6] [6]

Part i: Tricks or traps? a deep dive into rl for llm reasoning

URL https://cloud.google.com/ vertex-ai/generative-ai/docs/thinking-mode. Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, et al. Part i: Tricks or traps? a deep dive into rl for llm reasoning.arXiv preprint arXiv:2508.08221, 2025a. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radf...

work page arXiv

[7] [7]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [9]

American invitational mathematics examination 2025,

Mathematical Association of America. American invitational mathematics examination 2025,

work page 2025

[9] [10]

Accessed: 2025-10-

URL https:// artofproblemsolving.com/wiki/index.php/American_Invitational_Mathematics_Examination. Accessed: 2025-10-

work page 2025

[10] [11]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunovi´ c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´ c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun

Accessed: 2025-10-22. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar,...

work page 2025

[12] [13]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URL https://aclanthology.org/2024.acl-long.211/. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advan...

work page doi:10.18653/v1/2024.acl-long.211 2024

[13] [14]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

Qwen2.5 Technical Report

doi: 10.48550/arXiv.2412.15115. Ziqi Wang, Boye Niu, Zipeng Gao, Zhi Zheng, Tong Xu, Linghui Meng, Zhongli Li, Jing Liu, Yilong Chen, Chen Zhu, et al. A survey on parallel reasoning.CoRR, abs/2510.12164, 2025b. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nak...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115

[15] [16]

P., Kawaguchi, K., and Shieh, M

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P . Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.CoRR, abs/2405.00451,

work page arXiv

[16] [17]

Training large language models to reason in parallel with global forking tokens.CoRR, abs/2510.05132,

Sheng Jia, Xiao Wang, and Shiva Prasad Kasiviswanathan. Training large language models to reason in parallel with global forking tokens.CoRR, abs/2510.05132,

work page arXiv

[17] [18]

Learning adaptive parallel reasoning with language models.arXiv preprint arXiv:2504.15466, 2025

Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. Learning adaptive parallel reasoning with language models.CoRR, abs/2504.15466,

work page arXiv

[18] [19]

Teaching large language models to reason with reinforcement learning,

18 Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. Teaching large language models to reason with reinforcement learning.CoRR, abs/2403.04642,

work page arXiv

[19] [20]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...

work page internal anchor Pith review arXiv

[20] [21]

Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space, 2025a

Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, et al. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308,

work page arXiv

[21] [22]

Rulereasoner: Reinforced rule-based reasoning via domain-aware dynamic sampling

Yang Liu, Jiaqi Li, and Zilong Zheng. Rulereasoner: Reinforced rule-based reasoning via domain-aware dynamic sampling. arXiv preprint arXiv:2506.08672, 2025b. Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, and Zilong Zheng. Understanding and leveraging the expert specialization of context faithfulness in mixture-of-experts llms. InProceedings of the 2025 Con...

work page arXiv 2025

[22] [23]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024,

work page 2024

[23] [24]

Process reward models that think

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics, ACL 2025, pages 10495–10516, 2025b. Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekye...

work page arXiv 2025

[24] [25]

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing LLM reasoning with rule-based reinforcement learning.CoRR, abs/2502.14768,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.CoRR, abs/2507.18071, 2025b. 19

work page internal anchor Pith review Pith/arXiv arXiv