arxiv: 2603.27977 · v2 · submitted 2026-03-30 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

Yifan Wang , Bolian Li , David Cho , Ruqi Zhang , Fanping Sui , Ananth Grama

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningreasoning modelslabel-free RLreasoning topologymath reasoningopen-ended tasksPPOGRPO

0 comments

The pith

SARL improves reasoning models by rewarding the topology of thinking paths instead of final answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reinforcement learning for reasoning works better when it supervises the structure of intermediate steps rather than only the correctness of the end result. Standard methods require verifiable final answers and therefore stay limited to closed tasks while often producing brittle trajectories. SARL builds a reasoning map from each response's thinking steps and scores the map's topology for local coherence and global efficiency. This label-free signal is applied during PPO and GRPO training on both math and open-ended benchmarks. The approach yields higher accuracy than prior label-free methods and even surpasses ground-truth supervised RL while also producing more stable training with lower KL divergence.

Core claim

SARL constructs per-response reasoning maps from intermediate thinking steps and rewards their topology to shift supervision from the destination to the path, encouraging reasoning trajectories that are both locally coherent and globally efficient. On verifiable math tasks this yields average gains of 9.1 percent under PPO and 11.6 percent under GRPO across four benchmarks, with especially large lifts on AIME25; on non-verifiable open-ended tasks it produces average gains of 34.6 percent under PPO and 30.4 percent under GRPO on WildBench, outperforming both prior label-free baselines and preference-based methods.

What carries the argument

Per-response reasoning maps assembled from intermediate thinking steps, scored for topological properties of local coherence and global efficiency.

If this is right

Models reach higher accuracy on math benchmarks without any ground-truth answer labels during training.
Training exhibits lower KL divergence and higher policy entropy, indicating more stable and exploratory updates.
Performance rises on open-ended tasks where final-answer verification is impossible.
The same topology reward works across PPO and GRPO optimizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other sequential generation domains where process structure matters more than endpoint correctness.
Because the reward targets path properties rather than memorized answers, models might transfer better to entirely new problem classes.
Varying how steps are segmented when building the maps would reveal whether the performance edge depends on a particular extraction heuristic.

Load-bearing premise

The topology extracted from a model's intermediate steps supplies a reliable, unbiased signal of reasoning quality that remains valid outside the training distribution.

What would settle it

A controlled test in which models trained with SARL show no accuracy gain or lose performance on tasks where reasoning paths are edited to preserve the same topology scores while changing their actual content or correctness.

Figures

Figures reproduced from arXiv: 2603.27977 by Ananth Grama, Bolian Li, David Cho, Fanping Sui, Ruqi Zhang, Yifan Wang.

**Figure 2.** Figure 2: Training dynamics of different methods (reward signals) under GRPO. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Reinforcement learning is critical to improving large reasoning models, but its success relies heavily on verifiable rewards (RLVR), making it hard to use in open-ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimizing solely toward the final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning), and we extend traditional RLVR to open-ended settings. We introduce Structure-Aware Reinforcement Learning (SARL), a label-free framework that constructs per-response reasoning maps from intermediate thinking steps and rewards their reasoning topology. SARL shifts supervision from destination to path, encouraging reasoning trajectories that are both locally coherent and globally efficient. On verifiable math tasks, SARL outperforms prior label-free RL baselines and even exceeds RL methods with ground truth supervision, with average gains of +9.1% under PPO and +11.6% under GRPO across four math benchmarks, with particularly large improvements on AIME25 (+35.5% with PPO and +44.7% with GRPO). On non-verifiable open-ended tasks, SARL achieves average gains of +34.6% under PPO and +30.4% under GRPO on WildBench across five task categories, outperforming prior label-free RL methods and DPO, which relies on additional preference labels. Beyond strong performance, SARL exhibits substantially lower KL divergence and higher policy entropy, indicating more stable and exploratory training dynamics. Code and data are available at \href{https://github.com/cacayaya/SARL}{Code Link}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SARL tries to reward reasoning structure instead of outcomes for label-free RL, with solid reported gains but real questions on whether the topology signal is clean or benchmark-tuned.

read the letter

SARL's main contribution is shifting RL supervision from final answers to the topology of reasoning paths, allowing label-free training for open-ended reasoning tasks. The reported gains are substantial, but the method's reliability hinges on whether those topology rewards truly reflect reasoning quality. The paper does well in showing consistent improvements. On math benchmarks, it beats label-free baselines and even ground-truth RL methods, with average gains of about 9% under PPO and 11% under GRPO, and much larger on AIME25. For open-ended tasks on WildBench, gains are around 30-34%. Training dynamics look better too, with lower KL divergence and higher entropy, suggesting more stable and exploratory learning. Making the code available is helpful for anyone wanting to dig in. The soft spots are in the map construction and reward details. The approach assumes that per-response reasoning maps provide an unbiased signal of local coherence and global efficiency. But if the way steps are turned into nodes and edges or how paths are scored favors certain patterns that happen to match benchmark answers, the gains could be partly artifactual. This risk seems highest where improvements are biggest. The abstract leaves some of the formulation vague, so full methods section will be key to assess. This is for folks working on RL for reasoning models in AI, especially those dealing with domains lacking verifiable rewards. A reader focused on structure in thinking processes would find the experiments relevant. I'd recommend sending it to peer review. The empirical claims are strong enough to merit referee time, though revisions will likely be needed to address the potential for benchmark-specific artifacts.

Referee Report

3 major / 3 minor

Summary. The paper introduces Structure-Aware Reinforcement Learning (SARL), a label-free RL framework that constructs per-response reasoning maps from intermediate thinking steps and rewards their topology (local coherence and global efficiency) to shift supervision from final answers to reasoning trajectories. It reports average gains of +9.1% (PPO) and +11.6% (GRPO) on four math benchmarks (with +35.5%/+44.7% on AIME25), +34.6% (PPO) and +30.4% (GRPO) on WildBench open-ended tasks, outperforming prior label-free baselines and even ground-truth RL methods, plus lower KL divergence and higher policy entropy indicating more stable training.

Significance. If the results hold after addressing construction details, SARL would offer a meaningful advance for label-free RL on reasoning models by emphasizing path structure over outcomes, potentially improving generalization in open-ended domains where verifiable rewards are unavailable. The reported outperformance of supervised RL baselines and improved dynamics (lower KL, higher entropy) would strengthen the case for topology-based rewards as a scalable alternative.

major comments (3)

[§3 (Method)] The central mechanism—per-response reasoning map construction (node/edge definition from intermediate steps) and topology reward formulation—is load-bearing for the claim of unbiased quality signals, yet the manuscript provides limited detail on extraction, scoring, and hyperparameters, leaving open the risk that gains capture generation artifacts (e.g., step length or formatting patterns) rather than general reasoning quality, especially on AIME25 where improvements reach +35–44%.
[§4.1 (Math benchmarks)] The claim that SARL exceeds RL methods with ground-truth supervision on math tasks (average +9.1%/+11.6%) requires explicit verification that comparisons use matched base models, training steps, and hyperparameter budgets; without this, the result could reflect implementation differences rather than the topology reward's superiority.
[§4.2 (Open-ended tasks)] On non-verifiable tasks, the +34.6%/+30.4% WildBench gains depend on applying the same map-based reward to open-ended responses, but no ablation tests whether map extraction introduces biases (e.g., favoring certain response structures) that correlate with benchmark scores rather than true reasoning improvement.

minor comments (3)

[Abstract] Abstract: the code link is given as a placeholder; replace with the actual repository URL in the final version.
[§3] Notation for the topology reward (local vs. global components) would benefit from an explicit equation or pseudocode to improve reproducibility.
[§4] Results tables: include standard deviations or multiple seeds for the reported percentage gains to allow assessment of statistical reliability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting SARL's potential contribution to label-free RL. We address each major comment below with specific plans for revision.

read point-by-point responses

Referee: [§3 (Method)] The central mechanism—per-response reasoning map construction (node/edge definition from intermediate steps) and topology reward formulation—is load-bearing for the claim of unbiased quality signals, yet the manuscript provides limited detail on extraction, scoring, and hyperparameters, leaving open the risk that gains capture generation artifacts (e.g., step length or formatting patterns) rather than general reasoning quality, especially on AIME25 where improvements reach +35–44%.

Authors: We agree that the current description of map construction and reward formulation requires expansion to rule out artifact-based explanations. In the revised manuscript we will add to §3: (i) explicit pseudocode for node/edge extraction from intermediate steps, (ii) the full mathematical definition of the local-coherence and global-efficiency terms, and (iii) a hyperparameter table. We will also insert a controlled analysis (holding step count fixed) showing that topology reward gains on AIME25 persist beyond length or formatting patterns. revision: yes
Referee: [§4.1 (Math benchmarks)] The claim that SARL exceeds RL methods with ground-truth supervision on math tasks (average +9.1%/+11.6%) requires explicit verification that comparisons use matched base models, training steps, and hyperparameter budgets; without this, the result could reflect implementation differences rather than the topology reward's superiority.

Authors: All experiments in §4.1 used identical base models, identical training-step budgets, and hyperparameter grids of comparable size for every method. In the revision we will add an explicit paragraph in §4.1 stating these matching conditions and append a table listing the final hyperparameter values for SARL, PPO, GRPO, and the ground-truth RL baselines. revision: yes
Referee: [§4.2 (Open-ended tasks)] On non-verifiable tasks, the +34.6%/+30.4% WildBench gains depend on applying the same map-based reward to open-ended responses, but no ablation tests whether map extraction introduces biases (e.g., favoring certain response structures) that correlate with benchmark scores rather than true reasoning improvement.

Authors: We accept that an ablation isolating potential structural biases is needed. The revised §4.2 will include (i) a length-controlled ablation of the map reward on WildBench and (ii) a correlation analysis between topology features and benchmark scores. These additions will demonstrate that performance gains arise from reasoning topology rather than extraction artifacts. revision: yes

Circularity Check

0 steps flagged

SARL topology reward is an independent empirical construction; no derivation reduces to self-input by construction

full rationale

The paper defines a new reward based on per-response reasoning maps extracted from intermediate steps and evaluates the resulting policy on external verifiable math benchmarks (AIME, etc.) and WildBench. No equation or claim reduces the reported gains (+9.1% PPO, +11.6% GRPO) to a fitted parameter renamed as prediction or to a self-citation chain. The central premise (topology as unbiased quality signal) is presented as an empirical hypothesis tested against baselines, not derived from prior self-work by definition. Minor self-citation risk exists but is not load-bearing for the performance claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that reasoning steps can be mapped into a topology whose coherence and efficiency correlate with improved generalization, plus standard RL assumptions about policy optimization.

free parameters (1)

topology reward hyperparameters
Weights balancing local coherence versus global efficiency in the reward function are likely tuned on data.

axioms (1)

domain assumption Intermediate thinking steps can be extracted and assembled into a reasoning map whose topology reflects reasoning quality
Invoked in the construction of per-response reasoning maps from thinking steps.

invented entities (1)

reasoning map no independent evidence
purpose: Represents the structure of intermediate thinking steps for computing topology-based rewards
New construct introduced to shift supervision from outcome to path.

pith-pipeline@v0.9.0 · 5620 in / 1325 out tokens · 38532 ms · 2026-05-14T22:17:01.628382+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SARL constructs a per-response Reasoning Map from intermediate thinking steps and rewards its small-world topology... SR(G) = ½ C(G) + 1/(1+L(G))
IndisputableMonolith/Foundation/AlexanderDuality alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

small-world organization of the brain’s connectome... high local clustering... short global path lengths

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TEMPO: Scaling Test-time Training for Large Reasoning Models
cs.LG 2026-04 unverdicted novelty 6.0

TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Ali, R., Caso, F., Irwin, C., and Li `o, P

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

work page arXiv 2025
[2]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena. ai/

work page 2025
[3]

Small-world brain networks revisited.The Neuroscien- tist, 23(5):499–516, 2017

Danielle S Bassett and Edward T Bullmore. Small-world brain networks revisited.The Neuroscien- tist, 23(5):499–516, 2017

work page 2017
[4]

What characterizes effective reasoning? revisiting length, review, and structure of cot.arXiv preprint arXiv:2509.19284, 2025

Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, and Anthony Hartshorn. What characterizes effective reasoning? revisiting length, review, and structure of cot.arXiv preprint arXiv:2509.19284, 2025

work page arXiv 2025
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moon- tae Lee, Honglak Lee, and Lu Wang. Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

work page arXiv 2025
[8]

Amc-23.https://huggingface.co/datasets/knoveleng/AMC-23, 2025

knoveleng. Amc-23.https://huggingface.co/datasets/knoveleng/AMC-23, 2025

work page 2025
[9]

Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35: 3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35: 3843–3857, 2022

work page 2022
[10]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[11]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

work page arXiv 2024
[12]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

work page arXiv 2025
[13]

Let's reward step by step: Step-level reward model as the navigators for reasoning

Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning.arXiv preprint arXiv:2310.10080, 2023

work page arXiv 2023
[14]

Topology of reasoning: Understanding large reasoning models through reasoning graph properties.arXiv preprint arXiv:2506.05744, 2025

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. Topology of reasoning: Understanding large reasoning models through reasoning graph properties.arXiv preprint arXiv:2506.05744, 2025

work page arXiv 2025
[15]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022. 11

work page 2022
[16]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[17]

Navigation of brain networks

Caio Seguin, Martijn P Van Den Heuvel, and Andrew Zalesky. Navigation of brain networks. Proceedings of the National Academy of Sciences, 115(24):6297–6302, 2018

work page 2018
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

MIT press, 2016

Olaf Sporns.Networks of the Brain. MIT press, 2016

work page 2016
[21]

The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025

Xue Wen Tan, Nathaniel Tan, Galen Lee, and Stanley Kok. The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025

work page arXiv 2025
[22]

TRL: Transformers Reinforcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020. URLhttps://github.com/huggingface/trl

work page 2020
[23]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

work page 2024
[24]

Rong Wang, Mianxin Liu, Xinhong Cheng, Ying Wu, Andrea Hildebrandt, and Changsong Zhou. Segregation, integration, and balance of large-scale resting brain networks configure different cognitive abilities.Proceedings of the National Academy of Sciences, 118(23):e2022288118, 2021

work page 2021
[25]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery , and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Collective dynamics of ‘small-world’networks.nature, 393 (6684):440–442, 1998

Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks.nature, 393 (6684):440–442, 1998

work page 1998
[27]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[28]

Surrogate signals from format and length: Reinforcement learning for solving mathematical problems without ground truth answers.arXiv preprint arXiv:2505.19439, 2025

Rihui Xin, Han Liu, Zecheng Wang, Yupeng Zhang, Dianbo Sui, Xiaolin Hu, and Bingning Wang. Surrogate signals from format and length: Reinforcement learning for solving mathematical problems without ground truth answers.arXiv preprint arXiv:2505.19439, 2025

work page arXiv 2025
[29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page arXiv 2025
[31]

Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025. 12

work page arXiv 2025
[32]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguisti...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

The geometry of reasoning: Flowing logics in representation space.arXiv preprint arXiv:2510.09782, 2025

Yufa Zhou, Yixiao Wang, Xunjian Yin, Shuyan Zhou, and Anru R Zhang. The geometry of reasoning: Flowing logics in representation space.arXiv preprint arXiv:2510.09782, 2025

work page arXiv 2025
[34]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. A Implementation Details A.1 Step embedding extraction We provide two practical implementations for computing step embeddingse t used in §3. Implementation...

work page Pith review arXiv 2025