pith. machine review for the scientific record. sign in

arxiv: 2603.27977 · v2 · submitted 2026-03-30 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:17 UTC · model grok-4.3

classification 💻 cs.AI
keywords reinforcement learningreasoning modelslabel-free RLreasoning topologymath reasoningopen-ended tasksPPOGRPO
0
0 comments X

The pith

SARL improves reasoning models by rewarding the topology of thinking paths instead of final answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reinforcement learning for reasoning works better when it supervises the structure of intermediate steps rather than only the correctness of the end result. Standard methods require verifiable final answers and therefore stay limited to closed tasks while often producing brittle trajectories. SARL builds a reasoning map from each response's thinking steps and scores the map's topology for local coherence and global efficiency. This label-free signal is applied during PPO and GRPO training on both math and open-ended benchmarks. The approach yields higher accuracy than prior label-free methods and even surpasses ground-truth supervised RL while also producing more stable training with lower KL divergence.

Core claim

SARL constructs per-response reasoning maps from intermediate thinking steps and rewards their topology to shift supervision from the destination to the path, encouraging reasoning trajectories that are both locally coherent and globally efficient. On verifiable math tasks this yields average gains of 9.1 percent under PPO and 11.6 percent under GRPO across four benchmarks, with especially large lifts on AIME25; on non-verifiable open-ended tasks it produces average gains of 34.6 percent under PPO and 30.4 percent under GRPO on WildBench, outperforming both prior label-free baselines and preference-based methods.

What carries the argument

Per-response reasoning maps assembled from intermediate thinking steps, scored for topological properties of local coherence and global efficiency.

If this is right

  • Models reach higher accuracy on math benchmarks without any ground-truth answer labels during training.
  • Training exhibits lower KL divergence and higher policy entropy, indicating more stable and exploratory updates.
  • Performance rises on open-ended tasks where final-answer verification is impossible.
  • The same topology reward works across PPO and GRPO optimizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to other sequential generation domains where process structure matters more than endpoint correctness.
  • Because the reward targets path properties rather than memorized answers, models might transfer better to entirely new problem classes.
  • Varying how steps are segmented when building the maps would reveal whether the performance edge depends on a particular extraction heuristic.

Load-bearing premise

The topology extracted from a model's intermediate steps supplies a reliable, unbiased signal of reasoning quality that remains valid outside the training distribution.

What would settle it

A controlled test in which models trained with SARL show no accuracy gain or lose performance on tasks where reasoning paths are edited to preserve the same topology scores while changing their actual content or correctness.

Figures

Figures reproduced from arXiv: 2603.27977 by Ananth Grama, Bolian Li, David Cho, Fanping Sui, Ruqi Zhang, Yifan Wang.

Figure 1
Figure 1. Figure 1: Overview of Structure Aware Reinforcement Learning (SARL). Left: SARL replaces outcome [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics of different methods (reward signals) under GRPO. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Reinforcement learning is critical to improving large reasoning models, but its success relies heavily on verifiable rewards (RLVR), making it hard to use in open-ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimizing solely toward the final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning), and we extend traditional RLVR to open-ended settings. We introduce Structure-Aware Reinforcement Learning (SARL), a label-free framework that constructs per-response reasoning maps from intermediate thinking steps and rewards their reasoning topology. SARL shifts supervision from destination to path, encouraging reasoning trajectories that are both locally coherent and globally efficient. On verifiable math tasks, SARL outperforms prior label-free RL baselines and even exceeds RL methods with ground truth supervision, with average gains of +9.1% under PPO and +11.6% under GRPO across four math benchmarks, with particularly large improvements on AIME25 (+35.5% with PPO and +44.7% with GRPO). On non-verifiable open-ended tasks, SARL achieves average gains of +34.6% under PPO and +30.4% under GRPO on WildBench across five task categories, outperforming prior label-free RL methods and DPO, which relies on additional preference labels. Beyond strong performance, SARL exhibits substantially lower KL divergence and higher policy entropy, indicating more stable and exploratory training dynamics. Code and data are available at \href{https://github.com/cacayaya/SARL}{Code Link}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Structure-Aware Reinforcement Learning (SARL), a label-free RL framework that constructs per-response reasoning maps from intermediate thinking steps and rewards their topology (local coherence and global efficiency) to shift supervision from final answers to reasoning trajectories. It reports average gains of +9.1% (PPO) and +11.6% (GRPO) on four math benchmarks (with +35.5%/+44.7% on AIME25), +34.6% (PPO) and +30.4% (GRPO) on WildBench open-ended tasks, outperforming prior label-free baselines and even ground-truth RL methods, plus lower KL divergence and higher policy entropy indicating more stable training.

Significance. If the results hold after addressing construction details, SARL would offer a meaningful advance for label-free RL on reasoning models by emphasizing path structure over outcomes, potentially improving generalization in open-ended domains where verifiable rewards are unavailable. The reported outperformance of supervised RL baselines and improved dynamics (lower KL, higher entropy) would strengthen the case for topology-based rewards as a scalable alternative.

major comments (3)
  1. [§3 (Method)] The central mechanism—per-response reasoning map construction (node/edge definition from intermediate steps) and topology reward formulation—is load-bearing for the claim of unbiased quality signals, yet the manuscript provides limited detail on extraction, scoring, and hyperparameters, leaving open the risk that gains capture generation artifacts (e.g., step length or formatting patterns) rather than general reasoning quality, especially on AIME25 where improvements reach +35–44%.
  2. [§4.1 (Math benchmarks)] The claim that SARL exceeds RL methods with ground-truth supervision on math tasks (average +9.1%/+11.6%) requires explicit verification that comparisons use matched base models, training steps, and hyperparameter budgets; without this, the result could reflect implementation differences rather than the topology reward's superiority.
  3. [§4.2 (Open-ended tasks)] On non-verifiable tasks, the +34.6%/+30.4% WildBench gains depend on applying the same map-based reward to open-ended responses, but no ablation tests whether map extraction introduces biases (e.g., favoring certain response structures) that correlate with benchmark scores rather than true reasoning improvement.
minor comments (3)
  1. [Abstract] Abstract: the code link is given as a placeholder; replace with the actual repository URL in the final version.
  2. [§3] Notation for the topology reward (local vs. global components) would benefit from an explicit equation or pseudocode to improve reproducibility.
  3. [§4] Results tables: include standard deviations or multiple seeds for the reported percentage gains to allow assessment of statistical reliability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting SARL's potential contribution to label-free RL. We address each major comment below with specific plans for revision.

read point-by-point responses
  1. Referee: [§3 (Method)] The central mechanism—per-response reasoning map construction (node/edge definition from intermediate steps) and topology reward formulation—is load-bearing for the claim of unbiased quality signals, yet the manuscript provides limited detail on extraction, scoring, and hyperparameters, leaving open the risk that gains capture generation artifacts (e.g., step length or formatting patterns) rather than general reasoning quality, especially on AIME25 where improvements reach +35–44%.

    Authors: We agree that the current description of map construction and reward formulation requires expansion to rule out artifact-based explanations. In the revised manuscript we will add to §3: (i) explicit pseudocode for node/edge extraction from intermediate steps, (ii) the full mathematical definition of the local-coherence and global-efficiency terms, and (iii) a hyperparameter table. We will also insert a controlled analysis (holding step count fixed) showing that topology reward gains on AIME25 persist beyond length or formatting patterns. revision: yes

  2. Referee: [§4.1 (Math benchmarks)] The claim that SARL exceeds RL methods with ground-truth supervision on math tasks (average +9.1%/+11.6%) requires explicit verification that comparisons use matched base models, training steps, and hyperparameter budgets; without this, the result could reflect implementation differences rather than the topology reward's superiority.

    Authors: All experiments in §4.1 used identical base models, identical training-step budgets, and hyperparameter grids of comparable size for every method. In the revision we will add an explicit paragraph in §4.1 stating these matching conditions and append a table listing the final hyperparameter values for SARL, PPO, GRPO, and the ground-truth RL baselines. revision: yes

  3. Referee: [§4.2 (Open-ended tasks)] On non-verifiable tasks, the +34.6%/+30.4% WildBench gains depend on applying the same map-based reward to open-ended responses, but no ablation tests whether map extraction introduces biases (e.g., favoring certain response structures) that correlate with benchmark scores rather than true reasoning improvement.

    Authors: We accept that an ablation isolating potential structural biases is needed. The revised §4.2 will include (i) a length-controlled ablation of the map reward on WildBench and (ii) a correlation analysis between topology features and benchmark scores. These additions will demonstrate that performance gains arise from reasoning topology rather than extraction artifacts. revision: yes

Circularity Check

0 steps flagged

SARL topology reward is an independent empirical construction; no derivation reduces to self-input by construction

full rationale

The paper defines a new reward based on per-response reasoning maps extracted from intermediate steps and evaluates the resulting policy on external verifiable math benchmarks (AIME, etc.) and WildBench. No equation or claim reduces the reported gains (+9.1% PPO, +11.6% GRPO) to a fitted parameter renamed as prediction or to a self-citation chain. The central premise (topology as unbiased quality signal) is presented as an empirical hypothesis tested against baselines, not derived from prior self-work by definition. Minor self-citation risk exists but is not load-bearing for the performance claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that reasoning steps can be mapped into a topology whose coherence and efficiency correlate with improved generalization, plus standard RL assumptions about policy optimization.

free parameters (1)
  • topology reward hyperparameters
    Weights balancing local coherence versus global efficiency in the reward function are likely tuned on data.
axioms (1)
  • domain assumption Intermediate thinking steps can be extracted and assembled into a reasoning map whose topology reflects reasoning quality
    Invoked in the construction of per-response reasoning maps from thinking steps.
invented entities (1)
  • reasoning map no independent evidence
    purpose: Represents the structure of intermediate thinking steps for computing topology-based rewards
    New construct introduced to shift supervision from outcome to path.

pith-pipeline@v0.9.0 · 5620 in / 1325 out tokens · 38532 ms · 2026-05-14T22:17:01.628382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TEMPO: Scaling Test-time Training for Large Reasoning Models

    cs.LG 2026-04 unverdicted novelty 6.0

    TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Ali, R., Caso, F., Irwin, C., and Li `o, P

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

  2. [2]

    Matharena: Evaluating llms on uncontaminated math competitions, February 2025

    Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena. ai/

  3. [3]

    Small-world brain networks revisited.The Neuroscien- tist, 23(5):499–516, 2017

    Danielle S Bassett and Edward T Bullmore. Small-world brain networks revisited.The Neuroscien- tist, 23(5):499–516, 2017

  4. [4]

    What characterizes effective reasoning? revisiting length, review, and structure of cot.arXiv preprint arXiv:2509.19284, 2025

    Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, and Anthony Hartshorn. What characterizes effective reasoning? revisiting length, review, and structure of cot.arXiv preprint arXiv:2509.19284, 2025

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  6. [6]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  7. [7]

    Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

    Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moon- tae Lee, Honglak Lee, and Lu Wang. Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

  8. [8]

    Amc-23.https://huggingface.co/datasets/knoveleng/AMC-23, 2025

    knoveleng. Amc-23.https://huggingface.co/datasets/knoveleng/AMC-23, 2025

  9. [9]

    Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35: 3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35: 3843–3857, 2022

  10. [10]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  11. [11]

    Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

  12. [12]

    Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

    Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

  13. [13]

    Let's reward step by step: Step-level reward model as the navigators for reasoning

    Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning.arXiv preprint arXiv:2310.10080, 2023

  14. [14]

    Topology of reasoning: Understanding large reasoning models through reasoning graph properties.arXiv preprint arXiv:2506.05744, 2025

    Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. Topology of reasoning: Understanding large reasoning models through reasoning graph properties.arXiv preprint arXiv:2506.05744, 2025

  15. [15]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022. 11

  16. [16]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  17. [17]

    Navigation of brain networks

    Caio Seguin, Martijn P Van Den Heuvel, and Andrew Zalesky. Navigation of brain networks. Proceedings of the National Academy of Sciences, 115(24):6297–6302, 2018

  18. [18]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  19. [19]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  20. [20]

    MIT press, 2016

    Olaf Sporns.Networks of the Brain. MIT press, 2016

  21. [21]

    The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025

    Xue Wen Tan, Nathaniel Tan, Galen Lee, and Stanley Kok. The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025

  22. [22]

    TRL: Transformers Reinforcement Learning, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020. URLhttps://github.com/huggingface/trl

  23. [23]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

  24. [24]

    Rong Wang, Mianxin Liu, Xinhong Cheng, Ying Wu, Andrea Hildebrandt, and Changsong Zhou. Segregation, integration, and balance of large-scale resting brain networks configure different cognitive abilities.Proceedings of the National Academy of Sciences, 118(23):e2022288118, 2021

  25. [25]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery , and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  26. [26]

    Collective dynamics of ‘small-world’networks.nature, 393 (6684):440–442, 1998

    Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks.nature, 393 (6684):440–442, 1998

  27. [27]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  28. [28]

    Surrogate signals from format and length: Reinforcement learning for solving mathematical problems without ground truth answers.arXiv preprint arXiv:2505.19439, 2025

    Rihui Xin, Han Liu, Zecheng Wang, Yupeng Zhang, Dianbo Sui, Xiaolin Hu, and Bingning Wang. Surrogate signals from format and length: Reinforcement learning for solving mathematical problems without ground truth answers.arXiv preprint arXiv:2505.19439, 2025

  29. [29]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  30. [30]

    Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

    Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

  31. [31]

    Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

    Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025. 12

  32. [32]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguisti...

  33. [33]

    The geometry of reasoning: Flowing logics in representation space.arXiv preprint arXiv:2510.09782, 2025

    Yufa Zhou, Yixiao Wang, Xunjian Yin, Shuyan Zhou, and Anru R Zhang. The geometry of reasoning: Flowing logics in representation space.arXiv preprint arXiv:2510.09782, 2025

  34. [34]

    TTRL: Test-Time Reinforcement Learning

    Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. A Implementation Details A.1 Step embedding extraction We provide two practical implementations for computing step embeddingse t used in §3. Implementation...