arxiv: 2605.11853 · v2 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

Sijia Li , Yuchen Huang , Zifan Liu , Yanping Li , Jingjing Fu , Li Zhao , Jiang Bian , Ling Zhang

show 2 more authors

Jun Zhang Rui Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords GEARadvantage reweightingself-distillationcredit assignmentLLM agentsGRPOreinforcement learningmathematical reasoning

0 comments

The pith

GEAR uses self-distillation divergence to adaptively segment trajectories and reweight advantages for better LLM agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GEAR to improve credit assignment when training LLM agents with reinforcement learning on long-horizon tasks. Standard methods rely on coarse outcome rewards, but GEAR compares an on-policy student model against a ground-truth-conditioned teacher to detect divergence spikes that mark semantic deviation points. These spikes serve as anchors for grouping subsequent tokens into adaptive segments while keeping token-level resolution elsewhere, then modulating the advantage weights accordingly. The resulting policy updates outperform GRPO and fixed-granularity baselines on mathematical reasoning and tool-use benchmarks. Gains reach around 20 percent on tasks where the GRPO baseline is weaker, showing the value of letting divergence signals decide granularity.

Core claim

GEAR reshapes trajectory-level GRPO advantages by deriving token- and segment-level signals from self-distillation. It obtains a reference-guided divergence signal by comparing the on-policy student to a ground-truth-conditioned teacher, treating spikes in this divergence as the start of semantic deviations. Where the student stays aligned, token-level resolution is kept; where divergence rises, the continuation is grouped into an adaptive segment whose advantage is modulated by the departure-point divergence value. This produces more effective policy updates than standard GRPO, self-distillation alone, or fixed token- or turn-level methods.

What carries the argument

Divergence signal between on-policy student and ground-truth-conditioned teacher, used to locate adaptive segment boundaries and modulate local advantage weights.

If this is right

GEAR produces larger gains on benchmarks where standard GRPO accuracy is lower.
The method maintains token-level credit where models stay aligned with the teacher and coarsens it only at detected deviations.
Performance improvements hold across both 4B and 8B model sizes on eight different reasoning and agent benchmarks.
Adaptive reweighting outperforms fixed-granularity alternatives such as pure token-level or turn-level credit assignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same divergence-driven segmentation idea could transfer to credit assignment in non-language reinforcement learning tasks with long sequences.
Combining GEAR with other reward-modeling or shaping methods might further stabilize training on complex agent workflows.
Evaluating GEAR on trajectories substantially longer than those in the current benchmarks would test whether the adaptive boundaries remain effective at greater scales.

Load-bearing premise

The divergence signal between the on-policy student and the ground-truth teacher reliably marks the onset of semantic deviations that justify adaptive segment grouping.

What would settle it

Running the same benchmarks with GEAR's divergence-based boundaries replaced by random segment boundaries and finding no performance gain over GRPO would falsify the claim that the signal provides useful adaptive credit assignment.

Figures

Figures reproduced from arXiv: 2605.11853 by Jiang Bian, Jingjing Fu, Jun Zhang, Ling Zhang, Li Zhao, Rui Wang, Sijia Li, Yanping Li, Yuchen Huang, Zifan Liu.

**Figure 1.** Figure 1: Illustration of GEAR for fine-grained credit assignment in agent RL. (a) GRPO assigns the same trajectory-level advantage to all tokens. (b) GEAR preserves this trajectory-level advantage while redistributing credit at a finer granularity. It computes token-wise reverse KL divergence between the student and a groundtruth–conditioned teacher, then uses KL peaks to identify segment onsets and entropy to det… view at source ↗

**Figure 2.** Figure 2: Frequency of top-20 tokens with normalized reverse-KL [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Token-level visualization results of normalized KL divergence and normalized entropy. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Training curves of GRPO, GEAR and its variants. The left panel shows the mean training reward, [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEAR adds adaptive segment reweighting to GRPO via divergence spikes from a ground-truth teacher, with reported gains on agent benchmarks, but the signal's reliability for true credit assignment is unproven.

read the letter

GEAR's core move is to take the standard GRPO advantage and reshape it at variable lengths by spotting spikes in KL divergence between the on-policy student and a teacher that sees ground truth. Where divergence stays low the method keeps fine token resolution; where it jumps it bundles the continuation into a segment and scales the advantage by the spike size. That targets the real issue of coarse outcome rewards on long agent trajectories, and the paper shows consistent lifts over plain GRPO plus fixed token- or turn-level baselines across eight math and tool-use tasks, with the biggest edges on the harder ones where GRPO starts low. The empirical pattern is useful for anyone training agents on Qwen-scale models. The soft spot is the untested assumption that those spikes mark the first semantic departure rather than stylistic choices, token artifacts, or low-impact rephrasings. The abstract gives no direct check, such as boundary precision against verified error steps or an ablation that turns the adaptive rule off while keeping the teacher signal. Without that, the gains could come from extra regularization instead of sharper credit assignment. This is for people working on post-training for LLM agents who already use GRPO or similar methods. A reader who cares about credit assignment in sequential language tasks will find the idea and the benchmark numbers worth seeing. It deserves peer review because the problem is genuine and the proposed fix is distinct enough from prior fixed-granularity work to merit a proper look, even if the current evidence leaves the mechanism's precision open.

Referee Report

2 major / 2 minor

Summary. The manuscript presents GEAR (Granularity-adaptivE Advantage Reweighting), a framework for improving credit assignment in reinforcement learning for LLM agents. By leveraging self-distillation, it compares on-policy rollouts from the student policy with those from a ground-truth-conditioned teacher to compute a divergence signal. This signal identifies adaptive segment boundaries where divergence spikes, presumed to indicate semantic deviations, and uses the spike magnitude to reweight the trajectory-level GRPO advantages for finer-grained policy updates. The approach is evaluated on eight benchmarks spanning mathematical reasoning and agentic tool-use tasks using Qwen3 4B and 8B models, showing consistent improvements over GRPO, self-distillation baselines, and fixed-granularity methods, with particularly notable gains (up to ~20%) on tasks with weaker baseline performance.

Significance. If the divergence-based segmentation accurately captures points of semantic departure rather than superficial variations, this method could advance credit assignment techniques for long-horizon agentic tasks by providing adaptive granularity without manual tuning. The reported empirical results across diverse benchmarks indicate potential for practical impact in post-training LLMs, especially in challenging scenarios where coarse rewards limit learning. The self-distillation approach that avoids external models and the emphasis on adaptive rather than fixed granularity are notable strengths.

major comments (2)

[§3.2] §3.2: The central assumption that KL divergence spikes between the on-policy student and ground-truth-conditioned teacher reliably mark the onset of semantic deviations (as opposed to stylistic, tokenization, or low-impact rephrasing variations) is not directly validated. No precision/recall analysis against human-annotated error steps or other ground-truth deviation markers is reported, which is load-bearing for the claim that the resulting advantage modulation constitutes unbiased credit assignment.
[§4.3] §4.3, Table 3: The reported gains (up to ~20% over GRPO) are presented without ablations that isolate the adaptive boundary detection and spike-based reweighting from the self-distillation component alone, nor with statistical significance tests or run-to-run variance, leaving open whether improvements are attributable to the proposed mechanism.

minor comments (2)

[Abstract] Abstract: The acronym GRPO is used without expansion on first mention; clarify its full name (e.g., Group Relative Policy Optimization) for readers unfamiliar with the baseline.
[§2] §2: The description of how the teacher is conditioned on ground truth could be expanded with a short pseudocode snippet to make the reference-guided divergence computation fully reproducible from the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of validation and experimental rigor that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [§3.2] §3.2: The central assumption that KL divergence spikes between the on-policy student and ground-truth-conditioned teacher reliably mark the onset of semantic deviations (as opposed to stylistic, tokenization, or low-impact rephrasing variations) is not directly validated. No precision/recall analysis against human-annotated error steps or other ground-truth deviation markers is reported, which is load-bearing for the claim that the resulting advantage modulation constitutes unbiased credit assignment.

Authors: We agree that direct validation via precision/recall against human-annotated semantic deviation points would provide stronger grounding for the assumption. However, producing reliable human annotations for long-horizon trajectories across our eight benchmarks would require substantial additional resources outside the current scope. In the revised manuscript we will add a qualitative analysis section with concrete examples of divergence spikes aligned with clear semantic errors (e.g., incorrect intermediate reasoning steps or erroneous tool selections), together with quantitative correlations between spike locations and dataset-specific error categories. We believe these additions, combined with the observed performance gains on challenging tasks, offer practical support for the utility of the signal while acknowledging the limitation of not providing full human-validated metrics. revision: partial
Referee: [§4.3] §4.3, Table 3: The reported gains (up to ~20% over GRPO) are presented without ablations that isolate the adaptive boundary detection and spike-based reweighting from the self-distillation component alone, nor with statistical significance tests or run-to-run variance, leaving open whether improvements are attributable to the proposed mechanism.

Authors: We concur that isolating the contribution of the adaptive boundary detection and reweighting, and reporting statistical details, would strengthen the experimental claims. In the revised manuscript we will add ablation experiments that compare the full GEAR framework against a self-distillation baseline that retains the teacher signal but removes the adaptive segmentation and spike-based reweighting. We will also rerun all main experiments across multiple random seeds, report mean and standard deviation, and include statistical significance tests (e.g., paired t-tests) where appropriate to demonstrate that the gains are robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GEAR derivation

full rationale

The paper proposes an empirical heuristic for adaptive credit assignment: divergence spikes between an on-policy student rollout and a ground-truth-conditioned teacher define segment boundaries, with the spike value modulating the GRPO-derived segment advantage. The teacher supplies an external reference signal independent of the on-policy trajectory; the modulation is a direct function of observed divergence rather than a fitted parameter renamed as prediction or a self-referential definition. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core mechanism. The method is validated by direct benchmark comparisons against GRPO and fixed-granularity baselines, keeping the central claim self-contained and falsifiable outside any internal fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that divergence between student and ground-truth teacher provides a faithful signal for semantic boundaries; no free parameters or invented entities are visible from the abstract.

axioms (1)

domain assumption Divergence spikes between on-policy student and ground-truth teacher mark the onset of semantic deviations usable for credit grouping
Invoked to justify adaptive segment creation and advantage modulation.

pith-pipeline@v0.9.0 · 5629 in / 1289 out tokens · 45396 ms · 2026-05-15T05:55:53.856452+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries... rKLt = logπθ(at|st) − logπθ(at|s⋆t)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 13 internal anchors

[1]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023
[2]

Exploring autonomous agents through the lens of large language models: A review.arXiv preprint arXiv:2404.04442, 2024

Saikat Barua. Exploring autonomous agents through the lens of large language models: A review.arXiv preprint arXiv:2404.04442, 2024

work page arXiv 2024
[3]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023
[5]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[10]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[11]

Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment

Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Yang Katie Zhao, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment. InICML 2025 Workshop on Computer Use Agents, 2025

work page 2025
[12]

Empowering llm tool invocation with tool-call reward model

Da Ma, Ziyue Yang, Hongshen Xu, Haotian Fang, Kai Yu, and Lu Chen. Empowering llm tool invocation with tool-call reward model. InThe F ourteenth International Conference on Learning Representations

work page
[13]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning, 2025

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning, 2025

work page 2025
[15]

Information gain-based policy optimization: A simple and effective approach for multi-turn llm agents.arXiv preprint arXiv:2510.14967, 2025

Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, and Zhenzhe Ying. Information gain-based policy optimization: A simple and effective approach for multi-turn llm agents.arXiv preprint arXiv:2510.14967, 2025

work page arXiv 2025
[16]

Group-in-group policy optimization for llm agent training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 10

work page
[17]

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents.arXiv preprint arXiv:2602.16165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 9426–9439, 2024

work page 2024
[19]

Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. 2024

work page 2024
[20]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024

Wendi Li and Yixuan Li. Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024

work page arXiv 2024
[23]

A comprehensive survey of reward models: Taxonomy, applications, challenges, and future.arXiv preprint arXiv:2504.12328, 2025

Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, and Lei Zou. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future.arXiv preprint arXiv:2504.12328, 2025

work page arXiv 2025
[24]

Agentprm: Process reward models for llm agents via step-wise promise and progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. InProceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

work page 2026
[25]

Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025

Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, et al. Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025

work page arXiv 2025
[26]

Cm2: Rein- forcement learning with checklist rewards for multi-turn and multi-step agentic tool use.arXiv preprint arXiv:2602.12268, 2026

Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, et al. Cm2: Rein- forcement learning with checklist rewards for multi-turn and multi-step agentic tool use.arXiv preprint arXiv:2602.12268, 2026

work page arXiv 2026
[27]

Segment policy optimization: Ef- fective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment policy optimization: Ef- fective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025

work page arXiv 2025
[28]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Agent-as-tool: A study on the hierarchical decision making with reinforcement learning.arXiv preprint arXiv:2507.01489, 2025

Yanfei Zhang. Agent-as-tool: A study on the hierarchical decision making with reinforcement learning.arXiv preprint arXiv:2507.01489, 2025

work page arXiv 2025
[31]

Tooltree: Efficient llm agent tool planning via dual-feedback monte carlo tree search and bidirectional pruning

Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, and Eduard Hoy. Tooltree: Efficient llm agent tool planning via dual-feedback monte carlo tree search and bidirectional pruning. arXiv preprint arXiv:2603.12740, 2026

work page arXiv 2026
[32]

Tips: Turn-level information-potential reward shaping for search-augmented llms

Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, and Xiaolong Wang. Tips: Turn-level information-potential reward shaping for search-augmented llms. InThe F ourteenth International Conference on Learning Representations. 11

work page
[33]

Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

work page arXiv 2024
[34]

Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

work page arXiv 2023
[35]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025

Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

work page arXiv 2025
[37]

Scaling environments for llm agents: Fundamentals, approaches, and future directions

Yuchen Huang, Sijia Li, Zhiyuan Fan, Minghao LIU, Wei Liu, and Yi R Fung. Scaling environments for llm agents: Fundamentals, approaches, and future directions. InWorkshop on Scaling Environments for Agents, 2025b. URL https://openreview. net/forum, 2025

work page 2025
[38]

Agentic reinforced policy optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization. arXiv preprint arXiv:2507.19849, 2025

work page arXiv 2025
[39]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Toolsandbox: A stateful, conversational, inter- active evaluation benchmark for llm tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. Toolsandbox: A stateful, conversational, inter- active evaluation benchmark for llm tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183, 2025

work page 2025
[42]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025

work page 2025
[43]

Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025 a

Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, et al. Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025

work page arXiv 2025
[44]

user_id": {

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 12 Algorithm 1Granularity-AdaptivE Advantage Reweighting (GEAR) Require: Initial policy πθ, r...

work page 2025