Recognition: 2 theorem links
· Lean TheoremGEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Pith reviewed 2026-05-15 05:55 UTC · model grok-4.3
The pith
GEAR uses self-distillation divergence to adaptively segment trajectories and reweight advantages for better LLM agent reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GEAR reshapes trajectory-level GRPO advantages by deriving token- and segment-level signals from self-distillation. It obtains a reference-guided divergence signal by comparing the on-policy student to a ground-truth-conditioned teacher, treating spikes in this divergence as the start of semantic deviations. Where the student stays aligned, token-level resolution is kept; where divergence rises, the continuation is grouped into an adaptive segment whose advantage is modulated by the departure-point divergence value. This produces more effective policy updates than standard GRPO, self-distillation alone, or fixed token- or turn-level methods.
What carries the argument
Divergence signal between on-policy student and ground-truth-conditioned teacher, used to locate adaptive segment boundaries and modulate local advantage weights.
If this is right
- GEAR produces larger gains on benchmarks where standard GRPO accuracy is lower.
- The method maintains token-level credit where models stay aligned with the teacher and coarsens it only at detected deviations.
- Performance improvements hold across both 4B and 8B model sizes on eight different reasoning and agent benchmarks.
- Adaptive reweighting outperforms fixed-granularity alternatives such as pure token-level or turn-level credit assignment.
Where Pith is reading between the lines
- The same divergence-driven segmentation idea could transfer to credit assignment in non-language reinforcement learning tasks with long sequences.
- Combining GEAR with other reward-modeling or shaping methods might further stabilize training on complex agent workflows.
- Evaluating GEAR on trajectories substantially longer than those in the current benchmarks would test whether the adaptive boundaries remain effective at greater scales.
Load-bearing premise
The divergence signal between the on-policy student and the ground-truth teacher reliably marks the onset of semantic deviations that justify adaptive segment grouping.
What would settle it
Running the same benchmarks with GEAR's divergence-based boundaries replaced by random segment boundaries and finding no performance gain over GRPO would falsify the claim that the signal provides useful adaptive credit assignment.
Figures
read the original abstract
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents GEAR (Granularity-adaptivE Advantage Reweighting), a framework for improving credit assignment in reinforcement learning for LLM agents. By leveraging self-distillation, it compares on-policy rollouts from the student policy with those from a ground-truth-conditioned teacher to compute a divergence signal. This signal identifies adaptive segment boundaries where divergence spikes, presumed to indicate semantic deviations, and uses the spike magnitude to reweight the trajectory-level GRPO advantages for finer-grained policy updates. The approach is evaluated on eight benchmarks spanning mathematical reasoning and agentic tool-use tasks using Qwen3 4B and 8B models, showing consistent improvements over GRPO, self-distillation baselines, and fixed-granularity methods, with particularly notable gains (up to ~20%) on tasks with weaker baseline performance.
Significance. If the divergence-based segmentation accurately captures points of semantic departure rather than superficial variations, this method could advance credit assignment techniques for long-horizon agentic tasks by providing adaptive granularity without manual tuning. The reported empirical results across diverse benchmarks indicate potential for practical impact in post-training LLMs, especially in challenging scenarios where coarse rewards limit learning. The self-distillation approach that avoids external models and the emphasis on adaptive rather than fixed granularity are notable strengths.
major comments (2)
- [§3.2] §3.2: The central assumption that KL divergence spikes between the on-policy student and ground-truth-conditioned teacher reliably mark the onset of semantic deviations (as opposed to stylistic, tokenization, or low-impact rephrasing variations) is not directly validated. No precision/recall analysis against human-annotated error steps or other ground-truth deviation markers is reported, which is load-bearing for the claim that the resulting advantage modulation constitutes unbiased credit assignment.
- [§4.3] §4.3, Table 3: The reported gains (up to ~20% over GRPO) are presented without ablations that isolate the adaptive boundary detection and spike-based reweighting from the self-distillation component alone, nor with statistical significance tests or run-to-run variance, leaving open whether improvements are attributable to the proposed mechanism.
minor comments (2)
- [Abstract] Abstract: The acronym GRPO is used without expansion on first mention; clarify its full name (e.g., Group Relative Policy Optimization) for readers unfamiliar with the baseline.
- [§2] §2: The description of how the teacher is conditioned on ground truth could be expanded with a short pseudocode snippet to make the reference-guided divergence computation fully reproducible from the text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of validation and experimental rigor that we will address in the revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [§3.2] §3.2: The central assumption that KL divergence spikes between the on-policy student and ground-truth-conditioned teacher reliably mark the onset of semantic deviations (as opposed to stylistic, tokenization, or low-impact rephrasing variations) is not directly validated. No precision/recall analysis against human-annotated error steps or other ground-truth deviation markers is reported, which is load-bearing for the claim that the resulting advantage modulation constitutes unbiased credit assignment.
Authors: We agree that direct validation via precision/recall against human-annotated semantic deviation points would provide stronger grounding for the assumption. However, producing reliable human annotations for long-horizon trajectories across our eight benchmarks would require substantial additional resources outside the current scope. In the revised manuscript we will add a qualitative analysis section with concrete examples of divergence spikes aligned with clear semantic errors (e.g., incorrect intermediate reasoning steps or erroneous tool selections), together with quantitative correlations between spike locations and dataset-specific error categories. We believe these additions, combined with the observed performance gains on challenging tasks, offer practical support for the utility of the signal while acknowledging the limitation of not providing full human-validated metrics. revision: partial
-
Referee: [§4.3] §4.3, Table 3: The reported gains (up to ~20% over GRPO) are presented without ablations that isolate the adaptive boundary detection and spike-based reweighting from the self-distillation component alone, nor with statistical significance tests or run-to-run variance, leaving open whether improvements are attributable to the proposed mechanism.
Authors: We concur that isolating the contribution of the adaptive boundary detection and reweighting, and reporting statistical details, would strengthen the experimental claims. In the revised manuscript we will add ablation experiments that compare the full GEAR framework against a self-distillation baseline that retains the teacher signal but removes the adaptive segmentation and spike-based reweighting. We will also rerun all main experiments across multiple random seeds, report mean and standard deviation, and include statistical significance tests (e.g., paired t-tests) where appropriate to demonstrate that the gains are robust. revision: yes
Circularity Check
No significant circularity in GEAR derivation
full rationale
The paper proposes an empirical heuristic for adaptive credit assignment: divergence spikes between an on-policy student rollout and a ground-truth-conditioned teacher define segment boundaries, with the spike value modulating the GRPO-derived segment advantage. The teacher supplies an external reference signal independent of the on-policy trajectory; the modulation is a direct function of observed divergence rather than a fitted parameter renamed as prediction or a self-referential definition. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core mechanism. The method is validated by direct benchmark comparisons against GRPO and fixed-granularity baselines, keeping the central claim self-contained and falsifiable outside any internal fit.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Divergence spikes between on-policy student and ground-truth teacher mark the onset of semantic deviations usable for credit grouping
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries... rKLt = logπθ(at|st) − logπθ(at|s⋆t)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023
work page 2023
-
[2]
Saikat Barua. Exploring autonomous agents through the lens of large language models: A review.arXiv preprint arXiv:2404.04442, 2024
-
[3]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
React: Synergizing reasoning and acting in language models, 2023
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023
work page 2023
-
[5]
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[8]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
work page 1998
-
[10]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[11]
Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment
Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Yang Katie Zhao, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment. InICML 2025 Workshop on Computer Use Agents, 2025
work page 2025
-
[12]
Empowering llm tool invocation with tool-call reward model
Da Ma, Ziyue Yang, Hongshen Xu, Haotian Fang, Kai Yu, and Lu Chen. Empowering llm tool invocation with tool-call reward model. InThe F ourteenth International Conference on Learning Representations
-
[13]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning, 2025
work page 2025
-
[15]
Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, and Zhenzhe Ying. Information gain-based policy optimization: A simple and effective approach for multi-turn llm agents.arXiv preprint arXiv:2510.14967, 2025
-
[16]
Group-in-group policy optimization for llm agent training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 10
-
[17]
Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents.arXiv preprint arXiv:2602.16165, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 9426–9439, 2024
work page 2024
-
[19]
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. 2024
work page 2024
-
[20]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024
Wendi Li and Yixuan Li. Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024
-
[23]
Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, and Lei Zou. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future.arXiv preprint arXiv:2504.12328, 2025
-
[24]
Agentprm: Process reward models for llm agents via step-wise promise and progress
Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. InProceedings of the ACM Web Conference 2026, pages 4184–4195, 2026
work page 2026
-
[25]
Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, et al. Reinforcing multi-turn reasoning in llm agents via turn-level reward design.arXiv preprint arXiv:2505.11821, 2025
-
[26]
Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, et al. Cm2: Rein- forcement learning with checklist rewards for multi-turn and multi-step agentic tool use.arXiv preprint arXiv:2602.12268, 2026
-
[27]
Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment policy optimization: Ef- fective segment-level credit assignment in rl for large language models.arXiv preprint arXiv:2505.23564, 2025
-
[28]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Yanfei Zhang. Agent-as-tool: A study on the hierarchical decision making with reinforcement learning.arXiv preprint arXiv:2507.01489, 2025
-
[31]
Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, and Eduard Hoy. Tooltree: Efficient llm agent tool planning via dual-feedback monte carlo tree search and bidirectional pruning. arXiv preprint arXiv:2603.12740, 2026
-
[32]
Tips: Turn-level information-potential reward shaping for search-augmented llms
Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, and Xiaolong Wang. Tips: Turn-level information-potential reward shaping for search-augmented llms. InThe F ourteenth International Conference on Learning Representations. 11
-
[33]
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024
-
[34]
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023
-
[35]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025
Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025
-
[37]
Scaling environments for llm agents: Fundamentals, approaches, and future directions
Yuchen Huang, Sijia Li, Zhiyuan Fan, Minghao LIU, Wei Liu, and Yi R Fung. Scaling environments for llm agents: Fundamentals, approaches, and future directions. InWorkshop on Scaling Environments for Agents, 2025b. URL https://openreview. net/forum, 2025
work page 2025
-
[38]
Agentic reinforced policy optimization
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization. arXiv preprint arXiv:2507.19849, 2025
-
[39]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. Toolsandbox: A stateful, conversational, inter- active evaluation benchmark for llm tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183, 2025
work page 2025
-
[42]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025
work page 2025
-
[43]
Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025 a
Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, et al. Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025
-
[44]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 12 Algorithm 1Granularity-AdaptivE Advantage Reweighting (GEAR) Require: Initial policy πθ, r...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.