SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

Chen Luo; Dakuo Wang; Hanqing Lu; Jing Huang; Jiri Gesi; Manling Li; Qun Liu; Yimeng Zhang; Yisi Sang; Yuxuan Lu

arxiv: 2606.12908 · v1 · pith:D5GPITUXnew · submitted 2026-06-11 · 💻 cs.CL

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

Ziyi Wang , Yuxuan Lu , Yimeng Zhang , Qun Liu , Chen Luo , Jiri Gesi , Hanqing Lu , Yisi Sang

show 3 more authors

Manling Li Jing Huang Dakuo Wang

This is my paper

Pith reviewed 2026-06-27 06:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords failure-driven reinforcement learningtool-using agentslanguage model agentstargeted task generationController-Proposer-Solvermulti-turn tool useTau2-Bench

0 comments

The pith

SENTINEL improves tool-using language model agents by turning rollout failures into targeted training tasks through a Controller-Proposer-Solver loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard reinforcement learning wastes effort on fixed tasks that no longer match what an evolving policy needs to learn. SENTINEL instead extracts recurring error patterns from failed trajectories, creates new executable tasks that specifically exercise those weaknesses, and trains the agent on the resulting distribution. On a retail benchmark this raised Pass^1 from 66.4 to 74.9 while also beating ordinary RL across Pass^k metrics on synthetic tasks. The core idea is that an agent's own mistakes supply a more relevant and scalable training signal than any static task set.

Core claim

SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks.

What carries the argument

The Controller-Proposer-Solver loop that converts failed trajectories into new executable tasks targeting identified weaknesses.

If this is right

Training rollouts become more informative because tasks are chosen to match current weaknesses rather than fixed in advance.
Performance improves on both realistic retail benchmarks and general synthetic tool-use tasks across Pass^k metrics.
The approach scales without requiring an ever-growing library of human-designed tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same failure-to-task loop could be applied to other multi-turn agent domains such as web navigation or code editing.
Repeated cycles might allow an agent to generate an increasingly difficult curriculum for itself.
The method may lower the amount of human curation needed to keep training distributions aligned with policy progress.

Load-bearing premise

The controller can reliably detect recurring error patterns from failures and the proposer can create valid tasks that actually exercise those exact patterns.

What would settle it

On Tau2-Bench Retail the method produces no gain over the 66.4 Pass^1 baseline or generates mostly invalid tasks.

Figures

Figures reproduced from arXiv: 2606.12908 by Chen Luo, Dakuo Wang, Hanqing Lu, Jing Huang, Jiri Gesi, Manling Li, Qun Liu, Yimeng Zhang, Yisi Sang, Yuxuan Lu, Ziyi Wang.

**Figure 2.** Figure 2: SENTINEL forms a failure-driven reinforcement learning loop for tool-use agents. The Controller [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Success rate on Tau2-Bench Retail. Gen.RL: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SENTINEL's Controller-Proposer-Solver loop is a concrete way to turn agent failures into new tasks, but the abstract gives no evidence the middle steps actually work.

read the letter

The paper's central claim is that a three-stage loop can fix the mismatch between fixed training tasks and an improving policy in RL for tool-using agents. The Controller pulls recurring error patterns from failed trajectories, the Proposer turns those patterns into new executable tasks, and the Solver trains on them. On Tau2-Bench Retail the method lifts Pass^1 from 66.4 to 74.9 with Qwen3-4B-Thinking-2507 and beats standard RL on Pass^k metrics.

What is new is the explicit closed loop that treats failures as the source of the next training distribution rather than sampling from a static synthetic set. The framing is straightforward and addresses a real practical problem: once the policy gets better, many rollouts on the original tasks become uninformative.

The abstract does a decent job naming the components and reporting a specific benchmark number. That is enough to make the idea discussable.

The soft spot is exactly where the stress-test note says it is. Nothing is shown about whether the Controller reliably extracts patterns that matter, whether the Proposer produces tasks that are both valid and harder on those specific weaknesses, or whether the gains survive ablations that disable either module. Without those checks the 8.5-point lift could come from extra on-policy data or from the Proposer simply generating a different mix of tasks. The abstract supplies no coverage statistics, task validity rates, or control experiments, so the mechanism remains an assumption.

This is for groups already running RL on language-model agents and hitting the task-distribution problem. A reader who has tried fixed-task RL and wants a practical alternative will get a usable sketch even if the current evidence is thin.

The work deserves peer review so reviewers can ask for the missing implementation details and ablations. The idea is clear enough and the benchmark result is concrete enough to justify the time.

Referee Report

2 major / 1 minor

Summary. The paper proposes SENTINEL, a failure-driven reinforcement learning framework for training tool-using language model agents. It operates via a Controller-Proposer-Solver loop in which the Controller summarizes recurring error patterns from failed trajectories, the Proposer generates executable tasks targeting those patterns, and the Solver is trained on-policy with RL on the resulting tasks. On the Tau2-Bench Retail benchmark with Qwen3-4B-Thinking-2507, the method is reported to raise Pass^1 from 66.4 to 74.9 and to outperform standard RL on general synthetic tasks across Pass^k metrics.

Significance. If the reported gains prove robust, the failure-driven task generation mechanism would address a practical limitation of fixed task distributions in RL for agents, potentially improving sample efficiency by focusing rollouts on policy weaknesses. The core idea of extracting error patterns to drive task synthesis is a targeted contribution to agent training pipelines.

major comments (2)

[Abstract] Abstract: The central empirical claim (Pass^1 rising from 66.4 to 74.9 on Tau2-Bench Retail and outperforming RL baselines) is presented without any description of experimental controls, number of independent runs, variance estimates, statistical tests, or implementation details of the RL baselines, rendering the 8.5-point gain unverifiable from the supplied text.
[Abstract] Abstract: The headline result rests on the unverified assumption that the Controller reliably extracts recurring error patterns and the Proposer emits executable tasks that specifically close those gaps; the manuscript supplies no supporting metrics (error-pattern coverage, task-validity rates, or ablations that disable the Controller/Proposer) to rule out generic effects such as extra on-policy data or altered task diversity.

minor comments (1)

[Abstract] Abstract: The notation Pass\^{}1 and Pass\^{}k should be rendered consistently (e.g., Pass@1) and defined on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in experimental reporting and validation of SENTINEL's core components. We address each comment below and will incorporate revisions to strengthen verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim (Pass^1 rising from 66.4 to 74.9 on Tau2-Bench Retail and outperforming RL baselines) is presented without any description of experimental controls, number of independent runs, variance estimates, statistical tests, or implementation details of the RL baselines, rendering the 8.5-point gain unverifiable from the supplied text.

Authors: We agree the abstract omits these details. In revision we will append a concise clause noting that Pass^1 is averaged over three independent seeds with reported standard deviation, that RL baselines use identical on-policy PPO hyperparameters and the same Qwen3-4B-Thinking-2507 backbone as detailed in Section 4.2, and that a paired t-test yields p < 0.01. Full variance tables and baseline implementation code will be cross-referenced from the main experimental section. revision: yes
Referee: [Abstract] Abstract: The headline result rests on the unverified assumption that the Controller reliably extracts recurring error patterns and the Proposer emits executable tasks that specifically close those gaps; the manuscript supplies no supporting metrics (error-pattern coverage, task-validity rates, or ablations that disable the Controller/Proposer) to rule out generic effects such as extra on-policy data or altered task diversity.

Authors: The current manuscript contains only qualitative trajectory examples in the appendix and does not report quantitative coverage, validity, or ablation numbers. We will therefore add a new subsection with (i) error-pattern coverage (fraction of failure modes addressed by generated tasks), (ii) task-validity rate (executable tasks / proposed tasks), and (iii) ablations that disable the Controller or Proposer while keeping total on-policy steps constant. These results will be included in the revised experimental section to isolate the contribution of the failure-driven loop from generic data-volume effects. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark results with no derivation chain

full rationale

The paper describes an empirical RL framework (Controller-Proposer-Solver loop) evaluated on Tau2-Bench Retail and synthetic tasks, reporting Pass^1 improvements from 66.4 to 74.9. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmark comparisons rather than any internal reduction to inputs by construction. This is the standard case of a self-contained empirical study with no mathematical derivation to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; full paper would be required to audit these.

pith-pipeline@v0.9.1-grok · 5774 in / 1038 out tokens · 18855 ms · 2026-06-27T06:47:22.697457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 25 linked inside Pith

[1]

arXiv preprint arXiv:2504.03601 , year=

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. arXiv preprint arXiv:2504.03601 , year=

arXiv
[2]

arXiv preprint arXiv:2510.01179 , year=

TOUCAN: Synthesizing 1.5 M Tool-Agentic Data from Real-World MCP Environments , author=. arXiv preprint arXiv:2510.01179 , year=

arXiv
[3]

arXiv preprint arXiv:2510.24284 , year=

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools , author=. arXiv preprint arXiv:2510.24284 , year=

Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2406.12045 , year=

-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2506.07982 , year=

^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2402.13116 , year=

A survey on knowledge distillation of large language models , author=. arXiv preprint arXiv:2402.13116 , year=

Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2203.11171 , year=

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

Pith/arXiv arXiv
[8]

The Twelfth International Conference on Learning Representations , year=

WizardLM: Empowering large pre-trained language models to follow complex instructions , author=. The Twelfth International Conference on Learning Representations , year=
[9]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[10]

arXiv preprint arXiv:2308.09583 , year=

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=

Pith/arXiv arXiv
[11]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=
[12]

arXiv preprint arXiv:2306.05301 , year=

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases , author=. arXiv preprint arXiv:2306.05301 , year=

Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=. 2107.03374 , archivePrefix=

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2308.12950 , year=

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=. 2308.12950 , archivePrefix=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2304.08244 , year=

Api-bank: A comprehensive benchmark for tool-augmented llms , author=. arXiv preprint arXiv:2304.08244 , year=

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2505.03335 , year=

Absolute zero: Reinforced self-play reasoning with zero data , author=. arXiv preprint arXiv:2505.03335 , year=

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2508.05004 , year=

R-Zero: Self-Evolving Reasoning LLM from Zero Data , author=. arXiv preprint arXiv:2508.05004 , year=

Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2506.24119 , year=

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning , author=. arXiv preprint arXiv:2506.24119 , year=

arXiv
[19]

arXiv preprint arXiv:2506.01716 , year=

Self-challenging language model agents , author=. arXiv preprint arXiv:2506.01716 , year=

arXiv
[20]

arXiv preprint arXiv:2509.23124 , year=

Non-Collaborative User Simulators for Tool Agents , author=. arXiv preprint arXiv:2509.23124 , year=

arXiv
[21]

arXiv preprint arXiv:2508.14704 , year=

Mcp-universe: Benchmarking large language models with real-world model context protocol servers , author=. arXiv preprint arXiv:2508.14704 , year=

arXiv
[22]

Stanford University Center for Research on Foundation Models (CRFM) Technical Report , year=

Alpaca: A Strong, Replicable Instruction-Following Model , author=. Stanford University Center for Research on Foundation Models (CRFM) Technical Report , year=
[23]

arXiv preprint arXiv:2507.22034 , year=

Userbench: An interactive gym environment for user-centric agents , author=. arXiv preprint arXiv:2507.22034 , year=

arXiv
[24]

arXiv preprint arXiv:2402.09205 , year=

Tell me more! towards implicit user intention understanding of language model driven agents , author=. arXiv preprint arXiv:2402.09205 , year=

arXiv
[25]

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models , author=
[26]

arXiv preprint arXiv:2304.05376 , year=

Chemcrow: Augmenting large-language models with chemistry tools , author=. arXiv preprint arXiv:2304.05376 , year=

Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2310.06770 , year=

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

Pith/arXiv arXiv
[28]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=
[29]

arXiv preprint arXiv:2407.18901 , year=

Appworld: A controllable world of apps and people for benchmarking interactive coding agents , author=. arXiv preprint arXiv:2407.18901 , year=

arXiv
[30]

arXiv preprint arXiv:2508.01780 , year=

Livemcpbench: Can agents navigate an ocean of mcp tools? , author=. arXiv preprint arXiv:2508.01780 , year=

arXiv
[31]

Proceedings of the 42nd international acm sigir conference on research and development in information retrieval , pages=

Asking clarifying questions in open-domain information-seeking conversations , author=. Proceedings of the 42nd international acm sigir conference on research and development in information retrieval , pages=
[32]

arXiv preprint arXiv:1905.08743 , year=

Transferable multi-domain state generator for task-oriented dialogue systems , author=. arXiv preprint arXiv:1905.08743 , year=

Pith/arXiv arXiv 1905
[33]

arXiv preprint arXiv:1909.02027 , year=

An evaluation dataset for intent classification and out-of-scope prediction , author=. arXiv preprint arXiv:1909.02027 , year=

arXiv 1909
[34]

Science China Technological Sciences , volume=

Recent advances and challenges in task-oriented dialog systems , author=. Science China Technological Sciences , volume=. 2020 , publisher=

2020
[35]

arXiv preprint arXiv:2307.16789 , year=

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

Pith/arXiv arXiv
[36]

Advances in Neural Information Processing Systems , volume=

Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=
[37]

arXiv preprint arXiv:2402.11592 , year=

Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark , author=. arXiv preprint arXiv:2402.11592 , year=

arXiv
[38]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Soul: Unlocking the power of second-order optimization for llm unlearning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[39]

arXiv preprint arXiv:2507.17842 , year=

Shop-r1: Rewarding llms to simulate human behavior in online shopping via reinforcement learning , author=. arXiv preprint arXiv:2507.17842 , year=

arXiv
[40]

arXiv preprint arXiv:2506.14003 , year=

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs , author=. arXiv preprint arXiv:2506.14003 , year=

Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2510.07230 , year=

Customer-R1: Personalized simulation of human behaviors via RL-based LLM agent in online shopping , author=. arXiv preprint arXiv:2510.07230 , year=

arXiv
[42]

arXiv preprint arXiv:2510.19245 , year=

See, Think, Act: Online Shopper Behavior Simulation with VLM Agents , author=. arXiv preprint arXiv:2510.19245 , year=

arXiv
[43]

arXiv preprint arXiv:2506.05606 , year=

Opera: A dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation , author=. arXiv preprint arXiv:2506.05606 , year=

Pith/arXiv arXiv
[44]

2026 , eprint=

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents , author=. 2026 , eprint=

2026
[45]

2026 , eprint=

Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents , author=. 2026 , eprint=

2026
[46]

arXiv preprint arXiv:2601.22607 , year=

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents , author=. arXiv preprint arXiv:2601.22607 , year=

arXiv
[47]

arXiv preprint arXiv:2210.03629 , year=

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[49]

arXiv preprint arXiv:2602.00933 , year=

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers , author=. arXiv preprint arXiv:2602.00933 , year=

Pith/arXiv arXiv
[50]

arXiv preprint arXiv:2504.13958 , year=

Toolrl: Reward is all tool learning needs , author=. arXiv preprint arXiv:2504.13958 , year=

Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2503.23383 , year=

Torl: Scaling tool-integrated rl , author=. arXiv preprint arXiv:2503.23383 , year=

arXiv
[53]

arXiv preprint arXiv:2506.11425 , year=

Agent-rlvr: Training software engineering agents via guidance and environment rewards , author=. arXiv preprint arXiv:2506.11425 , year=

arXiv
[54]

arXiv preprint arXiv:2508.20453 , year=

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers , author=. arXiv preprint arXiv:2508.20453 , year=

arXiv
[55]

arXiv preprint arXiv:2509.24002 , year=

Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use , author=. arXiv preprint arXiv:2509.24002 , year=

arXiv
[56]

Advances in Neural Information Processing Systems , volume=

Absolute zero: Reinforced self-play reasoning with zero data , author=. Advances in Neural Information Processing Systems , volume=
[57]

Advances in Neural Information Processing Systems , volume=

Self-challenging language model agents , author=. Advances in Neural Information Processing Systems , volume=
[58]

arXiv preprint arXiv:2504.11536 , year=

Retool: Reinforcement learning for strategic tool use in llms , author=. arXiv preprint arXiv:2504.11536 , year=

Pith/arXiv arXiv
[59]

International Conference on Learning Representations , volume=

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning , author=. International Conference on Learning Representations , volume=
[60]

arXiv preprint arXiv:2505.20732 , year=

Spa-rl: Reinforcing llm agents via stepwise progress attribution , author=. arXiv preprint arXiv:2505.20732 , year=

arXiv
[61]

arXiv preprint arXiv:2605.17558 , year=

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs , author=. arXiv preprint arXiv:2605.17558 , year=

Pith/arXiv arXiv
[62]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
[63]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=
[64]

2026 , eprint=

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR , author=. 2026 , eprint=

2026

[1] [1]

arXiv preprint arXiv:2504.03601 , year=

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. arXiv preprint arXiv:2504.03601 , year=

arXiv

[2] [2]

arXiv preprint arXiv:2510.01179 , year=

TOUCAN: Synthesizing 1.5 M Tool-Agentic Data from Real-World MCP Environments , author=. arXiv preprint arXiv:2510.01179 , year=

arXiv

[3] [3]

arXiv preprint arXiv:2510.24284 , year=

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools , author=. arXiv preprint arXiv:2510.24284 , year=

Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2406.12045 , year=

-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2506.07982 , year=

^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2402.13116 , year=

A survey on knowledge distillation of large language models , author=. arXiv preprint arXiv:2402.13116 , year=

Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2203.11171 , year=

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

Pith/arXiv arXiv

[8] [8]

The Twelfth International Conference on Learning Representations , year=

WizardLM: Empowering large pre-trained language models to follow complex instructions , author=. The Twelfth International Conference on Learning Representations , year=

[9] [9]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[10] [10]

arXiv preprint arXiv:2308.09583 , year=

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=

Pith/arXiv arXiv

[11] [11]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

[12] [12]

arXiv preprint arXiv:2306.05301 , year=

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases , author=. arXiv preprint arXiv:2306.05301 , year=

Pith/arXiv arXiv

[13] [13]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=. 2107.03374 , archivePrefix=

Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2308.12950 , year=

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=. 2308.12950 , archivePrefix=

Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2304.08244 , year=

Api-bank: A comprehensive benchmark for tool-augmented llms , author=. arXiv preprint arXiv:2304.08244 , year=

Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2505.03335 , year=

Absolute zero: Reinforced self-play reasoning with zero data , author=. arXiv preprint arXiv:2505.03335 , year=

Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2508.05004 , year=

R-Zero: Self-Evolving Reasoning LLM from Zero Data , author=. arXiv preprint arXiv:2508.05004 , year=

Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2506.24119 , year=

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning , author=. arXiv preprint arXiv:2506.24119 , year=

arXiv

[19] [19]

arXiv preprint arXiv:2506.01716 , year=

Self-challenging language model agents , author=. arXiv preprint arXiv:2506.01716 , year=

arXiv

[20] [20]

arXiv preprint arXiv:2509.23124 , year=

Non-Collaborative User Simulators for Tool Agents , author=. arXiv preprint arXiv:2509.23124 , year=

arXiv

[21] [21]

arXiv preprint arXiv:2508.14704 , year=

Mcp-universe: Benchmarking large language models with real-world model context protocol servers , author=. arXiv preprint arXiv:2508.14704 , year=

arXiv

[22] [22]

Stanford University Center for Research on Foundation Models (CRFM) Technical Report , year=

Alpaca: A Strong, Replicable Instruction-Following Model , author=. Stanford University Center for Research on Foundation Models (CRFM) Technical Report , year=

[23] [23]

arXiv preprint arXiv:2507.22034 , year=

Userbench: An interactive gym environment for user-centric agents , author=. arXiv preprint arXiv:2507.22034 , year=

arXiv

[24] [24]

arXiv preprint arXiv:2402.09205 , year=

Tell me more! towards implicit user intention understanding of language model driven agents , author=. arXiv preprint arXiv:2402.09205 , year=

arXiv

[25] [25]

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models , author=

[26] [26]

arXiv preprint arXiv:2304.05376 , year=

Chemcrow: Augmenting large-language models with chemistry tools , author=. arXiv preprint arXiv:2304.05376 , year=

Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2310.06770 , year=

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

Pith/arXiv arXiv

[28] [28]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=

[29] [29]

arXiv preprint arXiv:2407.18901 , year=

Appworld: A controllable world of apps and people for benchmarking interactive coding agents , author=. arXiv preprint arXiv:2407.18901 , year=

arXiv

[30] [30]

arXiv preprint arXiv:2508.01780 , year=

Livemcpbench: Can agents navigate an ocean of mcp tools? , author=. arXiv preprint arXiv:2508.01780 , year=

arXiv

[31] [31]

Proceedings of the 42nd international acm sigir conference on research and development in information retrieval , pages=

Asking clarifying questions in open-domain information-seeking conversations , author=. Proceedings of the 42nd international acm sigir conference on research and development in information retrieval , pages=

[32] [32]

arXiv preprint arXiv:1905.08743 , year=

Transferable multi-domain state generator for task-oriented dialogue systems , author=. arXiv preprint arXiv:1905.08743 , year=

Pith/arXiv arXiv 1905

[33] [33]

arXiv preprint arXiv:1909.02027 , year=

An evaluation dataset for intent classification and out-of-scope prediction , author=. arXiv preprint arXiv:1909.02027 , year=

arXiv 1909

[34] [34]

Science China Technological Sciences , volume=

Recent advances and challenges in task-oriented dialog systems , author=. Science China Technological Sciences , volume=. 2020 , publisher=

2020

[35] [35]

arXiv preprint arXiv:2307.16789 , year=

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

Pith/arXiv arXiv

[36] [36]

Advances in Neural Information Processing Systems , volume=

Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=

[37] [37]

arXiv preprint arXiv:2402.11592 , year=

Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark , author=. arXiv preprint arXiv:2402.11592 , year=

arXiv

[38] [38]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Soul: Unlocking the power of second-order optimization for llm unlearning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[39] [39]

arXiv preprint arXiv:2507.17842 , year=

Shop-r1: Rewarding llms to simulate human behavior in online shopping via reinforcement learning , author=. arXiv preprint arXiv:2507.17842 , year=

arXiv

[40] [40]

arXiv preprint arXiv:2506.14003 , year=

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs , author=. arXiv preprint arXiv:2506.14003 , year=

Pith/arXiv arXiv

[41] [41]

arXiv preprint arXiv:2510.07230 , year=

Customer-R1: Personalized simulation of human behaviors via RL-based LLM agent in online shopping , author=. arXiv preprint arXiv:2510.07230 , year=

arXiv

[42] [42]

arXiv preprint arXiv:2510.19245 , year=

See, Think, Act: Online Shopper Behavior Simulation with VLM Agents , author=. arXiv preprint arXiv:2510.19245 , year=

arXiv

[43] [43]

arXiv preprint arXiv:2506.05606 , year=

Opera: A dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation , author=. arXiv preprint arXiv:2506.05606 , year=

Pith/arXiv arXiv

[44] [44]

2026 , eprint=

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents , author=. 2026 , eprint=

2026

[45] [45]

2026 , eprint=

Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents , author=. 2026 , eprint=

2026

[46] [46]

arXiv preprint arXiv:2601.22607 , year=

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents , author=. arXiv preprint arXiv:2601.22607 , year=

arXiv

[47] [47]

arXiv preprint arXiv:2210.03629 , year=

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[49] [49]

arXiv preprint arXiv:2602.00933 , year=

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers , author=. arXiv preprint arXiv:2602.00933 , year=

Pith/arXiv arXiv

[50] [50]

arXiv preprint arXiv:2504.13958 , year=

Toolrl: Reward is all tool learning needs , author=. arXiv preprint arXiv:2504.13958 , year=

Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[52] [52]

arXiv preprint arXiv:2503.23383 , year=

Torl: Scaling tool-integrated rl , author=. arXiv preprint arXiv:2503.23383 , year=

arXiv

[53] [53]

arXiv preprint arXiv:2506.11425 , year=

Agent-rlvr: Training software engineering agents via guidance and environment rewards , author=. arXiv preprint arXiv:2506.11425 , year=

arXiv

[54] [54]

arXiv preprint arXiv:2508.20453 , year=

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers , author=. arXiv preprint arXiv:2508.20453 , year=

arXiv

[55] [55]

arXiv preprint arXiv:2509.24002 , year=

Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use , author=. arXiv preprint arXiv:2509.24002 , year=

arXiv

[56] [56]

Advances in Neural Information Processing Systems , volume=

Absolute zero: Reinforced self-play reasoning with zero data , author=. Advances in Neural Information Processing Systems , volume=

[57] [57]

Advances in Neural Information Processing Systems , volume=

Self-challenging language model agents , author=. Advances in Neural Information Processing Systems , volume=

[58] [58]

arXiv preprint arXiv:2504.11536 , year=

Retool: Reinforcement learning for strategic tool use in llms , author=. arXiv preprint arXiv:2504.11536 , year=

Pith/arXiv arXiv

[59] [59]

International Conference on Learning Representations , volume=

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning , author=. International Conference on Learning Representations , volume=

[60] [60]

arXiv preprint arXiv:2505.20732 , year=

Spa-rl: Reinforcing llm agents via stepwise progress attribution , author=. arXiv preprint arXiv:2505.20732 , year=

arXiv

[61] [61]

arXiv preprint arXiv:2605.17558 , year=

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs , author=. arXiv preprint arXiv:2605.17558 , year=

Pith/arXiv arXiv

[62] [62]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

[63] [63]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

[64] [64]

2026 , eprint=

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR , author=. 2026 , eprint=

2026