Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin; Dong Wang; Hamed Zamani; Hansi Zeng; Jiawei Han; Jinsung Yoon; Sercan Arik; Zhenrui Yue

arxiv: 2503.09516 · v5 · submitted 2025-03-12 · 💻 cs.CL · cs.AI· cs.IR

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin , Hansi Zeng , Zhenrui Yue , Jinsung Yoon , Sercan Arik , Dong Wang , Hamed Zamani , Jiawei Han This is my paper

Pith reviewed 2026-05-11 06:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords Search-R1reinforcement learningLLM reasoningsearch enginesretrieval-augmented generationquestion answeringmulti-turn interaction

0 comments

The pith

LLMs trained with reinforcement learning learn to generate and use search queries during step-by-step reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that reinforcement learning can teach large language models to decide on their own when and what to search for while building an answer, rather than depending on fixed prompts or external instructions. By masking retrieved tokens during training and scoring only the final answer, the model discovers multi-turn search strategies that integrate fresh information into its chain of thought. This matters because current prompting approaches often leave models unable to use search engines effectively, resulting in outdated or incomplete reasoning on knowledge-intensive tasks. Experiments across seven question-answering datasets show consistent gains over standard retrieval-augmented baselines, with larger improvements on the 7B model than the 3B model.

Core claim

Search-R1 applies reinforcement learning to reasoning trajectories so that the LLM autonomously emits search queries at chosen points, receives real-time retrieval results, and continues reasoning with those results masked to stabilize training; an outcome-based reward then reinforces trajectories that reach correct final answers. This produces measurable improvements of 41 percent for Qwen2.5-7B and 20 percent for Qwen2.5-3B over comparable RAG baselines on seven QA datasets, while also yielding observations about response-length dynamics and the effects of different RL optimizers.

What carries the argument

Multi-turn search interactions optimized by outcome-based RL rewards and retrieved-token masking, which lets the model learn when to query without query-level supervision.

If this is right

Models learn to interleave search calls at useful moments inside long reasoning chains rather than only at the start.
Outcome-only rewards suffice to shape useful retrieval behavior across multiple turns.
Smaller models still show gains, though smaller than those for larger models under identical training.
Response length and search frequency change systematically as training proceeds.
The same RL setup supplies empirical comparisons among optimizers and model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other external tools such as calculators or code interpreters if similar masking and outcome rewards are used.
Learned search timing might reduce unnecessary retrievals and shorten inference latency once the policy stabilizes.
Because the method requires no human preference data, it may scale to new domains where only final-answer correctness is available.
The observed response-length dynamics suggest a possible trade-off between exploration of searches and concise final answers that future work could tune explicitly.

Load-bearing premise

That scoring only the final answer plus token masking is enough for the model to discover effective multi-turn search behavior without any additional human or query-level signals.

What would settle it

Retraining the same base models with the masking and outcome reward removed, or replaced by standard next-token prediction, and measuring whether the performance gap over RAG baselines disappears on the same seven datasets.

read the original abstract

Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Search-R1, an RL extension for training LLMs to autonomously generate multiple search queries during step-by-step reasoning with real-time retrieval. It uses retrieved token masking to stabilize training and a simple outcome-based reward (final answer correctness). Experiments on seven QA datasets report gains of 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over RAG baselines, plus empirical insights on RL methods, model scale, and response length; code and checkpoints are released publicly.

Significance. If the gains are shown to arise from genuinely improved multi-turn search policies rather than confounds, the work would advance retrieval-augmented reasoning by demonstrating that outcome-only RL plus masking can suffice without process supervision. Public code and checkpoints are a clear strength for reproducibility.

major comments (2)

[Experiments] Experiments section: the headline gains (41% and 20%) are reported without run-to-run variance, number of seeds, statistical significance tests, or explicit confirmation that retrieval corpus, top-k, and maximum turns are identical between Search-R1 and all RAG baselines; this is load-bearing for the claim that the RL policy itself drives the improvement.
[Method] Method and RL optimization sections: the combination of sparse outcome reward and retrieved-token masking is asserted to let the model discover effective multi-turn search without query-level supervision, yet no trajectory analysis, search-frequency ablations, or checks for over-searching/reward hacking are provided to substantiate that the learned behavior is optimal rather than lucky or hacky.

minor comments (2)

[Abstract] The abstract lists seven datasets but does not name them; adding the list would improve immediate readability.
[Empirical Insights] Response-length dynamics are mentioned as an insight but lack a dedicated figure or table reference in the provided summary; ensure all claimed analyses have clear visual or tabular support.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline gains (41% and 20%) are reported without run-to-run variance, number of seeds, statistical significance tests, or explicit confirmation that retrieval corpus, top-k, and maximum turns are identical between Search-R1 and all RAG baselines; this is load-bearing for the claim that the RL policy itself drives the improvement.

Authors: We agree that variance reporting and explicit confirmation of identical settings are important for validating the claims. In the revised manuscript, we will report results averaged over 3 random seeds with standard deviations and include statistical significance tests (e.g., paired t-tests) against the RAG baselines. We will also add an explicit statement in the experimental setup confirming that the retrieval corpus, top-k value, and maximum turns are identical across Search-R1 and all baselines, as implemented in the released code. This directly supports that the improvements arise from the learned RL policy. revision: yes
Referee: [Method] Method and RL optimization sections: the combination of sparse outcome reward and retrieved-token masking is asserted to let the model discover effective multi-turn search without query-level supervision, yet no trajectory analysis, search-frequency ablations, or checks for over-searching/reward hacking are provided to substantiate that the learned behavior is optimal rather than lucky or hacky.

Authors: We acknowledge that additional analyses would provide stronger evidence for the optimality of the learned policy. The manuscript already includes empirical insights on response length dynamics, which help rule out trivial over-searching as the source of gains. In the revision, we will add qualitative examples of multi-turn search trajectories, an ablation varying search frequency (via modified rewards), and further checks on response patterns to address potential reward hacking. These additions will better substantiate that the model discovers effective search behaviors. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical RL method with no derivation chain

full rationale

The paper introduces Search-R1 as an RL training procedure for LLMs to generate search queries during reasoning, using outcome-based rewards and retrieved-token masking. It reports experimental results on seven QA datasets showing gains over RAG baselines, plus empirical insights on optimization and response lengths. No first-principles derivation, theorem, or prediction is claimed that could reduce to its own inputs by construction; the work is self-contained as a procedural method validated by public code and external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the domain assumption that outcome rewards suffice to shape search behavior.

axioms (1)

domain assumption Outcome-based reward is sufficient to optimize search query generation and retrieval use
Paper explicitly uses a simple outcome-based reward function without query-level signals.

pith-pipeline@v0.9.0 · 5531 in / 1036 out tokens · 44972 ms · 2026-05-11T06:41:26.119835+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we adopt a straightforward outcome-based reward function, avoiding the complexity of process-based rewards. Our results demonstrate that this minimal reward design is effective in search-and-reasoning scenarios.
Foundation.LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SEARCH-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 accept novelty 8.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
cs.AI 2026-05 unverdicted novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 unverdicted novelty 7.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
cs.LG 2026-05 unverdicted novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
cs.AI 2026-05 unverdicted novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 7.0

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 conditional novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 accept novelty 7.0

LLM agents encode tool necessity in pre-generation hidden states with high linear decodability (AUROC 0.89-0.96); Probe&Prefill uses this to reduce tool calls 48% with 1.7% accuracy loss.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 7.0

SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design
cs.AI 2026-05 unverdicted novelty 7.0

AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
cs.AI 2026-05 unverdicted novelty 7.0

AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.
Inference-Time Budget Control for LLM Search Agents
cs.AI 2026-05 unverdicted novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Introduces the SRA paradigm and SRA-Bench benchmark showing retrieval-based skill augmentation improves agent performance but skill incorporation remains a bottleneck regardless of retrieval quality.
Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning
cs.AI 2026-04 unverdicted novelty 7.0

MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and eva...
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
Latent Abstraction for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 7.0

LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA...
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
cs.LG 2026-04 unverdicted novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
cs.IR 2026-04 unverdicted novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation
cs.CL 2026-04 unverdicted novelty 7.0

Deep-Reporter introduces a unified agentic framework for grounded multimodal long-form generation via multimodal search, checklist-guided synthesis, and recurrent context management, plus the M2LongBench benchmark.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
cs.IR 2026-04 unverdicted novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
cs.IR 2026-04 unverdicted novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate
cs.CL 2026-01 unverdicted novelty 7.0

SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.
Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning
cs.CL 2026-01 conditional novelty 7.0

Temp-R1 uses reverse curriculum reinforcement learning to train an autonomous agent that achieves state-of-the-art results on temporal KGQA benchmarks by developing sophisticated reasoning on hard questions first.
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.
Training Multi-Image Vision Agents via End2End Reinforcement Learning
cs.CV 2025-12 unverdicted novelty 7.0

IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing too...
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training
cs.LG 2025-11 unverdicted novelty 7.0

Q-RAG trains embedders via RL for multi-step retrieval and reports state-of-the-art results on BabiLong and RULER benchmarks for contexts up to 10M tokens.
MMSearch-R1: Incentivizing LMMs to Search
cs.CV 2025-06 unverdicted novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
cs.CL 2026-05 unverdicted novelty 6.0

DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API to...
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
cs.LG 2026-05 unverdicted novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and...
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
cs.AI 2026-05 unverdicted novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
cs.AI 2026-05 unverdicted novelty 6.0

SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
Harnessing LLM Agents with Skill Programs
cs.AI 2026-05 conditional novelty 6.0

HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning be...
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
cs.CL 2026-05 unverdicted novelty 6.0

RubricEM uses rubric-guided stagewise policy decomposition and reflection-based meta-policy evolution to improve long-horizon research agents beyond verifiable rewards.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 6.0

SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillMaster is a training framework that lets LLM agents autonomously propose, update, and apply skills, yielding 8.8% and 9.3% higher success rates on ALFWorld and WebShop than prior methods.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
cs.AI 2026-05 unverdicted novelty 6.0

A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.