hub Canonical reference

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu · 2025 · cs.AI · arXiv 2503.19470

Canonical reference. 70% of citing Pith papers cite this work as background.

39 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 39 citing papers arXiv PDF

abstract

Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1 other 1

citation-polarity summary

background 7 unclear 2 baseline 1

representative citing papers

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

cs.AI · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

Latent Abstraction for Retrieval-Augmented Generation

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-hop tasks.

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

cs.RO · 2026-03-14 · unverdicted · novelty 7.0

PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

cs.CL · 2026-01-29 · unverdicted · novelty 7.0

SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.

MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

cs.LG · 2025-11-11 · unverdicted · novelty 7.0

MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

cs.LG · 2025-11-10 · unverdicted · novelty 7.0

Q-RAG trains embedders via RL for multi-step retrieval and reports state-of-the-art results on BabiLong and RULER benchmarks for contexts up to 10M tokens.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

cs.CL · 2025-11-04 · unverdicted · novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

cs.CL · 2025-10-09 · unverdicted · novelty 7.0

HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.

MMSearch-R1: Incentivizing LMMs to Search

cs.CV · 2025-06-25 · unverdicted · novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.

Harnessing LLM Agents with Skill Programs

cs.AI · 2026-05-18 · conditional · novelty 6.0

HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

cs.AI · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue

cs.CL · 2026-03-06 · unverdicted · novelty 6.0

MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization for emotional support dialogues.

The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination

cs.LG · 2025-10-27 · conditional · novelty 6.0

Strengthening LLM reasoning through RL, SFT, or chain-of-thought prompting increases tool hallucination rates on SimpleToolHalluBench, with a reliability-capability trade-off observed across mitigation attempts.

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

cs.CL · 2025-10-01 · unverdicted · novelty 6.0

ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

cs.CL · 2025-05-28 · unverdicted · novelty 6.0

MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

cs.CL · 2025-04-30 · unverdicted · novelty 6.0

WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.

Supervising the search process produces reliable and generalizable information-seeking agents

cs.CL · 2025-02-19 · unverdicted · novelty 6.0

Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.

citing papers explorer

Showing 39 of 39 citing papers.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain cs.AI · 2026-05-18 · unverdicted · none · ref 65 · 2 links · internal anchor
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation cs.AI · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers cs.LG · 2026-05-06 · unverdicted · none · ref 55 · internal anchor
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
Latent Abstraction for Retrieval-Augmented Generation cs.CL · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning cs.AI · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-hop tasks.
Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving cs.RO · 2026-03-14 · unverdicted · none · ref 5 · internal anchor
PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.
Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate cs.CL · 2026-01-29 · unverdicted · none · ref 2 · internal anchor
SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation cs.LG · 2025-11-11 · unverdicted · none · ref 7 · internal anchor
MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.
Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training cs.LG · 2025-11-10 · unverdicted · none · ref 4 · internal anchor
Q-RAG trains embedders via RL for multi-step retrieval and reports state-of-the-art results on BabiLong and RULER benchmarks for contexts up to 10M tokens.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning cs.CL · 2025-11-04 · unverdicted · none · ref 2 · internal anchor
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation cs.CL · 2025-10-09 · unverdicted · none · ref 1 · internal anchor
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
MMSearch-R1: Incentivizing LMMs to Search cs.CV · 2025-06-25 · unverdicted · none · ref 7 · internal anchor
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning cs.AI · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.
Harnessing LLM Agents with Skill Programs cs.AI · 2026-05-18 · conditional · none · ref 11 · internal anchor
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning cs.AI · 2026-05-10 · unverdicted · none · ref 3 · 2 links · internal anchor
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents cs.CV · 2026-04-21 · unverdicted · none · ref 41 · internal anchor
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 13 · internal anchor
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue cs.CL · 2026-03-06 · unverdicted · none · ref 54 · internal anchor
MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization for emotional support dialogues.
The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination cs.LG · 2025-10-27 · conditional · none · ref 1 · internal anchor
Strengthening LLM reasoning through RL, SFT, or chain-of-thought prompting increases tool hallucination rates on SimpleToolHalluBench, with a reliability-capability trade-off observed across mitigation attempts.
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards cs.CL · 2025-10-01 · unverdicted · none · ref 4 · internal anchor
ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 116 · internal anchor
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation cs.CL · 2025-05-28 · unverdicted · none · ref 8 · internal anchor
MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.
WebThinker: Empowering Large Reasoning Models with Deep Research Capability cs.CL · 2025-04-30 · unverdicted · none · ref 2 · internal anchor
WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.
Supervising the search process produces reliable and generalizable information-seeking agents cs.CL · 2025-02-19 · unverdicted · none · ref 8 · internal anchor
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning cs.AI · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
Search-E1 interleaves vanilla GRPO with offline self-distillation via token-level forward KL alignment to privileged sibling trajectories, reaching 0.440 average EM on seven QA benchmarks with Qwen2.5-3B and beating open-source baselines.
D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning cs.LG · 2026-05-16 · unverdicted · none · ref 40 · internal anchor
D²Evo mines medium-difficulty anchors from the current model, trains a Questioner to generate matching questions, and jointly optimizes Solver and Questioner for progressive gains, outperforming baselines on math reasoning with under 2K real samples.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 2 · 2 links · internal anchor
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition cs.IR · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning cs.CL · 2026-04-14 · unverdicted · none · ref 1 · internal anchor
KG-Reasoner uses reinforcement learning to train LLMs for end-to-end multi-hop knowledge graph reasoning, achieving competitive or better results on eight benchmarks.
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning cs.AI · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs cs.CL · 2025-10-01 · unverdicted · none · ref 35 · internal anchor
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
LTRR: Learning To Rank Retrievers for LLMs cs.CL · 2025-06-16 · unverdicted · none · ref 8 · internal anchor
LTRR learns to rank a pool of retrievers by their expected contribution to RAG answer correctness and shows that query-dependent selection beats the best single retriever on QA benchmarks.
Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning cs.CL · 2025-05-20 · unverdicted · none · ref 13 · internal anchor
Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 91 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
An End-to-End Framework for Building Large Language Models for Software Operations cs.LG · 2026-04-06 · unverdicted · none · ref 8 · 2 links · internal anchor
OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 206 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 237 · internal anchor
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 65 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer