R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song; Jie Chen; Jinhao Jiang; Ji-Rong Wen; Lei Fang; Wayne Xin Zhao; Yingqian Min; Zhipeng Chen

arxiv: 2503.05592 · v2 · submitted 2025-03-07 · 💻 cs.AI · cs.CL· cs.IR

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song , Jinhao Jiang , Yingqian Min , Jie Chen , Zhipeng Chen , Wayne Xin Zhao , Lei Fang , Ji-Rong Wen This is my paper

Pith reviewed 2026-05-13 18:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR

keywords reinforcement learninglarge language modelssearch capabilityretrieval-augmented generationtool useoutcome-based RL

0 comments

The pith

R1-Searcher trains LLMs with outcome-based RL to call external search tools during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage outcome-based reinforcement learning method that teaches large language models to decide when to invoke external search systems while solving problems. Existing models often produce errors on questions needing facts beyond their training data because they cannot fetch fresh information. By rewarding only the final answer correctness, the approach builds search behavior without step-by-step process signals or special warm-up training. If the method works as claimed, it yields higher accuracy on knowledge-heavy tasks than standard retrieval systems and even closed models such as GPT-4o-mini, while applying to both base and instruction-tuned models.

Core claim

R1-Searcher is a two-stage outcome-based RL framework that enables LLMs to autonomously generate calls to external search systems inside their reasoning process, producing stronger results on knowledge-intensive benchmarks than prior RAG approaches and GPT-4o-mini without any process rewards or distillation for initialization.

What carries the argument

Two-stage outcome-based reinforcement learning that rewards final answer correctness and thereby incentivizes the model to insert search tool calls into its reasoning trajectory.

If this is right

The same outcome-based RL pipeline produces usable search behavior in both base and instruction-tuned models.
Search use generalizes to datasets outside the training distribution.
Accuracy on time-sensitive and fact-heavy questions rises above conventional RAG pipelines.
No auxiliary process reward model or supervised warm-up phase is required for the capability to emerge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to training other tool-use skills such as code execution or database queries using only final-outcome signals.
Reducing dependence on ever-larger internal knowledge stores becomes feasible if external search can be reliably triggered on demand.
Training pipelines that avoid process supervision could scale more easily to larger models or longer reasoning traces.

Load-bearing premise

Outcome rewards alone can reliably produce and generalize search behavior without process supervision or a distillation cold start.

What would settle it

A controlled test set of knowledge questions where internal model knowledge is provably insufficient; if the trained model answers correctly without ever calling search, or calls search but still fails at rates comparable to the untrained base model, the claim is falsified.

read the original abstract

Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R1-Searcher offers a clean two-stage outcome-only RL recipe for getting LLMs to call external search, but the abstract supplies no numbers or training details to show it actually works.

read the letter

The main idea is a two-stage RL procedure that trains LLMs to decide when to invoke an external search tool using only final-answer correctness as the signal. It skips process rewards and any distillation step for cold start, and the abstract says the same setup works on both base and instruct models while beating standard RAG baselines and even GPT-4o-mini on knowledge tasks. That framing is the clearest new piece relative to earlier RL-for-reasoning papers: it treats search invocation as something that can be shaped purely by outcome feedback. If the training curves actually show rising search frequency and the gains are real, the method is simple enough that groups working on tool-augmented agents could try it quickly. The paper also claims some out-of-domain generalization, which would be useful if shown with proper controls. The soft spot is the missing evidence. The abstract asserts significant outperformance but gives no dataset sizes, no exact baselines, no statistical tests, and no data on whether the model actually increases its search rate or just learns to answer better without the tool. Sparse terminal rewards often produce policies that ignore the action or emit low-value queries, exactly the risk the stress-test note flags. Without training dynamics or ablations on the two stages, it is hard to tell whether the signal was dense enough. The approach itself is not circular and engages the literature on RAG and RL post-training in a straightforward way. This is the kind of paper that belongs in a reading group focused on practical RL for tool use, because the recipe is easy to reproduce if the results check out. I would send it to peer review so referees can ask for the missing experimental details and checks on actual search behavior; the core question is important enough to justify the time even if revisions are needed.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces R1-Searcher, a two-stage outcome-based reinforcement learning framework that trains LLMs (both base and instruct variants) to autonomously invoke external search tools during reasoning. It claims this approach, relying exclusively on final-answer correctness rewards without process supervision or distillation for cold-start, enables effective search behavior that generalizes out-of-domain and yields significant outperformance over strong RAG baselines, including closed-source GPT-4o-mini.

Significance. If the empirical results hold with proper controls, the work would demonstrate that sparse outcome-only RL can reliably induce tool-use policies for external knowledge access, offering a simpler alternative to process-reward or imitation-based methods for reducing hallucinations on knowledge-intensive tasks.

major comments (2)

[Abstract] Abstract and Experiments section: the central claim of significant outperformance over prior RAG methods and GPT-4o-mini is asserted without any reported datasets, baselines, metrics, statistical tests, ablations, or controls in the provided text, leaving the empirical support for the two-stage RL procedure unevaluable.
[Method] Method and Training sections: the premise that outcome-based rewards alone suffice to increase search-tool invocation frequency and quality (rather than producing ignored or redundant queries) is load-bearing for the contribution, yet no training dynamics, search-rate curves, or qualitative policy analysis are referenced to validate this against the known sparsity issues of terminal rewards.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have made revisions to strengthen the empirical presentation and analysis of the two-stage RL procedure.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: the central claim of significant outperformance over prior RAG methods and GPT-4o-mini is asserted without any reported datasets, baselines, metrics, statistical tests, ablations, or controls in the provided text, leaving the empirical support for the two-stage RL procedure unevaluable.

Authors: We agree that the abstract should provide clearer pointers to the empirical support. The full Experiments section reports results on HotpotQA, 2WikiMultihopQA, and out-of-domain sets, with baselines including standard RAG pipelines and GPT-4o-mini, using exact-match and F1 metrics, plus ablations on the two-stage design and statistical significance via paired t-tests. To make this immediately visible, we have expanded the abstract to name the primary datasets, metrics, and key controls, and added explicit cross-references to the Experiments section and appendix tables. revision: yes
Referee: [Method] Method and Training sections: the premise that outcome-based rewards alone suffice to increase search-tool invocation frequency and quality (rather than producing ignored or redundant queries) is load-bearing for the contribution, yet no training dynamics, search-rate curves, or qualitative policy analysis are referenced to validate this against the known sparsity issues of terminal rewards.

Authors: We acknowledge that explicit validation of the learned search policy is important given the sparsity of terminal rewards. In the revised manuscript we have added (i) training curves tracking search-tool invocation rate over RL steps for both base and instruct models, (ii) search-rate curves comparing the two-stage procedure against a single-stage baseline, and (iii) qualitative policy traces showing that the model learns to issue relevant, non-redundant queries rather than ignoring the tool. These additions directly address the sparsity concern and are placed in the Training and Analysis sections with accompanying discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical RL training procedure

full rationale

The paper presents an empirical two-stage outcome-based RL framework that trains LLMs to invoke external search tools using only terminal rewards from final-answer correctness. No equations, parameter fits, or derivations are shown that would reduce any claimed prediction or search behavior to a self-referential quantity or fitted input by construction. Claims rest on experimental comparisons against external RAG baselines and GPT-4o-mini rather than internal self-citations, uniqueness theorems, or ansatzes. The method is explicitly described as relying on external search outcomes without process rewards or distillation, making the central result an observed training outcome rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement learning assumptions applied to tool-use behavior in LLMs; no free parameters, invented entities, or ad-hoc axioms are explicitly introduced in the abstract.

axioms (1)

domain assumption Outcome-based rewards suffice to train LLMs to decide when and how to use external search tools
Core premise of the two-stage RL framework described in the abstract.

pith-pipeline@v0.9.0 · 5491 in / 1064 out tokens · 41828 ms · 2026-05-13T18:32:38.361697+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose R1-Searcher, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 accept novelty 8.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
cs.AI 2026-05 unverdicted novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 unverdicted novelty 7.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
cs.AI 2026-05 unverdicted novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 7.0

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 7.0

SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
cs.CL 2026-05 unverdicted novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...
ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
cs.CL 2026-04 unverdicted novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
cs.CL 2025-11 unverdicted novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
cs.AI 2026-05 unverdicted novelty 6.0

EVE-Agent adds an evidence verifier to the proposer-solver loop that rewards spans by marginal accuracy gain, producing self-generated but inspectable training examples for search agents.
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 6.0

SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
cs.LG 2026-05 unverdicted novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
cs.CV 2026-04 unverdicted novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search
cs.CL 2026-04 unverdicted novelty 6.0

CalibAdv calibrates advantages in GRPO by downscaling negative signals from incorrect final answers using intermediate step correctness and rebalancing answer-level advantages, yielding better performance and training...
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
Towards Long-horizon Agentic Multimodal Search
cs.CV 2026-04 unverdicted novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
cs.AI 2026-04 unverdicted novelty 6.0

OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
cs.AI 2026-04 unverdicted novelty 6.0

CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
cs.SE 2026-04 unverdicted novelty 6.0

A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multil...
Procedural Knowledge at Scale Improves Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
Learning to Retrieve from Agent Trajectories
cs.IR 2026-03 conditional novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
cs.CL 2026-03 unverdicted novelty 6.0

KG-Hopper uses RL to embed full multi-hop KG traversal and backtracking into a single LLM inference round, enabling a 7B model to outperform larger multi-step systems and compete with GPT-3.5/GPT-4o-mini on eight benchmarks.
Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation
cs.CV 2026-02 unverdicted novelty 6.0

MARL-Rad trains region-specific and global agents with reinforcement learning on clinical rewards to produce more accurate radiology reports than prior methods on MIMIC-CXR and IU X-ray datasets.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
cs.CL 2025-11 unverdicted novelty 6.0

MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination
cs.LG 2025-10 conditional novelty 6.0

Strengthening LLM reasoning through RL, SFT, or chain-of-thought prompting increases tool hallucination rates on SimpleToolHalluBench, with a reliability-capability trade-off observed across mitigation attempts.
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
cs.CL 2025-10 unverdicted novelty 6.0

ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
cs.IR 2025-08 unverdicted novelty 6.0

WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
WebSailor: Navigating Super-human Reasoning for Web Agent
cs.CL 2025-07 conditional novelty 6.0

WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
cs.CL 2025-05 unverdicted novelty 6.0

MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 unverdicted novelty 6.0

ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 conditional novelty 6.0

ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
cs.CL 2025-04 unverdicted novelty 6.0

WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
cs.CL 2025-04 unverdicted novelty 6.0

ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
cs.AI 2025-04 conditional novelty 6.0

End-to-end RL in authentic web environments produces LLM research agents that outperform prompt-engineering and RAG-based baselines by up to 28.9 and 7.2 points respectively while exhibiting emergent planning, cross-v...
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
cs.AI 2025-03 unverdicted novelty 6.0

ReSearch trains LLMs via RL to integrate search operations into reasoning steps, achieving strong generalization across benchmarks and eliciting reflection and self-correction without supervised reasoning data.
Supervising the search process produces reliable and generalizable information-seeking agents
cs.CL 2025-02 unverdicted novelty 6.0

Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

Search-E1 interleaves vanilla GRPO with offline self-distillation via token-level forward KL alignment to privileged sibling trajectories, reaching 0.440 average EM on seven QA benchmarks with Qwen2.5-3B and beating o...
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
cs.AI 2026-05 unverdicted novelty 5.0

MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 5.0

CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning
cs.AI 2026-01 unverdicted novelty 5.0

MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
cs.CL 2025-10 unverdicted novelty 5.0

ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
cs.AI 2025-08 unverdicted novelty 5.0

Cognitive Kernel-Pro provides an open-source agent framework with curated training data across web, file, code, and reasoning domains plus test-time reflection and voting, achieving SOTA results on GAIA among free agents.
Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning
cs.CL 2025-05 unverdicted novelty 5.0

Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
cs.AI 2025-04 unverdicted novelty 5.0

ARTIST couples agentic reasoning with outcome-based reinforcement learning to let LLMs autonomously invoke tools in multi-turn chains, reporting up to 22% gains on math and function-calling benchmarks.
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
cs.AI 2026-04 unverdicted novelty 4.0

Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
cs.LG 2025-10 unverdicted novelty 4.0

GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
XekRung Technical Report
cs.CR 2026-04 unverdicted novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 56 Pith papers · 5 internal anchors

[1]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

An empirical study on eliciting and improving r1-like reasoning models, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025

work page 2025
[5]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

work page 2018
[6]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020
[7]

Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025

Shuting Wang, Jiejun Tan, Zhicheng Dou, and Ji-Rong Wen. Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025

work page 2025
[8]

Multi-reranker: Maximizing performance of retrieval-augmented generation in the financerag challenge, 2024

Joohyun Lee and Minji Roh. Multi-reranker: Maximizing performance of retrieval-augmented generation in the financerag challenge, 2024

work page 2024
[9]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

work page 2023
[10]

Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, and Jeff Z. Pan. Mintqa: A multi-hop question answering benchmark for evaluating llms on new and tail knowledge, 2025

work page 2025
[11]

Jiang, J

Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. CoRR, abs/2412.12881, 2024

work page arXiv 2024
[12]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

work page 2024
[13]

A survey on rag meeting llms: Towards retrieval-augmented large language models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6491–6501, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[14]

Search-o1: Agentic search-enhanced large reasoning models, 2025

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025

work page 2025
[15]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

work page 2024
[16]

Atom of thoughts for markov llm test-time scaling, 2025

Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, and Yuyu Luo. Atom of thoughts for markov llm test-time scaling, 2025

work page 2025
[17]

Chain- of-retrieval augmented generation.arXiv preprint arXiv:2501.14342, 2025

Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation. CoRR, abs/2501.14342, 2025

work page arXiv 2025
[18]

Le, Sergey Levine, and Yi Ma

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

work page 2025
[19]

Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks

Xingxuan Li, Weiwen Xu, Ruochen Zhao, Fangkai Jiao, Shafiq Joty, and Lidong Bing. Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks. arXiv preprint arXiv:2410.01428, 2024

work page arXiv 2024
[20]

Reinforce++: A simple and efficient approach for aligning large language models, 2025

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models, 2025

work page 2025
[21]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[22]

Musique: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[23]

Rearter: Retrieval-augmented reasoning with trustworthy process rewarding, 2025

Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Song Yang, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding, 2025

work page 2025
[24]

Sure: Summarizing retrievals using answer candidates for open- domain QA of LLMs

Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang-Woo Lee, Minjoon Seo, Jung- Woo Ha, and Jinwoo Shin. Sure: Summarizing retrievals using answer candidates for open- domain QA of LLMs. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[25]

REPLUG: Retrieval-Augmented Black-Box Language Models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652, 2023

work page internal anchor Pith review arXiv 2023
[26]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023

work page arXiv 2023
[27]

RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[28]

Compressing context to enhance inference efficiency of large language models

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[29]

Self-knowledge guided retrieval augmentation for large language models,

Yile Wang, Peng Li, Maosong Sun, and Yang Liu. Self-knowledge guided retrieval augmentation for large language models. arXiv preprint arXiv:2310.05002, 2023

work page arXiv 2023
[30]

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

work page internal anchor Pith review arXiv 2022
[31]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

work page arXiv 2023
[32]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, 2023

work page 2023
[33]

Marco-o1: Towards open reasoning models for open-ended solutions, 2024

Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024

work page arXiv 2024
[34]

Skywork-o1 open series

Skywork o1 Team. Skywork-o1 open series. https://huggingface.co/Skywork, Novem- ber 2024

work page 2024
[35]

Flashrag: A modular toolkit for efficient retrieval-augmented generation research

Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, and Zhicheng Dou. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. arXiv preprint arXiv:2405.13576, 2024

work page arXiv 2024
[36]

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick S. H. Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, I...

work page 2021
[37]

Zero: Memory optimiza- tions toward training trillion parameter models, 2020

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020

work page 2020
[38]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 17

work page 2024

[1] [1]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

An empirical study on eliciting and improving r1-like reasoning models, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025

work page 2025

[5] [5]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

work page 2018

[6] [6]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020

[7] [7]

Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025

Shuting Wang, Jiejun Tan, Zhicheng Dou, and Ji-Rong Wen. Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025

work page 2025

[8] [8]

Multi-reranker: Maximizing performance of retrieval-augmented generation in the financerag challenge, 2024

Joohyun Lee and Minji Roh. Multi-reranker: Maximizing performance of retrieval-augmented generation in the financerag challenge, 2024

work page 2024

[9] [9]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

work page 2023

[10] [10]

Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, and Jeff Z. Pan. Mintqa: A multi-hop question answering benchmark for evaluating llms on new and tail knowledge, 2025

work page 2025

[11] [11]

Jiang, J

Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. CoRR, abs/2412.12881, 2024

work page arXiv 2024

[12] [12]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

work page 2024

[13] [13]

A survey on rag meeting llms: Towards retrieval-augmented large language models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6491–6501, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024

[14] [14]

Search-o1: Agentic search-enhanced large reasoning models, 2025

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025

work page 2025

[15] [15]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

work page 2024

[16] [16]

Atom of thoughts for markov llm test-time scaling, 2025

Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, and Yuyu Luo. Atom of thoughts for markov llm test-time scaling, 2025

work page 2025

[17] [17]

Chain- of-retrieval augmented generation.arXiv preprint arXiv:2501.14342, 2025

Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation. CoRR, abs/2501.14342, 2025

work page arXiv 2025

[18] [18]

Le, Sergey Levine, and Yi Ma

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

work page 2025

[19] [19]

Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks

Xingxuan Li, Weiwen Xu, Ruochen Zhao, Fangkai Jiao, Shafiq Joty, and Lidong Bing. Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks. arXiv preprint arXiv:2410.01428, 2024

work page arXiv 2024

[20] [20]

Reinforce++: A simple and efficient approach for aligning large language models, 2025

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models, 2025

work page 2025

[21] [21]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025

[22] [22]

Musique: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[23] [23]

Rearter: Retrieval-augmented reasoning with trustworthy process rewarding, 2025

Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Song Yang, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding, 2025

work page 2025

[24] [24]

Sure: Summarizing retrievals using answer candidates for open- domain QA of LLMs

Jaehyung Kim, Jaehyun Nam, Sangwoo Mo, Jongjin Park, Sang-Woo Lee, Minjoon Seo, Jung- Woo Ha, and Jinwoo Shin. Sure: Summarizing retrievals using answer candidates for open- domain QA of LLMs. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[25] [25]

REPLUG: Retrieval-Augmented Black-Box Language Models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652, 2023

work page internal anchor Pith review arXiv 2023

[26] [26]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023

work page arXiv 2023

[27] [27]

RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[28] [28]

Compressing context to enhance inference efficiency of large language models

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore, December 2023. Association for Computational Linguistics

work page 2023

[29] [29]

Self-knowledge guided retrieval augmentation for large language models,

Yile Wang, Peng Li, Maosong Sun, and Yang Liu. Self-knowledge guided retrieval augmentation for large language models. arXiv preprint arXiv:2310.05002, 2023

work page arXiv 2023

[30] [30]

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

work page internal anchor Pith review arXiv 2022

[31] [31]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

work page arXiv 2023

[32] [32]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, 2023

work page 2023

[33] [33]

Marco-o1: Towards open reasoning models for open-ended solutions, 2024

Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024

work page arXiv 2024

[34] [34]

Skywork-o1 open series

Skywork o1 Team. Skywork-o1 open series. https://huggingface.co/Skywork, Novem- ber 2024

work page 2024

[35] [35]

Flashrag: A modular toolkit for efficient retrieval-augmented generation research

Jiajie Jin, Yutao Zhu, Xinyu Yang, Chenghao Zhang, and Zhicheng Dou. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. arXiv preprint arXiv:2405.13576, 2024

work page arXiv 2024

[36] [36]

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick S. H. Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, I...

work page 2021

[37] [37]

Zero: Memory optimiza- tions toward training trillion parameter models, 2020

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020

work page 2020

[38] [38]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 17

work page 2024