{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:IS2Y2FVNQ5CACKYP7PUWAEPTDP","short_pith_number":"pith:IS2Y2FVN","schema_version":"1.0","canonical_sha256":"44b58d16ad8744012b0ffbe96011f31bf53e5dbcc3f713af70316ef5b8f3a5f0","source":{"kind":"arxiv","id":"2503.19470","version":3},"attestation_state":"computed","paper":{"title":"ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"ReSearch trains LLMs to interleave search operations with text reasoning using only outcome-based reinforcement learning rewards.","cross_cats":["cs.CL"],"primary_cat":"cs.AI","authors_text":"Chenzheng Zhu, Fan Yang, Haofen Wang, Haoze Sun, Huajun Chen, Jeff Z. Pan, Linzhuang Sun, Mingyang Chen, Tianpeng Li, Weipeng Chen, Wen Zhang, Yijie Zhou, Zenan Zhou","submitted_at":"2025-03-25T09:00:58Z","abstract_excerpt":"Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-ba"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2503.19470","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.AI","submitted_at":"2025-03-25T09:00:58Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"bf05ce1fc3a58133438a96c34ba9f399e45a1ef5ac857af372a738e3eca2b82e","abstract_canon_sha256":"e59522c92d3b0f71aafdef1fc393fd60031cc735838a4baa4cae25ef063974e1"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.396197Z","signature_b64":"gNaEZn37FB16GUlOxdOlzWgPrbPYccnsUgYL6e6SX+nU/X4SMxCZ+HB8dHDSh8NDOpWG/9BSPoLPSzY9DmjUDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"44b58d16ad8744012b0ffbe96011f31bf53e5dbcc3f713af70316ef5b8f3a5f0","last_reissued_at":"2026-05-17T23:38:47.395691Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.395691Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"ReSearch trains LLMs to interleave search operations with text reasoning using only outcome-based reinforcement learning rewards.","cross_cats":["cs.CL"],"primary_cat":"cs.AI","authors_text":"Chenzheng Zhu, Fan Yang, Haofen Wang, Haoze Sun, Huajun Chen, Jeff Z. Pan, Linzhuang Sun, Mingyang Chen, Tianpeng Li, Weipeng Chen, Wen Zhang, Yijie Zhou, Zenan Zhou","submitted_at":"2025-03-25T09:00:58Z","abstract_excerpt":"Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-ba"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That outcome-based reinforcement learning rewards alone are sufficient to train effective search timing and integration without any supervised reasoning traces or explicit search supervision.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ReSearch trains LLMs via RL to integrate search operations into reasoning steps, achieving strong generalization across benchmarks and eliciting reflection and self-correction without supervised reasoning data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ReSearch trains LLMs to interleave search operations with text reasoning using only outcome-based reinforcement learning rewards.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c8b56a6067d5761ab0bff66ab16b184e461c91af3e2d8e271ad996045df8b892"},"source":{"id":"2503.19470","kind":"arxiv","version":3},"verdict":{"id":"750f7f73-1453-415f-ad27-daae68bb00a0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T15:43:22.448661Z","strongest_claim":"Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks.","one_line_summary":"ReSearch trains LLMs via RL to integrate search operations into reasoning steps, achieving strong generalization across benchmarks and eliciting reflection and self-correction without supervised reasoning data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That outcome-based reinforcement learning rewards alone are sufficient to train effective search timing and integration without any supervised reasoning traces or explicit search supervision.","pith_extraction_headline":"ReSearch trains LLMs to interleave search operations with text reasoning using only outcome-based reinforcement learning rewards."},"references":{"count":44,"sample":[{"doi":"","year":2025,"title":"Claude 3.7 sonnet and claude code, 2025","work_id":"34f8d3e0-abcb-40d4-8bed-60c970da1f8c","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Self-rag: Learning to retrieve, generate, and critique through self-reflection","work_id":"0a80e610-1315-461d-888b-efcd795f6ac2","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Rq-rag: Learning to refine queries for retrieval augmented generation","work_id":"5d12e4e2-f60c-4c61-aafa-e97863b41380","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, and Tianyu Pang","work_id":"6c5542cb-0f99-4d92-85f4-f7b97cd42104","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":5,"cited_arxiv_id":"2501.12948","is_internal_anchor":true}],"resolved_work":44,"snapshot_sha256":"53932ec0eaf90a414548519649fe63f2f3a2428e592bab4dca2a5e35bdd4c6a7","internal_anchors":14},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2503.19470","created_at":"2026-05-17T23:38:47.395790+00:00"},{"alias_kind":"arxiv_version","alias_value":"2503.19470v3","created_at":"2026-05-17T23:38:47.395790+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2503.19470","created_at":"2026-05-17T23:38:47.395790+00:00"},{"alias_kind":"pith_short_12","alias_value":"IS2Y2FVNQ5CA","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"IS2Y2FVNQ5CACKYP","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"IS2Y2FVN","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":43,"internal_anchor_count":43,"sample":[{"citing_arxiv_id":"2502.13957","citing_title":"Supervising the search process produces reliable and generalizable information-seeking agents","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2505.17086","citing_title":"Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22511","citing_title":"Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2601.22297","citing_title":"Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17946","citing_title":"SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17734","citing_title":"Harnessing LLM Agents with Skill Programs","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17946","citing_title":"SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18299","citing_title":"SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17037","citing_title":"D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2505.22095","citing_title":"Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2506.13743","citing_title":"LTRR: Learning To Rank Retrievers for LLMs","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2509.02547","citing_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","ref_index":116,"is_internal_anchor":true},{"citing_arxiv_id":"2510.00568","citing_title":"ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2510.00861","citing_title":"Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2510.07794","citing_title":"HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2510.22977","citing_title":"The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2511.02805","citing_title":"MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2509.08827","citing_title":"A Survey of Reinforcement Learning for Large Reasoning Models","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2511.07328","citing_title":"Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2511.07833","citing_title":"MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2601.12538","citing_title":"Agentic Reasoning for Large Language Models","ref_index":206,"is_internal_anchor":true},{"citing_arxiv_id":"2504.21776","citing_title":"WebThinker: Empowering Large Reasoning Models with Deep Research Capability","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2506.20670","citing_title":"MMSearch-R1: Incentivizing LMMs to Search","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2603.06194","citing_title":"MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2603.13842","citing_title":"Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving","ref_index":5,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP","json":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP.json","graph_json":"https://pith.science/api/pith-number/IS2Y2FVNQ5CACKYP7PUWAEPTDP/graph.json","events_json":"https://pith.science/api/pith-number/IS2Y2FVNQ5CACKYP7PUWAEPTDP/events.json","paper":"https://pith.science/paper/IS2Y2FVN"},"agent_actions":{"view_html":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP","download_json":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP.json","view_paper":"https://pith.science/paper/IS2Y2FVN","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2503.19470&json=true","fetch_graph":"https://pith.science/api/pith-number/IS2Y2FVNQ5CACKYP7PUWAEPTDP/graph.json","fetch_events":"https://pith.science/api/pith-number/IS2Y2FVNQ5CACKYP7PUWAEPTDP/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP/action/timestamp_anchor","attest_storage":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP/action/storage_attestation","attest_author":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP/action/author_attestation","sign_citation":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP/action/citation_signature","submit_replication":"https://pith.science/pith/IS2Y2FVNQ5CACKYP7PUWAEPTDP/action/replication_record"}},"created_at":"2026-05-17T23:38:47.395790+00:00","updated_at":"2026-05-17T23:38:47.395790+00:00"}