Recognition: no theorem link
Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents
Pith reviewed 2026-05-10 19:13 UTC · model grok-4.3
The pith
Small language models reach large-model accuracy on complex search tasks when fine-tuned to always search and ground answers in retrieved evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that small language models can be turned into effective search agents for knowledge-intensive multi-hop tasks by applying a lightweight fine-tuning procedure that explicitly teaches them to retrieve relevant information and generate answers strictly grounded in the retrieved evidence. This direct training approach yields higher accuracy than distilling agent behaviors from large models and reaches performance levels comparable to those large models across standard benchmarks. Analysis within the work further shows that adaptive strategies for deciding when to search tend to degrade results in small models, whereas consistent reliance on search produces more reliable and
What carries the argument
The lightweight fine-tuning approach that enforces consistent tool invocation and evidence-grounded answer generation in small language models, preventing reliance on parametric guesses.
If this is right
- Small models become viable alternatives to large ones for search-based reasoning without distillation.
- Consistent search behavior outperforms flexible adaptive strategies for these models on reasoning tasks.
- Grounding outputs in retrieved evidence directly reduces hallucinations in small models.
- The same training pattern can be applied to other tool-using scenarios where small models currently underperform.
Where Pith is reading between the lines
- This style of training could extend to domains such as code generation or scientific query answering where evidence grounding matters.
- It raises the possibility that explicit behavioral constraints during fine-tuning matter more than raw model size for certain agent tasks.
- Deployments on limited hardware might become practical if small models maintain high reliability after this training.
Load-bearing premise
Small language models can acquire reliable search habits and strict evidence grounding through lightweight fine-tuning alone, without needing larger scale or adaptive decision rules.
What would settle it
Running the trained small models on fresh multi-hop questions and measuring whether they still produce ungrounded answers at rates similar to untrained versions or fall short of large-model accuracy.
Figures
read the original abstract
Agents equipped with search tools have emerged as effective solutions for knowledge-intensive tasks. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their high computational cost limits practical deployment for search agents. Consequently, recent work has focused on distilling agentic behaviors from LLMs into Small Language Models (SLMs). Through comprehensive evaluation on complex multi-hop reasoning tasks, we find that despite possessing less parametric knowledge, SLMs invoke search tools less frequently and are more prone to hallucinations. To address this issue, we propose \policy, a lightweight fine-tuning approach that explicitly trains SLMs to reliably retrieve and generate answers grounded in retrieved evidence. Compared to agent distillation from LLMs, our approach improves performance by 17.3 scores on Bamboogle and 15.3 scores on HotpotQA, achieving LLM-level results across benchmarks. Our further analysis reveals that adaptive search strategies in SLMs often degrade performance, highlighting the necessity of consistent search behavior for reliable reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a lightweight fine-tuning method (denoted as policy or 'Search, Do not Guess') for small language models (SLMs) to serve as effective search agents on knowledge-intensive tasks. It observes that base SLMs invoke search tools less frequently and hallucinate more than LLMs despite having less parametric knowledge. The proposed approach explicitly trains SLMs to retrieve information reliably and generate answers grounded in retrieved evidence. It reports performance gains of 17.3 points on Bamboogle and 15.3 points on HotpotQA over agent-distillation baselines from LLMs, reaching LLM-level results, and finds that adaptive search strategies tend to degrade SLM performance, underscoring the value of consistent search behavior.
Significance. If the performance gains prove robust and the mechanism is validated, the work could meaningfully advance practical deployment of search agents by enabling smaller, more efficient models without heavy LLM distillation. The negative result on adaptive search offers a useful design insight for agent reliability. The emphasis on grounded generation addresses a known limitation of SLMs in tool-use settings.
major comments (3)
- [Abstract] Abstract: The central performance claims (17.3-point gain on Bamboogle and 15.3-point gain on HotpotQA, plus LLM-level results) are stated without any reference to experimental details such as baselines, data splits, number of runs, or statistical tests, making the claims impossible to evaluate from the provided information.
- [Abstract] Abstract: The gains are attributed to training SLMs to reliably retrieve and produce grounded answers (addressing lower tool-use frequency and higher hallucination rates in base SLMs). However, no quantitative post-training metrics on tool invocation rates or hallucination/groundedness rates are reported relative to the agent-distillation baseline, so the source of the delta cannot be confidently linked to the stated mechanism rather than other factors like data or formatting differences.
- [Abstract] Abstract (further analysis): The claim that adaptive search strategies often degrade performance in SLMs is presented as a key finding, but without details on how adaptivity was implemented, the size of the degradation, or controlled comparisons to non-adaptive variants, the result lacks the specificity needed to support the conclusion that consistent search behavior is necessary.
minor comments (2)
- The method is referred to as policy in the abstract; provide the full name, acronym expansion, and a brief description of the training objective in the introduction for clarity.
- Ensure all mentioned benchmarks (Bamboogle, HotpotQA) include proper citations to their original papers.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work's potential significance and for the specific comments aimed at improving the clarity and evaluability of our abstract. We agree that the abstract can be enhanced to include more experimental context and mechanistic details without exceeding length limits. Below we provide point-by-point responses to the major comments, indicating the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims (17.3-point gain on Bamboogle and 15.3-point gain on HotpotQA, plus LLM-level results) are stated without any reference to experimental details such as baselines, data splits, number of runs, or statistical tests, making the claims impossible to evaluate from the provided information.
Authors: We agree that the abstract would benefit from additional context to support evaluation of the claims. The gains are measured against agent-distillation baselines from LLMs, using the standard train/test splits for Bamboogle and HotpotQA. All main results are averaged over three independent runs, with standard deviations and full tables provided in Section 4. Statistical significance is evaluated via paired t-tests, with details in the appendix. In the revised manuscript, we will update the abstract to briefly note the comparison to LLM agent-distillation baselines and refer readers to Section 4 for complete experimental protocols, data splits, and run statistics. revision: yes
-
Referee: [Abstract] Abstract: The gains are attributed to training SLMs to reliably retrieve and produce grounded answers (addressing lower tool-use frequency and higher hallucination rates in base SLMs). However, no quantitative post-training metrics on tool invocation rates or hallucination/groundedness rates are reported relative to the agent-distillation baseline, so the source of the delta cannot be confidently linked to the stated mechanism rather than other factors like data or formatting differences.
Authors: The referee is correct that the abstract itself does not report post-training quantitative metrics on tool invocation or hallucination rates. The manuscript provides these comparisons in Section 5 (including tool-use frequency and grounding accuracy relative to the distillation baseline), which support the proposed mechanism. To directly address the concern, we will revise the abstract to explicitly link the performance gains to increased tool-use reliability and reduced hallucinations, while directing readers to the analysis section for the supporting quantitative evidence. revision: partial
-
Referee: [Abstract] Abstract (further analysis): The claim that adaptive search strategies often degrade performance in SLMs is presented as a key finding, but without details on how adaptivity was implemented, the size of the degradation, or controlled comparisons to non-adaptive variants, the result lacks the specificity needed to support the conclusion that consistent search behavior is necessary.
Authors: We acknowledge that the abstract presents this finding at a high level without implementation specifics. The adaptive search setup (model decides search vs. answer based on internal confidence) and controlled comparisons to non-adaptive policies, along with the observed degradation magnitudes, are detailed in Section 6. In the revised abstract, we will add a concise reference to the controlled experiments demonstrating degradation under adaptive strategies for SLMs, while pointing to Section 6 for full implementation details and quantitative results. revision: yes
Circularity Check
No circularity: purely empirical benchmark evaluation with no derivations or self-referential reductions
full rationale
The paper reports experimental results from lightweight fine-tuning of SLMs on search-agent tasks, with performance deltas measured directly on Bamboogle and HotpotQA benchmarks. No equations, parameter fits, predictions, or uniqueness theorems appear; the central claim rests on observed score improvements rather than any derivation that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any mathematical step, and the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SLMs possess less parametric knowledge than LLMs and therefore invoke search tools less frequently
Forward citations
Cited by 1 Pith paper
-
Towards Multi-Agent Autonomous Reasoning in Hydrodynamics
A Layer Execution Graph multi-agent system for hydrodynamics achieves 93.6% factual precision and 100% pass rate on 37 queries while degrading gracefully under data loss.
Reference graph
Works this paper leans on
-
[1]
Wikimedia database dump of the en- glish wikipedia on june 20, 2021. https: //archive.org/download/enwiki-20210620/ enwiki-20210620-pages-articles.xml.bz2 . Wikimedia database dump of the English Wikipedia on June 20, 2021. Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Pi- otr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. 2024. On-po...
-
[2]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Sealqa: Raising the bar for reasoning in search-augmented language models.Preprint, arXiv:2506.01062. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models.Preprint, arXiv:2210.03350. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabhar...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
You should always and directly use the Wikipedia search engine to look up the information needed to answer the question
-
[8]
wh-word is [attribute] of [entity]
Assume that the current year is 2018. No need to look up information that is more recent than this year. ### Search Query Format Guidelines When writing search queries, follow these specific formats depending on what information you need: **Format A: When inquiring about an attribute of an entity** Use the pattern: "wh-word is [attribute] of [entity]" Exa...
2018
-
[9]
You should carefully follow the format of searching and answering as shown in the example above
-
[10]
Do not use your own knowledge or personal experiences to speculate
You should always and directly use the Wikipedia search engine to look up the information needed to answer the question. Do not use your own knowledge or personal experiences to speculate
-
[11]
people that captured Malakoff
Your search queries should be a complete, natural language question instead of keywords. For instance, instead of searching for "people that captured Malakoff", you should search for "Who were the people that captured Malakoff?"
-
[12]
No need to reflect and doubt it
You can trust the information retrieved from the search engine to be accurate and factual, and use it in your subsequent reasoning. No need to reflect and doubt it
-
[13]
The people who captured Malakoff came to the region where Philipsburg is located in November 12, 1625
You final answer should be a short-form answer. Do NOT provide explanations or extract descriptions. For instance, instead of saying "The people who captured Malakoff came to the region where Philipsburg is located in November 12, 1625", you should only say "November 12, 1625"
-
[14]
Do not include any other information or text
Your final response should only have the answer enclosed in <answer> and </answer> tags. Do not include any other information or text
-
[15]
wh-word is [attribute] of [entity]
Assume that the current year is 2018. No need to look up information that is more recent than this year. ### Search Query Format Guidelines When writing search queries, follow these specific formats depending on what information you need: **Format A: When inquiring about an attribute of an entity** Use the pattern: "wh-word is [attribute] of [entity]" Exa...
2018
-
[16]
Carefully read each document and determine if it contains information relevant to the query
-
[17]
If you find relevant information, extract and summarize it in 1-3 clear sentences
-
[18]
**Do not use any information that is not present in the documents.**
-
[19]
The Great Silence
If none of the documents contain relevant information, state that clearly. ### Output Format (CRITICAL - MUST FOLLOW EXACTLY) - Your answer **MUST start with exactly**: ### Extracted Information - On the line(s) after this tag, write the extracted information. - If there is no relevant information, write: No helpful information found. - **IMPORTANT**: Eve...
1926
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.