arxiv: 2604.04651 · v1 · submitted 2026-04-06 · 💻 cs.AI

Recognition: no theorem link

Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents

Yizhou Liu , Qi Sun , Yulin Chen , Siyue Zhang , Chen Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords small language modelssearch agentsfine-tuningmulti-hop reasoningtool useevidence groundinghallucinations

0 comments

The pith

Small language models reach large-model accuracy on complex search tasks when fine-tuned to always search and ground answers in retrieved evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large models handle knowledge-heavy questions well by using search tools, yet their size makes them costly for routine use. Small models run cheaply but tend to skip searches and invent answers instead. The paper demonstrates that a lightweight fine-tuning process can train small models to invoke search tools reliably and build responses only from the evidence they retrieve. This produces results that match large-model performance on multi-hop reasoning benchmarks while outperforming methods that simply copy behaviors from bigger models. The same experiments show that allowing small models to choose adaptively when to search actually lowers accuracy, so fixed search habits prove more dependable.

Core claim

The paper establishes that small language models can be turned into effective search agents for knowledge-intensive multi-hop tasks by applying a lightweight fine-tuning procedure that explicitly teaches them to retrieve relevant information and generate answers strictly grounded in the retrieved evidence. This direct training approach yields higher accuracy than distilling agent behaviors from large models and reaches performance levels comparable to those large models across standard benchmarks. Analysis within the work further shows that adaptive strategies for deciding when to search tend to degrade results in small models, whereas consistent reliance on search produces more reliable and

What carries the argument

The lightweight fine-tuning approach that enforces consistent tool invocation and evidence-grounded answer generation in small language models, preventing reliance on parametric guesses.

If this is right

Small models become viable alternatives to large ones for search-based reasoning without distillation.
Consistent search behavior outperforms flexible adaptive strategies for these models on reasoning tasks.
Grounding outputs in retrieved evidence directly reduces hallucinations in small models.
The same training pattern can be applied to other tool-using scenarios where small models currently underperform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This style of training could extend to domains such as code generation or scientific query answering where evidence grounding matters.
It raises the possibility that explicit behavioral constraints during fine-tuning matter more than raw model size for certain agent tasks.
Deployments on limited hardware might become practical if small models maintain high reliability after this training.

Load-bearing premise

Small language models can acquire reliable search habits and strict evidence grounding through lightweight fine-tuning alone, without needing larger scale or adaptive decision rules.

What would settle it

Running the trained small models on fresh multi-hop questions and measuring whether they still produce ungrounded answers at rates similar to untrained versions or fall short of large-model accuracy.

Figures

Figures reproduced from arXiv: 2604.04651 by Chen Zhao, Qi Sun, Siyue Zhang, Yizhou Liu, Yulin Chen.

**Figure 2.** Figure 2: Scaling of agentic search performance. Small [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Confidence probing results. (a) illustrates the full sample distribution on a log scale; (b) zooms into [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Agents equipped with search tools have emerged as effective solutions for knowledge-intensive tasks. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their high computational cost limits practical deployment for search agents. Consequently, recent work has focused on distilling agentic behaviors from LLMs into Small Language Models (SLMs). Through comprehensive evaluation on complex multi-hop reasoning tasks, we find that despite possessing less parametric knowledge, SLMs invoke search tools less frequently and are more prone to hallucinations. To address this issue, we propose \policy, a lightweight fine-tuning approach that explicitly trains SLMs to reliably retrieve and generate answers grounded in retrieved evidence. Compared to agent distillation from LLMs, our approach improves performance by 17.3 scores on Bamboogle and 15.3 scores on HotpotQA, achieving LLM-level results across benchmarks. Our further analysis reveals that adaptive search strategies in SLMs often degrade performance, highlighting the necessity of consistent search behavior for reliable reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Forcing small models into consistent search-and-ground behavior beats distillation and hits LLM-level scores on multi-hop QA, but the paper still needs to show the tool-use and hallucination numbers to confirm why.

read the letter

The core result here is that a lightweight fine-tuning recipe can push small language models to match large ones on Bamboogle and HotpotQA by training them to retrieve reliably and stick to evidence-grounded answers. The 17-point and 15-point lifts over agent distillation are the headline numbers, and the negative finding that adaptive search hurts SLMs is the part that feels most useful in practice. That consistent-behavior emphasis is what sets this apart from plain distillation work; it gives a concrete policy instead of hoping the small model learns good tool habits on its own. The approach looks cheap to run and directly targets the two problems the abstract flags: low tool invocation and extra hallucinations. If the full experiments back this up with clean ablations, it is a practical win for anyone trying to ship agents without paying for big-model inference every time. The main gap is exactly the one the stress-test flags. The abstract motivates the method by saying base SLMs search less and hallucinate more, yet it does not report post-training tool-call rates or groundedness scores against the distillation baseline. Without those, it is hard to know whether the gains come from the intended mechanism or from differences in training data, answer format, or retrieval setup. If the paper has those metrics in the results section, the claim strengthens; if not, the causal story stays loose. This is worth sending to referees. The empirical deltas are large enough and the consistent-search angle is clear enough that a careful review could tighten the evidence without killing the contribution. People building cost-sensitive agents or working on SLM tool use would get something out of it.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a lightweight fine-tuning method (denoted as policy or 'Search, Do not Guess') for small language models (SLMs) to serve as effective search agents on knowledge-intensive tasks. It observes that base SLMs invoke search tools less frequently and hallucinate more than LLMs despite having less parametric knowledge. The proposed approach explicitly trains SLMs to retrieve information reliably and generate answers grounded in retrieved evidence. It reports performance gains of 17.3 points on Bamboogle and 15.3 points on HotpotQA over agent-distillation baselines from LLMs, reaching LLM-level results, and finds that adaptive search strategies tend to degrade SLM performance, underscoring the value of consistent search behavior.

Significance. If the performance gains prove robust and the mechanism is validated, the work could meaningfully advance practical deployment of search agents by enabling smaller, more efficient models without heavy LLM distillation. The negative result on adaptive search offers a useful design insight for agent reliability. The emphasis on grounded generation addresses a known limitation of SLMs in tool-use settings.

major comments (3)

[Abstract] Abstract: The central performance claims (17.3-point gain on Bamboogle and 15.3-point gain on HotpotQA, plus LLM-level results) are stated without any reference to experimental details such as baselines, data splits, number of runs, or statistical tests, making the claims impossible to evaluate from the provided information.
[Abstract] Abstract: The gains are attributed to training SLMs to reliably retrieve and produce grounded answers (addressing lower tool-use frequency and higher hallucination rates in base SLMs). However, no quantitative post-training metrics on tool invocation rates or hallucination/groundedness rates are reported relative to the agent-distillation baseline, so the source of the delta cannot be confidently linked to the stated mechanism rather than other factors like data or formatting differences.
[Abstract] Abstract (further analysis): The claim that adaptive search strategies often degrade performance in SLMs is presented as a key finding, but without details on how adaptivity was implemented, the size of the degradation, or controlled comparisons to non-adaptive variants, the result lacks the specificity needed to support the conclusion that consistent search behavior is necessary.

minor comments (2)

The method is referred to as policy in the abstract; provide the full name, acronym expansion, and a brief description of the training objective in the introduction for clarity.
Ensure all mentioned benchmarks (Bamboogle, HotpotQA) include proper citations to their original papers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their positive assessment of the work's potential significance and for the specific comments aimed at improving the clarity and evaluability of our abstract. We agree that the abstract can be enhanced to include more experimental context and mechanistic details without exceeding length limits. Below we provide point-by-point responses to the major comments, indicating the revisions we plan to make.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (17.3-point gain on Bamboogle and 15.3-point gain on HotpotQA, plus LLM-level results) are stated without any reference to experimental details such as baselines, data splits, number of runs, or statistical tests, making the claims impossible to evaluate from the provided information.

Authors: We agree that the abstract would benefit from additional context to support evaluation of the claims. The gains are measured against agent-distillation baselines from LLMs, using the standard train/test splits for Bamboogle and HotpotQA. All main results are averaged over three independent runs, with standard deviations and full tables provided in Section 4. Statistical significance is evaluated via paired t-tests, with details in the appendix. In the revised manuscript, we will update the abstract to briefly note the comparison to LLM agent-distillation baselines and refer readers to Section 4 for complete experimental protocols, data splits, and run statistics. revision: yes
Referee: [Abstract] Abstract: The gains are attributed to training SLMs to reliably retrieve and produce grounded answers (addressing lower tool-use frequency and higher hallucination rates in base SLMs). However, no quantitative post-training metrics on tool invocation rates or hallucination/groundedness rates are reported relative to the agent-distillation baseline, so the source of the delta cannot be confidently linked to the stated mechanism rather than other factors like data or formatting differences.

Authors: The referee is correct that the abstract itself does not report post-training quantitative metrics on tool invocation or hallucination rates. The manuscript provides these comparisons in Section 5 (including tool-use frequency and grounding accuracy relative to the distillation baseline), which support the proposed mechanism. To directly address the concern, we will revise the abstract to explicitly link the performance gains to increased tool-use reliability and reduced hallucinations, while directing readers to the analysis section for the supporting quantitative evidence. revision: partial
Referee: [Abstract] Abstract (further analysis): The claim that adaptive search strategies often degrade performance in SLMs is presented as a key finding, but without details on how adaptivity was implemented, the size of the degradation, or controlled comparisons to non-adaptive variants, the result lacks the specificity needed to support the conclusion that consistent search behavior is necessary.

Authors: We acknowledge that the abstract presents this finding at a high level without implementation specifics. The adaptive search setup (model decides search vs. answer based on internal confidence) and controlled comparisons to non-adaptive policies, along with the observed degradation magnitudes, are detailed in Section 6. In the revised abstract, we will add a concise reference to the controlled experiments demonstrating degradation under adaptive strategies for SLMs, while pointing to Section 6 for full implementation details and quantitative results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation with no derivations or self-referential reductions

full rationale

The paper reports experimental results from lightweight fine-tuning of SLMs on search-agent tasks, with performance deltas measured directly on Bamboogle and HotpotQA benchmarks. No equations, parameter fits, predictions, or uniqueness theorems appear; the central claim rests on observed score improvements rather than any derivation that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any mathematical step, and the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that SLMs have less parametric knowledge and therefore need explicit training for tool use; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption SLMs possess less parametric knowledge than LLMs and therefore invoke search tools less frequently
Directly stated in the abstract as the observed problem motivating the method.

pith-pipeline@v0.9.0 · 5472 in / 1186 out tokens · 84896 ms · 2026-05-10T19:13:51.699687+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Multi-Agent Autonomous Reasoning in Hydrodynamics
cs.AI 2026-05 unverdicted novelty 4.0

A Layer Execution Graph multi-agent system for hydrodynamics achieves 93.6% factual precision and 100% pass rate on 37 queries while degrading gracefully under data loss.

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 13921–13937, Bangkok, Thailand

Wikimedia database dump of the en- glish wikipedia on june 20, 2021. https: //archive.org/download/enwiki-20210620/ enwiki-20210620-pages-articles.xml.bz2 . Wikimedia database dump of the English Wikipedia on June 20, 2021. Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Pi- otr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. 2024. On-po...

work page arXiv 2021
[2]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Sealqa: Raising the bar for reasoning in search-augmented language models.Preprint, arXiv:2506.01062. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models.Preprint, arXiv:2210.03350. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabhar...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

You should always and directly use the Wikipedia search engine to look up the information needed to answer the question
[8]

wh-word is [attribute] of [entity]

Assume that the current year is 2018. No need to look up information that is more recent than this year. ### Search Query Format Guidelines When writing search queries, follow these specific formats depending on what information you need: **Format A: When inquiring about an attribute of an entity** Use the pattern: "wh-word is [attribute] of [entity]" Exa...

2018
[9]

You should carefully follow the format of searching and answering as shown in the example above
[10]

Do not use your own knowledge or personal experiences to speculate

You should always and directly use the Wikipedia search engine to look up the information needed to answer the question. Do not use your own knowledge or personal experiences to speculate
[11]

people that captured Malakoff

Your search queries should be a complete, natural language question instead of keywords. For instance, instead of searching for "people that captured Malakoff", you should search for "Who were the people that captured Malakoff?"
[12]

No need to reflect and doubt it

You can trust the information retrieved from the search engine to be accurate and factual, and use it in your subsequent reasoning. No need to reflect and doubt it
[13]

The people who captured Malakoff came to the region where Philipsburg is located in November 12, 1625

You final answer should be a short-form answer. Do NOT provide explanations or extract descriptions. For instance, instead of saying "The people who captured Malakoff came to the region where Philipsburg is located in November 12, 1625", you should only say "November 12, 1625"
[14]

Do not include any other information or text

Your final response should only have the answer enclosed in <answer> and </answer> tags. Do not include any other information or text
[15]

wh-word is [attribute] of [entity]

Assume that the current year is 2018. No need to look up information that is more recent than this year. ### Search Query Format Guidelines When writing search queries, follow these specific formats depending on what information you need: **Format A: When inquiring about an attribute of an entity** Use the pattern: "wh-word is [attribute] of [entity]" Exa...

2018
[16]

Carefully read each document and determine if it contains information relevant to the query
[17]

If you find relevant information, extract and summarize it in 1-3 clear sentences
[18]

**Do not use any information that is not present in the documents.**
[19]

The Great Silence

If none of the documents contain relevant information, state that clearly. ### Output Format (CRITICAL - MUST FOLLOW EXACTLY) - Your answer **MUST start with exactly**: ### Extracted Information - On the line(s) after this tag, write the extracted information. - If there is no relevant information, write: No helpful information found. - **IMPORTANT**: Eve...

1926