Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning
Pith reviewed 2026-06-29 18:25 UTC · model grok-4.3
The pith
Small language models solve multi-hop questions better by answering first with a quick guess then retrieving evidence to refine it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An answer-first-reason-later framework lets the model generate an initial hypothesis through System-I zero-shot answering, retrieve evidence keyed to that hypothesis, and then conduct System-II style deeper reasoning over the retrieved evidence; this chaining produces higher accuracy on multi-step QA tasks than the conventional think-first followed by iterative retrieval pipeline.
What carries the argument
The answer-first-reason-later framework that chains a System-I quick answer to System-II evidence-based reasoning by using the initial hypothesis as the retrieval query.
If this is right
- Hallucinations in the initial answer can be turned into a retrieval signal rather than treated only as error.
- SLMs can reach competitive multi-hop performance without first producing an explicit reasoning chain.
- The same model can switch between fast hypothesis generation and slower evidence refinement in a single pipeline.
- Hardware-efficient models become viable for tasks previously thought to require larger models or heavier retrieval loops.
Where Pith is reading between the lines
- The method may generalize to other retrieval-augmented tasks where an approximate first output can bootstrap search.
- Measuring the model's in its initial answer could serve as a trigger for whether to invoke the full retrieval-and-reason step.
- The inversion suggests that early errors need not cascade if retrieval is conditioned directly on the flawed hypothesis.
Load-bearing premise
Small language models generate initial answers that are sufficiently accurate or confidently structured to serve as effective seeds for retrieving useful evidence.
What would settle it
Running the proposed method and a think-first baseline on the same multi-hop QA benchmarks and finding that the answer-first approach yields lower accuracy would falsify the performance claim.
Figures
read the original abstract
Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an inversion of the standard think-first pipeline for multi-hop QA with small language models (SLMs). It claims that SLMs are often accurately confident in zero-shot initial answers and that hallucinations can usefully guide retrieval; the method therefore generates an answer-first hypothesis (System-I), retrieves evidence conditioned on that hypothesis, and performs deeper reasoning (System-II) to produce the final answer. The central empirical claim is that this answer-first-reason-later approach outperforms prior think-first baselines on various multi-step question-answering benchmarks.
Significance. If the reported outperformance is robust, the work offers a practically useful alternative to chain-of-thought and iterative retrieval pipelines for resource-limited SLMs. It supplies an empirical demonstration that an initial (possibly hallucinated) hypothesis can improve downstream retrieval and correction, which could reduce reliance on expensive multi-step prompting. No machine-checked proofs or parameter-free derivations are present; the contribution rests on the benchmark comparisons.
minor comments (1)
- The abstract states the high-level motivation and outperformance claim but supplies no benchmark names, numeric deltas, ablation controls, or error analysis. If these details appear only in later sections, they should be summarized in the abstract for immediate assessment of the central claim.
Simulated Author's Rebuttal
We thank the referee for their summary and assessment of our work. The referee accurately captures the proposed inversion of the standard pipeline and the central empirical claim. We note that the report lists no specific major comments, so we have no point-by-point responses to provide. We remain available to address any additional questions or clarifications the referee may have.
Circularity Check
No significant circularity; empirical claim independent of inputs
full rationale
The paper advances an empirical inversion of think-first pipelines for SLMs on multi-hop QA, asserting that an initial zero-shot answer (System-I) followed by retrieval-based correction (System-II) yields better benchmark results than prior think-first baselines. No equations, fitted parameters, or self-defined quantities appear. The central claim is framed as an experimental finding resting on comparisons to external prior work, with no load-bearing step that reduces by construction to a self-citation chain, ansatz, or renamed input. The provided abstract and description contain no self-referential derivations that would trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SLMs hallucinate more frequently than LLMs, impacting multi-step reasoning
- ad hoc to paper The think-first strategy is not always necessary
Reference graph
Works this paper leans on
-
[1]
InFirst Conference on Language Modeling
V-STar: Training verifiers for self-taught rea- soners. InFirst Conference on Language Modeling. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions...
2025
-
[2]
Guiding reasoning in small language models with llm assistance
Prunerag: Confidence-guided query decompo- sition trees for efficient retrieval-augmented gener- ation. InProceedings of the ACM Web Conference 2026, WWW ’26, page 1923–1934, New York, NY , USA. Association for Computing Machinery. Daniel Kahneman. 2011. Thinking, fast and slow.Far- rar, Straus and Giroux. Aditya K. Kamath, Ramya Prabhu, Jayashree Mohan, ...
-
[3]
Beibin Li, Yi Zhang, Sébastien Bubeck, Jeevan Pathuri, and Ishai Menache
Curran Associates, Inc. Beibin Li, Yi Zhang, Sébastien Bubeck, Jeevan Pathuri, and Ishai Menache. 2024a. Small language models for application interactions: A case study.CoRR, abs/2405.20347. Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2024b. The dawn after the dark: An empirical study on factuality hallucinati...
-
[4]
InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4554–4570, Bangkok, Thailand
Factual confidence of LLMs: on reliability and robustness of current estimators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4554–4570, Bangkok, Thailand. Association for Computational Linguistics. Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usma...
2025
-
[5]
CoAT: Chain-of-associated-thoughts frame- work for enhancing large language models reasoning. InFindings of the Association for Computational Lin- guistics: EMNLP 2025, pages 13028–13045, Suzhou, China. Association for Computational Linguistics. 10 Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. 2025. G...
-
[6]
InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 32600–32650, Suzhou, China
ThinkSLM: Towards reasoning in small lan- guage models. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 32600–32650, Suzhou, China. As- sociation for Computational Linguistics. Chris Alexiuk Tanay Varshney, Annie Surla. 2025. An Easy Introduction to LLM Reasoning, AI Agents, and Test Time Scaling | NVIDIA ...
2025
-
[7]
Kimi K2: Open Agentic Intelligence
Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Katherine Tian, Eric Mitchell, Huaxiu Yao, Christo- pher D Manning, and Chelsea Finn. 2024. Fine- tuning language models for factuality. InThe Twelfth International Conference on Learning Representa- tions. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, KDD ’25, page 6173–6183, New York, NY , USA
A survey on small language models in the era of large language models: Architecture, capabili- ties, and trustworthiness. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, KDD ’25, page 6173–6183, New York, NY , USA. Association for Computing Machinery. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei ...
2022
-
[9]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Haiyang Yang, Ziv Carmon, Dan Ariely, and Michael I Norton. 2019. The feeling of not knowing it all. Journal of Consumer Psychology, 29(3):455–462. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D. Manning. 2018. HotpotQA: A dataset for div...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[10]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc. Chuanyue Yu, Kuo Zhao, Yuhan Li, Heng Chang, Mingjian Feng, Xiangzhe Jiang, Yufei Sun, Jia Li, Yuzhi Zhang, Jianxin Li, and 1 others. 2026. Graphrag-r1: Graph retrieval-augmente...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
**Identify the Assassin’s Creed Games**: List all the major Assassin’s Creed games released so far
-
[12]
**Research Each Game’s Setting**: For each game on your list, find out where the game takes place
-
[13]
**Identify the Game Set in London**: From the research done in step 2, identify which game is set in the city of London
-
[14]
Assassin’s Creed IV: Black Flag
**Provide the Answer**: Once you have iden- tified the correct game, provide the answer to the question. Following these steps will lead you to the correct answer. Figure 4: Step-by-step reasoning for Fig. 1 question. Answer First The major Assassins Creed game set in the city of London is "Assassin’s Creed IV: Black Flag" with a significant side mission ...
2014
-
[15]
The idea is simple: given a question, retrieve the top-k documents from a knowledge base that are relevant to it
Standard RAG(Lewis et al., 2020): This is the original RAG method that is used as a fixed baseline in any related study. The idea is simple: given a question, retrieve the top-k documents from a knowledge base that are relevant to it. These documents are included in the prompt for the model (LLM/SLM) as grounding context for the question. While this showe...
2020
-
[16]
The idea behind doing this was to create a knowl- edge graph that broadly covers as much infor- mation as possible
HyperGraphRAG(Luo et al., 2025b): In- spired by GraphRAG (Peng et al., 2025), this method constructs ahypergraph, i.e., a graph where edges connect multiple vertices. The idea behind doing this was to create a knowl- edge graph that broadly covers as much infor- mation as possible. However, in our experi- ments, we noticed that this method is incred- ibly...
2025
-
[17]
GraphAnchor(Liu et al., 2026): Another variant of GraphRAG. Here, the framework uses anevolvingknowledge-graph, i.e., at each step, a small, focused knowledge-graph, based on the retrieved documents, is devel- oped, which keeps track of the entities and relationships in them. This graph is used to augment the retrieved documents and provide a relational u...
2026
-
[18]
PruneRAG(Jiao et al., 2026): Grounded in tree-based reasoning(Yao et al., 2023), i.e., exploring multiple search spaces to determine an answer, PruneRAG performs three opera- tions: (i) query decomposition and tree con- 13 struction, based on a model’s reasoning us- ing retrieved context; (ii) backtracking on the branches to aggregate evidence, (iii) toke...
2026
-
[19]
opti- mal
RT-RAG(Reasoning Tree Guided RAG) (Shi et al., 2026): Similar to PruneRAG, RT-RAG also performs tree-style reasoning. It gener- ates multiple candidate trees, selects the “opti- mal” one based on tree statistics such as depth and number of nodes, and, finally, gathers evi- dence based on the optimal tree to provide an answer
2026
-
[20]
cognitive outline
PAGER(Li et al., 2026): This method first creates a “cognitive outline” (called “page”) of what pieces of information a model needs to answer a question. The page has “slots” which are placeholders for potentially re- quired data. These slots are progressively filled with iterative retrieval. Finally, once the entire page has been populated, the model pro...
2026
-
[21]
Search-o1(Li et al., 2025a): Inspired by Agentic RAG (Singh et al., 2025), i.e., let- ting an LLM-agent dictate what to retrieve instead of providing it beforehand with top-k documents. Search-o1 differs from the others in two ways: (i) performingwebsearches to retrieve background data, instead of using a pre-defined knowledge index, and (ii) includ- ing ...
2025
-
[22]
Provide ONLY the specific answer
-
[23]
The answer is
Do not include introductory phrases (e.g., "The answer is..."), explanations, or context
-
[24]
The answer must be wrapped in <answer> tags inside an <output> block
-
[25]
Output ONLY the <output> block. ### Example: <input> Question: What is the capital of France? </input> <output> <answer>Paris</answer> </output> ### Task: Process the following input and provide the answer within an <output> block. Figure 6: System-I answer prompt. F Evaluation Details All frameworks are evaluated using the Hugging- Face implementation (h...
2018
-
[26]
Format: Each triple must be enclosed in <triple> tags using the structure: <triple>Subject | predicate_link | Object</triple>
-
[27]
X’s lead singer
Logical Depth: Do not skip steps. If the question involves a specific role or relationship (e.g., "X’s lead singer" or "Y’s director"), you MUST identify that specific individual as a separate node before linking them to the final answer
-
[28]
Chain of Reasoning: The sequence must form a step-by-step path where the Object of one triple leads to the Subject of the next
-
[29]
Predicate Style: Use concise, lowercase, snake_case for predicates
-
[30]
Do not include introductory text or explanations
Strict Output: Provide ONLY the <output> block. Do not include introductory text or explanations. ### Examples: <input> Question: Where was the lead singer of the band Queen born? Answer: Stone Town, Zanzibar </input> <output> <triple>Queen | has_lead_singer | Freddie Mercury</triple> <triple>Freddie Mercury | born_in | Stone Town, Zanzibar</triple> </out...
-
[35]
· · · To answer this question, let’s break down the information provided: 1.Justin Bannan played college football at Colorado
The Carroll Fighting Saint football program represents Carroll College of Helena, Montana in college football ... · · · To answer this question, let’s break down the information provided: 1.Justin Bannan played college football at Colorado
-
[36]
From the given information, we can infer that the current team representing Colorado is part of the Big 12 Conference, as Colorado is a member of the Big 12 Conference
We need to find out which conference the current team representing Colorado is a member of. From the given information, we can infer that the current team representing Colorado is part of the Big 12 Conference, as Colorado is a member of the Big 12 Conference. Therefore, the answer is the Big 12 Conference. <final_answer> Big 12 Conference </final_answer>...
-
[38]
Justin Harper (born February 24, 1985 in Catawba, North Carolina) is an American college football coach and former professional Canadian football wide receiver
1985
-
[39]
Timothy Joseph "Tim" Landis (born July 13, 1964) is an American football coach who is currently quarterbacks coach and special teams coordinator at Lycoming College
1964
-
[41]
· · · <think> Based on the question, we need to identify the current conference of the team for which Justin Bannan played college football
The Carroll Fighting Saint football program represents Carroll College of Helena, Montana in college football ... · · · <think> Based on the question, we need to identify the current conference of the team for which Justin Bannan played college football. From the given documents, we can find the following information: -Justin Bannan played college footbal...
-
[42]
He was drafted by the Buffalo Bills in the fifth round of the 2002 NFL Draft
Justin Lewis Bannan (born April 18, 1979) is a former American football defensive tackle. He was drafted by the Buffalo Bills in the fifth round of the 2002 NFL Draft. He played college football at Colorado
1979
-
[43]
Benjamin "Bennie" Oosterbaan (February 4, 1906 – October 25, 1990)
1906
-
[44]
Timothy Joseph "Tim" Landis
-
[45]
Justin Harper (born February 24, 1985 in Catawba, North Carolina) is
1985
-
[46]
Judson Albert "Jud" Timm (August 28, 1906 – December 23, 1994)
1906
-
[47]
The Trenton Lightning were an indoor
-
[48]
The 2013 Texas Longhorns football team
2013
-
[49]
It was also the final season as members the Big 12 Conference in the North Division for Colorado, before joining the Pac-12 Conference for the 2011 season
Colorado Buffaloes ... It was also the final season as members the Big 12 Conference in the North Division for Colorado, before joining the Pac-12 Conference for the 2011 season
2011
-
[50]
Roy Randolph Roundtree
-
[51]
· · · Let’s break down the information step-by-step:
The 2015 Wisconsin Badgers football ... · · · Let’s break down the information step-by-step:
2015
-
[52]
From the provided information, we know that Justin Bannan played college football for the Colorado Buffaloes
**Identify the Team**: The question asks about the team for which Justin Bannan played college football. From the provided information, we know that Justin Bannan played college football for the Colorado Buffaloes
-
[53]
From the provided information, we know that the Colorado Buffaloes were members of the Big 12 Conference until the 2011 season when they joined the Pac-12 Conference
**Determine the Current Conference**: The question asks about the current conference of the team for which Justin Bannan played college football. From the provided information, we know that the Colorado Buffaloes were members of the Big 12 Conference until the 2011 season when they joined the Pac-12 Conference. Therefore, the current conference of the tea...
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.