When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Ali Shiraee Kasmaee; Hamidreza Mahyar; Mahdi Astaraki; Mohammad Arshi Saloot; Soheila Samiee

arxiv: 2601.19827 · v3 · submitted 2026-01-27 · 💻 cs.CL · cs.AI· cs.IR

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Mahdi Astaraki , Mohammad Arshi Saloot , Ali Shiraee Kasmaee , Hamidreza Mahyar , Soheila Samiee This is my paper

Pith reviewed 2026-05-16 10:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords Iterative RAGMulti-hop Question AnsweringScientific QARetrieval-Augmented GenerationLLM evaluationContext overloadHypothesis refinement

0 comments

The pith

Iterative retrieval-reasoning loops outperform supplying all ideal evidence at once for multi-hop scientific questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in chemistry multi-hop question answering, an iterative process of retrieving evidence, refining hypotheses, and stopping based on evidence beats the performance of giving models all the correct evidence upfront. This challenges the assumption that more context is always better, showing that the way retrieval and reasoning are synchronized matters more than the quality of evidence alone. For models not specially fine-tuned on reasoning tasks, the gains are particularly large, up to 25.6 percentage points. The finding matters because it provides diagnostics for why RAG systems fail in specialized domains and suggests practical ways to improve them without additional training.

Core claim

Using the ChemKGMultiHopQA dataset, the study finds that across eleven LLMs, the Iterative RAG regime, which alternates retrieval, hypothesis refinement, and evidence-aware stopping, consistently outperforms the Gold Context regime where all oracle evidence is provided at once, with improvements reaching 25.6 percentage points especially in non-reasoning fine-tuned models. This occurs because staged retrieval reduces late-hop failures, mitigates context overload, and allows dynamic correction of early hypothesis drift, even though remaining issues like incomplete coverage and distractor latching persist.

What carries the argument

The training-free Iterative RAG controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping to synchronize retrieval and reasoning.

If this is right

Staged retrieval reduces late-hop failures compared to static evidence provision.
Mitigates context overload that can occur even with perfect evidence.
Enables dynamic correction of early hypothesis drift during the process.
Offers practical guidance for deploying and diagnosing RAG systems in specialized scientific settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This implies that for domain-specific multi-hop tasks, iterative strategies may be more effective than relying on static oracles even when perfect evidence is available.
Similar patterns could be tested in other scientific domains like biology or physics to see if staged retrieval provides comparable advantages.
Training methods could focus on improving stopping calibration and composition fidelity to further boost iterative RAG performance.

Load-bearing premise

That providing all oracle evidence at once truly represents an idealized upper bound without incurring hidden costs like context overload or distractor interference.

What would settle it

An experiment on the same ChemKGMultiHopQA dataset where Gold Context is adjusted to reduce overload, for instance by staging or shortening the evidence, and Iterative RAG no longer shows accuracy gains.

Figures

Figures reproduced from arXiv: 2601.19827 by Ali Shiraee Kasmaee, Hamidreza Mahyar, Mahdi Astaraki, Mohammad Arshi Saloot, Soheila Samiee.

**Figure 2.** Figure 2: Models’ accuracy Distribution of models’ accuracy in three setups of No Context (Parametric memory), Gold context, and Iterative retrieval and reasoning, shown in blue, red and green, respectively. Horizontal bars shows the results of t-test statistical analysis (Significance: *** p<0.001, , ** p<0.01) When isolating the specific gains from Gold Context to Iterative RAG, we observe distinct behaviors. On a… view at source ↗

**Figure 3.** Figure 3: Partition of Solvability. This heatmap classifies correct answers by the necessary condition for success: internal knowledge (Parametric), static evidence (Gold-Dependent), or dynamic retrieval (IterativeExclusive) [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Recoveries vs. Regressions from Gold Context to Iterative RAG. Green bars count recoveries (Gold incorrect → Iterative correct) and red bars count regressions (Gold correct → Iterative incorrect) per model; the black line (top) and the condensed panel (bottom) show the net gain questions count (= recoveries − regressions). The plot quantifies iteration’s overall benefit: models like GPT–4o and Llama 3.3 In… view at source ↗

**Figure 5.** Figure 5: Parametric Suppression Rate (PSR). The plot illustrates the proportion of questions answered correctly in the No Context setting that are suppressed (answered incorrectly) in Iterative RAG. Mistral Large 2402 exhibits the highest suppression rate (14.1%), indicating a strong tendency to prioritize retrieved noise over correct internal weights. In contrast, Claude 3.7 Sonnet is highly robust (2.7%), effecti… view at source ↗

**Figure 6.** Figure 6: Unanswered questions by all models in different set ups, where model relying on Parametric [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Anchor Carry-Drop Rate by Step. A universal spike at Step 2 indicates a "correction pivot," where models discard weak initial hypotheses. The subsequent decline in Steps 3–5 demonstrates reasoning convergence as the controller locks onto the correct entity chain. 5.1.2 Step-Count Distribution and Model Strategy This active control over the reasoning path results in distinct step-count distributions, reveal… view at source ↗

**Figure 8.** Figure 8: Performance by finalized retrieval step on questions failed by Gold-Context set up. Stacked bars (right y-axis) show the number of questions that finalized at each step, colored by oracle hop depth (1–4). The green solid line (left y-axis) reports Iterative-RAG accuracy on exactly those questions. planner stopped and produced an answer. The green solid line is the Iterative–RAG accuracy for the subset of q… view at source ↗

**Figure 9.** Figure 9: Impact of Retrieval Coverage Gap on models. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Sufficiency-Coverage Interaction. Horizontal axis reflect the sufficiency score and the vertical access reflects the hop coverage in retrieval, and the color illustrate the average accuracy across evaluated models in answering the questions with knowledge gap. The heatmap highlights the “dangerous zone” (low coverage, low sufficiency) where accuracy is lowest (30.6%). Holding sufficiency fixed, improving … view at source ↗

**Figure 11.** Figure 11: Miscalibration Analysis. on questions with incorrect answers in No Context setup. a) Accuracy Impact. Average accuracy across all models in three states: well-calibrated, under-confident, and over-confident. Overconfidence causes a significant drop in accuracy (p = 0.0022 OverConfident vs UnderConfident), while underconfidence primarily affects efficiency. b) Shift by Hop Depth. Stacked bar charts showi… view at source ↗

**Figure 12.** Figure 12: Composition Failure Rates. The percentage of incorrect answers where the correct evidence was [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Analysis of Distractor Latch Effects. (a) The catastrophic global impact on accuracy. The presence of a distractor latch imposes a massive penalty (∼53.9 percentage points). (b) The varying frequency with which models are affected by this failure mode. While all models are susceptible, non-reasoning models like Mistral Large are trapped nearly twice as often as GPT-5. 5.3 Efficiency Analysis Beyond raw ac… view at source ↗

**Figure 14.** Figure 14: Iterative RAG Cost vs Accuracy per model. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: The Trade-off between Adaptivity and Predictability. Models are plotted by their Token Scaling Factor (X-axis) against their Token Usage Consistency (Y-axis). A clear diagonal trend emerges: aggressive adapters (e.g., Grok 4 Fast, GPT-5) incur a high "volatility tax" (CV > 65%). Conversely, "Rigid Executors" (e.g., Llama 3.3, GPT-4o) remain highly predictable (CV < 30%) but lack dynamic scaling. Claude So… view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Iterative RAG beats gold context on this chemistry multi-hop set mainly because staged retrieval avoids overload that static concatenation can trigger.

read the letter

The headline result is that iterative retrieval-reasoning beats supplying all oracle evidence at once, with gains reaching 25 points across eleven models on ChemKGMultiHopQA. The paper isolates this on questions that need genuine hops and tracks specific failure types like coverage gaps, early drift, and stopping errors. That comparison is the first controlled one of its kind for scientific multi-hop QA, and the diagnostics are the useful part. They show staged loops help non-reasoning models correct course and reduce late-hop drops, while even perfect retrieval still leaves high composition failures. The work is straightforward empirical benchmarking with no fitted parameters or circular claims. The soft spot is the gold context condition itself. Concatenating all passages can produce attention dilution or effective overload that the iterative controller sidesteps by feeding evidence in stages. The abstract notes overload mitigation but gives no token counts, hop-wise lengths, or ablation that holds total evidence fixed while changing presentation. Without those checks the advantage is partly an artifact of the baseline rather than pure reasoning gain. This is for people who build or debug RAG pipelines in sparse scientific domains. A reader who needs concrete failure breakdowns for multi-hop systems will get practical takeaways. It deserves a serious referee because the pattern is consistent and the question matters for deployment, though the methods section will need to address the context-length controls.

Referee Report

2 major / 2 minor

Summary. The paper conducts a diagnostic study on scientific multi-hop question answering using the ChemKGMultiHopQA dataset. It compares three regimes—No Context, Gold Context (all oracle evidence provided at once), and Iterative RAG (a training-free iterative retrieval-reasoning controller)—across eleven LLMs. The key claim is that Iterative RAG consistently outperforms Gold Context, achieving gains of up to 25.6 percentage points, particularly for non-reasoning fine-tuned models, by reducing late-hop failures, mitigating context overload, and enabling dynamic correction of hypothesis drift. The study includes detailed failure diagnostics on retrieval coverage, anchor-carry drop, query quality, composition fidelity, and control calibration.

Significance. This work is significant for the field of retrieval-augmented generation in specialized domains. If the results hold, it demonstrates that the staging of retrieval and reasoning can be more influential than the mere availability of ideal evidence, challenging the assumption that static gold context is always optimal. The diagnostic approach provides a foundation for more reliable iterative RAG frameworks and practical guidance for scientific applications where knowledge is sparse and multi-hop reasoning is required. The consistent gains across models add to the robustness of the findings.

major comments (2)

[Experimental Setup and Results] The central claim that Iterative RAG outperforms Gold Context (up to 25.6 pp) depends on Gold Context functioning as a true idealized upper bound. However, the manuscript does not report per-condition token lengths, hop-wise evidence sizes, or a controlled ablation equalizing total evidence while varying presentation order. This leaves open whether gains arise from avoiding attention dilution or overload in long concatenated passages rather than superior retrieval-reasoning (see abstract discussion of context overload mitigation).
[Dataset and Methodology] Dataset construction details for isolating questions requiring genuine retrieval, exact stopping rules for the iterative controller, and statistical controls (e.g., significance testing or variance across the 11 models) are insufficiently specified. These are load-bearing for the cross-model consistency claim and reproducibility of the reported gains.

minor comments (2)

[Figures] Figure captions and legends should more explicitly reference the specific diagnostic metrics (e.g., anchor-carry drop, composition fidelity) discussed in the text for improved clarity.
[Abstract] The abstract could briefly note the number of questions or hops in ChemKGMultiHopQA to provide immediate scale for the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our diagnostic study of iterative RAG versus Gold Context in scientific multi-hop QA. We address each major comment point by point below, with planned revisions to improve clarity and reproducibility while preserving the core findings.

read point-by-point responses

Referee: [Experimental Setup and Results] The central claim that Iterative RAG outperforms Gold Context (up to 25.6 pp) depends on Gold Context functioning as a true idealized upper bound. However, the manuscript does not report per-condition token lengths, hop-wise evidence sizes, or a controlled ablation equalizing total evidence while varying presentation order. This leaves open whether gains arise from avoiding attention dilution or overload in long concatenated passages rather than superior retrieval-reasoning (see abstract discussion of context overload mitigation).

Authors: We agree that explicit reporting of token lengths would help disambiguate length-based effects from the benefits of staged retrieval. In the revision we will add a table reporting average input token counts for No Context, Gold Context, and Iterative RAG conditions, plus hop-wise evidence sizes. Our existing diagnostics already show Iterative RAG reducing late-hop failures and correcting hypothesis drift even when retrieval coverage is high; these mechanisms are not reducible to length alone. A full ablation that equalizes total evidence while randomizing order is not present in the current experiments and would require new runs, so we will note this as a limitation and future direction rather than claim the current results fully isolate it. Revision made: partial. revision: partial
Referee: [Dataset and Methodology] Dataset construction details for isolating questions requiring genuine retrieval, exact stopping rules for the iterative controller, and statistical controls (e.g., significance testing or variance across the 11 models) are insufficiently specified. These are load-bearing for the cross-model consistency claim and reproducibility of the reported gains.

Authors: We acknowledge these specification gaps limit reproducibility. The revised manuscript will expand the dataset section to describe the exact filtering criteria used to retain only questions where No-Context performance indicates genuine retrieval need (i.e., parametric failure). We will also state the precise stopping rules (confidence threshold of 0.8 or maximum 5 iterations, whichever first) and add per-model standard deviations plus paired statistical tests (Wilcoxon signed-rank) across the 11 LLMs to quantify consistency of the gains. These additions directly support the cross-model claims without altering the experimental design. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking with no derivations or fitted predictions

full rationale

The paper is a controlled empirical study comparing three RAG regimes (No Context, Gold Context, Iterative RAG) on the ChemKGMultiHopQA dataset across eleven LLMs. It reports performance metrics, diagnostics for retrieval coverage, hypothesis drift, and failure modes without any equations, parameter fitting, predictions derived from inputs, or load-bearing self-citations. The central claim (Iterative RAG outperforming Gold Context) rests on direct experimental measurements rather than any reduction to prior quantities by construction. No self-definitional loops, ansatz smuggling, or renaming of known results occur. The study is self-contained against external benchmarks and does not invoke uniqueness theorems or author-prior results as justification for its methodology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work is an empirical comparison using existing LLMs and a named dataset.

pith-pipeline@v0.9.0 · 5621 in / 947 out tokens · 18805 ms · 2026-05-16T10:28:06.222302+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
cs.AI 2026-05 unverdicted novelty 6.0

HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Cognitive load limits in large language models: Benchmarking multi-hop reasoning

Sai Teja Reddy Adapala. Cognitive load limits in large language models: Benchmarking multi-hop reasoning. arXiv preprint arXiv:2509.19517,

work page arXiv
[2]

Llama-3.3-70b-instruct, Dec 2024a

Meta AI. Llama-3.3-70b-instruct, Dec 2024a. URLhttps://huggingface.co/meta-llama/Llama-3. 3-70B-Instruct. Model card; Accessed 2025-10-21. Mistral AI. Mistral large: Open-weight dense transformer.https://mistral.ai/news/mistral-large, 2024b. Released December 2024, Accessed: 2025-10-03. Mistral AI. Au large: Announcing mistral large (2402).https://mistral...

work page 2025
[3]

Fair-rag: Faithful adaptive iterative refinement for retrieval-augmented generation.arXiv preprint arXiv:2510.22344,

Mohammad Aghajani Asl, Majid Asgari-Bidhendi, and Behrooz Minaei-Bidgoli. Fair-rag: Faithful adaptive iterative refinement for retrieval-augmented generation.arXiv preprint arXiv:2510.22344,

work page arXiv
[4]

Probing-rag: Self-probing to guide language models in selective document retrieval

Ingeol Baek, Hwan Chang, Byeongjeong Kim, Jimin Lee, and Hwanhee Lee. Probing-rag: Self-probing to guide language models in selective document retrieval. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 3287–3304,

work page 2025
[5]

Finqa: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3697–3711. Association for Computation...

work page 2021
[6]

Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu

URLhttps:// aclanthology.org/2021.emnlp-main.300/. Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu. Unified active retrieval for retrieval augmented generation.arXiv preprint arXiv:2406.12534,

work page arXiv 2021
[7]

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A

URLhttps://aclanthology.org/2025.findings-acl.123/. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Tech...

work page 2025
[8]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert

URL https://aclanthology.org/2021.naacl-main.365/. Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. In Nikolaos Aletras and Orphee De Clercq (eds.),Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrati...

work page 2021
[9]

The Llama 3 Herd of Models

Association for Computational Linguistics. doi: 10.18653/v1/2024.eacl-demo.16. URLhttps://aclanthology.org/2024.eacl-demo.16/. Meta AI et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.eacl-demo.16 2024
[10]

The Llama 3 Herd of Models

URLhttps://arxiv. org/abs/2407.21783. Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. Kirag: Knowledge-driven iterative retriever for enhanc- ing retrieval-augmented generation.arXiv preprint arXiv:2502.18397,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Smartrag: Jointly learn rag-related tasks from the environment feedback

Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. Smartrag: Jointly learn rag-related tasks from the environment feedback.arXiv preprint arXiv:2410.18141,

work page arXiv
[12]

Synergizing rag and reasoning: A systematic review

Yunfan Gao, Yun Xiong, Yijie Zhong, Yuxi Bi, Ming Xue, and Haofen Wang. Synergizing rag and reasoning: A systematic review.arXiv preprint arXiv:2504.15909,

work page arXiv
[13]

28 Google

Documentation; Accessed 2025-10-20. 28 Google. Gemini api release notes, Sep

work page 2025
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://ai.google.dev/gemini-api/docs/changelog. Accessed 2025-10-21. Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. 2025a. URLhttps://arxiv.org/abs/2501.12948. Daya Guo et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645 (8081):633–638, 2025b. doi: 10.1038/s4...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025
[15]

Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al

URLhttps://aclanthology.org/2020.coling-main.580/. Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253,

work page 2020
[16]

Retrieve, summarize, plan: Advancing multi- hop question answering with an iterative approach

Zhouyu Jiang, Mengshu Sun, Lei Liang, and Zhiqiang Zhang. Retrieve, summarize, plan: Advancing multi- hop question answering with an iterative approach. InCompanion Proceedings of the ACM on Web Conference 2025, pp. 1677–1686,

work page 2025
[17]

Chembed: Enhancing chemical literature search through domain- specific text embeddings.arXiv preprint arXiv:2508.01643,

Ali Shiraee Kasmaee, Mohammad Khodadad, Mehdi Astaraki, Mohammad Arshi Saloot, Nicholas Sherck, Hamidreza Mahyar, and Soheila Samiee. Chembed: Enhancing chemical literature search through domain- specific text embeddings.arXiv preprint arXiv:2508.01643,

work page arXiv
[18]

Evaluating multi-hop reasoning in large language models: A chemistry-centric case study

Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki, Nicholas Sherck, Hamidreza Mahyar, and Soheila Samiee. Evaluating multi-hop reasoning in large language models: A chemistry-centric case study. arXiv preprint arXiv:2504.16414, 2025a. Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki, Nicholas Sherck, Hamidreza Mahyar, and Soheila Samiee. Evalua...

work page arXiv 2024
[19]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Rui Li, Quanyu Dai, Zeyu Zhang, Xu Chen, Zhenhua Dong, and Ji-Rong Wen. Knowtrace: Bootstrapping iterative retrieval-augmented generation with structured knowledge tracing. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pp. 1470–1480, 2025a. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao ...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

ISBN 979-8-89176-251-0

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1142. URLhttps://aclanthology.org/2025.acl-long.1142/. Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, and Yan Tao. Globalrag: Enhancing global reasoning in multi-hop question answering v...

work page doi:10.18653/v1/2025.acl-long.1142 2025
[21]

Md Mahadi Hasan Nahid and Davood Rafiei

URLhttps://arxiv.org/abs/2510.20548. Md Mahadi Hasan Nahid and Davood Rafiei. Prism: Agentic retrieval with llms for multi-hop question answering.arXiv preprint arXiv:2510.14278,

work page arXiv
[22]

Gpt-4o: Openai’somnimodalmodel.https://openai.com/index/hello-gpt-4o, 2024a

OpenAI. Gpt-4o: Openai’somnimodalmodel.https://openai.com/index/hello-gpt-4o, 2024a. Released May 2024, Accessed: 2025-10-03. OpenAI. Gpt-4o system card.https://openai.com/index/gpt-4o-system-card/, 2024b. Accessed 2025- 10-20. OpenAI. Gpt-5: Advancing multimodal and long-context reasoning.https://openai.com/research,

work page 2024
[23]

Jaewan Park, Solbee Cho, and Jay-Yoon Lee

Released mid-2025, Accessed: 2025-10-03. Jaewan Park, Solbee Cho, and Jay-Yoon Lee. Stop-rag: Value-based retrieval control for iterative rag, 2025a. URLhttps://arxiv.org/abs/2510.14337. Sangwoo Park, Jinheon Baek, Soyeong Jeong, and Sung Ju Hwang. Chain of retrieval: Multi-aspect iterative search expansion and post-order search aggregation for full paper...

work page arXiv 2025
[24]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.arXiv preprint arXiv:2305.15294,

work page arXiv
[25]

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen

URLhttps: //arxiv.org/abs/2508.03644. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025a. Maojia Song, Renhang Liu, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou...

work page arXiv
[26]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Han- naneh Hajishirzi. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7534–7550. Association for Computa- tional Linguistics,

work page 2020
[27]

Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, and Jiang Bian

URLhttps://aclanthology.org/2020.emnlp-main.609/. Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, and Jiang Bian. Pike-rag: specialized knowledge and rationale augmented generation.arXiv preprint arXiv:2501.11551,

work page arXiv 2020
[28]

Geemi P Wellawatte, Huixuan Guo, Magdalena Lederbauer, Anna Borisova, Matthew Hart, Marta Brucka, andPhilippeSchwaller

URL https://arxiv.org/abs/2504.14858. Geemi P Wellawatte, Huixuan Guo, Magdalena Lederbauer, Anna Borisova, Matthew Hart, Marta Brucka, andPhilippeSchwaller. Chemlit-qa: ahumanevaluateddatasetforchemistryragtasks.Machine Learning: Science and Technology, 6(2):020601, 2025a. Geemi P Wellawatte, Huixuan Guo, Magdalena Lederbauer, Anna Borisova, Matthew Hart...

work page doi:10.1088/2632-2153/adc2d6
[29]

Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,

URLhttps://en.wikipedia. org/wiki/Grok_(large_language_model). Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined frame- work for enhancing llm reasoning with agentic tools.arXiv preprint arXiv:2502.04644,

work page arXiv
[30]

Grok 4 model card.https://data.x.ai/2025-08-20-grok-4-model-card.pdf, Aug

xAI. Grok 4 model card.https://data.x.ai/2025-08-20-grok-4-model-card.pdf, Aug

work page 2025
[31]

Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, and Chien-Sheng Wu

Accessed 2025-10-21. Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, and Chien-Sheng Wu. Do RAG systems cover what matters? evaluating and optimizing responses with sub-question coverage. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comp...

work page 2025
[32]

ISBN 979-8-89176-189-6

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.301. URL https://aclanthology.org/2025.naacl-long.301/. Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, and Aidong Zhang. Improving retrieval- augmented generation in medicine with iterative follow-up questions. InBiocomputing 2025: Proceedin...

work page doi:10.18653/v1/2025.naacl-long.301 2025
[33]

Activerag: Autonomously knowledge assimilation and accommodation through retrieval-augmented agents.arXiv preprint arXiv:2402.13547,

Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Chaojun Xiao, Zhiyuan Liu, Ge Yu, and Chenyan Xiong. Activerag: Autonomously knowledge assimilation and accommodation through retrieval-augmented agents.arXiv preprint arXiv:2402.13547,

work page arXiv
[34]

O1 embedder: Let retrievers think before action.arXiv preprint arXiv:2502.07555,

Ruiran Yan, Zheng Liu, and Defu Lian. O1 embedder: Let retrievers think before action.arXiv preprint arXiv:2502.07555,

work page arXiv
[35]

URLhttp://dx.doi.org/10.1145/3726302.3730018

doi: 10.1145/3726302.3730018. URLhttp://dx.doi.org/10.1145/3726302.3730018. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.1145/3726302.3730018 2018
[36]

Yilun Zhang

URLhttps: //aclanthology.org/2021.acl-long.254/. Yilun Zhang. Cognitive load-aware inference: A neuro-symbolic framework for optimizing the token economy of large language models.arXiv preprint arXiv:2507.00653,

work page arXiv 2021
[37]

extended thinking

URLhttps://z.ai/ blog/glm-4.6. Accessed 2025-10-21. S1 Appendix S1.1 Extended Literature Review The 2024–2025 period brought a steady run of dense-transformer LLMs, with some Mixture-of-Experts (MoE) experiments. In February 2024, Mistral releasedMistral Large, a conventional dense transformer tuned with instruction Supervised Fine-tuning (SFT) for predic...

work page 2025
[38]

Anthropic’sClaude 4.5 Sonnet(Sept

stays dense and SFT+preference tuned; the goal is smoother, less jumpy gains when users add exemplars, reasoning tokens, or clean retrieval OpenAI (2025). Anthropic’sClaude 4.5 Sonnet(Sept

work page 2025
[39]

continues the dense line with stronger longer-horizon control; its SFT variants try to ensure that spending more tokens yields helpful intermediate structure rather than drift Anthropic (2025b). xAI’sGrokevolved from an MoE phase (e.g., the 314B Grok-1) toward later, reasoning-oriented post-training on streamlined backbones; instruction SFT and preference...

work page 2025
[40]

extended thinking

extends a pragmatic dense family for enterprise cod- ing/reasoning; instruction SFT supports the usual test-time levers (few-shot conditioning, short intermediate reasoning, and measured benefits from tidy retrieval). Overall, three overlapping currents are relevant to this work: (i)Alignment-first dense transformers(e.g., Mistral Large 2402; Llama 3.3 70...

work page 2025

[1] [1]

Cognitive load limits in large language models: Benchmarking multi-hop reasoning

Sai Teja Reddy Adapala. Cognitive load limits in large language models: Benchmarking multi-hop reasoning. arXiv preprint arXiv:2509.19517,

work page arXiv

[2] [2]

Llama-3.3-70b-instruct, Dec 2024a

Meta AI. Llama-3.3-70b-instruct, Dec 2024a. URLhttps://huggingface.co/meta-llama/Llama-3. 3-70B-Instruct. Model card; Accessed 2025-10-21. Mistral AI. Mistral large: Open-weight dense transformer.https://mistral.ai/news/mistral-large, 2024b. Released December 2024, Accessed: 2025-10-03. Mistral AI. Au large: Announcing mistral large (2402).https://mistral...

work page 2025

[3] [3]

Fair-rag: Faithful adaptive iterative refinement for retrieval-augmented generation.arXiv preprint arXiv:2510.22344,

Mohammad Aghajani Asl, Majid Asgari-Bidhendi, and Behrooz Minaei-Bidgoli. Fair-rag: Faithful adaptive iterative refinement for retrieval-augmented generation.arXiv preprint arXiv:2510.22344,

work page arXiv

[4] [4]

Probing-rag: Self-probing to guide language models in selective document retrieval

Ingeol Baek, Hwan Chang, Byeongjeong Kim, Jimin Lee, and Hwanhee Lee. Probing-rag: Self-probing to guide language models in selective document retrieval. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 3287–3304,

work page 2025

[5] [5]

Finqa: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3697–3711. Association for Computation...

work page 2021

[6] [6]

Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu

URLhttps:// aclanthology.org/2021.emnlp-main.300/. Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu. Unified active retrieval for retrieval augmented generation.arXiv preprint arXiv:2406.12534,

work page arXiv 2021

[7] [7]

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A

URLhttps://aclanthology.org/2025.findings-acl.123/. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Tech...

work page 2025

[8] [8]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert

URL https://aclanthology.org/2021.naacl-main.365/. Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. In Nikolaos Aletras and Orphee De Clercq (eds.),Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrati...

work page 2021

[9] [9]

The Llama 3 Herd of Models

Association for Computational Linguistics. doi: 10.18653/v1/2024.eacl-demo.16. URLhttps://aclanthology.org/2024.eacl-demo.16/. Meta AI et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.eacl-demo.16 2024

[10] [10]

The Llama 3 Herd of Models

URLhttps://arxiv. org/abs/2407.21783. Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. Kirag: Knowledge-driven iterative retriever for enhanc- ing retrieval-augmented generation.arXiv preprint arXiv:2502.18397,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Smartrag: Jointly learn rag-related tasks from the environment feedback

Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. Smartrag: Jointly learn rag-related tasks from the environment feedback.arXiv preprint arXiv:2410.18141,

work page arXiv

[12] [12]

Synergizing rag and reasoning: A systematic review

Yunfan Gao, Yun Xiong, Yijie Zhong, Yuxi Bi, Ming Xue, and Haofen Wang. Synergizing rag and reasoning: A systematic review.arXiv preprint arXiv:2504.15909,

work page arXiv

[13] [13]

28 Google

Documentation; Accessed 2025-10-20. 28 Google. Gemini api release notes, Sep

work page 2025

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://ai.google.dev/gemini-api/docs/changelog. Accessed 2025-10-21. Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. 2025a. URLhttps://arxiv.org/abs/2501.12948. Daya Guo et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645 (8081):633–638, 2025b. doi: 10.1038/s4...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025

[15] [15]

Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al

URLhttps://aclanthology.org/2020.coling-main.580/. Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253,

work page 2020

[16] [16]

Retrieve, summarize, plan: Advancing multi- hop question answering with an iterative approach

Zhouyu Jiang, Mengshu Sun, Lei Liang, and Zhiqiang Zhang. Retrieve, summarize, plan: Advancing multi- hop question answering with an iterative approach. InCompanion Proceedings of the ACM on Web Conference 2025, pp. 1677–1686,

work page 2025

[17] [17]

Chembed: Enhancing chemical literature search through domain- specific text embeddings.arXiv preprint arXiv:2508.01643,

Ali Shiraee Kasmaee, Mohammad Khodadad, Mehdi Astaraki, Mohammad Arshi Saloot, Nicholas Sherck, Hamidreza Mahyar, and Soheila Samiee. Chembed: Enhancing chemical literature search through domain- specific text embeddings.arXiv preprint arXiv:2508.01643,

work page arXiv

[18] [18]

Evaluating multi-hop reasoning in large language models: A chemistry-centric case study

Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki, Nicholas Sherck, Hamidreza Mahyar, and Soheila Samiee. Evaluating multi-hop reasoning in large language models: A chemistry-centric case study. arXiv preprint arXiv:2504.16414, 2025a. Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki, Nicholas Sherck, Hamidreza Mahyar, and Soheila Samiee. Evalua...

work page arXiv 2024

[19] [19]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Rui Li, Quanyu Dai, Zeyu Zhang, Xu Chen, Zhenhua Dong, and Ji-Rong Wen. Knowtrace: Bootstrapping iterative retrieval-augmented generation with structured knowledge tracing. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pp. 1470–1480, 2025a. Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao ...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

ISBN 979-8-89176-251-0

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1142. URLhttps://aclanthology.org/2025.acl-long.1142/. Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, and Yan Tao. Globalrag: Enhancing global reasoning in multi-hop question answering v...

work page doi:10.18653/v1/2025.acl-long.1142 2025

[21] [21]

Md Mahadi Hasan Nahid and Davood Rafiei

URLhttps://arxiv.org/abs/2510.20548. Md Mahadi Hasan Nahid and Davood Rafiei. Prism: Agentic retrieval with llms for multi-hop question answering.arXiv preprint arXiv:2510.14278,

work page arXiv

[22] [22]

Gpt-4o: Openai’somnimodalmodel.https://openai.com/index/hello-gpt-4o, 2024a

OpenAI. Gpt-4o: Openai’somnimodalmodel.https://openai.com/index/hello-gpt-4o, 2024a. Released May 2024, Accessed: 2025-10-03. OpenAI. Gpt-4o system card.https://openai.com/index/gpt-4o-system-card/, 2024b. Accessed 2025- 10-20. OpenAI. Gpt-5: Advancing multimodal and long-context reasoning.https://openai.com/research,

work page 2024

[23] [23]

Jaewan Park, Solbee Cho, and Jay-Yoon Lee

Released mid-2025, Accessed: 2025-10-03. Jaewan Park, Solbee Cho, and Jay-Yoon Lee. Stop-rag: Value-based retrieval control for iterative rag, 2025a. URLhttps://arxiv.org/abs/2510.14337. Sangwoo Park, Jinheon Baek, Soyeong Jeong, and Sung Ju Hwang. Chain of retrieval: Multi-aspect iterative search expansion and post-order search aggregation for full paper...

work page arXiv 2025

[24] [24]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.arXiv preprint arXiv:2305.15294,

work page arXiv

[25] [25]

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen

URLhttps: //arxiv.org/abs/2508.03644. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025a. Maojia Song, Renhang Liu, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou...

work page arXiv

[26] [26]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Han- naneh Hajishirzi. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7534–7550. Association for Computa- tional Linguistics,

work page 2020

[27] [27]

Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, and Jiang Bian

URLhttps://aclanthology.org/2020.emnlp-main.609/. Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, and Jiang Bian. Pike-rag: specialized knowledge and rationale augmented generation.arXiv preprint arXiv:2501.11551,

work page arXiv 2020

[28] [28]

Geemi P Wellawatte, Huixuan Guo, Magdalena Lederbauer, Anna Borisova, Matthew Hart, Marta Brucka, andPhilippeSchwaller

URL https://arxiv.org/abs/2504.14858. Geemi P Wellawatte, Huixuan Guo, Magdalena Lederbauer, Anna Borisova, Matthew Hart, Marta Brucka, andPhilippeSchwaller. Chemlit-qa: ahumanevaluateddatasetforchemistryragtasks.Machine Learning: Science and Technology, 6(2):020601, 2025a. Geemi P Wellawatte, Huixuan Guo, Magdalena Lederbauer, Anna Borisova, Matthew Hart...

work page doi:10.1088/2632-2153/adc2d6

[29] [29]

Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,

URLhttps://en.wikipedia. org/wiki/Grok_(large_language_model). Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined frame- work for enhancing llm reasoning with agentic tools.arXiv preprint arXiv:2502.04644,

work page arXiv

[30] [30]

Grok 4 model card.https://data.x.ai/2025-08-20-grok-4-model-card.pdf, Aug

xAI. Grok 4 model card.https://data.x.ai/2025-08-20-grok-4-model-card.pdf, Aug

work page 2025

[31] [31]

Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, and Chien-Sheng Wu

Accessed 2025-10-21. Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, and Chien-Sheng Wu. Do RAG systems cover what matters? evaluating and optimizing responses with sub-question coverage. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Comp...

work page 2025

[32] [32]

ISBN 979-8-89176-189-6

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.301. URL https://aclanthology.org/2025.naacl-long.301/. Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, and Aidong Zhang. Improving retrieval- augmented generation in medicine with iterative follow-up questions. InBiocomputing 2025: Proceedin...

work page doi:10.18653/v1/2025.naacl-long.301 2025

[33] [33]

Activerag: Autonomously knowledge assimilation and accommodation through retrieval-augmented agents.arXiv preprint arXiv:2402.13547,

Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Chaojun Xiao, Zhiyuan Liu, Ge Yu, and Chenyan Xiong. Activerag: Autonomously knowledge assimilation and accommodation through retrieval-augmented agents.arXiv preprint arXiv:2402.13547,

work page arXiv

[34] [34]

O1 embedder: Let retrievers think before action.arXiv preprint arXiv:2502.07555,

Ruiran Yan, Zheng Liu, and Defu Lian. O1 embedder: Let retrievers think before action.arXiv preprint arXiv:2502.07555,

work page arXiv

[35] [35]

URLhttp://dx.doi.org/10.1145/3726302.3730018

doi: 10.1145/3726302.3730018. URLhttp://dx.doi.org/10.1145/3726302.3730018. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.1145/3726302.3730018 2018

[36] [36]

Yilun Zhang

URLhttps: //aclanthology.org/2021.acl-long.254/. Yilun Zhang. Cognitive load-aware inference: A neuro-symbolic framework for optimizing the token economy of large language models.arXiv preprint arXiv:2507.00653,

work page arXiv 2021

[37] [37]

extended thinking

URLhttps://z.ai/ blog/glm-4.6. Accessed 2025-10-21. S1 Appendix S1.1 Extended Literature Review The 2024–2025 period brought a steady run of dense-transformer LLMs, with some Mixture-of-Experts (MoE) experiments. In February 2024, Mistral releasedMistral Large, a conventional dense transformer tuned with instruction Supervised Fine-tuning (SFT) for predic...

work page 2025

[38] [38]

Anthropic’sClaude 4.5 Sonnet(Sept

stays dense and SFT+preference tuned; the goal is smoother, less jumpy gains when users add exemplars, reasoning tokens, or clean retrieval OpenAI (2025). Anthropic’sClaude 4.5 Sonnet(Sept

work page 2025

[39] [39]

continues the dense line with stronger longer-horizon control; its SFT variants try to ensure that spending more tokens yields helpful intermediate structure rather than drift Anthropic (2025b). xAI’sGrokevolved from an MoE phase (e.g., the 314B Grok-1) toward later, reasoning-oriented post-training on streamlined backbones; instruction SFT and preference...

work page 2025

[40] [40]

extended thinking

extends a pragmatic dense family for enterprise cod- ing/reasoning; instruction SFT supports the usual test-time levers (few-shot conditioning, short intermediate reasoning, and measured benefits from tidy retrieval). Overall, three overlapping currents are relevant to this work: (i)Alignment-first dense transformers(e.g., Mistral Large 2402; Llama 3.3 70...

work page 2025