SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents

Keivan Rezaei; Mehrdad Saberi; Soheil Feizi

arxiv: 2605.21965 · v1 · pith:APFNGIEMnew · submitted 2026-05-21 · 💻 cs.CL

SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents

Mehrdad Saberi , Keivan Rezaei , Soheil Feizi This is my paper

Pith reviewed 2026-05-22 06:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculationmulti-hop retrievallatency reductiontool useasynchronous verificationspeculative threadslossless acceleration

0 comments

The pith

SpecHop accelerates multi-hop retrieval by running multiple speculative threads that verify predictions asynchronously and roll back errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models that rely on external tools for multi-step questions incur repeated waits for each tool result. The paper builds a theoretical model that calculates the maximum latency savings possible when faster but approximate speculator tools guess future observations in advance. SpecHop puts this into practice by keeping several speculative threads active at once, checking their guesses against real tool outputs as they arrive, committing the correct path, and discarding the rest. This keeps the model's final answer identical to the unaccelerated version while shortening total elapsed time. Experiments on retrieval-augmented multi-hop benchmarks show the approach tracks the theoretical optimum and delivers up to 40 percent lower latency in tested cases.

Core claim

We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads.

What carries the argument

Continuous speculation via multiple active threads that perform asynchronous verification of predicted observations, commit correct branches, and roll back incorrect ones.

If this is right

With enough active threads SpecHop approaches oracle latency gains.
Empirical latency on retrieval-augmented multi-hop tasks closely matches theoretical predictions.
Wall-clock latency falls by up to 40 percent in some settings while the final trajectory stays unchanged.
The method enables lossless acceleration of multi-hop tool-use trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same thread-based verification pattern could apply to any sequential agent loop that incurs costly external calls, such as API chains or database traversals.
Better fast speculators would directly raise the achievable speedup ceiling beyond the 40 percent observed here.
Embedding the mechanism into existing LLM inference servers could make low-latency multi-hop agents practical for interactive use.

Load-bearing premise

Faster but less reliable speculator tools exist that can generate predicted observations suitable for later verification against real tool outputs.

What would settle it

Measure latency and accuracy on a fixed multi-hop retrieval benchmark while steadily increasing the number of active speculative threads; if latency converges to the theoretical oracle bound without accuracy loss, the central claim holds.

Figures

Figures reproduced from arXiv: 2605.21965 by Keivan Rezaei, Mehrdad Saberi, Soheil Feizi.

**Figure 1.** Figure 1: Overview of SPECHOP with continuous speculative execution, maintaining k = 3 active threads on a query requiring 4 external calls. Initially, T1 calls the target tool for (a1, o1). In parallel, T2 speculates the first observation as oˆ1 and continues to the next hop, where it calls the target tool for (a2, o2); T3 is created similarly. When o1 returns, the verifier V compares it with oˆ1. Since verificatio… view at source ↗

**Figure 2.** Figure 2: The effect of active thread limit (k) on relative latency (RelLat) and computational cost (average number of calls to the target tool T , speculator S, and generator model M). The evaluation setting uses GPT-4o as the speculator (S) and Web Search as the target tool (T ). The dashed line is the theoretical optimal relative latency (RelLat∗ ). The “Standard” setting refers to not using SPECHOP. Additional t… view at source ↗

**Figure 3.** Figure 3: Performance of SPECHOP using a fast E5-cache as the speculator (S) on the 2WikiMultihopQA dataset. The cache size varies from 5% to 25% of the full index. The plots show the empirical success probability (pˆ), relative speculator latency (αˆ), and resulting relative latency (RelLat) when accelerating both the full E5 retriever and Web Search target tools (T ). SPECHOP ensures that no unauthorized deviatio… view at source ↗

**Figure 4.** Figure 4: The effect of the active thread limit (k) on relative latency (RelLat) and computational cost across various extended configurations. The plots track the average number of calls to the target tool (T ), speculator (S), and generator model (M). The first two rows utilize CoRAG as the standard generator model, exploring different combinations of speculators and target tools. The final row demonstrates the k-… view at source ↗

**Figure 5.** Figure 5: Empirical distributions of hop-level latency components and resulting relative latency [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval-augmented multi-hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40\% in some settings. Code: https://github.com/mehrdadsaberi/spechop

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpecHop, a continuous speculation framework to accelerate multi-hop tool-use trajectories in LLMs without altering the final output trajectory. It assumes access to faster but less reliable speculator tools that produce predicted observations for later verification against target tool outputs. The work develops a theoretical framework characterizing optimal achievable latency gains under lossless speculation, introduces an algorithm that maintains multiple speculative threads with asynchronous verification, commit of correct branches, and rollback of incorrect ones, and reports that SpecHop can approach oracle latency gains with sufficient threads. Empirically, on retrieval-augmented multi-hop tasks, it closely matches theoretical predictions and achieves up to 40% latency reduction in some settings. The code is released at the provided GitHub link.

Significance. If the lossless trajectory preservation holds, the approach could meaningfully reduce wall-clock latency for retrieval-augmented agents while preserving accuracy, addressing a key practical bottleneck. Strengths include the explicit theoretical characterization of latency gains, the empirical match to predictions, and the open-sourced implementation, which supports reproducibility. The result is potentially impactful for latency-sensitive agentic systems if the rollback mechanism is shown to be robust.

major comments (2)

[§3] §3 (SpecHop algorithm and rollback procedure): The claim of lossless trajectory preservation requires that rollback after an incorrect speculative observation restores the model to a state from which all subsequent tool calls and generations are identical to those that would have been produced with the correct observation from the outset. Because later calls are conditioned on prior observations, simply replacing the observation does not automatically guarantee re-generation of the identical downstream sequence; the manuscript does not provide a concrete argument or pseudocode showing how the model state (including any internal context or generation history) is exactly restored without additional cost or divergence. This is load-bearing for both the theoretical latency characterization and the empirical 40% reduction claim.
[§2] Theoretical framework (likely §2): The optimal latency gain characterization assumes perfect, cost-free state restoration on rollback. If the rollback mechanism incurs any overhead or cannot guarantee identical downstream trajectories, the derived bounds no longer apply directly to the implemented system; the paper should either prove that restoration is exact or adjust the theoretical predictions to account for any residual cost.

minor comments (2)

[Results] Results section: The abstract states 'up to 40% in some settings'; the main text should explicitly identify the exact task, number of hops, and thread count corresponding to the maximum reported reduction, and include error bars or multiple runs to show robustness.
[§2] Notation: The distinction between 'target tool' and 'speculator tool' outputs should be introduced with consistent symbols early in the theoretical section to avoid ambiguity when discussing verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. The comments highlight important aspects of the state restoration mechanism that we will clarify in the revision. We address each major comment below.

read point-by-point responses

Referee: [§3] §3 (SpecHop algorithm and rollback procedure): The claim of lossless trajectory preservation requires that rollback after an incorrect speculative observation restores the model to a state from which all subsequent tool calls and generations are identical to those that would have been produced with the correct observation from the outset. Because later calls are conditioned on prior observations, simply replacing the observation does not automatically guarantee re-generation of the identical downstream sequence; the manuscript does not provide a concrete argument or pseudocode showing how the model state (including any internal context or generation history) is exactly restored without additional cost or divergence. This is load-bearing for both the theoretical latency characterization and the empirical 40% reduction claim.

Authors: We thank the referee for this observation. In the SpecHop algorithm, state is maintained via a branching context tree in which each active thread holds its own independent generation history and observation prefix. Upon asynchronous verification, an incorrect speculative observation causes the thread to be pruned; execution then resumes exclusively from the last verified correct prefix by restoring the corresponding context buffer. Because all downstream generations and tool calls are conditioned solely on this restored prefix, the resulting trajectory is identical to the non-speculative execution. We will expand §3 with an explicit paragraph and additional pseudocode lines that detail the context-buffer restore operation and confirm that no extra generation cost or divergence is incurred beyond the verification step itself. revision: yes
Referee: [§2] Theoretical framework (likely §2): The optimal latency gain characterization assumes perfect, cost-free state restoration on rollback. If the rollback mechanism incurs any overhead or cannot guarantee identical downstream trajectories, the derived bounds no longer apply directly to the implemented system; the paper should either prove that restoration is exact or adjust the theoretical predictions to account for any residual cost.

Authors: The latency bounds in §2 are derived under the explicit assumption of lossless speculation, which includes exact, zero-overhead state restoration. The multi-threaded context management in SpecHop realizes this assumption: rollback consists only of discarding an incorrect branch and reloading the verified prefix buffer, with no re-generation of prior steps. We will revise the opening of §2 to cross-reference the state-restoration guarantees now elaborated in §3 and to state that the derived bounds therefore apply directly to the implemented algorithm. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical latency framework is independent of empirical results

full rationale

The paper first presents a theoretical framework characterizing optimal latency gain for lossless multi-hop speculation, then proposes SpecHop as an implementation that approaches oracle gains, and finally reports empirical latency reductions up to 40% that match the theory. No step reduces a prediction or central claim to a fitted parameter, self-definition, or load-bearing self-citation by construction. The lossless trajectory preservation is an explicit assumption with stated rollback mechanism rather than a derived result that collapses into its inputs. The derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of faster but less reliable speculator tools and on the ability to maintain and verify multiple speculative threads without altering the final trajectory.

axioms (1)

domain assumption Faster but less reliable speculator tools are available for generating predicted observations.
Stated in the abstract as a prerequisite for the lossless speculation approach.

pith-pipeline@v0.9.0 · 5711 in / 1252 out tokens · 29212 ms · 2026-05-22T06:46:44.649123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 9 internal anchors

[1]

Computer use tool

Anthropic. Computer use tool. https://docs.anthropic.com/en/docs/ agents-and-tools/tool-use/computer-use-tool, 2026. Accessed: 2026-05-01

work page 2026
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Gemini deep research agent

Google AI for Developers. Gemini deep research agent. https://ai.google.dev/ gemini-api/docs/deep-research, 2026. Accessed May 1, 2026

work page 2026
[4]

Function calling with the gemini api

Google AI for Developers. Function calling with the gemini api. https://ai.google.dev/ gemini-api/docs/function-calling, 2026. Accessed: 2026-05-01

work page 2026
[5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020
[7]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[9]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[10]

Pipespec: Breaking stage dependencies in hierarchical llm decoding

Bradley McDanel, Sai Qian Zhang, Yunhai Hu, and Zining Liu. Pipespec: Breaking stage dependencies in hierarchical llm decoding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 12909–12920, 2025

work page 2025
[11]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025. Accessed May 1, 2026

work page 2025
[12]

Using tools

OpenAI. Using tools. https://developers.openai.com/api/docs/guides/tools,

work page
[13]

Accessed May 1, 2026

work page 2026
[14]

Demystifying delays in reasoning: A pilot temporal and token analysis of reasoning systems

Qi Qi, Reyna Abhyankar, and Yiying Zhang. Demystifying delays in reasoning: A pilot temporal and token analysis of reasoning systems. InNeurIPS 2025 Workshop on Efficient Reasoning, Vancouver, Canada, Dec 2025

work page 2025
[15]

Knowing when to ask–bridging large language models and data.arXiv preprint arXiv:2409.13741, 2024

Prashanth Radhakrishnan, Jennifer Chen, Bo Xu, Prem Ramaswami, Hannah Pho, Adriana Olmos, James Manyika, and Ramanathan V Guha. Knowing when to ask–bridging large language models and data.arXiv preprint arXiv:2409.13741, 2024. 10

work page arXiv 2024
[16]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

work page 2023
[17]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

James E. Smith. A study of branch prediction strategies. InProceedings of the 8th Annual Symposium on Computer Architecture, ISCA ’81, page 135–148, Washington, DC, USA, 1981. IEEE Computer Society Press

work page 1981
[20]

Smith and Andrew R

James E. Smith and Andrew R. Pleszkun. Implementing precise interrupts in pipelined proces- sors. page 372–383, 1995

work page 1995
[21]

Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

work page arXiv 2026
[22]

Accelerating llm inference with lossless speculative decoding algorithms for heterogeneous vocabularies.arXiv preprint arXiv:2502.05202, 2025

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, and David Harel. Accelerating llm inference with lossless speculative decoding algorithms for heterogeneous vocabularies.arXiv preprint arXiv:2502.05202, 2025

work page arXiv 2025
[23]

R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units.IBM Journal of Research and Development, 11(1):25–33, 1967. doi: 10.1147/rd.111.0025

work page doi:10.1147/rd.111.0025 1967
[24]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[25]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[27]

Chain- of-retrieval augmented generation.arXiv preprint arXiv:2501.14342, 2025

Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation.arXiv preprint arXiv:2501.14342, 2025

work page arXiv 2025
[28]

Deepnote: Note-centric deep retrievalaugmented generation.Preprint, 2025

Ruobing Wang, Qingfei Zhao, Yukun Yan, Daren Zha, Yuxuan Chen, Shi Yu, Zhenghao Liu, Yixuan Wang, Shuo Wang, Xu Han, et al. Deepnote: Note-centric deep retrievalaugmented generation.Preprint, 2025

work page 2025
[29]

Speculative rag: Enhancing retrieval augmented generation through drafting

Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, et al. Speculative rag: Enhancing retrieval augmented generation through drafting.arXiv preprint arXiv:2407.08223, 2024

work page arXiv 2024
[30]

Deepresearch-9k: A challenging benchmark dataset of deep-research agent

Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. Deepresearch-9k: A challenging benchmark dataset of deep-research agent. arXiv preprint arXiv:2603.01152, 2026

work page arXiv 2026
[31]

Decoding speculative decoding

Minghao Yan, Saurabh Agarwal, and Shivaram Venkataraman. Decoding speculative decoding. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6460–6473, 2025. 11

work page 2025
[32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, and Jing Zhou. Qwen3 technical report.arXiv, 2025. doi: 10.48550/arxiv....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[33]

Sparc-rag: Adaptive sequential-parallel scaling with context management for retrieval-augmented generation.arXiv preprint arXiv:2602.00083, 2026

Yuxin Yang, Gangda Deng, Ömer Faruk Akgül, Nima Chitsazan, Yash Govilkar, Akasha Tigalappanavara, Shi-Xiong Zhang, Sambit Sahu, and Viktor Prasanna. Sparc-rag: Adaptive sequential-parallel scaling with context management for retrieval-augmented generation.arXiv preprint arXiv:2602.00083, 2026

work page arXiv 2026
[34]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

work page 2018
[35]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Speculative Actions: A Lossless Framework for Faster Agentic Systems

Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. Speculative actions: A lossless framework for faster agentic systems.arXiv preprint arXiv:2510.04371, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Tse-Yu Yeh and Yale N. Patt. Two-level adaptive training branch prediction. InProceedings of the 24th Annual International Symposium on Microarchitecture, MICRO 24, page 51–61, New York, NY , USA, 1991. Association for Computing Machinery. ISBN 0897914600. doi: 10.1145/123465.123475. URLhttps://doi.org/10.1145/123465.123475

work page doi:10.1145/123465.123475 1991
[38]

No relevant information found

Haofei Yin, Mengbai Xiao, Tinghong Li, Xiao Zhang, Dongxiao Yu, and Guanghui Zhang. Specpipe: Accelerating pipeline parallelism-based llm inference with speculative decoding. arXiv preprint arXiv:2504.04104, 2025. 12 A Full Implementation of the Algorithm Algorithm 2SPECHOP: Continuous Speculative Execution withkActive Threads Require:Queryq, modelM, targ...

work page arXiv 2025
[39]

Which film has the director who was born earlier, Face Of A Fugitive or Cage Of Gold?

Use at most {max_hops} subquestions, but stop as soon as the final answer can be derived from the subanswers with certainty. Format example to imitate: question: "Which film has the director who was born earlier, Face Of A Fugitive or Cage Of Gold?" output: <subquestion> Who directed Face Of A Fugitive? </subquestion><subanswer> Paul Wendkos </subanswer><...

work page 1925
[40]

Next Sub-question Generation.Invoked when the model needs to query the target tool (T) or speculator (S). Main question: {question} Current hop index: {hops_done + 1} / {max_hops} Existing reasoning trace: {trace_blob} Output exactly one next subquestion in this format: <subquestion> ... </subquestion>

work page
[41]

Main question: {question} Current subquestion: {subquestion} Search results: {search_snippets} Answer the current subquestion only using the search results

Sub-answer Generation.Invoked to process the retrieved documents from the tool. Main question: {question} Current subquestion: {subquestion} Search results: {search_snippets} Answer the current subquestion only using the search results. If uncertain, return a short best-effort answer. Output exactly: <subanswer> ... </subanswer>

work page
[42]

fast assistant

Final Answer Generation.Invoked when the hop budget is exhausted or sufficient information is gathered. Main question: {question} Reasoning trace: {trace_blob} Output exactly one final answer tag: <final_answer> ... </final_answer> Formatting Fallback Mechanism.In the rare event that the off-the-shelf model outputs reasoning tokens outside the designated ...

work page

[1] [1]

Computer use tool

Anthropic. Computer use tool. https://docs.anthropic.com/en/docs/ agents-and-tools/tool-use/computer-use-tool, 2026. Accessed: 2026-05-01

work page 2026

[2] [2]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Gemini deep research agent

Google AI for Developers. Gemini deep research agent. https://ai.google.dev/ gemini-api/docs/deep-research, 2026. Accessed May 1, 2026

work page 2026

[4] [4]

Function calling with the gemini api

Google AI for Developers. Function calling with the gemini api. https://ai.google.dev/ gemini-api/docs/function-calling, 2026. Accessed: 2026-05-01

work page 2026

[5] [5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020

[7] [7]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[9] [9]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[10] [10]

Pipespec: Breaking stage dependencies in hierarchical llm decoding

Bradley McDanel, Sai Qian Zhang, Yunhai Hu, and Zining Liu. Pipespec: Breaking stage dependencies in hierarchical llm decoding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 12909–12920, 2025

work page 2025

[11] [11]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, 2025. Accessed May 1, 2026

work page 2025

[12] [12]

Using tools

OpenAI. Using tools. https://developers.openai.com/api/docs/guides/tools,

work page

[13] [13]

Accessed May 1, 2026

work page 2026

[14] [14]

Demystifying delays in reasoning: A pilot temporal and token analysis of reasoning systems

Qi Qi, Reyna Abhyankar, and Yiying Zhang. Demystifying delays in reasoning: A pilot temporal and token analysis of reasoning systems. InNeurIPS 2025 Workshop on Efficient Reasoning, Vancouver, Canada, Dec 2025

work page 2025

[15] [15]

Knowing when to ask–bridging large language models and data.arXiv preprint arXiv:2409.13741, 2024

Prashanth Radhakrishnan, Jennifer Chen, Bo Xu, Prem Ramaswami, Hannah Pho, Adriana Olmos, James Manyika, and Ramanathan V Guha. Knowing when to ask–bridging large language models and data.arXiv preprint arXiv:2409.13741, 2024. 10

work page arXiv 2024

[16] [16]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

work page 2023

[17] [17]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

James E. Smith. A study of branch prediction strategies. InProceedings of the 8th Annual Symposium on Computer Architecture, ISCA ’81, page 135–148, Washington, DC, USA, 1981. IEEE Computer Society Press

work page 1981

[20] [20]

Smith and Andrew R

James E. Smith and Andrew R. Pleszkun. Implementing precise interrupts in pipelined proces- sors. page 372–383, 1995

work page 1995

[21] [21]

Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

work page arXiv 2026

[22] [22]

Accelerating llm inference with lossless speculative decoding algorithms for heterogeneous vocabularies.arXiv preprint arXiv:2502.05202, 2025

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, and David Harel. Accelerating llm inference with lossless speculative decoding algorithms for heterogeneous vocabularies.arXiv preprint arXiv:2502.05202, 2025

work page arXiv 2025

[23] [23]

R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units.IBM Journal of Research and Development, 11(1):25–33, 1967. doi: 10.1147/rd.111.0025

work page doi:10.1147/rd.111.0025 1967

[24] [24]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[25] [25]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[27] [27]

Chain- of-retrieval augmented generation.arXiv preprint arXiv:2501.14342, 2025

Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain- of-retrieval augmented generation.arXiv preprint arXiv:2501.14342, 2025

work page arXiv 2025

[28] [28]

Deepnote: Note-centric deep retrievalaugmented generation.Preprint, 2025

Ruobing Wang, Qingfei Zhao, Yukun Yan, Daren Zha, Yuxuan Chen, Shi Yu, Zhenghao Liu, Yixuan Wang, Shuo Wang, Xu Han, et al. Deepnote: Note-centric deep retrievalaugmented generation.Preprint, 2025

work page 2025

[29] [29]

Speculative rag: Enhancing retrieval augmented generation through drafting

Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, et al. Speculative rag: Enhancing retrieval augmented generation through drafting.arXiv preprint arXiv:2407.08223, 2024

work page arXiv 2024

[30] [30]

Deepresearch-9k: A challenging benchmark dataset of deep-research agent

Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. Deepresearch-9k: A challenging benchmark dataset of deep-research agent. arXiv preprint arXiv:2603.01152, 2026

work page arXiv 2026

[31] [31]

Decoding speculative decoding

Minghao Yan, Saurabh Agarwal, and Shivaram Venkataraman. Decoding speculative decoding. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6460–6473, 2025. 11

work page 2025

[32] [32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, and Jing Zhou. Qwen3 technical report.arXiv, 2025. doi: 10.48550/arxiv....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[33] [33]

Sparc-rag: Adaptive sequential-parallel scaling with context management for retrieval-augmented generation.arXiv preprint arXiv:2602.00083, 2026

Yuxin Yang, Gangda Deng, Ömer Faruk Akgül, Nima Chitsazan, Yash Govilkar, Akasha Tigalappanavara, Shi-Xiong Zhang, Sambit Sahu, and Viktor Prasanna. Sparc-rag: Adaptive sequential-parallel scaling with context management for retrieval-augmented generation.arXiv preprint arXiv:2602.00083, 2026

work page arXiv 2026

[34] [34]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

work page 2018

[35] [35]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Speculative Actions: A Lossless Framework for Faster Agentic Systems

Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. Speculative actions: A lossless framework for faster agentic systems.arXiv preprint arXiv:2510.04371, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Tse-Yu Yeh and Yale N. Patt. Two-level adaptive training branch prediction. InProceedings of the 24th Annual International Symposium on Microarchitecture, MICRO 24, page 51–61, New York, NY , USA, 1991. Association for Computing Machinery. ISBN 0897914600. doi: 10.1145/123465.123475. URLhttps://doi.org/10.1145/123465.123475

work page doi:10.1145/123465.123475 1991

[38] [38]

No relevant information found

Haofei Yin, Mengbai Xiao, Tinghong Li, Xiao Zhang, Dongxiao Yu, and Guanghui Zhang. Specpipe: Accelerating pipeline parallelism-based llm inference with speculative decoding. arXiv preprint arXiv:2504.04104, 2025. 12 A Full Implementation of the Algorithm Algorithm 2SPECHOP: Continuous Speculative Execution withkActive Threads Require:Queryq, modelM, targ...

work page arXiv 2025

[39] [39]

Which film has the director who was born earlier, Face Of A Fugitive or Cage Of Gold?

Use at most {max_hops} subquestions, but stop as soon as the final answer can be derived from the subanswers with certainty. Format example to imitate: question: "Which film has the director who was born earlier, Face Of A Fugitive or Cage Of Gold?" output: <subquestion> Who directed Face Of A Fugitive? </subquestion><subanswer> Paul Wendkos </subanswer><...

work page 1925

[40] [40]

Next Sub-question Generation.Invoked when the model needs to query the target tool (T) or speculator (S). Main question: {question} Current hop index: {hops_done + 1} / {max_hops} Existing reasoning trace: {trace_blob} Output exactly one next subquestion in this format: <subquestion> ... </subquestion>

work page

[41] [41]

Main question: {question} Current subquestion: {subquestion} Search results: {search_snippets} Answer the current subquestion only using the search results

Sub-answer Generation.Invoked to process the retrieved documents from the tool. Main question: {question} Current subquestion: {subquestion} Search results: {search_snippets} Answer the current subquestion only using the search results. If uncertain, return a short best-effort answer. Output exactly: <subanswer> ... </subanswer>

work page

[42] [42]

fast assistant

Final Answer Generation.Invoked when the hop budget is exhausted or sufficient information is gathered. Main question: {question} Reasoning trace: {trace_blob} Output exactly one final answer tag: <final_answer> ... </final_answer> Formatting Fallback Mechanism.In the rare event that the off-the-shelf model outputs reasoning tokens outside the designated ...

work page