pith. machine review for the scientific record. sign in

arxiv: 2605.06285 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.LG

Recognition: unknown

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:22 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords LatentRAGagentic RAGlatent reasoningretrieval-augmented generationefficient inferencelatent space alignmentmulti-step question answering
0
0 comments X

The pith

LatentRAG moves multi-step reasoning and retrieval into continuous latent space to cut agentic RAG latency by roughly 90 percent while matching explicit methods on complex questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LatentRAG replaces the slow, token-by-token generation of intermediate thoughts and subqueries in agentic RAG with latent tokens taken directly from an LLM's hidden states in one forward pass. The framework aligns the language model with a dense retriever so that retrieval can operate over these continuous representations, and it adds a parallel decoding step that turns the latent tokens back into readable natural language for transparency. Experiments across seven benchmarks show that this latent approach delivers accuracy comparable to explicit agentic systems yet shrinks the inference-time cost enough to approach the speed of ordinary single-step RAG. The central move is therefore to keep the iterative search-agent behavior while removing its most expensive discrete-language bottleneck.

Core claim

By producing latent tokens for thoughts and subqueries directly from hidden states in a single forward pass, aligning the LLM with dense retrieval models in latent space, and adding parallel latent decoding to natural language, LatentRAG performs the multi-step retrieval and reasoning of agentic RAG without autoregressive generation of lengthy intermediate text.

What carries the argument

Latent tokens extracted from hidden states in one forward pass, aligned with a dense retriever and optionally decoded back to natural language.

If this is right

  • Agentic RAG can retain multi-step search behavior while operating at speeds close to single-step RAG.
  • Retrieval can be performed directly over continuous latent representations of subqueries rather than discrete text.
  • Joint training of the generator and retriever becomes possible because gradients flow through the latent alignment.
  • Interpretability is preserved by the optional decoding of latent tokens into readable intermediate steps.
  • The same latent-space shift can be applied to other iterative LLM tasks that currently rely on explicit token generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Longer or more deeply nested reasoning chains could be supported without a linear increase in latency.
  • Tool-use or planning loops that now require many autoregressive steps might be accelerated by analogous latent representations.
  • If latent tokens prove sufficient for retrieval, explicit natural-language intermediates may be treated as optional outputs rather than required intermediates in efficiency-sensitive deployments.

Load-bearing premise

Latent tokens taken from hidden states can faithfully carry the semantic content of natural-language thoughts and subqueries so that retrieval and end-to-end optimization remain effective.

What would settle it

A side-by-side evaluation in which LatentRAG accuracy drops measurably below explicit agentic RAG on questions that require several distinct reasoning hops, or in which measured end-to-end latency fails to show a reduction near 90 percent.

Figures

Figures reproduced from arXiv: 2605.06285 by Marcel Worring, Yijia Zheng.

Figure 1
Figure 1. Figure 1: Comparison of performance and latency on multi-hop QA datasets. LatentRAG achieves comparable performance to competitive agentic RAG methods such as Search-R1 and AutoRefine, while maintaining efficiency on par with naive single-step RAG. Search-R1 incurs substantial latency in thought and subquery generation, whereas LatentRAG substantially reduces the time spent in these two stages, leading to the observ… view at source ↗
Figure 2
Figure 2. Figure 2: (1) Traditional explicit agentic RAG methods alternate between generation and retrieval, view at source ↗
Figure 3
Figure 3. Figure 3: Performance and latency results across different retrieval model and LLM sizes. index that cannot fit on a single GPU. To ensure a fair comparison across different model sizes, we use three H100 GPUs for retrieval deployment and one for the LLM across all scaling experiments. As shown in view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of cosine similarity and angle between document embeddings and their mean direction. We visualize distributions using violin plots. In contrast to other retrieval models, e5-base-v2 yields embeddings with extremely high cosine similarity and small angular deviation, indicating collapse into a narrow cone of the hypersphere and severe anisotropy view at source ↗
Figure 5
Figure 5. Figure 5: Latency reduction using batch latent decoding vs. max length ratio. Lower max length ratios are as￾sociated with higher latency reduction ratios. Each data point corresponds to the results on each dataset. As discussed in the main paper, latent decoding improves transparency at the cost of additional latency. A good prop￾erty of our method is that the decoding of thoughts and subqueries is conditionally in… view at source ↗
Figure 6
Figure 6. Figure 6: performance under different numbers of latent thought and subquery tokens. To investigate the impact of latent token num￾bers, we vary the number of latent thought to￾kens m and the number of subquery tokens n and evaluate the exact match scores under differ￾ent configurations. As shown in view at source ↗
Figure 7
Figure 7. Figure 7: LogitLens Case Study 1 on LatentRAG♢. Latent thought and subquery tokens in the first step align with tokens related to the first subquery, The author of The Thing of It Is..., while those in the second step shift toward tokens related to the second subquery, William Goldman nationality. A latent token can encode the whole semantic concept, such as The Thing of It Is... or William Goldman. chairman Eugene … view at source ↗
Figure 8
Figure 8. Figure 8: LogitLens Case Study 2 on LatentRAG♢. Latent thought and subquery tokens in the first step align with tokens related to the first subquery, Eugene Habecker chairman of which magazine, while those in the second step shift toward tokens related to the second subquery, Christianity Today magazine type. A latent token can encode the whole semantic concept, such as magazine type or Christianity Today. 26 view at source ↗
read the original abstract

Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes LatentRAG, a framework that moves agentic RAG reasoning and retrieval into continuous latent space: latent tokens for thoughts and subqueries are generated directly from hidden states in a single forward pass, aligned with dense retrieval models for joint optimization, and decoded in parallel to natural language for interpretability. It reports performance comparable to explicit multi-step agentic RAG methods across seven benchmarks while cutting inference latency by ~90%.

Significance. If the empirical results hold under rigorous controls, the work would meaningfully narrow the efficiency gap between single-step RAG and adaptive agentic methods, enabling more practical deployment of complex multi-hop QA. The attempt at end-to-end latent alignment and the parallel decoding mechanism for transparency are constructive ideas that could influence future hybrid latent-explicit systems.

major comments (3)
  1. [Abstract] Abstract: the central empirical claim of 'comparable performance' and 'approximately 90%' latency reduction is presented without naming the seven benchmarks, the explicit agentic baselines, latency measurement protocol (wall-clock, tokens generated, hardware), error bars, or statistical significance tests. These omissions make it impossible to evaluate whether the latency gain preserves the adaptivity that agentic RAG is designed to provide.
  2. [Abstract] Abstract and method description: the architecture performs retrieval over latent subquery tokens produced in one forward pass, yet retrieval outputs never re-enter the model to condition subsequent latent tokens. This removes the iterative feedback loop that the paper itself identifies as the source of agentic RAG success on complex questions; no ablation or analysis is supplied showing that pre-encoded latent branches suffice when retrieval results would normally alter the reasoning path.
  3. [Abstract] The weakest assumption—that hidden-state latents can faithfully encode the semantic content of natural-language thoughts and subqueries sufficiently for effective retrieval—receives no direct validation (e.g., retrieval recall@K on latent vs. explicit subqueries, or human evaluation of decoded thoughts). Without such evidence the end-to-end optimization claim rests on an untested substitution.
minor comments (1)
  1. [Abstract] The abstract states 'align LLMs with dense retrieval models in the latent space' but does not specify the alignment loss, temperature, or projection layers; these details belong in the main text even if summarized here.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive comments on our paper. We address each of the major comments point by point below, and we will revise the manuscript accordingly to improve clarity and provide additional analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim of 'comparable performance' and 'approximately 90%' latency reduction is presented without naming the seven benchmarks, the explicit agentic baselines, latency measurement protocol (wall-clock, tokens generated, hardware), error bars, or statistical significance tests. These omissions make it impossible to evaluate whether the latency gain preserves the adaptivity that agentic RAG is designed to provide.

    Authors: We agree that the abstract should include more specific details to facilitate evaluation of the claims. In the revised version, we will name the seven benchmarks, list the explicit agentic baselines, describe the latency measurement protocol (wall-clock time, tokens generated, hardware), and report error bars along with statistical significance tests. These updates will help demonstrate that the reported latency reduction maintains the adaptivity of agentic RAG. revision: yes

  2. Referee: [Abstract] Abstract and method description: the architecture performs retrieval over latent subquery tokens produced in one forward pass, yet retrieval outputs never re-enter the model to condition subsequent latent tokens. This removes the iterative feedback loop that the paper itself identifies as the source of agentic RAG success on complex questions; no ablation or analysis is supplied showing that pre-encoded latent branches suffice when retrieval results would normally alter the reasoning path.

    Authors: LatentRAG generates latent tokens for thoughts and subqueries in one forward pass to achieve efficiency, with retrieval performed over these latents in parallel. This design avoids the latency of iterative autoregressive generation while aiming to capture multi-step reasoning through parallel latent branches. We note that the manuscript does not provide an ablation on iterative feedback. We will add an ablation study in the revision comparing the current approach to one that incorporates retrieval results for subsequent latent token generation, to show the conditions under which pre-encoded branches are sufficient. revision: yes

  3. Referee: [Abstract] The weakest assumption—that hidden-state latents can faithfully encode the semantic content of natural-language thoughts and subqueries sufficiently for effective retrieval—receives no direct validation (e.g., retrieval recall@K on latent vs. explicit subqueries, or human evaluation of decoded thoughts). Without such evidence the end-to-end optimization claim rests on an untested substitution.

    Authors: We recognize that direct validation of the semantic fidelity of the latent tokens would strengthen the paper. The parallel latent decoding is provided for interpretability, but we agree it does not constitute quantitative validation such as recall@K or human evaluation. The end-to-end results provide indirect support. In the revised manuscript, we will include retrieval recall@K comparisons between latent and explicit subqueries as well as human evaluations or detailed qualitative analysis of the decoded thoughts. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents LatentRAG as an empirical framework that generates latent tokens from hidden states in one forward pass, aligns them with dense retrieval, and evaluates via experiments on seven benchmarks. No equations, predictions, or self-citations are shown that reduce performance gains or the core mechanism to quantities defined by the inputs themselves. Claims rest on external benchmark comparisons rather than any algebraic identity or fitted-parameter renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework introduces latent tokens and latent-space alignment as new mechanisms whose effectiveness is asserted rather than derived from prior results; no free parameters are explicitly named in the abstract, but the alignment process implicitly requires learned parameters.

invented entities (2)
  • latent tokens for thoughts and subqueries no independent evidence
    purpose: represent intermediate reasoning steps in continuous space for single-pass generation and retrieval
    Introduced to replace autoregressive token generation; no independent evidence of semantic fidelity is provided in the abstract.
  • parallel latent decoding mechanism no independent evidence
    purpose: translate latent tokens back to natural language for transparency
    Added to improve interpretability; its contribution to overall performance is not quantified separately.

pith-pipeline@v0.9.0 · 5532 in / 1292 out tokens · 44314 ms · 2026-05-08T10:22:26.411377+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 28 canonical work pages · 8 internal anchors

  1. [1]

    Large language models in law: A survey.AI Open, 2024

    Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and Philip S Yu. Large language models in law: A survey.AI Open, 2024

  2. [2]

    A survey on large language models for mathematical reasoning

    Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, et al. A survey on large language models for mathematical reasoning. ACM Comput. Surv., 2025

  3. [3]

    Toward expert-level medical question answering with large language models.Nat

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nat. Med., 2025

  4. [4]

    Survey on factuality in large language models.ACM Comput

    Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Qipeng Guo, Xiangkun Hu, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Xuming Hu, Zehan Qi, Wenyang Gao, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, and Yue Zhang. Survey on factuality in large language models.ACM Comput. Surv., 2025

  5. [5]

    Factuality of large language models: A survey

    Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Nenkov Georgiev, Rocktim Jy- oti Das, and Preslav Nakov. Factuality of large language models: A survey. InEMNLP, 2024

  6. [6]

    Knowledge editing for large language models: A survey.ACM Comput

    Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey.ACM Comput. Surv., 2024

  7. [7]

    Bring your own knowledge: A survey of methods for LLM knowledge expansion.arXiv preprint arXiv:2502.12598, 2025

    Mingyang Wang, Alisa Stoll, Lukas Lange, Heike Adel, Hinrich Schütze, and Jannik Strötgen. Bring your own knowledge: A survey of methods for LLM knowledge expansion.arXiv preprint arXiv:2502.12598, 2025

  8. [8]

    Survey of hallucination in natural language generation.ACM Comput

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Comput. Surv., 2023

  9. [9]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 2025

  10. [10]

    Retrieval-augmented generation for knowledge- intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InNeurIPS, 2020

  11. [11]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InICML, 2020

  12. [12]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

  13. [13]

    Graph retrieval-augmented generation: A survey.ACM Trans

    Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. Graph retrieval-augmented generation: A survey.ACM Trans. Inf. Syst., 2025

  14. [14]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InACL, 2023

  15. [15]

    Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

    Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic RAG.arXiv preprint arXiv:2501.09136, 2025

  16. [16]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

  17. [17]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS, 2023

  18. [18]

    Search-o1: Agentic search-enhanced large reasoning models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InEMNLP, 2025

  19. [19]

    Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning. In COLM, 2025. 10

  20. [20]

    Chain-of-thought prompting elicits reasoning in large language models.NeurIPS, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.NeurIPS, 2022

  21. [21]

    Towards agentic RAG with deep reasoning: A survey of RAG-reasoning systems in LLMs

    Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, et al. Towards agentic RAG with deep reasoning: A survey of RAG-reasoning systems in LLMs. InFindings of EMNLP, 2025

  22. [22]

    An empirical study on reinforcement learning for reasoning-search interleaved LLM agents.arXiv preprint arXiv:2505.15117, 2025

    Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved LLM agents.arXiv preprint arXiv:2505.15117, 2025

  23. [23]

    A comprehensive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724, 2025

    Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Xiang Zhang, and Suhang Wang. A comprehensive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724, 2025

  24. [24]

    DeepRAG: Thinking to retrieve step by step for large language models

    Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. DeepRAG: Thinking to retrieve step by step for large language models. InICLR, 2026

  25. [25]

    RAG-R1: Incentivizing the search and reasoning capabilities of LLMs through multi-query parallelism

    Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, and Jinjie Gu. RAG-R1: Incentivizing the search and reasoning capabilities of LLMs through multi-query parallelism. InAAAI, 2026

  26. [26]

    Training large language models to reason in a continuous latent space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. InCOLM, 2025

  27. [27]

    Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.CoRR, abs/2505.16782, 2025

    Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025

  28. [28]

    Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

    Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

  29. [29]

    A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025

    Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025

  30. [30]

    Large concept models: Language modeling in a sentence representation space

    Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R Costa-jussà, David Dale, et al. Large concept models: Language modeling in a sentence representation space.arXiv preprint arXiv:2412.08821, 2024

  31. [31]

    LLM pretraining with continuous concepts.arXiv preprint arXiv:2502.08524, 2025

    Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuan- dong Tian, Jason Weston, and Xian Li. LLM pretraining with continuous concepts.arXiv preprint arXiv:2502.08524, 2025

  32. [32]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InICLR, 2024

  33. [33]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  34. [34]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022

  35. [35]

    Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning

    Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning. InNeurIPS, 2025

  36. [36]

    Model internals-based answer attribution for trustworthy retrieval-augmented generation

    Jirui Qi, Gabriele Sarti, Raquel Fernández, and Arianna Bisazza. Model internals-based answer attribution for trustworthy retrieval-augmented generation. InEMNLP, 2024

  37. [37]

    SAFE: Improving LLM systems using sentence- level in-generation attribution.arXiv preprint arXiv:2505.12621, 2025

    João Eduardo Batista, Emil Vatai, and Mohamed Wahib. SAFE: Improving LLM systems using sentence- level in-generation attribution.arXiv preprint arXiv:2505.12621, 2025

  38. [38]

    Active retrieval augmented generation

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InEMNLP, 2023. 11

  39. [39]

    ReAgent: Reversible multi-agent reasoning for knowledge-enhanced multi-hop QA

    Zhao Xinjie, Fan Gao, Xingyu Song, Yingjian Chen, Rui Yang, Yanran Fu, Yuyang Wang, Yusuke Iwasawa, Yutaka Matsuo, and Irene Li. ReAgent: Reversible multi-agent reasoning for knowledge-enhanced multi-hop QA. InEMNLP, 2025

  40. [40]

    Self-RAG: Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InICLR, 2024

  41. [41]

    AutoRAG: Automated framework for optimization of retrieval augmented generation pipeline.arXiv preprint arXiv:2410.20878, 2024

    Dongkyu Kim, Byoungwook Kim, Donggeon Han, and Matouš Eibich. AutoRAG: Automated framework for optimization of retrieval augmented generation pipeline.arXiv preprint arXiv:2410.20878, 2024

  42. [42]

    Unified active retrieval for retrieval augmented generation

    Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu. Unified active retrieval for retrieval augmented generation. InFindings of EMNLP, 2024

  43. [43]

    Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. InNAACL, 2024

  44. [44]

    ReSearch: Learning to reason with search for LLMs via reinforcement learning

    Mingyang Chen, Linzhuang Sun, Tianpeng Li, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, et al. ReSearch: Learning to reason with search for LLMs via reinforcement learning. InNeurIPS, 2025

  45. [45]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-Searcher: Incentivizing the search capability in LLMs via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

  46. [46]

    DeepResearcher: Scaling deep research via reinforcement learning in real-world environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments. InEMNLP, 2025

  47. [47]

    TIPS: Turn-level information-potential reward shaping for search-augmented LLMs

    Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, and Xiaolong Wang. TIPS: Turn-level information-potential reward shaping for search-augmented LLMs. InICLR, 2026

  48. [48]

    HiPRAG: hierarchical process rewards for efficient agentic retrieval augmented generation

    Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Kaiyu He, Xinya Du, and Zhiyu Chen. HiPRAG: hierarchical process rewards for efficient agentic retrieval augmented generation. InICLR, 2026

  49. [49]

    A 2Search: Ambiguity-aware question answering with reinforce- ment learning

    Fengji Zhang, Xinyao Niu, Chengyang Ying, Guancheng Lin, Zhongkai Hao, Zhou Fan, Chengen Huang, Jacky Keung, Bei Chen, and Junyang Lin. A 2Search: Ambiguity-aware question answering with reinforce- ment learning. InICLR, 2026

  50. [50]

    R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185, 2025

    Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. R-Search: Empowering LLM reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185, 2025

  51. [51]

    Parallelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning.arXiv preprint arXiv:2508.09303, 2025

    Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aaditya Shukla, and Rama Akkiraju. ParallelSearch: Train your LLMs to decompose query and search sub-queries in parallel with reinforcement learning.arXiv preprint arXiv:2508.09303, 2025

  52. [52]

    WideSeek-R1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

    Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, and Yu Wang. WideSeek-R1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

  53. [53]

    The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

    Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

  54. [54]

    Let’s think dot by dot: Hidden computation in transformer language models

    Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer language models. InCOLM, 2024

  55. [55]

    CODI: Compressing chain-of-thought into continuous space via self-distillation

    Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. CODI: Compressing chain-of-thought into continuous space via self-distillation. InEMNLP, 2025

  56. [56]

    Synadapt: Learning adap- tive reasoning in large language models via synthetic continuous chain-of-thought.arXiv preprint arXiv:2508.00574, 2025

    Jianwei Wang, Ziming Wu, Fuming Lai, Shaobing Lian, and Ziqian Zeng. SynAdapt: Learning adap- tive reasoning in large language models via synthetic continuous chain-of-thought.arXiv preprint arXiv:2508.00574, 2025

  57. [57]

    SIM-CoT: Supervised implicit chain-of-thought

    Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. SIM-CoT: Supervised implicit chain-of-thought. InICLR, 2026

  58. [58]

    Soft thinking: Unlocking the reasoning potential of LLMs in continuous concept space

    Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of LLMs in continuous concept space. In NeurIPS, 2025. 12

  59. [59]

    The geometry of reasoning: Flowing logics in representation space

    Yufa Zhou, Yixiao Wang, Xunjian Yin, Shuyan Zhou, and Anru R Zhang. The geometry of reasoning: Flowing logics in representation space. InICLR, 2026

  60. [60]

    LLM latent reasoning as chain of superposition

    Jingcheng Deng, Liang Pang, Zihao Wei, Shichen Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Latent reasoning in LLMs as a vocabulary-space superposition.arXiv preprint arXiv:2510.15522, 2025

  61. [61]

    SemCoT: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens

    Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, and Jundong Li. SemCoT: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens. InNeurIPS, 2025

  62. [62]

    SoftCoT: Soft chain-of-thought for efficient reasoning with LLMs

    Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. SoftCoT: Soft chain-of-thought for efficient reasoning with LLMs. InACL, 2025

  63. [63]

    CLaRa: Bridging retrieval and generation with continuous latent reasoning.arXiv preprint arXiv:2511.18659, 2025

    Jie He, Richard He Bai, Sinead Williamson, Jeff Z Pan, Navdeep Jaitly, and Yizhe Zhang. CLaRa: Bridging retrieval and generation with continuous latent reasoning.arXiv preprint arXiv:2511.18659, 2025

  64. [64]

    LaSER: Internalizing explicit reasoning into latent space for dense retrieval.arXiv preprint arXiv:2603.01425, 2026

    Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu, and Zhicheng Dou. LaSER: Internalizing explicit reasoning into latent space for dense retrieval.arXiv preprint arXiv:2603.01425, 2026

  65. [65]

    A survey on RAG meeting LLMs: Towards retrieval-augmented large language models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on RAG meeting LLMs: Towards retrieval-augmented large language models. InKDD, 2024

  66. [66]

    FlashRAG: A modular toolkit for efficient retrieval-augmented generation research

    Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. FlashRAG: A modular toolkit for efficient retrieval-augmented generation research. InWWW, 2025

  67. [67]

    CRUD-RAG: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.ACM Trans

    Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and Enhong Chen. CRUD-RAG: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.ACM Trans. Inf. Syst., 2025

  68. [68]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  69. [69]

    Natural questions: a benchmark for question answering research.TACL, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.TACL, 2019

  70. [70]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InACL, 2017

  71. [71]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In ACL, 2023

  72. [72]

    HotpotQA: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018

  73. [73]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InCOLING, 2020

  74. [74]

    MuSiQue: Multihop questions via single-hop question composition.TACL, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.TACL, 2022

  75. [75]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of EMNLP, 2023

  76. [76]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP, 2020

  77. [77]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InFindings of EMNLP, 2023

  78. [78]

    Zerosearch: Incentivize the search capability of llms without searching, 2025

    Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. ZeroSearch: Incentivize the search capability of LLMs without searching.arXiv preprint arXiv:2505.04588, 2025. 13

  79. [79]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  80. [80]

    MTEB: Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InEACL, 2023

Showing first 80 references.