arxiv: 2605.06285 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.LG

Recognition: unknown

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

Yijia Zheng , Marcel Worring

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:22 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords LatentRAGagentic RAGlatent reasoningretrieval-augmented generationefficient inferencelatent space alignmentmulti-step question answering

0 comments

The pith

LatentRAG moves multi-step reasoning and retrieval into continuous latent space to cut agentic RAG latency by roughly 90 percent while matching explicit methods on complex questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LatentRAG replaces the slow, token-by-token generation of intermediate thoughts and subqueries in agentic RAG with latent tokens taken directly from an LLM's hidden states in one forward pass. The framework aligns the language model with a dense retriever so that retrieval can operate over these continuous representations, and it adds a parallel decoding step that turns the latent tokens back into readable natural language for transparency. Experiments across seven benchmarks show that this latent approach delivers accuracy comparable to explicit agentic systems yet shrinks the inference-time cost enough to approach the speed of ordinary single-step RAG. The central move is therefore to keep the iterative search-agent behavior while removing its most expensive discrete-language bottleneck.

Core claim

By producing latent tokens for thoughts and subqueries directly from hidden states in a single forward pass, aligning the LLM with dense retrieval models in latent space, and adding parallel latent decoding to natural language, LatentRAG performs the multi-step retrieval and reasoning of agentic RAG without autoregressive generation of lengthy intermediate text.

What carries the argument

Latent tokens extracted from hidden states in one forward pass, aligned with a dense retriever and optionally decoded back to natural language.

If this is right

Agentic RAG can retain multi-step search behavior while operating at speeds close to single-step RAG.
Retrieval can be performed directly over continuous latent representations of subqueries rather than discrete text.
Joint training of the generator and retriever becomes possible because gradients flow through the latent alignment.
Interpretability is preserved by the optional decoding of latent tokens into readable intermediate steps.
The same latent-space shift can be applied to other iterative LLM tasks that currently rely on explicit token generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Longer or more deeply nested reasoning chains could be supported without a linear increase in latency.
Tool-use or planning loops that now require many autoregressive steps might be accelerated by analogous latent representations.
If latent tokens prove sufficient for retrieval, explicit natural-language intermediates may be treated as optional outputs rather than required intermediates in efficiency-sensitive deployments.

Load-bearing premise

Latent tokens taken from hidden states can faithfully carry the semantic content of natural-language thoughts and subqueries so that retrieval and end-to-end optimization remain effective.

What would settle it

A side-by-side evaluation in which LatentRAG accuracy drops measurably below explicit agentic RAG on questions that require several distinct reasoning hops, or in which measured end-to-end latency fails to show a reduction near 90 percent.

Figures

Figures reproduced from arXiv: 2605.06285 by Marcel Worring, Yijia Zheng.

**Figure 1.** Figure 1: Comparison of performance and latency on multi-hop QA datasets. LatentRAG achieves comparable performance to competitive agentic RAG methods such as Search-R1 and AutoRefine, while maintaining efficiency on par with naive single-step RAG. Search-R1 incurs substantial latency in thought and subquery generation, whereas LatentRAG substantially reduces the time spent in these two stages, leading to the observ… view at source ↗

**Figure 2.** Figure 2: (1) Traditional explicit agentic RAG methods alternate between generation and retrieval, view at source ↗

**Figure 3.** Figure 3: Performance and latency results across different retrieval model and LLM sizes. index that cannot fit on a single GPU. To ensure a fair comparison across different model sizes, we use three H100 GPUs for retrieval deployment and one for the LLM across all scaling experiments. As shown in view at source ↗

**Figure 4.** Figure 4: Distribution of cosine similarity and angle between document embeddings and their mean direction. We visualize distributions using violin plots. In contrast to other retrieval models, e5-base-v2 yields embeddings with extremely high cosine similarity and small angular deviation, indicating collapse into a narrow cone of the hypersphere and severe anisotropy view at source ↗

**Figure 5.** Figure 5: Latency reduction using batch latent decoding vs. max length ratio. Lower max length ratios are associated with higher latency reduction ratios. Each data point corresponds to the results on each dataset. As discussed in the main paper, latent decoding improves transparency at the cost of additional latency. A good property of our method is that the decoding of thoughts and subqueries is conditionally in… view at source ↗

**Figure 6.** Figure 6: performance under different numbers of latent thought and subquery tokens. To investigate the impact of latent token numbers, we vary the number of latent thought tokens m and the number of subquery tokens n and evaluate the exact match scores under different configurations. As shown in view at source ↗

**Figure 7.** Figure 7: LogitLens Case Study 1 on LatentRAG♢. Latent thought and subquery tokens in the first step align with tokens related to the first subquery, The author of The Thing of It Is..., while those in the second step shift toward tokens related to the second subquery, William Goldman nationality. A latent token can encode the whole semantic concept, such as The Thing of It Is... or William Goldman. chairman Eugene … view at source ↗

**Figure 8.** Figure 8: LogitLens Case Study 2 on LatentRAG♢. Latent thought and subquery tokens in the first step align with tokens related to the first subquery, Eugene Habecker chairman of which magazine, while those in the second step shift toward tokens related to the second subquery, Christianity Today magazine type. A latent token can encode the whole semantic concept, such as magazine type or Christianity Today. 26 view at source ↗

read the original abstract

Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LatentRAG offers a fast latent-space version of agentic RAG but drops the iterative feedback that often makes it effective.

read the letter

LatentRAG claims to deliver agentic RAG performance at single-step speeds by moving reasoning and retrieval into a shared latent space and generating everything in one forward pass. What stands out as new is the production of latent tokens for both intermediate thoughts and subqueries directly from hidden states, followed by alignment to a dense retriever for direct retrieval over those latents. The parallel latent decoding step that maps back to readable text is a nice addition for keeping the process somewhat transparent and allowing end-to-end optimization. The paper does well at identifying the latency bottleneck in explicit agentic methods and proposing a practical workaround that avoids token-by-token generation. A 90% latency cut while staying comparable on seven benchmarks would be a meaningful step if the results hold. The main concern is the loss of adaptivity. Because the latent tokens are created in a single pass without any retrieved documents feeding back into the model, the approach cannot adjust subsequent reasoning based on what the retrieval actually returns. This differs from true agentic RAG, where each step conditions on prior results. For queries that require such branching, the latent method might underperform despite the abstract's claims. The lack of reported baselines, error bars, or precise experimental setup in the summary also leaves the performance comparison on shaky ground. Readers working on efficient retrieval-augmented systems or latent space alignments would find this relevant. It is worth a serious referee because the architectural shift is distinct and the latency gains, if real, address a clear deployment barrier. I recommend sending it for peer review, with the expectation that reviewers will probe the single-pass assumption and demand fuller experimental reporting.

Referee Report

3 major / 1 minor

Summary. The paper proposes LatentRAG, a framework that moves agentic RAG reasoning and retrieval into continuous latent space: latent tokens for thoughts and subqueries are generated directly from hidden states in a single forward pass, aligned with dense retrieval models for joint optimization, and decoded in parallel to natural language for interpretability. It reports performance comparable to explicit multi-step agentic RAG methods across seven benchmarks while cutting inference latency by ~90%.

Significance. If the empirical results hold under rigorous controls, the work would meaningfully narrow the efficiency gap between single-step RAG and adaptive agentic methods, enabling more practical deployment of complex multi-hop QA. The attempt at end-to-end latent alignment and the parallel decoding mechanism for transparency are constructive ideas that could influence future hybrid latent-explicit systems.

major comments (3)

[Abstract] Abstract: the central empirical claim of 'comparable performance' and 'approximately 90%' latency reduction is presented without naming the seven benchmarks, the explicit agentic baselines, latency measurement protocol (wall-clock, tokens generated, hardware), error bars, or statistical significance tests. These omissions make it impossible to evaluate whether the latency gain preserves the adaptivity that agentic RAG is designed to provide.
[Abstract] Abstract and method description: the architecture performs retrieval over latent subquery tokens produced in one forward pass, yet retrieval outputs never re-enter the model to condition subsequent latent tokens. This removes the iterative feedback loop that the paper itself identifies as the source of agentic RAG success on complex questions; no ablation or analysis is supplied showing that pre-encoded latent branches suffice when retrieval results would normally alter the reasoning path.
[Abstract] The weakest assumption—that hidden-state latents can faithfully encode the semantic content of natural-language thoughts and subqueries sufficiently for effective retrieval—receives no direct validation (e.g., retrieval recall@K on latent vs. explicit subqueries, or human evaluation of decoded thoughts). Without such evidence the end-to-end optimization claim rests on an untested substitution.

minor comments (1)

[Abstract] The abstract states 'align LLMs with dense retrieval models in the latent space' but does not specify the alignment loss, temperature, or projection layers; these details belong in the main text even if summarized here.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive comments on our paper. We address each of the major comments point by point below, and we will revise the manuscript accordingly to improve clarity and provide additional analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim of 'comparable performance' and 'approximately 90%' latency reduction is presented without naming the seven benchmarks, the explicit agentic baselines, latency measurement protocol (wall-clock, tokens generated, hardware), error bars, or statistical significance tests. These omissions make it impossible to evaluate whether the latency gain preserves the adaptivity that agentic RAG is designed to provide.

Authors: We agree that the abstract should include more specific details to facilitate evaluation of the claims. In the revised version, we will name the seven benchmarks, list the explicit agentic baselines, describe the latency measurement protocol (wall-clock time, tokens generated, hardware), and report error bars along with statistical significance tests. These updates will help demonstrate that the reported latency reduction maintains the adaptivity of agentic RAG. revision: yes
Referee: [Abstract] Abstract and method description: the architecture performs retrieval over latent subquery tokens produced in one forward pass, yet retrieval outputs never re-enter the model to condition subsequent latent tokens. This removes the iterative feedback loop that the paper itself identifies as the source of agentic RAG success on complex questions; no ablation or analysis is supplied showing that pre-encoded latent branches suffice when retrieval results would normally alter the reasoning path.

Authors: LatentRAG generates latent tokens for thoughts and subqueries in one forward pass to achieve efficiency, with retrieval performed over these latents in parallel. This design avoids the latency of iterative autoregressive generation while aiming to capture multi-step reasoning through parallel latent branches. We note that the manuscript does not provide an ablation on iterative feedback. We will add an ablation study in the revision comparing the current approach to one that incorporates retrieval results for subsequent latent token generation, to show the conditions under which pre-encoded branches are sufficient. revision: yes
Referee: [Abstract] The weakest assumption—that hidden-state latents can faithfully encode the semantic content of natural-language thoughts and subqueries sufficiently for effective retrieval—receives no direct validation (e.g., retrieval recall@K on latent vs. explicit subqueries, or human evaluation of decoded thoughts). Without such evidence the end-to-end optimization claim rests on an untested substitution.

Authors: We recognize that direct validation of the semantic fidelity of the latent tokens would strengthen the paper. The parallel latent decoding is provided for interpretability, but we agree it does not constitute quantitative validation such as recall@K or human evaluation. The end-to-end results provide indirect support. In the revised manuscript, we will include retrieval recall@K comparisons between latent and explicit subqueries as well as human evaluations or detailed qualitative analysis of the decoded thoughts. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents LatentRAG as an empirical framework that generates latent tokens from hidden states in one forward pass, aligns them with dense retrieval, and evaluates via experiments on seven benchmarks. No equations, predictions, or self-citations are shown that reduce performance gains or the core mechanism to quantities defined by the inputs themselves. Claims rest on external benchmark comparisons rather than any algebraic identity or fitted-parameter renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework introduces latent tokens and latent-space alignment as new mechanisms whose effectiveness is asserted rather than derived from prior results; no free parameters are explicitly named in the abstract, but the alignment process implicitly requires learned parameters.

invented entities (2)

latent tokens for thoughts and subqueries no independent evidence
purpose: represent intermediate reasoning steps in continuous space for single-pass generation and retrieval
Introduced to replace autoregressive token generation; no independent evidence of semantic fidelity is provided in the abstract.
parallel latent decoding mechanism no independent evidence
purpose: translate latent tokens back to natural language for transparency
Added to improve interpretability; its contribution to overall performance is not quantified separately.

pith-pipeline@v0.9.0 · 5532 in / 1292 out tokens · 44314 ms · 2026-05-08T10:22:26.411377+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 28 canonical work pages · 8 internal anchors

[1]

Large language models in law: A survey.AI Open, 2024

Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and Philip S Yu. Large language models in law: A survey.AI Open, 2024

2024
[2]

A survey on large language models for mathematical reasoning

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, et al. A survey on large language models for mathematical reasoning. ACM Comput. Surv., 2025

2025
[3]

Toward expert-level medical question answering with large language models.Nat

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nat. Med., 2025

2025
[4]

Survey on factuality in large language models.ACM Comput

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Qipeng Guo, Xiangkun Hu, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Xuming Hu, Zehan Qi, Wenyang Gao, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, and Yue Zhang. Survey on factuality in large language models.ACM Comput. Surv., 2025

2025
[5]

Factuality of large language models: A survey

Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Nenkov Georgiev, Rocktim Jy- oti Das, and Preslav Nakov. Factuality of large language models: A survey. InEMNLP, 2024

2024
[6]

Knowledge editing for large language models: A survey.ACM Comput

Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey.ACM Comput. Surv., 2024

2024
[7]

Bring your own knowledge: A survey of methods for LLM knowledge expansion.arXiv preprint arXiv:2502.12598, 2025

Mingyang Wang, Alisa Stoll, Lukas Lange, Heike Adel, Hinrich Schütze, and Jannik Strötgen. Bring your own knowledge: A survey of methods for LLM knowledge expansion.arXiv preprint arXiv:2502.12598, 2025

work page arXiv 2025
[8]

Survey of hallucination in natural language generation.ACM Comput

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Comput. Surv., 2023

2023
[9]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 2025

2025
[10]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InNeurIPS, 2020

2020
[11]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InICML, 2020

2020
[12]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review arXiv 2023
[13]

Graph retrieval-augmented generation: A survey.ACM Trans

Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. Graph retrieval-augmented generation: A survey.ACM Trans. Inf. Syst., 2025

2025
[14]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InACL, 2023

2023
[15]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic RAG.arXiv preprint arXiv:2501.09136, 2025

work page internal anchor Pith review arXiv 2025
[16]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICLR, 2023

2023
[17]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS, 2023

2023
[18]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InEMNLP, 2025

2025
[19]

Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning. In COLM, 2025. 10

2025
[20]

Chain-of-thought prompting elicits reasoning in large language models.NeurIPS, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.NeurIPS, 2022

2022
[21]

Towards agentic RAG with deep reasoning: A survey of RAG-reasoning systems in LLMs

Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, et al. Towards agentic RAG with deep reasoning: A survey of RAG-reasoning systems in LLMs. InFindings of EMNLP, 2025

2025
[22]

An empirical study on reinforcement learning for reasoning-search interleaved LLM agents.arXiv preprint arXiv:2505.15117, 2025

Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved LLM agents.arXiv preprint arXiv:2505.15117, 2025

work page arXiv 2025
[23]

A comprehensive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724, 2025

Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Xiang Zhang, and Suhang Wang. A comprehensive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724, 2025

work page arXiv 2025
[24]

DeepRAG: Thinking to retrieve step by step for large language models

Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. DeepRAG: Thinking to retrieve step by step for large language models. InICLR, 2026

2026
[25]

RAG-R1: Incentivizing the search and reasoning capabilities of LLMs through multi-query parallelism

Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, and Jinjie Gu. RAG-R1: Incentivizing the search and reasoning capabilities of LLMs through multi-query parallelism. InAAAI, 2026

2026
[26]

Training large language models to reason in a continuous latent space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. InCOLM, 2025

2025
[27]

Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.CoRR, abs/2505.16782, 2025

Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025

work page arXiv 2025
[28]

Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

work page arXiv 2024
[29]

A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025

Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025

work page arXiv 2025
[30]

Large concept models: Language modeling in a sentence representation space

Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R Costa-jussà, David Dale, et al. Large concept models: Language modeling in a sentence representation space.arXiv preprint arXiv:2412.08821, 2024

work page arXiv 2024
[31]

LLM pretraining with continuous concepts.arXiv preprint arXiv:2502.08524, 2025

Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuan- dong Tian, Jason Weston, and Xian Li. LLM pretraining with continuous concepts.arXiv preprint arXiv:2502.08524, 2025

work page arXiv 2025
[32]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InICLR, 2024

2024
[33]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review arXiv 2025
[34]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review arXiv 2022
[35]

Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning

Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning. InNeurIPS, 2025

2025
[36]

Model internals-based answer attribution for trustworthy retrieval-augmented generation

Jirui Qi, Gabriele Sarti, Raquel Fernández, and Arianna Bisazza. Model internals-based answer attribution for trustworthy retrieval-augmented generation. InEMNLP, 2024

2024
[37]

SAFE: Improving LLM systems using sentence- level in-generation attribution.arXiv preprint arXiv:2505.12621, 2025

João Eduardo Batista, Emil Vatai, and Mohamed Wahib. SAFE: Improving LLM systems using sentence- level in-generation attribution.arXiv preprint arXiv:2505.12621, 2025

work page arXiv 2025
[38]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InEMNLP, 2023. 11

2023
[39]

ReAgent: Reversible multi-agent reasoning for knowledge-enhanced multi-hop QA

Zhao Xinjie, Fan Gao, Xingyu Song, Yingjian Chen, Rui Yang, Yanran Fu, Yuyang Wang, Yusuke Iwasawa, Yutaka Matsuo, and Irene Li. ReAgent: Reversible multi-agent reasoning for knowledge-enhanced multi-hop QA. InEMNLP, 2025

2025
[40]

Self-RAG: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InICLR, 2024

2024
[41]

AutoRAG: Automated framework for optimization of retrieval augmented generation pipeline.arXiv preprint arXiv:2410.20878, 2024

Dongkyu Kim, Byoungwook Kim, Donggeon Han, and Matouš Eibich. AutoRAG: Automated framework for optimization of retrieval augmented generation pipeline.arXiv preprint arXiv:2410.20878, 2024

work page arXiv 2024
[42]

Unified active retrieval for retrieval augmented generation

Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu. Unified active retrieval for retrieval augmented generation. InFindings of EMNLP, 2024

2024
[43]

Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. InNAACL, 2024

2024
[44]

ReSearch: Learning to reason with search for LLMs via reinforcement learning

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, et al. ReSearch: Learning to reason with search for LLMs via reinforcement learning. InNeurIPS, 2025

2025
[45]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-Searcher: Incentivizing the search capability in LLMs via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

work page internal anchor Pith review arXiv 2025
[46]

DeepResearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments. InEMNLP, 2025

2025
[47]

TIPS: Turn-level information-potential reward shaping for search-augmented LLMs

Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, and Xiaolong Wang. TIPS: Turn-level information-potential reward shaping for search-augmented LLMs. InICLR, 2026

2026
[48]

HiPRAG: hierarchical process rewards for efficient agentic retrieval augmented generation

Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Kaiyu He, Xinya Du, and Zhiyu Chen. HiPRAG: hierarchical process rewards for efficient agentic retrieval augmented generation. InICLR, 2026

2026
[49]

A 2Search: Ambiguity-aware question answering with reinforce- ment learning

Fengji Zhang, Xinyao Niu, Chengyang Ying, Guancheng Lin, Zhongkai Hao, Zhou Fan, Chengen Huang, Jacky Keung, Bei Chen, and Junyang Lin. A 2Search: Ambiguity-aware question answering with reinforce- ment learning. InICLR, 2026

2026
[50]

R-search: Em- powering llm reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185, 2025

Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. R-Search: Empowering LLM reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185, 2025

work page arXiv 2025
[51]

Parallelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning.arXiv preprint arXiv:2508.09303, 2025

Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aaditya Shukla, and Rama Akkiraju. ParallelSearch: Train your LLMs to decompose query and search sub-queries in parallel with reinforcement learning.arXiv preprint arXiv:2508.09303, 2025

work page arXiv 2025
[52]

WideSeek-R1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, and Yu Wang. WideSeek-R1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

work page arXiv 2026
[53]

The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

work page arXiv 2026
[54]

Let’s think dot by dot: Hidden computation in transformer language models

Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computation in transformer language models. InCOLM, 2024

2024
[55]

CODI: Compressing chain-of-thought into continuous space via self-distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. CODI: Compressing chain-of-thought into continuous space via self-distillation. InEMNLP, 2025

2025
[56]

Synadapt: Learning adap- tive reasoning in large language models via synthetic continuous chain-of-thought.arXiv preprint arXiv:2508.00574, 2025

Jianwei Wang, Ziming Wu, Fuming Lai, Shaobing Lian, and Ziqian Zeng. SynAdapt: Learning adap- tive reasoning in large language models via synthetic continuous chain-of-thought.arXiv preprint arXiv:2508.00574, 2025

work page arXiv 2025
[57]

SIM-CoT: Supervised implicit chain-of-thought

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. SIM-CoT: Supervised implicit chain-of-thought. InICLR, 2026

2026
[58]

Soft thinking: Unlocking the reasoning potential of LLMs in continuous concept space

Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of LLMs in continuous concept space. In NeurIPS, 2025. 12

2025
[59]

The geometry of reasoning: Flowing logics in representation space

Yufa Zhou, Yixiao Wang, Xunjian Yin, Shuyan Zhou, and Anru R Zhang. The geometry of reasoning: Flowing logics in representation space. InICLR, 2026

2026
[60]

LLM latent reasoning as chain of superposition

Jingcheng Deng, Liang Pang, Zihao Wei, Shichen Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Latent reasoning in LLMs as a vocabulary-space superposition.arXiv preprint arXiv:2510.15522, 2025

work page arXiv 2025
[61]

SemCoT: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens

Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, and Jundong Li. SemCoT: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens. InNeurIPS, 2025

2025
[62]

SoftCoT: Soft chain-of-thought for efficient reasoning with LLMs

Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. SoftCoT: Soft chain-of-thought for efficient reasoning with LLMs. InACL, 2025

2025
[63]

CLaRa: Bridging retrieval and generation with continuous latent reasoning.arXiv preprint arXiv:2511.18659, 2025

Jie He, Richard He Bai, Sinead Williamson, Jeff Z Pan, Navdeep Jaitly, and Yizhe Zhang. CLaRa: Bridging retrieval and generation with continuous latent reasoning.arXiv preprint arXiv:2511.18659, 2025

work page arXiv 2025
[64]

LaSER: Internalizing explicit reasoning into latent space for dense retrieval.arXiv preprint arXiv:2603.01425, 2026

Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu, and Zhicheng Dou. LaSER: Internalizing explicit reasoning into latent space for dense retrieval.arXiv preprint arXiv:2603.01425, 2026

work page arXiv 2026
[65]

A survey on RAG meeting LLMs: Towards retrieval-augmented large language models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on RAG meeting LLMs: Towards retrieval-augmented large language models. InKDD, 2024

2024
[66]

FlashRAG: A modular toolkit for efficient retrieval-augmented generation research

Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. FlashRAG: A modular toolkit for efficient retrieval-augmented generation research. InWWW, 2025

2025
[67]

CRUD-RAG: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.ACM Trans

Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and Enhong Chen. CRUD-RAG: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.ACM Trans. Inf. Syst., 2025

2025
[68]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review arXiv 2018
[69]

Natural questions: a benchmark for question answering research.TACL, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.TACL, 2019

2019
[70]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InACL, 2017

2017
[71]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In ACL, 2023

2023
[72]

HotpotQA: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018

2018
[73]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InCOLING, 2020

2020
[74]

MuSiQue: Multihop questions via single-hop question composition.TACL, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.TACL, 2022

2022
[75]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of EMNLP, 2023

2023
[76]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP, 2020

2020
[77]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InFindings of EMNLP, 2023

2023
[78]

Zerosearch: Incentivize the search capability of llms without searching, 2025

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. ZeroSearch: Incentivize the search capability of LLMs without searching.arXiv preprint arXiv:2505.04588, 2025. 13

work page arXiv 2025
[79]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review arXiv 2024
[80]

MTEB: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InEACL, 2023

2023

Showing first 80 references.