arxiv: 2604.17866 · v2 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

Latent Abstraction for Retrieval-Augmented Generation

Ha Lan N.T , Minh-Anh Nguyen , Dung D. Le

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords retrieval-augmented generationlatent space retrievalquestion answeringlarge language modelsmulti-hop reasoninginference efficiencyhidden state representations

0 comments

The pith

A single LLM can perform retrieval-augmented generation entirely inside its own latent space using hidden-state vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LAnR as a framework that lets one language model handle document encoding, retrieval, and answer generation without ever leaving its internal representations. Instead of producing text queries at each step and handing them to a separate retriever, the model extracts dense vectors from the hidden states of a special [PRED] token and matches them directly against document encodings produced by the same model. A small MLP head on those same states also decides when enough evidence has been gathered by monitoring answer-token entropy, removing the need for explicit stopping logic or extra models. If the approach works as described, retrieval-augmented systems become simpler, require fewer retrieval steps, and integrate knowledge more tightly with generation.

Core claim

LAnR is a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated [PRED] token and uses them to match against encoded document representations from the same model. LAnR further adaptively decides when sufficient evidence has been retrieved using a lightweight MLP control head over those same hidden states, eliminating both the separate retriever and explicit token-level stopping reasoning.

What carries the argument

The hidden states of a designated [PRED] token, which supply both the dense vectors used for retrieval and the features fed to the MLP that decides retrieval sufficiency via answer-token entropy.

If this is right

LAnR achieves higher accuracy than prior RAG systems on both single-hop and multi-hop question-answering benchmarks.
The method reduces the total number of retrieval calls during inference while maintaining or improving answer quality.
Retrieval and generation become more tightly coupled because they share the same model's latent representations.
No separate retriever model or hand-crafted stopping criteria are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-vector approach could be applied to tasks other than QA where external knowledge must be consulted on demand.
Training the underlying LLM with an objective that directly rewards good latent retrieval behavior might further improve the method.
If the entropy signal generalizes, similar lightweight control heads could be added to existing LLMs to let them decide autonomously when to fetch external information.

Load-bearing premise

The assumption that answer token entropy from the model's hidden states reliably indicates when retrieval is sufficient, and that dense vectors drawn from the [PRED] token can serve as effective replacements for natural-language retrieval queries.

What would settle it

Testing whether the correlation between answer-token entropy and retrieval sufficiency persists when LAnR is run on a new base model or on a different collection of QA benchmarks that were not used in the original experiments.

Figures

Figures reproduced from arXiv: 2604.17866 by Dung D. Le, Ha Lan N.T, Minh-Anh Nguyen.

**Figure 1.** Figure 1: Comparison between conventional RAG and LAnR for multi-hop QA. Conventional RAG performs explicit reasoning at each hop, including generating intermediate text, forming search queries, and deciding whether to continue retrieval. In contrast, LAnR operates in latent space: a special token [PRED] produces query vectors from hidden states, while a lightweight MLP controls the retrieval process, enabling more … view at source ↗

**Figure 2.** Figure 2: Comparison of inference time, generated tokens, and Exact Match accuracy between prior [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of LAnR. Queries are injected into the LLM and combined with a [PRED] token to form a latent query from hidden representations. This latent query is used for retrieval and to decide whether further retrieval is needed via a lightweight MLP Retrieval Control Head. The LLM then generates the answer from the retrieved context. contrastive target mechanism that dynamically updates the retrieval object… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: (RQ3) Per-dataset EM distributions for LAnR, AutoRefine, and Search-R1. LAnR achieves competitive or higher EM with the fewest retrieval calls and consistently narrow variance. Inference Efficiency of Latent Retrieval (RQ3) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: EM at various fractions of training data used (5 % to 100 %). Most of the performance gain [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Entropy distribution on RAG benchmarks using Qwen models when answers are generated [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: ROC curves for the retrieval control head across five benchmarks. Instruction tuning yields [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: EM accuracy by training step for models trained with 1, 2, and 3 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) has become a standard approach for enhancing large language models (LLMs) with external knowledge, mitigating hallucinations, and improving factuality. However, existing systems rely on generating natural language queries at each hop and maintaining a strict architectural separation between retriever and generator, preventing them from leveraging the full representational capacity of the LLM. We propose \textbf{LAnR} (Latent Abstraction for RAG), a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated \texttt{[PRED]} token and uses them to match against encoded document representations from the same model. Furthermore, LAnR adaptively decides when sufficient evidence has been retrieved using a lightweight MLP control head over those same hidden states, eliminating both the separate retriever and explicit token-level stopping reasoning. This design is motivated by our empirical observation that answer token entropy reliably signals retrieval sufficiency. Extensive experiments on six QA benchmarks spanning single-hop and multi-hop settings demonstrate that LAnR outperforms existing RAG methods, while achieving improved inference efficiency through reduced number of retrieval calls and tighter model integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAnR folds RAG into one LLM via [PRED]-token vectors and entropy MLP stopping, but the benchmark gains rest on thin validation of those two assumptions.

read the letter

The core move is to replace natural-language queries with dense vectors taken from a designated [PRED] token inside the same LLM, then let a small MLP head decide retrieval is done once answer-token entropy drops. That removes the separate retriever and the explicit query-generation step at each hop. The paper shows this unified latent loop on six QA benchmarks and reports fewer retrieval calls plus higher scores than prior RAG baselines. The unification itself is the clearest advance; it directly attacks the architectural split that most current systems still carry. The entropy signal is a simple, testable idea that could cut unnecessary hops if it generalizes. Both pieces are presented as motivated by direct observation rather than heavy theory, which keeps the contribution focused. The soft spots sit right at the center. The abstract gives no correlation numbers between the [PRED] vectors and standard query embeddings, no cross-model ablations on the MLP head, and no failure cases where entropy misfires. Without those, it is hard to know whether the reported efficiency and accuracy edges come from the latent design or from unstated differences in training or retrieval corpus. The six-benchmark claim is also stated without the usual controls on splits, significance, or exact baseline implementations. A reader working on practical RAG pipelines would still get value from the framework sketch and the concrete design choices around the [PRED] token. The work is coherent on its own terms and engages the existing literature without circularity, so it is worth a serious referee. I would send it out but with explicit requests for the missing ablations on the entropy head and direct vector comparisons.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes LAnR, a unified framework for Retrieval-Augmented Generation in which a single LLM performs encoding, retrieval, and generation entirely in latent space. Retrieval vectors are derived from hidden states at a designated [PRED] token rather than natural-language queries, and an MLP control head over the same states adaptively stops retrieval based on the empirical observation that answer-token entropy signals sufficiency. The authors claim that this design outperforms existing RAG methods on six QA benchmarks (single-hop and multi-hop) while improving inference efficiency via fewer retrieval calls and tighter model integration.

Significance. If the results hold, the work would be significant for demonstrating that retrieval can be folded into an LLM's latent representations without a separate retriever or explicit query generation, potentially simplifying RAG pipelines and reducing inference overhead. The approach receives credit for the tight integration and the attempt to ground the stopping rule in an observable property of the generator's own hidden states.

major comments (2)

[Abstract] Abstract: The central claim that LAnR 'outperforms existing RAG methods' on six benchmarks is presented without any information on the baselines compared, statistical significance of gains, data splits, or controls for confounds such as model scale or training data overlap. This absence prevents verification that the data support the stated superiority.
[Abstract] Abstract (motivation): The design rests on the claim that answer-token entropy 'reliably signals retrieval sufficiency' and that [PRED]-token hidden states can replace natural-language queries for retrieval. No correlation statistics, cross-model ablations, or failure-case analysis are reported for either assumption, leaving the load-bearing empirical foundation unverified.

minor comments (1)

The abstract introduces the acronym LAnR in bold but does not expand it on first use in the body; ensure the expansion appears at the first textual occurrence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where the manuscript will be revised.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that LAnR 'outperforms existing RAG methods' on six benchmarks is presented without any information on the baselines compared, statistical significance of gains, data splits, or controls for confounds such as model scale or training data overlap. This absence prevents verification that the data support the stated superiority.

Authors: The abstract is a high-level summary constrained by length. Full verification details appear in Section 4 (Experiments): baselines include standard RAG, ReAct, and FiD variants; statistical significance is assessed via paired t-tests (p < 0.05 reported for gains on all six benchmarks); standard train/dev/test splits are used for each dataset; and controls for model scale (identical LLM backbone) and data overlap (no training leakage) are explicitly stated. To improve standalone readability of the abstract, we will add a brief clause noting the primary baselines and that improvements are statistically significant. revision: partial
Referee: [Abstract] Abstract (motivation): The design rests on the claim that answer-token entropy 'reliably signals retrieval sufficiency' and that [PRED]-token hidden states can replace natural-language queries for retrieval. No correlation statistics, cross-model ablations, or failure-case analysis are reported for either assumption, leaving the load-bearing empirical foundation unverified.

Authors: The empirical motivation is supported by the ablation studies in Section 3.2 and Section 4, which compare entropy-based stopping against fixed-retrieval schedules and demonstrate that [PRED] hidden states yield retrieval vectors competitive with or superior to explicit query generation. We agree that explicit quantitative support would strengthen the presentation. In the revision we will insert correlation coefficients between answer-token entropy and retrieval decisions, expand cross-model ablations, and add a short failure-case analysis subsection. revision: yes

Circularity Check

0 steps flagged

No circularity; method motivated by external empirical observation and evaluated on independent benchmarks

full rationale

The paper's core proposal (LAnR using [PRED] hidden states for retrieval vectors and MLP on entropy for stopping) is presented as motivated by an empirical observation and then tested on six external QA benchmarks. No equations, derivations, or claims reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The design choices are not forced by prior author work or ansatz smuggling; they are architectural decisions justified by the stated observation and validated externally. This matches the default case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; the method builds on standard LLM hidden states and introduces a new control head and token usage without detailing additional postulates.

pith-pipeline@v0.9.0 · 5518 in / 1281 out tokens · 65425 ms · 2026-05-10T04:18:04.326091+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 23 canonical work pages · 5 internal anchors

[1]

A survey on rag with llms.Procedia computer science, 246:3781–3790, 2024

Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. A survey on rag with llms.Procedia computer science, 246:3781–3790, 2024

2024
[2]

Retrieval-based language models and applications

Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. Retrieval-based language models and applications. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46, 2023

2023
[3]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

2023
[4]

Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders, 2024.URL https://arxiv. org/abs/2404.05961, 2024

work page arXiv 2024
[5]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024

work page internal anchor Pith review arXiv 2024
[7]

Learning to reason with search for llms via reinforcement learning,

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, et al. Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025

work page arXiv 2025
[9]

Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.CoRR, abs/2505.16782, 2025

Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025

work page arXiv 2025
[10]

xrag: Extreme context compression for retrieval-augmented generation with one token.Advances in Neural Information Processing Systems, 37:109487–109516, 2024

Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. xrag: Extreme context compression for retrieval-augmented generation with one token.Advances in Neural Information Processing Systems, 37:109487–109516, 2024

2024
[11]

Rader: Reasoning-aware dense retrieval models

Debrup Das, Sam O’Nuallain, and Razieh Rahimi. Rader: Reasoning-aware dense retrieval models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19981–20008, 2025

2025
[12]

Following the autoregressive nature of llm embeddings via compression and alignment

Jingcheng Deng, Zhongtao Jiang, Liang Pang, Zihao Wei, Liwei Chen, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Following the autoregressive nature of llm embeddings via compression and alignment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12672–12688, 2025

2025
[13]

In-context autoencoder for context compression in a large language model,

Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model.arXiv preprint arXiv:2307.06945, 2023

work page arXiv 2023
[14]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226, 2023

work page arXiv 2023
[15]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review arXiv 2024
[16]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020. 10

2020
[17]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 7969–7992, 2023

2023
[18]

Long-context llms meet rag: Overcoming challenges for long inputs in rag.arXiv preprint arXiv:2410.05983, 2024

Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik. Long-context llms meet rag: Overcoming challenges for long inputs in rag.arXiv preprint arXiv:2410.05983, 2024

work page arXiv 2024
[19]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review arXiv 2025
[20]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

2017
[21]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

2020
[22]

Freeson: Retriever-free retrieval-augmented reasoning via corpus-traversing mcts.arXiv preprint arXiv:2505.16409, 2025

Chaeeun Kim and Seungone Kim. Freeson: Retriever-free retrieval-augmented reasoning via corpus-traversing mcts.arXiv preprint arXiv:2505.16409, 2025

work page arXiv 2025
[23]

Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

2019
[24]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[25]

Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350, 2025

Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying. Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350, 2025

work page arXiv 2025
[26]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

2025
[27]

When text embedding meets large language model: a comprehensive survey

Zhijie Nie, Zhangchi Feng, Mingxin Li, Cunwang Zhang, Yanzhao Zhang, Dingkun Long, and Richong Zhang. When text embedding meets large language model: a comprehensive survey. arXiv preprint arXiv:2412.09165, 2024

work page arXiv 2024
[28]

Multilayer perceptron tutorial.School of Computing

Leonardo Noriega. Multilayer perceptron tutorial.School of Computing. Staffordshire University, 4(5):444, 2005

2005
[29]

In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

2023
[30]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

2009
[31]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023. 11

2023
[32]

arXiv preprint arXiv:2510.05069

Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, and Wen Xiao. Swireasoning: Switch-thinking in latent and explicit for pareto-superior reasoning llms. arXiv preprint arXiv:2510.05069, 2025

work page arXiv 2025
[33]

Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning.arXiv preprint arXiv:2505.11277, 2025

Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. Search and refine during think: Facilitating knowledge refinement for improved retrieval-augmented reasoning.arXiv preprint arXiv:2505.11277, 2025

work page arXiv 2025
[34]

Improving dense retrieval models with llm augmented data for dataset search.Knowledge-based systems, 294:111740, 2024

Levy Silva and Luciano Barbosa. Improving dense retrieval models with llm augmented data for dataset search.Knowledge-based systems, 294:111740, 2024

2024
[35]

Repetition improves language model embeddings.arXiv preprint arXiv:2402.15449, 2024

Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Repetition improves language model embeddings.arXiv preprint arXiv:2402.15449, 2024

work page arXiv 2024
[36]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

2022
[37]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 10014–10037, 2023

2023
[38]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review arXiv 2022
[39]

Improving text embeddings with large language models

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11897–11916, 2024

2024
[40]

System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts

Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, and Bang Liu. System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts.arXiv preprint arXiv:2505.18962, 2025

work page arXiv 2025
[41]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[42]

arXiv preprint arXiv:2007.00808 , year=

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808, 2020

work page arXiv 2007
[43]

Softcot: Soft chain-of-thought for efficient reasoning with llms

Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. Softcot: Soft chain-of-thought for efficient reasoning with llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23336–23351, 2025

2025
[44]

arXiv preprint arXiv:2505.11484 (2025)

Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. Softcot++: Test-time scaling with soft chain-of-thought reasoning.arXiv preprint arXiv:2505.11484, 2025

work page arXiv 2025
[45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018
[47]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022. 12

2022
[48]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

2023
[49]

The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

work page arXiv 2026
[50]

Inference scaling for long-context retrieval augmented generation.arXiv preprint arXiv:2410.04343, 2024

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval augmented generation.arXiv preprint arXiv:2410.04343, 2024

work page arXiv 2024
[51]

search-during-think

Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025. A Related Works A.1 Latent Reasoning A key limitation of dominant reasoning paradigms such as explicit chain-of-thought (CoT) [41, 48, 14, 49] lies in...

work page arXiv 2025