pith. machine review for the scientific record. sign in

arxiv: 2604.26649 · v1 · submitted 2026-04-29 · 💻 cs.IR · cs.AI· cs.CL

Recognition: unknown

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:12 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords adaptive retrievallarge reasoning modelsRAGuncertainty detectionmulti-hop QAchain of thoughtretrieval interventionefficiency trade-offs
0
0 comments X

The pith

ReaLM-Retrieve detects uncertainty at individual reasoning steps to fetch external evidence only when needed, raising F1 scores 10.1 percent on average while cutting retrieval calls by 47 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models produce long chains of thought that often contain knowledge gaps, but standard RAG supplies evidence only before reasoning starts. The paper shows that inserting retrievals adaptively at step-level uncertainty points improves answer accuracy on multi-hop questions while using far fewer retrievals than fixed schedules. A sympathetic reader cares because this alignment lets extended reasoning models stay efficient and accurate on complex tasks that require chaining facts from external sources.

Core claim

ReaLM-Retrieve addresses the misalignment between pre-reasoning RAG and long thought chains by combining a step-level uncertainty detector, a learned retrieval intervention policy, and an efficiency-optimized integration mechanism. On MuSiQue, HotpotQA, and 2WikiMultiHopQA it delivers 10.1 percent absolute F1 gain over standard RAG and 47 percent fewer retrieval calls than fixed-interval methods such as IRCoT, reaching 71.2 percent F1 on MuSiQue with 1.8 calls per question on average while also raising retrieval precision and MRR.

What carries the argument

The step-level uncertainty detector that flags knowledge gaps at reasoning-step granularity, together with the retrieval intervention policy that decides when external evidence will most benefit the ongoing chain.

Load-bearing premise

The step-level uncertainty detector reliably identifies genuine knowledge gaps in the reasoning chain without false positives that derail the process or false negatives that leave gaps unfilled.

What would settle it

Run the uncertainty detector on held-out reasoning traces with human-annotated true gaps and measure whether F1 drops when the policy is forced to retrieve at detector-flagged steps versus a no-retrieval baseline.

Figures

Figures reproduced from arXiv: 2604.26649 by Dongxin Guo, Jikun Wu, Siu Ming Yiu.

Figure 1
Figure 1. Figure 1: Overview of ReaLM-Retrieve’s adaptive retrieval approach. (a) The temporal mismatch: traditional RAG retrieves once before generation, but reasoning models encounter knowledge gaps during 12K–25K token chains. (b) Our four-stage pipeline detects step-level uncertainty and retrieves only when needed via learned policy. (c) MuSiQue benchmark results show 71.2% F1 with only 1.8 retrieval calls, outperforming … view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of retrieval timing (as fraction of rea view at source ↗
Figure 2
Figure 2. Figure 2: F1 score by number of reasoning hops on MuSiQue. view at source ↗
read the original abstract

Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current RAG systems optimize for providing context before reasoning begins, while reasoning models require evidence injection during multi-step inference chains. We introduce ReaLM-Retrieve, a reasoning-aware retrieval framework that addresses this mismatch through three key innovations: (1) a step-level uncertainty detector that identifies knowledge gaps at reasoning-step granularity rather than token or sentence level; (2) a retrieval intervention policy that learns when external evidence maximally benefits ongoing reasoning; and (3) an efficiency-optimized integration mechanism that reduces per-retrieval overhead by 3.2x compared to naive integration. Experiments on MuSiQue, HotpotQA, and 2WikiMultiHopQA demonstrate that ReaLM-Retrieve achieves on average 10.1% absolute improvement in answer F1 over standard RAG (range: 9.0-11.8% across the three benchmarks) while reducing retrieval calls by 47% compared to fixed-interval approaches like IRCoT (all improvements significant at p<0.01, paired bootstrap). On the challenging MuSiQue benchmark requiring 2-4 hop reasoning, our method achieves 71.2% F1 with an average of only 1.8 retrieval calls per question. Analysis shows that ReaLM-Retrieve also improves retrieval quality itself, achieving 81.3% Recall@5 with consistently higher precision and MRR than fixed-interval baselines on supporting evidence, establishing new state-of-the-art efficiency-accuracy trade-offs for reasoning-intensive retrieval tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces ReaLM-Retrieve, a reasoning-aware retrieval framework for large reasoning models that generate long chains of thought. It proposes three innovations: (1) a step-level uncertainty detector to identify knowledge gaps at reasoning-step granularity, (2) a learned retrieval intervention policy to decide when external evidence is most beneficial, and (3) an efficiency-optimized integration mechanism claimed to reduce per-retrieval overhead by 3.2x. On MuSiQue, HotpotQA, and 2WikiMultiHopQA, it reports an average 10.1% absolute F1 gain over standard RAG (range 9.0-11.8%), 47% fewer retrieval calls than fixed-interval baselines like IRCoT (all p<0.01 via paired bootstrap), and specific results such as 71.2% F1 on MuSiQue with only 1.8 calls per question on average.

Significance. If the attribution of gains to the adaptive components holds after proper validation, the work could meaningfully advance RAG integration with long-horizon reasoning models by improving both accuracy and retrieval efficiency. The concrete numerical improvements and statistical testing on held-out multi-hop benchmarks are positive features, as is the emphasis on reducing retrieval frequency. However, the absence of component-level validation limits the assessed significance.

major comments (3)
  1. [Abstract] Abstract: The central claim that the 10.1% F1 improvement and 47% retrieval reduction stem from the step-level uncertainty detector accurately identifying true knowledge gaps is undermined by the complete absence of any standalone validation metrics (e.g., precision/recall/F1 of the detector against human-annotated missing facts at individual reasoning steps). Without this, false positives or negatives cannot be ruled out as confounders, weakening causal attribution to the three listed innovations.
  2. [Methods] Methods (as described in Abstract): The training procedure for the retrieval intervention policy, the exact mathematical definition and implementation of the step-level uncertainty metric, and any post-hoc choices in benchmark evaluation are not provided. This prevents verification that the reported statistically significant results (p<0.01) are supported by the data and experimental design.
  3. [Experiments] Experiments: The manuscript reports improved retrieval quality (81.3% Recall@5) but provides no ablation isolating the contribution of the uncertainty detector versus the policy or integration mechanism, leaving open whether gains could arise from the base reasoning model or unmentioned factors rather than adaptive triggering.
minor comments (1)
  1. [Abstract] Abstract: The per-benchmark breakdown of the 9.0-11.8% F1 range and the exact baselines (beyond 'standard RAG' and IRCoT) would improve clarity for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and commit to revisions that will strengthen the manuscript's reproducibility and causal claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the 10.1% F1 improvement and 47% retrieval reduction stem from the step-level uncertainty detector accurately identifying true knowledge gaps is undermined by the complete absence of any standalone validation metrics (e.g., precision/recall/F1 of the detector against human-annotated missing facts at individual reasoning steps). Without this, false positives or negatives cannot be ruled out as confounders, weakening causal attribution to the three listed innovations.

    Authors: We acknowledge that direct validation metrics for the uncertainty detector would provide stronger evidence for its role in identifying knowledge gaps and support causal attribution. While the end-to-end gains in F1 and retrieval efficiency, along with the improved retrieval quality (81.3% Recall@5), are consistent with effective adaptive triggering, we agree that standalone evaluation is needed to rule out confounders. In the revised manuscript, we will add precision, recall, and F1 scores for the step-level uncertainty detector, computed against human-annotated reasoning steps labeled for the presence of missing facts. revision: yes

  2. Referee: [Methods] Methods (as described in Abstract): The training procedure for the retrieval intervention policy, the exact mathematical definition and implementation of the step-level uncertainty metric, and any post-hoc choices in benchmark evaluation are not provided. This prevents verification that the reported statistically significant results (p<0.01) are supported by the data and experimental design.

    Authors: We apologize for the omission of these details in the submitted version. The revised manuscript will include the precise mathematical definition and implementation of the step-level uncertainty metric, the full training procedure for the retrieval intervention policy (including data construction, learning setup, and optimization), and clarification of any post-hoc choices made during benchmark evaluation. These additions will enable independent verification of the experimental design and the reported statistical significance. revision: yes

  3. Referee: [Experiments] Experiments: The manuscript reports improved retrieval quality (81.3% Recall@5) but provides no ablation isolating the contribution of the uncertainty detector versus the policy or integration mechanism, leaving open whether gains could arise from the base reasoning model or unmentioned factors rather than adaptive triggering.

    Authors: We agree that component ablations are necessary to isolate the contributions of the uncertainty detector, intervention policy, and integration mechanism. Although the current comparisons to strong baselines (standard RAG and IRCoT) demonstrate overall improvements, we will add targeted ablations in the revision. These will include controlled variants that replace the uncertainty detector with fixed-interval or random triggering, ablate the learned policy, and simplify the integration mechanism, allowing quantification of each component's specific impact on F1 scores and retrieval frequency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on held-out benchmarks with separately trained policy

full rationale

The paper presents a framework of three innovations (step-level uncertainty detector, retrieval intervention policy, efficiency-optimized integration) evaluated via experiments on MuSiQue, HotpotQA, and 2WikiMultiHopQA. Reported gains (10.1% F1, 47% fewer calls) are test-set metrics from held-out data; the policy is trained separately from evaluation. No equations, derivations, or first-principles claims appear that reduce by construction to inputs, fitted parameters renamed as predictions, or load-bearing self-citations. Results are externally falsifiable against standard RAG and IRCoT baselines, satisfying self-contained empirical standards.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the framework relies on standard supervised or reinforcement learning to train the intervention policy and uncertainty detector; no explicit free parameters, mathematical axioms, or new physical entities are stated. The three innovations are presented as engineering contributions rather than new theoretical primitives.

pith-pipeline@v0.9.0 · 5614 in / 1333 out tokens · 104083 ms · 2026-05-07T13:12:50.135684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 23 canonical work pages · 2 internal anchors

  1. [1]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia. InThe Twelfth International Conference on Learning Representa...

  2. [2]

    Rae, Erich Elsen, and Laurent Sifre

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Ruther- ford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, O...

  3. [3]

    Shahul ES, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Lin- guistics, EACL 2024 - System Demonstrations, St. Julians, Malta, March 17-22, 2024, Nikolaos Aletras and Orphée De Cl...

  4. [4]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Gen- eration for Large Language Models: A Survey.arXiv preprintarXiv.2312.10997 (2024). https://arxiv.org/abs/2312.10997

  5. [5]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, et al. 2025. DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning. Nature645 (2025), 633–638. doi:10.1038/s41586-025-09422-z

  6. [6]

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang

  7. [7]

    InProceedings of the 37th International Conference on Machine Learning (ICML’20)

    REALM: retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 368, 10 pages

  8. [8]

    Helia Hashemi, Victor Rühle, and Saravan Rajmohan. 2026. Cost-Aware Retrieval- Augmentation Reasoning Models with Adaptive Retrieval Depth. InProceedings of the ACM Web Conference 2026, WWW 2026, Dubai, United Arab Emirates, originally scheduled for April 13-17, 2026, rescheduled for June 29 - July 3, 2026, Hakim Hacid, Yoelle Maarek, Francesco Bonchi, Id...

  9. [9]

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. InProceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, Do- nia Scott, Núria Bel, and Chengqing Zong (Eds.). Inte...

  10. [10]

    Open- RAG : Enhanced Retrieval Augmented Reasoning with Open-Source Large Language Models

    Shayekh Bin Islam, Md. Asib Rahman, K. S. M. Tozammel Hossain, Enamul Hoque, Shafiq Joty, and Md. Rizwan Parvez. 2024. Open-RAG: Enhanced Retrieval Augmented Reasoning with Open-Source Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024 (Findings of ACL), Yaser Al-Onaiza...

  11. [11]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park

  12. [12]

    Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, Kevin Duh, Helena Gómez-Adorno, a...

  13. [13]

    Active retrieval augmented generation

    Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi- Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active Retrieval Augmented Generation. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). ...

  14. [14]

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning. InSecond Conference on Language Modeling. https://openreview.net/forum?id=Rwhi91ideu

  15. [15]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Ya...

  16. [16]

    Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vaness...

  17. [17]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Genera- tion. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net

  18. [18]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, Jason Flinn, Margo I. Se...

  19. [19]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information...

  20. [20]

    Shuo Li, Sangdon Park, Insup Lee, and Osbert Bastani. 2024. TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-2...

  21. [21]

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models.Trans. Mach. Learn. Res.2024 (2024)

  22. [22]

    Viktor Moskvoretskii, Maria Marina, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, and Alexander Panchenko. 2025. Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol...

  23. [23]

    OpenAI. 2024. OpenAI o1 System Card.arXiv preprintarXiv.2412.16720 (2024). https://arxiv.org/abs/2412.16720

  24. [24]

    Qwen Team. 2024. QwQ: Reflect Deeply on the Boundaries of the Unknown. https://qwenlm.github.io/blog/qwq-32b-preview/. Blog post

  25. [25]

    Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. PLAID: An Efficient Engine for Late Interaction Retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022, Mohammad Al Hasan and Li Xiong (Eds.). ACM, 1747–1756. doi:10.1145/3511808.3557325

  26. [26]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, W A, United States...

  27. [27]

    Jan Luca Scheerer, Matei Zaharia, Christopher Potts, Gustavo Alonso, and Omar Khattab. 2025. WARP: An Efficient Engine for Multi-Vector Retrieval. InProceed- ings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025, Nicola Ferro, Maria Maistro, Gabriella Pasi, Omar...

  28. [28]

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2024. REPLUG: Retrieval-Augmented Black-Box Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), NAACL ...

  29. [29]

    Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. 2024. DRAGIN: Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei K...

  30. [30]

    Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Yang Song, and Han Li. 2025. ReARTeR: Retrieval-Augmented Reasoning SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia. Dongxin Guo, Jikun Wu, and Siu Ming Yiu with Trustworthy Process Rewarding. InProceedings of the 48th International ACM SIGIR Conference on Research...

  31. [31]

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. 2023. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine- Tuned with Human Feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing...

  32. [33]

    MuSiQue: Multihop Questions via Single-hop Question Composition.Trans. Assoc. Comput. Linguistics10 (2022), 539–554. doi:10.1162/TACL_A_00475

  33. [34]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

  34. [35]

    InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Lin...

  35. [36]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net

  36. [37]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Novem...

  37. [38]

    , title =

    Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.Mach. Learn.8 (1992), 229–256. doi:10.1007/BF00992696

  38. [39]

    Bennett, Junaid Ahmed, and Arnold Overwijk

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net

  39. [40]

    , booktitle =

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, D...

  40. [41]

    Qingcheng Zeng, Weihao Xuan, Leyang Cui, and Rob Voigt. 2025. Thinking Out Loud: Do Reasoning Models Know When They’re Right?. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Associ...