arxiv: 2604.19899 · v1 · submitted 2026-04-21 · 💻 cs.IR

Recognition: unknown

A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

Gabriel Iturra-Bocaz , Petra Galuscakova

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:18 UTC · model grok-4.3

classification 💻 cs.IR

keywords reproducibilityretrieval-augmented generationMetaRAGmetacognitive RAGrerankingSIM-RAGmulti-hop question answering

0 comments

The pith

MetaRAG partially reproduces with relative gains over standard RAG but lower absolute scores, and improves further with reranking while proving more robust than SIM-RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reproduces the MetaRAG framework, which adds a metacognitive layer so large language models can critique their own reasoning and decide when to stop retrieving more information for complex questions. Following the original setup, the authors confirm that MetaRAG still outperforms basic RAG and reasoning baselines in relative terms. Absolute performance drops compared with the first report, which the study links to closed-source model changes, unreleased prompts, and missing implementation details. Extending MetaRAG with pointwise and listwise rerankers raises scores noticeably, and the approach holds up better than the lighter SIM-RAG critic when more retrieval features are added. Reproducing such systems matters because multi-hop question answering and similar tasks depend on reliable stopping rules that can be trusted across different runs and model versions.

Core claim

We reproduce MetaRAG following its original experimental setup and extend it in two directions: by evaluating the effect of PointWise and ListWise rerankers, and by comparing with SIM-RAG, which employs a lightweight critic model to stop retrieval. Our results confirm MetaRAG's relative improvements over standard RAG and reasoning-based baselines, but also reveal lower absolute scores than reported, reflecting challenges with closed-source LLM updates, missing implementation details, and unreleased prompts. We show that MetaRAG is partially reproduced, gains substantially from reranking, and is more robust than SIM-RAG when extended with additional retrieval features.

What carries the argument

The metacognitive critic inside MetaRAG that lets the LLM critique and refine its own reasoning to decide when enough information has been retrieved.

If this is right

MetaRAG performance rises substantially when PointWise or ListWise rerankers are added to the retrieval pipeline.
MetaRAG maintains its advantages over SIM-RAG even after both systems receive the same additional retrieval features.
Relative ordering of methods stays stable despite lower absolute numbers, so comparisons between RAG variants remain informative.
Releasing prompts and exact implementation details would be needed to close the absolute-performance gap in future reproductions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reranking steps could be treated as a default add-on for any metacognitive or critic-based RAG system rather than an optional extension.
Using open-weight models instead of closed-source ones might reduce the reproducibility drop seen here and make future studies easier to align.
The greater robustness of MetaRAG suggests that metacognitive stopping rules are less sensitive to small changes in retrieval pipelines than lighter critic models.
Exact reproduction of closed-source LLM results may require freezing model snapshots or reporting version hashes as standard practice.

Load-bearing premise

That differences in absolute performance can be attributed primarily to closed-source LLM updates and missing prompts rather than to unstated differences in experimental setup, data splits, or evaluation metrics between the reproduction and the original study.

What would settle it

Running the original MetaRAG code with the exact unreleased prompts, identical data splits, and the precise model versions available at the time of the first study would show whether the reported absolute scores reappear or whether other setup factors explain the gap.

Figures

Figures reproduced from arXiv: 2604.19899 by Gabriel Iturra-Bocaz, Petra Galuscakova.

read the original abstract

Recently, Retrieval Augmented Generation (RAG) has shifted focus to multi-retrieval approaches to tackle complex tasks such as multi-hop question answering. However, these systems struggle to decide when to stop searching once enough information has been gathered. To address this, \citet{zhou2024metacognitive} introduced Metacognitive Retrieval Augmented Generation (MetaRAG), a framework inspired by metacognition that enables Large Language Models to critique and refine their reasoning. In this reproducibility paper, we reproduce MetaRAG following its original experimental setup and extend it in two directions: (i) by evaluating the effect of PointWise and ListWise rerankers, and (ii) by comparing with SIM-RAG, which employs a lightweight critic model to stop retrieval. Our results confirm MetaRAG's relative improvements over standard RAG and reasoning-based baselines, but also reveal lower absolute scores than reported, reflecting challenges with closed-source LLM updates, missing implementation details, and unreleased prompts. We show that MetaRAG is partially reproduced, gains substantially from reranking, and is more robust than SIM-RAG when extended with additional retrieval features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward reproducibility study that confirms MetaRAG's relative edge over baselines but reports lower absolute scores, with modest value from the reranker and SIM-RAG extensions.

read the letter

The core takeaway is that MetaRAG still delivers relative gains over plain RAG and reasoning baselines when reproduced, but absolute numbers come in lower than the original report. The authors add two extensions: testing PointWise and ListWise rerankers, which boost performance noticeably, and a head-to-head with SIM-RAG, where MetaRAG looks more stable once extra retrieval features are added. They also document the practical friction of closed-source LLM changes and missing original prompts and details. That transparency is the paper's main strength. It shows the framework is not brittle in relative terms and gives concrete evidence that reranking helps close some gaps. The reporting of implementation barriers feels honest and matches what many people run into with these systems. The soft spot sits in the causal claim about why scores dropped. The authors point to LLM updates and unreleased prompts as the main drivers, yet they provide no explicit check that data splits, evaluation metrics, baseline code, or retrieval configs matched the original exactly. Without shipped code, full prompts, or logs, that attribution rests on interpretation rather than direct evidence. A reader cannot rule out other setup differences as contributors. This limits how much weight the partial-reproducibility conclusion can carry. The work is aimed at people building or evaluating RAG pipelines who need to know what survives contact with real-world closed models. It is not for readers seeking new mechanisms or theory. A serious referee would be appropriate because reproducibility checks like this still move the practical side of the field forward, provided the authors can tighten the experimental-matching details and consider releasing artifacts. I would send it to review with those requests rather than desk reject.

Referee Report

2 major / 2 minor

Summary. This reproducibility study of Metacognitive Retrieval-Augmented Generation (MetaRAG) follows the original experimental setup for multi-hop QA, confirming relative performance gains over standard RAG and reasoning baselines while reporting lower absolute scores. The authors attribute the absolute-score drop primarily to closed-source LLM updates, missing implementation details, and unreleased prompts. They further extend the work by testing PointWise and ListWise rerankers and comparing robustness against SIM-RAG under additional retrieval features, concluding that MetaRAG is partially reproduced, benefits substantially from reranking, and is more robust than SIM-RAG.

Significance. If the relative improvements and robustness findings hold after controlling for setup variables, the work usefully documents the practical difficulties of reproducing closed-source LLM-based IR systems and shows that simple reranking extensions can yield substantial gains. It also provides a direct comparison with SIM-RAG that clarifies when metacognitive stopping is advantageous, thereby contributing concrete evidence on reproducibility challenges and incremental improvements in the RAG literature.

major comments (2)

[Results and Discussion] The central claim that lower absolute scores are attributable primarily to closed-source LLM updates, missing prompts, and implementation gaps (rather than unverified differences in data splits, evaluation metrics, or baseline configurations) is load-bearing for the interpretation of partial reproducibility. No explicit verification, ablation, or side-by-side comparison of these other variables is reported, leaving the causal attribution untested.
[Extension Experiments] The statement that MetaRAG 'gains substantially from reranking' and 'is more robust than SIM-RAG' when extended with additional retrieval features requires clearer isolation of the reranker contribution versus the metacognitive component. Without separate ablations of the rerankers on the original baselines or quantitative robustness metrics (e.g., variance across feature additions), the comparative claim is difficult to evaluate.

minor comments (2)

[Introduction] The criteria used to label the reproduction as 'partial' are not explicitly defined; a short paragraph or table listing which original results were matched within a stated tolerance would improve clarity.
[Experimental Setup] Exact prompts, model versions, and retrieval hyperparameters used in the reproduction are not released, which limits independent verification even though the paper transparently reports the resulting challenges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our reproducibility study of MetaRAG. We address each major comment below with honest responses based on the manuscript's content and scope, and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Results and Discussion] The central claim that lower absolute scores are attributable primarily to closed-source LLM updates, missing prompts, and implementation gaps (rather than unverified differences in data splits, evaluation metrics, or baseline configurations) is load-bearing for the interpretation of partial reproducibility. No explicit verification, ablation, or side-by-side comparison of these other variables is reported, leaving the causal attribution untested.

Authors: We followed the original paper's reported experimental protocol as closely as possible, using the same multi-hop QA datasets (HotpotQA, 2WikiMultihopQA), evaluation metrics (EM, F1), and baseline setups described therein. The lower absolute scores persisted despite these efforts, leading to our attribution to LLM updates, unreleased prompts, and missing implementation details. We agree that explicit ablations isolating every possible variable would strengthen the claim; however, the absence of original prompts and full code limits exhaustive verification. We will revise the Results and Discussion section to add a limitations paragraph clarifying this and noting that relative gains over baselines remain consistent, supporting the partial reproducibility conclusion. revision: partial
Referee: [Extension Experiments] The statement that MetaRAG 'gains substantially from reranking' and 'is more robust than SIM-RAG' when extended with additional retrieval features requires clearer isolation of the reranker contribution versus the metacognitive component. Without separate ablations of the rerankers on the original baselines or quantitative robustness metrics (e.g., variance across feature additions), the comparative claim is difficult to evaluate.

Authors: Our extension experiments applied PointWise and ListWise rerankers within the MetaRAG framework and observed substantial gains, with comparisons to SIM-RAG under extended features showing greater robustness for MetaRAG. To address the need for clearer isolation, we will add ablations applying the same rerankers to standard RAG and reasoning baselines, plus quantitative metrics such as performance variance across feature configurations. These will be incorporated into the Extension Experiments section to better separate reranker effects from the metacognitive stopping mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reproducibility study with no derivations or self-referential structure

full rationale

The paper reproduces MetaRAG following the original setup, reports empirical results showing relative improvements but lower absolute scores, attributes differences to LLM updates and missing details, and extends the evaluation with rerankers and SIM-RAG comparison. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. Claims rest on direct experimental comparisons and observed outcomes, which are self-contained and externally falsifiable via reproduction. The interpretive attribution of score gaps is an assumption about experimental variables but does not reduce any derivation to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical reproduction study with no new mathematical axioms, free parameters, or invented entities; relies on standard IR evaluation practices and prior RAG baselines.

pith-pipeline@v0.9.0 · 5499 in / 1180 out tokens · 31514 ms · 2026-05-10T01:18:22.842221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 22 canonical work pages · 8 internal anchors

[1]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi
[2]

Self-rag: Learning to retrieve, generate, and critique through self-reflection. (2024)

2024
[3]

Hossein Bahak, Farzaneh Taheri, Zahra Zojaji, and Arefeh Kazemi. 2023. Evalu- ating chatgpt as a question answering system: A comprehensive analysis and comparison with existing models.arXiv preprint arXiv:2312.07592(2023)

work page arXiv 2023
[4]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu
[5]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]

work page internal anchor Pith review arXiv
[6]

Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 758–759

2009
[7]

Shaima Ahmad Freja, Ferhat Ozgur Catak, Betul Yurdem, and Chunming Rong
[8]

EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models.arXiv preprint arXiv:2602.02295(2026)

work page arXiv 2026
[9]

Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Naik, Pengshan Cai, and Alfio Gliozzo. 2022. Re2G: Retrieve, Rerank, Generate. In SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Gabriel Iturra Bocaz and Petra Galuščáková Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Lingui...

2022
[10]

Yasushi Gotoh. 2016. Development of Critical Thinking with Metacognitive Regulation.International association for development of the information society (2016)

2016
[11]

Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. 2021. Efficient Near- est Neighbor Language Models. InConference on Empirical Methods in Natural Language Processing

2021
[12]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. InProceedings of the 28th International Conference on Computational Linguistics. 6609–6625

2020
[13]

Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. InEACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 874–880

2021
[14]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park
[15]

Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity.arXiv preprint arXiv:2403.14403(2024)

work page arXiv 2024
[16]

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 7969–7992

2023
[17]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)

work page internal anchor Pith review arXiv 2025
[18]

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551(2017)

work page internal anchor Pith review arXiv 2017
[19]

Chris Kamphuis, Arjen P De Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In European Conference on Information Retrieval. Springer, 28–34

2020
[20]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

2020
[21]

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. [n. d.]. Generalization through Memorization: Nearest Neighbor Language Models. InInternational Conference on Learning Representations
[22]

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. [n. d.]. Decomposed Prompting: A Modular Ap- proach for Solving Complex Tasks. InThe Eleventh International Conference on Learning Representations
[23]

Emily R Lai. 2011. Metacognition: A literature review. (2011)

2011
[24]

Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, et al . 2024. A systematic survey and criti- cal review on evaluating large language models: Challenges, limitations, and recommendations.arXiv preprint arXiv:2407.04069(2024)

work page arXiv 2024
[25]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

2020
[26]

Jiatao Li, Xinyu Hu, and Xiaojun Wan. 2024. SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval.arXiv preprint arXiv:2409.13992 (2024)

work page arXiv 2024
[27]

Yanhong Li, Chenghao Yang, and Allyson Ettinger. 2024. When hindsight is not 20/20: Testing limits on reflective thinking in large language models.arXiv preprint arXiv:2404.09129(2024)

work page arXiv 2024
[28]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

work page internal anchor Pith review arXiv 2023
[29]

Vaibhav Mavi, Anubhav Jangra, Adam Jatowt, et al. 2024. Multi-hop question answering.Foundations and Trends®in Information Retrieval17, 5 (2024), 457– 586

2024
[30]

Gabriel de Souza P Moreira, Ronay Ak, Benedikt Schifferer, Mengyao Xu, Radek Osmulski, and Even Oldridge. 2024. Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG.arXiv preprint arXiv:2409.07691(2024)

work page arXiv 2024
[31]

TO Nelson and L Narens. 1990. Metamemory: A theoretical framework and some new findings. The Psychology of Learning and Motivation. Vol. 26

1990
[32]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

2016
[33]

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!arXiv preprint arXiv:2312.02724(2023)

work page arXiv 2023
[34]

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositionality gap in language models.arXiv preprint arXiv:2210.03350(2022)

work page arXiv 2022
[35]

Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan Farchi, and Ateret Anaby Tavor. 2023. Predicting question-answering performance of large language models through semantic consistency. InProceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM). 138–154

2023
[36]

Amirhossein Razavi, Mina Soltangheis, Negar Arabzadeh, Sara Salamat, Morteza Zihayat, and Ebrahim Bagheri. 2025. Benchmarking prompt sensitivity in large language models. InEuropean Conference on Information Retrieval. Springer, 303–313

2025
[37]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

2009
[38]

Gregory Schraw and David Moshman. 1995. Metacognitive theories.Educational psychology review7, 4 (1995), 351–371

1995
[39]

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al
[40]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

The prompt report: a systematic survey of prompt engineering techniques. arXiv preprint arXiv:2406.06608(2024)

work page internal anchor Pith review arXiv 2024
[41]

Sahel Sharifymoghaddam, Ronak Pradeep, Andre Slavescu, Ryan Nguyen, An- drew Xu, Zijian Chen, Yilin Zhang, Yidi Chen, Jasper Xian, and Jimmy Lin. 2025. Rankllm: A python package for reranking with llms. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3681–3690

2025
[42]

Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.113661 (2023)

work page internal anchor Pith review arXiv 2023
[43]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT good at search? investigat- ing large language models as re-ranking agents.arXiv preprint arXiv:2304.09542 (2023)

work page arXiv 2023
[44]

Yixuan Tang and Yi Yang. 2024. Multihop-rag: Benchmarking retrieval- augmented generation for multi-hop queries.arXiv preprint arXiv:2401.15391 (2024)

work page arXiv 2024
[46]

Interleaving retrieval with chain-of-thought reasoning for knowledge- intensive multi-step questions.arXiv preprint arXiv:2212.10509(2022)

work page arXiv 2022
[47]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal
[48]

InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10014–10037
[49]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

work page internal anchor Pith review arXiv 2022
[50]

Xuezhi Wang, Jason Wei, Dale Schuurmans, et al. 2022. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR)

2022
[51]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[52]

Diji Yang, Linda Zeng, Jinmeng Rao, and Yi Zhang. 2025. Knowing You Don’t Know: Learning When to Continue Search in Multi-round RAG through Self- Practicing. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1305–1315

2025
[53]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for di- verse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600 (2018)

work page internal anchor Pith review arXiv 2018
[54]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)

2023
[55]

Tianyu Yu et al. 2025. Unleashing the Power of Context Repetition for Robust Reasoning in LLMs.arXiv preprint arXiv:2503.06789(2025)

work page arXiv 2025
[56]

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Moham- mad Shoeybi, and Bryan Catanzaro. 2024. Rankrag: Unifying context ranking with retrieval-augmented generation in llms.Advances in Neural Information Processing Systems37 (2024), 121156–121184

2024
[57]

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al. 2024. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 1393–1412

2024
[58]

Zexuan Zhong, Tao Lei, and Danqi Chen. 2022. Training language models with memory augmentation.arXiv preprint arXiv:2205.12674(2022)

work page arXiv 2022
[59]

Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou. 2024. Metacog- nitive retrieval-augmented large language models. InProc. ACM Web Conf. 2024. 1453–1463

2024