A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

Ziyi Liu

arxiv: 2509.17766 · v2 · submitted 2025-09-22 · 💻 cs.CL · cs.AI

A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

Ziyi Liu This is my paper

Pith reviewed 2026-05-18 14:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-turn dialogueprompt engineeringstate reconstructionhistory remindlarge language modelsmulti-hop QAinformation filteringefficiency

0 comments

The pith

A training-free prompting strategy with state reconstruction and history remind mechanisms lets large language models retain key details across long multi-turn dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a prompting method that periodically reconstructs the current conversation state and reminds the model of important prior details to combat information loss in extended exchanges. This approach requires no additional training and targets the common problems of forgetting and high resource use that arise when language models process many turns of dialogue. A sympathetic reader would care because the method delivers measurable gains on multi-hop question answering tasks while also lowering the time and tokens needed for each response, making sustained interactions more practical.

Core claim

The State-Update Multi-turn Dialogue Strategy incorporates State Reconstruction and History Remind mechanisms to manage dialogue history, enabling large language models to maintain accurate information filtering and achieve higher downstream performance on multi-hop QA tasks without model training.

What carries the argument

State Reconstruction and History Remind mechanisms that update the current dialogue state and recall relevant history to prevent information forgetting.

If this is right

Core information filtering score rises by 32.6 percent on HotpotQA.
Downstream QA score increases by 14.1 percent on HotpotQA.
Inference time drops by 73.1 percent.
Token consumption falls by 59.4 percent.
Similar gains appear across multiple multi-hop QA datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested in open-domain chat or agent planning tasks that involve many turns but lack explicit QA structure.
It may pair with retrieval methods to extend usable context length beyond what prompting alone can handle.
The efficiency savings suggest prompt-based state management can reduce reliance on longer context windows or model fine-tuning.

Load-bearing premise

The observed gains stem primarily from the state reconstruction and history remind components rather than from other details of the prompting setup or the chosen datasets.

What would settle it

Re-running the HotpotQA experiments with a standard multi-turn prompt that omits the state reconstruction and history remind steps and checking whether the 32.6 percent filtering improvement, 14.1 percent QA gain, and efficiency reductions still appear.

read the original abstract

Large Language Models (LLMs) struggle with information forgetting and inefficiency in long-horizon, multi-turn dialogues. To address this, we propose a training-free prompt engineering method, the State-Update Multi-turn Dialogue Strategy. It utilizes "State Reconstruction" and "History Remind" mechanisms to effectively manage dialogue history. Our strategy shows strong performance across multiple multi-hop QA datasets. For instance, on the HotpotQA dataset, it improves the core information filtering score by 32.6%, leading to a 14.1% increase in the downstream QA score, while also reducing inference time by 73.1% and token consumption by 59.4%. Ablation studies confirm the pivotal roles of both components. Our work offers an effective solution for optimizing LLMs in long-range interactions, providing new insights for developing more robust Agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a training-free State-Update Multi-turn Dialogue Strategy for LLMs that employs State Reconstruction and History Remind mechanisms to manage dialogue history, reduce forgetting, and improve efficiency in long-horizon interactions. It evaluates the approach on multi-hop QA datasets including HotpotQA, reporting a 32.6% gain in core information filtering score, 14.1% improvement in downstream QA accuracy, 73.1% reduction in inference time, and 59.4% lower token consumption, with ablations attributing gains to the two proposed components.

Significance. If the performance gains prove robust and generalizable beyond the specific dialogue construction protocol, the method would offer a lightweight, prompt-only technique for scaling LLM agents to longer contexts without retraining or external memory modules. The efficiency metrics are particularly relevant for practical deployment.

major comments (2)

[§4] §4 (Experimental Setup): The manuscript must explicitly describe the dialogue simulation protocol used to convert multi-hop QA instances into multi-turn conversations. If user turns are generated by sequentially partitioning gold supporting sentences in fixed order, the State Reconstruction and History Remind mechanisms receive perfectly incremental, non-overlapping facts; any structured state update will then outperform raw history concatenation simply by avoiding context dilution. This setup detail is load-bearing for the causal claim that the 32.6% filtering and 14.1% QA gains on HotpotQA are attributable to the proposed mechanisms rather than the input construction itself.
[§4] §4 and Table 2: The reported improvements lack statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) across multiple runs or random seeds. Without these, it is impossible to determine whether the 14.1% QA gain exceeds variance attributable to LLM sampling or prompt sensitivity.

minor comments (2)

[Abstract] Clarify the exact definition and computation of the 'core information filtering score' used in the HotpotQA results; the abstract presents it as the primary metric but does not define it in the visible text.
Add a limitations section discussing applicability to non-QA multi-turn dialogues (e.g., open-ended chit-chat or task-oriented conversations) where supporting facts are not pre-partitioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, clarifying our experimental design and committing to revisions that strengthen the presentation of results without altering the core claims.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): The manuscript must explicitly describe the dialogue simulation protocol used to convert multi-hop QA instances into multi-turn conversations. If user turns are generated by sequentially partitioning gold supporting sentences in fixed order, the State Reconstruction and History Remind mechanisms receive perfectly incremental, non-overlapping facts; any structured state update will then outperform raw history concatenation simply by avoiding context dilution. This setup detail is load-bearing for the causal claim that the 32.6% filtering and 14.1% QA gains on HotpotQA are attributable to the proposed mechanisms rather than the input construction itself.

Authors: We agree that an explicit description of the dialogue construction protocol is necessary for reproducibility and for readers to assess the source of the reported gains. In the revised manuscript we will insert a dedicated paragraph in §4 that fully specifies how multi-hop QA instances are converted into multi-turn dialogues, including the sequential partitioning of gold supporting sentences into user turns. We emphasize that all baselines, including raw history concatenation, operate under exactly the same dialogue-construction protocol; therefore the performance differences cannot be attributed solely to the input format. The ablation results further isolate the contribution of each proposed mechanism: removing State Reconstruction or History Remind produces clear drops even when the underlying dialogue turns remain unchanged. revision: yes
Referee: [§4] §4 and Table 2: The reported improvements lack statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) across multiple runs or random seeds. Without these, it is impossible to determine whether the 14.1% QA gain exceeds variance attributable to LLM sampling or prompt sensitivity.

Authors: We recognize that statistical significance testing is important for establishing robustness against LLM sampling noise and prompt sensitivity. Our original experiments used single runs per configuration because of the high inference cost of the evaluated models. In the revision we will rerun the key HotpotQA experiments with at least three random seeds, report mean and standard deviation, and add bootstrap confidence intervals for the core metrics in Table 2. These additions will allow readers to evaluate whether the observed 14.1 % QA improvement exceeds typical variance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompting strategy with measured results on public datasets

full rationale

The paper introduces a training-free prompting method using State Reconstruction and History Remind mechanisms for multi-turn dialogue management. It evaluates this on multi-hop QA datasets such as HotpotQA via direct performance measurements and ablation studies, reporting gains in filtering score, QA accuracy, inference time, and token use. No equations, fitted parameters, self-referential definitions, or derivation chains appear in the provided text; the central claims rest on external benchmark outcomes rather than quantities defined inside the paper itself. This is a standard empirical contribution with independent content from the experimental results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical prompting study with no mathematical derivations, fitted constants, or postulated physical entities; it relies on standard assumptions that LLMs can follow structured prompts and that QA datasets measure relevant capabilities.

pith-pipeline@v0.9.0 · 5664 in / 1175 out tokens · 28503 ms · 2026-05-18T14:54:08.924055+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

on the HotpotQA dataset, it improves the core information filtering score by 32.6%, leading to a 14.1% increase in the downstream QA score, while also reducing inference time by 73.1% and token consumption by 59.4%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

INTRODUCTION Large Language Models (LLMs) have demonstrated remarkable, human-like capabilities across a vast spectrum of tasks[1][2][3], from complex reasoning to fluent text generation.Efforts to advance LLMs and address their inherent flaws, such as hallucination, have pursued several key strategies. Prompt engineering techniques, ex- emplified by Chai...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Current research has largely proceeded along two lines of inquiry

RELATED WORK A central strategy for improving the performance of LLMs is to equip them with high-quality context. Current research has largely proceeded along two lines of inquiry. The first centers on incorporating external knowledge, epitomized by frameworks such as Retrieval-Augmented Generation (RAG)[5][6] and Tool- using[7][8][9]. These methods provi...

work page
[3]

This strategy replaces the conventional method of linearly appending conversation history by reconstructing the dia- logue state at each turn

METHOD To address the challenges of positional bias and information degra- dation in traditional multi-turn dialogues, we propose a training-free prompt engineering approach named theState-Update Multi-turn Dialogue Strategy. This strategy replaces the conventional method of linearly appending conversation history by reconstructing the dia- logue state at...

work page
[4]

EXPERIMENTS 4.1. Experimental Setup Datasets.We evaluate our method on three public benchmarks: HotpotQA,2WikiMultiHopQA, andQASC.HotpotQAand2Wiki- MultiHopQA[19] are multi-hop QA datasets that require reasoning over contexts containing distractor information. ForQASC[20], a non-reasoning QA dataset, we construct a similar multi-turn format by randomly sa...

work page
[5]

CONCLUSION We introduce a novel, training-free State-Updating Multi-turn Di- alogue Strategy that leverages a state reconstruction and a history- aware reminding mechanism. Our approach not only significantly enhances performance on information filtering and downstream question-answering tasks by 14.1% but also drastically reduces computational overhead, ...

work page
[6]

Language models are unsu- pervised multitask learners,

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al., “Language models are unsu- pervised multitask learners,”OpenAI blog, vol. 1, no. 8, pp. 9, 2019

work page 2019
[7]

Language Models are Few-Shot Learners,

Tom Brown et al., “Language Models are Few-Shot Learners,” inAdvances in Neural Information Processing Systems. 2020, vol. 33, pp. 1877–1901, Curran Associates, Inc

work page 2020
[8]

Training language models to follow instructions with human feedback,

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- wal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe, “Training language models to follow instructions with human feed...

work page 2022
[9]

Chain-of-Thought Prompting Elicits Reasoning in Large Lan- guage Models,

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Lan- guage Models,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds. 2022, vol. 35, pp. 24824–24837, Cur- ran As...

work page 2022
[10]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural In- formation Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M...

work page 2020
[11]

Retrieval-augmented generation for large language models: A survey,

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang, “Retrieval-augmented generation for large language models: A survey,” 2024

work page 2024
[12]

Toolformer: Language models can teach themselves to use tools,

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023

work page 2023
[13]

ReAct: Synergizing rea- soning and acting in language models,

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao, “ReAct: Synergizing rea- soning and acting in language models,” inInternational Con- ference on Learning Representations (ICLR), 2023

work page 2023
[14]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han, “Search- r1: Training llms to reason and leverage search engines with reinforcement learning,”arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Llms get lost in multi-turn conversation,

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville, “Llms get lost in multi-turn conversation,” 2025

work page 2025
[16]

Lost in the middle: How language models use long contexts,

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang, “Lost in the middle: How language models use long contexts,”Trans- actions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024

work page 2024
[17]

HotpotQA: A dataset for diverse, explainable multi-hop question answering,

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning, “HotpotQA: A dataset for diverse, explainable multi-hop question answering,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, Ed...

work page 2018
[18]

Inference scaling for long-context re- trieval augmented generation,

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jager- man, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky, “Inference scaling for long-context re- trieval augmented generation,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

Dense text retrieval based on pretrained language models: A survey,

Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen, “Dense text retrieval based on pretrained language models: A survey,” 2022

work page 2022
[20]

Active retrieval augmented generation,

Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig, “Active retrieval augmented generation,” inProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali, Eds., Singapore, Dec. 2023, pp. 7969–7992, Association for ...

work page 2023
[21]

Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG,

Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik, “Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG,” inThe Thirteenth International Confer- ence on Learning Representations, 2025

work page 2025
[22]

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses,

Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jian- feng Gao, and Bill Dolan, “A Neural Network Approach to Context-Sensitive Generation of Conversational Responses,” inProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man ...

work page 2015
[23]

Llama 2: Open foundation and fine-tuned chat models,

Hugo Touvron and so on, “Llama 2: Open foundation and fine-tuned chat models,” 2023

work page 2023
[24]

Constructing a multi-hop QA dataset for com- prehensive evaluation of reasoning steps,

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa, “Constructing a multi-hop QA dataset for com- prehensive evaluation of reasoning steps,” inProceedings of the 28th International Conference on Computational Linguis- tics, Barcelona, Spain (Online), Dec. 2020, pp. 6609–6625, In- ternational Committee on Computational Linguistics

work page 2020
[25]

QASC: A dataset for question answering via sentence composition,

Tushar Khot, Peter Clark, Michal Guerquin, Peter Alexander Jansen, and Ashish Sabharwal, “QASC: A dataset for question answering via sentence composition,” inAAAI, 2019

work page 2019
[26]

Qwen2.5 technical report,

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, ...

work page 2025
[27]

Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

Gheorghe Comanici et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” 2025

work page 2025

[1] [1]

A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

INTRODUCTION Large Language Models (LLMs) have demonstrated remarkable, human-like capabilities across a vast spectrum of tasks[1][2][3], from complex reasoning to fluent text generation.Efforts to advance LLMs and address their inherent flaws, such as hallucination, have pursued several key strategies. Prompt engineering techniques, ex- emplified by Chai...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Current research has largely proceeded along two lines of inquiry

RELATED WORK A central strategy for improving the performance of LLMs is to equip them with high-quality context. Current research has largely proceeded along two lines of inquiry. The first centers on incorporating external knowledge, epitomized by frameworks such as Retrieval-Augmented Generation (RAG)[5][6] and Tool- using[7][8][9]. These methods provi...

work page

[3] [3]

This strategy replaces the conventional method of linearly appending conversation history by reconstructing the dia- logue state at each turn

METHOD To address the challenges of positional bias and information degra- dation in traditional multi-turn dialogues, we propose a training-free prompt engineering approach named theState-Update Multi-turn Dialogue Strategy. This strategy replaces the conventional method of linearly appending conversation history by reconstructing the dia- logue state at...

work page

[4] [4]

EXPERIMENTS 4.1. Experimental Setup Datasets.We evaluate our method on three public benchmarks: HotpotQA,2WikiMultiHopQA, andQASC.HotpotQAand2Wiki- MultiHopQA[19] are multi-hop QA datasets that require reasoning over contexts containing distractor information. ForQASC[20], a non-reasoning QA dataset, we construct a similar multi-turn format by randomly sa...

work page

[5] [5]

CONCLUSION We introduce a novel, training-free State-Updating Multi-turn Di- alogue Strategy that leverages a state reconstruction and a history- aware reminding mechanism. Our approach not only significantly enhances performance on information filtering and downstream question-answering tasks by 14.1% but also drastically reduces computational overhead, ...

work page

[6] [6]

Language models are unsu- pervised multitask learners,

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al., “Language models are unsu- pervised multitask learners,”OpenAI blog, vol. 1, no. 8, pp. 9, 2019

work page 2019

[7] [7]

Language Models are Few-Shot Learners,

Tom Brown et al., “Language Models are Few-Shot Learners,” inAdvances in Neural Information Processing Systems. 2020, vol. 33, pp. 1877–1901, Curran Associates, Inc

work page 2020

[8] [8]

Training language models to follow instructions with human feedback,

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- wal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe, “Training language models to follow instructions with human feed...

work page 2022

[9] [9]

Chain-of-Thought Prompting Elicits Reasoning in Large Lan- guage Models,

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Lan- guage Models,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds. 2022, vol. 35, pp. 24824–24837, Cur- ran As...

work page 2022

[10] [10]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural In- formation Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M...

work page 2020

[11] [11]

Retrieval-augmented generation for large language models: A survey,

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang, “Retrieval-augmented generation for large language models: A survey,” 2024

work page 2024

[12] [12]

Toolformer: Language models can teach themselves to use tools,

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023

work page 2023

[13] [13]

ReAct: Synergizing rea- soning and acting in language models,

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao, “ReAct: Synergizing rea- soning and acting in language models,” inInternational Con- ference on Learning Representations (ICLR), 2023

work page 2023

[14] [14]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han, “Search- r1: Training llms to reason and leverage search engines with reinforcement learning,”arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Llms get lost in multi-turn conversation,

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville, “Llms get lost in multi-turn conversation,” 2025

work page 2025

[16] [16]

Lost in the middle: How language models use long contexts,

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang, “Lost in the middle: How language models use long contexts,”Trans- actions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024

work page 2024

[17] [17]

HotpotQA: A dataset for diverse, explainable multi-hop question answering,

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning, “HotpotQA: A dataset for diverse, explainable multi-hop question answering,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, Ed...

work page 2018

[18] [18]

Inference scaling for long-context re- trieval augmented generation,

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jager- man, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky, “Inference scaling for long-context re- trieval augmented generation,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[19] [19]

Dense text retrieval based on pretrained language models: A survey,

Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen, “Dense text retrieval based on pretrained language models: A survey,” 2022

work page 2022

[20] [20]

Active retrieval augmented generation,

Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig, “Active retrieval augmented generation,” inProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali, Eds., Singapore, Dec. 2023, pp. 7969–7992, Association for ...

work page 2023

[21] [21]

Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG,

Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik, “Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG,” inThe Thirteenth International Confer- ence on Learning Representations, 2025

work page 2025

[22] [22]

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses,

Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jian- feng Gao, and Bill Dolan, “A Neural Network Approach to Context-Sensitive Generation of Conversational Responses,” inProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man ...

work page 2015

[23] [23]

Llama 2: Open foundation and fine-tuned chat models,

Hugo Touvron and so on, “Llama 2: Open foundation and fine-tuned chat models,” 2023

work page 2023

[24] [24]

Constructing a multi-hop QA dataset for com- prehensive evaluation of reasoning steps,

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa, “Constructing a multi-hop QA dataset for com- prehensive evaluation of reasoning steps,” inProceedings of the 28th International Conference on Computational Linguis- tics, Barcelona, Spain (Online), Dec. 2020, pp. 6609–6625, In- ternational Committee on Computational Linguistics

work page 2020

[25] [25]

QASC: A dataset for question answering via sentence composition,

Tushar Khot, Peter Clark, Michal Guerquin, Peter Alexander Jansen, and Ashish Sabharwal, “QASC: A dataset for question answering via sentence composition,” inAAAI, 2019

work page 2019

[26] [26]

Qwen2.5 technical report,

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, ...

work page 2025

[27] [27]

Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

Gheorghe Comanici et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” 2025

work page 2025