hub Canonical reference

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal · 2022 · cs.CL · arXiv 2212.10509

Canonical reference. 88% of citing Pith papers cite this work as background.

25 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

Prompting-based large language models (LLMs) are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textit{what to retrieve} depends on \textit{what has already been derived}, which in turn may depend on \textit{what was previously retrieved}. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \url{https://github.com/stonybrooknlp/ircot}

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8

citation-polarity summary

background 7 unclear 1

representative citing papers

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering

cs.MA · 2026-05-15 · unverdicted · novelty 7.0

GRASP introduces a hierarchical graph-based agentic retrieval method that achieves top accuracy on MuSiQue, 2WikiMultihopQA, and HotpotQA while using 30-50% fewer tokens than strong baselines.

Meta-Harness: End-to-End Optimization of Model Harnesses

cs.AI · 2026-03-30 · unverdicted · novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.

AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation

cs.IR · 2026-02-10 · unverdicted · novelty 7.0

AtomicRAG replaces chunk-based and triple-based GraphRAG with atom-entity graphs that store facts as atomic units and use personalized PageRank plus relevance filtering to achieve higher retrieval accuracy and reasoning robustness on five benchmarks.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

cs.CL · 2025-11-04 · unverdicted · novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

cs.LG · 2026-05-02 · unverdicted · novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.

DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking

cs.IR · 2026-04-13 · unverdicted · novelty 6.0

DualView fuses local cross-attention and global context aggregation via adaptive gating to rerank fixed candidate sets for multi-hop QA, reporting 99.4% Top-4 Recall on MuSiQue at 4 ms latency while beating larger cross-encoders.

Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

cs.CL · 2026-04-10 · unverdicted · novelty 6.0 · 2 refs

Introduces a facet-level diagnostics framework using Facet x Chunk matrices and controlled inference modes to show that RAG hallucinations arise mainly from evidence integration failures rather than retrieval errors.

Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

cs.CL · 2026-03-30 · unverdicted · novelty 6.0

PROClaim achieves 81.7% accuracy on Check-COVID claim verification by combining courtroom roles, progressive RAG, and multi-judge aggregation, outperforming standard multi-agent debate by 10 percentage points.

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

cs.CL · 2025-10-17 · unverdicted · novelty 6.0 · 2 refs

EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.

Question-Adaptive Graph Learning for Multi-hop Retrieval Augmented Generation

cs.LG · 2025-10-13 · unverdicted · novelty 6.0

A Multi-L KG and Quest-GNN with question-adaptive intra/inter-level message passing and synthesized pre-training data improves multi-hop RAG performance up to 33.8% on high-hop questions.

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

cs.CL · 2025-10-04 · unverdicted · novelty 6.0

Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.

Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

cs.CV · 2025-08-31 · unverdicted · novelty 6.0

PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

cs.CL · 2025-05-28 · unverdicted · novelty 6.0

MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.

A-MEM: Agentic Memory for LLM Agents

cs.CL · 2025-02-17 · unverdicted · novelty 6.0

A-MEM is a dynamic memory system for LLM agents that builds and refines an interconnected network of notes with agent-driven linking and evolution, showing performance gains over prior memory methods on six models.

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

cs.CL · 2024-04-24 · unverdicted · novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.

MemGPT: Towards LLMs as Operating Systems

cs.AI · 2023-10-12 · unverdicted · novelty 6.0

MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

cs.CV · 2023-03-20 · unverdicted · novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

State Representation and Termination for Recursive Reasoning Systems

cs.AI · 2026-05-02 · unverdicted · novelty 5.0

Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping rule is informative.

Agentic Reasoning for Large Language Models

cs.AI · 2026-01-18 · unverdicted · novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

cs.CV · 2023-09-29 · conditional · novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

cs.IR · 2026-04-21 · unverdicted · novelty 3.0

MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

cs.CL · 2025-03-27 · accept · novelty 3.0

A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.

citing papers explorer

Showing 25 of 25 citing papers.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 55 · internal anchor
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering cs.MA · 2026-05-15 · unverdicted · none · ref 3 · internal anchor
GRASP introduces a hierarchical graph-based agentic retrieval method that achieves top accuracy on MuSiQue, 2WikiMultihopQA, and HotpotQA while using 30-50% fewer tokens than strong baselines.
Meta-Harness: End-to-End Optimization of Model Harnesses cs.AI · 2026-03-30 · unverdicted · none · ref 50 · internal anchor
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.
AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation cs.IR · 2026-02-10 · unverdicted · none · ref 33 · internal anchor
AtomicRAG replaces chunk-based and triple-based GraphRAG with atom-entity graphs that store facts as atomic units and use personalized PageRank plus relevance filtering to achieve higher retrieval accuracy and reasoning robustness on five benchmarks.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning cs.CL · 2025-11-04 · unverdicted · none · ref 26 · internal anchor
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 182 · internal anchor
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data cs.LG · 2026-05-02 · unverdicted · none · ref 19 · internal anchor
S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking cs.IR · 2026-04-13 · unverdicted · none · ref 29 · internal anchor
DualView fuses local cross-attention and global context aggregation via adaptive gating to rerank fixed candidate sets for multi-hop QA, reporting 99.4% Top-4 Recall on MuSiQue at 4 ms latency while beating larger cross-encoders.
Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG cs.CL · 2026-04-10 · unverdicted · none · ref 5 · 2 links · internal anchor
Introduces a facet-level diagnostics framework using Facet x Chunk matrices and controlled inference modes to show that RAG hallucinations arise mainly from evidence integration failures rather than retrieval errors.
Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification cs.CL · 2026-03-30 · unverdicted · none · ref 5 · internal anchor
PROClaim achieves 81.7% accuracy on Check-COVID claim verification by combining courtroom roles, progressive RAG, and multi-judge aggregation, outperforming standard multi-agent debate by 10 percentage points.
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle cs.CL · 2025-10-17 · unverdicted · none · ref 37 · 2 links · internal anchor
EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.
Question-Adaptive Graph Learning for Multi-hop Retrieval Augmented Generation cs.LG · 2025-10-13 · unverdicted · none · ref 24 · internal anchor
A Multi-L KG and Quest-GNN with question-adaptive intra/inter-level message passing and synthesized pre-training data improves multi-hop RAG performance up to 33.8% on high-hop questions.
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models cs.CL · 2025-10-04 · unverdicted · none · ref 15 · internal anchor
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering cs.CV · 2025-08-31 · unverdicted · none · ref 40 · internal anchor
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation cs.CL · 2025-05-28 · unverdicted · none · ref 35 · internal anchor
MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.
A-MEM: Agentic Memory for LLM Agents cs.CL · 2025-02-17 · unverdicted · none · ref 31 · internal anchor
A-MEM is a dynamic memory system for LLM agents that builds and refines an interconnected network of notes with agent-driven linking and evolution, showing performance gains over prior memory methods on six models.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization cs.CL · 2024-04-24 · unverdicted · none · ref 64 · internal anchor
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
MemGPT: Towards LLMs as Operating Systems cs.AI · 2023-10-12 · unverdicted · none · ref 20 · internal anchor
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action cs.CV · 2023-03-20 · unverdicted · none · ref 26 · internal anchor
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
State Representation and Termination for Recursive Reasoning Systems cs.AI · 2026-05-02 · unverdicted · none · ref 10 · internal anchor
Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping rule is informative.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 214 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) cs.CV · 2023-09-29 · conditional · none · ref 124 · internal anchor
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation cs.IR · 2026-04-21 · unverdicted · none · ref 46 · internal anchor
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.
Large Language Model Agent: A Survey on Methodology, Applications and Challenges cs.CL · 2025-03-27 · accept · none · ref 46 · internal anchor
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 61 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer