hub Mixed citations

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, Akiko Aizawa · 2020 · cs.CL · arXiv 2011.01060

Mixed citation behavior. Most common role is background (50%).

22 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 22 citing papers arXiv PDF

abstract

A multi-hop question answering (QA) dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question. However, current datasets do not provide a complete explanation for the reasoning process from the question to the answer. Further, previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop reasoning to answer a question. In this study, we present a new multi-hop QA dataset, called 2WikiMultiHopQA, which uses structured and unstructured data. In our dataset, we introduce the evidence information containing a reasoning path for multi-hop questions. The evidence information has two benefits: (i) providing a comprehensive explanation for predictions and (ii) evaluating the reasoning skills of a model. We carefully design a pipeline and a set of templates when generating a question-answer pair that guarantees the multi-hop steps and the quality of the questions. We also exploit the structured format in Wikidata and use logical rules to create questions that are natural but still require multi-hop reasoning. Through experiments, we demonstrate that our dataset is challenging for multi-hop models and it ensures that multi-hop reasoning is required.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 3 method 1

citation-polarity summary

background 4 use dataset 3 unclear 1

representative citing papers

HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads

cs.IR · 2026-04-19 · unverdicted · novelty 7.0

HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.

Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

cs.IR · 2026-04-01 · unverdicted · novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.

AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation

cs.IR · 2026-02-10 · unverdicted · novelty 7.0

AtomicRAG replaces chunk-based and triple-based GraphRAG with atom-entity graphs that store facts as atomic units and use personalized PageRank plus relevance filtering to achieve higher retrieval accuracy and reasoning robustness on five benchmarks.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

cs.CL · 2025-11-04 · unverdicted · novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation

cs.IR · 2026-05-30 · unverdicted · novelty 6.0

MemGraphRAG uses a memory-based multi-agent system for globally consistent graph construction from fragmented corpora plus a memory-aware hierarchical retriever, claiming better benchmark performance than prior GraphRAG methods at similar cost.

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

EHRAG constructs structural hyperedges from sentence co-occurrence and semantic hyperedges from entity embedding clusters, then applies hybrid diffusion plus topic-aware PPR to retrieve top-k documents, outperforming baselines on four datasets with linear indexing cost and zero token overhead.

MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

cs.IR · 2026-04-19 · unverdicted · novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.

Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

cs.CL · 2026-03-06 · unverdicted · novelty 6.0

MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

cs.CL · 2025-11-14 · unverdicted · novelty 6.0

MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

cs.CL · 2025-10-17 · unverdicted · novelty 6.0 · 2 refs

EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.

Question-Adaptive Graph Learning for Multi-hop Retrieval Augmented Generation

cs.LG · 2025-10-13 · unverdicted · novelty 6.0

A Multi-L KG and Quest-GNN with question-adaptive intra/inter-level message passing and synthesized pre-training data improves multi-hop RAG performance up to 33.8% on high-hop questions.

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

cs.CL · 2025-05-28 · unverdicted · novelty 6.0

MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

cs.CL · 2025-05-07 · unverdicted · novelty 6.0 · 2 refs

ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.

Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

cs.CL · 2025-10-01 · unverdicted · novelty 5.0

ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

cs.LG · 2025-10-29 · unverdicted · novelty 4.0

GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.

Retrieval-Augmented Generation for Large Language Models: A Survey

cs.CL · 2023-12-18 · unverdicted · novelty 3.0

A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

cs.AI · 2026-04-04

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

cs.CL · 2026-02-01

citing papers explorer

Showing 11 of 11 citing papers after filters.

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning cs.CL · 2025-11-04 · unverdicted · none · ref 8 · internal anchor
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference cs.CL · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs cs.CL · 2026-04-14 · unverdicted · none · ref 51 · internal anchor
Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens cs.CL · 2026-03-06 · unverdicted · none · ref 16 · internal anchor
MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling cs.CL · 2025-11-14 · unverdicted · none · ref 40 · internal anchor
MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle cs.CL · 2025-10-17 · unverdicted · none · ref 33 · 2 links · internal anchor
EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation cs.CL · 2025-05-28 · unverdicted · none · ref 17 · internal anchor
MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching cs.CL · 2025-05-07 · unverdicted · none · ref 8 · 2 links · internal anchor
ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs cs.CL · 2025-10-01 · unverdicted · none · ref 21 · internal anchor
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 119 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse cs.CL · 2026-02-01 · unreviewed · ref 19 · internal anchor

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer