EnterpriseDocBench shows hybrid retrieval edges out BM25 and dense embeddings in end-to-end document pipelines, with weak inter-stage correlations and a gap between 85.5% factual accuracy and 0.40 average completeness.
hub Mixed citations
Ragas: Automated Evaluation of Retrieval Augmented Generation
Mixed citation behavior. Most common role is background (50%).
abstract
We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With Ragas, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.
StratRAG is a new benchmark dataset for multi-hop retrieval in RAG systems with noisy document pools, where hybrid retrieval reaches Recall@2 of 0.70 but bridge questions remain harder at 0.67.
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
UnWeaver disentangles documents into entities via LLM to retrieve original chunks, yielding a simpler alternative to GraphRAG that still reduces noise and preserves source fidelity.
W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.
Introduces a 93-question multimodal RAG benchmark with phrase-level recall and embedding-based hallucination metrics, finding closed-source pipelines outperform open-source ones especially on cross-modal and cross-document tasks.
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.
RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
IoDResearch is a private data-centric Deep Research framework that uses FAIR digital objects, atomic knowledge units, heterogeneous graph indexes, and a multi-agent system to outperform standard RAG baselines on retrieval, QA, and report generation tasks.
Proposes an intent-RAG framework that combines RAG, machine reasoning, and generative AI to interpret application intents and generate network intents, outperforming LLMs and vanilla RAG in translation tasks.
Deepchecks is a new multi-faceted evaluation framework for RAG that incorporates root cause analysis and production monitoring to assess reliability, relevance, and user satisfaction.
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
ragR provides a unified R-native workflow for constructing retrieval-augmented generation systems and evaluating them with LLM-scored RAGAS metrics.
The survey unifies LLM augmentation techniques along the single axis of structured context supplied at inference time and supplies a literature screening protocol plus deployment decision framework.
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.
citing papers explorer
-
Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI
EnterpriseDocBench shows hybrid retrieval edges out BM25 and dense embeddings in end-to-end document pipelines, with weak inter-stage correlations and a gap between 85.5% factual accuracy and 0.40 average completeness.
-
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
-
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
-
DOTRAG: Retrieval-Time Reasoning Along Paths
DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.
-
StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems
StratRAG is a new benchmark dataset for multi-hop retrieval in RAG systems with noisy document pools, where hybrid retrieval reaches Recall@2 of 0.70 but bridge questions remain harder at 0.67.
-
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
-
Hallucination Detection via Activations of Open-Weight Proxy Analyzers
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
-
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
-
UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough
UnWeaver disentangles documents into entities via LLM to retrieve original chunks, yielding a simpler alternative to GraphRAG that still reduces noise and preserves source fidelity.
-
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems
W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.
-
FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation
Introduces a 93-question multimodal RAG benchmark with phrase-level recall and embedding-based hallucination metrics, finding closed-source pipelines outperform open-source ones especially on cross-modal and cross-document tasks.
-
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
-
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering
CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.
-
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation
RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
-
IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data
IoDResearch is a private data-centric Deep Research framework that uses FAIR digital objects, atomic knowledge units, heterogeneous graph indexes, and a multi-agent system to outperform standard RAG baselines on retrieval, QA, and report generation tasks.
-
RAG-Enabled Intent Reasoning for Application-Network Interaction
Proposes an intent-RAG framework that combines RAG, machine reasoning, and generative AI to interpret application intents and generate network intents, outperforming LLMs and vanilla RAG in translation tasks.
-
Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)
Deepchecks is a new multi-faceted evaluation framework for RAG that incorporates root cause analysis and production monitoring to assess reliability, relevance, and user satisfaction.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
-
ragR: Retrieval-Augmented Generation and RAG Assessment in R
ragR provides a unified R-native workflow for constructing retrieval-augmented generation systems and evaluating them with LLM-scored RAGAS metrics.
-
Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation
The survey unifies LLM augmentation techniques along the single axis of structured context supplied at inference time and supplies a literature screening protocol plus deployment decision framework.
-
Retrieval-Augmented Generation for Large Language Models: A Survey
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
-
A Survey on Retrieval-Augmented Text Generation for Large Language Models
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.