hub Mixed citations

Ragas: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert · 2023 · cs.CL · arXiv 2309.15217

Mixed citation behavior. Most common role is background (50%).

24 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 24 citing papers arXiv PDF

abstract

We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With Ragas, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 3 unclear 2 support 1

representative citing papers

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

cs.CL · 2026-04-29 · conditional · novelty 7.0

EnterpriseDocBench shows hybrid retrieval edges out BM25 and dense embeddings in end-to-end document pipelines, with weak inter-stage correlations and a gap between 85.5% factual accuracy and 0.40 average completeness.

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

DOTRAG: Retrieval-Time Reasoning Along Paths

cs.IR · 2026-04-06 · unverdicted · novelty 7.0

DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

cs.IR · 2026-03-06 · accept · novelty 7.0

StratRAG is a new benchmark dataset for multi-hop retrieval in RAG systems with noisy document pools, where hybrid retrieval reaches Recall@2 of 0.70 but bridge questions remain harder at 0.67.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

cs.CL · 2024-01-27 · accept · novelty 7.0

MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

cs.IR · 2026-04-20 · unverdicted · novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough

cs.IR · 2026-02-06 · unverdicted · novelty 6.0

UnWeaver disentangles documents into entities via LLM to retrieve original chunks, yielding a simpler alternative to GraphRAG that still reduces noise and preserves source fidelity.

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

cs.IR · 2026-01-08 · unverdicted · novelty 6.0

W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

cs.AI · 2025-10-10 · unverdicted · novelty 6.0

Introduces a 93-question multimodal RAG benchmark with phrase-level recall and embedding-based hallucination metrics, finding closed-source pipelines outperform open-source ones especially on cross-modal and cross-document tasks.

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

cs.CL · 2024-04-24 · unverdicted · novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.

RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation

cs.IR · 2026-01-30 · unverdicted · novelty 5.0

RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.

IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data

cs.IR · 2025-10-02 · unverdicted · novelty 5.0

IoDResearch is a private data-centric Deep Research framework that uses FAIR digital objects, atomic knowledge units, heterogeneous graph indexes, and a multi-agent system to outperform standard RAG baselines on retrieval, QA, and report generation tasks.

RAG-Enabled Intent Reasoning for Application-Network Interaction

cs.NI · 2025-05-14 · unverdicted · novelty 5.0

Proposes an intent-RAG framework that combines RAG, machine reasoning, and generative AI to interpret application intents and generate network intents, outperforming LLMs and vanilla RAG in translation tasks.

Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)

cs.AI · 2026-05-14 · unverdicted · novelty 4.0

Deepchecks is a new multi-faceted evaluation framework for RAG that incorporates root cause analysis and production monitoring to assess reliability, relevance, and user satisfaction.

LLM-Oriented Information Retrieval: A Denoising-First Perspective

cs.IR · 2026-05-01 · unverdicted · novelty 4.0 · 2 refs

Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.

ragR: Retrieval-Augmented Generation and RAG Assessment in R

stat.CO · 2026-04-26 · accept · novelty 4.0

ragR provides a unified R-native workflow for constructing retrieval-augmented generation systems and evaluating them with LLM-scored RAGAS metrics.

Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

cs.CL · 2026-04-03 · unverdicted · novelty 4.0

The survey unifies LLM augmentation techniques along the single axis of structured context supplied at inference time and supplies a literature screening protocol plus deployment decision framework.

Retrieval-Augmented Generation for Large Language Models: A Survey

cs.CL · 2023-12-18 · unverdicted · novelty 3.0

A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.

A Survey on Retrieval-Augmented Text Generation for Large Language Models

cs.IR · 2024-04-17 · unverdicted · novelty 2.0

A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.

citing papers explorer

Showing 24 of 24 citing papers.

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI cs.CL · 2026-04-29 · conditional · none · ref 6 · internal anchor
EnterpriseDocBench shows hybrid retrieval edges out BM25 and dense embeddings in end-to-end document pipelines, with weak inter-stage correlations and a gap between 85.5% factual accuracy and 0.40 average completeness.
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs cs.AI · 2026-04-25 · unverdicted · none · ref 12 · internal anchor
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents cs.AI · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench cs.AI · 2026-04-17 · conditional · none · ref 6 · internal anchor
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration cs.CL · 2026-04-17 · unverdicted · none · ref 25 · internal anchor
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
DOTRAG: Retrieval-Time Reasoning Along Paths cs.IR · 2026-04-06 · unverdicted · none · ref 5 · internal anchor
DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.
StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems cs.IR · 2026-03-06 · accept · none · ref 1 · internal anchor
StratRAG is a new benchmark dataset for multi-hop retrieval in RAG systems with noisy document pools, where hybrid retrieval reaches Recall@2 of 0.70 but bridge questions remain harder at 0.67.
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries cs.CL · 2024-01-27 · accept · none · ref 6 · internal anchor
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
Hallucination Detection via Activations of Open-Weight Proxy Analyzers cs.CL · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies cs.IR · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough cs.IR · 2026-02-06 · unverdicted · none · ref 5 · internal anchor
UnWeaver disentangles documents into entities via LLM to retrieve original chunks, yielding a simpler alternative to GraphRAG that still reduces noise and preserves source fidelity.
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems cs.IR · 2026-01-08 · unverdicted · none · ref 18 · internal anchor
W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.
FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation cs.AI · 2025-10-10 · unverdicted · none · ref 8 · internal anchor
Introduces a 93-question multimodal RAG benchmark with phrase-level recall and embedding-based hallucination metrics, finding closed-source pipelines outperform open-source ones especially on cross-modal and cross-document tasks.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization cs.CL · 2024-04-24 · unverdicted · none · ref 14 · internal anchor
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering cs.CV · 2026-05-18 · unverdicted · none · ref 32 · internal anchor
CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation cs.IR · 2026-01-30 · unverdicted · none · ref 10 · internal anchor
RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data cs.IR · 2025-10-02 · unverdicted · none · ref 19 · internal anchor
IoDResearch is a private data-centric Deep Research framework that uses FAIR digital objects, atomic knowledge units, heterogeneous graph indexes, and a multi-agent system to outperform standard RAG baselines on retrieval, QA, and report generation tasks.
RAG-Enabled Intent Reasoning for Application-Network Interaction cs.NI · 2025-05-14 · unverdicted · none · ref 16 · internal anchor
Proposes an intent-RAG framework that combines RAG, machine reasoning, and generative AI to interpret application intents and generate network intents, outperforming LLMs and vanilla RAG in translation tasks.
Deepchecks: Evaluating Retrieval-Augmented Generation (RAG) cs.AI · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
Deepchecks is a new multi-faceted evaluation framework for RAG that incorporates root cause analysis and production monitoring to assess reliability, relevance, and user satisfaction.
LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR · 2026-05-01 · unverdicted · none · ref 40 · 2 links · internal anchor
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
ragR: Retrieval-Augmented Generation and RAG Assessment in R stat.CO · 2026-04-26 · accept · none · ref 1 · internal anchor
ragR provides a unified R-native workflow for constructing retrieval-augmented generation systems and evaluating them with LLM-scored RAGAS metrics.
Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation cs.CL · 2026-04-03 · unverdicted · none · ref 3 · internal anchor
The survey unifies LLM augmentation techniques along the single axis of structured context supplied at inference time and supplies a literature screening protocol plus deployment decision framework.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 164 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
A Survey on Retrieval-Augmented Text Generation for Large Language Models cs.IR · 2024-04-17 · unverdicted · none · ref 33 · internal anchor
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.

Ragas: Automated Evaluation of Retrieval Augmented Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer