hub Mixed citations

Ragas: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert · 2023 · cs.CL · arXiv 2309.15217

Mixed citation behavior. Most common role is background (50%).

35 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With Ragas, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 3 unclear 2 support 1

representative citing papers

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces CHARM framework that detects cascading hallucinations in agentic RAG at 89.4% rate with 5.3% false positives and reduces error propagation by 82.1% on multi-hop QA benchmarks.

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

cs.SE · 2026-05-26 · conditional · novelty 7.0

LogDx-CI benchmark shows hybrid grep+tail reducers achieve top diagnosis quality at low cost, agent loops shrink quality variance across reducers, and cross-family LLM summarizers outperform same-family pairs.

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

cs.CL · 2026-04-29 · conditional · novelty 7.0

EnterpriseDocBench shows hybrid retrieval edges out BM25 and dense embeddings in end-to-end document pipelines, with weak inter-stage correlations and a gap between 85.5% factual accuracy and 0.40 average completeness.

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

DOTRAG: Retrieval-Time Reasoning Along Paths

cs.IR · 2026-04-06 · unverdicted · novelty 7.0

DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

cs.IR · 2026-03-06 · accept · novelty 7.0

StratRAG is a new benchmark dataset for multi-hop retrieval in RAG systems with noisy document pools, where hybrid retrieval reaches Recall@2 of 0.70 but bridge questions remain harder at 0.67.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

cs.CL · 2024-01-27 · accept · novelty 7.0

MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

hep-ex · 2026-06-09 · unverdicted · novelty 6.0

Agentic hybrid RAG with a new muon collider benchmark outperforms baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding.

Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning

cs.CR · 2026-06-01 · unverdicted · novelty 6.0

Presents the Cross-Vendor Sola ISPM Benchmark and reports that adding relational context raises AI answer correctness by 34% and cuts exploration queries by 70% on multi-vendor identity tasks.

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

cs.CR · 2026-05-26 · unverdicted · novelty 6.0

GroundedCache reduces unsafe-served rate in RAG answer caching to 0-1.5% (vs 15-51.5% naive) via four validation gates while keeping p50 latency within 1.07x of no-cache baseline.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

cs.CL · 2026-05-25 · conditional · novelty 6.0

For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

cs.IR · 2026-04-20 · unverdicted · novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

cs.IR · 2026-01-08 · unverdicted · novelty 6.0

W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

cs.AI · 2025-10-10 · unverdicted · novelty 6.0

Introduces a 93-question multimodal RAG benchmark with phrase-level recall and embedding-based hallucination metrics, finding closed-source pipelines outperform open-source ones especially on cross-modal and cross-document tasks.

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

cs.CL · 2024-04-24 · unverdicted · novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.

ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering

cs.IR · 2026-06-29 · unverdicted · novelty 5.0

ARMOR optimizes retrievers via joint RAG-likelihood and InfoNCE training with regularization toward the base encoder, yielding improved retrieval and QA on telecom benchmarks.

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.

RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation

cs.IR · 2026-01-30 · unverdicted · novelty 5.0

RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.

IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data

cs.IR · 2025-10-02 · unverdicted · novelty 5.0

IoDResearch is a private data-centric Deep Research framework that uses FAIR digital objects, atomic knowledge units, heterogeneous graph indexes, and a multi-agent system to outperform standard RAG baselines on retrieval, QA, and report generation tasks.

RAG-Enabled Intent Reasoning for Application-Network Interaction

cs.NI · 2025-05-14 · unverdicted · novelty 5.0

Proposes an intent-RAG framework that combines RAG, machine reasoning, and generative AI to interpret application intents and generate network intents, outperforming LLMs and vanilla RAG in translation tasks.

citing papers explorer

Showing 27 of 27 citing papers after filters.

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation cs.AI · 2026-06-03 · unverdicted · none · ref 6 · internal anchor
Introduces CHARM framework that detects cascading hallucinations in agentic RAG at 89.4% rate with 5.3% false positives and reduces error propagation by 82.1% on multi-hop QA benchmarks.
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs cs.AI · 2026-04-25 · unverdicted · none · ref 12 · internal anchor
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents cs.AI · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration cs.CL · 2026-04-17 · unverdicted · none · ref 25 · internal anchor
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
DOTRAG: Retrieval-Time Reasoning Along Paths cs.IR · 2026-04-06 · unverdicted · none · ref 5 · internal anchor
DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.
Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis hep-ex · 2026-06-09 · unverdicted · none · ref 21 · internal anchor
Agentic hybrid RAG with a new muon collider benchmark outperforms baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding.
Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning cs.CR · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
Presents the Cross-Vendor Sola ISPM Benchmark and reports that adding relational context raises AI answer correctness by 34% and cuts exploration queries by 70% on multi-vendor identity tasks.
Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer? cs.CR · 2026-05-26 · unverdicted · none · ref 22 · internal anchor
GroundedCache reduces unsafe-served rate in RAG answer caching to 0-1.5% (vs 15-51.5% naive) via four validation gates while keeping p50 latency within 1.07x of no-cache baseline.
Hallucination Detection via Activations of Open-Weight Proxy Analyzers cs.CL · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies cs.IR · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems cs.IR · 2026-01-08 · unverdicted · none · ref 18 · internal anchor
W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.
FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation cs.AI · 2025-10-10 · unverdicted · none · ref 8 · internal anchor
Introduces a 93-question multimodal RAG benchmark with phrase-level recall and embedding-based hallucination metrics, finding closed-source pipelines outperform open-source ones especially on cross-modal and cross-document tasks.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization cs.CL · 2024-04-24 · unverdicted · none · ref 14 · internal anchor
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering cs.IR · 2026-06-29 · unverdicted · none · ref 7 · internal anchor
ARMOR optimizes retrievers via joint RAG-likelihood and InfoNCE training with regularization toward the base encoder, yielding improved retrieval and QA on telecom benchmarks.
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering cs.CV · 2026-05-18 · unverdicted · none · ref 32 · internal anchor
CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation cs.IR · 2026-01-30 · unverdicted · none · ref 10 · internal anchor
RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data cs.IR · 2025-10-02 · unverdicted · none · ref 19 · internal anchor
IoDResearch is a private data-centric Deep Research framework that uses FAIR digital objects, atomic knowledge units, heterogeneous graph indexes, and a multi-agent system to outperform standard RAG baselines on retrieval, QA, and report generation tasks.
RAG-Enabled Intent Reasoning for Application-Network Interaction cs.NI · 2025-05-14 · unverdicted · none · ref 16 · internal anchor
Proposes an intent-RAG framework that combines RAG, machine reasoning, and generative AI to interpret application intents and generate network intents, outperforming LLMs and vanilla RAG in translation tasks.
BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning cs.AI · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
BADGER is a new enterprise evaluation framework that adds LLM-assisted SQL component extraction and a Hybrid-EX metric validated on 150 human-annotated queries to existing text-to-SQL and agentic assessment methods.
Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy cs.IR · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
Introduces FD* metric for factual density in RAG and shows it alone reaches 100% top-5 saturation of Cochrane evidence on HealthFC where cosine similarity does not.
Graph-Augmented Retrieval for Cross-Entity Financial Sentiment Analysis: A Comparative Study cs.CL · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
Graph-RAG improves entity recall by 6.4% and answer relevancy by 11.7% over vector RAG on relational financial queries, with no loss in semantic similarity.
Deepchecks: Evaluating Retrieval-Augmented Generation (RAG) cs.AI · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
Deepchecks is a new multi-faceted evaluation framework for RAG that incorporates root cause analysis and production monitoring to assess reliability, relevance, and user satisfaction.
LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR · 2026-05-01 · unverdicted · none · ref 40 · 2 links · internal anchor
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation cs.CL · 2026-04-03 · unverdicted · none · ref 3 · internal anchor
The survey unifies LLM augmentation techniques along the single axis of structured context supplied at inference time and supplies a literature screening protocol plus deployment decision framework.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 164 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
Development of a Retrieval-Augmented Generation Virtual Assistant for Enhanced Information Discovery at Rubin Observatory astro-ph.IM · 2026-07-02 · unverdicted · none · ref 23 · internal anchor
Prototype RAG virtual assistant integrates Rubin Observatory documentation using Weaviate, LangChain, and GPT for conversational semantic search.
A Survey on Retrieval-Augmented Text Generation for Large Language Models cs.IR · 2024-04-17 · unverdicted · none · ref 33 · internal anchor
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.

Ragas: Automated Evaluation of Retrieval Augmented Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer