hub Mixed citations

Ragas: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert · 2023 · cs.CL · arXiv 2309.15217

Mixed citation behavior. Most common role is background (50%).

35 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With Ragas, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 3 unclear 2 support 1

representative citing papers

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces CHARM framework that detects cascading hallucinations in agentic RAG at 89.4% rate with 5.3% false positives and reduces error propagation by 82.1% on multi-hop QA benchmarks.

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

cs.SE · 2026-05-26 · conditional · novelty 7.0

LogDx-CI benchmark shows hybrid grep+tail reducers achieve top diagnosis quality at low cost, agent loops shrink quality variance across reducers, and cross-family LLM summarizers outperform same-family pairs.

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

cs.CL · 2026-04-29 · conditional · novelty 7.0

EnterpriseDocBench shows hybrid retrieval edges out BM25 and dense embeddings in end-to-end document pipelines, with weak inter-stage correlations and a gap between 85.5% factual accuracy and 0.40 average completeness.

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

DOTRAG: Retrieval-Time Reasoning Along Paths

cs.IR · 2026-04-06 · unverdicted · novelty 7.0

DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

cs.IR · 2026-03-06 · accept · novelty 7.0

StratRAG is a new benchmark dataset for multi-hop retrieval in RAG systems with noisy document pools, where hybrid retrieval reaches Recall@2 of 0.70 but bridge questions remain harder at 0.67.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

cs.CL · 2024-01-27 · accept · novelty 7.0

MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

hep-ex · 2026-06-09 · unverdicted · novelty 6.0

Agentic hybrid RAG with a new muon collider benchmark outperforms baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding.

Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning

cs.CR · 2026-06-01 · unverdicted · novelty 6.0

Presents the Cross-Vendor Sola ISPM Benchmark and reports that adding relational context raises AI answer correctness by 34% and cuts exploration queries by 70% on multi-vendor identity tasks.

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

cs.CR · 2026-05-26 · unverdicted · novelty 6.0

GroundedCache reduces unsafe-served rate in RAG answer caching to 0-1.5% (vs 15-51.5% naive) via four validation gates while keeping p50 latency within 1.07x of no-cache baseline.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

cs.CL · 2026-05-25 · conditional · novelty 6.0

For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

cs.IR · 2026-04-20 · unverdicted · novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

cs.IR · 2026-01-08 · unverdicted · novelty 6.0

W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

cs.AI · 2025-10-10 · unverdicted · novelty 6.0

Introduces a 93-question multimodal RAG benchmark with phrase-level recall and embedding-based hallucination metrics, finding closed-source pipelines outperform open-source ones especially on cross-modal and cross-document tasks.

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

cs.CL · 2024-04-24 · unverdicted · novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.

ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering

cs.IR · 2026-06-29 · unverdicted · novelty 5.0

ARMOR optimizes retrievers via joint RAG-likelihood and InfoNCE training with regularization toward the base encoder, yielding improved retrieval and QA on telecom benchmarks.

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.

RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation

cs.IR · 2026-01-30 · unverdicted · novelty 5.0

RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.

IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data

cs.IR · 2025-10-02 · unverdicted · novelty 5.0

IoDResearch is a private data-centric Deep Research framework that uses FAIR digital objects, atomic knowledge units, heterogeneous graph indexes, and a multi-agent system to outperform standard RAG baselines on retrieval, QA, and report generation tasks.

RAG-Enabled Intent Reasoning for Application-Network Interaction

cs.NI · 2025-05-14 · unverdicted · novelty 5.0

Proposes an intent-RAG framework that combines RAG, machine reasoning, and generative AI to interpret application intents and generate network intents, outperforming LLMs and vanilla RAG in translation tasks.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Ragas: Automated Evaluation of Retrieval Augmented Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer