hub

Halueval: A large-scale hallucination evaluation benchmark for large language models

Li, J · 2023 · arXiv 2305.11747

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

HalluScore: Large Language Model Hallucination Question Answering Benchmark

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA and fact-checking datasets.

Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

Design and Report Benchmarks for Knowledge Work

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

cs.CL · 2026-04-09 · conditional · novelty 6.0

Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

cs.CL · 2026-03-16 · unverdicted · novelty 6.0

Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

cs.CR · 2026-02-06 · unverdicted · novelty 6.0

LLMs hallucinate citations at rates from 14.23% to 94.93%, with 1.07% of papers containing invalid citations and an 80.9% increase in 2025.

When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents

cs.MA · 2026-01-07 · unverdicted · novelty 6.0

LLM agents exhibit emergent covert numerical coordination in canonical game settings under restricted or absent communication, shaping strategic outcomes.

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

cs.CL · 2025-10-08 · unverdicted · novelty 6.0

Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.

Measuring short-form factuality in large language models

cs.CL · 2024-11-07 · unverdicted · novelty 6.0

SimpleQA is a new benchmark of short, single-answer factual questions collected adversarially against GPT-4 to evaluate LLM factuality and confidence calibration.

Ragas: Automated Evaluation of Retrieval Augmented Generation

cs.CL · 2023-09-26 · unverdicted · novelty 6.0

Ragas supplies reference-free metrics for measuring context relevance, faithfulness to retrieved passages, and answer quality in RAG pipelines.

Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations

cs.CL · 2026-04-06 · unverdicted · novelty 5.0

LLM hallucinations arise from task-dependent basins in latent space, with separability varying by task and geometry-aware steering reducing their probability.

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

cs.CL · 2026-05-04 · unverdicted · novelty 4.0

HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.

Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics

cs.CR · 2025-04-01 · unverdicted · novelty 3.0

A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.

A Survey of Hallucination in Large Foundation Models

cs.AI · 2023-09-12 · accept · novelty 3.0

A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.

citing papers explorer

Showing 17 of 17 citing papers.

HalluScore: Large Language Model Hallucination Question Answering Benchmark cs.CL · 2026-05-16 · unverdicted · none · ref 18
HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 17
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration cs.CL · 2026-04-17 · unverdicted · none · ref 28
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS cs.CL · 2026-04-12 · unverdicted · none · ref 4
Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA and fact-checking datasets.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models cs.CL · 2026-04-12 · unverdicted · none · ref 29
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
Design and Report Benchmarks for Knowledge Work cs.AI · 2026-05-22 · unverdicted · none · ref 61
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts cs.CL · 2026-04-09 · conditional · none · ref 50
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI cs.CL · 2026-03-16 · unverdicted · none · ref 9
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models cs.CR · 2026-02-06 · unverdicted · none · ref 28
LLMs hallucinate citations at rates from 14.23% to 94.93%, with 1.07% of papers containing invalid citations and an 80.9% increase in 2025.
When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents cs.MA · 2026-01-07 · unverdicted · none · ref 24
LLM agents exhibit emergent covert numerical coordination in canonical game settings under restricted or absent communication, shaping strategic outcomes.
Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts cs.CL · 2025-10-08 · unverdicted · none · ref 11
Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.
Measuring short-form factuality in large language models cs.CL · 2024-11-07 · unverdicted · none · ref 9
SimpleQA is a new benchmark of short, single-answer factual questions collected adversarially against GPT-4 to evaluate LLM factuality and confidence calibration.
Ragas: Automated Evaluation of Retrieval Augmented Generation cs.CL · 2023-09-26 · unverdicted · none · ref 2
Ragas supplies reference-free metrics for measuring context relevance, faithfulness to retrieved passages, and answer quality in RAG pipelines.
Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations cs.CL · 2026-04-06 · unverdicted · none · ref 13
LLM hallucinations arise from task-dependent basins in latent space, with separability varying by task and geometry-aware steering reducing their probability.
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs cs.CL · 2026-05-04 · unverdicted · none · ref 10
HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics cs.CR · 2025-04-01 · unverdicted · none · ref 26
A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.
A Survey of Hallucination in Large Foundation Models cs.AI · 2023-09-12 · accept · none · ref 129
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.

Halueval: A large-scale hallucination evaluation benchmark for large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer