hub

Proceedings of the 2023

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh + 2 more · 2023 · Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing · DOI 10.18653/v1/2023.emnlp-main.741

47 Pith papers cite this work, alongside 220 external citations. Polarity classification is still indexing.

47 Pith papers citing it

220 external citations · Crossref

open at publisher browse 47 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

MedHal-Loc: Are "Explainable-by-Architecture" Medical Hallucination Detectors Faithful Localizers? A Localization Benchmark

cs.CL · 2026-06-19 · unverdicted · novelty 7.0

MedHal-Loc benchmark shows KG-triple hallucination detectors localize errors no better than chance on controlled medical statements due to entity extraction limits, while NLI and consistency methods succeed above chance, and real hallucinations are mostly diffuse conclusion changes.

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

cs.CV · 2026-06-12 · unverdicted · novelty 7.0

VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

AuthorityBench shows citation presence (real or fabricated) increases LLM hallucination rates vs no-citation baseline, strongest for fabricated citations on true claims, with domain variation but negligible venue or author effects.

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

PhantomBench: Benchmarking the Non-existential Threat of Language Models

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.

Evaluating Commercial AI Chatbots as News Intermediaries

cs.CL · 2026-05-21 · conditional · novelty 7.0

Commercial AI chatbots reach over 90% multiple-choice accuracy on recent news facts but lose 11-17% in free response and drop to 19-70% on subtle false-premise questions, with retrieval failures causing most errors and clear Anglophone bias.

Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

cs.HC · 2026-05-09 · accept · novelty 7.0

LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

cs.LG · 2026-05-06 · unverdicted · novelty 7.0 · 2 refs

Transpose-invariant spectral diagnostics on attention operators are orientation-blind, and a φ-G two-axis diagnostic distinguishes hallucination modes with 0.62-0.84 LC-AUROC and predicted polarity reversal.

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

cs.CL · 2026-03-16 · conditional · novelty 7.0

Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.

TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

cs.CL · 2025-11-02 · unverdicted · novelty 7.0

TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.

Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors

cs.CL · 2026-07-01 · unverdicted · novelty 6.0

Hallucinations arise from biased latent inference paths rather than missing knowledge, demonstrated via a new diagnostic testbed TrapQA that isolates task-retrieval and key-selection biases.

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

LoFa is a new benchmark and LFR@k metric for measuring LLM resistance to sustained logical fallacy attacks via generated question-argument pairs and debate simulations.

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

Global calibration metrics like ECE are confounded by accuracy; the proposed ACE framework with three accuracy-controlled views shows many prior calibration advantages weaken or reverse.

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

Introduces claim-conditioned re-scoring (SIFT) and warranted supports proportion (WSP) metric, reporting accuracy recovery up to 27.6 points and WSP calibration at AUC 0.92 on FEVER, SciFact and other benchmarks.

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.

GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

cs.IR · 2026-06-06 · unverdicted · novelty 6.0

GIScholarBench shows LLMs exhibit consistent overconfidence across three scholarly tasks in GIS, with different manifestations in factual retrieval, citation expansion, and idea generation.

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

cs.CL · 2026-06-05 · unverdicted · novelty 6.0

A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.

Boosting Self-Consistency with Ranking

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Proceedings of the 2023

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer