Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
hub
Proceedings of the 2023
47 Pith papers cite this work, alongside 220 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
MedHal-Loc benchmark shows KG-triple hallucination detectors localize errors no better than chance on controlled medical statements due to entity extraction limits, while NLI and consistency methods succeed above chance, and real hallucinations are mostly diffuse conclusion changes.
VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
AuthorityBench shows citation presence (real or fabricated) increases LLM hallucination rates vs no-citation baseline, strongest for fabricated citations on true claims, with domain variation but negligible venue or author effects.
WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.
Commercial AI chatbots reach over 90% multiple-choice accuracy on recent news facts but lose 11-17% in free response and drop to 19-70% on subtle false-premise questions, with retrieval failures causing most errors and clear Anglophone bias.
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.
Transpose-invariant spectral diagnostics on attention operators are orientation-blind, and a φ-G two-axis diagnostic distinguishes hallucination modes with 0.62-0.84 LC-AUROC and predicted polarity reversal.
SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.
TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.
Hallucinations arise from biased latent inference paths rather than missing knowledge, demonstrated via a new diagnostic testbed TrapQA that isolates task-retrieval and key-selection biases.
LoFa is a new benchmark and LFR@k metric for measuring LLM resistance to sustained logical fallacy attacks via generated question-argument pairs and debate simulations.
Global calibration metrics like ECE are confounded by accuracy; the proposed ACE framework with three accuracy-controlled views shows many prior calibration advantages weaken or reverse.
Introduces claim-conditioned re-scoring (SIFT) and warranted supports proportion (WSP) metric, reporting accuracy recovery up to 27.6 points and WSP calibration at AUC 0.92 on FEVER, SciFact and other benchmarks.
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
GIScholarBench shows LLMs exhibit consistent overconfidence across three scholarly tasks in GIS, with different manifestations in factual retrieval, citation expansion, and idea generation.
A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
citing papers explorer
No citing papers match the current filters.