{"total":47,"items":[{"citing_arxiv_id":"2607.00447","ref_index":61,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors","primary_cat":"cs.CL","submitted_at":"2026-07-01T05:02:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hallucinations arise from biased latent inference paths rather than missing knowledge, demonstrated via a new diagnostic testbed TrapQA that isolates task-retrieval and key-selection biases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31039","ref_index":49,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies","primary_cat":"cs.CL","submitted_at":"2026-06-30T02:17:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoFa is a new benchmark and LFR@k metric for measuring LLM resistance to sustained logical fallacy attacks via generated question-argument pairs and debate simulations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30814","ref_index":111,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs","primary_cat":"cs.CL","submitted_at":"2026-06-29T18:37:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Global calibration metrics like ECE are confounded by accuracy; the proposed ACE framework with three accuracy-controlled views shows many prior calibration advantages weaken or reverse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29251","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Summaries Distort Decisions: Information Fidelity in LLM-Compressed Financial Analysis","primary_cat":"cs.AI","submitted_at":"2026-06-28T07:44:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM-based compression of financial source material can alter downstream investment decisions via decontextualization and model dependency, addressed by an agentic auditing approach that checks multiple compressions against the original.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24627","ref_index":114,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking","primary_cat":"cs.CL","submitted_at":"2026-06-23T14:23:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces claim-conditioned re-scoring (SIFT) and warranted supports proportion (WSP) metric, reporting accuracy recovery up to 27.6 points and WSP calibration at AUC 0.92 on FEVER, SciFact and other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21517","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MedHal-Loc: Are \"Explainable-by-Architecture\" Medical Hallucination Detectors Faithful Localizers? A Localization Benchmark","primary_cat":"cs.CL","submitted_at":"2026-06-19T15:11:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MedHal-Loc benchmark shows KG-triple hallucination detectors localize errors no better than chance on controlled medical statements due to entity extraction limits, while NLI and consistency methods succeed above chance, and real hallucinations are mostly diffuse conclusion changes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20676","ref_index":26,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity","primary_cat":"cs.CV","submitted_at":"2026-06-12T15:53:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13111","ref_index":108,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"M\\\"OVE: A Holistic LLM Benchmark for the German Public Sector","primary_cat":"cs.CL","submitted_at":"2026-06-11T09:37:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13104","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-11T09:33:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AuthorityBench shows citation presence (real or fabricated) increases LLM hallucination rates vs no-citation baseline, strongest for fabricated citations on true claims, with domain variation but negligible venue or author effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12767","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage","primary_cat":"cs.AI","submitted_at":"2026-06-11T00:17:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Strict generation directly from Task-Method-Knowledge models yields 96.5% grounded and 92.6% usable QA pairs across 23 topics, outperforming transcript-first and TMK-aware alternatives on representational grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11816","ref_index":36,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-10T08:50:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11337","ref_index":78,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Can AI Agents Synthesize Scientific Conclusions?","primary_cat":"cs.AI","submitted_at":"2026-06-09T18:16:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11105","ref_index":36,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PhantomBench: Benchmarking the Non-existential Threat of Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-09T17:03:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08036","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GIScholarBench: Benchmarking LLM Overconfidence in GIS Research","primary_cat":"cs.IR","submitted_at":"2026-06-06T07:56:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GIScholarBench shows LLMs exhibit consistent overconfidence across three scholarly tasks in GIS, with different manifestations in factual retrieval, citation expansion, and idea generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06788","ref_index":53,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses","primary_cat":"cs.CL","submitted_at":"2026-06-05T00:14:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05054","ref_index":221,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Boosting Self-Consistency with Ranking","primary_cat":"cs.CL","submitted_at":"2026-06-03T16:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03980","ref_index":61,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill","primary_cat":"cs.LG","submitted_at":"2026-06-02T17:56:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01923","ref_index":69,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time","primary_cat":"cs.CL","submitted_at":"2026-06-01T08:57:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00432","ref_index":50,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Grounded Decoding: Retrieval-Anchored Probability Fusion for Faithful RAG","primary_cat":"cs.LG","submitted_at":"2026-05-29T23:47:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Grounded Decoding fuses full-RAG and retrieval-only next-token distributions via normalized geometric mean from a KL-barycenter to improve factual consistency and citation quality in RAG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00328","ref_index":36,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"KG-Guard: Graph-Based Hallucination Detection for Knowledge Base Question Answering","primary_cat":"cs.LG","submitted_at":"2026-05-29T20:11:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KG-Guard augments knowledge graphs with a virtual question node and uses a graph encoder plus MLP to classify LLM-proposed answers as hallucinations or not, reporting superior F1 scores and downstream improvements on three benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30329","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?","primary_cat":"cs.LG","submitted_at":"2026-05-28T17:57:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22785","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evaluating Commercial AI Chatbots as News Intermediaries","primary_cat":"cs.CL","submitted_at":"2026-05-21T17:42:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Commercial AI chatbots reach over 90% multiple-choice accuracy on recent news facts but lose 11-17% in free response and drop to 19-70% on subtle false-premise questions, with retrieval failures causing most errors and clear Anglophone bias.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21602","ref_index":69,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-20T18:08:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces MOOD benchmark for OOD LLM alignment failures and shows guard models plus Mahalanobis and perplexity OOD detectors improve recall from 39% to 45% with positive scaling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16217","ref_index":22,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Argus: Evidence Assembly for Scalable Deep Research Agents","primary_cat":"cs.CL","submitted_at":"2026-05-15T17:29:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14563","ref_index":63,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation","primary_cat":"cs.SE","submitted_at":"2026-05-14T08:35:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16407","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture","primary_cat":"cs.LO","submitted_at":"2026-05-13T12:01:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"partial","one_line_summary":"Introduces a trust-boundary architecture in Lean 4 with three certificate families and two operators that deliver sorry-free, axiom-audited assurances for LLM pipeline components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08590","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations","primary_cat":"cs.HC","submitted_at":"2026-05-09T01:10:40+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RQ2:What forms of epistemic overreach appear in these explanations? RQ3: How does epistemic overreach change when the same anomalous event is explained with more available evidence or with evidence-bounding instructions? To study these RQs, we obtain anomalous-day explanation scenarios from three longitudinal sensing datasets: StudentLife[ 68],GLOBEM[ 73], andCollegeExperience[ 51]. For each dataset, we identify individual-relative anomalous days in behavioral or affective measures, and organize the available information into nested evidence tiers that provide progressively richer contextual support. As part of this empirical audit, we compare explanations generated under two prompt policies. Theopen explanationcondition asks the model to explain the anomalous"},{"citing_arxiv_id":"2605.04893","ref_index":27,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics","primary_cat":"cs.LG","submitted_at":"2026-05-06T13:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transpose-invariant spectral diagnostics on attention operators are orientation-blind, and a φ-G two-axis diagnostic distinguishes hallucination modes with 0.62-0.84 LC-AUROC and predicted polarity reversal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04638","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-06T08:30:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03971","ref_index":75,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments","primary_cat":"cs.CL","submitted_at":"2026-05-05T16:53:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01749","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality","primary_cat":"cs.CL","submitted_at":"2026-05-03T07:07:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Exploration-Commitment Decoupling instantiated as Calibration-Aware Generation improves long-form factuality by up to 13% and reduces decoding time by up to 37% on five benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01482","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-02T15:05:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"An SCM-GRPO framework grounds multi-hop reasoning in structural dependency graphs and optimizes chain length via rule-based RL, outperforming baselines on HoVer and EX-FEVER.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01047","ref_index":36,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning","primary_cat":"cs.CR","submitted_at":"2026-05-01T19:20:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"treat symptoms without altering the weights that produce them. Adaptive Unlearningfills this gap. 2.3 Unlearning in LLMs As LLMs are often trained on massive corpora, they may inadver- tently memorize sensitive, private, or outdated information.Ma- chine unlearningremoves or suppresses the influence of specific training data or knowledge in a trained model without retraining from scratch [ 36], with applications to privacy compliance and post-hoc model correction.Exact unlearningproduces a model that behaves as if the target data had never been seen, but requires full 3 retraining and is intractable at scale [36]. Recent work therefore focuses on empirical algorithms that fine-tune the model to forget targeted content efficiently [20]."},{"citing_arxiv_id":"2604.25359","ref_index":24,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-28T08:27:01+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24801","ref_index":37,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Architecture Determines Observability of Transformers","primary_cat":"cs.LG","submitted_at":"2026-04-27T02:39:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20131","ref_index":88,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives","primary_cat":"cs.CL","submitted_at":"2026-04-22T02:58:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15109","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation","primary_cat":"cs.CL","submitted_at":"2026-04-16T15:03:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"IUQ quantifies claim-level uncertainty in long-form LLM generation by combining inter-sample consistency and intra-sample faithfulness through an interrogate-then-respond approach and outperforms baselines on two datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03141","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation","primary_cat":"cs.CL","submitted_at":"2026-04-03T16:03:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An importance-aware recall metric for LLM factuality evaluation reveals models are better at avoiding false claims than covering all relevant facts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06211","ref_index":35,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models","primary_cat":"cs.CL","submitted_at":"2026-03-16T11:10:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09564","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness","primary_cat":"cs.DC","submitted_at":"2026-02-14T01:33:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.01101","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TSVer: A Benchmark for Fact Verification Against Time-Series Evidence","primary_cat":"cs.CL","submitted_at":"2025-11-02T22:33:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.10060","ref_index":40,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems","primary_cat":"cs.LG","submitted_at":"2025-06-11T18:00:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23912","ref_index":55,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations","primary_cat":"cs.CL","submitted_at":"2025-05-29T18:05:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.14427","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-02-20T10:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.15594","ref_index":107,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Survey on LLM-as-a-Judge","primary_cat":"cs.CL","submitted_at":"2024-11-23T16:03:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"or multiple-choice questions are prone to ambiguity in response interpretation. Lastly, LLM-as-a-Judge evaluations may inadvertently reflect biases, such as favoring responses based on their position or length. 2.2 Model Selection 2.2.1General LLM.To automate evaluation by LLM-as-a-Judge, one effective approach is to employ advanced language models such as GPT-4 [107] instead of human evaluators [213]. For instance, Li et al . [81] created a test set with 805 questions and assessed the performance by comparing it to text-davinci-003 using GPT-4. Additionally, Zheng et al. [213] designed 80 multi- round test questions across eight common areas and used GPT-4 to automatically score the model's responses. The accuracy of the GPT-4-based evaluator has been demonstrated to be high compared"},{"citing_arxiv_id":"2408.10692","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-08-20T09:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.17753","ref_index":144,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evaluating Very Long-Term Conversational Memory of LLM Agents","primary_cat":"cs.CL","submitted_at":"2024-02-27T18:42:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}