{"total":39,"items":[{"citing_arxiv_id":"2605.22544","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:27:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20098","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Neurosymbolic Learning for Inference-Time Argumentation","primary_cat":"cs.AI","submitted_at":"2026-05-19T16:49:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ITA trains LLMs to generate and score arguments for ternary claim verification and uses argumentation semantics to derive faithful true/false/uncertain predictions from those structures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17187","ref_index":267,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media","primary_cat":"cs.CL","submitted_at":"2026-05-16T22:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16608","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios","primary_cat":"cs.LG","submitted_at":"2026-05-15T20:17:34+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06006","ref_index":34,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence","primary_cat":"cs.CL","submitted_at":"2026-05-07T10:58:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PrimeFacts extracts decontextualized premises from fact-check articles, raising evidence retrieval MRR by up to 30% and verdict prediction Macro-F1 by 10-20 points over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04495","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-05-06T04:51:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01782","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence","primary_cat":"cs.CR","submitted_at":"2026-05-03T08:42:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"rocks-exploiting vulnerabilities in retrieval-augmented generative models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1610-1626 (2024) [49] Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024) [50] Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355 (2018) [51] Wen, T., Wang, C., Yang, X., Tang, H., Xie, Y ., Lyu, L., Dou, Z., Wu, F.: Defend- ing against indirect prompt injection by instruction detection. In: Findings of the Associa- tion for Computational Linguistics: EMNLP 2025."},{"citing_arxiv_id":"2605.01133","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems","primary_cat":"cs.CR","submitted_at":"2026-05-01T22:15:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Embedding-based defenses fail against attacks that align malicious message embeddings with benign ones in LLM multi-agent systems, but token-level confidence scores improve robustness by enabling better pruning of suspicious messages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00369","ref_index":87,"ref_count":2,"confidence":0.88,"is_internal_anchor":true,"paper_title":"InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees","primary_cat":"cs.LG","submitted_at":"2026-05-01T03:12:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27037","ref_index":56,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval","primary_cat":"cs.IR","submitted_at":"2026-04-29T17:05:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implementation issues.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"document representations to be pre-computed and indexed for Max- imum Inner Product Search (MIPS). Significant progress has been made in optimizing bi-encoders through contrastive learning (e.g., Contriever [19]), negative sampling strategies (e.g., ANCE [ 61], DRAGON [29]), and distillation (e.g., TAS-B [17]). More recently, LLM-based dense retrievers such as repLLaMA [31] and E5-Mistral [56] have achieved strong performance by leveraging large lan- guage models as backbone encoders. However, all standard bi- encoders share a common limitation: they rely on a simple, fixed similarity function (typically dot product or cosine similarity), which compresses the query's semantic intent into a single geomet- ric point, potentially creating a bottleneck for complex matching"},{"citing_arxiv_id":"2604.24608","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models","primary_cat":"cs.IR","submitted_at":"2026-04-27T15:36:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[44] David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 7534-7550. doi:10.18653/v1/2020.emnlp-main.609 [45] Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, et al . 2026. Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation.arXiv preprint arXiv:2602.16990(2026). [46] Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2025."},{"citing_arxiv_id":"2604.24076","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress","primary_cat":"cs.AI","submitted_at":"2026-04-27T06:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A thermodynamic-inspired information-geometric framework defines a composite LLM stability score that outperforms a utility-entropy baseline by 0.0299 on average across 80 observations, with gains increasing at higher entropy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21193","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-23T01:37:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DAVinCI combines claim attribution to model internals and external sources with entailment-based verification to improve LLM factual reliability by 5-20% on fact-checking datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17237","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads","primary_cat":"cs.IR","submitted_at":"2026-04-19T03:43:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15741","ref_index":49,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Learning Uncertainty from Sequential Internal Dispersion in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-17T06:31:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08401","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing","primary_cat":"cs.AI","submitted_at":"2026-04-09T16:01:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAVeR adds self-auditing of internal beliefs in LLM agents via persona-based candidates and constraint-guided repairs, improving faithfulness on six benchmarks without hurting task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08046","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-04-09T09:52:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GuarantRAG improves RAG accuracy up to 12.1% and cuts hallucinations 16.3% by decoupling parametric reasoning from evidence integration via contrastive DPO and joint decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06163","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers","primary_cat":"cs.IR","submitted_at":"2026-04-07T17:57:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04743","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations","primary_cat":"cs.CL","submitted_at":"2026-04-06T15:08:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM hallucinations arise from task-dependent basins in latent space, with separability varying by task and geometry-aware steering reducing their probability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.19339","ref_index":41,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Spectral Tempering for Embedding Compression in Dense Passage Retrieval","primary_cat":"cs.IR","submitted_at":"2026-03-19T10:01:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.01101","ref_index":48,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"TSVer: A Benchmark for Fact Verification Against Time-Series Evidence","primary_cat":"cs.CL","submitted_at":"2025-11-02T22:33:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.08110","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AI Feedback Enhances Community-Based Content Moderation through Engagement with Counterarguments","primary_cat":"cs.CY","submitted_at":"2025-07-10T18:52:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI argumentative feedback on community notes produces larger quality improvements than supportive or neutral feedback in a hybrid moderation experiment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.04565","ref_index":180,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","primary_cat":"cs.MA","submitted_at":"2025-06-05T02:34:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"employs a round-trip consistency check to filter out low-quality knowledge [28], while FILCO implements fine-grained sentence-wise filtering of retrieved passages [182]. Knowledge refinement further enhances the quality, coherence, and usability of retrieved information. Examples include LLM-AMT, which uses a knowledge self-refiner to filter and refine retrieved information for relevance [180], and RECOMP, which compresses retrieved documents into concise textual summaries before appending them as input [194]. Besides knowledge filtering and knowledge refinement, reranking involves reordering and reassessing retrieved results to maximize relevance. Several advanced reranking techniques have been proposed. The approach by Lazaridou et al."},{"citing_arxiv_id":"2504.19314","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese","primary_cat":"cs.CL","submitted_at":"2025-04-27T17:32:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.17428","ref_index":98,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models","primary_cat":"cs.CL","submitted_at":"2024-05-27T17:59:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.05672","ref_index":39,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Multilingual E5 Text Embeddings: A Technical Report","primary_cat":"cs.CL","submitted_at":"2024-02-08T13:47:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.02716","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Understanding the planning of LLM agents: A survey","primary_cat":"cs.AI","submitted_at":"2024-02-05T04:25:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.15391","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries","primary_cat":"cs.CL","submitted_at":"2024-01-27T11:41:48+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.10997","ref_index":149,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Retrieval-Augmented Generation for Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2023-12-18T07:47:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Reasoning Commonsense Reasoning HellaSwag [143] [20], [66] CoT Reasoning CoT Reasoning [144] [27] Complex Reasoning CSQA [145] [55] Others Language Understanding MMLU [146] [7], [27], [28], [42], [43], [47], [72] Language Modeling WikiText-103 [147] [5], [29], [64], [71] StrategyQA [148] [14], [24], [48], [51], [55], [58] Fact Checking/Verification FEVER [149] [4], [13], [27], [34], [42], [50] PubHealth [150] [25], [67] Text Generation Biography [151] [67] Text Summarization WikiASP [152] [24] XSum [153] [17] Text Classification VioLens [154] [19] TREC [155] [33] Sentiment SST-2 [156] [20], [33], [38] Code Search CodeSearchNet [157] [76] Robustness Evaluation NoMIRACL [56] [56] Math GSM8K [158] [73] Machine Translation JRC-Acquis [159] [17]"},{"citing_arxiv_id":"2305.14233","ref_index":110,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Enhancing Chat Language Models by Scaling High-quality Instructional Conversations","primary_cat":"cs.CL","submitted_at":"2023-05-23T16:49:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.13734","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Internal State of an LLM Knows When It's Lying","primary_cat":"cs.CL","submitted_at":"2023-04-26T02:49:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hidden activations in LLMs encode detectable information about statement truthfulness, enabling a classifier to identify true versus false content more reliably than the model's assigned probabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.06364","ref_index":72,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models","primary_cat":"cs.CL","submitted_at":"2023-04-13T09:39:30+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2212.03827","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Discovering Latent Knowledge in Language Models Without Supervision","primary_cat":"cs.CL","submitted_at":"2022-12-07T18:17:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2212.03533","ref_index":57,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Text Embeddings by Weakly-Supervised Contrastive Pre-training","primary_cat":"cs.CL","submitted_at":"2022-12-07T09:25:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1-12. ACM New York, NY , USA, 2021. [56] Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241-251, 2018. [57] David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7534-7550, 2020. [58] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan"},{"citing_arxiv_id":"2208.03299","ref_index":252,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Atlas: Few-shot Learning with Retrieval Augmented Language Models","primary_cat":"cs.CL","submitted_at":"2022-08-05T17:39:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.09118","ref_index":173,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unsupervised Dense Information Retrieval with Contrastive Learning","primary_cat":"cs.IR","submitted_at":"2021-12-16T18:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2110.01552","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA","primary_cat":"cs.CL","submitted_at":"2021-10-04T16:45:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2102.11105","ref_index":61,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"REMOD: Relation Extraction for Modeling Online Discourse","primary_cat":"cs.SI","submitted_at":"2021-02-22T15:26:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents REMOD, a graph-based supervised method for extracting semantic relations between entities in text to support modeling of online discourse and potential misinformation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2005.11401","ref_index":59,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks","primary_cat":"cs.CL","submitted_at":"2020-05-22T21:34:34+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30, pages 5998-6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . [59] Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search for improved description of complex scenes. AAAI Conference on Artiﬁcial Intelligence, 2018. URL https://www.aaai.org/ocs/index. php/AAAI/AAAI18/paper/view/17329. [60] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman."}],"limit":50,"offset":0}