{"total":31,"items":[{"citing_arxiv_id":"2606.10711","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Agentic Web Requires New Normative Infrastructure","primary_cat":"cs.CY","submitted_at":"2026-06-09T11:15:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00610","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-05-30T08:18:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemGraphRAG uses a memory-based multi-agent system for globally consistent graph construction from fragmented corpora plus a memory-aware hierarchical retriever, claiming better benchmark performance than prior GraphRAG methods at similar cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22203","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents","primary_cat":"cs.CL","submitted_at":"2026-05-21T09:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Recursive character-based chunking at 300 characters outperforms Sentence-Based, Khmer-Aware, and LLM-Based methods on L2 distance, answer relevance, and Khmer IoU in a 5-fold evaluation on 18 Khmer agricultural QA pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12813","ref_index":122,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations","primary_cat":"cs.CL","submitted_at":"2026-05-12T23:13:50+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16347","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HPC-LLM: Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support","primary_cat":"cs.LG","submitted_at":"2026-05-08T03:54:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HPC-LLM fine-tunes Llama 3.1 8B via QLoRA on 9k-24k HPC examples and adds dense retrieval to deliver practical support for job scheduling, MPI, and GPU workflows, approaching the performance of larger general models at lower memory and latency cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05287","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use","primary_cat":"cs.CR","submitted_at":"2026-05-06T17:59:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24334","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering","primary_cat":"cs.CL","submitted_at":"2026-04-27T11:23:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22261","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion","primary_cat":"cs.CL","submitted_at":"2026-04-24T06:10:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RC-RAG boosts long-tail relation completion by infusing paraphrases into RAG stages, yielding up to 40.6 EM gains on benchmarks across five LLMs with no fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19820","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks","primary_cat":"cs.SE","submitted_at":"2026-04-19T07:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"KnowPilot integrates knowledge retrieval and memory systems into generative agents to achieve better results on domain-specific tasks such as text generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07274","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering","primary_cat":"cs.CL","submitted_at":"2026-04-08T16:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01348","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Procedural Knowledge at Scale Improves Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-01T20:01:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.03387","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LIMO: Less is More for Reasoning","primary_cat":"cs.CL","submitted_at":"2025-02-05T17:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.12935","ref_index":283,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions","primary_cat":"cs.AI","submitted_at":"2024-08-23T09:33:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In traditional reward modeling, human participants provide initial feedback to establish the reward model but do not participate during the AI system's training process. The disconnection of human feedback and AI systems can create opportunities for reward hacking. To achieve better alignment, researchers have adopted Cooperative Inverse Reinforcement Learning (CIRL) [ 283, 625] strategy, incorporating human participants into AI system control and learning process. Specifically, AI systems do not have access to ground truth reward values during training; instead, they infer these values through observation and interactions with human participants [2, 5]. Since the reward values rely on human participants, the behavior of AI systems tends to align more closely with human intentions."},{"citing_arxiv_id":"2401.18059","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval","primary_cat":"cs.CL","submitted_at":"2024-01-31T18:30:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.03714","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines","primary_cat":"cs.CL","submitted_at":"2023-10-05T17:37:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.16671","ref_index":100,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Demystifying CLIP Data","primary_cat":"cs.CV","submitted_at":"2023-09-28T17:59:56+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.14325","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Improving Factuality and Reasoning in Language Models through Multiagent Debate","primary_cat":"cs.CL","submitted_at":"2023-05-23T17:55:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To improve reasoning, approaches have relied on prompting techniques such as scratchpads [20], verification [3], chain-of-thought demonstrations [30, 11, 25], and intermediate self-reflection [ 26, 18] and finetuning [13, 24, 31]. To improve factuality, approaches have relied on training techniques such as RLHF [33, 16, 2], pruning truthful datasets [12], external knowledge retrieval [7] and training-free methods based off likelihood estimation [10]. Our work provides an alternative way to obtain reasoning and factuality in language models using multiagent debates, which only requires black-box access to a language generator. Prior work also has explored how to take the majority vote across different models [ 15, 3, 29, 28] while in this work,"},{"citing_arxiv_id":"2208.03299","ref_index":194,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Atlas: Few-shot Learning with Retrieval Augmented Language Models","primary_cat":"cs.CL","submitted_at":"2022-08-05T17:39:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.05221","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Language Models (Mostly) Know What They Know","primary_cat":"cs.CL","submitted_at":"2022-07-11T22:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2206.07682","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Emergent Abilities of Large Language Models","primary_cat":"cs.CL","submitted_at":"2022-06-15T17:32:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.05862","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","primary_cat":"cs.CL","submitted_at":"2022-04-12T15:02:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergence from initialization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2202.08906","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ST-MoE: Designing Stable and Transferable Sparse Expert Models","primary_cat":"cs.CL","submitted_at":"2022-02-17T21:39:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2201.10005","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Text and Code Embeddings by Contrastive Pre-Training","primary_cat":"cs.CL","submitted_at":"2022-01-24T23:36:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2201.08239","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LaMDA: Language Models for Dialog Applications","primary_cat":"cs.CL","submitted_at":"2022-01-20T15:44:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Most of the existing literature focuses on the problem of open-domain question-answering rather than dialog generation, and the models themselves are used to index and rank knowledge sources, rather than trained to use an intermediate tool. Given these differences, we note that the range of existing approaches to this problem include the RNNLM [34], RAG [35], REALM [ 36], and FiD [37] architectures. Zhu et al. [38] provide a survey of further recent work. See Karpukhin et al. [39] for details on the 'dense passage retriever' used in RAG. Recent work in this direction has expanded and elaborated on neural models' ability to retrieve and rank passages [40]. The RETRO architecture demonstrates that language models can be primed with results retrieved from"},{"citing_arxiv_id":"2112.09118","ref_index":134,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unsupervised Dense Information Retrieval with Contrastive Learning","primary_cat":"cs.IR","submitted_at":"2021-12-16T18:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2110.01552","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA","primary_cat":"cs.CL","submitted_at":"2021-10-04T16:45:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2101.03961","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","primary_cat":"cs.LG","submitted_at":"2021-01-11T16:11:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2005.14165","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Language Models are Few-Shot Learners","primary_cat":"cs.CL","submitted_at":"2020-05-28T17:29:03+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reduced potential to learn an overly narrow distribution from a large but narrow ﬁne-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art ﬁne-tuned models. Also, a small amount of task speciﬁc data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, VBL+16] - both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task. • One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and"},{"citing_arxiv_id":"2005.11401","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks","primary_cat":"cs.CL","submitted_at":"2020-05-22T21:34:34+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"where d(z) is a dense representation of a document produced by a BERTBASE document encoder [8], and q(x) a query representation produced by a query encoder, also based on BERTBASE. Calculating top-k(pη(·|x)), the list ofk documentsz with highest prior probabilitypη(z|x), is a Maximum Inner Product Search (MIPS) problem, which can be approximately solved in sub-linear time [23]. We use a pre-trained bi-encoder from DPR to initialize our retriever and to build the document index. This retriever was trained to retrieve documents which contain answers to TriviaQA [24] questions and Natural Questions [29]. We refer to the document index as the non-parametric memory. 2.3 Generator: BART The generator componentpθ(yi|x,z,y 1:i−1) could be modelled using any encoder-decoder."},{"citing_arxiv_id":"2004.04906","ref_index":78,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dense Passage Retrieval for Open-Domain Question Answering","primary_cat":"cs.CL","submitted_at":"2020-04-10T04:53:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2002.08910","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Much Knowledge Can You Pack Into the Parameters of a Language Model?","primary_cat":"cs.CL","submitted_at":"2020-02-10T18:55:58+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}