{"total":71,"items":[{"citing_arxiv_id":"2606.27747","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UNICS: Multilingual Code Search via Unified Pseudocode and Contrastive Transfer Learning","primary_cat":"cs.SE","submitted_at":"2026-06-26T06:03:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UNICS pre-trains on a pseudocode dataset for cross-lingual logic then applies multi-task transfer learning with hard-positive mining and dynamic hard-negative sampling to reach claimed SOTA on multilingual code-search benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13814","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TASR: Training-Free Adaptive Stopping for Iterative Retrieval","primary_cat":"cs.IR","submitted_at":"2026-06-11T18:35:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TASR provides a training-free predicate that stops iterative retrieval on repeated normalized answers plus calibrated logit margin above 0.25, retaining 94.8% of fixed-k=5 F1 at 62.6% of the calls across 32 configurations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00822","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval","primary_cat":"cs.IR","submitted_at":"2026-05-30T17:49:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillPager retrieves typed semantic nodes from skill documents via MMR to reach 78.89% LLM-judged sufficiency with 47% fewer tokens than full documents on a 395-skill benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00593","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering","primary_cat":"cs.CL","submitted_at":"2026-05-30T07:47:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPADER proposes step-wise peer advantage and diversity-aware exploration rewards in RL for multi-answer QA, reporting improved recall and F1 on QAMPARI, Mintaka, WebQSP, and QUEST.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30027","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29992","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-28T14:24:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 200M-parameter Turkish sentence embedding model is adapted from a multilingual teacher via tokenizer pruning, mean-composition initialization, and offline cosine distillation, achieving 77.55% Pearson correlation on STSbTR and 7th place on TR-MTEB.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28268","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Cost-effective LLMs Routing with Batch Prompting","primary_cat":"cs.DB","submitted_at":"2026-05-27T10:14:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22203","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents","primary_cat":"cs.CL","submitted_at":"2026-05-21T09:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Recursive character-based chunking at 300 characters outperforms Sentence-Based, Khmer-Aware, and LLM-Based methods on L2 distance, answer relevance, and Khmer IoU in a 5-fold evaluation on 18 Khmer agricultural QA pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22202","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance","primary_cat":"cs.CL","submitted_at":"2026-05-21T09:05:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14503","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks","primary_cat":"cs.SE","submitted_at":"2026-05-14T07:47:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14188","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"QOuLiPo: What a quantum computer sees when it reads a book","primary_cat":"quant-ph","submitted_at":"2026-05-13T23:10:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Literary texts are turned into graphs for neutral-atom quantum processors, with a new rigidity metric distinguishing structural uniqueness and a QOuLiPo corpus of engineered texts created to match hardware-native graphs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13415","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model","primary_cat":"cs.CL","submitted_at":"2026-05-13T12:10:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A system using XLM-RoBERTa, GPT-4 back-translation augmentation, undersampling, and language-specific threshold tuning reports 2-5% F1 gains on multilingual slur reclamation detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12714","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:22:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11864","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Very Efficient Listwise Multimodal Reranking for Long Documents","primary_cat":"cs.IR","submitted_at":"2026-05-12T09:45:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10805","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge","primary_cat":"cs.AI","submitted_at":"2026-05-11T16:30:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Accuracy-cost trade-offs and reasoning-instructional agreement across benchmarks.Upper: Accuracy improvement versus cost ratio.Lower: Agreement patterns between instruct and reasoning inference. on in-distribution data may mis-estimate either the benefit of reasoning or the risk of budget violation under distribution shift. This motivates our distributionally robust objective in Eq. (4), which applies robustness separately to the reward and cost terms to hedge against these two failure modes. 2.2. When and Why Reasoning Helps To better understand when and why reasoning improves LLM-as-a-Judge, we conduct a case analysis. This analy- sis reveals three recurring patterns: (i) reasoning improves judgment when evaluation requires explicit verification; (ii)"},{"citing_arxiv_id":"2605.10530","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery","primary_cat":"cs.IR","submitted_at":"2026-05-11T13:14:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"systems, existing efforts remain fragmented across isolated pipeline stages. These include query reformulation with demographic per- sonas [18], zero-shot query expansion [12], and production-level rewriting [3]; retrieval methods based on memory-augmented rea- soning [32] and adaptive multi-aspect retrieval augmentation [42]; and generation-stage approaches that use token-level rewards, style transfer [5], or LLM-powered user simulation [49]. Although recent surveys identify a structural convergence between personalized RAG and agentic architectures [16], current works lack integration; our framework addresses this gap by maintaining a coherent user model throughout the entire pipeline, enabling holistic personal- ization that adapts from task planning through to final generation."},{"citing_arxiv_id":"2605.10043","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework","primary_cat":"cs.CL","submitted_at":"2026-05-11T06:12:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"C-BPO personalizes LLMs via preference-calibrated binary signals and PU learning theory to isolate inter-user differences from shared task knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09863","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-05-11T01:49:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipping as plugins and servers with an audit log.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09461","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection","primary_cat":"cs.AI","submitted_at":"2026-05-10T10:20:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VulTriage combines control dependency extraction, CWE knowledge retrieval, and semantic summarization to improve LLM accuracy on vulnerability detection, reaching SOTA on PrimeVul and generalizing to Kotlin.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07249","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal","primary_cat":"cs.IR","submitted_at":"2026-05-08T05:10:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"from 100M to 8B parameters, alongside two sparse baselines and two late-interaction retrievers. Our selection prioritizes breadth across paradigms and model scales, and includes widely used or recently released publicly available multilingual retrievers at the time of our experiments. DenseOur dense retrievers encompass diverse model lineages, ranging from widely adopted encoder-only families such as multilingual-e5 [ 32], bge-m3 [33], gte [34], snowflake-arctic [35], nomic-embed [36], embeddinggemma [37] and jina [38, 39], to recent LLM-based embedding models including Qwen3-Embedding [40], llama-nemotron [41], and pplx-embed [42]. In this paradigm, queries and passages are independently encoded into fixed-dimensional vectors and scored by cosine similarity. We use each model's prescribed pooling strategy (CLS, mean, or last-token) and follow"},{"citing_arxiv_id":"2605.05991","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance","primary_cat":"cs.IR","submitted_at":"2026-05-07T10:41:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A case-driven multi-agent system automates the full pipeline of bad-case detection, annotation, and resolution for e-commerce search relevance using Annotator, Optimizer, and User agents plus supporting components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02171","ref_index":21,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization","primary_cat":"cs.DB","submitted_at":"2026-05-04T03:04:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QuIVer performs Vamana-style graph construction entirely inside a 2-bit Sign-Magnitude BQ space, achieving >=88% Recall@10 on contrastive-learning embeddings and 2.5-5.5x higher throughput than DiskANN/HNSW at matched recall with 4.7x less hot memory.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"to hide main-memory latency effectively during the XOR+Popcount distance computation. 5 Experiments 5.1 Setup Datasets.Table 3 summarizes the thirteen evaluation datasets, chosen to span the full spectrum of vector distributions encountered in practice. MiniLM-1M contains 1M sentence embeddings from the all- MiniLM-L6-v2 model (384-d); Cohere-1M, BGE-M3-1M [22] (mul- tilingual BGE-M3, 1024-d), and DBpedia-OpenAI-1M (available in both 1536-d and 3072-d variants) contain real LLM embeddings produced by contrastive learning; cosine similarity is their native metric. Wolt-CLIP-1M [23] contains 1M CLIP ViT-B/32 embeddings (512-d) from the Wolt food delivery product catalog, representing single-domain multimodal image embeddings."},{"citing_arxiv_id":"2605.00702","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory","primary_cat":"cs.CL","submitted_at":"2026-05-01T14:45:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"back, we apply a predefined feedback instruction Pg that compares the correct trajectory τ + with the negative trajectories {τ − j }, highlighting the desired properties of τ + and the typical errors in the neg- atives. The resulting natural-language contrastive reflection serves as atextual gradient, guiding the iterative refinement of the guidelines: g(k) = Grad τ +,{τ − j };P g \u0001 .(2) This textual gradient is then used to update S (k), guiding the agent toward more reliable and task- aligned trajectories. Batch-level gradient aggregation.To obtain a stable and general update signal, we aggregate tex- tual gradients across a mini-batch B of training examples. Each g(k) provides a localized critique about how S (k) should change for a specific(H, x);"},{"citing_arxiv_id":"2605.00618","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus","primary_cat":"cs.CL","submitted_at":"2026-05-01T12:41:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"These traits include awkward or unidiomatic phrasing due to overly literal renderings of the source, as well as systematic differences in lexical choice and syntax [ 67]. Usually, translationese effects are stylistic rather than semantic. For instance, translated documents carry over source-language influences, altering the distribution of words in the target language [17, 57]. Trans- lations generally use a narrower vocabulary and more repetitive word choice compared to original text, with NMT outputs being even more lexically limited than human translations [60, 46]. They tend to intro- duce greater explicitation, adding clarifying words or connectors that were implicit in the source [4, 56]. Finally, translations frequently use"},{"citing_arxiv_id":"2604.27600","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG","primary_cat":"cs.IR","submitted_at":"2026-04-30T08:50:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"utility of the visual context during the final generation phase. 4.3 FIG Quantification and Teacher Adaptation To optimize a high-fidelity selector capable of identifying critical ev- idence, it is imperative to establish a robust supervision signal that quantifies the intrinsic utility of each candidate fragment. While semantic relevance (e.g., cosine similarity) often serves as a proxy for importance [6], it frequently fails to align with the actual re- quirements of the generation process. Building upon the principles of DIG [ 49], we propose FIG, a metric designed to measure the marginal contribution of a fragment to the answer generation. 4.3.1 FIG Definition and Calculation Details.Formally, we define Fragment Information Gain (FIG) as the improvement in the length-"},{"citing_arxiv_id":"2604.25716","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cross-Lingual Jailbreak Detection via Semantic Codebooks","primary_cat":"cs.CL","submitted_at":"2026-04-28T14:43:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24334","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering","primary_cat":"cs.CL","submitted_at":"2026-04-27T11:23:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23734","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval","primary_cat":"cs.IR","submitted_at":"2026-04-26T14:28:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22722","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Aligning Dense Retrievers with LLM Utility via Distillation","primary_cat":"cs.IR","submitted_at":"2026-04-24T17:18:56+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22577","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"QuantClaw: Precision Where It Matters for OpenClaw","primary_cat":"cs.AI","submitted_at":"2026-04-24T14:10:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"The benefits are more ponounced on v2.0.0 for GLM-5. QuantClaw improves the average score by 2.09 point over FP8 baseline, while achieving 21.4% cost savings and 15.7% speed up. 8 The ablation study on task detection methods is shown in Table 3. QuantClaw supports various detection methods, including individual detectors such as RuleDetector, an embedding model (i.e., BGE-M3 [43]), a model-as-judge (e.g., GLM-4.7-Flash-INT4), and a hybrid strategy. Introducing judge models improves the detection accuracy while increasing the time overhead. However, the hybrid strategy achieves an acceptable trade-off, demonstrating the highest accuracy and a reasonable time cost. This establishesRuleDetector + BGE-M3as the default choice for QuantClaw."},{"citing_arxiv_id":"2604.21511","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Tokens to Concepts: Leveraging SAE for SPLADE","primary_cat":"cs.IR","submitted_at":"2026-04-23T10:13:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21264","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation","primary_cat":"cs.AI","submitted_at":"2026-04-23T04:17:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20117","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"To Know is to Construct: Schema-Constrained Generation for Agent Memory","primary_cat":"cs.CL","submitted_at":"2026-04-22T02:27:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCG-MEM reformulates agent memory access as schema-constrained generation within dynamic cognitive schemas, using assimilation and accommodation for updates plus an associative graph for reasoning, and outperforms retrieval baselines on the LoCoMo benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19899","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Reproducibility Study of Metacognitive Retrieval-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-04-21T18:22:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19566","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference","primary_cat":"cs.IR","submitted_at":"2026-04-21T15:19:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Diagnosable ColBERT aligns ColBERT embeddings to an expert-grounded clinical latent space to enable direct diagnosis of model misunderstandings and better training data curation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18486","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"This metric directly measures whether the model's reasoning arrives at the correct driving intent, which is the most safety-critical aspect of CoT quality. • STS Score (Semantic Textual Similarity Score): We compute a neural semantic similarity score between each predicted CoT and its ground-truth reference using a cross-encoder reranker (BGE-reranker-v2-m3 [11]). This evaluator is particularly suitable for computing a similarity score for templated CoTs where most of the words are identical. A cross-encoder concatenates the ground truth and the prediction, processing them simultaneously through full token-by-token cross-attention. This mechanism allows the model to perform a deep, comparative analysis, making it highly sensitive to critical localized contradictions-such as predicting"},{"citing_arxiv_id":"2604.17943","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents","primary_cat":"cs.CL","submitted_at":"2026-04-20T08:22:15+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17866","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Latent Abstraction for Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-04-20T06:26:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"as a fixed query vector, following the procedure described in Section 2; LAnR-Instruct uses adaptive multi-hop retrieval with a learned controller. Despite joint training with generation, LAnR-Instruct matches or exceeds specialized retrievers at comparable document budgets. Dataset Method R@1 R@3 R@5 R@10 Recall (budget) HotpotQA BM25 [30] 0.401 0.603 0.670 0.751 - BGE [5] 0.477 0.782 0.832 0.880 - E5 [38] 0.462 0.773 0.826 0.875 - LAnR-Instruct-static 0.451 0.769 0.800 0.830 - LAnR-Instruct - - - - 0.840±0.0012(~5.5 docs) 2WikiMQA BM25 [30] 0.360 0.555 0.605 0.656 - BGE [5] 0.411 0.643 0.680 0.715 - E5 [38] 0.415 0.656 0.687 0.717 - LAnR-Instruct-static 0.407 0.641 0.684 0.713 - LAnR-Instruct - - - - 0.715±0.0009(~5."},{"citing_arxiv_id":"2604.17738","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data","primary_cat":"cs.CL","submitted_at":"2026-04-20T02:51:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting Recall@50 from 68.89% to 77.55% and Recall@200 from 0.5969 to 0.7047.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Dense retrieval has made shared query-document embeddings a standard approach for ef- ficient nearest-neighbor search, supported by unsupervised con- trastive pretraining [10], instruction-tuned embeddings [ 16, 25], and large-scale weak supervision [27]. This line of work has pro- duced strong general-purpose encoders such as E5 [27], GTE [15], BGE [29], BGE-M3 [4], and Jina Embeddings v3 [24]. At the same time, parameter-efficient adaptation methods such as LoRA and QLoRA have made downstream specialization practical even with limited computational resources [6, 9]. In parallel, synthetic-data approaches show that LLM-generated supervision can substantially improve retrieval. InPars and InPars-v2 synthesize queries from"},{"citing_arxiv_id":"2604.17265","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search","primary_cat":"cs.IR","submitted_at":"2026-04-19T05:35:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22829","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding","primary_cat":"cs.IR","submitted_at":"2026-04-18T05:04:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LFRAG advances multimodal RAG to block-level retrieval with layout segmentation and cross-attention fusion, reporting SOTA retrieval, 7.20% higher answer accuracy, and 73.07% lower token consumption on the new LFDocQA benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15484","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents","primary_cat":"cs.IR","submitted_at":"2026-04-16T19:22:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BEIR datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14907","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task","primary_cat":"cs.CL","submitted_at":"2026-04-16T11:49:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"followed this direction with HateBERT pretrained on a large-scale corpus of Reddit comments containing offensive, abusive, or hateful content. Comparative experiments across three English benchmarks for abusive language detection (OffensEval, AbusEval, and HatEval) showed that HateBERT consistently outperformed the corresponding general BERT model on each of them. The OffensEval dataset [31] has since been adopted as one of de facto standards for evaluating offensive language, and later extended to multilingual settings in OffensEval-2020 [19], which introduced parallel datasets in Arabic, Danish, Greek, and Turkish. This led to evaluation of multilingual transformer architectures such as mBERT and XLM-RoBERTa. Across all languages, the dominant strategy among top-performing systems"},{"citing_arxiv_id":"2604.14586","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CPGRec+: A Balance-oriented Framework for Personalized Video Game Recommendations","primary_cat":"cs.IR","submitted_at":"2026-04-16T03:25:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CPGRec+ improves game recommendations on Steam data by reweighting player-game edges with signed preference strengths and using LLMs to generate preference-aware descriptions, yielding higher accuracy and diversity than prior models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Amnesia: The Dark Descent, which, despite being categorized as action, has distinct horror-themed graphics and puzzle- based gameplay. To validate the effectiveness of strict connections, we analyze the Steam network by comparing edge quantity, Euclidean distance, and cosine similarity of game description embeddings under raw and strict connections. Descrip- tions are generated using Qwen2.5 [ 91] for contextual understanding and M3-Embedding [ 9] for capturing semantic meaning-both chosen for their strong performance, accessibility, and cost-effectiveness. As shown in Fig. 2(a), intro- ducing a second category condition significantly reduces edge quantity, filtering out noisy connections and enhancing CBF signal quality. In Fig. 2(b), the larger main diagonal elements indicate that strict connections link games with more"},{"citing_arxiv_id":"2604.14389","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking","primary_cat":"cs.CL","submitted_at":"2026-04-15T20:06:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BiCon-Gate improves dialogue fact-checking by applying staged de-colloquialisation and gating rewrites based on semantic consistency with context, yielding gains on the DialFact benchmark over baselines including LLM rewrites.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11339","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Collaboration, Integration, and Thematic Exploration in European Framework Programmes: A Longitudinal Network Analysis","primary_cat":"physics.soc-ph","submitted_at":"2026-04-13T11:37:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EU Framework Programmes have increased participation equity and integrated new countries through collaboration, yet research remains concentrated on established trajectories rather than broadly exploratory.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"comparing funding allocations between research organizations and industry across topics. 5.1 Semantic embeddings extraction and topic modeling We extract vector representations from theobjectivefield of ourProjectsdataset, which contains the project abstract, using the pretrained modelBGE-M34, a multilingual encoder that maps text to a 1024-dimensional dense vector space [33]. We retain only projects with a non-emptyobjective field, yielding a corpus of 122,570 documents. A key advantage ofBGE-M3is its support for input sequences of up to 8,192 tokens, accommodating project descriptions that exceed the 512-token limit of many alternative models. We perform topic modeling with BERTopic [25], which combines transformer-based embeddings"},{"citing_arxiv_id":"2605.18767","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking","primary_cat":"cs.IR","submitted_at":"2026-04-13T08:56:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DualView fuses local cross-attention and global context aggregation via adaptive gating to rerank fixed candidate sets for multi-hop QA, reporting 99.4% Top-4 Recall on MuSiQue at 4 ms latency while beating larger cross-encoders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07054","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sell More, Play Less: Benchmarking LLM Realistic Selling Skill","primary_cat":"cs.CL","submitted_at":"2026-04-08T13:06:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"035 (50,000 yuan over 3 years = 5,250 yuan, bank-supervised). Suggests adding the company's WeChat for assistance. Rules: (1) Use English strictly. (2) Be realistic; do not men- tion impossible things. (3) Avoid personal life or off-topic content. (4) Use conversational language; keep responses brief. (5) Do not fabricate facts; respond from a customer perspective. (6) No parenthetical actions or inner thoughts. Lang #Dia Avg msgs/conv Avg user msgs ZH (118_zh) 118 7.98 3.85 EN (150_en) 150 8.13 3.83 Table 9: Held-out ablation test set statistics (Sec. 4.5). Averages exclude the system prompt. Lang is language, Dia is dialogues. One example of test data A: Hi.U: Hello, I'm a rep from xx Securities-do you remember us?"},{"citing_arxiv_id":"2604.05818","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering","primary_cat":"cs.CV","submitted_at":"2026-04-07T12:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03860","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LiquiLM: Bridging the Semantic Gap in Liquidity Flaw Audit via DCN and LLMs","primary_cat":"cs.CR","submitted_at":"2026-04-04T20:49:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiquiLM integrates LLMs and DCN to audit liquidity flaws in blockchain smart contracts, achieving over 90% F1-score and uncovering 238 high-risk contracts plus 10 CVE-certified vulnerabilities in real-world PoL and Ethereum contracts.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"This ensures the model focuses on structural logic rather than lexical vocabulary. Simultaneously, protected structural tags are defined to preserve annotation fields such as [Target_Function], ensuring the model correctly identifies the hierarchical levels of the slice. Embedding Vector Generation.To convert the information sources 𝐼1 and 𝐼2 into vector repre- sentations, we employ the pre-trained encoder BGE-M3 [8]. This process is expressed as Eq.(2): H=𝐸(𝐼) ∈R 𝐿×𝐷 (2) where 𝐿 is the sequence length (512) and 𝐷 is the feature dimension (1024). For 𝐼1 and 𝐼2, we obtain the following vector sets: • ®𝐼1 ={ ®𝐶𝑖 | ®𝐶𝑖 ={®𝑐1,®𝑐2, . . . ,®𝑐𝑡 },∀𝑖∈ {1,2, . . . , 𝑛}} • ®𝐼2 ={ ®𝑄𝑖 | ®𝑄𝑖 ={ ®𝑞1,®𝑞2, . . . ,®𝑞𝑘 },∀𝑖∈ {1,2, . . . , 𝑚}} The resulting set ®𝐼2 constitutes theSemantic Corpus."}],"limit":50,"offset":0}