{"total":139,"items":[{"citing_arxiv_id":"2606.27559","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Sensitivity-Aware Test Collection for Search Among Personal Information","primary_cat":"cs.IR","submitted_at":"2026-06-25T21:24:17+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27513","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward a Hybrid Digital Twin of Society: Quantifying Cognitive-Spatial Linkages Through Online-Offline Feedback Networks","primary_cat":"physics.soc-ph","submitted_at":"2026-06-25T19:58:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A Feedback Network model is developed showing online semantic exploration is more concentrated than physical mobility, with stable retail-business linkages and greater COVID disruption to spatial than cognitive routines, as a step toward hybrid digital twins of society.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01400","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs","primary_cat":"cs.CL","submitted_at":"2026-05-31T18:45:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01298","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Challenger at MultiPRIDE: Is It Hate Speech or Reclaimed?","primary_cat":"cs.CL","submitted_at":"2026-05-31T15:38:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Applies embeddings, Cleanlab noise filtering, and MLP classification to achieve robust performance on imbalanced MultiPride data for distinguishing hate speech from reclaimed language.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01074","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression","primary_cat":"cs.CL","submitted_at":"2026-05-31T07:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Combining dimensionality reduction and quantization compresses text embeddings to 0.1% size with minimal performance loss on MTEB tasks, outperforming either technique alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00510","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning","primary_cat":"cs.CL","submitted_at":"2026-05-30T04:00:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30729","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching","primary_cat":"cs.LG","submitted_at":"2026-05-29T01:45:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SemStruct models tables as heterogeneous graphs with GNNs on frozen PLM embeddings to incorporate row co-occurrences for schema matching and reports SOTA results on Valentine and SOTAB-SM benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30027","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29960","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction","primary_cat":"cs.CR","submitted_at":"2026-05-28T14:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29507","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Xetrieval: Mechanistically Explaining Dense Retrieval","primary_cat":"cs.AI","submitted_at":"2026-05-28T07:29:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Xetrieval enriches sentence embeddings with a single-pass reasoning internalizer and decomposes the result into sparse interpretable features whose overlaps explain individual dense-retrieval decisions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29192","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReasonOps: Operator Segmentation for LLM Reasoning Traces","primary_cat":"cs.AI","submitted_at":"2026-05-28T00:08:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Unsupervised clustering on sentence-initial 3-token pivots extracts 7 universal reasoning operators from 44k traces across 12 LLMs that enable model fingerprinting and answer-correctness prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26122","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents","primary_cat":"cs.CV","submitted_at":"2026-05-27T21:21:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DocArena automates creation of multimodal document QA training data via MLLM-based structuring and cross-page reasoning pairs, yielding agents with top retrieval and QA performance in unified tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28268","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Cost-effective LLMs Routing with Batch Prompting","primary_cat":"cs.DB","submitted_at":"2026-05-27T10:14:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28190","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness","primary_cat":"cs.CL","submitted_at":"2026-05-27T09:11:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28062","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor","primary_cat":"cs.CL","submitted_at":"2026-05-27T07:14:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ConvMemory delivers competitive recall at far lower latency than larger rerankers for long-term conversational memory while a multi-seed ablation refutes temporal-structure exploitation as the operative mechanism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27810","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LRanker: LLM Ranker for Massive Candidates","primary_cat":"cs.IR","submitted_at":"2026-05-27T01:04:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LRanker combines K-means candidate aggregation with graph-partitioned ensemble of query embeddings to improve LLM ranking accuracy and scalability on massive candidate pools, reporting 3-30% gains on RBench tasks up to 6.8M candidates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23618","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems","primary_cat":"cs.CL","submitted_at":"2026-05-22T13:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GE2 tops BEIR and Italian RAG benchmarks at nDCG@10 of 0.638 and 0.282 but with 231.6 ms latency; mE5-L is competitive on Italian at 31 ms while LaBSE underperforms all dedicated retrieval models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23556","ref_index":167,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Is Dimensionality a Barrier for Retrieval Models?","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:22:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22511","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-21T14:00:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22247","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions","primary_cat":"cs.CL","submitted_at":"2026-05-21T09:53:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22202","ref_index":129,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance","primary_cat":"cs.CL","submitted_at":"2026-05-21T09:05:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21987","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Conversational Recommender System","primary_cat":"cs.IR","submitted_at":"2026-05-21T04:36:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21807","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering","primary_cat":"cs.CL","submitted_at":"2026-05-20T23:04:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OGCaReBench is a new retrieval-focused benchmark for evaluating LLMs on off-guideline clinical questions from real case reports, showing retrieval augmentation raises accuracy from 56% to 82%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20761","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Findings of the Counter Turing Test: AI-Generated Text Detection","primary_cat":"cs.CL","submitted_at":"2026-05-20T06:01:17+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19568","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder","primary_cat":"cs.CL","submitted_at":"2026-05-19T09:13:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19425","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-19T06:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19366","ref_index":188,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems","primary_cat":"cs.LG","submitted_at":"2026-05-19T04:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The work introduces WaLeF/FIDLAr for flood forecasting, CoDiCast for probabilistic weather, and Hypercube-RAG for explainable environmental QA, claiming superior accuracy, efficiency, and interpretability over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18579","ref_index":35,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs","primary_cat":"cs.LG","submitted_at":"2026-05-18T15:56:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2Aligner decouples semantic and structural components in LLM-as-Aligner pre-training for sparse TAGs and uses structure-oriented reconstruction plus domain risk balancing to improve transferability and reduce generalization gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18299","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-18T12:18:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15886","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches","primary_cat":"cs.CL","submitted_at":"2026-05-15T12:09:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14503","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks","primary_cat":"cs.SE","submitted_at":"2026-05-14T07:47:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14448","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-14T06:41:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13534","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging","primary_cat":"cs.AI","submitted_at":"2026-05-13T13:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12714","ref_index":72,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:22:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12292","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"STRABLE: Benchmarking Tabular Machine Learning with Strings","primary_cat":"cs.LG","submitted_at":"2026-05-12T15:47:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 Context: tabular learning research 2.1 The tabular-learning benchmarking landscape A need for string tabular learning benchmarksThe \"iron rule\" guiding machine-learning research is to compare pipelines on held-out data [24]. While model rankings remain surprisingly consistent across data splits [ 50, 48, 24], no algorithm is optimal across all problem classes [ 64]. Rankings are domain-dependent, and models whose inductive biases match the data distribution perform best [21]. Introducing strings into a table fundamentally changes the data distribution. Can high-cardinality categorical encodings [38, 4] suffice, or do we need models with different inductive biases that leverage string semantics [31]? Can we identify the conditions that make each approach"},{"citing_arxiv_id":"2605.11864","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Very Efficient Listwise Multimodal Reranking for Long Documents","primary_cat":"cs.IR","submitted_at":"2026-05-12T09:45:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11461","ref_index":33,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Breaking $\\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-12T03:20:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"preprint arXiv:2601.03267, 2025. [31] Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pages 9269-9278. PMLR, 2020. [32] Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforce- ment learning problems.arXiv preprint arXiv:2505.15201, 2025. [33] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. 11 [34] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al."},{"citing_arxiv_id":"2605.11374","ref_index":8,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Test-Time Compute for Frozen Embedding Models through Agentic Program Search","primary_cat":"cs.LG","submitted_at":"2026-05-12T00:56:34+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10616","ref_index":105,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image","primary_cat":"cs.LG","submitted_at":"2026-05-11T14:12:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10097","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"H-MAPS: Hierarchical Memory-Augmented Proactive Search Assistant for Scientific Literature","primary_cat":"cs.IR","submitted_at":"2026-05-11T07:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"H-MAPS uses a three-layered hierarchical memory to infer a reader's background and intent from implicit behaviors, generating profile-specific questions and on-device literature retrieval, as shown when NLP and HCI researchers receive different recommendations for the same paper.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09889","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Skill Description Deception Attack against Task Routing in Internet of Agents","primary_cat":"cs.MA","submitted_at":"2026-05-11T02:25:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09769","ref_index":21,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification","primary_cat":"cs.AI","submitted_at":"2026-05-10T21:30:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A deliberative council of Gemini agents using absence-based clinical rules achieves 0.382 F1 without fine-tuning and second place overall at 0.406 F1 on defense mechanism classification, with minority-class overrides adding 2.4pp.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09544","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-10T13:56:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09287","ref_index":37,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T03:21:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"approximately 60,000 trajectories, along with step-level process annotations. The reward model is trained via full-parameter fine-tuning. More training details are provided in Appendix G. Search Agent Training.Following a multi-turn question answering setup, we conduct experiments with two model scales, Qwen-2.5-3B-Instruct and Qwen-2.5-7B-Instruct [46]. For retrieval, we adopt the E5 encoder [37] over the 2018 Wikipedia corpus, retrieving 3 documents at each interaction step. The models are trained on a combined dataset constructed from the NQ and HotpotQA training splits. We evaluate both in-domain and out-of-domain performance on seven QA benchmarks: NQ, TriviaQA, PopQA, 2WikiMultiHopQA, MuSiQue, HotpotQA, and Bamboogle [ 12, 14, 20, 7, 35, 47, 23]."},{"citing_arxiv_id":"2605.09038","ref_index":32,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks","primary_cat":"cs.AI","submitted_at":"2026-05-09T16:23:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"better skill-conditioned querying improves efficient fact lookup, while the multi-hop benchmarks test whether the learned skills help the model decompose bridge, comparison, and compositional search problems into more reliable multi-turn search trajectories. 4.2 Baselines Our baselines isolate three sources of improvement. Direct inference and chain-of-thought prompt- ing [32] measure language-only reasoning under matched Qwen2.5 backbones. RAG [15] measures the effect of adding retrieved evidence without explicit skill selection, query planning, or grounded stopping. Search-o1 [17], Search-R1 [11], and ZeroSearch [27] represent recent search-native agents for multi-turn retrieval and search-oriented post-training. This suite separates gains from reasoning,"},{"citing_arxiv_id":"2605.08299","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do not copy and paste! Rewriting strategies for code retrieval","primary_cat":"cs.SE","submitted_at":"2026-05-08T11:31:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07129","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation","primary_cat":"cs.IR","submitted_at":"2026-05-08T02:07:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Reinforced latent reasoning for llm-based recommendation.arXiv preprint arXiv:2505.19092, 2025. [34] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557-62583, 2024. [35] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. [36] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models."},{"citing_arxiv_id":"2605.06647","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval","primary_cat":"cs.IR","submitted_at":"2026-05-07T17:54:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06308","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization","primary_cat":"cs.AI","submitted_at":"2026-05-07T14:10:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06285","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG","primary_cat":"cs.CL","submitted_at":"2026-05-07T13:56:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"trained on trajectories generated by Search-R1 and AutoRefine are denoted asLatentRAG ♢ and LatentRAG△, respectively. To reduce computational costs, we conduct main experiments using lightweight retrieval models with fewer than 1B parameters, which are among the top-performing models on the MTEB benchmark [ 80] and cover diverse model architectures, including Qwen3- Embedding-0.6B [33], e5-base-v2 [34], jina-embeddings-v5-text-nano [81], harrier-oss-v1-270m1, and F2LLM-v2-330M [ 82]. Unless otherwise specified, we use Qwen3-Embedding-0.6B as the default retriever. To evaluate the trade-off between performance and latency, we report the exact match (EM) score [19] and the average latency per question. Latency is measured on a single NVIDIA H100 GPU with 94 GB memory by default."}],"limit":50,"offset":0}