{"total":16,"items":[{"citing_arxiv_id":"2605.15514","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably","primary_cat":"cs.CL","submitted_at":"2026-05-15T01:16:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13831","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:52:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[34] Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4643-4663, 2024. [35] Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128K context.arXiv preprint arXiv:2402.10171, 2024. [36] An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyang Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin"},{"citing_arxiv_id":"2605.10544","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing","primary_cat":"cs.CL","submitted_at":"2026-05-11T13:23:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"point gains over standard CPT. Qwen Llama Model / CPT Eval.∆with 95% CI Model / CPT Eval.∆with 95% CI Qwen2.5-0.5B / 4K NoLiMa train +10.09 [+8.42, +11.66] Llama-3.2-1B / 4K NoLiMa train +5.80 [+4.38, +7.06] Qwen2.5-0.5B / 4K NoLiMa extrap +5.34 [+3.92, +6.61] Llama-3.2-1B / 4K NoLiMa extrap +3.12 [+1.70, +4.39] Qwen2.5-0.5B / 4K RULER train +10.69 [+9.12, +12.04] Llama-3.2-1B / 4K RULER train +1.74 [+0.88, +2.51] Qwen2.5-0.5B / 4K RULER extrap +5.55 [+4.21, +6.78] Llama-3.2-1B / 4K RULER extrap +0.83 [-0.03, +1.59] Qwen2.5-1.5B / 8K NoLiMa train +2.28 [+1.18, +3.25] Llama-3.2-3B / 8K NoLiMa train +10.57 [+8.84, +12.15] Qwen2.5-1.5B / 8K NoLiMa extrap +3.71 [+2.21, +5.08] Llama-3.2-3B / 8K NoLiMa extrap +6."},{"citing_arxiv_id":"2604.24608","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models","primary_cat":"cs.IR","submitted_at":"2026-04-27T15:36:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This work is licensed under a Creative Commons Attribution 4.0 International License. SIGIR '26, Melbourne, VIC, Australia © 2026 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-2599-9/2026/07 https://doi.org/10.1145/3805712.3809945 1 Introduction Re-ranking is a critical stage in modern retrieval systems for applica- tions such as search and recommendation [7, 16, 23, 26, 45]. Tradi- tional neural re-rankers, such as bi-encoders [12, 13, 25] and cross- encoders [20, 30, 31], can be effective, but they typically require substantial relevance labels and task-specific training. Recent studies show that large language models (LLMs) can perform re-ranking in a zero-shot setting by leveraging their strong ability [22, 32, 34, 37]."},{"citing_arxiv_id":"2604.14339","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation","primary_cat":"cs.CL","submitted_at":"2026-04-15T18:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RoPE-Perturbed Self-Distillation improves positional robustness during long-context fine-tuning of LLMs by training models to produce consistent outputs across RoPE-perturbed views of the input.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08290","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants","primary_cat":"cs.SE","submitted_at":"2026-04-09T14:27:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19777","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation","primary_cat":"cs.CL","submitted_at":"2026-03-28T14:12:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SDSR places human metadata at file primacy and combines it with prompt routing rules to reach 100% primary category accuracy on a 119-category benchmark, far above the 65% no-guidance baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.04759","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stacked from One: Multi-Scale Self-Injection for Context Window Extension","primary_cat":"cs.CL","submitted_at":"2026-03-05T03:16:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SharedLLM stacks two copies of a short-context LLM so the lower one compresses context into query-aware multi-grained tokens that are injected only at the lowest layers of the upper one, enabling generalization from 8K training to 128K+ inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.13684","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference","primary_cat":"cs.CL","submitted_at":"2026-01-20T07:35:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HeteroCache dynamically allocates KV cache space to attention heads based on their temporal stability and uses hierarchical asynchronous retrieval to achieve state-of-the-art long-context performance with up to 3x faster decoding at 224K context length.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.19874","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate","primary_cat":"cs.LG","submitted_at":"2025-04-28T15:05:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.13663","ref_index":135,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference","primary_cat":"cs.CL","submitted_at":"2024-12-18T09:39:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.04264","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLVU: Benchmarking Multi-task Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2024-06-06T17:09:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", AC) tasks compared to previous open-source models. 3) Existing methods still struggle to handle most tasks in our benchmark. For instance, GPT-4o only achieves 42.9% in the needle question-answering (NQA) task. In contrast, analogous tasks in the text domain, such as NIHS (Needle-In-the-Haystack-Search) and Passkey Retrieval, are effectively handled by many existing long LLMs [ 14, 61]. Additionally, GPT-4o shows even less reliability in tasks like ego-reasoning (ER), action ordering (AO), and action count (AC), with most baseline methods performing even worse. These observations indicate that long-video understanding remains a significant challenge for today's MLLMs. In addition to the primary conclusions from the overall performances, we can also make the following interesting"},{"citing_arxiv_id":"2404.07143","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention","primary_cat":"cs.CL","submitted_at":"2024-04-10T16:18:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.06654","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RULER: What's the Real Context Size of Your Long-Context Language Models?","primary_cat":"cs.CL","submitted_at":"2024-04-09T23:41:27+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.20208","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unlock the Potential of Large Language Models for Predictive Tabular Tasks in Data Science with Table-Specific Pretraining","primary_cat":"cs.LG","submitted_at":"2024-03-29T14:41:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Table-specific pretraining of Llama-2 yields significant gains on zero-shot, few-shot, and in-context tabular prediction tasks over prior benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.04652","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Yi: Open Foundation Models by 01.AI","primary_cat":"cs.CL","submitted_at":"2024-03-07T16:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"response dialog pairs, with each and every one of the entry constructed and polished over multiple iterations and from user feedback. We take this approach because in our preliminary experiments, we observe that compared to the open-source data of several hundred thousand entries, the results from a smaller, manually annotated dataset are superior. These observations align with those reported in Gemini Team et al. [23], Touvron et al. [77], Zhou et al. [94]. 6 We use the following techniques to improve prompt distribution selection, response formatting, and chain-of-thought formatting: (1). for prompt distribution selection, drawing inspiration from WizardLM[83], we develope compound instructions and progressively evolved them to increase their complexity. This approach has significantly reduced the size of SFT data in our experiments; (2)."}],"limit":50,"offset":0}