{"total":56,"items":[{"citing_arxiv_id":"2605.22668","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:09:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22884","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tensor Cache: Eviction-conditioned Associative Memory for Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-21T00:21:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14589","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EndPrompt: Efficient Long-Context Extension via Terminal Anchoring","primary_cat":"cs.CL","submitted_at":"2026-05-14T09:00:03+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13831","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:52:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"05530, 2024. [22] Google. Introducing gemini 2.0: our new ai model for the agentic era, 2024. URL https://blog.google/ technology/google-deepmind/google-gemini-ai-update-december-2024/#ceo-message. [23] Google. Gemini 2.5: Our most intelligent ai model.URL https://blog.google/technology/google-deepmind/gemini- model-thinking-updates-march-2025/, 2025. [24] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023. [25] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations."},{"citing_arxiv_id":"2605.12922","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction","primary_cat":"cs.AI","submitted_at":"2026-05-13T02:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12904","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VIP-COP: Context Optimization for Tabular Foundation Models","primary_cat":"cs.LG","submitted_at":"2026-05-13T02:28:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimensional data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12471","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:53:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"They are orthogonal to KV-Fold and could compose with it. Linear-attention alternatives.Linear Attention [ 32] and Performers [33] compress past context into a fixed-size state, sacrificing content-based addressability for asymptotic efficiency - the opposite trade-off from KV-Fold. Position-extrapolation techniques.ALiBi [ 18], Positional Interpolation [ 19], YaRN [20], and LongRoPE [21] extend the trained position range. These are complementary to our setting: KV-Fold operates within the trained range in our experiments (Llama-3.1-8B's native 128K window), and composing with these methods is the natural route to pushing the operational ceiling further. 9 8 Discussion Taken together, the results suggest a simple picture: KV-Fold induces an attention regime that is"},{"citing_arxiv_id":"2605.12227","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:04:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10544","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing","primary_cat":"cs.CL","submitted_at":"2026-05-11T13:23:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Table 1 metric. Table 5: Paired bootstrap 95% confidence intervals for the main-result gains in Table 1. Values are point gains over standard CPT. Qwen Llama Model / CPT Eval.∆with 95% CI Model / CPT Eval.∆with 95% CI Qwen2.5-0.5B / 4K NoLiMa train +10.09 [+8.42, +11.66] Llama-3.2-1B / 4K NoLiMa train +5.80 [+4.38, +7.06] Qwen2.5-0.5B / 4K NoLiMa extrap +5.34 [+3.92, +6.61] Llama-3.2-1B / 4K NoLiMa extrap +3.12 [+1.70, +4.39] Qwen2.5-0.5B / 4K RULER train +10.69 [+9.12, +12.04] Llama-3.2-1B / 4K RULER train +1.74 [+0.88, +2.51] Qwen2.5-0.5B / 4K RULER extrap +5.55 [+4.21, +6.78] Llama-3.2-1B / 4K RULER extrap +0.83 [-0.03, +1.59] Qwen2.5-1.5B / 8K NoLiMa train +2.28 [+1.18, +3.25] Llama-3.2-3B / 8K NoLiMa train +10.57 [+8."},{"citing_arxiv_id":"2605.10414","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Remember to Forget: Gated Adaptive Positional Encoding","primary_cat":"cs.LG","submitted_at":"2026-05-11T11:52:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RoPE decomposes queries and keys into two-dimensional chunks, rotating each at a different frequencyg k ∈G, ranging fromg 1 = 1radian per token (highest frequency) tog d/2 ≈1/θradians per token (lowest), whereθis the base wavelength, defaulting to10,000[24]. Long-context extrapolation and the base-scaling deadlock.A natural response to RoPE's ex- trapolation failures is to scale θ. Position Interpolation [2], YaRN [19], and LongRoPE [8] remap rotary frequencies to reduce OOD phase angles at extended lengths. However, recent theoretical analyzes reveal that this exposes aninterpolation-extrapolation deadlock[13, 15, 29]: shrinking θ smooths extrapolation but harms long-range semantic discrimination, while inflating θ preserves local interpolation but devolves low-frequency channels into near-identity maps, ultimately colliding"},{"citing_arxiv_id":"2605.10268","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading","primary_cat":"cs.CL","submitted_at":"2026-05-11T09:30:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[26] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. [27] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.URL https://arxiv. org/abs/2306.15595, 2023. [28] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023. [29] Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint"},{"citing_arxiv_id":"2605.10045","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T06:14:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Detail Refinement Fine-detail support NoPE - Global-layout preservation T ′ j =T j L′/L - Smaller-sized object composition T ′ j ∈[T j , T j L′/L] - Fine-detail rendering T ′ j =T j Detail degradation A natural starting point is to borrow training-free extrapolation techniques developed for other paradigms. Positional-encoding remappings such as PI [5], NTK-aware scaling, and YaRN [26] are originally designed for LLMs, while training-free extrapolation methods such as RiFlex [ 51] and DyPE [18] target diffusion-based visual generation. However, applying them directly to V AR results in three failure modes: (i)global repetition, where the holistic layout recurs across the image; (ii) local repetition, where mid-sized structures, such as individual objects, appear at reduced sizes and"},{"citing_arxiv_id":"2605.07076","ref_index":46,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-Consolidating Language Models: Continual Knowledge Incorporation from Context","primary_cat":"cs.CL","submitted_at":"2026-05-08T00:50:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In both regimes the forgetting term is a drop relative to a per-context baseline evaluated on the same test material, differing only in whether that material is labeled queries or the raw context. Substituting Equations (14) and (16) into Equation (3) yields rsparse a;c t, θt \u0001 =u intrinsic θ(a) t ;c t \u0001 −λ f intrinsic θ(a) t ;D <t \u0001 ,(17) applicable to rolling-window long-context consolidation [ 46, 7], memory-compaction in agentic interaction histories [9, 47], and any setting in which a context arrives unlabeled. We then evaluate our self-consolidating LMs framework under the above two reward instantiations: context that comes with immediate downstream supervisions (Section 4), and context that has no downstream supervisions (Section 5). 4 Continual knowledge injection"},{"citing_arxiv_id":"2605.04217","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks","primary_cat":"cs.LG","submitted_at":"2026-05-05T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01858","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decouple and Cache: KV Cache Construction for Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-03T13:02:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00968","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models","primary_cat":"eess.SP","submitted_at":"2026-05-01T15:51:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Adaptive 3D-RoPE adapts rotary positional encoding to wireless channel physics via learnable 3D frequencies and dynamic CSI control, yielding up to 10.7 dB NMSE gains in scale extrapolation and 1 dB in zero-shot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24608","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models","primary_cat":"cs.IR","submitted_at":"2026-04-27T15:36:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings 38. Springer, 716-722. [2] Shijie Chen, Bernal Jimenez Gutierrez, and Yu Su. 2025. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers. InThe Thirteenth Inter- national Conference on Learning Representations. [3] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595(2023). [4] Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. 2020. Specter: Document-level representation learning using citation-informed transformers."},{"citing_arxiv_id":"2604.18603","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings","primary_cat":"q-bio.QM","submitted_at":"2026-04-09T19:32:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"For nucleotide modeling, where functional and structural dependencies can span large genomic distances, DTA's context extension capability may complement existing long-range approaches. Future work should focus on MLM-specific variants of position dropping, potentially enabling robust long-context extension in bidirectional settings without full long-context pretraining. Methods Data sources Argmax position probe.Synthetic sequences were generated by sampling integers uniformly from[0,v)wherev= 64is the vocabulary size. Labels were the 0-indexed position of the first occurrence of the maximum value. Sequence length was fixed atl= 64. Batches of 1,024 sequences were generated on-the-fly during training; evaluation used 16 batches of 1,024 sequences each. Natural language.We used FineWeb-Edu (45), a large-scale filtered web corpus designed for language model pretraining. Text was tokenized using a custom Byte-Pair Encoding (BPE) tokenizer (51) with a vocabulary of 4,096 tokens, chosen to reduce vocabulary size relative to standard tokenizers while preserving reasonable subword granularity. Training sequences were truncated or padded to 256 tokens. Validation and test sets were constructed by filtering documents with at least 1,024 tokens, then splitting the remaining documents into 1,000 documents each for validation and testing. Training data was streamed and filtered to exclude validation and test documents. Halleeet al.| arXiv | April 22, 2026 | 5-12 Fig. 5.DroPE recovery analysis. (a) NLP extended-context validation loss, accuracy, MCC, and F1 before and after dropping positional embeddings at 70% of training. (b) Protein extended-context validation loss, accuracy, MCC, and F1. The vertical dashed line marks the drop point. Shaded regions represent±1 standard deviation across three seeds. (c) NLP final test loss, accuracy, MCC, and F1 comparing RoPE (kept throughout) vs. RoPE-off (dropped at 70%). (d) Protein final test loss, accuracy, MCC, and F1. Signi"},{"citing_arxiv_id":"2604.08224","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering","primary_cat":"cs.SE","submitted_at":"2026-04-09T13:19:41+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01178","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Screening Is Enough","primary_cat":"cs.LG","submitted_at":"2026-04-01T17:29:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[17] Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations, 2022. [18] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. [19] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023. [20] Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens."},{"citing_arxiv_id":"2603.26815","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval","primary_cat":"cs.CL","submitted_at":"2026-03-26T18:05:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HDRR combines document-level semantic routing with scoped chunk retrieval to outperform both pure chunk-based retrieval and semantic file routing on the FinDER benchmark, delivering higher average scores, lower failure rates, and more perfect answers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.21783","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis","primary_cat":"cs.CV","submitted_at":"2026-03-23T10:25:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.04759","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stacked from One: Multi-Scale Self-Injection for Context Window Extension","primary_cat":"cs.CL","submitted_at":"2026-03-05T03:16:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SharedLLM stacks two copies of a short-context LLM so the lower one compresses context into query-aware multi-grained tokens that are injected only at the lowest layers of the upper one, enabling generalization from 8K training to 128K+ inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20981","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models","primary_cat":"cs.CV","submitted_at":"2026-02-24T15:01:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMHNet enables video-to-audio models trained on short clips to generalize and generate audio for videos over 5 minutes long.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.13933","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling","primary_cat":"cs.AI","submitted_at":"2026-02-15T00:06:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyMem introduces dual-granular memory storage with a lightweight summary module for fast responses and selective activation of a deep LLM module for complex queries, outperforming full-context baselines by 92.6% lower computational cost on LOCOMO and LongMemEval benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.07805","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Group Representational Position Encoding","primary_cat":"cs.LG","submitted_at":"2025-12-08T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18830","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training","primary_cat":"cs.CL","submitted_at":"2025-10-21T17:25:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.02283","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Forcing++: Towards Minute-Scale High-Quality Video Generation","primary_cat":"cs.CV","submitted_at":"2025-10-02T17:55:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.21042","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LayerNorm Induces Recency Bias in Transformer Decoders","primary_cat":"cs.CL","submitted_at":"2025-09-25T11:48:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.12635","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Positional Encoding via Token-Aware Phase Attention","primary_cat":"cs.CL","submitted_at":"2025-09-16T03:53:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.13334","ref_index":154,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Context Engineering for Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-07-17T17:50:36+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.02259","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent","primary_cat":"cs.CL","submitted_at":"2025-07-03T03:11:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.06708","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free","primary_cat":"cs.CL","submitted_at":"2025-05-10T17:15:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"∈Rn×n represents the scaled dot-product similarity matrix, andsoftmax(·) ensures the attention weights are no-negative and sum to 1 across each row. Multi-Head Concatenation: In multi-head attention, the above process is repeated in parallel forh heads, with each head having its projection matricesWi q ,Wi k,Wi v. All heads' outputs are concatenated: MultiHead(Q,K,V ) = Concat(head1,...,headh), (3) where headi = Attention(QWi Q,KWi K ,VW i V ). 2 Figure 2:Left: Proportion of attention allocated to the initial token per layer (test perplexity dataset). The baseline model suffers from a significant attention sink, with an average of 46.7% of attention scores across layers directed towards the first token. Introducing a gate effectively alleviates this, reducing the proportion to 4."},{"citing_arxiv_id":"2504.21318","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Phi-4-reasoning Technical Report","primary_cat":"cs.AI","submitted_at":"2025-04-30T05:05:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.09844","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training","primary_cat":"cs.DC","submitted_at":"2025-04-14T03:31:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.19786","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Gemma 3 Technical Report","primary_cat":"cs.CL","submitted_at":"2025-03-25T15:52:34+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.19325","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Long-Context Autoregressive Video Modeling with Next-Frame Prediction","primary_cat":"cs.CV","submitted_at":"2025-03-25T03:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05564","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TabICL: A Tabular Foundation Model for In-Context Learning on Large Data","primary_cat":"cs.LG","submitted_at":"2025-02-08T13:25:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"evaluated this RF extension with adaptive pruning enabled for both TabPFNv2 and TabICL on large datasets with more than 10K samples. Additionally, we set the ensemble size for both TabPFNv2 and TabICL to 4, considering that random forest already ensembles multiple decision trees. As shown in Figure G.1, the RF extension significantly improves the performance of both TabPFNv2 and TabICL. 10152025 [7.70] TabR [7.73] TabICL-RF [8.08] RealMLP [8.68] TabM [9.17] TabPFNv2-RF [9.42] ModernNCA [9.85] TabICL [10.72] CatBoost [11.81] MLP-PLR [12.74] FT-T [12.75] XGBoost [13.26] LightGBM [13.43] TabPFNv2 [14.30] DCNv2 [14.75] MLP [15.11] AutoInt [15.28] LoCalPFN SwitchT [28.72] LogReg [26.23] GrowNet [26.21] TabPFN [26.15] TabNet [25.17] KNN [24.92] TuneTables [24.70]"},{"citing_arxiv_id":"2502.01941","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression","primary_cat":"cs.CL","submitted_at":"2025-02-04T02:23:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.06679","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning","primary_cat":"cs.CL","submitted_at":"2024-09-10T17:44:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.16852","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Long Context Transfer from Language to Vision","primary_cat":"cs.CV","submitted_at":"2024-06-24T17:58:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"EgoSchema [51] 0179.8 VideoMME [21] 1017.0 V-NIAH (Ours) 000. ∞ from text to image is orthogonal to those works and can further enable LMMs to understand more frames. Context Extrapolation in Transformer Transformer does not directly work on sequences longer than its training length. To alleviate that, various RoPE-based [ 65] extension techniques [13, 7, 61, 57, 20] have been proposed to allow for training-free context extrapolation. Efforts have also been made on data curation [22, 79, 5] and system optimization [40, 45, 27] during long context training. There has been limited exploration of the context extrapolation in the domain of LMMs. [ 44] are closest to our work and train LMM with long context language models, but they do not benchmark"},{"citing_arxiv_id":"2406.12793","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools","primary_cat":"cs.CL","submitted_at":"2024-06-18T16:58:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V . Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. [5] S. Chen, S. Wong, L. Chen, and Y . Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. [6] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204."},{"citing_arxiv_id":"2404.07143","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention","primary_cat":"cs.CL","submitted_at":"2024-04-10T16:18:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.14608","ref_index":176,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey","primary_cat":"cs.LG","submitted_at":"2024-03-21T17:55:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This approach can effectively reduce catastrophic forgetting during the acquisition of new tasks. 3) Context Window Extension: LLMs are typically trained with a pre-defined context size. For example, LLaMA and LLaMA2 have pre-defined context sizes of 2048 and 4096 tokens, respectively. The positional encoding RoPE has weak extrapolation properties [176], which means the performance drops obviously given an input length exceeds the pre-defined context length. To solve this, a naive solution is to fine- tune a pre-trained LLM to a longer context. However, this escalates computational costs quadratically with context size, straining memory and processing resources. To address this, LongLoRA [177] proposes to fine-tune a pre-trained LLM"},{"citing_arxiv_id":"2402.13753","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens","primary_cat":"cs.CL","submitted_at":"2024-02-21T12:30:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.01613","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nomic Embed: Training a Reproducible Long Context Text Embedder","primary_cat":"cs.CL","submitted_at":"2024-02-02T18:23:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Nomic AI produced and open-sourced a reproducible 8192-context English text embedder that exceeds OpenAI Ada-002 and text-embedding-3-small performance on MTEB short-context and LoCo long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.14196","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence","primary_cat":"cs.SE","submitted_at":"2024-01-25T14:17:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.16886","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices","primary_cat":"cs.CV","submitted_at":"2023-12-28T08:21:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"performing future distillation without further pain. The con- text length used at the pre-train stage is 2k for all models Model Blocks Dim Heads Context length MobileLLaMA 1.4B 24 2048 16 2k MobileLLaMA 2.7B 32 2560 32 2k Table 2. Detailed settings of our language models. due to limited resources. However, the context window can be further scaled to 8k for inference, as indicated by [17]. The detailed settings of other components are listed below. • We apply RoPE [107] to inject positional information. • We apply pre-normalization to stabilize training. Specifically, we use RMSNorm [129] instead of layer norm and the MLP expansion ratio 8/3 instead of 4. • We also use SwiGLU activation function [104] instead of GELU as [115]. 3.4."},{"citing_arxiv_id":"2311.16867","ref_index":259,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Falcon Series of Open Language Models","primary_cat":"cs.CL","submitted_at":"2023-11-28T15:12:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.10631","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Llemma: An Open Language Model For Mathematics","primary_cat":"cs.CL","submitted_at":"2023-10-16T17:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}