{"total":90,"items":[{"citing_arxiv_id":"2606.00535","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation","primary_cat":"cs.LG","submitted_at":"2026-05-30T05:05:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30571","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode","primary_cat":"cs.AR","submitted_at":"2026-05-28T21:03:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Batch-1 autoregressive decode is memory-dominated yet launch overhead caps gains from higher-bandwidth GPUs, shown by measurements and CUDA Graphs ablation across four NVIDIA GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29891","ref_index":62,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DVSM: Decoder-only View Synthesis Model Done Right","primary_cat":"cs.CV","submitted_at":"2026-05-28T13:16:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Decoder-only view synthesis model using KV-cache representation and weight sharing between reconstruction and rendering networks achieves new SOTA on novel view synthesis benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29233","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference","primary_cat":"cs.LG","submitted_at":"2026-05-28T01:48:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BlockBatch is a training-free framework that coordinates multiple block-size branches via token merging and synchronization to reduce denoising NFEs by 26.6% and achieve 1.33x speedup in dLLM inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23258","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Simple Plug-in for Improving Eviction-Based KV Cache Compression","primary_cat":"cs.LG","submitted_at":"2026-05-22T06:00:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22106","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-21T07:40:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ArborKV uses search-structure awareness to evict low-reuse KV states in Tree-of-Thoughts inference, delivering up to 4x memory savings with near-full accuracy retention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22884","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tensor Cache: Eviction-conditioned Associative Memory for Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-21T00:21:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17653","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-17T21:10:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22850","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse","primary_cat":"cs.DC","submitted_at":"2026-05-16T16:48:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15621","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs","primary_cat":"cs.CV","submitted_at":"2026-05-15T05:09:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% accuracy retention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15250","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding","primary_cat":"cs.LG","submitted_at":"2026-05-14T15:50:01+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12922","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction","primary_cat":"cs.AI","submitted_at":"2026-05-13T02:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12464","ref_index":154,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Search Your Block Floating Point Scales!","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:50:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09778","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nectar: Neural Estimation of Cached-Token Attention via Regression","primary_cat":"cs.LG","submitted_at":"2026-05-10T21:51:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(2026) amortize MIPS by training a neural network that predicts the support functionq7→max j⟨q,k j⟩from queries sampled from a task distribution. Our score head targets the finite-temperature analogue of the same quantity, and we reuse the MLP design proposed there. C Additional Results C.1 Non-Uniform Capacity Allocation Across Layers Setting.We partition the 28 layers ofQwen3-1.7Binto four contiguous groups,[0,9),[9,15),[15,27), and [27,28), and assign per-group multipliers, normalized so that the averageρmatches across settings. The weightedscheme uses multipliers(1,2,5,2)on these four groups (larger capacity in middle-to-late layers); theuniformscheme uses(1,1,1,1). We compare MLP-basedNectarmodules withρ≈2%andλ KL=0.01, evaluated at convergence. Table 8 reports raw score and target MSE and the token-accuracy gap. Observations.Weighted allocation has lower target MSE (by∼3-4×10−2) and a smaller token-accuracy gap than uniform allocation on all three datasets in the table. Uniform allocation has lower score MSE, consistent with its larger share of capacity on early layers where scores are harder to fit. C.2 Score vs. Target Parameter Allocation Setting.The MLP architecture allocates separate parameter budgets to the score and target heads. At a totalρ=10%, we compare two splits onQwen2-7B: (i) score1%/ target9%, and (ii) score2.5%/ target7.5%. Both use the same layer-group multipliers(1,2,8,12). We run each split under two training regimes (see §A.4):pure distillation(λ α=λA=0, λKL=1), where neither head is directly supervised, andmixed training (λα=0.1, λA=1, λKL=2), where regression supervises each head in addition to the KL term. Table 9 reports the token-accuracy gap at convergence. Observations.The mixed-training regime is the more informative of the two, since it is the setting in which the split directly controls how much regression supervision each head receives. There, the1/9split attains an average token-accuracy gap of0.34%against0.47%for the"},{"citing_arxiv_id":"2605.08587","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kaczmarz Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-09T01:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[30] Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019. URL https://arxiv.org/abs/1911.02150. [31] Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear rnns via householder products, 2025. URLhttps://arxiv.org/abs/2502.10297. [32] Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=Ai8Hw3AXqks. [33] Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned"},{"citing_arxiv_id":"2605.07721","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:25:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"laws for stable looped language models, 2026. URL https://arxiv.org/abs/2604.12946. [14] Abbas Zeitoun, Lucas Torroba-Hennigen, and Yoon Kim. Hyperloop transformers, 2026. URL https://arxiv.org/abs/2604.21254. [15] Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019. URLhttps://arxiv.org/abs/1911.02150. [16] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models.arXiv preprint arXiv:2305.13245, 2023. 10 [17] William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention, 2024."},{"citing_arxiv_id":"2605.07588","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Revisiting Transformer Layer Parameterization Through Causal Energy Minimization","primary_cat":"cs.LG","submitted_at":"2026-05-08T11:02:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"base scaling, while the symmetric low-rank correction captures richer curvature information at negligible cost. We make no claim that P approximates the true Hessian; rather, we treat it as a learned proxy that can capture useful curvature information. For the interaction energy, we insert per-head matricesP k, giving ∆xϵ i(hi |h 1:i) :=− KX k=1 Pk W Q⊤ k iX j=1 αk ijvk j ! ,(14) whereα k ij = softmaxj \u0010\b 1 τ (kk j′)⊤qk i i j′=1 \u0011 . This denotes the update evaluated atx i =h i for the interaction energyϵ. For the element-wise energy, the gated MLP update becomes (contrast with unpreconditioned one in Equation (10)): ∆xξ i (hi |h 1:i) :=−P mlpV ⊤ (W hi)◦ϕ ′(V hi) \u0001 ,(15) withP mlp denoting its preconditioner. In both cases, the preconditioners could be trained to provide lightweight curvature information,"},{"citing_arxiv_id":"2605.06850","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-07T18:51:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06501","ref_index":70,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cubit: Token Mixer with Kernel Ridge Regression","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:18:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463-4473, 2019. [68] Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131-139, 1992. [69] Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019. [70] Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. [71] Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance with 0."},{"citing_arxiv_id":"2605.05602","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nearly Optimal Attention Coresets","primary_cat":"cs.DS","submitted_at":"2026-05-07T02:37:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05365","ref_index":148,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZAYA1-8B Technical Report","primary_cat":"cs.AI","submitted_at":"2026-05-06T18:44:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05066","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Impossibility Triangle of Long-Context Modeling","primary_cat":"cs.CL","submitted_at":"2026-05-06T16:01:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"bation ins 0 that produces a distinguishable difference ins T . A perturbation of magnitude 27 Zhou ϵins 0 grows to at mostL T ·ϵins T . For this to exceed the representation threshold 2 −b, we needϵ≥2 −b/LT = 2−(b+Tlog 2 L). Therefore, the effective precision per component is at mostb+Tlog 2 Lbits. The total information capacity of the states T is bounded by |sT |eff bits ≤d·(b+Tlog 2 L).(30) Substituting this effective capacity into the argument of Theorem 10 (replacingq(d) withd·(b+Tlog 2 L) in inequality (14)), we obtain n∗ ≤ d·b+d·T·log 2 L (1−ε) log 2 V−1 ,(31) which is (15). Remark 27The bound(15)reveals three dynamical regimes: (i)Contractive(L <1):log 2 L <0, so the effective capacitydecreaseswithT. In- formation about early inputs is exponentially forgotten."},{"citing_arxiv_id":"2605.02568","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k","primary_cat":"cs.LG","submitted_at":"2026-05-04T13:19:29+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02262","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization","primary_cat":"cs.CV","submitted_at":"2026-05-04T06:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"MA (Moving Attribute), MC (Moving Count), MD (Moving Direction), OE (Object Existence), OI (Object Interaction), OS (Object Shuffle), ST (Scene Transition), SC (State Change), UA (Unexpected Action). It is worth noting that the videos in MVBench are generally short, and the evaluation on the MVBench dataset also reflects the capability of WindowQuant on short-video tasks. We also use a comprehensive benchmark, MileBench [36], to compare the performance of WindowQuant with other quantization methods. We select 12 tasks from MileBench, including Object Existence (OE), Object Interaction (OI), Moving Attribute (MA), Egocentric Navigation (EN), State Change (SC), Scene Transition (ST), Space Understanding(SU), Webpage QA (WQA), Textbook QA (TQA), multimodal QA (MQA), Slide VQA (SQA), and Document QA (DQA)."},{"citing_arxiv_id":"2605.00789","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Make Your LVLM KV Cache More Lightweight","primary_cat":"cs.CV","submitted_at":"2026-05-01T17:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27476","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EdgeFM: Efficient Edge Inference for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:18:50+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23466","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs","primary_cat":"cs.LG","submitted_at":"2026-04-25T23:13:47+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23214","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding","primary_cat":"cs.CL","submitted_at":"2026-04-25T08:42:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DARC-CLIP improves CLIP-based meme classification with hierarchical adaptive refinement, delivering +4.18 AUROC and +6.84 F1 gains in hate detection on PrideMM and CrisisHateMM benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21335","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression","primary_cat":"cs.LG","submitted_at":"2026-04-23T06:47:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level selection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26968","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference","primary_cat":"cs.AR","submitted_at":"2026-04-19T21:34:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16983","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Graph-Guided Adaptive Channel Elimination for KV Cache Compression","primary_cat":"eess.SP","submitted_at":"2026-04-18T12:55:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16957","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon","primary_cat":"cs.LG","submitted_at":"2026-04-18T10:39:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16864","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HieraSparse: Hierarchical Semi-Structured Sparse KV Attention","primary_cat":"cs.DC","submitted_at":"2026-04-18T06:28:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and ","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15464","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU","primary_cat":"cs.PF","submitted_at":"2026-04-16T18:30:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15409","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference","primary_cat":"cs.LG","submitted_at":"2026-04-16T15:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13858","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Hormone-inspired Emotion Layer for Transformer language models (HELT)","primary_cat":"cs.NE","submitted_at":"2026-04-13T11:53:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07609","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC","primary_cat":"cs.DC","submitted_at":"2026-04-08T21:27:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while remaining stable under CPU interference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06955","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TRAPTI: Time-Resolved Analysis for SRAM Banking and Power Gating Optimization in Embedded Transformer Inference","primary_cat":"cs.AR","submitted_at":"2026-04-08T11:15:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRAPTI delivers cycle-accurate memory occupancy traces to guide SRAM banking and power-gating choices, showing a 2.72x lower peak memory footprint for a GQA model versus MHA under identical accelerator settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06370","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache","primary_cat":"cs.DC","submitted_at":"2026-04-07T18:52:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[50] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems36 (2023), 68539-68551. [51] Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150(2019). [52] Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gon- zalez, and Ion Stoica. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters.arXiv preprint arXiv:2311.03285(2023). [53] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu."},{"citing_arxiv_id":"2604.05688","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion","primary_cat":"cs.CL","submitted_at":"2026-04-07T10:40:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"influence of attention sink, one remedy is to introduce a dedicated learnable sink token during training, so that excess attention mass is redirected to a parameterized placeholder rather than to ordinary context tokens. Another remedy is gated attention, which applies a query-dependent gate to the SDPA output, e.g., ˜ot,i=g t,i⊙ot,i, g t,i=σ(Wght),(16) thereby suppressing query-irrelevant attention outputs. Recent evidence shows that such post-SDPA gating can mitigate attention sink and improve long-context extrapolation [Qiu et al., 2025]. In the experimental setup of this work, we ultimately select GateSW A and MLA as the target architectures. On the one hand, these attentions have relatively stable inference implementations in the community."},{"citing_arxiv_id":"2604.03446","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fast Cross-Operator Optimization of Attention Dataflow","primary_cat":"cs.AR","submitted_at":"2026-04-03T20:37:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sequence length during prefill and training stages [21], [38], [67], [82]. To address these challenges, numerous techniques have been proposed to improve the efficiency of attention computation on diverse hardware platforms, including CPUs, Fig. 1. Comparison of various cross-operator dataflow mappers. GPUs, and custom accelerators [4], [20], [65], [66], [76]. Among these platforms, accelerators offer high energy effi- ciency and reduced latency due to their specialized hardware architectures and flexible dataflow mappings. Dataflow mapping on accelerators dictates how computation and memory resources are utilized spatially and temporally. It not only plays a critical role in determining the efficiency of"},{"citing_arxiv_id":"2604.02979","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation","primary_cat":"cs.CV","submitted_at":"2026-04-03T11:34:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684-10695. [39] Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. InInternational Conference on Learning Representations. [40] Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150(2019). [41] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2023. Make-A-Video: Text-to-Video Generation without Text-Video Data. InThe Eleventh International Conference on Learning Representations. [42] Junhyuk So, Jungwon Lee, and Eunhyeok Park. 2024. Frdiff: Feature reuse for universal training-free acceleration of diffusion models."},{"citing_arxiv_id":"2604.08584","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference","primary_cat":"cs.LG","submitted_at":"2026-03-30T01:42:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 32K-128K contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.22910","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction","primary_cat":"cs.CL","submitted_at":"2026-03-24T07:58:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.20991","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal","primary_cat":"cs.LG","submitted_at":"2026-03-22T00:24:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"partial","one_line_summary":"Per-layer error amplification factor rho predicts representation drift in compressed transformers and guides superior pruning and layer-removal decisions compared to prior heuristics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.18196","ref_index":26,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference","primary_cat":"cs.LG","submitted_at":"2026-02-20T13:09:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.05695","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference","primary_cat":"cs.AI","submitted_at":"2026-02-05T14:21:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14053","ref_index":138,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems","primary_cat":"cs.LG","submitted_at":"2026-01-20T15:06:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Agentic capabilities unlock transformative applications across industries: Software Development Agents: •Code Generation: Given requirements, generate complete applications (backend + frontend + tests + documentation). Example: Devin AI agent claims 13.8% success on SWE-bench (real GitHub is- sues) [77]. •Debugging: Analyze error logs, reproduce bugs, test hypotheses, propose fixes. Reflexion [138] im- proves programming pass@1 by 30-50% through self-critique. •Code Review: Check style, detect bugs, suggest op- timizations. MetaGPT [66] simulates QA agent re- viewing engineer's code. Research and Analysis Agents: •Literature Review: Search academic databases, ex- tract key findings, synthesize insights, generate com- prehensive reports. 44"},{"citing_arxiv_id":"2512.24880","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"mHC: Manifold-Constrained Hyper-Connections","primary_cat":"cs.CL","submitted_at":"2025-12-31T14:16:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12131","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models","primary_cat":"cs.LG","submitted_at":"2025-12-13T01:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}