{"total":16,"items":[{"citing_arxiv_id":"2605.13262","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction","primary_cat":"cs.LG","submitted_at":"2026-05-13T09:43:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12770","ref_index":62,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WriteSAE: Sparse Autoencoders for Recurrent State","primary_cat":"cs.LG","submitted_at":"2026-05-12T21:32:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11007","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RT-Transformer: The Transformer Block as a Spherical State Estimator","primary_cat":"cs.LG","submitted_at":"2026-05-10T08:14:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"consensus on the hypersphere, followed by a local tangent-space filtering update and retraction back onto the sphere. For exact preservation of the latent state under transport through the value space, one would ideally have W oW v ≈I. The Transformer residual structure enforces this identity pathway explicitly, yielding the additive update z+ i =z i +rW o ¯ui.(17) Thus, the residual connection preserves the original representation while attention contributes only the directional filtering correction. In high-dimensional embeddings with approximately isotropic coordinates, ∥zi∥2 ∼d , so dimension- independent angular updates require r∝ √ d. Writing r=γ √ d, the scale γ corresponds naturally to the learned normalization gain used in RMSNorm-like architectures."},{"citing_arxiv_id":"2605.08696","ref_index":58,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Structured Recurrent Mixers for Massively Parallelized Sequence Generation","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:07:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05806","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retrieval from Within: An Intrinsic Capability of Attention-Based Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T07:42:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769-6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020. emnlp-main.550. URLhttps://aclanthology.org/2020.emnlp-main.550/. [17] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention, 2020. URL https://arxiv. org/abs/2006.16236. [18] Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via con- textualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR"},{"citing_arxiv_id":"2605.02568","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k","primary_cat":"cs.LG","submitted_at":"2026-05-04T13:19:29+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06683","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models","primary_cat":"cs.LG","submitted_at":"2026-04-24T20:37:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22442","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models","primary_cat":"cs.LG","submitted_at":"2026-04-24T10:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", blockwise local distillation, attention-derived initialization, gated layer-by-layer replacement) is a plausible route to closing this gap and is left to follow-up work.) 8 Related Work Efficient attention - sparsity and approximation.Sparse Transformers [10], Longformer [11], and BigBird [12] reduce attention's quadratic cost through fixed sparsity patterns. Performer [13] and linear attention [14] approximate the softmax kernel; Linformer [21] projects keys/values to a lower-rank subspace; Nystr¨ omformer [22] uses landmark points to approximate the full attention matrix. Reformer [23] uses LSH to select content-relevant tokens; Routing Transformer [24] uses online clustering for content-based attention. HubRouter shares the landmark-bottleneck intuition with Linformer/Nystr¨ omformer/Set Transformers, but"},{"citing_arxiv_id":"2604.06169","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In-Place Test-Time Training","primary_cat":"cs.LG","submitted_at":"2026-04-07T17:59:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2504.05646, 2025. [33] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th InternationalConference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2020. URLhttps://arxiv.org/abs/2006.16236. [34] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InICLR, 2020. [35] Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS, 2020. 13 [36] Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, and"},{"citing_arxiv_id":"2601.14053","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems","primary_cat":"cs.LG","submitted_at":"2026-01-20T15:06:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ter attention and feed-forward), but recent work shows single adapter after feed-forward suffices [112]. Train- ing time: 2-5×faster than full fine-tuning due to re- duced gradient computation.Composability:Multiple adapters can be trained for different tasks and swapped at inference without reloading base model-enabling task- specific customization in production systems [127]. Prefix Tuning[82] prepends learnable continuous prompts (\"virtual tokens\") to keys and values in each attention layer: Attention(Q,[P (l) K ;K],[P (l) V ;V])where P (l) K , P(l) V ∈R p×d are trainable prefix parameters for layerl, andp≈10-20prefix tokens. Unlike dis- crete prompt tuning which searches over token embed- dings, prefix tuning optimizes continuous parameters in"},{"citing_arxiv_id":"2512.20856","ref_index":112,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NVIDIA Nemotron 3: Efficient and Open Intelligence","primary_cat":"cs.CL","submitted_at":"2025-12-24T00:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24552","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Short window attention enables long-term memorization","primary_cat":"cs.LG","submitted_at":"2025-09-29T10:04:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22630","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StateX: Enhancing RNN Recall via Post-training State Expansion","primary_cat":"cs.CL","submitted_at":"2025-09-26T17:55:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StateX post-trains RNNs to expand recurrent state size, improving recall and in-context learning with negligible parameter growth.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.04154","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation","primary_cat":"cs.LG","submitted_at":"2025-09-04T12:29:14+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2010.04159","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deformable DETR: Deformable Transformers for End-to-End Object Detection","primary_cat":"cs.CV","submitted_at":"2020-10-08T17:59:21+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Deformable DETR achieves higher accuracy than DETR, especially on small objects, while converging in one-tenth the training epochs by using sparse deformable attention on image features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2009.14794","ref_index":132,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Attention with Performers","primary_cat":"cs.LG","submitted_at":"2020-09-30T17:09:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}