{"total":14,"items":[{"citing_arxiv_id":"2606.19984","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kolmogorov-Arnold Reservoir Computing","primary_cat":"cs.LG","submitted_at":"2026-06-18T09:24:37+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03825","ref_index":171,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Short Convolutions Improve Transformers","primary_cat":"cs.LG","submitted_at":"2026-06-02T16:07:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic short convolutions applied to key/query/value projections and linear layers in Transformers yield consistent performance gains and 1.33-1.60x compute advantages over standard models on language modeling from 150M to 2B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09862","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Blurry Window Attention","primary_cat":"cs.LG","submitted_at":"2026-05-31T17:43:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Blurry Window Attention stores a frequency window and reconstructs blurry KV history via Dirichlet kernel interpolation, achieving 8x better state efficiency than sliding window attention on the MQAR synthetic task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31163","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Memory by Design: Probabilistic Sequence Layers","primary_cat":"stat.ML","submitted_at":"2026-05-29T11:13:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The design-model framework unifies sub-quadratic sequence models as Bayesian filters and introduces a covariance-tracking Bayesian Layer that improves retrieval robustness beyond training regimes on MQAR and RULER benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25949","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers","primary_cat":"cs.LG","submitted_at":"2026-05-25T15:27:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WaveLiT combines wavelet tokenization, linear attention, and multiscale pyramids to produce parameter-efficient neural PDE solvers that match much larger models on TheWell benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13473","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-13T12:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, and João Sacramento. MesaNet: Sequence modeling by locally optimal test-time training, 2025. URLhttps://arxiv.org/abs/2506.05233. [59] Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: A unifying framework for designing sequence models with associative memory, 2025. URL https://arxiv.org/ abs/2501.12352. [60] Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. RNNs are not transformers (yet): The key bottleneck on in-context retrieval, 2024. URLhttps://arxiv.org/abs/2402.18510. [61] Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. October 2024. URLhttps://openreview.net/forum?id=r8H7xhYPwz. [62] Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim."},{"citing_arxiv_id":"2605.08301","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Priming: Hybrid State Space Models From Pre-trained Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-08T11:43:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06997","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators","primary_cat":"cs.LG","submitted_at":"2026-05-07T22:26:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2024. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887(2024). [33] Peihao Wang, Ruisi Cai, Yuehao Wang, Jiajun Zhu, Pragya Srivastava, Zhangyang Wang, and Pan Li. 2024. Understanding and mitigating bottlenecks of state space models through the lens of recency and over-smoothing.arXiv preprint arXiv:2501.00658(2024). [34] Matthew O Williams, Ioannis G Kevrekidis, and Clarence W Rowley. 2015. A data-driven approximation of the koopman operator: Extending dynamic mode decomposition.Journal of Nonlinear Science25, 6 (2015), 1307-1346. [35] Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. 2024. Par- allelizing Linear Transformers with the Delta Rule over Sequence Length."},{"citing_arxiv_id":"2605.05838","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MDN: Parallelizing Stepwise Momentum for Delta Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-07T08:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The hidden states are then updated as M[t+1] =¯µC [t] ·M [t] −(Diag ¯µC [t] ¯µ[t] ! ·K [t])⊤eV[t] (41) S[t+1] =¯αC [t] ·S [t]−bC [t] ·M [t] + (Diag \u0010 ΓC [t] \u0011 ·K [t])⊤eV[t],(42) where ΓC [t] is the C-th row (last row) vector of the t-th chunk of causal mask Γ[t] ∈R C×C . And the correlation value eV[t], eV[t] =U [t] −Y [t]S[t] +Z [t]M[t] ∈R C×d v ,(43) whereU [t] ∈R C×d v andY [t],Z [t] ∈R C×d k are computed byT [t] ∈R C×C , the detailed computation are blow, U[t] =T[t] ·V [t],(44) Y[t] =T[t] · \u0010 Diag(¯α0→C−1 [t] )·P [t] \u0011 (45) Z[t] =T[t] · \u0010 Diag(b0→C−1 [t] )·P [t] \u0011 (46) whereT [t] =Tril \u0010 I[t] + \u0010 P[t]K⊤ [t] ⊙Γ − [t] \u0011\u0011−1 (47) Beyond the scope of MDN, the proposed geometric decoupling strategy offers a principled perspective for alleviating"},{"citing_arxiv_id":"2604.21100","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences","primary_cat":"cs.LG","submitted_at":"2026-04-22T21:38:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21016","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression","primary_cat":"cs.LG","submitted_at":"2025-11-26T03:26:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.27258","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Higher-order Linear Attention","primary_cat":"cs.LG","submitted_at":"2025-10-31T07:54:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Higher-order Linear Attention realizes second-order and higher interactions in linear-time causal attention via constant-size state and associative scans.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19349","ref_index":192,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution","primary_cat":"cs.CL","submitted_at":"2025-09-17T17:49:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.13585","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention","primary_cat":"cs.CL","submitted_at":"2025-06-16T15:08:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}