{"total":61,"items":[{"citing_arxiv_id":"2606.27229","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention","primary_cat":"cs.CL","submitted_at":"2026-06-25T16:16:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29453","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs","primary_cat":"cs.LG","submitted_at":"2026-05-28T06:47:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DSRD unifies temporal and structural adaptation for dynamic graphs via a single recurrent retentive state with learnable time-sensitivity parameters in the decay kernels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26797","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior","primary_cat":"cs.LG","submitted_at":"2026-05-26T10:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Latent Recurrent Transformer augments autoregressive transformers with a cross-layer recurrent latent pathway from prior hidden states and uses interleaved parallel training to improve loss and in-context learning at ~0.3% extra parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26558","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding","primary_cat":"cs.AR","submitted_at":"2026-05-26T05:12:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cassandra is a self-speculative decoding system that builds a draft model via fine-grained data selection and optimized pruning/mantissa truncation, achieving up to 2.41x speedup over BF16 and 1.81x more tokens than Eagle-3 on Llama 3 8B without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23282","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring","primary_cat":"eess.IV","submitted_at":"2026-05-22T06:50:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DGNO parameterizes integral kernels with discontinuous Galerkin elements for heterogeneous defocus deblurring in pathology images and reports superior performance over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21333","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence","primary_cat":"cs.CL","submitted_at":"2026-05-20T16:00:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 194M-parameter spiking dual-path model trained on 3B Chinese-English tokens achieves held-out PPL 8.88-8.93 at >89% per-element sparsity, trailing GPT-2 201M by 7.7% while showing that LIF temporal integration outperforms simple top-k masking at matched sparsity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17108","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parallel Recursive LSTM","primary_cat":"cs.LG","submitted_at":"2026-05-16T18:28:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PR-LSTM replaces linear recurrence with recursive gated merging over a balanced binary tree to achieve log-depth parallelism without restricting transitions to linear or associative forms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13807","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo","primary_cat":"cond-mat.str-el","submitted_at":"2026-05-13T17:36:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Klambauer, J. Brandstetter, and S. Hochreiter, xLSTM: Extended long short-term mem- ory (2024), arXiv:2405.04517 [cs.LG]. [33] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, Transformers are rnns: fast autoregressive transform- ers with linear attention, inProceedings of the 37th In- ternational Conference on Machine Learning, ICML'20 (JMLR.org, 2020). [34] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Ar- cadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV, X. He, H. Hou, J. Lin, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, G. Song, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, Q. Zhou, J. Zhu, and R.-J. Zhu, RWKV:"},{"citing_arxiv_id":"2605.13370","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory","primary_cat":"cs.LG","submitted_at":"2026-05-13T11:28:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11007","ref_index":116,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RT-Transformer: The Transformer Block as a Spherical State Estimator","primary_cat":"cs.LG","submitted_at":"2026-05-10T08:14:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08696","ref_index":16,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Structured Recurrent Mixers for Massively Parallelized Sequence Generation","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:07:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Structured Recurrent Mixers provide a dual parallel-recurrent representation for sequence models, claiming superior training efficiency, information capacity, and inference throughput over linear complexity alternatives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08587","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kaczmarz Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-09T01:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"anonymized implementation is included in the supplementary material and avail- able athttps://github.com/anonymous-kla-review/kla-anonymous. 1 Introduction Softmax attention gives Transformers strong associative recall, but its token-token attention matrix scales quadratically with sequence length, increasing long-context prefill latency, activation memory, and train-short/test-long evaluation cost [24, 29]. IO-aware kernels and sequence-parallel implemen- tations reduce constant factors, yet they do not remove this asymptotic cost [38, 2, 8, 6, 4, 22]. Linear-time sequence models seek to alleviate this bottleneck by replacing token-token attention with a fixed-size recurrent state. As all past-token information is compressed into bounded memory,"},{"citing_arxiv_id":"2605.05838","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MDN: Parallelizing Stepwise Momentum for Delta Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-07T08:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"/uni00000036/uni00000057/uni00000044/uni00000045/uni0000004f/uni00000048/uni00000003/uni0000003d/uni00000052/uni00000051/uni00000048 /uni00000014 /uni00000013/uni00000014 /uni00000014/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000018 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000018 /uni00000014/uni00000011/uni00000013 (0, 1) [e 1, 1) (0, 1 ) (0, 2) /uni0000000b/uni00000046/uni0000000c/uni00000003/uni00000015/uni00000051/uni00000047/uni00000003/uni00000032/uni00000055/uni00000047/uni00000048/uni00000055/uni00000003/uni00000036/uni0000005c/uni00000056/uni00000057/uni00000048/uni00000050/uni00000003/uni0000005a/uni00000003/uni00000026/uni00000052/uni00000051/uni00000056/uni00000011/uni00000003/uni0000000b/uni00000030/uni00000027/uni00000031/uni0000000c /uni00000036/uni00000057/uni00000044/uni00000045/uni0000004f/uni00000048/uni00000003/uni0000003d/uni00000052/uni00000051/uni00000048 Figure 2.Spectral root trajectories of At by sweeping coefficients. (a) Roots lie on the real axisλ=α(1−β) , where β∈(0,1) yields positive value eigenvalues, whileβ∈(1,2) produces sign-flipping modes in negative value eigenvalues. (b) The α, µ, β∈(0,1) and η∈(0,2) yields a two-dimensional spectral region that may enter the left half-plane. (c) With the example constraint β <1−α and µ∈[e −1,1), all roots strictly confined to the right half-plane. and SPLR3 structures At =α tI−β tktk⊤ t similarly con- strain the eigenvalues to interval (−1,1) . Despite these improvements, these systems remain limited to the real do- main. This restriction prevents the system from capturing oscillatory dependencies. Second Order Dynamics and Expressivity.The step- wise momentum rule breaks this real-valued limitation by inducing a second-order system that admits complex con- jugate eigenvalues. Sweeping the coefficients produces eigenvalues of the transition matrix At as shown in Fig- ure 2(b) (see § F for a detailed derivation of At). First- order systems are rest"},{"citing_arxiv_id":"2605.00604","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-01T12:18:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"learn when to retain and forget information via gated memory, implicitly implementing prediction of future relevance. Our predictor is more explicit - it directly predicts the next embedding - and is applied speciﬁcally to the routing decision rather than the representation. 7.5 Free Energy Principle in Machine Learning The FEP has inspired several ML architectures. Active inference agents [26], [27], [28] apply the FEP to reinforcement learning, replacing reward maximization with free energy minimization. These systems operate on episodic timescales and focus on action selection in environment interac- tion, a diﬀerent regime from token-by-token routing in language models. Predictive coding networks [29], [30] implement hierarchical prediction error minimization as an"},{"citing_arxiv_id":"2604.22442","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models","primary_cat":"cs.LG","submitted_at":"2026-04-24T10:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"drop just below the 90% threshold), neutral in success count atM=14 andM=20 (thoughM=20's mean routing drops from 91.8% to 78.2%), worse atM=24 (3/5→2/5), and partially helpful atM=32 (3/5→4/5). All code and experiment scripts will be made publicly available with the final version. 1 Introduction Hybrid sequence models - combining cheap recurrence (Mamba [2], RWKV [3]) with selective attention - have emerged as a leading paradigm for efficient long-context modeling [4, 5, 6]. The key design decision is which tokens receive expensive attention. This routing decision is typically implicit: fixed schedules interleave recurrent and attention layers at predetermined intervals (every 5th layer in Jamba, every 6th in Griffin)."},{"citing_arxiv_id":"2604.21215","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Recurrent Transformer: Greater Effective Depth and Efficient Decoding","primary_cat":"cs.LG","submitted_at":"2026-04-23T02:12:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20915","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Absorber LLM: Harnessing Causal Synchronization for Test-Time Training","primary_cat":"cs.LG","submitted_at":"2026-04-22T02:58:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19826","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation","primary_cat":"cs.SE","submitted_at":"2026-04-20T14:47:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RNN with no attention matrices), we compute effective attention from the Weighted Key-Value (WKV) recurrence. Models.7 open-source models spanning diverse architectures: Qwen2.5-Coder-7B and -3B [12], StarCoder2-3B [19], CodeGemma- 7B [8], Code-LLaMA-7B [28], Phi-3-mini-4k-instruct [1] (6 trans- formers [34]), and RWKV-6-Finch-1B6 [25] (a gated-linear RNN [24]). MI requires access to internal representations that proprietary mod- els do not expose. The 7 models were selected for architectural diversity (including a non-transformer paradigm), code compe- tence, and feasibility on consumer hardware (16GB video RAM (VRAM)). Corpus.10 Python doctest samples and 10 Rust test samples, us- ing a model-agnostic corpus format with character-level byte offsets"},{"citing_arxiv_id":"2604.16913","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus","primary_cat":"cs.AI","submitted_at":"2026-04-18T08:46:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05219","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sparse Prefix Caching for Hybrid and Recurrent LLM Serving","primary_cat":"cs.LG","submitted_at":"2026-04-17T09:24:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15557","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Predicting Where Steering Vectors Succeed","primary_cat":"cs.LG","submitted_at":"2026-04-16T22:18:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12365","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive Spiking Neurons for Vision and Language Modeling","primary_cat":"cs.NE","submitted_at":"2026-04-14T06:53:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11321","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Winner-Take-All Spiking Transformer for Language Modeling","primary_cat":"cs.NE","submitted_at":"2026-04-13T11:23:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08542","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"TTT3R [11] further casts memory update as test-time learning, but still relies on a fixed-size token set. We instead propose a scalable global context representation with larger memory capacity for long-range dependencies. Memory mechanisms.Modern recurrent neural networks (RNNs), particularly linear-attention [31, 58] variants such as Mamba [15, 23], RWKV [49], and DeltaNet [57, 90], pro- vide an efficient alternative to standard quadratic complexity attention for context modeling and have demonstrated im- pressive performance in natural language tasks. However, these models compress the entire history into a finite-size hidden state, which limits their ability to capture complex long-range dependencies, especially in tasks like large-scale"},{"citing_arxiv_id":"2604.06339","ref_index":107,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evolution of Video Generative Foundations","primary_cat":"cs.CV","submitted_at":"2026-04-07T18:17:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CMD [105] first decompose a video into a content frame and motion representations using an autoencoder with content frames as weighted sums of video frames, and then fine- tunes a pre-trained image diffusion model to fit the content frame distribution and use a lightweight model to generate motion based on the content frame. Linear attention-based methods.Mamba [106] and RWKV [107] employ linear attention to overcome the quadratic complexity of traditional models like Transform- ers. Mamba [106] leverages a Selective State Space Model (SSM) for linear time complexity, processing only a subset of the sequence at a time. RWKV [107] combines RNNs and Transformers, using linear attention to capture long-term dependencies while reducing computational costs."},{"citing_arxiv_id":"2604.05688","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion","primary_cat":"cs.CL","submitted_at":"2026-04-07T10:40:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Letht∈Rd denote the input representation at position t, nh the number of attention heads, and dh the per-head dimension. In standard multi-head attention (MHA) [Vaswani et al., 2017], we first compute qt =W Qht,(1) kt =W Kht,(2) vt =W Vht,(3) whereq t,k t,v t∈Rnhdh. We then split them inton h heads, i.e., [qt,1;q t,2;···;qt,nh] =q t,(4) [kt,1;k t,2;···;kt,nh] =k t,(5) [vt,1;v t,2;···;vt,nh] =v t,(6) withq t,i,k t,i,v t,i∈Rdh. For causal self-attention, the output of thei-th head at steptis αt,i,j= exp ( q⊤ t,ikj,i/√dh ) ∑t s=1 exp ( q⊤ t,iks,i/√dh ),1≤j≤t,(7) ot,i= t∑ j=1 αt,i,jvj,i,(8) 3 Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion and the final attention output is obtained by concatenating all heads and applying the output projection,"},{"citing_arxiv_id":"2604.05030","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space","primary_cat":"cs.CL","submitted_at":"2026-04-06T18:00:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"finch: RWKV with matrix-valued states and dynamic recurrence, arXiv preprint arXiv:2404.05892 (2024). [101] G. Birkhoff and J. von Neumann, The logic of quantum mechanics, Annals of Mathematics37, 823 (1936). [102] C. Piron, Axiomatique quantique, Helvetica Physica Acta37, 439 (1964). [103] D. J. Foulis and C. H. Randall, Empirical logic and quantum mechanics, Synthese29, 81 (1974). [104] B. Coecke, D. Moore, and A. Wilce, Operational quantum logic: An overview, arXiv preprint quant- ph/0008019 (2001). [105] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, RoFormer: Enhanced transformer with rotary position embedding, Neurocomputing568, 127063 (2024). [106] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K¨ uttler, M."},{"citing_arxiv_id":"2604.09671","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Belief-State RWKV for Reinforcement Learning under Partial Observability","primary_cat":"cs.LG","submitted_at":"2026-04-01T22:28:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Belief-state RWKV maintains an uncertainty-aware recurrent state for RL policies in partial observability and shows modest gains over standard recurrent baselines in a pilot with observation noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14191","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention to Mamba: A Recipe for Cross-Architecture Distillation","primary_cat":"cs.CL","submitted_at":"2026-04-01T09:23:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.29002","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference","primary_cat":"cs.DC","submitted_at":"2026-03-30T21:03:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.20997","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models","primary_cat":"cs.LG","submitted_at":"2026-03-22T01:04:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.14360","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling","primary_cat":"cs.LG","submitted_at":"2026-03-15T12:53:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.09138","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rotation Equivariant Mamba for Vision Tasks","primary_cat":"cs.CV","submitted_at":"2026-03-10T03:22:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.04385","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training","primary_cat":"cs.CV","submitted_at":"2026-03-04T18:49:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.18196","ref_index":25,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference","primary_cat":"cs.LG","submitted_at":"2026-02-20T13:09:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01651","ref_index":44,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the Spatiotemporal Dynamics of Generalization in Neural Networks","primary_cat":"cs.LG","submitted_at":"2026-02-02T05:11:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Deriving a neural cellular automaton from locality, symmetry, and stability postulates produces 100% accurate addition generalization from 16-digit to 1-million-digit inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.03633","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MFC-RFNet: A Multi-scale Guided Rectified Flow Network for Radar Sequence Prediction","primary_cat":"cs.CV","submitted_at":"2026-01-07T06:24:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MFC-RFNet integrates multi-scale bidirectional communication, condition-guided alignment, and rectified flow to produce clearer and more skillful radar precipitation forecasts than prior baselines on four public datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.01322","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LinMU: Multimodal Understanding Made Linear","primary_cat":"cs.CV","submitted_at":"2026-01-04T01:17:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LinMU achieves linear-complexity multimodal understanding by swapping self-attention for an M-MATE dual-branch block and distilling from a frozen teacher VLM, matching accuracy with up to 2.7x faster TTFT and 9x higher throughput.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.26083","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism","primary_cat":"cs.LG","submitted_at":"2025-10-30T02:41:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.09883","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning","primary_cat":"cs.CL","submitted_at":"2025-10-10T21:37:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22630","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StateX: Enhancing RNN Recall via Post-training State Expansion","primary_cat":"cs.CL","submitted_at":"2025-09-26T17:55:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StateX post-trains RNNs to expand recurrent state size, improving recall and in-context learning with negligible parameter growth.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.04154","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation","primary_cat":"cs.LG","submitted_at":"2025-09-04T12:29:14+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.09025","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lizard: An Efficient Linearization Framework for Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-07-11T21:19:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.02259","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent","primary_cat":"cs.CL","submitted_at":"2025-07-03T03:11:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.17298","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mercury: Ultra-Fast Language Models Based on Diffusion","primary_cat":"cs.CL","submitted_at":"2025-06-17T17:06:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mercury Coder diffusion LLMs achieve throughputs of 1109 and 737 tokens per second on H100 GPUs, up to 10x faster than frontier models with comparable quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.11349","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Representation Paradigms in AI-based 3D Radiological Image Reconstruction: A Systematic Review","primary_cat":"cs.CV","submitted_at":"2025-04-15T16:21:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A systematic review that categorizes AI-based 3D radiological image reconstruction algorithms into four representation paradigms, summarizes evaluation metrics and datasets, and outlines challenges and future directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.18970","ref_index":74,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba","primary_cat":"cs.LG","submitted_at":"2025-03-22T01:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":0.0,"formal_verification":"none","one_line_summary":"A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.08223","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices","primary_cat":"cs.DC","submitted_at":"2025-03-11T09:41:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"67 times cache size reduction while outperforming Llama-3.2-3B. The xLSTM architecture [ 64] modernizes LSTM with exponential gates and matrix memory cells, with models ranging from 125M to 1.3B parameters trained on 300 billion tokens from SlimPajama [ 65], consistently outperforming comparable RWKV-4 [66], Llama [10], and Mamba models across various tasks in the PALOMA benchmark [ 67]. These architectural innovations demonstrate the potential for efficient and powerful language models that can run effectively on edge devices. SLMs can be constructed through diverse methodological approaches. The construction of efficient SLMs relies on a comprehensive suite of techniques, each with specific performance trade-offs. For training SLMs from scratch, optimized MLM approaches [68] with increased masking"},{"citing_arxiv_id":"2502.13189","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoBA: Mixture of Block Attention for Long-Context LLMs","primary_cat":"cs.LG","submitted_at":"2025-02-18T14:06:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.05465","ref_index":100,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)","primary_cat":"cs.CL","submitted_at":"2025-01-03T19:53:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5 at several tasks [1]. At the time of this writing, 8 out of the 10 top performing models on the Open LLM leaderboard on Hug- gingface were Mixtral derivates (the other two were Llama derivatives) [2]. At the time of this writing, Eagle 7B, a model trained on the RWKV architecture outperformed all 7B models including Mistral 7B on cross-lingual benchmarks [100]. 2.1.3 Phi: The Phi series of models developed by Microsoft started with the Phi-1 fo- cusing on code generation [42]. The dataset used to train the Phi-1 models, totaling about 7B tokens, is composed of: a filtered code-language dataset, primarily from The Stack and StackOverflow, refined using a language model- based classifier (approximately 6B tokens); a synthetic textbook dataset com-"}],"limit":50,"offset":0}