{"total":20,"items":[{"citing_arxiv_id":"2605.17108","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Parallel Recursive LSTM","primary_cat":"cs.LG","submitted_at":"2026-05-16T18:28:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PR-LSTM replaces linear recurrence with recursive gated merging over a balanced binary tree to achieve log-depth parallelism without restricting transitions to linear or associative forms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13807","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo","primary_cat":"cond-mat.str-el","submitted_at":"2026-05-13T17:36:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"De, Resurrecting recurrent neu- ral networks for long sequences (2023), arXiv:2303.06349 [cs.LG]. [30] A. Gu and T. Dao, Mamba: Linear-time sequence mod- eling with selective state spaces (2024), arXiv:2312.00752 [cs.LG]. [31] L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, and H. Hajimirsadeghi, Were RNNs all we needed? (2024), arXiv:2410.01201 [cs.LG]. [32] M. Beck, K. P¨ oppel, M. Spanring, A. Auer, O. Prud- nikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, xLSTM: Extended long short-term mem- ory (2024), arXiv:2405.04517 [cs.LG]. [33] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, Transformers are rnns: fast autoregressive transform- ers with linear attention, inProceedings of the 37th In-"},{"citing_arxiv_id":"2605.10643","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Single-Layer Model Can Do Language Modeling","primary_cat":"cs.CL","submitted_at":"2026-05-11T14:31:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08587","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kaczmarz Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-09T01:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[1] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URLhttps://arxiv.org/abs/2305.13245. [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URLhttps://arxiv.org/abs/1409.0473. [3] Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory, 2024. URLhttps://arxiv.org/abs/2405.04517. [4] William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley."},{"citing_arxiv_id":"2605.07142","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AGA3DNet: Anatomy-Guided Gaussian Priors with Multi-view xLSTM for 3D Brain MRI Subtype Classification","primary_cat":"cs.CV","submitted_at":"2026-05-08T02:21:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AGA3DNet improves 3D brain MRI subtype classification by feeding anatomy-guided Gaussian priors derived from radiology reports into a 3D CNN and multi-view xLSTM architecture.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Through multitask prompted finetuning, it supports multiple clinical tasks within a single model, improving efficiency, generalization, and real-world deployability in healthcare systems. Recurrent models and xLSTMClassical recurrent net- works such as LSTMs [21] have historically struggled with very long contexts. Extended long short-term memory (xL- STM) [11] revisits this line of work, combining LSTM gat- ing mechanisms with efficient sequence-modeling princi- ples inspired by SSMs. xLSTM demonstrates strong per- formance on long-sequence benchmarks while maintaining lower computational complexity than transformers. Its abil- ity to balance short- and long-range dependencies makes it particularly suitable for 3D neuroimaging, where volumet-"},{"citing_arxiv_id":"2604.19343","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scalable Memristive-Friendly Reservoir Computing for Time Series Classification","primary_cat":"cs.NE","submitted_at":"2026-04-21T11:26:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MARS parallel reservoirs achieve up to 21x training speedups and outperform LRU, S5, and Mamba on long sequence benchmarks while remaining gradient-free and compact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05030","ref_index":103,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space","primary_cat":"cs.CL","submitted_at":"2026-04-06T18:00:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"13048 (2023). [100] B. Peng, D. Goldstein, Q. Anthony,et al., Eagle and finch: RWKV with matrix-valued states and dynamic recurrence, arXiv preprint arXiv:2404.05892 (2024). [101] G. Birkhoff and J. von Neumann, The logic of quantum mechanics, Annals of Mathematics37, 823 (1936). [102] C. Piron, Axiomatique quantique, Helvetica Physica Acta37, 439 (1964). [103] D. J. Foulis and C. H. Randall, Empirical logic and quantum mechanics, Synthese29, 81 (1974). [104] B. Coecke, D. Moore, and A. Wilce, Operational quantum logic: An overview, arXiv preprint quant- ph/0008019 (2001). [105] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, RoFormer: Enhanced transformer with rotary position embedding, Neurocomputing568, 127063 (2024)."},{"citing_arxiv_id":"2604.02771","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ContractShield: Bridging Semantic-Structural Gaps via Hierarchical Cross-Modal Fusion for Multi-Label Vulnerability Detection in Obfuscated Smart Contracts","primary_cat":"cs.CR","submitted_at":"2026-04-03T06:29:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ContractShield achieves 89% Hamming score and 91% F1-score for five vulnerability types in obfuscated smart contracts via hierarchical cross-modal fusion of semantic, temporal, and structural features with only 1-3% performance drop.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"multimodal framework for multi-label vulnerability detection in Ethereum smart contracts. In particular, ContractShield is designed with strong resilience to adversarial obfuscation. It in- tegrates three complementary views of smart contract behavior: (1) a semantic view using sliding-window-enhanced CodeBERT for contextual representation from source code (SC), (2) a tem- poral view using xLSTM [25] to capture opcode dynamics (OP), and (3) a structural view using GATv2 to model CFGs derived from bytecode. In summary, we make the following contributions: • We present ContractShield, a multimodal framework for multi-label vulnerability detection in Ethereum smart con- tracts. The framework jointly models three complemen- tary views of a contract, including source code, opcode se-"},{"citing_arxiv_id":"2512.21370","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Detection of Lensed Gravitational Waves in the Millihertz Band Using Frequency-Domain Lensing Feature Extraction Network","primary_cat":"astro-ph.IM","submitted_at":"2025-12-24T03:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DCL-xLSTM neural network detects lensed GW events with AUC over 0.99 using training on PM and SIS lens models in the millihertz band.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.17018","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification","primary_cat":"cs.CL","submitted_at":"2025-10-19T21:50:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoGate-LSTM adds prototype-guided cosine feature-space gating to a character-level BiLSTM with multi-source embeddings and focal loss, reaching 0.881 macro-F1 on Jigsaw toxic comments while using 7.3M parameters and outperforming fine-tuned BERT by 6.9 points on minority labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24552","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Short window attention enables long-term memorization","primary_cat":"cs.LG","submitted_at":"2025-09-29T10:04:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10013","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project","primary_cat":"cs.DC","submitted_at":"2025-04-14T09:17:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Engineering report detailing HPC infrastructure, software choices, and performance measurements for training a 7B LLM using 3D parallelism on JUWELS Booster.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.08223","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices","primary_cat":"cs.DC","submitted_at":"2025-03-11T09:41:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"attention and SSM heads within the same layer for parallel processing, with its 1.5B variant trained on DCLM-Baseline-1.0 and SmolLM-Corpus achieving 11.67 times cache size reduction while outperforming Llama-3.2-3B. The xLSTM architecture [ 64] modernizes LSTM with exponential gates and matrix memory cells, with models ranging from 125M to 1.3B parameters trained on 300 billion tokens from SlimPajama [ 65], consistently outperforming comparable RWKV-4 [66], Llama [10], and Mamba models across various tasks in the PALOMA benchmark [ 67]. These architectural innovations demonstrate the potential for efficient and powerful language models that can run effectively on edge devices. SLMs can be constructed through diverse methodological approaches. The construction"},{"citing_arxiv_id":"2501.00663","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Titans: Learning to Memorize at Test Time","primary_cat":"cs.LG","submitted_at":"2024-12-31T22:32:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"id=e93ffDcpH3. [6] Dzmitry Bahdanau. \"Neural machine translation by jointly learning to align and translate\". In: arXiv preprint arXiv:1409.0473 (2014). [7] Reza Bayat, Mohammad Pezeshki, Elvis Dohmatob, David Lopez-Paz, and Pascal Vincent. \"The Pitfalls of Memo- rization: When Memorization Hurts Generalization\". In: arXiv preprint arXiv:2412.07684 (2024). [8] Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. \"xLSTM: Extended Long Short-Term Memory\". In: arXiv preprint arXiv:2405.04517 (2024). [9] Ali Behrouz, Michele Santacatterina, and Ramin Zabih. \"Mambamixer: Efficient selective state space models with"},{"citing_arxiv_id":"2412.06464","ref_index":299,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gated Delta Networks: Improving Mamba2 with Delta Rule","primary_cat":"cs.CL","submitted_at":"2024-12-09T13:09:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.08608","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision","primary_cat":"cs.LG","submitted_at":"2024-07-11T15:44:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.04620","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to (Learn at Test Time): RNNs with Expressive Hidden States","primary_cat":"cs.LG","submitted_at":"2024-07-05T16:23:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"to time-series classification, while [36] and [26] demonstrate how the choice of update rules affects the expressiveness of FWPs on formal language recognition tasks. 4.2 Modern RNN layers Our baseline, Mamba [27], is only one of the many recent RNN layers that inherit the linear (matrix) hidden states of linear attention [44, 63]. Some more recent examples are RWKV [58, 59], xLSTM [4], and Gated Linear Attention (GLA) [82]. The most relevant work is DeltaNet [62], which is equivalent to TTT-Linear with inner-loop mini-batch size 1, without the Layer Norm and residual connection. In [83], Yang et al. further improve the performance of DeltaNet and enable parallelized updates across tokens (in our terms, across inner loop mini-batches)."},{"citing_arxiv_id":"2405.21060","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","primary_cat":"cs.LG","submitted_at":"2024-05-31T17:50:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"Neural Machine Translation by Jointly Learning to Align and Translate\". In: The International Conference on Learning Representations (ICLR) . 2015. 36 [8] George A Baker, George A Baker Jr, Peter Graves-Morris, and Susan S Baker. Pade Approximants: Encyclopedia of Mathematics and It's Applications, Vol. 59 George A. Baker, Jr., Peter Graves-Morris . Vol. 59. Cambridge University Press, 1996. [9] Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Gün- ter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. \"xLSTM: Extended Long Short-Term Memory\". In: arXiv preprint arXiv:2405.04517 (2024). [10] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mo-"},{"citing_arxiv_id":"2404.06654","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RULER: What's the Real Context Size of Your Long-Context Language Models?","primary_cat":"cs.CL","submitted_at":"2024-04-09T23:41:27+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.06635","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gated Linear Attention Transformers with Hardware-Efficient Training","primary_cat":"cs.LG","submitted_at":"2023-12-11T18:51:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"enjoys parallel form and chunk-wise form, which could be potentially useful for future development of linear attention models. C.1 Parallel form By unrolling the recurrence we have, ot = qtSt = qt tX i=1 ( tY i+1 Gi)⊙(k T i vi) \u0001 (5) By taking advantage of the mixed product property of Kronercker/outer product, we have ( tY j=i+1 Gj)⊙(k T i vi) = ( bt bi ) T ( dt di ) \u0001 ⊙(k T i vi) (6) = \u0012 bt bi ⊙ki \u0013T\u0012 dt di ⊙vi \u0013 (7) where bt =Qt j=1αj,dt =Qt j=1βj. By plugging it into the expanded recurrence, we have the following form. ot = qtSt = qt tX i=1 ( tY i+1 Gi)⊙(k T i vi) \u0001 (8) = qt tX i=1 \u0012 bt bi ⊙ki \u0013T\u0012 dt Bi ⊙vi \u0013 (9) = tX i=1 qt \u0012 bt bi ⊙ki \u0013T!\u0012 dt di ⊙vi \u0013 (10) = tX i=1 qt, bt bi ⊙kt | {z } R1×1 \u0012 dt di ⊙vt \u0013 | {z } R1×dv"}],"limit":50,"offset":0}