{"total":17,"items":[{"citing_arxiv_id":"2606.12895","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning","primary_cat":"cs.LG","submitted_at":"2026-06-11T04:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06479","ref_index":123,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pretraining Recurrent Networks without Recurrence","primary_cat":"cs.LG","submitted_at":"2026-06-04T17:57:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SMT reduces RNN training to supervised learning on memory transitions (m_t, x_{t+1}) to m_{t+1} obtained from a Transformer encoder, enabling time-parallel training with O(1) gradient paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15216","ref_index":69,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations","primary_cat":"cs.AR","submitted_at":"2026-05-12T09:44:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BMRUs enable analog recurrent neural network hardware via discrete outputs that suppress noise 20-fold, with one-to-one parameter-to-circuit mapping and linear power scaling for recurrence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08539","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Continuity Laws for Sequential Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T22:55:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08171","ref_index":18,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Communication Dynamics Neural Networks: FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count","primary_cat":"cs.LG","submitted_at":"2026-05-04T23:43:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CDLinear is a block-circulant layer achieving 1/B parameter reduction whose weight Hessian is DFT-diagonalized, yielding population condition number exactly 1 under input pre-whitening.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"de Freitas,ACDC: A structured efficient linear layer, ICLR (2016). [16] A. T. Thomas, A. Gu, T. Dao, A. Rudra, and C. R' e,Learning compressed transforms with low dis- placement rank, NeurIPS31, 9052 (2018). [17] T. Dao, B. Chen, N. S. Sohoni, A. Desai, M. Poli, J. Grogan, A. Liu, A. Rao, A. Rudra, and C. R' e, Monarch: Expressive structured matrices for efficient and accurate training, ICML (2022). [18] Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler,Long range arena: A benchmark for efficient transformers, ICLR (2021); arXiv:2011.04006. [19] B. R. Frieden,Physics from Fisher Information: A Unification(Cambridge University Press, 1998). [20] C. E. Shannon,A mathematical theory of communication, Bell Syst."},{"citing_arxiv_id":"2604.22117","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training","primary_cat":"cs.LG","submitted_at":"2026-04-23T23:32:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20789","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:14:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14930","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IE as Cache: Information Extraction Enhanced Agentic Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-16T12:18:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In this work, we posit that strategically extracted and maintained information can directly scaffold large language model (LLM) reasoning, especially as LLMs increasingly operate in an agentic manner where they must iteratively read, decide, and act over complex inputs [11]. This perspective is crucial for processing noise-rich, long- form content [12], where LLMs struggle with irrelevant dis- tractors [13] and information decay in middle contexts [14]. In such settings, simply providing raw context is often insuffi- cient; instead, models benefit from a compact intermediate rep- resentation that preserves salient evidence while suppressing noise [15]. Addressing these inefficiencies, we draw inspira-"},{"citing_arxiv_id":"2604.10078","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating","primary_cat":"cs.CV","submitted_at":"2026-04-11T07:51:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DualEngage fuses transformer-encoded student motion dynamics with 3D scene features via softmax-gated fusion to recognize group engagement in classroom videos, reporting 96.21% average accuracy on a university dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03446","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast Cross-Operator Optimization of Attention Dataflow","primary_cat":"cs.AR","submitted_at":"2026-04-03T20:37:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Attention mechanisms play a central role in transformer- based models, which are prevalent across various application domains, including natural language processing [18], [22], [61], computer vision [23], [47], and image generation [54], [84]. As models seek to capture correlations across longer contexts, sequence lengths continue to increase [7], [39], [72]. However, the success of attention-based models comes with substantial memory and compute overhead, as the com- putational complexity of attention scales quadratically with sequence length during prefill and training stages [21], [38], [67], [82]. To address these challenges, numerous techniques have been proposed to improve the efficiency of attention"},{"citing_arxiv_id":"2511.10571","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Differentiable Filtering for Learning Hidden Markov Models","primary_cat":"cs.LG","submitted_at":"2025-11-13T18:08:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Belief Net learns HMM parameters by implementing the forward filter as a decoder-only neural network whose weights are the logits of the initial, transition, and emission distributions, trained end-to-end with autoregressive loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.04565","ref_index":174,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","primary_cat":"cs.MA","submitted_at":"2025-06-05T02:34:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Retriever combines both sparse and dense retrieval approaches to leverage the strengths of explicit term matching and semantic similarity [6, 50, 105]. LLM as Retriever involves the use of LLMs to directly retrieve relevant knowledge based on input queries [112]. 3.3 Generator The Generator in RAG systems is essentially an LLM. It can be an original pre-trained language model, such as T5 [136], FLAN [185] and LLaMA [174], or a black-box pre-trained language model, such as GPT-3 [14], GPT-4 [2], Gemini [169], Claude [24]. Alternatively, the generator can also be a fine-tuned language model specifically tailored for a particular task. For instance, BART [79] and T5 [63] are fine-tuned alongside the retriever, a process commonly referred to as co-training or dual fine-tuning, to enhance the quality and consistency of retrieval [93]."},{"citing_arxiv_id":"2506.06374","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks","primary_cat":"cs.NE","submitted_at":"2025-06-04T13:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.18970","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba","primary_cat":"cs.LG","submitted_at":"2025-03-22T01:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":0.0,"formal_verification":"none","one_line_summary":"A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.19427","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models","primary_cat":"cs.LG","submitted_at":"2024-02-29T18:24:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.16867","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Falcon Series of Open Language Models","primary_cat":"cs.CL","submitted_at":"2023-11-28T15:12:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.00114","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Show Your Work: Scratchpads for Intermediate Computation with Language Models","primary_cat":"cs.LG","submitted_at":"2021-11-30T21:32:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}